================================================================================ LECTURE 001 ================================================================================ Stanford CS229M - Lecture 1: Overview, supervised learning, empirical risk minimization Source: https://www.youtube.com/watch?v=I-tmjGFaaBg --- Transcript [00:00:05] okay so let's get started so [00:00:07] okay so let's get started so um so the formulation so [00:00:09] um so the formulation so um most of these cars we'll talk will be [00:00:11] um most of these cars we'll talk will be about surprise learning so in some part [00:00:13] about surprise learning so in some part we're going to talk about Express [00:00:14] we're going to talk about Express learning but [00:00:16] learning but um like I think [00:00:19] um like I think um maybe like 18 of the lectures will be [00:00:22] um maybe like 18 of the lectures will be about uh supervised learning so this is [00:00:24] about uh supervised learning so this is about supervised supervised learning [00:00:31] so [00:00:33] so um let me just uh so we're gonna have [00:00:35] um let me just uh so we're gonna have some definitions so input space [00:00:38] some definitions so input space this is the data that you want to um [00:00:43] this is the data that you want to um um classify or kind of like you want to [00:00:44] um classify or kind of like you want to do regression on so and there's a label [00:00:46] do regression on so and there's a label space [00:00:49] that's called y [00:00:50] that's called y and there's a joint probability [00:00:53] and there's a joint probability distribution [00:01:00] um p over the space of x times y [00:01:06] um p over the space of x times y um [00:01:06] um and and there is a let me see how do I [00:01:11] and and there is a let me see how do I I guess probably I will still try to do [00:01:13] I guess probably I will still try to do maybe I should do this [00:01:16] maybe I should do this I think this is better right so [00:01:20] I think this is better right so um and [00:01:22] um and um we're gonna have some training data [00:01:24] um we're gonna have some training data points [00:01:29] so these are next one [00:01:33] so these are next one y one up to extend y1 each data point is [00:01:38] y one up to extend y1 each data point is a pile of input and output [00:01:41] a pile of input and output and we're going to use n for the number [00:01:44] and we're going to use n for the number of examples in Fiverr [00:01:47] of examples in Fiverr um so n is reserved for the number of [00:01:49] um so n is reserved for the number of examples for this course and each of [00:01:52] examples for this course and each of these data points x i y i is assumed to [00:01:56] these data points x i y i is assumed to be drawn IID from this distribution P so [00:02:00] be drawn IID from this distribution P so p is the distribution we are interested [00:02:01] p is the distribution we are interested in and we have some examples from it [00:02:04] in and we have some examples from it so and we have some loss function [00:02:10] um and this loss function takes in two [00:02:12] um and this loss function takes in two labels so [00:02:15] labels so and it also is a number that [00:02:18] and it also is a number that characterize how different these two [00:02:20] characterize how different these two labels are [00:02:21] labels are and I think typically convention is that [00:02:23] and I think typically convention is that the first one is the predicted label [00:02:28] and this is the true label [00:02:32] and this is the true label or this is the observed label uh and and [00:02:36] or this is the observed label uh and and you assume that [00:02:40] this if you have [00:02:42] this if you have the loss is always connected [00:02:45] the loss is always connected um I think in some cases the laws can be [00:02:47] um I think in some cases the laws can be negative but um [00:02:49] negative but um in most of the cases the law system [00:02:50] in most of the cases the law system active [00:02:52] active and [00:02:54] and now [00:02:56] now you have um suppose you have a [00:03:02] uh you can also have a predictor [00:03:05] uh you can also have a predictor because this is what you are interested [00:03:07] because this is what you are interested in you want to have uh or sometimes it's [00:03:09] in you want to have uh or sometimes it's called Model sometimes it's a hypothesis [00:03:13] called Model sometimes it's a hypothesis we're going to use all of this in [00:03:15] we're going to use all of this in inter-exchangeably [00:03:17] inter-exchangeably um all of this you know [00:03:19] um all of this you know are used in some different different [00:03:21] are used in some different different contexts but they are they all mean the [00:03:23] contexts but they are they all mean the same thing it means the function you [00:03:24] same thing it means the function you want to look for to predict your label [00:03:27] want to look for to predict your label so [00:03:28] so um so this is a function that's called H [00:03:30] um so this is a function that's called H it's a mapping from y to X to y [00:03:34] it's a mapping from y to X to y um and you can define the laws [00:03:38] um and you can define the laws of the predictor [00:03:42] on example [00:03:46] X comma y [00:03:48] X comma y and allows will be [00:03:50] and allows will be your first plug in h of X which is your [00:03:53] your first plug in h of X which is your prediction and you have y right this is [00:03:54] prediction and you have y right this is the loss [00:03:56] the loss and then after you define all of this [00:03:58] and then after you define all of this you can Define the so-called expected [00:04:02] uh or population [00:04:06] risk [00:04:08] risk or loss [00:04:10] or loss this is kind of the the interesting [00:04:12] this is kind of the the interesting thing about machine learning like [00:04:13] thing about machine learning like everything everything has like two names [00:04:15] everything everything has like two names at least I think two is a lower bound [00:04:18] at least I think two is a lower bound sometimes it means three [00:04:20] sometimes it means three um [00:04:21] um so and also like my brain is kind of [00:04:23] so and also like my brain is kind of like kind of like a you for different [00:04:25] like kind of like a you for different kind of situations I use different name [00:04:27] kind of situations I use different name for this [00:04:28] for this um be prepared for that just because [00:04:30] um be prepared for that just because like when I when I learned this part of [00:04:32] like when I when I learned this part of things and those literature use that [00:04:34] things and those literature use that name and then if I if you learn [00:04:36] name and then if I if you learn something else then you know like those [00:04:38] something else then you know like those part of those kind of papers use a [00:04:40] part of those kind of papers use a different name but [00:04:41] different name but um so so my brain just uh like this name [00:04:43] um so so my brain just uh like this name spurred in into my different parts of my [00:04:45] spurred in into my different parts of my brain so so I might use inconsistent uh [00:04:49] brain so so I might use inconsistent uh terminologies a little bit but but all [00:04:51] terminologies a little bit but but all of these are the same like they are the [00:04:53] of these are the same like they are the same like expect it just means [00:04:55] same like expect it just means population and risk just means loss [00:04:58] population and risk just means loss um and [00:05:00] um and but of course I will be trying to con I [00:05:01] but of course I will be trying to con I will try to be consistent [00:05:03] will try to be consistent um as much as possible so and this [00:05:06] um as much as possible so and this expected you know risk of population [00:05:08] expected you know risk of population risk is defined to be the expectation of [00:05:12] risk is defined to be the expectation of the laws [00:05:16] and here the random variables are X and [00:05:18] and here the random variables are X and Y [00:05:19] Y and and they are drawn from this uh [00:05:22] and and they are drawn from this uh population description p and that's why [00:05:25] population description p and that's why it's called population risk [00:05:27] it's called population risk and and this is your final goal [00:05:30] and and this is your final goal um your final goal is basically to [00:05:34] um your final goal is basically to minimize [00:05:36] minimize to find [00:05:37] to find h [00:05:38] h that [00:05:40] that minimized [00:05:43] minimized uh the population risk [00:05:47] uh the population risk at least this is the goal for the for [00:05:49] at least this is the goal for the for the first at least [00:05:50] the first at least 15 lectures right so like this is the [00:05:53] 15 lectures right so like this is the goal for surprise learning you just want [00:05:54] goal for surprise learning you just want to predict expert as well as possible [00:05:58] to predict expert as well as possible okay [00:06:01] all right [00:06:02] all right okay so um [00:06:05] okay so um okay so to achieve this goal you also [00:06:07] okay so to achieve this goal you also have to introduce more Concepts right so [00:06:09] have to introduce more Concepts right so so one concept is this so-called [00:06:11] so one concept is this so-called hypothesis [00:06:14] hypothesis class [00:06:17] sometimes all hypothesis family [00:06:20] sometimes all hypothesis family and you can also call it you know [00:06:22] and you can also call it you know predictor class predictor family model [00:06:24] predictor class predictor family model class model family so let's call it [00:06:27] class model family so let's call it capital h [00:06:29] capital h and this is a set of functions [00:06:34] from X to y [00:06:38] from X to y right so and [00:06:40] right so and um you can Define the so-called excess [00:06:42] um you can Define the so-called excess risk [00:06:43] risk because at the end of the day you're [00:06:45] because at the end of the day you're going to search over a set of functions [00:06:46] going to search over a set of functions right so and maybe this set of function [00:06:48] right so and maybe this set of function is very bad for example this set of [00:06:49] is very bad for example this set of functions only contain one function [00:06:51] functions only contain one function so so that's why people Define these [00:06:53] so so that's why people Define these so-called accessories which tries to [00:06:55] so-called accessories which tries to kind of like [00:06:57] kind of like um [00:06:58] um uh Define your error relative to the the [00:07:01] uh Define your error relative to the the power of this hypothesis class or this [00:07:03] power of this hypothesis class or this set of functions right so this [00:07:05] set of functions right so this accessories with respect to [00:07:07] accessories with respect to Capital H so this is defined to be uh [00:07:12] Capital H so this is defined to be uh your population loss or population risk [00:07:15] your population loss or population risk minus [00:07:16] minus the best [00:07:17] the best you can find [00:07:20] you can find in this family [00:07:39] okay so so basically this term is the [00:07:42] okay so so basically this term is the best model [00:07:46] um pH [00:07:50] which H sorry the if yes so this is uh [00:07:55] which H sorry the if yes so this is uh this is a good question so if [00:07:57] this is a good question so if basically you can you ask me [00:08:00] basically you can you ask me I I guess so we just [00:08:03] I I guess so we just for this course let's say they are [00:08:04] for this course let's say they are exactly the same so of course they are [00:08:06] exactly the same so of course they are not exactly the same just because [00:08:08] not exactly the same just because sometimes you don't have the [00:08:10] sometimes you don't have the a unique minimizer right so you have a [00:08:12] a unique minimizer right so you have a sequence of like you can like a [00:08:16] um maybe I'll have a post to explain the [00:08:19] um maybe I'll have a post to explain the subtle differences between these two but [00:08:21] subtle differences between these two but for almost [00:08:22] for almost I think for exactly for this entire [00:08:24] I think for exactly for this entire course you can just assume if it's the [00:08:26] course you can just assume if it's the same as minimal yeah [00:08:30] same as minimal yeah um [00:08:31] um cool so and this is larger than zero [00:08:34] cool so and this is larger than zero because this is the minimum and so that [00:08:37] because this is the minimum and so that there's no way you can get better than [00:08:39] there's no way you can get better than the minimum so that's why it's Louder [00:08:41] the minimum so that's why it's Louder Than Zero [00:08:42] Than Zero so so in some sense this is trying to [00:08:44] so so in some sense this is trying to kind of to to think about excess works [00:08:47] kind of to to think about excess works is one way to kind of like [00:08:49] is one way to kind of like only think within the family of age [00:08:51] only think within the family of age right so so now if you get zero [00:08:54] right so so now if you get zero accessory that means that you cannot do [00:08:56] accessory that means that you cannot do anything better uh within this family [00:08:58] anything better uh within this family right of course if you change your [00:09:00] right of course if you change your family maybe you can get something else [00:09:01] family maybe you can get something else but at least within its family there's [00:09:02] but at least within its family there's no way you can do better [00:09:04] no way you can do better Okay so [00:09:07] Okay so so this is the the basic language we are [00:09:09] so this is the the basic language we are going to work in for this entire course [00:09:12] going to work in for this entire course um [00:09:13] um any questions so far in any case like [00:09:15] any questions so far in any case like just feel free to interact me at any [00:09:17] just feel free to interact me at any point you don't have to wait until I [00:09:19] point you don't have to wait until I pause you know in either the zoom [00:09:20] pause you know in either the zoom meeting or here so [00:09:23] meeting or here so um [00:09:25] so some quick you know examples to make [00:09:28] so some quick you know examples to make it less abstract so this is I assume [00:09:31] it less abstract so this is I assume this is relatively abstract [00:09:33] this is relatively abstract um so one of the type of questions [00:09:35] um so one of the type of questions regression problem [00:09:37] regression problem where [00:09:39] where um your why the label set is real number [00:09:43] um your why the label set is real number um so there are continued continuous [00:09:45] um so there are continued continuous labels [00:09:46] labels and oftentimes for regression problem [00:09:49] and oftentimes for regression problem you have the so-called Square laws [00:09:55] so this is the l y hat Y is equals to [00:09:58] so this is the l y hat Y is equals to maybe something about half times y [00:10:01] maybe something about half times y height minus y Square [00:10:04] um for example if you want to predict [00:10:06] um for example if you want to predict the temperature it makes sense to use [00:10:07] the temperature it makes sense to use the square loss of course there are the [00:10:09] the square loss of course there are the difference of laws and another [00:10:11] difference of laws and another possibility is classification problem [00:10:18] um so [00:10:19] um so in this case the Y is a discrete site [00:10:22] in this case the Y is a discrete site you have a set of K labels maybe it [00:10:26] you have a set of K labels maybe it could be two labels cat versus dog or [00:10:28] could be two labels cat versus dog or could be like multiple labels and [00:10:32] could be like multiple labels and um and then your loss you know the final [00:10:36] um and then your loss you know the final loss you care about often is this [00:10:38] loss you care about often is this so-called [00:10:40] so-called um zero one loss so basically you say [00:10:42] um zero one loss so basically you say that if you didn't get the right label [00:10:44] that if you didn't get the right label then you [00:10:47] then you um you are equals to one otherwise [00:10:48] um you are equals to one otherwise you're equal to zero this one is the [00:10:50] you're equal to zero this one is the indicator so indicator e is one if e [00:10:54] indicator so indicator e is one if e happens [00:10:57] an indicator of E [00:10:59] an indicator of E is zero otherwise [00:11:02] is zero otherwise so you'll see that you know when you [00:11:04] so you'll see that you know when you really do the Practical machine learning [00:11:06] really do the Practical machine learning algorithm you are not going to use this [00:11:07] algorithm you are not going to use this loss because of the other issues but [00:11:09] loss because of the other issues but this is the loss you care about [00:11:10] this is the loss you care about eventually at least this is one of the [00:11:12] eventually at least this is one of the losses you could care about eventually [00:11:14] losses you could care about eventually this is the so-called accuracy of the [00:11:16] this is the so-called accuracy of the error right but when you turn it you [00:11:18] error right but when you turn it you maybe you use cross-conscious okay [00:11:19] maybe you use cross-conscious okay that's a that's a slightly different [00:11:20] that's a that's a slightly different question [00:11:22] question so okay so now [00:11:26] so okay so now um this is the setup and the goals [00:11:29] um this is the setup and the goals and now let's talk about one [00:11:31] and now let's talk about one uh [00:11:33] uh important algorithm which is called [00:11:35] important algorithm which is called empirical risk minimization [00:11:37] empirical risk minimization this is the algorithm [00:11:40] this is the algorithm or type of algorithm that we can analyze [00:11:42] or type of algorithm that we can analyze for quite some time [00:11:44] for quite some time so so the algorithm is very simple so I [00:11:47] so so the algorithm is very simple so I guess this is what you do [00:11:49] guess this is what you do in practice every day so you have some [00:11:51] in practice every day so you have some tuning laws [00:11:52] tuning laws sometimes it's called empirical loss [00:11:59] and sometimes it's called empirical risk [00:12:01] and sometimes it's called empirical risk so [00:12:02] so so these laws we use our hat our hat [00:12:05] so these laws we use our hat our hat means it's empirical every time you use [00:12:07] means it's empirical every time you use a hat here it's kind of like it pretty [00:12:09] a hat here it's kind of like it pretty much means empirical [00:12:11] much means empirical so you have the [00:12:13] so you have the sum of the the average of the loss on [00:12:17] sum of the the average of the loss on all the examples [00:12:19] all the examples h i y i [00:12:22] h i y i from 1 to n [00:12:24] from 1 to n and then you do the so-called empirical [00:12:26] and then you do the so-called empirical risk minimization ERM in progress [00:12:29] risk minimization ERM in progress minimization where H hat [00:12:32] minimization where H hat is you find the [00:12:35] is you find the the best model [00:12:37] the best model in this family [00:12:39] in this family I guess here I'm using argument so so [00:12:42] I guess here I'm using argument so so our me and if it's just exactly the same [00:12:45] our me and if it's just exactly the same for this course [00:12:47] for this course um so [00:12:48] um so um so you'll find the best model within [00:12:51] um so you'll find the best model within the family that's minimize your [00:12:52] the family that's minimize your empirical risk [00:12:54] empirical risk and and you can break High arbitrarily [00:12:57] and and you can break High arbitrarily like like a this kind of like a [00:13:00] like like a this kind of like a like we don't care about breaking times [00:13:02] like we don't care about breaking times you know in many cases and and [00:13:04] you know in many cases and and um so so this is the the algorithm see [00:13:08] um so so this is the the algorithm see using this algorithm you you may need to [00:13:10] using this algorithm you you may need to use some other optimizers to find the [00:13:12] use some other optimizers to find the minimum right but this is the abstract [00:13:14] minimum right but this is the abstract way of thinking of the algorithm you'll [00:13:16] way of thinking of the algorithm you'll find a minimum [00:13:17] find a minimum and and the and the the pro the quest [00:13:20] and and the and the the pro the quest key question is that how do we [00:13:23] key question is that how do we um [00:13:24] um why this this is a good algorithm right [00:13:26] why this this is a good algorithm right why this is doing something sensible and [00:13:28] why this is doing something sensible and one of the the key property of this is [00:13:31] one of the the key property of this is that [00:13:33] that um one of the reasons why this is [00:13:34] um one of the reasons why this is somewhat meaningful is I guess as you [00:13:36] somewhat meaningful is I guess as you know already from previous classes [00:13:38] know already from previous classes because [00:13:40] because um x i y i [00:13:43] um x i y i a ID from P so then if you look at the [00:13:47] a ID from P so then if you look at the expectation of the empirical loss over [00:13:49] expectation of the empirical loss over the Run units of the [00:13:53] the Run units of the examples [00:13:55] examples right so if you take expectation of the [00:13:57] right so if you take expectation of the empirical loss of one example let's say [00:14:00] empirical loss of one example let's say and let's say the examples are run [00:14:04] then this is equals to the population [00:14:06] then this is equals to the population right this is exactly the same as usual [00:14:09] right this is exactly the same as usual x y [00:14:11] x y from p [00:14:12] from p h x [00:14:14] h x y right to verify this is just a change [00:14:16] y right to verify this is just a change of notation in some sense this is IO of [00:14:18] of notation in some sense this is IO of H so the expectation of the empirical [00:14:20] H so the expectation of the empirical laws is the same as the [00:14:25] laws is the same as the um the so basically saying that if you [00:14:27] um the so basically saying that if you take expectation of the empirical loss [00:14:29] take expectation of the empirical loss our head h [00:14:31] our head h which will be average of all of this [00:14:33] which will be average of all of this well over n times [00:14:37] quotation l h [00:14:43] right so this will be equals to L of H [00:14:45] right so this will be equals to L of H and here the randomness comes from all [00:14:47] and here the randomness comes from all the X I's and Y X [00:14:52] right so so this is the the typical [00:14:55] right so so this is the the typical justification we have for this kind of [00:14:56] justification we have for this kind of algorithm because the empirical loss is [00:14:59] algorithm because the empirical loss is a good either in perk is the estimate [00:15:01] a good either in perk is the estimate for the population loss that's why [00:15:03] for the population loss that's why minimizing empirical loss probably would [00:15:05] minimizing empirical loss probably would lead you to us minimize the population [00:15:07] lead you to us minimize the population loss [00:15:09] loss so in some sense at least a good part of [00:15:12] so in some sense at least a good part of this course is to justify more formally [00:15:13] this course is to justify more formally why this is the right thing right so [00:15:15] why this is the right thing right so intuitively sounds right [00:15:17] intuitively sounds right um but formally whether this is you know [00:15:19] um but formally whether this is you know like we want to kind of prove that this [00:15:21] like we want to kind of prove that this is actually the right thing and and it's [00:15:23] is actually the right thing and and it's actually not that easy because it does [00:15:24] actually not that easy because it does depend on [00:15:26] depend on um [00:15:26] um some other things for example how many [00:15:28] some other things for example how many examples you have and how large your [00:15:29] examples you have and how large your hypothesis class uh H is [00:15:32] hypothesis class uh H is um it's not that simple [00:15:34] um it's not that simple um this is just a kind of like intuition [00:15:38] um this is just a kind of like intuition Okay so [00:15:43] all right any questions so far [00:15:52] and and also we're gonna have I guess a [00:15:58] and and also we're gonna have I guess a I'm assume that I assume that most of [00:16:00] I'm assume that I assume that most of you know this just this is just a formal [00:16:02] you know this just this is just a formal definition [00:16:03] definition [Music] [00:16:04] [Music] um so when you really do this you know [00:16:06] um so when you really do this you know you have a hypothesis class when you [00:16:08] you have a hypothesis class when you really do it in computer you have to [00:16:10] really do it in computer you have to have a parameterized family so you can [00:16:12] have a parameterized family so you can optimize the parameters so you can also [00:16:15] optimize the parameters so you can also have a parameterized family so you call [00:16:18] have a parameterized family so you call this age for example Something Like H [00:16:20] this age for example Something Like H sub Theta and where Theta is in some [00:16:23] sub Theta and where Theta is in some space of parameters [00:16:25] space of parameters and and maybe let's say Theta is in r d [00:16:28] and and maybe let's say Theta is in r d or or some kind of like is the parameter [00:16:34] and then for example Theta could be for [00:16:37] and then for example Theta could be for example we say that could be [00:16:40] example we say that could be um sorry so Capital Theta is the formula [00:16:43] um sorry so Capital Theta is the formula of parameters [00:16:45] of parameters this is the [00:16:53] I guess you know sometimes you want to [00:16:55] I guess you know sometimes you want to say that you only do it for sparse [00:16:56] say that you only do it for sparse parameters or only do it for a certain [00:16:58] parameters or only do it for a certain kind of elements so [00:17:01] kind of elements so um [00:17:02] um uh and and well the example of this is [00:17:04] uh and and well the example of this is that you can take H to B for example H [00:17:07] that you can take H to B for example H Theta X which is equals to Theta [00:17:09] Theta X which is equals to Theta transpose X [00:17:10] transpose X then this is all the linear models [00:17:13] then this is all the linear models okay so this is easy and then you can [00:17:16] okay so this is easy and then you can also do erm [00:17:18] also do erm for parameters [00:17:23] family [00:17:27] um so I guess here [00:17:29] um so I guess here um this is actually probably the most [00:17:31] um this is actually probably the most important cases because in particularly [00:17:35] important cases because in particularly you do premature's family and now your [00:17:37] you do premature's family and now your training loss let's define this training [00:17:39] training loss let's define this training loss still as our head data as L hat but [00:17:42] loss still as our head data as L hat but let's uh with a little abuse of notation [00:17:44] let's uh with a little abuse of notation you say the Theta is your input of the [00:17:47] you say the Theta is your input of the tuning logs right say there is the [00:17:48] tuning logs right say there is the parameter before we say the tuning loss [00:17:50] parameter before we say the tuning loss is a function of the model and now it's [00:17:52] is a function of the model and now it's a function of the parameter because the [00:17:54] a function of the parameter because the model and the parameter are just [00:17:55] model and the parameter are just one-to-one correspondence in some sense [00:17:57] one-to-one correspondence in some sense or maybe not one to one but they have a [00:17:59] or maybe not one to one but they have a correspondence so your representation [00:18:02] correspondence so your representation for the [00:18:03] for the for the for the model is really through [00:18:06] for the for the model is really through the parameter Express each parameter [00:18:07] the parameter Express each parameter corresponds to model and this is just a [00:18:11] corresponds to model and this is just a I guess I'm just writing [00:18:16] what you're expecting probably so this [00:18:19] what you're expecting probably so this is the empirical loss and here I'm [00:18:21] is the empirical loss and here I'm overloading the notation a little bit [00:18:22] overloading the notation a little bit and we are going to overload this [00:18:23] and we are going to overload this notation in this course many times [00:18:27] um and uh and sometimes you write this [00:18:31] um and uh and sometimes you write this this thing [00:18:32] this thing uh sometimes you write this as [00:18:34] uh sometimes you write this as alternatively [00:18:36] alternatively again with a little bit notation [00:18:38] again with a little bit notation abuse of notation you sometimes write [00:18:41] abuse of notation you sometimes write this as this [00:18:45] and because just because Theta is what [00:18:47] and because just because Theta is what an X and Y I are what you care about [00:18:49] an X and Y I are what you care about after you know these three things you [00:18:51] after you know these three things you can complete the loss [00:18:54] can complete the loss um these are just some notations because [00:18:56] um these are just some notations because we are sometimes going to use this [00:18:58] we are sometimes going to use this notation sometimes we use this notation [00:19:00] notation sometimes we use this notation a little bit X [00:19:01] a little bit X um exchangeably so it's good to be aware [00:19:04] um exchangeably so it's good to be aware of that and you can Define the so-called [00:19:06] of that and you can Define the so-called erm [00:19:07] erm solution which is the [00:19:10] solution which is the argument of the empirical loss and where [00:19:13] argument of the empirical loss and where Theta is in this parameter Theta Capital [00:19:15] Theta is in this parameter Theta Capital Theta and sometimes you just write it [00:19:18] Theta and sometimes you just write it ahead [00:19:22] as a short time for ERM like [00:19:25] as a short time for ERM like sometimes you're going to write this [00:19:27] sometimes you're going to write this by the way you don't have to remember [00:19:28] by the way you don't have to remember some of this kind of like uh cases just [00:19:31] some of this kind of like uh cases just we're going to remind you later so in [00:19:33] we're going to remind you later so in the goal as you can expect it it's [00:19:35] the goal as you can expect it it's really again just to to show [00:19:40] really again just to to show the excess risk [00:19:44] of say the Hat erm [00:19:47] of say the Hat erm is small [00:19:49] is small because that's that's the success kind [00:19:51] because that's that's the success kind of Criterion right you want to show that [00:19:53] of Criterion right you want to show that you you find something ahead [00:19:57] you you find something ahead and this other hand is working and [00:19:59] and this other hand is working and working in the sense that the excess [00:20:00] working in the sense that the excess risk [00:20:05] is small [00:20:08] is small and this is kind of the basically the [00:20:11] and this is kind of the basically the goal of the first probably a few weeks [00:20:15] goal of the first probably a few weeks um and [00:20:16] um and and the core in some sense is really to [00:20:19] and the core in some sense is really to I guess just a kind of like a [00:20:22] I guess just a kind of like a um [00:20:23] um a trailer in some sense the the core [00:20:25] a trailer in some sense the the core idea is to show that L Theta is close to [00:20:29] idea is to show that L Theta is close to L height Theta [00:20:31] L height Theta right because you are minimizing the L [00:20:34] right because you are minimizing the L half data but you care about L Theta so [00:20:36] half data but you care about L Theta so you have to show this to our similar in [00:20:38] you have to show this to our similar in some sense but it's not that easy [00:20:41] some sense but it's not that easy but [00:20:44] uh sorry this is a I guess that's it's [00:20:47] uh sorry this is a I guess that's it's me sorry this is [00:20:52] actually I have a typo here in my nose [00:20:54] actually I have a typo here in my nose as well [00:20:55] as well next [00:20:58] next okay [00:21:02] so the goal is to show that uh your [00:21:06] so the goal is to show that uh your algorithm works right so and you know [00:21:08] algorithm works right so and you know the the this state I have er sorry they [00:21:11] the the this state I have er sorry they say the hype erm it's doing something [00:21:13] say the hype erm it's doing something right it's good so and what does it mean [00:21:16] right it's good so and what does it mean by a model is good right a model is good [00:21:18] by a model is good right a model is good in this it means that like at least in [00:21:21] in this it means that like at least in our definition it really only means that [00:21:23] our definition it really only means that the accessories is small right right if [00:21:25] the accessories is small right right if you can make sure that you are kind of [00:21:27] you can make sure that you are kind of close to getting the best model in this [00:21:29] close to getting the best model in this family then that means you are you are [00:21:31] family then that means you are you are doing uh doing well so so that's why the [00:21:34] doing uh doing well so so that's why the goal is to show that accessories is [00:21:35] goal is to show that accessories is small for this model [00:21:41] eventually you care about the the [00:21:43] eventually you care about the the learning algorithm but the but but to [00:21:46] learning algorithm but the but but to show this it does depend on you know [00:21:47] show this it does depend on you know what's the family of the hypothesis is [00:21:51] what's the family of the hypothesis is um but but the final final goal is that [00:21:53] um but but the final final goal is that you show a learning algorithm using [00:21:55] you show a learning algorithm using these family of models can work [00:21:59] you never actually evaluate uh [00:22:04] [Music] [00:22:06] [Music] so can you imagine or empirically I [00:22:09] so can you imagine or empirically I guess yeah so uh yes you can evaluate [00:22:12] guess yeah so uh yes you can evaluate out pretty well in the sense that you [00:22:13] out pretty well in the sense that you can have a holdout data so that's why [00:22:15] can have a holdout data so that's why the validation data is uh is used [00:22:18] the validation data is uh is used um [00:22:19] um um of course there are some subtleties [00:22:21] um of course there are some subtleties about you know okay so I so how do you [00:22:23] about you know okay so I so how do you evaluate the L so if you want so [00:22:27] evaluate the L so if you want so um the the the the ideal scenario is [00:22:29] um the the the the ideal scenario is that you collect some new data [00:22:31] that you collect some new data um and there are fresh data and then you [00:22:34] um and there are fresh data and then you use the empirical estimator [00:22:36] use the empirical estimator for it [00:22:38] for it um the subtlety would be that you know [00:22:40] um the subtlety would be that you know whether this you have seen this data [00:22:41] whether this you have seen this data before right if you don't if you haven't [00:22:43] before right if you don't if you haven't seen this data before then you're all [00:22:44] seen this data before then you're all great but if you have seen this data [00:22:46] great but if you have seen this data then it becomes tricky so that's [00:22:48] then it becomes tricky so that's actually exactly what we are doing here [00:22:49] actually exactly what we are doing here because [00:22:51] because I'll have an L right so [00:22:53] I'll have an L right so this is kind of like intuitively very [00:22:57] this is kind of like intuitively very much correct but the question is that [00:22:59] much correct but the question is that um this is we will talk more about this [00:23:02] um this is we will talk more about this like the the subtlety is that whether [00:23:04] like the the subtlety is that whether you have seen the data before or not [00:23:11] um any other questions [00:23:14] um any other questions okay cool okay sounds good [00:23:17] okay cool okay sounds good [Music] [00:23:22] so [00:23:25] and and [00:23:27] and and so and this is the kind of the main kind [00:23:30] so and this is the kind of the main kind of Topic in this course [00:23:32] of Topic in this course um although they're gonna be a more more [00:23:33] um although they're gonna be a more more subtleties before about this for example [00:23:35] subtleties before about this for example you know like uh in the first few weeks [00:23:37] you know like uh in the first few weeks we want to talk about this and then [00:23:38] we want to talk about this and then other things in this course [00:23:44] foreign [00:23:45] foreign so we're going to talk about how to [00:23:47] so we're going to talk about how to for example one thing is how to minimize [00:23:51] for example one thing is how to minimize it Theta right so suppose you know you [00:23:55] it Theta right so suppose you know you know that all of this grid is great but [00:23:57] know that all of this grid is great but you still don't want to know how do you [00:23:59] you still don't want to know how do you do this you know [00:24:00] do this you know your computational efficient way right [00:24:02] your computational efficient way right that's something we're going to touch on [00:24:04] that's something we're going to touch on um [00:24:05] um for a few lectures and [00:24:08] for a few lectures and um and also we're going to talk about [00:24:09] um and also we're going to talk about you know kind of like [00:24:10] you know kind of like additional complications in some sense [00:24:16] in deep learning in some sense this [00:24:18] in deep learning in some sense this framework becomes kind of questionable [00:24:20] framework becomes kind of questionable when you do deep learning of course some [00:24:21] when you do deep learning of course some part of it still kind of survives [00:24:23] part of it still kind of survives actually most of the part survives but [00:24:25] actually most of the part survives but some of the if you really go into the [00:24:27] some of the if you really go into the low level technical stuff then some of [00:24:29] low level technical stuff then some of the technical stuff stop kind of making [00:24:31] the technical stuff stop kind of making sense uh and [00:24:34] sense uh and um and there are a lot of additional [00:24:35] um and there are a lot of additional complications right so until so far the [00:24:37] complications right so until so far the everything is still kind of okay but [00:24:40] everything is still kind of okay but that's once you go one level lower then [00:24:42] that's once you go one level lower then some of the classical techniques don't [00:24:45] some of the classical techniques don't apply to deep learning and also we are [00:24:47] apply to deep learning and also we are going to talk a little bit about [00:24:48] going to talk a little bit about Enterprise Learning which is a kind of [00:24:52] Enterprise Learning which is a kind of somewhat different but still like some [00:24:54] somewhat different but still like some of these losses are involved of course [00:24:56] of these losses are involved of course and energization or all of this kind of [00:24:59] and energization or all of this kind of um [00:25:00] um um like the notorious is still like [00:25:02] um like the notorious is still like mostly applies but with a little bit [00:25:04] mostly applies but with a little bit differences [00:25:05] differences okay so that's the formulation [00:25:08] okay so that's the formulation um and now let's move on to [00:25:11] um and now let's move on to like asymptotics which um [00:25:15] like asymptotics which um before that any questions [00:25:25] um [00:25:27] Okay cool so [00:25:30] Okay cool so [Music] [00:25:30] [Music] um [00:25:31] um so what what does iOS metallic analysis [00:25:33] so what what does iOS metallic analysis means right asymptotic analysis [00:25:37] means right asymptotic analysis so [00:25:38] so this is a type of analysis [00:25:41] this is a type of analysis where you assume that [00:25:44] where you assume that n goes to Infinity [00:25:47] so let's earn the number of examples [00:25:50] so let's earn the number of examples goes to Infinity [00:25:52] goes to Infinity and you'll show a bond [00:25:56] like this of the form [00:26:01] something like excess risk [00:26:03] something like excess risk this is our goal which is L Theta hat [00:26:06] this is our goal which is L Theta hat minus [00:26:07] minus Arc mean [00:26:09] Arc mean sorry minus mean [00:26:17] this is less than [00:26:19] this is less than something like C Over N plus little one [00:26:22] something like C Over N plus little one times moreover [00:26:23] times moreover and here this constant C [00:26:28] and here this constant C is a constant but it's not a universal [00:26:30] is a constant but it's not a universal constant it's a constant that [00:26:32] constant it's a constant that of course doesn't depend on n [00:26:39] but could depend on [00:26:45] uh the problem [00:26:50] for example dimension [00:26:53] right stop so [00:26:55] right stop so um and and a little in this kind of like [00:26:57] um and and a little in this kind of like what you learn from calculus this is [00:26:58] what you learn from calculus this is like a lower term compared to one over n [00:27:02] like a lower term compared to one over n um [00:27:03] um so so this is kind of the general kind [00:27:05] so so this is kind of the general kind of approach and and after we talk about [00:27:07] of approach and and after we talk about this I'll talk about you know I'll uh uh [00:27:10] this I'll talk about you know I'll uh uh then we're going to move on the [00:27:11] then we're going to move on the so-called known asymptotic approach [00:27:12] so-called known asymptotic approach which I'll discuss that uh after we talk [00:27:15] which I'll discuss that uh after we talk about this so [00:27:17] about this so um [00:27:18] um Okay so [00:27:30] um so we closed [00:27:32] um so we closed closed door I mean I don't know [00:27:37] um [00:27:44] could you be a requested [00:27:51] yeah [00:27:51] yeah I think that's one probably is fun [00:27:54] I think that's one probably is fun um okay anyway [00:27:56] um okay anyway um yeah that's a that's that's a [00:27:58] um yeah that's a that's that's a oh that's a good question yeah so I [00:28:00] oh that's a good question yeah so I think um [00:28:01] think um why we care about this kind of like a [00:28:03] why we care about this kind of like a bond so we want a bond that goes to zero [00:28:05] bond so we want a bond that goes to zero as and goes to Infinity because you want [00:28:07] as and goes to Infinity because you want to say that if you have more and more [00:28:09] to say that if you have more and more examples you can do better and better [00:28:11] examples you can do better and better um but the weather is well over square [00:28:13] um but the weather is well over square root and all over n square that depends [00:28:15] root and all over n square that depends on what's the truth right so [00:28:17] on what's the truth right so um it just turns out that [00:28:19] um it just turns out that uh the right bound is well as we'll see [00:28:22] uh the right bound is well as we'll see or you cannot get better you cannot you [00:28:25] or you cannot get better you cannot you shouldn't get worse in of course it [00:28:27] shouldn't get worse in of course it still depends on the searching a little [00:28:28] still depends on the searching a little bit but for the setting we're going to [00:28:29] bit but for the setting we're going to talk about one over eight indeed the [00:28:31] talk about one over eight indeed the right part [00:28:33] right part so [00:28:36] so who [00:28:37] who okay so now let's get into a little more [00:28:40] okay so now let's get into a little more concrete setup so concrete [00:28:44] so I guess we're gonna write this Theta [00:28:47] so I guess we're gonna write this Theta is RP this is our family of parameters [00:28:51] is RP this is our family of parameters and Theta hat is I guess I'm writing [00:28:54] and Theta hat is I guess I'm writing this again [00:28:55] this again so but just to rewrite it this is the [00:28:58] so but just to rewrite it this is the the ERM solution [00:29:01] the ERM solution and let's let's just for notational [00:29:03] and let's let's just for notational convince convenience let's define Theta [00:29:05] convince convenience let's define Theta star [00:29:05] star to be the best [00:29:09] model [00:29:12] in this family right but this is the [00:29:14] in this family right but this is the population risk but not empirical risk [00:29:16] population risk but not empirical risk City Star is the best in terms of the [00:29:18] City Star is the best in terms of the population [00:29:19] population and [00:29:21] and and our goal is to bond excess risk [00:29:23] and our goal is to bond excess risk which will be just a l Theta minus L [00:29:26] which will be just a l Theta minus L Theta stock [00:29:27] Theta stock okay [00:29:28] okay excess risk [00:29:31] excess risk so our goal is to show our Theta minus L [00:29:33] so our goal is to show our Theta minus L sorry I will say the hats minus LLC the [00:29:36] sorry I will say the hats minus LLC the star is small [00:29:42] so okay and I guess either [00:29:47] so okay and I guess either so consequence of a trivial consequence [00:29:49] so consequence of a trivial consequence of this is that L of theta star is the [00:29:51] of this is that L of theta star is the mean [00:29:52] mean of L oxide [00:29:54] of L oxide okay yes [00:30:01] so okay so here's the theorem I will [00:30:04] so okay so here's the theorem I will improve so I'm going to typically in [00:30:07] improve so I'm going to typically in this course I'm going to take this [00:30:08] this course I'm going to take this approach that we State the theorem first [00:30:10] approach that we State the theorem first and then talk about you know why we have [00:30:12] and then talk about you know why we have to prove it or uh how do we prove it [00:30:15] to prove it or uh how do we prove it um [00:30:16] um so so we [00:30:18] so so we we assume the consistency [00:30:22] we assume the consistency by the way this is like a as with what I [00:30:24] by the way this is like a as with what I said in the beginning this part of the [00:30:27] said in the beginning this part of the uh lecture is a little bit informal [00:30:30] uh lecture is a little bit informal um just because I don't want to get into [00:30:31] um just because I don't want to get into too much struggling too much too too too [00:30:33] too much struggling too much too too too many dragons fragrance so what does the [00:30:36] many dragons fragrance so what does the consistency of failure half means so [00:30:38] consistency of failure half means so this means that say the heart eventually [00:30:40] this means that say the heart eventually converts to Theta star [00:30:42] converts to Theta star in probability as n goes to Infinity [00:30:46] in probability as n goes to Infinity um if you are not familiar with what [00:30:47] um if you are not familiar with what what convergence in probably I mean it [00:30:49] what convergence in probably I mean it doesn't really matter that much so the [00:30:50] doesn't really matter that much so the reason why you have to [00:30:52] reason why you have to have something slightly different is [00:30:54] have something slightly different is because this is a random variable so the [00:30:55] because this is a random variable so the head is a random variable right if it's [00:30:57] head is a random variable right if it's just a some [00:30:59] just a some deterministic variable as a function of [00:31:02] deterministic variable as a function of n then you can Define the convergence in [00:31:03] n then you can Define the convergence in the trivial way but here that is a [00:31:06] the trivial way but here that is a random variable so technically this [00:31:08] random variable so technically this means [00:31:10] means um [00:31:11] um convergence in probability [00:31:14] just in case you are interested in this [00:31:17] just in case you are interested in this but it's not that important so [00:31:19] but it's not that important so convergence probably means that if you [00:31:21] convergence probably means that if you take a limit [00:31:23] take a limit as n goes to Infinity we look at the [00:31:26] as n goes to Infinity we look at the probability of [00:31:27] probability of the height minus Theta star [00:31:30] the height minus Theta star is larger than some Epsilon [00:31:32] is larger than some Epsilon its probability will goes to zero as in [00:31:34] its probability will goes to zero as in goes to Infinity for any Epsilon it's [00:31:37] goes to Infinity for any Epsilon it's larger than zero [00:31:39] larger than zero but it's not very important for this [00:31:40] but it's not very important for this course like for this course it's perfect [00:31:42] course like for this course it's perfect fine that you just understand this [00:31:44] fine that you just understand this intuitively [00:31:45] intuitively because [00:31:46] because [Music] [00:31:50] though [00:31:51] though yeah exactly exactly [00:31:54] yeah exactly exactly right [00:32:00] sorry again [00:32:04] sorry yeah we have all of those yes [00:32:18] uh yes but I think the the issue is that [00:32:20] uh yes but I think the the issue is that it's gonna be [00:32:21] it's gonna be uh smaller vertically [00:32:25] uh smaller vertically so I I felt that if this is better I [00:32:28] so I I felt that if this is better I think because you have more there's more [00:32:30] think because you have more there's more things shown on the board [00:32:35] um and maybe I should also repeat the [00:32:37] um and maybe I should also repeat the question how the zoom meeting as well [00:32:38] question how the zoom meeting as well but yeah next time okay cool so [00:32:42] but yeah next time okay cool so um [00:32:45] and also this informally you know like [00:32:46] and also this informally you know like uh okay sounds good so and then let's [00:32:50] uh okay sounds good so and then let's see so [00:32:52] see so uh we also assume that [00:32:57] blah blah [00:32:59] blah blah the housing of the laws IPL star is [00:33:05] the housing of the laws IPL star is full rank [00:33:08] full rank and what does that happen mean the [00:33:10] and what does that happen mean the hassin is [00:33:11] hassin is uh I guess probably most of you have [00:33:14] uh I guess probably most of you have seen heading if you have taken cs209 but [00:33:16] seen heading if you have taken cs209 but the head is just a second all derivative [00:33:18] the head is just a second all derivative but you organize it in The Matrix right [00:33:20] but you organize it in The Matrix right so the hiding of a function f is a [00:33:22] so the hiding of a function f is a matrix and this Matrix all the answers [00:33:25] matrix and this Matrix all the answers are [00:33:26] are the partial derivatives [00:33:28] the partial derivatives like this [00:33:30] like this and this is a matrix of Dimension P by P [00:33:32] and this is a matrix of Dimension P by P if f is a function that Maps [00:33:37] if f is a function that Maps RP to to R [00:33:40] RP to to R okay [00:33:42] okay so [00:33:44] so um [00:33:45] um and there's also some other [00:33:48] and there's also some other regularity conditions [00:33:51] regularity conditions which I'm not going to even state [00:33:53] which I'm not going to even state because [00:33:55] because like it's uh it's probably not super [00:33:57] like it's uh it's probably not super important for this course [00:33:59] important for this course and for example this involves some [00:34:02] and for example this involves some something like this like the gradient is [00:34:04] something like this like the gradient is finance you know um something like that [00:34:07] finance you know um something like that and I know these assumptions then you [00:34:10] and I know these assumptions then you have that a few things you can know a [00:34:13] have that a few things you can know a lot of things like about the failure hat [00:34:15] lot of things like about the failure hat so first thing you know that is that [00:34:18] so first thing you know that is that um formula you have to write this so [00:34:21] um formula you have to write this so great square root and times Theta hat [00:34:23] great square root and times Theta hat minus Theta star [00:34:25] minus Theta star this is bounded [00:34:27] this is bounded is o p of 1. so I'm going to Define o p [00:34:31] is o p of 1. so I'm going to Define o p of 1 in moments but this is really [00:34:32] of 1 in moments but this is really roughly speaking just saying that said a [00:34:34] roughly speaking just saying that said a hatman Theta star [00:34:36] hatman Theta star is roughly on the order of like while [00:34:38] is roughly on the order of like while we're squared [00:34:40] we're squared something like this so so if you [00:34:43] something like this so so if you multiply Theta hydrated Star best growth [00:34:45] multiply Theta hydrated Star best growth and maybe comes on order of a constant [00:34:47] and maybe comes on order of a constant so the what is this op offline here [00:34:50] so the what is this op offline here again this is not super important for [00:34:52] again this is not super important for the course you can if you don't think of [00:34:53] the course you can if you don't think of as constant you can just think of as all [00:34:55] as constant you can just think of as all of one as in in most of the standard CS [00:34:58] of one as in in most of the standard CS courses [00:35:00] courses um the details here is that bonded in [00:35:02] um the details here is that bonded in probability [00:35:05] as a random variable in an index by n is [00:35:09] as a random variable in an index by n is op of one this is means that [00:35:13] op of one this is means that means [00:35:15] means for every Epsilon [00:35:17] for every Epsilon which is larger than zero there exists [00:35:19] which is larger than zero there exists the bond and [00:35:21] the bond and such that [00:35:23] such that if you look at the probability that x n [00:35:27] if you look at the probability that x n is bigger than the bonds [00:35:30] is bigger than the bonds this would goes to zero [00:35:32] this would goes to zero I guess this [00:35:35] I guess this for soup you can think of as Max you [00:35:37] for soup you can think of as Max you know if you are not familiar with the [00:35:38] know if you are not familiar with the soup [00:35:39] soup um this is going to be very small [00:35:42] um this is going to be very small eventually as n goes to Infinity [00:35:45] eventually as n goes to Infinity um but you know if you're [00:35:48] um but you know if you're not familiar with all of these dragons [00:35:50] not familiar with all of these dragons just to think of this as all fly you [00:35:52] just to think of this as all fly you know [00:35:53] know um [00:36:00] actually minimizes unique is already [00:36:03] actually minimizes unique is already assumed when I Define this in some sense [00:36:05] assumed when I Define this in some sense like uh so again I'm pretty informal [00:36:08] like uh so again I'm pretty informal here but I'm already assuming that the [00:36:10] here but I'm already assuming that the minimizer is unique but indeed if the [00:36:13] minimizer is unique but indeed if the image is unique I think you need a [00:36:15] image is unique I think you need a hydrogen to be full rank [00:36:17] hydrogen to be full rank um but I think hygiene is full rank [00:36:19] um but I think hygiene is full rank doesn't mean the minimizer is unique [00:36:25] okay sounds good so [00:36:27] okay sounds good so any other questions [00:36:28] any other questions so so but the most important thing the [00:36:31] so so but the most important thing the most important thing here is that you [00:36:32] most important thing here is that you somehow know how far set ahead is close [00:36:34] somehow know how far set ahead is close to their star and and how far it is it's [00:36:36] to their star and and how far it is it's kind of something like one word scrotion [00:36:38] kind of something like one word scrotion um and as n goes to Infinity [00:36:40] um and as n goes to Infinity and also you know that how different [00:36:43] and also you know that how different uh I always say the Hat the population [00:36:46] uh I always say the Hat the population risk of the minimizer to the half is [00:36:49] risk of the minimizer to the half is close to [00:36:51] the the population risk of the the best [00:36:55] the the population risk of the the best model say the star and how different [00:36:57] model say the star and how different they are so they are different [00:37:00] they are so they are different um [00:37:00] um in this sense where if you multiply the [00:37:02] in this sense where if you multiply the difference back and then you get a [00:37:03] difference back and then you get a constant which pretty much just saying [00:37:05] constant which pretty much just saying that L Theta hat minus L Theta star [00:37:08] that L Theta hat minus L Theta star is something like one over n [00:37:12] okay so and actually you know more than [00:37:17] okay so and actually you know more than this [00:37:18] this um so you also know that what's the [00:37:20] um so you also know that what's the distribution [00:37:21] distribution of the set of hatch minus Theta star [00:37:25] of the set of hatch minus Theta star so Theta height minus signal star is a [00:37:27] so Theta height minus signal star is a vector right so [00:37:29] vector right so um and it if we multiply by square root [00:37:30] um and it if we multiply by square root n it's going to be on out of constant [00:37:32] n it's going to be on out of constant but also you know what this distribution [00:37:34] but also you know what this distribution of this random variable is as n goes to [00:37:37] of this random variable is as n goes to Infinity I think this distribution is in [00:37:41] Infinity I think this distribution is in distribution is converging to [00:37:43] distribution is converging to um a gaussian distribution which means [00:37:46] um a gaussian distribution which means zero [00:37:47] zero and some covalents and these covariance [00:37:50] and some covalents and these covariance is complicated let me write it down [00:37:52] is complicated let me write it down um [00:38:06] foreign [00:38:10] all of these are intellectual notes so [00:38:13] all of these are intellectual notes so you don't necessarily have to take notes [00:38:15] you don't necessarily have to take notes if you don't want it anyway [00:38:19] if you don't want it anyway like [00:38:20] like um [00:38:22] um um [00:38:23] um right so [00:38:25] right so so like how to interpret these covers I [00:38:27] so like how to interpret these covers I think that's kind of like a it's not [00:38:29] think that's kind of like a it's not interpretable for the moment but the [00:38:31] interpretable for the moment but the point is that it's a gaussian [00:38:33] point is that it's a gaussian distribution and after scaling by square [00:38:36] distribution and after scaling by square root and so it's a so if you don't scale [00:38:38] root and so it's a so if you don't scale backward and it's going to be smaller [00:38:39] backward and it's going to be smaller and smaller [00:38:41] and smaller um but if you scale backwards and then [00:38:42] um but if you scale backwards and then it's going to be a gaussian distribution [00:38:44] it's going to be a gaussian distribution with a with a fixed covers [00:38:46] with a with a fixed covers and at least means zero so say the head [00:38:48] and at least means zero so say the head is centered around to the star so that's [00:38:50] is centered around to the star so that's very good news [00:38:52] very good news and also the first thing you also know [00:38:54] and also the first thing you also know something about how different your [00:38:56] something about how different your what's the distribution for excess risk [00:38:58] what's the distribution for excess risk we have talked about the the success [00:39:00] we have talked about the the success rate as a random variable is another one [00:39:03] rate as a random variable is another one over square one over n here right so [00:39:05] over square one over n here right so this is what we have to talk about but [00:39:07] this is what we have to talk about but also you know exactly what's the [00:39:08] also you know exactly what's the distribution so the distribution is you [00:39:11] distribution so the distribution is you know it's actually complicated States [00:39:13] know it's actually complicated States but let's let me do it [00:39:16] um so you first you define random [00:39:17] um so you first you define random variable let's call it [00:39:20] variable let's call it um [00:39:20] um a gaussian random variable with some [00:39:22] a gaussian random variable with some covariance [00:39:30] the exact detail here also doesn't [00:39:31] the exact detail here also doesn't matter that much [00:39:33] matter that much it's because it's come from the [00:39:34] it's because it's come from the derivation right you derivate you know [00:39:36] derivation right you derivate you know and you found that this is exactly the [00:39:38] and you found that this is exactly the right thing so [00:39:39] right thing so um but the point here is that if you [00:39:42] um but the point here is that if you find this random variable then you can [00:39:44] find this random variable then you can know that [00:39:47] fast risk [00:39:49] fast risk which is L Theta hat minus l0 star if [00:39:52] which is L Theta hat minus l0 star if you multiply that by 1 over n by n [00:39:55] you multiply that by 1 over n by n then in distribution it converges to [00:39:58] then in distribution it converges to this random variable the norm of this [00:40:01] this random variable the norm of this random variable s is the gaussian [00:40:03] random variable s is the gaussian distribution [00:40:05] distribution and you also know what's the expectation [00:40:08] and you also know what's the expectation of this if you really want [00:40:12] which is something like on the order for [00:40:14] which is something like on the order for over 2N and you also know [00:40:18] over 2N and you also know was the was the constant [00:40:28] okay so all of the all of these formulas [00:40:30] okay so all of the all of these formulas don't necessarily matter that much [00:40:32] don't necessarily matter that much because you know you derive it you got [00:40:33] because you know you derive it you got this right but the point is that you [00:40:35] this right but the point is that you almost know everything so you know [00:40:37] almost know everything so you know everything about yourself the Hat you [00:40:38] everything about yourself the Hat you know the description of seller hat you [00:40:40] know the description of seller hat you know all of the Hat you know the [00:40:41] know all of the Hat you know the description of alpha hat it's very [00:40:44] description of alpha hat it's very powerful [00:40:45] powerful so [00:40:47] so um and and you can make all of these [00:40:48] um and and you can make all of these four more if you want [00:40:53] any questions so far [00:41:04] so is that a property of what what is [00:41:14] yeah so I guess my understanding is the [00:41:17] yeah so I guess my understanding is the question is that what the consistency uh [00:41:20] question is that what the consistency uh assumption was is that a property of [00:41:22] assumption was is that a property of something right so what's what property [00:41:24] something right so what's what property uh like like is this the property of the [00:41:27] uh like like is this the property of the problem yes that's that's the correct so [00:41:28] problem yes that's that's the correct so it's a prop it's a property of the [00:41:31] it's a prop it's a property of the the problem meaning that means that it's [00:41:33] the problem meaning that means that it's a property of the model parameterization [00:41:37] a property of the model parameterization um um yeah [00:41:43] I have no idea [00:41:54] uh sorry you are not following the why [00:41:57] uh sorry you are not following the why this is true [00:42:03] um [00:42:05] um I guess maybe you can talk about this [00:42:06] I guess maybe you can talk about this offline it's okay yeah yeah just come to [00:42:08] offline it's okay yeah yeah just come to me after the course yeah [00:42:10] me after the course yeah um yeah but but you are not expected to [00:42:12] um yeah but but you are not expected to just one thing like for any anybody you [00:42:14] just one thing like for any anybody you are not expect to see why this is [00:42:15] are not expect to see why this is actually right these are just some [00:42:17] actually right these are just some statements saying that okay this can be [00:42:18] statements saying that okay this can be done mathematically I'll show you [00:42:20] done mathematically I'll show you something about how to derive this at [00:42:23] something about how to derive this at least the somewhat informal like [00:42:25] least the somewhat informal like um and the proof actually techniques is [00:42:27] um and the proof actually techniques is pretty simple the calculation is a [00:42:28] pretty simple the calculation is a little bit tricky it's a little bit [00:42:29] little bit tricky it's a little bit complicated you have to work through it [00:42:31] complicated you have to work through it um but the fundamental ideas is very [00:42:33] um but the fundamental ideas is very simple [00:42:34] simple um [00:42:35] um um yeah so far I'm only stating that [00:42:38] um yeah so far I'm only stating that this is a these are all correct like you [00:42:40] this is a these are all correct like you can prove all of this and that's the [00:42:41] can prove all of this and that's the only thing I'm saying so far [00:42:44] only thing I'm saying so far and produce the sentence very strong or [00:42:47] and produce the sentence very strong or are they like easily Satisfied by any [00:42:49] are they like easily Satisfied by any problems [00:42:50] problems yeah so for example the consistency [00:42:52] yeah so for example the consistency assumption right yes so that's a very [00:42:53] assumption right yes so that's a very good question right so like so far we [00:42:55] good question right so like so far we see this very strong statement you know [00:42:56] see this very strong statement you know everything about right so some things [00:42:58] everything about right so some things probably should go wrong because [00:43:00] probably should go wrong because otherwise we'll probably solve all the [00:43:01] otherwise we'll probably solve all the problems there's no non-linear [00:43:02] problems there's no non-linear assumption it works for nonlinearity [00:43:05] assumption it works for nonlinearity right so I think the problem is that [00:43:08] right so I think the problem is that um [00:43:09] um um the consistency assumption is a [00:43:10] um the consistency assumption is a little bit tricky if you don't have n [00:43:12] little bit tricky if you don't have n goes to Infinity like like you have to [00:43:14] goes to Infinity like like you have to really have to have and to be really [00:43:16] really have to have and to be really really big then you can somewhat kind of [00:43:19] really big then you can somewhat kind of like have the consistency and and I [00:43:22] like have the consistency and and I think basically the whole thing kind of [00:43:23] think basically the whole thing kind of like [00:43:25] like the whole problem of this kind of the [00:43:27] the whole problem of this kind of the the limitation of this theorem is that [00:43:29] the limitation of this theorem is that you need to let N goes to infinity and [00:43:31] you need to let N goes to infinity and you really need very very big and to [00:43:33] you really need very very big and to potentially see this effect [00:43:35] potentially see this effect so we're going to discuss this a little [00:43:37] so we're going to discuss this a little bit after uh we move on to the knowledge [00:43:39] bit after uh we move on to the knowledge methodox but the limited but yeah that's [00:43:42] methodox but the limited but yeah that's a kind of like a a trailer yeah [00:43:45] a kind of like a a trailer yeah so [00:43:47] so um right so so when your light and goes [00:43:49] um right so so when your light and goes to Infinity you have super powerful [00:43:50] to Infinity you have super powerful Tools in some sense [00:43:53] Tools in some sense um [00:43:53] um but still like this these are actually [00:43:55] but still like this these are actually reasonable characterizations for minor [00:43:57] reasonable characterizations for minor cases so it's not like they are [00:43:59] cases so it's not like they are completely off the reality I guess they [00:44:02] completely off the reality I guess they are not necessarily that applicable to [00:44:04] are not necessarily that applicable to the modern practice just because in this [00:44:06] the modern practice just because in this case we don't have n goes to Infinity [00:44:07] case we don't have n goes to Infinity right like you have a million data [00:44:09] right like you have a million data points in your image nice but your [00:44:11] points in your image nice but your parameters is like 10 million so so n is [00:44:14] parameters is like 10 million so so n is not going to Infinity as you fix the [00:44:16] not going to Infinity as you fix the parameter so so that's going to be in [00:44:18] parameter so so that's going to be in the next half of the lecture uh to some [00:44:20] the next half of the lecture uh to some extent [00:44:31] and actually when we really prove it we [00:44:33] and actually when we really prove it we are gonna you know if you do a very [00:44:35] are gonna you know if you do a very formal proof you are going to prove [00:44:36] formal proof you are going to prove straight from first and then do one and [00:44:38] straight from first and then do one and two [00:44:40] uh okay I think I have 15 minutes [00:44:44] uh okay I think I have 15 minutes right yeah I have 15 minutes [00:44:47] right yeah I have 15 minutes so [00:44:52] yeah okay so what I'm gonna do in the [00:44:54] yeah okay so what I'm gonna do in the next 15 minutes to show uh kind of [00:44:56] next 15 minutes to show uh kind of informal proof for one two [00:45:00] informal proof for one two um and next time I'm going to do a [00:45:03] um and next time I'm going to do a little more formal proof of three and [00:45:04] little more formal proof of three and four and then we're gonna get done with [00:45:05] four and then we're gonna get done with this asymptotics [00:45:07] this asymptotics um and then we move on to the more uh [00:45:10] um and then we move on to the more uh like uh knowledge metallic stuff okay so [00:45:15] so this is actually the the the proof [00:45:18] so this is actually the the the proof right so the key of the proof is two [00:45:20] right so the key of the proof is two things one thing is you want to do [00:45:22] things one thing is you want to do Taylor expansion [00:45:27] around hello star [00:45:30] around hello star and second thing is that you want to [00:45:32] and second thing is that you want to somehow use the fact that our hat is [00:45:34] somehow use the fact that our hat is close to l [00:45:35] close to l and nabla L hat is close to normal L hat [00:45:39] and nabla L hat is close to normal L hat is the gradient the empirical grid and [00:45:41] is the gradient the empirical grid and number L is the population grid and this [00:45:43] number L is the population grid and this is by law of large number [00:45:46] is by law of large number um [00:45:48] um okay I'll elabor on this but I guess the [00:45:50] okay I'll elabor on this but I guess the most important thing is really the clear [00:45:52] most important thing is really the clear expansion right so once you can work in [00:45:55] expansion right so once you can work in the neighborhood of something then [00:45:56] the neighborhood of something then everything becomes somewhat easy okay so [00:46:00] everything becomes somewhat easy okay so now let's talk about how to really do it [00:46:05] so [00:46:07] so um when you do 10 expansion so so the [00:46:09] um when you do 10 expansion so so the starting point is the following so you [00:46:10] starting point is the following so you care about the hat right so and what you [00:46:13] care about the hat right so and what you know about Center hat is that zero is [00:46:15] know about Center hat is that zero is equals to number hat LL [00:46:19] equals to number hat LL zero is equals to uh the gradient [00:46:23] zero is equals to uh the gradient of the empirical loss at the half is [00:46:26] of the empirical loss at the half is equal to zero this is because Theta hat [00:46:28] equal to zero this is because Theta hat is the minimizer right if you are the [00:46:30] is the minimizer right if you are the minimizer then the stationary condition [00:46:32] minimizer then the stationary condition tells you that the gradient is zero [00:46:35] tells you that the gradient is zero but you want to relate this to L because [00:46:38] but you want to relate this to L because everything is you know it's easier when [00:46:39] everything is you know it's easier when you do do it with L because L is the [00:46:41] you do do it with L because L is the population so if somehow uh first relate [00:46:45] population so if somehow uh first relate this to [00:46:47] this to um [00:46:48] um so you want to relate everything like [00:46:50] so you want to relate everything like basically the whole idea is that you [00:46:51] basically the whole idea is that you want to relate to the heart to say the [00:46:53] want to relate to the heart to say the star and L hat to l [00:46:54] star and L hat to l so the first thing is that we try to [00:46:56] so the first thing is that we try to relate this to their stock [00:46:59] relate this to their stock so you can write this as a Taylor expand [00:47:01] so you can write this as a Taylor expand around say let's start [00:47:04] around say let's start so so star is a reference point [00:47:07] so so star is a reference point um so and then you the first other term [00:47:10] um so and then you the first other term this is the zero solar term and the [00:47:11] this is the zero solar term and the first order will be the hessing of the [00:47:13] first order will be the hessing of the empirical loss [00:47:16] empirical loss foreign so this is the higher order kind [00:47:19] foreign so this is the higher order kind of like Telex like this is the tail [00:47:21] of like Telex like this is the tail expansion for [00:47:23] expansion for um for multi-dimensional function but [00:47:27] um for multi-dimensional function but it's it's a it's exactly the same as [00:47:31] it's it's a it's exactly the same as um random showcase it's just so that um [00:47:34] um random showcase it's just so that um you have to deal with some of the [00:47:36] you have to deal with some of the matrices so maybe just a small remark [00:47:38] matrices so maybe just a small remark here [00:47:39] here so what I'm doing here is that I'm tele [00:47:41] so what I'm doing here is that I'm tele expanding something like gradient of g z [00:47:44] expanding something like gradient of g z plus Epsilon abstractly speaking I've [00:47:46] plus Epsilon abstractly speaking I've not do a lot of different abstractions [00:47:48] not do a lot of different abstractions in in my [00:47:50] in in my um for for small things right so suppose [00:47:53] um for for small things right so suppose you care about this an Epson with your [00:47:55] you care about this an Epson with your small thing and Z is your reference [00:47:56] small thing and Z is your reference point you can show that this is the 10 [00:47:58] point you can show that this is the 10 expansion for this is really something [00:48:00] expansion for this is really something like [00:48:03] like Z Plus number Square g z times Epsilon [00:48:07] Z Plus number Square g z times Epsilon and this is a matrix [00:48:09] and this is a matrix and this is a the vector [00:48:14] and how do you verify this you can do [00:48:16] and how do you verify this you can do this for like each Dimension [00:48:18] this for like each Dimension simultaneous like individually and and [00:48:22] simultaneous like individually and and um and you get this equation this is [00:48:24] um and you get this equation this is kind of like a intuitive as well right [00:48:26] kind of like a intuitive as well right because the the hyacin is the gradient [00:48:29] because the the hyacin is the gradient of the gradient so [00:48:32] of the gradient so um so so this is the the first other [00:48:34] um so so this is the the first other tail expansion okay so now after any [00:48:37] tail expansion okay so now after any questions [00:48:40] okay so now after do the tail expansion [00:48:42] okay so now after do the tail expansion then you know that this is the lifetime [00:48:44] then you know that this is the lifetime side is zero uh and then you can [00:48:48] side is zero uh and then you can rearrange right so put this on the left [00:48:51] rearrange right so put this on the left hand side so what you get is that [00:48:54] hand side so what you get is that number L with a store [00:48:57] number L with a store and then head to my sales store [00:49:00] and then head to my sales store is roughly equals to [00:49:04] minus [00:49:06] minus uh sorry this is equals to minus grid [00:49:09] uh sorry this is equals to minus grid I'll add second star [00:49:11] I'll add second star plus some high order term and then now [00:49:14] plus some high order term and then now you have to have management start you [00:49:15] you have to have management start you can take the inverse of the hessing so [00:49:17] can take the inverse of the hessing so you can see the height of theta star [00:49:19] you can see the height of theta star is equals to [00:49:21] is equals to the inverse of the housing times [00:49:25] the inverse of the housing times the empirical gradient at the star [00:49:28] the empirical gradient at the star plus high order terms [00:49:31] for the terms [00:49:34] okay [00:49:40] sorry I I that's my back it's still L [00:49:44] sorry I I that's my back it's still L hat so far [00:49:46] hat so far okay [00:49:47] okay cool okay so [00:49:50] cool okay so um and you are exactly that's exactly [00:49:52] um and you are exactly that's exactly the right point so now I need to [00:49:54] the right point so now I need to make change all the arrowheads to l [00:49:57] make change all the arrowheads to l and and what I know so [00:50:00] and and what I know so I know that uh [00:50:03] I know that uh um basically I want to kind of like [00:50:04] um basically I want to kind of like change this to IO I want to change this [00:50:06] change this to IO I want to change this to L hat to to nobler L as well and and [00:50:09] to L hat to to nobler L as well and and also I need to consider the the the the [00:50:12] also I need to consider the the the the the kind of differences between them so [00:50:15] the kind of differences between them so how do I do that so [00:50:17] how do I do that so um I'm going to uh so at least I know a [00:50:20] um I'm going to uh so at least I know a few things right I know that [00:50:22] few things right I know that um expectation of our hearts to the star [00:50:26] um expectation of our hearts to the star is equals to L Theta star [00:50:28] is equals to L Theta star I know the expectation of number L [00:50:31] I know the expectation of number L height zero star is also equals to the [00:50:33] height zero star is also equals to the number l [00:50:34] number l start [00:50:36] start so you assume enough regularity [00:50:37] so you assume enough regularity conditions so that you can switch the [00:50:38] conditions so that you can switch the gradient with the expectation and you [00:50:41] gradient with the expectation and you also have something like this [00:50:48] so and this is equals to zero because [00:50:51] so and this is equals to zero because the [00:50:53] the sales star is the minimizer of the L so [00:50:55] sales star is the minimizer of the L so that's why this is zero [00:50:57] that's why this is zero and this is a a p by P Matrix which is [00:51:00] and this is a a p by P Matrix which is full rack as we assume [00:51:03] full rack as we assume so and [00:51:06] so so basically this is saying and also [00:51:09] so so basically this is saying and also you can um because this this [00:51:12] you can um because this this is [00:51:14] is uh average [00:51:18] of n i ID terms right [00:51:21] of n i ID terms right what is the because this is one over n [00:51:23] what is the because this is one over n times sum of number Square I'll [00:51:26] times sum of number Square I'll uh x i [00:51:29] uh x i why I Theta [00:51:32] why I Theta so it's a it's a sum of ID terms then [00:51:35] so it's a it's a sum of ID terms then you can use law of large number [00:51:41] to say that this [00:51:45] to say that this in convert this to [00:51:49] uh [00:51:51] uh this [00:51:53] this and you also know that okay the [00:51:55] and you also know that okay the similarly you also know that [00:52:02] what I'm sorry what I'm doing here my [00:52:04] what I'm sorry what I'm doing here my bad [00:52:06] bad else this is converging to [00:52:11] else this is converging to this [00:52:12] this okay so [00:52:15] okay so um [00:52:18] okay so if you want to just get the uh [00:52:23] okay and also moreover by a lot of large [00:52:27] okay and also moreover by a lot of large number you can also get something more [00:52:28] number you can also get something more more kind of like accurate about this [00:52:31] more kind of like accurate about this convergence so here you are only showing [00:52:33] convergence so here you are only showing that it's converging but also you can [00:52:34] that it's converging but also you can know that how much difference they are [00:52:37] know that how much difference they are you know that if you look at the [00:52:39] you know that if you look at the difference between [00:52:41] difference between foreign [00:52:51] this will be a gaussian distribution [00:52:55] this will be a gaussian distribution which means zero and covariance number [00:52:59] which means zero and covariance number LIC star [00:53:02] LIC star nabla [00:53:06] X Y first one [00:53:09] X Y first one I guess this is because you know if you [00:53:11] I guess this is because you know if you I guess I'm using Central limit theorem [00:53:13] I guess I'm using Central limit theorem here maybe I should first review Central [00:53:15] here maybe I should first review Central limit serum [00:53:24] so when you have sampling material you [00:53:26] so when you have sampling material you know that suppose [00:53:28] know that suppose X hat is equals to 1 over n times sum of [00:53:31] X hat is equals to 1 over n times sum of x i [00:53:32] x i and x i are ID [00:53:34] and x i are ID from some distribution d [00:53:37] from some distribution d and x i let's say is in D dimensional [00:53:41] and x i let's say is in D dimensional then let's say Sigma is the covariance [00:53:43] then let's say Sigma is the covariance of x i [00:53:47] of x i then [00:53:48] then you know that as n goes to Infinity [00:53:53] you know that as n goes to Infinity you know hey [00:53:55] you know hey X hat [00:53:57] X hat converges in dispute in probability to [00:54:00] converges in dispute in probability to the expectation of x [00:54:02] the expectation of x right that's the [00:54:04] right that's the um best law of large number that's [00:54:06] um best law of large number that's called Lower slash number and then the [00:54:09] called Lower slash number and then the the more accurate thing is that you can [00:54:11] the more accurate thing is that you can look at the difference between X hat and [00:54:13] look at the difference between X hat and expectation facts [00:54:15] expectation facts and you know that if you scale the [00:54:16] and you know that if you scale the difference by scoring n then this [00:54:19] difference by scoring n then this converges to gaussian distribution [00:54:21] converges to gaussian distribution first of all it's an order for constant [00:54:22] first of all it's an order for constant and second you know the distribution [00:54:24] and second you know the distribution is [00:54:26] is um means zero and covariance X [00:54:30] um means zero and covariance X and and in some sense this is saying [00:54:32] and and in some sense this is saying that X had intuitively I guess you know [00:54:35] that X had intuitively I guess you know informally this is saying that x height [00:54:37] informally this is saying that x height minus E of x [00:54:39] minus E of x is on the order of our square rooted [00:54:43] is on the order of our square rooted Okay so [00:54:45] Okay so this is Central limit theorem and what [00:54:47] this is Central limit theorem and what we are doing here is that in this [00:54:49] we are doing here is that in this equation what we're doing here is [00:54:51] equation what we're doing here is basically applying the central limit [00:54:52] basically applying the central limit theorem where you apply [00:54:57] x i it corresponds to gradient of IO at [00:55:02] x i it corresponds to gradient of IO at x i [00:55:03] x i y i [00:55:07] now yeah this is the gradient of the [00:55:08] now yeah this is the gradient of the laws and example I [00:55:10] laws and example I okay so so basically we have done some [00:55:14] okay so so basically we have done some of these preparations so we know that [00:55:16] of these preparations so we know that how different number L is from number L [00:55:20] how different number L is from number L hat is from number L and also we know [00:55:22] hat is from number L and also we know that the highest in converters and now [00:55:25] that the highest in converters and now we can come back to this important [00:55:26] we can come back to this important equation here [00:55:29] equation here so and and we are kind of ready to kind [00:55:32] so and and we are kind of ready to kind of get something real so let me [00:55:35] of get something real so let me uh let me rewrite that so set aside my [00:55:38] uh let me rewrite that so set aside my favorite star this was novela [00:55:41] favorite star this was novela Square I always had to see the star [00:55:43] Square I always had to see the star inverse times [00:55:46] inverse times you know [00:55:48] you know Theta star wait so is that right no [00:55:55] copy this is not [00:56:01] double IO Theta star hat plus high order [00:56:05] double IO Theta star hat plus high order terms [00:56:07] terms so this one is close to normal Square L [00:56:12] so this one is close to normal Square L Theta star inverse that's the first [00:56:15] Theta star inverse that's the first thing we know and also we know that this [00:56:17] thing we know and also we know that this one is roughly speaking number IO zero [00:56:20] one is roughly speaking number IO zero Star Plus some power over square rooted [00:56:24] Star Plus some power over square rooted right so [00:56:26] right so um [00:56:27] um suppose you do both of this then you get [00:56:30] suppose you do both of this then you get uh something like [00:56:34] uh something like uh [00:56:36] uh this [00:56:40] this is roughly equals to zero so then [00:56:42] this is roughly equals to zero so then you get square one over squared [00:56:46] you get square one over squared um [00:56:48] we are asking a question first because [00:56:50] we are asking a question first because this takes a little bit time [00:56:56] uh [00:57:02] X hat and x o [00:57:05] X hat and x o sorry my bad wait why [00:57:10] oh I guess uh yes so I'm thinking X is [00:57:13] oh I guess uh yes so I'm thinking X is also drawn from d [00:57:15] also drawn from d um so maybe I should maybe maybe I [00:57:18] um so maybe I should maybe maybe I should either use x i or let's let's say [00:57:20] should either use x i or let's let's say x is a generic variable that is drawn [00:57:21] x is a generic variable that is drawn from the same distribution d [00:57:25] but the expectation of X is the same as [00:57:27] but the expectation of X is the same as expectations of x times [00:57:33] uh here right yeah okay yes yeah I'm [00:57:37] uh here right yeah okay yes yeah I'm missing that [00:57:38] missing that um [00:57:39] um okay so maybe maybe like I'll I'll just [00:57:43] okay so maybe maybe like I'll I'll just do this a little bit more carefully so [00:57:45] do this a little bit more carefully so so I'm basically trying to kind of [00:57:47] so I'm basically trying to kind of replace the L height with L right so the [00:57:49] replace the L height with L right so the first thing is that this the the [00:57:51] first thing is that this the the gradient [00:57:53] gradient uh [00:57:58] this I got I guess using this the this [00:58:01] this I got I guess using this the this equation maybe let's create one [00:58:03] equation maybe let's create one so this is roughly equals to 1 over [00:58:05] so this is roughly equals to 1 over square root n plus uh number L second [00:58:09] square root n plus uh number L second star which is zero so this is roughly [00:58:12] star which is zero so this is roughly equals to power square rooted so if you [00:58:14] equals to power square rooted so if you don't care too much about the the vector [00:58:16] don't care too much about the the vector versus real number of the the the the [00:58:17] versus real number of the the the the discussion then you get more [00:58:19] discussion then you get more prescription [00:58:20] prescription and this one is kind of close to a [00:58:22] and this one is kind of close to a constants [00:58:24] inverse converges to [00:58:29] which is a constant so that these two [00:58:31] which is a constant so that these two things together will give that [00:58:33] things together will give that uh maybe I see the yellow color to [00:58:35] uh maybe I see the yellow color to continuous minus something one over [00:58:38] continuous minus something one over squared and [00:58:39] squared and uh on this other form of squirt so [00:58:41] uh on this other form of squirt so something like maybe more explicitly you [00:58:44] something like maybe more explicitly you get [00:58:47] let's start [00:58:49] let's start inverse times R square rooted and this [00:58:52] inverse times R square rooted and this is another form of square rooting so [00:58:54] is another form of square rooting so that's how you get that set of height [00:58:56] that's how you get that set of height meters to the star is on other form of [00:58:57] meters to the star is on other form of scripture of course this is not just to [00:58:59] scripture of course this is not just to clarify this is not exactly formal [00:59:01] clarify this is not exactly formal because I'm [00:59:02] because I'm I'm ignoring a lot of uh things for [00:59:04] I'm ignoring a lot of uh things for example [00:59:06] example this is a vector this one over scoring [00:59:08] this is a vector this one over scoring is really a vector it's not a scalar but [00:59:09] is really a vector it's not a scalar but it's all not the order so [00:59:12] it's all not the order so so that's how you get that the Hat [00:59:14] so that's how you get that the Hat method starts on other form of squirting [00:59:16] method starts on other form of squirting and also heuristically if you really [00:59:19] and also heuristically if you really care about IO Theta minus L Theta star [00:59:24] then you can this is the excess risk you [00:59:28] then you can this is the excess risk you can also do the tele expansion you say [00:59:30] can also do the tele expansion you say that this is you tell expand along the [00:59:31] that this is you tell expand along the star you get that this is l [00:59:34] star you get that this is l to the store times to that minus Theta [00:59:36] to the store times to that minus Theta star [00:59:37] star so this is maybe just wait why have so [00:59:40] so this is maybe just wait why have so many textbooks in my notes sorry my bad [00:59:46] so this is a hat [00:59:49] so this is a hat um the system okay so Plus [00:59:54] um the system okay so Plus here the interesting thing is that if [00:59:55] here the interesting thing is that if you do a first electoral expansion [00:59:57] you do a first electoral expansion you're gonna get zero so you'll have to [00:59:58] you're gonna get zero so you'll have to do second order expansion [01:00:02] do second order expansion star [01:00:03] star times housing [01:00:10] plus high order service [01:00:13] okay so the reason why I need to do the [01:00:17] okay so the reason why I need to do the technological expansion is because the [01:00:19] technological expansion is because the first Auto expansion is zero because [01:00:21] first Auto expansion is zero because this is zero so the star is the [01:00:23] this is zero so the star is the minimizer of L [01:00:24] minimizer of L right so that's why we have to look at a [01:00:26] right so that's why we have to look at a second order expansion and and if you [01:00:29] second order expansion and and if you want to roughly see what how large the [01:00:31] want to roughly see what how large the second Ultra expansion is you can see [01:00:33] second Ultra expansion is you can see that each of this term this is what [01:00:35] that each of this term this is what we're scrolled in this is one over [01:00:37] we're scrolled in this is one over square root in so so basically uh this [01:00:40] square root in so so basically uh this the the second term will be something [01:00:42] the the second term will be something like one over n plus five other terms [01:00:50] okay so so this is a heuristic some kind [01:00:53] okay so so this is a heuristic some kind of heuristic will proved to show why [01:00:55] of heuristic will proved to show why um say the Hackman starts out of one [01:00:57] um say the Hackman starts out of one square root and and in terms of the loss [01:01:00] square root and and in terms of the loss is on the order form [01:01:05] and the question so far [01:01:11] foreign [01:01:13] so consistency is needed in almost every [01:01:16] so consistency is needed in almost every step [01:01:28] I'm using the Central Central limiting [01:01:31] I'm using the Central Central limiting on the random variable not a functional [01:01:33] on the random variable not a functional phrase because [01:01:35] phrase because uh I'm not sure whether that's [01:01:37] uh I'm not sure whether that's um [01:01:38] um like [01:01:39] like by the way I forgot to repeat the [01:01:41] by the way I forgot to repeat the question but anyway I remember that next [01:01:43] question but anyway I remember that next time so uh so the question was that [01:01:45] time so uh so the question was that whether the central limit theorem is [01:01:47] whether the central limit theorem is applied to the random variable itself [01:01:49] applied to the random variable itself I think so because x i [01:01:53] I think so because x i is the this x i right so that [01:01:56] is the this x i right so that corresponds to the gradient so gradient [01:01:59] corresponds to the gradient so gradient of L and that example I is my random [01:02:02] of L and that example I is my random variable [01:02:04] so so that's how I got so and then the [01:02:07] so so that's how I got so and then the sum of x i [01:02:09] sum of x i this corresponds to empirical the [01:02:12] this corresponds to empirical the empirical gradient [01:02:13] empirical gradient and [01:02:15] and right so under the expectation [01:02:17] right so under the expectation corresponds to the population grid [01:02:20] corresponds to the population grid foreign [01:02:29] [Music] [01:02:32] [Music] yeah yeah I think yeah you need a lot of [01:02:35] yeah yeah I think yeah you need a lot of like different kind of like regulatory [01:02:36] like different kind of like regulatory conditions to make all of this work [01:02:37] conditions to make all of this work because for example also you [01:02:40] because for example also you there's an implicit step that I didn't [01:02:41] there's an implicit step that I didn't go through which is the inverse right so [01:02:43] go through which is the inverse right so like uh I only saw that the Hudson is [01:02:47] like uh I only saw that the Hudson is converged to for example [01:02:49] converged to for example where I did that so I only saw that [01:02:52] where I did that so I only saw that there hasn't converts the the empirical [01:02:54] there hasn't converts the the empirical has encouraged the population housing [01:02:56] has encouraged the population housing you also need to show that the inverse [01:02:58] you also need to show that the inverse of the empirical has been converts to [01:02:59] of the empirical has been converts to the inverse of the population [01:03:01] the inverse of the population so that's another thing [01:03:04] so that's another thing um you want to formally deal with [01:03:06] um you want to formally deal with so [01:03:08] so yeah every time I give this like I've [01:03:10] yeah every time I give this like I've taught this like a [01:03:11] taught this like a two or three times and every time there [01:03:14] two or three times and every time there are a lot of questions about this first [01:03:15] are a lot of questions about this first lecture on [01:03:17] lecture on I still haven't figured out a better way [01:03:19] I still haven't figured out a better way to teach it but I think maybe the better [01:03:21] to teach it but I think maybe the better the thing is just that [01:03:24] like I really want to convey this convey [01:03:27] like I really want to convey this convey this idea like the idea is that you can [01:03:28] this idea like the idea is that you can do third expansion you can pretty much [01:03:30] do third expansion you can pretty much do a lot of heuristic stuff and all of [01:03:32] do a lot of heuristic stuff and all of them can be made formal and how to [01:03:34] them can be made formal and how to exactly make it formal it's a little bit [01:03:36] exactly make it formal it's a little bit tricky as you know there's all great [01:03:38] tricky as you know there's all great questions right all the questions are [01:03:40] questions right all the questions are welcome but just just to set up the [01:03:42] welcome but just just to set up the expectation this is not mean to have a [01:03:45] expectation this is not mean to have a very formal [01:03:47] very formal um derivation here [01:03:52] okay so I think that's uh that's all for [01:03:54] okay so I think that's uh that's all for today so [01:03:56] today so um next time we're gonna make this a [01:03:58] um next time we're gonna make this a little bit formal Maybe for 15 minutes [01:04:00] little bit formal Maybe for 15 minutes and then we can move on to other things ================================================================================ LECTURE 002 ================================================================================ Stanford CS229M - Lecture 2: Asymptotic analysis, uniform convergence, Hoeffding inequality Source: https://www.youtube.com/watch?v=Fx3xldCEfsM --- Transcript [00:00:05] okay cool let's get started so [00:00:08] okay cool let's get started so uh [00:00:10] uh uh okay it's kind of complicated right [00:00:12] uh okay it's kind of complicated right like a it's kind of amazing right like [00:00:14] like a it's kind of amazing right like this technology is so Advanced so that [00:00:16] this technology is so Advanced so that you can do all of these things together [00:00:17] you can do all of these things together but uh I still have to do them one by [00:00:20] but uh I still have to do them one by one like I have like 10 action items [00:00:21] one like I have like 10 action items maybe more than 10. I need to also [00:00:24] maybe more than 10. I need to also connects with Wi-Fi that's uh that's [00:00:26] connects with Wi-Fi that's uh that's actually something I have to do [00:00:28] actually something I have to do um okay so but oh let's let's get [00:00:31] um okay so but oh let's let's get started oh I need to have my notes [00:00:33] started oh I need to have my notes so what we're going to do today is that [00:00:36] so what we're going to do today is that we're going to continue with the [00:00:37] we're going to continue with the asymptotics last time a little bit for [00:00:39] asymptotics last time a little bit for like a about like 15 to 20 minutes [00:00:42] like a about like 15 to 20 minutes um this is just to wrap up what we have [00:00:44] um this is just to wrap up what we have discussed [00:00:46] discussed um and as I said this uh first lecture [00:00:48] um and as I said this uh first lecture is always kind of like a little bit [00:00:50] is always kind of like a little bit tricky to for me to teach it because [00:00:53] tricky to for me to teach it because um the tools there you know if you want [00:00:54] um the tools there you know if you want to make it formal it requires some kind [00:00:56] to make it formal it requires some kind of backgrounds [00:00:57] of backgrounds um and if you don't want to make it [00:00:58] um and if you don't want to make it formal sometimes there is a little bit [00:01:00] formal sometimes there is a little bit confusion so [00:01:01] confusion so um from from the this lecture the second [00:01:04] um from from the this lecture the second half of this lecture I think we are [00:01:05] half of this lecture I think we are going to talk about [00:01:07] going to talk about um things that require less background [00:01:09] um things that require less background in some sense [00:01:10] in some sense um and it's more self-content [00:01:12] um and it's more self-content okay so the plan is that asymptotics [00:01:18] and then um the so-called uniform [00:01:20] and then um the so-called uniform convergence [00:01:22] convergence I'll Define what it is [00:01:24] I'll Define what it is and uniform convergence will be the main [00:01:26] and uniform convergence will be the main focus for the first few weeks of the [00:01:28] focus for the first few weeks of the lecture [00:01:28] lecture okay so let's start by reviewing what we [00:01:31] okay so let's start by reviewing what we have done last time so what we have the [00:01:33] have done last time so what we have the last time was this theorem where we [00:01:36] last time was this theorem where we showed that [00:01:38] showed that um [00:01:41] um if you assume [00:01:43] um if you assume consistency [00:01:45] consistency which is something that is kind of like [00:01:47] which is something that is kind of like we basically just assume without much [00:01:49] we basically just assume without much justification you know [00:01:51] justification you know um it's not always true and it also [00:01:53] um it's not always true and it also depends on the problem [00:01:54] depends on the problem uh the consistency basically means that [00:01:56] uh the consistency basically means that set a hat will converge to Theta star we [00:01:58] set a hat will converge to Theta star we call that hat is the ERM the empirical [00:02:01] call that hat is the ERM the empirical risk minimizer and Theta star is the [00:02:03] risk minimizer and Theta star is the minimizer of the population risk so you [00:02:06] minimizer of the population risk so you care about recovering through the star [00:02:07] care about recovering through the star or recovering something as good as the [00:02:10] or recovering something as good as the star [00:02:11] star and we also assume a bunch of other [00:02:13] and we also assume a bunch of other things like for example The Hyphen is [00:02:16] things like for example The Hyphen is full rank [00:02:17] full rank and also some regulatory conditions [00:02:19] and also some regulatory conditions which I didn't even Define [00:02:22] which I didn't even Define um exactly for example this requires [00:02:24] um exactly for example this requires something like some of the virus [00:02:26] something like some of the virus it's financed so you can apply the [00:02:27] it's financed so you can apply the serums and then under this assumptions [00:02:31] serums and then under this assumptions we have that [00:02:35] actually it's kind of challenging for me [00:02:37] actually it's kind of challenging for me because this [00:02:38] because this Podium after I read it it becomes [00:02:40] Podium after I read it it becomes unstable [00:02:41] unstable it's kind of like I feel like I'm [00:02:44] it's kind of like I feel like I'm writing where I'm on a boat [00:02:46] writing where I'm on a boat um [00:02:47] um but it's probably good for me to [00:02:49] but it's probably good for me to practice [00:02:50] practice my kind of like what is this called like [00:02:53] my kind of like what is this called like I would be better with some of the [00:02:55] I would be better with some of the sports I guess after you do this anyway [00:02:57] sports I guess after you do this anyway okay so uh so I guess we have discussed [00:03:01] okay so uh so I guess we have discussed that you know that's the distribution [00:03:02] that you know that's the distribution you know the order of the difference [00:03:05] you know the order of the difference between Cedar height and sales star the [00:03:06] between Cedar height and sales star the order is on the order of one over square [00:03:08] order is on the order of one over square root n and formally write it like this [00:03:10] root n and formally write it like this you scale backwards in and you know that [00:03:12] you scale backwards in and you know that it's on the other four [00:03:13] it's on the other four and you also know something about the [00:03:17] and you also know something about the laws [00:03:18] laws you know that the excess risk L to the [00:03:22] you know that the excess risk L to the Hat minus l0 star is on the on the order [00:03:25] Hat minus l0 star is on the on the order of one over [00:03:26] of one over n and if you formally write it you are [00:03:28] n and if you formally write it you are going to scale by n and then you say [00:03:29] going to scale by n and then you say it's on all the constants and also you [00:03:31] it's on all the constants and also you know that the distribution of theta hat [00:03:34] know that the distribution of theta hat minus the star this is converging to a [00:03:37] minus the star this is converging to a gaussian distribution which means zero [00:03:39] gaussian distribution which means zero and some covariance and this covariance [00:03:41] and some covariance and this covariance is kind of complicated but let me write [00:03:43] is kind of complicated but let me write it [00:03:44] it um something like this [00:03:51] this is just reviewing what we have [00:03:53] this is just reviewing what we have written last time [00:03:54] written last time and four we also know the the [00:03:58] and four we also know the the distribution of [00:04:01] the excess risk this is a distribution [00:04:03] the excess risk this is a distribution of a scalar because the accessories is a [00:04:05] of a scalar because the accessories is a scalar it will scale it by n then you [00:04:07] scalar it will scale it by n then you know the distribution is converging to [00:04:09] know the distribution is converging to the distribution of this random variable [00:04:12] the distribution of this random variable and this random variable as is a [00:04:14] and this random variable as is a gaussian random variable with covariance [00:04:18] gaussian random variable with covariance means zero and covariance something like [00:04:20] means zero and covariance something like above but not exactly [00:04:27] you don't have to remember exactly what [00:04:28] you don't have to remember exactly what the covariance here is because you know [00:04:30] the covariance here is because you know I don't even remember them [00:04:32] I don't even remember them um if I don't read my notes [00:04:34] um if I don't read my notes these are you know there are some [00:04:36] these are you know there are some intuitions about this which I'm going to [00:04:37] intuitions about this which I'm going to discuss but generally this is just uh [00:04:39] discuss but generally this is just uh was something you got from derivations [00:04:42] was something you got from derivations so so last time we have kind of like [00:04:45] so so last time we have kind of like roughly justify the number one number [00:04:47] roughly justify the number one number two and today I'm going to again give a [00:04:50] two and today I'm going to again give a relatively heuristic approve for three [00:04:51] relatively heuristic approve for three and four just very quick so that we can [00:04:54] and four just very quick so that we can wrap up this [00:04:55] wrap up this um [00:04:57] um um so I guess just to very quickly [00:05:00] um so I guess just to very quickly review what we have done last time so [00:05:01] review what we have done last time so the key idea to kind of like derive all [00:05:04] the key idea to kind of like derive all of this is by [00:05:06] of this is by um doing internal expansion internal [00:05:08] um doing internal expansion internal expansion I think the key equation let [00:05:10] expansion I think the key equation let me just rewrite it what we have done [00:05:12] me just rewrite it what we have done last time is this [00:05:14] last time is this so you look at the Taylor hat the [00:05:17] so you look at the Taylor hat the gradient of the empirical loss I see the [00:05:19] gradient of the empirical loss I see the Hat this is guaranteed to be zero [00:05:22] Hat this is guaranteed to be zero because set a half is the minimizer of [00:05:23] because set a half is the minimizer of the empirical loss and you try to expand [00:05:26] the empirical loss and you try to expand this around to the star [00:05:30] this around to the star and you get something like this [00:05:35] plus high order terms [00:05:39] and then you [00:05:41] and then you um rearrange this and get set a height [00:05:44] um rearrange this and get set a height minus Theta star is equal to [00:05:51] the inverse of the empirical housing [00:05:55] the inverse of the empirical housing um xcsr times [00:05:58] um xcsr times Theta hat plus high order terms [00:06:03] Theta hat plus high order terms and then you say that I'm going to [00:06:04] and then you say that I'm going to replace all the hats by L like I'll hide [00:06:07] replace all the hats by L like I'll hide by L using uh some kind of a lot of [00:06:10] by L using uh some kind of a lot of large numbers or uniform convergence and [00:06:13] large numbers or uniform convergence and last time we have roughly discussed that [00:06:15] last time we have roughly discussed that this is on order form of script and [00:06:17] this is on order form of script and because you have a concentration this is [00:06:20] because you have a concentration this is the average of this is roughly on all [00:06:22] the average of this is roughly on all the form scoring plus number L Theta [00:06:25] the form scoring plus number L Theta sorry this is Theta star [00:06:30] which is roughly on the other one square [00:06:32] which is roughly on the other one square root n and and this one is converting to [00:06:34] root n and and this one is converting to a constant so that's why the whole thing [00:06:36] a constant so that's why the whole thing is converging to something on order form [00:06:38] is converging to something on order form of scripture and this time we are going [00:06:40] of scripture and this time we are going to make a little bit more formal [00:06:42] to make a little bit more formal so so we've got the exact distribution [00:06:44] so so we've got the exact distribution of the high metal set of stuff I'll be [00:06:47] of the high metal set of stuff I'll be I'll make this part really quick just to [00:06:48] I'll make this part really quick just to so that um you know if you are not [00:06:50] so that um you know if you are not familiar with the background you don't [00:06:52] familiar with the background you don't get confused too much [00:06:54] get confused too much um so [00:06:56] um so um so the idea is that so if you look at [00:07:00] um so the idea is that so if you look at what what's the distribution of this if [00:07:01] what what's the distribution of this if you think about this this is the product [00:07:03] you think about this this is the product of two random variables and you roughly [00:07:06] of two random variables and you roughly know what the solution of each of the [00:07:07] know what the solution of each of the random variable is right so this one is [00:07:09] random variable is right so this one is going to converts to a constant which is [00:07:11] going to converts to a constant which is it's going to converge to another [00:07:14] it's going to converge to another IO Theta star inverse and this one is [00:07:17] IO Theta star inverse and this one is going to be a gaussian distribution if [00:07:18] going to be a gaussian distribution if you scale it correctly right so and and [00:07:22] you scale it correctly right so and and basically what you need to know is that [00:07:23] basically what you need to know is that what's the product what's the the [00:07:25] what's the product what's the the product of these two random variables [00:07:27] product of these two random variables what's the distribution of the product [00:07:29] what's the distribution of the product of two random variables when you know uh [00:07:31] of two random variables when you know uh each of them what happens with each of [00:07:33] each of them what happens with each of them and what happens is formally what [00:07:37] them and what happens is formally what you know what you do is your first scale [00:07:38] you know what you do is your first scale best version so that each of these two [00:07:40] best version so that each of these two random variables are on order of one so [00:07:43] random variables are on order of one so that you can reason about them uh easily [00:07:45] that you can reason about them uh easily so you scale by one word by squared and [00:07:48] so you scale by one word by squared and you get this [00:07:49] you get this and then you have [00:07:53] inverse and now you scale this empirical [00:07:56] inverse and now you scale this empirical green by square rooting and also you get [00:08:01] green by square rooting and also you get square root n and also let's fill in [00:08:05] square root n and also let's fill in the population gradient which is zero so [00:08:08] the population gradient which is zero so this one is zero I just write it here to [00:08:11] this one is zero I just write it here to make it closer to something you know [00:08:15] make it closer to something you know um and then this plus high order terms [00:08:19] um and then this plus high order terms um this is still hello terms even your [00:08:20] um this is still hello terms even your multiplied by square root only because I [00:08:22] multiplied by square root only because I think there's a typo in the direction of [00:08:24] think there's a typo in the direction of something pointed out which is really [00:08:26] something pointed out which is really nice but still you know imagine how you [00:08:28] nice but still you know imagine how you multiply it still has all the terms [00:08:29] multiply it still has all the terms compared to the other terms right so [00:08:33] compared to the other terms right so and now this one [00:08:36] and now this one let's call them that's called Z this and [00:08:39] let's call them that's called Z this and Z by law of large number all I think by [00:08:42] Z by law of large number all I think by Central limit theorem Z is a gaussian [00:08:45] Central limit theorem Z is a gaussian discussion because [00:08:47] discussion because um and with some covariance and what's [00:08:49] um and with some covariance and what's the covariance the covariance will be [00:08:52] the covariance the covariance will be um [00:08:54] the covariance of nebula [00:08:58] the covariance of nebula IO [00:08:59] IO okay [00:09:00] okay X Y Theta [00:09:03] X Y Theta why this is just because what is L hats [00:09:08] why this is just because what is L hats Theta star minus [00:09:11] Theta star minus this [00:09:12] this is really just a um [00:09:15] is really just a um this is the empirical version of the the [00:09:18] this is the empirical version of the the right hand side the the the population [00:09:20] right hand side the the the population gradient so this is really some one over [00:09:24] gradient so this is really some one over n times sum of number l x i y i [00:09:30] Theta star minus expectation of this way [00:09:34] Theta star minus expectation of this way of the same thing maybe you can [00:09:37] of the same thing maybe you can for Simplicity let's just write X Y here [00:09:39] for Simplicity let's just write X Y here let's start [00:09:40] let's start right so you apply this you apply [00:09:42] right so you apply this you apply Central limit theorem you know that if [00:09:44] Central limit theorem you know that if you scale this by square root in then [00:09:45] you scale this by square root in then you get a gaussian distribution [00:09:48] you get a gaussian distribution right so so that's why we know the [00:09:50] right so so that's why we know the random variable Z has gaussian [00:09:51] random variable Z has gaussian distribution and we know this one will [00:09:54] distribution and we know this one will converge to a constant as n goes to [00:09:56] converge to a constant as n goes to Infinity [00:09:57] Infinity and [00:09:58] and um there is a theorem that specifically [00:10:00] um there is a theorem that specifically deal with this but actually you know if [00:10:02] deal with this but actually you know if you think about it this makes a lot of [00:10:03] you think about it this makes a lot of sense so if you want to know the what's [00:10:04] sense so if you want to know the what's the left hand side basically it just [00:10:06] the left hand side basically it just becomes the distribution of the right [00:10:08] becomes the distribution of the right hand side it's a constant times a [00:10:10] hand side it's a constant times a gaussian disclose okay so this constant [00:10:13] gaussian disclose okay so this constant times the gaussian distribution so [00:10:14] times the gaussian distribution so basically we have to figure out what's [00:10:16] basically we have to figure out what's this what's the distribution [00:10:18] this what's the distribution um here it's like what what is the [00:10:21] um here it's like what what is the distribution of a constant times Z [00:10:24] distribution of a constant times Z so [00:10:25] so um [00:10:27] um so basically abstractly speaking what we [00:10:29] so basically abstractly speaking what we are dealing with here is that [00:10:33] so the question we're dealing here with [00:10:35] so the question we're dealing here with here is that [00:10:38] here is that just [00:10:40] just so different color for abstraction so [00:10:42] so different color for abstraction so basically you are asking what is the [00:10:43] basically you are asking what is the distribution of a times Z [00:10:45] distribution of a times Z if a is a constant [00:10:49] if a is a constant and Z is from from some gaussian [00:10:51] and Z is from from some gaussian distribution with covalent Sigma [00:10:54] distribution with covalent Sigma right so and [00:10:59] I'm missing this page [00:11:05] and you know that there's a Lemma which [00:11:08] and you know that there's a Lemma which says that [00:11:10] says that if in under this case [00:11:12] if in under this case A and Z also is also gaussian [00:11:14] A and Z also is also gaussian distribution with covariance which means [00:11:17] distribution with covariance which means zero and covariance a sigma a transpose [00:11:21] zero and covariance a sigma a transpose I think this is a homework question [00:11:23] I think this is a homework question homework zero question I'm not sure [00:11:25] homework zero question I'm not sure whether it's still there I I forgot to [00:11:26] whether it's still there I I forgot to double check [00:11:27] double check um but um [00:11:30] um but um this is something you can do like a [00:11:32] this is something you can do like a what's the transformation of a gaussian [00:11:33] what's the transformation of a gaussian distribution still transfer still a [00:11:35] distribution still transfer still a gaussian distribution it's just the [00:11:36] gaussian distribution it's just the covariance got transformed and actually [00:11:38] covariance got transformed and actually the way to transform the covariance is [00:11:40] the way to transform the covariance is that you like to multiply the [00:11:42] that you like to multiply the transformation and right multiply the [00:11:44] transformation and right multiply the transpose of it you get the new covers [00:11:46] transpose of it you get the new covers so so this is something you know it's [00:11:48] so so this is something you know it's not that cheaper to to derive this but [00:11:50] not that cheaper to to derive this but this is something [00:11:51] this is something um you can either look up from a book or [00:11:53] um you can either look up from a book or you can derive it yourself [00:11:55] you can derive it yourself right so with this a small Lemma then we [00:11:58] right so with this a small Lemma then we know that the distribution [00:12:01] know that the distribution um [00:12:05] okay star converges to replace the [00:12:09] okay star converges to replace the converges to um [00:12:12] converges to um a gaussian distribution which means zero [00:12:14] a gaussian distribution which means zero where a so here a corresponds to [00:12:20] the nubla IL Theta Star right [00:12:24] the nubla IL Theta Star right and and sigma corresponds to [00:12:27] and and sigma corresponds to Sigma corresponds to this one and you [00:12:29] Sigma corresponds to this one and you just plug in uh these two choices then [00:12:34] just plug in uh these two choices then foreign [00:12:45] we got [00:12:46] we got number IO [00:12:48] number IO plus Square L for the star minus 1 times [00:12:51] plus Square L for the star minus 1 times covariance [00:12:53] covariance of WL x y 0 star times WL star Wars [00:13:01] of WL x y 0 star times WL star Wars okay [00:13:03] okay um this is Convergence in distribution [00:13:06] um this is Convergence in distribution any questions so far [00:13:15] I realized that my camera is frozen I [00:13:18] I realized that my camera is frozen I don't know why [00:13:20] don't know why something seems to be wrong [00:13:46] for those people who are in a zoom [00:13:48] for those people who are in a zoom meeting you'll see my video [00:13:54] it's kind of Frozen [00:13:57] it's kind of Frozen I see thanks maybe let me read turn it [00:14:00] I see thanks maybe let me read turn it off and then turn it down [00:14:03] off and then turn it down okay sounds working now [00:14:08] okay cool and you can see that height [00:14:10] okay cool and you can see that height you can see everything now okay thanks [00:14:13] you can see everything now okay thanks okay cool [00:14:16] okay cool um any questions also if you're on a [00:14:18] um any questions also if you're on a zoom meeting also feel free to just [00:14:19] zoom meeting also feel free to just unmute and ask any questions [00:14:38] this is inverse yeah [00:14:44] so which term you are which term am I [00:14:47] so which term you are which term am I asking this one [00:14:48] asking this one foreign [00:14:58] yes it's exactly the same one it's [00:15:01] yes it's exactly the same one it's supposed to be the same thing transports [00:15:03] supposed to be the same thing transports right but this is a symmetric Matrix so [00:15:05] right but this is a symmetric Matrix so the transpose is the same itself so this [00:15:07] the transpose is the same itself so this is -1 [00:15:19] okay so I guess um what I'm gonna do is [00:15:23] okay so I guess um what I'm gonna do is I'm going to skip the proof for the the [00:15:26] I'm going to skip the proof for the the derivation for the number four [00:15:28] derivation for the number four um it's kind of like the same thing it's [00:15:30] um it's kind of like the same thing it's just that you have to [00:15:31] just that you have to kind of like because you already know [00:15:33] kind of like because you already know the distribution of data hat you should [00:15:35] the distribution of data hat you should know the distribution of theta hat and [00:15:37] know the distribution of theta hat and what you do is you do some tell [00:15:38] what you do is you do some tell expansion to make it a polynomial of set [00:15:41] expansion to make it a polynomial of set a hat and then you can use the what you [00:15:43] a hat and then you can use the what you know about header hat um all of these in [00:15:45] know about header hat um all of these in electronics I guess I'm going to skip [00:15:47] electronics I guess I'm going to skip this part [00:15:50] for example [00:15:54] is the reason for that [00:15:56] is the reason for that you mean the covarence seems the other [00:15:58] you mean the covarence seems the other like the new yeah [00:16:00] like the new yeah [Music] [00:16:07] um I think there's a connection but um [00:16:11] um I think there's a connection but um but I don't feel like a [00:16:14] but I don't feel like a it's you know I think all like this has [00:16:17] it's you know I think all like this has since shows up very often in many [00:16:19] since shows up very often in many different cases right so there is some [00:16:21] different cases right so there is some connection but I don't feel like it has [00:16:24] connection but I don't feel like it has to be like it's not like it super [00:16:25] to be like it's not like it super closely related so that it's important [00:16:27] closely related so that it's important enough to know yeah [00:16:30] enough to know yeah um [00:16:31] um yeah okay so I guess I'll skip the proof [00:16:34] yeah okay so I guess I'll skip the proof for number four um if you're interested [00:16:36] for number four um if you're interested you can look at the proof in Electro [00:16:37] you can look at the proof in Electro notes [00:16:39] notes um and what I'm gonna do is that I'm [00:16:40] um and what I'm gonna do is that I'm going to spend another five minutes to [00:16:42] going to spend another five minutes to talk about [00:16:43] talk about um [00:16:44] um a correlate a corollary of this theorem [00:16:47] a correlate a corollary of this theorem which is in a more [00:16:49] which is in a more um [00:16:50] um maybe a more [00:16:52] maybe a more typical setting like here here this [00:16:54] typical setting like here here this theorem is very general because it [00:16:56] theorem is very general because it doesn't say anything about the loss [00:16:57] doesn't say anything about the loss function it doesn't say anything about [00:16:58] function it doesn't say anything about the model it works for almost everything [00:17:01] the model it works for almost everything as long as you have the consistency and [00:17:03] as long as you have the consistency and here let me especially instantiate this [00:17:05] here let me especially instantiate this theorem for the so-called well-specified [00:17:07] theorem for the so-called well-specified case where you use log likelihood and [00:17:10] case where you use log likelihood and then we can see the the all of these [00:17:12] then we can see the the all of these covariance becomes a little bit of like [00:17:13] covariance becomes a little bit of like more intuitive and and things becomes a [00:17:16] more intuitive and and things becomes a little bit easier [00:17:17] little bit easier um [00:17:18] um so this is the so-called wild specified [00:17:21] so this is the so-called wild specified case [00:17:25] so um I guess in addition [00:17:30] to serum one [00:17:33] let's also assume that let's suppose [00:17:37] there exists some [00:17:40] there exists some probabilistic model [00:17:46] parametrized by Theta [00:17:52] so such the Y is given X and Theta so so [00:17:57] so such the Y is given X and Theta so so you assume that Y is generated from this [00:18:00] you assume that Y is generated from this uh probabilistic model [00:18:02] uh probabilistic model right so so so what does it mean so [00:18:05] right so so so what does it mean so basically means so let's say Suppose [00:18:07] basically means so let's say Suppose there exists a set of star I'm using the [00:18:09] there exists a set of star I'm using the super the subscript here to [00:18:11] super the subscript here to differentiate to differentiate from the [00:18:13] differentiate to differentiate from the Theta star uh defined before which was [00:18:15] Theta star uh defined before which was the minimizer of the population risk and [00:18:18] the minimizer of the population risk and actually they are the same but for now [00:18:20] actually they are the same but for now they are the difference so you so [00:18:22] they are the difference so you so basically you assume that the existence [00:18:23] basically you assume that the existence is a star such that the y i the data is [00:18:26] is a star such that the y i the data is generated from [00:18:27] generated from conditional exercise is generated from [00:18:29] conditional exercise is generated from this probabilistic model [00:18:35] right so so you assume so this is well [00:18:37] right so so you assume so this is well this is why it's called well specified [00:18:39] this is why it's called well specified it means that your data is generated [00:18:41] it means that your data is generated from some problem State model [00:18:43] from some problem State model and also in this case suppose you use [00:18:45] and also in this case suppose you use the log the loss function is the log [00:18:47] the log the loss function is the log likelihood [00:18:49] likelihood right before we didn't really say what [00:18:51] right before we didn't really say what the loss function needs to be right it [00:18:52] the loss function needs to be right it could be anything and now let's say the [00:18:55] could be anything and now let's say the log from the loss function is the log [00:18:56] log from the loss function is the log likelihood of this probabilistic model [00:19:00] likelihood of this probabilistic model think of this as for example logistic [00:19:02] think of this as for example logistic regression right [00:19:03] regression right um [00:19:04] um our linear regression with gaussian [00:19:06] our linear regression with gaussian noise so so your log likelihood could be [00:19:09] noise so so your log likelihood could be cross entropy laws could be like mean [00:19:11] cross entropy laws could be like mean Square laws depending on what the [00:19:12] Square laws depending on what the probabilistic model uh you have [00:19:20] right so this is your loss function and [00:19:22] right so this is your loss function and when you do this then uh you know a [00:19:24] when you do this then uh you know a bunch of things which are nicer in some [00:19:28] bunch of things which are nicer in some sense so first of all you know that the [00:19:30] sense so first of all you know that the Theta star is equals to [00:19:32] Theta star is equals to the Theta sub Star right so recall that [00:19:34] the Theta sub Star right so recall that this is the minimizer of the population [00:19:36] this is the minimizer of the population loss and this is the ground shoes this [00:19:39] loss and this is the ground shoes this is the one that generates our data and [00:19:41] is the one that generates our data and and in this case you can prove that when [00:19:43] and in this case you can prove that when you have infinite data where Theta star [00:19:45] you have infinite data where Theta star is the minimizer of the infinite data [00:19:47] is the minimizer of the infinite data case you can uh recover the ground to [00:19:50] case you can uh recover the ground to say a sub star so they are exactly the [00:19:52] say a sub star so they are exactly the same thing [00:19:53] same thing and uh you also know uh a bunch of [00:19:57] and uh you also know uh a bunch of signals other things for example you [00:19:59] signals other things for example you know that [00:20:00] know that the the gradient [00:20:03] the the gradient this is kind of trivial I'm just writing [00:20:05] this is kind of trivial I'm just writing it here because it used to be that I [00:20:09] it here because it used to be that I need to use prove this in a proof but if [00:20:11] need to use prove this in a proof but if you don't care about the proof this is [00:20:12] you don't care about the proof this is just an inconvenience step that you know [00:20:15] just an inconvenience step that you know so you know that expected gradient over [00:20:17] so you know that expected gradient over the population [00:20:19] the population H Theta star is zero [00:20:22] H Theta star is zero and also you know [00:20:23] and also you know what's the covariance [00:20:25] what's the covariance of the gradient [00:20:27] of the gradient this is the covariance of the gradient [00:20:29] this is the covariance of the gradient is the quantity that we care about right [00:20:30] is the quantity that we care about right because you know in the previous [00:20:32] because you know in the previous theorem the covariance of the gradient [00:20:34] theorem the covariance of the gradient shows up in the variance of the optical [00:20:36] shows up in the variance of the optical hatchimal signal star the covariance of [00:20:39] hatchimal signal star the covariance of the gradient [00:20:40] the gradient is [00:20:42] is icsr [00:20:43] icsr I guess from now on we don't distinguish [00:20:46] I guess from now on we don't distinguish set of superstars say the superscript [00:20:47] set of superstars say the superscript star and see the subscript star because [00:20:49] star and see the subscript star because they are the same [00:20:51] they are the same and you know the covariance actually [00:20:52] and you know the covariance actually happens to be the hessing [00:20:56] and with the covariance of the gradient [00:20:59] and with the covariance of the gradient happens to be the hessing then the [00:21:01] happens to be the hessing then the covariance of cell height Manifesto star [00:21:03] covariance of cell height Manifesto star can be simplified [00:21:04] can be simplified so this [00:21:08] recall that this used to be a gaussian [00:21:10] recall that this used to be a gaussian distribution which cope with something [00:21:12] distribution which cope with something like [00:21:13] like that's right the covariance of the [00:21:16] that's right the covariance of the silica benefits is used with this [00:21:18] silica benefits is used with this product of three things three matrices [00:21:20] product of three things three matrices but now what's in the middle is the same [00:21:23] but now what's in the middle is the same as the hashing that's what we claimed in [00:21:26] as the hashing that's what we claimed in number three so that means that you can [00:21:28] number three so that means that you can cancel [00:21:29] cancel uh this from with this and you get only [00:21:32] uh this from with this and you get only one term [00:21:34] one term so what what's left is that just the [00:21:36] so what what's left is that just the inverse of the hazard [00:21:44] maybe I should just use black fiber [00:21:47] maybe I should just use black fiber um [00:21:48] um yeah and you also know uh if you plug in [00:21:52] yeah and you also know uh if you plug in you know like this the covariance of the [00:21:55] you know like this the covariance of the the gradients you basically plug in [00:21:57] the gradients you basically plug in three into all the statements that you [00:21:59] three into all the statements that you had before then you can also get [00:22:01] had before then you can also get something like for example one of the [00:22:03] something like for example one of the important thing is this [00:22:04] important thing is this the [00:22:06] the excess risk [00:22:08] excess risk I guess we have claimed that it's on [00:22:10] I guess we have claimed that it's on auto Theta star but actually here you [00:22:11] auto Theta star but actually here you can be more precise you know that this [00:22:14] can be more precise you know that this is converging to [00:22:16] is converging to um basically a half times chi-square [00:22:18] um basically a half times chi-square distribution of [00:22:20] distribution of um with degree P so p is a dimension [00:22:24] um with degree P so p is a dimension of theta [00:22:25] of theta so suppose you have P parameters then [00:22:27] so suppose you have P parameters then this is uh [00:22:29] this is uh um the distribution of the excess risk [00:22:33] um the distribution of the excess risk and if you take the expectation of this [00:22:35] and if you take the expectation of this so that you get when you look at all the [00:22:37] so that you get when you look at all the randomness then what you get is that [00:22:39] randomness then what you get is that expectation of n times excess risk [00:22:46] is equals to the expectation of the [00:22:48] is equals to the expectation of the the chi-square distribution [00:22:51] the chi-square distribution um that this is equals to a half times P [00:22:54] um that this is equals to a half times P by the way Chi Square distribution if [00:22:55] by the way Chi Square distribution if you you don't have to know anything [00:22:57] you you don't have to know anything detailed about it this is basically the [00:23:01] detailed about it this is basically the um the distribution of a sum of P normal [00:23:03] um the distribution of a sum of P normal on gaussian Square [00:23:06] on gaussian Square um so so you know a lot of things about [00:23:08] um so so you know a lot of things about it like you know it's a it's positive [00:23:10] it like you know it's a it's positive and you know that it means like if the [00:23:13] and you know that it means like if the customer with repeat that means [00:23:15] customer with repeat that means um P you know if you need to know more [00:23:17] um P you know if you need to know more about this just Wikipedia uh it's very [00:23:19] about this just Wikipedia uh it's very easy we don't need anything deep about [00:23:21] easy we don't need anything deep about it so so the important thing is the the [00:23:23] it so so the important thing is the the last equation so basically we know that [00:23:25] last equation so basically we know that the excess risk in expectation here the [00:23:27] the excess risk in expectation here the expectation is over the randomness of [00:23:29] expectation is over the randomness of the data set [00:23:30] the data set right so the accessories if you don't [00:23:33] right so the accessories if you don't scale scale by the star sorry if you [00:23:36] scale scale by the star sorry if you don't scale it by and then you get this [00:23:38] don't scale it by and then you get this is equals to [00:23:39] is equals to I guess I should write conversion to [00:23:42] I guess I should write conversion to because it wouldn't be exactly equal [00:23:46] this is a half times p over n [00:23:49] this is a half times p over n so basically you don't even get the [00:23:51] so basically you don't even get the dependency on N but you also get a [00:23:52] dependency on N but you also get a dependency on P so another dimension so [00:23:55] dependency on P so another dimension so you know that [00:23:57] you know that um [00:23:58] um um the [00:23:59] um the worst order of excess risk of course [00:24:01] worst order of excess risk of course there's a high order terms middle of one [00:24:03] there's a high order terms middle of one over [00:24:06] and actually you know the variance of [00:24:09] and actually you know the variance of excess risk which I don't think is super [00:24:10] excess risk which I don't think is super important the virus is smaller than me [00:24:14] important the virus is smaller than me okay so in electron notes I think we [00:24:16] okay so in electron notes I think we have proofs for all of this [00:24:18] have proofs for all of this um but I think I'm not going to discuss [00:24:20] um but I think I'm not going to discuss the proof the main the most important [00:24:22] the proof the main the most important thing I think is [00:24:23] thing I think is this one [00:24:25] this one and and this one [00:24:27] and and this one you know so the number the the first [00:24:30] you know so the number the the first thing is saying that [00:24:33] thing is saying that the cover the the shape of the Hackman [00:24:37] the cover the the shape of the Hackman SSR the randomness you is [00:24:40] SSR the randomness you is uh the shape is kind of like the same as [00:24:42] uh the shape is kind of like the same as the inverse of the Hashim so so in those [00:24:45] the inverse of the Hashim so so in those directions where your head is you know [00:24:47] directions where your head is you know steeper [00:24:49] steeper then you have less stochasticity [00:24:52] then you have less stochasticity right so and in those directions where [00:24:54] right so and in those directions where the heads is smaller then you have more [00:24:55] the heads is smaller then you have more stochasticity [00:24:57] stochasticity um and the the last one is saying that [00:24:59] um and the the last one is saying that it doesn't matter what hessing is the [00:25:01] it doesn't matter what hessing is the only thing that matters is the number of [00:25:02] only thing that matters is the number of parameters if you care about this kind [00:25:04] parameters if you care about this kind of like asymptotic regime [00:25:06] of like asymptotic regime um the only thing that matters is the [00:25:08] um the only thing that matters is the parameter P [00:25:09] parameter P um the number of parameters [00:25:12] um the number of parameters we're going to discuss the limitation of [00:25:14] we're going to discuss the limitation of all of the theorems in a moment but [00:25:16] all of the theorems in a moment but that's this this is what we got from [00:25:18] that's this this is what we got from this asymptotic approach [00:25:28] any questions so far [00:25:43] Okay cool so I guess uh let's uh if you [00:25:47] Okay cool so I guess uh let's uh if you are interested in more details you can [00:25:49] are interested in more details you can take a look at the lecture notes [00:25:51] take a look at the lecture notes um so I guess now let's move on to uh uh [00:25:54] um so I guess now let's move on to uh uh uniform convergence uh and [00:25:57] uniform convergence uh and um in often people call this kind of [00:25:59] um in often people call this kind of line of research called uh knowledge [00:26:01] line of research called uh knowledge methodic [00:26:03] methodic so let's first discuss that this is [00:26:05] so let's first discuss that this is actually the [00:26:09] um the kind of like the the approach [00:26:11] um the kind of like the the approach we're going to take for the rest of the [00:26:12] we're going to take for the rest of the lecture we're gonna care about knowing [00:26:14] lecture we're gonna care about knowing asymptotics instead of asymptotic bonds [00:26:17] asymptotics instead of asymptotic bonds so the reason so let me Define uh let me [00:26:20] so the reason so let me Define uh let me motivate what you know what Define what [00:26:22] motivate what you know what Define what it is and and motivate why we care about [00:26:24] it is and and motivate why we care about it so recall that when you have [00:26:27] it so recall that when you have asymptotic bonds just a [00:26:30] asymptotic bonds just a that's like what I wrote about you know [00:26:32] that's like what I wrote about you know that this else has L7 Stars the final [00:26:35] that this else has L7 Stars the final outcome is something like this is equals [00:26:37] outcome is something like this is equals to P over 2 and [00:26:39] to P over 2 and plus little point over [00:26:41] plus little point over eight of blade of Y over n [00:26:43] eight of blade of Y over n however the problem is that here you are [00:26:46] however the problem is that here you are hiding [00:26:47] hiding a lot of things in this little little [00:26:49] a lot of things in this little little one little one over n so you hide [00:26:53] one little one over n so you hide all dependencies [00:26:58] other than the dependencies on other [00:27:00] other than the dependencies on other than [00:27:03] so what does it mean so in this rate of [00:27:05] so what does it mean so in this rate of later all notation you also have [00:27:07] later all notation you also have dependency on P so so if you tell me [00:27:10] dependency on P so so if you tell me it's important regime you get this Bond [00:27:12] it's important regime you get this Bond what happens is that you could either [00:27:14] what happens is that you could either have p over 2N plus one over n Square [00:27:17] have p over 2N plus one over n Square maybe the real weight is this [00:27:20] maybe the real weight is this it could also happen that the real word [00:27:22] it could also happen that the real word could be this [00:27:26] so both of these two cases would be a [00:27:29] so both of these two cases would be a possible situation if you tell me the [00:27:31] possible situation if you tell me the bound above right I wouldn't have ways [00:27:33] bound above right I wouldn't have ways to distinguish this because [00:27:35] to distinguish this because the [00:27:36] the right like this one is hidden in this [00:27:39] right like this one is hidden in this little little notation because the [00:27:42] little little notation because the lateral notation doesn't care about any [00:27:43] lateral notation doesn't care about any dependencies or anything else it only [00:27:44] dependencies or anything else it only clears for these dependencies on love [00:27:46] clears for these dependencies on love and at least in the context of [00:27:48] and at least in the context of asymptotics [00:27:49] asymptotics so so this is the the problem because [00:27:53] so so this is the the problem because clearly if your rate is on the right [00:27:55] clearly if your rate is on the right hand it's something on the right hand [00:27:56] hand it's something on the right hand side then this is very bad rate [00:27:58] side then this is very bad rate right very bad by the way by the way I [00:28:01] right very bad by the way by the way I mean the how [00:28:03] mean the how how does this depend on how [00:28:06] how does this depend on how I guess maybe that's just quite bomb [00:28:08] I guess maybe that's just quite bomb right so suppose your bond is on the [00:28:09] right so suppose your bond is on the right hand side then it's a very bad [00:28:12] right hand side then it's a very bad Bond because uh this requires uh n to be [00:28:17] Bond because uh this requires uh n to be bigger than P to the 50 so that [00:28:19] bigger than P to the 50 so that this bond is smaller than what [00:28:22] this bond is smaller than what right because you need a second term to [00:28:23] right because you need a second term to be smaller than one then you need n to [00:28:25] be smaller than one then you need n to be bigger than P to the 50. [00:28:31] [Music] [00:28:34] [Music] yes yes [00:28:36] yes yes so exactly [00:28:40] so exactly um so [00:28:42] um so um okay I'm going back to this so um so [00:28:44] um okay I'm going back to this so um so the bound on the right hand side is [00:28:46] the bound on the right hand side is going to be very bad and the back the [00:28:48] going to be very bad and the back the the bone on the the left hand side this [00:28:50] the bone on the the left hand side this one is kind of pretty good in some sense [00:28:51] one is kind of pretty good in some sense right [00:28:52] right but you have no ways to distinguish them [00:28:54] but you have no ways to distinguish them because these two things would be [00:28:57] because these two things would be connected towards p over 2 and plus [00:28:59] connected towards p over 2 and plus little y Over N in the in the asymptotic [00:29:02] little y Over N in the in the asymptotic cells so [00:29:04] cells so um so that's the that's the biggest [00:29:06] um so that's the that's the biggest problem [00:29:07] problem and [00:29:09] and and also in some sense [00:29:11] and also in some sense when you have other dependencies on the [00:29:15] when you have other dependencies on the for example the dimensionality even the [00:29:17] for example the dimensionality even the dependencies on N is not the only thing [00:29:19] dependencies on N is not the only thing that matters for example another kind of [00:29:21] that matters for example another kind of like a more extreme situation is that [00:29:22] like a more extreme situation is that supposed to compare [00:29:24] supposed to compare p over square root n versus p over 2N [00:29:27] p over square root n versus p over 2N Plus [00:29:28] Plus this [00:29:30] this but I suppose you have suppose you know [00:29:33] but I suppose you have suppose you know you have two of these bounds and if you [00:29:35] you have two of these bounds and if you use asymptotics right if you write in [00:29:37] use asymptotics right if you write in the asymptotic way then you are going to [00:29:40] the asymptotic way then you are going to conclude that this is p over 2 N plus [00:29:42] conclude that this is p over 2 N plus negative 1 over RN in the asymptotic [00:29:44] negative 1 over RN in the asymptotic language and this one will be something [00:29:47] language and this one will be something like P over square root and plus little [00:29:49] like P over square root and plus little while over square rooted so it sounds [00:29:51] while over square rooted so it sounds like this is bad because this one has [00:29:53] like this is bad because this one has high order dependencies on on N right [00:29:57] high order dependencies on on N right it's indeed too when n goes to Infinity [00:29:59] it's indeed too when n goes to Infinity the right hand side is smaller than the [00:30:01] the right hand side is smaller than the left hand side [00:30:02] left hand side but if you think about the more moderate [00:30:04] but if you think about the more moderate regime of n then it's not really true [00:30:07] regime of n then it's not really true because [00:30:08] because for for the bond to be less than one [00:30:10] for for the bond to be less than one right so if you want p over square root [00:30:12] right so if you want p over square root n to be less than one this means that n [00:30:15] n to be less than one this means that n is bigger than P Square [00:30:17] is bigger than P Square but if you want this p over 2 plus P of [00:30:20] but if you want this p over 2 plus P of 100 [00:30:21] 100 and square to be less than one this [00:30:23] and square to be less than one this means Chinese needs to be at least [00:30:24] means Chinese needs to be at least equals larger than [00:30:27] equals larger than um P to the 50 right so [00:30:30] um P to the 50 right so so so when n goes to Infinity then the [00:30:33] so so when n goes to Infinity then the the left-hand side you know uh it's [00:30:36] the left-hand side you know uh it's worse it's the worst bone but in most of [00:30:37] worse it's the worst bone but in most of the cases the left hand side is actually [00:30:39] the cases the left hand side is actually a better box so if you want P to the [00:30:41] a better box so if you want P to the left hand side to be a better Bond on [00:30:42] left hand side to be a better Bond on the right than the right hand side [00:30:44] the right than the right hand side I can see if you solve this [00:30:46] I can see if you solve this maybe you can even ignore this the [00:30:49] maybe you can even ignore this the um if you solve this this is kind of [00:30:50] um if you solve this this is kind of roughly saying that and if n is smaller [00:30:53] roughly saying that and if n is smaller than I think I did this calculation at [00:30:56] than I think I did this calculation at some point and it's smaller into the [00:30:58] some point and it's smaller into the P to the 66 then [00:31:01] P to the 66 then actually the bone on the left hand side [00:31:02] actually the bone on the left hand side is better than bone on the right hand [00:31:03] is better than bone on the right hand side just because there's P to the 100 [00:31:05] side just because there's P to the 100 is too big right so so basically the [00:31:07] is too big right so so basically the comparison [00:31:09] comparison um basically if you use this asymptotic [00:31:11] um basically if you use this asymptotic language things becomes a little weird [00:31:13] language things becomes a little weird if you consider other dependencies on on [00:31:16] if you consider other dependencies on on other parameters for example if you have [00:31:18] other parameters for example if you have a dependency on the dimension a [00:31:20] a dependency on the dimension a dimension for machine learning for [00:31:21] dimension for machine learning for modern machine learning is very high so [00:31:23] modern machine learning is very high so so this is why I think asymptotics even [00:31:26] so this is why I think asymptotics even though they are very powerful they don't [00:31:27] though they are very powerful they don't necessarily always apply to the modern [00:31:29] necessarily always apply to the modern machine learning just because it has the [00:31:31] machine learning just because it has the dependency [00:31:33] dependency um uh or for other terms right in in the [00:31:36] um uh or for other terms right in in the in the high order case right so like [00:31:38] in the high order case right so like this one has [00:31:39] this one has like [00:31:41] like this one has [00:31:43] this one has this one has the dependency on P so [00:31:45] this one has the dependency on P so that's the that's the main issue [00:31:47] that's the that's the main issue basically [00:31:48] basically okay so so what we do [00:31:50] okay so so what we do um uh how do we fix this right the first [00:31:53] um uh how do we fix this right the first thing we need to do is to fix the [00:31:54] thing we need to do is to fix the language in some sense we need to [00:31:56] language in some sense we need to um not only consider and goes to [00:31:58] um not only consider and goes to Infinity we have to also consider other [00:32:00] Infinity we have to also consider other quantities uh in this world so [00:32:03] quantities uh in this world so so basically what non-exymptotic does is [00:32:07] so basically what non-exymptotic does is that [00:32:07] that you care about this is just a [00:32:11] you care about this is just a um a term or kind of like a pro [00:32:14] um a term or kind of like a pro approach this is basically saying that [00:32:16] approach this is basically saying that you you only had [00:32:19] you you only had absolute constant [00:32:21] absolute constant in your bond [00:32:23] in your bond you have to hide something because you [00:32:24] you have to hide something because you know if you have to care about every [00:32:26] know if you have to care about every constant it's going to be too [00:32:27] constant it's going to be too complicated for Theory right it's going [00:32:29] complicated for Theory right it's going to be a lot of calculations but uh here [00:32:32] to be a lot of calculations but uh here we allow us to hide absolute content but [00:32:35] we allow us to hide absolute content but we cannot hide any other dependencies or [00:32:37] we cannot hide any other dependencies or any other things [00:32:39] any other things so uh so you're not allowed to have [00:32:41] so uh so you're not allowed to have dependency on P when n goes to Infinity [00:32:44] dependency on P when n goes to Infinity so and my absolute constant this really [00:32:47] so and my absolute constant this really means that this is a universal concept [00:32:49] means that this is a universal concept something like three five something you [00:32:51] something like three five something you can replace by [00:32:52] can replace by a real numerical number [00:32:55] a real numerical number um so [00:32:56] um so um and actually to kind of like uh make [00:32:59] um and actually to kind of like uh make everything easier so we are going to [00:33:01] everything easier so we are going to introduce this notation [00:33:03] introduce this notation um big old notation [00:33:05] um big old notation this is actually sometimes this big old [00:33:07] this is actually sometimes this big old notation has a little bit um different [00:33:09] notation has a little bit um different interpretation so I'm going to [00:33:12] interpretation so I'm going to I'm not I wouldn't say I I'm really [00:33:14] I'm not I wouldn't say I I'm really defining it but I'm going to just be [00:33:16] defining it but I'm going to just be clear about what the big old notation uh [00:33:19] clear about what the big old notation uh means from now so now big old notations [00:33:21] means from now so now big old notations from now on only highest Universal [00:33:23] from now on only highest Universal constant and and let me have actually a [00:33:26] constant and and let me have actually a kind of like more technical definition [00:33:28] kind of like more technical definition which is actually useful in some cases [00:33:30] which is actually useful in some cases when you really do a lot of theory [00:33:31] when you really do a lot of theory sometimes [00:33:33] sometimes I'm not sure whether some of you have [00:33:34] I'm not sure whether some of you have this confusion about whether you should [00:33:36] this confusion about whether you should use Big O or Omega [00:33:39] use Big O or Omega Omega like the big Omega like sometimes [00:33:41] Omega like the big Omega like sometimes it could be confused confusing so let me [00:33:43] it could be confused confusing so let me Define what this speaker really means it [00:33:45] Define what this speaker really means it really means that so every occurrence [00:33:51] at least this is what means in this [00:33:53] at least this is what means in this course it may not be exactly always the [00:33:54] course it may not be exactly always the same for every paper but I think people [00:33:56] same for every paper but I think people are converging to this kind of [00:33:58] are converging to this kind of interpretation so every occurrence of [00:34:00] interpretation so every occurrence of big of x [00:34:02] big of x is [00:34:04] is a placeholder [00:34:07] for some function [00:34:12] as a f x [00:34:15] such that [00:34:18] such that uh for every X in r [00:34:23] uh for every X in r f x is less than CX [00:34:27] f x is less than CX for some [00:34:29] for some absolute [00:34:31] absolute constant [00:34:33] constant C [00:34:35] C bigger than zero [00:34:37] bigger than zero so basically this is saying that if you [00:34:39] so basically this is saying that if you replace so maybe more explicitly [00:34:42] replace so maybe more explicitly is saying that you can replace [00:34:46] our current [00:34:48] our current replace all of x [00:34:53] by effects such that [00:34:56] by effects such that the statement is true [00:35:03] so [00:35:05] so um [00:35:06] um so basically if you see a statement with [00:35:08] so basically if you see a statement with a lot of O of X or something right it [00:35:11] a lot of O of X or something right it means that you can replace all of these [00:35:13] means that you can replace all of these occurrences of big old notations by some [00:35:15] occurrences of big old notations by some thing more explicit such that the [00:35:18] thing more explicit such that the statement is still true [00:35:20] statement is still true so it may seem to be over Q as a [00:35:22] so it may seem to be over Q as a definition of Big O you know which you [00:35:24] definition of Big O you know which you probably already familiar with but in [00:35:25] probably already familiar with but in some cases you know at least I've I've [00:35:28] some cases you know at least I've I've seen so many cases where I got confused [00:35:30] seen so many cases where I got confused I have to kind of really literally [00:35:31] I have to kind of really literally verify whether this I satisfy this [00:35:34] verify whether this I satisfy this definition [00:35:36] definition um anyway okay so and also just for for [00:35:39] um anyway okay so and also just for for notational kind of convenience sometimes [00:35:41] notational kind of convenience sometimes we also write a is less than b this is [00:35:43] we also write a is less than b this is just equivalent to [00:35:45] just equivalent to um there exists the absolute constant [00:35:51] C larger than zero such that [00:35:55] C larger than zero such that a is less than c times B and and [00:35:58] a is less than c times B and and technically if you really wants to be [00:36:00] technically if you really wants to be kind of like very solid this statement [00:36:03] kind of like very solid this statement should only apply to positive A and B [00:36:07] should only apply to positive A and B positive and because for negative ones [00:36:09] positive and because for negative ones you have to [00:36:10] you have to you you probably you know ideally you [00:36:12] you you probably you know ideally you should just write this only for positive [00:36:14] should just write this only for positive Envy that's my that's my suggestion [00:36:16] Envy that's my that's my suggestion because for negative ones it just [00:36:18] because for negative ones it just becomes a little bit confusing [00:36:21] becomes a little bit confusing um [00:36:22] um so so the point here is that there's no [00:36:23] so so the point here is that there's no like when you define this big old thing [00:36:25] like when you define this big old thing right so it depends on the literature [00:36:27] right so it depends on the literature sometimes when people Define Big O they [00:36:29] sometimes when people Define Big O they have to Define some limit but here in [00:36:32] have to Define some limit but here in this course you know the Big O just [00:36:33] this course you know the Big O just really means this there's no limit [00:36:35] really means this there's no limit taking you don't have to think about any [00:36:37] taking you don't have to think about any limit [00:36:46] could be functions or other more complex [00:36:49] could be functions or other more complex qualities [00:36:54] Okay cool so these are just some [00:36:56] Okay cool so these are just some notations to [00:36:58] notations to okay so now [00:37:00] okay so now the kind of the bond that we're care [00:37:01] the kind of the bond that we're care about is that [00:37:03] about is that so so [00:37:04] so so we are interested in [00:37:07] we are interested in in this notation we are interested in [00:37:10] in this notation we are interested in bonds [00:37:12] bonds of the form [00:37:13] of the form like the excess risk [00:37:17] like the excess risk so I'll [00:37:19] so I'll touch Ubuntu accessory I will say the [00:37:21] touch Ubuntu accessory I will say the height minus L see the star [00:37:23] height minus L see the star by [00:37:25] by something like [00:37:27] something like Big O of some function of maybe p and n [00:37:31] Big O of some function of maybe p and n where P could be a dimension and N could [00:37:34] where P could be a dimension and N could be the [00:37:35] be the um the number of data points of course [00:37:37] um the number of data points of course you can replace this by a function of [00:37:39] you can replace this by a function of other things but the point here is that [00:37:42] other things but the point here is that after you write this there's nothing [00:37:43] after you write this there's nothing else hidden in a big or only a universal [00:37:45] else hidden in a big or only a universal constant [00:37:47] constant and this will you know once you have [00:37:50] and this will you know once you have this kind of language you can compare [00:37:51] this kind of language you can compare things you know in a more proper way [00:37:54] things you know in a more proper way and in the next few lectures our goal is [00:37:56] and in the next few lectures our goal is to basically show how to provide bounds [00:37:58] to basically show how to provide bounds of this kind of form [00:37:59] of this kind of form sometimes the bond could be more [00:38:00] sometimes the bond could be more complicated not only depending on the [00:38:02] complicated not only depending on the parameters number of parameters and [00:38:04] parameters number of parameters and number of data points it could depend on [00:38:06] number of data points it could depend on the norm of the parameters so and so [00:38:08] the norm of the parameters so and so forth [00:38:09] forth um the point is that we always only had [00:38:11] um the point is that we always only had Universal constants [00:38:14] okay any questions [00:38:31] uh for some yes that's very important [00:38:33] uh for some yes that's very important because if you replace it for all [00:38:35] because if you replace it for all because here there's also [00:38:37] because here there's also like [00:38:39] like [Music] [00:38:39] [Music] um [00:38:41] um no I think it's for some so you [00:38:43] no I think it's for some so you literally only need existence of one [00:38:45] literally only need existence of one function that says by this such that if [00:38:47] function that says by this such that if you replace that if you replace your [00:38:50] you replace that if you replace your statement by f of x the statement is [00:38:52] statement by f of x the statement is true so yeah I think this is actually [00:38:54] true so yeah I think this is actually very good question because I got [00:38:55] very good question because I got confused by this many times so maybe [00:38:57] confused by this many times so maybe let's let's uh let's give an example [00:38:59] let's let's uh let's give an example right so you say the excess risk [00:39:04] is less than or one over square rooted [00:39:08] is less than or one over square rooted what does this mean this means that you [00:39:10] what does this mean this means that you can replace this this is your iPhone [00:39:12] can replace this this is your iPhone right you can replace this by for [00:39:15] right you can replace this by for example five over square root and such [00:39:17] example five over square root and such that this is exactly true [00:39:19] that this is exactly true but you don't need to say that for every [00:39:20] but you don't need to say that for every iPhone so if you say for every F then it [00:39:23] iPhone so if you say for every F then it means that if you if you place by 0.1 [00:39:25] means that if you if you place by 0.1 over scoring it still has to be true [00:39:26] over scoring it still has to be true right that's too much right you only [00:39:28] right that's too much right you only need existence of one [00:39:30] need existence of one but of course you know if you have [00:39:32] but of course you know if you have existed one five then there's always [00:39:34] existed one five then there's always there's always other F which which is [00:39:36] there's always other F which which is bigger that can also be replaced but you [00:39:38] bigger that can also be replaced but you only need one well [00:39:40] only need one well and also actually maybe this is a little [00:39:43] and also actually maybe this is a little bit Advanced but some this this kind of [00:39:45] bit Advanced but some this this kind of like interpretation also allows you to [00:39:46] like interpretation also allows you to have [00:39:47] have be all in your condition even for [00:39:49] be all in your condition even for example you can this is a this could be [00:39:52] example you can this is a this could be a lot of advice but for example you can [00:39:53] a lot of advice but for example you can write for all [00:39:55] write for all like when [00:39:58] like when if n is bigger than o p [00:40:01] if n is bigger than o p then [00:40:03] then uh [00:40:04] uh excess risk [00:40:07] excess risk is less than one [00:40:10] is less than one this is you know I'm not saying this is [00:40:12] this is you know I'm not saying this is a correct statement but this statement [00:40:13] a correct statement but this statement would be interpreted as if you replace [00:40:15] would be interpreted as if you replace this o of P by 2p then it's going to be [00:40:17] this o of P by 2p then it's going to be correct or if you replace this op by [00:40:19] correct or if you replace this op by some [00:40:20] some function some constant times P it's [00:40:22] function some constant times P it's going to be a correct statement and it's [00:40:24] going to be a correct statement and it's not omega here it's a it's really big O [00:40:26] not omega here it's a it's really big O which is uh sometimes confusing [00:40:30] which is uh sometimes confusing okay cool so now let's move on to [00:40:34] okay cool so now let's move on to um uh the the key idea that we are going [00:40:37] um uh the the key idea that we are going to have right so so to bound this access [00:40:39] to have right so so to bound this access rates how do we achieve bound like this [00:40:41] rates how do we achieve bound like this the key idea is to [00:40:44] the key idea is to to somehow say that El High Theta is [00:40:47] to somehow say that El High Theta is close to L Theta right in some way in [00:40:51] close to L Theta right in some way in some sense this is a [00:40:53] some sense this is a um I need to specify what I really mean [00:40:55] um I need to specify what I really mean by these two functions are closed right [00:40:57] by these two functions are closed right are they closed at every theater or are [00:41:00] are they closed at every theater or are they closed at a specific setup so here [00:41:02] they closed at a specific setup so here is a small claim which [00:41:05] is a small claim which you know tells you what you really need [00:41:07] you know tells you what you really need so what you need is that so suppose [00:41:12] I always say the star [00:41:14] I always say the star is close to IO official star [00:41:17] is close to IO official star suppose these two loss functions [00:41:19] suppose these two loss functions empirical and population loss are closed [00:41:21] empirical and population loss are closed actually a star [00:41:23] actually a star and also suppose [00:41:25] and also suppose they are closed [00:41:30] I say the Hat [00:41:33] I say the Hat and here actually only one set closeness [00:41:35] and here actually only one set closeness so [00:41:37] so so suppose you'll have both of this then [00:41:40] so suppose you'll have both of this then this implies that [00:41:42] this implies that i o Theta hat minus L Theta star [00:41:46] i o Theta hat minus L Theta star is less than two alpha [00:41:48] is less than two alpha so basically you just need to show that [00:41:50] so basically you just need to show that this two loss function the empirical [00:41:52] this two loss function the empirical loss and population loss are closed I [00:41:54] loss and population loss are closed I still star and I Theta hat [00:41:57] still star and I Theta hat then you can get along the excess tricks [00:41:59] then you can get along the excess tricks by two times Alpha and the proof is [00:42:01] by two times Alpha and the proof is actually very simple [00:42:03] actually very simple um what you do is that you know that [00:42:05] um what you do is that you know that this is compartment Theta hack but [00:42:07] this is compartment Theta hack but consider stock right and your condition [00:42:09] consider stock right and your condition involves compartment L versus L hat so [00:42:12] involves compartment L versus L hat so you have to do kind of flexible [00:42:13] you have to do kind of flexible arrangement to kind of predict them [00:42:15] arrangement to kind of predict them right so what you do is you say [00:42:17] right so what you do is you say I want to compare [00:42:20] I want to compare this too and I write this as a sum of [00:42:23] this too and I write this as a sum of three terms [00:42:25] three terms I will say the height minus [00:42:27] I will say the height minus L Heights you first [00:42:30] L Heights you first compare this L of theta Hat by L height [00:42:33] compare this L of theta Hat by L height of theta hat and then you have Al Hazard [00:42:35] of theta hat and then you have Al Hazard effect you compare this with L find the [00:42:38] effect you compare this with L find the star [00:42:40] star and then you compare as health to the [00:42:41] and then you compare as health to the star [00:42:42] star by with L7 [00:42:46] anyway let me I don't know why the video [00:42:48] anyway let me I don't know why the video is freezer at the end [00:42:51] is freezer at the end let me restart it [00:42:59] okay and the reason why this should be [00:43:00] okay and the reason why this should be free Okay [00:43:02] free Okay um [00:43:03] um so [00:43:06] so so why you want to do these three things [00:43:07] so why you want to do these three things right it's kind of you know once you see [00:43:09] right it's kind of you know once you see it it's kind of obvious because this one [00:43:12] it it's kind of obvious because this one is the condition one of the condition [00:43:13] is the condition one of the condition right [00:43:14] right this one is [00:43:16] this one is this the second condition [00:43:21] and this one [00:43:23] and this one is the the first condition [00:43:25] is the the first condition and you also have this one which [00:43:27] and you also have this one which compared directly to the height and say [00:43:29] compared directly to the height and say the star but this is comparing them at L [00:43:31] the star but this is comparing them at L hat and you know that L has this hat [00:43:35] hat and you know that L has this hat minus L High single star is less than [00:43:38] minus L High single star is less than zero because Theta hat is the minimizer [00:43:45] of L hat so this term is less than zero [00:43:48] of L hat so this term is less than zero and [00:43:50] and this term is less less than Alpha this [00:43:53] this term is less less than Alpha this term is just an alpha so in total you [00:43:56] term is just an alpha so in total you get [00:43:58] get if you continue you get too Alpha [00:44:01] if you continue you get too Alpha okay [00:44:03] okay so basically this is saying that it [00:44:05] so basically this is saying that it suffices to show the two conditions [00:44:08] suffices to show the two conditions the first condition is that L hat and L [00:44:11] the first condition is that L hat and L is closed ICL star the second condition [00:44:13] is closed ICL star the second condition is that L height and L are closed at [00:44:24] so it turns out that the kind of the the [00:44:27] so it turns out that the kind of the the challenge to prove these two [00:44:28] challenge to prove these two inequalities are like the difficulty are [00:44:31] inequalities are like the difficulty are completely different [00:44:32] completely different so let's say if this is number one this [00:44:35] so let's say if this is number one this is number two number one is much very [00:44:37] is number two number one is much very very easy to prove and number two will [00:44:39] very easy to prove and number two will require you know a lot of work which [00:44:41] require you know a lot of work which takes a few weeks [00:44:44] takes a few weeks um [00:44:44] um maybe not a few weeks like two weeks [00:44:54] um there's [00:44:55] um there's um it's just a [00:44:57] um it's just a the only reason is that you know [00:45:00] the only reason is that you know um like of course if you put absolute [00:45:02] um like of course if you put absolute value here it's still true right and [00:45:04] value here it's still true right and actually you can also Bond absolute [00:45:05] actually you can also Bond absolute value if you want uh the only reason is [00:45:07] value if you want uh the only reason is that if you don't have the absolute [00:45:09] that if you don't have the absolute value the [00:45:10] value the so so this set conditions are satisfied [00:45:13] so so this set conditions are satisfied it's a little bit easier slightly easier [00:45:14] it's a little bit easier slightly easier you you need you need one fewer step [00:45:16] you you need you need one fewer step that's why [00:45:18] that's why um in most of the books you don't have [00:45:20] um in most of the books you don't have that step and also you save a constant a [00:45:22] that step and also you save a constant a factor of two so so you know in my [00:45:24] factor of two so so you know in my actually this is a very good question in [00:45:25] actually this is a very good question in my first time that teaches I I just have [00:45:27] my first time that teaches I I just have absolute value and then later in the [00:45:30] absolute value and then later in the lecture I have to do additional steps to [00:45:31] lecture I have to do additional steps to fix that constant which makes a little [00:45:34] fix that constant which makes a little bit like a little bit annoying yeah [00:45:36] bit like a little bit annoying yeah but fundamentally you are right there's [00:45:38] but fundamentally you are right there's no real difference [00:45:46] yeah which I'm going to show the first [00:45:47] yeah which I'm going to show the first thing called very just right now the [00:45:49] thing called very just right now the first inner call is very easy [00:45:51] first inner call is very easy um and I'll I'll tell you why they are [00:45:53] um and I'll I'll tell you why they are different it sounds like they are very [00:45:54] different it sounds like they are very similar right so the difference is that [00:45:58] similar right so the difference is that um I guess let me see whether it's ready [00:45:59] um I guess let me see whether it's ready for me to talk about a difference [00:46:02] for me to talk about a difference um [00:46:03] um maybe let me not talk about the [00:46:05] maybe let me not talk about the difference first I mean first to show [00:46:06] difference first I mean first to show the inequality one and see what why it's [00:46:09] the inequality one and see what why it's relatively easy [00:46:10] relatively easy and so to do that [00:46:13] and so to do that um so so the goal is to show one [00:46:19] and [00:46:20] and the main tool we're going to use is the [00:46:22] the main tool we're going to use is the so-called concentration inequality [00:46:27] so [00:46:30] so um and this is in some sense a a [00:46:33] um and this is in some sense a a non-essymatotic version [00:46:35] non-essymatotic version of the law of large number [00:46:37] of the law of large number so it's trying to prove the same kind of [00:46:39] so it's trying to prove the same kind of things but in a different language and [00:46:40] things but in a different language and with a stronger form [00:46:42] with a stronger form um so this is not a some conversion [00:46:45] um so this is not a some conversion I guess of central limit theorem [00:46:48] I guess of central limit theorem um and now you don't have and you don't [00:46:50] um and now you don't have and you don't have to to deal with the limit you just [00:46:52] have to to deal with the limit you just have a bound that depends on it [00:46:55] have a bound that depends on it and I think probably some of you have [00:46:57] and I think probably some of you have heard of this inequality called often [00:46:59] heard of this inequality called often inequality [00:47:00] inequality I think this thing probably is going to [00:47:03] I think this thing probably is going to be it's taught in well nine cs109 [00:47:07] be it's taught in well nine cs109 or some of the statistics class but [00:47:09] or some of the statistics class but anyway you don't have to know um HP4 [00:47:12] anyway you don't have to know um HP4 um as a prerequisite and let me Define [00:47:13] um as a prerequisite and let me Define the inequality so this is trying to deal [00:47:16] the inequality so this is trying to deal with a sum of independent random [00:47:18] with a sum of independent random variables so let's say X1 up to x n [00:47:22] variables so let's say X1 up to x n be independent [00:47:25] be independent random variables [00:47:27] random variables and suppose [00:47:30] and suppose they are bonded [00:47:34] um each of them is bounded by a and b i [00:47:36] um each of them is bounded by a and b i you can think of A and B are just [00:47:38] you can think of A and B are just constant maybe zero and one [00:47:40] constant maybe zero and one um [00:47:40] um um and almost surely for every eye [00:47:44] um and almost surely for every eye um and [00:47:45] um and so we care about the the mean [00:47:49] so we care about the the mean so the mean of this random variable is [00:47:52] so the mean of this random variable is defined to be x i and sorry it's defined [00:47:54] defined to be x i and sorry it's defined to be mu [00:47:57] and so the the central question is how [00:48:00] and so the the central question is how different [00:48:03] is the empirical mean from the popular [00:48:06] is the empirical mean from the popular the the average from the from the the [00:48:09] the the average from the from the the expectation right so care about how [00:48:11] expectation right so care about how small is this [00:48:12] small is this and this is a random variable so you [00:48:14] and this is a random variable so you have to have a probabilistic statement [00:48:15] have to have a probabilistic statement so the claim is that the probability [00:48:17] so the claim is that the probability that this [00:48:19] that this difference is small [00:48:21] difference is small is very big [00:48:28] alternatively you can say that the [00:48:30] alternatively you can say that the probability that this difference is big [00:48:32] probability that this difference is big is very small [00:48:33] is very small um they're just the same [00:48:35] um they're just the same um so you get how big it is it's very [00:48:37] um so you get how big it is it's very close to one and the difference from one [00:48:39] close to one and the difference from one is this exponentially small number and [00:48:42] is this exponentially small number and what's in the exponential is something [00:48:44] what's in the exponential is something like this [00:48:46] like this beautiful [00:48:51] okay so this is the formal statement [00:48:54] okay so this is the formal statement maybe let me try to interpret it a [00:48:55] maybe let me try to interpret it a little bit and by instantiating a [00:48:58] little bit and by instantiating a special case so [00:49:00] special case so uh if you define a sigma Square to be 1 [00:49:03] uh if you define a sigma Square to be 1 over n times sum of b i minus a i [00:49:07] over n times sum of b i minus a i square right from 1 to 1. [00:49:09] square right from 1 to 1. so then the sigma [00:49:12] so then the sigma can be viewed [00:49:16] as kind of the virus [00:49:23] of one over sum of x i you know this is [00:49:27] of one over sum of x i you know this is not an executive variance right because [00:49:28] not an executive variance right because but it's kind of like a some kind of [00:49:31] but it's kind of like a some kind of like a upper Bond of the violence why [00:49:32] like a upper Bond of the violence why because if you look at the virus over in [00:49:35] because if you look at the virus over in times the sum of x i [00:49:38] times the sum of x i you know that the variance is linear so [00:49:42] you know that the variance is linear so um so first of all you get 1 over n [00:49:43] um so first of all you get 1 over n Square in front because the variance is [00:49:45] Square in front because the variance is quadratic in scaling and then you have [00:49:48] quadratic in scaling and then you have the in your relatives you have the sum [00:49:51] the in your relatives you have the sum of the variance of x i [00:49:53] of the variance of x i and then this is equals to one over n [00:49:56] and then this is equals to one over n Square [00:49:59] by definition x i minus expectation x i [00:50:03] by definition x i minus expectation x i square and now because each of each of [00:50:06] square and now because each of each of these x i x i is always between b and a [00:50:10] these x i x i is always between b and a I right so x i is between BR and a i [00:50:15] and expectation of x i you know as a [00:50:18] and expectation of x i you know as a consequence also is between BR and a i [00:50:21] consequence also is between BR and a i that means that this thing the the this [00:50:24] that means that this thing the the this thing is [00:50:25] thing is uh smaller than [00:50:28] uh smaller than um [00:50:29] um bi minus a i because both of these two [00:50:31] bi minus a i because both of these two quantities are in this interval so the [00:50:33] quantities are in this interval so the difference of them is also uh smaller [00:50:36] difference of them is also uh smaller than b and a i you get b i s square or [00:50:39] than b and a i you get b i s square or each of these terms so so that's why the [00:50:41] each of these terms so so that's why the the total the whole thing [00:50:45] the total the whole thing uh maybe I guess also including a while [00:50:47] uh maybe I guess also including a while running Square so the whole thing [00:50:50] running Square so the whole thing is smaller than one over n Square Times [00:50:53] is smaller than one over n Square Times sum of bi minus a i square right from [00:50:57] sum of bi minus a i square right from one to it [00:50:59] one to it [Music] [00:50:59] [Music] um [00:51:01] um let me see why I'm missing her [00:51:04] let me see why I'm missing her I think I have a [00:51:12] I think I have a [00:51:14] I think I have a type of Heap [00:51:22] okay so so basically you can think of [00:51:25] okay so so basically you can think of each of this [00:51:27] each of this um [00:51:27] um p and AI square is the variance and then [00:51:30] p and AI square is the variance and then they will take the sum of them and you [00:51:32] they will take the sum of them and you divide by 1 square that's kind of the [00:51:37] yeah the variance of the the whole [00:51:39] yeah the variance of the the whole exercise and and suppose you take this [00:51:42] exercise and and suppose you take this View and you can see what is this is [00:51:44] View and you can see what is this is saying what this inner chord is saying [00:51:45] saying what this inner chord is saying is the following so if you take Epsilon [00:51:48] is the following so if you take Epsilon to be [00:51:49] to be square root some constant c times Sigma [00:51:52] square root some constant c times Sigma Square Times log n [00:51:56] so this is something like a constant o [00:51:59] so this is something like a constant o of Sigma square root logon so you take [00:52:02] of Sigma square root logon so you take Epsilon to be something a little bit [00:52:03] Epsilon to be something a little bit bigger than the variance [00:52:05] bigger than the variance um by by squaring login then you know [00:52:08] um by by squaring login then you know that you plug this option into the whole [00:52:10] that you plug this option into the whole thing inequality [00:52:12] thing inequality and where C is a large constant [00:52:17] then you plug in this to the uh for [00:52:19] then you plug in this to the uh for example C is larger than 10 and if you [00:52:21] example C is larger than 10 and if you plug in this Epsilon to the fourth [00:52:23] plug in this Epsilon to the fourth inequality what you got is that [00:52:27] so [00:52:29] so probability one over n times sum of x i [00:52:32] probability one over n times sum of x i minus mu [00:52:34] minus mu this is actually the most interesting [00:52:36] this is actually the most interesting regime of this inequality when you plug [00:52:38] regime of this inequality when you plug in Epson to be on this level that's the [00:52:40] in Epson to be on this level that's the typically when you use it you always use [00:52:42] typically when you use it you always use absence with this level because this is [00:52:44] absence with this level because this is the the useful regime so when you apply [00:52:47] the the useful regime so when you apply it you get this is less than all of [00:52:49] it you get this is less than all of Sigma login [00:52:51] Sigma login because I press this by Epson is bigger [00:52:54] because I press this by Epson is bigger than 1 minus two times exponential now [00:52:57] than 1 minus two times exponential now let's plug in epsilon so uh so you guys [00:53:00] let's plug in epsilon so uh so you guys this is to [00:53:03] this is to maybe let's first not replace that [00:53:04] maybe let's first not replace that iPhone let's first replace Sigma versus [00:53:06] iPhone let's first replace Sigma versus it you can see that's the the right hand [00:53:09] it you can see that's the the right hand side by my definition of Sigma square [00:53:11] side by my definition of Sigma square right so this is the same as the whole [00:53:14] right so this is the same as the whole inequality and then plugging up Zone I [00:53:16] inequality and then plugging up Zone I get one minus two exponential [00:53:19] get one minus two exponential uh [00:53:21] uh to login [00:53:25] to login to two times bigger [00:53:29] I guess 2 is also in the Big O so you [00:53:31] I guess 2 is also in the Big O so you got [00:53:35] right so and now you choose this big O [00:53:38] right so and now you choose this big O to be a large constant right so recall [00:53:40] to be a large constant right so recall that [00:53:42] that um like like the speaker is you can [00:53:44] um like like the speaker is you can replace this figure by a large constant [00:53:45] replace this figure by a large constant right so so then you got this to be [00:53:48] right so so then you got this to be something like [00:53:49] something like 1 minus [00:53:51] 1 minus maybe let's say [00:53:53] maybe let's say I guess maybe it's here it's easier to [00:53:55] I guess maybe it's here it's easier to if I just keep the C [00:53:57] if I just keep the C especially I get 2C here [00:54:00] especially I get 2C here this is [00:54:02] this is C [00:54:05] C then you got n to the minus 2C 2 times n [00:54:08] then you got n to the minus 2C 2 times n to the minus two c and if you pick this [00:54:10] to the minus two c and if you pick this constant C to be something like 10 then [00:54:13] constant C to be something like 10 then you get 1 minus [00:54:14] you get 1 minus 2 to the N minus 20 right so so so [00:54:19] 2 to the N minus 20 right so so so basically this is saying that um you [00:54:21] basically this is saying that um you have a very very high probability such [00:54:24] have a very very high probability such that the difference is smaller than [00:54:26] that the difference is smaller than Sigma log 1. so in other words with high [00:54:30] Sigma log 1. so in other words with high probability [00:54:31] probability so with probability let's say larger [00:54:33] so with probability let's say larger than one minus n to the minus 10. you [00:54:36] than one minus n to the minus 10. you have that [00:54:37] have that the mean [00:54:39] the mean is the empirical mean is [00:54:42] is the empirical mean is close to the expectation in the sense [00:54:45] close to the expectation in the sense that they are closed in [00:54:47] that they are closed in uh [00:54:50] in this sense right they are they're [00:54:52] in this sense right they are they're bounded the difference between them are [00:54:53] bounded the difference between them are bounded by Big O of Sigma times logic [00:54:58] bounded by Big O of Sigma times logic so so basically this is saying that if [00:55:01] so so basically this is saying that if you think of Sigma as the as the [00:55:02] you think of Sigma as the as the variance as the cold code variance then [00:55:05] variance as the cold code variance then you cannot be you it's very hard for you [00:55:09] you cannot be you it's very hard for you to try to deviate from the mean [00:55:13] to try to deviate from the mean um by something larger much larger than [00:55:16] um by something larger much larger than the virus right so you you this is the [00:55:18] the virus right so you you this is the deviation from the mean and this is the [00:55:20] deviation from the mean and this is the variance up to Times Square Root login [00:55:22] variance up to Times Square Root login often log factor in this course is not [00:55:24] often log factor in this course is not very important so this thing you cannot [00:55:26] very important so this thing you cannot deviate from the the mean by a large [00:55:28] deviate from the the mean by a large Factor uh uh of the of the variable of [00:55:31] Factor uh uh of the of the variable of course this variance is not a real virus [00:55:33] course this variance is not a real virus it's the this is perceived virus you [00:55:36] it's the this is perceived virus you actually we're gonna get back to this [00:55:38] actually we're gonna get back to this concept this is sometimes called there's [00:55:41] concept this is sometimes called there's a concept called um variance proxy so [00:55:45] a concept called um variance proxy so which we're going to talk more about it [00:55:46] which we're going to talk more about it so [00:55:48] so so in some sense it's kind of like if [00:55:49] so in some sense it's kind of like if you draw this it's kind of like you are [00:55:51] you draw this it's kind of like you are saying that this random variable [00:55:54] saying that this random variable this [00:55:57] suppose you call this x height a random [00:55:59] suppose you call this x height a random variable if you look at the distribution [00:56:01] variable if you look at the distribution of this random variable is something [00:56:02] of this random variable is something like this [00:56:04] like this and and the mean is Mu right suppose [00:56:06] and and the mean is Mu right suppose this is the meal and you look at [00:56:09] this is the meal and you look at something [00:56:11] something deviate from the mule by Sigma Square [00:56:13] deviate from the mule by Sigma Square login [00:56:15] login and then you are saying that the math in [00:56:17] and then you are saying that the math in this part is extremely small how small [00:56:19] this part is extremely small how small they are they are smaller than inverse [00:56:21] they are they are smaller than inverse polynomial of any right so the mass here [00:56:24] polynomial of any right so the mass here is smaller than into the minus 2C or [00:56:27] is smaller than into the minus 2C or maybe inverse probably in [00:56:40] so and you can see that this Bond cannot [00:56:42] so and you can see that this Bond cannot be much much smaller [00:56:44] be much much smaller so and and the wealthiest way to see it [00:56:46] so and and the wealthiest way to see it is that if this is really a sigma it's [00:56:48] is that if this is really a sigma it's really the standard deviation then your [00:56:49] really the standard deviation then your bond cannot be improved much right [00:56:51] bond cannot be improved much right because [00:56:52] because for any random variable [00:56:54] for any random variable for any run bar we always have some [00:56:56] for any run bar we always have some probability [00:56:57] probability so the bones [00:57:01] cannot [00:57:02] cannot be improved [00:57:04] be improved much [00:57:06] much of course this this is a somehow kind of [00:57:08] of course this this is a somehow kind of like a just intuition right because if I [00:57:10] like a just intuition right because if I need to Define what they mean by not [00:57:12] need to Define what they mean by not improve much but intuitively this Pawn [00:57:15] improve much but intuitively this Pawn shouldn't be able to improve the match [00:57:16] shouldn't be able to improve the match because for any random variable [00:57:21] you always have some Mass they're always [00:57:23] you always have some Mass they're always some nice [00:57:26] Within [00:57:29] I mean plus minus 10 deviation right so [00:57:34] I mean plus minus 10 deviation right so if you really kind of [00:57:36] if you really kind of kind of look at the interval [00:57:38] kind of look at the interval um [00:57:39] um uh defined by the standard deviation [00:57:41] uh defined by the standard deviation there's always some nice Index right [00:57:43] there's always some nice Index right there's an actual constant one might see [00:57:44] there's an actual constant one might see that so you cannot say that [00:57:46] that so you cannot say that so you cannot kind of like make this [00:57:48] so you cannot kind of like make this intervals much much smaller and get the [00:57:51] intervals much much smaller and get the same bound because you know if you get [00:57:52] same bound because you know if you get me too too small then you have a lot of [00:57:54] me too too small then you have a lot of mass in it [00:57:56] mass in it So Okay cool so now let's interpret this [00:57:59] So Okay cool so now let's interpret this a little more so let's say we take a and [00:58:02] a little more so let's say we take a and we instantiate even more so let's take a [00:58:05] we instantiate even more so let's take a to be on out of [00:58:06] to be on out of maybe -1 [00:58:08] maybe -1 it's a negative number and B is on all [00:58:10] it's a negative number and B is on all of one right so this is typically the [00:58:12] of one right so this is typically the the important thing right so random [00:58:13] the important thing right so random variable is between minus point one [00:58:15] variable is between minus point one maybe minus a constant for constant then [00:58:18] maybe minus a constant for constant then what you have is that [00:58:19] what you have is that the empirical mean minus the [00:58:22] the empirical mean minus the uh expectation is [00:58:25] uh expectation is smaller than big of Sigma Square logar [00:58:28] smaller than big of Sigma Square logar this is the same thing I have written [00:58:29] this is the same thing I have written and what is Sigma Sigma is [00:58:33] and what is Sigma Sigma is square root one over N squared times the [00:58:35] square root one over N squared times the sum of bi minus a i [00:58:37] sum of bi minus a i score and this is something each of the [00:58:41] score and this is something each of the bi and ai's on out of one so you get one [00:58:43] bi and ai's on out of one so you get one over N squared times n because there are [00:58:45] over N squared times n because there are any of these terms so this is one over [00:58:47] any of these terms so this is one over square root [00:58:49] square root so so Sigma is another order of one over [00:58:52] so so Sigma is another order of one over square root and that's the variance of [00:58:53] square root and that's the variance of your [00:58:54] your mean estimate uh of the empirical mean [00:58:57] mean estimate uh of the empirical mean so that's why if you plug in this choice [00:59:00] so that's why if you plug in this choice of two Sigma you get squared and square [00:59:02] of two Sigma you get squared and square root login [00:59:04] root login so basically you cannot deviate by and [00:59:06] so basically you cannot deviate by and sometimes people write this as [00:59:12] foreign [00:59:16] just to hide all the log factors so if [00:59:19] just to hide all the log factors so if you don't have [00:59:20] you don't have a lot factor is basically saying that [00:59:23] a lot factor is basically saying that you cannot deviate by more than one over [00:59:25] you cannot deviate by more than one over Square rooting and this is something [00:59:27] Square rooting and this is something um it sounds very abstract for the [00:59:29] um it sounds very abstract for the moment like uh but uh in the long run [00:59:32] moment like uh but uh in the long run you'll see that this is this kind of [00:59:34] you'll see that this is this kind of kind of thinking I will be used many [00:59:36] kind of thinking I will be used many times and it's actually useful to kind [00:59:38] times and it's actually useful to kind of just burn this in your head if you [00:59:40] of just burn this in your head if you really do machine learning theory for [00:59:42] really do machine learning theory for life you know but you don't have to but [00:59:43] life you know but you don't have to but for me this is something like [00:59:47] for me this is something like um basically it's like uh you know I [00:59:49] um basically it's like uh you know I already kind of burned this into my head [00:59:50] already kind of burned this into my head in some sense [00:59:52] in some sense um any questions [00:59:58] okay so so so this is a kind of a short [01:00:01] okay so so so this is a kind of a short kind of like a [01:00:04] kind of like a um [01:00:05] um um a review I'm not sure about how like [01:00:08] um a review I'm not sure about how like I I think probably cs109 doesn't get [01:00:10] I I think probably cs109 doesn't get into this kind of details but this is [01:00:12] into this kind of details but this is just a kind of a review on the of the [01:00:15] just a kind of a review on the of the inequality with a lot of kind of like uh [01:00:17] inequality with a lot of kind of like uh um additional interpretation and now if [01:00:19] um additional interpretation and now if you apply often inequality [01:00:22] you apply often inequality to our case [01:00:26] let's see what we can get [01:00:30] through the empirical loss [01:00:35] right recall that our goal is to to deal [01:00:38] right recall that our goal is to to deal with this [01:00:40] with this the difference between this and this [01:00:42] the difference between this and this right and this one is one over n times [01:00:46] right and this one is one over n times the sum of [01:00:48] the sum of the loss on each of the example [01:00:56] and and this one is really literally the [01:00:58] and and this one is really literally the expectation of the of the sound right [01:01:06] and so this is a perfect case to use hot [01:01:09] and so this is a perfect case to use hot inequality because this one corresponds [01:01:11] inequality because this one corresponds to x i [01:01:13] to x i and but hopefully requires a bond on a [01:01:16] and but hopefully requires a bond on a random variable so we just assume that [01:01:18] random variable so we just assume that in many cases the loss is still bound is [01:01:21] in many cases the loss is still bound is indeed bounded so but here we just we [01:01:24] indeed bounded so but here we just we assume the loss is bounded between one [01:01:29] zero and one you know if the loss is not [01:01:31] zero and one you know if the loss is not bounded you need a little bit more [01:01:32] bounded you need a little bit more advanced tools to deal with it but let's [01:01:34] advanced tools to deal with it but let's say for now the loss responded between 0 [01:01:36] say for now the loss responded between 0 and 1 for example if you use [01:01:37] and 1 for example if you use classification your loss is zero one [01:01:39] classification your loss is zero one loss the loss can only be zero one so [01:01:41] loss the loss can only be zero one so that satisfies this loss [01:01:44] that satisfies this loss um so for every X and Y and Theta let's [01:01:47] um so for every X and Y and Theta let's say [01:01:48] say then [01:01:50] then um if you apply half linear quality what [01:01:52] um if you apply half linear quality what you get is that [01:01:55] you get is that so this is a Lemma but actually it's [01:01:57] so this is a Lemma but actually it's really just the application of the [01:01:58] really just the application of the inequality so for any [01:02:01] inequality so for any fixed setup [01:02:04] fixed setup so suppose you [01:02:06] so suppose you so so [01:02:08] so so let's see um [01:02:15] so so our high Theta this is a [01:02:19] so so our high Theta this is a basically your sum of X I right [01:02:23] basically your sum of X I right um where x i [01:02:27] is this LXI [01:02:30] is this LXI Y and Theta [01:02:33] and [01:02:35] and so you can compute Sigma square right [01:02:38] so you can compute Sigma square right you kind of like the [01:02:40] you kind of like the the thick variants that we are thinking [01:02:43] the thick variants that we are thinking about so the sigma Square I defined was [01:02:44] about so the sigma Square I defined was this the [01:02:46] this the minus AI Square I from 1 to n and I [01:02:49] minus AI Square I from 1 to n and I guess we have done this well over n [01:02:50] guess we have done this well over n Square Times n which is one over n so [01:02:53] Square Times n which is one over n so that means that [01:02:55] that means that L has Theta minus L Theta [01:02:59] L has Theta minus L Theta right is less than [01:03:01] right is less than o of [01:03:04] o of Sigma square root login [01:03:06] Sigma square root login with high probability [01:03:09] with high probability right so and sigma is [01:03:11] right so and sigma is sorry Sigma square is one over n so this [01:03:13] sorry Sigma square is one over n so this is [01:03:14] is o of square root login [01:03:19] and you can also write this as O2 of Y [01:03:21] and you can also write this as O2 of Y over squared [01:03:22] over squared so basically for every fixed data the [01:03:26] so basically for every fixed data the empirical laws and the population laws [01:03:29] empirical laws and the population laws only differ by one over scorpion with [01:03:32] only differ by one over scorpion with high probability [01:03:35] so [01:03:37] so sounds pretty good right so we show that [01:03:38] sounds pretty good right so we show that they are very close and how close they [01:03:40] they are very close and how close they are they are closed by what um they're [01:03:42] are they are closed by what um they're the the difference is one of our [01:03:44] the the difference is one of our squirting which goes to zero s and goes [01:03:46] squirting which goes to zero s and goes to Infinity so it's supposed to be a [01:03:47] to Infinity so it's supposed to be a small number and there's no other things [01:03:49] small number and there's no other things hidden here of course you hit a log fact [01:03:52] hidden here of course you hit a log fact you hide a lot factor in N but you don't [01:03:53] you hide a lot factor in N but you don't have any Factor about for example [01:03:56] have any Factor about for example dimensionality [01:04:01] any questions [01:04:06] so but there is a small issue okay [01:04:13] um [01:04:16] yeah exactly so with high probability [01:04:19] yeah exactly so with high probability um [01:04:20] um so technical I should write the [01:04:23] so technical I should write the probability that this [01:04:27] is happening [01:04:31] it's larger than 1 minus n to the O of Y [01:04:39] okay and and and this is actually a good [01:04:42] okay and and and this is actually a good time to practice this big old notation [01:04:44] time to practice this big old notation basically this is saying that you can [01:04:48] basically this is saying that you can you can replace actually here wait [01:04:50] you can replace actually here wait let me see [01:04:54] I think [01:04:59] should I pick off the line on an Omega [01:05:01] should I pick off the line on an Omega 1. [01:05:06] I think I should use Omega fly but maybe [01:05:09] I think I should use Omega fly but maybe I say C I said there exist a c [01:05:12] I say C I said there exist a c uh [01:05:14] uh such that a constant there exists a [01:05:19] such that a constant there exists a constant not in zeros [01:05:21] constant not in zeros such that this is true Maybe [01:05:26] yeah you see that sometimes this is [01:05:28] yeah you see that sometimes this is confusing like no one [01:05:30] confusing like no one um [01:05:31] um on the Fly I couldn't figure it out but [01:05:33] on the Fly I couldn't figure it out but but this is what we mean maybe let's say [01:05:36] but this is what we mean maybe let's say maybe that's just to say this is 10. I [01:05:38] maybe that's just to say this is 10. I think I think this is definitely a [01:05:40] think I think this is definitely a correct statement because there is O [01:05:41] correct statement because there is O here you can hide everything in there so [01:05:43] here you can hide everything in there so that's that's what I mean [01:05:46] that's that's what I mean okay cool [01:05:49] okay cool um [01:05:52] okay so so there's a there's this this [01:05:55] okay so so there's a there's this this is a correct statement but there is a [01:05:57] is a correct statement but there is a small uh thing that like there's a [01:06:00] small uh thing that like there's a important thing we know we should note [01:06:01] important thing we know we should note here so what do I mean by for any fixed [01:06:04] here so what do I mean by for any fixed data what does this really mean right I [01:06:05] data what does this really mean right I have this title here so this really [01:06:07] have this title here so this really means that you you need to first pick [01:06:09] means that you you need to first pick Satan and then you draw after you pick [01:06:11] Satan and then you draw after you pick Theta you jaw [01:06:14] Theta you jaw IID [01:06:16] IID all right x i and y i [01:06:19] all right x i and y i from this distribution p [01:06:21] from this distribution p so that these are so that well why you [01:06:24] so that these are so that well why you have to do this because you want to make [01:06:25] have to do this because you want to make sure that I O of x i [01:06:28] sure that I O of x i y i [01:06:29] y i Satan [01:06:31] Satan these are independently distributed are [01:06:35] these are independently distributed are independent [01:06:37] independent four different eyes so so if you pick [01:06:40] four different eyes so so if you pick Theta first and then you draw the X I's [01:06:42] Theta first and then you draw the X I's then indeed this random variable x i [01:06:45] then indeed this random variable x i which is equal to loss are independent [01:06:48] which is equal to loss are independent but this doesn't really mean that [01:06:50] but this doesn't really mean that you can [01:06:52] you can you can do this for Theta that depends [01:06:55] you can do this for Theta that depends on x i which is actually what I'm going [01:06:57] on x i which is actually what I'm going to talk about next so [01:06:59] to talk about next so so [01:07:01] so first of all you can apply this for [01:07:03] first of all you can apply this for Theta you can apply this [01:07:06] Theta you can apply this for Theta is equal to Theta star that's [01:07:08] for Theta is equal to Theta star that's a lot because Theta star is a universal [01:07:10] a lot because Theta star is a universal quantity right you you know what Theta [01:07:11] quantity right you you know what Theta star is you know the City Star exists [01:07:14] star is you know the City Star exists even before you draw the samples [01:07:16] even before you draw the samples why because Theta star is the minimizer [01:07:18] why because Theta star is the minimizer of the population risk the population [01:07:20] of the population risk the population risk doesn't depend on the samples it [01:07:22] risk doesn't depend on the samples it only depends on the discussion so that's [01:07:24] only depends on the discussion so that's why you can you can apply this with [01:07:26] why you can you can apply this with theta equals to your stock so that's why [01:07:28] theta equals to your stock so that's why we got this inequality one because we [01:07:31] we got this inequality one because we got L has a star minus L here star [01:07:33] got L has a star minus L here star it's less than [01:07:35] it's less than o tilde or foreign [01:07:38] so now the question is whether you can [01:07:41] so now the question is whether you can apply this to [01:07:43] apply this to Theta hat [01:07:45] Theta hat an answer is no you cannot apply it to [01:07:48] an answer is no you cannot apply it to it and it's not only just because you [01:07:50] it and it's not only just because you have some subtle kind of mathematical [01:07:52] have some subtle kind of mathematical rigorousness it's really just a it's not [01:07:55] rigorousness it's really just a it's not even correct it's very far from correct [01:07:56] even correct it's very far from correct to apply to Theta hat it's not a small [01:07:59] to apply to Theta hat it's not a small kind of mathematical nuances kind of [01:08:01] kind of mathematical nuances kind of thing and the reason is that there is a [01:08:03] thing and the reason is that there is a there's a dependency issue right so like [01:08:05] there's a dependency issue right so like as I alluded before a little bit so the [01:08:08] as I alluded before a little bit so the the dependency is that you have first [01:08:09] the dependency is that you have first half see the star right so the star [01:08:11] half see the star right so the star depends on the the population [01:08:12] depends on the the population distribution [01:08:13] distribution and Theta star is something you you that [01:08:16] and Theta star is something you you that is kind of exists there before you when [01:08:18] is kind of exists there before you when you draw the sample and then you draw [01:08:19] you draw the sample and then you draw the sample [01:08:24] and then you get to say the hat and then [01:08:27] and then you get to say the hat and then you come right and then you can um [01:08:30] you come right and then you can um compute for example I'll take a hat uh [01:08:34] compute for example I'll take a hat uh or l h the Hat this kind of things but [01:08:37] or l h the Hat this kind of things but had a heart depends on the samples so so [01:08:40] had a heart depends on the samples so so that means that L of x i y i [01:08:44] that means that L of x i y i had a hat [01:08:46] had a hat are not independent with each other [01:08:55] so so you cannot apply half the [01:08:57] so so you cannot apply half the inequality because they are not [01:08:58] inequality because they are not independent variables [01:09:06] so and this is kind of important because [01:09:09] so and this is kind of important because you know if you can really apply this [01:09:12] you know if you can really apply this actually [01:09:13] actually you always if if you can apply this uh [01:09:16] you always if if you can apply this uh coffee the head you can always get one [01:09:18] coffee the head you can always get one more story then there's no dependency on [01:09:19] more story then there's no dependency on anything like uh the machine learning [01:09:22] anything like uh the machine learning would be much much easier like we don't [01:09:23] would be much much easier like we don't have to think about [01:09:24] have to think about sample complexity is always small [01:09:27] sample complexity is always small so [01:09:28] so um [01:09:29] um so so basically the next uh well two [01:09:31] so so basically the next uh well two weeks we are dealing with this you know [01:09:33] weeks we are dealing with this you know how do we deal with say that heart [01:09:35] how do we deal with say that heart right so um [01:09:38] so so the idea to fix this is called [01:09:42] so so the idea to fix this is called uniform convergence [01:09:46] foreign [01:09:48] foreign is that you want to apply this [01:09:52] apply whole thing [01:09:56] to any Theta [01:09:59] to any Theta that is [01:10:02] that is predetermined before joining data [01:10:04] predetermined before joining data right you can you can apply this to any [01:10:07] right you can you can apply this to any say that it's predetermined before [01:10:09] say that it's predetermined before joining the data [01:10:13] I guess this this might sound a little [01:10:15] I guess this this might sound a little bit kind of like um [01:10:18] bit kind of like um um by yourself a little bit conflict so [01:10:19] um by yourself a little bit conflict so what I really mean is that [01:10:21] what I really mean is that you want to prove that [01:10:23] you want to prove that so what we have what we know now [01:10:27] so what we have what we know now is that for every data [01:10:30] is that for every data probability [01:10:32] probability I'll higher Theta minus so for every [01:10:35] I'll higher Theta minus so for every Theta that is has nothing to do with [01:10:37] Theta that is has nothing to do with those samples you know this is true [01:10:42] for someday for [01:10:44] for someday for of course I didn't specify exactly what [01:10:46] of course I didn't specify exactly what Epsilon Delta is but but this is the [01:10:48] Epsilon Delta is but but this is the form of the theorem we're approving [01:10:49] form of the theorem we're approving right now we can prove right now and we [01:10:51] right now we can prove right now and we have proved this for [01:10:53] have proved this for um and you can plug in say that it's [01:10:54] um and you can plug in say that it's equal to start that's fine but this is [01:10:57] equal to start that's fine but this is not [01:10:58] not uh the same as [01:11:00] uh the same as four [01:11:01] four this issue [01:11:08] so so the second statement is what we [01:11:11] so so the second statement is what we I'm going to prove uh in the next Well [01:11:14] I'm going to prove uh in the next Well to Well two weeks but there's these two [01:11:17] to Well two weeks but there's these two are two different statements the second [01:11:19] are two different statements the second statement is saying that [01:11:22] you first draw the samples and then [01:11:24] you first draw the samples and then after you draw the samples for all Theta [01:11:25] after you draw the samples for all Theta with two functions these two functions [01:11:28] with two functions these two functions are closed maybe it's kind of useful to [01:11:30] are closed maybe it's kind of useful to draw a figure right so [01:11:31] draw a figure right so um [01:11:33] um so there is a function called [01:11:37] L Theta right [01:11:40] L Theta right um and here this Dimensions Theta and [01:11:43] um and here this Dimensions Theta and the Y Dimension is the L Theta and now [01:11:46] the Y Dimension is the L Theta and now let's look at what's the empirical loss [01:11:48] let's look at what's the empirical loss so the empirical laws [01:11:50] so the empirical laws um [01:11:53] so [01:11:54] so I guess maybe let me give example where [01:11:55] I guess maybe let me give example where these two statements are different [01:11:57] these two statements are different so let's think about there are only two [01:11:59] so let's think about there are only two three cases [01:12:00] three cases so [01:12:02] so so this is a virtual example right so [01:12:04] so this is a virtual example right so consider the case versus I also Hearts [01:12:06] consider the case versus I also Hearts data [01:12:07] data is [01:12:09] is this function [01:12:15] so it's the right function with [01:12:16] so it's the right function with probability one third [01:12:20] and it and it's the [01:12:23] and it and it's the orange function [01:12:25] orange function with probability of one third [01:12:29] and maybe let's say is the [01:12:32] and maybe let's say is the grief [01:12:34] grief this is a starting I guess that you have [01:12:36] this is a starting I guess that you have different color [01:12:39] Korean function with probability 1 3. [01:12:46] and so what you know is that [01:12:50] and so what you know is that for every Theta so if you forever so for [01:12:53] for every Theta so if you forever so for any fixed data if you look at the [01:12:55] any fixed data if you look at the probability that L height Theta [01:12:58] probability that L height Theta is different from L Theta let's say [01:13:00] is different from L Theta let's say you're just different [01:13:01] you're just different so what's the chance that they are [01:13:02] so what's the chance that they are different [01:13:04] different so this chance is something like [01:13:07] so this chance is something like um [01:13:08] um something like two-thirds [01:13:10] something like two-thirds right because if you look at any [01:13:13] right because if you look at any Point any Theta you know for some Theta [01:13:16] Point any Theta you know for some Theta actually it's always the three function [01:13:17] actually it's always the three function is always the same right they're always [01:13:19] is always the same right they're always the same but maybe for example if you [01:13:21] the same but maybe for example if you pick a [01:13:23] pick a pick a point here [01:13:25] pick a point here right if you look at this point then [01:13:27] right if you look at this point then which probably is a third [01:13:30] which probably is a third I also have Theta is this right point [01:13:31] I also have Theta is this right point which is different from L Theta and with [01:13:34] which is different from L Theta and with this other two possibilities right it's [01:13:37] this other two possibilities right it's probably two-thirds uh IO has Theta is [01:13:40] probably two-thirds uh IO has Theta is equal to l Theta right so so basically [01:13:44] equal to l Theta right so so basically for every Theta [01:13:45] for every Theta you have some [01:13:47] you have some um sorry I should write this is equal to [01:13:50] um sorry I should write this is equal to this [01:13:57] a third right so so basically for [01:13:59] a third right so so basically for everything you have something like this [01:14:00] everything you have something like this right like a [01:14:03] right like a um [01:14:04] um and on the other hand if you look at [01:14:07] and on the other hand if you look at some statement like this [01:14:16] right so [01:14:18] right so if you look at this for every Theta L [01:14:21] if you look at this for every Theta L High Theta is close to L Theta [01:14:23] High Theta is close to L Theta then what's what's this thing this is [01:14:25] then what's what's this thing this is saying that basically these two [01:14:27] saying that basically these two functions are the same you know globally [01:14:29] functions are the same you know globally and clearly in any of these red yellow [01:14:32] and clearly in any of these red yellow and green cases this probability is zero [01:14:35] and green cases this probability is zero because [01:14:37] because um because in both of these three random [01:14:40] um because in both of these three random cases right so the two functions are not [01:14:42] cases right so the two functions are not always the same right there's always [01:14:44] always the same right there's always some chance that there are there's some [01:14:45] some chance that there are there's some diff there's some differences so so that [01:14:48] diff there's some differences so so that so that shows that you cannot uh easily [01:14:51] so that shows that you cannot uh easily switch the probability with the for all [01:14:53] switch the probability with the for all quantifier they are not just not [01:14:54] quantifier they are not just not switchable [01:14:56] switchable um [01:14:57] um I guess you know probably you have seen [01:14:58] I guess you know probably you have seen that also like uh [01:15:01] that also like uh in Sometimes some of you probably would [01:15:03] in Sometimes some of you probably would expect this like this is about unionbank [01:15:05] expect this like this is about unionbank right like when you do Union bonds [01:15:06] right like when you do Union bonds there's always these kind of things like [01:15:08] there's always these kind of things like where you whether you can switch the [01:15:10] where you whether you can switch the probability with four contract which we [01:15:13] probability with four contract which we are gonna talk about in a moment [01:15:15] are gonna talk about in a moment um [01:15:17] um so so basically [01:15:19] so so basically um this is probably I hope this is [01:15:21] um this is probably I hope this is demonstrating that it's kind of more [01:15:23] demonstrating that it's kind of more difficult to prove that uh on [01:15:28] difficult to prove that uh on to prove I will say the Hat minus [01:15:32] to prove I will say the Hat minus I'll have to to prove equality two [01:15:36] I'll have to to prove equality two so [01:15:39] so more the take-home point is that it's [01:15:41] more the take-home point is that it's more difficult [01:15:43] more difficult to prove [01:15:46] to prove equality to two what's equal to two [01:15:47] equality to two what's equal to two equality towards the difference between [01:15:50] equality towards the difference between I also the Hat [01:15:52] I also the Hat and I'll have the hat and the reason is [01:15:55] and I'll have the hat and the reason is that the hat is a function of the data [01:15:57] that the hat is a function of the data set and you lose the independence [01:16:00] set and you lose the independence and so the goal of the uh of the many of [01:16:05] and so the goal of the uh of the many of the like the rest of the lectures is to [01:16:07] the like the rest of the lectures is to show that this is indeed bounded uh [01:16:09] show that this is indeed bounded uh using the so-called uniform convergence [01:16:11] using the so-called uniform convergence and by uniform convergence let me just [01:16:13] and by uniform convergence let me just summarize I hope you got some intuition [01:16:15] summarize I hope you got some intuition here already like we need to prove [01:16:17] here already like we need to prove something like [01:16:20] something like probability [01:16:21] probability that for all Theta has had [01:16:24] that for all Theta has had close to L Theta it's less than Epsilon [01:16:33] so we need to prove something like this [01:16:35] so we need to prove something like this using some uh some techniques and and [01:16:39] using some uh some techniques and and you see that [01:16:42] you see that you're gonna get much looser bounds when [01:16:44] you're gonna get much looser bounds when you prove something like this the [01:16:45] you prove something like this the Epsilon Delta would be different from [01:16:46] Epsilon Delta would be different from the Epsilon Delta that you can get when [01:16:49] the Epsilon Delta that you can get when the the for our quantifier is outside [01:16:51] the the for our quantifier is outside the probability [01:16:54] so I guess I'll I'll show how to improve [01:16:58] so I guess I'll I'll show how to improve this compound [01:16:59] this compound um [01:17:01] um in the in the next few lectures but just [01:17:04] in the in the next few lectures but just the [01:17:05] the um [01:17:08] just uh I guess uh [01:17:10] just uh I guess uh um so that this suffices you know I [01:17:13] um so that this suffices you know I guess as expected if I have a claim 2 [01:17:17] guess as expected if I have a claim 2 so if you so you know that [01:17:21] so if you so you know that I also don't have minus LS is a star [01:17:25] I also don't have minus LS is a star it's less than I guess [01:17:28] by claim one [01:17:32] this is less than the differences [01:17:33] this is less than the differences between [01:17:35] between I'll take the star minus L height Little [01:17:37] I'll take the star minus L height Little Star [01:17:38] Star plus Iota hat minus L height to the Hat [01:17:44] plus Iota hat minus L height to the Hat and this is less than two times the soup [01:17:46] and this is less than two times the soup over all Theta [01:17:49] over all Theta foreign [01:17:53] right so if you can show that for all [01:17:55] right so if you can show that for all Theta they are similar then uh you have [01:17:57] Theta they are similar then uh you have a Bound for the excess risk [01:18:00] a Bound for the excess risk so maybe just a in some sense [01:18:04] so maybe just a in some sense if you draw the picture here so [01:18:05] if you draw the picture here so basically what you want to show is that [01:18:07] basically what you want to show is that suppose this is the population risk L [01:18:09] suppose this is the population risk L Theta you want to show that with high [01:18:11] Theta you want to show that with high probability your your empirical risk is [01:18:14] probability your your empirical risk is something like this [01:18:17] um which is kind of like uniformly close [01:18:20] um which is kind of like uniformly close to the the population risk that's kind [01:18:22] to the the population risk that's kind of the intuition we have and and [01:18:25] of the intuition we have and and actually [01:18:26] actually um let's see yeah so uh and actually you [01:18:30] um let's see yeah so uh and actually you know if you you know in the second half [01:18:32] know if you you know in the second half of the lecture like after week five or [01:18:34] of the lecture like after week five or week six [01:18:35] week six we're also going to talk about that it's [01:18:38] we're also going to talk about that it's not actually [01:18:39] not actually this picture is actually not entirely [01:18:41] this picture is actually not entirely accurate in the sense that this this [01:18:43] accurate in the sense that this this picture is actually in the sense that [01:18:44] picture is actually in the sense that indeed the in many cases the empirical [01:18:47] indeed the in many cases the empirical risk is kind of like Bonnet you know [01:18:50] risk is kind of like Bonnet you know by some absolute right within the app [01:18:53] by some absolute right within the app Zone uh within the the population risk [01:18:56] Zone uh within the the population risk but also it doesn't look like this kind [01:18:58] but also it doesn't look like this kind of fluctuating so what really happens is [01:19:00] of fluctuating so what really happens is something like [01:19:01] something like maybe you have a function population [01:19:03] maybe you have a function population risk like this this is population [01:19:06] risk like this this is population and the empirical risk is something [01:19:11] and the empirical risk is something first of all close to the population [01:19:12] first of all close to the population risk but also in terms of the shape and [01:19:14] risk but also in terms of the shape and the curvature is also closed so it [01:19:17] the curvature is also closed so it wouldn't be like that fluctuating it [01:19:18] wouldn't be like that fluctuating it would be something like maybe this [01:19:24] so so not only in terms of the value [01:19:27] so so not only in terms of the value they are closed but also in terms of [01:19:28] they are closed but also in terms of some other properties maybe the the [01:19:30] some other properties maybe the the curvature the ship there are also [01:19:32] curvature the ship there are also somewhat closed uh and and this is [01:19:35] somewhat closed uh and and this is useful for certain kind of like cases [01:19:37] useful for certain kind of like cases when you especially care about [01:19:38] when you especially care about optimization right so for example [01:19:41] optimization right so for example um [01:19:42] um if the the importance is so fluctuating [01:19:44] if the the importance is so fluctuating then it becomes harder to optimize and [01:19:47] then it becomes harder to optimize and and we care about optimization sometimes [01:19:49] and we care about optimization sometimes you want to show that [01:19:50] you want to show that um the empirical risk [01:19:53] um the empirical risk um can also have nice [01:19:56] um can also have nice properties for your computational [01:19:57] properties for your computational purposes [01:20:01] okay I guess that's uh that's a perfect [01:20:03] okay I guess that's uh that's a perfect time uh stopping time [01:20:06] time uh stopping time um okay thanks ================================================================================ LECTURE 003 ================================================================================ Stanford CS229M - Lecture 3: Finite hypothesis class, discretizing infinite hypothesis space Source: https://www.youtube.com/watch?v=io-YFfXbIXk --- Transcript [00:00:05] okay so now let's talk about math so [00:00:08] okay so now let's talk about math so last time where we ended was um we were [00:00:11] last time where we ended was um we were talking about uniform convergence so we [00:00:13] talking about uniform convergence so we said that you know our goal for the next [00:00:15] said that you know our goal for the next few lectures will be the so-called [00:00:18] few lectures will be the so-called uniform [00:00:23] convergence [00:00:24] convergence which means that you want to somewhat [00:00:26] which means that you want to somewhat prove that with high probability [00:00:30] prove that with high probability if you take soup or maximum I guess soup [00:00:33] if you take soup or maximum I guess soup really just means maximum uh for this [00:00:35] really just means maximum uh for this course so if you take soup over the [00:00:37] course so if you take soup over the hypothesis class and you look at the [00:00:39] hypothesis class and you look at the difference between [00:00:40] difference between the empirical risk and the population [00:00:43] the empirical risk and the population risk [00:00:44] risk um you want to show that this is small [00:00:48] um in most like with high probability [00:00:51] um in most like with high probability so so this is the the general idea uh [00:00:55] so so this is the the general idea uh and uh and we said that this is [00:00:57] and uh and we said that this is different from showing that so this is [00:01:00] different from showing that so this is different [00:01:06] from sodium for every fixed age with [00:01:09] from sodium for every fixed age with high probability [00:01:12] L height H minus l h is small [00:01:16] L height H minus l h is small so these two are of different nature [00:01:18] so these two are of different nature because one so the the order of the [00:01:21] because one so the the order of the quantifier in some sense is different so [00:01:23] quantifier in some sense is different so one requires that if this event so with [00:01:27] one requires that if this event so with higher probability event that the entire [00:01:30] higher probability event that the entire population risk is close to the uh [00:01:32] population risk is close to the uh empirical risk right so the other one is [00:01:34] empirical risk right so the other one is saying that you just only look at one [00:01:36] saying that you just only look at one single H and you look at what's the [00:01:38] single H and you look at what's the probability that uh this population risk [00:01:41] probability that uh this population risk is different from empirical risk and you [00:01:43] is different from empirical risk and you want to show that this event happens [00:01:46] want to show that this event happens with high probability so so in some [00:01:48] with high probability so so in some sense the difference is kind of like a [00:01:50] sense the difference is kind of like a union bond in some sense which I'm going [00:01:52] union bond in some sense which I'm going to talk more on um next when we get to [00:01:55] to talk more on um next when we get to prove this kind of like statement so [00:01:59] prove this kind of like statement so um [00:02:00] um so so this is the so I mean this lecture [00:02:03] so so this is the so I mean this lecture we're going to talk about two ways to do [00:02:05] we're going to talk about two ways to do this [00:02:06] this um um actually we're going to talk about [00:02:08] um um actually we're going to talk about two cases about H right so certainly [00:02:10] two cases about H right so certainly this statement depends on H you cannot [00:02:13] this statement depends on H you cannot hope to prove things like this for every [00:02:15] hope to prove things like this for every possible Capital H it does depend on the [00:02:17] possible Capital H it does depend on the the family of hypothesis you think about [00:02:19] the family of hypothesis you think about and the bound actually depends on the [00:02:21] and the bound actually depends on the family of hypothesis you're talking [00:02:22] family of hypothesis you're talking about so the first part is going to be [00:02:25] about so the first part is going to be about finite hypothesis class [00:02:30] where H is assumed to be [00:02:33] where H is assumed to be Finance [00:02:36] so why not [00:02:41] and the next part is going to be [00:02:43] and the next part is going to be infinite case [00:02:45] infinite case infinite hypothesis cost and for [00:02:47] infinite hypothesis cost and for infinite help us cars there are many [00:02:48] infinite help us cars there are many different ways to bond to achieve this [00:02:50] different ways to bond to achieve this kind of bond and uh today we're going to [00:02:53] kind of bond and uh today we're going to talk about a relatively Brute Force way [00:02:56] talk about a relatively Brute Force way to do it in some sense you do a [00:02:58] to do it in some sense you do a reduction to the file and hypothesis [00:02:59] reduction to the file and hypothesis class you know essentially every no [00:03:01] class you know essentially every no matter what you do you are doing a [00:03:03] matter what you do you are doing a reduction to the final hypothesis Clause [00:03:04] reduction to the final hypothesis Clause but how do you reduce to the final case [00:03:07] but how do you reduce to the final case you know does matter so in today we're [00:03:09] you know does matter so in today we're going to talk about the Brute Force [00:03:11] going to talk about the Brute Force reduction which does show some kind of [00:03:13] reduction which does show some kind of intuition [00:03:14] intuition um and [00:03:16] um and uh okay so that's a that's a brief [00:03:18] uh okay so that's a that's a brief overview of what we're gonna do with [00:03:19] overview of what we're gonna do with this [00:03:20] this um in this lecture [00:03:22] um in this lecture so [00:03:24] so um [00:03:25] um so I guess the um [00:03:28] so I guess the um let me just uh you know start by [00:03:32] yeah let's talk about the finite [00:03:33] yeah let's talk about the finite hypothesis [00:03:35] hypothesis and here's the serum we're gonna prove [00:03:39] so there are some conditions so the [00:03:41] so there are some conditions so the condition is as we did last time [00:03:44] condition is as we did last time recorded last time we assumed the laws [00:03:46] recorded last time we assumed the laws is between 0 and 1. [00:03:49] is between 0 and 1. for every x y and every hypothesis [00:03:52] for every x y and every hypothesis this is true for binary loss it's zero [00:03:55] this is true for binary loss it's zero one loss you know it's not true for [00:03:57] one loss you know it's not true for every possible losses but you know if [00:03:58] every possible losses but you know if you have other losses you have to do a [00:04:00] you have other losses you have to do a little bit small kind of like fixed uh [00:04:03] little bit small kind of like fixed uh to make this proof still work you know [00:04:04] to make this proof still work you know but this is not very essential it's [00:04:05] but this is not very essential it's mostly for convenience [00:04:07] mostly for convenience and what will improve is the following [00:04:10] and what will improve is the following statement so we say that for every Delta [00:04:13] statement so we say that for every Delta between zero and a half [00:04:16] between zero and a half this is not very important either so [00:04:18] this is not very important either so Delta is a small number and you say that [00:04:21] Delta is a small number and you say that with probability [00:04:25] come on at least [00:04:28] come on at least one minus Delta [00:04:31] one minus Delta we have that for every age L height H [00:04:35] we have that for every age L height H minus LH [00:04:37] minus LH in absolute value [00:04:39] in absolute value is bounded by [00:04:44] inspired by [00:04:45] inspired by square root [00:04:47] square root long [00:04:53] the size of h plus long 2 over Delta [00:04:58] the size of h plus long 2 over Delta over to it [00:05:01] and recall that the reason why we care [00:05:04] and recall that the reason why we care about this uniform convergence was that [00:05:06] about this uniform convergence was that it's useful for us to bond the excess [00:05:08] it's useful for us to bond the excess risk right so we have shown that if you [00:05:10] risk right so we have shown that if you have this kind of uniform convergence [00:05:11] have this kind of uniform convergence then you can prove that your accessories [00:05:14] then you can prove that your accessories is bounded so using what we have [00:05:16] is bounded so using what we have discussed last time as a Corollary [00:05:20] discussed last time as a Corollary we just we we also get IO the excess [00:05:23] we just we we also get IO the excess risk of LH H hat is the erm minus LH [00:05:27] risk of LH H hat is the erm minus LH star [00:05:29] star so this is the ERM solution [00:05:33] ioh hat minus LH star is less than uh [00:05:37] ioh hat minus LH star is less than uh you you pay a factor two uh in that [00:05:40] you you pay a factor two uh in that um in that if you know derivation so you [00:05:43] um in that if you know derivation so you multiply the factor two if so you get [00:05:45] multiply the factor two if so you get something like two times [00:05:49] something like two times loan [00:05:50] loan size of h plus loan 2 over Delta [00:05:55] size of h plus loan 2 over Delta over n [00:06:02] Okay cool so [00:06:05] Okay cool so um [00:06:08] all right so [00:06:11] all right so so this is the theorem we're gonna prove [00:06:13] so this is the theorem we're gonna prove and maybe just uh before we prove the [00:06:16] and maybe just uh before we prove the theorem you can see that the bond the [00:06:18] theorem you can see that the bond the right hand side of the bound does depend [00:06:20] right hand side of the bound does depend on on the uh the size of the hypothesis [00:06:25] on on the uh the size of the hypothesis class right if you have a bigger [00:06:27] class right if you have a bigger hypothesis class then your bond would be [00:06:29] hypothesis class then your bond would be worse so [00:06:30] worse so um so it's harder to prove for con this [00:06:33] um so it's harder to prove for con this uniform convergence when you have a [00:06:35] uniform convergence when you have a larger hypothesis class and if you try [00:06:38] larger hypothesis class and if you try to interpret this Bond at the end right [00:06:40] to interpret this Bond at the end right so here this is bounded excess risk and [00:06:43] so here this is bounded excess risk and we can see that [00:06:45] we can see that um you need n to be bigger than the log [00:06:48] um you need n to be bigger than the log of the size of H so that the right hand [00:06:51] of the size of H so that the right hand side of the bound becomes meaningful [00:06:53] side of the bound becomes meaningful right so you want to access risk to be [00:06:54] right so you want to access risk to be something smaller than one at least [00:06:56] something smaller than one at least right at the minimum so you need to be [00:06:58] right at the minimum so you need to be at least larger than log of the size of [00:07:01] at least larger than log of the size of the hypothesis class [00:07:03] the hypothesis class so so that's why you need enough samples [00:07:05] so so that's why you need enough samples right to to have to to make this Bond [00:07:07] right to to have to to make this Bond you know meaningful [00:07:09] you know meaningful um and of course as n goes to Infinity [00:07:11] um and of course as n goes to Infinity you have you know [00:07:12] you have you know um um on you have um better and better [00:07:14] um um on you have um better and better box I'm going to have a little more [00:07:16] box I'm going to have a little more discussion after we prove the theorem [00:07:18] discussion after we prove the theorem okay so so now let's try to improve the [00:07:21] okay so so now let's try to improve the theorem [00:07:24] um so I guess the the [00:07:27] um so I guess the the outline of the proof [00:07:29] outline of the proof is that first you do individual [00:07:33] is that first you do individual h [00:07:35] h you prove this for individuals you prove [00:07:37] you prove this for individuals you prove the the simple version basically like we [00:07:40] the the simple version basically like we discussed the last time and second uh we [00:07:43] discussed the last time and second uh we take a unit bound [00:07:46] over all h [00:07:49] over all h okay so so let's do the first step [00:07:53] okay so so let's do the first step so [00:07:55] so um recall that last time we have um [00:07:57] um recall that last time we have um done this already for fixated right so [00:08:00] done this already for fixated right so here I'm just doing a little bit more [00:08:02] here I'm just doing a little bit more formally so last time we actually uh [00:08:04] formally so last time we actually uh showed this right we we did we use this [00:08:07] showed this right we we did we use this half inequality to get something like L [00:08:09] half inequality to get something like L height Theta minus L Theta this is you [00:08:12] height Theta minus L Theta this is you know something like on order of [00:08:15] know something like on order of one over scorpion that's what we did [00:08:17] one over scorpion that's what we did somewhat informally last time with half [00:08:19] somewhat informally last time with half the inequality and today I'm going to [00:08:21] the inequality and today I'm going to have a little more kind of like careful [00:08:25] have a little more kind of like careful um derivations to get exactly all the [00:08:27] um derivations to get exactly all the dependencies of the constant and by the [00:08:30] dependencies of the constant and by the way Theta and H are the same H is just [00:08:32] way Theta and H are the same H is just when you talk about finite hypothesis [00:08:34] when you talk about finite hypothesis class right you don't necessarily have a [00:08:35] class right you don't necessarily have a parameter you may you you may just list [00:08:37] parameter you may you you may just list all the hypotheses that's why it's [00:08:38] all the hypotheses that's why it's called H and when you parameterize it [00:08:41] called H and when you parameterize it you have the parameter Theta so but but [00:08:43] you have the parameter Theta so but but for this purpose they are they're not [00:08:45] for this purpose they are they're not different at all so so let's uh let's [00:08:48] different at all so so let's uh let's apply this is the last lecture [00:08:52] apply this is the last lecture so let's apply uh half the inequality [00:09:01] so where [00:09:04] so where AI is zero bi is one [00:09:07] AI is zero bi is one so the bond is zero and one right so and [00:09:10] so the bond is zero and one right so and so we got that for every age in age [00:09:13] so we got that for every age in age right so suppose this H is fixed and [00:09:16] right so suppose this H is fixed and then you draw your sample you look at [00:09:17] then you draw your sample you look at the probability that [00:09:19] the probability that L height H minus LH [00:09:23] L height H minus LH is less than Epsilon [00:09:25] is less than Epsilon and here was random the random is the [00:09:27] and here was random the random is the data set the randomness come from the [00:09:29] data set the randomness come from the data set which you know goes into L hat [00:09:33] data set which you know goes into L hat and if you use the the Hoffman call you [00:09:35] and if you use the the Hoffman call you get this is larger than one minus two [00:09:37] get this is larger than one minus two times exponential minus two n Square [00:09:40] times exponential minus two n Square Epsilon Square [00:09:42] Epsilon Square over sum of beta i b i minus AI Square [00:09:47] and [00:09:48] and because bi is one and AI is zero so the [00:09:51] because bi is one and AI is zero so the sum of BS minus AI square is n so you [00:09:53] sum of BS minus AI square is n so you get 1 minus 2 exponential minus two and [00:09:57] get 1 minus 2 exponential minus two and Epsilon Square [00:10:00] and uh right this is because bi is one [00:10:04] and uh right this is because bi is one AI zero [00:10:06] AI zero and [00:10:09] and so in other words if you look at the [00:10:11] so in other words if you look at the other side of the bound we look at [00:10:15] other side of the bound we look at what's the chance that they are [00:10:17] what's the chance that they are different [00:10:18] different the chance that they are different is [00:10:20] the chance that they are different is less than two times exponential [00:10:22] less than two times exponential to n minus x square actually in many [00:10:25] to n minus x square actually in many cases half the inequality was stated in [00:10:27] cases half the inequality was stated in this this way instead of the way that I [00:10:28] this this way instead of the way that I showed before no they are exactly the [00:10:30] showed before no they are exactly the same it's just complementary of each [00:10:32] same it's just complementary of each other but you have [00:10:34] other but you have a lower Bound for some prob some events [00:10:36] a lower Bound for some prob some events then you have the upper bounds for the [00:10:38] then you have the upper bounds for the complementary of the events [00:10:41] complementary of the events so I know this is for every age you have [00:10:44] so I know this is for every age you have this right so so basically for average [00:10:46] this right so so basically for average you have a some kind of failure uh event [00:10:49] you have a some kind of failure uh event which is this event and this event [00:10:51] which is this event and this event happens with a small probability [00:10:53] happens with a small probability and now let's recall that you have the [00:10:55] and now let's recall that you have the so-called Union Bond [00:10:57] so-called Union Bond and the union bonds is saying that if [00:10:59] and the union bonds is saying that if you look at the union of a bunch of [00:11:02] you look at the union of a bunch of events [00:11:04] events maybe let's say k events then they are [00:11:07] maybe let's say k events then they are smaller than the probability of the sum [00:11:11] smaller than the probability of the sum of the problem of each event [00:11:13] of the problem of each event and here suppose you say the e [00:11:16] and here suppose you say the e H corresponds to the event [00:11:21] that L hat H is different from LH [00:11:25] that L hat H is different from LH by Epsilon [00:11:26] by Epsilon then you know that the probability that [00:11:29] then you know that the probability that the union of the Eis which is basically [00:11:31] the union of the Eis which is basically saying that [00:11:33] saying that what's the unit of this value basically [00:11:36] what's the unit of this value basically means that there exists h [00:11:38] means that there exists h such that this event happens right so [00:11:43] such that this event happens right so such the LH minus l h l h minus LH is [00:11:47] such the LH minus l h l h minus LH is larger than Epsilon so this is the [00:11:50] larger than Epsilon so this is the um [00:11:50] um the the unit of all of these events [00:11:53] the the unit of all of these events uh and it's less than the sum of [00:11:57] uh and it's less than the sum of the [00:11:58] the each of the events [00:12:07] okay so and now you plug in what you [00:12:10] okay so and now you plug in what you have prepared maybe let's say this [00:12:11] have prepared maybe let's say this equation let's call it one [00:12:13] equation let's call it one so if you plug in one then you get this [00:12:16] so if you plug in one then you get this is sum over [00:12:19] the all the edges and you get two times [00:12:23] the all the edges and you get two times exponential minus two and Epsilon Square [00:12:27] exponential minus two and Epsilon Square so each of these event is small and you [00:12:29] so each of these event is small and you you multiply by the total number of [00:12:31] you multiply by the total number of uh the possible events which is the size [00:12:34] uh the possible events which is the size of H so basically a 2 times H times [00:12:36] of H so basically a 2 times H times exponential that's to an Epsilon Square [00:12:40] exponential that's to an Epsilon Square right [00:12:44] so [00:12:46] so and and you can see that this is [00:12:48] and and you can see that this is basically what we wanted to have because [00:12:49] basically what we wanted to have because now we have there exist H such that this [00:12:54] now we have there exist H such that this um [00:12:55] um such that they are different so the [00:12:58] such that they are different so the complementary of this will be just that [00:13:00] complementary of this will be just that this is just equals to [00:13:02] this is just equals to 1 minus the probability that for average [00:13:04] 1 minus the probability that for average age [00:13:05] age the flip event is true [00:13:12] so [00:13:13] so by the way I'm not distinguishing the [00:13:15] by the way I'm not distinguishing the ladder or to equal or like a [00:13:18] ladder or to equal or like a in most of this course like a [00:13:22] in most of this course like a um so technically I probably should [00:13:23] um so technically I probably should write this is less than Epsilon [00:13:26] write this is less than Epsilon right instead of like less than or equal [00:13:27] right instead of like less than or equal to Epsilon but you know for for this [00:13:29] to Epsilon but you know for for this part for this course like I'm not [00:13:32] part for this course like I'm not super careful about this because they [00:13:33] super careful about this because they don't really matter that much [00:13:35] don't really matter that much um and in many cases it's actually the [00:13:37] um and in many cases it's actually the the probability that you you actually [00:13:39] the probability that you you actually this is exactly equal to Epsilon is [00:13:41] this is exactly equal to Epsilon is actually zero so so technically they are [00:13:43] actually zero so so technically they are even correct but anyway so [00:13:45] even correct but anyway so um so this is not super important for [00:13:47] um so this is not super important for this course [00:13:48] this course um so so but [00:13:50] um so so but right when you um so because of this you [00:13:53] right when you um so because of this you can see that this is what we care about [00:13:55] can see that this is what we care about right for every h l hat H is close to LH [00:13:57] right for every h l hat H is close to LH and we are already kind of getting there [00:14:00] and we are already kind of getting there almost right so the only thing we need [00:14:01] almost right so the only thing we need is to know what this quantity is right [00:14:03] is to know what this quantity is right we need to Upper bond this so that we [00:14:05] we need to Upper bond this so that we can lower bounds on this probability [00:14:09] can lower bounds on this probability okay so now let's choose Epsilon right [00:14:11] okay so now let's choose Epsilon right so that uh you can get a [00:14:14] so that uh you can get a um so basically you want that you want [00:14:16] um so basically you want that you want this probability to happen you want this [00:14:19] this probability to happen you want this thing to [00:14:21] thing to to be bigger than one minus Delta so [00:14:24] to be bigger than one minus Delta so that's why you want [00:14:25] that's why you want this [00:14:27] this to be less than Delta [00:14:30] to be less than Delta right so [00:14:32] right so um so basically we just need to choose [00:14:33] um so basically we just need to choose episode so that this probability becomes [00:14:36] episode so that this probability becomes Delta so choose Epsilon [00:14:39] Delta so choose Epsilon such that [00:14:41] such that 2 h [00:14:43] 2 h times exponential minus 2 and Epsilon [00:14:45] times exponential minus 2 and Epsilon square is equals to the outer [00:14:48] square is equals to the outer and this involves you solve the equation [00:14:51] and this involves you solve the equation you know which is not too hard so if you [00:14:53] you know which is not too hard so if you solve it you got Epsilon is equals to [00:14:57] solve it you got Epsilon is equals to um [00:14:58] um I guess exactly what I had before so [00:15:00] I guess exactly what I had before so Epsilon is equals to loan h plus [00:15:04] Epsilon is equals to loan h plus law and two over Delta over n [00:15:09] law and two over Delta over n uh over 2. [00:15:12] uh over 2. so so basically if you take absolute [00:15:14] so so basically if you take absolute with this then you know that the [00:15:15] with this then you know that the probability that [00:15:18] probability that for average [00:15:20] for average L has H minus maybe less maybe let's [00:15:23] L has H minus maybe less maybe let's start with the existence [00:15:26] start with the existence o h [00:15:28] o h it's bigger than Epsilon it's less than [00:15:32] it's bigger than Epsilon it's less than it's [00:15:34] it's less than one less than Delta [00:15:37] less than one less than Delta right so and then if you flip the event [00:15:40] right so and then if you flip the event you get the desired zero [00:15:50] questions so far [00:16:02] okay so [00:16:05] okay so um [00:16:08] let me have a few remarks to kind of [00:16:10] let me have a few remarks to kind of somewhat interpret what we have done and [00:16:12] somewhat interpret what we have done and compare it with what we did in the first [00:16:14] compare it with what we did in the first lecture so if you compare [00:16:18] lecture so if you compare we'll compare with the asymptotic [00:16:20] we'll compare with the asymptotic results [00:16:23] you're going to see this right so for [00:16:24] you're going to see this right so for asymptotics results [00:16:29] what you got is that LH hat minus LH [00:16:32] what you got is that LH hat minus LH star [00:16:33] star by the excess risk is bonded by is [00:16:36] by the excess risk is bonded by is something like C Over N plus eight over [00:16:39] something like C Over N plus eight over n and recall that this C [00:16:43] n and recall that this C you know can depend on [00:16:46] you know can depend on dimension [00:16:48] dimension on the problem [00:16:53] the problem and here you can hide [00:16:56] the problem and here you can hide uh any other dependencies [00:17:03] dependencies [00:17:06] on the problem [00:17:10] so and what we have now [00:17:14] so and what we have now is that LH hat the excess risk minus the [00:17:18] is that LH hat the excess risk minus the uh sorry iOS [00:17:21] uh sorry iOS the excess risk [00:17:23] the excess risk is [00:17:25] is smaller than [00:17:27] smaller than um so here you don't hide anything [00:17:29] um so here you don't hide anything you hide some costs in the force you [00:17:31] you hide some costs in the force you have this is low long h [00:17:34] have this is low long h over n [00:17:36] over n over sorry square roots [00:17:38] over sorry square roots this [00:17:39] this uh and of course you also have something [00:17:41] uh and of course you also have something like oh [00:17:43] like oh square root long [00:17:46] square root long well over Delta over it [00:17:49] well over Delta over it so this term is supposed to be [00:17:51] so this term is supposed to be relatively small because you can check [00:17:53] relatively small because you can check the outer to be this is the logarithmic [00:17:55] the outer to be this is the logarithmic right you can take Delta to be something [00:17:56] right you can take Delta to be something like maybe n to the 10 right so you [00:17:59] like maybe n to the 10 right so you still get log n over square root login [00:18:01] still get log n over square root login over square root n right so let's say [00:18:04] over square root n right so let's say take [00:18:05] take they also to be into the minus 10. so [00:18:07] they also to be into the minus 10. so then this one will be square root log [00:18:10] then this one will be square root log and overscripted which is negligible [00:18:14] and overscripted which is negligible um or almost liquid compared to the [00:18:16] um or almost liquid compared to the first term [00:18:17] first term so [00:18:18] so um so so basically let's say we ignore [00:18:20] um so so basically let's say we ignore this for the comparison so [00:18:22] this for the comparison so ignore this for the comparison then you [00:18:25] ignore this for the comparison then you can compare this and this [00:18:27] can compare this and this so you can see that [00:18:29] so you can see that um the the first thing is that we have a [00:18:32] um the the first thing is that we have a worse dependency [00:18:37] dependency on and right before the [00:18:40] dependency on and right before the dependency at least for the leading term [00:18:42] dependency at least for the leading term at least in terms of leading term the [00:18:44] at least in terms of leading term the dependency on and what's one over n and [00:18:46] dependency on and what's one over n and now it's one over square root And So It [00:18:48] now it's one over square root And So It Goes To Zero uh slower [00:18:51] Goes To Zero uh slower um [00:18:52] um um so however there's no [00:18:54] um so however there's no um [00:18:55] um um so so this could be improved this can [00:18:57] um so so this could be improved this can be improved [00:19:00] uh in certain cases which we probably [00:19:03] uh in certain cases which we probably wouldn't do at all [00:19:04] wouldn't do at all um [00:19:05] um I guess uh um in one of the homework [00:19:08] I guess uh um in one of the homework question you're gonna be asked to [00:19:09] question you're gonna be asked to improve this to some extent [00:19:12] improve this to some extent um in some cases you can improve this to [00:19:13] um in some cases you can improve this to one over n uh depending on you know [00:19:17] one over n uh depending on you know various situations but generally I guess [00:19:19] various situations but generally I guess you get relatively worse dependency on [00:19:21] you get relatively worse dependency on end uh in uh comparatory asymptotics [00:19:26] end uh in uh comparatory asymptotics um and and one of the reasons why this [00:19:28] um and and one of the reasons why this is the the reason one of the reason why [00:19:30] is the the reason one of the reason why this is happening is that [00:19:33] this is happening is that um [00:19:34] um so partly [00:19:36] so partly because [00:19:37] because we didn't assume uh twice [00:19:39] we didn't assume uh twice differentiability [00:19:53] um loss function right so here our loss [00:19:56] um loss function right so here our loss function is only the only assumption we [00:19:58] function is only the only assumption we have on loss functions between zero and [00:20:00] have on loss functions between zero and one so it even works for like zero one [00:20:02] one so it even works for like zero one loss the classification zero one loss [00:20:04] loss the classification zero one loss but before we did assume that the loss [00:20:08] but before we did assume that the loss has to be continuous and differentiable [00:20:10] has to be continuous and differentiable and I think we also assume is twice [00:20:12] and I think we also assume is twice differentiable so that does say a [00:20:14] differentiable so that does say a fundamental role here so when you have [00:20:16] fundamental role here so when you have when you don't have first vegetability [00:20:19] when you don't have first vegetability um um we and and when you don't have [00:20:23] um um we and and when you don't have other assumptions it's very hard to it's [00:20:25] other assumptions it's very hard to it's kind of like actually impossible to get [00:20:27] kind of like actually impossible to get uh well over and related medications [00:20:31] uh well over and related medications so [00:20:32] so um but here you know what I'm talking [00:20:34] um but here you know what I'm talking here is [00:20:36] here is um all about the downside of our new [00:20:38] um all about the downside of our new Bond so the pros you know we actually [00:20:40] Bond so the pros you know we actually already kind of like mentioned the the [00:20:42] already kind of like mentioned the the main Pro is that now we don't have any [00:20:44] main Pro is that now we don't have any dependencies uh about anything right so [00:20:48] dependencies uh about anything right so before we recall that last time we [00:20:50] before we recall that last time we motivated to have known asymptotic bonds [00:20:52] motivated to have known asymptotic bonds we are saying that this thing could hide [00:20:54] we are saying that this thing could hide a lot of things right this could hide [00:20:55] a lot of things right this could hide for example [00:20:56] for example something like that mentioned to the 50 [00:20:58] something like that mentioned to the 50 that's my extreme example Over N square [00:21:00] that's my extreme example Over N square right so P to the 50 Over N Square will [00:21:03] right so P to the 50 Over N Square will be counted as little while Over N right [00:21:06] be counted as little while Over N right so that doesn't make a lot of sense just [00:21:08] so that doesn't make a lot of sense just because if your dimension is too high [00:21:09] because if your dimension is too high then this is a this requires them to be [00:21:12] then this is a this requires them to be very big to be small right so so this [00:21:14] very big to be small right so so this this was the issue [00:21:16] this was the issue um that we uh we mentioned last time [00:21:19] um that we uh we mentioned last time about asymptotics and and now we fix [00:21:21] about asymptotics and and now we fix that issue and that's the main benefit [00:21:23] that issue and that's the main benefit we gain right so we don't have anything [00:21:25] we gain right so we don't have anything about the uh uh the dependencies and [00:21:29] about the uh uh the dependencies and also we expect to see what's the [00:21:31] also we expect to see what's the um [00:21:32] um how does this depend on the the [00:21:34] how does this depend on the the complexity in some sense of the of the [00:21:36] complexity in some sense of the of the hypothesis class you can think of this [00:21:38] hypothesis class you can think of this as a complexity [00:21:40] as a complexity the long h [00:21:42] the long h can be think of as a complexity [00:21:46] of the hypothesis course [00:21:51] so I guess probably if you have been to [00:21:53] so I guess probably if you have been to cs20 now you know we have talked about [00:21:55] cs20 now you know we have talked about you know if you you can over 30 if you [00:21:58] you know if you you can over 30 if you have too complex or function class but [00:22:00] have too complex or function class but you don't have enough data and this is [00:22:02] you don't have enough data and this is in some sense a mathematical [00:22:04] in some sense a mathematical characterization of that right so if [00:22:05] characterization of that right so if your function class is too complex [00:22:08] your function class is too complex so that the log log of H [00:22:11] so that the log log of H is too too big and and you don't have [00:22:13] is too too big and and you don't have enough data [00:22:15] enough data um compared to log of H then you may [00:22:18] um compared to log of H then you may have a word spot [00:22:20] have a word spot um and on the other hand suppose your [00:22:21] um and on the other hand suppose your logophage is small and your n is bigger [00:22:23] logophage is small and your n is bigger than log of H then you have a better [00:22:26] than log of H then you have a better bound which could be meaningful [00:22:32] okay any questions so far [00:22:44] we are doing yes you oh no this is the [00:22:47] we are doing yes you oh no this is the the differentiability of the loss [00:22:49] the differentiability of the loss function [00:22:51] function um so the loss function is a function of [00:22:54] um so the loss function is a function of um [00:22:55] um so the loss function is a function of [00:22:57] so the loss function is a function of you know it depending on how you think [00:22:59] you know it depending on how you think about it but by the differentiability I [00:23:01] about it but by the differentiability I really mean this function that takes in [00:23:02] really mean this function that takes in y height and Y and output the scalar [00:23:06] y height and Y and output the scalar so taking a prediction and the real [00:23:08] so taking a prediction and the real label and the optical scalar so whether [00:23:10] label and the optical scalar so whether this function is differentiable with [00:23:12] this function is differentiable with respects to uh Y and Y hat [00:23:15] respects to uh Y and Y hat so we didn't assume that this function [00:23:17] so we didn't assume that this function is differentiable here but implicitly [00:23:20] is differentiable here but implicitly you are assuming that this loss function [00:23:22] you are assuming that this loss function is differentiable with respect to Y and [00:23:24] is differentiable with respect to Y and Y height in the previous estimated [00:23:26] Y height in the previous estimated analysis because there actually we [00:23:28] analysis because there actually we assume the whole loss function if you [00:23:30] assume the whole loss function if you compose it with the model has to be [00:23:32] compose it with the model has to be differentiable [00:23:37] foreign [00:23:40] my number [00:23:46] so I didn't hear very well practical [00:23:49] so I didn't hear very well practical implementation of 24 numbers can you use [00:23:52] implementation of 24 numbers can you use like the same [00:23:53] like the same for practical implementations where you [00:23:56] for practical implementations where you have floating [00:23:58] have floating points [00:24:12] so so I guess my interpretation of the [00:24:14] so so I guess my interpretation of the question maybe let me rephrase the [00:24:15] question maybe let me rephrase the question a little bit also for the zoom [00:24:17] question a little bit also for the zoom people on the zoom meeting so I think [00:24:18] people on the zoom meeting so I think the proposal is that uh for example if [00:24:22] the proposal is that uh for example if you really have a practical model and [00:24:24] you really have a practical model and you have like a p parameters and uh [00:24:27] you have like a p parameters and uh um they are you know when you really [00:24:29] um they are you know when you really implement this in computer it's not [00:24:32] implement this in computer it's not continuous right so you can [00:24:34] continuous right so you can think of like each parameter is [00:24:35] think of like each parameter is described by you know maybe 32 bits and [00:24:38] described by you know maybe 32 bits and then you can count how many total [00:24:40] then you can count how many total possible number of different models they [00:24:42] possible number of different models they are and apply this block right so yes [00:24:46] are and apply this block right so yes and that's a good idea and uh and that [00:24:50] and that's a good idea and uh and that will give you what that will give you [00:24:51] will give you what that will give you that suppose you have P parameters [00:24:56] let's say you have 32 bits [00:24:59] let's say you have 32 bits so one bit is so then what does it mean [00:25:01] so one bit is so then what does it mean that means that the total size of H [00:25:04] that means that the total size of H would be that for every P parameter you [00:25:06] would be that for every P parameter you have 2 to the 32 choices [00:25:08] have 2 to the 32 choices and you multiply that to p [00:25:10] and you multiply that to p right so you you take the raised power [00:25:12] right so you you take the raised power to p and so that means the log of H [00:25:15] to p and so that means the log of H is equals to [00:25:17] is equals to something like 32 like all of P right [00:25:20] something like 32 like all of P right it's a constant time speed [00:25:22] it's a constant time speed so basically you don't get the bond that [00:25:25] so basically you don't get the bond that depends on number of parameters [00:25:27] depends on number of parameters um and uh and this is uh uh this is [00:25:31] um and uh and this is uh uh this is reasonable in some in some cases this is [00:25:33] reasonable in some in some cases this is you know not very reasonable in some [00:25:34] you know not very reasonable in some other cases but definitely it's a pretty [00:25:36] other cases but definitely it's a pretty it's a it's a bond that makes sense [00:25:39] it's a it's a bond that makes sense right so in some of the later parts of [00:25:42] right so in some of the later parts of the lecture we are going to see how to [00:25:44] the lecture we are going to see how to get a bond that doesn't depend on [00:25:46] get a bond that doesn't depend on parameter but if you um if you are fine [00:25:49] parameter but if you um if you are fine with getting a bond that depends the [00:25:50] with getting a bond that depends the number of parameters then this is indeed [00:25:52] number of parameters then this is indeed a good thought [00:25:53] a good thought um and [00:25:55] um and um and this is actually an actual [00:25:57] um and this is actually an actual question that leads me to the second [00:25:58] question that leads me to the second part of the lecture today's lecture [00:26:01] part of the lecture today's lecture um so so this is a proposal where you [00:26:04] um so so this is a proposal where you the proposal to do this [00:26:06] the proposal to do this has a small cons or small kind of [00:26:09] has a small cons or small kind of problem which is you basically have to [00:26:11] problem which is you basically have to say that I have to resort to [00:26:13] say that I have to resort to practical implementation in practice I [00:26:16] practical implementation in practice I cannot really Implement real numbers [00:26:17] cannot really Implement real numbers right uh all the real numbers I have to [00:26:20] right uh all the real numbers I have to discretize in some way right so that's [00:26:22] discretize in some way right so that's and sometimes you put a additional [00:26:24] and sometimes you put a additional restriction on yourself saying that you [00:26:26] restriction on yourself saying that you know if I can only use like uh [00:26:30] know if I can only use like uh like a lack of floating points then what [00:26:33] like a lack of floating points then what bound I can have right so what I'm going [00:26:35] bound I can have right so what I'm going to discuss now right next is that you [00:26:37] to discuss now right next is that you don't even need this you can even say [00:26:38] don't even need this you can even say that even for all the possible [00:26:40] that even for all the possible continuous models with you know [00:26:43] continuous models with you know um it's supposed to have P parameters [00:26:44] um it's supposed to have P parameters and each parameter is really a real [00:26:46] and each parameter is really a real number like you can for example suppose [00:26:48] number like you can for example suppose you have for Almighty computer which can [00:26:51] you have for Almighty computer which can have like [00:26:52] have like infinite Precision still your bound [00:26:54] infinite Precision still your bound would still look like something like [00:26:56] would still look like something like this you still have an old feedback like [00:26:58] this you still have an old feedback like uh so so so then you don't have to like [00:27:01] uh so so so then you don't have to like suppose we have that infinite [00:27:03] suppose we have that infinite like called hypothesis class proof then [00:27:05] like called hypothesis class proof then you don't need this practical [00:27:07] you don't need this practical um kind of like a kind of like a like [00:27:11] um kind of like a kind of like a like this this way of proving it right you [00:27:13] this this way of proving it right you can have a more genuine stronger way to [00:27:15] can have a more genuine stronger way to prove it and that's what we're [00:27:15] prove it and that's what we're production [00:27:17] production Okay cool so [00:27:20] Okay cool so um [00:27:21] um okay so maybe let's let's start to do [00:27:24] okay so maybe let's let's start to do that so let's talk about infinite [00:27:27] that so let's talk about infinite hypothesis class [00:27:32] and as I suggested you know a little bit [00:27:35] and as I suggested you know a little bit before so we are gonna have a bond that [00:27:38] before so we are gonna have a bond that looks like P Over N square root p over n [00:27:40] looks like P Over N square root p over n and p is number of parameters so this is [00:27:42] and p is number of parameters so this is something we're going to have [00:27:44] something we're going to have Okay cool so and [00:27:47] Okay cool so and um so so today we're gonna do this [00:27:49] um so so today we're gonna do this so-called like a Brute Force [00:27:53] so-called like a Brute Force this conversation [00:27:54] this conversation this is a technique of at least this is [00:27:57] this is a technique of at least this is how I name this technique I guess [00:27:59] how I name this technique I guess because this technique is too too Brute [00:28:02] because this technique is too too Brute Force I guess there's no really real [00:28:04] Force I guess there's no really real name for it [00:28:06] name for it um [00:28:07] um um and what you can do is the following [00:28:10] um and what you can do is the following so [00:28:11] so um [00:28:14] maybe maybe yeah let me say the theorem [00:28:16] maybe maybe yeah let me say the theorem we're going to prove first and then I [00:28:17] we're going to prove first and then I can tell you um what's the intuition and [00:28:19] can tell you um what's the intuition and how to prove it so so this is a serum so [00:28:23] how to prove it so so this is a serum so suppose H is [00:28:25] suppose H is okay I guess I'm still setting up [00:28:28] okay I guess I'm still setting up suppose H is parameterized by [00:28:31] suppose H is parameterized by Theta [00:28:34] Theta in P dimension [00:28:35] in P dimension so H is you know mathematical you write [00:28:38] so H is you know mathematical you write H is a family of H substeta each H [00:28:41] H is a family of H substeta each H subtitle is a parametricized model [00:28:43] subtitle is a parametricized model where a Theta is in some [00:28:46] where a Theta is in some set of theta which is a subset of RP so [00:28:50] set of theta which is a subset of RP so Capital Theta is the set of parameters [00:28:53] Capital Theta is the set of parameters you are going to choose from [00:28:55] you are going to choose from and in some sense this is for [00:28:58] and in some sense this is for um convenience so but I guess you [00:29:00] um convenience so but I guess you probably wouldn't see why this is only [00:29:02] probably wouldn't see why this is only for convenience you know but doesn't [00:29:04] for convenience you know but doesn't really matter so suppose you only select [00:29:06] really matter so suppose you only select models from this set [00:29:09] models from this set where you [00:29:10] where you uh where your Norm of the model is [00:29:13] uh where your Norm of the model is bounded by B [00:29:15] bounded by B um our dependency on B will be very will [00:29:17] um our dependency on B will be very will be the only logarithmic so in some sense [00:29:19] be the only logarithmic so in some sense this is not really a real restriction [00:29:20] this is not really a real restriction but you can choose B to be pretty big [00:29:22] but you can choose B to be pretty big just because your dependency on B is [00:29:24] just because your dependency on B is very very uh relaxed whereas the [00:29:26] very very uh relaxed whereas the logarithmic in b [00:29:28] logarithmic in b so so this is our setup [00:29:30] so so this is our setup and [00:29:32] and Also let's recall that we sometimes [00:29:36] Also let's recall that we sometimes gonna use this notation [00:29:38] gonna use this notation this is really you know we you use all [00:29:40] this is really you know we you use all of these rotations it enter [00:29:42] of these rotations it enter exchangeably so [00:29:44] exchangeably so either this is really just a loss of [00:29:47] either this is really just a loss of theta the model Theta on the data point [00:29:50] theta the model Theta on the data point X and Y so it's really just you compare [00:29:52] X and Y so it's really just you compare each state of X and Y and you get a lot [00:29:55] each state of X and Y and you get a lot right these two are just the same thing [00:29:58] right these two are just the same thing like we are abusing notation a little [00:30:00] like we are abusing notation a little bit and also recall that we have L Theta [00:30:02] bit and also recall that we have L Theta L High Theta this is all as we defined [00:30:04] L High Theta this is all as we defined before [00:30:05] before and so here's the theorem [00:30:09] and so here's the theorem so we still have to assume that the loss [00:30:12] so we still have to assume that the loss is between 0 and 1. [00:30:14] is between 0 and 1. foreign [00:30:16] foreign always assumed in most of this lecture [00:30:19] always assumed in most of this lecture in most of this course [00:30:21] in most of this course for every x y and Theta [00:30:25] for every x y and Theta and suppose [00:30:30] um this is additional assumption [00:30:33] um this is additional assumption where you assume that [00:30:35] where you assume that this loss function is calypses [00:30:42] in Theta [00:30:44] in Theta so [00:30:47] so um [00:30:48] um so for every [00:30:52] X and Y so what does this really mean [00:30:54] X and Y so what does this really mean this really means that [00:30:56] this really means that you're assuming L of X Y [00:30:59] you're assuming L of X Y Theta minus L of X Y [00:31:02] Theta minus L of X Y and Theta Prime if you change your model [00:31:04] and Theta Prime if you change your model to Theta Prime [00:31:05] to Theta Prime then your loss wouldn't be differed by [00:31:09] then your loss wouldn't be differed by a constant times say the mark centers [00:31:11] a constant times say the mark centers minus Theta Prime in two naught [00:31:14] minus Theta Prime in two naught so this may be out let's try to this is [00:31:17] so this may be out let's try to this is Kappa [00:31:19] Kappa so okay so again this [00:31:23] so okay so again this um [00:31:24] um the our dependency on Kappa will also be [00:31:27] the our dependency on Kappa will also be logarithmic so in some sense this is [00:31:29] logarithmic so in some sense this is also not assuming much like because if [00:31:32] also not assuming much like because if you're if your uh loss is kind of [00:31:35] you're if your uh loss is kind of somewhat continuous then you [00:31:38] somewhat continuous then you um [00:31:39] um then you can have a uh it's gonna be [00:31:42] then you can have a uh it's gonna be ellipses you know to some extent right [00:31:44] ellipses you know to some extent right it's probably the Ellipsis constantly [00:31:46] it's probably the Ellipsis constantly it's not very good but the lips constant [00:31:47] it's not very good but the lips constant would be something reasonable and if you [00:31:49] would be something reasonable and if you take a logarithmic of it it's not very [00:31:51] take a logarithmic of it it's not very sensitive to the leftist constants [00:31:53] sensitive to the leftist constants and then [00:31:56] and then with this then you got [00:31:58] with this then you got with probability [00:32:00] with probability at least [00:32:02] at least one minus [00:32:04] one minus um I guess of e to the minus P so [00:32:07] um I guess of e to the minus P so actually you have even higher [00:32:10] actually you have even higher um even lower failure probability the [00:32:12] um even lower failure probability the failure probability is e to the minus p [00:32:15] failure probability is e to the minus p um so with so small figure probability [00:32:18] um so with so small figure probability you get that for every Theta you have [00:32:20] you get that for every Theta you have the uniform convergence [00:32:24] is less than some Big O of square root p [00:32:28] is less than some Big O of square root p over n [00:32:29] over n times Max [00:32:32] times Max 1 [00:32:34] 1 and long [00:32:35] and long Kappa B and N [00:32:39] Kappa B and N so [00:32:40] so So eventually [00:32:42] So eventually um [00:32:43] um the dependency on carbon BR logarithmic [00:32:45] the dependency on carbon BR logarithmic that's what I promised and the main [00:32:47] that's what I promised and the main thing is really P Over N so you get the [00:32:50] thing is really P Over N so you get the uh the parameter dependency and get the [00:32:52] uh the parameter dependency and get the dependency on it [00:32:54] dependency on it and you still have the square root here [00:32:55] and you still have the square root here so this is still worse than uh there is [00:32:58] so this is still worse than uh there is a synthetic Bond if you compare with the [00:33:00] a synthetic Bond if you compare with the leading term of the asymptotic box but [00:33:03] leading term of the asymptotic box but as we said you know you don't have the [00:33:04] as we said you know you don't have the second another term in the asymptotic [00:33:06] second another term in the asymptotic Box [00:33:09] okay so how do we how do we [00:33:12] okay so how do we how do we um how to prove this so actually the [00:33:15] um how to prove this so actually the proof is very similar to what the what [00:33:18] proof is very similar to what the what was suggested in a question so you are [00:33:20] was suggested in a question so you are doing a discretization and then you deal [00:33:22] doing a discretization and then you deal with the the discretization error [00:33:24] with the the discretization error separately [00:33:25] separately so so what you do is the the following [00:33:28] so so what you do is the the following so [00:33:31] so um let me start with a kind of a sketch [00:33:33] um let me start with a kind of a sketch in some sense so the the the the kind of [00:33:35] in some sense so the the the the kind of the alternate sketch is the following so [00:33:37] the alternate sketch is the following so you define e Theta [00:33:40] you define e Theta um with the event [00:33:43] um with the event that you have this failure event right I [00:33:46] that you have this failure event right I always say that minus L Theta [00:33:48] always say that minus L Theta is larger than episode and Epson is [00:33:50] is larger than episode and Epson is going to be something TBD but Epson [00:33:53] going to be something TBD but Epson would be very similar to [00:33:55] would be very similar to this thing right because you care about [00:33:57] this thing right because you care about whether these two are different right [00:33:59] whether these two are different right how like a this much different but [00:34:02] how like a this much different but anyway so absolutely some number that is [00:34:04] anyway so absolutely some number that is kind of like a placeholder so you care [00:34:06] kind of like a placeholder so you care about this kind of events and we know [00:34:10] about this kind of events and we know that this will be a small probability [00:34:11] that this will be a small probability but as we have shown for the final case [00:34:14] but as we have shown for the final case right so this e Theta is a small [00:34:16] right so this e Theta is a small probability right [00:34:21] and before we call that what we did is [00:34:23] and before we call that what we did is that which we say that the union of this [00:34:25] that which we say that the union of this e Theta is less than the sum of the E [00:34:29] e Theta is less than the sum of the E Theta the probability of e Theta [00:34:32] Theta the probability of e Theta but now because you have infinite number [00:34:33] but now because you have infinite number of theta this is infinite because [00:34:37] of theta this is infinite because um each one has some you know small [00:34:39] um each one has some you know small probability event right so to fail and [00:34:41] probability event right so to fail and you take the sum over all possible [00:34:42] you take the sum over all possible events then you get infin number at [00:34:45] events then you get infin number at least [00:34:46] least um like like each of these will be some [00:34:48] um like like each of these will be some Epsilon you take the sum of infinite [00:34:50] Epsilon you take the sum of infinite number of things you get infinite [00:34:52] number of things you get infinite so that's why it doesn't work right you [00:34:54] so that's why it doesn't work right you cannot use the exactly the same thing as [00:34:56] cannot use the exactly the same thing as before [00:34:57] before so [00:34:58] so um but but the reason why this is uh uh [00:35:02] um but but the reason why this is uh uh this can be fixed is because this unit [00:35:04] this can be fixed is because this unit bond is very pessimistic [00:35:11] so so if you think about Union bank [00:35:14] so so if you think about Union bank right so unibong is really just saying [00:35:16] right so unibong is really just saying that okay I guess I'm not sure [00:35:18] that okay I guess I'm not sure depending on how you're learning balance [00:35:20] depending on how you're learning balance you know [00:35:21] you know um in quick structures like what I [00:35:23] um in quick structures like what I learned about unibody is this following [00:35:25] learned about unibody is this following for example you have a this is the full [00:35:26] for example you have a this is the full probability space and each [00:35:29] probability space and each even take some part of space Maybe This [00:35:31] even take some part of space Maybe This is e Theta 1 this is E1 [00:35:35] is e Theta 1 this is E1 and this is maybe E2 and the optimal the [00:35:39] and this is maybe E2 and the optimal the when the unibond would be tied it's it's [00:35:41] when the unibond would be tied it's it's when all those all of this event [00:35:44] when all those all of this event I call it they call them failure device [00:35:46] I call it they call them failure device all of them are destroyed right so right [00:35:48] all of them are destroyed right so right suppose this is the case then the unit [00:35:50] suppose this is the case then the unit of these events will be the sum of the [00:35:52] of these events will be the sum of the probability of each of the events [00:35:54] probability of each of the events but [00:35:55] but here it's not clear whether this event [00:35:57] here it's not clear whether this event are destroyed and actually it may have a [00:35:59] are destroyed and actually it may have a lot of overlap right so you have one [00:36:01] lot of overlap right so you have one Theta [00:36:02] Theta and if you change yourself to your [00:36:05] and if you change yourself to your nearby Theta you probably have something [00:36:07] nearby Theta you probably have something like this which is easy to Prime and [00:36:09] like this which is easy to Prime and they have all of this overlap and then [00:36:10] they have all of this overlap and then your unit Bond starts to be very loose [00:36:13] your unit Bond starts to be very loose um so so so so that's also kind of [00:36:16] um so so so so that's also kind of motivates our way to fix it or the way [00:36:19] motivates our way to fix it or the way that we fix it is the following [00:36:21] that we fix it is the following so uh we [00:36:23] so uh we fix it by [00:36:26] by first picking not up we don't take [00:36:28] by first picking not up we don't take unibond over all possible events we [00:36:31] unibond over all possible events we select a subset of events and we take [00:36:33] select a subset of events and we take unibond overdone and then we say the [00:36:35] unibond overdone and then we say the other events will be close to some of [00:36:37] other events will be close to some of this this subset of prototypical events [00:36:40] this this subset of prototypical events so so basically the idea is that you um [00:36:44] so so basically the idea is that you um the rough idea is that you selected some [00:36:48] the rough idea is that you selected some it's like prototypical events [00:36:51] it's like prototypical events prototypical sorry like just maybe I [00:36:54] prototypical sorry like just maybe I should just say typical events or you [00:36:56] should just say typical events or you just take some example events I don't [00:36:57] just take some example events I don't know how to I forgot how to spell that [00:37:00] know how to I forgot how to spell that some example events and this is [00:37:02] some example events and this is sometimes of events you um is a smaller [00:37:06] sometimes of events you um is a smaller set of events than what you finally care [00:37:07] set of events than what you finally care about and then you use unibond on the [00:37:11] about and then you use unibond on the subset [00:37:16] and then you say that other events [00:37:20] are similar to the subset [00:37:24] are similar to the subset to the examples [00:37:29] so so then you cover all the events so [00:37:33] so so then you cover all the events so that's the the rough idea [00:37:34] that's the the rough idea so um let's see how we exactly do this [00:37:37] so um let's see how we exactly do this so to exactly do this uh we need to [00:37:40] so to exactly do this uh we need to introduce something called [00:37:44] introduce something called any questions so far [00:37:47] okay so to exactly do this we need to [00:37:50] okay so to exactly do this we need to introduce something called this created [00:37:51] introduce something called this created uh or Epson cover this is actually also [00:37:55] uh or Epson cover this is actually also a useful tool [00:37:57] a useful tool um for other cases as well [00:38:00] um for other cases as well so uh let me let me first Define this [00:38:03] so uh let me let me first Define this Epson cover and then [00:38:05] Epson cover and then say Y is [00:38:08] say Y is it's kind of like it's a note it's a [00:38:09] it's kind of like it's a note it's a it's a language to describe what what [00:38:12] it's a language to describe what what are called prototypical events or [00:38:14] are called prototypical events or prototypical kind of like [00:38:16] prototypical kind of like parameters or models so so apps on that [00:38:20] parameters or models so so apps on that sometimes it's also called epsilonite [00:38:22] sometimes it's also called epsilonite sometimes it's called Epson cover [00:38:24] sometimes it's called Epson cover offers that [00:38:26] offers that s and here ice corresponds to the the [00:38:28] s and here ice corresponds to the the family of all models you care about and [00:38:30] family of all models you care about and you care about the subset of models and [00:38:33] you care about the subset of models and if you this and with respect to [00:38:37] if you this and with respect to like when you really Define it you have [00:38:38] like when you really Define it you have to specify metric row [00:38:43] row is a set [00:38:48] row is a set C which is also a subside of s [00:38:51] C which is also a subside of s but I think technically we don't have to [00:38:54] but I think technically we don't have to require C to be a subset device but I [00:38:56] require C to be a subset device but I think in almost all cases it's a subset [00:38:58] think in almost all cases it's a subset of s [00:38:59] of s such that [00:39:01] such that for every X in s [00:39:04] for every X in s right so there exists a [00:39:06] right so there exists a kind of a neighbor in C [00:39:09] kind of a neighbor in C which is close to X [00:39:13] so [00:39:19] if you if you draw this it's kind of [00:39:21] if you if you draw this it's kind of like [00:39:23] like you're saying that you have a set [00:39:25] you're saying that you have a set of models or parameters this is called [00:39:28] of models or parameters this is called as [00:39:29] as and the Epson cover is a subset of s [00:39:31] and the Epson cover is a subset of s right so as you select some points and [00:39:34] right so as you select some points and let's call these points [00:39:36] let's call these points these are all in C [00:39:38] these are all in C and then you say that this set of C [00:39:40] and then you say that this set of C needs to satisfy the following to be [00:39:43] needs to satisfy the following to be your Epsilon cover so what it has to [00:39:44] your Epsilon cover so what it has to satisfy it has specified that for every [00:39:46] satisfy it has specified that for every point you pick in x in x where you pick [00:39:49] point you pick in x in x where you pick this on X in US there exists a neighbor [00:39:52] this on X in US there exists a neighbor somewhat kind of a neighbor [00:39:54] somewhat kind of a neighbor um in C right let's call this x Prime [00:39:58] um in C right let's call this x Prime I guess I [00:40:01] I guess I cross and X seems to be the same I'm not [00:40:03] cross and X seems to be the same I'm not sure whether maybe I should anyway you [00:40:06] sure whether maybe I should anyway you see what I mean like the the purple [00:40:07] see what I mean like the the purple cross is just the indicating a point but [00:40:09] cross is just the indicating a point but okay so you have a point x here and and [00:40:13] okay so you have a point x here and and you you can always find some other point [00:40:16] you you can always find some other point in C such that X Prime is kind of close [00:40:18] in C such that X Prime is kind of close relax [00:40:20] relax so so so that's basically is saying that [00:40:22] so so so that's basically is saying that all of these points in C are [00:40:24] all of these points in C are prototypical point of points because [00:40:26] prototypical point of points because every point in us can find a neighbor in [00:40:28] every point in us can find a neighbor in C [00:40:29] C um [00:40:30] um that makes sense [00:40:39] and equivalently you can also write this [00:40:41] and equivalently you can also write this in the following way so equivalently [00:40:46] this is in sometimes more [00:40:49] this is in sometimes more explaining why it is called Epsilon [00:40:52] explaining why it is called Epsilon cover so equivalently you can write this [00:40:54] cover so equivalently you can write this as the s is [00:40:57] as the s is covered by the union of the ball [00:41:01] covered by the union of the ball around all the acts [00:41:04] around all the acts let me write it down and explain what [00:41:06] let me write it down and explain what this is so first of all this is the so [00:41:09] this is so first of all this is the so this thing is the ball [00:41:12] this thing is the ball centered [00:41:14] centered right X [00:41:16] right X with [00:41:18] with radius [00:41:21] radius Epsilon [00:41:22] Epsilon and and distance metric or Magic [00:41:26] and and distance metric or Magic row [00:41:27] row so [00:41:29] so um so basically this is saying the [00:41:30] um so basically this is saying the following so this is the equivalent [00:41:33] following so this is the equivalent definition of Epsilon cover so you are [00:41:34] definition of Epsilon cover so you are saying that if you look at all the the [00:41:37] saying that if you look at all the the balls [00:41:38] balls around all of these purple points [00:41:43] around all of these purple points right so this is the the radius is [00:41:45] right so this is the the radius is absolute so in some sense you can say [00:41:47] absolute so in some sense you can say that this ball so this point covers the [00:41:51] that this ball so this point covers the entire ball because for every [00:41:54] entire ball because for every point in this ball you can find you can [00:41:56] point in this ball you can find you can use the this point the center as the [00:41:59] use the this point the center as the neighbor [00:42:00] neighbor right so so basically every Point covers [00:42:02] right so so basically every Point covers some part of the space and so the [00:42:05] some part of the space and so the requirement is that [00:42:07] requirement is that if you look at all the points that can [00:42:09] if you look at all the points that can if you look like if you look at the all [00:42:12] if you look like if you look at the all the balls around all the centers [00:42:14] the balls around all the centers then this route cover the entire uh s [00:42:18] then this route cover the entire uh s right that means that every point in us [00:42:20] right that means that every point in us can be covered by some ball and that [00:42:22] can be covered by some ball and that means every point in us has a neighbor [00:42:24] means every point in us has a neighbor in C [00:42:32] any questions okay [00:42:44] see [00:42:46] see see may not be we will in in some sense [00:42:51] see may not be we will in in some sense we will insist that see like we need to [00:42:53] we will insist that see like we need to find a very small currency which is [00:42:55] find a very small currency which is finite and also hopefully we want the [00:42:57] finite and also hopefully we want the size of C to be small but by the [00:42:59] size of C to be small but by the definition there is nothing about [00:43:00] definition there is nothing about whether it's finite or not but we will [00:43:02] whether it's finite or not but we will construct Epsilon cover that is finance [00:43:09] right so this is for so far this is only [00:43:11] right so this is for so far this is only definition [00:43:13] definition um saying that c is absolute cover of s [00:43:16] um saying that c is absolute cover of s yeah but we will make try to make C to [00:43:18] yeah but we will make try to make C to be small [00:43:19] be small and this is actually exactly a [00:43:31] so so this is exactly what we're going [00:43:33] so so this is exactly what we're going to do next right so how do we construct [00:43:35] to do next right so how do we construct a set C that is finite and also cover [00:43:37] a set C that is finite and also cover the entire set right so what is X right [00:43:40] the entire set right so what is X right so for us [00:43:41] so for us as is this set of all parameters right [00:43:44] as is this set of all parameters right it's the side of parameter Theta with [00:43:46] it's the side of parameter Theta with this L2 Norm less than b [00:43:49] this L2 Norm less than b right you're going to construct a subset [00:43:50] right you're going to construct a subset of parameters that can cover all the [00:43:53] of parameters that can cover all the parameters [00:43:54] parameters right so here is a Lemma that says that [00:43:56] right so here is a Lemma that says that you can do this [00:43:57] you can do this and you can have fun and c and also [00:44:00] and you can have fun and c and also actually you can have a reasonable Bond [00:44:01] actually you can have a reasonable Bond on how many seats how many points in CDR [00:44:05] on how many seats how many points in CDR so uh [00:44:10] so let's let's define this to be I guess [00:44:12] so let's let's define this to be I guess for this Lemma I I call this Theta so so [00:44:15] for this Lemma I I call this Theta so so so Theta is defined as above so for [00:44:19] so Theta is defined as above so for every Epsilon in 0 and B [00:44:24] every Epsilon in 0 and B um [00:44:25] um for every radius there exists an Epsilon [00:44:28] for every radius there exists an Epsilon cover [00:44:31] of the stats Theta the L2 Norm with [00:44:33] of the stats Theta the L2 Norm with radius B [00:44:35] radius B such that with [00:44:39] such that with at most [00:44:42] 3B over Epsilon to the power P elements [00:44:48] 3B over Epsilon to the power P elements so this is a cover and the the size of [00:44:51] so this is a cover and the the size of this cover is bounded by 3B over [00:44:53] this cover is bounded by 3B over absolute to the power p [00:44:56] Okay so [00:44:59] Okay so so I think this is actually [00:45:02] so I think this is actually um we're gonna prove a weaker version [00:45:04] um we're gonna prove a weaker version the full [00:45:05] the full um so we're going to have a homework [00:45:07] um so we're going to have a homework question which kind of guide you to [00:45:09] question which kind of guide you to prove the this exactly this version so [00:45:12] prove the this exactly this version so for now in the lecture we're going to [00:45:13] for now in the lecture we're going to prove a weaker version [00:45:15] prove a weaker version which is a [00:45:17] which is a um somewhat easier [00:45:19] um somewhat easier so this recover version and also [00:45:21] so this recover version and also actually suffices for our purpose right [00:45:23] actually suffices for our purpose right so like you don't really necessarily [00:45:24] so like you don't really necessarily need a stronger version to prove the [00:45:27] need a stronger version to prove the final theorem just because this like the [00:45:29] final theorem just because this like the weaker version is only weaker by a [00:45:30] weaker version is only weaker by a little bit so [00:45:33] little bit so um [00:45:34] um um I guess the homework will guide you [00:45:37] um I guess the homework will guide you towards the stronger version which also [00:45:38] towards the stronger version which also introduce some techniques which is [00:45:40] introduce some techniques which is useful [00:45:42] useful um so so here is the a weaker version [00:45:44] um so so here is the a weaker version the weaker version is pretty much like [00:45:46] the weaker version is pretty much like you discretize your computer right you [00:45:48] you discretize your computer right you just do a tribute Equalization uh using [00:45:51] just do a tribute Equalization uh using some grade [00:45:52] some grade so what you do is you just take C [00:45:56] B uh [00:45:58] B uh a trivia grade in some sense [00:46:02] a trivia grade in some sense so what does that mean it really means [00:46:04] so what does that mean it really means that you have this ball [00:46:05] that you have this ball let me sweater [00:46:07] let me sweater I guess if you have this ball [00:46:10] I guess if you have this ball and you just say that you take that some [00:46:13] and you just say that you take that some arbitrary kind of like a coordinate [00:46:15] arbitrary kind of like a coordinate system you just take the natural [00:46:16] system you just take the natural coordinate system [00:46:18] coordinate system and you discretize your space like this [00:46:26] and then you take all the the this kind [00:46:28] and then you take all the the this kind of points as you'll see [00:46:31] of points as you'll see and that's it [00:46:32] and that's it so um and then the question is just a [00:46:34] so um and then the question is just a matter of counting on the whole fine [00:46:36] matter of counting on the whole fine fine grains your grade needs to be [00:46:38] fine grains your grade needs to be so formally so C is taken to be all the [00:46:43] so formally so C is taken to be all the points X [00:46:44] points X in RP such that [00:46:47] in RP such that x i the coordinate [00:46:50] x i the coordinate uh is a multi multiple I guess this is K [00:46:57] uh is a multi multiple I guess this is K K times [00:46:59] K times Epsilon over Square P so Epsilon over [00:47:01] Epsilon over Square P so Epsilon over Square p is my grade size [00:47:05] Square p is my grade size and K is the the integer multiplied with [00:47:07] and K is the the integer multiplied with it [00:47:08] it for some integer [00:47:12] for some integer foreign [00:47:18] p over Epsilon why I have this [00:47:21] p over Epsilon why I have this constitution on K because at some point [00:47:23] constitution on K because at some point you don't need more points because you [00:47:25] you don't need more points because you already [00:47:26] already you don't have to do anything beyond [00:47:27] you don't have to do anything beyond this part right it's because you know if [00:47:30] this part right it's because you know if your case too big you are the outside as [00:47:33] your case too big you are the outside as there is no point [00:47:34] there is no point so and if you do the calculation this is [00:47:37] so and if you do the calculation this is the right thing [00:47:38] the right thing um and [00:47:40] um and right so now we have to do two things [00:47:42] right so now we have to do two things one thing is we have to see how large C [00:47:44] one thing is we have to see how large C is and second thing is we have to prove [00:47:46] is and second thing is we have to prove that this is the Epson cover [00:47:48] that this is the Epson cover so let's do the first thing so why this [00:47:50] so let's do the first thing so why this is Epsilon cover [00:47:52] is Epsilon cover this is because if you look at any point [00:47:54] this is because if you look at any point x in s [00:47:56] x in s you just round it to the nearest uh [00:47:59] you just round it to the nearest uh nearest Port right so so when you run [00:48:03] nearest Port right so so when you run that you want it to um to do some [00:48:05] that you want it to um to do some rounding [00:48:07] rounding uh [00:48:10] uh let's see I guess we run it you're going [00:48:12] let's see I guess we run it you're going to [00:48:13] to um [00:48:14] um let's call it X Prime right let me not [00:48:17] let's call it X Prime right let me not write exactly what the wrong wrong [00:48:19] write exactly what the wrong wrong number just means you take the any any [00:48:21] number just means you take the any any vertex [00:48:23] vertex uh in this grid and and along the [00:48:25] uh in this grid and and along the nearest any nearly reasonable nearest [00:48:27] nearest any nearly reasonable nearest like that's what I mean like you just to [00:48:29] like that's what I mean like you just to do um [00:48:31] do um the the tribute on it let's say you want [00:48:33] the the tribute on it let's say you want to a smaller number it doesn't really [00:48:36] to a smaller number it doesn't really matter that much so so if you run it so [00:48:39] matter that much so so if you run it so what you got is that X [00:48:42] what you got is that X I minus x i Prime is less than [00:48:45] I minus x i Prime is less than Epsilon over Square p [00:48:47] Epsilon over Square p because for every Dimension when wrong [00:48:50] because for every Dimension when wrong you increase you create Epsilon over [00:48:52] you increase you create Epsilon over Square P error right F2 over Square p is [00:48:55] Square P error right F2 over Square p is your grade size [00:48:56] your grade size and that means that the distance between [00:48:59] and that means that the distance between x and x Prime [00:49:01] x and x Prime in the L2 sense right uh like this is I [00:49:06] in the L2 sense right uh like this is I guess [00:49:08] sorry I think this is I should mention [00:49:10] sorry I think this is I should mention that the row [00:49:13] L2 Norm the real [00:49:16] L2 Norm the real this yeah I should have mentioned this [00:49:19] this yeah I should have mentioned this so the the magic we are using is we [00:49:20] so the the magic we are using is we always L2 Norm [00:49:31] right so so then if you look at io2 Norm [00:49:34] right so so then if you look at io2 Norm of these two uh two things right so [00:49:38] of these two uh two things right so this is the sum of x i minus x i Prime [00:49:43] this is the sum of x i minus x i Prime squared I from 1 to p [00:49:46] squared I from 1 to p and then you Bond each coordinate you [00:49:48] and then you Bond each coordinate you get P times Epsilon Square over p [00:49:51] get P times Epsilon Square over p square root which is absolute that's [00:49:53] square root which is absolute that's actually why I chose the grid size to be [00:49:55] actually why I chose the grid size to be Epsilon over Square p just because I [00:49:58] Epsilon over Square p just because I want to to make it absolute like that [00:50:00] want to to make it absolute like that so this proves that is the Epsom Carver [00:50:03] so this proves that is the Epsom Carver right so and also we can count how large [00:50:08] right so and also we can count how large C is [00:50:09] C is so C is what C is something to the power [00:50:12] so C is what C is something to the power P because for every coordinate you have [00:50:14] P because for every coordinate you have a batch of choices for K and how many [00:50:17] a batch of choices for K and how many choices of okay there are [00:50:19] choices of okay there are basically this was in here is like pay [00:50:22] basically this was in here is like pay the absolute value of K is less than b [00:50:24] the absolute value of K is less than b squared p over F sub y so basically [00:50:26] squared p over F sub y so basically you've got B squared p over Epsilon and [00:50:30] you've got B squared p over Epsilon and because it can be positive and active [00:50:31] because it can be positive and active that's why you must buy two and you can [00:50:33] that's why you must buy two and you can also be zero so you add one so that's [00:50:36] also be zero so you add one so that's that's the total number of choices [00:50:38] that's the total number of choices uh in C [00:50:41] right so and [00:50:44] right so and one comment is that I eventually Only [00:50:47] one comment is that I eventually Only log C markers as you see so log c will [00:50:49] log C markers as you see so log c will be P log 2B Square p over Epsilon plus [00:50:53] be P log 2B Square p over Epsilon plus one [00:50:54] one and that's why this weaker version is [00:50:56] and that's why this weaker version is not super different from the stronger [00:50:58] not super different from the stronger version because the difference right so [00:51:01] version because the difference right so the stronger version was [00:51:03] the stronger version was the stronger version was 3B over Epsilon [00:51:06] the stronger version was 3B over Epsilon to the power p [00:51:07] to the power p and the log [00:51:09] and the log becomes P log 3B over Epsilon and if you [00:51:14] becomes P log 3B over Epsilon and if you compare the stronger version with the [00:51:16] compare the stronger version with the weaker version the only thing is [00:51:17] weaker version the only thing is different is the square p in the log so [00:51:19] different is the square p in the log so that's why eventually it doesn't change [00:51:20] that's why eventually it doesn't change the bonds too much [00:51:22] the bonds too much um [00:51:23] um Okay cool so this is our [00:51:27] Okay cool so this is our um proof for the the weaker version of [00:51:28] um proof for the the weaker version of dilemma [00:51:30] dilemma um and now let's uh uh let's use this [00:51:33] um and now let's uh uh let's use this Lemma and the Epson cover to prove the [00:51:36] Lemma and the Epson cover to prove the the final bound right so as we kind of [00:51:39] the final bound right so as we kind of like a plant what we do is that we first [00:51:43] like a plant what we do is that we first apply [00:51:46] um [00:51:50] apply fun in the hypothesis case [00:51:57] uh [00:52:00] the the final hypothesis you know [00:52:02] the the final hypothesis you know analysis [00:52:05] to C [00:52:08] and then you say that uh [00:52:11] and then you say that uh uh you then use somewhat so this may be [00:52:14] uh you then use somewhat so this may be let's say this is number one and then [00:52:15] let's say this is number one and then you say that [00:52:17] you say that um extent [00:52:20] one to the whole set as [00:52:23] one to the whole set as okay so now the the first step should be [00:52:27] okay so now the the first step should be trivial because we already prove it [00:52:29] trivial because we already prove it right so so if you want to do I then [00:52:32] right so so if you want to do I then basically you got that [00:52:35] basically you got that um if you [00:52:37] um if you the first thing is that you do it for [00:52:39] the first thing is that you do it for every [00:52:39] every fix the Theta uh in C [00:52:44] then you have probability of this [00:52:50] Epsilon if you use whole thing in [00:52:52] Epsilon if you use whole thing in quality you get this is I guess let's [00:52:54] quality you get this is I guess let's call it Epson tilde because this Epson [00:52:56] call it Epson tilde because this Epson field will be tuned to be decided later [00:52:58] field will be tuned to be decided later on to make the the bonds fit so you got [00:53:02] on to make the the bonds fit so you got this is two times exponential minus two [00:53:05] this is two times exponential minus two and Epsilon Square [00:53:06] and Epsilon Square this is by halfting [00:53:08] this is by halfting exactly the same thing as we have done [00:53:10] exactly the same thing as we have done before [00:53:12] before and then you take a unit Bond [00:53:16] is you got that probability that for [00:53:19] is you got that probability that for every Theta I guess [00:53:22] every Theta I guess to existed [00:53:25] to existed Theta in C [00:53:27] Theta in C such that this is not right [00:53:32] is small [00:53:34] is small and how small it is you multiply C with [00:53:37] and how small it is you multiply C with this [00:53:38] this exponential minus 2 and absence Theta [00:53:40] exponential minus 2 and absence Theta squared [00:53:43] squared okay so these two steps are exactly as [00:53:46] okay so these two steps are exactly as we did and [00:53:47] we did and if you flip this you get uh [00:53:50] if you flip this you get uh uh you get one minus so probability of [00:53:55] uh you get one minus so probability of a good event happens with high [00:53:59] a good event happens with high probability [00:54:06] I'm just flipping [00:54:10] okay so now we have to do the Second [00:54:12] okay so now we have to do the Second Step how do we extend this to for [00:54:15] Step how do we extend this to for everything in US [00:54:16] everything in US and [00:54:18] and um [00:54:22] so second and we are basically using [00:54:25] so second and we are basically using ellipsislessness [00:54:28] and you can see that this is you know [00:54:30] and you can see that this is you know not really anything super clever it's [00:54:32] not really anything super clever it's kind of like a Samsung Brute Force right [00:54:35] kind of like a Samsung Brute Force right so so [00:54:38] so so um [00:54:39] um okay so just for some quick preparation [00:54:41] okay so just for some quick preparation so because l x Theta [00:54:46] is [00:54:47] is Kappa ellipses this is Kappa [00:54:51] Kappa ellipses this is Kappa lips it's [00:54:55] in Theta this implies that [00:54:58] in Theta this implies that hello Theta and L has data [00:55:02] hello Theta and L has data are both copper lipses [00:55:07] why this is just because you know if [00:55:10] why this is just because you know if your average two function to look copper [00:55:12] your average two function to look copper loopsis function there are still couple [00:55:13] loopsis function there are still couple notches right so [00:55:15] notches right so f is copper ellipsis [00:55:18] f is copper ellipsis G is copper lipsense [00:55:21] G is copper lipsense f plus G over 2 is also copper lips [00:55:24] f plus G over 2 is also copper lips and you can prove this by a simple [00:55:26] and you can prove this by a simple triangle inequality [00:55:29] triangle inequality um and you can do this for multiple [00:55:31] um and you can do this for multiple functions not only just two functions [00:55:32] functions not only just two functions you can do it for n functions [00:55:34] you can do it for n functions okay so suppose we have this then [00:55:38] okay so suppose we have this then uh [00:55:43] now so we we also know that for so [00:55:46] now so we we also know that for so suppose we know for every so [00:55:49] suppose we know for every so we already know that for every data [00:55:51] we already know that for every data right so IO High Theta minus so suppose [00:55:53] right so IO High Theta minus so suppose we condition on this event [00:55:55] we condition on this event right so so with some chance you know [00:55:58] right so so with some chance you know with a very high probability this [00:55:59] with a very high probability this happens and suppose this happens let's [00:56:00] happens and suppose this happens let's condition on this event we want to prove [00:56:02] condition on this event we want to prove that the same thing happens when you [00:56:04] that the same thing happens when you replace C by Theta by S right so this [00:56:06] replace C by Theta by S right so this means that if you have for every Theta [00:56:10] means that if you have for every Theta in [00:56:11] in I guess I call it capital city but not s [00:56:13] I guess I call it capital city but not s right Capital say that the ball [00:56:15] right Capital say that the ball you can find [00:56:17] you can find consider zero in C [00:56:19] consider zero in C such that Theta minus the zero [00:56:23] such that Theta minus the zero L2 is less than absolute this is by the [00:56:25] L2 is less than absolute this is by the definition of absent cover where C is [00:56:27] definition of absent cover where C is absolute cover of capital Theta that's [00:56:29] absolute cover of capital Theta that's why you have this [00:56:31] why you have this and then this implies that [00:56:34] and then this implies that zero [00:56:38] is less than copper times Epsilon this [00:56:41] is less than copper times Epsilon this is voluptuousness [00:56:46] right so [00:56:48] right so and so in some sense you just use [00:56:50] and so in some sense you just use cellular zero as a reference point right [00:56:52] cellular zero as a reference point right so what you finally care about is L [00:56:54] so what you finally care about is L height Theta minus l [00:56:57] height Theta minus l sorry I guess you also know this you [00:56:58] sorry I guess you also know this you don't only know this you also know [00:57:01] don't only know this you also know High Theta minus L Theta High Square [00:57:04] High Theta minus L Theta High Square zero [00:57:05] zero is less than Kappa times F so this is [00:57:08] is less than Kappa times F so this is also by ellipsis [00:57:14] foreign [00:57:24] Ty and IO Theta [00:57:27] Ty and IO Theta right so and we have seen this kind of [00:57:30] right so and we have seen this kind of triangle in the quality you know this [00:57:31] triangle in the quality you know this kind of like a manipulation already [00:57:33] kind of like a manipulation already right because you know eventually you [00:57:35] right because you know eventually you get you care about the difference [00:57:35] get you care about the difference between L hat and L but the user said [00:57:38] between L hat and L but the user said that zero some reference points to kind [00:57:40] that zero some reference points to kind of bridge them so you you do this [00:57:43] of bridge them so you you do this decomposition you say that this is L hat [00:57:46] decomposition you say that this is L hat Theta minus L height Theta zero [00:57:48] Theta minus L height Theta zero plus IO High to the zero [00:57:51] plus IO High to the zero minus l0 [00:57:54] minus l0 plus IO zero minus L Theta [00:57:59] plus IO zero minus L Theta and now these two things are about [00:58:03] and now these two things are about differences between Theta and zero zero [00:58:05] differences between Theta and zero zero so [00:58:06] so this quantity is less than copper times [00:58:09] this quantity is less than copper times Epsilon [00:58:10] Epsilon and this quantity is also less than [00:58:12] and this quantity is also less than copper times Epsilon and this quantity [00:58:15] copper times Epsilon and this quantity is less than [00:58:16] is less than Epsilon [00:58:18] Epsilon this is because Theta 0 is in C [00:58:21] this is because Theta 0 is in C right so we have already proved that for [00:58:23] right so we have already proved that for everything in C you know L higher Theta [00:58:26] everything in C you know L higher Theta is equals to is close to L Theta so [00:58:29] is equals to is close to L Theta so um so that's why we got this three [00:58:30] um so that's why we got this three inequality right so so in total if you [00:58:33] inequality right so so in total if you look at the absolute value [00:58:36] look at the absolute value then you can do the Triangular quality [00:58:38] then you can do the Triangular quality to get the absolute value of the sum of [00:58:40] to get the absolute value of the sum of the sum of the absolute value of each of [00:58:42] the sum of the absolute value of each of them you get two Kappa times Epsilon [00:58:45] them you get two Kappa times Epsilon Plus [00:58:46] Plus uh [00:58:47] uh oh absolutely so sorry this is absolute [00:58:51] oh absolutely so sorry this is absolute because so recall that I use a different [00:58:54] because so recall that I use a different app zone for the for the concentration [00:58:56] app zone for the for the concentration just so that I can tune this Epson tilde [00:58:58] just so that I can tune this Epson tilde uh eventually [00:59:01] uh eventually okay so and if I now if it's the time to [00:59:04] okay so and if I now if it's the time to start [00:59:06] start uh episode to be absolute tilde over two [00:59:09] uh episode to be absolute tilde over two Kappa or maybe you you can do it another [00:59:12] Kappa or maybe you you can do it another way around like Epson 20 to be absent [00:59:13] way around like Epson 20 to be absent times to cover [00:59:15] times to cover then you get [00:59:17] then you get you buy it so so that you balance these [00:59:19] you buy it so so that you balance these two hour terms you get this is less than [00:59:22] two hour terms you get this is less than uh [00:59:25] two apps on children [00:59:28] two apps on children okay [00:59:29] okay so [00:59:31] so okay [00:59:33] okay um [00:59:35] um all right so now let's look at the uh [00:59:40] um what's the [00:59:43] um what's the let's go back to here [00:59:46] let's go back to here right because here there is something [00:59:48] right because here there is something about the cover size we have a deal with [00:59:50] about the cover size we have a deal with right we have to plug in the right [00:59:51] right we have to plug in the right covers Us [00:59:52] covers Us so and what is the cover size so the [00:59:55] so and what is the cover size so the cover size was [00:59:59] so lock cover size [01:00:03] uh log C [01:00:07] uh log C is equals to log [01:00:10] is equals to log 3B over Epsilon to the power p and I [01:00:13] 3B over Epsilon to the power p and I have already set Epsilon to be absolute [01:00:14] have already set Epsilon to be absolute over 2 Kappa so I need to plug in that [01:00:17] over 2 Kappa so I need to plug in that so I got [01:00:18] so I got p and let's first look at this and then [01:00:22] p and let's first look at this and then let's plug in the choice of epson tilde [01:00:24] let's plug in the choice of epson tilde okay P log [01:00:27] okay P log three [01:00:29] three B copper absolute [01:00:34] and you can see that copper is inside [01:00:36] and you can see that copper is inside the log so so that's why it's somewhat [01:00:38] the log so so that's why it's somewhat not instant not sensitive the choice of [01:00:41] not instant not sensitive the choice of copper and also absolute children is [01:00:43] copper and also absolute children is also in the log [01:00:44] also in the log so which is also nice [01:00:46] so which is also nice um [01:00:47] um and and now we have to care about this [01:00:49] and and now we have to care about this failure probability right so we [01:00:50] failure probability right so we basically want to say that this is [01:00:52] basically want to say that this is equals to something like the or [01:00:53] equals to something like the or something like uh Delta right so we want [01:00:59] uh to bounce [01:01:02] uh to bounce the failure probability to see [01:01:04] the failure probability to see exponential [01:01:06] exponential minus 2 and Epsilon 2 the square right [01:01:09] minus 2 and Epsilon 2 the square right so this is something about I want to [01:01:10] so this is something about I want to show that this is small actually in this [01:01:12] show that this is small actually in this case [01:01:13] case I'm hoping to show that this exponential [01:01:15] I'm hoping to show that this exponential minus p [01:01:19] okay so [01:01:22] okay so um [01:01:25] okay so so how do we show this and of [01:01:28] okay so so how do we show this and of course it depends on what Epsilon tilde [01:01:30] course it depends on what Epsilon tilde is right so you need to choose the right [01:01:31] is right so you need to choose the right absolute Fields such as this is true and [01:01:33] absolute Fields such as this is true and then that's your final basically your [01:01:35] then that's your final basically your final box [01:01:37] final box um and just to get something cute so [01:01:39] um and just to get something cute so you're gonna see that this you know the [01:01:41] you're gonna see that this you know the exact calculation of this is gonna be a [01:01:43] exact calculation of this is gonna be a little bit complicated [01:01:45] little bit complicated um but just to get some intuition here [01:01:46] um but just to get some intuition here right so suppose [01:01:49] right so suppose um [01:01:50] um suppose so this is a heuristic which is [01:01:54] suppose so this is a heuristic which is not not even technically correct but [01:01:57] not not even technically correct but it's it's approximately correct [01:01:59] it's it's approximately correct so suppose [01:02:01] so suppose like a optimistically the log C is [01:02:04] like a optimistically the log C is equals to P instead of P times instead [01:02:07] equals to P instead of P times instead of [01:02:10] P Times log 3 B over Epsilon [01:02:13] P Times log 3 B over Epsilon 3 because of our Epsilon tilde [01:02:15] 3 because of our Epsilon tilde so suppose you just have P you don't [01:02:17] so suppose you just have P you don't have the log chart actually this becomes [01:02:19] have the log chart actually this becomes a very simple calculation [01:02:21] a very simple calculation so [01:02:22] so um so what you got is that [01:02:24] um so what you got is that you got [01:02:26] you got um [01:02:30] so basically if you take the log of this [01:02:32] so basically if you take the log of this desired inequality [01:02:34] desired inequality you want that [01:02:37] um [01:02:39] um let me see [01:02:42] let me see if you take the log you get log 2 which [01:02:44] if you take the log you get log 2 which is not super important you get a log 2 [01:02:45] is not super important you get a log 2 times log C [01:02:47] times log C minus 2 and epsometer square and suppose [01:02:50] minus 2 and epsometer square and suppose log C is equal to P then you get P minus [01:02:53] log C is equal to P then you get P minus 2 and Epsilon Square [01:02:56] 2 and Epsilon Square and if you take Epsilon [01:02:59] and if you take Epsilon tilt that to be squared p over n [01:03:03] tilt that to be squared p over n then you get this is equals to [01:03:07] uh [01:03:12] P minus 2p [01:03:16] P minus 2p right is equal to minus p [01:03:19] right is equal to minus p right so so which means that 2C [01:03:23] right so so which means that 2C exponential minus two and Epsilon square [01:03:25] exponential minus two and Epsilon square if you take the exponential back you get [01:03:27] if you take the exponential back you get this is less than exponential minus p [01:03:31] this is less than exponential minus p so so this is a [01:03:34] so so this is a this is fundamentally how it works [01:03:39] okay but we did make this uh incorrect [01:03:42] okay but we did make this uh incorrect assumption that log C is equal to P but [01:03:44] assumption that log C is equal to P but this assumption is not very far right so [01:03:45] this assumption is not very far right so it's only off by a log Factor so if you [01:03:48] it's only off by a log Factor so if you want to fix this you know technically [01:03:50] want to fix this you know technically you need to deal with the log Factor it [01:03:52] you need to deal with the log Factor it wouldn't change much but it would [01:03:54] wouldn't change much but it would introduce a little bit kind of like [01:03:55] introduce a little bit kind of like complication [01:03:57] complication so I did have the the calculation here [01:04:00] so I did have the the calculation here I'm just going to basically write it um [01:04:03] I'm just going to basically write it um write it down but I don't actually [01:04:04] write it down but I don't actually expect that you follow all of this it [01:04:06] expect that you follow all of this it took me like one hour to even figure out [01:04:08] took me like one hour to even figure out all the constants so and so forth It's [01:04:10] all the constants so and so forth It's not super important I think the [01:04:11] not super important I think the intuition is already there [01:04:13] intuition is already there um so but let me just quickly write this [01:04:15] um so but let me just quickly write this just to say what you do formally [01:04:18] just to say what you do formally so if you suppose log C is equal you [01:04:23] so if you suppose log C is equal you only have this Bond right so this is [01:04:24] only have this Bond right so this is what we only have [01:04:28] this is 6p [01:04:30] this is 6p Kappa over Epsilon tilde [01:04:32] Kappa over Epsilon tilde and then let's take [01:04:35] and then let's take I have some Tudor to be [01:04:38] I have some Tudor to be square root some constant times P Over N [01:04:42] square root some constant times P Over N times Max this is actually this absolute [01:04:45] times Max this is actually this absolute is actually the absolute is the bound is [01:04:47] is actually the absolute is the bound is the final Bond right so that's why [01:04:48] the final Bond right so that's why you're gonna see [01:04:49] you're gonna see kind of the same thing [01:04:52] kind of the same thing and then final box and c0 is a [01:04:54] and then final box and c0 is a sufficient large constant [01:04:56] sufficient large constant c0 is a sufficient large constant which [01:04:59] c0 is a sufficient large constant which will choose a little bit later [01:05:07] so and and you're plugging all of this [01:05:10] so and and you're plugging all of this and you just uh again you take the you [01:05:13] and you just uh again you take the you look at the log of the inequalities we [01:05:15] look at the log of the inequalities we care about the log of it is this [01:05:17] care about the log of it is this and you plug in uh this choice of C you [01:05:21] and you plug in uh this choice of C you get P log Six B [01:05:25] get P log Six B copper over absolute [01:05:27] copper over absolute minus 2 and epsilometer square and you [01:05:30] minus 2 and epsilometer square and you somehow know that if you don't ignore [01:05:31] somehow know that if you don't ignore this log it's already work it's just you [01:05:33] this log it's already work it's just you if you have the log you still have to [01:05:35] if you have the log you still have to deal with it [01:05:36] deal with it so you get uh [01:05:38] so you get uh something like P log I'm not even sure [01:05:41] something like P log I'm not even sure whether I really have to [01:05:44] whether I really have to write down all of this but just in case [01:05:46] write down all of this but just in case some of you want to [01:05:49] some of you want to have this hardcore calculation [01:05:57] right so you got this and then [01:06:03] you [01:06:06] somehow I guess this the first term [01:06:09] somehow I guess this the first term becomes log six [01:06:11] becomes log six the couple you explain the first term [01:06:15] the couple you explain the first term square root c0pu [01:06:20] I guess I decompose this into [01:06:30] okay guys I'm decomposing the first term [01:06:35] 6 over c0p minus C 0 p [01:06:40] 6 over c0p minus C 0 p log Kappa b n [01:06:42] log Kappa b n I guess you know the the way that I [01:06:45] I guess you know the the way that I always kind of think about this is that [01:06:46] always kind of think about this is that you know when you do the calculation you [01:06:48] you know when you do the calculation you always check with what happens if you [01:06:50] always check with what happens if you don't have the log right so what if you [01:06:52] don't have the log right so what if you don't have to lock then this term is a [01:06:53] don't have to lock then this term is a large constant times p on this term is p [01:06:55] large constant times p on this term is p so that's that's why it's nice [01:06:57] so that's that's why it's nice So eventually you can if it takes c0 to [01:07:00] So eventually you can if it takes c0 to be something like 32 36 I think you can [01:07:02] be something like 32 36 I think you can show that this is bigger than this one [01:07:04] show that this is bigger than this one and this one I think I guess I [01:07:08] and this one I think I guess I and this one is inactive when p is large [01:07:12] and this one is inactive when p is large and then you got this is less than minus [01:07:13] and then you got this is less than minus P I guess the exact calculation there is [01:07:16] P I guess the exact calculation there is some more detailed calculation in the [01:07:17] some more detailed calculation in the notes but it doesn't matter that much [01:07:19] notes but it doesn't matter that much right so [01:07:22] right so um so that's that's what we do so then [01:07:24] um so that's that's what we do so then basically this is saying that if you [01:07:25] basically this is saying that if you take an exponential so you guys of this [01:07:27] take an exponential so you guys of this inequality you guys 2C [01:07:30] inequality you guys 2C exponential minus two and Epsilon 0 [01:07:33] exponential minus two and Epsilon 0 square is less than two times [01:07:35] square is less than two times exponential minus P so this is our [01:07:37] exponential minus P so this is our failure probability [01:07:40] um so basically with this probability so [01:07:42] um so basically with this probability so with probability larger than one minus [01:07:45] with probability larger than one minus of e to the minus p [01:07:47] of e to the minus p will have our historical minus L Theta [01:07:50] will have our historical minus L Theta is less than 2 Epsilon tilde [01:07:53] is less than 2 Epsilon tilde which is this uh thing that we did we [01:07:57] which is this uh thing that we did we um we want it I mean not just not copy [01:08:00] um we want it I mean not just not copy it again so [01:08:02] it again so Okay cool so so that's the proof and [01:08:07] Okay cool so so that's the proof and this proof is a little messy and this is [01:08:09] this proof is a little messy and this is probably one of the reasons why if you [01:08:10] probably one of the reasons why if you open up a classical machine book they [01:08:13] open up a classical machine book they would they typically don't show you this [01:08:15] would they typically don't show you this groove uh so it's just because it's a [01:08:18] groove uh so it's just because it's a little messy but but actually it's um [01:08:21] little messy but but actually it's um the reason why I always try to show this [01:08:24] the reason why I always try to show this proof is that I feel like it's very [01:08:25] proof is that I feel like it's very intuitive [01:08:27] intuitive um uh and it demonstrates what's really [01:08:29] um uh and it demonstrates what's really is going on and that's also this kind of [01:08:31] is going on and that's also this kind of things it's actually useful for many [01:08:34] things it's actually useful for many uh for money reasoning works if you're [01:08:36] uh for money reasoning works if you're looking if you look at a technical [01:08:37] looking if you look at a technical low-level details so so the fancy is [01:08:40] low-level details so so the fancy is rather more complexity thing that we are [01:08:42] rather more complexity thing that we are going to talk about next you know they [01:08:43] going to talk about next you know they are very nice but sometimes they don't [01:08:44] are very nice but sometimes they don't apply and you have to really use this [01:08:46] apply and you have to really use this you go back to the most Brute Force way [01:08:48] you go back to the most Brute Force way to think about it [01:08:50] to think about it um [01:08:50] um right okay so maybe just a [01:08:55] right okay so maybe just a a few quick comments [01:08:58] a few quick comments about this proof [01:09:01] about this proof I guess if you really uh think about [01:09:03] I guess if you really uh think about this this is really saying that you have [01:09:05] this this is really saying that you have a generalization error [01:09:10] right so it is less than [01:09:13] right so it is less than the log of this excess risk right [01:09:17] the log of this excess risk right up to constant factor of course plus [01:09:19] up to constant factor of course plus Epsilon to the k [01:09:22] Epsilon to the k so so this part is from the finance uh [01:09:26] so so this part is from the finance uh hypothesis case and this one comes from [01:09:29] hypothesis case and this one comes from the so this is not okay this is Kappa [01:09:34] so this is the discretization error [01:09:40] and this is the finite [01:09:42] and this is the finite hypothesis case [01:09:47] and in some sense you're just trading [01:09:49] and in some sense you're just trading all of these two [01:09:50] all of these two and and what what does it mean by [01:09:52] and and what what does it mean by chilling off this too it really means [01:09:54] chilling off this too it really means that [01:09:55] that um what apps on you choose right so this [01:09:58] um what apps on you choose right so this one depends on episode [01:10:05] so so this one depends on episode but it [01:10:07] so so this one depends on episode but it depends on apps on your very weak way [01:10:09] depends on apps on your very weak way because it depends on Epson in the [01:10:11] because it depends on Epson in the logarithm logarithmic so that that make [01:10:14] logarithm logarithmic so that that make it you know very easy to trade off this [01:10:16] it you know very easy to trade off this because you can pick up some to be quite [01:10:17] because you can pick up some to be quite small so that this term becomes small [01:10:19] small so that this term becomes small and this term [01:10:21] and this term this depends on sorry I think [01:10:23] this depends on sorry I think technically this should depend on log 1 [01:10:24] technically this should depend on log 1 over Epsilon so [01:10:26] over Epsilon so all right so the app the smaller the app [01:10:28] all right so the app the smaller the app zone is the better the second term but [01:10:30] zone is the better the second term but the worst the first term but the first [01:10:32] the worst the first term but the first term increase as Epsilon goes to zero [01:10:34] term increase as Epsilon goes to zero very slowly [01:10:36] very slowly so that's why you pretty much can ignore [01:10:37] so that's why you pretty much can ignore the second term in some sense just [01:10:39] the second term in some sense just because you take absolutely very small [01:10:40] because you take absolutely very small so that the second term becomes [01:10:42] so that the second term becomes negligible and even when those for those [01:10:45] negligible and even when those for those small apps on the first term it's still [01:10:47] small apps on the first term it's still reasonably bonded and that's that's why [01:10:50] reasonably bonded and that's that's why you can make this trade off really nice [01:10:53] you can make this trade off really nice um but in some other cases you know as [01:10:55] um but in some other cases you know as you will see in the later phase right [01:10:57] you will see in the later phase right we'll do the discrete decision the first [01:10:59] we'll do the discrete decision the first term wouldn't be as nice as this it [01:11:01] term wouldn't be as nice as this it wouldn't be locked while reps it would [01:11:02] wouldn't be locked while reps it would be something that goes to Infinity as [01:11:05] be something that goes to Infinity as absolute zero in a faster rate right so [01:11:07] absolute zero in a faster rate right so we're probably you know in the later [01:11:09] we're probably you know in the later case [01:11:10] case later sometimes this this first term [01:11:12] later sometimes this this first term will be y over Epsilon Square then the [01:11:15] will be y over Epsilon Square then the the tradeoff will becomes a little bit [01:11:17] the tradeoff will becomes a little bit more tricky and you have to be more [01:11:19] more tricky and you have to be more um more careful about the trade-off [01:11:23] um more careful about the trade-off um and finally [01:11:25] um and finally um just uh for kind of like this is a [01:11:28] um just uh for kind of like this is a kind of probably all overview for [01:11:31] kind of probably all overview for um [01:11:32] um just like a kind of like a from some of [01:11:35] just like a kind of like a from some of the birds as view right so log h [01:11:37] the birds as view right so log h or p in this case right so [01:11:42] or the the P you can think of this as [01:11:44] or the the P you can think of this as complex measures I guess I've mentioned [01:11:46] complex measures I guess I've mentioned this [01:11:47] this uh as well right so these are complex [01:11:49] uh as well right so these are complex measures of the hypothesis class [01:11:54] so um and and the general like a [01:11:58] so um and and the general like a phenomenon is always like you have a [01:12:00] phenomenon is always like you have a bigger age [01:12:03] it means that you need more samples [01:12:07] this is always like you have worse Bond [01:12:10] this is always like you have worse Bond so which means you need more samples to [01:12:11] so which means you need more samples to learn [01:12:13] learn um and and in some sense the next uh [01:12:16] um and and in some sense the next uh well two weeks we are talking about [01:12:19] well two weeks we are talking about um [01:12:20] um a more [01:12:22] a more accurate [01:12:23] accurate I guess accurate may not be the right [01:12:25] I guess accurate may not be the right word like a more [01:12:27] word like a more um fingering hypothesis more fragrant [01:12:33] complex measure so so what are you is [01:12:36] complex measure so so what are you is the right complex measure there is no [01:12:37] the right complex measure there is no really [01:12:39] really um a super decisive answer what's the [01:12:41] um a super decisive answer what's the right complex measure in some sense it's [01:12:42] right complex measure in some sense it's up to uh up to the theorem approval in [01:12:46] up to uh up to the theorem approval in some sense so [01:12:48] some sense so um but we're gonna have a more [01:12:50] um but we're gonna have a more fine-grain and more in sometimes [01:12:52] fine-grain and more in sometimes fundamental complex measure in the in [01:12:54] fundamental complex measure in the in the next two lectures which is called [01:12:55] the next two lectures which is called random marker complexity and [01:12:58] random marker complexity and and you can use that to derive many of [01:13:00] and you can use that to derive many of these Bonds in your more principal way [01:13:02] these Bonds in your more principal way so and in general I think uh one of the [01:13:06] so and in general I think uh one of the important questions for you know [01:13:07] important questions for you know especially in the Summer's classical [01:13:08] especially in the Summer's classical statistical machine learning is to find [01:13:10] statistical machine learning is to find out what's the right complex measure for [01:13:13] out what's the right complex measure for your hypothesis class [01:13:14] your hypothesis class right so we're going to discuss you know [01:13:16] right so we're going to discuss you know what does it really mean by right and [01:13:17] what does it really mean by right and wrong like there's no unique answer [01:13:19] wrong like there's no unique answer right so [01:13:21] right so um [01:13:21] um um but but there are some kind of like [01:13:23] um but but there are some kind of like but this is kind of the essential [01:13:25] but this is kind of the essential question right so you need a complex [01:13:27] question right so you need a complex measure that really captures the [01:13:30] measure that really captures the the fundamental complexity of this of [01:13:32] the fundamental complexity of this of this class for example if you have an [01:13:33] this class for example if you have an infinite carbon's class right so you [01:13:35] infinite carbon's class right so you shouldn't use log H log H is not really [01:13:37] shouldn't use log H log H is not really the fundamental complex measure for [01:13:39] the fundamental complex measure for infinite hypothesis you probably should [01:13:41] infinite hypothesis you probably should use dimensionality probably in the later [01:13:43] use dimensionality probably in the later of the course we are going to see if you [01:13:45] of the course we are going to see if you can use the norm of your parameters as [01:13:48] can use the norm of your parameters as the as the complex measure so and and it [01:13:51] the as the complex measure so and and it does depend on the the specific cases [01:13:53] does depend on the the specific cases and also sometimes depends on data so [01:13:56] and also sometimes depends on data so this will be what we discussed uh in the [01:13:58] this will be what we discussed uh in the in the next few weeks [01:14:01] in the next few weeks um I think this is a natural place to [01:14:03] um I think this is a natural place to stop [01:14:04] stop um yeah okay I think that's all for [01:14:07] um yeah okay I think that's all for today ================================================================================ LECTURE 004 ================================================================================ Stanford CS229M - Lecture 4: Advanced concentration inequalities Source: https://www.youtube.com/watch?v=fKM6fcOkXuk --- Transcript [00:00:05] so [00:00:07] so um [00:00:08] um so last time [00:00:09] so last time um in the last three lectures we have [00:00:11] um in the last three lectures we have talked about [00:00:13] talked about um the [00:00:14] um the kind of the basics of uniform [00:00:16] kind of the basics of uniform convergence I guess so just a very quick [00:00:18] convergence I guess so just a very quick review so [00:00:21] I think we have proved that the excess [00:00:23] I think we have proved that the excess risk [00:00:25] risk this is in lecture two it's bonded by [00:00:32] this [00:00:33] this this is a difference between empirical [00:00:35] this is a difference between empirical and population [00:00:38] oh right thanks yeah sorry I forgot [00:00:42] oh right thanks yeah sorry I forgot uh yeah thanks for reminding me it's [00:00:44] uh yeah thanks for reminding me it's gonna be a problem [00:00:46] gonna be a problem if I forgot to do that uh yeah [00:00:50] if I forgot to do that uh yeah well I didn't join the zoom yeah [00:00:53] well I didn't join the zoom yeah sorry [00:01:08] foreign [00:01:18] thanks for reminding me so and [00:01:23] thanks for reminding me so and so we have shown this [00:01:30] all right so this is uh what we're [00:01:33] all right so this is uh what we're showing of the claim in the lecture too [00:01:35] showing of the claim in the lecture too so basically this is saying that you [00:01:37] so basically this is saying that you only have to bond the difference between [00:01:38] only have to bond the difference between the population and the empirical for all [00:01:41] the population and the empirical for all states of Christ so the most important [00:01:42] states of Christ so the most important thing is the second term because the [00:01:44] thing is the second term because the first term we have shown that [00:01:46] first term we have shown that um it's close to it's something bonded [00:01:48] um it's close to it's something bonded by one word scorpion so [00:01:51] by one word scorpion so um so the goal is to show the second [00:01:52] um so the goal is to show the second term and we have discussed uh how to do [00:01:55] term and we have discussed uh how to do it for finding hypothesis class and also [00:01:58] it for finding hypothesis class and also how to do it for infinite hypothesis [00:01:59] how to do it for infinite hypothesis costs to get uh with a relatively Brute [00:02:02] costs to get uh with a relatively Brute Force discretization technique [00:02:04] Force discretization technique and [00:02:06] and um so and in the next few lectures I [00:02:09] um so and in the next few lectures I guess we are gonna [00:02:11] guess we are gonna um as I kind of like mentioned before so [00:02:13] um as I kind of like mentioned before so we're gonna have some other techniques [00:02:15] we're gonna have some other techniques to deal with the second term so that we [00:02:17] to deal with the second term so that we can have more informative bounds and [00:02:19] can have more informative bounds and today we are gonna uh take a small in [00:02:23] today we are gonna uh take a small in some sense a small digression or in some [00:02:25] some sense a small digression or in some sense a small preparation for some of [00:02:27] sense a small preparation for some of the tools that we're going to use for [00:02:29] the tools that we're going to use for the next lecture so so in the next [00:02:31] the next lecture so so in the next lecture what we're going to do is that [00:02:32] lecture what we're going to do is that we're gonna bounce the expectation of [00:02:35] we're gonna bounce the expectation of this [00:02:38] so this is the next step sure [00:02:43] and and this is expectation over the [00:02:46] and and this is expectation over the randomness of the the data right so this [00:02:50] randomness of the the data right so this quantity itself is a random variable [00:02:51] quantity itself is a random variable right so because it depends on the data [00:02:54] right so because it depends on the data uh the tuning data you have because L [00:02:56] uh the tuning data you have because L hex depends on the training data and [00:02:58] hex depends on the training data and next time we're going to offer bond this [00:03:01] next time we're going to offer bond this by some quantity which is called rather [00:03:03] by some quantity which is called rather macro complexity [00:03:06] macro complexity um and [00:03:07] um and um so today we're gonna do something [00:03:09] um so today we're gonna do something that is useful for uh it's a useful [00:03:12] that is useful for uh it's a useful preparation uh for doing this [00:03:15] preparation uh for doing this um [00:03:15] um um I guess so okay so I guess here is [00:03:18] um I guess so okay so I guess here is the the plan so next lecture we're going [00:03:19] the the plan so next lecture we're going to do this and then the next the lecture [00:03:22] to do this and then the next the lecture uh we're also going to deal with the [00:03:25] uh we're also going to deal with the difference between [00:03:30] this and this expectation [00:03:37] so so that's the that's the plan for the [00:03:39] so so that's the that's the plan for the next lecture so and today what we're [00:03:42] next lecture so and today what we're going to do is we're going to have some [00:03:43] going to do is we're going to have some tools that [00:03:44] tools that prepare us for proving quantities like [00:03:47] prepare us for proving quantities like this [00:03:48] this um and so that's the next time we don't [00:03:50] um and so that's the next time we don't have to kind of have a have a small [00:03:52] have to kind of have a have a small section dealing with the tool in the [00:03:54] section dealing with the tool in the middle so so so I'm trying to kind of [00:03:57] middle so so so I'm trying to kind of kind of prepare us with the the right [00:03:59] kind of prepare us with the the right tool for the next lecture [00:04:01] tool for the next lecture so a more concrete overview is the [00:04:04] so a more concrete overview is the following so the the goal for this [00:04:06] following so the the goal for this lecture is the following so suppose you [00:04:09] lecture is the following so suppose you have some random variable [00:04:11] have some random variable X1 up to accent [00:04:13] X1 up to accent so they are independent [00:04:18] and random variables [00:04:20] and random variables so we're going to show two type of color [00:04:22] so we're going to show two type of color inequalities so the first type of [00:04:24] inequalities so the first type of inequality is to show that [00:04:27] inequality is to show that if you take the sum of these kind of [00:04:28] if you take the sum of these kind of random variables they are [00:04:30] random variables they are concentrated around the expectation this [00:04:33] concentrated around the expectation this is a [00:04:35] basically whole thing inequality is one [00:04:37] basically whole thing inequality is one type of piece inequality we are going to [00:04:39] type of piece inequality we are going to extend whole thing inequality to [00:04:40] extend whole thing inequality to something more uh General [00:04:43] something more uh General um and second thing is that we're going [00:04:45] um and second thing is that we're going to show [00:04:46] to show that for certain for certain [00:04:51] that for certain for certain type of function [00:04:55] f if you look at a general function not [00:04:58] f if you look at a general function not necessarily just the sum of this random [00:05:00] necessarily just the sum of this random variable of course you have to have some [00:05:02] variable of course you have to have some restrictions on what the functions F [00:05:03] restrictions on what the functions F will look like but suppose you have the [00:05:06] will look like but suppose you have the right restriction that you can show that [00:05:07] right restriction that you can show that even you have a function of X1 up to X N [00:05:10] even you have a function of X1 up to X N is still concentrated around the [00:05:12] is still concentrated around the expectation of this function [00:05:16] and this will be particularly useful for [00:05:18] and this will be particularly useful for showing this inequality maybe let's call [00:05:21] showing this inequality maybe let's call this I here so because you can [00:05:25] this I here so because you can um [00:05:27] in some sense maybe maybe just the C1 [00:05:29] in some sense maybe maybe just the C1 you know this corresponds to L height [00:05:32] you know this corresponds to L height Theta is close to L Theta because L High [00:05:34] Theta is close to L Theta because L High Theta is of the form like X1 up to [00:05:38] Theta is of the form like X1 up to uh plus X2 up to Accent right L Theta is [00:05:41] uh plus X2 up to Accent right L Theta is the expectation of L hat and the second [00:05:43] the expectation of L hat and the second type of inequality will be useful for [00:05:46] type of inequality will be useful for proving what I said this equality one so [00:05:49] proving what I said this equality one so because if you care about something like [00:05:52] because if you care about something like this [00:05:55] is roughly equals to expectation [00:06:01] this is our head [00:06:05] so then you can view this entire thing [00:06:07] so then you can view this entire thing as a function [00:06:10] as a function viewed as a function of your date of [00:06:13] viewed as a function of your date of your tuning data of your ID training [00:06:15] your tuning data of your ID training data so this is a function of x 1 up to [00:06:18] data so this is a function of x 1 up to x n [00:06:19] x n where these are the training data [00:06:25] so so basically the so this kind of [00:06:28] so so basically the so this kind of inequalities are called concentration [00:06:29] inequalities are called concentration inequality the key kind of like idea is [00:06:33] inequality the key kind of like idea is that if you have a family of IID random [00:06:36] that if you have a family of IID random variables then [00:06:38] variables then first of all if you take the sum of them [00:06:40] first of all if you take the sum of them they become like gaussian and they [00:06:42] they become like gaussian and they become concentrate conflict ocean like [00:06:44] become concentrate conflict ocean like and they concentrate around the mean of [00:06:47] and they concentrate around the mean of the cell and uh and this the same thing [00:06:50] the cell and uh and this the same thing also happens if you apply certain kind [00:06:52] also happens if you apply certain kind of functions on accident I will tell you [00:06:55] of functions on accident I will tell you what kind of functions will have these [00:06:57] what kind of functions will have these kind of properties [00:06:58] kind of properties um and [00:07:00] um and um and this kind of inequalities you [00:07:01] um and this kind of inequalities you know not only useful for what we're [00:07:03] know not only useful for what we're going to do [00:07:04] going to do um next but also generally pretty useful [00:07:07] um next but also generally pretty useful for [00:07:08] for um [00:07:09] um for machine learning like for [00:07:11] for machine learning like for statistical learning theory because in [00:07:14] statistical learning theory because in some sense if you think about you know [00:07:15] some sense if you think about you know what happens in learning theory [00:07:17] what happens in learning theory in many cases basically you are trying [00:07:19] in many cases basically you are trying to deal with the difference between an [00:07:20] to deal with the difference between an empirical discussion and a population [00:07:23] empirical discussion and a population distribution right so this kind of [00:07:24] distribution right so this kind of things will show up in many many [00:07:26] things will show up in many many different cases and that's also one of [00:07:28] different cases and that's also one of the reasons why I kind of like isolate [00:07:30] the reasons why I kind of like isolate this part as a single lecture to talk [00:07:32] this part as a single lecture to talk about technique [00:07:33] about technique um if if it's just a or some kind of [00:07:36] um if if it's just a or some kind of like a tool that is only useful for one [00:07:38] like a tool that is only useful for one lecture then we can just invoke that as [00:07:40] lecture then we can just invoke that as a Lemma but here I think it's more [00:07:43] a Lemma but here I think it's more useful than that so that's why I want to [00:07:44] useful than that so that's why I want to kind of also show you how to prove some [00:07:47] kind of also show you how to prove some of these things and also what what is [00:07:49] of these things and also what what is what kind of alignment I'm not going to [00:07:50] what kind of alignment I'm not going to prove all the inequalities I'm going to [00:07:51] prove all the inequalities I'm going to show today [00:07:53] show today um but I'm going to talk about some of [00:07:54] um but I'm going to talk about some of the advanced version of inequality so [00:07:56] the advanced version of inequality so that you know that they exist and then [00:07:59] that you know that they exist and then when you need to use them you can kind [00:08:01] when you need to use them you can kind of find the right tools [00:08:03] of find the right tools so so that's the overview of the lecture [00:08:07] so so that's the overview of the lecture um so [00:08:08] um so I guess [00:08:10] I guess um so let's start with the simple [00:08:11] um so let's start with the simple version right when you have a song of [00:08:13] version right when you have a song of independent random variable and we have [00:08:15] independent random variable and we have discussed this before uh about the in [00:08:18] discussed this before uh about the in the in the context support inequality [00:08:20] the in the context support inequality I'm going to have a you know [00:08:22] I'm going to have a you know um [00:08:23] um kind of like a a moral kind of [00:08:25] kind of like a a moral kind of comprehensive discussion about this so [00:08:27] comprehensive discussion about this so let's consider you have around variable [00:08:29] let's consider you have around variable Z which is equals to sum [00:08:31] Z which is equals to sum of X1 up to X [00:08:33] of X1 up to X right X i's are independent [00:08:39] and so so the a warm-up is that [00:08:45] and so so the a warm-up is that what if you don't ignore what what the [00:08:47] what if you don't ignore what what the structure of Z is right so obviously you [00:08:49] structure of Z is right so obviously you know that Z is the sum of independent [00:08:50] know that Z is the sum of independent random variable what if you ignore this [00:08:52] random variable what if you ignore this structure [00:08:53] structure so what what if we ignore the structure [00:09:00] so you still have something that you can [00:09:03] so you still have something that you can show uh you can see you can still have [00:09:06] show uh you can see you can still have some inequality that can show that Z is [00:09:09] some inequality that can show that Z is close to the expectation so here is the [00:09:11] close to the expectation so here is the inequality which is called Championship [00:09:13] inequality which is called Championship inequality [00:09:15] inequality uh [00:09:18] I think probably you've heard of this [00:09:20] I think probably you've heard of this you know in some of the probability [00:09:20] you know in some of the probability class so the Intercourse is saying that [00:09:23] class so the Intercourse is saying that if the probability that Z deviate from [00:09:27] if the probability that Z deviate from the expectation of Z [00:09:29] the expectation of Z by some amount t [00:09:31] by some amount t is less than this thing [00:09:33] is less than this thing the variance of Z over T Square [00:09:37] the variance of Z over T Square so it's pretty intuitive right so if the [00:09:40] so it's pretty intuitive right so if the variance of Z is small then you have [00:09:42] variance of Z is small then you have less deviation from the expectation and [00:09:45] less deviation from the expectation and of course if T is bigger right if you [00:09:47] of course if T is bigger right if you look at a bigger window then it's [00:09:51] look at a bigger window then it's there's a small probability smaller [00:09:53] there's a small probability smaller probability outside the window [00:09:54] probability outside the window right so in some sense if you draw this [00:09:57] right so in some sense if you draw this is something like suppose you have a [00:09:59] is something like suppose you have a distribution that looks like maybe this [00:10:00] distribution that looks like maybe this and the mean is here expectation of c [00:10:03] and the mean is here expectation of c and what this is saying is that if you [00:10:06] and what this is saying is that if you look at the standard deviation [00:10:08] look at the standard deviation Z [00:10:10] Z right so suppose you and you look at [00:10:12] right so suppose you and you look at this standard deviation [00:10:14] this standard deviation and suppose you take t [00:10:17] and suppose you take t to be something like uh standard [00:10:20] to be something like uh standard deviation of Z times one over square [00:10:23] deviation of Z times one over square root Delta you plug into this inequality [00:10:25] root Delta you plug into this inequality what you get is that the probability [00:10:28] what you get is that the probability that you will deviate [00:10:32] that you will deviate but this T is less than maybe let's just [00:10:35] but this T is less than maybe let's just write explicitly [00:10:37] write explicitly and deviation of C over square root [00:10:39] and deviation of C over square root Delta is less than [00:10:42] Delta is less than um [00:10:43] um Delta right so this is saying that if [00:10:47] Delta right so this is saying that if you multiply stand division Z by some [00:10:49] you multiply stand division Z by some quantity by something like here so [00:10:51] quantity by something like here so suppose this is [00:10:52] suppose this is the division c times one over square [00:10:54] the division c times one over square root Delta Delta is less than one then [00:10:57] root Delta Delta is less than one then the probability in this tail [00:11:00] the probability in this tail is less than the other [00:11:02] is less than the other right so so we we so this is uh in some [00:11:05] right so so we we so this is uh in some sense the the weakest form of [00:11:08] sense the the weakest form of concentration that you always have like [00:11:11] concentration that you always have like without using any structure about the [00:11:15] without using any structure about the random variable Z [00:11:16] random variable Z so however this is not very strong as [00:11:19] so however this is not very strong as we'll see because if you think about [00:11:21] we'll see because if you think about what happens uh with uh with gaussian [00:11:26] what happens uh with uh with gaussian right so [00:11:28] right so um [00:11:29] um let's see whether I missed a constant [00:11:31] let's see whether I missed a constant here [00:11:32] here yeah yeah [00:11:33] yeah yeah so if you think about [00:11:36] so if you think about um [00:11:38] um uh let's see [00:11:43] so if you think about the gaussian [00:11:44] so if you think about the gaussian distribution [00:11:45] distribution suppose you know Z is gaussian so [00:11:47] suppose you know Z is gaussian so gaussian Z [00:11:49] gaussian Z then what you know is that [00:11:52] then what you know is that um [00:11:54] um so suppose Z is something like from [00:11:57] so suppose Z is something like from like 0 1 right or maybe it doesn't [00:12:01] like 0 1 right or maybe it doesn't matter whether the mean is zero let's [00:12:02] matter whether the mean is zero let's say it's supposed to mean it's mu [00:12:04] say it's supposed to mean it's mu then what you know is that Z minus [00:12:07] then what you know is that Z minus expectation of Z [00:12:09] expectation of Z is less than [00:12:11] is less than standard deviation [00:12:14] standard deviation of C [00:12:16] of C times uh [00:12:19] times uh square root [00:12:25] log one over the odd I guess maybe let's [00:12:27] log one over the odd I guess maybe let's say this is just a general gaussian [00:12:29] say this is just a general gaussian distribution [00:12:30] distribution where the standard deviation is Sigma [00:12:32] where the standard deviation is Sigma and this [00:12:36] and this so with probability [00:12:43] at least [00:12:44] at least one minus Delta you have this so so [00:12:47] one minus Delta you have this so so basically if you have a gaussian [00:12:48] basically if you have a gaussian distribution then what you have is that [00:12:50] distribution then what you have is that for the same failure probability Delta [00:12:53] for the same failure probability Delta here you have a a stronger Bond it's log [00:12:56] here you have a a stronger Bond it's log 1 over Delta instead of one over Delta [00:12:58] 1 over Delta instead of one over Delta squared [00:13:02] so so in some sense what's it [00:13:05] so so in some sense what's it um I guess I'm not showing this you know [00:13:07] um I guess I'm not showing this you know uh I haven't proved this for you like [00:13:09] uh I haven't proved this for you like but you can do the calculation so in [00:13:11] but you can do the calculation so in some sense this is saying that the tail [00:13:13] some sense this is saying that the tail like you can the the tail decays faster [00:13:16] like you can the the tail decays faster for gaussian so basically for gaussian [00:13:19] for gaussian so basically for gaussian you only have to multiply a little bit [00:13:22] you only have to multiply a little bit um so so suppose this discussion you [00:13:25] um so so suppose this discussion you only have to consider standard deviation [00:13:27] only have to consider standard deviation Z Times log one over Delta square root [00:13:29] Z Times log one over Delta square root then you know that the rest then you [00:13:32] then you know that the rest then you know the rest of the the part has [00:13:34] know the rest of the the part has probability less than Delta but if you [00:13:36] probability less than Delta but if you don't need no Excursion then you have to [00:13:38] don't need no Excursion then you have to be a little bit more generous in terms [00:13:40] be a little bit more generous in terms of the the interval on that you draw [00:13:44] of the the interval on that you draw okay so so in some sense like uh the [00:13:48] okay so so in some sense like uh the um [00:13:49] um the the goal that we're going to have is [00:13:51] the the goal that we're going to have is that we're going to show that if your Z [00:13:52] that we're going to show that if your Z is a sum of random variable that is more [00:13:55] is a sum of random variable that is more like gaussian instead of like a general [00:13:57] like gaussian instead of like a general uh the worst case z or or you have a [00:14:01] uh the worst case z or or you have a better Bond like this instead of like [00:14:03] better Bond like this instead of like the the bonds like this from the [00:14:05] the the bonds like this from the championship inequality [00:14:08] championship inequality so [00:14:09] so um so it's so and also if you if you see [00:14:12] um so it's so and also if you if you see the in the the [00:14:14] the in the the if you look at a more kind of carefully [00:14:16] if you look at a more kind of carefully on the consequences of these two [00:14:18] on the consequences of these two inequalities right so uh maybe let's [00:14:21] inequalities right so uh maybe let's call this number three and this number [00:14:22] call this number three and this number four so if you have number three then [00:14:26] four so if you have number three then you can if you take [00:14:28] you can if you take Delta to be something like inverse [00:14:31] Delta to be something like inverse polyun [00:14:33] polyun then you you know that with high [00:14:35] then you you know that with high probability [00:14:36] probability right so at least [00:14:39] right so at least one minus one over volume [00:14:44] Z minus expectation Z is less than [00:14:47] Z minus expectation Z is less than standard deviation of Z [00:14:49] standard deviation of Z Times Square Root log n so basically you [00:14:52] Times Square Root log n so basically you only lose a log Factor [00:14:54] only lose a log Factor um if you want to make the probability [00:14:55] um if you want to make the probability very high right so [00:14:57] very high right so right so if you want to make a high [00:14:59] right so if you want to make a high probability event then you only have to [00:15:01] probability event then you only have to multiply by square root login to the [00:15:03] multiply by square root login to the standard deviation and then the rest of [00:15:05] standard deviation and then the rest of the probability becomes very small [00:15:07] the probability becomes very small however if if you use three if you use [00:15:10] however if if you use three if you use the number wait this is sorry this is [00:15:13] the number wait this is sorry this is number four that's sorry for the [00:15:15] number four that's sorry for the countries so if you use the number four [00:15:17] countries so if you use the number four discussion like then you got this but if [00:15:19] discussion like then you got this but if it's uh if you use number three then you [00:15:21] it's uh if you use number three then you if you take the alternative [00:15:24] if you take the alternative uh if using [00:15:29] take the author to be inverse probably [00:15:34] then what happens is that with high [00:15:36] then what happens is that with high probability you have this statement with [00:15:38] probability you have this statement with high probability Z minus expectation Z [00:15:41] high probability Z minus expectation Z is less than [00:15:44] is less than STD of Z over square root Delta which is [00:15:47] STD of Z over square root Delta which is STD of c times poly [00:15:52] STD of c times poly so there's a big difference between the [00:15:54] so there's a big difference between the additional Factor here right so if you [00:15:56] additional Factor here right so if you compare these two Factor you have a [00:15:59] compare these two Factor you have a you have a big difference [00:16:01] you have a big difference so that's why we want the so-called kind [00:16:03] so that's why we want the so-called kind of the faster tail [00:16:04] of the faster tail um so of the smaller tail like in the in [00:16:07] um so of the smaller tail like in the in the quality four instead of inequality [00:16:09] the quality four instead of inequality three [00:16:10] three um [00:16:11] um and a slightly alternative view which [00:16:14] and a slightly alternative view which we're going to kind of like in some [00:16:16] we're going to kind of like in some sense [00:16:17] sense uh switch between these two views you [00:16:20] uh switch between these two views you know in many like they are equivalent [00:16:22] know in many like they are equivalent but like we're gonna switch this you [00:16:24] but like we're gonna switch this you know [00:16:25] know um uh very often so the alternative view [00:16:27] um uh very often so the alternative view is that [00:16:29] is that um you can say that Z minus expectation [00:16:31] um you can say that Z minus expectation Z is less than [00:16:33] Z is less than so for gaussian [00:16:36] so for gaussian what you have is that if you look at [00:16:38] what you have is that if you look at this if you view this quantity like this [00:16:40] this if you view this quantity like this then you have this is less than [00:16:41] then you have this is less than expectation minus [00:16:43] expectation minus 2T squared over [00:16:46] 2T squared over variance of C times n [00:16:49] variance of C times n so now you can compare this inequality [00:16:52] so now you can compare this inequality maybe let's call it five just [00:16:53] maybe let's call it five just temporarily versus the chapter of [00:16:55] temporarily versus the chapter of inequality one [00:16:57] inequality one so if you look at one then this is [00:17:00] so if you look at one then this is the right hand side is decayed with t in [00:17:02] the right hand side is decayed with t in a polynomial wave right so it's one over [00:17:04] a polynomial wave right so it's one over T square and if you look at five it [00:17:07] T square and if you look at five it decays exponentially fast as T goes to [00:17:09] decays exponentially fast as T goes to Infinity so that's another way to see [00:17:12] Infinity so that's another way to see the differences right so the the tail [00:17:14] the differences right so the the tail probability for gaussian distribution is [00:17:16] probability for gaussian distribution is depending very fast exponentially fast [00:17:18] depending very fast exponentially fast but if you use the Chip Shop inequality [00:17:21] but if you use the Chip Shop inequality you only get a polynomial fast decaying [00:17:23] you only get a polynomial fast decaying um inequality and that's the that's [00:17:26] um inequality and that's the that's another way to see the differences [00:17:28] another way to see the differences okay so we are going to look for the the [00:17:30] okay so we are going to look for the the faster tail right that's what we our [00:17:32] faster tail right that's what we our goal so so the goal just to repeat so Z [00:17:37] goal so so the goal just to repeat so Z this [00:17:39] this is like a gaussian that's basically our [00:17:41] is like a gaussian that's basically our goal [00:17:43] goal but of course you know how do you say in [00:17:45] but of course you know how do you say in what sense it's like gaussian you know [00:17:46] what sense it's like gaussian you know that's uh there are multiple different [00:17:48] that's uh there are multiple different versions [00:17:49] versions um uh we're gonna formalize that what [00:17:51] um uh we're gonna formalize that what does it mean by it's more like a [00:17:52] does it mean by it's more like a gaussian tail [00:17:53] gaussian tail so [00:17:55] so um [00:17:57] to do this formulate let's start with [00:18:00] to do this formulate let's start with some definitions so [00:18:05] actually we're going to be fun what is [00:18:07] actually we're going to be fun what is mean by [00:18:08] mean by um gaussian like uh um to start with so [00:18:13] um gaussian like uh um to start with so let's say a random variable [00:18:16] let's say a random variable X [00:18:17] X with [00:18:19] with find and mean [00:18:22] this is a one-dimensional random [00:18:24] this is a one-dimensional random variable mu which is equals to [00:18:26] variable mu which is equals to expectation X [00:18:28] expectation X is called sub gaussian [00:18:34] with [00:18:36] with parameter [00:18:38] parameter Sigma [00:18:51] the following is true let me write it [00:18:53] the following is true let me write it down it's it's not very intuitive when [00:18:56] down it's it's not very intuitive when you first look at it [00:19:04] right so I don't I'm not expecting that [00:19:06] right so I don't I'm not expecting that you can see what this is really I mean [00:19:09] you can see what this is really I mean but this is a definition for something [00:19:10] but this is a definition for something is close to gaussian right so [00:19:13] is close to gaussian right so um and the the this is not very [00:19:16] um and the the this is not very intuitive but the corollary is the [00:19:19] intuitive but the corollary is the following so [00:19:21] following so um [00:19:23] um so a Corollary [00:19:26] so a Corollary is that axis Sigma sub gaussian [00:19:34] if [00:19:35] if the following happens now it implies the [00:19:38] the following happens now it implies the following happens so x minus mu [00:19:41] following happens so x minus mu larger than t [00:19:43] larger than t is less than [00:19:45] is less than 2 times exponential minus 2T squared [00:19:50] 2 times exponential minus 2T squared over two Sigma squared over sorry [00:19:54] over two Sigma squared over sorry t squared over two Sigma squared for [00:19:56] t squared over two Sigma squared for every t [00:19:58] every t are in 0. [00:19:59] are in 0. so the corollary is probably intuitive [00:20:02] so the corollary is probably intuitive right so if x is sub gaussian if [00:20:07] right so if x is sub gaussian if you have this exponential indicating [00:20:09] you have this exponential indicating tail Bond right so this right hand side [00:20:12] tail Bond right so this right hand side indicates [00:20:13] indicates um uh very fast in K in as T goes to [00:20:17] um uh very fast in K in as T goes to Infinity it's actually not only [00:20:18] Infinity it's actually not only exponentiality is expected exponential [00:20:20] exponentiality is expected exponential in t squared [00:20:22] in t squared um so this is a in some sense uh this is [00:20:25] um so this is a in some sense uh this is a much more intuitive definition of sub [00:20:26] a much more intuitive definition of sub gaussian [00:20:28] gaussian um but um but the the formal definition [00:20:30] um but um but the the formal definition above [00:20:32] above uh will be more useful uh for [00:20:36] uh will be more useful uh for um for the mathematical cleanness but [00:20:38] um for the mathematical cleanness but you can basically think of these two [00:20:40] you can basically think of these two hours as equivalent actually they are uh [00:20:42] hours as equivalent actually they are uh somewhat equivalent [00:20:44] somewhat equivalent um [00:20:45] um maybe before talking about that [00:20:48] maybe before talking about that um let's say [00:20:49] um let's say if you recall that [00:20:52] if you recall that uh [00:20:53] uh if x is gaussian is really literally [00:20:55] if x is gaussian is really literally gaussian [00:20:57] gaussian with [00:20:59] with virus [00:21:01] virus Sigma Square [00:21:03] Sigma Square then this inequality maybe let's call it [00:21:06] then this inequality maybe let's call it six [00:21:07] six then this means 6 is true [00:21:11] then this means 6 is true I didn't I didn't improve this but this [00:21:13] I didn't I didn't improve this but this is something relatively standard so if [00:21:15] is something relatively standard so if you have a gaussian with varying Sigma [00:21:17] you have a gaussian with varying Sigma Square uh then uh you you can you know [00:21:20] Square uh then uh you you can you know if you do some kind of like calculation [00:21:21] if you do some kind of like calculation do some integral which is not super [00:21:23] do some integral which is not super trivial you have to do some calculation [00:21:25] trivial you have to do some calculation but [00:21:26] but um believe me six is two right so and so [00:21:29] um believe me six is two right so and so basically Sigma sub gaussian is saying [00:21:31] basically Sigma sub gaussian is saying that you have the same property as a [00:21:34] that you have the same property as a gaussian random variable with variance [00:21:35] gaussian random variable with variance Sigma Square and also because of this uh [00:21:39] Sigma Square and also because of this uh Sigma Square [00:21:41] Sigma Square in the sub gaussian definition [00:21:48] is often called [00:21:54] variance proxy [00:21:59] so in some sense if you are Sigma sub [00:22:02] so in some sense if you are Sigma sub gaussian then the sigma is kind of like [00:22:04] gaussian then the sigma is kind of like you can think of as Sigma Square you can [00:22:06] you can think of as Sigma Square you can think of it as a some kind of like [00:22:08] think of it as a some kind of like pseudo virus it's not exactly a virus [00:22:11] pseudo virus it's not exactly a virus but it's the kind of a alternative [00:22:13] but it's the kind of a alternative version of the virus which actually is [00:22:15] version of the virus which actually is probably more important than the virus [00:22:16] probably more important than the virus itself [00:22:17] itself so [00:22:19] so um [00:22:20] um so that's the the rough intuition and [00:22:23] so that's the the rough intuition and also regarding these two definitions of [00:22:25] also regarding these two definitions of this corollary six maybe let's call this [00:22:27] this corollary six maybe let's call this seven so six and seven are in some sense [00:22:31] seven so six and seven are in some sense um [00:22:33] um equivalent definition [00:22:38] up to some small constant up to [00:22:41] up to some small constant up to uh [00:22:43] uh um some constants [00:22:46] Factor [00:22:48] Factor so what does that mean is that if you [00:22:49] so what does that mean is that if you look at if you use six as a definition [00:22:52] look at if you use six as a definition then it means that suppose use 6 as the [00:22:56] then it means that suppose use 6 as the definition or suppose you satisfy 6 then [00:22:59] definition or suppose you satisfy 6 then you know that X is of Sigma sub gaussian [00:23:07] um under the definition under the formal [00:23:09] um under the definition under the formal definition so so so in some sense if you [00:23:12] definition so so so in some sense if you don't care about the constant factor in [00:23:14] don't care about the constant factor in front of the variance proxy then these [00:23:17] front of the variance proxy then these two definitions are you can in like five [00:23:19] two definitions are you can in like five in seven plus six and six also implies [00:23:22] in seven plus six and six also implies seven up to a small constant loss [00:23:27] um so so basically the the way that I [00:23:29] um so so basically the the way that I always think about this is that I always [00:23:30] always think about this is that I always think about six as [00:23:32] think about six as as the intuitively that I think about it [00:23:35] as the intuitively that I think about it um but when I really use the [00:23:37] um but when I really use the what I really need to use the some [00:23:39] what I really need to use the some properties about sub gaussian and when I [00:23:40] properties about sub gaussian and when I really kind of like [00:23:41] really kind of like or to prove something uh I typically use [00:23:45] or to prove something uh I typically use seven [00:23:46] seven and [00:23:48] and and also I didn't tell you why this uh [00:23:50] and also I didn't tell you why this uh two equations are somehow related right [00:23:52] two equations are somehow related right it still sounds like mysterious why they [00:23:54] it still sounds like mysterious why they are related and here is the reason why [00:23:57] are related and here is the reason why they are related uh I guess what I'm [00:23:59] they are related uh I guess what I'm gonna do is that I'm going to show [00:24:02] gonna do is that I'm going to show here 6 implies seven I'm not going to [00:24:05] here 6 implies seven I'm not going to show seven [00:24:07] show seven sorry my number is my number is a little [00:24:09] sorry my number is my number is a little bit different from my number in the [00:24:11] bit different from my number in the notes that's why I'm not confuse I'm [00:24:13] notes that's why I'm not confuse I'm going to show [00:24:15] going to show seven in plus six [00:24:17] seven in plus six by six and plus seven you know requires [00:24:19] by six and plus seven you know requires a different proof but if I show seven in [00:24:21] a different proof but if I show seven in plus six you probably would kind of like [00:24:23] plus six you probably would kind of like get a little intuition why they are [00:24:25] get a little intuition why they are related quantities [00:24:26] related quantities so [00:24:28] so um [00:24:29] um so the kind of the the general intuition [00:24:31] so the kind of the the general intuition is the following right so if you look at [00:24:33] is the following right so if you look at the chapter right [00:24:36] the chapter right so chubby chef [00:24:38] so chubby chef inequality [00:24:40] inequality um [00:24:41] um how do you prove types of inequality [00:24:44] how do you prove types of inequality so the the way that you prove caption [00:24:47] so the the way that you prove caption self inequality is something like this [00:24:49] self inequality is something like this so that you say the probability of the Z [00:24:51] so that you say the probability of the Z is minus expectation Z is larger than t [00:24:54] is minus expectation Z is larger than t this is equals to probability that Z [00:24:57] this is equals to probability that Z minus expectation Z square is larger [00:25:00] minus expectation Z square is larger than T Square [00:25:02] than T Square and then you use the so-called Markov [00:25:03] and then you use the so-called Markov inequality you say that this is less [00:25:06] inequality you say that this is less than the expectation of this in this [00:25:08] than the expectation of this in this random variable [00:25:10] random variable over t-score [00:25:12] over t-score so so the last step is using this Markov [00:25:14] so so the last step is using this Markov inequality [00:25:16] inequality is it called Markov yes I think it is so [00:25:18] is it called Markov yes I think it is so which is saying that if you look at the [00:25:20] which is saying that if you look at the probability of some random variable [00:25:21] probability of some random variable maybe that's called y it's R than t [00:25:25] maybe that's called y it's R than t uh this is smaller than [00:25:27] uh this is smaller than uh the expectation [00:25:31] uh the expectation of y less than t [00:25:34] of y less than t right because you know if you're you [00:25:36] right because you know if you're you have so much mass [00:25:37] have so much mass lower than T then your expectation has [00:25:39] lower than T then your expectation has to be high that's basically the [00:25:41] to be high that's basically the intuition so and and you can see that [00:25:43] intuition so and and you can see that the the way to prove chapter inequality [00:25:45] the the way to prove chapter inequality is that you risk to the power of two you [00:25:48] is that you risk to the power of two you raise to the second power [00:25:49] raise to the second power so so that means that naturally you can [00:25:52] so so that means that naturally you can also consider higher power and apply [00:25:54] also consider higher power and apply Markov inequality again you can get some [00:25:56] Markov inequality again you can get some other type of inequality so [00:25:59] other type of inequality so so if you consider higher moments [00:26:02] so if you consider higher moments then what happens is that you can you [00:26:05] then what happens is that you can you may get something like this right so if [00:26:07] may get something like this right so if for example you can say I'm going to [00:26:08] for example you can say I'm going to look at [00:26:11] the fourth power [00:26:13] the fourth power so the fourth power [00:26:16] so the fourth power um this is oh sorry this is still equal [00:26:17] um this is oh sorry this is still equal to this right because you just raised [00:26:20] to this right because you just raised everything to the fourth power it's the [00:26:21] everything to the fourth power it's the same event so this is equal to this [00:26:23] same event so this is equal to this and then you can uh use the markup [00:26:27] and then you can uh use the markup inequality to be to get expectation Z [00:26:29] inequality to be to get expectation Z minus expectation of Z [00:26:31] minus expectation of Z to the power 4. [00:26:33] to the power 4. over t to the 4. [00:26:36] over t to the 4. so now you see that you have a better [00:26:38] so now you see that you have a better dependency on t [00:26:40] dependency on t better uh or faster better dependency or [00:26:43] better uh or faster better dependency or faster Decay Maybe [00:26:45] faster Decay Maybe faster decay [00:26:48] in G right which is something we are [00:26:51] in G right which is something we are looking for right we we are finally [00:26:52] looking for right we we are finally aiming for exponential dependency but [00:26:54] aiming for exponential dependency but now we get something better than T [00:26:56] now we get something better than T Square we get t to the 4. so [00:26:59] Square we get t to the 4. so um but of course in the trade-off is [00:27:01] um but of course in the trade-off is that our top right this quality on the [00:27:03] that our top right this quality on the top [00:27:04] top might be bigger in some sense than the [00:27:06] might be bigger in some sense than the virus right this is the false power of [00:27:09] virus right this is the false power of the deviation in some sense so in some [00:27:11] the deviation in some sense so in some sense you're going to kind of you get a [00:27:12] sense you're going to kind of you get a trade off and you place a trade offers [00:27:14] trade off and you place a trade offers so you get a better dependency on T but [00:27:16] so you get a better dependency on T but you get the worst dependency in the in [00:27:18] you get the worst dependency in the in the [00:27:19] the um [00:27:20] um uh in the in the numerator [00:27:23] uh in the in the numerator so so and you can try to do this with [00:27:26] so so and you can try to do this with higher powers like you raised to the [00:27:28] higher powers like you raised to the power six with Power 8 so and so forth [00:27:30] power six with Power 8 so and so forth right so actually they are you know [00:27:32] right so actually they are you know especially if you look at the early [00:27:34] especially if you look at the early works you know in this concentration of [00:27:35] works you know in this concentration of quality people do risk to higher power [00:27:37] quality people do risk to higher power it turns out that there is a relatively [00:27:39] it turns out that there is a relatively simple way to deal with all the powers [00:27:43] simple way to deal with all the powers um this is um which is called moment [00:27:45] um this is um which is called moment generating function so this becomes this [00:27:48] generating function so this becomes this will make it you know cleaner so that [00:27:50] will make it you know cleaner so that you don't have to deal with each of the [00:27:51] you don't have to deal with each of the power and see which one has the best [00:27:53] power and see which one has the best show it off so let's talk about moment [00:27:56] show it off so let's talk about moment generating functions [00:27:58] generating functions is [00:27:59] is exactly this this thing that we defined [00:28:02] exactly this this thing that we defined in this [00:28:03] in this definition we use in this definition of [00:28:07] definition we use in this definition of defining sub gaussian technology so this [00:28:10] defining sub gaussian technology so this is the expectation of [00:28:13] is the expectation of uh exponential of the deviation of [00:28:16] uh exponential of the deviation of between X and its expectations so why [00:28:20] between X and its expectations so why this is an interesting quantity so the [00:28:22] this is an interesting quantity so the reason is that if you look at if your [00:28:24] reason is that if you look at if your Taylor expanders your tax band was [00:28:26] Taylor expanders your tax band was inside [00:28:27] inside but this is exponential of something [00:28:29] but this is exponential of something right so Taylor expansion would be that [00:28:30] right so Taylor expansion would be that one plus Lambda [00:28:32] one plus Lambda times x minus E to ex [00:28:35] times x minus E to ex plus Lambda Square over 2 times x minus [00:28:40] plus Lambda Square over 2 times x minus e x [00:28:41] e x Square [00:28:42] Square plus so and so forth right so and if you [00:28:46] plus so and so forth right so and if you write it more formally so this is [00:28:47] write it more formally so this is something like sum over [00:28:50] something like sum over okay from 0 to n [00:28:52] okay from 0 to n and the coefficient natural expansion is [00:28:54] and the coefficient natural expansion is Lambda to the power K Over K factorial [00:28:58] Lambda to the power K Over K factorial times expectation and you switch the [00:29:00] times expectation and you switch the expectation with the sum you get [00:29:02] expectation with the sum you get expectation of x minus e x [00:29:04] expectation of x minus e x the power k [00:29:07] the power k so you can see that this moment [00:29:08] so you can see that this moment generating function is really a mixture [00:29:10] generating function is really a mixture of different moments [00:29:13] of different moments right [00:29:14] right like you have all the moments and every [00:29:17] like you have all the moments and every moment have a different weight in front [00:29:19] moment have a different weight in front of them in some sense this is saying [00:29:21] of them in some sense this is saying that you know what we are going to do is [00:29:23] that you know what we are going to do is that you are going to change the Lambda [00:29:25] that you are going to change the Lambda so that you change the relative weight [00:29:28] so that you change the relative weight in front of all the movements so that [00:29:30] in front of all the movements so that you can choose in some sense the right [00:29:31] you can choose in some sense the right trade-off [00:29:33] trade-off um [00:29:34] um between the the which moment you are [00:29:37] between the the which moment you are going to use and and uh [00:29:39] going to use and and uh um so so sometimes if you choose to the [00:29:41] um so so sometimes if you choose to the right lamp that you're going to choose [00:29:42] right lamp that you're going to choose the right focus on the right moment and [00:29:44] the right focus on the right moment and get the right dependency [00:29:46] get the right dependency um so so that's the the rough intuition [00:29:49] um so so that's the the rough intuition and formally you know [00:29:51] and formally you know um if you really do this you know [00:29:52] um if you really do this you know mathematically you actually it's even [00:29:54] mathematically you actually it's even simpler than this so what you can have [00:29:56] simpler than this so what you can have is that if you look at probability of x [00:29:58] is that if you look at probability of x minus e x [00:29:59] minus e x rnt [00:30:01] rnt then this is so this is formally the way [00:30:05] then this is so this is formally the way you do the trade-off is the following so [00:30:06] you do the trade-off is the following so you look at this and you say I'm going [00:30:08] you look at this and you say I'm going to risk instead of reasoning to the [00:30:10] to risk instead of reasoning to the power I'm going to use exponential [00:30:14] so this is equivalent to e this [00:30:18] so this is equivalent to e this the the exponentiated version is larger [00:30:20] the the exponentiated version is larger than expect natural Lambda t [00:30:23] than expect natural Lambda t right and then now you use Markov [00:30:26] right and then now you use Markov inequality [00:30:27] inequality for this explanation version so you get [00:30:29] for this explanation version so you get expectation e of Lambda x minus e x [00:30:34] expectation e of Lambda x minus e x over [00:30:36] over this is Markov inequality it's the [00:30:38] this is Markov inequality it's the London [00:30:39] London Markov [00:30:43] and now you use the definition of the [00:30:45] and now you use the definition of the sub gaussian entity [00:30:46] sub gaussian entity so you say that [00:30:49] so you say that I guess I need to review what's the [00:30:51] I guess I need to review what's the definition maybe you already remember it [00:30:53] definition maybe you already remember it so the definition of subconscionality is [00:30:55] so the definition of subconscionality is that [00:30:56] that the moment generating function is [00:30:58] the moment generating function is bounded by exponential of Lambda Square [00:31:01] bounded by exponential of Lambda Square that's the important thing right so [00:31:03] that's the important thing right so there's a Lambda Square in the exponent [00:31:04] there's a Lambda Square in the exponent it's exponential of some quadratic [00:31:07] it's exponential of some quadratic function of Lambda so and let's apply [00:31:10] function of Lambda so and let's apply that you got e to the sigma Square [00:31:12] that you got e to the sigma Square Lambda Square over two [00:31:14] Lambda Square over two in the numerator and divided by X along [00:31:16] in the numerator and divided by X along the T so this is e to the [00:31:19] the T so this is e to the Sigma Square Lambda Square over 2 minus [00:31:21] Sigma Square Lambda Square over 2 minus Lambda T and now you can see that in the [00:31:23] Lambda T and now you can see that in the exponent you have a quadratic and this [00:31:26] exponent you have a quadratic and this quadratic looks like [00:31:29] quadratic looks like um [00:31:34] wait am I doing the right thing so [00:31:40] this is uh right so this is a quadratic [00:31:43] this is uh right so this is a quadratic that looks like this right [00:31:45] that looks like this right right something uh maybe not there's [00:31:48] right something uh maybe not there's maybe some [00:31:50] maybe some this is a quadratic that looks like this [00:31:52] this is a quadratic that looks like this right and you can choose Lambda whatever [00:31:56] right and you can choose Lambda whatever you want right there's a free parameter [00:31:58] you want right there's a free parameter so that's why you want to choose the [00:31:59] so that's why you want to choose the minimum Lambda that minimize this [00:32:01] minimum Lambda that minimize this quadratic so that you get the best Bond [00:32:04] quadratic so that you get the best Bond so so take the [00:32:07] so so take the the best Lambda which means that you [00:32:09] the best Lambda which means that you want to find a lot that minimize this [00:32:11] want to find a lot that minimize this quadratic that's relatively easy you can [00:32:12] quadratic that's relatively easy you can just [00:32:14] just um take the the smallest Lambda is the [00:32:16] um take the the smallest Lambda is the global minimum you just do the [00:32:18] global minimum you just do the derivative and make the derivative to be [00:32:20] derivative and make the derivative to be zero and at the best Lambda it turns out [00:32:22] zero and at the best Lambda it turns out to be T over Sigma square and you plug [00:32:24] to be T over Sigma square and you plug that in then this is equals to e to the [00:32:27] that in then this is equals to e to the minus t squared [00:32:29] minus t squared over 2 Sigma Square [00:32:32] over 2 Sigma Square so so basically we show this is equation [00:32:35] so so basically we show this is equation seven right this is equation this is the [00:32:37] seven right this is equation this is the second this is the corollary I think [00:32:38] second this is the corollary I think it's equation six [00:32:41] it's equation six foreign [00:32:41] foreign six [00:32:45] so so basically you start with the the [00:32:47] so so basically you start with the the gaussian [00:32:48] gaussian so here you use the definition [00:32:50] so here you use the definition to basically use the definition of the [00:32:52] to basically use the definition of the subductionality and you get this table [00:32:54] subductionality and you get this table bound uh for for this random variable [00:32:58] um [00:32:59] um and also you can get the other side [00:33:00] and also you can get the other side right so here you only know that X is [00:33:03] right so here you only know that X is not too much bigger than T than E X Plus [00:33:05] not too much bigger than T than E X Plus T you can also get [00:33:07] T you can also get and also get [00:33:10] uh and the other side [00:33:18] less than minus t and how do you do that [00:33:21] less than minus t and how do you do that um the the trivial thing would be that [00:33:23] um the the trivial thing would be that you just flip [00:33:26] you just flip uh you define X Prime to be minus X [00:33:30] uh you define X Prime to be minus X and then probability that X Prime minus [00:33:34] and then probability that X Prime minus E X Prime is larger than t [00:33:37] E X Prime is larger than t is the same as probability x minus e x [00:33:41] is the same as probability x minus e x is smaller than minus t [00:33:44] is smaller than minus t right just about a simple definition and [00:33:46] right just about a simple definition and then you apply the what we have already [00:33:47] then you apply the what we have already got on X Prime and then that implies the [00:33:53] got on X Prime and then that implies the uh what you have the the other side of [00:33:56] uh what you have the the other side of the bounds for for X but this is not [00:33:59] the bounds for for X but this is not super important right it's just the the [00:34:02] super important right it's just the the two sides are basically the same for our [00:34:05] two sides are basically the same for our purpose [00:34:06] purpose Okay so [00:34:08] Okay so I think okay so what I've done so far so [00:34:10] I think okay so what I've done so far so I have defined this sub gaussian random [00:34:13] I have defined this sub gaussian random variable and I've argued that the sub [00:34:15] variable and I've argued that the sub gaussian random variable uh is basically [00:34:18] gaussian random variable uh is basically saying that you have you have two ways [00:34:20] saying that you have you have two ways right so one way is that the [00:34:21] right so one way is that the subconscious random variable [00:34:23] subconscious random variable basically means that you have a very [00:34:25] basically means that you have a very fast tail [00:34:26] fast tail or very like a fast decaying tail or the [00:34:29] or very like a fast decaying tail or the moment some kind of moment you can think [00:34:31] moment some kind of moment you can think of e to the Lambda x minus mu at the [00:34:33] of e to the Lambda x minus mu at the moment [00:34:34] moment some kind of moment is boundary in all [00:34:36] some kind of moment is boundary in all the moments are bounded by something [00:34:38] the moments are bounded by something um or in this form [00:34:40] um or in this form so [00:34:41] so so far I only talked about one random [00:34:43] so far I only talked about one random variable right the the but the the [00:34:46] variable right the the but the the reason why I care about this is the [00:34:47] reason why I care about this is the following theorem which is [00:34:49] following theorem which is um the men just in some sense so it's [00:34:53] um the men just in some sense so it's saying that if you have run if you have [00:34:56] saying that if you have run if you have a all the X eyes right all the [00:34:59] a all the X eyes right all the independent random variables are sub [00:35:00] independent random variables are sub gaussian and some of them is also sub [00:35:02] gaussian and some of them is also sub goes so you can compose and that's the [00:35:05] goes so you can compose and that's the the biggest benefit of this subconscious [00:35:07] the biggest benefit of this subconscious Melody [00:35:18] are independent [00:35:21] are independent some gaussian random variable [00:35:25] some gaussian random variable with [00:35:28] variance proxy [00:35:31] variance proxy Sigma 1 square up to Sigma n Square [00:35:35] Sigma 1 square up to Sigma n Square respectively [00:35:37] respectively then if you look at the sum of them [00:35:44] is [00:35:46] is also sub gaussian [00:35:50] with [00:35:52] with variance [00:35:55] proxy sum of Sigma I squared F of Y [00:36:00] proxy sum of Sigma I squared F of Y but [00:36:01] but so so this is and as a corollary because [00:36:06] so so this is and as a corollary because Z is subduction with this various [00:36:08] Z is subduction with this various property you know that the you have the [00:36:10] property you know that the you have the concentration for Z [00:36:12] concentration for Z which is of this exponential form [00:36:21] right so you know that you have this [00:36:23] right so you know that you have this tail that decays exponentially cost [00:36:27] tail that decays exponentially cost so so this is uh very important because [00:36:29] so so this is uh very important because you know very useful and very important [00:36:31] you know very useful and very important because now if you have a sum of [00:36:33] because now if you have a sum of independent variables you want to know [00:36:34] independent variables you want to know how fast the tail decays you can you can [00:36:38] how fast the tail decays you can you can look at whether each of them is sub goes [00:36:43] look at whether each of them is sub goes um I'm going to prove this uh in a [00:36:44] um I'm going to prove this uh in a moment [00:36:45] moment uh [00:36:47] uh and the proof is actually just the two [00:36:49] and the proof is actually just the two lines which is actually very cool [00:36:52] lines which is actually very cool [Music] [00:36:53] [Music] so but before approving the the [00:36:55] so but before approving the the statement let me try to give you some [00:36:58] statement let me try to give you some examples on what random variables are [00:36:59] examples on what random variables are subconscious right so basically this the [00:37:01] subconscious right so basically this the applicability of this theorem depends on [00:37:04] applicability of this theorem depends on whether you can show each of the x i is [00:37:06] whether you can show each of the x i is subconscious if you can show each of the [00:37:07] subconscious if you can show each of the X I sub gaussian with very good [00:37:09] X I sub gaussian with very good parameter Sigma ice then this theorem [00:37:12] parameter Sigma ice then this theorem applies and you guys are pretty good [00:37:13] applies and you guys are pretty good bond for the sum of them right so so [00:37:16] bond for the sum of them right so so what random variables are subconscious [00:37:18] what random variables are subconscious right if like a single random variable [00:37:20] right if like a single random variable or sub gaussian so there are some [00:37:22] or sub gaussian so there are some examples here [00:37:28] by the way whether your your random verb [00:37:31] by the way whether your your random verb with subconscious sometimes depends on [00:37:32] with subconscious sometimes depends on what statement you choose right so if [00:37:33] what statement you choose right so if you choose bigger and bigger Sigma there [00:37:35] you choose bigger and bigger Sigma there is at least either more chance that [00:37:37] is at least either more chance that there can be subconscious of course it's [00:37:39] there can be subconscious of course it's not like it's not guaranteed that if you [00:37:40] not like it's not guaranteed that if you choose Sigma TV is really really big you [00:37:42] choose Sigma TV is really really big you can be subconscious that's not always [00:37:43] can be subconscious that's not always guaranteed but at least you know [00:37:45] guaranteed but at least you know intuitively [00:37:46] intuitively it's not a a binary question it's not [00:37:49] it's not a a binary question it's not saying like this is my subconscious this [00:37:51] saying like this is my subconscious this one is not sometimes it depends on what [00:37:53] one is not sometimes it depends on what parameters you choose [00:37:55] parameters you choose um so at least it's not always a binary [00:37:57] um so at least it's not always a binary question for example for rather marker [00:37:59] question for example for rather marker complexity this also called random [00:38:00] complexity this also called random marker variable rather micro variable [00:38:02] marker variable rather micro variable just means that [00:38:03] just means that random variable that's it so basically [00:38:06] random variable that's it so basically means axis uniform [00:38:08] means axis uniform from plus one minus one [00:38:11] from plus one minus one right so [00:38:13] right so um so this one I claim is sub gaussian [00:38:16] um so this one I claim is sub gaussian um the reason is you know intuitive the [00:38:18] um the reason is you know intuitive the reason is that if you look at this [00:38:19] reason is that if you look at this random variable if you look at the [00:38:20] random variable if you look at the density is something like you have a [00:38:22] density is something like you have a spike at one Spike at minus one right [00:38:25] spike at one Spike at minus one right so basically density Decay is very fast [00:38:27] so basically density Decay is very fast after you go outside first or minus one [00:38:30] after you go outside first or minus one indicates you know extremely fast it [00:38:32] indicates you know extremely fast it becomes zero that's why it's a [00:38:34] becomes zero that's why it's a subconscious and technically you can say [00:38:36] subconscious and technically you can say that um you can prove this [00:38:41] another T [00:38:43] another T is less than [00:38:45] is less than two exponential minus t squared over a [00:38:49] two exponential minus t squared over a big constant C 0 for [00:38:51] big constant C 0 for c0 to be o of one maybe let's say two [00:38:54] c0 to be o of one maybe let's say two uh this is because [00:38:57] uh this is because um if T is less than one then right hand [00:39:01] um if T is less than one then right hand side [00:39:02] side uh is less than one is equals to it's [00:39:05] uh is less than one is equals to it's bigger than one so that's always true [00:39:07] bigger than one so that's always true right I think I choose this so that yes [00:39:11] right I think I choose this so that yes because this the right hand side is [00:39:13] because this the right hand side is equals to [00:39:14] equals to exponential minus one over [00:39:17] exponential minus one over c0 and if you take c0 to be a big [00:39:20] c0 and if you take c0 to be a big constant maybe two then this is larger [00:39:22] constant maybe two then this is larger than one [00:39:23] than one so you verify this for T is less than [00:39:24] so you verify this for T is less than one and then if T is bigger than one [00:39:27] one and then if T is bigger than one then the LHS is just zero [00:39:30] then the LHS is just zero so that's why it's always also true [00:39:32] so that's why it's always also true so [00:39:34] so um so that's my sub run the marker a [00:39:36] um so that's my sub run the marker a random variable so that means that [00:39:39] random variable so that means that write a marker [00:39:43] random variable is of one sub [00:39:47] random variable is of one sub coefficient [00:39:51] our subconscious with variance offer [00:39:54] our subconscious with variance offer with variance proxy of one [00:39:58] with variance proxy of one so [00:40:00] so and similarly you can prove that [00:40:02] and similarly you can prove that um [00:40:03] um similarly [00:40:06] if [00:40:08] if x minus E of X is pointed by m [00:40:12] x minus E of X is pointed by m so basically suppose you have a [00:40:14] so basically suppose you have a random variable where e of e of X is [00:40:17] random variable where e of e of X is here [00:40:18] here and if you look at a window of size n [00:40:21] and if you look at a window of size n plus M minus m [00:40:26] right so suppose your density is [00:40:28] right so suppose your density is literally zero outside and you have some [00:40:31] literally zero outside and you have some maybe some whatever density we want and [00:40:33] maybe some whatever density we want and later is zero outside [00:40:35] later is zero outside then [00:40:36] then once you go beyond the window and then [00:40:41] once you go beyond the window and then um the density has you know extremely [00:40:43] um the density has you know extremely fast or it doesn't densely just become [00:40:45] fast or it doesn't densely just become zero so that's why this is uh [00:40:48] zero so that's why this is uh you know [00:40:49] you know of I'm subconscious [00:40:51] of I'm subconscious it's not you know you still need to you [00:40:53] it's not you know you still need to you know to formally proofability you still [00:40:55] know to formally proofability you still need to kind of like construct [00:40:57] need to kind of like construct um I used to verify the definition of [00:40:59] um I used to verify the definition of course right but I guess it's kind of [00:41:01] course right but I guess it's kind of intuitive that uh a subconscious just [00:41:04] intuitive that uh a subconscious just because the tail vanishes completely [00:41:06] because the tail vanishes completely after you have the window [00:41:09] after you have the window and there is a stronger claim [00:41:13] and there is a stronger claim which also got the right constant here I [00:41:15] which also got the right constant here I only have o of n but you can actually [00:41:17] only have o of n but you can actually get the a stronger K which got the exact [00:41:22] get the a stronger K which got the exact height constant [00:41:23] height constant so this is saying that if a is less than [00:41:26] so this is saying that if a is less than x is less than b almost really so your [00:41:29] x is less than b almost really so your random variable is almost surely bonded [00:41:30] random variable is almost surely bonded between a and b [00:41:32] between a and b then [00:41:35] then the you can prove this [00:41:37] the you can prove this e to the long the x minus e x [00:41:40] e to the long the x minus e x this kind of like moment generating [00:41:42] this kind of like moment generating function is always less than e to the [00:41:44] function is always less than e to the Lambda Square Times you want the [00:41:47] Lambda Square Times you want the quadratic in Lambda Square quadratically [00:41:49] quadratic in Lambda Square quadratically Lambda in exponent and you care about [00:41:51] Lambda in exponent and you care about the constants because the constant is [00:41:52] the constants because the constant is the variance proxy and you can prove [00:41:55] the variance proxy and you can prove that this constant [00:41:57] that this constant E A minus Square over eight [00:42:00] E A minus Square over eight and this is saying that [00:42:02] and this is saying that X is [00:42:04] X is subgaussians [00:42:07] with [00:42:10] with virus [00:42:11] virus proxy [00:42:13] proxy B minus a [00:42:16] B minus a over 2. [00:42:19] over 2. and this is actually a whole more [00:42:20] and this is actually a whole more question um it's not that tribute to [00:42:22] question um it's not that tribute to prove rate actually uh if you want to [00:42:23] prove rate actually uh if you want to get a right constant if you just want to [00:42:25] get a right constant if you just want to get some constant you know if I think if [00:42:27] get some constant you know if I think if you want to get instead of eight you [00:42:29] you want to get instead of eight you want to get 2 is relatively easy if [00:42:31] want to get 2 is relatively easy if you've got eight you need to do a little [00:42:33] you've got eight you need to do a little bit slightly more about it [00:42:37] bit slightly more about it um we will have some hint in the in the [00:42:39] um we will have some hint in the in the homework as well to help you to prove it [00:42:43] homework as well to help you to prove it um [00:42:44] um all right so so these are about all so [00:42:48] all right so so these are about all so this is all about bonded random [00:42:49] this is all about bonded random variables basically saying that if you [00:42:51] variables basically saying that if you have you have a monument variable it's [00:42:52] have you have a monument variable it's going to be subcaution and [00:42:55] going to be subcaution and um also this works for gaussian random [00:42:57] um also this works for gaussian random variables of course right so a gaussian [00:43:00] variables of course right so a gaussian random variable has to be sub-gaussian [00:43:01] random variable has to be sub-gaussian right so as we motivate it right so if x [00:43:04] right so as we motivate it right so if x is from mu Sigma Square then [00:43:07] is from mu Sigma Square then I guess formula you can prove the [00:43:09] I guess formula you can prove the following you can show that e to the [00:43:10] following you can show that e to the Lambda X [00:43:12] Lambda X minus e x [00:43:14] minus e x you can compute this actually this is [00:43:16] you can compute this actually this is equals to exactly e to the sigma Square [00:43:18] equals to exactly e to the sigma Square Lambda squared over 2. [00:43:21] Lambda squared over 2. so that's why sub gaussian [00:43:25] with [00:43:27] with variance proxy [00:43:31] Sigma Square [00:43:32] Sigma Square okay [00:43:34] okay um I think these are the the the [00:43:36] um I think these are the the the boundary random variables and the and [00:43:39] boundary random variables and the and the [00:43:40] the and gaussian random variables are [00:43:42] and gaussian random variables are probably the most [00:43:44] probably the most uh important examples of subconscious [00:43:47] uh important examples of subconscious and variables [00:43:49] and variables um [00:43:49] um and just a small [00:43:53] and just a small in the homework we're going to talk [00:43:54] in the homework we're going to talk about something called sub exponential [00:43:56] about something called sub exponential random variables [00:43:57] random variables which is a weaker version of [00:44:00] which is a weaker version of subconscious mind variable so [00:44:03] subconscious mind variable so um and this is precisely to deal with [00:44:06] um and this is precisely to deal with the fact that some random variables are [00:44:07] the fact that some random variables are now subconscious whatever variance [00:44:09] now subconscious whatever variance products you choose so [00:44:12] products you choose so um just to give you a a rough Sense on [00:44:14] um just to give you a a rough Sense on what the homework is about so [00:44:17] what the homework is about so um so here when you define a sub [00:44:18] um so here when you define a sub gaussian random variable you can in this [00:44:20] gaussian random variable you can in this like in this corollary view right so [00:44:23] like in this corollary view right so this alternative View [00:44:25] this alternative View here you have t-square [00:44:27] here you have t-square so so you insist that the the Decay is [00:44:31] so so you insist that the the Decay is exponential in t-score that's a [00:44:34] exponential in t-score that's a relatively strong [00:44:35] relatively strong uh requirement and there are random [00:44:37] uh requirement and there are random variables that doesn't have this fast [00:44:38] variables that doesn't have this fast decay so for example I think one typical [00:44:41] decay so for example I think one typical example would be [00:44:43] example would be if you take the gaussian square if you [00:44:45] if you take the gaussian square if you square the gaussian which becomes the [00:44:48] square the gaussian which becomes the blanket on the name what's called like [00:44:49] blanket on the name what's called like chi-square distribution right so that [00:44:53] chi-square distribution right so that one doesn't have this fast decay of um [00:44:55] one doesn't have this fast decay of um of tail right it's not t square is T [00:44:58] of tail right it's not t square is T so so for these random variables you [00:45:00] so so for these random variables you still want to prove something about [00:45:01] still want to prove something about concentration and and you can still do [00:45:05] concentration and and you can still do do them almost the same as sub gaussian [00:45:07] do them almost the same as sub gaussian variable with some minor technique with [00:45:10] variable with some minor technique with some technical kind of differences and [00:45:12] some technical kind of differences and that's what the homework uh uh one of [00:45:14] that's what the homework uh uh one of the homework question your homework one [00:45:16] the homework question your homework one is about [00:45:17] is about so [00:45:19] so um all right okay cool so any questions [00:45:22] um all right okay cool so any questions so far [00:45:28] okay so now let's uh prove this theorem [00:45:33] okay so now let's uh prove this theorem about this additivity of a sub gaussian [00:45:36] about this additivity of a sub gaussian random variable so [00:45:37] random variable so proof [00:45:39] proof of theorem [00:45:41] of theorem so our goal is to show that the sum of x [00:45:45] so our goal is to show that the sum of x i is subconscious right this is the goal [00:45:50] right so we just use the definition uh [00:45:53] right so we just use the definition uh we start with the definition the [00:45:54] we start with the definition the definition is that if you want it to be [00:45:56] definition is that if you want it to be to prove it to be sub goals and you need [00:45:58] to prove it to be sub goals and you need to look at the moment generating [00:45:59] to look at the moment generating function [00:46:04] um sorry [00:46:06] I have some [00:46:14] some titles here I mean Market [00:46:20] okay so you look at the moment [00:46:21] okay so you look at the moment generating function [00:46:23] generating function and so here you can see the nice thing [00:46:25] and so here you can see the nice thing about this which is that you can because [00:46:29] about this which is that you can because this is exponential it can decompose [00:46:31] this is exponential it can decompose very easily so you can write this as [00:46:34] very easily so you can write this as exponential Lambda X1 minus E X One [00:46:43] right and again because it's independent [00:46:46] right and again because it's independent you can switch the expectations you can [00:46:49] you can switch the expectations you can factorize right each of the exercise are [00:46:51] factorize right each of the exercise are independent so you can [00:46:54] independent so you can uh switch the expectation with the [00:46:56] uh switch the expectation with the product to get uh expectation yield from [00:46:59] product to get uh expectation yield from the X1 minus E X One [00:47:03] times expectation of e Lambda X2 minus E [00:47:07] times expectation of e Lambda X2 minus E X two [00:47:17] okay so this is using Independence [00:47:22] and then you just say I know that each [00:47:25] and then you just say I know that each of the this random variable is [00:47:26] of the this random variable is subconscious so I just won't use my [00:47:28] subconscious so I just won't use my definition that each of the random [00:47:29] definition that each of the random variable is Sigma I Square subcaution so [00:47:32] variable is Sigma I Square subcaution so you bonded by e to the Lambda one longer [00:47:37] you bonded by e to the Lambda one longer Square Sigma 1 square over 2 times e to [00:47:39] Square Sigma 1 square over 2 times e to the Lambda Square Sigma 2 squared over [00:47:41] the Lambda Square Sigma 2 squared over two [00:47:44] this is by [00:47:46] this is by definition [00:47:49] and then you get this is e to the Lambda [00:47:52] and then you get this is e to the Lambda Square over two times sum of [00:47:55] Square over two times sum of Sigma I squared [00:47:58] and and you get this that means that sum [00:48:03] and and you get this that means that sum of x i is sum of Sigma I squared [00:48:06] of x i is sum of Sigma I squared subcaution right so this is the virus [00:48:08] subcaution right so this is the virus proxy [00:48:14] for [00:48:15] for sum of exercise and you can see that [00:48:19] sum of exercise and you can see that the kind of the benefit of using this [00:48:21] the kind of the benefit of using this moment generating function right the the [00:48:22] moment generating function right the the the exponential is because you can [00:48:25] the exponential is because you can factorize the exponential easily right [00:48:27] factorize the exponential easily right so if you don't use exponential if you [00:48:28] so if you don't use exponential if you for example use the the force Power or [00:48:31] for example use the the force Power or the eighth power right you wouldn't have [00:48:34] the eighth power right you wouldn't have such a nice simple proof [00:48:42] any other questions [00:48:50] okay so [00:48:52] okay so um [00:48:54] um so that's the first part of the lecture [00:48:55] so that's the first part of the lecture right which is about [00:48:57] right which is about um a sum of independent running [00:48:58] um a sum of independent running variables [00:48:59] variables and now I'm going to talk about [00:49:02] and now I'm going to talk about um [00:49:03] um a more complex function of independent [00:49:06] a more complex function of independent random variables [00:49:08] random variables so [00:49:10] so so now I'm going to talk about something [00:49:12] so now I'm going to talk about something like this how does this kind of things [00:49:15] like this how does this kind of things concentrate [00:49:17] concentrate [Music] [00:49:18] [Music] um [00:49:19] um and you can see that [00:49:21] and you can see that um in some sense you want to say that [00:49:24] um in some sense you want to say that this function f when f is kind of close [00:49:26] this function f when f is kind of close to a summation [00:49:27] to a summation in some sense right in some weak sense [00:49:29] in some sense right in some weak sense then you still have very similar type of [00:49:32] then you still have very similar type of bond that's the spirit and but what does [00:49:35] bond that's the spirit and but what does it mean by close to submission we'll see [00:49:37] it mean by close to submission we'll see so here is the the serum one of the [00:49:41] so here is the the serum one of the serum [00:49:42] serum which is actually something we're going [00:49:44] which is actually something we're going to use [00:49:45] to use for in our future lecture [00:49:47] for in our future lecture which is called mcdermid inequality [00:49:52] so there is a bunch of conditions [00:49:57] so suppose you have a function f I guess [00:50:00] so suppose you have a function f I guess little f is the capital F iot form so [00:50:03] little f is the capital F iot form so you have a function f [00:50:05] you have a function f satisfy [00:50:09] the so-called bonded [00:50:12] difference [00:50:17] condition [00:50:19] but this for the body different [00:50:21] but this for the body different condition means so it's a it's saying [00:50:23] condition means so it's a it's saying that for every [00:50:26] that for every choice of X1 up to x n [00:50:29] choice of X1 up to x n and [00:50:31] and x i Prime so if so I guess for every I [00:50:34] x i Prime so if so I guess for every I and every choice of X1 up to x n [00:50:37] and every choice of X1 up to x n uh by the way here this x is a little [00:50:40] uh by the way here this x is a little excise because here I have I haven't got [00:50:42] excise because here I have I haven't got any random variables yet these are just [00:50:45] any random variables yet these are just the generic uh like numbers so uh for [00:50:49] the generic uh like numbers so uh for every eye for every choice of x one [00:50:51] every eye for every choice of x one after X on and for every x i Prime which [00:50:54] after X on and for every x i Prime which is the which will be used as a [00:50:55] is the which will be used as a replacement for x i if you look at [00:50:59] these two quantities one is that you [00:51:01] these two quantities one is that you apply F on X my reflection and the other [00:51:04] apply F on X my reflection and the other one is that you apply F on [00:51:07] one is that you apply F on X1 up up to X N but replace x i by x i [00:51:11] X1 up up to X N but replace x i by x i Prime so basically replace one [00:51:12] Prime so basically replace one coordinate [00:51:14] coordinate by something else [00:51:17] by something else and you look at you look at what kind of [00:51:19] and you look at you look at what kind of changes you can make by doing this [00:51:22] changes you can make by doing this and you assume that the maximum changes [00:51:25] and you assume that the maximum changes you can make is by CI [00:51:29] so basically this is saying that you are [00:51:31] so basically this is saying that you are not very sensitive to this function is [00:51:35] not very sensitive to this function is not sensitive [00:51:37] not sensitive to uh changing a single variable [00:51:41] to uh changing a single variable a single coordinate a single input [00:51:45] a single coordinate a single input a single coordinate of the input [00:51:48] a single coordinate of the input and if you have this [00:51:50] and if you have this bondage soccer bonded difference [00:51:52] bondage soccer bonded difference condition then [00:51:54] condition then you can uh you can say that let X1 up to [00:51:59] you can uh you can say that let X1 up to xn now that they are capital X [00:52:02] xn now that they are capital X be independent random variable [00:52:08] and the we have probability [00:52:12] and the we have probability that f x 1 up to x n [00:52:16] that f x 1 up to x n is deviated from its expectation [00:52:21] is deviated from its expectation by T [00:52:23] by T is less than this exponential [00:52:27] is less than this exponential thing [00:52:28] thing minus 2T squared over [00:52:32] minus 2T squared over sum of CI score I from 1. [00:52:38] so so in other words I guess you know [00:52:41] so so in other words I guess you know equivalently you are basically saying [00:52:43] equivalently you are basically saying that [00:52:47] you're essentially saying that FX1 up to [00:52:49] you're essentially saying that FX1 up to Accent this is [00:52:52] Accent this is uh sub gaussian [00:52:56] with [00:52:57] with a variance proxy [00:53:02] something like sum of CI Square [00:53:05] something like sum of CI Square maybe a big O right if there's some [00:53:08] maybe a big O right if there's some constant that you may lose by doing it a [00:53:10] constant that you may lose by doing it a little bit [00:53:12] little bit um [00:53:13] um like like your virus proxy like this is [00:53:15] like like your virus proxy like this is using the equivalence of the two [00:53:17] using the equivalence of the two definitions right so [00:53:18] definitions right so um so this is the more intuitive [00:53:20] um so this is the more intuitive definition of subconscion and and if you [00:53:22] definition of subconscion and and if you change to the formal definition you will [00:53:23] change to the formal definition you will lose a constant [00:53:25] lose a constant of course you suggest that we let your [00:53:28] of course you suggest that we let your moment this is like a functions at a [00:53:30] moment this is like a functions at a kind of behave with the sums but just [00:53:33] kind of behave with the sums but just would you say it was conditioned sort of [00:53:34] would you say it was conditioned sort of what it looks like a song but you can [00:53:37] what it looks like a song but you can have an excellent Planet X I differ by [00:53:39] have an excellent Planet X I differ by doing the CI [00:53:42] doing the CI yeah so yeah yeah that's a that's a very [00:53:45] yeah so yeah yeah that's a that's a very good question so so um I think I before [00:53:48] good question so so um I think I before I forgot to repeat the question now I so [00:53:50] I forgot to repeat the question now I so from now on I should try to uh repeat [00:53:52] from now on I should try to uh repeat the question the question was that I [00:53:54] the question the question was that I mentioned that you want to make some [00:53:56] mentioned that you want to make some conditions on F which make it similar to [00:53:59] conditions on F which make it similar to the song right so and why this is [00:54:01] the song right so and why this is similar to the song [00:54:03] similar to the song um yeah so so first of all I think a [00:54:05] um yeah so so first of all I think a small clarification I guess very similar [00:54:07] small clarification I guess very similar is the actual very weak side you'll see [00:54:09] is the actual very weak side you'll see that you know in some sense the [00:54:11] that you know in some sense the uh all of these conditions becomes you [00:54:14] uh all of these conditions becomes you know in some sense not very similar but [00:54:16] know in some sense not very similar but um uh I think they're only similar in a [00:54:19] um uh I think they're only similar in a sense that you want to make sure [00:54:21] sense that you want to make sure that no coordinate is very [00:54:25] that no coordinate is very strongly influencing your final outcome [00:54:28] strongly influencing your final outcome so when you have a sum right so if you [00:54:30] so when you have a sum right so if you change one coordinate you wouldn't [00:54:32] change one coordinate you wouldn't influence your uh final outcome much and [00:54:36] influence your uh final outcome much and and here is the it's the same thing so [00:54:38] and here is the it's the same thing so basically I think whether it's a sum or [00:54:40] basically I think whether it's a sum or not it doesn't matter it's really about [00:54:41] not it doesn't matter it's really about whether you have certain kind of [00:54:43] whether you have certain kind of luciousness uh property so so maybe just [00:54:46] luciousness uh property so so maybe just a briefly also we can verify that this [00:54:49] a briefly also we can verify that this condition contains the sum at least [00:54:52] condition contains the sum at least right so that probably would be useful [00:54:54] right so that probably would be useful so suppose you have [00:54:55] so suppose you have F X1 up to X N is equals to sum of x i [00:55:00] F X1 up to X N is equals to sum of x i and each of the x i is bonded by [00:55:03] and each of the x i is bonded by something like bi and [00:55:06] something like bi and other one by AI [00:55:08] other one by AI and now suppose you change one of the x [00:55:09] and now suppose you change one of the x i how much you can change the final [00:55:12] i how much you can change the final outcome so then you can say that this [00:55:14] outcome so then you can say that this you have the bounded [00:55:17] you have the bounded difference condition [00:55:20] where c i is equals to bi minus a i [00:55:23] where c i is equals to bi minus a i because that's the only that's the [00:55:25] because that's the only that's the biggest change you can make uh if you [00:55:27] biggest change you can make uh if you change one chord [00:55:29] change one chord or x i right so so that's the maximum [00:55:32] or x i right so so that's the maximum kind of like range of like changes uh [00:55:35] kind of like range of like changes uh for the stock and but you can see that [00:55:37] for the stock and but you can see that you know you can imagine many other [00:55:38] you know you can imagine many other functions that have this property which [00:55:40] functions that have this property which doesn't look like some at all [00:55:42] doesn't look like some at all right so so so so indeed more precisely [00:55:45] right so so so so indeed more precisely I think the [00:55:47] I think the the kind of the intuition is that you [00:55:49] the kind of the intuition is that you want this function f to be [00:55:52] want this function f to be some reluctious [00:55:55] like uh in some ellipses or not super [00:55:58] like uh in some ellipses or not super sensitive to individual things yeah [00:56:00] sensitive to individual things yeah that's the that's the general intuition [00:56:09] right so the question was why you just [00:56:11] right so the question was why you just don't why don't this assume that I have [00:56:13] don't why don't this assume that I have a solution right so this is a very good [00:56:16] a solution right so this is a very good question and [00:56:18] question and um [00:56:18] um the there's very short answer is that we [00:56:20] the there's very short answer is that we don't know how to prove that person [00:56:22] don't know how to prove that person like uh like we don't know how to prove [00:56:24] like uh like we don't know how to prove that if f is Ellipsis then it's a uh uh [00:56:28] that if f is Ellipsis then it's a uh uh you have this result [00:56:29] you have this result so uh and uh a longer version is that [00:56:34] so uh and uh a longer version is that people have been actually trying to [00:56:37] people have been actually trying to um [00:56:38] um kind of like this is very like like a [00:56:41] kind of like this is very like like a lot of researchers especially [00:56:42] lot of researchers especially mathematicians have worked on this area [00:56:44] mathematicians have worked on this area and [00:56:45] and and there's a question about what is the [00:56:47] and there's a question about what is the right definition of Philip Justice like [00:56:50] right definition of Philip Justice like uh I guess you you'll probably see in a [00:56:52] uh I guess you you'll probably see in a moment like I'm gonna show two more [00:56:54] moment like I'm gonna show two more General versions and they have slightly [00:56:55] General versions and they have slightly different definition of voluptiousness [00:56:57] different definition of voluptiousness or like all the intuition of [00:56:59] or like all the intuition of philipsiousness [00:57:00] philipsiousness um and and there are somewhat [00:57:02] um and and there are somewhat complicated it's not like as clean as [00:57:04] complicated it's not like as clean as you expect just mostly because there's [00:57:06] you expect just mostly because there's some technical challenges uh in those [00:57:09] some technical challenges uh in those cases and you'll see also a case where [00:57:12] cases and you'll see also a case where if x eyes are gaussian then you have a [00:57:14] if x eyes are gaussian then you have a very clean theorem you just literally as [00:57:16] very clean theorem you just literally as you said you just assume f is [00:57:19] you said you just assume f is we'll get to there in a moment [00:57:23] we'll get to there in a moment yourself [00:57:35] right so so I guess your question is [00:57:37] right so so I guess your question is that uh here you need this absolute bond [00:57:41] that uh here you need this absolute bond in some sense in some sense to make to [00:57:43] in some sense in some sense to make to make sure you have to spawn in a [00:57:44] make sure you have to spawn in a different condition right so [00:57:47] different condition right so um it's you need some kind of some [00:57:49] um it's you need some kind of some things that kind of absolutely to be [00:57:51] things that kind of absolutely to be absolutely bonded for example in some [00:57:52] absolutely bonded for example in some cases you need exercise to be absolutely [00:57:54] cases you need exercise to be absolutely abundant between a and b i right so so [00:57:57] abundant between a and b i right so so and this is not very this is a little [00:58:00] and this is not very this is a little bit different from the intuition we had [00:58:02] bit different from the intuition we had about sub gaussian right so before we're [00:58:03] about sub gaussian right so before we're saying that if each random variable has [00:58:05] saying that if each random variable has a fast tail then the sum also has a fast [00:58:08] a fast tail then the sum also has a fast table but here you have you need [00:58:09] table but here you have you need absolute some kind of absolute [00:58:11] absolute some kind of absolute restrictions right so this is actually [00:58:13] restrictions right so this is actually related to the the answer I had before [00:58:15] related to the the answer I had before like uh if you look at into the all the [00:58:18] like uh if you look at into the all the technical details actually it's not that [00:58:20] technical details actually it's not that easy to deal with the a tail that can go [00:58:23] easy to deal with the a tail that can go to infinite so so there's some technical [00:58:26] to infinite so so there's some technical challenges here uh which prevent us to [00:58:29] challenges here uh which prevent us to uh to have something super clean uh I [00:58:32] uh to have something super clean uh I would say so [00:58:34] would say so um [00:58:35] um so so for example if you know X is a [00:58:38] so so for example if you know X is a gaussian we will see that you have a [00:58:40] gaussian we will see that you have a very clean theorem but if you don't know [00:58:42] very clean theorem but if you don't know excess are gaussian then it's uh [00:58:45] excess are gaussian then it's uh um it's it's kind of a technically very [00:58:47] um it's it's kind of a technically very complicated to deal with the the tail of [00:58:49] complicated to deal with the the tail of each of the exercise [00:58:51] each of the exercise and in some sense you can imagine right [00:58:53] and in some sense you can imagine right so [00:58:54] so uh maybe this is a little bit too [00:58:55] uh maybe this is a little bit too advanced but like for example if you [00:58:57] advanced but like for example if you have [00:58:58] have x i right so the tail is subconscious [00:59:00] x i right so the tail is subconscious suppose x i is just gaussian right so [00:59:02] suppose x i is just gaussian right so and if F can square it right so suppose [00:59:07] and if F can square it right so suppose in the function f u Square x i inside [00:59:09] in the function f u Square x i inside somewhere now x i becomes x square and [00:59:11] somewhere now x i becomes x square and the tail becomes slower as I said so [00:59:13] the tail becomes slower as I said so like the when you square it it becomes [00:59:16] like the when you square it it becomes Chi Square the solution the tail becomes [00:59:18] Chi Square the solution the tail becomes slower and if you take the fourth power [00:59:19] slower and if you take the fourth power it becomes even slower so you have to [00:59:21] it becomes even slower so you have to somehow [00:59:22] somehow balance this right like it's not only [00:59:24] balance this right like it's not only about the input it's also about what F [00:59:26] about the input it's also about what F does right if F does something super bad [00:59:28] does right if F does something super bad to for example Square the gaussian or [00:59:30] to for example Square the gaussian or with the goes into the higher power then [00:59:33] with the goes into the higher power then the tail becomes slower and your [00:59:34] the tail becomes slower and your concentration becomes worse so that's [00:59:36] concentration becomes worse so that's the kind of the challenge [00:59:40] foreign [00:59:43] okay so yeah so let me proceed with a [00:59:45] okay so yeah so let me proceed with a more General version and then I'm going [00:59:48] more General version and then I'm going to talk about the gaussian version and [00:59:49] to talk about the gaussian version and then at the end suppose I have time I'm [00:59:52] then at the end suppose I have time I'm going to prove this here so this this [00:59:53] going to prove this here so this this theorem is something we can prove [00:59:54] theorem is something we can prove ourselves uh without doing a lot of [00:59:57] ourselves uh without doing a lot of hybrid work and but the theorem I'm [00:59:59] hybrid work and but the theorem I'm going to introduce next is is some more [01:00:00] going to introduce next is is some more kind of like a [01:00:02] kind of like a very challenging proof so this is a more [01:00:04] very challenging proof so this is a more General version [01:00:10] so um I think this is theorem [01:00:13] so um I think this is theorem 3.18 in one of this book by [01:00:17] 3.18 in one of this book by I guess if you look at the lecture notes [01:00:19] I guess if you look at the lecture notes there is a formal [01:00:20] there is a formal um this is one handle [01:00:25] yeah so uh so it's a book on probability [01:00:27] yeah so uh so it's a book on probability Theory so [01:00:29] Theory so so in this book they basically what [01:00:32] so in this book they basically what happens is that they extend the this [01:00:35] happens is that they extend the this bonded difference [01:00:36] bonded difference uh condition to something Milder and the [01:00:40] uh condition to something Milder and the definition is so you start with some [01:00:42] definition is so you start with some definition this is D minus I [01:00:46] definition this is D minus I has defunders to be f x 1 up to x n [01:00:51] has defunders to be f x 1 up to x n minus you take the if over Z [01:00:55] minus you take the if over Z f x 1 up to x minus 1 Z [01:01:04] right so so basically you are saying [01:01:07] right so so basically you are saying that if you look at X and you change one [01:01:09] that if you look at X and you change one of the one one of the coordinates and [01:01:11] of the one one of the coordinates and you want to see how much you can make it [01:01:14] you want to see how much you can make it smaller [01:01:15] smaller uh [01:01:16] uh right so because this quantity is always [01:01:18] right so because this quantity is always one bigger than zero so basically are [01:01:20] one bigger than zero so basically are saying that how much you can make it [01:01:22] saying that how much you can make it smaller by changing one coordinate Z [01:01:24] smaller by changing one coordinate Z right e instead you can just think of [01:01:26] right e instead you can just think of efforts mean so minimum right so so the [01:01:30] efforts mean so minimum right so so the difference between between this and [01:01:32] difference between between this and before is that before you require [01:01:34] before is that before you require so so basically before [01:01:37] so so basically before in microdermott [01:01:40] you require [01:01:42] you require d i minus f x to be less than c i [01:01:46] d i minus f x to be less than c i for every X [01:01:49] for every X but here you don't make that [01:01:51] but here you don't make that uh you don't insist that no you you have [01:01:54] uh you don't insist that no you you have at least you have access as an argument [01:01:56] at least you have access as an argument of this DF right so you define the [01:01:58] of this DF right so you define the sensitivity at every point you have you [01:02:00] sensitivity at every point you have you didn't assume a global sensitivity thing [01:02:02] didn't assume a global sensitivity thing you talk about the sensitivity at X [01:02:04] you talk about the sensitivity at X right that's a that's a quality and then [01:02:07] right that's a that's a quality and then you can also Define the sensitivity on [01:02:10] you can also Define the sensitivity on the other side which is just soup [01:02:20] foreign [01:02:28] at every point but they are not Global [01:02:31] at every point but they are not Global sensitivity and now you can define [01:02:33] sensitivity and now you can define global sensitivity D Plus [01:02:35] global sensitivity D Plus which is the soup [01:02:39] which is the soup over all expands now you take soup but [01:02:42] over all expands now you take soup but before taking soup what's inside the [01:02:43] before taking soup what's inside the soup is the Sun [01:02:46] soup is the Sun of this [01:02:52] Square [01:02:55] so so basically the the uh maybe let me [01:02:59] so so basically the the uh maybe let me first write down all the definitions and [01:03:00] first write down all the definitions and then interpret them [01:03:11] this is minus not minus one [01:03:18] so [01:03:20] so and then maybe let me write a conclusion [01:03:22] and then maybe let me write a conclusion so you get that probability [01:03:25] so you get that probability I have X1 up to accent [01:03:29] I have X1 up to accent minus expectation of other than t [01:03:32] minus expectation of other than t is less than expectation it's [01:03:34] is less than expectation it's exponential minus t squared over four [01:03:37] exponential minus t squared over four D minus so so you have a little bit [01:03:40] D minus so so you have a little bit different Bound for upper side and lower [01:03:42] different Bound for upper side and lower sides which is probably not important [01:03:44] sides which is probably not important for many cases uh but just for for the [01:03:47] for many cases uh but just for for the sake of completeness let's write both of [01:03:48] sake of completeness let's write both of them [01:03:53] for D Plus [01:03:55] for D Plus right so I guess [01:03:58] right so I guess one and up X and are independent of [01:04:01] one and up X and are independent of course [01:04:03] dependent [01:04:06] okay so that's the so this is the [01:04:08] okay so that's the so this is the theorem so I guess the important thing [01:04:10] theorem so I guess the important thing is that what is this D plus and D minus [01:04:12] is that what is this D plus and D minus and how does it different [01:04:13] and how does it different from electromat right so so basically [01:04:18] I think the difference is that when you [01:04:20] I think the difference is that when you do the the the CI in a mcdermate is you [01:04:24] do the the the CI in a mcdermate is you take Soup [01:04:26] take Soup over X1 up to X N and then d i Plus [01:04:30] over X1 up to X N and then d i Plus f x 1 up to accent [01:04:33] f x 1 up to accent your first excuse right this is a CI [01:04:35] your first excuse right this is a CI right which is a global sensitivity for [01:04:37] right which is a global sensitivity for the ice coordinate and then the sum of [01:04:39] the ice coordinate and then the sum of CI Square the variance proxy in the [01:04:41] CI Square the variance proxy in the mcdermen is if you take sum over I from [01:04:45] mcdermen is if you take sum over I from 1 to n and you take soup over X Prime up [01:04:47] 1 to n and you take soup over X Prime up to x n [01:04:49] to x n d i plus f x 1 up to X [01:04:54] my Square [01:04:58] so basically you look at the global [01:05:00] so basically you look at the global sensitivity for every accordingly and [01:05:02] sensitivity for every accordingly and then you take the sum over it and here [01:05:04] then you take the sum over it and here the difference is that this D Plus or D [01:05:07] the difference is that this D Plus or D minus right so they are you are first [01:05:08] minus right so they are you are first looking you are taking the sum of their [01:05:10] looking you are taking the sum of their sensitivity over all coordinates [01:05:12] sensitivity over all coordinates at this point x when you first take the [01:05:15] at this point x when you first take the sum [01:05:18] and then you take the soup [01:05:21] so so you know [01:05:24] so so you know it's probably it's not that easy to find [01:05:26] it's probably it's not that easy to find a concrete example to see the [01:05:27] a concrete example to see the differences of these two but I guess you [01:05:29] differences of these two but I guess you can imagine the the [01:05:31] can imagine the the the order of doing the soup and the and [01:05:33] the order of doing the soup and the and the and the service it doesn't matter [01:05:35] the and the service it doesn't matter right so it's possible that for example [01:05:36] right so it's possible that for example you have a point x such that only for [01:05:38] you have a point x such that only for one coordinate you are very sensitive [01:05:40] one coordinate you are very sensitive and for other Corners you are you're not [01:05:42] and for other Corners you are you're not very sensitive so that then you take the [01:05:44] very sensitive so that then you take the sum and then you take the maximum it is [01:05:46] sum and then you take the maximum it is more advantage to do that right so [01:05:50] more advantage to do that right so um [01:05:51] um and in some sense I think the the [01:05:53] and in some sense I think the the mathematicians like spend a lot of time [01:05:55] mathematicians like spend a lot of time thinking about how do you change the [01:05:56] thinking about how do you change the order so [01:05:58] order so the the best thing you want to do is you [01:06:00] the the best thing you want to do is you take the soup at the very very end right [01:06:01] take the soup at the very very end right selective like this one like but this [01:06:03] selective like this one like but this one actually there's a small soup [01:06:04] one actually there's a small soup somewhere in the middle because in a [01:06:07] somewhere in the middle because in a definition of drf you still have this in [01:06:10] definition of drf you still have this in so so the best thing would be that you [01:06:12] so so the best thing would be that you just Define the sensitivity for every [01:06:14] just Define the sensitivity for every thing like like a gradient and then you [01:06:16] thing like like a gradient and then you take soup at the very variant which is [01:06:18] take soup at the very variant which is what I'm going to show for gaussian [01:06:19] what I'm going to show for gaussian description [01:06:21] description um but this is the best we can know for [01:06:22] um but this is the best we can know for General Distribution right so you look [01:06:24] General Distribution right so you look at your sensitivity [01:06:26] at your sensitivity uh at every corner then you take the sum [01:06:28] uh at every corner then you take the sum of all the sensitivity and then you take [01:06:30] of all the sensitivity and then you take Soup of x [01:06:32] Soup of x but the sensitivity has to be fun to be [01:06:34] but the sensitivity has to be fun to be defined in this in this sense [01:06:41] uh [01:06:42] uh does it make some sense yeah I I you [01:06:45] does it make some sense yeah I I you know I I'm not expecting you to [01:06:47] know I I'm not expecting you to understand all the nuances like I don't [01:06:49] understand all the nuances like I don't even understand exactly all the nuances [01:06:51] even understand exactly all the nuances like I need to open a book to see uh to [01:06:54] like I need to open a book to see uh to find the cases where there's a [01:06:55] find the cases where there's a difference I think they are actually [01:06:56] difference I think they are actually indeed quite some differences uh like [01:06:59] indeed quite some differences uh like between these two inequalities but but [01:07:01] between these two inequalities but but it's not like uh you probably wouldn't [01:07:03] it's not like uh you probably wouldn't be able to favor to see the differences [01:07:07] be able to favor to see the differences um [01:07:08] um okay so and [01:07:10] okay so and now let's answer this question about you [01:07:13] now let's answer this question about you know what happens if uh all the access [01:07:15] know what happens if uh all the access are unbounded right so what happens [01:07:21] if X1 up to Accents are unbonded [01:07:27] if X1 up to Accents are unbonded if these are unbonded like the ocean [01:07:29] if these are unbonded like the ocean random variable even you take F to be a [01:07:30] random variable even you take F to be a sum you probably wouldn't satisfy the [01:07:32] sum you probably wouldn't satisfy the boundary difference condition you [01:07:34] boundary difference condition you wouldn't satisfy this condition here [01:07:35] wouldn't satisfy this condition here either in its improved case because [01:07:37] either in its improved case because clearly there is an if here right so e [01:07:40] clearly there is an if here right so e so even f is the sum and x i sub [01:07:42] so even f is the sum and x i sub gaussian this one would be Infinity [01:07:45] gaussian this one would be Infinity right because there's no Bound for any [01:07:48] right because there's no Bound for any individual there's no absolute bonds for [01:07:49] individual there's no absolute bonds for any individual random variable [01:07:51] any individual random variable so uh so so that's the next question how [01:07:54] so uh so so that's the next question how do we deal with the case when X1 after [01:07:56] do we deal with the case when X1 after action are not vomit [01:07:57] action are not vomit and there are some existing uh results [01:08:00] and there are some existing uh results along this line so the first result is [01:08:03] along this line so the first result is called concurrency inequality which is [01:08:05] called concurrency inequality which is uh one of the very beautiful results [01:08:08] uh one of the very beautiful results um [01:08:09] um also for other reasons not only for the [01:08:12] also for other reasons not only for the reason for concentration inequality but [01:08:14] reason for concentration inequality but also like for other [01:08:17] also like for other um reasons not really into this course [01:08:18] um reasons not really into this course so uh so this is inequality is saying [01:08:21] so uh so this is inequality is saying the following so if [01:08:22] the following so if X1 up to xn [01:08:26] X1 up to xn are ID are gaussian [01:08:28] are ID are gaussian which means zero and one [01:08:30] which means zero and one so and you have some function f right so [01:08:33] so and you have some function f right so and you can look at the variance of this [01:08:35] and you can look at the variance of this function [01:08:37] it didn't prove that this is sub [01:08:38] it didn't prove that this is sub gaussian you only show the abound on the [01:08:40] gaussian you only show the abound on the variance which is something necessary to [01:08:42] variance which is something necessary to have right so if you don't have a [01:08:43] have right so if you don't have a boundary variance you probably wouldn't [01:08:45] boundary variance you probably wouldn't be able to show it in subcaution [01:08:47] be able to show it in subcaution so the virus is less than this is [01:08:49] so the virus is less than this is exactly as suggested before in the [01:08:52] exactly as suggested before in the question so this is less than [01:08:58] the gradient square and you take [01:09:00] the gradient square and you take expectation over the random variable X [01:09:02] expectation over the random variable X right so this is a expectation of the [01:09:05] right so this is a expectation of the gradient of this random variable so so [01:09:07] gradient of this random variable so so this is in some sense the ideal type of [01:09:10] this is in some sense the ideal type of right hand side that you would hope for [01:09:12] right hand side that you would hope for so so the the concentration of this [01:09:15] so so the the concentration of this random variable f is somehow controlled [01:09:18] random variable f is somehow controlled by how sensitive how hot ellipses the [01:09:21] by how sensitive how hot ellipses the function f is [01:09:24] function f is so this is the idealistic and the [01:09:26] so this is the idealistic and the basically the best kind of like thing [01:09:28] basically the best kind of like thing you you can hope for [01:09:30] you you can hope for um but the limitation here is that on [01:09:32] um but the limitation here is that on the left hand side you only control the [01:09:34] the left hand side you only control the variance you didn't control the tail [01:09:36] variance you didn't control the tail um explicitly [01:09:38] um explicitly right so if you want to turn the virus [01:09:40] right so if you want to turn the virus to the to the tailbone you have to use [01:09:41] to the to the tailbone you have to use the championship you get one over T [01:09:43] the championship you get one over T Square ball [01:09:44] Square ball so and you can also deal with this like [01:09:46] so and you can also deal with this like with other kind of like a gaussian [01:09:48] with other kind of like a gaussian random variable it doesn't have to be [01:09:49] random variable it doesn't have to be like a uh it means zero and one that's [01:09:52] like a uh it means zero and one that's easy [01:09:54] easy um and [01:09:55] um and and you and it's the strongest thing [01:09:58] and you and it's the strongest thing here is the following so so here is the [01:10:01] here is the following so so here is the the stronger theorem which we can deal [01:10:03] the stronger theorem which we can deal with the tail [01:10:06] so so here you suppose [01:10:10] so so here you suppose f is our ellipsis [01:10:15] and with respect to [01:10:17] and with respect to Ukrainian measure [01:10:22] between euclidean distance sorry [01:10:24] between euclidean distance sorry distance [01:10:32] which is saying that f x [01:10:35] which is saying that f x minus f y [01:10:38] minus f y is less than L times x minus y squared [01:10:42] is less than L times x minus y squared for every X and Y in r [01:10:45] for every X and Y in r so in some sense this is saying that [01:10:48] so in some sense this is saying that basically this is saying that the [01:10:50] basically this is saying that the gradient of f x [01:10:52] gradient of f x is [01:10:53] is uniformly bounded by L [01:10:55] uniformly bounded by L right so you can see that this is [01:10:58] right so you can see that this is different from this one because here [01:11:00] different from this one because here you're quite agreeing for [01:11:02] you're quite agreeing for for every point to be less than L and [01:11:04] for every point to be less than L and above you require the average gradient [01:11:07] above you require the average gradient to be something small [01:11:09] to be something small so so here we make a stronger assumption [01:11:11] so so here we make a stronger assumption to say that this function is just a [01:11:13] to say that this function is just a global ellipses and then you can have a [01:11:16] global ellipses and then you can have a stronger bone on the tail [01:11:18] stronger bone on the tail so now let's X1 up to x n [01:11:22] so now let's X1 up to x n v i d [01:11:24] v i d from gaussian [01:11:26] from gaussian and now you can have the tail [01:11:29] and now you can have the tail Bond [01:11:31] Bond that would like to have [01:11:35] that would like to have so this right T is less than 2 [01:11:38] so this right T is less than 2 exponential minus t Square over 2 I will [01:11:41] exponential minus t Square over 2 I will score [01:11:42] score so basically f x [01:11:45] so basically f x is IO subconscious maybe all five oh [01:11:48] is IO subconscious maybe all five oh it's a ghost [01:11:51] but but the L is not expected gradient [01:11:54] but but the L is not expected gradient the L is the absolute Bond on the [01:11:57] the L is the absolute Bond on the gradient so so you can kind of see the [01:11:59] gradient so so you can kind of see the the flavor of all of this concentration [01:12:01] the flavor of all of this concentration inequality it really depends on you when [01:12:03] inequality it really depends on you when you take the soup when you take the [01:12:05] you take the soup when you take the expectation you know like of a different [01:12:07] expectation you know like of a different kind of conditions you can have [01:12:09] kind of conditions you can have different uh theorems with different [01:12:11] different uh theorems with different strength [01:12:16] any questions [01:12:17] any questions attack [01:12:32] then you would get like higher moments [01:12:42] uh I [01:12:44] uh I I I don't think I know uh on the the [01:12:48] I I don't think I know uh on the the exact results of the top of my head I [01:12:51] exact results of the top of my head I think [01:12:53] think um [01:12:54] um I think the higher moment [01:12:56] I think the higher moment could you get a higher moment from the [01:12:58] could you get a higher moment from the one below I guess I think if you want to [01:13:00] one below I guess I think if you want to have higher moment you have to assume [01:13:01] have higher moment you have to assume something stronger [01:13:03] something stronger that's my hunch like for example this [01:13:06] that's my hunch like for example this one below will give you a higher moment [01:13:08] one below will give you a higher moment about [01:13:09] about right so I'm not sure whether you can [01:13:12] right so I'm not sure whether you can have a higher moment bound that has [01:13:15] have a higher moment bound that has weaker conditions than this I I don't [01:13:18] weaker conditions than this I I don't know also I I don't know too much about [01:13:20] know also I I don't know too much about pdes so so I could I could miss I don't [01:13:24] pdes so so I could I could miss I don't know everything like this is the only [01:13:26] know everything like this is the only thing I know [01:13:28] thing I know oh [01:13:31] yeah but but indeed it is called it has [01:13:33] yeah but but indeed it is called it has a lot of different applications not only [01:13:35] a lot of different applications not only here [01:13:38] okay so we have [01:13:41] okay so we have this is uh I have 15 minutes oh we have [01:13:45] this is uh I have 15 minutes oh we have 10 minutes [01:13:48] so [01:13:52] it's a little challenging for me to give [01:13:54] it's a little challenging for me to give the full proof for the McDermott [01:13:55] the full proof for the McDermott inequality in 10 minutes but I think [01:13:57] inequality in 10 minutes but I think I'll try a little bit just uh you know [01:13:59] I'll try a little bit just uh you know if I couldn't have the foolproof I can [01:14:01] if I couldn't have the foolproof I can give you a sketch right so uh so that's [01:14:04] give you a sketch right so uh so that's the last thing I was planning to do [01:14:05] the last thing I was planning to do right so like uh for all of the [01:14:07] right so like uh for all of the inequality above like this right [01:14:10] inequality above like this right inequality this tail bonds for gaussian [01:14:13] inequality this tail bonds for gaussian I think they are beyond the scope of [01:14:14] I think they are beyond the scope of this course like we we don't uh we're [01:14:17] this course like we we don't uh we're already pretty we're already doing a lot [01:14:18] already pretty we're already doing a lot of things like in the technical part so [01:14:20] of things like in the technical part so so these things probably even I do it I [01:14:23] so these things probably even I do it I would just invoke a theorem from a book [01:14:25] would just invoke a theorem from a book so so you don't need to to know the [01:14:27] so so you don't need to to know the proof for the McDermott air quality I [01:14:29] proof for the McDermott air quality I don't think you need to know the proof [01:14:30] don't think you need to know the proof but I think the proof is kind of [01:14:31] but I think the proof is kind of interesting uh to some extent so it's [01:14:34] interesting uh to some extent so it's probably Worth showing so let's try that [01:14:37] probably Worth showing so let's try that in the next 10 minutes [01:14:45] so [01:14:48] so um so we care about you know bonding we [01:14:50] um so we care about you know bonding we care about something like this right and [01:14:52] care about something like this right and we have to bond the difference condition [01:14:54] we have to bond the difference condition and the the high level kind of like [01:14:57] and the the high level kind of like intuition is that you want to [01:15:00] intuition is that you want to so this one can correlate you know f of [01:15:03] so this one can correlate you know f of x by abstract and it's kind of like [01:15:05] x by abstract and it's kind of like something it could be a very complex [01:15:07] something it could be a very complex function complicated function of xma [01:15:08] function complicated function of xma after X right and somehow you still want [01:15:11] after X right and somehow you still want to reduce it to a sum in some sense but [01:15:14] to reduce it to a sum in some sense but but the reduction is is not that it's [01:15:16] but the reduction is is not that it's not like straightforward the reductions [01:15:18] not like straightforward the reductions like this so the way you do it is the [01:15:21] like this so the way you do it is the following so you say that you define a [01:15:23] following so you say that you define a sequence of random variables let's [01:15:24] sequence of random variables let's define Z zero [01:15:26] define Z zero to be the expectation [01:15:28] to be the expectation of X1 after accent [01:15:31] of X1 after accent so this is just nothing it's just a [01:15:33] so this is just nothing it's just a scalar which is a constant [01:15:36] scalar which is a constant and and then you define C1 to be [01:15:40] and and then you define C1 to be expectation of f x 1 up to accent [01:15:45] expectation of f x 1 up to accent conditional X1 [01:15:48] conditional X1 so what does this mean this is a [01:15:49] so what does this mean this is a function this is a function [01:15:53] function this is a function of X1 so basically Z1 is a function of [01:15:55] of X1 so basically Z1 is a function of X1 but you average out all the other [01:15:59] X1 but you average out all the other uh X I exercise [01:16:03] uh X I exercise right so and you can also Define uh z i [01:16:06] right so and you can also Define uh z i which is you know [01:16:08] which is you know the expectation of X1 after x n [01:16:12] the expectation of X1 after x n condition the first I random variable [01:16:15] condition the first I random variable so this is a function [01:16:19] of X1 up to x i [01:16:22] of X1 up to x i so give me x 1 up to accept this becomes [01:16:24] so give me x 1 up to accept this becomes a scalar [01:16:26] a scalar um because all the other Randomness got [01:16:28] um because all the other Randomness got average down [01:16:29] average down so in some sense you can see that Z zero [01:16:31] so in some sense you can see that Z zero doesn't have any Randomness and Z1 has a [01:16:33] doesn't have any Randomness and Z1 has a little Randomness because it's a [01:16:34] little Randomness because it's a function of random number of X1 so it's [01:16:36] function of random number of X1 so it's a random variable and Zi has more and [01:16:38] a random variable and Zi has more and more random x and z n is finally what [01:16:41] more random x and z n is finally what you care about [01:16:42] you care about which is the [01:16:44] which is the the fully random case [01:16:46] the fully random case and and the important thing is that you [01:16:49] and and the important thing is that you care about ZN minus zero right f x the F [01:16:52] care about ZN minus zero right f x the F minus the expectations [01:16:54] minus the expectations and you can decompose this into a [01:16:57] and you can decompose this into a sequence of things right so like luckily [01:16:59] sequence of things right so like luckily is telescoping some [01:17:08] and this is what I mean by reduction to [01:17:10] and this is what I mean by reduction to the sum so basically now you have a sum [01:17:12] the sum so basically now you have a sum of random variables and you somehow kind [01:17:15] of random variables and you somehow kind of think of them as independent in some [01:17:16] of think of them as independent in some sense there are definitely not exactly [01:17:18] sense there are definitely not exactly independent [01:17:19] independent but uh but you're going to reuse the [01:17:23] but uh but you're going to reuse the the proof that that you use for the [01:17:25] the proof that that you use for the summation that's what I'm going to see [01:17:27] summation that's what I'm going to see so [01:17:29] so um [01:17:29] um and and if you think look at this right [01:17:31] and and if you think look at this right so this is a this is a function of [01:17:35] so this is a this is a function of X1 and this is a function of X2 [01:17:39] X1 and this is a function of X2 of X1 and X2 so and so forth and this is [01:17:41] of X1 and X2 so and so forth and this is a function of [01:17:43] a function of X1 up to X and [01:17:47] X1 up to X and and yeah this depends on all the random [01:17:48] and yeah this depends on all the random variables okay so and now let's try to [01:17:52] variables okay so and now let's try to see what we know about each of this uh Z [01:17:55] see what we know about each of this uh Z and z i minus C minus one [01:17:58] and z i minus C minus one right so first of all we know that for [01:17:59] right so first of all we know that for every z i if you take expectation of CI [01:18:02] every z i if you take expectation of CI this is expectation of [01:18:05] this is expectation of expectation of f x 1 up to x n [01:18:10] right so so in the inside you have a [01:18:13] right so so in the inside you have a function of x amount of x i and then [01:18:15] function of x amount of x i and then outside you average out all the [01:18:16] outside you average out all the randomness of X1 up to x i again [01:18:19] randomness of X1 up to x i again so this is so this is equals to [01:18:22] so this is so this is equals to the expectation of f right by uh this is [01:18:26] the expectation of f right by uh this is called [01:18:27] called a total law of expectation right [01:18:31] a total law of expectation right you can't you you take the expectation [01:18:34] you can't you you take the expectation of the conditional thing then you get [01:18:37] of the conditional thing then you get the expectation [01:18:41] so this equals to this [01:18:44] so this equals to this which is equals to uh basically the [01:18:48] which is equals to uh basically the Z zero [01:18:51] Z zero so uh [01:18:55] and then this means that the expectation [01:18:58] and then this means that the expectation of z i minus z i minus 1 [01:19:01] of z i minus z i minus 1 is equals to zero so each of these [01:19:03] is equals to zero so each of these random variable in this decomposition is [01:19:06] random variable in this decomposition is mean zero [01:19:08] mean zero unless you have [01:19:12] so basically the intuition is that this [01:19:14] so basically the intuition is that this will define c d i to be z i minus z i [01:19:16] will define c d i to be z i minus z i minus one [01:19:18] minus one what I'm going to do is that you're [01:19:19] what I'm going to do is that you're gonna have [01:19:21] gonna have in some sense you you want to kind of [01:19:23] in some sense you you want to kind of like bounce [01:19:25] like bounce the moment generating function of each [01:19:27] the moment generating function of each of the DI and then you say that because [01:19:29] of the DI and then you say that because your final thing is the sum of the DI [01:19:31] your final thing is the sum of the DI you can find the moment generally [01:19:32] you can find the moment generally generating function of the sum of the DI [01:19:35] generating function of the sum of the DI so let's work on each of the DI [01:19:38] so let's work on each of the DI first right so [01:19:41] first right so I guess I'm going to claim that uh [01:19:44] I guess I'm going to claim that uh z i minus z i minus 1. [01:19:51] is [01:19:55] always less than [01:20:00] CI [01:20:03] where the CR is the bond difference [01:20:06] where the CR is the bond difference condition in the [01:20:08] condition in the uh [01:20:10] uh in the in the condition of the maximum [01:20:13] in the in the condition of the maximum inequality so how do I do that I guess [01:20:17] inequality so how do I do that I guess uh let me see whether I can simplify [01:20:19] uh let me see whether I can simplify this group a little bit [01:20:22] um [01:20:23] um in the sake of for the sake of time [01:20:26] in the sake of for the sake of time okay I guess it doesn't Okay so let's [01:20:28] okay I guess it doesn't Okay so let's let's only prove it for Z1 minus zero [01:20:31] let's only prove it for Z1 minus zero just uh [01:20:33] just uh um in the inches of time so if you look [01:20:35] um in the inches of time so if you look at Z1 [01:20:39] C1 is expectation of X1 up to x n [01:20:45] C1 is expectation of X1 up to x n Foundation X1 [01:20:48] Foundation X1 and [01:20:49] and if you [01:21:00] so I guess [01:21:02] so I guess you can replace the first one by the [01:21:04] you can replace the first one by the soup [01:21:06] soup over [01:21:07] over all the possible choices of X Y [01:21:11] right so and after you do this [01:21:17] the this quantity is not a function of X [01:21:19] the this quantity is not a function of X bar anymore so it doesn't matter whether [01:21:21] bar anymore so it doesn't matter whether you condition exponent or not [01:21:22] you condition exponent or not so you literally just got [01:21:25] so you literally just got uh expectation soup [01:21:28] uh expectation soup of the X 2 up to exit [01:21:37] so [01:21:44] let me see [01:21:47] and also you know that Z1 is bigger than [01:21:50] and also you know that Z1 is bigger than expectation [01:21:53] expectation if for the same reason [01:21:58] excellent [01:21:59] excellent so in some sense you have kind of some [01:22:01] so in some sense you have kind of some kind of upper body lower Bound for Z1 [01:22:05] kind of upper body lower Bound for Z1 um I guess these two quantities are not [01:22:07] um I guess these two quantities are not exactly useful uh for the bonds what's [01:22:09] exactly useful uh for the bonds what's really useful is this if you look at Z1 [01:22:12] really useful is this if you look at Z1 minus zero [01:22:14] minus zero and this is then [01:22:19] so this is you know expectation [01:22:23] so this is you know expectation um FX1 up to x n [01:22:26] um FX1 up to x n foreign [01:22:27] foreign X Y minus [01:22:31] X Y minus f x one expectation FX1 up to x n [01:22:36] f x one expectation FX1 up to x n so [01:22:38] so um so you can bound this by expectation [01:22:41] um so you can bound this by expectation soup using what we have done above [01:22:49] and minus expectation of f x one [01:22:55] so so here both the the [01:22:58] so so here both the the um and then you can put this inside [01:23:06] it's it's like a I think it's slightly [01:23:08] it's it's like a I think it's slightly confusing when you really look at the [01:23:11] confusing when you really look at the math but intuitively what you're saying [01:23:13] math but intuitively what you're saying is that [01:23:14] is that what the difference between C1 c0 is [01:23:16] what the difference between C1 c0 is only one current right so and but we [01:23:18] only one current right so and but we know that if you change that one [01:23:20] know that if you change that one coordinate you cannot make much [01:23:22] coordinate you cannot make much difference [01:23:23] difference so right so that's what we know what [01:23:25] so right so that's what we know what like if for any X1 X2 up to X N if you [01:23:28] like if for any X1 X2 up to X N if you change only X1 you wouldn't make much of [01:23:30] change only X1 you wouldn't make much of a difference that's why Z1 zero [01:23:33] a difference that's why Z1 zero wouldn't make much of a difference [01:23:34] wouldn't make much of a difference because the only thing different is X1 [01:23:37] because the only thing different is X1 so but okay but maybe let me have the [01:23:41] so but okay but maybe let me have the formal proof like uh so on the other [01:23:43] formal proof like uh so on the other hand you can also prove that the same [01:23:45] hand you can also prove that the same thing so you can prove this is larger [01:23:46] thing so you can prove this is larger than if [01:23:47] than if the [01:23:54] so basically I'm I'm basically trying to [01:23:56] so basically I'm I'm basically trying to say that you know the difference between [01:23:57] say that you know the difference between C1 and c0 is up and lower bounded by the [01:24:00] C1 and c0 is up and lower bounded by the the extremo case right where you you [01:24:02] the extremo case right where you you pick your Z in the in the worst case and [01:24:05] pick your Z in the in the worst case and this means that if you define this to be [01:24:07] this means that if you define this to be A1 and you define this to be A2 [01:24:11] A1 and you define this to be A2 then a maybe let's call this sorry let's [01:24:13] then a maybe let's call this sorry let's call this V1 [01:24:16] and this A1 so you have an upper bound [01:24:18] and this A1 so you have an upper bound and the lower bound on Z1 Z minus zero [01:24:21] and the lower bound on Z1 Z minus zero and uh and you can show what's the upper [01:24:24] and uh and you can show what's the upper bound and lower bound so the B1 minus B [01:24:26] bound and lower bound so the B1 minus B A1 [01:24:28] A1 right so this will be expectation soup [01:24:33] right so this will be expectation soup this is the extreme case [01:24:36] minus the if [01:24:43] right so this is exactly the CIS that we [01:24:46] right so this is exactly the CIS that we Define right so if you change your [01:24:47] Define right so if you change your random variable in the you change your [01:24:50] random variable in the you change your inputs in the first coordinate what what [01:24:52] inputs in the first coordinate what what you can change right so the maximum [01:24:54] you can change right so the maximum changes is C1 so this is less than C1 [01:24:57] changes is C1 so this is less than C1 but uh condition [01:25:02] so basically this is saying that Z one [01:25:04] so basically this is saying that Z one zero [01:25:05] zero is between [01:25:07] is between B1 and A1 and B1 minus A1 is less than [01:25:11] B1 and A1 and B1 minus A1 is less than C1 so so this is saying that each of the [01:25:14] C1 so so this is saying that each of the the random variables Z1 and zero [01:25:16] the random variables Z1 and zero responded in a small interval and [01:25:18] responded in a small interval and similarly [01:25:21] we can also show that d i minus Z M as [01:25:23] we can also show that d i minus Z M as one is bonded between [01:25:27] one is bonded between uh is by between something like b i and [01:25:31] uh is by between something like b i and a i [01:25:32] a i and bi minus a i is also less than c i [01:25:37] and bi minus a i is also less than c i so so so recall that our final goal is D [01:25:40] so so so recall that our final goal is D A minus e0 which is the sum of c i minus [01:25:43] A minus e0 which is the sum of c i minus CM minus one [01:25:45] CM minus one and we have both approved proved that [01:25:48] and we have both approved proved that each of these random variable is [01:25:49] each of these random variable is somewhat bounded [01:25:50] somewhat bounded uh in some small interval and now we can [01:25:53] uh in some small interval and now we can use the moment generating function [01:25:56] use the moment generating function so what you do is you say you take [01:25:58] so what you do is you say you take expectation of [01:26:01] expectation of um [01:26:02] um Lambda [01:26:05] ZN minus zero [01:26:08] ZN minus zero and this is expectation [01:26:10] and this is expectation e of Lambda sum of z n minus c m s one [01:26:15] e of Lambda sum of z n minus c m s one so the first thing we have to do is to [01:26:17] so the first thing we have to do is to defectorize them in some way [01:26:20] defectorize them in some way right so how do we factorize them we [01:26:23] right so how do we factorize them we just use the the conditional [01:26:25] just use the the conditional uh [01:26:27] uh um [01:26:28] um we we kind of like do the chain in some [01:26:30] we we kind of like do the chain in some sense of the chain rule so what you do [01:26:32] sense of the chain rule so what you do is that your first condition down [01:26:35] is that your first condition down uh [01:26:37] uh X1 up to X N minus one so then you have [01:26:40] X1 up to X N minus one so then you have this expectation e [01:26:43] this expectation e Lambda z n minus C A minus one [01:26:46] Lambda z n minus C A minus one conditional X1 up to X N minus one [01:26:51] conditional X1 up to X N minus one and and when condition on it you get [01:26:53] and and when condition on it you get this and then the rest of the things [01:26:55] this and then the rest of the things it's a function of [01:26:58] it's a function of X and up to x minus one [01:27:04] right so so this is a um like what's [01:27:08] right so so this is a um like what's inside your condition maximum up to x [01:27:09] inside your condition maximum up to x minus one this one only depends on [01:27:12] minus one this one only depends on um [01:27:16] right so this is a function of X1 after [01:27:18] right so this is a function of X1 after accent and as well and this is the [01:27:20] accent and as well and this is the function of accent so that's why it's [01:27:21] function of accent so that's why it's inside expectation [01:27:23] inside expectation and then [01:27:25] and then um this term [01:27:28] um this term um because DM minus C minus one a [01:27:30] um because DM minus C minus one a responded and and responding a strong [01:27:33] responded and and responding a strong sense in a sense that for every possible [01:27:35] sense in a sense that for every possible choice is for X so like you know that [01:27:37] choice is for X so like you know that this is a kind of absolute Bound for for [01:27:39] this is a kind of absolute Bound for for zMAX one right so we so we know that [01:27:42] zMAX one right so we so we know that this [01:27:45] is [01:27:47] is this one [01:27:51] this is less than [01:27:53] this is less than um expected uh exponential [01:27:58] um expected uh exponential of Lambda Square [01:28:00] of Lambda Square are Sigma a C N squared [01:28:05] over two is it because if you have a [01:28:08] over two is it because if you have a boundary random variable and we know [01:28:09] boundary random variable and we know that it's sub gaussian right so uh you [01:28:12] that it's sub gaussian right so uh you can verify this in various ways you know [01:28:14] can verify this in various ways you know one way to do it is you just uh [01:28:17] one way to do it is you just uh um actually actually this this will show [01:28:20] um actually actually this this will show up in the homework this is the one of [01:28:21] up in the homework this is the one of the homework questions we Define right [01:28:23] the homework questions we Define right so like if you have a bonded random [01:28:24] so like if you have a bonded random variable well it's subconscious right so [01:28:27] variable well it's subconscious right so uh and you can find your moment [01:28:29] uh and you can find your moment generation function so and and then you [01:28:32] generation function so and and then you can replace this term by [01:28:37] this absolute quantity C and square over [01:28:39] this absolute quantity C and square over two and times the sum of the other terms [01:28:45] I think this is n minus one [01:28:49] and then you peel off the second term [01:28:50] and then you peel off the second term again and again so you do this [01:28:52] again and again so you do this iteratively I guess I've given it we're [01:28:55] iteratively I guess I've given it we're already running out of time so [01:28:58] already running out of time so um [01:28:59] um we got this so if you have this you can [01:29:02] we got this so if you have this you can or something like this I guess this is [01:29:05] or something like this I guess this is actually eight [01:29:06] actually eight if you really do it carefully [01:29:08] if you really do it carefully so [01:29:10] so um so so so yeah I guess I'll just [01:29:12] um so so so yeah I guess I'll just sketch this so this means that F minus [01:29:17] sketch this so this means that F minus expectation of right so it's equals to [01:29:19] expectation of right so it's equals to sum of z i minus CMS one [01:29:22] sum of z i minus CMS one is uh sub gaussian [01:29:26] uh [01:29:29] variance proxy [01:29:33] squared [01:29:35] squared okay I guess so that's the end of the [01:29:37] okay I guess so that's the end of the proof but this proof is optional it's [01:29:39] proof but this proof is optional it's just that we have more time so that's [01:29:40] just that we have more time so that's why I show the proof [01:29:42] why I show the proof um okay any questions [01:29:51] [Music] [01:29:55] um [01:30:08] little barriers Oh you mean this one oh [01:30:11] little barriers Oh you mean this one oh yeah so from here to here yeah this is [01:30:14] yeah so from here to here yeah this is just a [01:30:16] just a uh yeah it's just a cheaper step like [01:30:19] uh yeah it's just a cheaper step like I guess [01:30:20] I guess maybe technically what I should write is [01:30:24] maybe technically what I should write is maybe let me let me do this uh here just [01:30:26] maybe let me let me do this uh here just so the if you want to do two steps the [01:30:29] so the if you want to do two steps the first thing is you do this [01:30:31] first thing is you do this uh sorry you do [01:30:34] uh sorry you do you just do this you do the total [01:30:36] you just do this you do the total expectation you condition you you first [01:30:38] expectation you condition you you first conditional exponent to up to accent [01:30:42] we do this right [01:30:45] we do this right so this is the law of total expectation [01:30:47] so this is the law of total expectation and then you find that this term is a [01:30:50] and then you find that this term is a constant when you condition on X1 up to [01:30:52] constant when you condition on X1 up to X N minus one so that's why you can move [01:30:54] X N minus one so that's why you can move it outside [01:30:56] it outside um [01:30:57] um yeah there's nothing deep there [01:31:04] okay sounds good um Okay cool so I guess [01:31:08] okay sounds good um Okay cool so I guess see you next Monday ================================================================================ LECTURE 005 ================================================================================ Stanford CS229M - Lecture 5: Rademacher complexity, empirical Rademacher complexity Source: https://www.youtube.com/watch?v=tkJd2B98hII --- Transcript [00:00:05] so I guess uh yeah sorry for the delay a [00:00:07] so I guess uh yeah sorry for the delay a little bit uh I I couldn't find water [00:00:10] little bit uh I I couldn't find water somehow [00:00:11] somehow um [00:00:12] um um anyway so but uh [00:00:15] um anyway so but uh um okay let's get started [00:00:17] um okay let's get started um so last time we talked about [00:00:20] um so last time we talked about um concentration in the quality which [00:00:22] um concentration in the quality which was uh for some preparations uh for what [00:00:26] was uh for some preparations uh for what we need today or maybe the last lecture [00:00:28] we need today or maybe the last lecture and today we are going to go back to the [00:00:30] and today we are going to go back to the uniform convergence so we call that uh [00:00:33] uniform convergence so we call that uh our goal was to [00:00:35] our goal was to uh to the uniform convergence [00:00:40] and we have proved some results for [00:00:42] and we have proved some results for example we have proved that [00:00:45] example we have proved that um I guess we the first thing we had is [00:00:48] um I guess we the first thing we had is that we say the excess risk is funded by [00:00:50] that we say the excess risk is funded by this uniform convergence right so we [00:00:53] this uniform convergence right so we basically care about something like the [00:00:55] basically care about something like the soup [00:00:56] soup of the differences [00:01:04] um [00:01:05] um so guys we have shown that the excess [00:01:07] so guys we have shown that the excess risk [00:01:10] thank you [00:01:12] thank you I don't know why I'm [00:01:14] I don't know why I'm something wrong with my okay [00:01:16] something wrong with my okay assess risk this is funded by for [00:01:19] assess risk this is funded by for example uh something like IO [00:01:22] example uh something like IO H star minus L has H star [00:01:25] H star minus L has H star Plus [00:01:29] minus L height H where H is capital H [00:01:33] minus L height H where H is capital H and we have used this to [00:01:36] and we have used this to um [00:01:37] um get a certain kind of uniform [00:01:40] get a certain kind of uniform convergence result for example we have [00:01:41] convergence result for example we have shown that for [00:01:42] shown that for find a hypothesis class [00:01:48] we've got LH hat minus [00:01:51] we've got LH hat minus um [00:01:53] um L of X star this is bounded by [00:01:57] L of X star this is bounded by guys we have shown this technically but [00:02:00] guys we have shown this technically but this can be turned into excess response [00:02:03] this can be turned into excess response we've shown this is less than [00:02:07] um [00:02:07] um sorry [00:02:10] sorry this [00:02:12] this is done [00:02:14] is done square root long H roughly speaking Over [00:02:18] square root long H roughly speaking Over N if you ignore other log factors [00:02:21] N if you ignore other log factors um and you can take soup over each in [00:02:24] um and you can take soup over each in Capital H and also for [00:02:27] Capital H and also for hypothesis class parametrized by [00:02:31] class with [00:02:34] class with P parameters [00:02:37] we also have got something like L Theta [00:02:40] we also have got something like L Theta hat minus L height or I think there's [00:02:44] hat minus L height or I think there's something wrong with my [00:02:45] something wrong with my clothes let me take a quick [00:02:51] right here [00:02:53] right here so we can get IO silk later in capital [00:02:57] so we can get IO silk later in capital Theta L Theta minus L has Theta [00:03:01] Theta L Theta minus L has Theta this is bounded by some kind of O till [00:03:04] this is bounded by some kind of O till your Square p over it this is what we [00:03:06] your Square p over it this is what we did like two lectures ago [00:03:09] did like two lectures ago so and you can think of this you know I [00:03:12] so and you can think of this you know I guess we have discussed this briefly so [00:03:14] guess we have discussed this briefly so this content this quantity these are in [00:03:17] this content this quantity these are in some sense [00:03:18] some sense complexity measures [00:03:22] of the hypothesis [00:03:26] of the hypothesis and this is generally the type of new [00:03:27] and this is generally the type of new results that we're going to get right so [00:03:29] results that we're going to get right so we're going to have something that [00:03:30] we're going to have something that decreases as then goes to uh infinity [00:03:33] decreases as then goes to uh infinity and also there's another constant [00:03:36] and also there's another constant um which is uh there's another factor [00:03:38] um which is uh there's another factor which is the hypothesis class right so [00:03:41] which is the hypothesis class right so so basically eventually you will say [00:03:42] so basically eventually you will say that you know if n is bigger than the [00:03:44] that you know if n is bigger than the hypothesis class then you can get a long [00:03:46] hypothesis class then you can get a long tubule error box [00:03:48] tubule error box so the problem with these two bound is [00:03:51] so the problem with these two bound is the following so the limitation [00:03:55] so the limitation [00:03:57] so the limitation I think you can talk about you know [00:03:59] I think you can talk about you know different limitations from different [00:04:01] different limitations from different perspectives but I think the basic [00:04:03] perspectives but I think the basic limitation of the this p h parameter [00:04:05] limitation of the this p h parameter Bond here is that it requires [00:04:09] Bond here is that it requires tend to be much bigger than P so that [00:04:11] tend to be much bigger than P so that this bond is [00:04:13] this bond is um going uh is small so and this is not [00:04:16] um going uh is small so and this is not necessarily feasible [00:04:18] necessarily feasible um in many cases and also this is not [00:04:19] um in many cases and also this is not really [00:04:21] really um what happens in reality right so in [00:04:23] um what happens in reality right so in reality like in many cases n is smaller [00:04:26] reality like in many cases n is smaller than p [00:04:28] than p is quite often [00:04:31] it's not always the case but you know [00:04:33] it's not always the case but you know it's pretty often it's more often in [00:04:36] it's pretty often it's more often in um in the modern situation where you [00:04:38] um in the modern situation where you have a so-called over parametricizing [00:04:40] have a so-called over parametricizing network but over part my choice I guess [00:04:42] network but over part my choice I guess I'll Define that more carefully [00:04:45] I'll Define that more carefully um in the later course in the later [00:04:47] um in the later course in the later lectures but basically in the modern [00:04:49] lectures but basically in the modern setting when you have a deep artwork [00:04:51] setting when you have a deep artwork your image Knight has like a million [00:04:53] your image Knight has like a million examples but uh your parameters could be [00:04:56] examples but uh your parameters could be something like 10 minutes or maybe 100 [00:04:58] something like 10 minutes or maybe 100 million sometimes could be billions [00:05:00] million sometimes could be billions so of course this is not necessarily [00:05:03] so of course this is not necessarily always the case where sometimes you [00:05:05] always the case where sometimes you still have like a is bigger than P [00:05:06] still have like a is bigger than P depending on the situation but generally [00:05:08] depending on the situation but generally people found that it's useful to make [00:05:11] people found that it's useful to make your network your P very large so [00:05:13] your network your P very large so definitely it's not the case that you [00:05:15] definitely it's not the case that you won't earn to be much much bigger than P [00:05:16] won't earn to be much much bigger than P that's definitely not true [00:05:18] that's definitely not true so uh and the reason in some sense why [00:05:22] so uh and the reason in some sense why this is not capturing what happens in [00:05:23] this is not capturing what happens in reality is that this is not precise [00:05:25] reality is that this is not precise enough right so not precise enough [00:05:30] enough right so not precise enough in the sense that your complexity [00:05:32] in the sense that your complexity measure is in some sense two worst case [00:05:34] measure is in some sense two worst case right your complex measure is measuring [00:05:36] right your complex measure is measuring the complexity of all possible [00:05:38] the complexity of all possible parameters with all possible models with [00:05:40] parameters with all possible models with P parameters but you are not special [00:05:42] P parameters but you are not special specializing enough to some special kind [00:05:46] specializing enough to some special kind of models among all the models with P [00:05:48] of models among all the models with P parameters for example you cannot [00:05:50] parameters for example you cannot achieve your case you know especially in [00:05:52] achieve your case you know especially in the kind of more classical language you [00:05:55] the kind of more classical language you cannot distinguish [00:05:58] a sparse parameter from a dense [00:06:00] a sparse parameter from a dense parameter you cannot distinguish for [00:06:02] parameter you cannot distinguish for example you have a parameter class where [00:06:04] example you have a parameter class where Theta has not one normal [00:06:08] Theta has not one normal versus some hypothesis cost where Theta [00:06:11] versus some hypothesis cost where Theta has some two normal [00:06:14] right so in either of these cases the [00:06:16] right so in either of these cases the parameter P the P will be showing up in [00:06:20] parameter P the P will be showing up in your bound so um and but not necessarily [00:06:23] your bound so um and but not necessarily the B the the uh the control of the norm [00:06:26] the B the the uh the control of the norm of the parameters so so that's why we [00:06:29] of the parameters so so that's why we are looking for [00:06:30] are looking for something that can be more precise that [00:06:32] something that can be more precise that can [00:06:33] can um not that can uh not depend on P but [00:06:37] um not that can uh not depend on P but depend on some more [00:06:39] depend on some more um accurate characterization of the [00:06:41] um accurate characterization of the complexity [00:06:43] complexity so so so today our next lecture [00:06:47] so so so today our next lecture and next few lectures in some sense [00:06:53] so what we're going to say our goal is [00:06:56] so what we're going to say our goal is to prove something like I'll [00:06:59] to prove something like I'll so the hatch minus L High to the hat is [00:07:02] so the hatch minus L High to the hat is less than something like some complexity [00:07:06] of theta [00:07:08] of theta and over n [00:07:10] and over n and but this complexity measure could be [00:07:12] and but this complexity measure could be more fun grained than just a single [00:07:14] more fun grained than just a single number P [00:07:16] number P and this complex measure could possibly [00:07:18] and this complex measure could possibly also depend on diffusion so this [00:07:20] also depend on diffusion so this complexity [00:07:24] may depend even on a distribution [00:07:28] distribution P where p is the [00:07:31] distribution P where p is the distribution of your data so maybe for [00:07:33] distribution of your data so maybe for some distribution P or complexity is you [00:07:36] some distribution P or complexity is you know smaller for some other distribution [00:07:37] know smaller for some other distribution P the complexity is higher and we are [00:07:41] P the complexity is higher and we are trying to capture the intrinsic kind of [00:07:43] trying to capture the intrinsic kind of like difficulty of the learning problem [00:07:45] like difficulty of the learning problem but of course you know like you know [00:07:48] but of course you know like you know there is this is somewhat subjective [00:07:49] there is this is somewhat subjective because you know first you know like [00:07:51] because you know first you know like this depends on a little bit on what you [00:07:53] this depends on a little bit on what you believe uh that is happening in real [00:07:56] believe uh that is happening in real life right so if you believe that [00:07:58] life right so if you believe that the real parameter is as far as then you [00:08:01] the real parameter is as far as then you probably should have a complex measure [00:08:03] probably should have a complex measure that captures the L1 Norm of the of the [00:08:06] that captures the L1 Norm of the of the of the parameters right if you believe [00:08:08] of the parameters right if you believe that the ground choose parameter is you [00:08:10] that the ground choose parameter is you know [00:08:11] know um have other properties then you [00:08:13] um have other properties then you probably should use a different [00:08:14] probably should use a different complexity measure [00:08:16] complexity measure um so [00:08:17] um so um so but this is the general goal and [00:08:19] um so but this is the general goal and also the Practical kind of like way of [00:08:21] also the Practical kind of like way of thinking about this is that you can [00:08:22] thinking about this is that you can think of the right hand side as a [00:08:25] think of the right hand side as a is something that motivates your [00:08:27] is something that motivates your regularization right so the the [00:08:30] regularization right so the the Practical impact I guess maybe are the [00:08:34] Practical impact I guess maybe are the practical implication is that you can [00:08:36] practical implication is that you can use this complexity [00:08:38] use this complexity of theta as a regularization [00:08:46] because if you just optimize your model [00:08:48] because if you just optimize your model where you're going to find some [00:08:50] where you're going to find some parameter Theta especially if you don't [00:08:52] parameter Theta especially if you don't have enough data you may have much for [00:08:53] have enough data you may have much for Global Minima among the search space [00:08:55] Global Minima among the search space right so but uh if you know that certain [00:08:59] right so but uh if you know that certain complexity measure will make the bond [00:09:01] complexity measure will make the bond better then you can actively find [00:09:04] better then you can actively find models with small complexity so you can [00:09:07] models with small complexity so you can use this complex measure on the right [00:09:09] use this complex measure on the right hand side um like you add this [00:09:11] hand side um like you add this complexity measure multiplied by Lambda [00:09:13] complexity measure multiplied by Lambda to your chaining loss so you get the [00:09:16] to your chaining loss so you get the regular slots [00:09:18] regular slots so that you are more likely to find the [00:09:20] so that you are more likely to find the small complexity one which generalizes [00:09:21] small complexity one which generalizes better [00:09:23] better okay so [00:09:25] okay so um [00:09:26] um I guess that's the the basic idea and [00:09:33] um [00:09:34] um so [00:09:35] so what we're going to do is that [00:09:37] what we're going to do is that um so we're gonna today the first part [00:09:40] um so we're gonna today the first part is we're going to talk about [00:09:42] is we're going to talk about a week ago right before we were talking [00:09:44] a week ago right before we were talking about the the [00:09:46] about the the uniform convergence right so this is our [00:09:48] uniform convergence right so this is our tool right you want to prove that LH [00:09:51] tool right you want to prove that LH minus L height h [00:09:56] um it's small for all possible age [00:09:58] um it's small for all possible age and uh [00:10:02] and uh and in the first part of the lecture [00:10:03] and in the first part of the lecture we're going to bond the expectation of [00:10:05] we're going to bond the expectation of this [00:10:06] this as some kind of like weaker goal and in [00:10:08] as some kind of like weaker goal and in the second part of the lecture if if we [00:10:10] the second part of the lecture if if we have time we're going to bond it with [00:10:12] have time we're going to bond it with high probability without expectation in [00:10:14] high probability without expectation in front of it and once the expression [00:10:16] front of it and once the expression comes from this Randomness come from the [00:10:18] comes from this Randomness come from the data [00:10:19] data and the tuning data [00:10:23] right because L hat depends on the [00:10:25] right because L hat depends on the training data and the genome data are [00:10:26] training data and the genome data are random ID drawn so that's why uh you uh [00:10:30] random ID drawn so that's why uh you uh what things that the expectation is a [00:10:31] what things that the expectation is a random variable that depends on the [00:10:33] random variable that depends on the randomness of the genome data and you [00:10:35] randomness of the genome data and you take expectations of this random [00:10:36] take expectations of this random variable [00:10:37] variable uh and that's the that's the goal so [00:10:39] uh and that's the that's the goal so we're gonna Bond our upper bounders with [00:10:42] we're gonna Bond our upper bounders with some other quantities that we think are [00:10:43] some other quantities that we think are more intrinsic and convenient uh for us [00:10:46] more intrinsic and convenient uh for us to use [00:10:47] to use so [00:10:49] so um [00:10:50] um I guess I need to start with some [00:10:51] I guess I need to start with some definitions [00:10:54] so [00:10:56] so this this definition is called this is a [00:10:59] this this definition is called this is a definition called rather marker [00:11:00] definition called rather marker complexity [00:11:01] complexity which is the main [00:11:04] which is the main object we're going to focus on in this [00:11:05] object we're going to focus on in this lecture so the definition is [00:11:10] let F be your family [00:11:15] of real [00:11:18] of real value functions [00:11:21] so far I have in this definition f is [00:11:24] so far I have in this definition f is just abstract family of functions and [00:11:25] just abstract family of functions and we're going to define a complexity for [00:11:27] we're going to define a complexity for this family of functions F and then [00:11:30] this family of functions F and then we're going to say what functions of f [00:11:31] we're going to say what functions of f we are going to care about we care about [00:11:33] we are going to care about we care about actually the functions of the losses the [00:11:35] actually the functions of the losses the the family of the losses but for now f [00:11:38] the family of the losses but for now f is just the abstract family of functions [00:11:41] is just the abstract family of functions and you're going to define a complex [00:11:42] and you're going to define a complex measure for this abstract family [00:11:43] measure for this abstract family functions [00:11:45] functions so let's say this family functions that [00:11:47] so let's say this family functions that Maps [00:11:49] Maps some input space let's call it Z to real [00:11:52] some input space let's call it Z to real number [00:11:55] um and [00:11:57] at p [00:11:59] at p be a distribution [00:12:04] over this input space z [00:12:07] over this input space z then the average the so-called [00:12:11] then the average the so-called often you don't really necessarily have [00:12:13] often you don't really necessarily have to specify as average when the marker [00:12:15] to specify as average when the marker complexity but technically [00:12:17] complexity but technically is the average weather marker complexity [00:12:22] of f [00:12:24] of f is defined as [00:12:29] the following so this is r and sub f [00:12:34] the following so this is r and sub f where n indicates how many examples you [00:12:36] where n indicates how many examples you have how many chaining examples how many [00:12:38] have how many chaining examples how many empirical examples you have [00:12:40] empirical examples you have R and F is defined to be [00:12:45] your first draw some examples [00:12:48] your first draw some examples you can think of this as training [00:12:49] you can think of this as training examples ID from the distribution p [00:12:53] examples ID from the distribution p and then you draw [00:12:56] and then you draw some so-called rather Markov random [00:12:58] some so-called rather Markov random variables recall that by the market [00:12:59] variables recall that by the market random variables such as binary plus one [00:13:01] random variables such as binary plus one minus one uniform you you draw Sigma 1 [00:13:04] minus one uniform you you draw Sigma 1 up to Sigma n [00:13:06] up to Sigma n i d [00:13:08] i d uniformly from -1 1. [00:13:12] uniformly from -1 1. and then you look at [00:13:14] and then you look at this this [00:13:15] this this quantity you look at the soup [00:13:20] over this function class capital F and [00:13:23] over this function class capital F and you look at the quantity [00:13:26] you look at the quantity the average of Sigma i f z i [00:13:31] the average of Sigma i f z i and from 1 2. so this sounds like a kind [00:13:34] and from 1 2. so this sounds like a kind of like a pretty complicated [00:13:36] of like a pretty complicated definition but let me try to interpret a [00:13:39] definition but let me try to interpret a little bit [00:13:41] little bit um [00:13:41] um so [00:13:42] so in some sense maybe first of all only [00:13:45] in some sense maybe first of all only think about this quantity right [00:13:47] think about this quantity right just think about what's inside this soup [00:13:49] just think about what's inside this soup so this is the correlation [00:13:55] you know the one over n is is just a [00:13:57] you know the one over n is is just a normalization which is not important [00:13:59] normalization which is not important this is the correlation [00:14:04] between [00:14:06] between the outputs [00:14:11] of f right so the output of f is FZ1 up [00:14:15] of f right so the output of f is FZ1 up to F zero right [00:14:18] and some random variable Sigma 1 and [00:14:21] and some random variable Sigma 1 and sigma of course if you [00:14:24] sigma of course if you uh just look at this right the [00:14:25] uh just look at this right the correlation should be in typically very [00:14:27] correlation should be in typically very close to zero because you shouldn't [00:14:29] close to zero because you shouldn't tolerate with random variables right but [00:14:31] tolerate with random variables right but there is a soup [00:14:33] there is a soup right so you are first drawing the sigma [00:14:35] right so you are first drawing the sigma 1 up to Sigma n and then you take a soup [00:14:37] 1 up to Sigma n and then you take a soup over f so basically you are saying that [00:14:39] over f so basically you are saying that what's the maximal so basically this [00:14:41] what's the maximal so basically this whole thing [00:14:43] whole thing is the maximum correlation [00:14:46] is the maximum correlation between [00:14:48] between F the output of F and sigma 1 up to [00:14:51] F the output of F and sigma 1 up to Sigma n after you draw Sigma X right so [00:14:53] Sigma n after you draw Sigma X right so you can you first just take my eye and [00:14:54] you can you first just take my eye and then you try to find something that [00:14:56] then you try to find something that correlates with Sigma ice but you try to [00:14:58] correlates with Sigma ice but you try to find the app such that it can also put [00:15:00] find the app such that it can also put something that looks like the random [00:15:01] something that looks like the random things that you have to [00:15:03] things that you have to you have done [00:15:04] you have done so so in some sense if you have a high [00:15:08] so so in some sense if you have a high complexity [00:15:12] means that [00:15:14] means that for most or for almost all for most of [00:15:17] for most or for almost all for most of binary patterns [00:15:19] binary patterns where binary patterns just means that [00:15:21] where binary patterns just means that you have the sigma 1 up to Sigma n there [00:15:24] you have the sigma 1 up to Sigma n there exists f [00:15:25] exists f in this hypo's class such that [00:15:28] in this hypo's class such that the output on this family [00:15:32] the output on this family is [00:15:34] is similar to [00:15:36] similar to or similar or correlated [00:15:40] with [00:15:44] a random pattern right so for any random [00:15:47] a random pattern right so for any random pattern if you draw it then you can find [00:15:49] pattern if you draw it then you can find post Hulk of a function f in this family [00:15:53] post Hulk of a function f in this family class such that the output on this [00:15:56] class such that the output on this um on on 0 up to ZN looks like the [00:16:00] um on on 0 up to ZN looks like the random pattern you have to run you have [00:16:02] random pattern you have to run you have to [00:16:03] to so in some sense this is saying that how [00:16:05] so in some sense this is saying that how diverse [00:16:06] diverse the outputs you can have from this value [00:16:10] the outputs you can have from this value function itself right so if if this [00:16:13] function itself right so if if this family functions F can map your 0 up to [00:16:16] family functions F can map your 0 up to Z into any possible patterns then this [00:16:18] Z into any possible patterns then this random Mark complexity will be the [00:16:20] random Mark complexity will be the largest right so for example suppose [00:16:22] largest right so for example suppose every port [00:16:24] every port binary pattern can be somewhat output by [00:16:27] binary pattern can be somewhat output by this family of functions F on Z then you [00:16:30] this family of functions F on Z then you get the maximal uh random marker [00:16:32] get the maximal uh random marker complexity [00:16:34] complexity intuitively [00:16:37] any questions so far [00:16:47] the question is is this necessarily a [00:16:50] the question is is this necessarily a non-increasing uh function of n [00:16:54] non-increasing uh function of n um [00:16:55] um I think I should be but it shouldn't be [00:16:58] I think I should be but it shouldn't be but I don't think it's trivial to see [00:17:00] but I don't think it's trivial to see why it's [00:17:03] why it's Y is now increasing [00:17:06] Y is now increasing um at least off the top of my head I [00:17:08] um at least off the top of my head I don't see a super simple argument [00:17:12] don't see a super simple argument [Music] [00:17:12] [Music] um [00:17:14] but I think you can you can uh prove it [00:17:17] but I think you can you can uh prove it without too much effort I think roughly [00:17:20] without too much effort I think roughly speaking how do you prove it is that [00:17:22] speaking how do you prove it is that you can because you take the soup right [00:17:25] you can because you take the soup right so you somehow can [00:17:28] so you somehow can I think you can you can prove it by [00:17:30] I think you can you can prove it by switching the soup with expectation for [00:17:33] switching the soup with expectation for one [00:17:34] one for example the last example and then [00:17:36] for example the last example and then you got the roughly speaking the [00:17:38] you got the roughly speaking the definition of the N minus one uh version [00:17:41] definition of the N minus one uh version of the value market complexity [00:17:43] of the value market complexity um but maybe I may not do it on a fly [00:17:45] um but maybe I may not do it on a fly just in case I I missed some something I [00:17:47] just in case I I missed some something I got stuck so by the roughly speaking I [00:17:50] got stuck so by the roughly speaking I think that should work [00:17:51] think that should work any other questions [00:17:55] by the way I never got any questions [00:17:57] by the way I never got any questions from Zoom so you should feel free to [00:17:59] from Zoom so you should feel free to speak up I I I just unmute yourself and [00:18:02] speak up I I I just unmute yourself and ask questions I sometimes I'm not even [00:18:04] ask questions I sometimes I'm not even sure whether the zoom is working [00:18:07] sure whether the zoom is working um [00:18:12] that's a great question so f is not uh [00:18:16] that's a great question so f is not uh required to be mapped to plus one minus [00:18:19] required to be mapped to plus one minus one and it's true that this can be [00:18:21] one and it's true that this can be unbounded right so this is actually [00:18:23] unbounded right so this is actually sensitive to the scale of f right if you [00:18:26] sensitive to the scale of f right if you scale F by a factor of two then you're [00:18:29] scale F by a factor of two then you're gonna have uh uh two times the rather [00:18:31] gonna have uh uh two times the rather Market complexity and this is actually [00:18:34] Market complexity and this is actually somewhat useful in certain cases [00:18:37] somewhat useful in certain cases um which we probably will talk about [00:18:39] um which we probably will talk about later [00:18:42] there is a question [00:18:45] there is a question okay [00:18:46] okay Okay cool so [00:18:48] Okay cool so um [00:18:49] um so okay now let's see why we are care [00:18:52] so okay now let's see why we are care about this so why we care about this [00:18:54] about this so why we care about this rather marker complexity [00:18:55] rather marker complexity the reason is that [00:18:58] the reason is that you know the following [00:19:03] so you know that [00:19:05] so you know that let me write down what [00:19:07] let me write down what what is the theorem [00:19:11] what is the theorem supposed to do this hypothetic [00:19:13] supposed to do this hypothetic experiment you draw [00:19:14] experiment you draw any examples from distribution p [00:19:17] any examples from distribution p and then you look at [00:19:19] and then you look at this quantity [00:19:22] this quantity the average of f of z i i from 1 to n [00:19:26] the average of f of z i i from 1 to n minus the expectation of FZ right this [00:19:29] minus the expectation of FZ right this is kind of the quantity We have dealt [00:19:31] is kind of the quantity We have dealt with uh in the last lecture the [00:19:33] with uh in the last lecture the concentration how much you deviate from [00:19:36] concentration how much you deviate from your [00:19:37] your um from your mean but you take a seal [00:19:39] um from your mean but you take a seal here because you carry it sometimes [00:19:41] here because you carry it sometimes carry about the maximum possible [00:19:42] carry about the maximum possible deviation post talk after you draw the [00:19:45] deviation post talk after you draw the examples [00:19:46] examples if you look at this quantity [00:19:48] if you look at this quantity then this quality is bonded by two times [00:19:52] then this quality is bonded by two times the rather Markle complex co5 [00:19:56] I guess to appreciate what the theorem [00:19:59] I guess to appreciate what the theorem is really doing I guess it's probably [00:20:01] is really doing I guess it's probably time to say what exactly what kind of if [00:20:04] time to say what exactly what kind of if we care about right so F so far is [00:20:06] we care about right so F so far is abstract thing but now let's try to [00:20:07] abstract thing but now let's try to instantiate [00:20:08] instantiate so it's supposed to take off [00:20:10] so it's supposed to take off the capital F to be the family functions [00:20:14] the capital F to be the family functions that Maps Z [00:20:16] that Maps Z which is [00:20:19] which is taken to be a pair of X and Y the input [00:20:22] taken to be a pair of X and Y the input and output and we'll map it to the loss [00:20:24] and output and we'll map it to the loss function the loss of X Y on the [00:20:28] function the loss of X Y on the hypothesis h [00:20:31] for any age in the hypothesis so [00:20:33] for any age in the hypothesis so basically this is the family of losses [00:20:40] for every model every model is a [00:20:42] for every model every model is a function right and given that model and [00:20:45] function right and given that model and you you get a a loss function [00:20:47] you you get a a loss function right defined by the Model H right so so [00:20:51] right defined by the Model H right so so basically this is the composition of the [00:20:53] basically this is the composition of the of the model function with the the the [00:20:56] of the model function with the the the the the the loss function [00:20:58] the the the loss function uh the two-dimensional loss function [00:21:00] uh the two-dimensional loss function right the little oil so together you [00:21:03] right the little oil so together you know basically you get the this is a map [00:21:05] know basically you get the this is a map from the data point to the loss of the [00:21:07] from the data point to the loss of the data point but you can vary uh what [00:21:10] data point but you can vary uh what functions uh what models you care about [00:21:12] functions uh what models you care about so you get a family of losses [00:21:15] so you get a family of losses um so so in some sense it's just a [00:21:17] um so so in some sense it's just a slightly it's a slight extension of the [00:21:19] slightly it's a slight extension of the family of models in some sense but but [00:21:21] family of models in some sense but but here is it's about the losses and [00:21:23] here is it's about the losses and suppose you care about this you take I [00:21:25] suppose you care about this you take I have to be this and you can see that the [00:21:27] have to be this and you can see that the left hand side is exactly what we are [00:21:30] left hand side is exactly what we are worth trying to [00:21:32] worth trying to um to bounce right just because you know [00:21:34] um to bounce right just because you know FDI [00:21:35] FDI is laws of [00:21:38] is laws of x i I guess we write x i y I like this [00:21:42] x i I guess we write x i y I like this x i y i [00:21:45] x i y i at h [00:21:46] at h right so then the sum of [00:21:50] right so then the sum of the empirical sum is just the empirical [00:21:52] the empirical sum is just the empirical loss [00:21:57] right so whatever and sum of c f z i [00:22:02] right so whatever and sum of c f z i this is just one over n times the sum of [00:22:05] this is just one over n times the sum of L of x i y i [00:22:08] L of x i y i H this is just the loss the empirical [00:22:11] H this is just the loss the empirical loss of the hypothesis class the [00:22:13] loss of the hypothesis class the hypothesis h [00:22:14] hypothesis h right and the expectation of CI FZ [00:22:18] right and the expectation of CI FZ is the expectation of the loss [00:22:26] and where X and Y [00:22:29] and where X and Y are drawn from the distribution p and [00:22:31] are drawn from the distribution p and this becomes the [00:22:32] this becomes the population loss right so that's why the [00:22:36] population loss right so that's why the left hand side of this theorem it's [00:22:39] left hand side of this theorem it's really just the soup over h [00:22:42] really just the soup over h L height h [00:22:44] L height h minus LH [00:22:48] contact us [00:22:50] contact us a quick question and you take [00:22:52] a quick question and you take expectation over the randomness of the [00:22:54] expectation over the randomness of the data [00:22:56] data so that's the that's the weaker version [00:22:58] so that's the that's the weaker version of uniform convergence that we outline [00:23:01] of uniform convergence that we outline in the beginning of the lecture [00:23:03] in the beginning of the lecture right so and you can bond this by the [00:23:07] right so and you can bond this by the random marker complexity of this [00:23:09] random marker complexity of this function cos F right the random marker [00:23:10] function cos F right the random marker complexity of uh of this family of [00:23:14] complexity of uh of this family of losses [00:23:19] right so basically the theorem is saying [00:23:21] right so basically the theorem is saying that [00:23:23] that um the generalization error [00:23:28] is less than the rather Market [00:23:30] is less than the rather Market complexity of f [00:23:32] complexity of f like I think technical expectation of [00:23:35] like I think technical expectation of the generalization [00:23:39] um there was a question here [00:23:44] now there's no absolute value here yes [00:23:48] now there's no absolute value here yes there's no absolute value that's a great [00:23:50] there's no absolute value that's a great question so there's no apps value and it [00:23:52] question so there's no apps value and it becomes a little bit trickier if you [00:23:53] becomes a little bit trickier if you either absolute value I think if you add [00:23:55] either absolute value I think if you add absolute value [00:23:56] absolute value first of all you need a different proof [00:23:57] first of all you need a different proof a slightly different proof and second [00:23:59] a slightly different proof and second you're going to have a different [00:24:00] you're going to have a different constant instead of two you can get [00:24:01] constant instead of two you can get probably four [00:24:03] probably four and and the cleanest way is to do it is [00:24:05] and and the cleanest way is to do it is that you don't do absolute value [00:24:07] that you don't do absolute value on in this theorem you do the absolute [00:24:10] on in this theorem you do the absolute value in the in the in the up in the in [00:24:13] value in the in the in the up in the in the outer layer actually you don't even [00:24:14] the outer layer actually you don't even need absolute value actually technically [00:24:16] need absolute value actually technically because eventually you only cover one [00:24:18] because eventually you only cover one side of the box when you do the [00:24:20] side of the box when you do the generalization so so technically we [00:24:22] generalization so so technically we don't even need absolute value anywhere [00:24:23] don't even need absolute value anywhere right [00:24:30] okay [00:24:31] okay so [00:24:36] and if you really think about this R and [00:24:38] and if you really think about this R and F in this contact context right so so [00:24:45] for this particular F what does it mean [00:24:47] for this particular F what does it mean it really means that how well [00:24:51] it really means that how well the family of losses right so the losses [00:24:53] the family of losses right so the losses of data [00:24:57] and data [00:25:00] can correlate with random pattern [00:25:08] can correlate [00:25:22] so this still sounds a little bit kind [00:25:25] so this still sounds a little bit kind of like not super intuitive we can [00:25:28] of like not super intuitive we can further you know for simplified case we [00:25:30] further you know for simplified case we can further uh simplify this a little [00:25:32] can further uh simplify this a little bit [00:25:33] bit so um so suppose you have a binary [00:25:36] so um so suppose you have a binary classification [00:25:49] so suppose let's say your Y is between [00:25:52] so suppose let's say your Y is between plus one nine minus one [00:25:54] plus one nine minus one and L [00:25:56] and L is zero one loss [00:25:59] is zero one loss so L of x y h [00:26:03] so L of x y h is equals to the indicator [00:26:06] is equals to the indicator of H of X is not equals to Y if they are [00:26:10] of H of X is not equals to Y if they are done equal you have lost one otherwise [00:26:11] done equal you have lost one otherwise you have lost 0. [00:26:13] you have lost 0. and in this case uh we can further [00:26:16] and in this case uh we can further interpret this a little bit more so what [00:26:18] interpret this a little bit more so what you can do is the following first of all [00:26:20] you can do is the following first of all you will write this indicator into this [00:26:22] you will write this indicator into this form [00:26:24] form we write it as a half [00:26:26] we write it as a half times y minus 1 times h of x [00:26:31] um this is assuming [00:26:34] um this is assuming here I'm assuming h of X is also in plus [00:26:36] here I'm assuming h of X is also in plus one minus one [00:26:38] one minus one so [00:26:40] so um by the way and what I'm doing here is [00:26:42] um by the way and what I'm doing here is to try to instantiate this into a [00:26:43] to try to instantiate this into a special case so that you can interpret [00:26:45] special case so that you can interpret the rather marker complexity more [00:26:48] the rather marker complexity more um in a more intuitive way [00:26:50] um in a more intuitive way and also this this whole thing is also [00:26:52] and also this this whole thing is also useful by itself as well [00:26:54] useful by itself as well um so [00:26:56] um so so when h of X is plus by minus one then [00:26:58] so when h of X is plus by minus one then Y is plus one minus and also Y is plus [00:27:01] Y is plus one minus and also Y is plus one minus one then the indicator that [00:27:03] one minus one then the indicator that they are different you can write this as [00:27:05] they are different you can write this as this because if Y and H are different [00:27:08] this because if Y and H are different then you get yhx is -1 and then the [00:27:11] then you get yhx is -1 and then the whole thing will be one and if Y and h x [00:27:13] whole thing will be one and if Y and h x are the same then y times HX is 1 and [00:27:16] are the same then y times HX is 1 and then this quality this quantity is zero [00:27:19] then this quality this quantity is zero so you can just verify it right so so [00:27:21] so you can just verify it right so so the reason we do this is we somehow make [00:27:23] the reason we do this is we somehow make it more linear in y and h x [00:27:25] it more linear in y and h x and then you can look at the rather [00:27:28] and then you can look at the rather marker complexity [00:27:30] marker complexity so the R and F [00:27:32] so the R and F is this expectation of soup [00:27:43] Sigma I [00:27:46] Sigma I all right so and let's plug in the loss [00:27:53] so here are the expectation so some in [00:27:56] so here are the expectation so some in the definition I have two expectations [00:27:57] the definition I have two expectations right so but now I put 12 expectations [00:28:00] right so but now I put 12 expectations into one that could be just you merge [00:28:02] into one that could be just you merge them so that Randomness come from both [00:28:04] them so that Randomness come from both the data and the random marker patterns [00:28:06] the data and the random marker patterns and you get soup over h [00:28:11] so you plug in this formula a half times [00:28:14] so you plug in this formula a half times one minus y i h x i [00:28:18] one minus y i h x i foreign [00:28:24] and now [00:28:26] and now let's do some real Arrangements [00:28:29] let's do some real Arrangements it's very simple rearrangement [00:28:40] Plus [00:28:41] Plus one over n times a half [00:28:46] Sigma [00:28:50] so here [00:28:51] so here at this quantity [00:28:54] at this quantity you know it's inside the soup right but [00:28:56] you know it's inside the soup right but actually it's a constant that doesn't [00:28:58] actually it's a constant that doesn't depend on H [00:29:00] depend on H so you can pull it outside of the soup [00:29:01] so you can pull it outside of the soup so you can technically write this just [00:29:04] so you can technically write this just because this sum of Sigma I is a [00:29:06] because this sum of Sigma I is a constant [00:29:07] constant and then because now now you can switch [00:29:12] and then because now now you can switch the expectation with the sum and get [00:29:15] the expectation with the sum and get expectation [00:29:16] expectation soup of the first term [00:29:28] and plus the expectation of this one [00:29:31] and plus the expectation of this one over two and sum of Sigma I [00:29:34] over two and sum of Sigma I and this term becomes zero [00:29:39] oh here [00:29:42] oh here and this thing becomes zero because the [00:29:44] and this thing becomes zero because the expectation of the radar micro variable [00:29:46] expectation of the radar micro variable is zero [00:29:48] is zero and so then we're only left with the [00:29:51] and so then we're only left with the first quantity [00:29:53] first quantity and if you if you look at the first [00:29:54] and if you if you look at the first quantity then you realize that [00:29:57] quantity then you realize that because so Sigma I is a random variable [00:30:00] because so Sigma I is a random variable right so H uh [00:30:04] why [00:30:06] why why I Sigma I has the same distribution [00:30:12] Sigma I [00:30:14] Sigma I right no matter what Y is right so for [00:30:17] right no matter what Y is right so for even for y is one or four otherwise [00:30:18] even for y is one or four otherwise minus one they have the exact same [00:30:20] minus one they have the exact same distribution [00:30:21] distribution so that's why you can replace why I say [00:30:24] so that's why you can replace why I say actually you can also have minus here [00:30:25] actually you can also have minus here that's still true [00:30:26] that's still true right because you only randomly flip the [00:30:29] right because you only randomly flip the Sun [00:30:30] Sun right so so basically that means you can [00:30:32] right so so basically that means you can replace y or Sigma minus y Sigma I by [00:30:35] replace y or Sigma minus y Sigma I by Sigma itself I still you don't change [00:30:37] Sigma itself I still you don't change the expectation [00:30:39] the expectation right so you you can replace this by [00:30:42] right so you you can replace this by maybe maybe let's define this you know I [00:30:46] maybe maybe let's define this you know I guess you can say technical the easiest [00:30:48] guess you can say technical the easiest way to check this I saw some confusion [00:30:49] way to check this I saw some confusion easy ways to touch this you just Define [00:30:51] easy ways to touch this you just Define Sigma Prime to view minus y Sigma I then [00:30:55] Sigma Prime to view minus y Sigma I then you guys soup more and sum of h x i [00:31:00] you guys soup more and sum of h x i Sigma Prime [00:31:02] Sigma Prime but still take my subscribers [00:31:03] but still take my subscribers description is still plus one minus my [00:31:05] description is still plus one minus my uniform and independent right so so [00:31:07] uniform and independent right so so Sigma Prime has the same distribution as [00:31:09] Sigma Prime has the same distribution as Sigma I so you can just write this [00:31:12] Sigma I so you can just write this the same way as [00:31:17] this is why [00:31:20] this is why okay so so what we have achieved here [00:31:22] okay so so what we have achieved here what we have achieved here is that this [00:31:25] what we have achieved here is that this seems to be a strictly simpler quality [00:31:28] seems to be a strictly simpler quality than before why this is basically the [00:31:30] than before why this is basically the rather marker complexity [00:31:32] rather marker complexity of the hypothesis class h [00:31:34] of the hypothesis class h but now the family of losses [00:31:37] but now the family of losses right before we are talking about the [00:31:38] right before we are talking about the hypothesis class of the family of losses [00:31:40] hypothesis class of the family of losses and now you are talking about the [00:31:41] and now you are talking about the exactly [00:31:42] exactly write the marker complexity of the [00:31:45] write the marker complexity of the hypothesis class h [00:31:46] hypothesis class h so basically this is saying that [00:31:49] so basically this is saying that for bandwidth [00:31:51] for bandwidth I think I'm missing something I'm [00:31:52] I think I'm missing something I'm missing a half here where where is the [00:31:54] missing a half here where where is the half go [00:31:55] half go yeah I think I lost the half sorry [00:32:05] I think I lost oh I have the half in the [00:32:07] I think I lost oh I have the half in the toes it's just I forgot to [00:32:09] toes it's just I forgot to copy it so [00:32:11] copy it so so basically this is a half times the [00:32:14] so basically this is a half times the right on the complex so what we've [00:32:15] right on the complex so what we've achieved is that the rather markup [00:32:17] achieved is that the rather markup complexity of f [00:32:19] complexity of f in this special case of binary [00:32:21] in this special case of binary classification and zero one loss is [00:32:23] classification and zero one loss is equals to half times the rather Market [00:32:25] equals to half times the rather Market complexity of the hypothesis class [00:32:28] complexity of the hypothesis class so [00:32:30] so so that's a slightly simpler you know [00:32:32] so that's a slightly simpler you know way of thinking about this because [00:32:33] way of thinking about this because what's what's this this is basically [00:32:35] what's what's this this is basically saying that [00:32:37] saying that how while [00:32:39] how while H can memorize [00:32:43] memorize uh [00:32:45] memorize uh the random random label [00:32:49] the random random label right you can think of Sigma 1 up to [00:32:51] right you can think of Sigma 1 up to Sigma n is random label and R and H is [00:32:55] Sigma n is random label and R and H is Big when [00:32:56] Big when you can there exists of H in the capital [00:33:00] you can there exists of H in the capital h [00:33:01] h such that [00:33:02] such that h of X I's [00:33:05] h of X I's is equal to Sigma I this is the best [00:33:07] is equal to Sigma I this is the best situation right this has the strongest [00:33:09] situation right this has the strongest correlation so basically if you can [00:33:10] correlation so basically if you can memorize all the random label with some [00:33:14] memorize all the random label with some hypothesis in your hypothesis class that [00:33:15] hypothesis in your hypothesis class that means your random Mark complexity is the [00:33:17] means your random Mark complexity is the biggest and that's gives you the worst [00:33:19] biggest and that's gives you the worst dramatization bound [00:33:21] dramatization bound um and the vice versa if you cannot [00:33:22] um and the vice versa if you cannot memorize then you get better [00:33:23] memorize then you get better optimization [00:33:24] optimization foreign [00:33:57] I see okay that's a that's a good [00:33:59] I see okay that's a that's a good question let me repeat the question so [00:34:01] question let me repeat the question so the question is Sigma Prime is equals to [00:34:04] the question is Sigma Prime is equals to Y times actually there's a minus there [00:34:06] Y times actually there's a minus there so but it doesn't matter so Sigma Prime [00:34:08] so but it doesn't matter so Sigma Prime is minus y times Sigma I but y i itself [00:34:12] is minus y times Sigma I but y i itself is random variable so can we still claim [00:34:15] is random variable so can we still claim that Sigma Prime has the same [00:34:16] that Sigma Prime has the same distribution as Sigma [00:34:18] distribution as Sigma right so that's indeed that's a good [00:34:20] right so that's indeed that's a good question so it's a so technically I [00:34:22] question so it's a so technically I think what you should do is the [00:34:23] think what you should do is the following [00:34:24] following so if you are really careful about this [00:34:27] so if you are really careful about this so there are two uh Randomness right so [00:34:30] so there are two uh Randomness right so one is from the X and one is from a [00:34:31] one is from the X and one is from a sigma so you first [00:34:33] sigma so you first condition on the randomness of oxide [00:34:35] condition on the randomness of oxide to say that [00:34:38] to say that so you first in the first expectation so [00:34:40] so you first in the first expectation so basically [00:34:42] basically how do I say this so you can write this [00:34:44] how do I say this so you can write this as the following right so [00:34:46] as the following right so the condition on x i [00:34:48] the condition on x i y i [00:34:50] y i and then you look at the randomness of [00:34:52] and then you look at the randomness of Sigma I [00:34:54] Sigma I right so now after you condition x i [00:34:57] right so now after you condition x i then this is the absolutely clear right [00:35:00] then this is the absolutely clear right so like for any choice of y i right [00:35:02] so like for any choice of y i right Sigma I and sigma Prime has the same [00:35:04] Sigma I and sigma Prime has the same distribution condition any choice of [00:35:07] distribution condition any choice of deterministic choice y [00:35:08] deterministic choice y so then so so you do it inside and then [00:35:12] so then so so you do it inside and then why [00:35:13] why it's gone in your formula so then you [00:35:15] it's gone in your formula so then you don't have to care about the the outside [00:35:20] cool [00:35:21] cool so it sounds good [00:35:28] and let me kind of like a so okay so the [00:35:30] and let me kind of like a so okay so the take homework here is that you know the [00:35:32] take homework here is that you know the random complex of f is similar to the [00:35:34] random complex of f is similar to the random complexity of the model and the [00:35:36] random complexity of the model and the rather Mark complex of the model is [00:35:38] rather Mark complex of the model is basically saying how well we can [00:35:39] basically saying how well we can memorize random label right but there is [00:35:43] memorize random label right but there is a smoke caveat here which is that this [00:35:45] a smoke caveat here which is that this relationship is not always true [00:35:48] relationship is not always true this relationship is true exactly true [00:35:50] this relationship is true exactly true for binary classification zero one loss [00:35:52] for binary classification zero one loss but it doesn't it's not even true for [00:35:54] but it doesn't it's not even true for for example some other loss function [00:35:56] for example some other loss function so I think the intuitive largely is [00:35:59] so I think the intuitive largely is still correct but [00:36:01] still correct but um but you cannot take this you know uh [00:36:03] um but you cannot take this you know uh literally or rigorously like religious [00:36:06] literally or rigorously like religious religiously for every situation and in [00:36:09] religiously for every situation and in some cases it's actually there could be [00:36:10] some cases it's actually there could be a confusion because there could be cases [00:36:13] a confusion because there could be cases where [00:36:15] um these two are mismatched especially [00:36:18] um these two are mismatched especially if your loss function can do something [00:36:20] if your loss function can do something different for example your most function [00:36:22] different for example your most function could change the binary number to real [00:36:25] could change the binary number to real number or the Lost function uh has other [00:36:27] number or the Lost function uh has other kind of properties right so for example [00:36:29] kind of properties right so for example a loss function is also non-linear right [00:36:31] a loss function is also non-linear right so suppose you take exponential loss so [00:36:34] so suppose you take exponential loss so and actually they are in some sense like [00:36:37] and actually they are in some sense like in the in the past they were this kind [00:36:39] in the in the past they were this kind of like um [00:36:41] of like um in some extreme cases you could some [00:36:43] in some extreme cases you could some papers or some some papers actually [00:36:46] papers or some some papers actually misinterpreted in some sense like [00:36:49] misinterpreted in some sense like like a [00:36:52] like a um I I guess I'm just giving a warning [00:36:55] um I I guess I'm just giving a warning in some sense don't like always apply [00:36:56] in some sense don't like always apply this every time without even thinking [00:36:58] this every time without even thinking about it [00:36:59] about it um the intuition is roughly two but it's [00:37:02] um the intuition is roughly two but it's not exactly sure [00:37:04] not exactly sure at all times you know I guess there will [00:37:06] at all times you know I guess there will be a place where I'm gonna mention this [00:37:07] be a place where I'm gonna mention this again like in the later later of the [00:37:09] again like in the later later of the lecture [00:37:10] lecture oh like in some of the later lectures [00:37:14] oh like in some of the later lectures um [00:37:18] and [00:37:19] and by the way just to and what we're going [00:37:21] by the way just to and what we're going to do next is I'm going to prove the [00:37:23] to do next is I'm going to prove the theorem [00:37:28] um and just a small kind of overview for [00:37:30] um and just a small kind of overview for what we will do next lecture so so in [00:37:33] what we will do next lecture so so in this lecture we are going to deal with [00:37:34] this lecture we are going to deal with this abstract measure the rather Market [00:37:35] this abstract measure the rather Market complexity right and you may wonder [00:37:38] complexity right and you may wonder probably some of you are wondering why [00:37:40] probably some of you are wondering why random art complexity is is something [00:37:42] random art complexity is is something measurable something like [00:37:44] measurable something like um that is useful right so we don't [00:37:47] um that is useful right so we don't answer that today we're gonna answer [00:37:48] answer that today we're gonna answer that you know in the next few lectures [00:37:50] that you know in the next few lectures so today we are just introducing this [00:37:52] so today we are just introducing this rather more complex and say this is [00:37:54] rather more complex and say this is bonding the uniform convergence and it's [00:37:56] bonding the uniform convergence and it's rather more complexity something [00:37:57] rather more complexity something intuitive I hope you found that right so [00:38:00] intuitive I hope you found that right so like if it's talking about how well you [00:38:02] like if it's talking about how well you can memorize labels right so if [00:38:04] can memorize labels right so if something at least makes sense and in [00:38:06] something at least makes sense and in the next few lectures we are going to [00:38:08] the next few lectures we are going to instantiate uh this to more concrete [00:38:11] instantiate uh this to more concrete models where you can Bond the radamer [00:38:13] models where you can Bond the radamer complexly by something more concrete uh [00:38:16] complexly by something more concrete uh in the next two lectures [00:38:18] in the next two lectures I got some [00:38:21] oh [00:38:25] did somebody ask a question I didn't [00:38:27] did somebody ask a question I didn't hear [00:38:29] hear uh yeah I think a couple people chimed [00:38:32] uh yeah I think a couple people chimed in uh [00:38:33] in uh you answered my question in the meantime [00:38:35] you answered my question in the meantime but someone else might have a question [00:38:37] but someone else might have a question still [00:38:38] still yeah sorry I forgot to open my um I have [00:38:41] yeah sorry I forgot to open my um I have volume on if yeah please please ask [00:38:45] volume on if yeah please please ask questions you know if you yeah now it's [00:38:47] questions you know if you yeah now it's working okay thank you [00:38:51] oh actually there is a question what is [00:38:54] oh actually there is a question what is the connection between the random or [00:38:57] the connection between the random or complexity and the degree of freedom [00:38:59] complexity and the degree of freedom right I think I assume by the degree of [00:39:01] right I think I assume by the degree of freedom [00:39:03] freedom um you mean the number of parameters [00:39:05] um you mean the number of parameters right so um so I guess that's kind of [00:39:09] right so um so I guess that's kind of like what we motivated in the beginning [00:39:10] like what we motivated in the beginning so using this rather more complexity we [00:39:13] so using this rather more complexity we will be able to prove more precise [00:39:16] will be able to prove more precise bounds [00:39:20] parameters so [00:39:23] parameters so um probably so far you haven't seen that [00:39:24] um probably so far you haven't seen that I I you know like I don't exactly to see [00:39:27] I I you know like I don't exactly to see that but in the next lecture we're gonna [00:39:28] that but in the next lecture we're gonna see you can prove better bonds that [00:39:30] see you can prove better bonds that depends on something more [00:39:33] depends on something more um fungal than the number of parameters [00:39:37] um fungal than the number of parameters I hope that answers questions please [00:39:39] I hope that answers questions please feel free to just unmute yourself and [00:39:41] feel free to just unmute yourself and ask any follow-ups okay I guess [00:39:45] ask any follow-ups okay I guess conceptual question how do you generally [00:39:46] conceptual question how do you generally think about the distinction between the [00:39:49] think about the distinction between the the hypothesis uh family of hypotheses [00:39:54] the hypothesis uh family of hypotheses versus the family of losses over them uh [00:39:56] versus the family of losses over them uh to me they because they have the same [00:39:58] to me they because they have the same cardinality right they seem like a [00:40:00] cardinality right they seem like a direct map in between one half like how [00:40:02] direct map in between one half like how do you distinguish I guess in your mind [00:40:04] do you distinguish I guess in your mind between those two how do you think about [00:40:06] between those two how do you think about them [00:40:07] them yeah that's a that's a great question so [00:40:09] yeah that's a that's a great question so in my mind they are very similar except [00:40:13] in my mind they are very similar except that [00:40:15] that um [00:40:16] um except that I think this will be a [00:40:18] except that I think this will be a little more [00:40:19] little more um [00:40:21] um explicit in the next lecture or maybe [00:40:23] explicit in the next lecture or maybe two lectures later so except that when [00:40:25] two lectures later so except that when you talk about the models the models [00:40:27] you talk about the models the models oftentimes ought to put a real number so [00:40:30] oftentimes ought to put a real number so for example if you think about logistics [00:40:31] for example if you think about logistics regression the model is output the [00:40:33] regression the model is output the largest which could be anywhere on a [00:40:36] largest which could be anywhere on a real line and they will turn that into a [00:40:38] real line and they will turn that into a probability and then use that [00:40:40] probability and then use that probability to compute a loss and the [00:40:42] probability to compute a loss and the loss becomes something first of all [00:40:44] loss becomes something first of all non-native and often sometimes the loss [00:40:46] non-native and often sometimes the loss is you know reasonably it's between zero [00:40:48] is you know reasonably it's between zero and one but logistic loss it's not [00:40:50] and one but logistic loss it's not between a zero and one but I think the [00:40:52] between a zero and one but I think the most interesting regime is that it's [00:40:54] most interesting regime is that it's somewhat small right so it's between [00:40:55] somewhat small right so it's between zero and one and and if you care about [00:40:57] zero and one and and if you care about classification loss then there's [00:40:59] classification loss then there's literally between zero and what so so [00:41:01] literally between zero and what so so the loss function sometimes has uh has a [00:41:05] the loss function sometimes has uh has a scale in some sense like it's it's [00:41:07] scale in some sense like it's it's something on out of one and but your [00:41:09] something on out of one and but your model could be uh Sometimes outputting [00:41:12] model could be uh Sometimes outputting some bigger numbers [00:41:14] some bigger numbers um [00:41:15] um so so there is a conversion there which [00:41:18] so so there is a conversion there which will be more explicit in the future [00:41:20] will be more explicit in the future lectures but besides beyond that [00:41:24] lectures but besides beyond that um I don't typically I don't distinguish [00:41:26] um I don't typically I don't distinguish them uh very much [00:41:29] them uh very much yeah I thought it interesting that in [00:41:31] yeah I thought it interesting that in your in the example at least for binary [00:41:33] your in the example at least for binary classification the the complexity of the [00:41:36] classification the the complexity of the uh the loss family was less than half of [00:41:40] uh the loss family was less than half of the complexity of the model family is [00:41:43] the complexity of the model family is that is it common that the your [00:41:45] that is it common that the your complexity goes down when you compose it [00:41:48] complexity goes down when you compose it with a loss function [00:41:51] um I think it's common that they are [00:41:54] um I think it's common that they are related we will see that in many cases [00:41:56] related we will see that in many cases they can be related but I think I [00:41:58] they can be related but I think I wouldn't read too much from that [00:42:00] wouldn't read too much from that constant half because it's a half does [00:42:03] constant half because it's a half does depend on how you [00:42:05] depend on how you Define your own labels for example if [00:42:08] Define your own labels for example if your label is 0 1 I think you wouldn't [00:42:10] your label is 0 1 I think you wouldn't see the half [00:42:12] see the half so there is some small artifact area [00:42:14] so there is some small artifact area right so the constant so it doesn't [00:42:15] right so the constant so it doesn't really matter [00:42:21] okay cool okay let's continue so [00:42:25] okay cool okay let's continue so um so we're gonna prove this and this is [00:42:28] um so we're gonna prove this and this is called symmetrization technique the [00:42:29] called symmetrization technique the proof and this is a very um [00:42:32] proof and this is a very um this is a technique that can be used in [00:42:34] this is a technique that can be used in many other cases [00:42:35] many other cases not necessarily in this course but in [00:42:38] not necessarily in this course but in other kind of like [00:42:40] other kind of like areas of probability let's say [00:42:43] areas of probability let's say um [00:42:43] um so symmetricization technique I think it [00:42:46] so symmetricization technique I think it probably comes from those kind of [00:42:47] probably comes from those kind of average probability [00:42:49] average probability um anyways in the first place so [00:42:51] um anyways in the first place so um [00:42:53] um so the technique is that [00:42:55] so the technique is that um let's write down what we care about [00:42:58] um let's write down what we care about so [00:43:05] because what we care about is the soup [00:43:08] because what we care about is the soup let me not take a spectation for now [00:43:10] let me not take a spectation for now just so that it's a little bit cleaner [00:43:12] just so that it's a little bit cleaner we will take expectation [00:43:14] we will take expectation in a bit [00:43:16] in a bit so this is not symmetric in some sense [00:43:19] so this is not symmetric in some sense because you have [00:43:22] um [00:43:23] um you have this you know subtraction here [00:43:26] you have this you know subtraction here and [00:43:27] and so so these two terms don't look the [00:43:29] so so these two terms don't look the same but that's that's what I mean by [00:43:30] same but that's that's what I mean by not smash Loop so there's a way to make [00:43:32] not smash Loop so there's a way to make them some somehow more symmetric so what [00:43:35] them some somehow more symmetric so what you do is you is that [00:43:38] you do is you is that if you fix for now let's say we fixed [00:43:41] if you fix for now let's say we fixed X1 Z1 after Z1 [00:43:45] X1 Z1 after Z1 and we let Z1 Prime and z n Prime to be [00:43:50] and we let Z1 Prime and z n Prime to be a different drawing another draw [00:43:52] a different drawing another draw from [00:43:54] from the distribution p and ID so you draw a [00:43:58] the distribution p and ID so you draw a sequence of copies of Zima of position [00:44:01] sequence of copies of Zima of position ID [00:44:02] ID and then what you can do is that you can [00:44:04] and then what you can do is that you can say [00:44:06] say can write this [00:44:08] can write this second [00:44:09] second the expectation this quantity [00:44:12] the expectation this quantity using the zi Prime [00:44:14] using the zi Prime you know just uh because by definition [00:44:17] you know just uh because by definition you know all the zis have the same [00:44:20] you know all the zis have the same uh distribution from P so expectation of [00:44:24] uh distribution from P so expectation of f is really the same as you look at [00:44:27] f is really the same as you look at expectation of sum of f of z i Prime [00:44:31] expectation of sum of f of z i Prime because each of this term has [00:44:32] because each of this term has expectation the same as your EF and and [00:44:35] expectation the same as your EF and and you average them so you get eo5 [00:44:42] right so and you see that you know this [00:44:46] right so and you see that you know this already makes it a little bit more [00:44:47] already makes it a little bit more symmetric uh for whatever you know like [00:44:50] symmetric uh for whatever you know like just looks on the surface looks more [00:44:52] just looks on the surface looks more symmetric because this is the sum of [00:44:54] symmetric because this is the sum of things and this is some of things of [00:44:56] things and this is some of things of course it's still a little bit different [00:44:56] course it's still a little bit different because the expectation [00:44:58] because the expectation uh is in front of this thing but there's [00:45:00] uh is in front of this thing but there's no expectation in front of the first so [00:45:05] um what we can do is you can uh of [00:45:09] um what we can do is you can uh of course one thing you can do is you can [00:45:10] course one thing you can do is you can put the expectation in front of the both [00:45:14] put the expectation in front of the both term [00:45:14] term which is not doing really anything [00:45:16] which is not doing really anything because [00:45:18] because for now zi is constant and z a prime Is [00:45:21] for now zi is constant and z a prime Is Random [00:45:22] Random so in some sense you are just putting [00:45:24] so in some sense you are just putting some constant inside expectation [00:45:29] and now let's what happens is that you [00:45:33] and now let's what happens is that you can switch the expectation with the soup [00:45:36] can switch the expectation with the soup maybe let's ask the question first [00:45:39] maybe let's ask the question first oh sorry yeah sorry [00:45:44] it's this one yeah cool thanks yeah so [00:45:47] it's this one yeah cool thanks yeah so now uh we'll make it more symmetric so [00:45:49] now uh we'll make it more symmetric so we'll switch [00:45:51] we'll switch uh the expectation with the soup so I'm [00:45:54] uh the expectation with the soup so I'm claiming that if you switch them you get [00:45:56] claiming that if you switch them you get an inequality [00:46:02] the prime [00:46:21] right so why this is true and this is [00:46:24] right so why this is true and this is just a this is just a very generic [00:46:26] just a this is just a very generic inequality where which claims that you [00:46:29] inequality where which claims that you can [00:46:30] can replace switch to an expectation get [00:46:33] replace switch to an expectation get inequality so generically this is the [00:46:35] inequality so generically this is the claim is that suppose you have a [00:46:38] claim is that suppose you have a function G that takes in two variables [00:46:41] function G that takes in two variables and suppose you are taking expectation [00:46:44] and suppose you are taking expectation first [00:46:45] first over the randomness of the first bar the [00:46:47] over the randomness of the first bar the second variable and then you take Soup [00:46:49] second variable and then you take Soup over the first bar the and you take Soup [00:46:54] over the first bar the and you take Soup over the first variable suppose this is [00:46:56] over the first variable suppose this is the quantity you have then you can [00:46:58] the quantity you have then you can replace this by [00:47:00] replace this by eventually by first taking expectation [00:47:02] eventually by first taking expectation and then actually by first taking a soup [00:47:04] and then actually by first taking a soup and then taking expectation [00:47:07] and then taking expectation but because when you do the math you are [00:47:09] but because when you do the math you are doing it from right hand side to the [00:47:11] doing it from right hand side to the left hand side [00:47:12] left hand side and why this is true this is because [00:47:14] and why this is true this is because you can have an intermediate step which [00:47:16] you can have an intermediate step which is you take Soup [00:47:18] is you take Soup over you [00:47:20] over you take expectation over B and you bound [00:47:22] take expectation over B and you bound this g u v by Silk [00:47:25] this g u v by Silk over u g u [00:47:28] over u g u maybe that's called U Prime [00:47:33] right so this inequality is very simple [00:47:36] right so this inequality is very simple it's just because this term is unsure [00:47:38] it's just because this term is unsure what term wise bonded by the soup [00:47:41] what term wise bonded by the soup and and once you do the steel then you [00:47:44] and and once you do the steel then you see that [00:47:48] this whole thing doesn't depend on you [00:47:50] this whole thing doesn't depend on you anymore right so maybe I should have [00:47:52] anymore right so maybe I should have another step so [00:47:54] another step so actually I'm claiming that this is just [00:47:56] actually I'm claiming that this is just equal to this [00:47:58] equal to this because this term [00:48:00] because this term doesn't depend on you you already got [00:48:02] doesn't depend on you you already got rid of you [00:48:03] rid of you right so so the soup over you [00:48:07] right so so the soup over you just can be gone and then the green term [00:48:10] just can be gone and then the green term is equals to the term below just because [00:48:12] is equals to the term below just because you change the variable name U to U [00:48:15] you change the variable name U to U Prime that's not nothing [00:48:16] Prime that's not nothing right so that's why it's equality [00:48:20] right so that's why it's equality so in general it's kind of like just [00:48:22] so in general it's kind of like just probably useful to kind of know this as [00:48:24] probably useful to kind of know this as a fact so you can switch the soup with [00:48:26] a fact so you can switch the soup with expectation and get inequality [00:48:29] expectation and get inequality um [00:48:30] um sometimes I do I don't remember which in [00:48:33] sometimes I do I don't remember which in which direction thing of course it is so [00:48:34] which direction thing of course it is so that's why you still want to probably [00:48:36] that's why you still want to probably somewhat know how to prove it so that in [00:48:38] somewhat know how to prove it so that in case you got confused which direction it [00:48:40] case you got confused which direction it is you you can still recover it [00:48:42] is you you can still recover it um Okay cool so [00:48:46] um Okay cool so um okay so that's the that's how this [00:48:48] um okay so that's the that's how this works [00:48:49] works and now if you take expectation over Z [00:48:52] and now if you take expectation over Z again we already condition on Z but not [00:48:54] again we already condition on Z but not suppose let's take expectation over Z [00:48:56] suppose let's take expectation over Z then you see that this is kind of like [00:48:57] then you see that this is kind of like very symmetric so what we got is that [00:49:01] very symmetric so what we got is that its expectation over Z1 up to z n [00:49:14] that's bonded by now you have two [00:49:16] that's bonded by now you have two expectations here one is over Z times [00:49:19] expectations here one is over Z times and the other is of over z i Primes [00:49:23] and the other is of over z i Primes and then you have soup [00:49:33] let's put it into a single sum by the [00:49:35] let's put it into a single sum by the way [00:49:50] okay [00:49:54] so now it becomes a little more [00:49:56] so now it becomes a little more symmetric and I'll do some one more [00:49:58] symmetric and I'll do some one more thing to make it even more symmetric [00:50:00] thing to make it even more symmetric so [00:50:02] so um [00:50:03] um this one [00:50:06] this one it's magic in the sense that actually is [00:50:09] it's magic in the sense that actually is a mean zero random variable [00:50:12] a mean zero random variable right it's not going to even mean zero [00:50:14] right it's not going to even mean zero but actually in terms of distribution is [00:50:15] but actually in terms of distribution is is the discussion is Magic in the [00:50:18] is the discussion is Magic in the following sense so FCI [00:50:22] following sense so FCI minus FCI Prime [00:50:24] minus FCI Prime has the same distribution [00:50:29] as fzf1 minus FCI [00:50:34] as fzf1 minus FCI because these two things are just the [00:50:36] because these two things are just the renaming of each other in some sense so [00:50:38] renaming of each other in some sense so they have the same distribution or in [00:50:40] they have the same distribution or in other words this has the same [00:50:42] other words this has the same distribution as Sigma I [00:50:44] distribution as Sigma I FCI [00:50:46] FCI minus FCI Prime [00:50:48] minus FCI Prime for any Sigma for any Sigma I [00:50:52] for any Sigma for any Sigma I that is binary [00:50:54] that is binary right if it's minus one if it's plus one [00:50:57] right if it's minus one if it's plus one is the same thing it's responded spine [00:50:58] is the same thing it's responded spine just to flip the order and [00:51:01] just to flip the order and so so that means that you can [00:51:04] so so that means that you can for free introduce this random variable [00:51:07] for free introduce this random variable Sigma I and and not not change anything [00:51:09] Sigma I and and not not change anything so that means that if you introduce this [00:51:12] so that means that if you introduce this rather Markov random variable [00:51:17] and you take expectation over Sigma ions [00:51:28] you multiply the sigma I [00:51:32] you multiply the sigma I this fcm as FDR Prime [00:51:35] this fcm as FDR Prime so this is still in equality [00:51:37] so this is still in equality actually like here even you choose any [00:51:40] actually like here even you choose any Sigma I this is equality of course if [00:51:42] Sigma I this is equality of course if you choose random Sigma is you know on [00:51:44] you choose random Sigma is you know on average them is still equality [00:51:49] okay [00:51:50] okay so I think technically you you for like [00:51:52] so I think technically you you for like you claim that for any Sigma idccality [00:51:56] you claim that for any Sigma idccality technically the first step is this is [00:51:58] technically the first step is this is equality for every Sigma I for any Sigma [00:52:01] equality for every Sigma I for any Sigma I [00:52:02] I and then you say that even you take [00:52:04] and then you say that even you take another expectation over Sigma ice [00:52:08] that's still true [00:52:14] it's still true and then you can switch [00:52:16] it's still true and then you can switch the expectation whatever you want [00:52:18] the expectation whatever you want okay and I'm going to switch it just [00:52:20] okay and I'm going to switch it just because it's a little bit convenient for [00:52:21] because it's a little bit convenient for me too [00:52:22] me too do that [00:52:37] okay [00:52:39] okay so [00:52:42] so now what I'm going to do is that I'm [00:52:46] so now what I'm going to do is that I'm going to [00:52:48] going to uh [00:52:50] uh work this to work this into two sums [00:52:53] work this to work this into two sums so I'm going to have expectations here [00:52:56] so I'm going to have expectations here you have all the randomness [00:52:59] you have all the randomness is this just a simplification of [00:53:01] is this just a simplification of rotation so I'm claiming that this is [00:53:04] rotation so I'm claiming that this is less than soup of [00:53:07] less than soup of the first term [00:53:10] the first term plus the soup [00:53:11] plus the soup of the second term [00:53:25] and here what we are doing is [00:53:27] and here what we are doing is essentially exactly the same thing as [00:53:29] essentially exactly the same thing as the the switch of the expectation the [00:53:31] the the switch of the expectation the spec of the expectation and and the soup [00:53:34] spec of the expectation and and the soup but here we only have two terms so it's [00:53:36] but here we only have two terms so it's a swipe of sum and soup so here we are [00:53:39] a swipe of sum and soup so here we are we are doing something like soup of two [00:53:42] we are doing something like soup of two terms so something like a function of F [00:53:44] terms so something like a function of F plus another function five [00:53:47] plus another function five over F and then you can say that this is [00:53:50] over F and then you can say that this is less than [00:53:51] less than soup of f u of f [00:53:55] soup of f u of f 2 [00:53:56] 2 05. [00:53:59] 05. I guess you can prove this you know [00:54:00] I guess you can prove this you know almost the same way as we have done with [00:54:03] almost the same way as we have done with the expectation as well it's just you [00:54:05] the expectation as well it's just you need one step in the middle uh I will [00:54:06] need one step in the middle uh I will leave faces as a exercise for you so [00:54:12] leave faces as a exercise for you so and now [00:54:13] and now probably have seen that we are getting [00:54:15] probably have seen that we are getting closer and closer to the definition of [00:54:16] closer and closer to the definition of Nevada Market complexity the only thing [00:54:18] Nevada Market complexity the only thing is that we have two terms and the right [00:54:20] is that we have two terms and the right Market complexity have just the only of [00:54:22] Market complexity have just the only of this so now we can change this to [00:54:27] this so now we can change this to the we can swap the expectation with the [00:54:29] the we can swap the expectation with the sum so you got soup [00:54:34] CI [00:54:35] CI and here and plus [00:54:37] and here and plus expectation soup [00:54:42] minus sign FCI Prime [00:54:45] minus sign FCI Prime and here the randomness is [00:54:48] and here the randomness is Z1 up to z n and sigma 1 up to Sigma so [00:54:51] Z1 up to z n and sigma 1 up to Sigma so this is exactly the first term is [00:54:52] this is exactly the first term is exactly By the Mark complexity and I'm [00:54:55] exactly By the Mark complexity and I'm going to claim that the second term is [00:54:56] going to claim that the second term is also exactly random or complexity [00:54:57] also exactly random or complexity because [00:54:59] because here [00:55:01] here the my Randomness is E1 C1 Prime up to Z [00:55:03] the my Randomness is E1 C1 Prime up to Z and Prime and sigma 1 up to Sigma n [00:55:06] and Prime and sigma 1 up to Sigma n but again because minus Sigma i f z i [00:55:10] but again because minus Sigma i f z i Prime has the same distribution [00:55:15] as Sigma i f z i [00:55:19] as Sigma i f z i right because minus Sigma has the same [00:55:22] right because minus Sigma has the same distribution Sigma I and z i Prime has [00:55:24] distribution Sigma I and z i Prime has the same distribution z i [00:55:26] the same distribution z i so then the second term is still [00:55:28] so then the second term is still this is equals to the first term [00:55:31] this is equals to the first term so basically this is just equals to two [00:55:33] so basically this is just equals to two times [00:55:35] times this [00:55:41] is 2 times the random number five [00:55:55] the questions [00:56:14] that's the quick question and that's [00:56:16] that's the quick question and that's exactly what I'm going to remark on so [00:56:18] exactly what I'm going to remark on so the question was [00:56:20] the question was um you know if I freeze it slightly [00:56:22] um you know if I freeze it slightly differently so what we have really done [00:56:24] differently so what we have really done here did we do anything that's kind of [00:56:26] here did we do anything that's kind of like powerful or did we do something [00:56:27] like powerful or did we do something because the left hand side it has a soup [00:56:30] because the left hand side it has a soup the right hand side still has a soup [00:56:32] the right hand side still has a soup right so uh did we do something really [00:56:36] right so uh did we do something really useful or did we just do a bunch of [00:56:38] useful or did we just do a bunch of algebra right so I'm gonna claim that we [00:56:42] algebra right so I'm gonna claim that we did do something useful and the reason [00:56:43] did do something useful and the reason is that [00:56:45] is that um the left hand side you know it's [00:56:47] um the left hand side you know it's something like a soup [00:56:49] something like a soup is what we care about right the the [00:56:50] is what we care about right the the difference between empirical mean and [00:56:52] difference between empirical mean and the [00:56:53] the and uh and the the the population mean [00:56:57] and uh and the the the population mean right [00:56:59] so on the right hand side [00:57:03] roughly speaking the the most important [00:57:06] roughly speaking the the most important thing is this this thing [00:57:12] so what we have achieved here [00:57:15] so what we have achieved here so [00:57:17] so so one we have achieved that [00:57:20] so one we have achieved that we remove [00:57:24] the [00:57:27] the yeah [00:57:28] yeah right so we've got we'll get rid of the [00:57:30] right so we've got we'll get rid of the EF [00:57:31] EF and [00:57:33] and it's probably you know not super clear [00:57:35] it's probably you know not super clear why we should appreciate this fact that [00:57:37] why we should appreciate this fact that we got rid of the EF and the first set [00:57:39] we got rid of the EF and the first set but I can say that this EF is you know [00:57:42] but I can say that this EF is you know it's somewhat kind of annoying because [00:57:44] it's somewhat kind of annoying because you don't have a good control on it [00:57:46] you don't have a good control on it right so when you look at this in some [00:57:49] right so when you look at this in some sense this quantity doesn't depend on [00:57:50] sense this quantity doesn't depend on the relative [00:57:53] the relative uh like for example this point it does [00:57:55] uh like for example this point it does depend on if EF right so if you shift it [00:57:57] depend on if EF right so if you shift it a little bit it wouldn't change actually [00:57:58] a little bit it wouldn't change actually we're going to claim that this coin to [00:58:00] we're going to claim that this coin to the right hand side is translation [00:58:02] the right hand side is translation environment so so in some sense you move [00:58:04] environment so so in some sense you move the remove the the translation [00:58:06] the remove the the translation environment part [00:58:07] environment part so maybe let me just claim [00:58:12] the the right hand side is translation [00:58:15] the the right hand side is translation environment [00:58:18] or maybe the mock the rather marker [00:58:21] or maybe the mock the rather marker complexity [00:58:24] I'm going to claim that prove this in [00:58:26] I'm going to claim that prove this in your moment [00:58:28] your moment let me see what I plan to do this in [00:58:30] let me see what I plan to do this in today's lecture [00:58:37] I think I didn't plan to do it today's [00:58:39] I think I didn't plan to do it today's lecture but but this is a crime so in [00:58:42] lecture but but this is a crime so in some sense you remove the the [00:58:43] some sense you remove the the translation environment part you remove [00:58:44] translation environment part you remove the EF right so which is useful in many [00:58:48] the EF right so which is useful in many cases and [00:58:51] cases and second [00:58:53] you sometimes introduce more randomness [00:59:00] Sigma 1 up to Sigma n [00:59:03] Sigma 1 up to Sigma n so while introducing this Randomness is [00:59:05] so while introducing this Randomness is useful [00:59:06] useful it's probably still unclear [00:59:08] it's probably still unclear right now but eventually what we're [00:59:10] right now but eventually what we're going to do is that we are going to have [00:59:12] going to do is that we are going to have So So currently what we really have is [00:59:14] So So currently what we really have is you have expectation of this right you [00:59:17] you have expectation of this right you also haven't expected of this and here [00:59:19] also haven't expected of this and here the randomness is Z1 up to ZN and sigma [00:59:22] the randomness is Z1 up to ZN and sigma 1 up to Sigma [00:59:24] 1 up to Sigma right so we will use additional [00:59:26] right so we will use additional Randomness and this allows us this will [00:59:28] Randomness and this allows us this will allows us [00:59:32] to drop [00:59:34] to drop a randomness [00:59:37] from Z1 up to zero [00:59:41] from Z1 up to zero um this will this will be something [00:59:42] um this will this will be something we'll see I guess probably [00:59:45] we'll see I guess probably in the next lecture [00:59:47] in the next lecture So So eventually you don't have to care [00:59:50] So So eventually you don't have to care about the the you don't have to take [00:59:52] about the the you don't have to take expectation over Z1 plus year you can [00:59:54] expectation over Z1 plus year you can you can claim with high probability uh [00:59:57] you can claim with high probability uh uh so the red price I wouldn't have to [00:59:58] uh so the red price I wouldn't have to run like Z1 after CM you don't need to [01:00:00] run like Z1 after CM you don't need to take expectation over Z by up to Z and [01:00:02] take expectation over Z by up to Z and the only Randomness come from the sigma [01:00:04] the only Randomness come from the sigma ice which [01:00:06] ice which uh if you eventually I guess probably [01:00:08] uh if you eventually I guess probably you you don't see exactly what I mean [01:00:10] you you don't see exactly what I mean but if eventually you only care about [01:00:12] but if eventually you only care about the randomness of Sigma to Sigma I a [01:00:15] the randomness of Sigma to Sigma I a sigma 1 up to Sigma and that's that's uh [01:00:16] sigma 1 up to Sigma and that's that's uh that seems to be a a benefit because [01:00:19] that seems to be a a benefit because this is a very simple much simpler [01:00:22] this is a very simple much simpler randomness right so Sigma up to Sigma [01:00:25] randomness right so Sigma up to Sigma has a very simple distribution if they [01:00:27] has a very simple distribution if they are just random markers random variables [01:00:29] are just random markers random variables so they are much less complicated than [01:00:32] so they are much less complicated than the distribution of c bar to Z which is [01:00:34] the distribution of c bar to Z which is something you don't know right you just [01:00:36] something you don't know right you just assume there's a distribution P but you [01:00:38] assume there's a distribution P but you didn't really know any other properties [01:00:40] didn't really know any other properties about it [01:00:41] about it so so I think that's the [01:00:43] so so I think that's the um that's the second benefit [01:00:46] um that's the second benefit um but of course the limitation is that [01:00:48] um but of course the limitation is that we still have the soup which is still a [01:00:51] we still have the soup which is still a problem but I think you probably [01:00:53] problem but I think you probably wouldn't shouldn't expect that you can [01:00:54] wouldn't shouldn't expect that you can remove the soup on this level when you [01:00:56] remove the soup on this level when you have abstract family class [01:00:59] have abstract family class um you'll probably shouldn't expect you [01:01:00] um you'll probably shouldn't expect you can remove the soup completely right so [01:01:02] can remove the soup completely right so it should be the next level where you [01:01:03] it should be the next level where you remove the soup when you have a concrete [01:01:06] remove the soup when you have a concrete hypothesis [01:01:10] um [01:01:11] um cool [01:01:26] so the next part is another [01:01:30] so the next part is another um useful [01:01:32] um useful property or kind of useful thing to know [01:01:34] property or kind of useful thing to know about rather Mark complexity which is [01:01:35] about rather Mark complexity which is that it's rather more complexity can [01:01:37] that it's rather more complexity can depend [01:01:40] on the distribution p [01:01:43] on the distribution p it still can depend on the solution p [01:01:46] it still can depend on the solution p even though you know our goal is to try [01:01:48] even though you know our goal is to try to kind of like use the new Randomness [01:01:50] to kind of like use the new Randomness right the deal with the simpler [01:01:51] right the deal with the simpler randomness [01:01:52] randomness right why this is the case [01:01:55] right why this is the case um this is just because [01:01:58] um this is just because in this definition of random marker [01:01:59] in this definition of random marker complexity you do have to draw some [01:02:01] complexity you do have to draw some z-ramp to Zero from the distribution p [01:02:04] z-ramp to Zero from the distribution p right so this is a gym example where you [01:02:06] right so this is a gym example where you can see that [01:02:08] can see that um [01:02:11] um where p is a point nice [01:02:17] so let's say Z is always equal to Z zero [01:02:20] so let's say Z is always equal to Z zero almost right so whatever how you draw it [01:02:23] almost right so whatever how you draw it you always just draw a single point [01:02:25] you always just draw a single point and in this case actually you can have a [01:02:27] and in this case actually you can have a good runner marker complexity [01:02:29] good runner marker complexity for any bounded function so suppose [01:02:33] for any bounded function so suppose let's say minus one [01:02:36] let's say minus one suppose you care about [01:02:38] suppose you care about you know [01:02:40] you know and this is the only constraint on the [01:02:42] and this is the only constraint on the Family app so basically you care about F [01:02:44] Family app so basically you care about F or maybe more tech let's say f is the [01:02:47] or maybe more tech let's say f is the family of functions F such that [01:02:49] family of functions F such that fz0 [01:02:52] fz0 expounded by one so we just have a [01:02:54] expounded by one so we just have a boundary family of functions you don't [01:02:55] boundary family of functions you don't even have any parametric form [01:02:57] even have any parametric form still you can prove that the right [01:03:00] still you can prove that the right number complex of this family of on this [01:03:02] number complex of this family of on this family is small [01:03:04] family is small so you can say that [01:03:06] so you can say that if you look at soup [01:03:16] so this is what [01:03:18] so this is what because FDI is always the same so this [01:03:21] because FDI is always the same so this is it it literally equals to one of n [01:03:24] is it it literally equals to one of n times F zero [01:03:27] times F zero times sum of Sigma I [01:03:31] and because you know fz0 is just a [01:03:34] and because you know fz0 is just a constant it doesn't depend on what f is [01:03:35] constant it doesn't depend on what f is right because Z zero is wait sorry my [01:03:38] right because Z zero is wait sorry my bad so F I'm wrong with that let's see [01:03:42] bad so F I'm wrong with that let's see so [01:03:43] so box fz0 it still depends on Z zero right [01:03:46] box fz0 it still depends on Z zero right but fc0 is bounded between 0 and 1. [01:03:50] but fc0 is bounded between 0 and 1. between -1 so that that means that this [01:03:53] between -1 so that that means that this is less than or equal to expectation if [01:03:56] is less than or equal to expectation if you just bound this f z 0 by 1 you get [01:03:59] you just bound this f z 0 by 1 you get one over n times sum of Sigma i absolute [01:04:02] one over n times sum of Sigma i absolute value [01:04:05] right so and [01:04:08] right so and this is actually [01:04:10] this is actually um [01:04:10] um then use caution Source or use the [01:04:13] then use caution Source or use the uh I think this is called cautious words [01:04:16] uh I think this is called cautious words so the expectation of this random [01:04:18] so the expectation of this random variable is smaller than the [01:04:22] expectation of the square of the random [01:04:25] expectation of the square of the random variable [01:04:26] variable to the power of half [01:04:28] to the power of half and then you can [01:04:31] and then you can after these derivations we're going to [01:04:33] after these derivations we're going to see this configuration several times so [01:04:35] see this configuration several times so you get [01:04:37] you get uh [01:04:39] uh a half expectation [01:04:42] a half expectation sum of Sigma I Sigma J [01:04:45] sum of Sigma I Sigma J is not equals to J this plus sum of [01:04:48] is not equals to J this plus sum of Sigma I Square [01:04:50] Sigma I Square times y Over N Square [01:04:52] times y Over N Square right I'm just expanding uh expanding [01:04:55] right I'm just expanding uh expanding rate so you get one over n Square [01:04:58] rate so you get one over n Square Sigma Sigma J has mean zero so you get [01:05:01] Sigma Sigma J has mean zero so you get sum over I from 1 to n expectation Sigma [01:05:04] sum over I from 1 to n expectation Sigma I Square [01:05:06] I Square a half so this is [01:05:08] a half so this is each of this is one you take the sum we [01:05:10] each of this is one you take the sum we get n so you get one over n to the power [01:05:13] get n so you get one over n to the power of half is power squared [01:05:17] of half is power squared so [01:05:19] so so in some sense this is kind of like [01:05:20] so in some sense this is kind of like interesting right so like for very very [01:05:23] interesting right so like for very very large family functions without even [01:05:24] large family functions without even parametric form you can still have a [01:05:26] parametric form you can still have a good random complexity and the reason is [01:05:28] good random complexity and the reason is that the distribution is so simple [01:05:31] that the distribution is so simple so it's in some sense this is indeed [01:05:33] so it's in some sense this is indeed this is the indicator that you the [01:05:35] this is the indicator that you the radamer complexity can capture something [01:05:37] radamer complexity can capture something about the distribution right if the [01:05:39] about the distribution right if the distribution is driven it's simple then [01:05:41] distribution is driven it's simple then rather marker complex they can capture [01:05:43] rather marker complex they can capture that and tell you that [01:05:46] that and tell you that um your uh it's very easy to generalize [01:05:48] um your uh it's very easy to generalize right right so so basically [01:05:51] right right so so basically only have on a very simple distribution [01:05:54] only have on a very simple distribution should be considered by considered as [01:05:56] should be considered by considered as very simple even though this [01:05:58] very simple even though this and sometimes this family of f is just [01:05:59] and sometimes this family of f is just you have basically no assumptions in [01:06:02] you have basically no assumptions in some sense right there is no parametric [01:06:03] some sense right there is no parametric form is it's a very large family of [01:06:05] form is it's a very large family of functions but with respect to the [01:06:07] functions but with respect to the distribution the simple distribution it [01:06:09] distribution the simple distribution it should be considered as simple and this [01:06:11] should be considered as simple and this is what weather marker complexity can [01:06:12] is what weather marker complexity can tell you so that's saying that's rather [01:06:14] tell you so that's saying that's rather more complex it can take into account uh [01:06:17] more complex it can take into account uh the discussion p [01:06:19] the discussion p um [01:06:20] um but how much it can take into account of [01:06:22] but how much it can take into account of the diffusion P that's the question [01:06:23] the diffusion P that's the question marks right so in many of the other [01:06:25] marks right so in many of the other analysis you [01:06:27] analysis you don't like in manual analysis you don't [01:06:29] don't like in manual analysis you don't have this property like you don't really [01:06:33] have this property like you don't really use too much about the solution p in [01:06:35] use too much about the solution p in many of the concrete bounds for random [01:06:36] many of the concrete bounds for random or complexity but in principle it can [01:06:39] or complexity but in principle it can capture something about p [01:06:42] capture something about p uh [01:06:48] I have 15 minutes I think there is time [01:06:51] I have 15 minutes I think there is time for me to do this next part [01:06:58] let's see [01:07:01] here I think I have time to do this [01:07:05] here I think I have time to do this okay [01:07:06] okay so the next part if there's any no [01:07:08] so the next part if there's any no questions [01:07:30] but I mean it is it possible [01:07:38] so so your question is whether for [01:07:42] so so your question is whether for example when the features X the [01:07:45] example when the features X the coordinates of X have correlations [01:07:49] coordinates of X have correlations or maybe you have Independence [01:07:51] or maybe you have Independence Independence is uh probably it's more [01:07:54] Independence is uh probably it's more like a simplistic thing right so can you [01:07:57] like a simplistic thing right so can you get a better bond from random marker [01:07:59] get a better bond from random marker complexity [01:08:01] um [01:08:04] I think this to answer this question we [01:08:07] I think this to answer this question we need to zoom in to [01:08:09] need to zoom in to um concrete settings [01:08:11] um concrete settings right for linear models I guess [01:08:14] right for linear models I guess you see that you would get bonds at [01:08:17] you see that you would get bonds at least you know if you compare two [01:08:18] least you know if you compare two extreme cases where in one case all the [01:08:20] extreme cases where in one case all the coordinates are correlated [01:08:23] coordinates are correlated and the other case is that [01:08:25] and the other case is that actually it's unclear right so because [01:08:27] actually it's unclear right so because if all the coins are correlated actually [01:08:31] if all the coins are correlated actually you probably should have a better Bond [01:08:34] you probably should have a better Bond because if for example in a very very [01:08:36] because if for example in a very very extreme case right so all the [01:08:37] extreme case right so all the coordinates are the same [01:08:40] coordinates are the same then you effectively have a [01:08:41] then you effectively have a one-dimensional problem so you should [01:08:43] one-dimensional problem so you should have a better Bond [01:08:44] have a better Bond so so it does depend on the particular [01:08:47] so so it does depend on the particular situation I think [01:08:50] situation I think right so so yeah so it's interesting [01:08:52] right so so yeah so it's interesting right it's not clear that Independence [01:08:53] right it's not clear that Independence means really simpler [01:08:57] means really simpler Independence could mean that it's com [01:08:58] Independence could mean that it's com more complicated [01:09:00] more complicated just because we have independent input [01:09:02] just because we have independent input distribution right so you have a diverse [01:09:04] distribution right so you have a diverse set of distribution that first set of [01:09:06] set of distribution that first set of data [01:09:07] data it might be even more it's harder to [01:09:09] it might be even more it's harder to generalize in some cases right so for [01:09:11] generalize in some cases right so for example here in this one point much case [01:09:13] example here in this one point much case right you have a very nice family of [01:09:16] right you have a very nice family of data it means if you can generalize [01:09:19] data it means if you can generalize easier because you can memorize that [01:09:21] easier because you can memorize that zero [01:09:22] zero right so so Independence might make it [01:09:25] right so so Independence might make it harder [01:09:30] okay so the next 15 to 10 to 15 minutes [01:09:34] okay so the next 15 to 10 to 15 minutes let me try to uh Define the so-called [01:09:37] let me try to uh Define the so-called empirical random marker complexity [01:09:41] empirical random marker complexity and the goal here is to remove that [01:09:43] and the goal here is to remove that expectation in front of the [01:09:47] expectation in front of the in front of the soup right so so [01:09:49] in front of the soup right so so currently the average version has two [01:09:51] currently the average version has two expectations one is [01:09:53] expectations one is over the randomness of 0 up to ZN and [01:09:56] over the randomness of 0 up to ZN and there's another expectation of the [01:09:57] there's another expectation of the randomness that we created [01:10:00] randomness that we created right over Sigma 1 up to Sigma and you [01:10:02] right over Sigma 1 up to Sigma and you have this soup [01:10:13] and we're going to claim that this is [01:10:16] and we're going to claim that this is basically similar to [01:10:19] basically similar to this [01:10:20] this without expectation [01:10:22] without expectation with high probability [01:10:27] so with high probability [01:10:30] so with high probability and the probability is over the runness [01:10:32] and the probability is over the runness of Z1 to zero right so you still you [01:10:34] of Z1 to zero right so you still you have to draw Z by up to z n but for most [01:10:37] have to draw Z by up to z n but for most of the choice of Z1 up to z n [01:10:40] of the choice of Z1 up to z n these two things are similar [01:10:46] right so this is the random variable [01:10:47] right so this is the random variable that depends on Z1 of position this is [01:10:49] that depends on Z1 of position this is the random this is just a constant right [01:10:51] the random this is just a constant right this this number is a it's not a random [01:10:53] this this number is a it's not a random number it's a [01:10:55] number it's a yeah I probably shouldn't call it [01:10:56] yeah I probably shouldn't call it constant this is a deterministic number [01:10:58] constant this is a deterministic number right so this number depends on zero up [01:11:00] right so this number depends on zero up to z n and I'm claiming that [01:11:03] to z n and I'm claiming that the second one the right hand side is [01:11:05] the second one the right hand side is concentrating around the first one [01:11:09] concentrating around the first one um [01:11:10] um with high probability [01:11:12] with high probability so and if you can do this then that's [01:11:15] so and if you can do this then that's what we I kind of alluded to before [01:11:16] what we I kind of alluded to before right so now if this is defined to be [01:11:20] right so now if this is defined to be empirical rather micro complexity [01:11:26] um I guess so let me have a notation for [01:11:28] um I guess so let me have a notation for that [01:11:32] foreign [01:11:41] I think let me just uh [01:11:43] I think let me just uh I think you'll notice there is a formal [01:11:45] I think you'll notice there is a formal definition but here let me just for the [01:11:46] definition but here let me just for the sake of time let's just Define this to [01:11:48] sake of time let's just Define this to be r s of f [01:11:51] be r s of f where s is [01:11:53] where s is the set of 0 up to zero [01:11:56] the set of 0 up to zero and this is called empirical rather [01:11:58] and this is called empirical rather marker complexity [01:12:01] marker complexity and and you can see that the original [01:12:04] and and you can see that the original red marker complexity the average one of [01:12:06] red marker complexity the average one of our complexity is the expectation of the [01:12:09] our complexity is the expectation of the empirical value Mark complexity [01:12:11] empirical value Mark complexity where you take expectation over the set [01:12:13] where you take expectation over the set s [01:12:14] s right [01:12:16] right so just because these two things only [01:12:18] so just because these two things only differ by a single expectation [01:12:21] differ by a single expectation and so if you can do this then [01:12:24] and so if you can do this then you have a [01:12:27] you have a uh you have a high probability Bond you [01:12:28] uh you have a high probability Bond you don't have to average over zero up to [01:12:30] don't have to average over zero up to seven and also you can do the same thing [01:12:32] seven and also you can do the same thing for the left hand side for the uniform [01:12:34] for the left hand side for the uniform convergence thing right so recall that [01:12:36] convergence thing right so recall that before we only prove that [01:12:38] before we only prove that the expectation of the above to zione [01:12:43] soup [01:12:50] this minus EF right we want to prove [01:12:53] this minus EF right we want to prove that this is less than random micro [01:12:54] that this is less than random micro complexity [01:12:56] complexity right so this one [01:12:57] right so this one we will also show that this is [01:12:59] we will also show that this is approximately [01:13:02] approximately equals to this with high probability [01:13:04] equals to this with high probability like I guess I should say [01:13:06] like I guess I should say the later one [01:13:09] the later one this one later one is a random variable [01:13:11] this one later one is a random variable right that depends on z bar to z n is [01:13:14] right that depends on z bar to z n is approximately equal to the expectation [01:13:16] approximately equal to the expectation with high probability [01:13:19] with high probability so if you have both of this then you [01:13:21] so if you have both of this then you basically remove the expectation from [01:13:23] basically remove the expectation from your equation and you get a high [01:13:25] your equation and you get a high probability [01:13:31] does that make sense [01:13:33] does that make sense any questions [01:13:36] any questions so basically eventually we're going to [01:13:37] so basically eventually we're going to prove this so let me say the formal [01:13:39] prove this so let me say the formal serum [01:13:43] we can prove that [01:13:55] you see all the effects are bounded [01:13:58] you see all the effects are bounded then [01:14:01] with probability at least [01:14:05] with probability at least one minus Delta [01:14:08] one minus Delta we have the soup [01:14:10] we have the soup so here I don't have expectation this is [01:14:13] so here I don't have expectation this is the [01:14:13] the but there's a run over the randomness [01:14:22] Z1 up to z n [01:14:26] Z1 up to z n soup [01:14:27] soup this [01:14:31] is less than [01:14:34] two times [01:14:37] two times the internal correct number complexity [01:14:39] the internal correct number complexity is after [01:14:40] is after I will draw the mark complexity plus [01:14:43] I will draw the mark complexity plus additional cost [01:14:45] additional cost additional term which is the log long [01:14:48] additional term which is the log long of 2 over Delta [01:14:51] of 2 over Delta over 2 and so you pay additional small [01:14:54] over 2 and so you pay additional small term which is on the auto four over [01:14:55] term which is on the auto four over square root and times some log something [01:14:58] square root and times some log something logarithmic [01:14:59] logarithmic in a probability Delta [01:15:01] in a probability Delta um [01:15:02] um so and but basically by paying this you [01:15:05] so and but basically by paying this you get a high probability Bond instead of [01:15:07] get a high probability Bond instead of the average person [01:15:23] I think the proof here the proof is [01:15:26] I think the proof here the proof is actually relatively [01:15:29] actually relatively straightforward it's basically just [01:15:31] straightforward it's basically just applying [01:15:32] applying uh mcdermen inequality [01:15:35] uh mcdermen inequality um but maybe let me do that in the next [01:15:39] um but maybe let me do that in the next lecture I think it takes probably 10 [01:15:40] lecture I think it takes probably 10 minutes [01:15:42] minutes [Music] [01:15:43] [Music] maybe let me start with the remark right [01:15:46] maybe let me start with the remark right so [01:15:47] so so I guess typically this [01:15:52] loan to Delta over n [01:15:55] loan to Delta over n this is typically much smaller than [01:15:58] this is typically much smaller than either the rather marker complexity [01:16:00] either the rather marker complexity or the the either the empirical one or [01:16:03] or the the either the empirical one or the population one [01:16:05] the population one and [01:16:06] and the reason is that these two things will [01:16:09] the reason is that these two things will be something like square root something [01:16:11] be something like square root something over n and if something depends on [01:16:15] over n and if something depends on the complexity of f it's something that [01:16:17] the complexity of f it's something that is not negligible but here you have [01:16:20] is not negligible but here you have square root a logarithmic some [01:16:22] square root a logarithmic some logarithmic term over n so that's pretty [01:16:24] logarithmic term over n so that's pretty much the smallest thing you can think of [01:16:25] much the smallest thing you can think of right so like a algorithmic is kind of [01:16:28] right so like a algorithmic is kind of like a constant [01:16:29] like a constant so so your complexity of f wouldn't be a [01:16:33] so so your complexity of f wouldn't be a on a logarithmic uh of anything it [01:16:36] on a logarithmic uh of anything it should be something bigger than that so [01:16:38] should be something bigger than that so that's why typically this additional [01:16:39] that's why typically this additional term [01:16:40] term um is negligible so that's why basically [01:16:42] um is negligible so that's why basically you can think of this [01:16:44] you can think of this um you didn't lose anything by doing the [01:16:47] um you didn't lose anything by doing the empirical version [01:16:48] empirical version and it's interesting that what you lose [01:16:50] and it's interesting that what you lose here at this on this level [01:16:52] here at this on this level what you lose here doesn't depend on [01:16:54] what you lose here doesn't depend on complex of f [01:16:55] complex of f right so basically if you [01:16:58] right so basically if you um so so this term depends on complex [01:16:59] um so so this term depends on complex five but what you lose here between the [01:17:01] five but what you lose here between the expected empirical and population don't [01:17:04] expected empirical and population don't depend on complex [01:17:07] depend on complex foreign [01:17:14] and maybe I think this is a perfect time [01:17:17] and maybe I think this is a perfect time for doing a second remark [01:17:20] for doing a second remark two [01:17:21] two so and I guess R and F [01:17:25] so and I guess R and F or risf [01:17:27] or risf they are both translation environment [01:17:36] so uh what does that mean that means [01:17:40] so uh what does that mean that means that suppose [01:17:43] that suppose you have F Prime which is equals to [01:17:47] you have F Prime which is equals to a translation of f which means that this [01:17:50] a translation of f which means that this is a family of function [01:17:53] is a family of function F Prime Z which is equals to f z plus u [01:17:58] F Prime Z which is equals to f z plus u a universal constant C zero [01:18:02] right so for every function in capital F [01:18:05] right so for every function in capital F you've got the function capital F Prime [01:18:06] you've got the function capital F Prime which is just a translation uh you just [01:18:10] which is just a translation uh you just add some c0 to it [01:18:12] add some c0 to it then [01:18:16] they have the same [01:18:20] empirical by the marker complexity [01:18:29] and in some sense we have seen this [01:18:31] and in some sense we have seen this derivation somewhere in the in this um [01:18:34] derivation somewhere in the in this um in one of the irrigations before but let [01:18:36] in one of the irrigations before but let me just to make it more explicit [01:18:38] me just to make it more explicit so [01:18:40] so the the rather more complex of this is [01:18:42] the the rather more complex of this is you look at the expectation of Sigma [01:18:46] you look at the expectation of Sigma and you take the soup off [01:18:51] sum of Sigma i f I Prime [01:18:55] sum of Sigma i f I Prime F Prime [01:18:58] F Prime CI [01:19:03] and [01:19:05] and you plug in a definition [01:19:14] plus zero [01:19:17] but [01:19:21] right so now you can [01:19:25] right so now you can put the part about c0 I think we have [01:19:27] put the part about c0 I think we have seen the same technique before because [01:19:30] seen the same technique before because c0 is not a function of little f so you [01:19:33] c0 is not a function of little f so you can put it out so you get [01:19:42] Plus [01:19:43] Plus one over n times sum of Sigma I times c0 [01:19:48] one over n times sum of Sigma I times c0 and then you can survive expectations [01:19:50] and then you can survive expectations with the sum so you get expectation [01:19:52] with the sum so you get expectation signal [01:19:53] signal the soup [01:19:55] the soup plus expectation of y Over N times sum [01:19:58] plus expectation of y Over N times sum of Sigma i c 0. [01:20:01] of Sigma i c 0. and this becomes zero because Sigma I is [01:20:04] and this becomes zero because Sigma I is a binary uh or it's a red marker random [01:20:07] a binary uh or it's a red marker random variable so then this is RS of f [01:20:12] variable so then this is RS of f so [01:20:13] so so in some sense this is a property of [01:20:16] so in some sense this is a property of the radical complexity which is um [01:20:19] the radical complexity which is um like a [01:20:21] like a somewhat interesting right like you [01:20:23] somewhat interesting right like you don't care about the translation but you [01:20:25] don't care about the translation but you do care about a scale so if you scale [01:20:26] do care about a scale so if you scale everything by half or by two then you [01:20:29] everything by half or by two then you would change the value of complexity but [01:20:31] would change the value of complexity but I wouldn't change when you shift things [01:20:32] I wouldn't change when you shift things so it's about the relative differences [01:20:34] so it's about the relative differences between functions here it's not not [01:20:37] between functions here it's not not about the absolute [01:20:39] about the absolute size or five right so so for example if [01:20:41] size or five right so so for example if the function of F always takes values [01:20:44] the function of F always takes values between one thousand and one thousand [01:20:46] between one thousand and one thousand one that's not very different from [01:20:48] one that's not very different from taking values between zero and one [01:20:50] taking values between zero and one okay I think this is an actual stopping [01:20:53] okay I think this is an actual stopping point uh for today [01:20:56] point uh for today um any questions [01:21:08] uh first of all it's not always the case [01:21:11] uh first of all it's not always the case that [01:21:12] that right that's a good question right so I [01:21:14] right that's a good question right so I I claim this you know regularly without [01:21:16] I claim this you know regularly without any justification so why the rather [01:21:18] any justification so why the rather Market complexity should be like this by [01:21:21] Market complexity should be like this by one of our scripted so I think it's not [01:21:24] one of our scripted so I think it's not even [01:21:26] even so I should say it's not actually [01:21:28] so I should say it's not actually exactly true like so for most of the [01:21:30] exactly true like so for most of the case actually for all the cases we are [01:21:31] case actually for all the cases we are going to see in this [01:21:33] going to see in this in the lectures it's one over square [01:21:35] in the lectures it's one over square rooted but in some cases it could be the [01:21:39] rooted but in some cases it could be the dependency on could be a little bit [01:21:40] dependency on could be a little bit different [01:21:42] different um so yeah so sorry I was not quite [01:21:43] um so yeah so sorry I was not quite clear and I think I'm not sure whether [01:21:45] clear and I think I'm not sure whether that's questions still in the homework I [01:21:47] that's questions still in the homework I think in the homework question there [01:21:48] think in the homework question there used to be a question [01:21:50] used to be a question um where you have other dependencies on [01:21:52] um where you have other dependencies on I think I probably removed that question [01:21:54] I think I probably removed that question for this year [01:21:56] for this year um I remember I remove it just because [01:21:58] um I remember I remove it just because it's not that relevant to the overall [01:22:00] it's not that relevant to the overall goal but but there could be other [01:22:02] goal but but there could be other dependencies [01:22:04] dependencies um for some reason it's always like [01:22:06] um for some reason it's always like it's mostly the case is one of the [01:22:08] it's mostly the case is one of the scrotum I think this the reason is that [01:22:10] scrotum I think this the reason is that the even you look at a single [01:22:14] the even you look at a single uh even you look at a single [01:22:17] uh even you look at a single example you don't take the soup [01:22:19] example you don't take the soup right you just look at the like a like [01:22:21] right you just look at the like a like you you do the wrong thing right so you [01:22:24] you you do the wrong thing right so you you say I fixed my function and then I [01:22:27] you say I fixed my function and then I draw my data I look at how different the [01:22:29] draw my data I look at how different the empirical one is different from the [01:22:31] empirical one is different from the population one right that's always one [01:22:33] population one right that's always one over square rooted like without any [01:22:35] over square rooted like without any doubt right so so that's why you can [01:22:37] doubt right so so that's why you can never go better than lava Square rooting [01:22:39] never go better than lava Square rooting but you can be worse than lava script [01:22:42] but you can be worse than lava script I'm not sure why that makes sense right [01:22:43] I'm not sure why that makes sense right so if you look at a single [01:22:45] so if you look at a single the concentration of a single at a [01:22:47] the concentration of a single at a single F right like you fix the the [01:22:49] single F right like you fix the the function H you draw the random variable [01:22:51] function H you draw the random variable Z1 up to Z and you still have some [01:22:53] Z1 up to Z and you still have some fluctuation on order for our Square [01:22:55] fluctuation on order for our Square rooting so you cannot repeat that [01:22:57] rooting so you cannot repeat that but it could be worse than that [01:23:06] from definition of the random marker I [01:23:08] from definition of the random marker I think you can still see that to some [01:23:10] think you can still see that to some extent because if you look at the sum of [01:23:14] extent because if you look at the sum of Sigma I i5 [01:23:16] Sigma I i5 maybe let's just a here right so this is [01:23:19] maybe let's just a here right so this is still a sum of N terms [01:23:22] still a sum of N terms right and and each of so like if you [01:23:24] right and and each of so like if you don't take the soup this term would be [01:23:26] don't take the soup this term would be something on all of our square rooted [01:23:29] something on all of our square rooted that's because of the Constitution right [01:23:31] that's because of the Constitution right you have sum of N terms each of them is [01:23:33] you have sum of N terms each of them is on the other one so the sum of the [01:23:36] on the other one so the sum of the instruments is on all the square root [01:23:37] instruments is on all the square root and and then you divide by n you get one [01:23:40] and and then you divide by n you get one over square root [01:23:45] we can talk more offline Maybe [01:23:48] we can talk more offline Maybe okay sounds good I guess I'll see you on [01:23:51] okay sounds good I guess I'll see you on Monday or Wednesday ================================================================================ LECTURE 006 ================================================================================ Stanford CS229M - Lecture 6: Margin theory and Rademacher complexity for linear models Source: https://www.youtube.com/watch?v=echF7IWE05c --- Transcript [00:00:05] uh hello everyone [00:00:08] uh hello everyone um so I guess uh so in the next uh of I [00:00:14] um so I guess uh so in the next uh of I guess in this lecture uh what we're [00:00:16] guess in this lecture uh what we're gonna do is that [00:00:20] we're going to bound on random marker [00:00:23] we're going to bound on random marker complexity by some uh concrete [00:00:28] complexity by some uh concrete uh formula [00:00:32] uh for concrete models [00:00:35] and by concrete models I really just [00:00:38] and by concrete models I really just mean linear models for this lecture and [00:00:40] mean linear models for this lecture and in the [00:00:41] in the um a few lectures later we're going to [00:00:43] um a few lectures later we're going to talk about new Networks and just as a [00:00:47] talk about new Networks and just as a review to connect to past lectures we [00:00:49] review to connect to past lectures we have proved that generalization [00:00:50] have proved that generalization error the excess risk of the transition [00:00:53] error the excess risk of the transition error is upper bounded by rather Market [00:00:56] error is upper bounded by rather Market complexity that's what we have done last [00:00:57] complexity that's what we have done last time although in the last few lectures [00:01:00] time although in the last few lectures and uh in the next few lectures we're [00:01:02] and uh in the next few lectures we're going to talk about how to Upper bound [00:01:04] going to talk about how to Upper bound throughout the market complexity [00:01:06] throughout the market complexity um for exactly [00:01:07] um for exactly for concrete models like linear models [00:01:10] for concrete models like linear models or your Networks [00:01:12] or your Networks so and also we are going to deal with uh [00:01:16] so and also we are going to deal with uh the classification [00:01:18] the classification on loss [00:01:20] on loss so there is something to do uh to to uh [00:01:23] so there is something to do uh to to uh to do with classification loss because [00:01:26] to do with classification loss because it's a binary loss it's not continuous [00:01:28] it's a binary loss it's not continuous so we have to deal with at using some [00:01:31] so we have to deal with at using some special technique [00:01:32] special technique so that's the overview for this lecture [00:01:35] so that's the overview for this lecture so let's first set up the basic things [00:01:38] so let's first set up the basic things and classification so we're going to [00:01:40] and classification so we're going to deal with binary classification [00:01:43] deal with binary classification foreign [00:01:53] minus one and one and classifier H which [00:01:57] minus one and one and classifier H which Maps the input Space X to real number R [00:02:00] Maps the input Space X to real number R so here the classifier maps [00:02:03] so here the classifier maps Maps the we think of H as the function [00:02:06] Maps the we think of H as the function that Maps uh the input to the to the [00:02:09] that Maps uh the input to the to the real number the logic for example and [00:02:11] real number the logic for example and when you make the prediction you take a [00:02:13] when you make the prediction you take a sign [00:02:16] um [00:02:16] um of the output of the classifier so you [00:02:20] of the output of the classifier so you take the sum of H of x [00:02:22] take the sum of H of x and this gives you the [00:02:24] and this gives you the um the classifier if the H output is [00:02:26] um the classifier if the H output is positive numbers you output one and [00:02:28] positive numbers you output one and otherwise you get an active one [00:02:30] otherwise you get an active one and H is the family of H [00:02:36] of H [00:02:38] of H that's our annotation and the loss [00:02:40] that's our annotation and the loss function [00:02:43] on each example X comma y [00:02:46] on each example X comma y is equals to the indicator [00:02:49] is equals to the indicator of Y is not equals to the two exam two [00:02:52] of Y is not equals to the two exam two label is not equal to the Sun [00:02:56] label is not equal to the Sun of H of x [00:02:58] of H of x that's our setup [00:03:01] that's our setup so [00:03:03] so um the [00:03:05] um the um [00:03:07] um right so I guess so [00:03:10] right so I guess so uh the first thing is that I'm going to [00:03:13] uh the first thing is that I'm going to very briefly mention this uh final [00:03:16] very briefly mention this uh final hypothesis case [00:03:18] hypothesis case as just a [00:03:21] as just a a very quick kind of note or we have [00:03:23] a very quick kind of note or we have already identified and hypothesis clouds [00:03:25] already identified and hypothesis clouds right so it's probably useful to know [00:03:26] right so it's probably useful to know that [00:03:27] that um you can recover the same bounds for [00:03:29] um you can recover the same bounds for finite hypothesis class using this [00:03:31] finite hypothesis class using this Machinery of rather Market complexity [00:03:33] Machinery of rather Market complexity right that's a kind of like a probably a [00:03:35] right that's a kind of like a probably a reasonable requirement if you think that [00:03:37] reasonable requirement if you think that random marker complexity is a powerful [00:03:39] random marker complexity is a powerful tool so there is indeed such a theorem [00:03:41] tool so there is indeed such a theorem which I'm not going to prove today [00:03:43] which I'm not going to prove today because the the way to prove it actually [00:03:46] because the the way to prove it actually is more related to something [00:03:48] is more related to something uh more advanced later [00:03:51] uh more advanced later um so I'm just gonna State the theorem [00:03:52] um so I'm just gonna State the theorem so saying that if [00:03:56] so saying that if F satisfies [00:03:59] F satisfies that [00:04:01] that um [00:04:02] um for every F in F [00:04:05] for every F in F the sum of fzi square [00:04:09] the sum of fzi square this is [00:04:11] this is less than M Square [00:04:14] less than M Square um this is the condition that is a [00:04:16] um this is the condition that is a little kind of like a not super [00:04:17] little kind of like a not super intuitive but actually what you really [00:04:19] intuitive but actually what you really this is saying is that you know this is [00:04:22] this is saying is that you know this is implied you know this is a weaker [00:04:24] implied you know this is a weaker version [00:04:28] of just assuming FZ is bonded less than [00:04:32] of just assuming FZ is bonded less than that right [00:04:33] that right so so if FC is by n then of course the [00:04:37] so so if FC is by n then of course the average of the square is divided by m [00:04:39] average of the square is divided by m square right so it's just I'm assuming [00:04:41] square right so it's just I'm assuming I'm I'm stating the state the week is [00:04:43] I'm I'm stating the state the week is the version of the theorem for [00:04:46] the version of the theorem for generality and if you know this then you [00:04:48] generality and if you know this then you have that [00:04:49] have that the empirical rather marker complexity [00:04:52] the empirical rather marker complexity on this example zero up to CN right so s [00:04:56] on this example zero up to CN right so s is equals to [00:04:58] is equals to um s is equals to 0 up to zero [00:05:02] um s is equals to 0 up to zero right so it is bonded by [00:05:04] right so it is bonded by the something by the size of this [00:05:07] the something by the size of this hypothesis class [00:05:09] hypothesis class and and explained by the logarithmic of [00:05:12] and and explained by the logarithmic of the size of the hypothesis class [00:05:14] the size of the hypothesis class something like I'm Square the range of [00:05:17] something like I'm Square the range of this function class n function class F [00:05:20] this function class n function class F and also you divide by n you take the [00:05:22] and also you divide by n you take the square root [00:05:24] square root so so this is you know essentially [00:05:26] so so this is you know essentially basically it's log F Over N square root [00:05:29] basically it's log F Over N square root log F over n and and if you apply this [00:05:31] log F over n and and if you apply this to uh the final hypothesis class that [00:05:34] to uh the final hypothesis class that we'll talk about right for example if [00:05:35] we'll talk about right for example if you apply this to the loss function the [00:05:36] you apply this to the loss function the binary loss function you get what we had [00:05:39] binary loss function you get what we had um before [00:05:40] um before um like it's kind of it's almost exactly [00:05:43] um like it's kind of it's almost exactly the same bounds eventually [00:05:45] the same bounds eventually so [00:05:46] so um and and we are not going to prove [00:05:48] um and and we are not going to prove this we are not but we are going to [00:05:50] this we are not but we are going to prove it you know in the future lectures [00:05:51] prove it you know in the future lectures today we are not going to prove it [00:05:52] today we are not going to prove it because the the techniques is more [00:05:54] because the the techniques is more related to something what we're going to [00:05:57] related to something what we're going to use later so [00:05:59] use later so um but this is not a [00:06:00] um but this is not a um you know this is just saying that we [00:06:02] um you know this is just saying that we can achieve what we had but it's more [00:06:04] can achieve what we had but it's more interesting when you apply this by the [00:06:06] interesting when you apply this by the marker complexity for a continuous [00:06:08] marker complexity for a continuous function class right so and we have also [00:06:10] function class right so and we have also talked about you know what's the [00:06:11] talked about you know what's the limitation of uh having fun and [00:06:14] limitation of uh having fun and hypothesis class for example the [00:06:16] hypothesis class for example the limitation is that you know if you even [00:06:19] limitation is that you know if you even if you do this you know with some kind [00:06:20] if you do this you know with some kind of like visualization with continuous [00:06:23] of like visualization with continuous models you're going to lose a parameter [00:06:24] models you're going to lose a parameter p in your box right so if you do this [00:06:27] p in your box right so if you do this plus some discretization [00:06:30] plus some discretization then what likely you'll get is something [00:06:32] then what likely you'll get is something like P Over N where p is the [00:06:34] like P Over N where p is the dimensionality of the um of the [00:06:37] dimensionality of the um of the model is number of parameters in the [00:06:40] model is number of parameters in the model if you do some discretization so [00:06:42] model if you do some discretization so and it wouldn't be super you know [00:06:44] and it wouldn't be super you know impressive you know given that we [00:06:45] impressive you know given that we already have done uh those you know um [00:06:48] already have done uh those you know um Brute Force because visualization that [00:06:49] Brute Force because visualization that we had done before [00:06:51] we had done before right so [00:06:53] right so um so today what we're gonna do is that [00:06:55] um so today what we're gonna do is that we're gonna have a different way to [00:06:57] we're gonna have a different way to Upper Bound by the market complexity not [00:07:00] Upper Bound by the market complexity not using [00:07:01] using um this kind of tools and the the way [00:07:02] um this kind of tools and the the way that we do it is actually more algebraic [00:07:05] that we do it is actually more algebraic and analytical as you see so [00:07:08] and analytical as you see so um [00:07:09] um um so before doing that so uh we are [00:07:12] um so before doing that so uh we are gonna uh so first deal with the loss [00:07:15] gonna uh so first deal with the loss function right so if you look at the [00:07:16] function right so if you look at the loss right this is l01 x y h right it's [00:07:21] loss right this is l01 x y h right it's in is this [00:07:24] laughs [00:07:26] laughs so the thing is that the tricky thing is [00:07:28] so the thing is that the tricky thing is that there is a sign here [00:07:30] that there is a sign here right so if you don't have the Sun and H [00:07:33] right so if you don't have the Sun and H of X is outputting something like binary [00:07:36] of X is outputting something like binary where we call that we have done this in [00:07:38] where we call that we have done this in one of the previous lecture right so in [00:07:40] one of the previous lecture right so in that case we assume HX is [00:07:44] that case we assume HX is so in previous lecture [00:07:46] so in previous lecture we have shown that if h x is outputting [00:07:49] we have shown that if h x is outputting something like binary right its outputs [00:07:51] something like binary right its outputs plus one minus one then you can show [00:07:53] plus one minus one then you can show that the rather marker complexity of f [00:07:55] that the rather marker complexity of f the random marker complexity of the loss [00:07:58] the random marker complexity of the loss functions the losses is basically on the [00:08:01] functions the losses is basically on the same order as the radar marker [00:08:02] same order as the radar marker complexity [00:08:04] complexity of the hypothesis class h [00:08:06] of the hypothesis class h right that's what we did last time but [00:08:08] right that's what we did last time but now we are doing a slightly different [00:08:10] now we are doing a slightly different definition of the H right so the H is [00:08:12] definition of the H right so the H is the function that outputs the real [00:08:15] the function that outputs the real number right it's the one that before [00:08:17] number right it's the one that before the sine function so then uh [00:08:21] the sine function so then uh um this kind of like a uh this kind of [00:08:24] um this kind of like a uh this kind of like reduction doesn't work anymore [00:08:26] like reduction doesn't work anymore right so of course you can still apply [00:08:29] right so of course you can still apply the same thing to the sine of H but then [00:08:31] the same thing to the sine of H but then you're going to get the by the marker [00:08:33] you're going to get the by the marker complex stick of the sine of H which you [00:08:36] complex stick of the sine of H which you know again is also it's kind of like you [00:08:38] know again is also it's kind of like you didn't solve the problem you had a [00:08:39] didn't solve the problem you had a problem in a different way so so we're [00:08:42] problem in a different way so so we're going to express a deal with the on this [00:08:44] going to express a deal with the on this issue first and that's called uh [00:08:47] issue first and that's called uh sometimes I think people call it modern [00:08:49] sometimes I think people call it modern Theory so we're going to introduce a [00:08:52] Theory so we're going to introduce a bunch of tools to deal with this sign [00:08:54] bunch of tools to deal with this sign issue in some sense you are trying you [00:08:55] issue in some sense you are trying you have to convert the real number to the [00:08:58] have to convert the real number to the to the binary number in some effective [00:09:00] to the binary number in some effective way and then we're going to bound the [00:09:02] way and then we're going to bound the random complexity of linear models [00:09:05] random complexity of linear models um and using analytical tools so that's [00:09:07] um and using analytical tools so that's the uh that's the plot [00:09:13] okay so [00:09:15] okay so um [00:09:15] um so I guess the the kind of the intuition [00:09:18] so I guess the the kind of the intuition is that [00:09:20] is that the the scales in some sense like miter [00:09:24] the the scales in some sense like miter in when you do classification implicitly [00:09:26] in when you do classification implicitly but even though at the end of the day [00:09:28] but even though at the end of the day your scale doesn't matter right so so [00:09:30] your scale doesn't matter right so so kind of the the kind of the motivating [00:09:31] kind of the the kind of the motivating example is the following for example [00:09:33] example is the following for example suppose you have a classification task [00:09:34] suppose you have a classification task right I'm using right for positive data [00:09:38] right I'm using right for positive data and you have some kind of like [00:09:42] um uh you kind of have the uh you Circle [00:09:46] um uh you kind of have the uh you Circle for the negative data and if you think [00:09:48] for the negative data and if you think about different classifiers for example [00:09:50] about different classifiers for example this classifier and [00:09:52] this classifier and uh this one [00:09:55] uh this one right so these two classifiers you know [00:09:57] right so these two classifiers you know just intuitively I'm not claiming [00:09:59] just intuitively I'm not claiming anything mathematics rigorously if it's [00:10:01] anything mathematics rigorously if it's if you inclusive you think about these [00:10:03] if you inclusive you think about these two classifiers the the pink one [00:10:05] two classifiers the the pink one probably should generalize words [00:10:07] probably should generalize words than the blue one because you only see [00:10:10] than the blue one because you only see these eight examples you know maybe if [00:10:13] these eight examples you know maybe if you kind of like draw a new example test [00:10:15] you kind of like draw a new example test for example maybe it look like uh maybe [00:10:17] for example maybe it look like uh maybe it's here then the pink one would have a [00:10:20] it's here then the pink one would have a would have a mistake on this test [00:10:22] would have a mistake on this test example right so but the blue one sounds [00:10:25] example right so but the blue one sounds like less likely to make mistakes on [00:10:27] like less likely to make mistakes on Test example so [00:10:29] Test example so so intuitive play you know like the blue [00:10:31] so intuitive play you know like the blue one seems to have you know some words [00:10:34] one seems to have you know some words better Generation [00:10:36] better Generation Um just because it's kind of like [00:10:38] Um just because it's kind of like separate the two clusters more clearly [00:10:40] separate the two clusters more clearly and more confidently so the confidence [00:10:43] and more confidently so the confidence or the so in some sense you can think of [00:10:45] or the so in some sense you can think of this h of X itself as the confidence [00:10:48] this h of X itself as the confidence right because this is a real number in [00:10:49] right because this is a real number in some sense this is to the bigger it is [00:10:52] some sense this is to the bigger it is you know the the more confident you are [00:10:54] you know the the more confident you are about this example and and and and and [00:10:57] about this example and and and and and this doesn't matter to some extent and [00:10:59] this doesn't matter to some extent and this is what we are going to say so like [00:11:01] this is what we are going to say so like how do you somehow reason about this and [00:11:04] how do you somehow reason about this and and make them matter in some sense in [00:11:06] and make them matter in some sense in your analysis [00:11:07] your analysis um [00:11:08] um so so here is the [00:11:11] so so here is the um the more formal approach towards this [00:11:13] um the more formal approach towards this so let's [00:11:15] so let's um firstly Define okay so I guess [00:11:18] um firstly Define okay so I guess let's first assume [00:11:20] let's first assume this is a assumption throughout this [00:11:23] this is a assumption throughout this lecture so we assume that we classify [00:11:25] lecture so we assume that we classify all the examples correctly so assume the [00:11:27] all the examples correctly so assume the tuning [00:11:29] tuning ever [00:11:31] ever is zero [00:11:33] is zero so Perfection perfect classification [00:11:39] fortuning data [00:11:43] and and you can see that this is in some [00:11:45] and and you can see that this is in some sense you know reasonable especially [00:11:47] sense you know reasonable especially giving them the more than kind of like [00:11:49] giving them the more than kind of like on success of the large Network right so [00:11:52] on success of the large Network right so typically you can make the chain viral [00:11:53] typically you can make the chain viral very small [00:11:55] very small um and this was actually a reasonable [00:11:56] um and this was actually a reasonable assumption even before deep learning [00:11:58] assumption even before deep learning came into play like before deep learning [00:12:02] came into play like before deep learning what people did was that you add more [00:12:03] what people did was that you add more and more features and to your [00:12:05] and more features and to your dimensionality of the features becomes [00:12:07] dimensionality of the features becomes Higher and Higher and at some point in [00:12:09] Higher and Higher and at some point in the dimensionality of the features [00:12:10] the dimensionality of the features becomes bigger than the number of [00:12:12] becomes bigger than the number of examples and then you can always pay [00:12:14] examples and then you can always pay attention data with zero with zero error [00:12:18] attention data with zero with zero error so so formally what that does mean is [00:12:20] so so formally what that does mean is that if you look at tuning example it's [00:12:23] that if you look at tuning example it's always equals to some h of this [00:12:26] always equals to some h of this that that's what I mean by training our [00:12:28] that that's what I mean by training our zeros and under this kind of like [00:12:30] zeros and under this kind of like training error zero uh hypothesis class [00:12:33] training error zero uh hypothesis class you can Define the so-called margin [00:12:37] so this margin technically I think is [00:12:39] so this margin technically I think is only defined for at least you know if [00:12:41] only defined for at least you know if you don't do any modification of it you [00:12:43] you don't do any modification of it you you probably should only Define it for [00:12:45] you probably should only Define it for zero error classifier and and this is [00:12:48] zero error classifier and and this is the so-called unnormalized margin [00:12:54] so and so the margin [00:12:59] so and so the margin of you know X you know this is really [00:13:02] of you know X you know this is really just the Y times H Theta of x [00:13:05] just the Y times H Theta of x you multiply uh x with Y just because [00:13:09] you multiply uh x with Y just because you know like uh you want to make it a [00:13:11] you know like uh you want to make it a positive number right so if [00:13:13] positive number right so if Y is positive right if you have positive [00:13:16] Y is positive right if you have positive class you want h x to be big and [00:13:20] class you want h x to be big and um if Y is negative you want h x to be [00:13:22] um if Y is negative you want h x to be small right so so in some sense margin [00:13:24] small right so so in some sense margin it's kind of like a a varying form or [00:13:27] it's kind of like a a varying form or informal version of confidence it's not [00:13:29] informal version of confidence it's not a probability of course it's between 0 [00:13:31] a probability of course it's between 0 and infinity oh right so this is always [00:13:33] and infinity oh right so this is always non-active if you fader classifier [00:13:35] non-active if you fader classifier exactly on the chaining data so [00:13:39] exactly on the chaining data so um so so this is always non-active you [00:13:41] um so so this is always non-active you know when you are correct on this data [00:13:43] know when you are correct on this data point right when Y is y and H Theta X [00:13:47] point right when Y is y and H Theta X Y is equal to the sine of h x Theta X [00:13:53] okay so this is definition of the margin [00:13:55] okay so this is definition of the margin of a single example and then you can [00:13:56] of a single example and then you can definitely Define a margin of the data [00:13:58] definitely Define a margin of the data set [00:13:59] set right the margin for data set margin of [00:14:02] right the margin for data set margin of the classifier on the data set so this [00:14:04] the classifier on the data set so this is defined to be uh you look at the [00:14:06] is defined to be uh you look at the minimum uh margin over all examples so [00:14:10] minimum uh margin over all examples so you take the minimum over i y i [00:14:13] you take the minimum over i y i times H Theta of x i [00:14:15] times H Theta of x i of course this model is a function of [00:14:17] of course this model is a function of the classifier you know if you change [00:14:19] the classifier you know if you change the classifier you have different [00:14:20] the classifier you have different margins right in some sense the [00:14:22] margins right in some sense the the blue one as a a Jew there right like [00:14:26] the blue one as a a Jew there right like a has a bigger margin a pink one because [00:14:29] a has a bigger margin a pink one because uh in some sense the the pink one has [00:14:32] uh in some sense the the pink one has some example which has very small margin [00:14:34] some example which has very small margin and the blue one for all the examples [00:14:36] and the blue one for all the examples you have big margins so you take the [00:14:38] you have big margins so you take the minimum then over all the examples you [00:14:41] minimum then over all the examples you still have relatively big Market [00:14:43] still have relatively big Market um but I guess you know here I'm [00:14:45] um but I guess you know here I'm defining an unnormalized margin so so if [00:14:48] defining an unnormalized margin so so if you look at Anonymous margin it's not [00:14:50] you look at Anonymous margin it's not exactly a distance [00:14:51] exactly a distance from the example to the hyperplane so [00:14:55] from the example to the hyperplane so you have to normalize it so that it [00:14:56] you have to normalize it so that it becomes the distance to the hyperplate [00:14:58] becomes the distance to the hyperplate um but I think we [00:14:59] um but I think we I think in this course we don't need to [00:15:02] I think in this course we don't need to actually Define a normalized margin [00:15:04] actually Define a normalized margin um per se [00:15:05] um per se um so so if you normally so for linear [00:15:09] um so so if you normally so for linear model I guess probably you have learned [00:15:10] model I guess probably you have learned this from cs299 right so if you for [00:15:12] this from cs299 right so if you for linear model if you normalize this [00:15:14] linear model if you normalize this margin with the norm of the Theta then [00:15:16] margin with the norm of the Theta then it's going to be the distance between [00:15:18] it's going to be the distance between example to the linear separate to the [00:15:21] example to the linear separate to the hyper plane [00:15:22] hyper plane um so and and the middle margin would be [00:15:25] um so and and the middle margin would be the minimum distance of the of all the [00:15:27] the minimum distance of the of all the examples to the uh to the hyperplate [00:15:32] examples to the uh to the hyperplate so and our goal would be something like [00:15:36] so and our goal would be something like you you basically you're going to bond [00:15:38] you you basically you're going to bond the generation Our older brother [00:15:40] the generation Our older brother I guess we bought the generalization I [00:15:42] I guess we bought the generalization I rebelled by the market complexity [00:15:44] rebelled by the market complexity that's what we did in the past but the [00:15:47] that's what we did in the past but the rather Market complexity and then this [00:15:49] rather Market complexity and then this by some function [00:15:51] by some function of the parameter [00:15:54] and uh the the pirating Norm [00:15:59] and uh the the pirating Norm so the normal Theta and some function of [00:16:01] so the normal Theta and some function of the margin that's what we'll eventually [00:16:04] the margin that's what we'll eventually get out of this lecture [00:16:07] get out of this lecture so [00:16:09] so um and while we need to Define these [00:16:10] um and while we need to Define these margins the reason is that uh this party [00:16:13] margins the reason is that uh this party comes from a technique to deal with the [00:16:15] comes from a technique to deal with the loss function [00:16:17] loss function um so so [00:16:19] um so so we're going to introduce a surrogate [00:16:21] we're going to introduce a surrogate loss [00:16:24] function [00:16:27] uh that's texting the margin that takes [00:16:30] uh that's texting the margin that takes the margin into account [00:16:37] I guess you know [00:16:38] I guess you know we you know at the end of the day all of [00:16:40] we you know at the end of the day all of this will put will be all together [00:16:42] this will put will be all together intuitively the reason we want to do [00:16:44] intuitively the reason we want to do this is that we somehow believe that [00:16:45] this is that we somehow believe that Martin writers follow generalization [00:16:47] Martin writers follow generalization then you probably want to have a bonds [00:16:50] then you probably want to have a bonds that depends on the margin and you want [00:16:51] that depends on the margin and you want to have a loss function that also [00:16:53] to have a loss function that also depends on margin so so far if you you [00:16:55] depends on margin so so far if you you look at the zero one loss it doesn't [00:16:56] look at the zero one loss it doesn't depend on margin right how large HX is [00:16:59] depend on margin right how large HX is doesn't change your zero one loss on an [00:17:02] doesn't change your zero one loss on an example right so as long as the sun [00:17:04] example right so as long as the sun doesn't change you don't really care [00:17:05] doesn't change you don't really care right so we want something that kind of [00:17:08] right so we want something that kind of depends on the [00:17:10] depends on the um the margin so so the the loss [00:17:13] um the margin so so the the loss function is called Web clause [00:17:16] function is called Web clause and I think sometimes it's also called [00:17:18] and I think sometimes it's also called just margin laws [00:17:21] so it's lost function has a parameter [00:17:24] so it's lost function has a parameter gamma and Gamma is kind of like the [00:17:27] gamma and Gamma is kind of like the target margin in some sense or kind of a [00:17:29] target margin in some sense or kind of a reference margin you can think of it [00:17:31] reference margin you can think of it like that so this is a loss function [00:17:33] like that so this is a loss function that tips into a single number t [00:17:36] that tips into a single number t and it outputs [00:17:38] and it outputs uh maybe let me draw it first [00:17:40] uh maybe let me draw it first um I guess maybe let's write down the [00:17:42] um I guess maybe let's write down the technical equation [00:17:44] technical equation so if T is for the comma okay let me let [00:17:48] so if T is for the comma okay let me let me join [00:17:49] me join so [00:17:51] so this function is a function that looks [00:17:53] this function is a function that looks like this [00:17:58] so and here this is gamma and here [00:18:02] so and here this is gamma and here this is one so when [00:18:06] this is one so when T is larger than gamma you you make it [00:18:09] T is larger than gamma you you make it zero when T is uh [00:18:12] zero when T is uh equal to uh when T is less than zero you [00:18:16] equal to uh when T is less than zero you make it one that corresponds to the the [00:18:19] make it one that corresponds to the the the flight area on the left hand side of [00:18:22] the flight area on the left hand side of the origin and then when you are between [00:18:24] the origin and then when you are between 0 and 1 your e linearly interpolate so [00:18:28] 0 and 1 your e linearly interpolate so this is the way to linearly interpolate [00:18:30] this is the way to linearly interpolate is 1 minus t over gamma if [00:18:33] is 1 minus t over gamma if you are between [00:18:34] you are between 0 and gamma [00:18:37] 0 and gamma right so so this is the linear region [00:18:40] right so so this is the linear region and and why you are uh why you're [00:18:44] and and why you are uh why you're interested in this the reason is that [00:18:46] interested in this the reason is that this is in some sense a [00:18:49] this is in some sense a um extension of the zero one loss so [00:18:52] um extension of the zero one loss so maybe let me first Define notation so [00:18:54] maybe let me first Define notation so with a bit abuse of notation [00:19:06] you can also write IO gamma X Y you look [00:19:10] you can also write IO gamma X Y you look at the margin laws applied on this [00:19:12] at the margin laws applied on this classifier H that's defined to be [00:19:15] classifier H that's defined to be IO gamma y times h of x [00:19:19] IO gamma y times h of x this is the definition [00:19:21] this is the definition so but these two Algoma have different [00:19:24] so but these two Algoma have different meanings on left hand side right hand [00:19:26] meanings on left hand side right hand side as you can see so the this one is [00:19:28] side as you can see so the this one is the one we just defined so basically [00:19:30] the one we just defined so basically first of all you know before you know [00:19:33] first of all you know before you know when you talk about loss functions right [00:19:35] when you talk about loss functions right it takes in two arguments Y and Y hat [00:19:38] it takes in two arguments Y and Y hat right so but for classification the only [00:19:40] right so but for classification the only thing that matters is you take the [00:19:41] thing that matters is you take the product of them so that's why you only [00:19:44] product of them so that's why you only care about y times of H times h of X [00:19:46] care about y times of H times h of X right so in the so basically in this [00:19:48] right so in the so basically in this notation [00:19:49] notation we have know that the ideal loss [00:19:51] we have know that the ideal loss function l01 loss function of X and Y [00:19:55] function l01 loss function of X and Y h x is equals to some indicator of [00:19:59] h x is equals to some indicator of y times h x is [00:20:03] y times h x is um y times sine of h x [00:20:06] um y times sine of h x I guess that's the same thing right so y [00:20:08] I guess that's the same thing right so y times h x [00:20:10] times h x is [00:20:11] is uh larger than zero right [00:20:15] uh larger than zero right so this is the this is the zero this is [00:20:18] so this is the this is the zero this is a different way to write zero one loss [00:20:19] a different way to write zero one loss function right so if Y and h x doesn't [00:20:21] function right so if Y and h x doesn't have the [00:20:22] have the wait bye-bye if this is like [00:20:25] wait bye-bye if this is like Less Than Zero [00:20:29] so [00:20:31] so um [00:20:46] so I have to mark that I have a typo [00:20:48] so I have to mark that I have a typo here so that I can fix it for the future [00:20:52] here so that I can fix it for the future um okay so [00:20:54] um okay so um so this is the binary classification [00:20:56] um so this is the binary classification loss and this is the so-called run plus [00:20:58] loss and this is the so-called run plus and you can see the differences is that [00:21:01] and you can see the differences is that the indicator function right would just [00:21:03] the indicator function right would just look like this [00:21:06] this is indicator T is less than zero [00:21:09] this is indicator T is less than zero and what we do is that we extend we make [00:21:13] and what we do is that we extend we make this indicator function more continuous [00:21:15] this indicator function more continuous that's basically what we are doing [00:21:18] that's basically what we are doing so in some so and you can know that so [00:21:20] so in some so and you can know that so from this you can see that [00:21:23] from this you can see that the L [00:21:25] the L gamma y h x is always bigger than the [00:21:29] gamma y h x is always bigger than the indicator of y h x [00:21:33] indicator of y h x it's less than zero just because [00:21:35] it's less than zero just because the function above that I do is bigger [00:21:38] the function above that I do is bigger pointwise than the function below [00:21:40] pointwise than the function below right so and which means that if you [00:21:43] right so and which means that if you look at the zero one loss [00:21:45] look at the zero one loss of any example [00:21:48] of any example is always less than [00:21:52] the realm Clause at an example [00:21:55] the realm Clause at an example and which means that you know if you [00:21:57] and which means that you know if you look at the test loss right you do take [00:21:59] look at the test loss right you do take the expectation over uh X Y join from p [00:22:08] foreign [00:22:09] foreign so this is the the final thing you [00:22:12] so this is the the final thing you really care about right the [00:22:13] really care about right the generalization the the test error right [00:22:16] generalization the the test error right so this is the fundamental thing you [00:22:17] so this is the fundamental thing you care about you can at least upper bound [00:22:19] care about you can at least upper bound this by the population [00:22:23] this by the population error under the rug and the population [00:22:26] error under the rug and the population loss under the Run flaws [00:22:30] right so by doing this you know you [00:22:33] right so by doing this you know you change the you make the the loss bigger [00:22:35] change the you make the the loss bigger right so which and then we're going to [00:22:37] right so which and then we're going to bound this so so basically with this [00:22:41] bound this so so basically with this eventually what we're going to do is [00:22:42] eventually what we're going to do is we're compound the [00:22:44] we're compound the the test error the test loss under the [00:22:47] the test error the test loss under the Run plus which is the upper bound on the [00:22:50] Run plus which is the upper bound on the binary loss [00:22:53] binary loss right so so this is our goal upper [00:22:56] right so so this is our goal upper boundaries [00:23:02] okay so and how do we up about this [00:23:06] okay so and how do we up about this um [00:23:07] um so [00:23:10] so I think you know it's kind of [00:23:12] so I think you know it's kind of probably you know at least when I read [00:23:13] probably you know at least when I read this you know at the first time from a [00:23:14] this you know at the first time from a book I'm kind of unclear why you want to [00:23:16] book I'm kind of unclear why you want to do this you know continue continuation [00:23:18] do this you know continue continuation it will come just you know it will come [00:23:20] it will come just you know it will come in a moment so the one of the reasons [00:23:22] in a moment so the one of the reasons you want to make Ellipsis so that you [00:23:24] you want to make Ellipsis so that you can somehow on get rid of the loss but [00:23:27] can somehow on get rid of the loss but before doing that we have to [00:23:29] before doing that we have to um [00:23:30] um I guess let's let's first uh clear up [00:23:33] I guess let's let's first uh clear up the high level thing first and then [00:23:34] the high level thing first and then let's look at the low level detail about [00:23:37] let's look at the low level detail about how to how to use the laws so the high [00:23:39] how to how to use the laws so the high level plan would just be that you let IO [00:23:43] level plan would just be that you let IO hat gamma this is the empirical loss [00:23:45] hat gamma this is the empirical loss corresponding to [00:23:47] corresponding to the [00:23:48] the the ramp loss [00:23:55] and you can also Define [00:23:56] and you can also Define [Music] [00:23:58] [Music] now I think this is a function for H [00:24:01] now I think this is a function for H mm [00:24:05] we can Define the population loss [00:24:07] we can Define the population loss as I said is this l l gamma x y h [00:24:14] as I said is this l l gamma x y h and then if you use brother marker [00:24:16] and then if you use brother marker complexity [00:24:18] complexity on the Machinery have developed you get [00:24:21] on the Machinery have developed you get the [00:24:23] the population loss [00:24:25] population loss minus the empirical loss [00:24:28] minus the empirical loss is bonded by two times the [00:24:34] uh the the empirical order marker [00:24:37] uh the the empirical order marker complexity plus three times [00:24:39] complexity plus three times uh log to the auto over it [00:24:43] uh log to the auto over it right this is what we did in the in the [00:24:45] right this is what we did in the in the previous lecture right the [00:24:46] previous lecture right the generalization Bond can be the [00:24:48] generalization Bond can be the generalization error can be bounded by [00:24:49] generalization error can be bounded by the empirical by the marker complexity [00:24:52] the empirical by the marker complexity where f is this family of losses defined [00:24:56] where f is this family of losses defined by the [00:24:59] foreign [00:25:04] so so this is saying that eventually you [00:25:08] so so this is saying that eventually you are gonna just uh need to the basically [00:25:10] are gonna just uh need to the basically this will be the the goal next [00:25:14] this will be the the goal next right because [00:25:16] right because if you have the bound on the on this [00:25:18] if you have the bound on the on this random marker complex that you have the [00:25:19] random marker complex that you have the bound on the [00:25:21] bound on the you know we're gonna do this you know [00:25:23] you know we're gonna do this you know more carefully you know after we have [00:25:25] more carefully you know after we have the right amount of plastic but roughly [00:25:27] the right amount of plastic but roughly speaking once you have the rather Market [00:25:28] speaking once you have the rather Market complexity you have upper bound on the [00:25:31] complexity you have upper bound on the population red plots under the [00:25:33] population red plots under the population rate loss upper bounds the [00:25:35] population rate loss upper bounds the population binary loss [00:25:38] population binary loss so [00:25:40] so um [00:25:42] um right [00:25:45] okay [00:25:49] where is the soup [00:25:51] where is the soup uh sure so yes [00:25:57] um yeah but without soup it's also true [00:25:59] um yeah but without soup it's also true I guess right so um so I guess for [00:26:01] I guess right so um so I guess for average this is true [00:26:06] okay yeah [00:26:08] okay yeah uh of with high probability for average [00:26:11] uh of with high probability for average this is true [00:26:13] this is true technical [00:26:18] okay so now let's talk about the the [00:26:21] okay so now let's talk about the the rather Market complexity so how why [00:26:25] rather Market complexity so how why so like [00:26:28] and this review is why we care about the [00:26:30] and this review is why we care about the Run Clause so the rather more complexity [00:26:32] Run Clause so the rather more complexity uh [00:26:33] uh of R of f relates to the random bar [00:26:37] of R of f relates to the random bar complexity of which in a pretty nice way [00:26:40] complexity of which in a pretty nice way and here is the uh here's a lemon that's [00:26:44] and here is the uh here's a lemon that's released them [00:26:45] released them um it's called Telegraph [00:26:53] so saying the following so suppose you [00:26:55] so saying the following so suppose you have a function Phi [00:26:58] have a function Phi it's a one-dimensional function [00:27:00] it's a one-dimensional function and it's Ellipsis for copper Ellipsis [00:27:02] and it's Ellipsis for copper Ellipsis function [00:27:08] all right guys so we have [00:27:10] all right guys so we have um kind of Define ellipsis function so [00:27:13] um kind of Define ellipsis function so this really means that if you have two [00:27:15] this really means that if you have two any two numbers Phi of x minus V of Y is [00:27:19] any two numbers Phi of x minus V of Y is less than Kappa x minus y and here it's [00:27:22] less than Kappa x minus y and here it's just absolute value because all [00:27:23] just absolute value because all everything is one dimensional [00:27:25] everything is one dimensional okay and once you have this then you can [00:27:29] okay and once you have this then you can look at the [00:27:30] look at the composition of this one dimensional [00:27:32] composition of this one dimensional function with any hypothesis cross [00:27:34] function with any hypothesis cross so [00:27:36] so so this [00:27:37] so this is defined to be you compose [00:27:40] is defined to be you compose uh V so basically you might V to the [00:27:44] uh V so basically you might V to the conversational field of H of [00:27:47] conversational field of H of uh the [00:27:49] uh the right so this is the mapping [00:27:52] right so this is the mapping and this is a [00:27:54] and this is a fee will be the loss function basically [00:27:56] fee will be the loss function basically but so here is abstract so you can [00:27:58] but so here is abstract so you can compose any function V with the [00:28:00] compose any function V with the hypothesis class [00:28:01] hypothesis class and to get this feed [00:28:03] and to get this feed composed with h and [00:28:05] composed with h and uh [00:28:07] uh anyway so then you can get what you have [00:28:10] anyway so then you can get what you have so then what you can get is that the [00:28:13] so then what you can get is that the composition [00:28:14] composition the compose hypothesis class is bounded [00:28:16] the compose hypothesis class is bounded by Kappa times RS of H [00:28:22] so so basically saying that if you [00:28:25] so so basically saying that if you compose anything on top of existing [00:28:26] compose anything on top of existing hypothesis class [00:28:28] hypothesis class if what you composed with the fee [00:28:30] if what you composed with the fee function is Ellipsis then you just only [00:28:32] function is Ellipsis then you just only blow up by a factor of copper by the [00:28:35] blow up by a factor of copper by the lips business [00:28:36] lips business so [00:28:37] so um and so with this you can probably see [00:28:39] um and so with this you can probably see why we care about relaxing the binary [00:28:42] why we care about relaxing the binary function because the the the the [00:28:44] function because the the the the indicator function is not Ellipsis but [00:28:46] indicator function is not Ellipsis but if you use the Run function it will be [00:28:48] if you use the Run function it will be lips and that's what we do next so by [00:28:52] lips and that's what we do next so by the way this this theorem is um this [00:28:54] the way this this theorem is um this lemon does [00:28:56] lemon does um doesn't have a very simple proof [00:28:59] um doesn't have a very simple proof um we're not going to prove it in the [00:29:01] um we're not going to prove it in the lecture um [00:29:03] lecture um um it does require some it's kind of [00:29:05] um it does require some it's kind of like something kind of like pretty in in [00:29:08] like something kind of like pretty in in my own opinion it's kind of pretty [00:29:11] my own opinion it's kind of pretty novel and and deep to me like I you know [00:29:15] novel and and deep to me like I you know I used to be able to prove it but like I [00:29:17] I used to be able to prove it but like I proved it once myself um but I think the [00:29:20] proved it once myself um but I think the all the existing proofs I know is kind [00:29:22] all the existing proofs I know is kind of like somewhat [00:29:24] of like somewhat a little bit mysterious to me [00:29:27] a little bit mysterious to me um [00:29:28] um um but but the the high level the [00:29:30] um but but the the high level the intuition problem is reasonably [00:29:32] intuition problem is reasonably reasonable right because you have a [00:29:35] reasonable right because you have a hypothesis cost you compose it with [00:29:36] hypothesis cost you compose it with something that doesn't really introduce [00:29:38] something that doesn't really introduce additional fluctuation that much [00:29:40] additional fluctuation that much right so that's why [00:29:43] right so that's why um that's why you don't make the [00:29:45] um that's why you don't make the hypothesis class much more complicated [00:29:49] hypothesis class much more complicated um but but if you look at the kind of [00:29:52] um but but if you look at the kind of like the if you look at exactly what [00:29:54] like the if you look at exactly what this formula is saying this is saying [00:29:56] this formula is saying this is saying that you take the soup [00:29:59] that you take the soup of age in h so how do you write the left [00:30:02] of age in h so how do you write the left hand side is something like this right [00:30:06] um Sigma I [00:30:09] uh V of H of z i [00:30:14] uh V of H of z i and you want to show this is expounded [00:30:16] and you want to show this is expounded by [00:30:23] HDI [00:30:25] HDI right that's that's the goal for that [00:30:27] right that's that's the goal for that that what this this thing is saying [00:30:30] that what this this thing is saying and look at some will kind of like [00:30:33] and look at some will kind of like imagine why this is difficult to prove [00:30:34] imagine why this is difficult to prove but because [00:30:37] but because like uh you cannot really change the [00:30:39] like uh you cannot really change the order of expectation with soup you know [00:30:41] order of expectation with soup you know if you do that you you make the inner [00:30:43] if you do that you you make the inner coil too loose and somehow there's a fee [00:30:46] coil too loose and somehow there's a fee somewhere in the middle of this equation [00:30:47] somewhere in the middle of this equation it's kind of a very hard to period or [00:30:50] it's kind of a very hard to period or period off anyway this is just my [00:30:52] period off anyway this is just my personal comment about this dilemma like [00:30:54] personal comment about this dilemma like it sounds pretty deep to me oh okay [00:30:58] it sounds pretty deep to me oh okay um anyway so we're going to take uh [00:31:00] um anyway so we're going to take uh we're gonna use this so so and I think [00:31:03] we're gonna use this so so and I think it's probably somewhat [00:31:06] it's probably somewhat um [00:31:07] um um are we obvious how do we use it we're [00:31:09] um are we obvious how do we use it we're going to take this field function to be [00:31:11] going to take this field function to be the Run Plus [00:31:13] the Run Plus right so L gamma of T then because the [00:31:17] right so L gamma of T then because the Run Clause if you I guess let's go back [00:31:19] Run Clause if you I guess let's go back to here [00:31:21] to here so the Run plus is Ellipsis function the [00:31:24] so the Run plus is Ellipsis function the Ellipsis constant depends on gamma [00:31:25] Ellipsis constant depends on gamma because here the ellipse is constant [00:31:27] because here the ellipse is constant zero here it's complete flat so how Loop [00:31:30] zero here it's complete flat so how Loop just this depends on how was the slope [00:31:32] just this depends on how was the slope here and slope there is one over gamma [00:31:35] here and slope there is one over gamma because this is gamma this is one so the [00:31:37] because this is gamma this is one so the slope here is one over comma so the [00:31:39] slope here is one over comma so the luxiousness of the uh of the red Clause [00:31:43] luxiousness of the uh of the red Clause is a [00:31:49] a v is equals to 1 over gamma sorry [00:31:53] a v is equals to 1 over gamma sorry right [00:31:54] right so uh and [00:31:58] so uh and um if you take [00:32:06] so if you take H to be [00:32:10] I guess let's take H Prime to be this [00:32:13] I guess let's take H Prime to be this you Map X Y [00:32:15] you Map X Y to Y times h x [00:32:18] to Y times h x where H is in H [00:32:20] where H is in H Prime is still not exactly the same as H [00:32:22] Prime is still not exactly the same as H because it is a y multiplied with h [00:32:25] because it is a y multiplied with h and then you take F to be Phi composed [00:32:29] and then you take F to be Phi composed with h Prime [00:32:32] and then by the telegram Lemma [00:32:41] what you have is that the rather more [00:32:45] what you have is that the rather more complex of f which is what we care about [00:32:46] complex of f which is what we care about is less than one over gamma [00:32:48] is less than one over gamma deluxiousness [00:32:50] deluxiousness times RS of H Prime [00:32:55] okay so we kind of get rid of the the [00:32:58] okay so we kind of get rid of the the the effect of the loss function by using [00:33:01] the effect of the loss function by using this telegram Lemma [00:33:03] this telegram Lemma and then you can also relate H Prime to [00:33:06] and then you can also relate H Prime to H much easier now because H Prime and H [00:33:09] H much easier now because H Prime and H what's the difference the only [00:33:10] what's the difference the only difference is you have a Sime flip by [00:33:12] difference is you have a Sime flip by random marker complexity is not very [00:33:13] random marker complexity is not very sensitive to sound Flip [00:33:16] sensitive to sound Flip um this is just because [00:33:18] um this is just because RS I guess we have done this before [00:33:21] RS I guess we have done this before um at least implicitly since some other [00:33:23] um at least implicitly since some other proof right so the random R complex of H [00:33:26] proof right so the random R complex of H Prime is something like this [00:33:31] Sigma i y i [00:33:35] y y i [00:33:40] times h [00:33:42] times h x i [00:33:44] x i all right that's using the definition [00:33:45] all right that's using the definition and now [00:33:48] and now you look at this Sigma iyi has the same [00:33:50] you look at this Sigma iyi has the same distribution [00:33:54] is why I [00:33:57] that's our I Sigma I because anyway you [00:34:00] that's our I Sigma I because anyway you flip it right so so that's why this is [00:34:03] flip it right so so that's why this is equals to [00:34:07] you can basically get rid of the y i you [00:34:10] you can basically get rid of the y i you get [00:34:12] get what's this [00:34:14] what's this and then the right hand side is [00:34:17] and then the right hand side is the right Market complexity of H [00:34:25] m [00:34:29] okay so with all of this what we got is [00:34:32] okay so with all of this what we got is that so this basically combining these [00:34:35] that so this basically combining these two things what we got is that R of f is [00:34:38] two things what we got is that R of f is less than one over gamma times RS of H [00:34:42] less than one over gamma times RS of H and you can see that the interesting [00:34:44] and you can see that the interesting thing is the first of all the loss is [00:34:45] thing is the first of all the loss is gone and second Y is also gone right so [00:34:49] gone and second Y is also gone right so you you don't have any wires in the [00:34:51] you you don't have any wires in the right hand side anymore [00:34:52] right hand side anymore so that's so so basically the at the end [00:34:55] so that's so so basically the at the end of the only thing that matters now [00:34:57] of the only thing that matters now is the is the is the is the HX [00:35:04] okay so and with this we can put this [00:35:09] okay so and with this we can put this all of this thing together to get a [00:35:11] all of this thing together to get a finally let's get the bound on the [00:35:13] finally let's get the bound on the binary test error so recall that we [00:35:17] binary test error so recall that we assume [00:35:21] the perfect classification will assume [00:35:23] the perfect classification will assume that y i times h of x i [00:35:26] that y i times h of x i is bigger than zero for every I this [00:35:28] is bigger than zero for every I this will assume a perfect fitting [00:35:31] will assume a perfect fitting uh perfect [00:35:35] and then you can take [00:35:39] Gama mean [00:35:41] Gama mean to be [00:35:42] to be uh [00:35:47] right so this is the [00:35:51] right so this is the let me see why I'm talking I'm using [00:35:53] let me see why I'm talking I'm using yeah [00:35:55] yeah so this gamma Min is the empirical is [00:35:58] so this gamma Min is the empirical is the minimum margin uh for [00:36:01] the minimum margin uh for for this data set [00:36:03] for this data set right so now [00:36:06] right so now um [00:36:10] you you have that I guess [00:36:13] you you have that I guess in technical [00:36:15] in technical let me see what's [00:36:17] let me see what's I think I have a typo here sorry [00:36:20] so let's just call this gamma this we [00:36:22] so let's just call this gamma this we use gamma to Define this right so then [00:36:24] use gamma to Define this right so then if you look at [00:36:27] I'll have gamma h [00:36:31] I'll have gamma h this is what this is gonna be [00:36:35] this is what this is gonna be I claim is going to be zero because you [00:36:37] I claim is going to be zero because you have L gamma y i h x i [00:36:42] have L gamma y i h x i so n and y h i is always bigger than [00:36:45] so n and y h i is always bigger than gamma and recall that in this run plus [00:36:47] gamma and recall that in this run plus if you are bigger than gamma then you [00:36:49] if you are bigger than gamma then you are zero so basically every example has [00:36:52] are zero so basically every example has zero loss under the ram plots [00:36:55] zero loss under the ram plots so for the training example is the [00:36:57] so for the training example is the binary laws and the ram Clause are not [00:36:59] binary laws and the ram Clause are not different because they are both zero [00:37:03] different because they are both zero and [00:37:04] and therefore [00:37:06] therefore you uh can have the following sequence [00:37:09] you uh can have the following sequence of inequality so you first bound to the [00:37:11] of inequality so you first bound to the zero one loss of H by the Run Plus [00:37:15] zero one loss of H by the Run Plus this is because the Run Clause is always [00:37:17] this is because the Run Clause is always better than larger than zero one loss [00:37:19] better than larger than zero one loss and then you say that [00:37:21] and then you say that uh this is smaller than IO hat gamma h [00:37:27] uh this is smaller than IO hat gamma h plus the the rather Market complexity [00:37:29] plus the the rather Market complexity plus something like of RS [00:37:32] plus something like of RS H over [00:37:35] H over because let's do it a little bit slowly [00:37:37] because let's do it a little bit slowly so this rather marker complexity [00:37:39] so this rather marker complexity of f [00:37:45] is [00:37:50] plus some square root log 2 over Delta n [00:37:54] plus some square root log 2 over Delta n and then you [00:37:57] and then you use the [00:38:01] inequality between F and H so you get RS [00:38:04] inequality between F and H so you get RS of H [00:38:05] of H over gamma [00:38:07] over gamma plus square root log 2 over Delta over n [00:38:10] plus square root log 2 over Delta over n and then this one [00:38:12] and then this one is zero as we claimed because this is [00:38:15] is zero as we claimed because this is the empirical run Plus [00:38:16] the empirical run Plus for the [00:38:18] for the um for the training data so this becomes [00:38:20] um for the training data so this becomes zero so you got just a big O of RS of H [00:38:25] zero so you got just a big O of RS of H over [00:38:28] over um gamma right so comma [00:38:35] so [00:38:37] so um [00:38:39] there is a [00:38:40] there is a then the caveat with this with this [00:38:43] then the caveat with this with this inequality I'm not sure whether any of [00:38:45] inequality I'm not sure whether any of you have noticed that [00:38:47] you have noticed that um but if you have noticed that maybe [00:38:49] um but if you have noticed that maybe hold down for a second let's first uh [00:38:51] hold down for a second let's first uh some water interpret [00:38:53] some water interpret uh [00:38:56] uh okay maybe let's just explain the salary [00:38:58] okay maybe let's just explain the salary so so the problem so there was the [00:39:00] so so the problem so there was the caveat here there's actually there's [00:39:01] caveat here there's actually there's actually a mistake in some sense like no [00:39:03] actually a mistake in some sense like no not a serious one but there is a issue [00:39:06] not a serious one but there is a issue with this you know uh with this [00:39:08] with this you know uh with this derivation [00:39:10] derivation the reason is that [00:39:13] the reason is that what is the definition of gamma [00:39:16] what is the definition of gamma right so here the definition of gamma [00:39:18] right so here the definition of gamma depends on the data [00:39:19] depends on the data and then you just mess up all the the [00:39:22] and then you just mess up all the the independence all the [00:39:24] independence all the like like her so if gamma like when you [00:39:27] like like her so if gamma like when you do when we do all of these things right [00:39:29] do when we do all of these things right before right so gamma is a constant and [00:39:32] before right so gamma is a constant and then you have the gamma first and then [00:39:34] then you have the gamma first and then you draw your data points you you have [00:39:36] you draw your data points you you have your rather Market complexity so and so [00:39:37] your rather Market complexity so and so forth [00:39:38] forth right so but here [00:39:40] right so but here we take the gamma to be something that [00:39:43] we take the gamma to be something that depends on the data which will break all [00:39:45] depends on the data which will break all the [00:39:46] the um the random marker complex Machinery [00:39:48] um the random marker complex Machinery because in a rather more complexity [00:39:50] because in a rather more complexity machine where you cannot let your loss [00:39:52] machine where you cannot let your loss function of your function Clause depend [00:39:54] function of your function Clause depend on data right so [00:39:58] Theory [00:40:00] Theory the function class F this cannot be this [00:40:03] the function class F this cannot be this cannot [00:40:04] cannot depend [00:40:06] depend on data [00:40:08] on data all right so we did want to deal with [00:40:10] all right so we did want to deal with the uniform convergence the the H hat [00:40:13] the uniform convergence the the H hat the classifier the final classified Bond [00:40:15] the classifier the final classified Bond you can depend on data that's the that's [00:40:18] you can depend on data that's the that's the benefit of uniform convergence but [00:40:20] the benefit of uniform convergence but the function class F cannot depend on [00:40:22] the function class F cannot depend on data so that's a small caveat but this [00:40:25] data so that's a small caveat but this is not a very big deal but but if you [00:40:28] is not a very big deal but but if you choose okay so but if you choose gamma [00:40:30] choose okay so but if you choose gamma to be something that depends on data [00:40:31] to be something that depends on data then your function cost depends on data [00:40:33] then your function cost depends on data and then you you break this right so [00:40:38] um but I think I'm not going to deal [00:40:41] um but I think I'm not going to deal with this you know very formally just [00:40:42] with this you know very formally just because this is not a super uh [00:40:46] because this is not a super uh for mathematical rigors rigor of course [00:40:49] for mathematical rigors rigor of course you cannot do this but it's very it's [00:40:50] you cannot do this but it's very it's relatively easy to fix the way to fix it [00:40:53] relatively easy to fix the way to fix it is that you take another [00:40:56] do another [00:41:00] Union Bond [00:41:04] on the choice [00:41:07] of gamma so for now you choose gamma to [00:41:11] of gamma so for now you choose gamma to be the minimum uh to be the minimum [00:41:14] be the minimum uh to be the minimum margin you know depending on data but [00:41:15] margin you know depending on data but what you should do is you should also [00:41:17] what you should do is you should also prove this for every gamma [00:41:19] prove this for every gamma right so if you can prove this [00:41:21] right so if you can prove this inequality for every gamma this bunch of [00:41:24] inequality for every gamma this bunch of inequality this is so first here for [00:41:27] inequality this is so first here for every Gamma or I guess [00:41:30] every Gamma or I guess you can prove it until here for every [00:41:32] you can prove it until here for every gamma [00:41:33] gamma and then in the last type you can choose [00:41:35] and then in the last type you can choose the gamma to be the one that you wanted [00:41:36] the gamma to be the one that you wanted because you're already done with the [00:41:38] because you're already done with the random complexity so so you just plug in [00:41:41] random complexity so so you just plug in whatever gamma you want right so so I [00:41:44] whatever gamma you want right so so I know and the way to do it is actually [00:41:45] know and the way to do it is actually relatively easy roughly speaking what [00:41:47] relatively easy roughly speaking what you do is you say you look at the gamma [00:41:49] you do is you say you look at the gamma as a single number right so to do [00:41:51] as a single number right so to do uniform convergence over a single [00:41:52] uniform convergence over a single parameter is is always very easy what [00:41:55] parameter is is always very easy what you actually here it's even easier [00:41:57] you actually here it's even easier because you don't care about multiple [00:41:59] because you don't care about multiple multiplicative bonds so suppose you have [00:42:02] multiplicative bonds so suppose you have a bond on what's the largest possible [00:42:03] a bond on what's the largest possible gamma you have let's say you have B [00:42:06] gamma you have let's say you have B then what you can do is you can decrease [00:42:07] then what you can do is you can decrease decrease [00:42:08] decrease into multiple buckets [00:42:12] into multiple buckets something like this maybe you can do so [00:42:14] something like this maybe you can do so you can serve like one bucket SP over [00:42:16] you can serve like one bucket SP over two over two B another bucket is be over [00:42:19] two over two B another bucket is be over four to two B over two [00:42:21] four to two B over two and you prove this for every uh [00:42:24] and you prove this for every uh points in this discretization it will [00:42:26] points in this discretization it will decrease discrete that so basically [00:42:27] decrease discrete that so basically within every bucket you don't really [00:42:29] within every bucket you don't really change much right the only difference [00:42:31] change much right the only difference between two numbers in the same bucket [00:42:34] between two numbers in the same bucket is only a factor of two so at most you [00:42:36] is only a factor of two so at most you lose a factor of two [00:42:37] lose a factor of two so basically this is saying that you can [00:42:40] so basically this is saying that you can you only have to show [00:42:42] you only have to show a bounce [00:42:44] a bounce for those [00:42:45] for those points between the like [00:42:47] points between the like between each bucket right those kind of [00:42:49] between each bucket right those kind of boundary points and how many points they [00:42:51] boundary points and how many points they are they are only log B points [00:42:54] are they are only log B points um in some sense and and you can also do [00:42:55] um in some sense and and you can also do a uniform convergence all of these [00:42:57] a uniform convergence all of these points actually gets even [00:42:59] points actually gets even um technical you can even get log Block [00:43:00] um technical you can even get log Block B dependency [00:43:02] B dependency um but anyway so so this is a [00:43:05] um but anyway so so this is a um this is the rough idea about how to [00:43:06] um this is the rough idea about how to do this last step of uniform convergence [00:43:10] do this last step of uniform convergence um because it's relatively easy if you [00:43:12] um because it's relatively easy if you look at the papers I think [00:43:14] look at the papers I think um most of the papers don't actually [00:43:18] um most of the papers don't actually um don't actually do this step just for [00:43:20] um don't actually do this step just for Simplicity of course the state of [00:43:22] Simplicity of course the state of theorem in a different way so that it's [00:43:24] theorem in a different way so that it's the theorem is still correct but so they [00:43:26] the theorem is still correct but so they just don't do this very very last step [00:43:28] just don't do this very very last step um [00:43:29] um um on so to make it simpler so [00:43:34] um on so to make it simpler so yeah so that's that's also what I'm [00:43:35] yeah so that's that's also what I'm gonna do like I'm not going to prove [00:43:38] gonna do like I'm not going to prove prove with you like a [00:43:40] prove with you like a um [00:43:40] um a super uh regular theorem I guess you [00:43:43] a super uh regular theorem I guess you know if you really want to prove it [00:43:44] know if you really want to prove it what's going to have is the following so [00:43:46] what's going to have is the following so suppose you really want to have a [00:43:47] suppose you really want to have a theorem the theorem statement will look [00:43:49] theorem the theorem statement will look like this so it will say that with [00:43:51] like this so it will say that with probability [00:43:53] probability Delta [00:43:55] Delta for every [00:43:57] for every um [00:43:59] um so for every [00:44:01] so for every gamma between some zero and Gamma Max [00:44:04] gamma between some zero and Gamma Max then you can say that for every age [00:44:07] then you can say that for every age L gamma h less than L hat gamma h [00:44:11] L gamma h less than L hat gamma h plus some Big O of [00:44:16] like this plus square root log one over [00:44:18] like this plus square root log one over Delta Over N plus square roots [00:44:22] Delta Over N plus square roots log [00:44:24] log gamma Max [00:44:26] gamma Max over something like this [00:44:29] over something like this I guess maybe God Max should be larger [00:44:31] I guess maybe God Max should be larger than one just to select you [00:44:36] um [00:44:39] and there's a Corollary [00:44:43] you can have l01 H [00:44:47] you can have l01 H at for the uh for the for the hypothesis [00:44:51] at for the uh for the for the hypothesis you care about right it's less than [00:44:54] you care about right it's less than something like r [00:44:55] something like r of RS of H over R gamma [00:45:01] of RS of H over R gamma Plus [00:45:02] Plus curve log one over d [00:45:06] curve log one over d right so here this is the empirical [00:45:07] right so here this is the empirical gamma [00:45:10] gamma maybe I should call it gamma mean I [00:45:12] maybe I should call it gamma mean I think I somehow have a little bit of [00:45:14] think I somehow have a little bit of inconsistent notation here sorry so [00:45:19] this is mean over i y [00:45:23] this is mean over i y EP sorry [00:45:34] okay [00:45:36] okay um [00:45:45] [Music] [00:45:49] but then [00:45:52] but then so like can we just like I'm 90.1 [00:45:57] so like can we just like I'm 90.1 because [00:46:01] it's like [00:46:03] it's like uh so okay I guess so [00:46:06] uh so okay I guess so uh [00:46:08] uh I think the question is like whether you [00:46:11] I think the question is like whether you know why you don't take gamma Max to be [00:46:12] know why you don't take gamma Max to be really small right [00:46:14] really small right um so first of all it's not clear [00:46:16] um so first of all it's not clear whether you can always prove that [00:46:18] whether you can always prove that the final gamma you have on the [00:46:20] the final gamma you have on the empirical data can be really small [00:46:21] empirical data can be really small that's probably not actually one gamma [00:46:23] that's probably not actually one gamma to be big you want to improve the data [00:46:26] to be big you want to improve the data to have bigger margin so that your [00:46:27] to have bigger margin so that your generalization is smaller [00:46:29] generalization is smaller right so you do want to make the somehow [00:46:31] right so you do want to make the somehow the gamma like at least that's the [00:46:34] the gamma like at least that's the interesting regime like this very very [00:46:35] interesting regime like this very very small gamma regime is probably not the [00:46:37] small gamma regime is probably not the most interesting one because your bond [00:46:39] most interesting one because your bond would be [00:46:40] would be that your your right hand side will be [00:46:43] that your your right hand side will be very big so so actually if the gum is [00:46:44] very big so so actually if the gum is really small it's probably probably [00:46:46] really small it's probably probably don't need the third oh sorry I think [00:46:49] don't need the third oh sorry I think anyway sorry my bad there is a third [00:46:50] anyway sorry my bad there is a third term in this as well let me fix that [00:46:54] term in this as well let me fix that first [00:46:55] first um but so but suppose your gamma is [00:46:57] um but so but suppose your gamma is really very small right so you probably [00:46:59] really very small right so you probably don't even need a third term because the [00:47:00] don't even need a third term because the first term is already very very big that [00:47:02] first term is already very very big that correctly kind of like governs your your [00:47:04] correctly kind of like governs your your generalization ball [00:47:06] generalization ball so so you do care about somewhat uh [00:47:08] so so you do care about somewhat uh large comma but there's still a question [00:47:10] large comma but there's still a question about why you want to come on [00:47:12] about why you want to come on what if all the scales are very very [00:47:14] what if all the scales are very very small right so I think I think it's [00:47:16] small right so I think I think it's really just that um [00:47:23] I think technically you [00:47:26] I think technically you let me see so [00:47:28] let me see so does that answer the question [00:47:37] yeah [00:47:39] yeah um [00:47:40] um yeah [00:47:43] I I think there's some kind of like [00:47:44] I I think there's some kind of like small things like for example what if [00:47:47] small things like for example what if all your everything is super small [00:47:49] all your everything is super small right like what if like uh all the [00:47:52] right like what if like uh all the numbers are extremely small I think you [00:47:54] numbers are extremely small I think you may you you can make this one element [00:47:57] may you you can make this one element tighter you know in some way yeah [00:48:00] tighter you know in some way yeah I think there's another question [00:48:10] this one [00:48:12] this one oh this is log comma Max [00:48:19] and the same thing here [00:48:22] and the same thing here foreign [00:48:28] I don't recommend to spend too much time [00:48:30] I don't recommend to spend too much time thinking about this small facility here [00:48:34] thinking about this small facility here um the the most important thing is to do [00:48:35] um the the most important thing is to do the first term right so let's try to so [00:48:38] the first term right so let's try to so I guess maybe the interpretation is more [00:48:40] I guess maybe the interpretation is more important [00:48:41] important okay so the first term this is r as of H [00:48:44] okay so the first term this is r as of H over gamma [00:48:46] over gamma right gamma is the agama mean where [00:48:48] right gamma is the agama mean where gamma mean this is the empirical [00:48:53] margin right of the entire data sets so [00:48:56] margin right of the entire data sets so and this is saying that if you are very [00:48:59] and this is saying that if you are very confident about all the tuning examples [00:49:02] confident about all the tuning examples then you're gonna have better [00:49:04] then you're gonna have better organization your bond will be smaller [00:49:05] organization your bond will be smaller right if you're if all the examples are [00:49:08] right if you're if all the examples are very very close to be [00:49:10] very very close to be all the training examples are close to [00:49:12] all the training examples are close to zero right this is meaning i y i [00:49:15] zero right this is meaning i y i h x i [00:49:17] h x i right if all the H outputs on your [00:49:20] right if all the H outputs on your training examples are very close to zero [00:49:22] training examples are very close to zero that means your gamma Max actually only [00:49:24] that means your gamma Max actually only involve them in this definition but as [00:49:27] involve them in this definition but as long as one of your [00:49:29] long as one of your um [00:49:29] um example chain example has a very small [00:49:32] example chain example has a very small age right value has very has a value [00:49:36] age right value has very has a value very close to zero then your generation [00:49:38] very close to zero then your generation Bond would be [00:49:40] Bond would be um quite worse [00:49:41] um quite worse so so you want all the examples to be [00:49:43] so so you want all the examples to be very far away from zero like very [00:49:45] very far away from zero like very constant in some sense [00:49:47] constant in some sense and on the other hand you want your [00:49:49] and on the other hand you want your classifier you want to denominate the [00:49:52] classifier you want to denominate the the numerator to be as small as possible [00:49:54] the numerator to be as small as possible you want your classifier to uh output uh [00:49:59] you want your classifier to uh output uh to to be to be less complex and also [00:50:02] to to be to be less complex and also there's another thing to check here [00:50:03] there's another thing to check here which is the scale does match so the [00:50:06] which is the scale does match so the scaling actually is the rising so for [00:50:09] scaling actually is the rising so for example we have talked about that [00:50:12] example we have talked about that random marker complexity depends on the [00:50:14] random marker complexity depends on the scale of your function right so if you [00:50:16] scale of your function right so if you say you multiply all your function by a [00:50:19] say you multiply all your function by a half [00:50:19] half then your render more complexity will be [00:50:21] then your render more complexity will be reduced by a half [00:50:23] reduced by a half and you can see here that this Bond [00:50:25] and you can see here that this Bond makes sense because you cannot cheat by [00:50:28] makes sense because you cannot cheat by doing that so if you multiply your [00:50:30] doing that so if you multiply your function H sorry [00:50:33] function H sorry so suppose you say you you have a h [00:50:36] so suppose you say you you have a h Prime [00:50:38] Prime um which is all the functions [00:50:42] um which is all the functions divided by let's say 100 h [00:50:45] divided by let's say 100 h right suppose you divide by 100 then [00:50:47] right suppose you divide by 100 then what happens is that the rather marker [00:50:50] what happens is that the rather marker complexity of H Prime yes indeed is [00:50:52] complexity of H Prime yes indeed is divided by a hundred [00:50:55] divided by a hundred but the gamma mean will also be divided [00:50:58] but the gamma mean will also be divided by 100. so that's why you cannot see [00:51:00] by 100. so that's why you cannot see these bounds you know by a trivial risk [00:51:02] these bounds you know by a trivial risk scaling of your hypothesis cause and [00:51:04] scaling of your hypothesis cause and that also kind of shows that [00:51:06] that also kind of shows that something like this will have to show up [00:51:08] something like this will have to show up here right if you don't have this you [00:51:09] here right if you don't have this you only have RS of H then your bond [00:51:11] only have RS of H then your bond wouldn't be right right because your [00:51:13] wouldn't be right right because your bond wouldn't be would be wouldn't be [00:51:15] bond wouldn't be would be wouldn't be invariant to scale it [00:51:17] invariant to scale it foreign [00:51:35] okay so this basically concludes [00:51:39] okay so this basically concludes um with our [00:51:40] um with our um uh treatment about the the loss so [00:51:45] um uh treatment about the the loss so basically the take home is that you have [00:51:47] basically the take home is that you have you care about is two quantity one is [00:51:49] you care about is two quantity one is the margin and the other is the random R [00:51:51] the margin and the other is the random R complexity [00:51:52] complexity and now let's run the by the marker [00:51:55] and now let's run the by the marker complexity [00:52:02] uh for linear models [00:52:06] so [00:52:08] so um and then so what I'm going to do is [00:52:10] um and then so what I'm going to do is that I'm going to the linear models [00:52:11] that I'm going to the linear models today and next lecture we'll talk about [00:52:14] today and next lecture we'll talk about generally deep learning and from next [00:52:17] generally deep learning and from next lecture we're going to talk more about [00:52:18] lecture we're going to talk more about like non-linear models in general uh so [00:52:21] like non-linear models in general uh so I'm going to have first type of overview [00:52:22] I'm going to have first type of overview for deep learning and then come back to [00:52:24] for deep learning and then come back to this to talk about the random marker [00:52:26] this to talk about the random marker complex of non-linear models that's the [00:52:28] complex of non-linear models that's the that's the high level plan so for linear [00:52:31] that's the high level plan so for linear models here is a theorem [00:52:36] um so suppose you have a hypothesis [00:52:38] um so suppose you have a hypothesis class H which Maps X to [00:52:41] class H which Maps X to W transpose X [00:52:43] W transpose X where W is now your parameter unless you [00:52:46] where W is now your parameter unless you apply measure W has two Norm less than b [00:52:50] apply measure W has two Norm less than b then [00:52:52] then uh and also let's assume [00:53:01] your data [00:53:04] the data distribution has a [00:53:10] right so this is the L2 Norm square and [00:53:13] right so this is the L2 Norm square and expectation is bounded by c squared [00:53:16] expectation is bounded by c squared suppose you know these two things then [00:53:18] suppose you know these two things then you can boundaries marker complexity [00:53:23] the empirical radar complexity is [00:53:25] the empirical radar complexity is bounded by [00:53:27] bounded by B Over N Times Square Root [00:53:31] B Over N Times Square Root sum of x i [00:53:38] so I guess this is not immediately you [00:53:41] so I guess this is not immediately you know the scale is in on the right right [00:53:44] know the scale is in on the right right the mentioned but I think the the [00:53:46] the mentioned but I think the the average one is easier to interpret if [00:53:48] average one is easier to interpret if you look at the average [00:53:49] you look at the average rather marker complexity you can bond [00:53:51] rather marker complexity you can bond this by B times C over square rooted [00:53:55] this by B times C over square rooted so first of all you get this square root [00:53:57] so first of all you get this square root dependency which is very typical for [00:53:59] dependency which is very typical for random marker complexity bound and [00:54:02] random marker complexity bound and second in the numerator you get B which [00:54:05] second in the numerator you get B which is the bound of the L2 Norm of the [00:54:07] is the bound of the L2 Norm of the parameter I also get a c which is [00:54:09] parameter I also get a c which is basically talking about how [00:54:12] basically talking about how how large your data it what's the norm [00:54:15] how large your data it what's the norm of data points right so you should have [00:54:17] of data points right so you should have like both of these two things coming [00:54:20] like both of these two things coming into play because again rather more [00:54:22] into play because again rather more complexity you know is sensitive to [00:54:24] complexity you know is sensitive to scale [00:54:25] scale so you shouldn't be so you should have [00:54:27] so you shouldn't be so you should have all the scaling things there because [00:54:28] all the scaling things there because otherwise you can cheat right so like uh [00:54:31] otherwise you can cheat right so like uh for example what if you don't have C [00:54:33] for example what if you don't have C here then your bond wouldn't be true [00:54:34] here then your bond wouldn't be true because you can scale your X arbitrarily [00:54:36] because you can scale your X arbitrarily to make your rather more complex [00:54:37] to make your rather more complex averagely big [00:54:39] averagely big so so you have to have all the scalings [00:54:41] so so you have to have all the scalings right [00:54:43] right so [00:54:45] so um [00:54:47] um correct so that's the [00:54:49] correct so that's the uh [00:54:51] uh the first thing we're going to show [00:54:52] the first thing we're going to show about linear models we're going to have [00:54:54] about linear models we're going to have some other theorems about linear models [00:54:56] some other theorems about linear models under other constraints and then we're [00:54:57] under other constraints and then we're going to compare them and also compare [00:55:00] going to compare them and also compare them with the previous Bond as well but [00:55:02] them with the previous Bond as well but but let's try let's first prove the [00:55:04] but let's try let's first prove the theorem [00:55:07] so and this kind of like it also [00:55:09] so and this kind of like it also demonstrates how do you generally Bond [00:55:11] demonstrates how do you generally Bond write a marker complexity using this [00:55:13] write a marker complexity using this kind of like a somewhat analytical [00:55:15] kind of like a somewhat analytical approach [00:55:16] approach so we start with the empirical [00:55:20] so we start with the empirical um [00:55:21] um empirical one empirical number [00:55:23] empirical one empirical number complexity the definition is that you [00:55:25] complexity the definition is that you draw some [00:55:27] draw some um you draw some sigmas and then you [00:55:28] um you draw some sigmas and then you look at the soup of this sun [00:55:33] and here you write w transpose x i [00:55:36] and here you write w transpose x i because this is the model [00:55:38] because this is the model output [00:55:40] and you take soup over W and the [00:55:44] and you take soup over W and the what's the constraint of w the constant [00:55:46] what's the constraint of w the constant of w is that L2 Norm is less than b [00:55:50] and [00:55:52] and now let's uh do some derivations [00:55:56] now let's uh do some derivations so first [00:55:58] so first so we basically want to solve this soup [00:56:01] so we basically want to solve this soup first so to solve this soup I want to [00:56:04] first so to solve this soup I want to understand [00:56:05] understand what [00:56:08] this thing is right the we realized that [00:56:10] this thing is right the we realized that this is actually a linear function of w [00:56:13] this is actually a linear function of w and you can write this as [00:56:16] and you can write this as the inner product of w was [00:56:19] the inner product of w was the sum of Sigma i x i [00:56:23] the sum of Sigma i x i this is just because you can pull the [00:56:25] this is just because you can pull the lineage from your front [00:56:27] lineage from your front um sometimes [00:56:33] and now what what's the soup of this [00:56:35] and now what what's the soup of this this is easy because [00:56:40] I guess you know [00:56:43] I guess you know what is the soup of [00:56:45] what is the soup of W inner product with some Vector that's [00:56:47] W inner product with some Vector that's called um say V [00:56:50] called um say V where you take soup over two Norm of w [00:56:53] where you take soup over two Norm of w less than b [00:56:54] less than b right so basically you want to find the [00:56:56] right so basically you want to find the vector that has maximum correlation with [00:56:59] vector that has maximum correlation with some vector v and you have some constant [00:57:01] some vector v and you have some constant on the norm of B right so this is just [00:57:05] on the norm of B right so this is just literally equals to I guess there are [00:57:07] literally equals to I guess there are multiple ways to do this for example one [00:57:08] multiple ways to do this for example one way to do this you can use caution [00:57:09] way to do this you can use caution shorts [00:57:11] shorts so so you can say WV is less than the [00:57:16] so so you can say WV is less than the norm of w [00:57:17] norm of w times Norm of V [00:57:19] times Norm of V so this is [00:57:21] so this is less than b times [00:57:25] less than b times and actually this is this can be a tent [00:57:28] and actually this is this can be a tent by searching W so the [00:57:31] by searching W so the so the the [00:57:33] so the the answer is that the maximum the soup is [00:57:37] answer is that the maximum the soup is equals to B times the not two Norm of V [00:57:40] equals to B times the not two Norm of V and how do you attend this inequality to [00:57:43] and how do you attend this inequality to attain quality you just choose W to be [00:57:45] attain quality you just choose W to be on the same direction of V so that your [00:57:47] on the same direction of V so that your cautious towards this inequality is [00:57:49] cautious towards this inequality is tight [00:57:50] tight um and and you get the right number [00:57:52] um and and you get the right number right so I think this is probably I [00:57:55] right so I think this is probably I think this is in one of the homework [00:57:56] think this is in one of the homework questions homework zero question I think [00:57:58] questions homework zero question I think right that's a warm up okay so and if [00:58:00] right that's a warm up okay so and if you apply this you know thing to here so [00:58:03] you apply this you know thing to here so basically you got B times the vector B [00:58:06] basically you got B times the vector B the VA corresponds to this [00:58:09] the VA corresponds to this thing [00:58:11] all right so so you get rid of the six [00:58:13] all right so so you get rid of the six that's a that's a big thing [00:58:15] that's a that's a big thing first because the soup is very hard to [00:58:17] first because the soup is very hard to deal with [00:58:18] deal with and now you have a norm of a random [00:58:20] and now you have a norm of a random variable when this random variable is a [00:58:23] variable when this random variable is a sum of [00:58:25] sum of sum of random variables and note that [00:58:28] sum of random variables and note that here we are talking about in paragraph [00:58:29] here we are talking about in paragraph complexity so the only Randomness come [00:58:31] complexity so the only Randomness come from Sigma but not X so but still this [00:58:34] from Sigma but not X so but still this is a random variable it's a random mix [00:58:36] is a random variable it's a random mix mixing of this axis [00:58:38] mixing of this axis uh and how to deal with this so we are [00:58:41] uh and how to deal with this so we are going to use uh the [00:58:43] going to use uh the conscious words again so uh I guess [00:58:47] conscious words again so uh I guess maybe for Preparation let's first get [00:58:48] maybe for Preparation let's first get rid of the [00:58:49] rid of the move out the b and Sig B and N so you [00:58:52] move out the b and Sig B and N so you get sum of Sigma and x i [00:58:55] get sum of Sigma and x i pull Norm [00:58:57] pull Norm uh and you say that this is less than [00:59:00] uh and you say that this is less than the square root of the expectation of [00:59:04] the square root of the expectation of the square [00:59:08] this is just because expectation of B is [00:59:12] this is just because expectation of B is less than [00:59:13] less than square root expectation V Square [00:59:17] square root expectation V Square so uh for any I guess maybe actually [00:59:20] so uh for any I guess maybe actually point B for any random variable X let's [00:59:22] point B for any random variable X let's say x square X [00:59:27] another thing about you if if you square [00:59:29] another thing about you if if you square the nice thing I think we have seen this [00:59:30] the nice thing I think we have seen this kind of like calculation not only once [00:59:33] kind of like calculation not only once um in some other cases we all also seen [00:59:36] um in some other cases we all also seen this like the nice thing about square is [00:59:37] this like the nice thing about square is that you can expand it right so you can [00:59:40] that you can expand it right so you can uh you can [00:59:44] uh you can um just expand what's inside this [00:59:46] um just expand what's inside this expectation you get [00:59:47] expectation you get uh this is equals to sum of Sigma I [00:59:50] uh this is equals to sum of Sigma I Square X [00:59:53] Square X I square plus sum of Sigma I Sigma J [00:59:58] I square plus sum of Sigma I Sigma J X I in a product with x j [01:00:02] this is just the expansion [01:00:05] this is just the expansion um [01:00:06] um and [01:00:08] and then another thing is that this this is [01:00:14] then another thing is that this this is here you have I is not equals to J and [01:00:18] here you have I is not equals to J and because I is not equal to its [01:00:19] because I is not equal to its expectation of Sigma or Sigma J is zero [01:00:25] okay [01:00:27] okay because they are independent variables [01:00:30] because they are independent variables you can this is equal to expectation [01:00:31] you can this is equal to expectation Sigma I expectation Sigma J which is [01:00:34] Sigma I expectation Sigma J which is equal to zero [01:00:36] equal to zero so so this term is gone so what we have [01:00:39] so so this term is gone so what we have is p over n times expectations [01:00:43] is p over n times expectations sum of Sigma I Square Sigma square is [01:00:46] sum of Sigma I Square Sigma square is actually one [01:00:47] actually one because it's rather marker run variable [01:00:50] because it's rather marker run variable so x i to Norm Square [01:00:57] and then [01:00:58] and then the XI to Norm under the expectation is [01:01:02] the XI to Norm under the expectation is over Sigma right it's always over Sigma [01:01:05] over Sigma right it's always over Sigma so the there's no it's equivalent to no [01:01:08] so the there's no it's equivalent to no expectation because [01:01:11] x i is not a function of Sigma so we [01:01:13] x i is not a function of Sigma so we just have this and this is our [01:01:15] just have this and this is our uh desired font [01:01:19] uh desired font our bound was this exactly this [01:01:26] so so here it appears that you educate [01:01:30] so so here it appears that you educate in n in one over n but actually [01:01:33] in n in one over n but actually in this sum you also have something [01:01:35] in this sum you also have something about n right the sound grows as ungrows [01:01:37] about n right the sound grows as ungrows for Infinity so so you balance them you [01:01:40] for Infinity so so you balance them you get actually one over square root and [01:01:42] get actually one over square root and dependency which will be see more [01:01:44] dependency which will be see more problems tonight if you do the average [01:01:47] problems tonight if you do the average right if you average over X again recall [01:01:49] right if you average over X again recall that [01:01:51] that the average number complexity is the [01:01:54] the average number complexity is the average of in per quarter number [01:01:56] average of in per quarter number complexity over the randomness of the [01:01:58] complexity over the randomness of the data set [01:01:59] data set then you get V Over N times expectation [01:02:03] then you get V Over N times expectation over the randomness of s s is the the [01:02:06] over the randomness of s s is the the concatenate is the the set of X I's [01:02:08] concatenate is the the set of X I's we've got this square root [01:02:17] so now you you get into this exact [01:02:19] so now you you get into this exact situation that where you have some [01:02:21] situation that where you have some square root inside expectation which is [01:02:23] square root inside expectation which is not very convenient so you raised to the [01:02:26] not very convenient so you raised to the higher power that causes shorts using [01:02:28] higher power that causes shorts using this thing [01:02:30] this thing so you get B Over N times expectation [01:02:34] of the square of this [01:02:39] I'm sorry I should use x superscript i [01:02:45] right so [01:02:51] and now let's V Over N square root [01:03:03] and each of the X I's has the same [01:03:05] and each of the X I's has the same distribution and then we also assume [01:03:06] distribution and then we also assume that this [01:03:08] that this is equals to I think this is our [01:03:10] is equals to I think this is our assumption C about C right we'll assume [01:03:13] assumption C about C right we'll assume this is equal to c squared [01:03:14] this is equal to c squared that's the so this is C Square so you [01:03:17] that's the so this is C Square so you get [01:03:18] get um [01:03:19] um and C square and you take square root so [01:03:21] and C square and you take square root so you get b c square rooted [01:03:26] right [01:03:33] okay sounds good [01:03:37] any questions [01:03:40] foreign [01:03:41] foreign [Applause] [01:03:53] next I'm going to show another go ahead [01:04:01] [Music] [01:04:08] yes so so this is a this is a great [01:04:11] yes so so this is a this is a great question the question is that what if [01:04:13] question the question is that what if you don't use cautious towards you use [01:04:15] you don't use cautious towards you use triangle inequality I think I that's [01:04:18] triangle inequality I think I that's actually a very good question I should [01:04:20] actually a very good question I should actually actually let's try to do it a [01:04:22] actually actually let's try to do it a little bit [01:04:23] little bit so [01:04:24] so so I guess what you mean here is that [01:04:27] so I guess what you mean here is that you wait by the way where you want to [01:04:30] you wait by the way where you want to use it [01:04:34] once you open like a for example [01:04:39] once you open like a for example we are here [01:04:41] we are here um [01:04:43] um uh sorry which Yeah Yeah from here to [01:04:46] uh sorry which Yeah Yeah from here to here basically right yeah so so yes so [01:04:49] here basically right yeah so so yes so that so that's a very good question so [01:04:50] that so that's a very good question so if you don't do this if you say [01:04:54] if you don't do this if you say let me see a different color to indicate [01:04:56] let me see a different color to indicate this [01:04:57] this so if you say that you do triangle and [01:05:00] so if you say that you do triangle and quality you're going to bond it by B [01:05:01] quality you're going to bond it by B Over N times [01:05:03] Over N times expectation of the sum of x i [01:05:09] expectation of the sum of x i to naught [01:05:11] to naught okay [01:05:12] okay so then [01:05:13] so then let's say you also take expectation over [01:05:15] let's say you also take expectation over X I just uh or maybe like we don't have [01:05:18] X I just uh or maybe like we don't have to do that so let's say this is B Over N [01:05:21] to do that so let's say this is B Over N times sum of [01:05:23] times sum of expectation of x i [01:05:27] expectation of x i x i to know [01:05:32] and let's see what and you can see what [01:05:33] and let's see what and you can see what happens here [01:05:36] happens here so you have n terms here and each of [01:05:38] so you have n terms here and each of this term is on some constant scale [01:05:40] this term is on some constant scale let's say [01:05:42] let's say and so basically the sum will be on all [01:05:44] and so basically the sum will be on all of N and then you cancel it with ends [01:05:46] of N and then you cancel it with ends you get B [01:05:48] you get B so basically at the end of that you [01:05:49] so basically at the end of that you don't have dependency on N anymore [01:05:51] don't have dependency on N anymore so that's that strictly works because we [01:05:53] so that's that strictly works because we we do want to have a dependency [01:05:55] we do want to have a dependency something like one over square rooted [01:05:57] something like one over square rooted that's something that goes to zero as n [01:05:58] that's something that goes to zero as n goes to Infinity [01:05:59] goes to Infinity and the reason is on why this is a loose [01:06:02] and the reason is on why this is a loose inequality the reason is because [01:06:05] inequality the reason is because here if you look at this this is a sum [01:06:08] here if you look at this this is a sum of things that can cancel each other [01:06:11] of things that can cancel each other because Sigma I flip things right so and [01:06:16] because Sigma I flip things right so and um [01:06:17] um and if you do triangle on the top you [01:06:19] and if you do triangle on the top you are basically assuming all of these [01:06:20] are basically assuming all of these factors in the same direction [01:06:21] factors in the same direction right so so even with the flip right [01:06:25] right so so even with the flip right it's so it's possible that all the [01:06:27] it's so it's possible that all the access are exactly the same direction [01:06:28] access are exactly the same direction let's see that's that's already the case [01:06:30] let's see that's that's already the case right but with the flip they cancel with [01:06:32] right but with the flip they cancel with each other right so one of them is going [01:06:34] each other right so one of them is going this direction reality is in the [01:06:36] this direction reality is in the opposite direction so you have the [01:06:37] opposite direction so you have the cancellation [01:06:38] cancellation and and that's why the this conscious [01:06:41] and and that's why the this conscious virus is is our distance in the quality [01:06:43] virus is is our distance in the quality is more kind of like [01:06:45] is more kind of like um height and this is exactly just here [01:06:47] um height and this is exactly just here right because you have to use the [01:06:48] right because you have to use the concentration cancellation between the [01:06:50] concentration cancellation between the sigma X if you don't use it [01:06:52] sigma X if you don't use it actually if you don't use it strongly [01:06:54] actually if you don't use it strongly enough in some sense you wouldn't have a [01:06:56] enough in some sense you wouldn't have a have a good bound [01:06:57] have a good bound um eventually [01:07:08] right exactly exactly [01:07:11] right exactly exactly foreign [01:07:19] okay cool and [01:07:24] the next thing would be another theorem [01:07:28] the next thing would be another theorem and this different theorem would still [01:07:32] and this different theorem would still deal with linear models but have [01:07:34] deal with linear models but have different [01:07:35] different Norm measurement of the parameter and [01:07:38] Norm measurement of the parameter and you will see a different Bond and this [01:07:40] you will see a different Bond and this is one of the reasons one of the things [01:07:42] is one of the reasons one of the things I motivated one of the [01:07:45] I motivated one of the the reason I use to motivate random [01:07:47] the reason I use to motivate random marker complexity because I was saying [01:07:49] marker complexity because I was saying that you can get more precise [01:07:51] that you can get more precise dependencies on what Norms you want to [01:07:52] dependencies on what Norms you want to use for example [01:07:54] use for example right so so suppose we still have the [01:07:57] right so so suppose we still have the same age we have the a different age [01:08:00] same age we have the a different age it's still linear model but the [01:08:02] it's still linear model but the constraint on a parameter now becomes L1 [01:08:05] constraint on a parameter now becomes L1 naught [01:08:09] so this is L1 [01:08:10] so this is L1 and [01:08:13] we'll assume that [01:08:16] we'll assume that now the infinity Norm of x [01:08:19] now the infinity Norm of x is less than C [01:08:22] is less than C for all I [01:08:25] for all I uh [01:08:28] uh for oi [01:08:29] for oi and also let's specify the dimension [01:08:32] and also let's specify the dimension x i is in Rd [01:08:36] actually this is an interesting thing so [01:08:38] actually this is an interesting thing so before we didn't even specify the [01:08:39] before we didn't even specify the dimension of X in the previous theorem [01:08:42] dimension of X in the previous theorem because it doesn't show up in the bond [01:08:43] because it doesn't show up in the bond it has to be some Vector in some [01:08:45] it has to be some Vector in some dimensions of course right but it [01:08:48] dimensions of course right but it doesn't matter what what the [01:08:49] doesn't matter what what the dimensionality is actually you can apply [01:08:51] dimensionality is actually you can apply this to even [01:08:53] this to even if the dimensional vectors [01:08:56] if the dimensional vectors um as long as the norm of X is bounded [01:08:58] um as long as the norm of X is bounded by C so but next theorem will depend on [01:09:02] by C so but next theorem will depend on Dimension the dimension is d [01:09:04] Dimension the dimension is d and [01:09:05] and then [01:09:07] then the empirical order marker complexity is [01:09:10] the empirical order marker complexity is less than [01:09:12] less than B times C Times Square Root [01:09:16] B times C Times Square Root 2 log D over n [01:09:20] 2 log D over n over square rooted [01:09:25] right [01:09:27] right and you can see now the norm one Norm [01:09:29] and you can see now the norm one Norm starts to matter you know so so it's [01:09:30] starts to matter you know so so it's still kind of so let's say suppose you [01:09:32] still kind of so let's say suppose you ignore and ignore the log D so basically [01:09:34] ignore and ignore the log D so basically still something like B times C but it's [01:09:36] still something like B times C but it's a different measurement uh on the B and [01:09:39] a different measurement uh on the B and C the definition are different now the B [01:09:41] C the definition are different now the B is the one Norm of uh of uh W and C is [01:09:46] is the one Norm of uh of uh W and C is the infinite Norm of x [01:09:48] the infinite Norm of x we'll compare these two theorems in [01:09:50] we'll compare these two theorems in after we prove it [01:09:51] after we prove it let me see [01:09:59] how much time I have [01:10:02] how much time I have I think I do have time to prove it and [01:10:04] I think I do have time to prove it and then compare so [01:10:07] then compare so um so the proof [01:10:10] the proof won't be [01:10:12] the proof won't be complete in a sense that I have to [01:10:14] complete in a sense that I have to invoke some lemon which I will prove [01:10:16] invoke some lemon which I will prove actually [01:10:17] actually which will be proved by you in the [01:10:19] which will be proved by you in the homework [01:10:21] homework um [01:10:22] um um but let's do the most of the stuff so [01:10:25] um but let's do the most of the stuff so the definition is that the same thing [01:10:35] foreign [01:10:39] something like [01:10:42] something like again [01:10:46] you can view this as some W Times some V [01:10:49] you can view this as some W Times some V where V is this one over n times the sum [01:10:53] where V is this one over n times the sum of Sigma I [01:10:55] of Sigma I x i [01:10:57] x i I'm writing here this is Sigma i w [01:11:00] I'm writing here this is Sigma i w transpose x y [01:11:03] transpose x y so we are doing the same decomposition [01:11:05] so we are doing the same decomposition but now you are taking a soup over one [01:11:07] but now you are taking a soup over one Norm [01:11:08] Norm and and you know that if you take the [01:11:10] and and you know that if you take the soup off [01:11:13] soup off over one Norm constant B [01:11:16] over one Norm constant B V times W you can this is also [01:11:19] V times W you can this is also relatively easy to prove I mean this is [01:11:21] relatively easy to prove I mean this is equals to B times [01:11:24] equals to B times the infinity sorry this W Times V [01:11:29] the infinity sorry this W Times V W Times V this is going to be equals to [01:11:31] W Times V this is going to be equals to the infinite Norm [01:11:33] the infinite Norm of B [01:11:37] so so that's how we Island it eliminate [01:11:41] so so that's how we Island it eliminate W so we just got the infinite Norm of V [01:11:46] W so we just got the infinite Norm of V uh [01:11:49] which [01:11:52] let me see [01:11:54] let me see sorry yeah so [01:11:57] sorry yeah so let me see what's going on here [01:12:01] yeah I think this is [01:12:05] oh [01:12:09] [Music] [01:12:17] okay [01:12:21] I see so [01:12:24] I see so right so this is going to be equals to [01:12:26] right so this is going to be equals to this [01:12:37] however so now we've got a problem so we [01:12:39] however so now we've got a problem so we have this Infinity Norm [01:12:41] have this Infinity Norm and [01:12:42] and so how do we proceed right so you can [01:12:46] so how do we proceed right so you can for example use triangle inequality [01:12:48] for example use triangle inequality right so then again we don't use the [01:12:50] right so then again we don't use the cancellation right so if you use [01:12:51] cancellation right so if you use triangle like and I'll call it even [01:12:54] triangle like and I'll call it even flip the you know you swept the sum with [01:12:56] flip the you know you swept the sum with the infinity Norm but you don't have the [01:12:57] the infinity Norm but you don't have the cancellation [01:12:58] cancellation so how do we deal with this so in some [01:13:01] so how do we deal with this so in some sense I'm going to uh [01:13:03] sense I'm going to uh uh [01:13:05] uh kind of like the infinite Norm is kind [01:13:06] kind of like the infinite Norm is kind of like something [01:13:08] of like something um uh different that you cannot use this [01:13:11] um uh different that you cannot use this analytical tool so what I'm going to do [01:13:13] analytical tool so what I'm going to do is I'm going to do a different approach [01:13:14] is I'm going to do a different approach or somewhat different approach so from [01:13:17] or somewhat different approach so from here so this inequality so [01:13:20] here so this inequality so this is a Diagon in some sense so what [01:13:23] this is a Diagon in some sense so what you do is that you say you look at what [01:13:27] you do is that you say you look at what this uh really means so you can [01:13:31] this uh really means so you can say that [01:13:33] say that [Music] [01:13:33] [Music] um [01:13:35] the [01:13:40] this is equals to [01:13:42] this is equals to B Over N times the soup [01:13:46] B Over N times the soup of [01:13:48] of w [01:13:52] is in [01:13:54] is in sorry [01:13:57] B such as [01:13:59] B such as I think I I probably got the wrong [01:14:01] I think I I probably got the wrong version of the notes that's why I wasn't [01:14:04] version of the notes that's why I wasn't supposed [01:14:05] supposed like a little bit kind of like uh [01:14:08] like a little bit kind of like uh surprised by the notes because I think [01:14:10] surprised by the notes because I think uh on the newer version shouldn't be [01:14:12] uh on the newer version shouldn't be like this but anyway so [01:14:15] like this but anyway so so the first thing you can do is you can [01:14:17] so the first thing you can do is you can normalize it to be one so you got W [01:14:20] normalize it to be one so you got W Times V right that's uh W Bar times V [01:14:24] Times V right that's uh W Bar times V times B that's easy [01:14:26] times B that's easy and then what you can say is that if you [01:14:30] and then what you can say is that if you think about what maximize this inner [01:14:33] think about what maximize this inner product among all the L1 non-bonding [01:14:35] product among all the L1 non-bonding vector actually the only thing you care [01:14:38] vector actually the only thing you care about is [01:14:39] about is so the soup is actually literally equals [01:14:41] so the soup is actually literally equals to if W1 W bar is between [01:14:45] to if W1 W bar is between plus minus E1 plus minus E2 so and so [01:14:48] plus minus E1 plus minus E2 so and so forth [01:14:49] forth plus minus e d [01:14:53] this is my claim and and the reason for [01:14:56] this is my claim and and the reason for this claim is just that you know if you [01:14:58] this claim is just that you know if you look at sum of w bar i v i right suppose [01:15:02] look at sum of w bar i v i right suppose I is index for the coordinates right so [01:15:04] I is index for the coordinates right so how does this is [01:15:06] how does this is uh [01:15:08] uh so basically you can say that [01:15:11] so basically you can say that um [01:15:13] um so basically you know that the soup [01:15:15] so basically you know that the soup W bar V [01:15:20] is equals to V infinity Norm in notice [01:15:25] is equals to V infinity Norm in notice um and and what you care about is that [01:15:27] um and and what you care about is that what are the extremo points like in what [01:15:30] what are the extremo points like in what case you achieve this uh you achieve [01:15:33] case you achieve this uh you achieve this equality [01:15:35] this equality and it turns out that the way that you [01:15:37] and it turns out that the way that you achieved this equality is that W [01:15:41] achieved this equality is that W basically you want to take W bar I [01:15:44] basically you want to take W bar I to be one [01:15:46] to be one for I [01:15:48] for I for I is [01:15:50] for I is the largest entry [01:15:53] so for I such that [01:15:59] v i [01:16:01] v i is the largest is the max [01:16:04] is the largest is the max over VJ [01:16:09] J is in D [01:16:11] J is in D right so [01:16:14] right so I'm not sure whether this is this [01:16:16] I'm not sure whether this is this probably requires a little bit of a [01:16:17] probably requires a little bit of a thinking offline like a [01:16:18] thinking offline like a so [01:16:20] so um [01:16:22] um so so at least you can verify in this [01:16:24] so so at least you can verify in this case if you choose wi to be one for this [01:16:26] case if you choose wi to be one for this case and double w i bar to be zero for [01:16:29] case and double w i bar to be zero for all the other eyes then what you get is [01:16:31] all the other eyes then what you get is that you get sum of wi bar v i is equals [01:16:34] that you get sum of wi bar v i is equals to just v i and VI is equals to V [01:16:38] to just v i and VI is equals to V infinity Norm because bi is the largest [01:16:39] infinity Norm because bi is the largest one [01:16:40] one uh uh foreign [01:16:48] so if you choose this wi then uh you [01:16:54] so if you choose this wi then uh you and all that all the w i i bar zero so [01:16:57] and all that all the w i i bar zero so you got VI [01:16:59] you got VI infinite norm and that's equals to V the [01:17:01] infinite norm and that's equals to V the the v i in absolute value that that's [01:17:03] the v i in absolute value that that's the V if they know [01:17:05] the V if they know right so and and also if if you don't [01:17:07] right so and and also if if you don't have apps right or another fans that you [01:17:09] have apps right or another fans that you can also flip the Wi on [01:17:12] can also flip the Wi on to be either one or minus one [01:17:16] to be either one or minus one does it make sense [01:17:19] uh yeah it's not it's it's relatively [01:17:22] uh yeah it's not it's it's relatively easy so it's just a [01:17:24] easy so it's just a uh it will probably requires a little [01:17:26] uh it will probably requires a little bit of thinking offline as well so so [01:17:28] bit of thinking offline as well so so but the basically is that you know when [01:17:30] but the basically is that you know when you do this kind of like a maximization [01:17:32] you do this kind of like a maximization over the Simplex you always cut the [01:17:34] over the Simplex you always cut the vertex right your max the extreme point [01:17:36] vertex right your max the extreme point is always the vertex that's another way [01:17:37] is always the vertex that's another way to think about it and the vertex are [01:17:39] to think about it and the vertex are those you know uh natural bases [01:17:43] those you know uh natural bases and then we basically got into a so what [01:17:46] and then we basically got into a so what happens is that this is a final hypothe [01:17:48] happens is that this is a final hypothe class [01:17:49] class no [01:17:54] what does that mean so basically you can [01:17:56] what does that mean so basically you can think of your hypothesis class H bar [01:18:00] think of your hypothesis class H bar is something that X maps to W bar [01:18:03] is something that X maps to W bar transpose X where W bar is only [01:18:06] transpose X where W bar is only inside this family of [01:18:09] inside this family of plus minus E1 up to plus minus E [01:18:15] right so so you don't have all the [01:18:18] right so so you don't have all the um you don't have all the linear [01:18:21] um you don't have all the linear classifiers [01:18:22] classifiers anymore you just have 2D linear [01:18:25] anymore you just have 2D linear classifiers now [01:18:26] classifiers now right so so basically this thing is just [01:18:29] right so so basically this thing is just equals to the [01:18:32] equals to the basically if you put a b outside you get [01:18:35] basically if you put a b outside you get this [01:18:41] plus minus EI Public Power V and this is [01:18:45] plus minus EI Public Power V and this is just equals to B times the random marker [01:18:47] just equals to B times the random marker complexity [01:18:48] complexity of this hypothesis cross H bar [01:18:57] okay and we have a claim that in the and [01:19:02] okay and we have a claim that in the and also we can we have claim in the one of [01:19:05] also we can we have claim in the one of the the very early on part of the [01:19:08] the the very early on part of the lecture so we claim that for [01:19:11] lecture so we claim that for um [01:19:12] um for fun and help us class you can Bond [01:19:15] for fun and help us class you can Bond it by [01:19:19] so this is uh let's go back to the [01:19:21] so this is uh let's go back to the lecture to the [01:19:23] lecture to the notes before so this is the Lemon that [01:19:25] notes before so this is the Lemon that we sold the radar complexity is the log [01:19:29] we sold the radar complexity is the log of the hypothesis card size [01:19:31] of the hypothesis card size is bounded by the log of the hypers [01:19:33] is bounded by the log of the hypers class size times something like M Square [01:19:35] class size times something like M Square where M square is the largest possible [01:19:37] where M square is the largest possible value you can output [01:19:39] value you can output from this hypothesis class so let's [01:19:41] from this hypothesis class so let's complete what M square is right the size [01:19:44] complete what M square is right the size of f is particular is 2D and what's the [01:19:48] of f is particular is 2D and what's the what's the so this is equal to 2D but [01:19:52] what's the so this is equal to 2D but what is the corresponding m [01:19:54] what is the corresponding m right so we can say that for every W bar [01:19:57] right so we can say that for every W bar in this plus minus E I [01:20:00] in this plus minus E I you know that [01:20:05] you look at the largest I guess you know [01:20:07] you look at the largest I guess you know that [01:20:09] that W bar x i this is back y by [01:20:13] W bar x i this is back y by the L1 Norm of w Bar times the L [01:20:16] the L1 Norm of w Bar times the L Infinity Norm [01:20:18] Infinity Norm of x i [01:20:19] of x i and this is bounded by [01:20:21] and this is bounded by 1 times C [01:20:24] 1 times C right where C is the infinite Norm bond [01:20:26] right where C is the infinite Norm bond for x i [01:20:28] for x i and that means that if you look at this [01:20:33] things that we have to verify in the [01:20:35] things that we have to verify in the Lemma [01:20:36] Lemma right so we have to verify that [01:20:39] right so we have to verify that the sum of the squares of the output [01:20:43] is less than M squared right that's what [01:20:45] is less than M squared right that's what we have to verify so then because each [01:20:48] we have to verify so then because each of the term is less than this so we can [01:20:50] of the term is less than this so we can just [01:20:51] just verify the sum of scores this is one [01:20:53] verify the sum of scores this is one over n times [01:20:55] over n times NC Square [01:20:57] NC Square which is equals to C Square [01:21:00] which is equals to C Square so basically the the corresponding ion [01:21:03] so basically the the corresponding ion will be just be C square m Square would [01:21:04] will be just be C square m Square would be c squared so that's why r as [01:21:07] be c squared so that's why r as H bar is less than square root [01:21:10] H bar is less than square root 2 C Square [01:21:12] 2 C Square log side of H Over N this is just the [01:21:16] log side of H Over N this is just the square root 2 C Square log d [01:21:19] square root 2 C Square log d over n [01:21:21] over n and [01:21:22] and now recall that we have a b here which [01:21:26] now recall that we have a b here which we got so uh so RS H is less than b [01:21:31] we got so uh so RS H is less than b times r h bar which is [01:21:35] times r h bar which is uh equals to Which is less than [01:21:40] uh equals to Which is less than B times C Times Square Root 2 log B over [01:21:44] B times C Times Square Root 2 log B over square root [01:21:55] okay any questions [01:21:58] okay any questions foreign [01:22:01] so yeah I think we're about time so I [01:22:05] so yeah I think we're about time so I guess next lecture at the beginning I [01:22:07] guess next lecture at the beginning I would discuss [01:22:08] would discuss how do you compare or how do you [01:22:10] how do you compare or how do you interpret these two theorems right so so [01:22:13] interpret these two theorems right so so these two theorems have their strength [01:22:14] these two theorems have their strength on different cases depending on what [01:22:16] on different cases depending on what kind of like W's you are what kind of [01:22:19] kind of like W's you are what kind of data you have and what kind of WS you [01:22:21] data you have and what kind of WS you can fit from the data uh I'll do that in [01:22:24] can fit from the data uh I'll do that in the next lecture [01:22:29] okay sounds good [01:22:31] okay sounds good um I guess that's all for today see you [01:22:34] um I guess that's all for today see you next Monday ================================================================================ LECTURE 007 ================================================================================ Stanford CS229M - Lecture 7: Challenges in DL theory, generalization bounds for neural nets Source: https://www.youtube.com/watch?v=kVkMRDZ5fcU --- Transcript [00:00:05] okay I guess let's get started [00:00:08] okay I guess let's get started so uh in this lecture what we're going [00:00:11] so uh in this lecture what we're going to do is that at the beginning we're [00:00:13] to do is that at the beginning we're going to talk about uh deep learning [00:00:16] going to talk about uh deep learning uh especially some of the challenges in [00:00:18] uh especially some of the challenges in deep learning theory [00:00:23] and then in the next uh probably like [00:00:26] and then in the next uh probably like five to ten lectures we are going to [00:00:28] five to ten lectures we are going to discuss [00:00:29] discuss um different aspects about deep learning [00:00:31] um different aspects about deep learning I guess you'll see like over on top of [00:00:33] I guess you'll see like over on top of optimization derivation so and so forth [00:00:36] optimization derivation so and so forth so uh so basically independent Theory [00:00:39] so uh so basically independent Theory they are kind of like different aspects [00:00:40] they are kind of like different aspects for example optimization which will [00:00:41] for example optimization which will spend probably two lectures on later and [00:00:44] spend probably two lectures on later and uh generalization is another question [00:00:46] uh generalization is another question which uh we probably [00:00:48] which uh we probably um will talk about for probably more [00:00:50] um will talk about for probably more than three lectures [00:00:52] than three lectures um and at the end of the course we're [00:00:54] um and at the end of the course we're going to talk about some other so they [00:00:56] going to talk about some other so they have different topics [00:00:57] have different topics so so in some sense you can view this as [00:01:00] so so in some sense you can view this as a kind of like an outline for the next [00:01:02] a kind of like an outline for the next five weeks [00:01:04] five weeks um so [00:01:06] um so so to talk about different Theory I [00:01:08] so to talk about different Theory I think it's probably useful to somewhat [00:01:10] think it's probably useful to somewhat kind of like summarize the classical [00:01:13] kind of like summarize the classical machine Theory which I actually didn't [00:01:15] machine Theory which I actually didn't really talk about that much [00:01:17] really talk about that much um as you know from a word as view that [00:01:20] um as you know from a word as view that much in the beginning of the course [00:01:21] much in the beginning of the course because I felt that you know if it's if [00:01:24] because I felt that you know if it's if you have too if too much information at [00:01:26] you have too if too much information at the beginning is probably a little bit [00:01:27] the beginning is probably a little bit too too much so but now I'm going to [00:01:29] too too much so but now I'm going to kind of have a kind of a higher level uh [00:01:32] kind of have a kind of a higher level uh view about you know what classical [00:01:34] view about you know what classical Motion in theory [00:01:36] Motion in theory um do in terms of different kind of like [00:01:38] um do in terms of different kind of like ice packs or different kind of like [00:01:40] ice packs or different kind of like topics so I guess uh in the in a more [00:01:44] topics so I guess uh in the in a more classical version of theory there are [00:01:46] classical version of theory there are several [00:01:47] several uh things so one thing is called [00:01:50] uh things so one thing is called approximation Theory [00:01:55] so [00:01:57] so um so in some sense this another keyword [00:02:00] um so in some sense this another keyword is called expressivity [00:02:02] is called expressivity our representational power if you see [00:02:05] our representational power if you see these kind of things you are you know [00:02:07] these kind of things you are you know that [00:02:08] that representational power [00:02:10] representational power so you know that they are all about the [00:02:12] so you know that they are all about the same thing so what they are doing is [00:02:13] same thing so what they are doing is really caring about [00:02:15] really caring about basically you want to bond [00:02:18] basically you want to bond I also the star [00:02:20] I also the star which is the best [00:02:23] which is the best model in your family [00:02:26] model in your family so so far you know until this week you [00:02:30] so so far you know until this week you know we always talk about you know [00:02:32] know we always talk about you know excess risk we compare it with the best [00:02:34] excess risk we compare it with the best model in the class and we said that if [00:02:36] model in the class and we said that if you can get the best model in the class [00:02:37] you can get the best model in the class then you are done right but actually [00:02:40] then you are done right but actually it's not done because maybe you are [00:02:42] it's not done because maybe you are using the wrong [00:02:44] using the wrong uh hypothesis class so your best [00:02:47] uh hypothesis class so your best hypothesis class in the family [00:02:48] hypothesis class in the family hypothesis Club the best hypothesis in [00:02:51] hypothesis Club the best hypothesis in the hypothesis class is probably not [00:02:52] the hypothesis class is probably not great right so approximation theory is [00:02:56] great right so approximation theory is basically trying to deal with this right [00:02:58] basically trying to deal with this right you are trying to understand whether [00:03:00] you are trying to understand whether your hypothesis class is powerful enough [00:03:02] your hypothesis class is powerful enough to express the functions you care about [00:03:06] to express the functions you care about so for example you know you know you [00:03:08] so for example you know you know you know kind of shape your case for example [00:03:10] know kind of shape your case for example suppose you have some data like this and [00:03:14] suppose you have some data like this and right something like this some positive [00:03:17] right something like this some positive data some negative data here you know [00:03:19] data some negative data here you know that if you use linear model then the [00:03:21] that if you use linear model then the best model linear model is not going to [00:03:23] best model linear model is not going to do great right because if you probably [00:03:26] do great right because if you probably find the best linear model you probably [00:03:27] find the best linear model you probably would do something like this you know I [00:03:30] would do something like this you know I don't know so so so in this case you [00:03:33] don't know so so so in this case you have you you can say that you know I [00:03:35] have you you can say that you know I will say the star wouldn't be great if [00:03:36] will say the star wouldn't be great if you choose your Capital Theta to believe [00:03:38] you choose your Capital Theta to believe in your family and and then you can [00:03:40] in your family and and then you can study you know what [00:03:42] study you know what uh hypothesis class can contain a good [00:03:45] uh hypothesis class can contain a good classifier even you have access to [00:03:47] classifier even you have access to population data so and so forth [00:03:50] population data so and so forth right so so in some sense this is trying [00:03:52] right so so in some sense this is trying to understand how good karma hypothesis [00:03:55] to understand how good karma hypothesis class H approximate the ground Truth uh [00:03:58] class H approximate the ground Truth uh label function right so [00:04:00] label function right so um so um so that's the that's one uh [00:04:04] um so um so that's the that's one uh type of question so um and another type [00:04:07] type of question so um and another type of question is what we discussed already [00:04:08] of question is what we discussed already so which is about the statistical aspect [00:04:13] aspect [00:04:15] aspect oh sometimes people call generalization [00:04:19] oh sometimes people call generalization Theory [00:04:21] Theory so so this is about the accessories as [00:04:24] so so this is about the accessories as we discussed here in the last [00:04:26] we discussed here in the last uh several weeks so so you are trying to [00:04:29] uh several weeks so so you are trying to bond [00:04:30] bond from above the difference between your [00:04:34] from above the difference between your learned hypothesis [00:04:36] learned hypothesis from the best hypothesis to the stock [00:04:40] from the best hypothesis to the stock right so uh and what we have done was [00:04:44] right so uh and what we have done was something like [00:04:45] something like you uh you you bound this by [00:04:50] you uh you you bound this by I'll say the Hat minus L Heights the Hat [00:04:54] I'll say the Hat minus L Heights the Hat Plus [00:04:55] Plus and also the star [00:04:58] and also the star minus L height to the star [00:05:02] right so and people have called this [00:05:07] right so and people have called this the generalization error [00:05:12] by the generalization error is the [00:05:14] by the generalization error is the difference between the population loss [00:05:17] difference between the population loss and empirical laws on the Learned [00:05:19] and empirical laws on the Learned parameter right so [00:05:23] parameter right so right so this is the generalization also [00:05:25] right so this is the generalization also basically the difference between tuning [00:05:26] basically the difference between tuning loss and [00:05:27] loss and test loss right on the learn parameters [00:05:31] test loss right on the learn parameters if they had this ERM then this is [00:05:33] if they had this ERM then this is talking about ER and but maybe in other [00:05:35] talking about ER and but maybe in other cases you are using some other algorithm [00:05:38] cases you are using some other algorithm to find the hat that you bound the [00:05:40] to find the hat that you bound the transition error for that [00:05:42] transition error for that and this term as we argued the second [00:05:44] and this term as we argued the second term is always small just no matter what [00:05:47] term is always small just no matter what hypothesis class you use you know [00:05:48] hypothesis class you use you know basically as long so your loss function [00:05:50] basically as long so your loss function is bonded then this term is always [00:05:51] is bonded then this term is always something like 1y squared and so so [00:05:54] something like 1y squared and so so basically that's why we don't care about [00:05:55] basically that's why we don't care about this term that much [00:05:58] this term that much Okay so [00:06:00] Okay so um and [00:06:01] um and you know what we have done was something [00:06:04] you know what we have done was something like you prove this kind of [00:06:05] like you prove this kind of generalization bound of the so you prove [00:06:09] generalization bound of the so you prove something like I almost had a hat minus [00:06:11] something like I almost had a hat minus L has the Hat be bonded by something [00:06:13] L has the Hat be bonded by something like some complexity [00:06:17] like some complexity um [00:06:18] um complexity [00:06:21] over square root and I guess typically [00:06:23] over square root and I guess typically probably you should write this [00:06:26] probably you should write this right so and and this the principle here [00:06:28] right so and and this the principle here is that if your hypothesis class is [00:06:32] is that if your hypothesis class is um [00:06:32] um uh is of low complexity that you have [00:06:35] uh is of low complexity that you have better generalization error [00:06:37] better generalization error right so so simple hypothesis oh like uh [00:06:40] right so so simple hypothesis oh like uh can generalize better so I think this [00:06:43] can generalize better so I think this sometimes also people call this outcomes [00:06:45] sometimes also people call this outcomes reader [00:06:47] reader this is I think it's kind of like [00:06:48] this is I think it's kind of like philosophical principle which was which [00:06:51] philosophical principle which was which dates back to something like 1100 or [00:06:54] dates back to something like 1100 or like a some around that time [00:06:57] like a some around that time um and the principle is something like [00:06:59] um and the principle is something like simple or parsimonious explanation can [00:07:03] simple or parsimonious explanation can generate better to other situations [00:07:06] generate better to other situations and you can see even from these two [00:07:08] and you can see even from these two things where you can see that there's [00:07:10] things where you can see that there's some kind of like [00:07:11] some kind of like a conflict or trade-off between the [00:07:14] a conflict or trade-off between the approximation Theory and the transition [00:07:15] approximation Theory and the transition Theory because you know if you use [00:07:19] Theory because you know if you use a very very simple hypothesis class then [00:07:21] a very very simple hypothesis class then your LCL star may not be good enough [00:07:24] your LCL star may not be good enough right for example for the data I do here [00:07:26] right for example for the data I do here if you use linear model then your LCD [00:07:29] if you use linear model then your LCD star is not great but your generation [00:07:31] star is not great but your generation error could be very good because your [00:07:33] error could be very good because your model is linear and simple [00:07:35] model is linear and simple so so there's some trade-off between and [00:07:37] so so there's some trade-off between and I think people also sometimes call this [00:07:39] I think people also sometimes call this bias and virus right so the virus mostly [00:07:43] bias and virus right so the virus mostly corresponds to generalization Theory it [00:07:45] corresponds to generalization Theory it corresponds to statistical [00:07:47] corresponds to statistical um error introduced from learning [00:07:49] um error introduced from learning because you have finite data that's why [00:07:51] because you have finite data that's why you have to pay something that depends [00:07:52] you have to pay something that depends on how many examples you have that's the [00:07:54] on how many examples you have that's the kind of virus and the bias mostly is the [00:07:58] kind of virus and the bias mostly is the quality that only depends buys all the [00:08:00] quality that only depends buys all the expressivity is the quantity that [00:08:02] expressivity is the quantity that depends on the fundamental power of your [00:08:05] depends on the fundamental power of your hypothesis class is not something that [00:08:06] hypothesis class is not something that depends on how many examples you have [00:08:09] depends on how many examples you have right so [00:08:11] right so um but the buyers virus trade-off thing [00:08:13] um but the buyers virus trade-off thing right it's it's essentially the same [00:08:15] right it's it's essentially the same thing here but the exact definition of [00:08:17] thing here but the exact definition of bias and violence can only apply to [00:08:19] bias and violence can only apply to basically squares Square loss and linear [00:08:22] basically squares Square loss and linear model that's why we don't use them [00:08:24] model that's why we don't use them um express it here but the prints for [00:08:27] um express it here but the prints for are some work related [00:08:30] are some work related so and you can also kind of like a you [00:08:33] so and you can also kind of like a you know extend this transition Theory a [00:08:35] know extend this transition Theory a little bit by saying that you can have [00:08:39] little bit by saying that you can have um you can consider the regularized loss [00:08:41] um you can consider the regularized loss this is in some sense you can consider [00:08:42] this is in some sense you can consider this as a application or implication of [00:08:45] this as a application or implication of the generalization Theory which says [00:08:47] the generalization Theory which says that if you use regularized laws [00:08:49] that if you use regularized laws by something like [00:08:52] by something like um I'll [00:08:54] um I'll I'll hide correct [00:08:57] I'll hide correct is something like L has data plus Lambda [00:09:01] is something like L has data plus Lambda R Theta where this is a regularizer [00:09:06] R Theta where this is a regularizer that captures the complexity of the [00:09:08] that captures the complexity of the hypothesis [00:09:10] hypothesis um so then you can hope to have claim [00:09:13] um so then you can hope to have claim like this so so you can have a statistic [00:09:16] like this so so you can have a statistic claim of the following form of course [00:09:18] claim of the following form of course you know this depends on exactly what [00:09:20] you know this depends on exactly what you know regulator use with models so [00:09:22] you know regulator use with models so and so forth but the form of the claim [00:09:24] and so forth but the form of the claim is something like [00:09:25] is something like if [00:09:27] if data along the hat is the global [00:09:29] data along the hat is the global minimizer [00:09:34] of Il rack l h rack then [00:09:38] of Il rack l h rack then you have a generalization Bond [00:09:43] you can bound the excess risk or you can [00:09:45] you can bound the excess risk or you can Bond either I guess either the excess [00:09:47] Bond either I guess either the excess risk or the generalization error [00:09:49] risk or the generalization error I guess they are pretty much related as [00:09:51] I guess they are pretty much related as we have discussed right so all the [00:09:53] we have discussed right so all the generalization error [00:09:57] right so they are bounded by [00:10:00] uh buy something [00:10:04] so this is the top this is the type of [00:10:07] so this is the top this is the type of results you probably get uh from this [00:10:10] results you probably get uh from this kind of statistical uh the [00:10:12] kind of statistical uh the generalization Theory because you know [00:10:14] generalization Theory because you know that if you the reason is that if you [00:10:17] that if you the reason is that if you optimize this regularized loss and you [00:10:19] optimize this regularized loss and you indeed find a very small regular slot [00:10:21] indeed find a very small regular slot that means that your regularizer the r [00:10:23] that means that your regularizer the r of theta the complexity [00:10:25] of theta the complexity right is small and also it means that [00:10:27] right is small and also it means that your tuning error is small and then look [00:10:29] your tuning error is small and then look if both of these are small then you can [00:10:31] if both of these are small then you can show that your accessories is not right [00:10:34] show that your accessories is not right because this model will generalize to [00:10:36] because this model will generalize to the the population of the the test case [00:10:41] the the population of the the test case so uh and and then there's a third [00:10:46] so uh and and then there's a third aspect which is called optimization [00:10:51] and the question so far [00:10:54] and the question so far right so [00:10:56] right so um there's a third aspect which is [00:10:58] um there's a third aspect which is called optimization so the question is [00:11:00] called optimization so the question is about you know how to [00:11:04] about you know how to numerically how to [00:11:10] find Theta hat right so the Hat could be [00:11:13] find Theta hat right so the Hat could be the argument [00:11:15] the argument of the tuning laws or maybe you can talk [00:11:18] of the tuning laws or maybe you can talk about [00:11:19] about so they had Lambda the rigorous laws [00:11:24] some characters [00:11:27] right so [00:11:29] right so um [00:11:36] so and and this is a purely you know at [00:11:39] so and and this is a purely you know at least in a classical [00:11:40] least in a classical way of thinking about this you can [00:11:42] way of thinking about this you can basically view this as a separate [00:11:44] basically view this as a separate question about [00:11:46] question about you you can forget about where your data [00:11:49] you you can forget about where your data comes from you can forget about you know [00:11:51] comes from you can forget about you know why you care about minimizing [00:11:53] why you care about minimizing instruction loss you just say that I'm [00:11:55] instruction loss you just say that I'm giving extreme laws you know that's my [00:11:57] giving extreme laws you know that's my job right so uh and and typically the [00:12:00] job right so uh and and typically the approach is something like you know you [00:12:01] approach is something like you know you can if the loss function is convex you [00:12:03] can if the loss function is convex you use convex optimization [00:12:07] and and all maybe you can use grid and [00:12:09] and and all maybe you can use grid and descent for non-convex functions so and [00:12:11] descent for non-convex functions so and so forth or maybe stochastic between [00:12:13] so forth or maybe stochastic between descent [00:12:15] descent um there are many different approaches [00:12:17] um there are many different approaches um and and when you measure the success [00:12:19] um and and when you measure the success or you match the interface is that you [00:12:21] or you match the interface is that you care about how well [00:12:23] care about how well you can approximate the the minimizer [00:12:26] you can approximate the the minimizer you can never find the exact minimizer [00:12:28] you can never find the exact minimizer right using a numerical approach right [00:12:29] right using a numerical approach right so you always have some [00:12:31] so you always have some um uh some small error compared to the [00:12:34] um uh some small error compared to the minimizer of the empirical loss and you [00:12:37] minimizer of the empirical loss and you can measure the error in different ways [00:12:39] can measure the error in different ways maybe match the error in terms of the [00:12:41] maybe match the error in terms of the sub-optimality in terms of how different [00:12:44] sub-optimality in terms of how different um your minimizer is in terms of the [00:12:47] um your minimizer is in terms of the loss function compared to the the best [00:12:49] loss function compared to the the best minimizer or you can compare other kind [00:12:53] minimizer or you can compare other kind of qualities [00:12:55] of qualities so so in some sense I think [00:12:58] so so in some sense I think um from this kind of like a [00:13:00] um from this kind of like a kind of like the summary here you can [00:13:03] kind of like the summary here you can think of the statistical part is [00:13:05] think of the statistical part is kind of pretty much independent from the [00:13:08] kind of pretty much independent from the optimization part [00:13:09] optimization part uh of course they're also you know [00:13:11] uh of course they're also you know interesting interface for example you [00:13:13] interesting interface for example you know you can also ask about you know [00:13:15] know you can also ask about you know what regularizer so when you write [00:13:16] what regularizer so when you write regularizer right so you can ask the [00:13:18] regularizer right so you can ask the question you know what regularizer can [00:13:20] question you know what regularizer can equally [00:13:21] equally can simultaneously have good statistical [00:13:24] can simultaneously have good statistical performance but also can be easy to [00:13:26] performance but also can be easy to optimize right so and by easy to [00:13:28] optimize right so and by easy to optimize it means that you can optimize [00:13:30] optimize it means that you can optimize it fast or maybe optimize it in a [00:13:32] it fast or maybe optimize it in a certain time maybe D time or d square [00:13:34] certain time maybe D time or d square time so and so forth so there are still [00:13:36] time so and so forth so there are still interactions between different parts but [00:13:38] interactions between different parts but you just need if you just need a [00:13:40] you just need if you just need a kind of like high level kind of [00:13:43] kind of like high level kind of understanding you can think of the think [00:13:44] understanding you can think of the think of them as kind of separate parts right [00:13:46] of them as kind of separate parts right the interactions are more on the lower [00:13:48] the interactions are more on the lower level [00:13:48] level details about you know how do you [00:13:50] details about you know how do you achieve the the best statistical [00:13:52] achieve the the best statistical efficiency or how do you achieve best [00:13:53] efficiency or how do you achieve best the best [00:13:54] the best computational under statistical [00:13:56] computational under statistical efficiency then you have to talk about [00:13:57] efficiency then you have to talk about the interactions but at a high level you [00:14:00] the interactions but at a high level you don't have to think about them uh [00:14:03] don't have to think about them uh simultaneously you can think of them [00:14:04] simultaneously you can think of them roughly separately next question is [00:14:12] sorry oh no no this is another reason [00:14:14] sorry oh no no this is another reason sorry my bad this is just the two things [00:14:16] sorry my bad this is just the two things they are there [00:14:18] they are there my my writing is bad so so and these two [00:14:21] my my writing is bad so so and these two qualities are basically similar right so [00:14:23] qualities are basically similar right so like you care about the excess risk [00:14:25] like you care about the excess risk which is the most important thing [00:14:27] which is the most important thing uh and [00:14:29] uh and um but which is almost the same as the [00:14:31] um but which is almost the same as the generalization error right so [00:14:33] generalization error right so um and it actually bounds you bound the [00:14:35] um and it actually bounds you bound the transition everything about excess risk [00:14:41] um [00:14:42] um okay so any questions so far so so these [00:14:46] okay so any questions so far so so these are the the standard way of thinking [00:14:48] are the the standard way of thinking about this question these questions [00:14:50] about this question these questions um and but what happens in deep learning [00:14:52] um and but what happens in deep learning what happens in deep learning is that as [00:14:54] what happens in deep learning is that as you see things becomes more complicated [00:14:56] you see things becomes more complicated and for fundamental reason [00:14:58] and for fundamental reason um and I think the the first thing is [00:15:02] um and I think the the first thing is that for deep learning there are [00:15:03] that for deep learning there are probably two things that change at least [00:15:05] probably two things that change at least on the surface so one thing that changes [00:15:07] on the surface so one thing that changes is that you have you know from linear [00:15:10] is that you have you know from linear model [00:15:10] model it becomes non-linear model [00:15:13] it becomes non-linear model right so and this directly affects the [00:15:16] right so and this directly affects the optimization because when you have [00:15:18] optimization because when you have non-linear model it becomes non-convex [00:15:20] non-linear model it becomes non-convex loss [00:15:24] right but this wouldn't you know change [00:15:26] right but this wouldn't you know change things you know wouldn't change the the [00:15:28] things you know wouldn't change the the the structure right like the structure [00:15:30] the structure right like the structure view fundamentally because it just [00:15:32] view fundamentally because it just become makes the optimization question [00:15:33] become makes the optimization question harder right so at least at the [00:15:36] harder right so at least at the beginning this is what I thought you [00:15:37] beginning this is what I thought you know five years ago maybe [00:15:39] know five years ago maybe like more than five years ago when I was [00:15:41] like more than five years ago when I was doing started to do deep learning theory [00:15:44] doing started to do deep learning theory right after deep learning took off right [00:15:46] right after deep learning took off right I thought at very very first I thought [00:15:48] I thought at very very first I thought that the only difference is that now [00:15:50] that the only difference is that now your optimization question becomes [00:15:52] your optimization question becomes harder so so and then the question is [00:15:55] harder so so and then the question is just how do you optimize better but then [00:15:58] just how do you optimize better but then I think in probably like about three or [00:16:01] I think in probably like about three or four years ago people realized that [00:16:02] four years ago people realized that there's also another fundamental [00:16:03] there's also another fundamental difference from the statistical [00:16:05] difference from the statistical perspective which is that empirical you [00:16:08] perspective which is that empirical you you always use this so-called over [00:16:10] you always use this so-called over parameters model [00:16:13] parameters model maybe it's not precise to say that you [00:16:16] maybe it's not precise to say that you always use over parameters model but [00:16:17] always use over parameters model but generally over parameters models are [00:16:20] generally over parameters models are better than [00:16:22] better than um like more parameters are always [00:16:24] um like more parameters are always generated or almost always better so [00:16:28] generated or almost always better so more parameters [00:16:31] more parameters generally helps [00:16:34] and it can help even to the extent that [00:16:37] and it can help even to the extent that when your parameters are more than the [00:16:39] when your parameters are more than the number of data points right so it even [00:16:41] number of data points right so it even helps when D is larger than and this [00:16:43] helps when D is larger than and this still helps [00:16:46] and it even helps when you have already [00:16:49] and it even helps when you have already zero training hour so even after even [00:16:52] zero training hour so even after even after [00:16:54] after you already have training up [00:16:56] you already have training up zero training error [00:16:59] zero training error so so this is a plot that uh I [00:17:03] so so this is a plot that uh I I got from some paper this is a from a [00:17:06] I got from some paper this is a from a paper by initial by Tommy Collins [00:17:08] paper by initial by Tommy Collins Chevrolet next in 2015 so this is what [00:17:11] Chevrolet next in 2015 so this is what they've found [00:17:13] they've found um of course this is only a very small [00:17:15] um of course this is only a very small data set but the roughly speaking the [00:17:18] data set but the roughly speaking the same phenomenon also holds for larger [00:17:20] same phenomenon also holds for larger data sets and you can see here that the [00:17:22] data sets and you can see here that the black [00:17:24] black curve is the chilling error and the [00:17:26] curve is the chilling error and the x-axis is how many hidden units or how [00:17:29] x-axis is how many hidden units or how large network is hidden units means the [00:17:31] large network is hidden units means the number of neurons in audio Network which [00:17:33] number of neurons in audio Network which of course which you know if you have [00:17:35] of course which you know if you have multiple neurons you have more [00:17:36] multiple neurons you have more parameters and actually it's quadratic [00:17:38] parameters and actually it's quadratic in a number of [00:17:39] in a number of the the number of parameters is [00:17:42] the the number of parameters is quadratic in any neurons in this over in [00:17:45] quadratic in any neurons in this over in this fully connected case this is a very [00:17:47] this fully connected case this is a very simple fully connected Network on mnist [00:17:50] simple fully connected Network on mnist and you can see that after you have more [00:17:52] and you can see that after you have more than 64 Penny neurons you can fit a [00:17:55] than 64 Penny neurons you can fit a nurse perfectly like zero zero percent [00:17:58] nurse perfectly like zero zero percent error I think literally zero maybe not [00:18:01] error I think literally zero maybe not exactly literally maybe 0.101 error [00:18:03] exactly literally maybe 0.101 error something like that [00:18:04] something like that and and if you look at the a typical [00:18:09] and and if you look at the a typical textbook right so you what you will do [00:18:12] textbook right so you what you will do is that you would predict that the [00:18:14] is that you would predict that the the test error will [00:18:16] the test error will like go up after a certain points [00:18:18] like go up after a certain points because you're over fitting you are [00:18:20] because you're over fitting you are using Hue complex of a model and you are [00:18:22] using Hue complex of a model and you are over facing to the data right that's the [00:18:25] over facing to the data right that's the purple thing which you would probably [00:18:26] purple thing which you would probably rate from some of the text classical [00:18:27] rate from some of the text classical textbooks and actually it does happen in [00:18:30] textbooks and actually it does happen in in some classical settings but just not [00:18:32] in some classical settings but just not happen often in your networks you know [00:18:35] happen often in your networks you know or probably never happens in life works [00:18:37] or probably never happens in life works and [00:18:39] and um [00:18:40] um um and and if and what really happens is [00:18:43] um and and if and what really happens is the is the right one [00:18:44] the is the right one the generalization I will actually [00:18:46] the generalization I will actually continue to improve as you have more and [00:18:47] continue to improve as you have more and more neurons even though you are the [00:18:49] more neurons even though you are the even memorize everything right so like [00:18:52] even memorize everything right so like so if you compare 64 versus 4K [00:18:56] so if you compare 64 versus 4K basically these are just two neurons oh [00:18:59] basically these are just two neurons oh sorry these are just two new networks [00:19:00] sorry these are just two new networks you know both of them are both of them [00:19:02] you know both of them are both of them have like a face the chain data [00:19:05] have like a face the chain data with 100 accuracy but one of them has [00:19:08] with 100 accuracy but one of them has better test accuracy on than the other [00:19:11] better test accuracy on than the other so [00:19:12] so um [00:19:13] um so so this is the kind of a big mystery [00:19:16] so so this is the kind of a big mystery um [00:19:16] um um from from a theoretical point of view [00:19:19] um from from a theoretical point of view especially if you believe in the [00:19:20] especially if you believe in the classical trade-off between bias and [00:19:21] classical trade-off between bias and variants or the trade-off between the [00:19:23] variants or the trade-off between the expressivity and the transition Theory [00:19:25] expressivity and the transition Theory standardization power so [00:19:28] standardization power so um [00:19:29] um so this is a big open question right so [00:19:32] so this is a big open question right so and let me let me briefly this let me [00:19:36] and let me let me briefly this let me discuss you know like uh again like what [00:19:38] discuss you know like uh again like what what's the impact on each of these you [00:19:40] what's the impact on each of these you know Concepts right and actually you [00:19:43] know Concepts right and actually you even have to really think about some of [00:19:44] even have to really think about some of these Concepts like how to [00:19:47] these Concepts like how to like this this some of these Concepts [00:19:49] like this this some of these Concepts becomes [00:19:50] becomes entangled or intertwined no in deep [00:19:53] entangled or intertwined no in deep learning so first of all for [00:19:55] learning so first of all for approximation Theory I think we don't [00:19:57] approximation Theory I think we don't really change you know things don't [00:20:00] really change you know things don't change that much at least [00:20:03] change that much at least compared to other parts [00:20:05] compared to other parts so for approximation Theory I think [00:20:10] so for approximation Theory I think generally I guess you know you know that [00:20:12] generally I guess you know you know that you know large models are expressive [00:20:20] and there's actually something called [00:20:22] and there's actually something called Universal approximate approximation [00:20:24] Universal approximate approximation theorem [00:20:27] I'm not sure you've heard of if you [00:20:29] I'm not sure you've heard of if you heard of it not in some sense this is [00:20:32] heard of it not in some sense this is saying that if you have a new network [00:20:33] saying that if you have a new network that is wide enough then you can [00:20:34] that is wide enough then you can approximate any functions of course [00:20:36] approximate any functions of course that's a in some sense a misleading way [00:20:39] that's a in some sense a misleading way to say it because you know [00:20:41] to say it because you know what what does it mean by large enough [00:20:42] what what does it mean by large enough right so if you need exponential number [00:20:44] right so if you need exponential number of neurons that's indeed very large that [00:20:47] of neurons that's indeed very large that that's large enough but that's not [00:20:49] that's large enough but that's not really implementable right so so [00:20:51] really implementable right so so empirical you don't even need that many [00:20:53] empirical you don't even need that many neurons to uh to be expressive I think [00:20:56] neurons to uh to be expressive I think you just need a normal number of neurons [00:20:58] you just need a normal number of neurons but anyway so the the gist is that we do [00:21:00] but anyway so the the gist is that we do believe you know regardless of whether [00:21:03] believe you know regardless of whether this universe or approximation theory is [00:21:06] this universe or approximation theory is exactly answering the question at least [00:21:08] exactly answering the question at least we believe that [00:21:10] we believe that um [00:21:11] um the new white works are very powerful so [00:21:13] the new white works are very powerful so we generally believe that your [00:21:16] we generally believe that your the best model in this family especially [00:21:18] the best model in this family especially if you use a wide enough Network this is [00:21:20] if you use a wide enough Network this is generally small this is what we [00:21:23] generally small this is what we generally believe you know [00:21:25] generally believe you know um and at least what you can show is [00:21:27] um and at least what you can show is that so and at least [00:21:29] that so and at least what you can show is that you can say [00:21:30] what you can show is that you can say this is really small [00:21:35] the the minimizer of the training loss [00:21:37] the the minimizer of the training loss because if you have a new network with [00:21:39] because if you have a new network with more than any neurons [00:21:41] more than any neurons so this is just because [00:21:43] so this is just because with more than a neurons [00:21:47] and is the number of examples you can [00:21:50] and is the number of examples you can you can show proof you can beautifully [00:21:52] you can show proof you can beautifully memorize all the training examples [00:21:57] at least you can find one network that [00:21:59] at least you can find one network that memorized other two examples the network [00:22:01] memorized other two examples the network may not generalize but you can but this [00:22:05] may not generalize but you can but this already means that your minimum training [00:22:07] already means that your minimum training loss is very small it's probably zero [00:22:15] okay so basically for approximation [00:22:17] okay so basically for approximation Theory I think we generally believe that [00:22:20] Theory I think we generally believe that you know [00:22:21] you know the models are very expressive [00:22:23] the models are very expressive and then that becomes the generalization [00:22:27] and then that becomes the generalization part [00:22:27] part which becomes you know quite complicated [00:22:31] which becomes you know quite complicated so so there are um there's one another [00:22:34] so so there are um there's one another kind of like a um information about what [00:22:37] kind of like a um information about what practical network uh does is that in in [00:22:41] practical network uh does is that in in practice [00:22:42] practice also people don't use uh very strong [00:22:46] also people don't use uh very strong regulations only weak regulations are [00:22:48] regulations only weak regulations are weak regularizations are used [00:22:51] weak regularizations are used and this is kind of like a somewhat [00:22:54] and this is kind of like a somewhat important thing to say just because you [00:22:55] important thing to say just because you know [00:22:56] know recall that you know even in a classical [00:22:58] recall that you know even in a classical setting right it's not always that you [00:23:01] setting right it's not always that you can show [00:23:04] can show um [00:23:04] um like like sometimes you know you can [00:23:07] like like sometimes you know you can have a work so even in a classical [00:23:08] have a work so even in a classical setting you can have the setting there [00:23:10] setting you can have the setting there where you have a lot of parameters but [00:23:13] where you have a lot of parameters but you have a strong regularization to [00:23:14] you have a strong regularization to compensate so that's allowed in a [00:23:17] compensate so that's allowed in a classical setting right for example if [00:23:19] classical setting right for example if you have a [00:23:20] you have a uh use a sparse in your regression where [00:23:23] uh use a sparse in your regression where you have a lot of features the [00:23:24] you have a lot of features the dimensional is very high but you [00:23:26] dimensional is very high but you regularize the sparsity of your linear [00:23:28] regularize the sparsity of your linear model then [00:23:30] model then uh you [00:23:31] uh you are [00:23:34] are uh wait I think [00:23:36] uh wait I think speaking of sparse ninja model I think I [00:23:37] speaking of sparse ninja model I think I forgot to do something that we are left [00:23:40] forgot to do something that we are left last time [00:23:41] last time about the comparison between animals but [00:23:43] about the comparison between animals but but okay let's anyways my back I think I [00:23:47] but okay let's anyways my back I think I should have it but anyway but let's [00:23:48] should have it but anyway but let's continue with this but anyway so so what [00:23:50] continue with this but anyway so so what I was saying is that even in the [00:23:51] I was saying is that even in the classical case you do allow to have very [00:23:54] classical case you do allow to have very big [00:23:55] big uh you do a lot to have D bigger than n [00:23:57] uh you do a lot to have D bigger than n right the dimension can be bigger than I [00:23:59] right the dimension can be bigger than I thought you use the regularization [00:24:01] thought you use the regularization right if you use it because if you use a [00:24:03] right if you use it because if you use a regularization you implicitly [00:24:06] regularization you implicitly restrict the complexity for example if [00:24:08] restrict the complexity for example if you say that the sparse state of your [00:24:09] you say that the sparse state of your model is S and S is less than n then [00:24:13] model is S and S is less than n then that's okay [00:24:14] that's okay so however in in uh in deep learning in [00:24:18] so however in in uh in deep learning in practice we only use very weak [00:24:20] practice we only use very weak regularization right so typically just [00:24:21] regularization right so typically just some L2 [00:24:25] um at least you know even with L2 you [00:24:27] um at least you know even with L2 you can work sometimes even without L2 you [00:24:29] can work sometimes even without L2 you can you can work pretty well [00:24:32] can you can work pretty well and and also the regularization strength [00:24:35] and and also the regularization strength is also relatively small the the [00:24:37] is also relatively small the the strength is small enough so that you can [00:24:39] strength is small enough so that you can still fit your tuning data basically 100 [00:24:42] still fit your tuning data basically 100 with basically 100 accuracy [00:24:45] with basically 100 accuracy so and and another way to say the [00:24:48] so and and another way to say the weakness of the regularization is that [00:24:50] weakness of the regularization is that you can consider the following fact [00:24:52] you can consider the following fact so the this radicalized laws right if [00:24:55] so the this radicalized laws right if you just regularize with the for example [00:24:57] you just regularize with the for example you you just regularize with something [00:24:59] you you just regularize with something like L2 [00:25:01] like L2 with some Lambda [00:25:03] with some Lambda the regular slots doesn't have [00:25:07] unique [00:25:10] unique Global minimizer [00:25:15] so or at least it has very different [00:25:19] so or at least it has very different approximate Global minimizer right maybe [00:25:21] approximate Global minimizer right maybe if you really care about the [00:25:23] if you really care about the the numerical Precision right if you say [00:25:25] the numerical Precision right if you say like you care about the [00:25:28] like you care about the um like 10 like you care about very very [00:25:31] um like 10 like you care about very very small you know accuracy you know [00:25:33] small you know accuracy you know Precision that maybe there's a unique [00:25:34] Precision that maybe there's a unique Global minimizer but but for practical [00:25:37] Global minimizer but but for practical you know purposes you know there are [00:25:39] you know purposes you know there are many different Global minimizers that [00:25:41] many different Global minimizers that are very similar in terms of training [00:25:43] are very similar in terms of training accuracy uh at all in terms of the right [00:25:46] accuracy uh at all in terms of the right price loss [00:25:47] price loss so and they all have very small regular [00:25:49] so and they all have very small regular slots they have very small uh uh Lots [00:25:53] slots they have very small uh uh Lots from the regular regular part they also [00:25:56] from the regular regular part they also have very small loads from the general [00:25:58] have very small loads from the general airport and there are different global [00:26:00] airport and there are different global minimizers [00:26:01] minimizers and it's also another thing is that it's [00:26:04] and it's also another thing is that it's also not true that all of these Global [00:26:06] also not true that all of these Global minimizers perform the same so [00:26:14] This Global minimizers perform the same [00:26:24] on a test [00:26:27] on a test I guess probably it's easier to just [00:26:29] I guess probably it's easier to just have a figure here I think I did prepare [00:26:32] have a figure here I think I did prepare a figure [00:26:33] a figure so [00:26:35] so um so let's see I think [00:26:38] um so let's see I think this is experiment I've done [00:26:41] this is experiment I've done um a few years back you know there are [00:26:43] um a few years back you know there are many different kind of like plots you [00:26:44] many different kind of like plots you can find online on different papers like [00:26:46] can find online on different papers like this this is just one of them I actually [00:26:49] this this is just one of them I actually took a little bit to exacerbate the [00:26:52] took a little bit to exacerbate the differences a little bit but the the [00:26:54] differences a little bit but the the gist is always the same so this is what [00:26:57] gist is always the same so this is what this is the c510 and you have two [00:26:59] this is the c510 and you have two algorithm [00:27:02] algorithm um the the right one or the blue one and [00:27:04] um the the right one or the blue one and I'm plotting the chain on the test [00:27:07] I'm plotting the chain on the test um and this two algorithms only differ [00:27:10] um and this two algorithms only differ by the Learning rate they have the same [00:27:13] by the Learning rate they have the same training objective [00:27:14] training objective they have the same regularizations uh [00:27:18] they have the same regularizations uh strengths it's just that the optimizer [00:27:21] strengths it's just that the optimizer are different [00:27:22] are different so at the end of the you see that both [00:27:24] so at the end of the you see that both of these two algorithms formed some [00:27:27] of these two algorithms formed some Global minimizer all approximate Global [00:27:30] Global minimizer all approximate Global minimeter you can see the tuning error [00:27:31] minimeter you can see the tuning error is close to zero in both of the two [00:27:34] is close to zero in both of the two cases [00:27:35] cases right under [00:27:37] right under um so so both of these available mean in [00:27:38] um so so both of these available mean in some sense [00:27:40] some sense or at least you know approximately mean [00:27:43] or at least you know approximately mean up to very good approximation [00:27:45] up to very good approximation but you can see that their tests errors [00:27:47] but you can see that their tests errors are very different so that means that [00:27:49] are very different so that means that these are two different Global means for [00:27:50] these are two different Global means for sure right in the parameter space and [00:27:52] sure right in the parameter space and also they perform very differently on a [00:27:54] also they perform very differently on a test [00:27:55] test so [00:27:57] so um so that's the kind of the mystery [00:27:59] um so that's the kind of the mystery right because this kind of refute the [00:28:02] right because this kind of refute the possibility to have a theorem like in [00:28:04] possibility to have a theorem like in the classical case right so if recall [00:28:08] the classical case right so if recall that in a classical case typically you [00:28:09] that in a classical case typically you have theorem like this where I'm saying [00:28:12] have theorem like this where I'm saying yeah [00:28:13] yeah so you have serums like this [00:28:16] so you have serums like this by saying something like if you find a [00:28:19] by saying something like if you find a equal D Global minimizer or a global [00:28:21] equal D Global minimizer or a global minimize or any Global minimizer of the [00:28:23] minimize or any Global minimizer of the regular slots then you can generalize [00:28:24] regular slots then you can generalize you can burn the joint system [00:28:26] you can burn the joint system and this is no longer the case because [00:28:28] and this is no longer the case because not all the global minimizers are the [00:28:31] not all the global minimizers are the same but some of them are better some of [00:28:32] same but some of them are better some of them are words [00:28:33] them are words and you probably shouldn't have the same [00:28:35] and you probably shouldn't have the same bond for all of them and some of them [00:28:37] bond for all of them and some of them probably just don't generate at all [00:28:38] probably just don't generate at all right so this is saying that you cannot [00:28:40] right so this is saying that you cannot just say any Global minimizer [00:28:42] just say any Global minimizer generalized you have to somehow [00:28:43] generalized you have to somehow distinguish different Global minimizers [00:28:45] distinguish different Global minimizers found by different algorithms [00:28:49] found by different algorithms so so this basically means that you know [00:28:52] so so this basically means that you know and I know and but what what happens [00:28:54] and I know and but what what happens here right so what happens is that [00:28:57] here right so what happens is that the optimization starts coming to play [00:28:59] the optimization starts coming to play and this is the reason so basically as I [00:29:02] and this is the reason so basically as I allow you to in some sense [00:29:05] allow you to in some sense like different optimizers found [00:29:06] like different optimizers found different color Minima and some of them [00:29:08] different color Minima and some of them are better and some of them are worse [00:29:10] are better and some of them are worse so that is saying that optimization is [00:29:12] so that is saying that optimization is not only [00:29:13] not only so optimization is not only about [00:29:22] of funding [00:29:26] any minimizers and a global mean [00:29:31] if you just say you find a global mean [00:29:33] if you just say you find a global mean that's not enough you have to use [00:29:34] that's not enough you have to use optimization to find the right Global [00:29:36] optimization to find the right Global mean [00:29:37] mean so in some sense the optimization [00:29:41] um [00:29:42] um the optimization have two jobs [00:29:44] the optimization have two jobs one thing is that you have to find [00:29:46] one thing is that you have to find something that has smaller error or [00:29:48] something that has smaller error or smaller a smaller error small [00:29:50] smaller a smaller error small regularized loss and the other job is [00:29:52] regularized loss and the other job is that it also has to find [00:29:56] that it also has to find um something I don't know like it has to [00:29:58] um something I don't know like it has to find a global minimum [00:30:01] so [00:30:02] so so in some sense the kind of the picture [00:30:04] so in some sense the kind of the picture is like this in my mind so you have this [00:30:08] is like this in my mind so you have this um [00:30:09] um um I'm using one dimensional thing right [00:30:11] um I'm using one dimensional thing right this is the parameter this Dimension is [00:30:13] this is the parameter this Dimension is the parameter [00:30:15] the parameter and basically I'm envisioning this kind [00:30:17] and basically I'm envisioning this kind of toy case where you have a the [00:30:20] of toy case where you have a the landscape of the training loss and tests [00:30:22] landscape of the training loss and tests also cut this right so the training loss [00:30:24] also cut this right so the training loss has two Global minimum and and one of [00:30:27] has two Global minimum and and one of them is a good Global minimum and the [00:30:28] them is a good Global minimum and the other one is a bad one the bad one in a [00:30:30] other one is a bad one the bad one in a sense that the test corresponding test [00:30:32] sense that the test corresponding test error is bad [00:30:33] error is bad and the optimization algorithm is not [00:30:37] and the optimization algorithm is not only responsible for funding an [00:30:39] only responsible for funding an arbitrary Global meaning it's also it [00:30:42] arbitrary Global meaning it's also it actually has to find the right glow [00:30:44] actually has to find the right glow green instead of the bad global warming [00:30:46] green instead of the bad global warming so somehow the optimization algorithm is [00:30:48] so somehow the optimization algorithm is doing something beyond what it's [00:30:50] doing something beyond what it's supposed to do [00:30:51] supposed to do right so I guess uh in some sense you [00:30:55] right so I guess uh in some sense you know this is a one-dimensional case if [00:30:57] know this is a one-dimensional case if you think about the high dimensional [00:30:58] you think about the high dimensional case this is a something I always often [00:31:00] case this is a something I always often use in my slides it's kind of like your [00:31:03] use in my slides it's kind of like your uh going to a ski resort and [00:31:07] uh going to a ski resort and um the first time I came to America I [00:31:08] um the first time I came to America I didn't realize that you know uh you can [00:31:10] didn't realize that you know uh you can have multiple uh values or not much for [00:31:13] have multiple uh values or not much for parking lots on the same in the same uh [00:31:16] parking lots on the same in the same uh ski resort so I [00:31:19] ski resort so I like when I go back home right I just go [00:31:21] like when I go back home right I just go to a I do follow greeting descent right [00:31:23] to a I do follow greeting descent right I just go to an arbitrary kind of like a [00:31:26] I just go to an arbitrary kind of like a valley and I found that my car was not [00:31:28] valley and I found that my car was not there and then it's actually a trouble [00:31:30] there and then it's actually a trouble because the resort is closed and you [00:31:32] because the resort is closed and you cannot lift you up and so so it's [00:31:34] cannot lift you up and so so it's actually pretty annoying but then I [00:31:37] actually pretty annoying but then I realized that actually there are much [00:31:39] realized that actually there are much for Global minimum and one of them is [00:31:40] for Global minimum and one of them is better than the others so so and you [00:31:42] better than the others so so and you have to find it uh [00:31:44] have to find it uh um so so it's not like arbitrary [00:31:47] um so so it's not like arbitrary um within this set [00:31:48] um within this set uh or or or maybe the great intense [00:31:51] uh or or or maybe the great intense doing something more than just the [00:31:53] doing something more than just the arbitrary downfield skiing right [00:31:59] um [00:32:06] right so so why the the generalization [00:32:10] right so so why the the generalization so the question is exactly [00:32:12] so the question is exactly mathematically where the transition [00:32:13] mathematically where the transition Theory breaks down I think [00:32:16] Theory breaks down I think um [00:32:22] the bonds becomes accurate you know [00:32:24] the bonds becomes accurate you know basically the bonds you can prove [00:32:25] basically the bonds you can prove becomes vectors [00:32:27] becomes vectors in the bound you can prove under this [00:32:30] in the bound you can prove under this language under the existing language [00:32:31] language under the existing language becomes vectors [00:32:33] becomes vectors so so basically if you say that you want [00:32:36] so so basically if you say that you want to prove a bond that works for all your [00:32:38] to prove a bond that works for all your network of size 10 million of size like [00:32:42] network of size 10 million of size like 100 million right for only one million [00:32:45] 100 million right for only one million examples right so if that's the language [00:32:46] examples right so if that's the language you are using then uh it wouldn't work [00:32:49] you are using then uh it wouldn't work anymore so you have to have a more [00:32:52] anymore so you have to have a more precise way [00:32:53] precise way uh to think about it [00:32:56] uh to think about it does that answer the question to some [00:32:58] does that answer the question to some extent [00:33:18] right that's that's roughly speaking [00:33:21] right that's that's roughly speaking that's the approach we're going to take [00:33:22] that's the approach we're going to take but there's small there's one problem [00:33:24] but there's small there's one problem with this if you just do exactly what [00:33:26] with this if you just do exactly what you said there's a problem which is [00:33:27] you said there's a problem which is you're gonna get the same bound [00:33:30] you're gonna get the same bound for any algorithm [00:33:33] for any algorithm right but empirically different [00:33:35] right but empirically different algorithms have different [00:33:37] algorithms have different uh performance you you know and the way [00:33:41] uh performance you you know and the way to fix it is it becomes that you first [00:33:43] to fix it is it becomes that you first say that different algorithm [00:33:45] say that different algorithm find models with different complexity [00:33:48] find models with different complexity and then you can have different bonds [00:33:50] and then you can have different bonds for for them [00:33:52] for for them so so the algorithm has to come into [00:33:54] so so the algorithm has to come into play in some way so [00:33:57] um yeah so basically that's the that's [00:33:59] um yeah so basically that's the that's the kind of the conclusion here so the [00:34:02] the kind of the conclusion here so the algorithm has to come into play in your [00:34:05] algorithm has to come into play in your statistical analysis right because if [00:34:08] statistical analysis right because if you don't have the algorithm there you [00:34:10] you don't have the algorithm there you are not going to distinguish these [00:34:11] are not going to distinguish these different algorithms right so so in some [00:34:14] different algorithms right so so in some sense you entangle the statistics with [00:34:16] sense you entangle the statistics with optimization [00:34:18] optimization um to some extent [00:34:20] um to some extent and and and uh and and now like so so [00:34:24] and and and uh and and now like so so basically like the way to fix it is at [00:34:27] basically like the way to fix it is at least the kind of the current plan that [00:34:29] least the kind of the current plan that the general kind of agenda I think most [00:34:31] the general kind of agenda I think most of the researchers agree on [00:34:33] of the researchers agree on um seems to agree on uh is that you [00:34:36] um seems to agree on uh is that you analyze the optimization and analyze why [00:34:39] analyze the optimization and analyze why the optimizer finds a good local minimum [00:34:42] the optimizer finds a good local minimum so basically you want Optimizer you want [00:34:44] so basically you want Optimizer you want you need to have a theory that says [00:34:46] you need to have a theory that says something like the optimizer [00:34:49] something like the optimizer find [00:34:51] find a Taylor hat [00:34:53] a Taylor hat such that so one they say the heart is a [00:34:56] such that so one they say the heart is a global mean [00:34:58] global mean that's probably or approximate Global [00:35:01] that's probably or approximate Global mean of the empirical risk [00:35:04] mean of the empirical risk and also two [00:35:07] and also two say the Hat has some special property [00:35:14] that you didn't expressly say [00:35:17] that you didn't expressly say that actually should do right for [00:35:19] that actually should do right for example the property could be low [00:35:20] example the property could be low complexity [00:35:23] so maybe for example why just to give [00:35:26] so maybe for example why just to give you action case for example you you run [00:35:29] you action case for example you you run algorithm without any regularization but [00:35:32] algorithm without any regularization but then magically you say that even though [00:35:34] then magically you say that even though I didn't let it regularize but actually [00:35:37] I didn't let it regularize but actually the set of height I found has low L2 [00:35:40] the set of height I found has low L2 note [00:35:41] note or has the even the minimum L2 note [00:35:43] or has the even the minimum L2 note actually you can prove this kind of [00:35:44] actually you can prove this kind of theorems in certain cases [00:35:46] theorems in certain cases so and then because of the special [00:35:48] so and then because of the special property thing and this implies that it [00:35:51] property thing and this implies that it can generalize [00:35:54] right so [00:35:57] right so um and and people have kind of like [00:35:59] um and and people have kind of like proof theorems of this kind of form in [00:36:02] proof theorems of this kind of form in many different case for example you know [00:36:04] many different case for example you know uh like uh [00:36:08] uh like uh wait [00:36:13] right so so so for example you know you [00:36:16] right so so so for example you know you can talk about ICD right so [00:36:19] can talk about ICD right so um so ICD probably has some you know [00:36:20] um so ICD probably has some you know special preferences in terms of what [00:36:22] special preferences in terms of what models they want to find uh and maybe [00:36:25] models they want to find uh and maybe SGD with different kind of like [00:36:27] SGD with different kind of like specifications right so you can have [00:36:30] specifications right so you can have large and rates you know small batch and [00:36:32] large and rates you know small batch and so forth I'll talk about that in a [00:36:33] so forth I'll talk about that in a moment but generally we want to say that [00:36:35] moment but generally we want to say that the optimizer the Practical Optimizer [00:36:38] the optimizer the Practical Optimizer people are using can have some [00:36:40] people are using can have some preferences on certain type of global [00:36:43] preferences on certain type of global minimize [00:36:46] so and then after you have this right [00:36:48] so and then after you have this right like as I said after you have the the [00:36:50] like as I said after you have the the special preference then you can use the [00:36:52] special preference then you can use the so this part right from the special [00:36:54] so this part right from the special property the low complexity to [00:36:56] property the low complexity to generalization this could be kind of [00:36:58] generalization this could be kind of more classical this could be classical [00:37:01] more classical this could be classical Theory [00:37:03] Theory right or maybe Improvement of classical [00:37:04] right or maybe Improvement of classical Theory depending on what complex measure [00:37:06] Theory depending on what complex measure you are talking about right as you [00:37:08] you are talking about right as you suggested [00:37:09] suggested so [00:37:11] so um so that's kind of the the kind of [00:37:13] um so that's kind of the the kind of like the [00:37:14] like the uh the current kind of like uh [00:37:16] uh the current kind of like uh statistical of expanding people in [00:37:19] statistical of expanding people in theory [00:37:20] theory um of course there are other kind of [00:37:21] um of course there are other kind of like approaches but I think this is [00:37:23] like approaches but I think this is pretty much I think kind of like the [00:37:25] pretty much I think kind of like the high level [00:37:26] high level people have almost richer consensus on [00:37:29] people have almost richer consensus on the high level approach here I think [00:37:32] um um and and what are the kind of the [00:37:35] um um and and what are the kind of the the the best results let me have a brief [00:37:37] the the best results let me have a brief summary of what you know are the best [00:37:39] summary of what you know are the best results people know uh roughly speaking [00:37:42] results people know uh roughly speaking um in each of these aspects so so [00:37:45] um in each of these aspects so so basically first of all for uh [00:37:50] oh okay sorry so let me just make this a [00:37:53] oh okay sorry so let me just make this a little more formal so I guess you know [00:37:55] little more formal so I guess you know um a little more formally [00:37:59] so basically you have probably three [00:38:01] so basically you have probably three tasks you know in my language [00:38:04] tasks you know in my language um so first you prove that [00:38:08] I guess you know I'm repeating myself a [00:38:11] I guess you know I'm repeating myself a little bit in some sense you know so you [00:38:13] little bit in some sense you know so you prove that the optimizer converters to [00:38:18] uh approximate [00:38:23] uh local [00:38:25] uh local a global mean [00:38:29] of L height Theta [00:38:31] of L height Theta and then in a second task you also have [00:38:34] and then in a second task you also have to prove that in addition to one [00:38:41] to one [00:38:44] to one the Theta hat [00:38:46] the Theta hat also [00:38:48] also has low complexity [00:38:53] for example something like R said I had [00:38:56] for example something like R said I had is less than c for some complexity [00:38:59] is less than c for some complexity measure [00:39:05] r [00:39:06] r and this R depends on the algorithm [00:39:08] and this R depends on the algorithm depends on even the details in the [00:39:10] depends on even the details in the algorithm like learning rate in your [00:39:11] algorithm like learning rate in your batch set so and so forth [00:39:13] batch set so and so forth and then Tasker 3 [00:39:16] and then Tasker 3 you say that for every Theta [00:39:20] you say that for every Theta such that [00:39:23] R Theta is less than C [00:39:27] R Theta is less than C and [00:39:28] and maybe L height Theta is close to zero so [00:39:31] maybe L height Theta is close to zero so for every Theta with low complexity and [00:39:34] for every Theta with low complexity and small Channel error we have [00:39:39] the test error [00:39:44] I would say that it's also small [00:39:49] so [00:39:50] so that's kind of like the the general idea [00:39:53] that's kind of like the the general idea and and what we have done you know what [00:39:55] and and what we have done you know what people have done in this kind of area so [00:39:58] people have done in this kind of area so I guess so um [00:40:00] I guess so um um so regarding the task one [00:40:03] um so regarding the task one right which is the optimization question [00:40:04] right which is the optimization question right task one is optimization [00:40:07] right task one is optimization so I think maybe if you want to [00:40:08] so I think maybe if you want to associate some keyword [00:40:10] associate some keyword to this people would call this the first [00:40:12] to this people would call this the first question optimization and the first the [00:40:14] question optimization and the first the second question people often call this [00:40:16] second question people often call this implicit [00:40:18] implicit regularization effect [00:40:21] regularization effect yeah probably actually explain [00:40:24] yeah probably actually explain this because you know [00:40:26] this because you know this is implicit because you never told [00:40:28] this is implicit because you never told the algorithm to minimize this [00:40:30] the algorithm to minimize this complexity right it's implicitly in the [00:40:33] complexity right it's implicitly in the optimization procedure and it's a [00:40:35] optimization procedure and it's a regularization effect because you get [00:40:37] regularization effect because you get some low complexity solution [00:40:39] some low complexity solution right and the third one this is probably [00:40:41] right and the third one this is probably like the more or less the the classical [00:40:44] like the more or less the the classical authorization block right [00:40:50] and for task one I think [00:40:53] and for task one I think um [00:40:54] um what happens is that if you don't have [00:40:56] what happens is that if you don't have regularization [00:40:57] regularization so I guess uh sorry so for task one I [00:41:02] so I guess uh sorry so for task one I think for the optimization question [00:41:03] think for the optimization question so one of our search you know consider [00:41:06] so one of our search you know consider the case where you don't have [00:41:08] the case where you don't have over parametrization [00:41:11] this is a [00:41:14] this is a over primary physics [00:41:21] you know without over parametrization in [00:41:23] you know without over parametrization in some special case you can still prove [00:41:25] some special case you can still prove this in some special case [00:41:28] this in some special case uh for example Matrix factorization [00:41:30] uh for example Matrix factorization problem [00:41:35] maybe linearizing Network [00:41:40] or maybe something like tens [00:41:42] or maybe something like tens optimization [00:41:44] optimization um you can show that green descent or [00:41:46] um you can show that green descent or SGD [00:41:48] SGD can converge [00:41:53] to look Global mean [00:41:57] so here linearizing Network means that [00:41:59] so here linearizing Network means that you don't have any activations basically [00:42:01] you don't have any activations basically activation is linear so you just stack a [00:42:03] activation is linear so you just stack a bunch of linear models which doesn't [00:42:05] bunch of linear models which doesn't really have any doesn't really do [00:42:07] really have any doesn't really do anything from a statistical point of [00:42:08] anything from a statistical point of view it's just purely for [00:42:11] view it's just purely for for either you only analyze that as an [00:42:14] for either you only analyze that as an exercise for for for your Technique in [00:42:16] exercise for for for your Technique in some sense but you can still publish [00:42:18] some sense but you can still publish papers in it just because it's [00:42:19] papers in it just because it's everything about opposition is very [00:42:21] everything about opposition is very complicated [00:42:22] complicated even analyzing linearization work is [00:42:24] even analyzing linearization work is difficult [00:42:25] difficult so so this is the one thing that people [00:42:27] so so this is the one thing that people have done but you can see that this [00:42:29] have done but you can see that this doesn't really address the all the [00:42:30] doesn't really address the all the issues right because you don't allow [00:42:32] issues right because you don't allow over parametrization [00:42:35] it only works for linear resume at work [00:42:39] it only works for linear resume at work um all Matrix finalization problem like [00:42:41] um all Matrix finalization problem like which is completion so and so forth [00:42:43] which is completion so and so forth so [00:42:44] so [Music] [00:42:45] [Music] um [00:42:46] um and recently you know in the last three [00:42:49] and recently you know in the last three or four years I think so you can also do [00:42:52] or four years I think so you can also do this optimization question for [00:42:54] this optimization question for uh for new Networks [00:42:56] uh for new Networks for adding new artworks [00:43:00] for adding new artworks for almost any networks in your deep you [00:43:02] for almost any networks in your deep you know shallow so and so forth [00:43:04] know shallow so and so forth um but with the caveat that you have to [00:43:07] um but with the caveat that you have to of four special [00:43:11] of four special hyper parameters [00:43:17] so special hyper parameters means [00:43:19] so special hyper parameters means something like you know maybe like a so [00:43:22] something like you know maybe like a so first of all you need over [00:43:22] first of all you need over parameterization [00:43:24] parameterization that's actually probably good because [00:43:26] that's actually probably good because anyway [00:43:28] anyway empirical people use over conversation [00:43:30] empirical people use over conversation but the limitation is that you also need [00:43:32] but the limitation is that you also need special learning rate [00:43:33] special learning rate or an initialization especially [00:43:35] or an initialization especially initialization [00:43:39] and learning rate so and so forth [00:43:42] and learning rate so and so forth so [00:43:44] so and and that becomes a problem [00:43:47] and and that becomes a problem by the way this is called typically [00:43:49] by the way this is called typically called ntk approach neurotungent kernel [00:43:52] called ntk approach neurotungent kernel which I'm going to talk [00:43:54] which I'm going to talk more in the future lectures [00:43:57] more in the future lectures uh and explain why this is called New [00:43:59] uh and explain why this is called New retention kernel so this is the [00:44:01] retention kernel so this is the so-called ntk approach [00:44:02] so-called ntk approach and the problem with this approach is [00:44:04] and the problem with this approach is that this special initialization is a [00:44:07] that this special initialization is a problem and also Special Olympics or [00:44:09] problem and also Special Olympics or special algorithm so you also have [00:44:11] special algorithm so you also have something maybe so this is also [00:44:13] something maybe so this is also you also need something about batch size [00:44:15] you also need something about batch size for example in most of the paper you [00:44:17] for example in most of the paper you have to like the best sets to be very [00:44:19] have to like the best sets to be very big you can only analyze reading design [00:44:21] big you can only analyze reading design you cannot have like stochastic building [00:44:23] you cannot have like stochastic building insects so so this you know [00:44:26] insects so so this you know um this is kind of like the the [00:44:28] um this is kind of like the the restriction on the hyper parameters at [00:44:30] restriction on the hyper parameters at the beginning we thought okay that's not [00:44:32] the beginning we thought okay that's not a big problem right we have this hyper [00:44:34] a big problem right we have this hyper parameters and then next they will [00:44:35] parameters and then next they will probably extend them to other hyper [00:44:37] probably extend them to other hyper parameters [00:44:38] parameters but it turns out that [00:44:39] but it turns out that there is some serious limitation in the [00:44:43] there is some serious limitation in the hyper parameters because you know as I [00:44:45] hyper parameters because you know as I motivated before even you change the [00:44:46] motivated before even you change the aluminum schedule right you know in the [00:44:48] aluminum schedule right you know in the figure we we found [00:44:50] figure we we found right so [00:44:51] right so this one this is a real experiment if [00:44:54] this one this is a real experiment if you change the learning schedule you [00:44:55] you change the learning schedule you change the performance of your model so [00:44:58] change the performance of your model so so it's not all so if you analyze a [00:45:01] so it's not all so if you analyze a special learning rate schedule and you [00:45:03] special learning rate schedule and you analyze a special instruction then maybe [00:45:05] analyze a special instruction then maybe you are not actually analyzing uh [00:45:07] you are not actually analyzing uh anything impressive right so [00:45:11] anything impressive right so um so for example like in this ntk case [00:45:13] um so for example like in this ntk case I think what we can analyze [00:45:16] I think what we can analyze um the algorithm you cannot analyze [00:45:18] um the algorithm you cannot analyze wouldn't give you the best performance [00:45:21] wouldn't give you the best performance that deep learning offers you probably [00:45:22] that deep learning offers you probably get something like 80 on CD far but the [00:45:24] get something like 80 on CD far but the best algorithm probably got like 95 uh [00:45:27] best algorithm probably got like 95 uh of course you know they are [00:45:29] of course you know they are improvements to uh along this line but [00:45:32] improvements to uh along this line but generally the the issue is that you [00:45:36] generally the the issue is that you you make this type of parameter so [00:45:38] you make this type of parameter so special so that you lose [00:45:40] special so that you lose the correct implicit regularization [00:45:43] the correct implicit regularization effect of the optimizers [00:45:46] um and and you are you're analyzing an [00:45:48] um and and you are you're analyzing an Optimizer that doesn't have the correct [00:45:50] Optimizer that doesn't have the correct implicit organization effect so that [00:45:52] implicit organization effect so that they don't generalize as well as the as [00:45:54] they don't generalize as well as the as the real deep learning algorithms [00:45:56] the real deep learning algorithms but still I'm going to talk about this [00:45:58] but still I'm going to talk about this because this is a very nice idea and in [00:46:00] because this is a very nice idea and in certain cases it's pretty useful so [00:46:05] certain cases it's pretty useful so um and then [00:46:07] um and then um for the [00:46:09] um for the impressive organizations question right [00:46:11] impressive organizations question right so the question about you know why the [00:46:14] so the question about you know why the algorithm and why the optimizer prefers [00:46:16] algorithm and why the optimizer prefers certain kind of like a low complexity [00:46:18] certain kind of like a low complexity model people have had a lot of results [00:46:21] model people have had a lot of results on special cases so special models [00:46:27] or and actually maybe I should quite [00:46:29] or and actually maybe I should quite simplified models [00:46:33] simplified models simply [00:46:34] simply I don't know why [00:46:37] some somebody took my yoga knife uh Yoga [00:46:40] some somebody took my yoga knife uh Yoga Yoga break for some reason and I have to [00:46:43] Yoga break for some reason and I have to use the book [00:46:45] use the book um anyway uh so special simplified [00:46:50] models and also special [00:46:54] models and also special uh optimizers [00:46:57] uh optimizers but here the special business especially [00:46:59] but here the special business especially in the right ways you're analyzing the [00:47:01] in the right ways you're analyzing the effect of the optimizer so you [00:47:03] effect of the optimizer so you you focus you focus on each aspect in [00:47:06] you focus you focus on each aspect in each paper in some sense so what are the [00:47:08] each paper in some sense so what are the models that people have analyzed for [00:47:09] models that people have analyzed for example linear regression [00:47:13] this is something you can say and here [00:47:15] this is something you can say and here you can say that you know certain [00:47:16] you can say that you know certain initialization prefers certain kind of [00:47:18] initialization prefers certain kind of like uh uh um models and you can also [00:47:23] like uh uh um models and you can also talk about logistical question [00:47:25] talk about logistical question logistically in your questions [00:47:28] logistically in your questions and here you [00:47:29] and here you will see that you know you can prove [00:47:31] will see that you know you can prove something like [00:47:33] something like even the model just wants to find the [00:47:35] even the model just wants to find the minimum like even the model just try to [00:47:37] minimum like even the model just try to minimize the logistic loss actually [00:47:39] minimize the logistic loss actually tries to find the max margin solution [00:47:42] tries to find the max margin solution um so and also for Matrix sensing [00:47:45] um so and also for Matrix sensing a matrix characterization problems [00:47:51] and uh and linear [00:47:54] and uh and linear neural network [00:47:55] neural network so you can you can talk about this and [00:47:57] so you can you can talk about this and also there are special aspects of the [00:47:59] also there are special aspects of the optimizers [00:48:04] of course this is a in some sense [00:48:06] of course this is a in some sense there's a sometimes there has to be a [00:48:08] there's a sometimes there has to be a combination of the problem and Optimizer [00:48:09] combination of the problem and Optimizer because you know [00:48:11] because you know certain optimizers wouldn't have [00:48:12] certain optimizers wouldn't have implicit recordization for certain [00:48:13] implicit recordization for certain problems so you can talk about the GD [00:48:16] problems so you can talk about the GD you can talk about SGD [00:48:19] you can talk about SGD um [00:48:21] um and ICD I think there's actually also [00:48:24] and ICD I think there's actually also about the the noise covariance [00:48:29] like what what co-virus you will give [00:48:31] like what what co-virus you will give you the right implicit organization also [00:48:33] you the right implicit organization also the noise scale [00:48:35] the noise scale which also matters [00:48:37] which also matters and you can also talk about for example [00:48:39] and you can also talk about for example dropout [00:48:40] dropout this is something you you do you know in [00:48:42] this is something you you do you know in your Optimizer which will change the [00:48:44] your Optimizer which will change the impressive bias and you can also talk [00:48:46] impressive bias and you can also talk about learning rate [00:48:48] about learning rate um which is also actually important [00:48:54] and batch sets you know so and so forth [00:48:59] and and also they are unsolved open [00:49:02] and and also they are unsolved open questions for example momentum you know [00:49:04] questions for example momentum you know like specialization all of this has some [00:49:06] like specialization all of this has some in place regulation effect [00:49:08] in place regulation effect so that's why this this becomes you know [00:49:10] so that's why this this becomes you know complicated right so everything you do [00:49:12] complicated right so everything you do in your Optimizer everything you change [00:49:14] in your Optimizer everything you change right would possibly have an impressive [00:49:16] right would possibly have an impressive acquisition in fact sometimes it's [00:49:17] acquisition in fact sometimes it's positive sometimes it's negative [00:49:19] positive sometimes it's negative of course most of the tricks that we [00:49:21] of course most of the tricks that we have seen have positive effect because [00:49:23] have seen have positive effect because that's why they survive and they are [00:49:25] that's why they survive and they are published right so [00:49:28] published right so um [00:49:28] um um so so [00:49:30] um so so um that's a stethoscope I guess [00:49:33] um that's a stethoscope I guess and [00:49:35] and and they are I'm also going to try to [00:49:38] and they are I'm also going to try to mention a more General result that we [00:49:40] mention a more General result that we have [00:49:40] have on [00:49:44] like uh me and some collaborators have [00:49:47] like uh me and some collaborators have done so you can also [00:49:49] done so you can also try to have more General results which [00:49:51] try to have more General results which says something like you know SGD [00:49:55] says something like you know SGD I'll have Theta is roughly equivalent to [00:49:58] I'll have Theta is roughly equivalent to doing gradient descent on L hand filter [00:50:02] doing gradient descent on L hand filter plus some log l Lambda R Theta [00:50:06] plus some log l Lambda R Theta for some [00:50:08] for some r [00:50:09] r for some regulator R for this system [00:50:13] for some regulator R for this system um this is a result that we can so like [00:50:18] um this is a result that we can so like this is the a much simplified you know [00:50:21] this is the a much simplified you know the high level idea of a result we can [00:50:23] the high level idea of a result we can show but of course there are limitations [00:50:24] show but of course there are limitations right so this this can more General [00:50:26] right so this this can more General results have weak weakness in other [00:50:29] results have weak weakness in other aspects [00:50:31] aspects for example you may have additional [00:50:33] for example you may have additional assumptions or you can only deal with [00:50:34] assumptions or you can only deal with certain kind of stochasticity so and so [00:50:36] certain kind of stochasticity so and so forth [00:50:37] forth um but I think from this result you can [00:50:39] um but I think from this result you can see that you know this is kind of the [00:50:41] see that you know this is kind of the things that we are trying to do right so [00:50:42] things that we are trying to do right so if you do if you add stochasticity then [00:50:44] if you do if you add stochasticity then you automatically implicitly you got a [00:50:47] you automatically implicitly you got a regularizer for free even though you're [00:50:49] regularizer for free even though you're you're often you are using stochastic [00:50:51] you're often you are using stochastic within design on the original training [00:50:53] within design on the original training loss but somehow you get off finder for [00:50:55] loss but somehow you get off finder for free somewhere [00:50:58] free somewhere okay so so I think uh um yeah so so [00:51:01] okay so so I think uh um yeah so so basically we are going to talk about you [00:51:02] basically we are going to talk about you know many of this you know in the in the [00:51:04] know many of this you know in the in the next few lectures in the in the future [00:51:05] next few lectures in the in the future lectures uh and the the for the for the [00:51:10] lectures uh and the the for the for the task three for the generalization bond [00:51:12] task three for the generalization bond this is uh [00:51:13] this is uh um also like uh uh an interesting [00:51:16] um also like uh uh an interesting question for deep learning because [00:51:19] question for deep learning because um uh you also want to have precise [00:51:22] um uh you also want to have precise generalization bonds that can be [00:51:24] generalization bonds that can be compatible with the regulator you got [00:51:27] compatible with the regulator you got from the previous part right so we have [00:51:29] from the previous part right so we have said that the optimizer has a preference [00:51:31] said that the optimizer has a preference but does that preference leads to a [00:51:33] but does that preference leads to a better optimization uh that's another [00:51:35] better optimization uh that's another open question right so [00:51:37] open question right so um so so for example you can have [00:51:42] um so so for example you can have like so one of the early like one of the [00:51:46] like so one of the early like one of the paper [00:51:48] paper in 2017 proved that for this [00:51:53] if you use a [00:51:56] if you use a this as the the complex Dimension where [00:51:58] this as the the complex Dimension where AI is the the weights [00:52:02] AI is the the weights the weights [00:52:05] the weights of ice layer [00:52:08] of ice layer so if you use this then you can [00:52:09] so if you use this then you can guarantee a generalization block that's [00:52:11] guarantee a generalization block that's one of the early results along this line [00:52:15] one of the early results along this line um but the problem with this is that [00:52:17] um but the problem with this is that this is not precise enough right this is [00:52:19] this is not precise enough right this is still too big too demanding in some [00:52:21] still too big too demanding in some sense right so [00:52:23] sense right so um [00:52:25] um um [00:52:26] um um you sometimes even more precise you [00:52:28] um you sometimes even more precise you know uh optimizes for example if you if [00:52:31] know uh optimizes for example if you if you can guarantee that I guess you know [00:52:33] you can guarantee that I guess you know I will talk about the limitations [00:52:34] I will talk about the limitations probably uh uh when I really talk about [00:52:36] probably uh uh when I really talk about this but this is still not precise [00:52:38] this but this is still not precise enough [00:52:40] [Music] [00:52:43] [Music] and and you sometimes need more fun [00:52:46] and and you sometimes need more fun Grant [00:52:47] Grant uh uh more more fungal complexity [00:52:51] uh uh more more fungal complexity measure that is more compatible [00:52:53] measure that is more compatible more fun Grant [00:52:59] and and also ideally you want something [00:53:01] and and also ideally you want something that can be that is uh [00:53:04] that can be that is uh that is the result of the optimizer [00:53:06] that is the result of the optimizer right so you want this this regularizer [00:53:08] right so you want this this regularizer here to be the same regulator as what [00:53:10] here to be the same regulator as what you had in the implicit requisition [00:53:12] you had in the implicit requisition effect [00:53:15] so um so that's the that's the third [00:53:18] so um so that's the that's the third part [00:53:20] part um [00:53:20] um yeah I think that's a basically a high [00:53:23] yeah I think that's a basically a high level overview of some of the lectures [00:53:25] level overview of some of the lectures we're going to talk about [00:53:26] we're going to talk about some of the lectures uh in in the next [00:53:29] some of the lectures uh in in the next few weeks and of course there are other [00:53:30] few weeks and of course there are other open questions in deep learning as well [00:53:32] open questions in deep learning as well uh for example what's the law of the [00:53:34] uh for example what's the law of the primary station right so [00:53:36] primary station right so um in this in the last in this tasks you [00:53:39] um in this in the last in this tasks you know I didn't mention any of those also [00:53:41] know I didn't mention any of those also and so forth [00:53:43] and so forth um but for those kind of things I don't [00:53:46] um but for those kind of things I don't think there's a systematic study yet so [00:53:47] think there's a systematic study yet so that's why we don't talk about them much [00:53:50] that's why we don't talk about them much for now [00:53:52] for now um [00:53:54] um and I think the for the intermediate [00:53:56] and I think the for the intermediate plan I'm going to talk about task three [00:53:57] plan I'm going to talk about task three here first [00:54:00] here first um because we are in this mode of proven [00:54:02] um because we are in this mode of proven transition boundary we have talked about [00:54:03] transition boundary we have talked about random marker complexity uh so so and [00:54:07] random marker complexity uh so so and all of this depends on the value more [00:54:08] all of this depends on the value more complexity and you know I'm going to [00:54:10] complexity and you know I'm going to talk about that first and then I'm going [00:54:11] talk about that first and then I'm going to move on to the other parts [00:54:14] to move on to the other parts any questions so far [00:54:25] sorry I didn't hear the question [00:54:37] yeah yeah I got the question yeah so so [00:54:40] yeah yeah I got the question yeah so so the question is whether any of these [00:54:42] the question is whether any of these results or tasks you know depends on the [00:54:43] results or tasks you know depends on the data distribution yes they all depend on [00:54:46] data distribution yes they all depend on data distribution I think [00:54:47] data distribution I think um so all of them assume some data [00:54:49] um so all of them assume some data description underlining so uh some of [00:54:52] description underlining so uh some of them will require something stronger [00:54:53] them will require something stronger some some of them [00:54:55] some some of them just require some regulatory condition [00:54:57] just require some regulatory condition but but I don't think you can go away [00:55:00] but but I don't think you can go away without any data distribution assumption [00:55:02] without any data distribution assumption yeah and some of them have very strong [00:55:04] yeah and some of them have very strong data description assumption to be fair [00:55:08] data description assumption to be fair and that's actually in some sense one of [00:55:10] and that's actually in some sense one of the in my opinion that's one of the [00:55:13] the in my opinion that's one of the uh technical challenge here like it's [00:55:18] uh technical challenge here like it's kind of like a a startup balance if you [00:55:21] kind of like a a startup balance if you assume too much about data then you lose [00:55:25] assume too much about data then you lose the [00:55:25] the the realisticness [00:55:27] the realisticness but if you assume something too strong [00:55:30] but if you assume something too strong then uh sorry but if you receive uh if [00:55:34] then uh sorry but if you receive uh if you assume too less about data then you [00:55:37] you assume too less about data then you you have some harness results like like [00:55:40] you have some harness results like like so so certainly without any data [00:55:42] so so certainly without any data assumption you probably shouldn't [00:55:44] assumption you probably shouldn't approve shouldn't be able to prove [00:55:46] approve shouldn't be able to prove almost any results here just because [00:55:48] almost any results here just because things become simply hard especially if [00:55:50] things become simply hard especially if you talk about off [00:55:51] you talk about off computational procedure it's very easy [00:55:54] computational procedure it's very easy to get into LP heart uh hard instance [00:55:57] to get into LP heart uh hard instance so so we need some data distribution [00:56:00] so so we need some data distribution assumption another kind of even more [00:56:03] assumption another kind of even more complex question is that you know how do [00:56:05] complex question is that you know how do you leverage your data distribution [00:56:06] you leverage your data distribution assumption like we don't have a lot of [00:56:09] assumption like we don't have a lot of tools [00:56:11] tools um so for example if you assume it's [00:56:12] um so for example if you assume it's called option then what you know you [00:56:13] called option then what you know you know something about you know what's the [00:56:15] know something about you know what's the moment so and so forth right you can do [00:56:16] moment so and so forth right you can do some certain kind of derivations [00:56:18] some certain kind of derivations but I don't feel like we use the [00:56:21] but I don't feel like we use the even the property precaution [00:56:24] even the property precaution um enough in some sense and and let [00:56:26] um enough in some sense and and let along other kind of like data [00:56:28] along other kind of like data distribution assumption we don't have a [00:56:30] distribution assumption we don't have a lot of good tools to use them [00:56:34] um cool so [00:56:37] um cool so um [00:56:41] if there's no any other questions I'm [00:56:43] if there's no any other questions I'm going to move on to [00:56:45] going to move on to uh the generalization Bound for device [00:56:48] uh the generalization Bound for device works [00:56:50] works so [00:56:52] so and you can see that this is still [00:56:54] and you can see that this is still roughly in the kind of mindset of the [00:56:56] roughly in the kind of mindset of the classical setting even the only [00:56:57] classical setting even the only difference is that we are looking for [00:56:59] difference is that we are looking for proper complex measures not only the [00:57:01] proper complex measures not only the dimension dependency but something more [00:57:04] dimension dependency but something more sometimes more complicated [00:57:10] okay [00:57:13] and and you'll see that this part is [00:57:16] and and you'll see that this part is really a direct extension of what we [00:57:17] really a direct extension of what we have done in the last three weeks [00:57:18] have done in the last three weeks because the tools are shared and and [00:57:21] because the tools are shared and and it's really just that you need more [00:57:23] it's really just that you need more you need better Tools in something [00:57:26] okay so [00:57:28] okay so um all right so now let's [00:57:30] um all right so now let's so talk about the the particular setups [00:57:33] so talk about the the particular setups that we can do so we're going to start [00:57:34] that we can do so we're going to start with two layers [00:57:37] to learn that your Networks [00:57:39] to learn that your Networks and then in the next few lectures we're [00:57:42] and then in the next few lectures we're going to move on to multiple layers [00:57:44] going to move on to multiple layers uh and for two layers let's use the [00:57:46] uh and for two layers let's use the following locations [00:57:49] following locations so let's say your parameter Theta [00:57:52] so let's say your parameter Theta contains of two parts one part is w and [00:57:54] contains of two parts one part is w and the other part is u [00:57:56] the other part is u so w is the [00:57:58] so w is the uh the linear the the second layer mu is [00:58:02] uh the linear the the second layer mu is the first layer so so [00:58:04] the first layer so so basically on your network F Theta X will [00:58:06] basically on your network F Theta X will be something like w transpose Phi of [00:58:09] be something like w transpose Phi of ux where U is a matrix that Maps [00:58:14] ux where U is a matrix that Maps Dimension D to Dimension n where m is [00:58:17] Dimension D to Dimension n where m is the number of neurons [00:58:18] the number of neurons so basically u x will be n dimensional [00:58:21] so basically u x will be n dimensional so this will be M dimensional and you [00:58:24] so this will be M dimensional and you apply on [00:58:26] apply on um [00:58:27] um uh element wise random function so Phi [00:58:30] uh element wise random function so Phi is element twice [00:58:34] value [00:58:35] value functions [00:58:38] so so Phi of a vector Z1 up to z n [00:58:43] so so Phi of a vector Z1 up to z n is equals to [00:58:46] is equals to um basically you apply the answer wise [00:58:48] um basically you apply the answer wise you get Max [00:58:50] you get Max Z one zero [00:58:52] Z one zero to Max [00:58:54] to Max yeah zero [00:58:56] yeah zero right so after you apply fee you get MN [00:58:59] right so after you apply fee you get MN vector and then you inner product with W [00:59:01] vector and then you inner product with W you get a single scalar so we have a [00:59:04] you get a single scalar so we have a model that outputs a single scalar using [00:59:07] model that outputs a single scalar using these two layers U and w [00:59:11] these two layers U and w and again we still call x i y i the [00:59:15] and again we still call x i y i the tuning data set [00:59:18] um [00:59:20] um as usual [00:59:23] Okay so [00:59:26] Okay so so our goal is that [00:59:30] we first show [00:59:32] we first show I'd rather Mark a complexity Bond [00:59:36] I'd rather Mark a complexity Bond and and then we also talk about you know [00:59:39] and and then we also talk about you know how [00:59:41] how is RC Bond [00:59:43] is RC Bond is relevant to [00:59:46] is relevant to to practice [00:59:49] foreign [00:59:50] foreign okay so and I think for today we [00:59:53] okay so and I think for today we probably wouldn't even be able to finish [00:59:54] probably wouldn't even be able to finish number one [00:59:56] number one um because I'm gonna have actually two [00:59:58] um because I'm gonna have actually two bounce one is better than the other [00:59:59] bounce one is better than the other so [01:00:01] so um so here is the theorem for a random [01:00:04] um so here is the theorem for a random marker complexity bound [01:00:09] so the theorem is that [01:00:13] so suppose [01:00:16] you have a hypothesis class that [01:00:18] you have a hypothesis class that consists of models look like this [01:00:19] consists of models look like this parameters by Theta where you require [01:00:23] parameters by Theta where you require that [01:00:24] that the norm of w is less than BW [01:00:27] the norm of w is less than BW and Norm of u i is followed by B U I [01:00:32] and Norm of u i is followed by B U I guess I didn't Define UI so this is uh [01:00:35] guess I didn't Define UI so this is uh let me say this so you [01:00:38] let me say this so you is this Matrix of M by D Matrix and [01:00:42] is this Matrix of M by D Matrix and let's say each of the [01:00:45] rows is U1 transpose up to U [01:00:49] rows is U1 transpose up to U and transpose [01:00:52] and transpose right so [01:00:53] right so um so each UI is of Dimension d [01:00:56] um so each UI is of Dimension d and so that's why [01:00:59] and so that's why U times x is really inner product of [01:01:02] U times x is really inner product of this uis with x [01:01:07] that's the notation I'm going to use [01:01:09] that's the notation I'm going to use okay so so basically uis are rows of the [01:01:13] okay so so basically uis are rows of the the weight Matrix [01:01:16] the weight Matrix so we restricted the W and the normal w [01:01:19] so we restricted the W and the normal w a normal UI to something like b double [01:01:21] a normal UI to something like b double and bu [01:01:23] and bu and then we also assume something about [01:01:33] the the data has expected two Norm [01:01:35] the the data has expected two Norm Square less than C [01:01:37] Square less than C I guess actually this is probably C [01:01:39] I guess actually this is probably C Square [01:01:41] Square the title here [01:01:43] the title here so and then under all of these [01:01:46] so and then under all of these assumptions [01:01:49] you can prove the rather marker [01:01:50] you can prove the rather marker complexity bound [01:01:52] complexity bound R and of H [01:01:55] R and of H is less than 2 times VW [01:01:59] is less than 2 times VW times C [01:02:01] times C Times Square Root n [01:02:04] Times Square Root n over square rooted [01:02:08] so I guess [01:02:11] so I guess just a remark is that this is not ideal [01:02:14] just a remark is that this is not ideal bound not a good bound [01:02:19] because [01:02:20] because I am shows up [01:02:24] in the bond [01:02:27] and actually it shows up in a wrong way [01:02:29] and actually it shows up in a wrong way because it says that if you have more [01:02:30] because it says that if you have more neurons [01:02:31] neurons you have our response right so this is [01:02:34] you have our response right so this is this the M shows up in the more [01:02:36] this the M shows up in the more classical kind of sense where you have [01:02:37] classical kind of sense where you have more neurons you know you have more [01:02:39] more neurons you know you have more complex models than it's not great right [01:02:42] complex models than it's not great right so but this you know so basically you [01:02:44] so but this you know so basically you cannot use this theorem to explain the [01:02:45] cannot use this theorem to explain the size of deep learning or the over [01:02:47] size of deep learning or the over parameters model because this is saying [01:02:48] parameters model because this is saying over parametric model will have bigger [01:02:51] over parametric model will have bigger uh write a marker complexity but you [01:02:53] uh write a marker complexity but you actually want something that [01:02:55] actually want something that have you want a bond that is better when [01:02:58] have you want a bond that is better when M goes to Infinity in some sense right [01:03:00] M goes to Infinity in some sense right to explain the plot that I [01:03:02] to explain the plot that I I kind of like show here right so you [01:03:05] I kind of like show here right so you want that as I'm going to Infinity you [01:03:07] want that as I'm going to Infinity you want a better and better Bound in some [01:03:08] want a better and better Bound in some sense but this one gives you a worse and [01:03:10] sense but this one gives you a worse and worst spot [01:03:12] worst spot but it nevertheless let's do prove this [01:03:13] but it nevertheless let's do prove this because this is kind of like a warm up [01:03:15] because this is kind of like a warm up uh for uh what we will show next [01:03:30] I see so maybe let me rephrase the [01:03:33] I see so maybe let me rephrase the question first make sure so [01:03:35] question first make sure so so you are [01:03:37] so you are you know [01:03:39] you know your the question I might if my [01:03:41] your the question I might if my understanding is correct your question [01:03:43] understanding is correct your question is that why you expect this right one is [01:03:45] is that why you expect this right one is going to zero or is decreasing forever [01:03:48] going to zero or is decreasing forever right uh instead of like a really going [01:03:51] right uh instead of like a really going up after a certain point right it's just [01:03:54] up after a certain point right it's just we don't have enough [01:03:56] we don't have enough um data points right like we didn't run [01:03:58] um data points right like we didn't run that very super large scale experiment I [01:04:01] that very super large scale experiment I think the answer is that [01:04:04] think the answer is that we do think this is already large enough [01:04:06] we do think this is already large enough for us to to kind of believe that it [01:04:08] for us to to kind of believe that it will never go up like just because 4K [01:04:10] will never go up like just because 4K neurons for this task is really really a [01:04:13] neurons for this task is really really a lot like 64 already allowed you to [01:04:15] lot like 64 already allowed you to memorize you typically you wouldn't even [01:04:17] memorize you typically you wouldn't even run so many you probably just you know [01:04:20] run so many you probably just you know maybe it will be easier to convince you [01:04:23] maybe it will be easier to convince you if I show you [01:04:24] if I show you four up to 128. and and you'll see [01:04:28] four up to 128. and and you'll see something like this and then you ask me [01:04:30] something like this and then you ask me the question I'll show you 128 up to 4K [01:04:33] the question I'll show you 128 up to 4K and you probably would be more convinced [01:04:35] and you probably would be more convinced um so but yeah but 4K is already pretty [01:04:39] um so but yeah but 4K is already pretty large I think [01:04:40] large I think um that's yeah [01:04:42] um that's yeah um [01:04:43] um but of course you can never rule all the [01:04:45] but of course you can never rule all the possibility that after you know maybe a [01:04:47] possibility that after you know maybe a million neurons it goes up it just [01:04:50] million neurons it goes up it just sounds unlikely yeah [01:04:52] sounds unlikely yeah is [01:05:01] um [01:05:17] right right [01:05:19] right right so so I guess the uh [01:05:23] so so I guess the uh I think the intention of the question [01:05:25] I think the intention of the question was that whether this Bond really is [01:05:27] was that whether this Bond really is growing as I'm goes to Infinity right so [01:05:30] growing as I'm goes to Infinity right so because BW could be [01:05:32] because BW could be both BW and B you could depend on l and [01:05:35] both BW and B you could depend on l and maybe they depend on in different ways [01:05:37] maybe they depend on in different ways maybe BW increase as M goes to infinity [01:05:39] maybe BW increase as M goes to infinity and B U probably decrease as M goes to [01:05:42] and B U probably decrease as M goes to Infinity so that's definitely a [01:05:45] Infinity so that's definitely a possibility right so I think the thing [01:05:47] possibility right so I think the thing here is that I'm choosing this scaling [01:05:49] here is that I'm choosing this scaling so that [01:05:51] so that is at least arguably fun to choose to [01:05:56] is at least arguably fun to choose to think of BW and bu to be constant [01:05:59] think of BW and bu to be constant um so why the reason is that [01:06:03] um so why the reason is that this is only this is probably a little [01:06:05] this is only this is probably a little bit big so the UI is the is the [01:06:07] bit big so the UI is the is the contribution of each component [01:06:09] contribution of each component right so so the [01:06:12] right so so the so in some sense the [01:06:15] so in some sense the so I'm choating so but W is the [01:06:17] so I'm choating so but W is the contribution of all the components so in [01:06:20] contribution of all the components so in some sense you are saying that the top [01:06:21] some sense you are saying that the top layer you control the contribution from [01:06:24] layer you control the contribution from all the components and you want to say [01:06:25] all the components and you want to say that that's the constant you don't want [01:06:26] that that's the constant you don't want that to grow as M goes to Infinity [01:06:28] that to grow as M goes to Infinity because [01:06:31] because so basically maybe maybe one way [01:06:34] because so basically maybe maybe one way to think about this is the falling so if [01:06:36] to think about this is the falling so if you think about the scale the sale here [01:06:37] you think about the scale the sale here does make some sense because UI is on [01:06:40] does make some sense because UI is on all the flexible constants [01:06:42] all the flexible constants then UI transpose [01:06:44] then UI transpose at least at least a constant doesn't [01:06:45] at least at least a constant doesn't depend on M right so UI doesn't depend [01:06:48] depend on M right so UI doesn't depend on M [01:06:48] on M and the UI transpose X doesn't depend on [01:06:50] and the UI transpose X doesn't depend on M [01:06:51] M if here I'm writing this a little bit [01:06:55] if here I'm writing this a little bit um I'm not you so here Theta could [01:06:57] um I'm not you so here Theta could probably have some dependency on D we [01:06:59] probably have some dependency on D we only care about dependency on let's say [01:07:01] only care about dependency on let's say see UI is a lot of constant and then you [01:07:03] see UI is a lot of constant and then you have sum of w [01:07:05] have sum of w uh WI [01:07:08] uh WI fee of UI transpose X right so each of [01:07:11] fee of UI transpose X right so each of these term is on out of constant and [01:07:12] these term is on out of constant and your wi the total contribution is [01:07:14] your wi the total contribution is constant so so that's why the total [01:07:17] constant so so that's why the total thing you can sum would be if that this [01:07:18] thing you can sum would be if that this is on the order of constants [01:07:20] is on the order of constants because it's not like each of the wi is [01:07:23] because it's not like each of the wi is a lot of constant it's the the sum of [01:07:25] a lot of constant it's the the sum of the squares of them is on all of [01:07:26] the squares of them is on all of constants so in some sense you can [01:07:28] constants so in some sense you can believe that this whole thing is on auto [01:07:31] believe that this whole thing is on auto constant [01:07:32] constant especially if you have [01:07:35] especially if you have uh I guess it depends on how you think [01:07:37] uh I guess it depends on how you think about this right so if you Subs replace [01:07:40] about this right so if you Subs replace W I by One V squared and then this is [01:07:43] W I by One V squared and then this is actually [01:07:45] this is let's see so this is a [01:07:49] this is let's see so this is a I guess depending on how you approximate [01:07:51] I guess depending on how you approximate this like a roughly speaking if you if [01:07:53] this like a roughly speaking if you if you use caution shorts you're gonna pop [01:07:54] you use caution shorts you're gonna pop up approximately by something like sum [01:07:57] up approximately by something like sum of wi squared a half times sum of UI [01:08:01] of wi squared a half times sum of UI transpose x square [01:08:04] transpose x square a half [01:08:05] a half and and this is something [01:08:11] uh let me see [01:08:14] why this is not of cons maybe maybe [01:08:16] why this is not of cons maybe maybe we're actually even more generous than [01:08:18] we're actually even more generous than that so [01:08:20] that so but [01:08:24] the L2 Norm of w is a constant [01:08:27] the L2 Norm of w is a constant but [01:08:36] but I think you can still make this [01:08:39] but I think you can still make this bigger [01:08:40] bigger if all of them are correlated right so [01:08:42] if all of them are correlated right so if [01:08:46] others [01:08:56] okay so I think this is [01:08:59] okay so I think this is I'm pretty sure the answer I should have [01:09:02] I'm pretty sure the answer I should have answered but I'm not [01:09:04] answered but I'm not I don't see I have a common sensor right [01:09:06] I don't see I have a common sensor right now so maybe we can discuss offline [01:09:08] now so maybe we can discuss offline performance yeah [01:09:10] performance yeah um but I think the scaling is chosen to [01:09:12] um but I think the scaling is chosen to be at least you know reasonably correct [01:09:15] be at least you know reasonably correct you know of course you can still argue [01:09:16] you know of course you can still argue with certain like there are there's [01:09:18] with certain like there are there's always a [01:09:19] always a for example depending on how double the [01:09:21] for example depending on how double the correlates with this race there's always [01:09:23] correlates with this race there's always some kind of like uh flexibility but but [01:09:26] some kind of like uh flexibility but but I think the scaling is relatively okay [01:09:30] I think the scaling is relatively okay um [01:09:31] um um anyway but but it's a very good [01:09:33] um anyway but but it's a very good question because you can have misleading [01:09:35] question because you can have misleading results you know if you don't you are [01:09:36] results you know if you don't you are not very careful about the scale [01:09:39] not very careful about the scale um Okay so [01:09:41] um Okay so let's see [01:09:43] let's see okay so maybe I have 15 minutes I think [01:09:46] okay so maybe I have 15 minutes I think I can prove the theorem [01:09:48] I can prove the theorem um in 15 minutes [01:09:58] so what we do is that we use the [01:10:02] so what we do is that we use the definition of the writer marker [01:10:03] definition of the writer marker complexity and gradually pay off the [01:10:06] complexity and gradually pay off the soup so like we did before right we have [01:10:09] soup so like we did before right we have a soup and we have to somehow get rid of [01:10:10] a soup and we have to somehow get rid of it [01:10:12] it [Music] [01:10:22] um [01:10:22] um [Music] [01:10:32] but you know compare that activity that [01:10:35] but you know compare that activity that you normally which is different bypasses [01:10:36] you normally which is different bypasses right uh but Difference by square um so [01:10:40] right uh but Difference by square um so that's why I got I don't know [01:10:42] that's why I got I don't know I was I was thinking to use the argument [01:10:45] I was I was thinking to use the argument but I think it's not going to be right [01:10:46] but I think it's not going to be right because some of WIP of [01:10:49] because some of WIP of UI transpose X I think [01:10:52] UI transpose X I think the most pessimistic bone right would be [01:10:54] the most pessimistic bone right would be something like [01:10:55] something like this is less than if you replace each of [01:10:57] this is less than if you replace each of the wi by one over Square than and each [01:11:00] the wi by one over Square than and each of these by a constant you're going to [01:11:01] of these by a constant you're going to have some of this [01:11:02] have some of this and this will be square gun [01:11:05] and this will be square gun so so then in some sense this is not [01:11:08] so so then in some sense this is not helping me to justify to use this game [01:11:13] so right but if you believe that there's [01:11:15] so right but if you believe that there's some cancellation [01:11:17] some cancellation so suppose you believe that there's a [01:11:19] so suppose you believe that there's a cancellation here [01:11:20] cancellation here then this wouldn't be this would be [01:11:22] then this wouldn't be this would be something out of [01:11:24] something out of one [01:11:26] one so so basically if you want to so if you [01:11:29] so so basically if you want to so if you believe in there is a cancellation then [01:11:30] believe in there is a cancellation then it's in this original scaling [01:11:33] it's in this original scaling um or in other words suppose you want to [01:11:35] um or in other words suppose you want to make the scaling even smaller [01:11:37] make the scaling even smaller right suppose you want to say that I'm [01:11:39] right suppose you want to say that I'm going to be BW is even smaller than [01:11:42] going to be BW is even smaller than the scaling I give right or bu is even [01:11:45] the scaling I give right or bu is even smaller then you have to [01:11:47] smaller then you have to assume there's a strong correlation in [01:11:49] assume there's a strong correlation in your model [01:11:50] your model otherwise your model wouldn't even opt [01:11:52] otherwise your model wouldn't even opt to put something out of what [01:11:55] to put something out of what so so whether you are willing to do that [01:11:57] so so whether you are willing to do that so for example suppose I I I I tell you [01:12:00] so for example suppose I I I I tell you that this is actually experience things [01:12:01] that this is actually experience things right so then I have to convince you [01:12:04] right so then I have to convince you that I can choose b w to be on the order [01:12:07] that I can choose b w to be on the order of maybe [01:12:09] of maybe maybe one and then UI to be on the other [01:12:12] maybe one and then UI to be on the other one was called [01:12:13] one was called and and then this wouldn't go the bound [01:12:16] and and then this wouldn't go the bound would indeed would not [01:12:19] would indeed would not grow if I'm goes to Infinity but [01:12:22] grow if I'm goes to Infinity but you'll find that some of bif of UI [01:12:25] you'll find that some of bif of UI transpose X is very difficult to be big [01:12:28] transpose X is very difficult to be big you have to kind of match up everything [01:12:30] you have to kind of match up everything to make it big enough to fit a label so [01:12:33] to make it big enough to fit a label so would you be willing to do that I think [01:12:35] would you be willing to do that I think you can arguably say that that's not [01:12:38] you can arguably say that that's not are really realistic [01:12:42] Okay cool so I guess let's [01:12:47] uh um let's prove this [01:12:50] uh um let's prove this so the proof as I said we're gonna try [01:12:53] so the proof as I said we're gonna try to remove the soup in the definition of [01:12:55] to remove the soup in the definition of the random R complexity kind of like [01:12:57] the random R complexity kind of like step by step [01:12:58] step by step so first of all let's define [01:13:02] let's define v [01:13:04] let's define v to be the [01:13:05] to be the post activation the intermediate layer [01:13:08] post activation the intermediate layer right expand V to be U times x which is [01:13:10] right expand V to be U times x which is the arm dimensional vector [01:13:12] the arm dimensional vector and you can also correspondingly Define [01:13:17] VI to be [01:13:21] the corresponding you know activation [01:13:22] the corresponding you know activation for the is example this is an dimension [01:13:26] for the is example this is an dimension and then using this notation the random [01:13:28] and then using this notation the random marker complex the [01:13:30] marker complex the only empirical sampling program number [01:13:32] only empirical sampling program number complexity is you take expectation the [01:13:34] complexity is you take expectation the randomness is from Sigma [01:13:36] randomness is from Sigma and you take Soup [01:13:39] and you take Soup um you have some of [01:13:42] um you have some of Sigma I [01:13:43] Sigma I transpose times huge right outside of x [01:13:47] transpose times huge right outside of x i [01:13:49] i but I've said of x i I'm going to [01:13:51] but I've said of x i I'm going to rewrite it as W transpose bi because [01:13:56] rewrite it as W transpose bi because um [01:13:58] um just because that's the notation right [01:14:01] just because that's the notation right so [01:14:02] so um let me just replace this here [01:14:05] um let me just replace this here W transpose VI this is afternoon of x i [01:14:09] W transpose VI this is afternoon of x i and then here you take two over two [01:14:12] and then here you take two over two things over both W and U and U is the [01:14:15] things over both W and U and U is the dependency on U is heading v i [01:14:17] dependency on U is heading v i and let's clean up this to put the wire [01:14:21] and let's clean up this to put the wire on your front and you have soup [01:14:24] on your front and you have soup over [01:14:26] over you and soup over w [01:14:29] you and soup over w transpose times y Over N times sum of [01:14:32] transpose times y Over N times sum of Sigma [01:14:34] Sigma i v i [01:14:36] i v i I guess this probably looks familiar [01:14:38] I guess this probably looks familiar for you because we did something like [01:14:40] for you because we did something like this in the linear case as well [01:14:43] this in the linear case as well and then you can get rid of the W but [01:14:46] and then you can get rid of the W but you still have the U [01:14:50] so you soup over you and you get rid of [01:14:52] so you soup over you and you get rid of the w w has L2 Norm Bond so [01:14:57] the w w has L2 Norm Bond so is less than BW right so you get [01:15:03] the soup of this is equals to BW times [01:15:05] the soup of this is equals to BW times the L2 Norm [01:15:07] the L2 Norm of Sigma i v i [01:15:14] right okay so so now we're getting rid [01:15:17] right okay so so now we're getting rid of that BW we can put BW in front [01:15:21] of that BW we can put BW in front and now let's [01:15:24] and now let's deal with the U [01:15:27] deal with the U and the U is something like [01:15:30] and the U is something like wait [01:15:32] wait I think I shouldn't have already here [01:15:33] I think I shouldn't have already here anymore my bad [01:15:40] so now [01:15:42] so now let's this is I've this is a sum from [01:15:45] let's this is I've this is a sum from one two and [01:15:48] one two and and let's rewrite this let's plug in [01:15:50] and let's rewrite this let's plug in back the definition of VI which is Phi [01:15:55] back the definition of VI which is Phi of u x i [01:15:58] and [01:16:00] and um what I'm going to do is that I'm [01:16:03] um what I'm going to do is that I'm going to do a kind of like a if you kind [01:16:06] going to do a kind of like a if you kind of get familiar with this right you can [01:16:07] of get familiar with this right you can see that this is a very loose way to do [01:16:10] see that this is a very loose way to do this I'm going to replace the two Norm [01:16:11] this I'm going to replace the two Norm by Infinity Norm [01:16:13] by Infinity Norm so I'm going to say that [01:16:16] this is less than square root and [01:16:19] this is less than square root and times the infinity Norm of this [01:16:29] this is just because you know [01:16:31] this is just because you know and then back to V2 Norm is less than [01:16:34] and then back to V2 Norm is less than squared M times the infinite [01:16:40] and [01:16:41] and right so if V is in M dimensional [01:16:46] right so if V is in M dimensional okay so [01:16:48] okay so um and now the reason why I want to [01:16:50] um and now the reason why I want to replace it by Infinity Norm is can be [01:16:52] replace it by Infinity Norm is can be seen later because missing no because [01:16:57] seen later because missing no because somehow with infinite Norm I can [01:16:59] somehow with infinite Norm I can simplify the soup [01:17:02] simplify the soup so now I have a soup [01:17:04] so now I have a soup and [01:17:06] and note that what this Vector is this [01:17:08] note that what this Vector is this Vector maybe let's [01:17:10] Vector maybe let's do something here so this is a [01:17:12] do something here so this is a this Vector is a sum of [01:17:15] this Vector is a sum of a bunch of vectors right the infinite [01:17:18] a bunch of vectors right the infinite Norm is about the dimension the [01:17:19] Norm is about the dimension the coordinates of this vector [01:17:21] coordinates of this vector so basically uh [01:17:25] so basically uh each of this Dimensions is really [01:17:28] each of this Dimensions is really something like sum of Sigma I [01:17:32] something like sum of Sigma I v u j x i [01:17:35] v u j x i and the sum is over I [01:17:38] and the sum is over I this is the this is the JS dimension of [01:17:40] this is the this is the JS dimension of this vector [01:17:42] this vector so [01:17:43] so basically I can take sub over J [01:17:47] basically I can take sub over J and soup over [01:17:50] and soup over uh U and [01:17:53] uh U and sum of Sigma I [01:17:55] sum of Sigma I fee u j transpose x i [01:17:59] fee u j transpose x i want [01:18:03] all right and now I can actually also [01:18:04] all right and now I can actually also write u j here because if I take the [01:18:06] write u j here because if I take the Super over J you know the J is actually [01:18:08] Super over J you know the J is actually only depends on u j [01:18:10] only depends on u j to [01:18:12] to um right and [01:18:15] um right and and that's actually [01:18:17] and that's actually kind of the main reason why we want to [01:18:18] kind of the main reason why we want to use this input now because once you [01:18:20] use this input now because once you write this you found that all the Js are [01:18:22] write this you found that all the Js are equivalent right so uh like uh like [01:18:27] equivalent right so uh like uh like anyway you're taking soup right so it [01:18:29] anyway you're taking soup right so it doesn't matter what is UJ or you want U2 [01:18:31] doesn't matter what is UJ or you want U2 right so the soup is the same [01:18:33] right so the soup is the same so this is equals to [01:18:36] foreign [01:18:43] oh sorry this is an inequality [01:18:51] soup over you a single Vector U so you [01:18:55] soup over you a single Vector U so you replace u j b by uh and you say that [01:18:58] replace u j b by uh and you say that this is needs to be needs to less than b [01:19:00] this is needs to be needs to less than b u because u j [01:19:03] u because u j you used to have a bond du right let's [01:19:05] you used to have a bond du right let's just skip it for Simplicity so and then [01:19:08] just skip it for Simplicity so and then you can write this as [01:19:12] i v of [01:19:14] i v of huge transpose x i [01:19:18] in some sense you remove the IM [01:19:20] in some sense you remove the IM dependency because for Infinity Norm the [01:19:24] dependency because for Infinity Norm the how many arms don't matter [01:19:26] how many arms don't matter doesn't matter so and now [01:19:31] doesn't matter so and now there is one step where I'm going to [01:19:33] there is one step where I'm going to remove the [01:19:34] remove the the absolute value [01:19:36] the absolute value um because if you don't have the [01:19:38] um because if you don't have the absolute values it's kind of like you [01:19:40] absolute values it's kind of like you know let's first remove it so by [01:19:42] know let's first remove it so by removing it require will pay your Factor [01:19:45] removing it require will pay your Factor two and this requires something [01:19:47] two and this requires something that is not exactly civil but I will not [01:19:52] that is not exactly civil but I will not prove it in inches of time at least [01:19:55] prove it in inches of time at least so you can remove this [01:19:57] so you can remove this absolute value the reason you know it's [01:20:00] absolute value the reason you know it's intellectual nodes the reason is [01:20:02] intellectual nodes the reason is actually you know fundamentally it's [01:20:03] actually you know fundamentally it's pretty simple it's just basically [01:20:05] pretty simple it's just basically because [01:20:07] because the the soup is actually mostly positive [01:20:09] the the soup is actually mostly positive like almost always positive [01:20:12] like almost always positive like uh like because you have can choose [01:20:14] like uh like because you have can choose the you it's always possible absolutely [01:20:16] the you it's always possible absolutely will not wait or without absolute value [01:20:17] will not wait or without absolute value it doesn't really matter [01:20:18] it doesn't really matter uh at least for this case right because [01:20:22] uh at least for this case right because if you are taking a soup right so anyway [01:20:24] if you are taking a soup right so anyway it's going to be positive because you [01:20:26] it's going to be positive because you can choose you to be to make this [01:20:27] can choose you to be to make this quantity positive anyway so this is uh [01:20:30] quantity positive anyway so this is uh uh what I like this is I will ask you to [01:20:34] uh what I like this is I will ask you to refer to the lecture notes to for the [01:20:36] refer to the lecture notes to for the formal proof [01:20:37] formal proof um and then now after removing the [01:20:40] um and then now after removing the absolute value [01:20:41] absolute value uh you can see that this is [01:20:45] uh you can see that this is something like a rather Markle [01:20:46] something like a rather Markle complexity [01:20:47] complexity of something simple because this you can [01:20:50] of something simple because this you can view this as your function now [01:20:52] view this as your function now and this is the rather complexity of [01:20:54] and this is the rather complexity of this kind of function [01:20:56] this kind of function but still you have fee on you right so [01:20:58] but still you have fee on you right so so that's why we are going to use the [01:21:00] so that's why we are going to use the ellipsecon the ellipse is composition [01:21:03] ellipsecon the ellipse is composition so so this will be less than [01:21:08] two [01:21:09] two you copy all the constants [01:21:23] so this is by the Ellipsis [01:21:30] the the I think the Ellipsis composition [01:21:32] the the I think the Ellipsis composition or the telegram level [01:21:36] ground level [01:21:39] so I think in some sense you think of [01:21:42] so I think in some sense you think of like uh maybe you can define something [01:21:44] like uh maybe you can define something like I guess maybe [01:21:47] like I guess maybe it's Prime to be the family of U [01:21:50] it's Prime to be the family of U transpose X [01:21:57] right and then you can also look at fee [01:22:01] right and then you can also look at fee of H composed with h Prime [01:22:04] of H composed with h Prime so the random marker of three composed [01:22:06] so the random marker of three composed with h Prime is going to be equals to [01:22:08] with h Prime is going to be equals to is this quantity right so and this is [01:22:10] is this quantity right so and this is less than the leftist negative Phi which [01:22:13] less than the leftist negative Phi which is one value [01:22:14] is one value and then you get H Prime which is this [01:22:17] and then you get H Prime which is this quantity [01:22:18] quantity okay [01:22:20] okay so that's that's the how we do it and [01:22:23] so that's that's the how we do it and now it becomes linear so U transpose x i [01:22:25] now it becomes linear so U transpose x i is is a linear function class and I [01:22:29] is is a linear function class and I think we have done this before so uh for [01:22:34] think we have done this before so uh for L2 Norm constant linear class I think [01:22:36] L2 Norm constant linear class I think you can get [01:22:37] you can get uh something like this is less than 2 [01:22:40] uh something like this is less than 2 square root um [01:22:42] square root um b u over Square over n [01:22:49] times [01:22:51] times sorry this is p w [01:22:54] sorry this is p w times b u Times Square Root [01:22:57] times b u Times Square Root sum of x i to not Square [01:23:01] sum of x i to not Square this is just by [01:23:03] this is just by uh by what we had for the linear model [01:23:13] oh sure yeah sorry [01:23:22] where'd the two come from [01:23:24] where'd the two come from so the two come from [01:23:26] so the two come from here this this line oh and this is the [01:23:30] here this this line oh and this is the something I didn't explain right so like [01:23:31] something I didn't explain right so like when you remove the absolute value [01:23:34] when you remove the absolute value so [01:23:35] so um [01:23:37] um can you get they are exactly the same [01:23:39] can you get they are exactly the same uh without looking at two is that a [01:23:42] uh without looking at two is that a question [01:23:43] question [Music] [01:23:49] I'm [01:23:49] I'm [Music] [01:23:51] [Music] I suspect it's possible but I'm not 100 [01:23:53] I suspect it's possible but I'm not 100 sure so like the proof in the lecture [01:23:55] sure so like the proof in the lecture notes [01:23:56] notes do does lose two [01:23:59] do does lose two but it sounds like it's possible right [01:24:02] but it sounds like it's possible right you can save that factor too [01:24:06] you can save that factor too um because the intuition I had doesn't [01:24:07] um because the intuition I had doesn't really tell you why you should lose [01:24:09] really tell you why you should lose anything right so my intuition is that [01:24:10] anything right so my intuition is that this quantity is just always positive so [01:24:12] this quantity is just always positive so the app's value doesn't matter right so [01:24:14] the app's value doesn't matter right so that intuition didn't tell you why you [01:24:16] that intuition didn't tell you why you should lose the two uh [01:24:18] should lose the two uh well I really proved the things I think [01:24:20] well I really proved the things I think at least the proof I figure out oh I [01:24:22] at least the proof I figure out oh I read from the book wait maybe I figure [01:24:25] read from the book wait maybe I figure out myself I lose the two so maybe maybe [01:24:27] out myself I lose the two so maybe maybe it's because I [01:24:28] it's because I I didn't do the right thing exactly the [01:24:31] I didn't do the right thing exactly the right thing [01:24:35] okay so only [01:24:40] and then the very last step you can take [01:24:43] and then the very last step you can take the expectation of the empirical writer [01:24:45] the expectation of the empirical writer micro complexity [01:24:48] and this is the expected over s then you [01:24:50] and this is the expected over s then you just get [01:24:52] just get um like what we did before right so the [01:24:54] um like what we did before right so the expectation of this is less than C Times [01:24:57] expectation of this is less than C Times Square Root and so you get this expanded [01:24:58] Square Root and so you get this expanded by [01:24:59] by 2 squared m b w e u times C over square [01:25:04] 2 squared m b w e u times C over square root [01:25:06] root that's because you use a conscious words [01:25:07] that's because you use a conscious words for for this part this is exactly the [01:25:09] for for this part this is exactly the same as what we have done for the linear [01:25:11] same as what we have done for the linear models [01:25:14] okay so I guess this is a natural [01:25:16] okay so I guess this is a natural stopping point [01:25:18] stopping point um and next time we're gonna have a bond [01:25:20] um and next time we're gonna have a bond that [01:25:21] that somewhat improved on this so that you [01:25:23] somewhat improved on this so that you don't have the expensive dependency on [01:25:26] don't have the expensive dependency on that [01:25:27] that um any questions [01:25:33] okay I guess I'll see you on Wednesday ================================================================================ LECTURE 008 ================================================================================ Stanford CS229M - Lecture 8: Refined generalization bounds for neural nets, Kernel methods Source: https://www.youtube.com/watch?v=gwKfeDRCvSg --- Transcript [00:00:05] okay so let's get started [00:00:07] okay so let's get started so um [00:00:10] so um I think last time what we are left was [00:00:14] I think last time what we are left was um [00:00:15] um [Music] [00:00:16] [Music] I think we covered the weaker [00:00:19] I think we covered the weaker generalization belt [00:00:21] generalization belt um and then today we are going to [00:00:23] um and then today we are going to provide a stronger organization about [00:00:26] provide a stronger organization about uh for the new network [00:00:29] uh for the new network let me just double check whether I [00:00:32] let me just double check whether I sorry I think some high [00:00:35] sorry I think some high if I was confused where I'm [00:00:39] okay cool yeah okay yeah so last time [00:00:42] okay cool yeah okay yeah so last time what we did was that we had this [00:00:45] what we did was that we had this [Music] [00:00:45] [Music] um [00:00:48] standardization bound of uh of the form [00:00:51] standardization bound of uh of the form that you have something like something [00:00:53] that you have something like something in the square root and in the [00:00:55] in the square root and in the denominator and [00:01:00] and today we are going to remove that [00:01:03] and today we are going to remove that script and [00:01:05] script and um you know not exactly by just [00:01:06] um you know not exactly by just improving the bound we also have to kind [00:01:08] improving the bound we also have to kind of like somewhat change the hypothesis [00:01:11] of like somewhat change the hypothesis class so that's the that's the first [00:01:13] class so that's the that's the first part of the lecture and then we are [00:01:15] part of the lecture and then we are going to talk about [00:01:16] going to talk about um today we're going to talk about [00:01:17] um today we're going to talk about stronger so first we talk about stronger [00:01:19] stronger so first we talk about stronger version and then we talk about some [00:01:21] version and then we talk about some connections [00:01:24] to Kernel method [00:01:28] and then we will talk about uh even [00:01:31] and then we will talk about uh even stronger bonds um for a multiple layer [00:01:34] stronger bonds um for a multiple layer in your networks and that requires some [00:01:35] in your networks and that requires some preparation [00:01:37] preparation um with some techniques and we'll talk [00:01:39] um with some techniques and we'll talk about those techniques if we have time [00:01:40] about those techniques if we have time today otherwise we'll talk about those [00:01:43] today otherwise we'll talk about those next week [00:01:45] next week okay so just to briefly review the setup [00:01:47] okay so just to briefly review the setup so the setup was that we have some Theta [00:01:51] so the setup was that we have some Theta which is consists of [00:01:53] which is consists of two layers and the first layer is [00:01:57] two layers and the first layer is uh I think this is called second layer [00:01:59] uh I think this is called second layer which is the vector W and the first [00:02:02] which is the vector W and the first layer is a matrix that Maps Dimension B [00:02:04] layer is a matrix that Maps Dimension B to Dimension n and our model is [00:02:07] to Dimension n and our model is something like [00:02:09] something like double transpose V of u x [00:02:13] double transpose V of u x and the fee is element wise relu [00:02:21] and [00:02:23] and so last time what we heard was that we [00:02:26] so last time what we heard was that we have this generalization bound [00:02:28] have this generalization bound of the form that this rather marker [00:02:31] of the form that this rather marker complexity [00:02:33] complexity of H is bounded by something like 2 [00:02:36] of H is bounded by something like 2 times BW [00:02:38] times BW b u times c squared n Over N where H is [00:02:44] b u times c squared n Over N where H is defined to be something like [00:02:48] where yours you restricted two Norm of w [00:02:52] where yours you restricted two Norm of w to be BW [00:02:54] to be BW and you restricted the max [00:02:57] and you restricted the max to Norm of UI to be [00:03:00] to Norm of UI to be bu [00:03:01] bu and that's the hypothesis class and in [00:03:03] and that's the hypothesis class and in some sense [00:03:05] some sense if you I guess we discussed this a [00:03:07] if you I guess we discussed this a little bit in the in the class you know [00:03:10] little bit in the in the class you know um so and also I think somebody asked [00:03:12] um so and also I think somebody asked this question you see in some sense [00:03:14] this question you see in some sense there's a scaling environment because [00:03:15] there's a scaling environment because you can so Alpha times w [00:03:18] you can so Alpha times w over Alpha U would be the same model as [00:03:22] over Alpha U would be the same model as W Times U right so just because you can [00:03:25] W Times U right so just because you can scale the first layer by Alpha and then [00:03:27] scale the first layer by Alpha and then you down scale the second layer back one [00:03:29] you down scale the second layer back one over Alpha if Alpha is bigger than zero [00:03:31] over Alpha if Alpha is bigger than zero so that means that you can also kind of [00:03:34] so that means that you can also kind of change this bound a little bit and re [00:03:36] change this bound a little bit and re rewrite it as something like basically [00:03:38] rewrite it as something like basically you can say that roughly speaking the [00:03:40] you can say that roughly speaking the generalization our [00:03:43] is something like bonded by all of [00:03:46] is something like bonded by all of square root n Over N oh sorry this is [00:03:49] square root n Over N oh sorry this is square rooting over square root n times [00:03:52] square rooting over square root n times the norm of w [00:03:54] the norm of w times the max I [00:03:58] times the max I the max of the two normal view right [00:04:00] the max of the two normal view right this is kind of the intuitive way to [00:04:01] this is kind of the intuitive way to think about this [00:04:03] think about this so [00:04:05] so um so so today we're going to have a [00:04:07] um so so today we're going to have a stronger bond that doesn't have the [00:04:08] stronger bond that doesn't have the script on here but we'll have some [00:04:10] script on here but we'll have some slightly different terms here in terms [00:04:12] slightly different terms here in terms of how do you measure the complexity of [00:04:13] of how do you measure the complexity of w and the complexity of E [00:04:17] w and the complexity of E Okay so [00:04:22] so here is a refined Bond let me Define [00:04:26] so here is a refined Bond let me Define this let me say the theorem first [00:04:29] this let me say the theorem first so the theorem is that we Define this [00:04:31] so the theorem is that we Define this complex measure that's called C of theta [00:04:35] complex measure that's called C of theta this complex measure is defined to be [00:04:36] this complex measure is defined to be the sum of [00:04:40] the absolute value W J times [00:04:45] the true Norm of EU J and you take a sum [00:04:47] the true Norm of EU J and you take a sum over J [00:04:49] over J and correspondingly given this hypo [00:04:52] and correspondingly given this hypo complex match you can Define the [00:04:53] complex match you can Define the corresponding hypothesis class [00:04:55] corresponding hypothesis class which is the family of functions with [00:04:58] which is the family of functions with bonded [00:04:59] bonded complexity something like 1 by B [00:05:02] complexity something like 1 by B and also we assume that [00:05:07] the norm of x i is less than C [00:05:11] the norm of x i is less than C for every I [00:05:13] for every I know that here actually we have a [00:05:15] know that here actually we have a stronger Assumption of data because [00:05:16] stronger Assumption of data because before we assume the average of the norm [00:05:18] before we assume the average of the norm is less than C [00:05:20] is less than C uh other or the average of the square [00:05:22] uh other or the average of the square Norm is less than c squared now we [00:05:25] Norm is less than c squared now we assume each data point is less than C [00:05:27] assume each data point is less than C this is just the technicality in some [00:05:29] this is just the technicality in some sense [00:05:31] sense um and with all of this then we can [00:05:33] um and with all of this then we can prove that the rather marker complexity [00:05:36] prove that the rather marker complexity of H is divided by 2 times B times C [00:05:41] of H is divided by 2 times B times C over square root [00:05:51] okay so maybe let me first start with [00:05:54] okay so maybe let me first start with some interpretation of this theorem and [00:05:56] some interpretation of this theorem and see why this is an interesting one to [00:05:58] see why this is an interesting one to prove and they are the proof so a few [00:06:00] prove and they are the proof so a few remarks the first one is that why this [00:06:03] remarks the first one is that why this is you know better uh than before right [00:06:05] is you know better uh than before right so I'm claiming that this is strictly [00:06:08] so I'm claiming that this is strictly better [00:06:10] better uh [00:06:12] uh than before [00:06:16] um at least in the following sense [00:06:18] um at least in the following sense um so if you really you know I guess um [00:06:23] um so if you really you know I guess um um so the way that I compare them is the [00:06:25] um so the way that I compare them is the following so before what we had was that [00:06:27] following so before what we had was that so before what we had was that this [00:06:31] so before what we had was that this generalization is something like square [00:06:32] generalization is something like square root n over squared n times this [00:06:34] root n over squared n times this complexity something like [00:06:36] complexity something like two Norm of W Times the this one as we [00:06:39] two Norm of W Times the this one as we said [00:06:40] said that's kind of the intuitive thinking [00:06:42] that's kind of the intuitive thinking about it if you assume the the C is a [00:06:44] about it if you assume the the C is a constant C is just something about the [00:06:46] constant C is just something about the data which then we change as we change [00:06:48] data which then we change as we change the com the the hypothesis class is [00:06:50] the com the the hypothesis class is really something like a constant [00:06:52] really something like a constant um and and now you can basically think [00:06:55] um and and now you can basically think of this new bound as people of one over [00:07:00] of this new bound as people of one over square root 10 times B which is the [00:07:02] square root 10 times B which is the capital B the capital B is basically sum [00:07:04] capital B the capital B is basically sum of w j times [00:07:07] of w j times uj2 not right so basically I'm [00:07:10] uj2 not right so basically I'm the way I'm comparing it doesn't mean is [00:07:12] the way I'm comparing it doesn't mean is that I'm comparing these two quantities [00:07:14] that I'm comparing these two quantities and I claim that the second quantity is [00:07:16] and I claim that the second quantity is strictly smaller than the first quantity [00:07:17] strictly smaller than the first quantity and the reason is just that uh if you do [00:07:22] and the reason is just that uh if you do some simple [00:07:23] some simple in the call okay you can see that this [00:07:26] in the call okay you can see that this is [00:07:28] is less than [00:07:30] less than um [00:07:31] um so you first do a closet shorts you say [00:07:34] so you first do a closet shorts you say this is less than sum of w j Square half [00:07:37] this is less than sum of w j Square half and sum of [00:07:39] and sum of UJ two Norm Square to the power of half [00:07:43] UJ two Norm Square to the power of half and then the first term becomes two Norm [00:07:45] and then the first term becomes two Norm of w [00:07:46] of w and the second term [00:07:48] and the second term is you can Bond each of them by the max [00:07:50] is you can Bond each of them by the max so you get [00:07:52] so you get M times Max J [00:07:56] M times Max J u j two Norm square a half [00:07:59] u j two Norm square a half so what you got is square root n times [00:08:02] so what you got is square root n times W two naught times Max [00:08:05] W two naught times Max say [00:08:06] say u j to know [00:08:08] u j to know so so this is a in this size this is a [00:08:11] so so this is a in this size this is a circle better amount and you know they [00:08:15] circle better amount and you know they could be the same you know if your WG [00:08:17] could be the same you know if your WG and UJ makes all of these inequalities [00:08:19] and UJ makes all of these inequalities exactly equal but in other cases it [00:08:22] exactly equal but in other cases it won't and in some sense you can see one [00:08:24] won't and in some sense you can see one of the intuition here is that this new [00:08:26] of the intuition here is that this new complex measure so another thing is that [00:08:29] complex measure so another thing is that this new complexity measure [00:08:35] see Theta captures the invariance [00:08:36] see Theta captures the invariance captures the scaling environments better [00:08:47] or what do I mean by that so so what I [00:08:50] or what do I mean by that so so what I mean is that so what I mentioned I [00:08:53] mean is that so what I mentioned I mentioned the following scaling [00:08:54] mentioned the following scaling environments right so if you have W and [00:08:55] environments right so if you have W and u [00:08:56] u this is equival equivalent to [00:09:00] this is equival equivalent to Alpha W over Alpha U right this is [00:09:03] Alpha W over Alpha U right this is because like the the value is scaling [00:09:06] because like the the value is scaling environment and you can do this but [00:09:08] environment and you can do this but actually you have a lot more scaling [00:09:10] actually you have a lot more scaling environments you can scale actually each [00:09:11] environments you can scale actually each pairs of neurons you know in this way so [00:09:14] pairs of neurons you know in this way so what you really have is that this is [00:09:16] what you really have is that this is actually equivalent to you can [00:09:18] actually equivalent to you can scale each wi to something like WJ [00:09:23] scale each wi to something like WJ you skill each of these um [00:09:25] you skill each of these um uh [00:09:28] uh the chase coordinates by Alpha G and you [00:09:30] the chase coordinates by Alpha G and you scale correspondingly of one over Alpha [00:09:32] scale correspondingly of one over Alpha J times u j [00:09:34] J times u j and you do this for different scalar for [00:09:36] and you do this for different scalar for every WG and UG and this is still the [00:09:39] every WG and UG and this is still the same [00:09:41] right just because you know the sum of [00:09:43] right just because you know the sum of WJ [00:09:45] WJ value [00:09:48] value UJ transpose X is the same as [00:09:51] UJ transpose X is the same as Alpha jwj view of [00:09:55] Alpha jwj view of 1 over Alpha j u j x for any u j for any [00:10:00] 1 over Alpha j u j x for any u j for any scaling that is positive [00:10:02] scaling that is positive right so and you can see that if you [00:10:04] right so and you can see that if you consider this kind of environment still [00:10:06] consider this kind of environment still this complex measure is the same right [00:10:09] this complex measure is the same right so the complexity measure is [00:10:12] so the complexity measure is is really environment [00:10:16] to the scaling here because if you [00:10:18] to the scaling here because if you change WJ and UJ accordingly you don't [00:10:21] change WJ and UJ accordingly you don't change the complexity which [00:10:23] change the complexity which is to some extent it seems to be a good [00:10:25] is to some extent it seems to be a good thing to have right so [00:10:27] thing to have right so so it's so but before the the complex [00:10:31] so it's so but before the the complex measure doesn't have this property right [00:10:32] measure doesn't have this property right so if you look at this complex dimension [00:10:36] if you scale each of the WJ [00:10:39] if you scale each of the WJ by a different scalar and you skate each [00:10:41] by a different scalar and you skate each of the user by different scalar you [00:10:42] of the user by different scalar you wouldn't this number would change [00:10:47] sure [00:10:57] right [00:10:58] right uh [00:11:00] uh right so so you are saying that this [00:11:02] right so so you are saying that this okay yes so yes so this one you you do [00:11:05] okay yes so yes so this one you you do make a stronger assumption to do this [00:11:08] make a stronger assumption to do this problem [00:11:09] problem sorry okay [00:11:15] yes [00:11:18] yes uh sorry what was the question maybe I [00:11:20] uh sorry what was the question maybe I didn't answer [00:11:32] so so yeah but I think I I I'm guessing [00:11:35] so so yeah but I think I I I'm guessing what you are saying is that before the [00:11:38] what you are saying is that before the the the condition was something like [00:11:41] the the condition was something like [Music] [00:11:43] [Music] I think it's one over n times sum of x i [00:11:47] I think it's one over n times sum of x i two Norm square root is less than six [00:11:50] two Norm square root is less than six that was in the previous theorem I think [00:11:54] that was in the previous theorem I think or or something like or maybe that was [00:11:56] or or something like or maybe that was uh in the previous team we did this [00:12:00] is less than c squared right so so [00:12:04] is less than c squared right so so indeed this is so the new condition is [00:12:07] indeed this is so the new condition is stronger than old ones because the this [00:12:09] stronger than old ones because the this one implies the automatic right so [00:12:12] one implies the automatic right so um yeah so I'm assuming that suppose you [00:12:14] um yeah so I'm assuming that suppose you say this is not a problem like you just [00:12:16] say this is not a problem like you just you just live with the stronger [00:12:18] you just live with the stronger assumption then our bound is strictly [00:12:20] assumption then our bound is strictly better in some sense this assumption X [00:12:22] better in some sense this assumption X is a little bit less important to some [00:12:25] is a little bit less important to some extent because [00:12:27] extent because for example you know if anywhere your [00:12:28] for example you know if anywhere your data satisfies the stronger assumption [00:12:30] data satisfies the stronger assumption that is less important right so [00:12:33] that is less important right so so [00:12:35] so um [00:12:35] um yeah but you are right that the data [00:12:37] yeah but you are right that the data assumption is a little bit different but [00:12:38] assumption is a little bit different but I don't think it's uh it matters that [00:12:40] I don't think it's uh it matters that much [00:12:42] much um [00:12:48] right right that's true that that's [00:12:50] right right that's true that that's definitely true [00:12:52] definitely true um or you can choose the right C uh so [00:12:55] um or you can choose the right C uh so um but I guess I think the question was [00:12:57] um but I guess I think the question was more about comparing the two theorems [00:12:59] more about comparing the two theorems you know you know if you normalize key [00:13:01] you know you know if you normalize key or maybe you should normalize there so [00:13:02] or maybe you should normalize there so so what's the fire comparison [00:13:06] um [00:13:07] um cool so so this is one thing one other [00:13:10] cool so so this is one thing one other thing about this complex measure in some [00:13:12] thing about this complex measure in some sense like uh this complex measure is a [00:13:15] sense like uh this complex measure is a little bit more environment to the to at [00:13:16] little bit more environment to the to at least the trivial environments in the [00:13:18] least the trivial environments in the matchwork so [00:13:19] matchwork so um and also the the bound is better and [00:13:21] um and also the the bound is better and also another thing that we have about [00:13:23] also another thing that we have about nothing about this is that [00:13:27] nothing about this is that um about this theorem is that if you [00:13:30] um about this theorem is that if you have M goes to Infinity [00:13:31] have M goes to Infinity at least you get you know a stronger or [00:13:34] at least you get you know a stronger or equally or a stronger or equivalent uh [00:13:38] equally or a stronger or equivalent uh theorem so the theorem [00:13:42] is stronger [00:13:44] is stronger so what what do I mean by that so let me [00:13:47] so what what do I mean by that so let me let me explain this so suppose you look [00:13:49] let me explain this so suppose you look at the dependency on N right so this [00:13:51] at the dependency on N right so this whole theorem depends on M implicitly [00:13:53] whole theorem depends on M implicitly somewhere I didn't specify that but now [00:13:55] somewhere I didn't specify that but now let's make it more expressive let's say [00:13:57] let's make it more expressive let's say say hm is this complexity in the same [00:14:00] say hm is this complexity in the same thing right so uh where you you have M [00:14:03] thing right so uh where you you have M neurons [00:14:06] and also C Theta is less than b [00:14:09] and also C Theta is less than b all right so forever M our theorem [00:14:11] all right so forever M our theorem applies so but I'm now I'm just making a [00:14:14] applies so but I'm now I'm just making a dependency on I'm a little bit more [00:14:16] dependency on I'm a little bit more explicit [00:14:17] explicit and you know that hm is a subset of hm [00:14:20] and you know that hm is a subset of hm plus one [00:14:22] plus one in in verses in the sense that you know [00:14:24] in in verses in the sense that you know if you have a function that is in hm [00:14:27] if you have a function that is in hm it can always you know either fake [00:14:29] it can always you know either fake neural or zero dummy neural to make an H [00:14:31] neural or zero dummy neural to make an H M plus one register so any [00:14:35] M plus one register so any I've stated in hm you can either [00:14:39] I've stated in hm you can either a dummy [00:14:41] a dummy new room [00:14:42] new room so meaning that you make W plus 1 0 and [00:14:46] so meaning that you make W plus 1 0 and you [00:14:47] you M plus one zero [00:14:49] M plus one zero and then you can extend this function so [00:14:52] and then you can extend this function so that it becomes in hm password [00:14:55] that it becomes in hm password so so H M plus 1 is always a strong it's [00:14:57] so so H M plus 1 is always a strong it's a bigger family of functions than HR so [00:15:01] a bigger family of functions than HR so so so that's what that's why you have a [00:15:03] so so that's what that's why you have a stroke by the bond then depend on F [00:15:05] stroke by the bond then depend on F right like you have the same [00:15:07] right like you have the same rather work complexity for everyone so [00:15:09] rather work complexity for everyone so in some sense your your bound would be [00:15:11] in some sense your your bound would be stronger for bigger [00:15:14] stronger for bigger so so the strongest theorem would be you [00:15:17] so so the strongest theorem would be you just applied for H Infinity [00:15:19] just applied for H Infinity so so and and that's actually in some [00:15:22] so so and and that's actually in some sense the fundamental reason why uh [00:15:25] sense the fundamental reason why uh later you see that you know you're gonna [00:15:27] later you see that you know you're gonna have a generalization Bond at least a [00:15:29] have a generalization Bond at least a generalization bond that is decreasing [00:15:31] generalization bond that is decreasing as M goes to Infinity so [00:15:35] as M goes to Infinity so um [00:15:36] um and that's another nice property of this [00:15:38] and that's another nice property of this complex semester and also another small [00:15:41] complex semester and also another small remark is that there's something called [00:15:43] remark is that there's something called personal [00:15:45] personal um if you don't haven't heard of it you [00:15:47] um if you don't haven't heard of it you know probably doesn't matter this is um [00:15:50] know probably doesn't matter this is um um this is a complex semester that [00:15:52] um this is a complex semester that people in parallel that people proposed [00:15:55] people in parallel that people proposed and people evaluate that [00:15:57] and people evaluate that people found that this is correlated [00:15:59] people found that this is correlated with the the [00:16:01] with the the um the the real generalization bound [00:16:03] um the the real generalization bound empirically and and this is very closely [00:16:07] empirically and and this is very closely related to the definition of c Theta [00:16:08] related to the definition of c Theta here so in some sense what you kind of [00:16:11] here so in some sense what you kind of see [00:16:11] see of the parts Norm is trying to say that [00:16:14] of the parts Norm is trying to say that you look at all the parts from the input [00:16:15] you look at all the parts from the input to the output and you look at the total [00:16:19] to the output and you look at the total kind of norms of all the process and in [00:16:22] kind of norms of all the process and in some sense this is kind of like that you [00:16:23] some sense this is kind of like that you know it's not exactly the same depending [00:16:25] know it's not exactly the same depending on which version of the Platinum but [00:16:28] on which version of the Platinum but but kind of like the way you think about [00:16:29] but kind of like the way you think about this is that [00:16:31] this is that you look at the input X and [00:16:34] you look at the input X and um so this is WJ and and these things is [00:16:38] um so this is WJ and and these things is UJ [00:16:39] UJ so in some sense you you look at the so [00:16:42] so in some sense you you look at the so every Parts matters right so that's why [00:16:43] every Parts matters right so that's why you look at WJ times u j first and then [00:16:45] you look at WJ times u j first and then you take this up instead of letting you [00:16:47] you take this up instead of letting you look at each layer first and then you [00:16:49] look at each layer first and then you multiply [00:16:51] multiply um [00:16:52] um yeah if you haven't heard of the partner [00:16:54] yeah if you haven't heard of the partner what I said probably wouldn't be [00:16:57] what I said probably wouldn't be wouldn't make that much sense like uh [00:16:59] wouldn't make that much sense like uh but if you have heard of it probably you [00:17:01] but if you have heard of it probably you can see the connection this is not super [00:17:02] can see the connection this is not super important this is just a something [00:17:04] important this is just a something people have you know empirically studied [00:17:07] people have you know empirically studied um [00:17:08] um all right so uh we'll talk about more [00:17:10] all right so uh we'll talk about more implications of the theorem um later but [00:17:13] implications of the theorem um later but before that let me prove the Improvement [00:17:17] before that let me prove the Improvement any questions so far [00:17:22] so how do we prove this so you can see [00:17:25] so how do we prove this so you can see that the the kind of one of the main [00:17:26] that the the kind of one of the main point in the proof is that you're able [00:17:28] point in the proof is that you're able to kind of like change the scaling in [00:17:29] to kind of like change the scaling in the right way because you want to [00:17:31] the right way because you want to capture the scaling environments you [00:17:32] capture the scaling environments you don't want to peel off so before what we [00:17:34] don't want to peel off so before what we did was that we tried to remove the w [00:17:36] did was that we tried to remove the w first and then we remove the EU right [00:17:38] first and then we remove the EU right you you have a soup over W and U you [00:17:41] you you have a soup over W and U you somehow remove each of them sequentially [00:17:43] somehow remove each of them sequentially and now the the thing is that you still [00:17:46] and now the the thing is that you still do the same thing but you want to remove [00:17:47] do the same thing but you want to remove them [00:17:49] them um [00:17:50] um sequentially as well but you want to [00:17:52] sequentially as well but you want to first rescale things first and then [00:17:55] first rescale things first and then remove them so that you can eventually [00:17:57] remove them so that you can eventually get the right kind of scaling [00:17:59] get the right kind of scaling environment I'm not sure about this big [00:18:00] environment I'm not sure about this big size stacker we can you see like more [00:18:03] size stacker we can you see like more clearly in the proof so [00:18:06] clearly in the proof so so first of all let's define this Vector [00:18:08] so first of all let's define this Vector U let's define U Bar to be the [00:18:11] U let's define U Bar to be the normalized version for you [00:18:13] normalized version for you so and then let's start [00:18:16] so and then let's start with the derivation [00:18:18] with the derivation so what we have is that the writer [00:18:20] so what we have is that the writer marker complexity is something like this [00:18:23] marker complexity is something like this I put one moreover in front just to make [00:18:25] I put one moreover in front just to make it easier so you have this is the [00:18:28] it easier so you have this is the definition [00:18:32] and I guess [00:18:35] we do the usual thing [00:18:39] we do the usual thing the first steps the first two steps are [00:18:41] the first steps the first two steps are just [00:18:44] plugging in the definition [00:18:53] I [00:18:57] and now we want to firstly rescue W and [00:19:01] and now we want to firstly rescue W and U before we take the soup [00:19:03] U before we take the soup so what we do is that we [00:19:08] write this as [00:19:15] WJ [00:19:17] WJ uj2 Norm so and then we in insert the [00:19:22] uj2 Norm so and then we in insert the the fee we use UJ bar transpose x i [00:19:26] the fee we use UJ bar transpose x i right so in some sense you you put the [00:19:28] right so in some sense you you put the norm of UJ outside of the fee [00:19:31] norm of UJ outside of the fee and the normal view J is a positive [00:19:33] and the normal view J is a positive number so you can put it out outside of [00:19:34] number so you can put it out outside of the fee [00:19:36] the fee um and [00:19:43] sorry I have a little bit of trouble [00:19:44] sorry I have a little bit of trouble reading this [00:19:47] um but I think I kind of remember what [00:19:54] oh okay like there's a page segment so [00:19:58] oh okay like there's a page segment so that I couldn't read what's my notes [00:20:00] that I couldn't read what's my notes were [00:20:01] were um anyway so [00:20:02] um anyway so um so you rearrange this a little bit so [00:20:05] um so you rearrange this a little bit so in some sense we treat this WJ times uj2 [00:20:08] in some sense we treat this WJ times uj2 Norm as our owj and we want to kind of [00:20:10] Norm as our owj and we want to kind of remove that first [00:20:12] remove that first that's kind of and also you can see that [00:20:14] that's kind of and also you can see that this one is something that shows up in [00:20:16] this one is something that shows up in the complexity measure the complex [00:20:18] the complexity measure the complex measure is basically the sum of this is [00:20:19] measure is basically the sum of this is less than V right the complex measure is [00:20:21] less than V right the complex measure is really just the sum of wjuj2 to not [00:20:24] really just the sum of wjuj2 to not right so you have shift over Theta [00:20:28] right so you have shift over Theta and you [00:20:30] and you uh I guess rewrite this you change this [00:20:32] uh I guess rewrite this you change this the order of the summation so that [00:20:36] the order of the summation so that it's clearer that [00:20:39] it's clearer that of this times [00:20:44] u j f u j bar transpose x i [00:20:52] and here I guess let's specify what the [00:20:54] and here I guess let's specify what the constraint of theta is the constraint of [00:20:56] constraint of theta is the constraint of theta is that c Theta is less than b [00:20:58] theta is that c Theta is less than b which means that this wju J2 Norm is [00:21:02] which means that this wju J2 Norm is less than b right so the constraint is [00:21:05] less than b right so the constraint is really just saying that the sum of WJ [00:21:07] really just saying that the sum of WJ uj2 [00:21:09] uj2 is less than is less than b [00:21:14] and now you can see that the sum of [00:21:16] and now you can see that the sum of these quantities is less than b but we [00:21:18] these quantities is less than b but we care about the weighted result right so [00:21:20] care about the weighted result right so we wait the each of the coin is by [00:21:22] we wait the each of the coin is by something and then you take this up [00:21:31] sorry [00:21:34] this is the the best [00:21:36] this is the the best the best [00:21:37] the best um this is a problem when you derive [00:21:39] um this is a problem when you derive things on the Fly just this particular [00:21:41] things on the Fly just this particular line couldn't really find my nose so so [00:21:43] line couldn't really find my nose so so I'm improvising okay [00:21:47] I'm improvising okay um thanks [00:21:48] um thanks um so and then let's uh okay so we know [00:21:52] um so and then let's uh okay so we know that the sum of wju J2 is less than b so [00:21:55] that the sum of wju J2 is less than b so that means that you can use the inner [00:21:57] that means that you can use the inner quality here so you say that you [00:22:04] so you [00:22:05] so you um [00:22:06] um I guess maybe maybe let me just have a [00:22:09] I guess maybe maybe let me just have a so this is like basically you're [00:22:10] so this is like basically you're applying this [00:22:13] applying this aibi is specialized that you don't know [00:22:16] aibi is specialized that you don't know the sub in AI times the max of EI [00:22:20] the sub in AI times the max of EI this is what we are applying actually I [00:22:22] this is what we are applying actually I probably should use J just to be more [00:22:23] probably should use J just to be more consistent with [00:22:25] consistent with so this is J from one to n and this is J [00:22:29] so this is J from one to n and this is J from 1 to m a j [00:22:32] from 1 to m a j PJ and AJ corresponds to [00:22:36] PJ and AJ corresponds to wjuj turn on and B J corresponds to this [00:22:40] wjuj turn on and B J corresponds to this quantity [00:22:41] quantity right it's absolutely what I'm doing so [00:22:45] right it's absolutely what I'm doing so if you [00:22:46] if you live in this then you get [00:22:48] live in this then you get basically the sum of AJ [00:22:51] basically the sum of AJ the sum of [00:22:53] the sum of WJ [00:22:55] WJ uj2 Norm J from 1 to n times the max [00:22:59] uj2 Norm J from 1 to n times the max over J [00:23:11] right so in some sense this is a holder [00:23:14] right so in some sense this is a holder in the quality right so like the inner [00:23:15] in the quality right so like the inner product of A and B is less than the one [00:23:17] product of A and B is less than the one Norm of a and the infinite number of b [00:23:21] Norm of a and the infinite number of b um [00:23:22] um and then this quantity now we got this [00:23:25] and then this quantity now we got this separate one this is less than b right [00:23:26] separate one this is less than b right so then this is less than one over n [00:23:30] so then this is less than one over n times [00:23:31] times Sigma [00:23:32] Sigma times B times soup over Theta [00:23:37] times B times soup over Theta Max J [00:23:40] Max J sum of Sigma i v u j bar transpose x y [00:23:48] and now this if you carefully compare [00:23:51] and now this if you carefully compare this with what we had before this should [00:23:53] this with what we had before this should look somewhat similar familiar because [00:23:55] look somewhat similar familiar because in some sense we [00:23:58] in some sense we we achieve almost the same things as we [00:24:00] we achieve almost the same things as we have done before we [00:24:01] have done before we we remove the influence of w and we [00:24:04] we remove the influence of w and we won't have something about you and here [00:24:05] won't have something about you and here what you have about you doesn't have the [00:24:07] what you have about you doesn't have the scale anymore you only have UJ barf [00:24:10] scale anymore you only have UJ barf so [00:24:12] so so basically now what you can do is that [00:24:15] so basically now what you can do is that you can say that [00:24:18] you can say that um so so from here is basically the same [00:24:20] um so so from here is basically the same thing as the previous proof [00:24:22] thing as the previous proof um let me try to repeat a few steps so I [00:24:25] um let me try to repeat a few steps so I guess one thing you can do is that you [00:24:27] guess one thing you can do is that you realize that this Max over J is not [00:24:29] realize that this Max over J is not doing really much so what you can do is [00:24:31] doing really much so what you can do is that you can replace this by Max over U [00:24:34] that you can replace this by Max over U Bar where the norm of U Bar is one [00:24:37] Bar where the norm of U Bar is one and some [00:24:40] and some Sigma I [00:24:43] Sigma I V U Bar transpose XL [00:24:46] V U Bar transpose XL so that's one thing we can do [00:24:50] sure that's that's good point so I [00:24:53] sure that's that's good point so I should have reps value so [00:24:56] should have reps value so I think I should have right here [00:24:59] I think I should have right here and I should have it here and I still [00:25:01] and I should have it here and I still should have it here [00:25:04] should have it here X [00:25:05] X yeah thanks for catching all of this [00:25:07] yeah thanks for catching all of this issues so and then this I guess [00:25:12] issues so and then this I guess you probably also remember there's a [00:25:14] you probably also remember there's a step I skipped before where I removed [00:25:16] step I skipped before where I removed the apps value by paying a factor two [00:25:19] the apps value by paying a factor two so you can do it you can this less than [00:25:21] so you can do it you can this less than this soup [00:25:35] right and there was uh all of these are [00:25:39] right and there was uh all of these are almost the same as you know it's exactly [00:25:41] almost the same as you know it's exactly the same as before and now you can pee [00:25:43] the same as before and now you can pee off the you can remove the fee by the [00:25:45] off the you can remove the fee by the the the the [00:25:47] the the the lipstick compensation number or the [00:25:50] lipstick compensation number or the telegram number [00:25:51] telegram number so you can [00:25:55] get rid of the fee [00:26:01] so this is Tyler ground timer [00:26:03] so this is Tyler ground timer and then you can [00:26:07] and then you can then this becomes the complexity of the [00:26:09] then this becomes the complexity of the linear model and you do some thing and [00:26:12] linear model and you do some thing and then you can get the same thing like 2B [00:26:15] then you can get the same thing like 2B C over square rooted [00:26:17] C over square rooted where the C comes from the [00:26:19] where the C comes from the comes from the norm of the X star [00:26:25] right so [00:26:28] right so Yeah so basically [00:26:30] Yeah so basically from [00:26:35] basically from after here these are so [00:26:38] basically from after here these are so this part [00:26:40] this part the same as before [00:26:45] um I guess there's a small difference [00:26:46] um I guess there's a small difference which is that [00:26:49] which is that the U Bar now is normalized to Norm one [00:26:52] the U Bar now is normalized to Norm one right so so you just have a so that's [00:26:56] right so so you just have a so that's why you don't catch up so before like if [00:26:58] why you don't catch up so before like if you look at the the proof before what [00:27:01] you look at the the proof before what happens is that you don't have the you [00:27:03] happens is that you don't have the you have some other control View Bar it [00:27:05] have some other control View Bar it controls the U you you know the norm of [00:27:07] controls the U you you know the norm of you use the SMB and now you know the [00:27:09] you use the SMB and now you know the normal View Bar is less than one so [00:27:11] normal View Bar is less than one so that's why you don't have the bu show up [00:27:13] that's why you don't have the bu show up in the final bound because the normal [00:27:15] in the final bound because the normal view R is less than one [00:27:17] view R is less than one so so in some sense this is just almost [00:27:19] so so in some sense this is just almost the same proof which the only difference [00:27:21] the same proof which the only difference is that you somehow remove the scaling [00:27:23] is that you somehow remove the scaling of U First [00:27:24] of U First you put the scaling of you actively in [00:27:26] you put the scaling of you actively in the in the W so that you can [00:27:29] the in the W so that you can organize this a little bit better [00:27:37] any questions [00:27:41] correct so I think next let me talk [00:27:44] correct so I think next let me talk about some of the the kind of like the [00:27:46] about some of the the kind of like the implications [00:27:47] implications of the of the theory here some of them [00:27:51] of the of the theory here some of them are kind of like interesting so [00:27:54] are kind of like interesting so I think one thing is that if you believe [00:27:56] I think one thing is that if you believe in a theory then what directly we should [00:27:58] in a theory then what directly we should do is that [00:27:59] do is that this is not what people do in practice [00:28:01] this is not what people do in practice but I would argue this this is also [00:28:03] but I would argue this this is also close to what people do in the practice [00:28:05] close to what people do in the practice um but if you just believe in a theory [00:28:07] um but if you just believe in a theory what you probably would do is you want [00:28:08] what you probably would do is you want to define the following Max Martin [00:28:11] to define the following Max Martin solution right you want to do the max [00:28:12] solution right you want to do the max margin on the minimum Norm solution so I [00:28:15] margin on the minimum Norm solution so I guess you can do [00:28:17] guess you can do maybe the pro either do program one [00:28:20] maybe the pro either do program one where you minimize the complexity of C [00:28:23] where you minimize the complexity of C Theta [00:28:29] and with the constraint that the margin [00:28:33] and with the constraint that the margin is larger than what [00:28:35] is larger than what right [00:28:36] right the why we care about the margin we call [00:28:38] the why we care about the margin we call that you know all of this depends on the [00:28:39] that you know all of this depends on the margin eventually because you know [00:28:41] margin eventually because you know eventually your generalization error [00:28:42] eventually your generalization error will be the complexity of the margin [00:28:45] will be the complexity of the margin or alternatively [00:28:48] I think these are exactly equivalent [00:28:52] I think these are exactly equivalent so you can you can say that you you [00:28:54] so you can you can say that you you maximize the margin [00:28:58] and with the constraint that [00:29:02] and with the constraint that uh your complexity is less than one [00:29:07] uh your complexity is less than one so let's call this program two and we [00:29:09] so let's call this program two and we can probably find this here [00:29:12] can probably find this here I'm started I guess I probably don't [00:29:14] I'm started I guess I probably don't have to define null so you can do these [00:29:16] have to define null so you can do these two programs right so and these two [00:29:18] two programs right so and these two programs the reason why uh you you want [00:29:21] programs the reason why uh you you want this program is because your [00:29:22] this program is because your generalization are rebound [00:29:29] will be something like [00:29:31] will be something like the generalization error will be [00:29:34] the generalization error will be something like iOS data [00:29:39] will be less than I would say the Hat [00:29:42] will be less than I would say the Hat will be less than so you said I had over [00:29:46] will be less than so you said I had over Gamma mean [00:29:47] Gamma mean Theta hat [00:29:49] Theta hat over scrotum [00:29:50] over scrotum plus lower the terms [00:29:54] right this is the using the general [00:29:56] right this is the using the general Machinery that we had right so so you [00:29:58] Machinery that we had right so so you have one over Square rooting so like you [00:30:01] have one over Square rooting so like you basically have they write the marker [00:30:02] basically have they write the marker complexity this is the rather micro [00:30:04] complexity this is the rather micro complexity [00:30:05] complexity and sorry I mean [00:30:09] this part is the brother marker [00:30:11] this part is the brother marker complexity this corresponds to the [00:30:12] complexity this corresponds to the rather macro complexity [00:30:14] rather macro complexity of H and this is the margin [00:30:17] of H and this is the margin um right so that's what we got from the [00:30:21] um right so that's what we got from the modern Theory [00:30:37] easy to see [00:30:40] easy to see something [00:30:45] I think [00:30:48] I think depending on [00:30:49] depending on I think you are basically right [00:30:53] I think you are basically right um but I think I would say this is [00:30:55] um but I think I would say this is already something [00:30:57] already something like we already achieved something [00:30:58] like we already achieved something because [00:30:59] because this I think maybe the right way to [00:31:01] this I think maybe the right way to think about this is that [00:31:04] think about this is that you compare these two in a two [00:31:09] you compare these two in a two two things right so in the very idealist [00:31:12] two things right so in the very idealist takeaway right so for all the wgs are [00:31:14] takeaway right so for all the wgs are the same all the UJS are the same then [00:31:16] the same all the UJS are the same then these two bonds are just the same so [00:31:17] these two bonds are just the same so then you are right you are just folding [00:31:19] then you are right you are just folding you are just changing your form of your [00:31:21] you are just changing your form of your bone and nothing really changed right [00:31:23] bone and nothing really changed right you somehow fold the square root and [00:31:24] you somehow fold the square root and somewhere but the [00:31:27] somewhere but the the thing is that this is not tight [00:31:29] the thing is that this is not tight always and and you probably shouldn't [00:31:31] always and and you probably shouldn't expect it to be tight it's not it [00:31:33] expect it to be tight it's not it shouldn't be the case that all the wjs [00:31:34] shouldn't be the case that all the wjs and UJS are the same you probably should [00:31:36] and UJS are the same you probably should have a detailing wjas you have more and [00:31:39] have a detailing wjas you have more and more neurons you're going to have [00:31:40] more neurons you're going to have smaller and smaller WJ [00:31:43] smaller and smaller WJ um [00:31:43] um just be you know like there's no way [00:31:46] just be you know like there's no way that this is tied for all I'm right so [00:31:48] that this is tied for all I'm right so like it can be tied for all that for one [00:31:50] like it can be tied for all that for one eye but if you add more neurons it [00:31:52] eye but if you add more neurons it wouldn't be touched so so the typical [00:31:54] wouldn't be touched so so the typical thing would be that [00:31:56] thing would be that you have more eyes you have more and [00:31:58] you have more eyes you have more and more neurons these neurons probably [00:31:59] more neurons these neurons probably should you know have smaller and smaller [00:32:01] should you know have smaller and smaller Norm [00:32:02] Norm um because they are capturing like you [00:32:04] um because they are capturing like you know more and more kind of complex you [00:32:06] know more and more kind of complex you know subtleties in your function Club [00:32:08] know subtleties in your function Club you know in your ground shoes function [00:32:09] you know in your ground shoes function right so so so so basically I'm saying [00:32:11] right so so so so basically I'm saying that this this inequality wouldn't be [00:32:14] that this this inequality wouldn't be tight uh in the in the in the idealistic [00:32:17] tight uh in the in the in the idealistic in the in the ground choose function for [00:32:19] in the in the ground choose function for example [00:32:23] right so [00:32:26] right so um yeah but but from a very technical [00:32:29] um yeah but but from a very technical point of view I think that's the only we [00:32:31] point of view I think that's the only we only did a very small trick to to change [00:32:33] only did a very small trick to to change the form [00:32:46] yeah I think you can say that in some [00:32:49] yeah I think you can say that in some sense yes [00:32:50] sense yes uh [00:32:53] uh or at least the other bound would be [00:32:56] or at least the other bound would be yeah I guess depending on how you think [00:32:58] yeah I guess depending on how you think about it this yeah [00:33:00] about it this yeah um but I think the way I think about [00:33:02] um but I think the way I think about this that is really just okay the way I [00:33:04] this that is really just okay the way I think about it is that these two bonds [00:33:06] think about it is that these two bonds are exactly the same when all the wgs [00:33:09] are exactly the same when all the wgs and UJS are the same like they are all [00:33:12] and UJS are the same like they are all for example constant or something like [00:33:13] for example constant or something like that or maybe all one over Square time [00:33:15] that or maybe all one over Square time so then you don't get anything from this [00:33:18] so then you don't get anything from this right so [00:33:20] right so um but uh but it will be much different [00:33:22] um but uh but it will be much different if you want to find the function where [00:33:25] if you want to find the function where your WJ and UJ goes to zero gradually as [00:33:28] your WJ and UJ goes to zero gradually as you have more money neurons [00:33:32] okay so [00:33:35] okay so um [00:33:36] um right okay so going back to the [00:33:37] right okay so going back to the transition Bond so so I think the [00:33:39] transition Bond so so I think the generalization bond in some sense [00:33:40] generalization bond in some sense motivates the the use of this kind of [00:33:42] motivates the the use of this kind of like Max modern solution or the minimum [00:33:43] like Max modern solution or the minimum Norm solution just because [00:33:46] Norm solution just because eventually your rather markup complexity [00:33:48] eventually your rather markup complexity depends on the complexity of the the [00:33:50] depends on the complexity of the the model and and you also have the margin [00:33:52] model and and you also have the margin term from the from the margin part from [00:33:55] term from the from the margin part from the last part [00:33:57] the last part so and one of the interesting thing is [00:33:59] so and one of the interesting thing is that this quantity if you think about [00:34:01] that this quantity if you think about this quantity [00:34:02] this quantity and this quality can you can show this [00:34:04] and this quality can you can show this is not increasing as I'm goes to [00:34:08] is not increasing as I'm goes to Infinity [00:34:09] Infinity so and the reason is actually pretty [00:34:11] so and the reason is actually pretty simple so but maybe let me write down [00:34:13] simple so but maybe let me write down just to be to to be clear about what I [00:34:16] just to be to to be clear about what I really mean so let's say let's use it I [00:34:18] really mean so let's say let's use it I had M to be the minimizer [00:34:25] of uh of say program one [00:34:30] of uh of say program one right so and I'm is to in to index which [00:34:34] right so and I'm is to in to index which how many neurons you are using [00:34:36] how many neurons you are using so for every time you have a minimizer [00:34:39] so for every time you have a minimizer and you can Define gamma M Star to be [00:34:44] um to be the the solution of the to be [00:34:46] um to be the the solution of the to be the corresponding margin right you can [00:34:48] the corresponding margin right you can Define gamma M Star to be the [00:34:51] Define gamma M Star to be the the margin with the constraints that D [00:34:53] the margin with the constraints that D Theta is [00:34:55] Theta is so let's say Define this vehicle M Star [00:34:58] so let's say Define this vehicle M Star right so [00:35:01] by the way I think I'm [00:35:03] by the way I think I'm I think I want to Define this to be two [00:35:05] I think I want to Define this to be two so let's let's let's mostly use two as [00:35:07] so let's let's let's mostly use two as my as our main thing like a this is a [00:35:12] my as our main thing like a this is a little bit Title Here [00:35:14] little bit Title Here Okay so [00:35:17] Okay so um so suppose you solve this program too [00:35:19] um so suppose you solve this program too and you get this Max modern solution [00:35:20] and you get this Max modern solution right so and then [00:35:22] right so and then uh so your bound so this means that [00:35:26] uh so your bound so this means that the bound [00:35:28] the bound is [00:35:31] C Theta Hide N over gamma mean [00:35:36] C Theta Hide N over gamma mean say that I had M over square root n and [00:35:38] say that I had M over square root n and because we normalize the C the [00:35:40] because we normalize the C the complexity to be one so this is really [00:35:42] complexity to be one so this is really one over gamma M Star Times Square [00:35:45] one over gamma M Star Times Square rooted [00:35:46] rooted right this is the generation box so [00:35:49] right this is the generation box so basically whether this bond is better or [00:35:50] basically whether this bond is better or not depends on whether gamma M Star [00:35:53] not depends on whether gamma M Star is increasing or decreasing [00:35:56] is increasing or decreasing and and interestingly the gamma star [00:36:00] and and interestingly the gamma star is increasing [00:36:05] this is a [00:36:08] and this is in some sense of almost bad [00:36:11] and this is in some sense of almost bad definition why this is because if you [00:36:13] definition why this is because if you think about what the gamma star M Star [00:36:16] think about what the gamma star M Star means it means that the maximum margin [00:36:18] means it means that the maximum margin you can achieve when you restrict your [00:36:19] you can achieve when you restrict your complexity to be less than what [00:36:22] complexity to be less than what right and I also use M neurons and the [00:36:26] right and I also use M neurons and the thing is that when you have more neurons [00:36:30] thing is that when you have more neurons at least you would achieve the same [00:36:32] at least you would achieve the same margin you shouldn't be worse just [00:36:35] margin you shouldn't be worse just because the only like right so with more [00:36:36] because the only like right so with more neurons [00:36:38] neurons you never get worse [00:36:42] can at least achieve the same margin by [00:36:44] can at least achieve the same margin by adding just the dummy neuron as kind of [00:36:47] adding just the dummy neuron as kind of exactly the same argument as I had right [00:36:51] at least [00:36:55] achieve the same margin [00:36:59] because you've just either done in your [00:37:01] because you've just either done in your room and it doesn't change the [00:37:02] room and it doesn't change the functionality it doesn't change the [00:37:03] functionality it doesn't change the complexity it doesn't change the margin [00:37:05] complexity it doesn't change the margin it's just everything is the same but [00:37:08] it's just everything is the same but having more neurons give you additional [00:37:10] having more neurons give you additional flexibility you could possibly kind of [00:37:12] flexibility you could possibly kind of like change your neurons a little bit [00:37:13] like change your neurons a little bit more cleverly instead of just adding a [00:37:15] more cleverly instead of just adding a dummy neuron that's by your margin [00:37:17] dummy neuron that's by your margin adding one more neuron will potentially [00:37:19] adding one more neuron will potentially make a margin bigger [00:37:21] make a margin bigger so so at least you know you never get [00:37:24] so so at least you know you never get the margin smaller by adding neurons [00:37:27] the margin smaller by adding neurons so so that means that this Bond can [00:37:29] so so that means that this Bond can decrease as an uh goes to Infinity [00:37:34] um at least it's not increasing as n [00:37:36] um at least it's not increasing as n goes to Infinity [00:37:38] goes to Infinity so so in some sense this is kind of like [00:37:40] so so in some sense this is kind of like the nice thing about this compared to [00:37:42] the nice thing about this compared to other bonds where you have explicit [00:37:44] other bonds where you have explicit dependency on and if you have an [00:37:45] dependency on and if you have an explicit dependency on M at least you [00:37:47] explicit dependency on M at least you know if you just look at it it wouldn't [00:37:49] know if you just look at it it wouldn't you wouldn't be able to argue that this [00:37:50] you wouldn't be able to argue that this bond is better so so now you can say [00:37:53] bond is better so so now you can say this one is bad better as M goes to [00:37:55] this one is bad better as M goes to Infinity of course this doesn't really [00:37:57] Infinity of course this doesn't really say [00:37:58] say this doesn't address everything because [00:38:00] this doesn't address everything because this is just the upper bound right it's [00:38:03] this is just the upper bound right it's not like you are saying that the actual [00:38:05] not like you are saying that the actual generalization error is decreasing as M [00:38:07] generalization error is decreasing as M goes to Infinity that would be the ideal [00:38:08] goes to Infinity that would be the ideal theorem you want to prove right that's [00:38:10] theorem you want to prove right that's that would match exactly the [00:38:12] that would match exactly the the the plus I showed last time where [00:38:15] the the plus I showed last time where you have more neurons and you your bad [00:38:17] you have more neurons and you your bad your accuracy is improving how your your [00:38:19] your accuracy is improving how your your arrow is decreasing right so so here [00:38:22] arrow is decreasing right so so here you're only talking about bonds right so [00:38:23] you're only talking about bonds right so if the bond is loose then [00:38:25] if the bond is loose then um it's unclear whether this [00:38:28] um it's unclear whether this decreasingly M thing is is really a big [00:38:31] decreasingly M thing is is really a big deal [00:38:32] deal um and that's that's indeed true so but [00:38:34] um and that's that's indeed true so but I think this is a you know in some sense [00:38:36] I think this is a you know in some sense this is a starting point right so if [00:38:38] this is a starting point right so if your bond is increasing M that is [00:38:40] your bond is increasing M that is completely useless right your bond is [00:38:42] completely useless right your bond is decreasing um that doesn't really mean [00:38:43] decreasing um that doesn't really mean that it's it's super powerful but at [00:38:45] that it's it's super powerful but at least that's a good sign to have right [00:38:46] least that's a good sign to have right that's a that's a good thing to have [00:38:49] that's a that's a good thing to have uh and and in some sense it's really [00:38:52] uh and and in some sense it's really very it's really hard to capture the [00:38:55] very it's really hard to capture the exact [00:38:56] exact uh [00:38:57] uh test error so if you really want to say [00:38:59] test error so if you really want to say that the exact test error or the [00:39:01] that the exact test error or the generalization error is decreasingly and [00:39:03] generalization error is decreasingly and that's basically the only thing you can [00:39:05] that's basically the only thing you can do is with uh with linear model at least [00:39:08] do is with uh with linear model at least so far the only techniques I know is [00:39:09] so far the only techniques I know is that you just literally compute exactly [00:39:12] that you just literally compute exactly what the test error is on linear model [00:39:14] what the test error is on linear model you can do the analytical derivation [00:39:16] you can do the analytical derivation using linear algebra to simplify them [00:39:18] using linear algebra to simplify them and in certain cases you can you can [00:39:20] and in certain cases you can you can show indeed the uh the average decrease [00:39:24] show indeed the uh the average decrease as I'm goes to inflated this is uh this [00:39:27] as I'm goes to inflated this is uh this is actually a pretty [00:39:29] is actually a pretty um um [00:39:30] um um popular [00:39:32] popular uh Direction in the last few years like [00:39:35] uh Direction in the last few years like people have done this for [00:39:37] people have done this for various kind of linear models but [00:39:39] various kind of linear models but basically only recipient to linear [00:39:41] basically only recipient to linear models [00:39:42] models um [00:39:43] um right so so here you know we want to [00:39:45] right so so here you know we want to work with right work so so we have to [00:39:47] work with right work so so we have to kind of somehow live with a weaker [00:39:49] kind of somehow live with a weaker result you only say that the bond is [00:39:51] result you only say that the bond is decreasing but not actual error is [00:39:53] decreasing but not actual error is decreasing [00:39:56] Okay so [00:39:58] Okay so I guess the next thing I want to say is [00:39:59] I guess the next thing I want to say is that [00:40:00] that um is actually okay another thing that [00:40:03] um is actually okay another thing that is this is different from this program [00:40:05] is this is different from this program well too they are still different from [00:40:07] well too they are still different from what you do in practice right you [00:40:09] what you do in practice right you probably don't do exactly this [00:40:10] probably don't do exactly this complexity measure nobody regulates it [00:40:12] complexity measure nobody regulates it like that probably somebody tried they [00:40:15] like that probably somebody tried they probably wouldn't make a difference [00:40:17] probably wouldn't make a difference um and and here what I'm going to say is [00:40:19] um and and here what I'm going to say is that actually it's interesting that this [00:40:21] that actually it's interesting that this complex measure [00:40:24] complex measure this complex measure is definitely [00:40:25] this complex measure is definitely different from L2 complex measure right [00:40:28] different from L2 complex measure right but once you minimize the complex [00:40:32] but once you minimize the complex measure is this you get the same effect [00:40:33] measure is this you get the same effect as minimizing L2 that's all minimizing [00:40:36] as minimizing L2 that's all minimizing L2 is the same as minimizing this uh [00:40:38] L2 is the same as minimizing this uh maybe let me let me just clarify what [00:40:39] maybe let me let me just clarify what what does that mean so basically my my [00:40:43] what does that mean so basically my my main point here is that if you maximize [00:40:46] main point here is that if you maximize margins [00:40:51] uh [00:40:54] uh so sorry you can so so you can so can be [00:40:58] so sorry you can so so you can so can be done [00:40:59] done by [00:41:00] by minimizing across entropy laws [00:41:07] the one that with L2 regularization [00:41:12] the one that with L2 regularization so here I have two things one is I'm [00:41:13] so here I have two things one is I'm using cross Enterprise I'm using L2 [00:41:16] using cross Enterprise I'm using L2 regularization I'll do it do one of them [00:41:19] regularization I'll do it do one of them um as that so at the first I'm going to [00:41:21] um as that so at the first I'm going to first use L2 organization instead of the [00:41:23] first use L2 organization instead of the complex mesh I defined and I'm going to [00:41:25] complex mesh I defined and I'm going to say that it's actually doing the same [00:41:26] say that it's actually doing the same thing so here's this first timer [00:41:30] thing so here's this first timer so so suppose you consider [00:41:33] so so suppose you consider the one we have considered right so [00:41:35] the one we have considered right so let's call this J1 which is [00:41:37] let's call this J1 which is you minimize the complexity with the [00:41:40] you minimize the complexity with the constraint that the margin is [00:41:43] constraint that the margin is larger than one by the way I I keep [00:41:45] larger than one by the way I I keep changing the sometimes I'm minimizing [00:41:46] changing the sometimes I'm minimizing the complexity with the margins and one [00:41:49] the complexity with the margins and one sometimes times minimizing the margin [00:41:50] sometimes times minimizing the margin with the complexities less than one [00:41:52] with the complexities less than one it's the somehow yeah I probably should [00:41:55] it's the somehow yeah I probably should make a move all consistent but just in [00:41:57] make a move all consistent but just in my mind they are always the same so [00:41:58] my mind they are always the same so sometimes I forgot to sorry I should [00:42:01] sometimes I forgot to sorry I should probably just keep keep a single version [00:42:03] probably just keep keep a single version of it but but the other thing they are [00:42:06] of it but but the other thing they are just the like a equivalent because [00:42:09] just the like a equivalent because um yeah [00:42:11] um yeah um yeah so anyway so so here I'm [00:42:13] um yeah so anyway so so here I'm minimize the complexity with the margins [00:42:16] minimize the complexity with the margins larger than one and I'm claiming that if [00:42:17] larger than one and I'm claiming that if you look at another one which is you [00:42:19] you look at another one which is you minimize the L2 [00:42:22] minimize the L2 thing the L2 Norm [00:42:24] thing the L2 Norm and with the constraint that [00:42:27] and with the constraint that the margin is larger than what [00:42:30] the margin is larger than what so these two [00:42:32] so these two are the same [00:42:37] so obviously these two functions are not [00:42:39] so obviously these two functions are not the same right these two complex [00:42:40] the same right these two complex measures are not the same but if you [00:42:41] measures are not the same but if you minimize the complexity the extreme [00:42:43] minimize the complexity the extreme opponent actually turns out to be the [00:42:45] opponent actually turns out to be the same [00:42:46] same which is kind of interesting [00:42:48] which is kind of interesting um and the proof is like false so [00:42:53] um and the proof is like false so um [00:42:56] I think at least one thing you know is [00:42:58] I think at least one thing you know is that [00:42:58] that uh the the regular the Iota [00:43:02] uh the the regular the Iota regularization what is that this is also [00:43:04] regularization what is that this is also regularization is the sum of the squares [00:43:07] regularization is the sum of the squares of all the parameters which is sum of WJ [00:43:10] of all the parameters which is sum of WJ square plus sum of u j to Norm Square [00:43:15] square plus sum of u j to Norm Square and you can show that this is larger [00:43:17] and you can show that this is larger than the complex measure we have defined [00:43:18] than the complex measure we have defined because you can use the amgm so you can [00:43:21] because you can use the amgm so you can say this is WJ square plus [00:43:23] say this is WJ square plus u j two Norm square and use [00:43:27] u j two Norm square and use I think this is called AMG I mean [00:43:29] I think this is called AMG I mean inequality of [00:43:30] inequality of you know for me everything is caution [00:43:32] you know for me everything is caution sure so it's just inequality so but [00:43:34] sure so it's just inequality so but anyway so you get WJ times [00:43:37] anyway so you get WJ times uj2 Norm [00:43:39] uj2 Norm times two [00:43:41] times two and you cancel these two so this is B [00:43:43] and you cancel these two so this is B Theta [00:43:44] Theta right so so you're so you are minimizing [00:43:47] right so so you're so you are minimizing so in J2 the program J2 you are [00:43:50] so in J2 the program J2 you are minimizing a larger complexity measure [00:43:54] minimizing a larger complexity measure but I guess you know the intuition is [00:43:56] but I guess you know the intuition is that even though you're minimizing a [00:43:58] that even though you're minimizing a larger complex measure but when you [00:44:01] larger complex measure but when you uh the extremal point [00:44:04] uh the extremal point actually will make these two things the [00:44:06] actually will make these two things the same so the intuition is that [00:44:10] the extremal point [00:44:13] the extremal point should satisfy [00:44:19] should satisfy W J is equals to u j [00:44:23] should satisfy W J is equals to u j even you are when you are minimizing the [00:44:25] even you are when you are minimizing the L2 regularization [00:44:27] L2 regularization right so and if that's the suppose [00:44:29] right so and if that's the suppose that's the case then you can kind of [00:44:31] that's the case then you can kind of believe that these two things are the [00:44:32] believe that these two things are the same because when I'm minimizing the L2 [00:44:34] same because when I'm minimizing the L2 regularization if the extreme point is [00:44:36] regularization if the extreme point is satisfies this then I for this case if [00:44:39] satisfies this then I for this case if this is true then C Theta is the same as [00:44:42] this is true then C Theta is the same as the L2 so then you are not really doing [00:44:45] the L2 so then you are not really doing anything different so that's kind of the [00:44:46] anything different so that's kind of the intuition [00:44:48] intuition um if you really want to prove this kind [00:44:49] um if you really want to prove this kind of formally I guess the simplest way to [00:44:51] of formally I guess the simplest way to prove this is the [00:44:52] prove this is the following [00:44:54] following so so so you you you say that I guess [00:44:59] so so so you you you say that I guess this implies that J2 is lower than J1 [00:45:02] this implies that J2 is lower than J1 and you want to use the institution [00:45:04] and you want to use the institution intuition to show that J1 is bigger than [00:45:06] intuition to show that J1 is bigger than J2 instead is larger than 22 as well so [00:45:10] J2 instead is larger than 22 as well so what we do is that we say that [00:45:12] what we do is that we say that data [00:45:14] data in the minimizer [00:45:18] of one [00:45:21] of one of J1 [00:45:23] of J1 maybe let's call this [00:45:26] maybe let's call this I think let's call this uh maybe [00:45:30] three and this is four [00:45:32] three and this is four so [00:45:34] so is that a good number that's probably [00:45:36] is that a good number that's probably not a good number let's call this [00:45:38] not a good number let's call this P1 and P2 [00:45:41] P1 and P2 so minimizer of P1 [00:45:44] so minimizer of P1 and then what you do is you construct [00:45:47] and then what you do is you construct so you get a Theta that is the minimizer [00:45:50] so you get a Theta that is the minimizer of the first one and you want to [00:45:51] of the first one and you want to construct a Theta Prime which [00:45:53] construct a Theta Prime which is [00:45:55] is uh it's very good on the [00:45:58] uh it's very good on the um [00:45:59] um on the stack in terms of the second [00:46:01] on the stack in terms of the second program so you construct consider Prime [00:46:04] program so you construct consider Prime and what you do is that you say I'm [00:46:06] and what you do is that you say I'm going to take WJ Prime to be the [00:46:09] going to take WJ Prime to be the renormalized version [00:46:13] of the of WJ and UJ from [00:46:17] of the of WJ and UJ from again to be the really normalized [00:46:19] again to be the really normalized version of ug [00:46:24] and then you can verify that because I'm [00:46:28] and then you can verify that because I'm just a change in the scaling UJ [00:46:31] just a change in the scaling UJ times Phi of WJ transpose X is the same [00:46:35] times Phi of WJ transpose X is the same as [00:46:36] as sorry WJ times V of u j transpose X is [00:46:39] sorry WJ times V of u j transpose X is the same as before [00:46:45] and also WJ [00:46:48] in terms of the complexity measure they [00:46:50] in terms of the complexity measure they are also the same [00:46:51] are also the same after doing this transformation [00:46:58] okay and and this means that c Theta [00:47:03] okay and and this means that c Theta it's the same as C Theta Prime and F [00:47:06] it's the same as C Theta Prime and F Theta is the same as F0 Prime so the [00:47:09] Theta is the same as F0 Prime so the functionality and the complex measure [00:47:11] functionality and the complex measure didn't change [00:47:12] didn't change and [00:47:14] and and what's interesting is that for Theta [00:47:16] and what's interesting is that for Theta Prime [00:47:16] Prime V sub Prime is also equals to [00:47:22] the L2 Norm because [00:47:24] the L2 Norm because my construction okay why I'm doing this [00:47:27] my construction okay why I'm doing this construction I'm doing this construction [00:47:29] construction I'm doing this construction because I wanted WJ [00:47:32] because I wanted WJ it to be equals to the norm of u j [00:47:35] it to be equals to the norm of u j this is why I choose this uh scale by [00:47:38] this is why I choose this uh scale by the way I think this should be like this [00:47:47] oh sorry [00:47:48] oh sorry wait am I right oh no it's like this all [00:47:51] wait am I right oh no it's like this all right so we can verify WJ is the same as [00:47:54] right so we can verify WJ is the same as the usage [00:47:56] the usage this right so because [00:48:00] this right so because in you know this is actually my design [00:48:01] in you know this is actually my design in some sense like you can verify this [00:48:03] in some sense like you can verify this you know like but but this is like you [00:48:05] you know like but but this is like you know if if this is true I should change [00:48:07] know if if this is true I should change my design to make a two but that's the [00:48:09] my design to make a two but that's the that's the point so [00:48:11] that's the point so um [00:48:12] um so that means that so what does this [00:48:15] so that means that so what does this mean this means that [00:48:16] mean this means that Prime satisfies [00:48:20] Prime satisfies constraint [00:48:25] of so all of this means that the series [00:48:28] of so all of this means that the series Prime constraints of P2 [00:48:31] Prime constraints of P2 so that means that [00:48:33] so that means that C Theta Prime is less than [00:48:36] C Theta Prime is less than is so it's bigger than J2 [00:48:42] right and [00:48:45] right and C to the prime is equals to [00:48:51] and Theta the prime [00:48:54] and Theta the prime is equals to a half Theta Prime [00:48:57] is equals to a half Theta Prime square is equals to a half [00:49:02] sorry C to the prime [00:49:06] is equals to [00:49:09] is equals to okay [00:49:11] okay what is this equal to this is equals to [00:49:14] what is this equal to this is equals to let's see what's going on here [00:49:17] let's see what's going on here then [00:49:19] then I want to show that Theta Prime is [00:49:20] I want to show that Theta Prime is equals to J1 this is because [00:49:30] right this is this is just because Theta [00:49:33] right this is this is just because Theta Prime is equals to C Theta which is [00:49:35] Prime is equals to C Theta which is equals to J1 [00:49:37] equals to J1 okay [00:49:38] okay I didn't change the complex measure [00:49:39] I didn't change the complex measure because I'm just really scaling so [00:49:41] because I'm just really scaling so that's why J1 is bigger than J2 [00:49:45] that's why J1 is bigger than J2 and before you get J2 is bigger than 31 [00:49:48] and before you get J2 is bigger than 31 so that's why [00:49:50] so that's why J2 and J1 that's it [00:49:53] J2 and J1 that's it yeah actually I was a little hesitated I [00:49:55] yeah actually I was a little hesitated I was a little bit hesitant whether I [00:49:56] was a little bit hesitant whether I should show this proof or the a more [00:49:59] should show this proof or the a more intuitive or another version which is [00:50:01] intuitive or another version which is actually intellectuals in the lecture [00:50:02] actually intellectuals in the lecture notes there's a slightly different way [00:50:04] notes there's a slightly different way to prove the same thing uh at the end of [00:50:06] to prove the same thing uh at the end of the everything is relatively simple it's [00:50:08] the everything is relatively simple it's not nothing really hard [00:50:10] not nothing really hard um so this proof is very easy to verify [00:50:12] um so this proof is very easy to verify and the other proof is kind of like in [00:50:15] and the other proof is kind of like in some sense covers the intuition [00:50:16] some sense covers the intuition intuition is really just what like what [00:50:17] intuition is really just what like what I said at the extreme point this I know [00:50:21] I said at the extreme point this I know with WJ and UJ two Norm has to be the [00:50:23] with WJ and UJ two Norm has to be the same so this two complex measures are [00:50:25] same so this two complex measures are not different [00:50:26] not different so that's the that's the main intuition [00:50:35] so the prime satisfies the constraint P2 [00:50:37] so the prime satisfies the constraint P2 so the constraint is only about the [00:50:39] so the constraint is only about the margin right so the margin [00:50:42] margin right so the margin uh is only about the functionality of [00:50:44] uh is only about the functionality of this model right so if you predict the [00:50:45] this model right so if you predict the same thing your margin will be the same [00:50:49] same thing your margin will be the same right so Theta parameters data have the [00:50:52] right so Theta parameters data have the same functionality because you only [00:50:54] same functionality because you only rebalance the the scale right you just [00:50:57] rebalance the the scale right you just multiply WJ by something on [00:51:00] multiply WJ by something on divide the use it by something else so [00:51:03] divide the use it by something else so so the functionality is maintained so [00:51:06] so the functionality is maintained so that's why the margins that's it [00:51:15] in the why there is no yeah [00:51:22] so here [00:51:24] so here this is the equality [00:51:27] oh sorry sorry sorry not bad [00:51:31] oh sorry sorry sorry not bad why this is because this is in [00:51:32] why this is because this is in accordance [00:51:40] [Music] [00:51:40] [Music] um [00:51:42] um okay great so [00:51:44] okay great so okay so the first thing okay so the [00:51:46] okay so the first thing okay so the first level we have so okay what we have [00:51:48] first level we have so okay what we have done we're basically are saying that if [00:51:49] done we're basically are saying that if you minimize the L2 Norm it's the same [00:51:50] you minimize the L2 Norm it's the same as minimizing this [00:51:53] as minimizing this um complex measure okay and we also [00:51:56] um complex measure okay and we also wanted to do the cross entropy and this [00:51:58] wanted to do the cross entropy and this is something I'm not going to prove but [00:52:01] is something I'm not going to prove but I'm just going to stay alive stay the [00:52:02] I'm just going to stay alive stay the dilemma and if you're interested you can [00:52:04] dilemma and if you're interested you can read a [00:52:06] read a um with a paper about it like you know [00:52:07] um with a paper about it like you know the proof is actually relatively simple [00:52:09] the proof is actually relatively simple but I think uh we'll probably have time [00:52:11] but I think uh we'll probably have time today to do that [00:52:14] today to do that um so so the Lemma too is that if you [00:52:16] um so so the Lemma too is that if you consider [00:52:18] consider a regularized [00:52:22] a regularized cross entropy laws [00:52:30] and something like L height Lambda Theta [00:52:33] and something like L height Lambda Theta which is equals to one over n [00:52:36] which is equals to one over n um [00:52:37] um I guess in this lecture this is the [00:52:39] I guess in this lecture this is the first time I have ever talked about what [00:52:41] first time I have ever talked about what cross entropy loss but I I assume that [00:52:43] cross entropy loss but I I assume that you some would know what they are right [00:52:45] you some would know what they are right this is the the laws for the logistic [00:52:47] this is the the laws for the logistic regression right you have more I times F [00:52:51] regression right you have more I times F Theta x i [00:52:54] Theta x i um so this is your input and the loss is [00:52:56] um so this is your input and the loss is the log of one so the loss in some sense [00:52:58] the log of one so the loss in some sense is really [00:53:00] is really log one plus exponential minus t this is [00:53:03] log one plus exponential minus t this is the logistic loss [00:53:05] the logistic loss and you you add some Lambda [00:53:09] and you you add some Lambda times L2 regularization [00:53:13] times L2 regularization suppose you do this and [00:53:16] suppose you do this and and let's say that [00:53:19] and let's say that title of the hat with a minimizer [00:53:24] I'm going to claim that for small enough [00:53:25] I'm going to claim that for small enough Lambda Federal height Lambda is [00:53:27] Lambda Federal height Lambda is basically doing the same thing as the [00:53:29] basically doing the same thing as the max modern solution [00:53:31] max modern solution um but there is a small thing that I I [00:53:34] um but there is a small thing that I I have to deal with it which is that what [00:53:37] have to deal with it which is that what is the norm right because the max [00:53:38] is the norm right because the max marginal thing is is you need a you need [00:53:40] marginal thing is is you need a you need a norm you need a [00:53:42] a norm you need a basically you need to care about the [00:53:43] basically you need to care about the ratio between the margin and the and the [00:53:45] ratio between the margin and the and the norm so so that's why my statement is [00:53:48] norm so so that's why my statement is that [00:53:49] that uh [00:53:58] [Applause] [00:54:05] okay I don't know why [00:54:07] okay I don't know why okay yeah so then [00:54:10] okay yeah so then okay it's my statement is something [00:54:11] okay it's my statement is something about this it's like this so basically [00:54:14] about this it's like this so basically you can say that if Lambda goes to 0 for [00:54:16] you can say that if Lambda goes to 0 for small is not enough Lambda [00:54:18] small is not enough Lambda then [00:54:21] the norm versus the margin [00:54:27] will goes to [00:54:29] will goes to J1 [00:54:32] J1 which was defined to be [00:54:35] J1 which was defined to be the max the the minimum Norm solution [00:54:37] the max the the minimum Norm solution right [00:54:41] this is just a I'm recalling the [00:54:47] definition so basically you are [00:54:48] definition so basically you are converging to the max modern solution or [00:54:50] converging to the max modern solution or the minimum Norm solution up to a [00:54:53] the minimum Norm solution up to a scaling because here I'm looking at the [00:54:54] scaling because here I'm looking at the ratio so this when you have a small a [00:54:57] ratio so this when you have a small a very small uh Theta you uh [00:55:02] very small uh Theta you uh sorry when you have this very small [00:55:04] sorry when you have this very small Lambda your Norm of theta would be [00:55:05] Lambda your Norm of theta would be something actually pretty big that's [00:55:07] something actually pretty big that's because your organization is too weak so [00:55:08] because your organization is too weak so you you are going to get very big thumb [00:55:11] you you are going to get very big thumb solution but if you normalize the norm [00:55:12] solution but if you normalize the norm with the margin then you you found that [00:55:14] with the margin then you you found that this is actually the max modern solution [00:55:17] this is actually the max modern solution um I'm not gonna prove this I guess if [00:55:19] um I'm not gonna prove this I guess if you're interested this is a theorem 4.2 [00:55:23] you're interested this is a theorem 4.2 of a paper I wrote [00:55:26] of a paper I wrote um with [00:55:28] um with collaborators [00:55:30] collaborators um and actually this theorem is actually [00:55:32] um and actually this theorem is actually very simple like um [00:55:34] very simple like um um um and actually it works for not only [00:55:37] um um and actually it works for not only the L2 regularization it works for [00:55:40] the L2 regularization it works for almost all homogeneous or like almost [00:55:42] almost all homogeneous or like almost all regularizations you can think of so [00:55:45] all regularizations you can think of so so the gist is basically saying that if [00:55:48] so the gist is basically saying that if you care about the max margin uh [00:55:50] you care about the max margin uh solution with respect to certain complex [00:55:51] solution with respect to certain complex measure right so if the complex measure [00:55:53] measure right so if the complex measure could be L2 in this case it could be [00:55:55] could be L2 in this case it could be something else [00:55:57] something else like here it could be anything right so [00:55:59] like here it could be anything right so one way to achieve it is that you just [00:56:01] one way to achieve it is that you just add a very weak regularization in your [00:56:04] add a very weak regularization in your course entropy loss and that will give [00:56:06] course entropy loss and that will give you the maximum solution [00:56:14] okay [00:56:17] any questions [00:56:23] [Music] [00:56:25] [Music] yeah so the general kind of like the [00:56:29] yeah so the general kind of like the gist is that suppose you care about the [00:56:31] gist is that suppose you care about the max modern solution right but maximum [00:56:33] max modern solution right but maximum resolution requires a complex measure [00:56:35] resolution requires a complex measure right you need to say I'm minimizing the [00:56:37] right you need to say I'm minimizing the norm such a norm with the margin slogan [00:56:39] norm such a norm with the margin slogan well I'm maximize the margin with some [00:56:41] well I'm maximize the margin with some constraints right there's a norm right [00:56:43] constraints right there's a norm right so or there's a complex measure so if [00:56:45] so or there's a complex measure so if you want to get the maximum resolution [00:56:46] you want to get the maximum resolution you just put a complex mesh in the [00:56:49] you just put a complex mesh in the uh here [00:56:51] uh here right so and with a small enough Lambda [00:56:54] right so and with a small enough Lambda then and you have cross entropy laws and [00:56:57] then and you have cross entropy laws and then [00:56:58] then the solution this will will give you the [00:57:00] the solution this will will give you the maximizing solution [00:57:02] maximizing solution uh of course you can look for the [00:57:04] uh of course you can look for the maximum solution you know just directly [00:57:06] maximum solution you know just directly by solving the program but you can also [00:57:08] by solving the program but you can also do it like this way and this is [00:57:09] do it like this way and this is something that seems to be more [00:57:13] something that seems to be more um typical like at least this is what [00:57:15] um typical like at least this is what people do in paracord all the time in [00:57:17] people do in paracord all the time in some sense right in some sense this is [00:57:19] some sense right in some sense this is just linking [00:57:21] just linking um what people do in particularly with [00:57:23] um what people do in particularly with the maximal resolution which is not what [00:57:25] the maximal resolution which is not what people do that typically in deep [00:57:27] people do that typically in deep learning [00:57:30] learning um [00:57:30] um right but but there is a but the caveat [00:57:33] right but but there is a but the caveat here like if you care about the broader [00:57:35] here like if you care about the broader the interpretation the caveat here is [00:57:37] the interpretation the caveat here is you need a longer to be very small so [00:57:39] you need a longer to be very small so basically this is saying that if you use [00:57:41] basically this is saying that if you use a very small Lambda you get a maximum [00:57:42] a very small Lambda you get a maximum resolution but the empirical actually [00:57:44] resolution but the empirical actually you don't use that small amount you [00:57:46] you don't use that small amount you actually use something bigger than [00:57:48] actually use something bigger than uh bigger than this infinitesimal small [00:57:51] uh bigger than this infinitesimal small Lambda so so empirical you probably [00:57:54] Lambda so so empirical you probably wouldn't get exactly Max minus solution [00:57:55] wouldn't get exactly Max minus solution you're gonna get something similar to it [00:57:58] you're gonna get something similar to it but not exactly the same [00:58:00] but not exactly the same and actually it's kind of interesting [00:58:02] and actually it's kind of interesting that like I guess probably for cs29 [00:58:06] that like I guess probably for cs29 um like you have learned like a Max [00:58:08] um like you have learned like a Max marketing solution right so it sounds [00:58:09] marketing solution right so it sounds like the [00:58:10] like the before deep learning that's the that's [00:58:12] before deep learning that's the that's the right thing to do right like but [00:58:14] the right thing to do right like but even like even you look at linear models [00:58:17] even like even you look at linear models it's never at least I haven't seen I'm [00:58:20] it's never at least I haven't seen I'm not I'm not a like a practitioner I do a [00:58:22] not I'm not a like a practitioner I do a lot of theory but when I do experiments [00:58:24] lot of theory but when I do experiments I've never seen that [00:58:26] I've never seen that next modern solution is the best for [00:58:28] next modern solution is the best for leading model [00:58:29] leading model somehow [00:58:30] somehow it's it's like when you use a very small [00:58:32] it's it's like when you use a very small Lambda you do get maximum resolution but [00:58:34] Lambda you do get maximum resolution but if you use bigger longer sometimes it's [00:58:36] if you use bigger longer sometimes it's a little better [00:58:37] a little better so so I think Max smarter switching some [00:58:39] so so I think Max smarter switching some steps is just a theoretic approximation [00:58:42] steps is just a theoretic approximation of what people really do in practice [00:58:48] um [00:58:50] all right so let me see [00:58:55] so the next part I'm trying to I'm [00:58:57] so the next part I'm trying to I'm trying to connect this uh this deep [00:58:59] trying to connect this uh this deep learning thing this this deep learning [00:59:01] learning thing this this deep learning not very deep like two layer not working [00:59:03] not very deep like two layer not working with the the so-called L1 xpn [00:59:10] this is also kind of like a I think [00:59:12] this is also kind of like a I think people [00:59:13] people um like the exact kind of like a [00:59:16] um like the exact kind of like a uh thing is in my paper as well like but [00:59:19] uh thing is in my paper as well like but it's only a like a three paragraphs like [00:59:21] it's only a like a three paragraphs like in appendix [00:59:23] in appendix um and we are we are not really [00:59:25] um and we are we are not really inventing it we we just uh [00:59:27] inventing it we we just uh in some sense said something that people [00:59:29] in some sense said something that people already knew like implicitly we thought [00:59:31] already knew like implicitly we thought that it's useful to write it down [00:59:33] that it's useful to write it down um so so so so so the so the general [00:59:37] um so so so so so the so the general thing is that you want to say what we're [00:59:39] thing is that you want to say what we're going to claim is the following so we're [00:59:41] going to claim is the following so we're going to claim that this is really doing [00:59:42] going to claim that this is really doing uh what you might work is doing with [00:59:44] uh what you might work is doing with this two layer Network and the maximizer [00:59:46] this two layer Network and the maximizer solution it's really just doing [00:59:48] solution it's really just doing something like L1 svr uh in some future [00:59:51] something like L1 svr uh in some future in some kind of feature space [00:59:54] in some kind of feature space um so but let me explain you know [00:59:57] um so but let me explain you know uh I guess I haven't defined what L1 SPM [01:00:00] uh I guess I haven't defined what L1 SPM is right so you're probably familiar [01:00:01] is right so you're probably familiar with SVN that's the so-called L2 version [01:00:03] with SVN that's the so-called L2 version of the sphere so here you are going to [01:00:05] of the sphere so here you are going to have a slightly different version of XVM [01:00:07] have a slightly different version of XVM so [01:00:09] so um the the [01:00:11] um the the the kind of the the idea is that first [01:00:14] the kind of the the idea is that first of all let's look at infinite number of [01:00:15] of all let's look at infinite number of neurons [01:00:16] neurons because we have claimed that you know [01:00:18] because we have claimed that you know more neurons is always better so why not [01:00:20] more neurons is always better so why not think about infinite number of neurons [01:00:22] think about infinite number of neurons and let's see what infinite number of [01:00:23] and let's see what infinite number of neurons you will do for us right so you [01:00:26] neurons you will do for us right so you look at the max margin when you have [01:00:28] look at the max margin when you have infinite of neurons this is the largest [01:00:30] infinite of neurons this is the largest possible margin you can achieve uh with [01:00:33] possible margin you can achieve uh with even even number of neurons and suppose [01:00:35] even even number of neurons and suppose this is achieved [01:00:42] by [01:00:45] E1 [01:00:47] E1 so and so forth [01:00:49] so and so forth you need infant number of neurons [01:00:51] you need infant number of neurons probably [01:00:52] probably so many neurons [01:00:54] so many neurons actually this you can achieve this [01:00:56] actually this you can achieve this without number even number of neurons [01:00:58] without number even number of neurons you can achieve it with I think n plus [01:01:01] you can achieve it with I think n plus one neurons any number of data points so [01:01:04] one neurons any number of data points so um but but let's say you have infinite [01:01:06] um but but let's say you have infinite number of neurons just like basically [01:01:08] number of neurons just like basically basically infinite is not very different [01:01:10] basically infinite is not very different from n plus one neurons as long as you [01:01:11] from n plus one neurons as long as you have more than M plus by neurons you [01:01:13] have more than M plus by neurons you don't really get anything more uh from [01:01:16] don't really get anything more uh from this so [01:01:18] this so um and let's again U Bar is the [01:01:22] um and let's again U Bar is the normalization of U [01:01:24] normalization of U so I think we have kind of play with [01:01:27] so I think we have kind of play with this like a lot of times right this is [01:01:29] this like a lot of times right this is equivalent [01:01:31] equivalent to W1 [01:01:35] to W1 U1 [01:01:36] U1 two Norm U1 bar so and so forth right [01:01:47] this is uh what we have let's call this [01:01:49] this is uh what we have let's call this setup setup [01:01:53] I think I'll call you serial killer in [01:01:54] I think I'll call you serial killer in this case and I call this one W12 [01:01:57] this case and I call this one W12 okay so we have kind of like uh then [01:02:00] okay so we have kind of like uh then it's really scaling a lot of times and [01:02:02] it's really scaling a lot of times and we know that if you rescale this you [01:02:04] we know that if you rescale this you don't change the complex measure and and [01:02:06] don't change the complex measure and and here the complex measure is [01:02:09] here the complex measure is right the wju J2 norm and this is the [01:02:12] right the wju J2 norm and this is the just the W J tilde [01:02:15] just the W J tilde absolute value so this is the one Norm [01:02:17] absolute value so this is the one Norm of WJ W tilde [01:02:20] of WJ W tilde so so that that's where the one now I'm [01:02:22] so so that that's where the one now I'm coming to play [01:02:23] coming to play so basically the idea is that [01:02:25] so basically the idea is that after you change this Viewpoint [01:02:27] after you change this Viewpoint basically you just view W1 W1 tilde as [01:02:30] basically you just view W1 W1 tilde as the variable [01:02:31] the variable and then you are doing some kind of like [01:02:33] and then you are doing some kind of like sparse linear regression of sparse [01:02:36] sparse linear regression of sparse sphere [01:02:37] sphere so [01:02:38] so um so formally what you can do is that [01:02:40] um so formally what you can do is that you can say you can pretend [01:02:46] um [01:02:48] um so every [01:02:50] so every you in the [01:02:53] you in the sphere on a sphere as T minus one is the [01:02:57] sphere on a sphere as T minus one is the dimensional sphere [01:03:02] so you pretends every u in the sphere [01:03:04] so you pretends every u in the sphere shows up [01:03:08] in a collection of u-bars [01:03:14] right once you pretend that ever you and [01:03:17] right once you pretend that ever you and why this is possible this is just [01:03:19] why this is possible this is just because adding more neurons is never a [01:03:20] because adding more neurons is never a bad thing but you add a lot of neurons [01:03:23] bad thing but you add a lot of neurons to the to the uis and you just set those [01:03:26] to the to the uis and you just set those corresponding [01:03:27] corresponding this is just because you can add neuros [01:03:29] this is just because you can add neuros at you you J [01:03:32] at you you J and zero right you you just add these [01:03:35] and zero right you you just add these things it never it never change anything [01:03:39] things it never it never change anything right if you don't see any neurons in [01:03:41] right if you don't see any neurons in this collection you just add [01:03:43] this collection you just add this neural into the collection and you [01:03:45] this neural into the collection and you added zero as the coefficient it doesn't [01:03:47] added zero as the coefficient it doesn't change the functionality it doesn't [01:03:48] change the functionality it doesn't change the complex measure [01:03:50] change the complex measure so that's why you can pretend that the [01:03:53] so that's why you can pretend that the collection of E1 up to u u n and [01:03:56] collection of E1 up to u u n and I guess there is also more you have [01:03:58] I guess there is also more you have infinite of this [01:04:00] infinite of this it's really just a collection of all the [01:04:02] it's really just a collection of all the possible [01:04:03] possible unit Norm Vector on the chere [01:04:06] unit Norm Vector on the chere that doesn't really change anything [01:04:09] that doesn't really change anything and and once you have that then so once [01:04:12] and and once you have that then so once you pretend that this UJ bar is just a [01:04:15] you pretend that this UJ bar is just a equals to sd-1 then you can take a [01:04:19] equals to sd-1 then you can take a continuous perspective you can say that [01:04:21] continuous perspective you can say that this F Theta Q the X [01:04:23] this F Theta Q the X is really something like sum of right if [01:04:26] is really something like sum of right if you write the [01:04:27] you write the discrete version you got this but you [01:04:29] discrete version you got this but you can say this is [01:04:32] can say this is you can think of this as a continuous [01:04:34] you can think of this as a continuous version [01:04:35] version We Are Forever every U Bar you have a w [01:04:39] We Are Forever every U Bar you have a w and you are integrating over all the [01:04:40] and you are integrating over all the u-bar [01:04:47] I'm not sure whether this makes any [01:04:49] I'm not sure whether this makes any sense [01:04:50] sense um this is the [01:04:52] um this is the the simplest way that I came up with to [01:04:55] the simplest way that I came up with to defend this without [01:04:57] defend this without talking about uh what's up [01:05:01] talking about uh what's up at least you know this is one way that I [01:05:02] at least you know this is one way that I came up to expand this without too much [01:05:04] came up to expand this without too much Dragon [01:05:05] Dragon um but I [01:05:06] um but I of course I don't know whether this [01:05:08] of course I don't know whether this works for everyone [01:05:09] works for everyone um again in the lecture notes there is a [01:05:11] um again in the lecture notes there is a slight different way to to to introduce [01:05:13] slight different way to to to introduce this [01:05:14] this which requires a bit more drug and I [01:05:17] which requires a bit more drug and I don't know [01:05:18] don't know um any questions [01:05:23] oh sorry my bad one writing Sigma here [01:05:28] oh sorry my bad one writing Sigma here see yeah [01:05:30] see yeah sorry sorry [01:05:32] sorry sorry uh okay [01:05:41] uh [01:05:42] uh yeah feel free [01:05:47] oops [01:05:52] right okay yeah can I have ankle so the [01:05:55] right okay yeah can I have ankle so the question is whether we can have anchor [01:05:56] question is whether we can have anchor phone number of neurons [01:05:59] phone number of neurons um [01:06:00] um um [01:06:02] um here this is not soup [01:06:05] here this is not soup like [01:06:07] like this is just a concept or like a like [01:06:09] this is just a concept or like a like for example [01:06:11] for example the same question could you could ask [01:06:12] the same question could you could ask the same question about why you can when [01:06:15] the same question about why you can when you define the integral actually you are [01:06:17] you define the integral actually you are using the countable number of [01:06:19] using the countable number of these kind of decisions and you take the [01:06:21] these kind of decisions and you take the limit you can still get something [01:06:22] limit you can still get something uncomfortable so so this is kind of the [01:06:25] uncomfortable so so this is kind of the same thing it's it's yeah and also in [01:06:27] same thing it's it's yeah and also in some sense eventually this thing [01:06:30] some sense eventually this thing is kind of like just a language in some [01:06:32] is kind of like just a language in some sense it's not really like you implement [01:06:34] sense it's not really like you implement the integral [01:06:35] the integral in in the [01:06:37] in in the uh in practice so yeah [01:06:41] uh in practice so yeah doesn't make some sense okay so and and [01:06:45] doesn't make some sense okay so and and basically I think if you if you see this [01:06:47] basically I think if you if you see this this is kind of like what you can the [01:06:49] this is kind of like what you can the way you can view this is that you can [01:06:50] way you can view this is that you can see this is a WTO the inner product with [01:06:52] see this is a WTO the inner product with a fee [01:06:54] a fee and a fee X is a universal [01:06:58] and a fee X is a universal feature map [01:06:59] feature map so so basically you think of this each [01:07:01] so so basically you think of this each of this is a feature this is a feature [01:07:04] of this is a feature this is a feature and this is like the coefficient [01:07:08] and this is like the coefficient um in front of the feature and this [01:07:10] um in front of the feature and this feature is a is a now the feature is a [01:07:13] feature is a is a now the feature is a the difference between now the [01:07:15] the difference between now the difference here is that this feature is [01:07:17] difference here is that this feature is a predefined feature it's no longer [01:07:20] a predefined feature it's no longer something learned because you have all [01:07:22] something learned because you have all the possible U Bar in this world in your [01:07:24] the possible U Bar in this world in your feature set so so basically you can view [01:07:27] feature set so so basically you can view this this V ax is really just a this [01:07:31] this this V ax is really just a this gigantic feature [01:07:34] okay fee feature Vector where you have [01:07:37] okay fee feature Vector where you have all the possible [01:07:38] all the possible u-bars you know feature set [01:07:41] u-bars you know feature set so and WTO is the coefficient in front [01:07:45] so and WTO is the coefficient in front of feature [01:07:46] of feature so so basically if you [01:07:48] so so basically if you if you kind of like a text cs209 then [01:07:50] if you kind of like a text cs209 then this is the feature in the kernel part [01:07:52] this is the feature in the kernel part and this is the [01:07:54] and this is the Theta [01:07:55] Theta or the the weight vector or the [01:07:57] or the the weight vector or the parameters uh in front of features so [01:08:00] parameters uh in front of features so now it's a linear function in the [01:08:01] now it's a linear function in the feature uh in the features [01:08:03] feature uh in the features so [01:08:05] so um so but the the thing is that the [01:08:08] um so but the the thing is that the complexity measure corresponds to as we [01:08:10] complexity measure corresponds to as we argue with corresponds to W tilde one [01:08:12] argue with corresponds to W tilde one Norm [01:08:13] Norm but now two not [01:08:15] but now two not so so this is a [01:08:17] so so this is a so this is why the max margin [01:08:22] with you know [01:08:24] with you know this field Theta is less than one [01:08:26] this field Theta is less than one corresponds to [01:08:27] corresponds to Max margin [01:08:31] with [01:08:33] with this L1 on Max margin [01:08:36] this L1 on Max margin so basically you can think of this as [01:08:38] so basically you can think of this as the so the corresponding questions you [01:08:40] the so the corresponding questions you are maximizing the margin the margin is [01:08:43] are maximizing the margin the margin is the same so w tilde v x i [01:08:47] the same so w tilde v x i take the mean over I [01:08:49] take the mean over I and Max over W2 though and with the [01:08:51] and Max over W2 though and with the constraint that [01:08:53] constraint that the one Norm is less than one [01:08:55] the one Norm is less than one and this is called L1 svm [01:08:59] and this is called L1 svm with [01:09:00] with feature [01:09:02] feature fee [01:09:04] fee so so on the difference from the svm you [01:09:07] so so on the difference from the svm you learned in the for example cs209 would [01:09:09] learned in the for example cs209 would be that this is one North but not two [01:09:11] be that this is one North but not two naught [01:09:13] naught so so it's not doing just a simple [01:09:15] so so it's not doing just a simple kernel sphere it's doing something uh [01:09:19] kernel sphere it's doing something uh different from that and the interesting [01:09:21] different from that and the interesting thing is that the L1 SPM is actually not [01:09:24] thing is that the L1 SPM is actually not implementable [01:09:26] implementable with infinite features [01:09:35] it's not implementable [01:09:36] it's not implementable so [01:09:37] so like a when you take a cs29 one of the [01:09:41] like a when you take a cs29 one of the message we had is that when you use the [01:09:43] message we had is that when you use the Kernel check you can actually work with [01:09:45] Kernel check you can actually work with infant dimensional feature [01:09:47] infant dimensional feature because you can you can change [01:09:50] because you can you can change everything depends on the Kernel the [01:09:52] everything depends on the Kernel the inner product of the features so you [01:09:53] inner product of the features so you don't really care about dimensionality [01:09:54] don't really care about dimensionality of the features you can work with [01:09:55] of the features you can work with infinite dimensional feature but here [01:09:58] infinite dimensional feature but here you don't have that Kernel check anymore [01:09:59] you don't have that Kernel check anymore so if you have the L1 constant the [01:10:03] so if you have the L1 constant the kernel trick doesn't apply uh this the [01:10:05] kernel trick doesn't apply uh this the final solution is not just the function [01:10:07] final solution is not just the function of the inner product of the features [01:10:10] of the inner product of the features um so so you cannot apply the kernel [01:10:12] um so so you cannot apply the kernel trick so that's why you cannot [01:10:14] trick so that's why you cannot implemented with the [01:10:17] implemented with the um [01:10:18] um like where's the chronologic so so so [01:10:21] like where's the chronologic so so so this this is so so this part is purely [01:10:24] this this is so so this part is purely for understanding this is saying that [01:10:25] for understanding this is saying that okay new network is doing something more [01:10:26] okay new network is doing something more than what you can do with kernel because [01:10:28] than what you can do with kernel because now you are you are effectively doing an [01:10:31] now you are you are effectively doing an L1 version of the kernel problem which [01:10:33] L1 version of the kernel problem which was not able to do which is something [01:10:36] was not able to do which is something you are not able to do with the the [01:10:37] you are not able to do with the the standard Kernel check you have to use [01:10:39] standard Kernel check you have to use the new network to achieve the same [01:10:41] the new network to achieve the same thing [01:10:49] uh we didn't yeah we didn't prove that [01:10:51] uh we didn't yeah we didn't prove that lyx VM is not implementable [01:10:54] lyx VM is not implementable um but I think the how do I say this [01:10:58] um but I think the how do I say this uh I guess you know how do you prove [01:11:01] uh I guess you know how do you prove that it's not implementable like you [01:11:03] that it's not implementable like you have to have a [01:11:04] have to have a you have to say what that mean that you [01:11:06] you have to say what that mean that you mean by implementation so so this is [01:11:08] mean by implementation so so this is just really just we don't know maybe the [01:11:09] just really just we don't know maybe the easiest way to say is just we don't know [01:11:11] easiest way to say is just we don't know how to implement it but it sounds like a [01:11:13] how to implement it but it sounds like a very unlikely to be to to be able to to [01:11:16] very unlikely to be to to be able to to be done [01:11:17] be done um [01:11:19] um um also on the other side on the flip [01:11:21] um also on the other side on the flip side for the right works man we are we [01:11:24] side for the right works man we are we are saying that you can implement it [01:11:25] are saying that you can implement it right you can you can in fact so [01:11:26] right you can you can in fact so basically here is saying that you can [01:11:28] basically here is saying that you can effectively use your network to [01:11:30] effectively use your network to implement this over SVN but the caveat [01:11:32] implement this over SVN but the caveat is that you still don't know whether you [01:11:35] is that you still don't know whether you can optimize the network so it's not the [01:11:37] can optimize the network so it's not the end-to-end result it's saying that if [01:11:38] end-to-end result it's saying that if you assume you can optimize your network [01:11:40] you assume you can optimize your network efficiently and and and up to Global [01:11:44] efficiently and and and up to Global minimum then you can solve the lysbl but [01:11:46] minimum then you can solve the lysbl but there is a caveat about whether you can [01:11:48] there is a caveat about whether you can really computationally solve your [01:11:50] really computationally solve your network [01:11:51] network um that's something we don't uh we don't [01:11:53] um that's something we don't uh we don't know how to do [01:11:55] know how to do like we don't know how to prove [01:11:57] like we don't know how to prove theoretically empirically it sounds like [01:11:58] theoretically empirically it sounds like two you can do it just by creating [01:12:01] two you can do it just by creating research [01:12:17] Okay so [01:12:19] Okay so [Applause] [01:12:21] [Applause] I think this is all I wanted to say [01:12:23] I think this is all I wanted to say about uh the two layer Network [01:12:26] about uh the two layer Network um next we are going to talk about you [01:12:28] um next we are going to talk about you know our goal would be the next goal [01:12:30] know our goal would be the next goal would be to prove something about [01:12:32] would be to prove something about multiple multiple layering Network and [01:12:34] multiple multiple layering Network and we need more tools so my plan is to [01:12:37] we need more tools so my plan is to spend the next 10 minutes to talk about [01:12:39] spend the next 10 minutes to talk about some of the [01:12:42] some of the uh the tools uh and we need to continue [01:12:44] uh the tools uh and we need to continue about the tools next in the next lecture [01:12:47] about the tools next in the next lecture and then we can talk about how to have [01:12:49] and then we can talk about how to have better bounds for multi-layer Network [01:12:52] better bounds for multi-layer Network um [01:12:54] um but if there's any questions I can [01:12:56] but if there's any questions I can talk about that [01:12:57] talk about that I can answer any questions first [01:13:00] I can answer any questions first it's a little bit awkward I think I [01:13:02] it's a little bit awkward I think I thought I have 20 minutes but it was [01:13:04] thought I have 20 minutes but it was only 10 minutes [01:13:06] only 10 minutes um but still I think it's okay we can [01:13:08] um but still I think it's okay we can we can start with the simple thing [01:13:11] we can start with the simple thing but this will be a kind of like a [01:13:13] but this will be a kind of like a at least for the moment it will be a [01:13:15] at least for the moment it will be a different mindset we are just thinking [01:13:17] different mindset we are just thinking about the tools again [01:13:19] about the tools again Okay so [01:13:22] Okay so Okay so [01:13:24] Okay so now we are talking we are getting back [01:13:27] now we are talking we are getting back to how do we Bond rather marker [01:13:28] to how do we Bond rather marker complexity [01:13:32] and and we are [01:13:34] and and we are talking about the different type of [01:13:35] talking about the different type of tools and [01:13:38] tools and and let's recall [01:13:40] and let's recall um [01:13:40] um um [01:13:41] um maybe okay I guess before doing that [01:13:43] maybe okay I guess before doing that maybe let's think about [01:13:46] maybe let's think about a function [01:13:48] a function space view [01:13:51] of the rather marker complexity [01:13:55] of the rather marker complexity so maybe let me write down the rather my [01:13:57] so maybe let me write down the rather my complexity first so this is something [01:13:59] complexity first so this is something like if you have a function class f [01:14:03] like if you have a function class f a random account complexity is [01:14:06] a random account complexity is this is the empirical random complexity [01:14:14] so if f x is equals to Z1 of C [01:14:23] and let's think about uh [01:14:27] and let's think about uh um let's define the following set Q [01:14:30] um let's define the following set Q this is a set of vectors and the vectors [01:14:33] this is a set of vectors and the vectors are the outputs of Earth [01:14:37] are the outputs of Earth on these endpoints [01:14:41] so for every function you're going to [01:14:43] so for every function you're going to have a vector and dimensional vector [01:14:46] have a vector and dimensional vector and so this is basically the set of [01:14:48] and so this is basically the set of outputs [01:14:53] of f [01:14:55] of f on the data points 0 up to zero [01:14:59] on the data points 0 up to zero right these are all the possible outputs [01:15:01] right these are all the possible outputs you can you can you can you can go you [01:15:03] you can you can you can you can go you can get from applying F on this set of [01:15:06] can get from applying F on this set of points [01:15:07] points and they are vectors and then you can [01:15:10] and they are vectors and then you can rewrite this as as the rather Market [01:15:13] rewrite this as as the rather Market complexity as the following [01:15:17] complexity as the following so you can think of this as you are [01:15:19] so you can think of this as you are looking at all the possible vectors V [01:15:21] looking at all the possible vectors V and Q [01:15:23] and Q and you look at the inner product of [01:15:25] and you look at the inner product of Sigma with v [01:15:28] Sigma with v or just because this well and sigma V is [01:15:31] or just because this well and sigma V is really just the sum of [01:15:33] really just the sum of Sigma ivi which is the sum of [01:15:36] Sigma ivi which is the sum of Sigma i f z i [01:15:39] Sigma i f z i this is just rewriting [01:15:42] this is just rewriting so [01:15:44] so so [01:15:45] so um the point here is that [01:15:48] um the point here is that this [01:15:49] this this rsf [01:15:52] this rsf only [01:15:54] only depends on Q [01:15:57] depends on the outputs but for example [01:16:00] depends on the outputs but for example but not as opposed to [01:16:02] but not as opposed to the [01:16:04] the for example the parametrization of f [01:16:07] for example the parametrization of f I mean X mixing what I what I mean here [01:16:10] I mean X mixing what I what I mean here so for example let's say suppose you [01:16:12] so for example let's say suppose you have function of class f which is f of x [01:16:15] have function of class f which is f of x is equals to something like sum of theta [01:16:17] is equals to something like sum of theta i x i [01:16:19] i x i where Theta is in dimension d [01:16:23] where Theta is in dimension d and suppose you have another function [01:16:24] and suppose you have another function class F Prime [01:16:27] class F Prime which is of the following form say [01:16:29] which is of the following form say something like sum of say the I [01:16:31] something like sum of say the I I'm [01:16:32] I'm writing something [01:16:34] writing something where Theta is in D and W also is prime [01:16:38] where Theta is in D and W also is prime mind this is just a weird example just [01:16:40] mind this is just a weird example just to demonstrate a point so suppose [01:16:43] to demonstrate a point so suppose suppose you have these two function [01:16:45] suppose you have these two function clause [01:16:46] clause and they have different parametrization [01:16:48] and they have different parametrization they have different parameter space even [01:16:49] they have different parameter space even right so one has v dimensional parts of [01:16:51] right so one has v dimensional parts of a space and the other has 2D dimensional [01:16:54] a space and the other has 2D dimensional parametrical space but these two [01:16:56] parametrical space but these two functions have the same Q [01:17:01] corresponding Q because they're [01:17:03] corresponding Q because they're the the formula of outputs uh [01:17:07] the the formula of outputs uh um are the same because [01:17:10] um are the same because in some sense you can have a one to one [01:17:12] in some sense you can have a one to one knife between one a function in capital [01:17:15] knife between one a function in capital F and the function in capital F Prime [01:17:17] F and the function in capital F Prime because they are just a [01:17:20] because they are just a um [01:17:22] um um like like for every possible output [01:17:25] um like like for every possible output that can be output by the function f you [01:17:27] that can be output by the function f you can also find the one that can be output [01:17:29] can also find the one that can be output by some function in F Prime [01:17:32] by some function in F Prime so so so they have the different [01:17:34] so so so they have the different parametization but they have the same [01:17:36] parametization but they have the same functionality in some sense or the same [01:17:38] functionality in some sense or the same it's the same family of functions [01:17:41] it's the same family of functions um and they have the same q and then [01:17:43] um and they have the same q and then that means they have the same random [01:17:44] that means they have the same random marker complexity [01:17:46] marker complexity so so I guess you know I'm just trying [01:17:47] so so I guess you know I'm just trying to reinforce this idea that the only [01:17:50] to reinforce this idea that the only thing that matters is the outputs of the [01:17:52] thing that matters is the outputs of the functions but not how the functions are [01:17:54] functions but not how the functions are represented or parametric [01:17:58] represented or parametric and and this would be useful or kind of [01:18:02] and and this would be useful or kind of like as a as a general thing [01:18:04] like as a as a general thing um it's kind of like a change of mindset [01:18:06] um it's kind of like a change of mindset right so before you are talking about [01:18:07] right so before you are talking about the parameters right so what are the [01:18:09] the parameters right so what are the parameters of f how do you describe your [01:18:11] parameters of f how do you describe your parameters from now on we're gonna get [01:18:14] parameters from now on we're gonna get rid of we are not going to think about [01:18:15] rid of we are not going to think about the parameters that much we are more [01:18:17] the parameters that much we are more thinking about the outputs of the [01:18:19] thinking about the outputs of the functions [01:18:20] functions and [01:18:22] and and there's a so-called massage lemon [01:18:25] and there's a so-called massage lemon which is actually uh one of the things [01:18:28] which is actually uh one of the things you are asked to prove in the homework [01:18:31] you are asked to prove in the homework um so this letter is saying that if this [01:18:33] um so this letter is saying that if this Q [01:18:35] Q right satisfies that [01:18:41] Q is first of all [01:18:45] Q is first of all um so I guess maybe let's say [01:18:48] um so I guess maybe let's say forever vector v in q [01:18:51] forever vector v in q the two Norm of V [01:18:54] over squared n is less than m [01:18:58] over squared n is less than m so this is the this set of Q contains [01:19:01] so this is the this set of Q contains only bounded vectors in this sense [01:19:03] only bounded vectors in this sense by the way this [01:19:05] by the way this from now on we're going to do some [01:19:07] from now on we're going to do some things very often just because [01:19:09] things very often just because [Music] [01:19:09] [Music] um [01:19:10] um you want to normalize your vector you [01:19:13] you want to normalize your vector you measure the vector by the normalizing [01:19:15] measure the vector by the normalizing the normalized norm so so the norm [01:19:18] the normalized norm so so the norm itself doesn't matter that much you want [01:19:19] itself doesn't matter that much you want to normalize the norm by the dimension [01:19:21] to normalize the norm by the dimension of the vector [01:19:29] right right [01:19:31] right right [Music] [01:19:38] right that's right but but I think this [01:19:40] right that's right but but I think this is this is actually a very good question [01:19:42] is this is actually a very good question which I I probably should have talked [01:19:44] which I I probably should have talked about [01:19:45] about um [01:19:46] um early so [01:19:48] early so the I think probably I mentioned this a [01:19:50] the I think probably I mentioned this a little bit at some point so [01:19:52] little bit at some point so um [01:19:52] um one of the nice thing about empirical [01:19:54] one of the nice thing about empirical radar complexity is that now you are in [01:19:57] radar complexity is that now you are in this mindset that your zi's are fixed [01:20:00] this mindset that your zi's are fixed so you don't have any Randomness in CES [01:20:02] so you don't have any Randomness in CES they are just the end points fixed [01:20:06] they are just the end points fixed they're like a Fiverr and and of course [01:20:08] they're like a Fiverr and and of course the functions can be changed when you [01:20:10] the functions can be changed when you have a family functions but you don't [01:20:11] have a family functions but you don't have a are changing [01:20:14] have a are changing the eyes [01:20:15] the eyes so that simplifies things a lot in some [01:20:18] so that simplifies things a lot in some sense and so in some sense you can think [01:20:20] sense and so in some sense you can think of this [01:20:21] of this even in some sense like a you can think [01:20:23] even in some sense like a you can think of the functions the the family of [01:20:25] of the functions the the family of functions are functions that map zis to [01:20:27] functions are functions that map zis to real numbers but not functions that Maps [01:20:30] real numbers but not functions that Maps like Rd to real numbers like you can get [01:20:33] like Rd to real numbers like you can get forget about any other points this will [01:20:36] forget about any other points this will just have Imports and all your functions [01:20:38] just have Imports and all your functions can be just represented as an [01:20:40] can be just represented as an [Music] [01:20:41] [Music] and numbers right which are the outputs [01:20:44] and numbers right which are the outputs on its Imports [01:20:45] on its Imports um there's no any other parts you have [01:20:46] um there's no any other parts you have to care about that's kind of the beauty [01:20:48] to care about that's kind of the beauty of the rather Mark complex in some sense [01:20:50] of the rather Mark complex in some sense very cool [01:20:51] very cool um that's kind of the why it's powerful [01:20:53] um that's kind of the why it's powerful because you before the Zs the source of [01:20:55] because you before the Zs the source of the randomness but not the randomness [01:20:57] the randomness but not the randomness come from the sigma [01:20:59] come from the sigma so so that's why you can fix the eyes [01:21:16] as long as [01:21:19] as long as [Music] [01:21:31] I I think I think you're you're uh the [01:21:34] I I think I think you're you're uh the the [01:21:35] the exacting problem is not what like for [01:21:38] exacting problem is not what like for the side but you are on the right to in [01:21:40] the side but you are on the right to in the right direction so so basically the [01:21:42] the right direction so so basically the right of our complexity depends on [01:21:45] right of our complexity depends on how how complex this step queue is [01:21:48] how how complex this step queue is that's what I'm gonna say and and uh [01:21:52] that's what I'm gonna say and and uh um and actually the next time you can [01:21:54] um and actually the next time you can see like this I think we have actually [01:21:56] see like this I think we have actually uh mentioned this before so if Q is not [01:21:59] uh mentioned this before so if Q is not very complex for example if Q is a [01:22:01] very complex for example if Q is a finite site [01:22:03] finite site then you have a good value marker [01:22:05] then you have a good value marker complexity [01:22:07] complexity um and of course you know how do you [01:22:08] um and of course you know how do you measure the complex of Q that's a little [01:22:10] measure the complex of Q that's a little kind of like a [01:22:12] kind of like a um it's a it's a question that we have [01:22:14] um it's a it's a question that we have to study but but there are for example [01:22:16] to study but but there are for example if the queue is fine and then you have a [01:22:18] if the queue is fine and then you have a boundary environment complex that's what [01:22:20] boundary environment complex that's what I'm going to write so suppose [01:22:22] I'm going to write so suppose so then so [01:22:24] so then so you need two things one is that Q is [01:22:26] you need two things one is that Q is fine and the other thing is that Q is [01:22:27] fine and the other thing is that Q is kind of like roughly bounded [01:22:29] kind of like roughly bounded and these two things that you have that [01:22:32] and these two things that you have that this expectation over Sigma [01:22:36] this expectation over Sigma of this this thing which is equivalent [01:22:39] of this this thing which is equivalent to random micro complexity [01:22:41] to random micro complexity is bonded by square root 2 N Square [01:22:46] is bonded by square root 2 N Square log [01:22:47] log Q [01:22:49] Q over n [01:22:53] so the side of Q coming to play and I [01:22:55] so the side of Q coming to play and I guess as a Corollary [01:22:59] I think this color is something I have [01:23:01] I think this color is something I have presented before without a proof [01:23:05] presented before without a proof if I've satisfies that [01:23:10] the this function is bounded on this [01:23:14] the this function is bounded on this the eyes in the following sense [01:23:17] the eyes in the following sense the average output is founded by m [01:23:20] the average output is founded by m Square [01:23:21] Square the average output square is divided by [01:23:23] the average output square is divided by m Square then [01:23:25] m Square then the random marker complexity of f [01:23:28] the random marker complexity of f is founded by [01:23:30] is founded by 2 underscore [01:23:32] 2 underscore log [01:23:34] log F over [01:23:39] Okay so [01:23:42] so that's the that's the relatively easy [01:23:44] so that's the that's the relatively easy thing where you have final hypothesis [01:23:45] thing where you have final hypothesis costs [01:23:48] um [01:23:48] um [Applause] [01:23:52] all right okay so and and this is a [01:23:55] all right okay so and and this is a homework question of myself as a [01:23:57] homework question of myself as a homework question I think there's the [01:23:58] homework question I think there's the hint which is actually pretty important [01:24:00] hint which is actually pretty important which is you should consider using [01:24:04] which is you should consider using something about the moment generating [01:24:05] something about the moment generating function which will make the math easier [01:24:07] function which will make the math easier they actually there are two ways to [01:24:08] they actually there are two ways to prove it another way is that you do [01:24:11] prove it another way is that you do discretization plus Union Bond and that [01:24:14] discretization plus Union Bond and that will give your you will have a [01:24:16] will give your you will have a relatively hard time to do that just [01:24:17] relatively hard time to do that just because the the constant is still so [01:24:18] because the the constant is still so hard to to make like you can you can [01:24:21] hard to to make like you can you can work out a similar Bond but just a [01:24:23] work out a similar Bond but just a little nicely the moment generating [01:24:24] little nicely the moment generating function is very really cool [01:24:26] function is very really cool um the proof is actually pretty short if [01:24:28] um the proof is actually pretty short if you use the right way you use it in the [01:24:30] you use the right way you use it in the right way okay so I guess uh let me just [01:24:34] right way okay so I guess uh let me just briefly um uh give a quick overview of [01:24:37] briefly um uh give a quick overview of what we're gonna do next so that you [01:24:39] what we're gonna do next so that you kind of appreciate why I'm setting up [01:24:41] kind of appreciate why I'm setting up the things here so so the things the [01:24:43] the things here so so the things the next thing is that what if Q [01:24:47] Q is not finite [01:24:50] Q is not finite what do we do [01:24:52] what do we do right so and our answer would be that [01:24:55] right so and our answer would be that you you do some discretization [01:24:59] and plus uni mode [01:25:03] uh so you have a so basically you have [01:25:05] uh so you have a so basically you have some have some cover stuff [01:25:08] some have some cover stuff and then you have a union Bond okay or [01:25:11] and then you have a union Bond okay or maybe I should say I have this [01:25:12] maybe I should say I have this conversation to reduce to [01:25:14] conversation to reduce to the final case [01:25:17] the final case that's that's basically an idea and you [01:25:20] that's that's basically an idea and you probably have seen this idea before [01:25:21] probably have seen this idea before right like you have seen it right in the [01:25:23] right like you have seen it right in the in the third lecture maybe but when we [01:25:25] in the third lecture maybe but when we talk about uh infinite hypothesis class [01:25:28] talk about uh infinite hypothesis class but here the difference is that here you [01:25:30] but here the difference is that here you are discredible you are discretizing [01:25:34] are discredible you are discretizing the output space the set queue which is [01:25:37] the output space the set queue which is a [01:25:38] a um uh n-dimensional a set of [01:25:41] um uh n-dimensional a set of n-dimensional vectors so before you are [01:25:43] n-dimensional vectors so before you are discretizing [01:25:44] discretizing the the parameter space here you have a [01:25:47] the the parameter space here you have a d dimensional parameter space you [01:25:49] d dimensional parameter space you describe it has that but here you are [01:25:51] describe it has that but here you are doing a more in substance fundamental [01:25:52] doing a more in substance fundamental discretization because finally the [01:25:54] discretization because finally the parameters as I argued it's probably not [01:25:56] parameters as I argued it's probably not the most important thing right what's [01:25:58] the most important thing right what's really important is that what's the [01:25:59] really important is that what's the functionality of this family functions [01:26:01] functionality of this family functions so so now your art is quite testing in [01:26:03] so so now your art is quite testing in the right space [01:26:05] the right space uh a more fundamental space because this [01:26:07] uh a more fundamental space because this is the space of the outputs [01:26:09] is the space of the outputs so so what we're going to do is we're [01:26:11] so so what we're going to do is we're going to discuss a few techniques to [01:26:13] going to discuss a few techniques to discretize [01:26:14] discretize uh uh this queue and and what kind of [01:26:18] uh uh this queue and and what kind of like visualization you really need [01:26:20] like visualization you really need um so and so forth so um and and there's [01:26:22] um so and so forth so um and and there's actually some kind of like a pretty [01:26:24] actually some kind of like a pretty uh deep Theory theorem which is called [01:26:27] uh deep Theory theorem which is called the the the changing fit like the The [01:26:30] the the the changing fit like the The Dudley changing theorem which actually [01:26:32] Dudley changing theorem which actually required let you to [01:26:34] required let you to discrete height in your United States [01:26:36] discrete height in your United States where you have a hierarchical [01:26:37] where you have a hierarchical discretization so that you can have the [01:26:39] discretization so that you can have the best discretization so this is something [01:26:42] best discretization so this is something beyond what we have done before uh even [01:26:46] beyond what we have done before uh even you you don't care about the difference [01:26:48] you you don't care about the difference between the output space and parameter [01:26:49] between the output space and parameter space here you can discretize in a much [01:26:52] space here you can discretize in a much more efficient fashion [01:26:54] more efficient fashion um so so that's what we're gonna do next [01:26:57] um so so that's what we're gonna do next and then we're gonna use this for the uh [01:27:00] and then we're gonna use this for the uh the multilayer network [01:27:02] the multilayer network okay sounds good I think that's all for [01:27:04] okay sounds good I think that's all for today ================================================================================ LECTURE 009 ================================================================================ Stanford CS229M - Lecture 9: Covering number approach, Dudley Theorem Source: https://www.youtube.com/watch?v=wDfardbL50I --- Transcript [00:00:05] okay I guess uh let's get started [00:00:08] okay I guess uh let's get started this is working right yeah so [00:00:12] this is working right yeah so um I guess last time where we [00:00:15] um I guess last time where we um and that end up with was [00:00:18] um and that end up with was um [00:00:19] um so we talked about you know this view [00:00:21] so we talked about you know this view that [00:00:24] that you view the function class f [00:00:28] that you view the function class f um in some sense as equivalent to a set [00:00:32] um in some sense as equivalent to a set queue right so if you have a function [00:00:34] queue right so if you have a function class F and you can Define this Q to be [00:00:36] class F and you can Define this Q to be the set of vectors [00:00:39] the set of vectors of this form or basically the output [00:00:42] of this form or basically the output vector which is a vector in RN [00:00:45] vector which is a vector in RN and here f is changing over the class F [00:00:49] and here f is changing over the class F right so in some sense for the random [00:00:51] right so in some sense for the random one complexity perspective these two [00:00:54] one complexity perspective these two objects are not very different right so [00:00:56] objects are not very different right so the random the important complexity of f [00:00:59] the random the important complexity of f only depends on Q [00:01:01] only depends on Q and also we have talked about the case [00:01:03] and also we have talked about the case when you have a find and Q [00:01:05] when you have a find and Q of find an F in some sense and sometimes [00:01:08] of find an F in some sense and sometimes actually even you have infinite F you [00:01:10] actually even you have infinite F you can have finite queue in some cases but [00:01:12] can have finite queue in some cases but not very technical [00:01:13] not very technical um but in this case what you can show is [00:01:16] um but in this case what you can show is that you can have a random marker [00:01:17] that you can have a random marker complexity Bank [00:01:19] complexity Bank um this is the so-called masala [00:01:25] so we're saying that [00:01:27] so we're saying that um if your queue satisfies that [00:01:32] this is the at the end of the last [00:01:34] this is the at the end of the last lecture so suppose for every Vector in Q [00:01:36] lecture so suppose for every Vector in Q we have that [00:01:40] this Norm of the queue normalized by [00:01:47] while we're squared in is less than n [00:01:50] while we're squared in is less than n then [00:01:53] we know that this quantity [00:01:56] we know that this quantity which is essentially the right marker [00:01:58] which is essentially the right marker complexity of [00:02:01] complexity of f is bounded by this [00:02:04] f is bounded by this 2 times M Square Times log of the size [00:02:07] 2 times M Square Times log of the size of Q [00:02:08] of Q over n [00:02:10] over n so [00:02:12] so um and if you translate this back to the [00:02:14] um and if you translate this back to the function class then you know that [00:02:22] if [00:02:24] if satisfies [00:02:27] that [00:02:30] that forever having us [00:02:34] forever having us this [00:02:39] I've responded in average by m right [00:02:42] I've responded in average by m right this this you can view this as an [00:02:44] this this you can view this as an average size of app but it's a quadratic [00:02:46] average size of app but it's a quadratic mean but now that later the mean uh and [00:02:50] mean but now that later the mean uh and then you have that [00:02:52] then you have that the random marker complexity of this [00:02:54] the random marker complexity of this function class f is bounded by 2m Square [00:02:58] function class f is bounded by 2m Square log of the size of f [00:03:01] log of the size of f over it [00:03:02] over it so and [00:03:06] so and so in this time we're going to deal with [00:03:08] so in this time we're going to deal with the case where you don't have a finite [00:03:10] the case where you don't have a finite hypothesis class [00:03:12] hypothesis class right so if you have infinite hypothesis [00:03:17] right so if you have infinite hypothesis infinite Q or F then what do you do [00:03:20] infinite Q or F then what do you do and what we're going to do is that we're [00:03:21] and what we're going to do is that we're going to do a disqualization but now [00:03:24] going to do a disqualization but now we're discretizing in the [00:03:29] in the queue space or the auto space of [00:03:31] in the queue space or the auto space of f [00:03:32] f so before I think you know in one of the [00:03:35] so before I think you know in one of the previous lectures we discretized in the [00:03:37] previous lectures we discretized in the parameter space and now we are going to [00:03:39] parameter space and now we are going to discretize in this more fundamental [00:03:41] discretize in this more fundamental space to Optical space because as we [00:03:43] space to Optical space because as we kind of argue that output space of the [00:03:45] kind of argue that output space of the is what's really fundamentally important [00:03:47] is what's really fundamentally important the parametrization is just something [00:03:49] the parametrization is just something that influenced the output space but if [00:03:51] that influenced the output space but if you have the same output space but [00:03:53] you have the same output space but different parametrization actually the [00:03:55] different parametrization actually the functions class are not different so so [00:03:58] functions class are not different so so the parametrization are not the most [00:03:59] the parametrization are not the most fundamental thing here so so what we're [00:04:01] fundamental thing here so so what we're going to do is we're going to discretize [00:04:03] going to do is we're going to discretize the output space [00:04:05] the output space and [00:04:07] and um [00:04:09] um so [00:04:11] so so the so and we still have this idea of [00:04:14] so the so and we still have this idea of epson this concept [00:04:17] epson this concept of epson cover so now we are going to [00:04:19] of epson cover so now we are going to cover the offspace output space Q or [00:04:21] cover the offspace output space Q or output space of f by the so-called Epson [00:04:24] output space of f by the so-called Epson cover [00:04:25] cover let's recall a definition of Epsilon [00:04:27] let's recall a definition of Epsilon cover so recall that the definition was [00:04:30] cover so recall that the definition was that c is Epsilon cover [00:04:35] that c is Epsilon cover of Q now I'm using I'm talking about [00:04:37] of Q now I'm using I'm talking about Epsilon cover of q but I just changed [00:04:40] Epsilon cover of q but I just changed the variable I think before we call it [00:04:41] the variable I think before we call it absolute power of some other set [00:04:43] absolute power of some other set so with respect to some Matrix [00:04:47] so with respect to some Matrix rho if for any Vector in q [00:04:51] rho if for any Vector in q there exists a [00:04:54] there exists a Vector in C that covers it and by covers [00:04:57] Vector in C that covers it and by covers it it means that [00:05:00] it it means that such that [00:05:01] such that the distance between this Vector is less [00:05:03] the distance between this Vector is less than absolute [00:05:05] than absolute and let me also Define the so-called [00:05:09] and let me also Define the so-called covering number [00:05:10] covering number which is the quantity we're going to [00:05:13] which is the quantity we're going to uh use very frequently so the cover [00:05:18] uh use very frequently so the cover number of [00:05:20] number of foreign [00:05:29] like there are several arguments one [00:05:31] like there are several arguments one thing is the the target radius the [00:05:34] thing is the the target radius the target [00:05:35] target um radius Epsilon and also the third Q [00:05:39] um radius Epsilon and also the third Q and the magic row this is defined to be [00:05:44] and the magic row this is defined to be the minimum [00:05:46] the minimum size [00:05:48] size of Epsilon cover [00:05:51] of Epsilon cover of Q with respect to real [00:05:55] of Q with respect to real right so this is the minimal possible [00:05:57] right so this is the minimal possible size of the covering [00:06:00] size of the covering um and [00:06:05] and [00:06:06] and so [00:06:10] sorry there's a so how so and in some [00:06:13] sorry there's a so how so and in some sense we can use this carbon number in [00:06:16] sense we can use this carbon number in after two ways one way is you talk about [00:06:18] after two ways one way is you talk about the covering of Q and the other way you [00:06:20] the covering of Q and the other way you can talk about carving of f right so uh [00:06:23] can talk about carving of f right so uh even though I think the fundamental [00:06:25] even though I think the fundamental thing is about the queue I think in the [00:06:27] thing is about the queue I think in the literature you know if you read the [00:06:28] literature you know if you read the paper then in most of cases people talk [00:06:31] paper then in most of cases people talk about covering of the you know the [00:06:33] about covering of the you know the function at least in many papers so [00:06:36] function at least in many papers so we're going to use that language but but [00:06:37] we're going to use that language but but they are essentially the same so so [00:06:40] they are essentially the same so so basically [00:06:41] basically um let's first color so if you do this [00:06:43] um let's first color so if you do this for the covering of f then it's the same [00:06:45] for the covering of f then it's the same thing so if you have Epsilon cover [00:06:47] thing so if you have Epsilon cover of the function class F you just view f [00:06:49] of the function class F you just view f as a as a function class so then it is [00:06:53] as a as a function class so then it is saying that you know [00:06:56] saying that you know uh satisfies that [00:07:00] for every F in capital F there exists F [00:07:03] for every F in capital F there exists F Prime such that u f f Prime is less than [00:07:06] Prime such that u f f Prime is less than absolute so it's just a literally the [00:07:08] absolute so it's just a literally the same thing [00:07:10] same thing um and also we're going to choose the [00:07:12] um and also we're going to choose the row uh to be the same you know for like [00:07:15] row uh to be the same you know for like it's like a like a 4k1 f so basically [00:07:19] it's like a like a 4k1 f so basically what we're going to do is that we're [00:07:20] what we're going to do is that we're going to choose row uh between two [00:07:24] going to choose row uh between two vectors right in the Q in the cube [00:07:26] vectors right in the Q in the cube perspective you choose this to be one [00:07:28] perspective you choose this to be one over square root n times [00:07:30] over square root n times uh the L2 distance recall that both V [00:07:34] uh the L2 distance recall that both V and V Prime are Dimension eigenvector in [00:07:36] and V Prime are Dimension eigenvector in Space RN so this is basically out sorry [00:07:42] Space RN so this is basically out sorry into there's no Square so basically this [00:07:45] into there's no Square so basically this is a normalized version of the L2 [00:07:46] is a normalized version of the L2 distance the reason we normalize by over [00:07:48] distance the reason we normalize by over square root n is just because this is [00:07:50] square root n is just because this is more consistent [00:07:52] more consistent um you know the normalization [00:07:53] um you know the normalization fundamentally doesn't matter first of [00:07:55] fundamentally doesn't matter first of all right so whatever normalization you [00:07:56] all right so whatever normalization you choose it doesn't change the essence and [00:07:58] choose it doesn't change the essence and the reason why we choose a normalization [00:08:00] the reason why we choose a normalization here is just simply for consistency with [00:08:04] here is just simply for consistency with um the function space view where you [00:08:06] um the function space view where you have a two functions then you define a [00:08:08] have a two functions then you define a row to be [00:08:10] row to be suppose you have two functions F and F [00:08:12] suppose you have two functions F and F Prime and what's the distance between [00:08:13] Prime and what's the distance between them recall that you know we only [00:08:16] them recall that you know we only restrict our function on the file and [00:08:18] restrict our function on the file and set of points zero to zero and so the [00:08:21] set of points zero to zero and so the typical definition of the distance would [00:08:23] typical definition of the distance would just be the L2 distance on the set of [00:08:25] just be the L2 distance on the set of points so it's just something like you [00:08:27] points so it's just something like you look at the average [00:08:29] look at the average difference between these two functions [00:08:32] difference between these two functions on zis [00:08:36] and then you you take the quadratic [00:08:39] and then you you take the quadratic average [00:08:40] average and then you take the [00:08:42] and then you take the um basically called jacket average of [00:08:44] um basically called jacket average of the difference between F and F Prime [00:08:46] the difference between F and F Prime onset of CIS and you can see that these [00:08:48] onset of CIS and you can see that these are exactly the same row [00:08:51] are exactly the same row um just a view you can view them in [00:08:53] um just a view you can view them in either the function space or you can [00:08:55] either the function space or you can view it in the uh the vector space [00:08:58] view it in the uh the vector space and typically people write this row as [00:09:01] and typically people write this row as row to [00:09:03] row to PN so so the reason I guess you know for [00:09:07] PN so so the reason I guess you know for those who are not familiar with just a [00:09:09] those who are not familiar with just a Sim compass [00:09:11] Sim compass arbitrary conflict symbols to indicate [00:09:13] arbitrary conflict symbols to indicate this but for those of you who are a [00:09:15] this but for those of you who are a little bit more familiar with some of [00:09:17] little bit more familiar with some of this function analysis so I think the [00:09:19] this function analysis so I think the idea is that p n this is the empirical [00:09:22] idea is that p n this is the empirical distribution [00:09:27] basically uniform [00:09:31] basically uniform over Z1 up to zero [00:09:34] over Z1 up to zero and L2 [00:09:36] and L2 of PM means that you have a L2 metric [00:09:39] of PM means that you have a L2 metric defined on this empirical distribution [00:09:42] defined on this empirical distribution this uniform Distribution on the on the [00:09:44] this uniform Distribution on the on the script but if you don't know [00:09:46] script but if you don't know where these come from like no no right [00:09:49] where these come from like no no right this is just a let's just treat it as an [00:09:51] this is just a let's just treat it as an abstract note like um symbol just [00:09:54] abstract note like um symbol just because you know I'm gonna use this [00:09:55] because you know I'm gonna use this simple several times just for formality [00:09:58] simple several times just for formality um but it really just means this [00:10:01] um but it really just means this [Music] [00:10:01] [Music] um [00:10:04] Okay so [00:10:06] Okay so so and you know with this view basically [00:10:09] so and you know with this view basically you know as we as we have said right so [00:10:11] you know as we as we have said right so you have a f you know corresponds to q [00:10:13] you have a f you know corresponds to q and a function f corresponds to this [00:10:16] and a function f corresponds to this vector [00:10:17] vector FZ1 up to fzn [00:10:20] FZ1 up to fzn in Q and and it's a one-to-one [00:10:23] in Q and and it's a one-to-one correspondence also the row corresponds [00:10:25] correspondence also the row corresponds to each other so you can in some sense [00:10:29] to each other so you can in some sense write this tribute and the [00:10:31] write this tribute and the correspondence [00:10:32] correspondence if you look at the function space view [00:10:34] if you look at the function space view with the metric row then is the carbon [00:10:37] with the metric row then is the carbon number is the same as you view it in the [00:10:39] number is the same as you view it in the output [00:10:41] output in the in the output space the vector [00:10:44] in the in the output space the vector space and you use the metric normalized [00:10:47] space and you use the metric normalized L2 naught and one of the reasons why we [00:10:50] L2 naught and one of the reasons why we normalize about well and by something [00:10:52] normalize about well and by something that depends on it is just because you [00:10:54] that depends on it is just because you have any Dimension and any something [00:10:56] have any Dimension and any something that's changing so in some sense it [00:10:57] that's changing so in some sense it makes sense to normalize by that because [00:11:00] makes sense to normalize by that because if you have a chain Vector with changing [00:11:02] if you have a chain Vector with changing dimension sometimes it's hard to compare [00:11:04] dimension sometimes it's hard to compare different cases so that's why you want [00:11:06] different cases so that's why you want to have a norm that doesn't depend on [00:11:08] to have a norm that doesn't depend on dimensionality [00:11:11] dimensionality and from now on we're going to write the [00:11:12] and from now on we're going to write the the function space view like notation we [00:11:14] the function space view like notation we are going to write in the the the F [00:11:16] are going to write in the the the F notation but [00:11:19] notation but in my mind I'm always thinking about [00:11:21] in my mind I'm always thinking about Optical space because that's that's just [00:11:23] Optical space because that's that's just a vector space which is much easier to [00:11:25] a vector space which is much easier to think about [00:11:27] okay so and also I you know the the [00:11:30] okay so and also I you know the the formal kind of like a theorem will be [00:11:32] formal kind of like a theorem will be stating in the function space but [00:11:35] stating in the function space but um [00:11:36] um but when I approve it I'm going to [00:11:37] but when I approve it I'm going to change to the queue just to make it more [00:11:39] change to the queue just to make it more kind of explicit [00:11:41] kind of explicit and here's a theorem that kind of deal [00:11:43] and here's a theorem that kind of deal with the [00:11:44] with the in some sense this is a kind of like a [00:11:46] in some sense this is a kind of like a trivial discretization [00:11:49] trivial discretization what we're going to do is that we're [00:11:50] what we're going to do is that we're going to first discuss this and then [00:11:52] going to first discuss this and then have a more advanced statisticalization [00:11:54] have a more advanced statisticalization which is called chaining so the trivial [00:11:58] which is called chaining so the trivial version is the following which is in [00:12:00] version is the following which is in some sense basically [00:12:02] some sense basically um [00:12:04] um um [00:12:05] um basically the same as like in spirit the [00:12:08] basically the same as like in spirit the same as what we have done in lecture [00:12:10] same as what we have done in lecture three but here we are doing the function [00:12:13] three but here we are doing the function space so let FB [00:12:15] space so let FB of family [00:12:18] of functions [00:12:21] of functions from some space z to [00:12:24] from some space z to minus one one so I will show that these [00:12:27] minus one one so I will show that these functions to be bounded between -1 1 [00:12:30] functions to be bounded between -1 1 and then for every Epsilon larger than [00:12:32] and then for every Epsilon larger than zero [00:12:33] zero you can show The Following [00:12:36] you can show The Following so the body marker complexity is less [00:12:39] so the body marker complexity is less than Epsilon Plus [00:12:42] than Epsilon Plus let me write it down and they interpret [00:12:44] let me write it down and they interpret it log of the carbon number [00:12:47] it log of the carbon number with the radius Epsilon [00:12:53] over n [00:12:56] and we're going to show you know how to [00:12:59] and we're going to show you know how to prove this and when you show how to [00:13:00] prove this and when you show how to prove when we prove it you know you'll [00:13:02] prove when we prove it you know you'll see that this is in some sense the [00:13:04] see that this is in some sense the discretization error [00:13:06] discretization error and this is in some sense from the [00:13:08] and this is in some sense from the random complexity of the of the the [00:13:12] random complexity of the of the the finance Epson cover [00:13:16] so [00:13:18] so um we'll see this more clearly in the [00:13:19] um we'll see this more clearly in the proof [00:13:22] proof so in sometimes the general idea is that [00:13:24] so in sometimes the general idea is that you approximate to the [00:13:26] you approximate to the the proof the general idea is that [00:13:29] the proof the general idea is that your approximate [00:13:32] F by [00:13:34] F by by an Epson cover [00:13:37] by an Epson cover and [00:13:39] and maybe let's and let's call C and then [00:13:42] maybe let's and let's call C and then maybe let's not give a name so but I [00:13:44] maybe let's not give a name so but I have some cover and then when you have [00:13:45] have some cover and then when you have the Epson cover uh for the Epson cover [00:13:48] the Epson cover uh for the Epson cover you have a random marker complexity [00:13:50] you have a random marker complexity bound and then you pay something because [00:13:52] bound and then you pay something because of the discretization or the [00:13:54] of the discretization or the approximation [00:13:56] approximation Okay so and when we prove it as I said [00:14:00] Okay so and when we prove it as I said you know I tend to kind of change it to [00:14:02] you know I tend to kind of change it to the vector space view just because then [00:14:04] the vector space view just because then you don't need any all of those kind of [00:14:07] you don't need any all of those kind of like function uh on drugging about [00:14:09] like function uh on drugging about function space so let's [00:14:13] function space so let's um [00:14:14] um let's [00:14:19] let's see be an Epsilon cover [00:14:23] let's see be an Epsilon cover of Clue [00:14:25] of Clue Q is the output space where Q is the [00:14:29] Q is the output space where Q is the same thing right so then let's say this [00:14:32] same thing right so then let's say this is such the size [00:14:38] which is equals to the minimum carbon [00:14:40] which is equals to the minimum carbon number right [00:14:44] which is just the same as [00:14:46] which is just the same as as we claimed [00:14:50] as we claimed of the function class so I know [00:14:54] of the function class so I know um [00:14:55] um Okay so [00:14:57] Okay so now if you look at the random marker [00:14:58] now if you look at the random marker complexity of the function as we claim [00:15:00] complexity of the function as we claim that this is in some sense the same as [00:15:02] that this is in some sense the same as the complexity of the [00:15:04] the complexity of the output set [00:15:08] and now what you do is you say I'm going [00:15:10] and now what you do is you say I'm going to approximate V by the nearby point in [00:15:13] to approximate V by the nearby point in the cover right so so suppose you have [00:15:16] the cover right so so suppose you have the set queue [00:15:18] the set queue and I have a vector v and I know that V [00:15:21] and I have a vector v and I know that V is covered by something right [00:15:24] is covered by something right you have some cover like this [00:15:27] you have some cover like this you know that this point V is covered by [00:15:29] you know that this point V is covered by for example this point V Prime [00:15:31] for example this point V Prime in in the set C [00:15:34] in in the set C right every every Point C recall that [00:15:37] right every every Point C recall that every Point C because [00:15:39] every Point C because um to cover a certain family of Pawns [00:15:41] um to cover a certain family of Pawns right it can cover six Neighbors in some [00:15:43] right it can cover six Neighbors in some radius and and you and you know that [00:15:46] radius and and you and you know that every Point can be covered by some [00:15:48] every Point can be covered by some Vector in C so when the vector B can be [00:15:51] Vector in C so when the vector B can be covered by V Prime let's say so then you [00:15:54] covered by V Prime let's say so then you know that V and V Prime [00:15:55] know that V and V Prime the distance is less than absolute and [00:15:58] the distance is less than absolute and then you can approximate so for every V [00:16:02] then you can approximate so for every V find V Prime in C and you know that [00:16:04] find V Prime in C and you know that distance is less than Epsilon and also [00:16:06] distance is less than Epsilon and also you can write these Sigma in some sense [00:16:09] you can write these Sigma in some sense distribute as [00:16:11] distribute as V Prime Sigma plus v minus V Prime Sigma [00:16:17] V Prime Sigma plus v minus V Prime Sigma right maybe let's call this Z [00:16:20] right maybe let's call this Z So it's b Prime Sigma [00:16:22] So it's b Prime Sigma plus Z times Sigma [00:16:26] right and what you know is that Z is [00:16:29] right and what you know is that Z is small so because the distance right so [00:16:32] small so because the distance right so you know Z [00:16:33] you know Z in this distance recall that we are [00:16:35] in this distance recall that we are using a scaled L2 Norm [00:16:37] using a scaled L2 Norm so this is less than Epsilon this is [00:16:39] so this is less than Epsilon this is what we know [00:16:41] what we know so then what we know that Z times Sigma [00:16:45] so then what we know that Z times Sigma you can use the uh I think this is one [00:16:48] you can use the uh I think this is one which this is cautious right so [00:16:51] which this is cautious right so the inner product of two vectors is less [00:16:53] the inner product of two vectors is less than [00:16:54] than the norm of the two vectors the two Norm [00:16:56] the norm of the two vectors the two Norm of the two vectors so this is less than [00:16:59] of the two vectors so this is less than squared [00:17:00] squared in times Epsilon times [00:17:03] in times Epsilon times the norm of the sigma which is n times [00:17:05] the norm of the sigma which is n times F2 [00:17:08] right so so basically we know that this [00:17:11] right so so basically we know that this error term is less than Epsilon by doing [00:17:13] error term is less than Epsilon by doing this and then uh so now we can go back [00:17:16] this and then uh so now we can go back to the random marker complexity [00:17:24] you first use this uh [00:17:27] you first use this uh um [00:17:28] um uh [00:17:30] uh so this is just the less than [00:17:32] so this is just the less than expectation [00:17:37] using this uh a few things right so [00:17:41] using this uh a few things right so less than up song right because Z in a [00:17:43] less than up song right because Z in a paradigm with Sigma is less than Epsilon [00:17:45] paradigm with Sigma is less than Epsilon and this Epsilon can be go outside of [00:17:47] and this Epsilon can be go outside of all of those things because Epsilon is a [00:17:49] all of those things because Epsilon is a constant so then you get [00:17:56] plus absolute and here what's the range [00:17:59] plus absolute and here what's the range of V Prime so V Prime always has to be [00:18:01] of V Prime so V Prime always has to be in C right there's no way like this is [00:18:04] in C right there's no way like this is the our definition of v Prime V Prime is [00:18:07] the our definition of v Prime V Prime is the the cover in C so then [00:18:10] the the cover in C so then if you can see this I guess this is [00:18:12] if you can see this I guess this is equality sorry [00:18:15] equality sorry and then uh [00:18:18] and then uh and then this one [00:18:20] and then this one you can use the Masala lemon this is the [00:18:22] you can use the Masala lemon this is the complexity of the set C the cover set C [00:18:25] complexity of the set C the cover set C is using my cell Lemma you get square [00:18:27] is using my cell Lemma you get square root two log C Over N plus Epsilon [00:18:33] root two log C Over N plus Epsilon um [00:18:34] um and and we are done right C has this [00:18:37] and and we are done right C has this size [00:18:37] size so this is just the square root 2 log [00:18:42] so this is just the square root 2 log an Epsilon fl2pm [00:18:46] of M plus Epsilon [00:18:53] okay so pretty simple [00:18:57] okay so pretty simple um and any questions so far [00:19:01] um and any questions so far okay [00:19:02] okay so now let's talk about stronger theorem [00:19:06] so now let's talk about stronger theorem and this is a in my opinion pretty deep [00:19:08] and this is a in my opinion pretty deep zero because [00:19:10] zero because at least you know I probably I don't [00:19:12] at least you know I probably I don't have much inclusion about it but you [00:19:14] have much inclusion about it but you know hopefully after you I show the [00:19:16] know hopefully after you I show the proof you know it's intuitive but it is [00:19:18] proof you know it's intuitive but it is something non-tribute and this is [00:19:20] something non-tribute and this is generally this type of technique is [00:19:22] generally this type of technique is called chaining [00:19:23] called chaining uh and they could be much more ways to [00:19:26] uh and they could be much more ways to do this kind of training in such [00:19:28] do this kind of training in such different situations so here I'm [00:19:31] different situations so here I'm um here the the particular theorem is [00:19:33] um here the the particular theorem is called Dudley serum Dudley [00:19:35] called Dudley serum Dudley that laser [00:19:39] so the theorem is saying that [00:19:42] so the theorem is saying that um so [00:19:44] um so so let F be [00:19:47] so let F be of family of functions [00:19:53] from Z to R [00:19:55] from Z to R so here actually I relax this even [00:19:58] so here actually I relax this even because this theorem is more General it [00:20:00] because this theorem is more General it can work more even you know functions [00:20:01] can work more even you know functions that are not bounded and [00:20:04] that are not bounded and so [00:20:07] so The Watermark complexity is Bound By The [00:20:09] The Watermark complexity is Bound By The Following [00:20:12] let me write it down it it doesn't look [00:20:15] let me write it down it it doesn't look very intuitive in the beginning but I [00:20:18] very intuitive in the beginning but I will explain [00:20:20] will explain so it's integral [00:20:23] so it's integral so the variable is Epsilon so you are [00:20:26] so the variable is Epsilon so you are integrating [00:20:27] integrating a function of Epsilon from 0 to Infinity [00:20:31] a function of Epsilon from 0 to Infinity and you look at the cover number [00:20:35] for different Epson and you divide by [00:20:38] for different Epson and you divide by uh and [00:20:40] uh and so so the integrand is square root of [00:20:44] so so the integrand is square root of the log of the carbon number over square [00:20:46] the log of the carbon number over square rooted [00:20:47] rooted so and the first step you know it's not [00:20:50] so and the first step you know it's not even clear whether this is a stronger [00:20:51] even clear whether this is a stronger serum than before because you know [00:20:54] serum than before because you know um it's not cheaper to compare with the [00:20:56] um it's not cheaper to compare with the previous one but actually you can [00:20:58] previous one but actually you can compare if you do some work [00:21:00] compare if you do some work um so so probably you know what I'm [00:21:03] um so so probably you know what I'm going to do is I'm going to show the [00:21:04] going to do is I'm going to show the proof [00:21:05] proof um and then I'm going to interpret this [00:21:07] um and then I'm going to interpret this because you know I think from the proof [00:21:09] because you know I think from the proof it's it's pretty obvious that you're [00:21:11] it's it's pretty obvious that you're gonna get a stronger statement uh but if [00:21:13] gonna get a stronger statement uh but if you just compare the form you know it's [00:21:15] you just compare the form you know it's not that you know tribute to compare but [00:21:17] not that you know tribute to compare but from the proof you can see that this is [00:21:19] from the proof you can see that this is like the proof techniques the extension [00:21:21] like the proof techniques the extension of the previous proof technique and you [00:21:24] of the previous proof technique and you should kind of like [00:21:25] should kind of like it's pretty obvious that you should [00:21:26] it's pretty obvious that you should expect a stronger theorem and then you [00:21:29] expect a stronger theorem and then you know later I'm going to compare them and [00:21:31] know later I'm going to compare them and also interpret this because you know [00:21:33] also interpret this because you know this form by itself is still somewhat [00:21:35] this form by itself is still somewhat kind of like hard to hard to use right [00:21:37] kind of like hard to hard to use right how do we know whether I can integrate [00:21:39] how do we know whether I can integrate something good out of this right so so [00:21:42] something good out of this right so so I'm going to you know give you some [00:21:43] I'm going to you know give you some several cases where you can integrate a [00:21:46] several cases where you can integrate a good number out of this integration [00:21:48] good number out of this integration so that's the part [00:21:51] um [00:21:54] all right so so now let's dive into the [00:21:57] all right so so now let's dive into the proof lecture so how do we prove this [00:21:59] proof lecture so how do we prove this and what's intuition so let's start with [00:22:01] and what's intuition so let's start with the intuition the intuition is that [00:22:05] the intuition the intuition is that um [00:22:07] this is actually probably one of the [00:22:09] this is actually probably one of the pretty technical proof [00:22:13] pretty technical proof in in this in this course so [00:22:16] in in this in this course so um [00:22:18] um so intuition is that you have this [00:22:22] so intuition is that you have this um [00:22:23] um as I'm I'm thinking about whether I [00:22:25] as I'm I'm thinking about whether I should draw a single figure I've I've [00:22:27] should draw a single figure I've I've drawn a lot of figures on the on my [00:22:30] drawn a lot of figures on the on my lecture notes but I think it's going to [00:22:32] lecture notes but I think it's going to be challenging for the scrap note takers [00:22:34] be challenging for the scrap note takers to [00:22:35] to produce all of them in this notes so I'm [00:22:37] produce all of them in this notes so I'm thinking if I should draw one [00:22:39] thinking if I should draw one yeah maybe I'll drop multiple on the [00:22:41] yeah maybe I'll drop multiple on the latest Scrabble takers to figure out how [00:22:43] latest Scrabble takers to figure out how to merge them if they want [00:22:45] to merge them if they want um so the intuition is let me draw this [00:22:47] um so the intuition is let me draw this again so [00:22:48] again so so you have this set queue [00:22:50] so you have this set queue and what we have done was that you [00:22:53] and what we have done was that you create a cover [00:22:54] create a cover Epsilon cover [00:22:58] right it covers this and every sensor is [00:23:00] right it covers this and every sensor is one point in C and you want all of these [00:23:03] one point in C and you want all of these you know balls to cover your set [00:23:05] you know balls to cover your set right so and what we have done was that [00:23:08] right so and what we have done was that you have a vector v [00:23:10] you have a vector v here [00:23:12] here and and you say that I'm going to [00:23:14] and and you say that I'm going to approximate V by V Prime [00:23:17] approximate V by V Prime and plus the distance [00:23:20] and plus the distance so basically your approximately V by V [00:23:21] so basically your approximately V by V prime plus the this the difference in z [00:23:25] prime plus the this the difference in z so [00:23:26] so um so this is all fine the problem is [00:23:28] um so this is all fine the problem is that you know how do you so you have [00:23:30] that you know how do you so you have this [00:23:31] this formula let me just write again [00:23:37] so the tricky thing is that how do you [00:23:39] so the tricky thing is that how do you deal with this error Z times Sigma right [00:23:42] deal with this error Z times Sigma right so what we did before was that we have a [00:23:44] so what we did before was that we have a very Brute Force inequality saying that [00:23:46] very Brute Force inequality saying that this is less than two Norm of Z times [00:23:49] this is less than two Norm of Z times two Norm of Sigma [00:23:50] two Norm of Sigma over and when this can happen this can [00:23:53] over and when this can happen this can happen only if Z is perfectly correlated [00:23:56] happen only if Z is perfectly correlated with Sigma which just cannot happen [00:23:59] with Sigma which just cannot happen always [00:24:00] always right because Z is a vector which is the [00:24:02] right because Z is a vector which is the difference between V and V Prime this it [00:24:04] difference between V and V Prime this it could be correlated with Sigma you know [00:24:07] could be correlated with Sigma you know you know if your ball is you know if it [00:24:10] you know if your ball is you know if it so by the way this ball is like a [00:24:13] so by the way this ball is like a you know I draw it like a ball but this [00:24:15] you know I draw it like a ball but this is this could be of different shape [00:24:17] is this could be of different shape right because if every all the [00:24:18] right because if every all the everything is really a ball right so [00:24:20] everything is really a ball right so suppose this is really just a cleaning [00:24:22] suppose this is really just a cleaning body everything will become too trivial [00:24:24] body everything will become too trivial for us right so Q is the set and there [00:24:27] for us right so Q is the set and there is some Metric defined on it and this [00:24:29] is some Metric defined on it and this metric is potentially somewhat [00:24:31] metric is potentially somewhat complicated which we don't really know [00:24:32] complicated which we don't really know the metric is uh sorry like a sorry the [00:24:36] the metric is uh sorry like a sorry the metric is Trivial but the the set itself [00:24:38] metric is Trivial but the the set itself is could be complicated because you [00:24:40] is could be complicated because you don't really know what a side looks like [00:24:41] don't really know what a side looks like right it's the image of a function on [00:24:44] right it's the image of a function on some set of vectors on set of points [00:24:47] some set of vectors on set of points right so so this this set is uh these [00:24:49] right so so this this set is uh these are all balls but the the set itself is [00:24:52] are all balls but the the set itself is could be [00:24:54] could be or somewhere kind of like weirdly shaped [00:24:56] or somewhere kind of like weirdly shaped so that's why this z may not always be [00:24:58] so that's why this z may not always be correlated with Sigma so in the worst [00:25:00] correlated with Sigma so in the worst case it can but you know not not always [00:25:02] case it can but you know not not always possible so so so the so basically the [00:25:06] possible so so so the so basically the question is that you know can we [00:25:08] question is that you know can we strengthen this inequality here [00:25:10] strengthen this inequality here like why this has to be worst case [00:25:12] like why this has to be worst case so so if you think about this right so [00:25:16] so so if you think about this right so um [00:25:17] um what is the if you think about it so [00:25:20] what is the if you think about it so uh what is the soup expectation of the [00:25:23] uh what is the soup expectation of the so basically what you really care about [00:25:25] so basically what you really care about what you care about is the following so [00:25:27] what you care about is the following so you can take this [00:25:32] so let me just write down let me do a [00:25:35] so let me just write down let me do a little bit slowly so that's uh [00:25:38] um [00:25:39] um so you care so you do this inequality [00:25:43] so you care so you do this inequality you first say that this is less than the [00:25:45] you first say that this is less than the expectation of the soup [00:25:47] expectation of the soup of the first term [00:25:50] of the first term plus the expectation of the soup of the [00:25:53] plus the expectation of the soup of the second term [00:25:54] second term this is because you know [00:25:57] this is because you know um I guess we have claimed that you know [00:25:59] um I guess we have claimed that you know uh expectation of soup [00:26:03] uh expectation of soup a plus b [00:26:04] a plus b is always less than [00:26:06] is always less than uh expectation of soup [00:26:09] uh expectation of soup hey [00:26:11] hey plus expectation of super B [00:26:14] plus expectation of super B right so so the first thing you can do [00:26:17] right so so the first thing you can do is this follow and then you care about [00:26:18] is this follow and then you care about this [00:26:19] this and before as I said you know we have a [00:26:22] and before as I said you know we have a very worst case inequality for the inner [00:26:24] very worst case inequality for the inner product but actually [00:26:26] product but actually this is this point itself you know may [00:26:28] this is this point itself you know may not be like worst case right because [00:26:30] not be like worst case right because here Z is in this in some sense in this [00:26:33] here Z is in this in some sense in this ball around V Prime right so you have [00:26:36] ball around V Prime right so you have this [00:26:37] this ball v Prime here which is the which is [00:26:40] ball v Prime here which is the which is the ball and the Z is is in this ball [00:26:43] the ball and the Z is is in this ball so so if this this ball is not like a [00:26:45] so so if this this ball is not like a you know sometimes this Z is in the [00:26:49] you know sometimes this Z is in the and you can create this you can make [00:26:51] and you can create this you can make this a [00:26:52] this a discover you know of a certain shape so [00:26:54] discover you know of a certain shape so that it is in this ball in some sense [00:26:56] that it is in this ball in some sense this is the ball intersect with Q if [00:26:58] this is the ball intersect with Q if it's really a ball I think [00:27:00] it's really a ball I think you [00:27:02] you um [00:27:02] um uh the the worst case inequality is [00:27:05] uh the the worst case inequality is tight but actually you are intersecting [00:27:07] tight but actually you are intersecting this ball with the q1 Q could be weirdly [00:27:09] this ball with the q1 Q could be weirdly shaped so so if you look at this then [00:27:13] shaped so so if you look at this then this one could still be possibly small [00:27:15] this one could still be possibly small because you know if this uh this this [00:27:19] because you know if this uh this this ball intersect intersect with Q is of a [00:27:22] ball intersect intersect with Q is of a small complexity [00:27:23] small complexity right so so so what you so basically the [00:27:26] right so so so what you so basically the idea is that what you do is that the for [00:27:28] idea is that what you do is that the for the first inner thing you just do the [00:27:30] the first inner thing you just do the the log of the carbon number but for the [00:27:32] the log of the carbon number but for the second thing you do another round of [00:27:34] second thing you do another round of like discretization [00:27:36] like discretization so because you don't want to say that Z [00:27:38] so because you don't want to say that Z can be worst case I want to say that Z [00:27:40] can be worst case I want to say that Z probably cannot be worst case z has to [00:27:43] probably cannot be worst case z has to be have some structure so I'm going to [00:27:45] be have some structure so I'm going to describe it again [00:27:48] sorry I I mean how do I turn this off [00:27:52] sorry I I mean how do I turn this off okay [00:27:55] um wait why am I why I'm having this [00:27:57] um wait why am I why I'm having this sorry [00:27:59] sorry not bad [00:28:02] I'm not using my attackers right [00:28:06] so everyone is selector everyone was [00:28:08] so everyone is selector everyone was Zoom meeting can hear me right could [00:28:10] Zoom meeting can hear me right could could hear me right so sorry I forgot to [00:28:14] take off the [00:28:16] take off the High School bye-bye [00:28:24] can you still hear me [00:28:28] okay I hope you can hear me okay thank [00:28:31] okay I hope you can hear me okay thank you thanks [00:28:32] you thanks okay cool sorry I'm back I forgot [00:28:35] okay cool sorry I'm back I forgot um take off it okay so so basically the [00:28:37] um take off it okay so so basically the kind of idea is that you this is still a [00:28:39] kind of idea is that you this is still a rather marker complexity of BB Prime in [00:28:43] rather marker complexity of BB Prime in a sec with q and and you can do another [00:28:45] a sec with q and and you can do another round of this causation for this set so [00:28:48] round of this causation for this set so that you get you know even tighter [00:28:49] that you get you know even tighter inequality so that's kind of the the [00:28:51] inequality so that's kind of the the rough idea so basically you have a [00:28:53] rough idea so basically you have a nested [00:28:55] nested um [00:28:55] um layers of like this conversation to be [00:28:59] layers of like this conversation to be to make it stronger and stronger [00:29:03] so [00:29:09] so that's the basic idea and now let's do a [00:29:12] that's the basic idea and now let's do a look let's make a little more formal so [00:29:15] look let's make a little more formal so that I can [00:29:17] that I can um [00:29:19] um Define something and [00:29:22] Define something and uh and and explain the increasing more [00:29:24] uh and and explain the increasing more so let's say we have so so I guess you [00:29:27] so let's say we have so so I guess you know maybe just the two [00:29:29] know maybe just the two briefly draw this you know a little bit [00:29:31] briefly draw this you know a little bit so so what you do is you do another [00:29:33] so so what you do is you do another disposition of this yellow ball [00:29:36] disposition of this yellow ball and then you say that [00:29:39] and then you say that this Z cannot be worst case it has to be [00:29:42] this Z cannot be worst case it has to be something like [00:29:44] something like Z can be approximate by this plus this [00:29:48] Z can be approximate by this plus this I'm not sure whether this is too [00:29:51] I'm not sure whether this is too I will draw a bigger figure figure but [00:29:52] I will draw a bigger figure figure but basically this point Z is now you [00:29:56] basically this point Z is now you approximate the Z by its nearest [00:29:57] approximate the Z by its nearest neighbor again and then you look at the [00:29:59] neighbor again and then you look at the difference and then your approximate [00:30:01] difference and then your approximate Difference by something else I'll draw [00:30:03] Difference by something else I'll draw this more formally uh in a moment [00:30:06] this more formally uh in a moment so [00:30:07] so um to do that let's define Epsilon 0 to [00:30:10] um to do that let's define Epsilon 0 to be [00:30:11] be as the soup [00:30:13] as the soup uh over [00:30:16] uh over f c over i x [00:30:20] f c over i x Max over I [00:30:23] fzi so this is just the maximum possible [00:30:26] fzi so this is just the maximum possible value that you can output and and you [00:30:30] value that you can output and and you can see that this is just some [00:30:32] can see that this is just some preparation which is [00:30:33] preparation which is almost trivial so you can see that [00:30:36] almost trivial so you can see that this is always [00:30:39] bigger than this because each entry [00:30:42] bigger than this because each entry Epsilon is bigger than each of the fdis [00:30:45] Epsilon is bigger than each of the fdis and this is equals to square root of [00:30:48] and this is equals to square root of power n the two number V Square [00:30:51] power n the two number V Square forever being Q [00:30:52] forever being Q so so basically Epsilon 0 is an upper [00:30:55] so so basically Epsilon 0 is an upper bound [00:30:59] of the entire side [00:31:06] you have to not have to talk about any [00:31:07] you have to not have to talk about any absence bigger than this because you [00:31:10] absence bigger than this because you know everything is in this ball of [00:31:12] know everything is in this ball of Epsilon zero [00:31:13] Epsilon zero and now I'm going to create this nested [00:31:17] and now I'm going to create this nested uh all this [00:31:19] uh all this the technical is not nested but I think [00:31:21] the technical is not nested but I think I've always thought about it as a [00:31:23] I've always thought about it as a message [00:31:24] message family of like her these conditions but [00:31:27] family of like her these conditions but technically you don't really need a [00:31:28] technically you don't really need a nested part so let me draw this you know [00:31:33] nested part so let me draw this you know uh okay let me Define things first so [00:31:35] uh okay let me Define things first so for so I'm going to [00:31:37] for so I'm going to consider Epsilon 1 to be [00:31:42] half times absolute zero Epsilon 2 is [00:31:46] half times absolute zero Epsilon 2 is a culture of Epsilon zero so in general [00:31:48] a culture of Epsilon zero so in general Epsilon J is 2 to the minus J Epsilon 0. [00:31:52] Epsilon J is 2 to the minus J Epsilon 0. so so these are [00:31:54] so so these are um the kind of like the the radius for [00:31:56] um the kind of like the the radius for my Epsilon cover and let [00:31:59] my Epsilon cover and let C J B [00:32:01] C J B uh and Epsilon J cover [00:32:05] uh and Epsilon J cover of [00:32:07] of the set Q of Q so I have this family of [00:32:10] the set Q of Q so I have this family of of Epsilon covers and intuitively you [00:32:14] of Epsilon covers and intuitively you can think of [00:32:18] kind of think of [00:32:21] kind of think of Epsilon J plus 1 cover like CJ is nested [00:32:26] Epsilon J plus 1 cover like CJ is nested in C [00:32:27] in C C J plus one is nested [00:32:31] C J plus one is nested in CJ in some sense but I don't but this [00:32:33] in CJ in some sense but I don't but this is not necessary for the proof [00:32:35] is not necessary for the proof and also it's not the entire you know [00:32:37] and also it's not the entire you know but not necessary [00:32:39] but not necessary I just like to think like that just to [00:32:41] I just like to think like that just to give me some kind of like Chrome [00:32:44] give me some kind of like Chrome um kind of intuition so so what's really [00:32:47] um kind of intuition so so what's really happening let me draw if I draw this [00:32:49] happening let me draw if I draw this what's really happening is that I have [00:32:51] what's really happening is that I have this set Q maybe I should enjoy a ball [00:32:54] this set Q maybe I should enjoy a ball so that it's kind of like more [00:32:55] so that it's kind of like more interesting so this is the set queue and [00:33:00] interesting so this is the set queue and um there is a [00:33:02] um there is a there's a biggest thing which is the [00:33:04] there's a biggest thing which is the Epson zero which covers everything let [00:33:06] Epson zero which covers everything let me now draw that so if you use the [00:33:08] me now draw that so if you use the absolute zero cover then it's trivial [00:33:09] absolute zero cover then it's trivial because apps from zero you can just use [00:33:11] because apps from zero you can just use a trivial cover to cover you just need [00:33:13] a trivial cover to cover you just need one Pawn to cover everything so you just [00:33:16] one Pawn to cover everything so you just need the origin that's not all that [00:33:18] need the origin that's not all that let's draw uh something [00:33:20] let's draw uh something maybe option one so what happens is that [00:33:23] maybe option one so what happens is that you use you have a very coarse going to [00:33:26] you use you have a very coarse going to cover at the beginning [00:33:27] cover at the beginning something like this [00:33:34] right so so this is your Epson one [00:33:38] right so so this is your Epson one um and [00:33:41] and I have a point [00:33:45] this is something really hard to draw so [00:33:47] this is something really hard to draw so I need to follow my notes exactly [00:33:50] I need to follow my notes exactly so that I don't have any issues with it [00:33:52] so that I don't have any issues with it so I guess suppose I have a point [00:33:54] so I guess suppose I have a point let's say [00:33:57] here [00:33:58] here this is my point V that I want to [00:34:01] this is my point V that I want to approximate by the cover [00:34:03] approximate by the cover so suppose this is the origin [00:34:07] so [00:34:08] so before what I do is that maybe let's [00:34:10] before what I do is that maybe let's draw this V somewhere else sorry [00:34:15] maybe this will be here [00:34:17] maybe this will be here so [00:34:18] so so this is let's call this U1 this is [00:34:21] so this is let's call this U1 this is the closest point in the first level of [00:34:23] the closest point in the first level of the Epson cover [00:34:24] the Epson cover so [00:34:26] so so before I just use U1 to approximate V [00:34:29] so before I just use U1 to approximate V and now what I'm going to do is that I'm [00:34:31] and now what I'm going to do is that I'm going to first use you one and then I [00:34:33] going to first use you one and then I consider the second level of the Epson [00:34:35] consider the second level of the Epson cover so which is of a smaller size you [00:34:37] cover so which is of a smaller size you know with this of the actual size half [00:34:40] know with this of the actual size half so whether this is by the way this this [00:34:42] so whether this is by the way this this number two is not nothing magical you [00:34:44] number two is not nothing magical you can make it like three or four it's just [00:34:47] can make it like three or four it's just for convenience you just need a constant [00:34:49] for convenience you just need a constant um constant Factor smaller [00:34:52] um constant Factor smaller um that you know at every level [00:34:54] um that you know at every level so so you have this for example right so [00:34:57] so so you have this for example right so this is the second level [00:35:01] and what you do is you say I'm going to [00:35:04] and what you do is you say I'm going to take the point u 2 here U2 is the [00:35:07] take the point u 2 here U2 is the nearest neighbor of v in the second [00:35:09] nearest neighbor of v in the second level [00:35:10] level and then what I'm gonna do is I'm going [00:35:12] and then what I'm gonna do is I'm going to approximate [00:35:14] to approximate V by U1 Plus [00:35:18] V by U1 Plus this vector [00:35:20] this vector between U2 and U1 [00:35:23] between U2 and U1 so so then I have a smaller distance [00:35:25] so so then I have a smaller distance between V and U2 right so and then I'm [00:35:28] between V and U2 right so and then I'm going to have the third level let me [00:35:30] going to have the third level let me I'll only draws three levels so suppose [00:35:33] I'll only draws three levels so suppose in the third level what happens is that [00:35:35] in the third level what happens is that you have another [00:35:38] you have another thing here and this is u3 [00:35:41] thing here and this is u3 and then you also consider this Vector [00:35:44] and then you also consider this Vector between U2 and u3 so basically [00:35:46] between U2 and u3 so basically approximately V by this right Vector [00:35:48] approximately V by this right Vector plus the green Vector plus the yellow [00:35:50] plus the green Vector plus the yellow vector and then you continue to do this [00:35:52] vector and then you continue to do this until you get to be [00:35:56] until you get to be so [00:35:58] so any questions so far [00:36:04] so basically I'm going to approximately [00:36:07] so basically I'm going to approximately V by U1 plus U2 minus U1 [00:36:12] V by U1 plus U2 minus U1 plus u3 minus U2 [00:36:15] plus u3 minus U2 until Infinity because I'm going to have [00:36:18] until Infinity because I'm going to have like an infinite number of disk [00:36:19] like an infinite number of disk coverings [00:36:20] coverings um it doesn't have to be exactly [00:36:22] um it doesn't have to be exactly infinite number of them if you have fun [00:36:24] infinite number of them if you have fun going enough like a approximation you [00:36:26] going enough like a approximation you can stop but for Simplicity let's just [00:36:28] can stop but for Simplicity let's just say we have an infinite sequence of [00:36:30] say we have an infinite sequence of absent covers and you can do this so so [00:36:34] absent covers and you can do this so so more formula what I'm going to do is [00:36:35] more formula what I'm going to do is that for every V and Q [00:36:38] that for every V and Q okay let's I guess this is just a formal [00:36:42] okay let's I guess this is just a formal definition [00:36:43] definition so [00:36:45] so it's it's nearest neighbor [00:36:49] you know wrist [00:36:55] neighboring CJ right so so that's why by [00:36:59] neighboring CJ right so so that's why by definition [00:37:00] definition because Uday has to be covered by CJ so [00:37:03] because Uday has to be covered by CJ so that's why so V has to be covered by CJ [00:37:06] that's why so V has to be covered by CJ so that's where V10 [00:37:07] so that's where V10 the distance between [00:37:09] the distance between u j is less than Epsilon J [00:37:13] u j is less than Epsilon J um right so all in other words one over [00:37:15] um right so all in other words one over square root and times V [00:37:17] square root and times V cvj2 Norm is less than absolutely [00:37:22] and [00:37:24] and and also because Epsilon J goes to zero [00:37:27] and also because Epsilon J goes to zero we know that u j goes to [00:37:30] we know that u j goes to V eventually as J goes to Infinity [00:37:34] V eventually as J goes to Infinity J goes to the infinity [00:37:37] J goes to the infinity so that's why you can write this nested [00:37:39] so that's why you can write this nested sum you can write this as U1 [00:37:42] sum you can write this as U1 Plus [00:37:43] Plus U2 minus U1 plus u3 minus U2 [00:37:49] U2 minus U1 plus u3 minus U2 so and so forth [00:37:50] so and so forth right and if you like you use zero to be [00:37:54] right and if you like you use zero to be zero then you can write this as [00:37:57] zero then you can write this as U1 minus u0 plus U2 minus U1 [00:38:02] U1 minus u0 plus U2 minus U1 this is just a to make it looks nicer so [00:38:04] this is just a to make it looks nicer so that we can write it as some so this is [00:38:07] that we can write it as some so this is sum of u i let's see y minus one [00:38:11] sum of u i let's see y minus one y for 1 to Infinity [00:38:13] y for 1 to Infinity and you can check the convergence right [00:38:15] and you can check the convergence right if you really want right so just because [00:38:17] if you really want right so just because I have this [00:38:19] I have this so [00:38:20] so so if you look at the partial sum then [00:38:25] so if you look at the partial sum then is [00:38:30] um minus u0 and because u n [00:38:33] um minus u0 and because u n this partial sum [00:38:34] this partial sum this goes to [00:38:37] this goes to v as M goes to Infinity so so this could [00:38:41] v as M goes to Infinity so so this could cover and technically you actually if [00:38:43] cover and technically you actually if you really want to have a proof and you [00:38:45] you really want to have a proof and you don't have actually have to [00:38:47] don't have actually have to use use infinite sum I'm just trying to [00:38:50] use use infinite sum I'm just trying to make it simpler so you can just say I'm [00:38:52] make it simpler so you can just say I'm going to choose the amp that is big [00:38:53] going to choose the amp that is big enough and then I pay some small error [00:38:55] enough and then I pay some small error at the end that's also fun [00:38:57] at the end that's also fun um [00:38:58] um so okay so and once we do this and what [00:39:03] so okay so and once we do this and what as we kind of planned [00:39:05] as we kind of planned so we have this kind of like us [00:39:07] so we have this kind of like us better and better approximation right so [00:39:09] better and better approximation right so now let's deal with each of these [00:39:11] now let's deal with each of these factors [00:39:12] factors so what we have is that expectation of [00:39:15] so what we have is that expectation of the soup [00:39:19] this becomes expectation [00:39:23] terms [00:39:25] terms sum of UI minus UI minus 1 Sigma [00:39:30] sum of UI minus UI minus 1 Sigma and from 1 to Infinity [00:39:33] and from 1 to Infinity right so and then you [00:39:36] right so and then you um switch the sum with the soup so you [00:39:38] um switch the sum with the soup so you get [00:39:39] get expectation less than expectation [00:39:42] expectation less than expectation soup [00:39:45] soup uh [00:39:46] uh six [00:39:50] um [00:39:51] um say soup [00:39:57] right and and then this is equals to sum [00:40:02] right and and then this is equals to sum the expectation of the soup [00:40:14] okay so and and here the constraint is [00:40:18] okay so and and here the constraint is the UI needs to be in CI and UI minus 1 [00:40:21] the UI needs to be in CI and UI minus 1 needs to be in CA minus 1. [00:40:24] needs to be in CA minus 1. right so so in some sense this quantity [00:40:27] right so so in some sense this quantity each of these quantity is kind of like [00:40:29] each of these quantity is kind of like some kind of like random marker [00:40:30] some kind of like random marker complexity [00:40:31] complexity but but this is a final class because U1 [00:40:34] but but this is a final class because U1 and UI minus 1 no are not arbitrary [00:40:36] and UI minus 1 no are not arbitrary vectors they have to come from a finance [00:40:38] vectors they have to come from a finance set [00:40:39] set and then we just have to deal with you [00:40:41] and then we just have to deal with you know we just have to see what's the [00:40:42] know we just have to see what's the rather Market complexity of this set and [00:40:45] rather Market complexity of this set and and then continue with the derivation [00:40:48] and then continue with the derivation so [00:40:49] so um okay so let's try to deal with each [00:40:51] um okay so let's try to deal with each of these terms so [00:40:53] of these terms so so we're trying to use Masala land all [00:40:56] so we're trying to use Masala land all right so Masa Lemma is dealing with is [00:40:58] right so Masa Lemma is dealing with is trying to deal with this kind of terms [00:40:59] trying to deal with this kind of terms for finance set so first of all CI so [00:41:06] for finance set so first of all CI so so the combination of UI and um as well [00:41:09] so the combination of UI and um as well are the variables right so they are in [00:41:10] are the variables right so they are in CI times CMS 1. and CI times c i minus 1 [00:41:14] CI times CMS 1. and CI times c i minus 1 the size is equals to [00:41:17] the size is equals to this the size of CI times CM as well so [00:41:19] this the size of CI times CM as well so this is something you you can compute [00:41:22] this is something you you can compute and simplify that in a moment [00:41:24] and simplify that in a moment uh and [00:41:26] uh and you can also have an by the way for [00:41:29] you can also have an by the way for masal Lemma let's just go back real [00:41:32] masal Lemma let's just go back real quick [00:41:33] quick so I think we had this in the beginning [00:41:34] so I think we had this in the beginning so for masalama you have to check how [00:41:36] so for masalama you have to check how large [00:41:38] large you have to check how large the vectors [00:41:40] you have to check how large the vectors are right so so this atom doesn't matter [00:41:43] are right so so this atom doesn't matter right if all the vectors are super big [00:41:45] right if all the vectors are super big then your complex will be big and if all [00:41:47] then your complex will be big and if all the vectors are extremely small then [00:41:49] the vectors are extremely small then your complexity will be small [00:41:50] your complexity will be small so so let's check what's the value of M [00:41:52] so so let's check what's the value of M here so the value m is the bound on the [00:41:55] here so the value m is the bound on the two Norm of the vectors the normalized [00:41:58] two Norm of the vectors the normalized two Norm of the vectors right so so [00:42:00] two Norm of the vectors right so so basically we need to check one over [00:42:01] basically we need to check one over squared n times [00:42:04] squared n times UI minus UI minus 2 2 Norm [00:42:07] UI minus UI minus 2 2 Norm how large this can be eventually [00:42:11] how large this can be eventually so uh so this can be if you up about [00:42:15] so uh so this can be if you up about this this is at most [00:42:18] this this is at most just to achieve you try and go [00:42:20] just to achieve you try and go inequality [00:42:22] inequality and [00:42:24] and wait sorry you cannot do a triangle and [00:42:27] wait sorry you cannot do a triangle and the quality that would defeat the [00:42:29] the quality that would defeat the purpose so what I'm going to do is that [00:42:31] purpose so what I'm going to do is that uh yeah sorry so you you are gonna do a [00:42:35] uh yeah sorry so you you are gonna do a slightly more careful triangle and [00:42:37] slightly more careful triangle and quality because [00:42:39] quality because you want to say you are an um as well [00:42:41] you want to say you are an um as well close right so so but U and UI minus [00:42:43] close right so so but U and UI minus five themselves each of them could be [00:42:45] five themselves each of them could be big if you look at this right so [00:42:48] big if you look at this right so U1 and U2 you know as as vectors they [00:42:52] U1 and U2 you know as as vectors they are they're probably big but their [00:42:53] are they're probably big but their differences is small and smaller and [00:42:55] differences is small and smaller and smaller as you have bigger and bigger [00:42:56] smaller as you have bigger and bigger eyes and how do you control that I think [00:42:59] eyes and how do you control that I think there's actually easy way you just [00:43:01] there's actually easy way you just derive this as UI minus V because you [00:43:03] derive this as UI minus V because you can always compare with v [00:43:04] can always compare with v that's something you know [00:43:06] that's something you know right so and then use triangle [00:43:09] right so and then use triangle inequality because both UI minus V is [00:43:11] inequality because both UI minus V is somewhat small and U1 minus one [00:43:14] somewhat small and U1 minus one V is somewhat small [00:43:17] V is somewhat small and how small they are so you know that [00:43:21] and how small they are so you know that the first term score well scoring times [00:43:24] the first term score well scoring times UI minus V this is less than Epsilon I [00:43:30] UI minus V this is less than Epsilon I and and the first term and the second [00:43:31] and and the first term and the second term is less than Epsilon I minus one [00:43:34] term is less than Epsilon I minus one this is just by the definition of the [00:43:38] this is just by the definition of the uh of the Epson cover [00:43:40] uh of the Epson cover right so and Epsilon I [00:43:43] right so and Epsilon I is 2 to the minus I times Epsilon zero [00:43:46] is 2 to the minus I times Epsilon zero so Epsilon I is smaller than Epsilon [00:43:47] so Epsilon I is smaller than Epsilon minus one so by a factor of two so so [00:43:51] minus one so by a factor of two so so this is actually three times x y [00:43:53] this is actually three times x y just because Epsilon minus 1 is 2 times [00:43:55] just because Epsilon minus 1 is 2 times bigger than Epsilon [00:43:57] bigger than Epsilon okay so with all of this preparation we [00:43:59] okay so with all of this preparation we can apply the masalama then [00:44:06] what you have is that [00:44:10] soup [00:44:17] is less than [00:44:19] is less than so we get square root 2 times the M [00:44:22] so we get square root 2 times the M Square this this is the m right [00:44:25] Square this this is the m right so you have M Square which is 3 is [00:44:27] so you have M Square which is 3 is Epsilon I squared [00:44:29] Epsilon I squared and it has the log of the carbon number [00:44:33] and it has the log of the carbon number and the covering number sorry the log of [00:44:36] and the covering number sorry the log of the size of the of the set the size of [00:44:38] the size of the of the set the size of the set is CI times CM as one [00:44:43] and over [00:44:45] and over over n [00:44:48] over n right so and let's try to simplify this [00:44:51] right so and let's try to simplify this a little bit [00:44:52] a little bit so you get three Epsilon I outside [00:44:55] so you get three Epsilon I outside over square root n and you have [00:44:57] over square root n and you have square root log CI plus log c m minus 1. [00:45:02] square root log CI plus log c m minus 1. and 1 times 2 [00:45:06] and 1 times 2 and then you say that this is less than [00:45:08] and then you say that this is less than so C so [00:45:10] so C so c i [00:45:12] c i is probably bigger than it's always [00:45:14] is probably bigger than it's always bigger than CM as one because you know [00:45:17] bigger than CM as one because you know CI is a more fun grained absolute cover [00:45:20] CI is a more fun grained absolute cover discretization then cms1 so if you have [00:45:24] discretization then cms1 so if you have more fun growing you should have more [00:45:25] more fun growing you should have more sex more points this is just by [00:45:27] sex more points this is just by definition so uh so you got you just [00:45:31] definition so uh so you got you just spawn CMS 1 by CI so you get [00:45:33] spawn CMS 1 by CI so you get 6 Epsilon I over square root n Times [00:45:36] 6 Epsilon I over square root n Times Square Root log [00:45:38] Square Root log CI because we just replace this term by [00:45:41] CI because we just replace this term by log c i [00:45:43] log c i okay so the constant doesn't really [00:45:44] okay so the constant doesn't really matter that much anyways so uh [00:45:49] matter that much anyways so uh all right [00:45:50] all right so so now let's see what we have [00:45:52] so so now let's see what we have achieved right so we have bound each of [00:45:54] achieved right so we have bound each of this term and let's go back to this [00:45:55] this term and let's go back to this formula so we just plugged it in so what [00:45:57] formula so we just plugged it in so what we got is that [00:45:59] we got is that uh so we got expectation [00:46:03] uh so we got expectation soup well and V Sigma this is our Target [00:46:06] soup well and V Sigma this is our Target Which is less than the sum of this over [00:46:08] Which is less than the sum of this over I [00:46:09] I I have from 1 to Infinity [00:46:12] I have from 1 to Infinity 6 Epsilon I overscript and square root [00:46:15] 6 Epsilon I overscript and square root log CI [00:46:18] log CI so this is still not really an [00:46:20] so this is still not really an integration right so [00:46:22] integration right so um [00:46:23] um um so how do you turn this into [00:46:24] um so how do you turn this into integration right but this is kind of [00:46:26] integration right but this is kind of like half a little like a flavor of the [00:46:27] like half a little like a flavor of the integration you have like [00:46:29] integration you have like you know a lot of terms right in some [00:46:30] you know a lot of terms right in some sense [00:46:32] sense um so so how do we see this right so [00:46:34] um so so how do we see this right so they're they're I think the the way I [00:46:36] they're they're I think the the way I see this is the following [00:46:38] see this is the following so [00:46:39] so um so maybe let me just write down [00:46:41] um so maybe let me just write down what's the final formula you want to [00:46:43] what's the final formula you want to achieve the final formula I want to [00:46:44] achieve the final formula I want to achieve is [00:46:45] achieve is to recall that this is something like 12 [00:46:47] to recall that this is something like 12 times 1 over square root 10 Times Square [00:46:49] times 1 over square root 10 Times Square Root log and Epsilon [00:46:52] Root log and Epsilon I have L2 p and the episode and this is [00:46:56] I have L2 p and the episode and this is the final formula we want to achieve [00:46:59] the final formula we want to achieve and by the way in some sense actually [00:47:01] and by the way in some sense actually you don't really have to get this [00:47:02] you don't really have to get this integration if you [00:47:05] integration if you if you don't if you just care about you [00:47:06] if you don't if you just care about you know applying this to some cases because [00:47:08] know applying this to some cases because this is enough for you to apply it uh [00:47:10] this is enough for you to apply it uh it's just like this integration looks so [00:47:12] it's just like this integration looks so nice and it's kind of like a it's a it's [00:47:14] nice and it's kind of like a it's a it's a good interface you know in in a [00:47:17] a good interface you know in in a mathematical sense but then so so how do [00:47:19] mathematical sense but then so so how do you see these two are almost the same [00:47:22] you see these two are almost the same um the way I see it is the following so [00:47:24] um the way I see it is the following so if you think about [00:47:26] if you think about what this integration is right so they [00:47:28] what this integration is right so they have up some of this [00:47:30] have up some of this on this Dimension and let's plot the [00:47:34] on this Dimension and let's plot the the carbon number the cover number will [00:47:36] the carbon number the cover number will be [00:47:37] be the log this is the law of carbon number [00:47:39] the log this is the law of carbon number log maybe let's say square root [00:47:42] log maybe let's say square root square root log [00:47:43] square root log and Epsilon f l2p [00:47:48] so you plug this and at some point this [00:47:51] so you plug this and at some point this this carbon number will be one and so [00:47:53] this carbon number will be one and so the log of the carbon number will be [00:47:54] the log of the carbon number will be zero this is just because when you when [00:47:56] zero this is just because when you when you read this is big enough you can just [00:47:58] you read this is big enough you can just use one thing to cover everything so the [00:48:00] use one thing to cover everything so the log cabin number can be one right and [00:48:03] log cabin number can be one right and particularly in our notation [00:48:04] particularly in our notation when when you read this is absolute zero [00:48:07] when when you read this is absolute zero then your cover number becomes one and [00:48:09] then your cover number becomes one and the low carbon number becomes zero so [00:48:10] the low carbon number becomes zero so square root of that is also zero so and [00:48:13] square root of that is also zero so and and this this current number will be [00:48:15] and this this current number will be going to will go to Infinity eventually [00:48:17] going to will go to Infinity eventually as Epsilon goes to zero [00:48:19] as Epsilon goes to zero because [00:48:20] because um you know you need more and more [00:48:22] um you know you need more and more points in covers as you have like a more [00:48:25] points in covers as you have like a more and more fun growing covers [00:48:28] and more fun growing covers so and and you have this you know [00:48:30] so and and you have this you know sequence of like points but so like a [00:48:32] sequence of like points but so like a you have for example option one is here [00:48:35] you have for example option one is here right so which is half of Epsilon zero [00:48:37] right so which is half of Epsilon zero but let's look at Absolute I so [00:48:41] but let's look at Absolute I so foreign [00:48:48] so suppose this is Epsilon I [00:48:51] so suppose this is Epsilon I and and [00:48:52] and and if this is absolute I then a half of [00:48:55] if this is absolute I then a half of height will be absolute I plus one by [00:48:57] height will be absolute I plus one by definition [00:48:59] definition so this is Epsilon I Plus 1. [00:49:02] so this is Epsilon I Plus 1. and and what what is this value this is [00:49:04] and and what what is this value this is value the corresponding carbon number [00:49:06] value the corresponding carbon number right so this is [00:49:07] right so this is square root log c i [00:49:11] square root log c i right that's the that's our in our [00:49:14] right that's the that's our in our notation [00:49:15] notation right so and [00:49:19] right so and and now let's let's compare these two [00:49:21] and now let's let's compare these two quantities where this [00:49:23] quantities where this this quantity and this quantity this is [00:49:25] this quantity and this quantity this is what we are trying to link right so the [00:49:27] what we are trying to link right so the quantity below is just the [00:49:30] quantity below is just the the error and the curve under this curve [00:49:32] the error and the curve under this curve right that's the definition of okay I [00:49:34] right that's the definition of okay I guess I'm ignoring the one with scorpion [00:49:36] guess I'm ignoring the one with scorpion which is you know easy right so so if [00:49:39] which is you know easy right so so if you don't have the one by square root [00:49:40] you don't have the one by square root not the 12 right the [00:49:43] not the 12 right the so so this integral is just the error on [00:49:46] so so this integral is just the error on the curve and now what's this definite [00:49:48] the curve and now what's this definite sum and if you look at the final song [00:49:50] sum and if you look at the final song then it's Epsilon [00:49:52] then it's Epsilon like [00:49:54] like if you look at the [00:49:57] if you look at the this thing [00:49:59] this thing the arrow of this triangle [00:50:01] the arrow of this triangle uh sorry it's another color this is a [00:50:03] uh sorry it's another color this is a rectangle my back so the area of this [00:50:06] rectangle my back so the area of this rectangle then is uh the the error the [00:50:11] rectangle then is uh the the error the mass of the is this is Epsilon I minus [00:50:13] mass of the is this is Epsilon I minus Epsilon I plus one oh times the height [00:50:17] Epsilon I plus one oh times the height which is square root log CI [00:50:20] which is square root log CI and Epsilon I and Epsilon I plus 1 are [00:50:23] and Epsilon I and Epsilon I plus 1 are just the [00:50:24] just the this is just the [00:50:26] this is just the let me see what's the best I think this [00:50:28] let me see what's the best I think this is Epsilon I over 2 times [00:50:30] is Epsilon I over 2 times log of CI [00:50:34] and this is [00:50:36] and this is just the multiple of this term [00:50:39] just the multiple of this term right so so basically the final sum is [00:50:41] right so so basically the final sum is sometimes just dealing with all of these [00:50:43] sometimes just dealing with all of these rectangles and the integral is doing all [00:50:45] rectangles and the integral is doing all everything [00:50:46] everything so so that's why the sum of the [00:50:48] so so that's why the sum of the rectangles will be smaller than the [00:50:50] rectangles will be smaller than the integrals up to a constant Vector so [00:50:52] integrals up to a constant Vector so basically what you know is that you know [00:50:54] basically what you know is that you know Epsilon I over 2 [00:50:55] Epsilon I over 2 square root log c i because this is the [00:50:58] square root log c i because this is the this area [00:51:00] this area is less than the integral [00:51:03] is less than the integral of this part right if this is less than [00:51:05] of this part right if this is less than the integral of this part it's less than [00:51:07] the integral of this part it's less than integral from [00:51:08] integral from Epsilon I plus 1 to Epsilon I [00:51:12] Epsilon I plus 1 to Epsilon I and square root log and Epsilon [00:51:16] and square root log and Epsilon about 2PM the Absol [00:51:20] about 2PM the Absol okay so and with this we we can just [00:51:23] okay so and with this we we can just take sum over all eyes so it's [00:51:26] take sum over all eyes so it's so you have some of Epsilon I over 2 [00:51:28] so you have some of Epsilon I over 2 square root log c i [00:51:31] square root log c i it's less than [00:51:33] it's less than sum over I from [00:51:36] sum over I from one to infinity infinity [00:51:48] right [00:51:50] right uh sorry this is not right so and [00:51:54] uh sorry this is not right so and and now you can see that this you know [00:51:57] and now you can see that this you know each of these integral has the matching [00:51:58] each of these integral has the matching upper bound lower bound so you got this [00:52:00] upper bound lower bound so you got this is from zero to absolute zero [00:52:04] is from zero to absolute zero square root log and Epsilon [00:52:08] square root log and Epsilon 2 P.M [00:52:10] 2 P.M the abso [00:52:11] the abso and the upper bond is still not [00:52:14] and the upper bond is still not infinitive but that doesn't really [00:52:15] infinitive but that doesn't really matter because this really literally [00:52:17] matter because this really literally just equals to [00:52:18] just equals to you can extend it to Infinity because [00:52:20] you can extend it to Infinity because everything Beyond Epsilon zero bigger [00:52:23] everything Beyond Epsilon zero bigger than Epsilon zero will be [00:52:26] than Epsilon zero will be zero so [00:52:29] zero so that's what we have [00:52:35] okay so now you just if you just [00:52:38] okay so now you just if you just multiply [00:52:39] multiply this is the essential thing right so [00:52:41] this is the essential thing right so using the with this inequality you just [00:52:44] using the with this inequality you just link these two quantities you know so I [00:52:46] link these two quantities you know so I think you just have to work out the [00:52:47] think you just have to work out the constantly I think there's a constant [00:52:49] constantly I think there's a constant two there so that's why it gets from 6 [00:52:51] two there so that's why it gets from 6 to 12. so with this you get expectation [00:52:59] and this is actually the rather Market [00:53:01] and this is actually the rather Market complexity of f is equals to this [00:53:04] complexity of f is equals to this is less than [00:53:05] is less than basically what six Epsilon I over [00:53:07] basically what six Epsilon I over squared and [00:53:09] squared and square root log c i and this is less [00:53:13] square root log c i and this is less than 12 times this integral [00:53:27] Okay so [00:53:32] any questions [00:53:36] okay great so [00:53:40] okay so and and I think from this figure [00:53:42] okay so and and I think from this figure you can also kind of see that you know [00:53:43] you can also kind of see that you know in some sense the essence here is that [00:53:45] in some sense the essence here is that how fast [00:53:46] how fast Epsilon goes to Infinity that's what's [00:53:49] Epsilon goes to Infinity that's what's important here right because if Epsilon [00:53:50] important here right because if Epsilon goes to Infinity very fast then your [00:53:52] goes to Infinity very fast then your integration probably could be even [00:53:53] integration probably could be even Infinity so then you don't have any bot [00:53:56] Infinity so then you don't have any bot and if your uh if this thing goes to [00:53:59] and if your uh if this thing goes to Infinity [00:54:00] Infinity like here slower than you get available [00:54:21] yeah so the question is like oh I chose [00:54:24] yeah so the question is like oh I chose this uh level like a fact of two right [00:54:26] this uh level like a fact of two right so like it's 2 to the minus J times [00:54:28] so like it's 2 to the minus J times Epsilon zero so what if I change that [00:54:30] Epsilon zero so what if I change that two to three or something like that I I [00:54:33] two to three or something like that I I never tried that myself but I think [00:54:37] never tried that myself but I think um very very likely you would just get a [00:54:39] um very very likely you would just get a similar constant maybe you get better [00:54:41] similar constant maybe you get better than 12 maybe you get worse than 12 but [00:54:43] than 12 maybe you get worse than 12 but anyway this constant is not that [00:54:44] anyway this constant is not that important for us so but I [00:54:47] important for us so but I I think it's very unlikely you can gain [00:54:49] I think it's very unlikely you can gain anything by [00:54:50] anything by like you can gain anything more than a [00:54:52] like you can gain anything more than a constant [00:54:57] okay so now let's let's try to interpret [00:55:00] okay so now let's let's try to interpret this theorem a little bit more [00:55:02] this theorem a little bit more because in some sense this serum you [00:55:04] because in some sense this serum you know like this form is kind of hard to [00:55:06] know like this form is kind of hard to use right so because you know if I got [00:55:08] use right so because you know if I got the log carbon number bound [00:55:09] the log carbon number bound and okay what's the internal use of this [00:55:12] and okay what's the internal use of this here so the the way to use this Serum is [00:55:14] here so the the way to use this Serum is that you you get some low carbon number [00:55:16] that you you get some low carbon number bound and then you do this integral you [00:55:18] bound and then you do this integral you get the random marker complexity [00:55:20] get the random marker complexity right so but uh but it's kind of like [00:55:23] right so but uh but it's kind of like hard to use it because you know before [00:55:24] hard to use it because you know before you get the so you don't know how how [00:55:26] you get the so you don't know how how does this translation work you know [00:55:28] does this translation work you know especially but actually the translations [00:55:30] especially but actually the translations from the carbon number to the right [00:55:32] from the carbon number to the right number complex actually it's actually [00:55:33] number complex actually it's actually relatively simple as as I will show so [00:55:36] relatively simple as as I will show so this integration doesn't have [00:55:38] this integration doesn't have like you you see like actually you don't [00:55:40] like you you see like actually you don't even need to [00:55:42] even need to like like I never complete this integral [00:55:45] like like I never complete this integral myself like after I don't once in some [00:55:47] myself like after I don't once in some sense okay so here's how it works [00:55:50] sense okay so here's how it works so so for the um [00:55:56] so [00:55:58] so Yeah so basically the question is when [00:55:59] Yeah so basically the question is when this is finance right so when this is [00:56:01] this is finance right so when this is this thing is finance [00:56:05] and and when it's Finance you know [00:56:07] and and when it's Finance you know what's the dependency right so and so [00:56:09] what's the dependency right so and so forth [00:56:09] forth so when [00:56:11] so when it's finite [00:56:14] so I think there are several cases like [00:56:16] so I think there are several cases like let's do kind of this is a case study so [00:56:19] let's do kind of this is a case study so it of course it depends on what the law [00:56:20] it of course it depends on what the law cover number will be so we have I have a [00:56:24] cover number will be so we have I have a few cases here so a [00:56:26] few cases here so a if [00:56:28] if the covering number is exponential in [00:56:31] the covering number is exponential in absolute [00:56:39] um is of the form something like one [00:56:41] um is of the form something like one over Epsilon to the power [00:56:43] over Epsilon to the power some power R is just a variable like [00:56:46] some power R is just a variable like placeholder [00:56:47] placeholder so suppose it's expensive in the sense [00:56:49] so suppose it's expensive in the sense that one of the Epsilon is in the base [00:56:51] that one of the Epsilon is in the base then you can do this computation you get [00:57:02] and this is equals to something like [00:57:07] and this is equals to something like on [00:57:10] small over squared and [00:57:12] small over squared and square root R log y rhapsulin [00:57:16] square root R log y rhapsulin right so just because you take a low [00:57:18] right so just because you take a low carbon number and and you will see that [00:57:20] carbon number and and you will see that and you take the absolute [00:57:23] and you take the absolute and you'll see that the log one graphs [00:57:24] and you'll see that the log one graphs don't integrate to some constant [00:57:26] don't integrate to some constant uh from zero to Infinity [00:57:28] uh from zero to Infinity oh by the way I think maybe I should say [00:57:33] oh by the way I think maybe I should say I I forgot to take a yes [00:57:37] uh [00:57:39] uh there's a small thing like I don't want [00:57:40] there's a small thing like I don't want to always integrate from 0 to Infinity [00:57:42] to always integrate from 0 to Infinity because that sometimes it's actually [00:57:44] because that sometimes it's actually annoying so [00:57:46] annoying so I forgot to mention this so let's assume [00:57:50] I forgot to mention this so let's assume that the ifs bonnet [00:57:55] between let's say [00:57:57] between let's say minus one one [00:57:58] minus one one so that [00:58:01] so that this integral only I have to do you only [00:58:03] this integral only I have to do you only have to do it in between 0 and infinity [00:58:04] have to do it in between 0 and infinity so like [00:58:06] so like apps on zero let's say it's one [00:58:07] apps on zero let's say it's one something like this or maybe a constant [00:58:10] something like this or maybe a constant so we only have to [00:58:15] to integrate between 0 to 1 let's say [00:58:19] right so so this is just because you [00:58:22] right so so this is just because you have a bonding function after that like [00:58:24] have a bonding function after that like the log cabin number because it becomes [00:58:26] the log cabin number because it becomes zero [00:58:27] zero and now let's integrate going back to [00:58:30] and now let's integrate going back to this we integrate between 0 and 1 is log [00:58:32] this we integrate between 0 and 1 is log one over Epsilon and you'll see that you [00:58:34] one over Epsilon and you'll see that you know this log one website will actually [00:58:35] know this log one website will actually integrates to something a strictly a [00:58:38] integrates to something a strictly a constant so this will be [00:58:41] constant so this will be um just the o [00:58:43] um just the o maybe let's write in one notation I [00:58:45] maybe let's write in one notation I should write this like [00:58:47] should write this like and this is just an order of like square [00:58:50] and this is just an order of like square root R over n [00:58:52] root R over n because the absolute integrals to a [00:58:54] because the absolute integrals to a constant the dependency of absolute is [00:58:56] constant the dependency of absolute is called [00:58:57] called okay so so that's good so you get this [00:59:01] okay so so that's good so you get this thing and [00:59:04] thing and let's look at another case so this is [00:59:06] let's look at another case so this is actually a case where the dependence on [00:59:08] actually a case where the dependence on absence is very very mild because it's a [00:59:11] absence is very very mild because it's a lot more radical so that's why it's [00:59:12] lot more radical so that's why it's pretty mild [00:59:13] pretty mild and but sometimes you never get you [00:59:15] and but sometimes you never get you don't get this so so if [00:59:18] don't get this so so if it's an absolute f l two p n [00:59:23] it's an absolute f l two p n is of the form [00:59:27] a to the Epsilon R over Epsilon so now [00:59:30] a to the Epsilon R over Epsilon so now the absent is in exponent [00:59:33] the absent is in exponent um [00:59:34] um um but um yeah and that's and and it's [00:59:38] um but um yeah and that's and and it's one of absolute exponents and in this [00:59:40] one of absolute exponents and in this case if you look at this [00:59:42] case if you look at this more of a square root and [00:59:44] more of a square root and integral [00:59:46] integral slow cover number [00:59:49] this will be well squared n [00:59:54] R over Epsilon squared algorithm log a [00:59:59] R over Epsilon squared algorithm log a right [01:00:01] right so [01:00:06] and this is still [01:00:08] and this is still so [01:00:09] so the episode and and still square root [01:00:12] the episode and and still square root one of our apps on this integrates to [01:00:13] one of our apps on this integrates to one [01:00:16] the Epson this is a constant [01:00:19] the Epson this is a constant a universal constant you can [01:00:21] a universal constant you can I guess we don't care about constantly [01:00:23] I guess we don't care about constantly so it's some constant [01:00:26] so it's some constant um and this is equals to so basically if [01:00:29] um and this is equals to so basically if you ignore the log Factor this is equal [01:00:31] you ignore the log Factor this is equal to our [01:00:33] to our so still of this form [01:00:35] so still of this form that's still good and now it comes to [01:00:38] that's still good and now it comes to the the trickiest thing which is kind of [01:00:40] the the trickiest thing which is kind of like it's on a kind of on a boundary [01:00:43] like it's on a kind of on a boundary between what we can do and what we [01:00:45] between what we can do and what we cannot do [01:00:47] cannot do so if this is of the form something like [01:00:50] so if this is of the form something like a to the r over Epsilon Square so now I [01:00:53] a to the r over Epsilon Square so now I have a [01:00:54] have a even worse dependency on Epson right so [01:00:56] even worse dependency on Epson right so it's in exponent and also is one webs [01:00:58] it's in exponent and also is one webs from Square so it goes to Infinity as [01:01:00] from Square so it goes to Infinity as Epsilon goes to zero faster [01:01:02] Epsilon goes to zero faster so and in this case this becomes a [01:01:05] so and in this case this becomes a little bit tricky because and but [01:01:07] little bit tricky because and but actually this is the most common case [01:01:08] actually this is the most common case right if you if you really do the work [01:01:10] right if you if you really do the work you know I don't really expect you to [01:01:11] you know I don't really expect you to prove financialization about yourself [01:01:13] prove financialization about yourself um [01:01:14] um um you know that that often like but if [01:01:16] um you know that that often like but if you really do the work in many of the [01:01:18] you really do the work in many of the cases you get this kind of like cover [01:01:19] cases you get this kind of like cover number so and this is actually tricky [01:01:22] number so and this is actually tricky because if you integrate [01:01:24] because if you integrate uh the thing [01:01:30] what you get is that [01:01:32] what you get is that uh [01:01:35] uh you take the log of this and you take [01:01:37] uh you take the log of this and you take square root so what you get is maybe [01:01:39] square root so what you get is maybe let's [01:01:41] let's you get [01:01:43] you get square root R times one over Epsilon [01:01:48] uh Times Square Root log a [01:01:52] uh Times Square Root log a right so this is the App Store [01:01:55] right so this is the App Store so this is called r squared log a square [01:01:58] so this is called r squared log a square root and [01:02:00] root and one of our apps on the Epson and this [01:02:03] one of our apps on the Epson and this this thing is actually Infinity [01:02:08] um I guess this is because the [01:02:10] um I guess this is because the how do you see this like a wall of [01:02:11] how do you see this like a wall of absolute integers to log I have some [01:02:14] absolute integers to log I have some undone log absolute zero is infinity so [01:02:17] undone log absolute zero is infinity so this goes to Infinity too fast at zero [01:02:20] this goes to Infinity too fast at zero so that it integrates to Infinity [01:02:23] so that it integrates to Infinity so so this is actually you know this is [01:02:26] so so this is actually you know this is not a good news for us right so so how [01:02:28] not a good news for us right so so how do we but actually this can be fixed uh [01:02:31] do we but actually this can be fixed uh how do we fix this so this can be fixed [01:02:36] by our improved version [01:02:43] of at least here and this improved [01:02:45] of at least here and this improved version in some sense you know I'm not [01:02:47] version in some sense you know I'm not going to prove it but it actually is [01:02:48] going to prove it but it actually is kind of like almost [01:02:50] kind of like almost expected so what you can show is that so [01:02:53] expected so what you can show is that so basically the idea is that you don't [01:02:56] basically the idea is that you don't to do the discretization all the way to [01:02:58] to do the discretization all the way to zero you do it until certain levels that [01:03:01] zero you do it until certain levels that you can pay the the worst case Bond [01:03:03] you can pay the the worst case Bond so basically you you do it only to the [01:03:05] so basically you you do it only to the level of alpha [01:03:06] level of alpha so you buy it by this [01:03:14] I think there's a actually I'm not sure [01:03:16] I think there's a actually I'm not sure whether there's a two here but let me [01:03:17] whether there's a two here but let me have the two here anyway [01:03:19] have the two here anyway so that for safety [01:03:22] so that for safety and with a constant it's not very [01:03:24] and with a constant it's not very important [01:03:25] important so basically when you do the integration [01:03:26] so basically when you do the integration you are not integrating from zero to [01:03:28] you are not integrating from zero to Infinity you are integrating from alpha [01:03:30] Infinity you are integrating from alpha to infinity and Below Alpha you just pay [01:03:32] to infinity and Below Alpha you just pay this Alpha bound so in some sense you [01:03:34] this Alpha bound so in some sense you can see that this is the interpolation [01:03:36] can see that this is the interpolation of the two bonds we had right so recall [01:03:38] of the two bonds we had right so recall that one Bond we had was this wood force [01:03:41] that one Bond we had was this wood force increa brute forcing [01:03:43] increa brute forcing where [01:03:45] where we pay this Epsilon [01:03:47] we pay this Epsilon like right so this is just because we [01:03:49] like right so this is just because we have a worst case Bound for the for the [01:03:51] have a worst case Bound for the for the Epson error [01:03:53] Epson error and and the other case we had this [01:03:55] and and the other case we had this integration we don't pay anything in the [01:03:57] integration we don't pay anything in the worst case and this is basically saying [01:03:59] worst case and this is basically saying that [01:04:01] that you do this nested or iterative [01:04:03] you do this nested or iterative discretization to offer and then you pay [01:04:06] discretization to offer and then you pay a small error offer at the end [01:04:10] a small error offer at the end um and [01:04:11] um and and why this is useful this is useful [01:04:14] and why this is useful this is useful because you know it kind of like avoid [01:04:16] because you know it kind of like avoid this cheeky regime where you are very [01:04:18] this cheeky regime where you are very very close to zero so what you can do is [01:04:20] very close to zero so what you can do is that [01:04:22] that like I think this theorem you can [01:04:24] like I think this theorem you can probably prove it yourself so I'm not [01:04:27] probably prove it yourself so I'm not going to so the proof [01:04:29] going to so the proof um so and if you use it so you can take [01:04:31] um so and if you use it so you can take Alpha to be something like one over poly [01:04:33] Alpha to be something like one over poly poly and [01:04:36] poly and so something super super small [01:04:39] so something super super small right so and so that it's four Alpha so [01:04:42] right so and so that it's four Alpha so that for Alpha [01:04:44] that for Alpha is negligible [01:04:48] and and so that you hear on the right [01:04:51] and and so that you hear on the right hand side you don't integrate to [01:04:52] hand side you don't integrate to Infinity [01:04:53] Infinity so so basically four Alpha is in English [01:04:55] so so basically four Alpha is in English one question is what integration [01:04:58] one question is what integration uh or the integration look like so this [01:05:00] uh or the integration look like so this is something like [01:05:02] is something like inverse portion [01:05:04] inverse portion which is negligible and then you have [01:05:06] which is negligible and then you have square root R square root log a [01:05:09] square root R square root log a square root n and you integrate between [01:05:12] square root n and you integrate between Alpha and one and you have one over up [01:05:14] Alpha and one and you have one over up so well over Epsilon D Epsilon [01:05:17] so well over Epsilon D Epsilon and unfortunately this one [01:05:20] and unfortunately this one even though it goes to Infinity as Alpha [01:05:22] even though it goes to Infinity as Alpha absent goes to zero [01:05:24] absent goes to zero like a as Alpha go to zero but this is [01:05:26] like a as Alpha go to zero but this is actually something that depends on Alpha [01:05:28] actually something that depends on Alpha very [01:05:29] very very weekly so [01:05:32] very weekly so so this is this right [01:05:34] so this is this right I'm not surviving this system [01:05:36] I'm not surviving this system you know what my rotation means right [01:05:40] you know what my rotation means right sometimes I think in different calculus [01:05:42] sometimes I think in different calculus look I see different notations for this [01:05:44] look I see different notations for this so so sometimes I got confused [01:05:52] so and now this thing is really just a [01:06:03] times uh [01:06:05] times uh this is like a log one which is zero and [01:06:07] this is like a log one which is zero and minus log Alpha so you get log 1 over [01:06:10] minus log Alpha so you get log 1 over Alpha [01:06:12] Alpha and this is logarithmaking Alpha so the [01:06:14] and this is logarithmaking Alpha so the alpha is pulling in so this is [01:06:15] alpha is pulling in so this is logarithmic [01:06:17] logarithmic so this is login [01:06:21] so this is login so so basically eventually this is still [01:06:23] so so basically eventually this is still o till the [01:06:24] o till the squared r over square rooted if you hide [01:06:27] squared r over square rooted if you hide all the logarithmic Factor [01:06:33] okay so in summary [01:06:39] so [01:06:41] so so the covering number [01:06:45] of the form [01:06:49] while the options to the r [01:06:52] while the options to the r a to the r over Epsilon a to the r over [01:06:55] a to the r over Epsilon a to the r over Epsilon Square only to some old leads to [01:06:58] Epsilon Square only to some old leads to something like a random marker [01:06:59] something like a random marker complexible [01:07:01] complexible this form [01:07:04] and these are probably basically the [01:07:07] and these are probably basically the pretty much the only cases I I know of [01:07:09] pretty much the only cases I I know of that can lead to this like for example [01:07:11] that can lead to this like for example if you suppose hypothetically [01:07:14] if you suppose hypothetically your current number is something like a [01:07:16] your current number is something like a to the alpha r r over Epsilon Cube [01:07:19] to the alpha r r over Epsilon Cube I think this will uh will break [01:07:23] I think this will uh will break um because [01:07:24] um because so [01:07:25] so so here if this is absent Cube and I [01:07:28] so here if this is absent Cube and I think here it's going to be absence to [01:07:29] think here it's going to be absence to 1.5 1 wraps up to 1.5 [01:07:32] 1.5 1 wraps up to 1.5 and when it's up so maybe let's do a [01:07:34] and when it's up so maybe let's do a quick characteristic so suppose this [01:07:36] quick characteristic so suppose this apps on to the 1.5 [01:07:39] apps on to the 1.5 and of course you still have to [01:07:40] and of course you still have to integrate from alpha so that you don't [01:07:42] integrate from alpha so that you don't you try to avoid the blow up right so [01:07:45] you try to avoid the blow up right so but it wouldn't be as effective because [01:07:48] but it wouldn't be as effective because 12 apps onto the 1.5 [01:07:50] 12 apps onto the 1.5 this the integration of this is [01:07:53] this the integration of this is I think 1 over square with Epsilon is [01:07:56] I think 1 over square with Epsilon is that log absolute [01:07:58] that log absolute and I think this will be [01:08:01] and I think this will be um [01:08:02] um 1 minus y squared Alpha [01:08:07] I think it did [01:08:09] I think it did yeah okay there's a minus here I think [01:08:12] yeah okay there's a minus here I think right so [01:08:13] right so so it's going to be something like 1 [01:08:14] so it's going to be something like 1 over square root of [01:08:17] over square root of minus what something like this maybe [01:08:18] minus what something like this maybe there's a half here [01:08:20] there's a half here yeah [01:08:23] yeah anyway let's let's see ignore the [01:08:24] anyway let's let's see ignore the constants I don't know what the constant [01:08:26] constants I don't know what the constant is but but the problem is that this is [01:08:27] is but but the problem is that this is not log of like this one over Square [01:08:29] not log of like this one over Square Alpha so and now you cannot take Alpha [01:08:31] Alpha so and now you cannot take Alpha to be poly inverse poly and because if [01:08:33] to be poly inverse poly and because if it's inverse polygen you you pay too [01:08:34] it's inverse polygen you you pay too much here [01:08:35] much here so and so so then it's got it's going to [01:08:39] so and so so then it's got it's going to be a very tricky balance between [01:08:41] be a very tricky balance between this fall off a term and under this term [01:08:45] this fall off a term and under this term right and I think [01:08:46] right and I think at least I'm not aware of any cases [01:08:48] at least I'm not aware of any cases where you can balance them in a nice way [01:08:50] where you can balance them in a nice way so that you get still a good Bond [01:08:52] so that you get still a good Bond um I think it's gonna be probably not [01:08:54] um I think it's gonna be probably not even possible [01:08:55] even possible um and but on the other hand you know [01:08:58] um and but on the other hand you know for the case when you have this thing [01:08:59] for the case when you have this thing right this is log whatever Alpha here so [01:09:02] right this is log whatever Alpha here so the balance is Trivial it's kind of like [01:09:03] the balance is Trivial it's kind of like your free lunch if you pay a lot of [01:09:05] your free lunch if you pay a lot of factors so so it's almost always [01:09:07] factors so so it's almost always possible right so [01:09:09] possible right so so that's the uh that's the difference [01:09:14] um [01:09:15] um right and and actually most typically [01:09:17] right and and actually most typically you're gonna get some cover number of [01:09:19] you're gonna get some cover number of this form that's that's the most typical [01:09:21] this form that's that's the most typical cases [01:09:26] okay any questions [01:09:32] okay so now I have 15 minutes today so [01:09:36] okay so now I have 15 minutes today so um so the the rough plan for the for the [01:09:38] um so the the rough plan for the for the next the rest of the 15 minutes and the [01:09:40] next the rest of the 15 minutes and the next lecture is that we are gonna talk [01:09:42] next lecture is that we are gonna talk about hovering number Bound for linear [01:09:45] about hovering number Bound for linear models and deepness and those will imply [01:09:48] models and deepness and those will imply about the marker complexity bonds and I [01:09:51] about the marker complexity bonds and I think today I'm going to talk about [01:09:53] think today I'm going to talk about linear models but for linear models I'm [01:09:55] linear models but for linear models I'm not going to give you the proof because [01:09:56] not going to give you the proof because I think the proof is a little bit too [01:09:58] I think the proof is a little bit too technical [01:09:59] technical um in most of the cases you wouldn't you [01:10:00] um in most of the cases you wouldn't you need to use prove it yourself you just [01:10:02] need to use prove it yourself you just have to invoke it so basically I'm just [01:10:05] have to invoke it so basically I'm just gonna [01:10:06] gonna um [01:10:09] I'm going to just State some theorems [01:10:11] I'm going to just State some theorems and tell you that actually for linear [01:10:13] and tell you that actually for linear models this is almost [01:10:15] models this is almost a kind of well I think it's kind of [01:10:18] a kind of well I think it's kind of almost all done like you know everything [01:10:20] almost all done like you know everything about it and I think they are [01:10:23] about it and I think they are pretty much like [01:10:24] pretty much like matching upper bound lower bounds [01:10:28] matching upper bound lower bounds so this is actually from a paper by Tong [01:10:31] so this is actually from a paper by Tong Jong in 2012 sorry 2002 2002 [01:10:36] Jong in 2012 sorry 2002 2002 so [01:10:38] so um so it's saying that this is for Lee [01:10:40] um so it's saying that this is for Lee for linear models [01:10:43] for linear models so [01:10:45] so linear models [01:10:49] so [01:10:52] so yeah so so Suppose there are [01:10:55] yeah so so Suppose there are X1 up to X N and RD [01:11:00] X1 up to X N and RD are n data points [01:11:06] and P and Q is this so-called conjugate [01:11:10] and P and Q is this so-called conjugate pairs [01:11:11] pairs I guess I hope that probably you have [01:11:13] I guess I hope that probably you have seen this kind of things in order [01:11:14] seen this kind of things in order inequalities if one one over P plus 1 [01:11:16] inequalities if one one over P plus 1 over Q is equals to one [01:11:18] over Q is equals to one and also we also assume that [01:11:21] and also we also assume that he is [01:11:23] he is better than two and less than infinity [01:11:24] better than two and less than infinity but in most places you just can think of [01:11:26] but in most places you just can think of p and Q are both 2 that's the most [01:11:27] p and Q are both 2 that's the most important thing [01:11:29] important thing and assume that [01:11:36] the P Norm of X is less than c for every [01:11:39] the P Norm of X is less than c for every I [01:11:40] I and then let's consider this hypothesis [01:11:42] and then let's consider this hypothesis because I've [01:11:43] because I've indexed by Q [01:11:46] indexed by Q so this is the family of linear models [01:11:50] where [01:11:52] where the norm of the linear model is bounded [01:11:54] the norm of the linear model is bounded by B [01:11:56] by B right recall that we have actually [01:11:57] right recall that we have actually talked about these kind of models [01:11:58] talked about these kind of models wherever p is 2 and Q is 2 or maybe p is [01:12:01] wherever p is 2 and Q is 2 or maybe p is one Q is infinity this kind of things so [01:12:04] one Q is infinity this kind of things so and then before we prove the random are [01:12:06] and then before we prove the random are complexity Bond and now we prove the the [01:12:08] complexity Bond and now we prove the the carbon number bond which will also give [01:12:10] carbon number bond which will also give a random marker complexity bound and and [01:12:13] a random marker complexity bound and and this row [01:12:16] this row is equals to l2pn L2 PN this is the same [01:12:19] is equals to l2pn L2 PN this is the same thing as we have defined before [01:12:21] thing as we have defined before and then [01:12:24] the law covering number [01:12:30] Epsilon [01:12:32] Epsilon fq [01:12:34] fq sorry [01:12:38] times row [01:12:39] times row is less than [01:12:41] is less than b square C Square [01:12:43] b square C Square over Epsilon Square [01:12:48] sailing the saving doesn't matter it's [01:12:50] sailing the saving doesn't matter it's just trying to deal with the corner [01:12:52] just trying to deal with the corner cases where this is like zero or [01:12:53] cases where this is like zero or something like that [01:12:55] something like that so I would log to 2D plus one [01:13:01] so [01:13:03] so and when P so and when p is 2 Q is 2 [01:13:09] and when P so and when p is 2 Q is 2 you can strengthen this construction [01:13:12] you can strengthen this construction this [01:13:18] slightly [01:13:22] to something like log [01:13:26] to something like log and Epsilon F2 row [01:13:33] is less than b square C Square over [01:13:36] is less than b square C Square over Epsilon Square [01:13:38] Epsilon Square times the log 2 [01:13:41] times the log 2 I guess the the base is also that's [01:13:43] I guess the the base is also that's important because it's only change the [01:13:45] important because it's only change the constant it's just that corporate [01:13:47] constant it's just that corporate from the base of the log doesn't really [01:13:49] from the base of the log doesn't really matter that much [01:13:50] matter that much I'm probably here just for the sake of [01:13:53] I'm probably here just for the sake of preciseness so you can you can improve [01:13:55] preciseness so you can you can improve the B definitely do something that [01:13:56] the B definitely do something that depends on and already which doesn't [01:13:59] depends on and already which doesn't matter that much [01:14:01] matter that much um at least for our purpose you know for [01:14:03] um at least for our purpose you know for for the [01:14:04] for the for other cases you if you care about [01:14:06] for other cases you if you care about the bond that absolutely doesn't depend [01:14:07] the bond that absolutely doesn't depend on D then this is this matters otherwise [01:14:09] on D then this is this matters otherwise it doesn't matter that much [01:14:11] it doesn't matter that much Okay so [01:14:14] Okay so so [01:14:16] so and and the way to remember this is just [01:14:18] and and the way to remember this is just that this gives the same random Market [01:14:20] that this gives the same random Market complexity right so basically if you [01:14:23] complexity right so basically if you um use the discussion above [01:14:26] um use the discussion above is the the conversion above right we [01:14:28] is the the conversion above right we have done [01:14:29] have done so this is of the form which form this [01:14:31] so this is of the form which form this is of the form [01:14:33] is of the form this thing [01:14:35] this thing right a to the r over F sub square right [01:14:38] right a to the r over F sub square right because [01:14:39] because here you have a log right like after [01:14:42] here you have a log right like after taking a log is our Epsilon Square [01:14:44] taking a log is our Epsilon Square and the r is B Star Square C Square so [01:14:47] and the r is B Star Square C Square so so even this conversion you got that the [01:14:50] so even this conversion you got that the rather Market complexity [01:14:53] rather Market complexity is less than [01:14:54] is less than square root R over n and where R is b [01:14:57] square root R over n and where R is b square c squared [01:14:59] square c squared so this is b c over square root [01:15:03] so this is b c over square root um up to [01:15:05] um up to logarithmic Factor [01:15:07] logarithmic Factor and this was very similar to this is the [01:15:10] and this was very similar to this is the same signals we have them [01:15:12] same signals we have them before right so B was the norm of the [01:15:14] before right so B was the norm of the classifier and C was the norm of the on [01:15:17] classifier and C was the norm of the on the data so you get the multiplication [01:15:19] the data so you get the multiplication of them and over square rooted [01:15:23] of them and over square rooted so there's some small differences in [01:15:25] so there's some small differences in terms of the logarithm manufacturer so [01:15:26] terms of the logarithm manufacturer so which Let's ignore just for Simplicity [01:15:33] okay and [01:15:35] okay and you can also you can also show this [01:15:36] you can also you can also show this formatively your models [01:15:39] formatively your models um sorry multivarity multi varied [01:15:43] um sorry multivarity multi varied linear functions [01:15:46] linear functions and I'm showing this just because this [01:15:48] and I'm showing this just because this will be useful as a building block [01:15:52] a speeding block of our future [01:15:55] a speeding block of our future because uh when you have like a lead [01:15:59] because uh when you have like a lead when you have new networks you know a [01:16:01] when you have new networks you know a linear and multivariate linear model is [01:16:03] linear and multivariate linear model is a building block for a layer of network [01:16:07] a building block for a layer of network um and this in some sense there's [01:16:09] um and this in some sense there's nothing really intelligent here is but I [01:16:11] nothing really intelligent here is but I just have to State it so that I can use [01:16:12] just have to State it so that I can use it later so so suppose you have okay [01:16:15] it later so so suppose you have okay first I have a decimal definition [01:16:19] so definition so suppose and is a matrix [01:16:23] so definition so suppose and is a matrix of let's form [01:16:27] and the Iron by n Matrix [01:16:30] and the Iron by n Matrix let's define the true Norm 2 to 2 comma [01:16:33] let's define the true Norm 2 to 2 comma 1 Norm this is not the operating Norm [01:16:35] 1 Norm this is not the operating Norm this is the just some [01:16:38] this is the just some some arbitrary Norm so this is the two [01:16:41] some arbitrary Norm so this is the two comma one Norm which is the sum [01:16:44] comma one Norm which is the sum of the two Norm of the columns [01:16:49] [Music] [01:16:50] [Music] yeah [01:16:51] yeah so I'm I is of Dimension and [01:16:57] and and you take this so basically you [01:16:59] and and you take this so basically you first take the [01:17:00] first take the two Norm of the column and then you take [01:17:02] two Norm of the column and then you take the one Norm [01:17:03] the one Norm to to group them [01:17:05] to to group them right so and and then [01:17:08] right so and and then you know in this definition I'm [01:17:10] you know in this definition I'm transposed to a norm [01:17:12] transposed to a norm this is basically the Sun [01:17:15] this is basically the Sun of the two norms [01:17:20] of rows [01:17:21] of rows of that [01:17:24] of that this is just definition and then we're [01:17:25] this is just definition and then we're going to use this in the statement [01:17:27] going to use this in the statement so here is a theorem [01:17:32] a theorem is that if you consider [01:17:36] here I'm not going to do a p and Q just [01:17:38] here I'm not going to do a p and Q just for Simplicity [01:17:39] for Simplicity so you just to do the two Norm version [01:17:42] so you just to do the two Norm version so p and Q are both two so you consider [01:17:45] so p and Q are both two so you consider the multivariate uh function with output [01:17:48] the multivariate uh function with output multiple outputs [01:17:50] multiple outputs um and this W let's say of Dimension n [01:17:53] um and this W let's say of Dimension n by D [01:17:55] by D and let's constrain the W the two to one [01:17:58] and let's constrain the W the two to one Norm of w to be less than b [01:18:02] Norm of w to be less than b so and again let's see to be the average [01:18:06] so and again let's see to be the average of the norm of the data [01:18:25] and then you got [01:18:29] and then you got log [01:18:31] log and [01:18:35] Epsilon f [01:18:38] Epsilon f L to p n [01:18:40] L to p n is less than [01:18:42] is less than c squared B squared [01:18:45] c squared B squared over Epsilon Square [01:18:48] over Epsilon Square log log into D times m [01:18:55] so so it's kind of the same thing the [01:18:57] so so it's kind of the same thing the norm of the parameter times the norm of [01:18:59] norm of the parameter times the norm of the data over Epsilon Square on by the [01:19:01] the data over Epsilon Square on by the norm of the parameter is measured by [01:19:03] norm of the parameter is measured by this two to one Norm or sorry two two [01:19:05] this two to one Norm or sorry two two number of w transpose [01:19:07] number of w transpose I think I think I have typo here [01:19:10] I think I think I have typo here so what's the two to one Norm of w [01:19:12] so what's the two to one Norm of w transpose as I said is the sum of the [01:19:14] transpose as I said is the sum of the two Norms of the rows of w [01:19:17] two Norms of the rows of w so [01:19:20] so so in some sense there is nothing [01:19:21] so in some sense there is nothing surprising here in some sense you just [01:19:23] surprising here in some sense you just glue all the dimensions like you just [01:19:25] glue all the dimensions like you just treat all Dimension independently in [01:19:26] treat all Dimension independently in some sense like like for example if you [01:19:29] some sense like like for example if you think about suppose you try w [01:19:32] think about suppose you try w Let's see different color [01:19:34] Let's see different color suppose you write w as W1 transpose up [01:19:38] suppose you write w as W1 transpose up to WM transpose where you have I'm [01:19:41] to WM transpose where you have I'm uh vectors row vectors and then w x is [01:19:44] uh vectors row vectors and then w x is really just you multiply W1 transpose [01:19:46] really just you multiply W1 transpose with X up to W M transpose X [01:19:49] with X up to W M transpose X so you can view this you know linear [01:19:51] so you can view this you know linear layer as you know I'm Different linear [01:19:54] layer as you know I'm Different linear functions one-dimensional linear [01:19:56] functions one-dimensional linear functions and then the two [01:19:59] functions and then the two to one Norm of [01:20:01] to one Norm of is just the sum of the w i to Norm [01:20:05] is just the sum of the w i to Norm so in some sense you just sum that you [01:20:07] so in some sense you just sum that you take the sum of the complex dimension [01:20:09] take the sum of the complex dimension sum of complex dimensions [01:20:15] of each of the model wi transpose X [01:20:18] of each of the model wi transpose X right so wi is two normally is the [01:20:20] right so wi is two normally is the complex measure of the linear function [01:20:22] complex measure of the linear function and you take the sum [01:20:23] and you take the sum so the proof is actually just [01:20:26] so the proof is actually just yeah there's nothing [01:20:29] um [01:20:31] um I think I have five minutes let me also [01:20:32] I think I have five minutes let me also mention another thing which is useful [01:20:34] mention another thing which is useful for preparation for the deepness [01:20:36] for preparation for the deepness so this is also related to how to do how [01:20:39] so this is also related to how to do how do we deal with bounding the law [01:20:40] do we deal with bounding the law covering them the carbon number so you [01:20:43] covering them the carbon number so you can also have the Ellipsis composition [01:20:46] can also have the Ellipsis composition this is a useful tool for us to do deal [01:20:49] this is a useful tool for us to do deal with the cover numbers and this is [01:20:51] with the cover numbers and this is actually [01:20:52] actually recall that you know we had this [01:20:53] recall that you know we had this telegram Lemma right so [01:20:57] if I had the telegram Lemma which was [01:20:59] if I had the telegram Lemma which was like for the rider marker complexity [01:21:01] like for the rider marker complexity right so you say something like the [01:21:02] right so you say something like the random marker complex of fee composed [01:21:04] random marker complex of fee composed with h is less than [01:21:06] with h is less than some lipsiousness of V times the rather [01:21:09] some lipsiousness of V times the rather Markle complexity [01:21:10] Markle complexity of it something like this so this this [01:21:13] of it something like this so this this was the lipstick compensation for rather [01:21:15] was the lipstick compensation for rather marker complexity and that turns out [01:21:17] marker complexity and that turns out that for log covering number [01:21:19] that for log covering number the latest composition is eventually [01:21:21] the latest composition is eventually this the telegram level I didn't prove [01:21:23] this the telegram level I didn't prove it for you I just say this is a fact [01:21:25] it for you I just say this is a fact this is a theorem and actually proving [01:21:27] this is a theorem and actually proving it doesn't [01:21:28] it doesn't doesn't sound easy like as I mentioned [01:21:31] doesn't sound easy like as I mentioned like it's actually sometimes pretty [01:21:33] like it's actually sometimes pretty complicated it's pretty [01:21:35] complicated it's pretty it's I think it's a challenging theorem [01:21:36] it's I think it's a challenging theorem to prove [01:21:37] to prove um and here the lipsense composition it [01:21:39] um and here the lipsense composition it becomes trivial for covering number [01:21:42] becomes trivial for covering number um [01:21:42] um um yeah I think the the fundamental [01:21:44] um yeah I think the the fundamental intuition of the spirit is the same it's [01:21:46] intuition of the spirit is the same it's just for cover number somehow this [01:21:48] just for cover number somehow this becomes super [01:21:50] becomes super intuitive and explicit like so the [01:21:53] intuitive and explicit like so the let me stay dilemma but um [01:21:57] so this is almost a trivial thing so [01:21:59] so this is almost a trivial thing so suppose [01:22:01] suppose fee is Kappa ellipsis [01:22:07] and then and let's say row is this [01:22:11] and then and let's say row is this output Norm [01:22:13] output Norm thing like then [01:22:16] thing like then a low cover number of fee absolute fee [01:22:21] a low cover number of fee absolute fee composed with f [01:22:27] like I messed up my order of these [01:22:30] like I messed up my order of these arguments like in my notes like for [01:22:31] arguments like in my notes like for every occurrences after certain points [01:22:33] every occurrences after certain points so so I have to fix it later so [01:22:37] so so I have to fix it later so um [01:22:38] um so Epsilon so if you look at the low [01:22:41] so Epsilon so if you look at the low carbon number of the composed function [01:22:43] carbon number of the composed function class V composed with F this is less [01:22:46] class V composed with F this is less than [01:22:46] than the low carbon number of [01:22:50] the low carbon number of of the original one but you have a [01:22:52] of the original one but you have a different radius or different [01:22:54] different radius or different um granularity [01:22:56] um granularity so you basically have to cover the [01:22:58] so you basically have to cover the original one with absolute over copper [01:23:00] original one with absolute over copper granularity so that you can turn that [01:23:02] granularity so that you can turn that into a cover [01:23:04] into a cover Epsilon cover of the new composed [01:23:06] Epsilon cover of the new composed function and this is pretty much just [01:23:08] function and this is pretty much just tribute because if you just take [01:23:11] tribute because if you just take I guess I'll just take Epsom over Kappa [01:23:13] I guess I'll just take Epsom over Kappa cover [01:23:15] cover for f [01:23:17] for f and then [01:23:19] and then like then [01:23:20] like then so suppose this is thick [01:23:22] so suppose this is thick that's called C [01:23:26] and then [01:23:27] and then free composed with c [01:23:29] free composed with c it I climate is option cover [01:23:33] it I climate is option cover of V composed with f [01:23:36] of V composed with f so because [01:23:39] so because just for every [01:23:42] just for every composed with f in [01:23:46] this class [01:23:48] this class you can just First Take F find F Prime [01:23:52] you can just First Take F find F Prime in C [01:23:54] in C such that [01:23:57] um [01:23:58] um this video I have F Prime is less than [01:24:01] this video I have F Prime is less than absolute so your first one is covered [01:24:04] absolute so your first one is covered um in C and then you just compose it so [01:24:07] um in C and then you just compose it so if we compose with f Prime I claim that [01:24:09] if we compose with f Prime I claim that this is actually a neighbor of B [01:24:11] this is actually a neighbor of B composed with F this is because if you [01:24:13] composed with F this is because if you look at distance between this [01:24:15] look at distance between this two thing this is um [01:24:18] two thing this is um square root one over n sum of [01:24:21] square root one over n sum of C of f Prime z i minus field FCI [01:24:30] and you use the ellipsislessness so this [01:24:33] and you use the ellipsislessness so this is [01:24:34] is less than water and times Kappa Square [01:24:43] and then because F Prime and F are absor [01:24:46] and then because F Prime and F are absor over copper close so this is copper [01:24:48] over copper close so this is copper times Epsilon over Kappa [01:24:50] times Epsilon over Kappa it's absolute [01:24:51] it's absolute so so we're done [01:24:55] so so we're done um [01:24:56] um yeah so [01:24:58] yeah so okay I guess that's a that's a good [01:25:00] okay I guess that's a that's a good stopping point [01:25:02] stopping point um and we'll continue next to the next [01:25:05] um and we'll continue next to the next lecture about weakness [01:25:07] lecture about weakness cool any questions [01:25:16] yeah yeah so I should yes that's right [01:25:19] yeah yeah so I should yes that's right the Ellipsis [01:25:22] the Ellipsis [Music] [01:25:25] yeah so so far I have actually a [01:25:27] yeah so so far I have actually a one-dimensional function fee is a [01:25:29] one-dimensional function fee is a randomized something right I have [01:25:30] randomized something right I have outputs wonder myself thing and then you [01:25:32] outputs wonder myself thing and then you have a one to one R to R function fee so [01:25:36] have a one to one R to R function fee so so there's no magic but yes but if I [01:25:39] so there's no magic but yes but if I have output is a vector and then your [01:25:41] have output is a vector and then your fee is a vector to Vector function then [01:25:43] fee is a vector to Vector function then you have to make the norm [01:25:44] you have to make the norm everything compatible the Ellipsis this [01:25:47] everything compatible the Ellipsis this has to [01:25:48] has to be the same thing compatible with the [01:25:50] be the same thing compatible with the norm [01:25:58] s [01:25:59] s [Music] [01:26:01] [Music] yes yes so we're going to use just out [01:26:11] um [01:26:11] um okay okay sounds good okay I guess I'll [01:26:14] okay okay sounds good okay I guess I'll see you [01:26:15] see you on Wednesday ================================================================================ LECTURE 010 ================================================================================ Stanford CS229M - Lecture 10: Generalization bounds for deep nets Source: https://www.youtube.com/watch?v=P5-VVI1qLxA --- Transcript [00:00:05] so last time we have talked about [00:00:08] so last time we have talked about um cover number so cover number is upper [00:00:11] um cover number so cover number is upper Bound for the rather marginal complexity [00:00:12] Bound for the rather marginal complexity and then our goal is to bond cover [00:00:15] and then our goal is to bond cover numbers because this is a new tool for [00:00:17] numbers because this is a new tool for bonding around the market complexity and [00:00:19] bonding around the market complexity and we have discussed what are the bounds [00:00:22] we have discussed what are the bounds for linear models I didn't show any of [00:00:23] for linear models I didn't show any of the proofs but there are some existing [00:00:25] the proofs but there are some existing bonds which are 20 years old actually [00:00:28] bonds which are 20 years old actually and we also talk about the ellipses [00:00:31] and we also talk about the ellipses Ellipsis composition number for carbon [00:00:34] Ellipsis composition number for carbon numbers which is um much easier than the [00:00:38] numbers which is um much easier than the uh the corresponding level for random [00:00:41] uh the corresponding level for random marker complexity so basically if you [00:00:42] marker complexity so basically if you know [00:00:43] know a function class has good copying number [00:00:46] a function class has good copying number bonds and then you compose it with [00:00:48] bonds and then you compose it with Ellipsis function then you still have a [00:00:50] Ellipsis function then you still have a reasonable covering number bounds [00:00:53] reasonable covering number bounds um so so that's the the general idea and [00:00:56] um so so that's the the general idea and then today we're going to talk about the [00:00:59] then today we're going to talk about the Deep new Networks [00:01:04] and we are going to use some of these [00:01:07] and we are going to use some of these tools because you can see that a [00:01:08] tools because you can see that a deepness is actually composed of [00:01:11] deepness is actually composed of multiple linear models with you know [00:01:13] multiple linear models with you know some of the ellipse functions right the [00:01:15] some of the ellipse functions right the activations so so this is the [00:01:19] activations so so this is the uh the goal of this lecture so let me [00:01:22] uh the goal of this lecture so let me set up the [00:01:24] set up the actually sorry let me give me one moment [00:01:26] actually sorry let me give me one moment I think I probably have to change the [00:01:27] I think I probably have to change the mask because I'm always having the fog [00:01:31] mask because I'm always having the fog I don't know what happens with this mask [00:01:34] I don't know what happens with this mask let's change one [00:01:39] maybe there's some efficiency with the [00:01:41] maybe there's some efficiency with the mask [00:02:02] okay let's continue so [00:02:06] okay let's continue so so we have [00:02:08] so we have um a new network so the setup is that we [00:02:11] um a new network so the setup is that we have some Network that's called H Theta [00:02:13] have some Network that's called H Theta Theta is the [00:02:15] Theta is the is used to denote a set of parameters [00:02:17] is used to denote a set of parameters and [00:02:19] and um and we have all layers so the network [00:02:21] um and we have all layers so the network looks like this so the last layer we [00:02:24] looks like this so the last layer we don't have any activation and then you [00:02:26] don't have any activation and then you have some activation in the next layer [00:02:28] have some activation in the next layer you get a minus one [00:02:30] you get a minus one some connectors [00:02:35] so basically you if you do the [00:02:37] so basically you if you do the ordering of the math mobler formula so [00:02:40] ordering of the math mobler formula so you first multiply x with W1 and then [00:02:42] you first multiply x with W1 and then you pass through a nonlinearity and then [00:02:45] you pass through a nonlinearity and then you multiply W2 and you do this so and [00:02:47] you multiply W2 and you do this so and so forth until you have all layers [00:02:49] so forth until you have all layers the system Network so there are layers [00:02:53] the system Network so there are layers and wi are the weights [00:03:00] and the kind of the bond that we're [00:03:02] and the kind of the bond that we're going to talk about is that so so here [00:03:05] going to talk about is that so so here is the theorem [00:03:07] so assume [00:03:10] so assume uh x i [00:03:12] uh x i 2 Norm is less than [00:03:14] 2 Norm is less than C [00:03:16] C and consider a family of networks with [00:03:20] and consider a family of networks with the H Theta with some Norm control [00:03:23] the H Theta with some Norm control of the weights so we control the [00:03:25] of the weights so we control the operator Norm [00:03:27] operator Norm of the ways to be copper I [00:03:33] Kappa I [00:03:34] Kappa I and we can show [00:03:37] and we can show the two to one Norm of wi transpose [00:03:40] the two to one Norm of wi transpose to be bi [00:03:42] to be bi and then suppose you control your [00:03:44] and then suppose you control your complex function class like this then [00:03:46] complex function class like this then the rather Markle complexity [00:03:49] the rather Markle complexity will be less than [00:03:51] will be less than um up to constant Factor [00:03:54] um up to constant Factor uh C over square root n [00:03:57] uh C over square root n times the product of Kappa I [00:04:01] times the product of Kappa I times [00:04:02] times the sum of [00:04:05] the sum of this is a complex formula let me explain [00:04:06] this is a complex formula let me explain it in a moment [00:04:10] right from 1 to R [00:04:12] right from 1 to R 3 over 2. [00:04:14] 3 over 2. so and [00:04:18] so and are alternative play you know [00:04:20] are alternative play you know as a Corollary [00:04:26] I guess this is not necessarily that [00:04:28] I guess this is not necessarily that formal because you have to talk about [00:04:30] formal because you have to talk about you know what exactly this is you have [00:04:32] you know what exactly this is you have to have some failure probabilities [00:04:34] to have some failure probabilities but you know roughly speaking you are [00:04:36] but you know roughly speaking you are seeing that generalization error is less [00:04:38] seeing that generalization error is less than [00:04:39] than or to the [00:04:41] or to the 1 over the margin [00:04:43] 1 over the margin times 1 over square root in [00:04:45] times 1 over square root in times this uh times C [00:04:50] times this uh times C times the product [00:04:53] times the product of the operator Norm [00:04:56] times the [00:04:59] times the this quantity [00:05:06] I guess here I'm using the [00:05:11] right so times the norm [00:05:20] this is a little bit okay anyway so so [00:05:24] this is a little bit okay anyway so so so basically the important thing here is [00:05:26] so basically the important thing here is that the complex damage depends on few [00:05:28] that the complex damage depends on few things one thing is that the operator [00:05:30] things one thing is that the operator Norm [00:05:31] Norm of the [00:05:33] of the of the weight Matrix and it depends on [00:05:35] of the weight Matrix and it depends on the operator Norm as a product so the [00:05:37] the operator Norm as a product so the complexity in the complexity term it [00:05:39] complexity in the complexity term it shows the product of the operating Norm [00:05:41] shows the product of the operating Norm of all the weights shows up in the [00:05:44] of all the weights shows up in the complexity and also there's this term [00:05:46] complexity and also there's this term which basically you can think of this as [00:05:48] which basically you can think of this as a polynomial in Kappa I and bi so this [00:05:52] a polynomial in Kappa I and bi so this you can in some sense think of this as a [00:05:55] you can in some sense think of this as a polynomial [00:05:56] polynomial of Kappa I and bi which is not really [00:05:59] of Kappa I and bi which is not really important right so as long as polynomial [00:06:01] important right so as long as polynomial for for us you know it's not that [00:06:03] for for us you know it's not that important because the product of the [00:06:05] important because the product of the operational Norm probably will be the [00:06:07] operational Norm probably will be the dominating term and that the polynomial [00:06:09] dominating term and that the polynomial in bi in the cup I probably are somewhat [00:06:12] in bi in the cup I probably are somewhat kind of like [00:06:13] kind of like and we write it very small so so so so [00:06:16] and we write it very small so so so so so we don't necessarily have to care [00:06:18] so we don't necessarily have to care about exactly what this 2 2 3 2 3 means [00:06:20] about exactly what this 2 2 3 2 3 means actually they don't have any special [00:06:21] actually they don't have any special meaning it's really just something that [00:06:23] meaning it's really just something that comes out of the proof [00:06:25] comes out of the proof um but as long as there are polynomials [00:06:27] um but as long as there are polynomials we are relatively happy with it [00:06:29] we are relatively happy with it um and so so basically this is the the [00:06:32] um and so so basically this is the the important term and this term [00:06:34] important term and this term um if you look at the bond it comes from [00:06:36] um if you look at the bond it comes from the lipsyness of the model so so copper [00:06:39] the lipsyness of the model so so copper I is the bound on the ellipselessness of [00:06:41] I is the bound on the ellipselessness of a single layer and the product of copper [00:06:44] a single layer and the product of copper I is the bound on ellipsislessness of [00:06:46] I is the bound on ellipsislessness of the product of all layers [00:06:48] the product of all layers so [00:06:50] so um uh so so so so so so so so so [00:06:53] um uh so so so so so so so so so without you know any details I think the [00:06:55] without you know any details I think the this term you know you can imagine this [00:06:58] this term you know you can imagine this comes from some ellipses composition [00:06:59] comes from some ellipses composition some use of Lupus composition level [00:07:05] we're going to see [00:07:08] we're going to see this this is assumption sorry [00:07:13] this is I [00:07:14] this is I thought oh you assume it's true for [00:07:16] thought oh you assume it's true for every eye [00:07:17] every eye I think this can be relaxed a little bit [00:07:18] I think this can be relaxed a little bit again it's not very important so you can [00:07:21] again it's not very important so you can maybe relax it to be the average of x i [00:07:22] maybe relax it to be the average of x i is less than C it's not super important [00:07:27] one oh right so that's a yeah sorry so [00:07:30] one oh right so that's a yeah sorry so the operator Norm is the [00:07:33] the operator Norm is the um I guess maybe I didn't so this is the [00:07:35] um I guess maybe I didn't so this is the also the spectral Norm also the largest [00:07:38] also the spectral Norm also the largest single so [00:07:40] single so this is the spectral Norm [00:07:44] oh the largest [00:07:47] oh the largest singular value [00:07:50] singular value if any of this makes sense to you [00:07:52] if any of this makes sense to you and if uh and also the formal definition [00:07:54] and if uh and also the formal definition is just that [00:07:57] is just that the max over X [00:08:04] foreign [00:08:08] objects because this is the if you think [00:08:12] objects because this is the if you think about WS operator then this is saying [00:08:13] about WS operator then this is saying that how does this operator change your [00:08:16] that how does this operator change your Norm right so if you give it a two Norm [00:08:18] Norm right so if you give it a two Norm Vector then how does it uh change the [00:08:21] Vector then how does it uh change the norm [00:08:22] norm um yeah so [00:08:26] cool so and you can kind of see that [00:08:28] cool so and you can kind of see that this is kind of like about Ellipsis this [00:08:31] this is kind of like about Ellipsis this is also maybe I should expand this a [00:08:33] is also maybe I should expand this a little bit so this is also about the [00:08:35] little bit so this is also about the luxiousness [00:08:37] foreign [00:08:39] foreign model WX right because if you carry the [00:08:42] model WX right because if you carry the lepsis you know what you have to verify [00:08:44] lepsis you know what you have to verify you have to verify that w x minus Wy [00:08:48] you have to verify that w x minus Wy is less than some constant times x minus [00:08:51] is less than some constant times x minus y [00:08:52] y and what that constant should be right [00:08:54] and what that constant should be right so [00:08:55] so um if you prove an inequality then [00:08:57] um if you prove an inequality then you're gonna get [00:08:58] you're gonna get the operator Norm [00:09:01] the operator Norm or spectral Norm of w there so that's [00:09:04] or spectral Norm of w there so that's why this is the corresponds to the [00:09:06] why this is the corresponds to the Ellipsis needs of a linear model [00:09:09] Ellipsis needs of a linear model foreign [00:09:18] Okay cool so by the way I have I haven't [00:09:22] Okay cool so by the way I have I haven't got any questions from Zoom for a long [00:09:23] got any questions from Zoom for a long time so you should feel free to ask [00:09:25] time so you should feel free to ask questions just um [00:09:27] questions just um you don't have to but of course so [00:09:30] you don't have to but of course so um feel free to animate yourself [00:09:33] um feel free to animate yourself um and [00:09:35] um and okay so how do we prove this so um [00:09:39] okay so how do we prove this so um so uh the fundamental idea yeah so in [00:09:43] so uh the fundamental idea yeah so in the next 30 minutes we're going to talk [00:09:44] the next 30 minutes we're going to talk about this proof [00:09:46] about this proof um the the fundamental idea is that you [00:09:51] um the the fundamental idea is that you um somewhat kind of cover your function [00:09:53] um somewhat kind of cover your function set iteratively [00:10:00] so cover this set of functions so I have [00:10:02] so cover this set of functions so I have iteratively [00:10:06] and iteratively means that you you cover [00:10:08] and iteratively means that you you cover more and more layers [00:10:10] more and more layers gradually [00:10:17] and [00:10:19] and um and and how do you do this [00:10:20] um and and how do you do this iteratively you have to use the [00:10:22] iteratively you have to use the ellipsislessness [00:10:23] ellipsislessness and sometimes the love just composition [00:10:26] and sometimes the love just composition the amount that we have discussed and [00:10:29] the amount that we have discussed and also you want to also control [00:10:33] and also controlling [00:10:36] and also controlling how the error propagates [00:10:45] right so that's so that's a high level [00:10:48] right so that's so that's a high level summary it's kind of abstract but let me [00:10:51] summary it's kind of abstract but let me um let me dab dive into the details so [00:10:54] um let me dab dive into the details so for example let's say let's let's cut [00:10:55] for example let's say let's let's cut Also let's try to abstractify [00:11:01] just for each layer of f [00:11:06] is f i so so basically f i corresponds [00:11:10] is f i so so basically f i corresponds to [00:11:13] a linear multiplication a matrix [00:11:17] a linear multiplication a matrix multiplication plus activation layer [00:11:22] right so this is a one layer and then [00:11:24] right so this is a one layer and then you can write F then you can consider f [00:11:26] you can write F then you can consider f as [00:11:27] as this composition of f r with f r minus [00:11:31] this composition of f r with f r minus one [00:11:32] one so and so forth right so basically for [00:11:34] so and so forth right so basically for every layer we have certain choices you [00:11:36] every layer we have certain choices you can choose your weight Matrix and then [00:11:37] can choose your weight Matrix and then you compose all of this function class [00:11:39] you compose all of this function class by this composition I guess we have used [00:11:42] by this composition I guess we have used this notation multiple times this is [00:11:43] this notation multiple times this is really just means that you're looking at [00:11:46] really just means that you're looking at uh FR composed with f r minus one [00:11:50] uh FR composed with f r minus one composed with F1 where each fi is from [00:11:53] composed with F1 where each fi is from the family of capital f i [00:11:57] the family of capital f i Okay so [00:12:00] um so this abstraction will allow us to [00:12:02] um so this abstraction will allow us to kind of have much cleaner [00:12:04] kind of have much cleaner um [00:12:05] um um notations but fundamentally you can [00:12:08] um notations but fundamentally you can usually just think of each of the ifis [00:12:10] usually just think of each of the ifis as an ISO layer right so and and and [00:12:14] as an ISO layer right so and and and what we know is that you know so we know [00:12:18] what we know is that you know so we know so suppose maybe let's say [00:12:23] so for the sake of kind of like uh [00:12:25] so for the sake of kind of like uh preparation so suppose [00:12:28] for every Earth in f i f i I've played [00:12:32] for every Earth in f i f i I've played Wi-Fi in capital F5 [00:12:34] Wi-Fi in capital F5 FR is copper I ellipses this is actually [00:12:38] FR is copper I ellipses this is actually the case for us because you know we [00:12:39] the case for us because you know we restricted the the spectral Norm of of [00:12:42] restricted the the spectral Norm of of the operating Norm of each of the Ws to [00:12:45] the operating Norm of each of the Ws to be less than copper I that means that [00:12:46] be less than copper I that means that every layer is copper ellipses right so [00:12:50] every layer is copper ellipses right so uh and the value is one left just so [00:12:52] uh and the value is one left just so even you compose with activation it's [00:12:54] even you compose with activation it's still a couple options so suppose you [00:12:56] still a couple options so suppose you know each of these functions is copper [00:12:58] know each of these functions is copper Ellipsis then you know that you know [00:13:01] Ellipsis then you know that you know um [00:13:02] um f i x so this these are just some [00:13:05] f i x so this these are just some preparations [00:13:07] preparations so two Norm is less than copper I [00:13:10] so two Norm is less than copper I x minus y two Norm [00:13:12] x minus y two Norm and maybe that's just for Simplicity [00:13:18] suppose [00:13:21] suppose fi is 0 is equals to zero this is also [00:13:24] fi is 0 is equals to zero this is also the case you know in the in the real [00:13:26] the case you know in the in the real case we care about when you have like a [00:13:27] case we care about when you have like a network so and and also let's suppose [00:13:31] network so and and also let's suppose that x i [00:13:33] that x i is less than C this is also our [00:13:35] is less than C this is also our assumption so then with all of this then [00:13:38] assumption so then with all of this then you know that uh we know a bunch of like [00:13:40] you know that uh we know a bunch of like basic things so for example we know that [00:13:43] basic things so for example we know that uh you can [00:13:46] uh you can you can bounce [00:13:48] you can bounce what's this [00:13:52] was the material application of on XI [00:13:55] was the material application of on XI what's the normal [00:13:58] what's the normal was the bound on the norm here so the [00:14:00] was the bound on the norm here so the bound on the norm is can be bonded by [00:14:02] bound on the norm is can be bonded by each time you make and most capture a [00:14:06] each time you make and most capture a copper I factor so you get Kappa I times [00:14:08] copper I factor so you get Kappa I times Ka minus one [00:14:09] Ka minus one times couple I minus two so and so forth [00:14:11] times couple I minus two so and so forth times Kappa 1 times C [00:14:14] times Kappa 1 times C so [00:14:16] so um [00:14:17] um so we call this [00:14:21] CI [00:14:23] CI and let's define this should be equals [00:14:26] and let's define this should be equals to CI so basically for this is some [00:14:29] to CI so basically for this is some basically kind of like a preparation so [00:14:31] basically kind of like a preparation so under this abstraction you know some [00:14:33] under this abstraction you know some bound on each of the layer and you know [00:14:35] bound on each of the layer and you know each of the layer is ellipses [00:14:37] each of the layer is ellipses and [00:14:39] and um what we're going to do is that we're [00:14:41] um what we're going to do is that we're going to uh do two things so [00:14:45] going to uh do two things so for two steps [00:14:48] so first you control [00:14:52] the cover number of each layer [00:15:01] of each layer [00:15:04] of each layer and second [00:15:06] and second you have a combination element combine [00:15:10] you have a combination element combine you compose this you know like a combine [00:15:13] you compose this you know like a combine them [00:15:14] them together [00:15:16] together like a [00:15:17] like a so like you have a line that turns the [00:15:19] so like you have a line that turns the each layer right so you turn each of the [00:15:21] each layer right so you turn each of the layers so you have a line with a [00:15:24] layers so you have a line with a different single layer carbon number [00:15:25] different single layer carbon number bond [00:15:26] bond uh to multiple layer [00:15:31] not really [00:15:37] so [00:15:39] so um [00:15:40] um and I think number two is the number one [00:15:42] and I think number two is the number one is kind of easy because for number one [00:15:44] is kind of easy because for number one this is just a linear model composed [00:15:46] this is just a linear model composed with Zone lips just activation right one [00:15:49] with Zone lips just activation right one left just activation you can just invoke [00:15:50] left just activation you can just invoke on the what we have discussed last time [00:15:53] on the what we have discussed last time right [00:15:54] right um [00:15:55] um um so so basically the important thing [00:15:57] um so so basically the important thing is that how do you turn a single layer [00:15:59] is that how do you turn a single layer carbon number bound into multiple layer [00:16:01] carbon number bound into multiple layer on carbon number bond that's basically [00:16:03] on carbon number bond that's basically uh the main thing I'm going to discuss [00:16:05] uh the main thing I'm going to discuss so let's call this there's a Lemma that [00:16:08] so let's call this there's a Lemma that does this so so [00:16:11] does this so so um [00:16:12] um under the assumption [00:16:15] under the assumption setup [00:16:17] setup above and the kind of like the [00:16:20] above and the kind of like the the relatively [00:16:22] the relatively um abstract setup above so assume that [00:16:27] um abstract setup above so assume that supposed to assume [00:16:29] supposed to assume for every inputs [00:16:36] was [00:16:37] was L2 Norm [00:16:43] less than C or minus one [00:16:47] less than C or minus one uh so this inputs are used to define the [00:16:52] used to [00:16:55] used to Define [00:16:56] Define right the ol and LPN and the metric l2pn [00:17:02] right the ol and LPN and the metric l2pn so this is the the inputs value for [00:17:05] so this is the the inputs value for which we are evaluating your covering [00:17:06] which we are evaluating your covering number right so if you to Define cover [00:17:08] number right so if you to Define cover number you have to define the metric [00:17:10] number you have to define the metric Define which empirical inputs you're [00:17:12] Define which empirical inputs you're evaluating on right so I'm assuming that [00:17:14] evaluating on right so I'm assuming that for every input of this Norm constraint [00:17:16] for every input of this Norm constraint you have a carbon number [00:17:21] you have a cover number but [00:17:24] you have a cover number but so you know that [00:17:25] so you know that Epsilon I [00:17:27] Epsilon I f i [00:17:29] f i L to p n [00:17:31] L to p n is less than some function [00:17:33] is less than some function of this and CMS 1. some function of the [00:17:36] of this and CMS 1. some function of the norm and some function of the target [00:17:38] norm and some function of the target radius [00:17:40] radius so this is just assumption this is your [00:17:41] so this is just assumption this is your simulator so basically this is assuming [00:17:43] simulator so basically this is assuming that you have a single layer bound [00:17:45] that you have a single layer bound single layer bound scene [00:17:54] so suppose you have a single layer of [00:17:55] so suppose you have a single layer of carbon number bond like of this form and [00:17:57] carbon number bond like of this form and you do have this Bond it's just that [00:17:59] you do have this Bond it's just that didn't give you the exact formula right [00:18:01] didn't give you the exact formula right so if you instantiate on a linear model [00:18:04] so if you instantiate on a linear model you are going to get something like this [00:18:05] you are going to get something like this this will be something like CM minus 1 [00:18:08] this will be something like CM minus 1 square over Epsilon square right the [00:18:11] square over Epsilon square right the norm of the input squared over Epsilon [00:18:13] norm of the input squared over Epsilon square right so that would be what [00:18:15] square right so that would be what happens when you have linear models but [00:18:17] happens when you have linear models but suppose you have the single layer carbon [00:18:19] suppose you have the single layer carbon number bond then [00:18:22] number bond then the conclusion is that you can turn this [00:18:24] the conclusion is that you can turn this into a multi-layer carbon number bond [00:18:26] into a multi-layer carbon number bond and and this the form of this you know [00:18:28] and and this the form of this you know translation is not very clean but [00:18:32] translation is not very clean but um it's like this so [00:18:33] um it's like this so and there exists the apps on cover [00:18:37] and there exists the apps on cover of I've R composed [00:18:41] of I've R composed up to F1 [00:18:43] up to F1 uh [00:18:46] uh for Epsilon is equals to [00:18:49] for Epsilon is equals to the following thing [00:19:07] that's right oh one more let me finished [00:19:09] that's right oh one more let me finished okay foreign [00:19:14] [Music] [00:19:25] I'm assuming a generic thing here um but [00:19:29] I'm assuming a generic thing here um but actually you can [00:19:30] actually you can just follow for the abstraction right [00:19:32] just follow for the abstraction right like you know when you really use it for [00:19:34] like you know when you really use it for linear models it's going to be something [00:19:36] linear models it's going to be something like CMS one square over Epsilon Square [00:19:39] like CMS one square over Epsilon Square so [00:19:41] so um right so [00:19:43] um right so this is G [00:19:45] this is G um and then such so so there is existed [00:19:48] um and then such so so there is existed apps on covers such that with this size [00:19:50] apps on covers such that with this size such that [00:19:53] such that the log size [00:19:58] is bounded by the log size of this cover [00:20:01] is bounded by the log size of this cover is bounded by [00:20:03] is bounded by sum of [00:20:05] sum of G of Epsilon i c here minus 1. [00:20:09] G of Epsilon i c here minus 1. and I from 1 to over [00:20:12] and I from 1 to over so so basically if you have a low carbon [00:20:15] so so basically if you have a low carbon number bound of this form for every [00:20:17] number bound of this form for every layer then you can have a low carbon [00:20:19] layer then you can have a low carbon number bounds for this for this thing [00:20:22] number bounds for this for this thing uh and the bonds is the the law covering [00:20:24] uh and the bonds is the the law covering number just add up [00:20:26] number just add up in yourself [00:20:27] in yourself but the tricky thing is that it's not [00:20:29] but the tricky thing is that it's not like a [00:20:31] like a um the the the cover size also grows so [00:20:35] um the the the cover size also grows so the cover size also kind of add up in [00:20:37] the cover size also kind of add up in some way which is a little bit kind of [00:20:39] some way which is a little bit kind of like complicated [00:20:40] like complicated so basically your cover size is kind of [00:20:42] so basically your cover size is kind of like multiply it's kind of added in some [00:20:44] like multiply it's kind of added in some way where you also multiply some of this [00:20:47] way where you also multiply some of this kind of like [00:20:50] in some sense and your cover covering [00:20:54] in some sense and your cover covering number is also added in someone [00:20:56] number is also added in someone so this is the fundamental mechanism for [00:20:58] so this is the fundamental mechanism for us to turn a single layer bond to [00:21:00] us to turn a single layer bond to multiple layer bonds of course we are [00:21:02] multiple layer bonds of course we are going to I'm going to use this in some [00:21:04] going to I'm going to use this in some way at the end so that we got the final [00:21:06] way at the end so that we got the final result because you have to choose what [00:21:08] result because you have to choose what apps and I's are right So eventually [00:21:10] apps and I's are right So eventually what you do is that you're going to [00:21:12] what you do is that you're going to choose epsilonize so that you get the [00:21:15] choose epsilonize so that you get the desired Target radius and you you work [00:21:17] desired Target radius and you you work out what exactly this formula should be [00:21:19] out what exactly this formula should be when for that particular choice of [00:21:21] when for that particular choice of epsilonize [00:21:25] does that make sense so far right so [00:21:28] does that make sense so far right so um but before doing that I'm going to [00:21:30] um but before doing that I'm going to first proof this lemon and then I'm [00:21:32] first proof this lemon and then I'm going to undo the derivations right so [00:21:35] going to undo the derivations right so after you have this this is the core [00:21:36] after you have this this is the core right after that this is just a choose [00:21:38] right after that this is just a choose parameter so you just choose absolize in [00:21:40] parameter so you just choose absolize in some way that is in favor of you and and [00:21:43] some way that is in favor of you and and work out what is the final block [00:21:48] foreign [00:21:57] I mean sometimes the interpretation of [00:21:59] I mean sometimes the interpretation of this Lemma is that you know you somehow [00:22:01] this Lemma is that you know you somehow you can add up the the covering number [00:22:03] you can add up the the covering number bound the log cabin number Bound in this [00:22:05] bound the log cabin number Bound in this way [00:22:06] way as long as you pay some additional [00:22:09] as long as you pay some additional radius like um [00:22:11] radius like um okay so this proof is in some sense you [00:22:15] okay so this proof is in some sense you know in one [00:22:17] know in one in some sense it's actually pretty [00:22:18] in some sense it's actually pretty simple but but the exposition you know [00:22:21] simple but but the exposition you know requires some some challenge like it's a [00:22:24] requires some some challenge like it's a little bit challenging [00:22:27] so so the fundamental idea is the [00:22:29] so so the fundamental idea is the following so [00:22:31] following so um so we start with this uh data point [00:22:34] um so we start with this uh data point um we start with this uh [00:22:36] um we start with this uh concatenation of ended up ones right so [00:22:38] concatenation of ended up ones right so you have ended up ones and you map these [00:22:40] you have ended up ones and you map these n data points to [00:22:42] n data points to a set of ports right this is the queue [00:22:45] a set of ports right this is the queue that we talk about [00:22:48] I mean I think I need to draw this you [00:22:50] I mean I think I need to draw this you know good way so that [00:22:55] so that I I have more space so let's [00:22:57] so that I I have more space so let's start with [00:22:59] start with you start with endpoints and you map [00:23:01] you start with endpoints and you map these endpoints into a vector of [00:23:04] these endpoints into a vector of Dimension n [00:23:05] Dimension n or maybe a um actually it's a matrix of [00:23:09] or maybe a um actually it's a matrix of Dimension so [00:23:10] Dimension so so you map this to [00:23:14] some space and each of this point here [00:23:18] some space and each of this point here is the concatenation of xfx1 [00:23:21] is the concatenation of xfx1 up to f x n [00:23:24] up to f x n and this is the so-called Q the set of Q [00:23:27] and this is the so-called Q the set of Q right that we have to cover [00:23:29] right that we have to cover right so and you can use multiple [00:23:31] right so and you can use multiple functions F right you can use f [00:23:33] functions F right you can use f any function F1 in F1 right to to map to [00:23:36] any function F1 in F1 right to to map to a different point [00:23:38] a different point if you choose different f1s you want to [00:23:40] if you choose different f1s you want to map different one [00:23:41] map different one and if you just have one layer what [00:23:43] and if you just have one layer what we're going to do is that you're going [00:23:44] we're going to do is that you're going to cover this set queue right that's [00:23:46] to cover this set queue right that's what we do for copy number for one [00:23:48] what we do for copy number for one function for one family functions F1 [00:23:52] function for one family functions F1 right so [00:23:54] right so um [00:23:56] right so so then what you do is that you [00:23:58] right so so then what you do is that you like I'm just basically reviewing what [00:24:01] like I'm just basically reviewing what we have done for carbon numbers for one [00:24:02] we have done for carbon numbers for one family functions you you create this [00:24:05] family functions you you create this kind of like [00:24:06] kind of like balls right so that covers it [00:24:10] balls right so that covers it so basically you create [00:24:12] so basically you create um like the sensors [00:24:15] um like the sensors and these are points that are in C [00:24:18] and these are points that are in C right maybe let's call it C1 so let's [00:24:20] right maybe let's call it C1 so let's create some let's say C1 [00:24:23] create some let's say C1 is Epson one cover [00:24:27] is Epson one cover of f y right this is what what that [00:24:30] of f y right this is what what that means right so and and now we are going [00:24:34] means right so and and now we are going to see how do we turn this into a cover [00:24:37] to see how do we turn this into a cover for F1 composed with F2 [00:24:39] for F1 composed with F2 so that's that's kind of the job we are [00:24:41] so that's that's kind of the job we are trying to do and and what's really going [00:24:43] trying to do and and what's really going on here is that you know for every Point [00:24:46] on here is that you know for every Point here [00:24:48] here in in this uh in this uh in the output [00:24:51] in in this uh in this uh in the output space of right so this is the maybe [00:24:53] space of right so this is the maybe let's call this q1 which is the [00:24:55] let's call this q1 which is the the output space of [00:24:59] the output space of maybe let's call this thing capital X [00:25:02] maybe let's call this thing capital X so then [00:25:04] so then q1 is the family of outputs [00:25:07] q1 is the family of outputs where the function has to be chosen from [00:25:11] where the function has to be chosen from Alpha [00:25:12] Alpha right so and what's the what's the [00:25:14] right so and what's the what's the composer so how about you know if you [00:25:16] composer so how about you know if you have F2 another layer so what happens is [00:25:18] have F2 another layer so what happens is that for every point in Q q1 you can [00:25:21] that for every point in Q q1 you can apply multiple different functions right [00:25:24] apply multiple different functions right for any functions little F2 in capital [00:25:26] for any functions little F2 in capital F2 you can apply it to map to a new [00:25:28] F2 you can apply it to map to a new point in the uh in the new space right [00:25:31] point in the uh in the new space right to map it to Newport so you got a for [00:25:34] to map it to Newport so you got a for every Point here you get a bunch of [00:25:35] every Point here you get a bunch of possibility possible outputs and for [00:25:37] possibility possible outputs and for every Point here you get another bunch [00:25:39] every Point here you get another bunch of possible outputs right so each of [00:25:42] of possible outputs right so each of these new Point could be uh your image [00:25:44] these new Point could be uh your image after applying two layers [00:25:48] after applying two layers right so so now we are trying to apply [00:25:50] right so so now we are trying to apply we're trying to cover this new set of [00:25:52] we're trying to cover this new set of fun uh of outputs that are Q2 let's say [00:25:55] fun uh of outputs that are Q2 let's say and how do how do we cover it so the [00:25:57] and how do how do we cover it so the approach that we are going to take is [00:25:59] approach that we are going to take is somewhat kind of like a [00:26:02] somewhat kind of like a um [00:26:03] um in some sense you know pretty Brute [00:26:05] in some sense you know pretty Brute Force what you do is you say you want to [00:26:07] Force what you do is you say you want to leverage the existing power for Capital [00:26:09] leverage the existing power for Capital F1 in some way so what you do is you say [00:26:12] F1 in some way so what you do is you say uh you look at uh a center here in C1 [00:26:16] uh you look at uh a center here in C1 and you look at what other image of this [00:26:19] and you look at what other image of this point [00:26:20] point uh in the in the Q in the in the after [00:26:23] uh in the in the Q in the in the after applying a second layer so you get [00:26:25] applying a second layer so you get something like this [00:26:26] something like this right so this is the set of the image of [00:26:30] right so this is the set of the image of the output [00:26:32] the output or like of this point so maybe let's say [00:26:35] or like of this point so maybe let's say suppose [00:26:37] suppose this point is called [00:26:40] this point is called F1 X [00:26:42] F1 X which is in C1 that's called this F1 [00:26:45] which is in C1 that's called this F1 primax which is in C1 [00:26:48] primax which is in C1 and then you look at all the outputs [00:26:50] and then you look at all the outputs from iPhone primarks so you got this uh [00:26:54] from iPhone primarks so you got this uh family of points where you apply F2 [00:26:58] family of points where you apply F2 on F1 Prime X [00:27:02] on F1 Prime X and where F2 can be chosen arbitrarily [00:27:04] and where F2 can be chosen arbitrarily from F2 [00:27:07] from F2 okay so and now what we do is that we [00:27:11] okay so and now what we do is that we cover [00:27:12] cover this set [00:27:14] this set by a new Epson cover so what you do is [00:27:17] by a new Epson cover so what you do is you say [00:27:18] you say I'm going to cover this [00:27:21] I'm going to cover this with a with a bunch of things [00:27:25] and what does that mean that really [00:27:27] and what does that mean that really means that you choose a subset of of [00:27:29] means that you choose a subset of of capital F2 [00:27:30] capital F2 and and cover because here you are [00:27:33] and and cover because here you are ranging over all possible functions in [00:27:35] ranging over all possible functions in F2 right so if you want to choose the [00:27:37] F2 right so if you want to choose the cover for it you just say I'm going to [00:27:39] cover for it you just say I'm going to drop some of them I choose a subset a [00:27:41] drop some of them I choose a subset a discretization of capital of capital F2 [00:27:45] discretization of capital of capital F2 so that's basically [00:27:47] so that's basically uh the approach so and [00:27:51] uh the approach so and and you you do this and you and then you [00:27:53] and you you do this and you and then you do this for every possible points in C1 [00:27:56] do this for every possible points in C1 and cover them so basically suppose you [00:27:58] and cover them so basically suppose you have another point in C1 [00:28:01] have another point in C1 here and then you look at all of like [00:28:04] here and then you look at all of like as image and you do another set of cover [00:28:10] and you do this for every possible point [00:28:12] and you do this for every possible point in C1 and every possible points in C1 [00:28:14] in C1 and every possible points in C1 have induced a set and that side can [00:28:17] have induced a set and that side can induce a cover and then you take the [00:28:19] induce a cover and then you take the unit of all of these right balls into [00:28:23] unit of all of these right balls into um uh and that the unity of all of its [00:28:26] um uh and that the unity of all of its right balls becomes a cover for the for [00:28:28] right balls becomes a cover for the for the Q2 right for example suppose you [00:28:30] the Q2 right for example suppose you have [00:28:31] have let's say [00:28:33] let's say F1 prime prime X here [00:28:36] F1 prime prime X here right then maybe I should use a [00:28:38] right then maybe I should use a consistent color [00:28:41] let's say you have a [00:28:45] F1 prime prime X and this is mapped to [00:28:48] F1 prime prime X and this is mapped to the set of points [00:28:51] the set of points here this is the set of all F2 F1 prime [00:28:54] here this is the set of all F2 F1 prime prime X [00:28:56] prime X where F2 is in capital F2 and then you [00:28:59] where F2 is in capital F2 and then you create a cover [00:29:01] create a cover for this set [00:29:02] for this set so that you can discretize F2 again [00:29:06] so that you can discretize F2 again and then you take the unit of all of [00:29:08] and then you take the unit of all of this red cover of this white balls as [00:29:11] this red cover of this white balls as your cover for Q2 [00:29:13] your cover for Q2 so so any questions so far [00:29:16] so so any questions so far so formally what we should do is the [00:29:18] so formally what we should do is the following so [00:29:22] um [00:29:26] so Epson one and Epsilon are [00:29:29] so Epson one and Epsilon are are the [00:29:32] are the radius [00:29:34] radius uh on each layer [00:29:37] uh on each layer these are just pre and which are TBD you [00:29:40] these are just pre and which are TBD you choose this you know I guess in this [00:29:42] choose this you know I guess in this level they are not TB they are just [00:29:43] level they are not TB they are just already given to you eventually you'll [00:29:45] already given to you eventually you'll choose something uh I'll choose some [00:29:47] choose something uh I'll choose some numbers for them and then what you do is [00:29:50] numbers for them and then what you do is that [00:29:51] that C1 is the Epsilon cover [00:29:55] C1 is the Epsilon cover Epsilon cover [00:29:56] Epsilon cover of F1 that's easy [00:29:59] of F1 that's easy and then you say that for every [00:30:03] and then you say that for every F1 Prime in C1 [00:30:06] F1 Prime in C1 construct [00:30:10] this C2 a cover [00:30:13] this C2 a cover in the second space [00:30:15] in the second space and by this cover depends on F1 Prime [00:30:18] and by this cover depends on F1 Prime so to cover to Epsilon 2 cover [00:30:22] so to cover to Epsilon 2 cover the set [00:30:27] F2 composed with F1 Prime [00:30:30] F2 composed with F1 Prime which is defined which is what I wrote [00:30:32] which is defined which is what I wrote above right this is F2 F1 capital X [00:30:37] above right this is F2 F1 capital X uh [00:30:38] uh F2 is ranging in capital F2 right so for [00:30:43] F2 is ranging in capital F2 right so for every set like this where this set is [00:30:45] every set like this where this set is this [00:30:46] this this side is really literally this blue [00:30:48] this side is really literally this blue things that I drew here like this blue [00:30:50] things that I drew here like this blue side [00:30:51] side and I come I choose a cover [00:30:55] and I come I choose a cover I already know that cover to be C2 [00:30:58] I already know that cover to be C2 comma F1 Prime because this cover [00:31:00] comma F1 Prime because this cover depends on F5 [00:31:03] depends on F5 and then I'm going to take the Union [00:31:07] and then I'm going to take the Union so I'm going to uh [00:31:09] so I'm going to uh like C2 [00:31:12] like C2 to be the union of all of this C2 F1 [00:31:15] to be the union of all of this C2 F1 Prime where F1 Prime is in C1 [00:31:20] so this is how I do how do I construct [00:31:22] so this is how I do how do I construct the cover for the second layer [00:31:26] the cover for the second layer so so this is uh this is supposed to be [00:31:29] so so this is uh this is supposed to be a cover supposed to be a [00:31:34] a cover for [00:31:37] a cover for Capital F2 composed with capital F1 [00:31:49] okay any questions so far [00:31:52] okay any questions so far so so there are several questions right [00:31:54] so so there are several questions right so one question is that you know what's [00:31:56] so one question is that you know what's the [00:31:57] the how good this cover is right that's one [00:31:59] how good this cover is right that's one thing and the other thing is that how [00:32:01] thing and the other thing is that how large these are discover this cover is [00:32:02] large these are discover this cover is so the side the side of this cover is [00:32:04] so the side the side of this cover is relatively easy to compute because you [00:32:07] relatively easy to compute because you are basically just blowing off the the [00:32:09] are basically just blowing off the the size multiplying because for every one [00:32:13] size multiplying because for every one in the olds in the C1 you you create [00:32:15] in the olds in the C1 you you create this cover right so you just multiply [00:32:17] this cover right so you just multiply basically the covering numbers together [00:32:19] basically the covering numbers together in some sense and that's easy because [00:32:22] in some sense and that's easy because you you can formulate what you can do is [00:32:24] you you can formulate what you can do is that you can say C2 F1 Prime [00:32:27] that you can say C2 F1 Prime this is something the log of this is [00:32:29] this is something the log of this is going to be bounded by G of Epsilon 1 C2 [00:32:34] going to be bounded by G of Epsilon 1 C2 uh sorry Epsilon 2c2 [00:32:37] uh sorry Epsilon 2c2 but this is Epsilon 2. this is episode 2 [00:32:41] but this is Epsilon 2. this is episode 2 C1 I think in my location [00:32:44] C1 I think in my location C1 [00:32:45] C1 below C1 [00:32:47] below C1 this is my assumption because my [00:32:48] this is my assumption because my assumption is that as long as your input [00:32:51] assumption is that as long as your input is bound by C1 and your cover size is [00:32:54] is bound by C1 and your cover size is Epsilon 2 you have for this Bond [00:32:56] Epsilon 2 you have for this Bond right so and also so that means that uh [00:33:00] right so and also so that means that uh the size of C2 is bounded by the side of [00:33:04] the size of C2 is bounded by the side of C1 times this exponential of this G of [00:33:08] C1 times this exponential of this G of Epsilon 2 C1 because for every point in [00:33:11] Epsilon 2 C1 because for every point in C1 you have [00:33:12] C1 you have you have a bond for that particular that [00:33:15] you have a bond for that particular that corresponding set so then you just [00:33:16] corresponding set so then you just multiply by C1 [00:33:19] multiply by C1 and that means the log of C2 is less [00:33:22] and that means the log of C2 is less than the log [00:33:23] than the log of C1 plus G of Epsilon 2 C1 [00:33:27] of C1 plus G of Epsilon 2 C1 which is equals to G of [00:33:30] which is equals to G of oh Epsilon 1 C 0 plus G of Epsilon 2 C [00:33:35] oh Epsilon 1 C 0 plus G of Epsilon 2 C 1. [00:33:37] 1. oh I guess actually I forgot to Define c [00:33:39] oh I guess actually I forgot to Define c zero c0 just for convenience let's [00:33:41] zero c0 just for convenience let's define [00:33:43] define right so [00:33:45] right so Define c0 to be just a c the the bond on [00:33:49] Define c0 to be just a c the the bond on the input see so c c i's are the bonds [00:33:52] the input see so c c i's are the bonds on the the the the the layers the [00:33:54] on the the the the the layers the activation layers [00:33:56] activation layers um and c0 is the bound on the inputs [00:34:00] um and c0 is the bound on the inputs right so [00:34:03] okay [00:34:05] okay so you know so the so basically the size [00:34:07] so you know so the so basically the size is added up the log size is added up [00:34:10] is added up the log size is added up that's easy and we're going to deal with [00:34:12] that's easy and we're going to deal with the covering how does the covering works [00:34:14] the covering how does the covering works like uh at the end [00:34:16] like uh at the end um so so so before doing the covering uh [00:34:19] um so so so before doing the covering uh Computing a carbon radius let's define [00:34:22] Computing a carbon radius let's define how do you proceed with you know more [00:34:23] how do you proceed with you know more layers so similarly [00:34:28] for given [00:34:30] for given CK suppose you have covered K layers [00:34:32] CK suppose you have covered K layers then now you are constructing a cover [00:34:35] then now you are constructing a cover for the K plus one layer so what you do [00:34:37] for the K plus one layer so what you do you say that so for any [00:34:43] FK [00:34:44] FK Prime composed with [00:34:47] Prime composed with FK minus 1 Prime [00:34:53] F1 Prime in CK [00:34:56] F1 Prime in CK uh [00:34:58] uh you construct [00:35:02] some ck plus one [00:35:04] some ck plus one which is a function [00:35:06] which is a function of this FK up to F1 Prime [00:35:10] of this FK up to F1 Prime so [00:35:11] so uh that [00:35:13] uh that Epsilon K plus 1 covers [00:35:18] the set F K plus 1 composed with [00:35:22] the set F K plus 1 composed with FK Prime [00:35:26] so I knew like this [00:35:28] so I knew like this CK the final color to be the Union [00:35:32] CK the final color to be the Union of [00:35:35] uh [00:35:37] uh of all of this kind of sets [00:35:45] and [00:35:47] and similarly you can prove that the log of [00:35:49] similarly you can prove that the log of the CK Plus One will be less than the [00:35:52] the CK Plus One will be less than the sum of all the single layer covers [00:35:54] sum of all the single layer covers Epsilon K plus 1 [00:35:56] Epsilon K plus 1 c k [00:35:58] c k plus up to G of Epsilon 1 C 0. [00:36:11] all right so so I've shown you how to [00:36:15] all right so so I've shown you how to cover it it's just the iterative cover [00:36:16] cover it it's just the iterative cover that [00:36:17] that kind of pretty brute force in some sense [00:36:19] kind of pretty brute force in some sense and now the question is why this is a [00:36:22] and now the question is why this is a good cover right why what's the radius [00:36:24] good cover right why what's the radius so basically we need to answer the [00:36:26] so basically we need to answer the question right so so for every [00:36:30] question right so so for every f are composed [00:36:32] f are composed up to F1 [00:36:34] up to F1 which is belongs to this FR this set [00:36:39] right this is the site we're going to [00:36:40] right this is the site we're going to cover so you pick a function in this set [00:36:42] cover so you pick a function in this set and you want to say that this can be [00:36:44] and you want to say that this can be represented by something in the cover [00:36:46] represented by something in the cover right with some small distances [00:36:49] right with some small distances so how why how does that work so you [00:36:52] so how why how does that work so you first say that [00:36:53] first say that you you know there's six As One Prime in [00:36:56] you you know there's six As One Prime in C1 such that [00:37:00] row of F1 F Prime [00:37:02] row of F1 F Prime this is less than Epsilon y that's [00:37:04] this is less than Epsilon y that's something you know because C1 is a cover [00:37:07] something you know because C1 is a cover uh Epsilon cover of the capital F1 so [00:37:10] uh Epsilon cover of the capital F1 so now you have to say let's say you try to [00:37:13] now you have to say let's say you try to pick [00:37:14] pick something in C2 that can cover F2 [00:37:17] something in C2 that can cover F2 composed with F1 how to do that you [00:37:20] composed with F1 how to do that you basically in some sense use the [00:37:22] basically in some sense use the uh use the construction you say that [00:37:25] uh use the construction you say that um [00:37:25] um maybe I should I should [00:37:28] maybe I should I should I should draw this a little bit more so [00:37:30] I should draw this a little bit more so okay so [00:37:34] so the first thing is it's suppose you [00:37:36] so the first thing is it's suppose you have you have a function here you have a [00:37:39] have you have a function here you have a point here which is F1 of x right so you [00:37:42] point here which is F1 of x right so you cover it by this point [00:37:46] so how do I do this [00:37:50] let's cover this by this point right [00:37:52] let's cover this by this point right so now suppose you have a point in the [00:37:55] so now suppose you have a point in the second layer suppose you have a point [00:37:58] second layer suppose you have a point uh somewhere here [00:38:01] uh somewhere here which is the map [00:38:04] which is the map which is computed from that f1x right [00:38:06] which is computed from that f1x right you apply some F2 [00:38:09] you apply some F2 to it [00:38:11] to it so and what you do is you say that you [00:38:13] so and what you do is you say that you first look at the Neighbors in the first [00:38:16] first look at the Neighbors in the first layer so you got this point [00:38:19] layer so you got this point and this point you you look at uh [00:38:22] and this point you you look at uh what's the neighboring [00:38:24] what's the neighboring what's the image of this point [00:38:28] what's the image of this point in a second layer maybe something here [00:38:32] so I guess here you're applying I guess [00:38:34] so I guess here you're applying I guess let's assume you're applying F2 here so [00:38:37] let's assume you're applying F2 here so you get F2 here you use the same F2 on [00:38:40] you get F2 here you use the same F2 on the on the on the cover and you get this [00:38:43] the on the on the cover and you get this point and then after you've got this [00:38:45] point and then after you've got this point [00:38:46] point um [00:38:48] um you you look at the neighbors the [00:38:51] you you look at the neighbors the in the right [00:38:53] in the right see so you got this one so basically [00:38:55] see so you got this one so basically this will be the cover [00:38:58] for the purple ones [00:39:01] for the purple ones I'm not sure whether this makes sense [00:39:05] sounds good [00:39:10] okay so in other words in more formula [00:39:12] okay so in other words in more formula so basically you say that [00:39:14] so basically you say that so you want to say there exists a [00:39:16] so you want to say there exists a function in this cover right [00:39:19] function in this cover right in this C2 F1 Prime [00:39:22] in this C2 F1 Prime so this is the this one [00:39:27] this one is [00:39:29] this one is at the right point I think [00:39:32] at the right point I think this is that right point so you can [00:39:35] this is that right point so you can discover such that [00:39:37] discover such that such that [00:39:40] such that I have two rho of F2 F1 Prime is close [00:39:44] I have two rho of F2 F1 Prime is close to what is close to [00:39:47] to what is close to this move the blue point was the blue [00:39:49] this move the blue point was the blue point the blue point is F2 [00:39:52] point the blue point is F2 composed with F1 Prime [00:39:56] you know this is less than Epsilon 2. [00:39:59] you know this is less than Epsilon 2. right so [00:40:07] yeah I guess [00:40:10] yeah I guess I guess maybe let's write this as F2 [00:40:12] I guess maybe let's write this as F2 Prime composed with I from Prime so [00:40:13] Prime composed with I from Prime so that's just to make it looks kind of [00:40:15] that's just to make it looks kind of like wrong you know that's my that's [00:40:18] like wrong you know that's my that's also what my cover is doing right so so [00:40:21] also what my cover is doing right so so suppose zero F2 Prime composed with f [00:40:24] suppose zero F2 Prime composed with f prime prime [00:40:26] prime prime so your cover has this structure that [00:40:27] so your cover has this structure that you will first apply from Prime menu [00:40:29] you will first apply from Prime menu cover you use the F2 parameter cover [00:40:31] cover you use the F2 parameter cover right so [00:40:32] right so suppose [00:40:35] suppose start what I'm saying here okay sorry [00:40:37] start what I'm saying here okay sorry sorry my bad [00:40:39] sorry my bad so [00:40:40] so so you have this function in this cover [00:40:43] so you have this function in this cover such that [00:40:52] so [00:40:54] so this is of the form let's say F2 [00:40:58] this is of the form let's say F2 I want to make this too complicated but [00:40:59] I want to make this too complicated but I think let's say this is of the form F2 [00:41:02] I think let's say this is of the form F2 Prime composed with F1 Prime [00:41:04] Prime composed with F1 Prime which is in this cover but this app this [00:41:06] which is in this cover but this app this one actually implicitly depends on [00:41:09] one actually implicitly depends on depends on [00:41:10] depends on F1 Prime as well but let's uh Let's [00:41:13] F1 Prime as well but let's uh Let's ignore that notation so you got a row of [00:41:16] ignore that notation so you got a row of F2 Prime composed with F1 Prime which is [00:41:19] F2 Prime composed with F1 Prime which is close to F2 composed with F1 but this [00:41:22] close to F2 composed with F1 but this point is not what you really want to [00:41:23] point is not what you really want to cover because you're going to cover F2 [00:41:25] cover because you're going to cover F2 composed with F1 [00:41:26] composed with F1 so what you care about is that rho of F2 [00:41:29] so what you care about is that rho of F2 Prime composed with f Prime [00:41:31] Prime composed with f Prime the difference between this and F2 [00:41:33] the difference between this and F2 composed with F1 so this is the thing [00:41:36] composed with F1 so this is the thing you really care about [00:41:38] you really care about and you can see that there's still some [00:41:39] and you can see that there's still some differences because the differences come [00:41:42] differences because the differences come from that the [00:41:44] from that the this is F1 Prime but not on F1 so that's [00:41:48] this is F1 Prime but not on F1 so that's why you do a triangle inequality you say [00:41:50] why you do a triangle inequality you say that the target [00:41:53] that the target is less than [00:41:54] is less than row of [00:41:57] row of two prime [00:41:58] two prime uh [00:42:01] composed with F1 Prime [00:42:05] so use this as the intermediate term [00:42:12] right so this one is less than Epsilon [00:42:15] right so this one is less than Epsilon 2. [00:42:17] and you are left with this thing that [00:42:21] and you are left with this thing that where you only differ in the first layer [00:42:24] where you only differ in the first layer right that's the difference you and but [00:42:26] right that's the difference you and but this difference is kind of like [00:42:27] this difference is kind of like propagated in some sense [00:42:30] propagated in some sense if you look at this figure this figure [00:42:32] if you look at this figure this figure is a little kind of like cheeky so this [00:42:34] is a little kind of like cheeky so this is the difference in the first layer but [00:42:36] is the difference in the first layer but once you apply this this F2 right so you [00:42:39] once you apply this this F2 right so you have a bigger difference [00:42:40] have a bigger difference so this is the difference in the second [00:42:42] so this is the difference in the second layer [00:42:44] layer and this difference can be kind of like [00:42:46] and this difference can be kind of like a blow up a little bit because even [00:42:48] a blow up a little bit because even though you apply the same function you [00:42:49] though you apply the same function you may blow up the differences a little bit [00:42:52] may blow up the differences a little bit so that's why you have to use the [00:42:53] so that's why you have to use the lipstickness to say that this is less [00:42:56] lipstickness to say that this is less than Epsilon 2 plus [00:42:58] than Epsilon 2 plus Kappa [00:43:01] Kappa Kappa 2 times Epsilon 1. [00:43:05] Kappa 2 times Epsilon 1. or maybe covers 2 times [00:43:07] or maybe covers 2 times row of F1 Prime F1 and this is the [00:43:11] row of F1 Prime F1 and this is the less than Epsilon 2 plus Kappa 2 times [00:43:14] less than Epsilon 2 plus Kappa 2 times Epsilon [00:43:19] that's how you bound the [00:43:22] that's how you bound the the covering the radius for the second [00:43:25] the covering the radius for the second layer [00:43:27] layer any questions [00:43:30] and you can similarly do all of this you [00:43:33] and you can similarly do all of this you know in the [00:43:35] know in the um [00:43:36] um for 4K which [00:43:38] for 4K which so there exists a function of K Prime [00:43:41] so there exists a function of K Prime which depends on you know [00:43:43] which depends on you know One Prime up to f k [00:43:46] One Prime up to f k minus one prime [00:43:48] minus one prime and in this set CK let's write this as [00:43:51] and in this set CK let's write this as FK Prime composed with [00:43:53] FK Prime composed with K minus 1 Prime [00:43:55] K minus 1 Prime composed with f Prime [00:43:58] composed with f Prime and such that you know this is a [00:44:01] and such that you know this is a cover [00:44:03] cover such that the distance is less than [00:44:08] Epsilon okay [00:44:15] that's the Epson k [00:44:17] that's the Epson k that's the definition of the cover [00:44:19] that's the definition of the cover and then you have to kind of like [00:44:22] and then you have to kind of like see why this is uh [00:44:29] a good thing for the original one where [00:44:31] a good thing for the original one where you call that this is not actually what [00:44:33] you call that this is not actually what you really care about you care about the [00:44:34] you really care about you care about the FK composed with FK minus 1 up to F F1 [00:44:37] FK composed with FK minus 1 up to F F1 you don't care about the prime so you [00:44:39] you don't care about the prime so you care about this [00:44:41] care about this this is the thing that you really care [00:44:43] this is the thing that you really care about [00:44:44] about you want to show this is small and how [00:44:46] you want to show this is small and how to do this you kind of like it kind of [00:44:49] to do this you kind of like it kind of expand this into multiple terms [00:44:53] expand this into multiple terms so you say that this is less than [00:45:02] so guys you [00:45:06] the first thing is you first compare [00:45:08] the first thing is you first compare with [00:45:09] with this kind of telescoping sound is pretty [00:45:11] this kind of telescoping sound is pretty actually useful in many cases [00:45:16] it first compared with this and then you [00:45:18] it first compared with this and then you come compare [00:45:22] come compare you just gradually pay off you know more [00:45:24] you just gradually pay off you know more and more terms [00:45:33] this is no Prime here [00:45:41] and then until [00:45:43] and then until finally you got [00:45:45] finally you got everything is kind of no Prime basically [00:45:52] and eventually you get [00:45:56] and eventually you get what you care there is no problem at all [00:45:59] what you care there is no problem at all so so this is just triangle inequality [00:46:01] so so this is just triangle inequality and now you bound each of these terms [00:46:03] and now you bound each of these terms the first term by definition is less [00:46:05] the first term by definition is less than Epsilon K and the second term [00:46:08] than Epsilon K and the second term s [00:46:10] s the second term you see that these two [00:46:12] the second term you see that these two are the same [00:46:13] are the same and this part is also the same [00:46:17] and this part is also the same the only differences come from the [00:46:20] the only differences come from the difference between this FK minus one and [00:46:22] difference between this FK minus one and FK the F Prime K and F are F Prime [00:46:26] FK the F Prime K and F are F Prime commencement and f f k minus one so so [00:46:30] commencement and f f k minus one so so because of the cover that give you [00:46:31] because of the cover that give you absolute K minus one and and then you [00:46:35] absolute K minus one and and then you also have to blow up a little bit [00:46:36] also have to blow up a little bit because of the FK composed on top of it [00:46:39] because of the FK composed on top of it so you also have to pay Ellipsis needs [00:46:41] so you also have to pay Ellipsis needs of FK which is a Kappa k [00:46:44] of FK which is a Kappa k pop up [00:46:46] pop up hey sorry McCain copper looks probably [00:46:49] hey sorry McCain copper looks probably almost the same [00:46:52] so [00:46:54] so and then you have Epsilon K minus 2 [00:46:57] and then you have Epsilon K minus 2 times Kappa [00:46:59] times Kappa K minus 1 times Kappa k [00:47:02] K minus 1 times Kappa k so and so forth until you have you know [00:47:06] so and so forth until you have you know in the last time you only the only [00:47:08] in the last time you only the only difference come from the first one so [00:47:09] difference come from the first one so you play Epson one because they are [00:47:11] you play Epson one because they are Epsilon cover F1 Prime is from the Epson [00:47:13] Epsilon cover F1 Prime is from the Epson cover and then you you pay a lot of flip [00:47:16] cover and then you you pay a lot of flip systems like half car park K times K [00:47:20] systems like half car park K times K minus one [00:47:21] minus one up to Copper 2. [00:47:32] so and and if you take k tobr then you [00:47:35] so and and if you take k tobr then you get the eventual thing right so the the [00:47:36] get the eventual thing right so the the eventual theorem is that your radius [00:47:40] eventual theorem is that your radius for the final [00:47:43] for the final covering is something like [00:47:45] covering is something like where is it [00:47:49] oops [00:47:51] oops still really is for the final covering [00:47:53] still really is for the final covering is something like this [00:47:54] is something like this right so that's eventually what you got [00:48:01] any questions [00:48:33] distance [00:48:50] so guys the question is that why [00:48:54] so guys the question is that why why you only have to why don't you only [00:48:57] why you only have to why don't you only approve require this [00:49:00] approve require this this is because like you know [00:49:04] this is because like you know suppose you know suppose [00:49:08] suppose you know suppose let me let me try whether this works [00:49:10] let me let me try whether this works suppose your function class is like F1 [00:49:12] suppose your function class is like F1 composed with a fixed function [00:49:15] maybe in the other side so [00:49:19] maybe in the other side so F2 com F1 composed with a fixed function [00:49:22] F2 com F1 composed with a fixed function that is called F2 maybe composed with F3 [00:49:25] that is called F2 maybe composed with F3 so and so forth fr and all of these are [00:49:28] so and so forth fr and all of these are fixed [00:49:29] fixed then you only have this term [00:49:33] but this you also have to cover the the [00:49:37] but this you also have to cover the the possibilities for for the second layer [00:49:39] possibilities for for the second layer the third layer so and so forth that's [00:49:40] the third layer so and so forth that's why you have to pay the other things [00:49:46] Okay cool so now we are done with this [00:49:49] Okay cool so now we are done with this Lemma and now let's go back to the proof [00:49:51] Lemma and now let's go back to the proof of the theorem [00:49:53] of the theorem and approve the serum as I kind of [00:49:55] and approve the serum as I kind of alluded before it's pretty much just the [00:49:59] alluded before it's pretty much just the kind of annoying [00:50:01] kind of annoying calculation in some sense [00:50:04] calculation in some sense um there is a way to do the calculation [00:50:06] um there is a way to do the calculation in a simpler way but I'm going to first [00:50:08] in a simpler way but I'm going to first show you a zero knowledge proof so [00:50:10] show you a zero knowledge proof so basically I'm just going to tell you [00:50:12] basically I'm just going to tell you that I'm going to choose my Epson y to [00:50:13] that I'm going to choose my Epson y to be this and and it just works out and [00:50:15] be this and and it just works out and then I'm going to show you some some way [00:50:17] then I'm going to show you some some way to kind of [00:50:19] to kind of but at least what I would do with this [00:50:21] but at least what I would do with this you know if I write a paper I'm going to [00:50:23] you know if I write a paper I'm going to show you the first proof I'm going to [00:50:24] show you the first proof I'm going to show which is just a just choosing some [00:50:27] show which is just a just choosing some app so high so if let's start with that [00:50:29] app so high so if let's start with that so so basically [00:50:31] so so basically uh everything is about choosing Epsilon [00:50:34] uh everything is about choosing Epsilon I right so you first okay of course you [00:50:36] I right so you first okay of course you first know that this [00:50:38] first know that this for is equals to all tilde of CM minus 1 [00:50:42] for is equals to all tilde of CM minus 1 square [00:50:43] square bi squared over Epsilon y score [00:50:47] bi squared over Epsilon y score um because this is linear model this is [00:50:49] um because this is linear model this is linear model [00:50:53] composed with [00:50:55] composed with one Ellipsis function [00:51:00] right so recall that each of the FI [00:51:02] right so recall that each of the FI is a linear model composed with the one [00:51:05] is a linear model composed with the one Ellipsis function a fixed one Ellipsis [00:51:07] Ellipsis function a fixed one Ellipsis function and and for the linear model [00:51:09] function and and for the linear model the carbon number the log cover number [00:51:11] the carbon number the log cover number is supposed to be something like the [00:51:12] is supposed to be something like the norm of the input [00:51:16] and this is the [00:51:19] and this is the bi is the [00:51:22] bi is the wi transpose two to one Norm right so [00:51:24] wi transpose two to one Norm right so the the norm of the parameter and [00:51:27] the the norm of the parameter and divided by the radius [00:51:30] divided by the radius this is what we have shown the last time [00:51:31] this is what we have shown the last time like we didn't prove this but this is [00:51:33] like we didn't prove this but this is the Lemma we had last time about the log [00:51:37] the Lemma we had last time about the log covering number of the linear models [00:51:40] covering number of the linear models right so and so we'll plug in this and [00:51:43] right so and so we'll plug in this and then basically [00:51:45] then basically uh uh [00:51:48] uh uh so basically you have two quantities [00:51:49] so basically you have two quantities right so one is the logs carbon size [00:51:54] which is the sum of [00:51:56] which is the sum of CM minus 1 square bi squared over [00:51:59] CM minus 1 square bi squared over Epsilon y squared [00:52:01] Epsilon y squared and also you have another thing which [00:52:03] and also you have another thing which which is a radius [00:52:06] which is a radius which is uh sum of [00:52:08] which is uh sum of Epsilon y I'm writing it as some Kappa [00:52:12] Epsilon y I'm writing it as some Kappa I plus 1 up to couple R right from one [00:52:15] I plus 1 up to couple R right from one two [00:52:16] two r [00:52:18] r right so you basically have these two [00:52:20] right so you basically have these two things that you want to trade off you [00:52:22] things that you want to trade off you want to find the balance dependencies [00:52:23] want to find the balance dependencies between them you want to make the lower [00:52:25] between them you want to make the lower common size to depend on the radius as [00:52:27] common size to depend on the radius as as best as possible [00:52:29] as best as possible so so you just choose some apps right [00:52:32] so so you just choose some apps right and so the so you care about the best [00:52:34] and so the so you care about the best kind of trade off all the dependencies [00:52:39] so that's this is absolute so so what [00:52:42] so that's this is absolute so so what you should do is that you should say I [00:52:44] you should do is that you should say I guess [00:52:45] guess if I give you a zero knowledge proof I'm [00:52:47] if I give you a zero knowledge proof I'm going to choose [00:52:52] [Music] [00:52:54] [Music] Epsilon I to be [00:52:57] Epsilon I to be C A minus 1 square [00:52:59] C A minus 1 square bi Square [00:53:01] bi Square over copper I plus 1 up to Kappa r [00:53:06] over copper I plus 1 up to Kappa r a third [00:53:08] a third times Epsilon [00:53:12] sum of b i [00:53:15] sum of b i 2 3 over [00:53:18] 2 3 over Kappa I two thirds [00:53:23] product of copper I 2 3. [00:53:29] all right [00:53:30] all right so if I choose Epsilon y to be this [00:53:33] so if I choose Epsilon y to be this then I'll claim that [00:53:36] then I'll claim that sum of epson y Kappa I plus 1 up to [00:53:39] sum of epson y Kappa I plus 1 up to Kappa R this will be indeed equals to [00:53:41] Kappa R this will be indeed equals to Epsilon [00:53:44] Epsilon and and why is that I'm going to do that [00:53:46] and and why is that I'm going to do that this tedious derivation for you but you [00:53:48] this tedious derivation for you but you know I don't feel like you should really [00:53:50] know I don't feel like you should really need to verify it on a fly or you don't [00:53:52] need to verify it on a fly or you don't necessarily have to verify it later but [00:53:54] necessarily have to verify it later but just for the sake of completeness let me [00:53:56] just for the sake of completeness let me do the calculation [00:53:58] do the calculation this will be [00:53:59] this will be [Music] [00:54:01] [Music] oh this is uh [00:54:04] oh this is uh uh you're just plugging Epsilon here so [00:54:06] uh you're just plugging Epsilon here so you get a [00:54:08] you get a I think CA minus one two thirds bi two [00:54:11] I think CA minus one two thirds bi two thirds this come from these two terms [00:54:14] thirds this come from these two terms and then there's some [00:54:19] thing about this and this [00:54:23] uh and also this thing right so you can [00:54:26] uh and also this thing right so you can organize those things into or I guess [00:54:29] organize those things into or I guess I'm only [00:54:30] I'm only I guess I'm treating this as a constant [00:54:32] I guess I'm treating this as a constant for the moment [00:54:34] for the moment in this derivation so I got this [00:54:37] in this derivation so I got this multiply with this we got two thirds [00:54:39] multiply with this we got two thirds of I2 thirds [00:54:41] of I2 thirds anyway if you don't want to verify this [00:54:43] anyway if you don't want to verify this just maybe [00:54:45] just maybe bear with me five seconds [00:54:53] all right so [00:55:04] I'm still [00:55:07] and then I guess one thing is that CI is [00:55:11] and then I guess one thing is that CI is also a function of copper I because we [00:55:12] also a function of copper I because we recall that CI is the norm Bond [00:55:15] recall that CI is the norm Bond for the layers which depends on copper I [00:55:18] for the layers which depends on copper I question uh yes [00:55:27] the I here [00:55:31] this is the same eye [00:55:35] uh oh sorry you're probably yes you're [00:55:37] uh oh sorry you're probably yes you're right you probably should use a [00:55:38] right you probably should use a different index just for the sake of yes [00:55:41] different index just for the sake of yes so so if you want [00:55:44] uh I think even I read this it's [00:55:47] uh I think even I read this it's probably [00:55:48] probably okay right you know what I mean but at [00:55:51] okay right you know what I mean but at the deal you probably is a j just for [00:55:53] the deal you probably is a j just for the for the sake of [00:55:55] the for the sake of um [00:55:56] um um completeness so yeah but but this one [00:56:00] um completeness so yeah but but this one will you average out this part like the [00:56:03] will you average out this part like the the after doing the salmon product the [00:56:05] the after doing the salmon product the eye is gone in the second part [00:56:08] eye is gone in the second part so [00:56:10] so anyway so let me [00:56:13] anyway so let me to the studio thing so we call that CI [00:56:15] to the studio thing so we call that CI is the norm bounds for so and CI is [00:56:17] is the norm bounds for so and CI is defined to be I think CI is definitely [00:56:19] defined to be I think CI is definitely some product of cup ice and so we got [00:56:24] some product of cup ice and so we got uh I guess let's put bi in the front and [00:56:27] uh I guess let's put bi in the front and then you got Casia is Kappa One Two [00:56:29] then you got Casia is Kappa One Two third up to [00:56:32] third up to I minus 1 2 3. this is this corresponds [00:56:34] I minus 1 2 3. this is this corresponds to [00:56:35] to CR minus 1. and then you get copper I [00:56:38] CR minus 1. and then you get copper I plus 2 1 2 3 that's from here [00:56:42] plus 2 1 2 3 that's from here but this is couple plus one I plus one [00:56:45] but this is couple plus one I plus one sorry [00:56:49] and then you still multiply the same [00:56:52] and then you still multiply the same thing here [00:56:53] thing here this this is [00:56:55] this this is and then you simplify this [00:56:58] and then you simplify this to [00:57:00] to the first assume to [00:57:02] the first assume to I guess you can see that the only [00:57:03] I guess you can see that the only missing term is copper I so this is [00:57:05] missing term is copper I so this is equals to bi to third over Kappa I to [00:57:08] equals to bi to third over Kappa I to the third times Paradox of Kappa I two [00:57:12] the third times Paradox of Kappa I two thirds and from 1 to R [00:57:14] thirds and from 1 to R and times this thing [00:57:16] and times this thing and now let's deal with this thing you [00:57:19] and now let's deal with this thing you can see that this one cancels with [00:57:22] can see that this one cancels with this one cancels with this one and this [00:57:26] this one cancels with this one and this one cancels with this one so you got [00:57:27] one cancels with this one so you got really equals to Epsilon [00:57:31] and the log covering size [00:57:36] is equals to [00:57:40] um [00:57:42] um what is equals to uh the sum of the [00:57:49] um [00:57:49] um what I'm doing here is equals to this [00:57:55] base quality the sum of [00:57:58] base quality the sum of goodness first right attributing this is [00:58:00] goodness first right attributing this is CMS 1 square bi squared over Epsilon y [00:58:03] CMS 1 square bi squared over Epsilon y square and you're plugging up some y [00:58:05] square and you're plugging up some y here so you get this gigantic thing [00:58:07] here so you get this gigantic thing maybe let's call this thing Z [00:58:10] maybe let's call this thing Z so you got one over Z Square [00:58:13] so you got one over Z Square times this one's sums of C minus 1 [00:58:16] times this one's sums of C minus 1 square bi Square you plug in this to the [00:58:19] square bi Square you plug in this to the minus two so you get [00:58:22] minus two so you get uh [00:58:26] CMS one square BMS one score [00:58:36] minus [00:58:38] minus two thirds [00:58:39] two thirds and then copper I plus 1 up to Copper I [00:58:43] and then copper I plus 1 up to Copper I 2 3 and there's some cancellations [00:58:47] 2 3 and there's some cancellations so this will be [00:58:50] so this will be foreign [00:58:52] foreign [Music] [00:59:03] [Music] [00:59:35] I think this [00:59:38] what [00:59:42] I think I I jump a step in my notes so I [00:59:46] I think I I jump a step in my notes so I okay I I I need to so sorry [00:59:58] I think you you you you plug in the [01:00:00] I think you you you you plug in the definition of cm and S1 then you get bi2 [01:00:03] definition of cm and S1 then you get bi2 third over copper I 2 3. [01:00:06] third over copper I 2 3. times for product [01:00:08] times for product of copper I [01:00:10] of copper I to a third and now I use the definition [01:00:12] to a third and now I use the definition of Z which is this gigantic constant and [01:00:15] of Z which is this gigantic constant and eventually I think you get [01:00:17] eventually I think you get let me now do that [01:00:20] let me now do that that carefully but eventually you get [01:00:22] that carefully but eventually you get this [01:00:35] over Epsilon Square okay so I guess [01:00:38] over Epsilon Square okay so I guess maybe this is a good demonstration of [01:00:40] maybe this is a good demonstration of why I shouldn't do this right like like [01:00:44] why I shouldn't do this right like like even after I even had I verify this with [01:00:47] even after I even had I verify this with my notes which has almost all the steps [01:00:49] my notes which has almost all the steps is kind of tricky so so but anyway so [01:00:52] is kind of tricky so so but anyway so before talking about how to do this [01:00:53] before talking about how to do this better [01:00:54] better I guess let's first I agree that this [01:00:56] I guess let's first I agree that this this is done right because now you see [01:00:58] this is done right because now you see the log covering size is bounded by this [01:01:00] the log covering size is bounded by this Epsilon Square something over Epsilon [01:01:02] Epsilon Square something over Epsilon square and that's what we wanted to have [01:01:04] square and that's what we wanted to have and then you apply the random marker [01:01:06] and then you apply the random marker complexity apply the rather macro [01:01:07] complexity apply the rather macro complexity [01:01:08] complexity the the the the the the tool from carbon [01:01:12] the the the the the the tool from carbon number [01:01:15] provider micro complexity recall that if [01:01:18] provider micro complexity recall that if you have log covering number [01:01:22] you have log covering number it's R to R over F Square then this [01:01:24] it's R to R over F Square then this means rather Markle complexity is [01:01:26] means rather Markle complexity is something like square root R over n [01:01:29] something like square root R over n right so so this is what we discussed [01:01:31] right so so this is what we discussed last time and if you apply this small [01:01:32] last time and if you apply this small tool then you get the log the radar [01:01:35] tool then you get the log the radar complexity will be this one [01:01:37] complexity will be this one uh um square root of this one and and [01:01:41] uh um square root of this one and and overscripted and then you are done [01:01:43] overscripted and then you are done Okay so [01:01:46] Okay so okay so so we are done but uh I think I [01:01:48] okay so so we are done but uh I think I want to kind of share how to do this a [01:01:50] want to kind of share how to do this a little more easily without going through [01:01:53] little more easily without going through all of this pen [01:01:55] all of this pen um this is a small trick it's pure [01:01:57] um this is a small trick it's pure mathematical trick [01:01:59] mathematical trick um I I don't know how many of you know [01:02:00] um I I don't know how many of you know it maybe maybe you all knowledge or [01:02:02] it maybe maybe you all knowledge or maybe you all don't know it [01:02:04] maybe you all don't know it um but anyway let's let's talk about it [01:02:06] um but anyway let's let's talk about it so basically the question is you care [01:02:09] so basically the question is you care about [01:02:11] about right this is the question you can't [01:02:13] right this is the question you can't remember the trade-off between these two [01:02:16] so [01:02:17] so and you care about the trade-off between [01:02:19] and you care about the trade-off between these two and [01:02:21] these two and so what you could do is that if you [01:02:24] so what you could do is that if you abstractify this it's kind of like [01:02:27] abstractify this it's kind of like So abstractive speaking [01:02:32] this is about the trade-off between [01:02:33] this is about the trade-off between something like AI maybe let's use some [01:02:36] something like AI maybe let's use some different numbers say [01:02:38] different numbers say Alpha I [01:02:40] Alpha I score [01:02:43] over Epsilon I [01:02:45] over Epsilon I Square versus sum of [01:02:50] Square versus sum of beta I absent y something like this [01:02:54] beta I absent y something like this right that's kind of the game you are [01:02:56] right that's kind of the game you are dealing with and how do you do the [01:02:58] dealing with and how do you do the trade-off I think [01:03:01] trade-off I think so what you would do is that you [01:03:05] so what you would do is that you use this uh [01:03:07] use this uh so-called holder inequality [01:03:15] so the whole inequality is that you know [01:03:16] so the whole inequality is that you know there are multiple ways to write it for [01:03:18] there are multiple ways to write it for example you can write a in a product [01:03:19] example you can write a in a product with B [01:03:21] with B is less than the P Norm of a times the Q [01:03:24] is less than the P Norm of a times the Q Norm of B when p [01:03:28] Norm of B when p and Q satisfies this [01:03:32] right so and [01:03:34] right so and for example when p [01:03:36] for example when p and or you calls already like this the [01:03:39] and or you calls already like this the sum of a i b i is less than [01:03:42] sum of a i b i is less than sum of AI to the power p [01:03:45] sum of AI to the power p to the one [01:03:47] to the one this something like this [01:03:51] this something like this all right this is just exactly the same [01:03:52] all right this is just exactly the same same thing and the guys when p is two [01:03:55] same thing and the guys when p is two this is the culture inequality and we [01:03:58] this is the culture inequality and we need something slightly different we [01:03:59] need something slightly different we need piece [01:04:01] need piece 3 where Q is three over two [01:04:05] 3 where Q is three over two then you got sum of a i cubed [01:04:08] then you got sum of a i cubed a third times sum of b i [01:04:13] 2 3. [01:04:15] 2 3. is larger than sum of a i b i so in some [01:04:18] is larger than sum of a i b i so in some sense you know all of these inequality [01:04:20] sense you know all of these inequality is trying to kind of deal with [01:04:23] um has this kind of form [01:04:26] um has this kind of form and [01:04:28] and I guess maybe what should I which one [01:04:30] I guess maybe what should I which one should I do first [01:04:33] should I do first uh [01:04:38] like I'm not sure whether I'm I'm lost [01:04:40] like I'm not sure whether I'm I'm lost in you so like I guess what eventually I [01:04:42] in you so like I guess what eventually I want to do is the following maybe let me [01:04:44] want to do is the following maybe let me just give you an overview eventually I [01:04:46] just give you an overview eventually I want to do is just that I want to do [01:04:47] want to do is just that I want to do this [01:04:48] this I'm going to say that this [01:04:50] I'm going to say that this product with the sum of beta I abs and Y [01:04:52] product with the sum of beta I abs and Y this is larger than [01:04:59] some of alpha I beta I [01:05:03] some of alpha I beta I to the two thirds [01:05:05] to the two thirds and three two [01:05:07] and three two so if you if you're able to do this then [01:05:08] so if you if you're able to do this then you kind of like cancel out [01:05:12] you kind of like cancel out so maybe sorry maybe let me do this [01:05:14] so maybe sorry maybe let me do this first so we care about sum of alpha I [01:05:17] first so we care about sum of alpha I square of absence I squared versus sum [01:05:19] square of absence I squared versus sum of beta I Epsilon and this is your [01:05:21] of beta I Epsilon and this is your absolute and this is your covering first [01:05:24] absolute and this is your covering first the covering [01:05:26] the covering and what you want what you can do is [01:05:28] and what you want what you can do is that you can say this [01:05:30] that you can say this times this [01:05:33] times this Square [01:05:35] Square still forgot about this just let me do [01:05:38] still forgot about this just let me do it formally so [01:05:39] it formally so is there's an inequality that shows that [01:05:42] is there's an inequality that shows that this is larger than [01:05:44] this is larger than sum of alpha I beta I [01:05:49] the two-thirds over 32 and this is [01:05:52] the two-thirds over 32 and this is essentially hold the inequality let me [01:05:54] essentially hold the inequality let me let me justify this in a moment but [01:05:55] let me justify this in a moment but suppose you believe in me in this then [01:05:57] suppose you believe in me in this then you say that and and suppose you also [01:05:59] you say that and and suppose you also believe that this is achievable suppose [01:06:03] believe that this is achievable suppose we believe that equality is achievable [01:06:05] we believe that equality is achievable which I will justify in a moment [01:06:08] which I will justify in a moment so if we call it is achievable [01:06:11] so if we call it is achievable it means that there exists epsilonis [01:06:14] it means that there exists epsilonis such that [01:06:18] sum of alpha I Square [01:06:20] sum of alpha I Square Epsilon square is equals to this [01:06:23] Epsilon square is equals to this quantity [01:06:28] 32 over sum of beta I Epsilon y squared [01:06:32] 32 over sum of beta I Epsilon y squared and recall that this [01:06:35] and recall that this is your Epsilon square and this is your [01:06:38] is your Epsilon square and this is your log cover number [01:06:39] log cover number then you get log cover number you can so [01:06:42] then you get log cover number you can so you get that you can choose [01:06:45] you get that you can choose absolutely such that the log cabin [01:06:47] absolutely such that the log cabin number is less than this which is equals [01:06:49] number is less than this which is equals to this quantity over Epsom Square [01:06:51] to this quantity over Epsom Square and this quantity is your what you are [01:06:53] and this quantity is your what you are looking for [01:06:54] looking for right this is the the r thing that [01:06:56] right this is the the r thing that you're looking for which is the [01:06:58] you're looking for which is the something like this [01:07:00] something like this and and you don't have to do any of this [01:07:02] and and you don't have to do any of this verification right that's it and [01:07:04] verification right that's it and basically just have to plug in Alpha and [01:07:05] basically just have to plug in Alpha and beta and you just verify and that's it [01:07:08] beta and you just verify and that's it so [01:07:09] so does it make sense [01:07:11] does it make sense so so basically you cancel all the apps [01:07:12] so so basically you cancel all the apps on eyes you know you try to find the [01:07:14] on eyes you know you try to find the best apps by proving the best inequality [01:07:16] best apps by proving the best inequality you want and you also want the inner [01:07:17] you want and you also want the inner quality to be achievable [01:07:19] quality to be achievable so [01:07:21] so um so so for example another another [01:07:23] um so so for example another another situation that this is used for is that [01:07:26] situation that this is used for is that for example you probably have seen this [01:07:27] for example you probably have seen this kind of font like you you have a [01:07:29] kind of font like you you have a parameter ETA you have the ETA plus b [01:07:31] parameter ETA you have the ETA plus b over ETA you want something like this [01:07:33] over ETA you want something like this and you can choose your ether arbitrary [01:07:36] and you can choose your ether arbitrary arbitrary so how do you do it you know [01:07:38] arbitrary so how do you do it you know many people tell you that you just find [01:07:40] many people tell you that you just find out the minimum ETA by doing some kind [01:07:42] out the minimum ETA by doing some kind of like uh taking a gradient right so [01:07:45] of like uh taking a gradient right so and you find out the minimum eight right [01:07:46] and you find out the minimum eight right that's that's fine but so but my way to [01:07:50] that's that's fine but so but my way to do it is I just prove that ETA plus beta [01:07:52] do it is I just prove that ETA plus beta ETA this is larger than two Times Square [01:07:54] ETA this is larger than two Times Square b [01:07:55] b this is conscious word so like AMG and [01:07:58] this is conscious word so like AMG and whatever you call it and this inequality [01:08:00] whatever you call it and this inequality is achievable right you can attend the [01:08:03] is achievable right you can attend the equality so that you know basically the [01:08:06] equality so that you know basically the best thing for this is two square B [01:08:08] best thing for this is two square B asically basically if you know this is [01:08:11] asically basically if you know this is equality inequality is [01:08:14] equality inequality is um [01:08:15] um is is attainable then you know that you [01:08:17] is is attainable then you know that you can choose the existed ETA such that [01:08:21] can choose the existed ETA such that ETA plus b over ETA is 2 square B and [01:08:23] ETA plus b over ETA is 2 square B and then you get rid of the ETA you get the [01:08:25] then you get rid of the ETA you get the best Bond you want so the same thing [01:08:28] best Bond you want so the same thing it's the same logic here as well where [01:08:30] it's the same logic here as well where you prove the inequality so that you can [01:08:31] you prove the inequality so that you can cancel out the parameter that you want [01:08:33] cancel out the parameter that you want to choose and then you know and if that [01:08:36] to choose and then you know and if that equality that inequality can be attained [01:08:39] equality that inequality can be attained as equality then you know that you are [01:08:41] as equality then you know that you are getting the best parameter and you don't [01:08:43] getting the best parameter and you don't you don't even necessarily have to [01:08:44] you don't even necessarily have to compute what if that's why [01:08:46] compute what if that's why or you know of course you know you write [01:08:48] or you know of course you know you write a paper probably you still want to [01:08:49] a paper probably you still want to complete Epsilon I and do the zero [01:08:51] complete Epsilon I and do the zero knowledge proof right that's why all the [01:08:53] knowledge proof right that's why all the papers show you show you the these kind [01:08:56] papers show you show you the these kind of things uh right so because you know [01:08:58] of things uh right so because you know so how do I [01:09:00] so how do I like I have to do a lot more arguments [01:09:02] like I have to do a lot more arguments right to show that this but but in your [01:09:05] right to show that this but but in your mind you probably should do this later [01:09:07] mind you probably should do this later version this is [01:09:08] version this is later version you know at least this is [01:09:11] later version you know at least this is what I did in my mind when I do any [01:09:13] what I did in my mind when I do any research like this but because this is [01:09:15] research like this but because this is so fast so that you can get a [01:09:17] so fast so that you can get a get a estimate on what the what's the [01:09:19] get a estimate on what the what's the boundary can have right so [01:09:21] boundary can have right so and in some sense this is useful [01:09:24] and in some sense this is useful um [01:09:25] um in many cases because you know one of [01:09:27] in many cases because you know one of the way to make your theoretical [01:09:29] the way to make your theoretical research [01:09:30] research faster is that you have a lot of [01:09:33] faster is that you have a lot of modularized small steps which you can do [01:09:35] modularized small steps which you can do very very fast [01:09:36] very very fast right so so one of the way I found that [01:09:39] right so so one of the way I found that people can get into this [01:09:41] people can get into this very messy calculation is that this you [01:09:45] very messy calculation is that this you know every Theory you know if you prove [01:09:46] know every Theory you know if you prove something hard right so you have to use [01:09:48] something hard right so you have to use a lot of pages right so you like your [01:09:51] a lot of pages right so you like your eventual product is something like so [01:09:52] eventual product is something like so the page is proof or maybe more than [01:09:54] the page is proof or maybe more than that sometimes there are 70 Pages proofs [01:09:56] that sometimes there are 70 Pages proofs or like 100 pages right so so [01:09:59] or like 100 pages right so so at least I I think at least in when I do [01:10:02] at least I I think at least in when I do those kind of like proofs I never really [01:10:04] those kind of like proofs I never really kind of like if I change one part of it [01:10:07] kind of like if I change one part of it I never have to redo that 100 Pages [01:10:09] I never have to redo that 100 Pages calculation to know what's the final [01:10:11] calculation to know what's the final outcome so so basically after a certain [01:10:14] outcome so so basically after a certain point I already know that in this part [01:10:16] point I already know that in this part maybe these two pages are the most [01:10:18] maybe these two pages are the most important thing I also know that how [01:10:19] important thing I also know that how does this page [01:10:20] does this page translate to the final outcome and I've [01:10:23] translate to the final outcome and I've already done those kind of very fast I [01:10:24] already done those kind of very fast I have a kind of like a very fast data [01:10:27] have a kind of like a very fast data structure so that I know that if these [01:10:28] structure so that I know that if these two [01:10:29] two if this part can be improved by a factor [01:10:32] if this part can be improved by a factor of two then what does that mean for my [01:10:34] of two then what does that mean for my final outcome and and that part is kind [01:10:37] final outcome and and that part is kind of like already [01:10:38] of like already abstract enough so that you have this [01:10:40] abstract enough so that you have this very fast conversion and then then you [01:10:43] very fast conversion and then then you can iterate very fast [01:10:45] can iterate very fast um so so and and [01:10:47] um so so and and as a [01:10:49] as a um the the flip side is you know the [01:10:52] um the the flip side is you know the like the this is opposed to another [01:10:55] like the this is opposed to another model which is that you you change your [01:10:56] model which is that you you change your proof in some part you have to redo all [01:10:58] proof in some part you have to redo all the other parts and so so that would be [01:11:00] the other parts and so so that would be much slower so so this is one of those [01:11:03] much slower so so this is one of those kind of this kind of tricks that I I [01:11:05] kind of this kind of tricks that I I realized that so so if you can do [01:11:08] realized that so so if you can do something small like this kind of [01:11:10] something small like this kind of abstract things very fast and you can [01:11:12] abstract things very fast and you can iterate faster uh in your in your [01:11:15] iterate faster uh in your in your research anyway so [01:11:18] research anyway so so so far it makes sense and and if you [01:11:21] so so far it makes sense and and if you really care about why this inequalities [01:11:23] really care about why this inequalities is true [01:11:25] is true why this is called is true I think I was [01:11:26] why this is called is true I think I was trying to just specify why it's true [01:11:28] trying to just specify why it's true because you can just use holding quality [01:11:30] because you can just use holding quality I guess if you apply the holding cost [01:11:32] I guess if you apply the holding cost you get something like this and this is [01:11:34] you get something like this and this is still not exactly like this I think so [01:11:37] still not exactly like this I think so or actually this is exactly like this [01:11:38] or actually this is exactly like this right because you have to choose [01:11:40] right because you have to choose you can choose your AI to be [01:11:44] you just say [01:11:47] you just say IR Cube maps to this and b i [01:11:52] IR Cube maps to this and b i lab students if you just want to verify [01:11:55] lab students if you just want to verify right you just have to change it right [01:11:57] right you just have to change it right so but again you know if you want have [01:12:00] so but again you know if you want have to verify this you know by matching it [01:12:02] to verify this you know by matching it it's still too slow to me so what I do [01:12:04] it's still too slow to me so what I do is I also memorize other different [01:12:06] is I also memorize other different versions of this holder inequality so [01:12:08] versions of this holder inequality so that I can do it faster I think the [01:12:10] that I can do it faster I think the version I memorized in my mind is that [01:12:13] version I memorized in my mind is that at least one version of the holding [01:12:14] at least one version of the holding recording in my mind the mmrs is this [01:12:18] recording in my mind the mmrs is this which is [01:12:20] which is something like [01:12:22] something like sum of UI Square [01:12:25] sum of UI Square times 1 3 [01:12:27] times 1 3 of the sum of VI [01:12:30] of the sum of VI 2 3 [01:12:31] 2 3 is rather than some UI [01:12:35] is rather than some UI v i 2 3. [01:12:40] something like this which is even closer [01:12:42] something like this which is even closer to here right because in some sense in [01:12:45] to here right because in some sense in sometimes the way you to memorize is [01:12:47] sometimes the way you to memorize is that if you have a bigger component [01:12:49] that if you have a bigger component exponent here too then these two will go [01:12:52] exponent here too then these two will go to the vi [01:12:57] how do I say this so basically you put [01:12:59] how do I say this so basically you put the [01:13:01] the select so why this is UI to the power to [01:13:03] select so why this is UI to the power to three right so why this is UI to the 2 3 [01:13:05] three right so why this is UI to the 2 3 this is because here is you have a [01:13:07] this is because here is you have a square and then you you have a one-third [01:13:09] square and then you you have a one-third outside and the reason why here you have [01:13:12] outside and the reason why here you have VI to the power 2 3 is because you [01:13:14] VI to the power 2 3 is because you inside you have VI the linear term and [01:13:16] inside you have VI the linear term and then outside you have the two thirds so [01:13:18] then outside you have the two thirds so so if you know that then you know that [01:13:21] so if you know that then you know that if you have a square here then you can [01:13:23] if you have a square here then you can cancel this Epson y because Epsilon y [01:13:25] cancel this Epson y because Epsilon y will be squared and here you have [01:13:26] will be squared and here you have Epsilon Square so they can cancel each [01:13:29] Epsilon Square so they can cancel each other I'm not sure why this makes any [01:13:31] other I'm not sure why this makes any sense like it it takes some probably [01:13:33] sense like it it takes some probably some practice if you see this enough [01:13:35] some practice if you see this enough times you know what kind of inequalities [01:13:37] times you know what kind of inequalities can prove [01:13:39] can prove um [01:13:40] um anyway I guess I probably should wrap up [01:13:42] anyway I guess I probably should wrap up this discussion it's um [01:13:45] this discussion it's um um any questions [01:13:54] okay I think let's see [01:13:59] 10 minutes [01:14:07] foreign [01:14:20] this inequality because the equality can [01:14:22] this inequality because the equality can be achieved so that's why [01:14:24] be achieved so that's why you know is the best choice oh you need [01:14:26] you know is the best choice oh you need a final one okay yeah maybe let me [01:14:29] a final one okay yeah maybe let me discuss that I'll answer that in the [01:14:31] discuss that I'll answer that in the next 10 minutes [01:14:35] right okay so so basically um now next [01:14:38] right okay so so basically um now next we're gonna do something more [01:14:42] we're gonna do something more um [01:14:43] um better than this so and and actually it [01:14:46] better than this so and and actually it turns out the proof is actually cleaner [01:14:47] turns out the proof is actually cleaner uh to some extent [01:14:50] uh to some extent um because in some sense because it's [01:14:52] um because in some sense because it's capturing the right quality so [01:14:55] capturing the right quality so so next we're gonna have generalization [01:14:57] so next we're gonna have generalization bonds [01:15:01] that [01:15:03] that uh [01:15:05] uh depend on the [01:15:08] depend on the actual level [01:15:10] actual level s [01:15:14] and I'm going to argue that deluxedness [01:15:16] and I'm going to argue that deluxedness we had before was only upper bound right [01:15:19] we had before was only upper bound right before we have this [01:15:21] before we have this right before we have this right so [01:15:24] right before we have this right so before we have [01:15:26] before we have expand where you have [01:15:29] expand where you have essentially a dominant term times other [01:15:32] essentially a dominant term times other terms [01:15:35] which is just polynomial in the [01:15:38] which is just polynomial in the in the norm [01:15:40] in the norm which is not very important and this one [01:15:42] which is not very important and this one this is only upper bound [01:15:47] on the laboriousness [01:15:51] right and it's a pretty worst case upper [01:15:53] right and it's a pretty worst case upper bound because you know you have to if [01:15:56] bound because you know you have to if you really want your network to be [01:15:59] you really want your network to be um to achieve this electricity you have [01:16:01] um to achieve this electricity you have to actually like uh [01:16:03] to actually like uh kind of construct something that is kind [01:16:04] kind of construct something that is kind of like somewhat kind of like special [01:16:08] of like somewhat kind of like special and even this is achievable right so [01:16:10] and even this is achievable right so this worst case you know upper one can [01:16:11] this worst case you know upper one can be achievable in certain cases still you [01:16:13] be achievable in certain cases still you want to find out the network which is [01:16:15] want to find out the network which is you know probably better in that there's [01:16:17] you know probably better in that there's empirically so so so so that's why we [01:16:21] empirically so so so so that's why we are going to kind of have replace so [01:16:22] are going to kind of have replace so basically the kind of the high level [01:16:23] basically the kind of the high level goal is they want to replace this [01:16:25] goal is they want to replace this product of the spectral Norm by [01:16:27] product of the spectral Norm by something that is more [01:16:29] something that is more um uh more of uh accurate and there are [01:16:33] um uh more of uh accurate and there are many several motivations to do this so I [01:16:35] many several motivations to do this so I guess one thing is that um and this this [01:16:38] guess one thing is that um and this this relates to the limitation of this Bond [01:16:40] relates to the limitation of this Bond so one thing is that this wi Opera Norm [01:16:44] so one thing is that this wi Opera Norm uh has to be [01:16:47] uh has to be louder than one or even you can arguably [01:16:51] louder than one or even you can arguably say this is even larger than Square two [01:16:53] say this is even larger than Square two to make sure [01:16:56] to make sure FX [01:16:58] FX is not too small [01:17:00] is not too small right why this is the case this is [01:17:02] right why this is the case this is because if you look at every layer let's [01:17:04] because if you look at every layer let's say h i [01:17:05] say h i is the ice layer [01:17:10] differential iPlayer [01:17:13] differential iPlayer then h i plus one the true Norm of it [01:17:17] then h i plus one the true Norm of it is the true Norm [01:17:19] is the true Norm of you know this [01:17:22] of you know this you apply this last layer and if you do [01:17:24] you apply this last layer and if you do a heuristic you say that you know [01:17:26] a heuristic you say that you know suppose you believe [01:17:28] suppose you believe that this activation is value activation [01:17:32] that this activation is value activation kills half of the corners I see [01:17:35] kills half of the corners I see right so so it is zero out half of the [01:17:37] right so so it is zero out half of the chords suppose you have that kind of [01:17:38] chords suppose you have that kind of like then it means that after value your [01:17:41] like then it means that after value your Norm will reduce by one over Square two [01:17:44] Norm will reduce by one over Square two because you kill half of the chords [01:17:47] because you kill half of the chords so [01:17:49] so of course this is a very heuristic this [01:17:51] of course this is a very heuristic this is just a belief right like a you know [01:17:53] is just a belief right like a you know assumption [01:17:54] assumption um and suppose this is the case and then [01:17:56] um and suppose this is the case and then you say that [01:17:57] you say that um this is less than R score two times [01:18:00] um this is less than R score two times the opposite Norm of [01:18:02] the opposite Norm of wi times h i 2 Norm [01:18:06] wi times h i 2 Norm so then you can see that each time you [01:18:09] so then you can see that each time you can only grow [01:18:10] can only grow your Norm of h i by this Factor [01:18:14] your Norm of h i by this Factor so if W I open a norm is less than [01:18:17] so if W I open a norm is less than square root 2 then you are shrinking a [01:18:19] square root 2 then you are shrinking a norm over layers [01:18:21] norm over layers so your Norm of every layer will become [01:18:23] so your Norm of every layer will become smaller and smaller and eventually they [01:18:24] smaller and smaller and eventually they will convert to zero so your output will [01:18:26] will convert to zero so your output will be very small [01:18:27] be very small so so that's why you have to make sure [01:18:30] so so that's why you have to make sure that the Opera Norm of wi to be somewhat [01:18:33] that the Opera Norm of wi to be somewhat big it cannot be like too small [01:18:35] big it cannot be like too small well you know in the most ultimate case [01:18:38] well you know in the most ultimate case you know I think you want the Optimum to [01:18:39] you know I think you want the Optimum to be larger than one but in in its kind of [01:18:42] be larger than one but in in its kind of more typical case you need it even to be [01:18:44] more typical case you need it even to be larger than square root two so you have [01:18:46] larger than square root two so you have the case right so so [01:18:48] the case right so so um so so this means that [01:18:53] so in sometimes this means that you know [01:18:55] so in sometimes this means that you know this the the product will be will be big [01:19:05] so and and another thing is that um [01:19:10] so and and another thing is that um so motivation too I think this is [01:19:12] so motivation too I think this is something I I mentioned right so this is [01:19:14] something I I mentioned right so this is only a worst case upper bound it's it's [01:19:16] only a worst case upper bound it's it's very worst case [01:19:19] very worst case of the ellipsislessness [01:19:21] of the ellipsislessness and in practice you know [01:19:24] and in practice you know um so the so the lipsin is the [01:19:26] um so the so the lipsin is the ellipselessness on the data points [01:19:28] ellipselessness on the data points the love justness [01:19:31] the love justness on the data point [01:19:35] X1 and UPS to X and [01:19:38] X1 and UPS to X and might be better [01:19:42] all the ellipticness on the [01:19:45] all the ellipticness on the the population distribution or maybe on [01:19:47] the population distribution or maybe on the data points or maybe on the [01:19:52] or on X from P from the population [01:19:55] or on X from P from the population distribution could be better [01:20:00] right so and and this phone doesn't [01:20:03] right so and and this phone doesn't capture that [01:20:04] capture that and another thing is that [01:20:07] and another thing is that um it turns out that we will discuss [01:20:08] um it turns out that we will discuss this you know in a later [01:20:10] this you know in a later um lectures [01:20:11] um lectures um so it turns out that sud prefers [01:20:16] um so it turns out that sud prefers a flight locomi [01:20:20] a flight locomi this is something we widely believed and [01:20:22] this is something we widely believed and in certain cases we can prove this and [01:20:24] in certain cases we can prove this and the flat local mean [01:20:26] the flat local mean it's roughly speaking we will we'll show [01:20:29] it's roughly speaking we will we'll show this you know we will justify this in [01:20:30] this you know we will justify this in later lectures but roughly speaking this [01:20:32] later lectures but roughly speaking this is [01:20:33] is like the lapsiousness [01:20:36] like the lapsiousness of the models [01:20:39] of the models only empirical data [01:20:43] so you can see that this is not a [01:20:45] so you can see that this is not a luxiousness the worst case solar system [01:20:47] luxiousness the worst case solar system is on all data points is deliciousness [01:20:50] is on all data points is deliciousness on the empirical data so which further [01:20:53] on the empirical data so which further justifies I probably want to have a bond [01:20:56] justifies I probably want to have a bond that depends on Ellipsis on the [01:20:58] that depends on Ellipsis on the empirical data but not the [01:20:59] empirical data but not the ellipselessness in the worst case [01:21:02] ellipselessness in the worst case so and [01:21:04] so and and in some sense and also another thing [01:21:07] and in some sense and also another thing is that another remark is that it's okay [01:21:09] is that another remark is that it's okay to to have a generalization bond that [01:21:12] to to have a generalization bond that depends on the empirical data so okay [01:21:16] depends on the empirical data so okay to make the generalization Bond [01:21:24] depend on in private on [01:21:27] depend on in private on that's one after accent [01:21:29] that's one after accent because and in some sense this is [01:21:31] because and in some sense this is actually you know [01:21:33] actually you know nice because suppose [01:21:36] nice because suppose a generalization [01:21:39] is less than some function [01:21:42] is less than some function of [01:21:43] of the classifier and X1 up to excellent [01:21:47] the classifier and X1 up to excellent this is still useful because you can [01:21:49] this is still useful because you can still use this [01:21:51] still use this use it as an expressive regularizer [01:21:57] so there's no problem that your [01:21:58] so there's no problem that your transition Bond [01:22:00] transition Bond uh there's no problem for our [01:22:02] uh there's no problem for our generalization bond to depend on [01:22:03] generalization bond to depend on empirical data you probably don't want [01:22:05] empirical data you probably don't want optimization bond to depend on [01:22:07] optimization bond to depend on the population data because you don't [01:22:09] the population data because you don't know how to regulate it anymore but if [01:22:11] know how to regulate it anymore but if it depends on empirical data it's fine [01:22:14] it depends on empirical data it's fine so so so basically concretely [01:22:17] so so so basically concretely uh in the next lecture I guess we'll [01:22:19] uh in the next lecture I guess we'll prove that [01:22:23] next lecture [01:22:27] we'll prove that the optimization error [01:22:30] we'll prove that the optimization error the or the test error the L of theta is [01:22:33] the or the test error the L of theta is less than some function [01:22:35] less than some function of the lipstickness [01:22:38] of the lipstickness of f Theta [01:22:40] of f Theta on X1 up to x n [01:22:43] on X1 up to x n and then the norms [01:22:46] and then the norms of theta and this function is a [01:22:48] of theta and this function is a polynomial function which doesn't have [01:22:50] polynomial function which doesn't have anything or exponential in it [01:22:57] okay I guess I'll uh I'll stop here any [01:23:00] okay I guess I'll uh I'll stop here any questions [01:23:06] and interestingly the proof for the next [01:23:08] and interestingly the proof for the next lecture is actually easier than today I [01:23:11] lecture is actually easier than today I hope I don't know how do you think about [01:23:13] hope I don't know how do you think about the proof today like [01:23:14] the proof today like you know it's pretty brute for so in [01:23:16] you know it's pretty brute for so in that sense it's actually not very hard [01:23:17] that sense it's actually not very hard but it's pretty messy [01:23:26] okay I guess so I will see you uh next [01:23:28] okay I guess so I will see you uh next week ================================================================================ LECTURE 011 ================================================================================ Stanford CS229M - Lecture 11: All-layer margin Source: https://www.youtube.com/watch?v=GeXBfyrKfM4 --- Transcript [00:00:05] so last time we talked about the [00:00:07] so last time we talked about the generalization bonds and today we are [00:00:09] generalization bonds and today we are going to talk about [00:00:10] going to talk about um uh some better generalization bonds [00:00:12] um uh some better generalization bonds for these Networks [00:00:14] for these Networks so recall that last time what we did was [00:00:17] so recall that last time what we did was that we show something like the rather [00:00:19] that we show something like the rather macro complexity [00:00:20] macro complexity is bounded by something like this [00:00:34] times and polynomial [00:00:36] times and polynomial of the norms [00:00:39] of the the weights [00:00:44] right so [00:00:47] and we said that this comes from a kind [00:00:50] and we said that this comes from a kind of a worst case bond for [00:00:51] of a worst case bond for ellipsislessness [00:00:52] ellipsislessness um of the [00:01:00] of the model [00:01:03] um [00:01:04] um foreign [00:01:17] over the entire worst case over the [00:01:20] over the entire worst case over the entire input space [00:01:24] and this is because when we do the [00:01:26] and this is because when we do the covering number we have to use this [00:01:27] covering number we have to use this Ellipsis decomposition this lipsticks [00:01:29] Ellipsis decomposition this lipsticks composition level and and there you have [00:01:31] composition level and and there you have to [00:01:33] to um use the [00:01:35] um use the um uh you have to use the lipstickness [00:01:36] um uh you have to use the lipstickness for the entire set [00:01:41] so this is a little bit distracting in [00:01:43] so this is a little bit distracting in the in the in the life just because I'm [00:01:45] the in the in the life just because I'm sharing the screen using my laptop so [00:01:48] sharing the screen using my laptop so that you can charge my iPad [00:01:50] that you can charge my iPad um okay so and we have discussed a few [00:01:53] um okay so and we have discussed a few motivations for us to improve upon this [00:01:55] motivations for us to improve upon this uh theorems right so I guess we [00:01:58] uh theorems right so I guess we discussed four of them and one of them I [00:02:00] discussed four of them and one of them I guess I'll just briefly mention them in [00:02:02] guess I'll just briefly mention them in words one of them is that you know this [00:02:04] words one of them is that you know this bond is exponential [00:02:06] bond is exponential um [00:02:08] um exponential in depth [00:02:12] exponential in depth which is bad because typically you have [00:02:14] which is bad because typically you have a lot of players and another thing is [00:02:17] a lot of players and another thing is that you know this is worst case [00:02:19] that you know this is worst case lupusiousness and one another thing is [00:02:21] lupusiousness and one another thing is that typically you want to have [00:02:23] that typically you want to have something like [00:02:24] something like um [00:02:25] um SGD prefers [00:02:29] um Ellipsis models [00:02:33] that's good but this is Ellipsis models [00:02:36] that's good but this is Ellipsis models you know and where Ellipsis is on the [00:02:39] you know and where Ellipsis is on the empirical data [00:02:41] empirical data because if you think about an algorithm [00:02:45] because if you think about an algorithm and an algorithm can only do something [00:02:48] and an algorithm can only do something on the empirical data right so you know [00:02:50] on the empirical data right so you know so we'll show this more like later in [00:02:52] so we'll show this more like later in the course but even if you think about [00:02:54] the course but even if you think about it right so like on a high level right [00:02:57] it right so like on a high level right so the algorithm can only prefer [00:02:58] so the algorithm can only prefer something about empirical data but not [00:03:00] something about empirical data but not about the entire space [00:03:02] about the entire space right and and and also we said that for [00:03:06] right and and and also we said that for tighter Bond we're going to have [00:03:07] tighter Bond we're going to have something of data dependent right [00:03:08] something of data dependent right something that depends on the lips on [00:03:11] something that depends on the lips on the empirical data so concretely I guess [00:03:13] the empirical data so concretely I guess today we're going to do is that we are [00:03:16] today we're going to do is that we are going to show something like this the [00:03:17] going to show something like this the generalization of parameter Theta is a [00:03:21] generalization of parameter Theta is a function [00:03:22] function of lusciousness [00:03:27] of f Theta [00:03:30] of f Theta on the empirical data X1 after X and [00:03:34] on the empirical data X1 after X and and also the norm [00:03:37] and also the norm of theta and this function is a [00:03:40] of theta and this function is a polynomial [00:03:42] polynomial so that there is no any exponential [00:03:45] so that there is no any exponential dependency there's no Paradox of things [00:03:48] dependency there's no Paradox of things so so that's the goal of this lecture [00:03:52] so so that's the goal of this lecture and we're going to call it so [00:03:55] and we're going to call it so and we we have to do like the [00:03:58] and we we have to do like the we have to Define some kind of like I'll [00:03:59] we have to Define some kind of like I'll introduce some new Machinery to achieve [00:04:01] introduce some new Machinery to achieve this kind of things the reason is that [00:04:03] this kind of things the reason is that this is a different type of phone than [00:04:05] this is a different type of phone than what we have done before because you can [00:04:07] what we have done before because you can see that on the right hand side you have [00:04:09] see that on the right hand side you have a function of the tuning data so [00:04:12] a function of the tuning data so typically on the right hand side right [00:04:14] typically on the right hand side right so the the so-called Club let's call it [00:04:16] so the the so-called Club let's call it classical uniform convergence [00:04:19] I guess you know what really uniform [00:04:21] I guess you know what really uniform confidence really mean that's slightly [00:04:23] confidence really mean that's slightly kind of debatable because it depends on [00:04:25] kind of debatable because it depends on how you Scope it but at least what we [00:04:28] how you Scope it but at least what we have discussed in this lecture all the [00:04:30] have discussed in this lecture all the bonds are doing something like this [00:04:31] bonds are doing something like this right so the bounds before we all look [00:04:34] right so the bounds before we all look like for every Earth in some [00:04:36] like for every Earth in some hypothesis class f [00:04:39] hypothesis class f the the law the empirical loss is less [00:04:42] the the law the empirical loss is less than something like a complexity measure [00:04:45] than something like a complexity measure of um of of capital F over square rooted [00:04:48] of um of of capital F over square rooted something like this [00:04:49] something like this right so or you know like maybe with [00:04:53] right so or you know like maybe with high probability [00:04:55] high probability something like this and all [00:04:56] something like this and all alternatively [00:04:58] alternatively we can also achieve this kind of things [00:05:00] we can also achieve this kind of things I think we in implicitly discuss this [00:05:02] I think we in implicitly discuss this right so for every Earth [00:05:04] right so for every Earth L of f is less than a complexity measure [00:05:08] L of f is less than a complexity measure of [00:05:09] of little f over square rooted [00:05:12] little f over square rooted so so here this is capital f [00:05:16] so the first type is what we do exactly [00:05:19] so the first type is what we do exactly you know what we got exactly from random [00:05:21] you know what we got exactly from random marker complexity because you just you [00:05:23] marker complexity because you just you know apply random marker complexity on [00:05:24] know apply random marker complexity on it and and this is in some sense they [00:05:26] it and and this is in some sense they write the marker complexity and the [00:05:28] write the marker complexity and the second type you can also get it by doing [00:05:31] second type you can also get it by doing a little bit things from the first type [00:05:32] a little bit things from the first type so you can get the second type [00:05:38] yeah [00:05:39] yeah by something like I guess this is a [00:05:42] by something like I guess this is a remark by [00:05:44] remark by considering F to be something like all [00:05:47] considering F to be something like all the functions where the complexity [00:05:50] the functions where the complexity of legal f is less than Capital C right [00:05:53] of legal f is less than Capital C right think of the complexity as for example [00:05:54] think of the complexity as for example Norm of the weights when you first [00:05:56] Norm of the weights when you first Define hypothesis class where the norm [00:05:58] Define hypothesis class where the norm of the weight is less than Capital C and [00:06:01] of the weight is less than Capital C and then you apply [00:06:04] uh one on all on on capital f [00:06:09] uh one on all on on capital f on this hypothesis class [00:06:11] on this hypothesis class and then you do a unit Bond and then and [00:06:15] and then you do a unit Bond and then and then take a union bound [00:06:20] on all C right so for every Capital C it [00:06:24] on all C right so for every Capital C it defines a hypothesis class and you [00:06:26] defines a hypothesis class and you probably can write it as f [00:06:28] probably can write it as f 2 F Sub C right so and for this subject [00:06:30] 2 F Sub C right so and for this subject F Sub C you can do the standard random [00:06:33] F Sub C you can do the standard random marker complexity and then you can say [00:06:35] marker complexity and then you can say that I'm going to enumerate over all [00:06:36] that I'm going to enumerate over all possible Capital C and then do another [00:06:39] possible Capital C and then do another layer of Union Bond on top of it we [00:06:40] layer of Union Bond on top of it we didn't we never do this formula but this [00:06:42] didn't we never do this formula but this is just one parameter you can just [00:06:43] is just one parameter you can just discretize whatever you want [00:06:45] discretize whatever you want so so in some sense this is how you get [00:06:48] so so in some sense this is how you get the type 1 type 2 bound but the thing is [00:06:50] the type 1 type 2 bound but the thing is that either of this bound on the right [00:06:53] that either of this bound on the right hand side the bound depends on the data [00:06:55] hand side the bound depends on the data the empirical data it's always a [00:06:58] the empirical data it's always a property either of the model or of the [00:07:00] property either of the model or of the function class [00:07:01] function class so [00:07:02] so so the question is you know how do you [00:07:04] so the question is you know how do you if you want to get something like this [00:07:05] if you want to get something like this right like our goal today you have to do [00:07:08] right like our goal today you have to do something [00:07:09] something kind of like more [00:07:11] kind of like more um uh you have to include some new [00:07:12] um uh you have to include some new techniques right so our goal right so [00:07:15] techniques right so our goal right so our goal is to get something like [00:07:17] our goal is to get something like I think maybe let's call this I think we [00:07:19] I think maybe let's call this I think we call this data dependent Bond [00:07:21] call this data dependent Bond generalization Bond [00:07:25] this term might be a little bit kind of [00:07:27] this term might be a little bit kind of overused in certain cases you know but [00:07:29] overused in certain cases you know but but what I mean here is that you want to [00:07:31] but what I mean here is that you want to have a bond that with higher probability [00:07:33] have a bond that with higher probability for every f [00:07:36] for every f your population loss is less than some [00:07:39] your population loss is less than some maybe complexity [00:07:40] maybe complexity of F and the empirical date [00:07:48] so the right hand side is also random [00:07:50] so the right hand side is also random variable that depends on the empirical [00:07:52] variable that depends on the empirical data of course the you you're asking [00:07:55] data of course the you you're asking this for high probability right anyways [00:07:56] this for high probability right anyways right so you're asking that for all for [00:07:59] right so you're asking that for all for with high probability over the choice of [00:08:01] with high probability over the choice of the empirical data this inequality is [00:08:03] the empirical data this inequality is true [00:08:05] true and and this is useful in the sense that [00:08:08] and and this is useful in the sense that you know this is useful in a still kind [00:08:10] you know this is useful in a still kind of useful in a sense that [00:08:14] you can regularize the right hand side [00:08:15] you can regularize the right hand side you can [00:08:18] you can you can add [00:08:20] you can add the rhs [00:08:22] the rhs as a regularizer [00:08:26] right so not only this is explanation in [00:08:29] right so not only this is explanation in some sense but also it can be used [00:08:31] some sense but also it can be used actively as a regularizer because the [00:08:33] actively as a regularizer because the right transfer is something you can [00:08:35] right transfer is something you can optimize [00:08:36] optimize right so [00:08:38] right so so this is kind of the goal that we are [00:08:41] so this is kind of the goal that we are uh trying to achieve [00:08:43] uh trying to achieve so [00:08:44] so um and in some sense I think you know I [00:08:46] um and in some sense I think you know I used to have a little argument about you [00:08:47] used to have a little argument about you know why this is actually the right [00:08:49] know why this is actually the right thing to do it's kind of like cheeky [00:08:51] thing to do it's kind of like cheeky because you know uh these days still [00:08:53] because you know uh these days still there's no consensus on what exactly [00:08:55] there's no consensus on what exactly kind of generalization you are looking [00:08:56] kind of generalization you are looking for you know I'm I believe that this is [00:08:58] for you know I'm I believe that this is one thing that is good to have but you [00:09:00] one thing that is good to have but you know uh but there could be other forms [00:09:02] know uh but there could be other forms of like generalization blocks [00:09:04] of like generalization blocks um in some sense you can argue that this [00:09:07] um in some sense you can argue that this is um the best you can achieve in the [00:09:09] is um the best you can achieve in the sense that you know you cannot you know [00:09:11] sense that you know you cannot you know have a stronger one on the right hand [00:09:13] have a stronger one on the right hand side because for example you cannot [00:09:14] side because for example you cannot replace this [00:09:15] replace this empirical uh data by population [00:09:18] empirical uh data by population description right if you replace that [00:09:20] description right if you replace that then you can just choose suppose you [00:09:22] then you can just choose suppose you allow the complex measure to depend on [00:09:25] allow the complex measure to depend on the population distribution suppose [00:09:27] the population distribution suppose you allow that I can have complexity [00:09:30] you allow that I can have complexity of F and the population distribution T [00:09:32] of F and the population distribution T then why not just Define this to be LP [00:09:35] then why not just Define this to be LP of f like sorry why not to Define this [00:09:38] of f like sorry why not to Define this to be the population risk what if you [00:09:41] to be the population risk what if you allow this why not just Define to be [00:09:43] allow this why not just Define to be something like tax from p [00:09:46] something like tax from p f x right so the population risk would [00:09:50] f x right so the population risk would be a good complex measure [00:09:52] be a good complex measure then it sometimes you lose the the the [00:09:55] then it sometimes you lose the the the gist here in some sense it becomes too [00:09:57] gist here in some sense it becomes too trivial and in some sense they suggests [00:09:59] trivial and in some sense they suggests that you're cheating in some sense by [00:10:00] that you're cheating in some sense by allowing the complex measure to depend [00:10:03] allowing the complex measure to depend on P so in some sense the kind of the [00:10:04] on P so in some sense the kind of the fundamental question we are facing about [00:10:07] fundamental question we are facing about this we are facing about in the [00:10:09] this we are facing about in the generalization bound is that you don't [00:10:10] generalization bound is that you don't have access to the population [00:10:12] have access to the population distribution you want to have an [00:10:14] distribution you want to have an empirical measure for your complexity so [00:10:17] empirical measure for your complexity so that you can use that for regularization [00:10:20] that you can use that for regularization by the way but you know it this argument [00:10:23] by the way but you know it this argument is kind of like I know a debateful so so [00:10:25] is kind of like I know a debateful so so for now we just say that this is a one [00:10:27] for now we just say that this is a one of the reasonable goals right so okay [00:10:30] of the reasonable goals right so okay and and why doing this is challenging [00:10:33] and and why doing this is challenging um [00:10:34] um uh I think the the first thing is that [00:10:36] uh I think the the first thing is that this is challenging because you cannot [00:10:38] this is challenging because you cannot do the simple reduction as as we have [00:10:40] do the simple reduction as as we have done before [00:10:41] done before so so the reduction from type the [00:10:44] so so the reduction from type the reduction [00:10:46] reduction between one two [00:10:51] type one type two bonds uh doesn't work [00:10:53] type one type two bonds uh doesn't work anymore [00:10:55] anymore foreign [00:11:01] for example let's define capital F to be [00:11:05] for example let's define capital F to be all the way to F such that the [00:11:07] all the way to F such that the complexity [00:11:10] complexity um of f [00:11:12] um of f X [00:11:19] suppose you say this is less than C [00:11:23] um suppose you define this right this is [00:11:26] um suppose you define this right this is your hypothesis class [00:11:27] your hypothesis class and let's say suppose our we attempt to [00:11:30] and let's say suppose our we attempt to use you know as an attempt is that you [00:11:33] use you know as an attempt is that you use f [00:11:34] use f is rather marker complexity [00:11:37] is rather marker complexity for capital f [00:11:39] for capital f worst issue why we cannot do this [00:11:45] the reason is that [00:11:47] the reason is that if your complex Network depends on the [00:11:49] if your complex Network depends on the data then your random hypothesis class [00:11:51] data then your random hypothesis class also depends on data before your [00:11:54] also depends on data before your complexion identity depend on data your [00:11:55] complexion identity depend on data your hypothesis cost is just a fixed [00:11:57] hypothesis cost is just a fixed hypothesis class [00:11:58] hypothesis class so so now it's a hypothesis cost that [00:12:00] so so now it's a hypothesis cost that depends on data so f is also a random [00:12:03] depends on data so f is also a random variable [00:12:07] depending on data [00:12:10] depending on data data in data means means empirical data [00:12:14] data in data means means empirical data right and then you can use the writer [00:12:15] right and then you can use the writer marker complexity like the theorem for [00:12:18] marker complexity like the theorem for random marker complexity but why the [00:12:21] random marker complexity but why the random Market complex debunks the the [00:12:22] random Market complex debunks the the generalization over that theorem [00:12:24] generalization over that theorem requires the capital F to be a fixed [00:12:27] requires the capital F to be a fixed hypothesis class that is fixed before [00:12:29] hypothesis class that is fixed before you draw the the random data [00:12:33] you draw the the random data so so that's the that's the challenge [00:12:38] Okay so [00:12:40] Okay so and how do we uh adjust this so in some [00:12:44] and how do we uh adjust this so in some sense the kind of like the the way to [00:12:46] sense the kind of like the the way to the high level way to to address it is [00:12:48] the high level way to to address it is to kind of like redefine [00:12:52] to kind of like redefine uh you have to have a refund way to [00:12:53] uh you have to have a refund way to think about you know uh uniform convert [00:12:57] think about you know uh uniform convert so some refund refund uniform converges [00:13:03] this is not going to be exactly what we [00:13:05] this is not going to be exactly what we do like eventually because you know you [00:13:07] do like eventually because you know you what we do eventually will be something [00:13:09] what we do eventually will be something very clean and and doesn't have any kind [00:13:12] very clean and and doesn't have any kind of like [00:13:13] of like kind of like personality but but this is [00:13:15] kind of like personality but but this is kind of the roughly thinking how do you [00:13:16] kind of the roughly thinking how do you think about it so [00:13:19] think about it so um so maybe let's make a make assumption [00:13:22] um so maybe let's make a make assumption these are some suppose the complexity [00:13:24] these are some suppose the complexity measure [00:13:28] uh is uh separable [00:13:33] in the sense that [00:13:36] this complexity of f on the empirical [00:13:40] this complexity of f on the empirical example [00:13:44] is of some form like G of [00:13:48] is of some form like G of f x i right it's really some function of [00:13:51] f x i right it's really some function of F and x i and you take the sum of them [00:13:54] F and x i and you take the sum of them so suppose in this special case then you [00:13:56] so suppose in this special case then you can think of [00:13:58] can think of essentially what we're doing is that we [00:13:59] essentially what we're doing is that we are considering [00:14:01] are considering then we can consider [00:14:06] a [00:14:08] a an augmented loss [00:14:13] so you can define something like L tilde [00:14:15] so you can define something like L tilde f is equals to [00:14:17] f is equals to something like LF times the indicator [00:14:20] something like LF times the indicator that is complexity [00:14:22] that is complexity is less than C [00:14:25] is less than C so in some sense what you are doing here [00:14:26] so in some sense what you are doing here is that you are changing the the loss [00:14:28] is that you are changing the the loss function [00:14:29] function in some way so that it's easier for you [00:14:33] in some way so that it's easier for you to use the existing bar so before for [00:14:35] to use the existing bar so before for example let's say the kind of the mental [00:14:37] example let's say the kind of the mental picture I have in mind is something like [00:14:39] picture I have in mind is something like you have [00:14:41] you have um you have a loss function which is [00:14:44] um you have a loss function which is like something like this [00:14:45] like something like this maybe let's say this is empirical loss [00:14:49] and you have some [00:14:51] and you have some region [00:14:53] region and this is the region where you have [00:14:55] and this is the region where you have low complexity [00:15:00] right so but this region is a random [00:15:02] right so but this region is a random region because the low complexity the [00:15:05] region because the low complexity the definition of low complexity depends on [00:15:06] definition of low complexity depends on data so this is random [00:15:10] so that's why you can use the uniform [00:15:11] so that's why you can use the uniform convergence only on this low complexity [00:15:13] convergence only on this low complexity region right so you cannot say that I'm [00:15:15] region right so you cannot say that I'm only going to apply [00:15:17] only going to apply my uniform convergence for this region [00:15:19] my uniform convergence for this region even though that's your goal but you [00:15:21] even though that's your goal but you cannot apply the random micro complexity [00:15:23] cannot apply the random micro complexity Theory so so the the kind of like the [00:15:27] Theory so so the the kind of like the this kind of like organic laws what is [00:15:28] this kind of like organic laws what is fundamental is doing something like it [00:15:30] fundamental is doing something like it changed the geometry outside [00:15:34] changed the geometry outside the the low complex region so you for [00:15:37] the the low complex region so you for example you just Define a lot newer loss [00:15:38] example you just Define a lot newer loss function to be zero here [00:15:40] function to be zero here and then [00:15:42] and then um I [00:15:43] um I like a the same thing as it as it was [00:15:46] like a the same thing as it as it was you know in the low complex region so [00:15:49] you know in the low complex region so now we have a globally defined loss [00:15:51] now we have a globally defined loss function and so basically the the region [00:15:54] function and so basically the the region that you are taking Union Bond over [00:15:55] that you are taking Union Bond over right the the the the hypothesis class [00:15:58] right the the the the hypothesis class is still the same but you change the [00:16:00] is still the same but you change the loss function [00:16:01] loss function so so if you do this then you can hope [00:16:04] so so if you do this then you can hope to so can hope [00:16:07] to so can hope to apply [00:16:10] to apply like existing tools [00:16:14] on altitude of F and l2.5 is sometimes [00:16:17] on altitude of F and l2.5 is sometimes kind of like a filtering thing that [00:16:20] kind of like a filtering thing that filters the [00:16:21] filters the the [00:16:22] the uh um the low complexity but but you [00:16:26] uh um the low complexity but but you don't do it you know technically right [00:16:27] don't do it you know technically right technically you are just changing the [00:16:29] technically you are just changing the loss function that's the only thing you [00:16:30] loss function that's the only thing you do but the effect of it is the same as [00:16:32] do but the effect of it is the same as you change the hypothesis class [00:16:35] you change the hypothesis class so so I think this is the first thing [00:16:37] so so I think this is the first thing like this is the first attempt that we [00:16:38] like this is the first attempt that we have done in one of our paper when we [00:16:40] have done in one of our paper when we try to address this and and and this is [00:16:43] try to address this and and and this is actually the fundamental [00:16:45] actually the fundamental um on idea in some sense so so you [00:16:48] um on idea in some sense so so you change your loss function so that you [00:16:49] change your loss function so that you can deal with different type of [00:16:51] can deal with different type of quantities of different regions of the [00:16:53] quantities of different regions of the hypothesis and then later [00:16:56] hypothesis and then later um so so this is one of the paper we had [00:16:58] um so so this is one of the paper we had in [00:16:59] in I think 2019 [00:17:02] I think 2019 and we got some results and the the if [00:17:04] and we got some results and the the if you exactly do this indicator thing [00:17:06] you exactly do this indicator thing where you change the loss like this you [00:17:07] where you change the loss like this you can already get something but the kind [00:17:09] can already get something but the kind of the the results are messy so then we [00:17:12] of the the results are messy so then we kind of like [00:17:14] kind of like um in some sense I think even more [00:17:16] um in some sense I think even more broadly right so so in some sense all [00:17:18] broadly right so so in some sense all this is doing is change the loss [00:17:19] this is doing is change the loss function right so so so you are trying [00:17:22] function right so so so you are trying to have a surrogate loss [00:17:25] and and surrogate loss we are not [00:17:27] and and surrogate loss we are not actually unfamiliar with it right we [00:17:29] actually unfamiliar with it right we have used the surrogate loss in the [00:17:30] have used the surrogate loss in the margin case right it's just that [00:17:32] margin case right it's just that surrogate loss there is kind of like [00:17:33] surrogate loss there is kind of like this the simplest way the simplest is [00:17:35] this the simplest way the simplest is surrogates loss so so so basically what [00:17:38] surrogates loss so so so basically what we are um what I'm going to talk about [00:17:40] we are um what I'm going to talk about today you know in the main part is this [00:17:43] today you know in the main part is this so-called you know Euler margin which is [00:17:47] so-called you know Euler margin which is a different way of [00:17:50] a different way of it's kind of like a surrogate [00:17:53] it's kind of like a surrogate margin [00:17:54] margin and once you have this kind of like a [00:17:56] and once you have this kind of like a like fake margin this is a kind of in [00:17:58] like fake margin this is a kind of in some sense to define a new loss function [00:18:00] some sense to define a new loss function for you and once you have this new loss [00:18:02] for you and once you have this new loss function you can do everything in a [00:18:04] function you can do everything in a super clean way and then uh you can [00:18:08] super clean way and then uh you can um [00:18:09] um um like you can apply the existing kind [00:18:11] um like you can apply the existing kind of tools in some sense [00:18:16] okay so this is a kind of like a sketchy [00:18:19] okay so this is a kind of like a sketchy a vague kind of introduction I'm not [00:18:21] a vague kind of introduction I'm not sure whether any questions so far [00:18:26] oh sorry this is um I yeah this is the [00:18:29] oh sorry this is um I yeah this is the name of the the thing we are going to [00:18:30] name of the the thing we are going to introduce but we're gonna introduce a [00:18:32] introduce but we're gonna introduce a new margin which is we call it only [00:18:33] new margin which is we call it only rematch [00:18:34] rematch I probably should Define it formally [00:18:39] so [00:18:40] so um okay so some so basically the main [00:18:43] um okay so some so basically the main point I'm doing I'm saying here is that [00:18:45] point I'm doing I'm saying here is that we have not defined a surrogate loss and [00:18:47] we have not defined a surrogate loss and and using this surrogate loss the the [00:18:49] and using this surrogate loss the the point of the target loss is to change [00:18:51] point of the target loss is to change the original Law so that you can focus [00:18:52] the original Law so that you can focus on the important part of the space and [00:18:54] on the important part of the space and this target loss will be basically kind [00:18:56] this target loss will be basically kind of boring for [00:18:58] of boring for for this kind of like a high complexity [00:19:00] for this kind of like a high complexity part right they are just they are not [00:19:02] part right they are just they are not doing anything they are basically [00:19:03] doing anything they are basically zeroing out like in some sense and so [00:19:06] zeroing out like in some sense and so that's the general intuition okay so now [00:19:09] that's the general intuition okay so now let's see how do we do that exactly [00:19:12] let's see how do we do that exactly so so we're gonna start with a [00:19:15] so so we're gonna start with a generalization [00:19:18] generalization of of margin [00:19:21] so [00:19:23] so um so let f [00:19:29] um so this is a classification model [00:19:35] foreign [00:19:39] and your margin is just F itself right [00:19:42] and your margin is just F itself right so the the typical margin [00:19:46] so the the typical margin the classic the the standard margin is [00:19:48] the classic the the standard margin is just defined [00:19:49] just defined a standard margin [00:19:54] is just equals to Y times f x [00:19:56] is just equals to Y times f x Y is between [00:19:58] Y is between plus one minus one right that's what we [00:20:00] plus one minus one right that's what we used before [00:20:01] used before and now I'm going to define a so-called [00:20:04] and now I'm going to define a so-called generalized margin [00:20:07] we say [00:20:09] we say g f x y [00:20:12] g f x y is a [00:20:14] is a generalized [00:20:17] margin [00:20:22] if [00:20:24] if satisfies [00:20:28] the following two properties so the [00:20:29] the following two properties so the first part of it is that g f x y [00:20:33] first part of it is that g f x y is zero if you classify [00:20:37] is zero if you classify correctly [00:20:42] it classified correctly [00:20:46] it classified correctly so I think I have typo here let me think [00:20:51] sorry I think [00:20:53] sorry I think if you cast very wrongly [00:20:56] if you cast very wrongly and this will be [00:20:57] and this will be larger than zero if [00:21:00] larger than zero if f x y [00:21:04] is classified correctly so let me [00:21:08] is classified correctly so let me mark this important type of [00:21:11] mark this important type of sorry [00:21:19] okay so and you can see that you know [00:21:21] okay so and you can see that you know this is trying to imitate the the [00:21:23] this is trying to imitate the the standard margin right the first standard [00:21:24] standard margin right the first standard margin is bigger than zero if you [00:21:26] margin is bigger than zero if you classify correctly and otherwise you say [00:21:29] classify correctly and otherwise you say you zero it out so that's a like [00:21:31] you zero it out so that's a like external Market also this is only [00:21:33] external Market also this is only defined [00:21:35] defined only defined [00:21:36] only defined for correct classification [00:21:40] right so so in some sense you can extend [00:21:42] right so so in some sense you can extend to incorrect classification just by [00:21:44] to incorrect classification just by extending it to zero [00:21:46] extending it to zero and so and we and we say that you know [00:21:50] and so and we and we say that you know like uh and there's another small thing [00:21:53] like uh and there's another small thing which is that [00:21:56] which is that we have to define the so-called you know [00:21:58] we have to define the so-called you know infinite carbon number [00:22:04] so um this is defined to be L Infinity [00:22:08] so um this is defined to be L Infinity Epsilon f [00:22:10] Epsilon f is the [00:22:13] is the this is a small technical extension of [00:22:16] this is a small technical extension of the L2 carbon number it's not that you [00:22:18] the L2 carbon number it's not that you know important in most of the cases it [00:22:20] know important in most of the cases it just makes the [00:22:21] just makes the in some sense the in some cases it makes [00:22:24] in some sense the in some cases it makes the definition cleaner [00:22:26] the definition cleaner um and in some cases it makes the proof [00:22:27] um and in some cases it makes the proof a little bit easier so so our Infinity [00:22:30] a little bit easier so so our Infinity carbon number is the minimum [00:22:32] carbon number is the minimum cover size [00:22:39] with respect to the Matrix [00:22:42] row [00:22:44] row your row is defined to be this L [00:22:47] your row is defined to be this L Infinity Norm [00:22:48] Infinity Norm so basically you say that you look at [00:22:50] so basically you say that you look at the entire space [00:22:53] the entire space of the input of F and you look at the [00:22:55] of the input of F and you look at the difference between f x and [00:22:57] difference between f x and F Prime X and and you take the soup [00:23:00] F Prime X and and you take the soup so basically this is the F minus F Prime [00:23:04] so basically this is the F minus F Prime Infinity naught [00:23:06] Infinity naught um so [00:23:09] um so okay so given these two what we'll say [00:23:12] okay so given these two what we'll say is that [00:23:14] is that um our Lemma will be that [00:23:17] um our Lemma will be that the with the you can have a analogous [00:23:22] the with the you can have a analogous um Theory analogous to the modern Theory [00:23:24] um Theory analogous to the modern Theory where you use this generalized margin [00:23:27] where you use this generalized margin and also the infinite cover number [00:23:29] and also the infinite cover number actually actually you can even do it [00:23:30] actually actually you can even do it without like a standard carbon number [00:23:32] without like a standard carbon number it's just easier to state with the [00:23:34] it's just easier to state with the infinite cover number and also maybe [00:23:36] infinite cover number and also maybe before doing that let me also have [00:23:38] before doing that let me also have another remark [00:23:40] another remark which is that this infinite cover number [00:23:44] which is that this infinite cover number is larger than the standard [00:23:49] Auto covering [00:23:51] Auto covering this is just because the [00:23:54] this is just because the this is the more demanding notion [00:23:55] this is the more demanding notion because you are demanding that F and F [00:23:57] because you are demanding that F and F Prime are closed at every possible input [00:24:00] Prime are closed at every possible input and before you are demanding that F and [00:24:02] and before you are demanding that F and F Prime are closed on the empirical data [00:24:05] F Prime are closed on the empirical data right so this is because [00:24:09] right so this is because the metric that we used before [00:24:11] the metric that we used before was The Matrix [00:24:13] was The Matrix that is smaller [00:24:15] that is smaller then the Matrix used in the infinite [00:24:18] then the Matrix used in the infinite case [00:24:35] Okay so [00:24:38] Okay so so with this um small extension what [00:24:40] so with this um small extension what we're going to do is that we're going to [00:24:41] we're going to do is that we're going to say actually we can have analogous [00:24:43] say actually we can have analogous modern Theory with the generous margin [00:24:45] modern Theory with the generous margin so the Lemma is that [00:24:49] so the Lemma is that um so suppose [00:24:51] um so suppose GF is a generalized margin [00:24:58] and let's [00:25:01] capital G [00:25:04] capital G to be the family of GF [00:25:07] to be the family of GF where f is ranging over the capital n [00:25:12] where f is ranging over the capital n and suppose [00:25:14] and suppose recall that this is kind of like what we [00:25:16] recall that this is kind of like what we are kind of like a this is in some sense [00:25:19] are kind of like a this is in some sense just a slightly kind of like more [00:25:20] just a slightly kind of like more complex version of your model hypothesis [00:25:22] complex version of your model hypothesis right like if you just use yfx then this [00:25:25] right like if you just use yfx then this will just be y times f x that's the [00:25:27] will just be y times f x that's the hypothesis cause that's the class G and [00:25:30] hypothesis cause that's the class G and this is a little more General than that [00:25:31] this is a little more General than that and suppose uh for some r [00:25:36] and suppose uh for some r a covering number [00:25:41] the infinite covering number of G of G [00:25:45] the infinite covering number of G of G is less than R square over Epsilon [00:25:48] is less than R square over Epsilon Square [00:25:50] for Epsilon [00:25:52] for Epsilon and 0 for any Epsilon zero [00:26:00] or I suppose you have this kind of like [00:26:02] or I suppose you have this kind of like one over Epsilon Square decay in the low [00:26:04] one over Epsilon Square decay in the low carbon number recall that this is one of [00:26:05] carbon number recall that this is one of the regime that we [00:26:07] the regime that we that is good right so this is the [00:26:09] that is good right so this is the actually the the worst regime we can [00:26:12] actually the the worst regime we can tolerate when we do the right Market [00:26:13] tolerate when we do the right Market complexity right so [00:26:15] complexity right so and [00:26:16] and suppose you have to understand [00:26:20] with probability larger than one minus [00:26:22] with probability larger than one minus Delta the other is the failure [00:26:23] Delta the other is the failure probability which will be hidden in the [00:26:25] probability which will be hidden in the in the logarithmic [00:26:26] in the logarithmic uh over the randomness [00:26:32] training data [00:26:36] for every F in capital f [00:26:41] that correctly [00:26:44] predicts [00:26:46] predicts all shiny example [00:26:50] all shiny example right so for margin we also we for and [00:26:54] right so for margin we also we for and in the modern Theory we always consider [00:26:56] in the modern Theory we always consider functions that can correctly predict all [00:26:59] functions that can correctly predict all the examples then you have the zero one [00:27:01] the examples then you have the zero one error [00:27:02] error is less than [00:27:04] is less than of [00:27:05] of tilde of [00:27:08] tilde of 1 over square root n times [00:27:10] 1 over square root n times well over the minimum generalized margin [00:27:20] plus altitude of one over square root [00:27:23] plus altitude of one over square root so so to recall that basically before [00:27:26] so so to recall that basically before what we had was oh there's an R here [00:27:28] what we had was oh there's an R here sorry [00:27:29] sorry so before what we had was that here you [00:27:32] so before what we had was that here you have the standard margin [00:27:34] have the standard margin standard margin [00:27:36] standard margin the minimum margin over the entire data [00:27:38] the minimum margin over the entire data set [00:27:39] set and here R is the complexity of the [00:27:42] and here R is the complexity of the model hypothesis class [00:27:44] model hypothesis class right and all the other things are [00:27:45] right and all the other things are mistake [00:27:46] mistake now the change is that now [00:27:49] now the change is that now here you replace it by generalized [00:27:51] here you replace it by generalized margin and R becomes the hypothesis [00:27:54] margin and R becomes the hypothesis uh the complexity of the hypothesis [00:27:56] uh the complexity of the hypothesis class of this generalized margin GF [00:27:58] class of this generalized margin GF right and the complexity is measured [00:28:00] right and the complexity is measured slightly differently we are using the [00:28:01] slightly differently we are using the the covering number but actually you can [00:28:03] the covering number but actually you can also use weather marker complexity here [00:28:06] also use weather marker complexity here um it's the same I'm just stating it so [00:28:08] um it's the same I'm just stating it so that it's easier for for the Future Part [00:28:11] that it's easier for for the Future Part um [00:28:19] and this one is actually not very tight [00:28:21] and this one is actually not very tight you can actually improve this Bond you [00:28:22] you can actually improve this Bond you know in some ways [00:28:24] know in some ways um but but that's but this is the [00:28:25] um but but that's but this is the simplest version [00:28:27] simplest version and and the proof of this you know is [00:28:30] and and the proof of this you know is basically it's just we just basically we [00:28:32] basically it's just we just basically we use all what we have done with margin [00:28:34] use all what we have done with margin Theory it's just everything seems to [00:28:36] Theory it's just everything seems to just can transfer exactly so so just to [00:28:40] just can transfer exactly so so just to replace in some sense the proof is just [00:28:42] replace in some sense the proof is just replace [00:28:44] replace the f [00:28:46] the f by G [00:28:47] by G in the margin [00:28:50] in the margin Theory I'll do this you know step by [00:28:52] Theory I'll do this you know step by step [00:28:53] step um but but this is the short version so [00:28:56] um but but this is the short version so so so so technically what you do is [00:28:59] so so so technically what you do is let's still use the [00:29:00] let's still use the the Ram plus [00:29:05] recall that the Run plus was the loss [00:29:07] recall that the Run plus was the loss function that looks like [00:29:08] function that looks like this [00:29:11] this where this is a gamma red [00:29:14] where this is a gamma red this part is gamma this part is one [00:29:16] this part is gamma this part is one something like this [00:29:18] something like this and recall that before we after we have [00:29:20] and recall that before we after we have this run plus we Define this you know [00:29:22] this run plus we Define this you know surrogate loss [00:29:23] surrogate loss right so we Define a surrogate loss our [00:29:25] right so we Define a surrogate loss our height gamma [00:29:27] height gamma data to be [00:29:34] before we just apply it with the model [00:29:37] before we just apply it with the model but now we use the generator smart [00:29:43] uh before here this was just F Theta but [00:29:45] uh before here this was just F Theta but now it becomes like G of f Theta G sub F [00:29:47] now it becomes like G of f Theta G sub F Theta and we can also Define the [00:29:51] Theta and we can also Define the surrogate population loss which is just [00:29:53] surrogate population loss which is just the expectation of the empirical loss [00:30:04] okay so and before what we do it is that [00:30:08] okay so and before what we do it is that we use the radar marker complexity to [00:30:10] we use the radar marker complexity to control the differences of these two [00:30:11] control the differences of these two loss function [00:30:19] wow [00:30:21] wow we saw that you take L gamma [00:30:25] we saw that you take L gamma Theta is minus L height gamma Theta [00:30:29] Theta is minus L height gamma Theta is less than [00:30:30] is less than the import router marker complexity of f [00:30:32] the import router marker complexity of f that's what we did before but now is to [00:30:34] that's what we did before but now is to start before we did the Imperial number [00:30:36] start before we did the Imperial number complexity of [00:30:38] complexity of L gamma composed with f and now it's l [00:30:40] L gamma composed with f and now it's l gamma composed with g [00:30:42] gamma composed with g because the function class the function [00:30:44] because the function class the function is different [00:30:46] is different um plus o to the Power Square 10. [00:30:51] sorry [00:30:54] sorry oh okay thank you so much yeah that's a [00:30:58] oh okay thank you so much yeah that's a so I would just switch to this [00:31:01] so I would just switch to this I only have one charger but yeah [00:31:11] uh no I think it's uh the problem is [00:31:13] uh no I think it's uh the problem is that when you use this I cannot charge [00:31:17] that when you use this I cannot charge uh right oh but I can wait it doesn't [00:31:20] uh right oh but I can wait it doesn't matter how yeah it's not the charger [00:31:23] matter how yeah it's not the charger it's the the plug the the hole yeah okay [00:31:27] it's the the plug the the hole yeah okay so now it works [00:31:29] so now it works okay good thanks uh okay cool so [00:31:35] right so [00:31:37] right so okay so so now we have to use the random [00:31:39] okay so so now we have to use the random mark on Flex day and then the radamr [00:31:41] mark on Flex day and then the radamr complexity is less than [00:31:45] the copying number right so I guess [00:31:49] the copying number right so I guess maybe let's still do that let's do the [00:31:50] maybe let's still do that let's do the cover number [00:31:52] cover number so cover number [00:31:55] so cover number let's do some preparation so we assume [00:31:58] let's do some preparation so we assume the Infinity Property number but [00:31:59] the Infinity Property number but actually you know [00:32:01] actually you know um it's [00:32:03] um it's um okay I guess let's say so the [00:32:05] um okay I guess let's say so the covering number the standard cover [00:32:07] covering number the standard cover number [00:32:09] composed with g [00:32:14] the [00:32:15] the l2pn [00:32:17] l2pn this is less than this [00:32:20] this is less than this Standard carbon number [00:32:23] where you use the [00:32:28] um [00:32:29] um by removing the [00:32:31] by removing the the Algoma right so Alabama so l2pn [00:32:36] the Algoma right so Alabama so l2pn because this is [00:32:38] because this is this step is using the Ellipsis [00:32:43] deliciousness [00:32:44] deliciousness of [00:32:45] of agama [00:32:47] agama so it's actually one over gamma Ellipsis [00:32:48] so it's actually one over gamma Ellipsis right so this is using the [00:32:51] right so this is using the ellipselessness of the carbon number [00:32:53] ellipselessness of the carbon number and now next you say this is also bonded [00:32:57] and now next you say this is also bonded by the infinity version [00:33:00] right and and then if the infinite [00:33:03] right and and then if the infinite version will have assumption the [00:33:04] version will have assumption the Assumption was that for every episode [00:33:06] Assumption was that for every episode you this is less than R square over [00:33:09] you this is less than R square over Epsilon Square gamma Square [00:33:12] Epsilon Square gamma Square the last step is spell assumption [00:33:15] the last step is spell assumption okay so so you can see that actually you [00:33:18] okay so so you can see that actually you know even suppose you assume something [00:33:19] know even suppose you assume something about this then it's also fun if you [00:33:21] about this then it's also fun if you receive something about [00:33:24] receive something about um [00:33:25] um right so you don't have to you I [00:33:26] right so you don't have to you I literally use the infinite note [00:33:29] literally use the infinite note um okay so and then because the [00:33:34] um okay so and then because the this low carbon number is less than this [00:33:36] this low carbon number is less than this and we have kind of like a this kind of [00:33:39] and we have kind of like a this kind of like translation right so that if you [00:33:42] like translation right so that if you translate low carbon number to random [00:33:44] translate low carbon number to random marker complexity you got are as our [00:33:47] marker complexity you got are as our gamma composed with G is less than o [00:33:49] gamma composed with G is less than o tilde [00:33:51] tilde R over gamma squared [00:33:54] R over gamma squared overcome scorpion right this is by [00:33:57] overcome scorpion right this is by uh chaining [00:34:00] uh chaining a w theorem and and its consequences [00:34:03] a w theorem and and its consequences because we have discussed what kind of [00:34:05] because we have discussed what kind of like cover numbers replies what kind of [00:34:07] like cover numbers replies what kind of random marker complexity [00:34:09] random marker complexity right so okay so and [00:34:13] right so okay so and then [00:34:13] then [Music] [00:34:15] [Music] the same thing I guess [00:34:17] the same thing I guess take [00:34:19] take gamma to be gamma mean [00:34:22] gamma to be gamma mean which is the Min over I [00:34:24] which is the Min over I g i f [00:34:26] g i f say y i [00:34:29] say y i right so and then [00:34:32] right so and then there this step is not form this [00:34:35] there this step is not form this some some there's some caveat here [00:34:38] some some there's some caveat here because gamma is a random variable you [00:34:40] because gamma is a random variable you have to do Union Bond eventually [00:34:49] but let me not get into it I guess we [00:34:51] but let me not get into it I guess we had this issue before as well [00:34:54] had this issue before as well um but it's only one number you can [00:34:56] um but it's only one number you can discretize and doing over gamma but [00:34:58] discretize and doing over gamma but suppose let's say we just take down what [00:34:59] suppose let's say we just take down what we Gamma mean and then [00:35:02] we Gamma mean and then foreign [00:35:05] so then you got L 0 1 Theta then zero [00:35:09] so then you got L 0 1 Theta then zero plus O2 off [00:35:11] plus O2 off R over square root n times gamma mean [00:35:15] R over square root n times gamma mean plus some altitude of Y squared [00:35:20] okay [00:35:22] so this proof is not 100 formal just [00:35:25] so this proof is not 100 formal just because the technical I'm not allowed to [00:35:27] because the technical I'm not allowed to take gamma to be anything that depends [00:35:29] take gamma to be anything that depends on the data right so I have to really [00:35:31] on the data right so I have to really show it for every gamma [00:35:33] show it for every gamma um and that requires another inbound [00:35:35] um and that requires another inbound overcome [00:35:37] overcome foreign [00:35:44] so maybe let's let's see what we have [00:35:46] so maybe let's let's see what we have achieved with this level right what we [00:35:47] achieved with this level right what we achieve this this Lemma is that now if [00:35:49] achieve this this Lemma is that now if you define your [00:35:51] you define your basically you can fold everything you [00:35:52] basically you can fold everything you can try to fold everything in this [00:35:54] can try to fold everything in this generalized market right this challenge [00:35:55] generalized market right this challenge margin in some sense is a way to [00:35:58] margin in some sense is a way to to twist your model output right so you [00:36:01] to twist your model output right so you can stretch the model output in for [00:36:02] can stretch the model output in for certain F and you can squeeze it for [00:36:04] certain F and you can squeeze it for certain other app so in some sense this [00:36:06] certain other app so in some sense this is what we actually will do right so we [00:36:08] is what we actually will do right so we in some sense stretched the function for [00:36:11] in some sense stretched the function for those places where we'll see you you see [00:36:14] those places where we'll see you you see how we do it like you stretch the [00:36:16] how we do it like you stretch the function according to where you are at [00:36:19] function according to where you are at and [00:36:21] and um [00:36:21] um so [00:36:23] so so basically everything is folded into [00:36:25] so basically everything is folded into this [00:36:26] this um [00:36:27] um thermos margin and the question is so [00:36:29] thermos margin and the question is so the question now is that question [00:36:33] the question now is that question so for what [00:36:36] so for what GF you can you can bounce the [00:36:38] GF you can you can bounce the generalization error you can Bond the [00:36:40] generalization error you can Bond the covering number [00:36:45] of G right and also you want this GF to [00:36:48] of G right and also you want this GF to be something with meaning of also and so [00:36:49] be something with meaning of also and so forth [00:36:51] forth so and suppose you know if you just take [00:36:53] so and suppose you know if you just take GF [00:36:55] GF to be the standard one [00:36:57] to be the standard one yfx then the cover number of this GF [00:37:00] yfx then the cover number of this GF will be the same as current number of F [00:37:01] will be the same as current number of F and it will be the then the rather [00:37:04] and it will be the then the rather Market complexity will be something like [00:37:06] Market complexity will be something like then a cover number [00:37:12] depends on the product [00:37:15] so [00:37:18] so so but but we are trying to do better [00:37:19] so but but we are trying to do better than this [00:37:21] than this okay so how do we do this so now we [00:37:25] okay so how do we do this so now we Define this so-called all layer margin [00:37:26] Define this so-called all layer margin this is a special [00:37:28] this is a special instance of this GF this is a concrete [00:37:32] instance of this GF this is a concrete definition of GF for which we can Bond [00:37:34] definition of GF for which we can Bond the render marker or the cover number uh [00:37:37] the render marker or the cover number uh complexity [00:37:38] complexity so to Define this uh to Define this [00:37:40] so to Define this uh to Define this Euler margin this generalized margin [00:37:43] Euler margin this generalized margin right so we have to actually introduce [00:37:45] right so we have to actually introduce some notations so we are considering [00:37:50] some notations so we are considering some perturbed model [00:37:53] some perturbed model so I guess okay I think [00:37:56] so I guess okay I think maybe I think actually it's good useful [00:38:00] maybe I think actually it's good useful to have some motivations before I [00:38:02] to have some motivations before I defined I forgot to add this so [00:38:04] defined I forgot to add this so uh in the our motivation is the [00:38:06] uh in the our motivation is the following so [00:38:08] following so if you think about the linear model [00:38:13] and the the margin is defined to be [00:38:17] and the the margin is defined to be the the margin [00:38:20] the standard margin right so this like [00:38:23] the standard margin right so this like the you like the normalized margin [00:38:27] the you like the normalized margin to normalized margin [00:38:31] is defined to be something like y times [00:38:34] is defined to be something like y times f x [00:38:35] f x over the norm of maybe it says your [00:38:38] over the norm of maybe it says your model is double transpose X [00:38:40] model is double transpose X so your margin is defined to be y times [00:38:42] so your margin is defined to be y times the model output [00:38:44] the model output over the two Norm of w [00:38:46] over the two Norm of w right so this is the normalized margin [00:38:49] right so this is the normalized margin which is something that's kind of like [00:38:50] which is something that's kind of like governs uh the the generalization [00:38:53] governs uh the the generalization performance [00:38:54] performance and the question is how do you normalize [00:38:55] and the question is how do you normalize right so so like if you have deep model [00:38:58] right so so like if you have deep model then you can try to normalize by [00:38:59] then you can try to normalize by something right so if you have a deep [00:39:01] something right so if you have a deep model [00:39:04] you can so while attempt is that you can [00:39:06] you can so while attempt is that you can try to normalize [00:39:10] by some quantity [00:39:13] by some quantity maybe this could be the product of the [00:39:16] maybe this could be the product of the Ellipsis Network [00:39:18] Ellipsis Network or maybe something else so that's the [00:39:21] or maybe something else so that's the that's the natural attempt and in some [00:39:22] that's the natural attempt and in some sense all the previous work is in some [00:39:24] sense all the previous work is in some sense doing this right you are [00:39:26] sense doing this right you are normalizing the margin based on the [00:39:27] normalizing the margin based on the worst case slips [00:39:28] worst case slips so and what we are doing is that we [00:39:31] so and what we are doing is that we don't know we don't want to all [00:39:32] don't know we don't want to all normalize by only a constant that [00:39:34] normalize by only a constant that depends only on the function class so so [00:39:37] depends only on the function class so so we take a different approach what we do [00:39:39] we take a different approach what we do is we say we [00:39:41] is we say we reinterpreted the standard margin by [00:39:43] reinterpreted the standard margin by something else so so we have another [00:39:45] something else so so we have another interpretation [00:39:51] so our interpretation is that you can [00:39:53] so our interpretation is that you can view this as [00:39:55] view this as something like [00:39:57] something like minimum Delta such that [00:40:02] minimum Delta such that w [00:40:03] w plus Delta [00:40:06] um [00:40:09] trans sorry w [00:40:15] times X Plus Delta [00:40:18] times X Plus Delta y [00:40:19] y is less than zero [00:40:22] is less than zero so you're trying to find the minimum [00:40:24] so you're trying to find the minimum perturbation of your data point such [00:40:26] perturbation of your data point such that [00:40:27] that after perturbate you can cross the the [00:40:31] after perturbate you can cross the the boundary right so intuitively this is [00:40:33] boundary right so intuitively this is also kind of [00:40:34] also kind of right because the margin is the distance [00:40:37] right because the margin is the distance to the boundary right so it's also the [00:40:40] to the boundary right so it's also the same thing as how much you can perturb [00:40:41] same thing as how much you can perturb it so that you can cross the boundary [00:40:44] it so that you can cross the boundary so so this is the kind of the [00:40:46] so so this is the kind of the perspective we take to generalize the [00:40:48] perspective we take to generalize the margin for all for for deep models [00:40:52] margin for all for for deep models so if you take this you know there's [00:40:54] so if you take this you know there's some kind of like a small you know like [00:40:55] some kind of like a small you know like if you do the all the exact math maybe [00:40:57] if you do the all the exact math maybe something that match exactly but this is [00:40:59] something that match exactly but this is still kind of like the the [00:41:01] still kind of like the the the rough intuition about it and how do [00:41:04] the rough intuition about it and how do we do this exactly so [00:41:06] we do this exactly so um so for deep models we are still [00:41:09] um so for deep models we are still trying to take this perturbation based [00:41:11] trying to take this perturbation based perspective but we have to perturb it [00:41:14] perspective but we have to perturb it turns out we have to predict all the [00:41:16] turns out we have to predict all the all the layers not only the input so the [00:41:19] all the layers not only the input so the the first attempt we tried is that you [00:41:22] the first attempt we tried is that you just perturb the input you try to see [00:41:24] just perturb the input you try to see what is the smallest perturbation of the [00:41:25] what is the smallest perturbation of the input so that you can change the [00:41:27] input so that you can change the decision of your of your model right but [00:41:30] decision of your of your model right but that [00:41:31] that just technically doesn't work it it [00:41:33] just technically doesn't work it it doesn't seems to capture the fundamental [00:41:35] doesn't seems to capture the fundamental complexity so we have to consider this [00:41:38] complexity so we have to consider this um perturbed uh consider this [00:41:41] um perturbed uh consider this percept model that perturbs all the [00:41:43] percept model that perturbs all the layers [00:41:47] so what we do is we have a perturbation [00:41:49] so what we do is we have a perturbation Delta which is a cost sequence of [00:41:51] Delta which is a cost sequence of perturbation Delta 1 up to Delta r [00:41:53] perturbation Delta 1 up to Delta r and each Delta I is a vector [00:41:58] and each Delta I is a vector and the way you perturb is the following [00:42:00] and the way you perturb is the following you also have to work out the [00:42:01] you also have to work out the normalization in the right way [00:42:04] normalization in the right way um so your first perturb the first layer [00:42:06] um so your first perturb the first layer so the first layer used to be W1 [00:42:07] so the first layer used to be W1 transpose x w one times x [00:42:10] transpose x w one times x you know deep net and you perturb that [00:42:12] you know deep net and you perturb that by adding the other one which is a [00:42:14] by adding the other one which is a vector times the norm of X the true Norm [00:42:18] vector times the norm of X the true Norm of x [00:42:20] of x and then [00:42:22] and then you perturb the second layer [00:42:26] um so okay how do you predict the second [00:42:27] um so okay how do you predict the second layer you first apply [00:42:30] layer you first apply W2 on the first layer on the perturbed [00:42:33] W2 on the first layer on the perturbed version of the first layer [00:42:35] version of the first layer and then you perturb it Furthermore with [00:42:38] and then you perturb it Furthermore with Delta two and how much you perturb those [00:42:40] Delta two and how much you perturb those with what's the scaling in front of the [00:42:43] with what's the scaling in front of the other two there are two is a vector the [00:42:45] other two there are two is a vector the scaling is the norm of the first layer [00:42:53] so so how do the exactly Design This [00:42:56] so so how do the exactly Design This preservation is a little bit kind of [00:42:58] preservation is a little bit kind of like a tricky right so like we tried the [00:43:00] like a tricky right so like we tried the virus versions in our research and it [00:43:02] virus versions in our research and it turns out this is actually make [00:43:04] turns out this is actually make everything fade nicely so you can do [00:43:06] everything fade nicely so you can do this for multiple layers and then [00:43:09] this for multiple layers and then eventually you have this HR the OS layer [00:43:11] eventually you have this HR the OS layer perturbed [00:43:13] perturbed uh layer is equals to you first apply [00:43:17] uh layer is equals to you first apply the nonlinear The Matrix notification [00:43:19] the nonlinear The Matrix notification and nonlinearity [00:43:20] and nonlinearity on your previous preserved layer [00:43:23] on your previous preserved layer and then you perturb it by Vector Delta [00:43:26] and then you perturb it by Vector Delta R scaled by the norm [00:43:28] R scaled by the norm of the previous layer [00:43:32] and after you define this perturbation [00:43:34] and after you define this perturbation you can ask you know what's the smallest [00:43:35] you can ask you know what's the smallest preservation that changed my decision [00:43:38] preservation that changed my decision so you can and that's the definition of [00:43:41] so you can and that's the definition of the lower layer margin which we call an [00:43:42] the lower layer margin which we call an F [00:43:43] F X Y this is the Euler margin is defined [00:43:47] X Y this is the Euler margin is defined to be the minimum perturbation [00:43:51] to be the minimum perturbation and how do you match the size of the [00:43:53] and how do you match the size of the pertivation you measure it by [00:43:55] pertivation you measure it by the sum of the two Norm [00:43:59] the sum of the two Norm of the perturbation of every layer [00:44:01] of the perturbation of every layer and your Constitution is that after [00:44:03] and your Constitution is that after perturb [00:44:05] perturb I guess you call this FX Delta this is [00:44:07] I guess you call this FX Delta this is the perturbation of the whole model f x [00:44:09] the perturbation of the whole model f x Delta after perturbation times y [00:44:13] Delta after perturbation times y it becomes negative so incorrect [00:44:16] it becomes negative so incorrect prediction [00:44:19] you can also do this for multi on labels [00:44:22] you can also do this for multi on labels but it's essentially the same so I'm [00:44:24] but it's essentially the same so I'm doing binary labels [00:44:28] okay so this is the definition of the [00:44:31] okay so this is the definition of the all layer margin so you can you can see [00:44:33] all layer margin so you can you can see that the definition becomes much more [00:44:34] that the definition becomes much more complicated but then the proof of it [00:44:36] complicated but then the proof of it will be easy [00:44:40] and I guess you can also intuitively [00:44:43] and I guess you can also intuitively interpret this right so so M as X y [00:44:48] interpret this right so so M as X y so in some sense this is big [00:44:52] so in some sense this is big if [00:44:55] if uh if it's hard to perturb right so if [00:44:58] uh if it's hard to perturb right so if it's hard to preserve it's hard to [00:44:59] it's hard to preserve it's hard to change [00:45:01] change hard to change [00:45:04] hard to change decisions of the network and how how [00:45:06] decisions of the network and how how could it be hard I think there are the [00:45:08] could it be hard I think there are the two ways to make it hard to perturb [00:45:10] two ways to make it hard to perturb right so one thing is that [00:45:13] right so one thing is that um the model f is ellipsis [00:45:16] and this means that it's very Ellipsis [00:45:18] and this means that it's very Ellipsis so this means that you have to perturb a [00:45:20] so this means that you have to perturb a lot to to make a big difference of your [00:45:23] lot to to make a big difference of your a big change of your model output right [00:45:25] a big change of your model output right so and another possibility is that your [00:45:27] so and another possibility is that your FX just is large [00:45:30] FX just is large in some sense your standard margin is [00:45:32] in some sense your standard margin is large if a standard margin is large you [00:45:34] large if a standard margin is large you have to change a lot right you also have [00:45:36] have to change a lot right you also have to change a lot because before you're [00:45:39] to change a lot because before you're outputting something like positive where [00:45:40] outputting something like positive where FX is very big and now you have to [00:45:43] FX is very big and now you have to change it to another side of the [00:45:44] change it to another side of the boundary so then you have to perturb a [00:45:46] boundary so then you have to perturb a lot [00:45:48] lot right so or maybe I could I could say F [00:45:52] right so or maybe I could I could say F like y f x y times f x is large [00:45:56] like y f x y times f x is large so [00:45:57] so typical I also I always talk about about [00:46:00] typical I also I always talk about about why is one so so positive means that you [00:46:03] why is one so so positive means that you are very confident about your prediction [00:46:04] are very confident about your prediction and if you're very confident then it [00:46:06] and if you're very confident then it means you have to change perturb a lot [00:46:09] means you have to change perturb a lot so that you can change your [00:46:11] so that you can change your um change your mind right so that the [00:46:12] um change your mind right so that the model can change it so much [00:46:18] right so and here loves us you know [00:46:20] right so and here loves us you know technically this is lip system in the [00:46:22] technically this is lip system in the intermediate variables [00:46:25] intermediate variables intermediate intermediate layers [00:46:27] intermediate intermediate layers because you are measuring [00:46:29] because you are measuring how robust it is to perturbation but the [00:46:32] how robust it is to perturbation but the perturbation is done on the intermediate [00:46:34] perturbation is done on the intermediate layers [00:46:36] layers but Ellipsis needs in the intimidate [00:46:38] but Ellipsis needs in the intimidate layers it turns out that it's actually [00:46:39] layers it turns out that it's actually close to lapuousness [00:46:42] close to lapuousness with respect to parameters LL discuss [00:46:44] with respect to parameters LL discuss that in a moment [00:46:49] Okay so [00:46:52] Okay so and and once you have all of this so [00:46:55] and and once you have all of this so then you have the following theorem [00:46:59] then you have the following theorem so this is saying that with high [00:47:02] so this is saying that with high probability [00:47:04] probability um l01 f the zero one Arrow of f is less [00:47:07] um l01 f the zero one Arrow of f is less than o tilde of [00:47:10] than o tilde of the following we have one over square [00:47:12] the following we have one over square root 10 first and then you have sum of [00:47:18] this is the so-called one one Norm of w [00:47:20] this is the so-called one one Norm of w which I'm going to Define your moment [00:47:22] which I'm going to Define your moment and also minimum I [00:47:27] am f x i [00:47:30] am f x i why I [00:47:32] why I plus o tilde of [00:47:35] plus o tilde of R over squared [00:47:38] R over squared where [00:47:44] is the sum [00:47:47] of [00:47:49] of absolute values [00:47:55] branches [00:47:57] branches w [00:48:00] w I guess the in some sense you know we [00:48:02] I guess the in some sense you know we are in a method that anything polynomial [00:48:04] are in a method that anything polynomial in the norm doesn't really matter so [00:48:06] in the norm doesn't really matter so doesn't matter that much so so so this [00:48:09] doesn't matter that much so so so this is uh in some sense you just consider it [00:48:11] is uh in some sense you just consider it as polynomial but of course you know you [00:48:13] as polynomial but of course you know you can also talk about you know whether [00:48:14] can also talk about you know whether this one to one comma one Norm is the [00:48:16] this one to one comma one Norm is the right choice of the norm in some sense [00:48:18] right choice of the norm in some sense this is [00:48:19] this is not the best Norm we can hope for so so [00:48:22] not the best Norm we can hope for so so there's still some have room for [00:48:24] there's still some have room for improvement here [00:48:26] improvement here um [00:48:27] um but I guess you know suppose you ignore [00:48:28] but I guess you know suppose you ignore anything polynomial Norm so then what's [00:48:31] anything polynomial Norm so then what's important hitting here is the this all [00:48:33] important hitting here is the this all layer margin here so basically this is [00:48:35] layer margin here so basically this is saying that if the Euler margin [00:48:38] saying that if the Euler margin is [00:48:40] is always big then your generation is good [00:48:42] always big then your generation is good if the other model is smaller then your [00:48:43] if the other model is smaller then your generalization is bad and what's all [00:48:45] generalization is bad and what's all your margin all the margin is about the [00:48:47] your margin all the margin is about the perturbation your business to the [00:48:49] perturbation your business to the intermediate layers so this is saying [00:48:51] intermediate layers so this is saying that if you're robust to perturbations [00:48:54] that if you're robust to perturbations robust to perturbations [00:48:59] in intermediate layers [00:49:05] and that implies that you have good [00:49:07] and that implies that you have good generalization [00:49:17] and you can also compare [00:49:20] and you can also compare this with the bond that we've got before [00:49:22] this with the bond that we've got before you can pretty much argue that this is [00:49:24] you can pretty much argue that this is strictly better than before [00:49:26] strictly better than before um so so [00:49:28] um so so um so basically so compare [00:49:31] um so basically so compare because is this the right place for us [00:49:33] because is this the right place for us to fix us [00:49:36] I guess let me discuss this you know [00:49:39] I guess let me discuss this you know comparing with the pre-response later uh [00:49:42] comparing with the pre-response later uh when I'm doing all the all kind of [00:49:44] when I'm doing all the all kind of remarks about this theorem but but you [00:49:46] remarks about this theorem but but you can show that this is [00:49:48] can show that this is um better than the previous one mostly [00:49:50] um better than the previous one mostly just because [00:49:52] just because um this MF [00:49:54] um this MF in some sense this mfxy [00:49:57] in some sense this mfxy it's kind of like roughly speaking you [00:49:59] it's kind of like roughly speaking you can think of it this as [00:50:00] can think of it this as so like in the in the worst case [00:50:04] I think this is smaller [00:50:08] I think this is smaller because maybe [00:50:13] is smaller than well over [00:50:18] effects [00:50:20] effects something like this right so because [00:50:21] something like this right so because this is ellipselessness and this is how [00:50:23] this is ellipselessness and this is how much you have to change your output you [00:50:25] much you have to change your output you have to change your output from FX to [00:50:27] have to change your output from FX to zero right so and and and this is our [00:50:31] zero right so and and and this is our lipstickness so that's why you have to [00:50:32] lipstickness so that's why you have to change you have to make a big movement [00:50:35] change you have to make a big movement to to change it to change FX from [00:50:38] to to change it to change FX from something like positive or negative to [00:50:40] something like positive or negative to zero and and uh and and and and [00:50:45] zero and and uh and and and and wait my bad I think I sorry I think I [00:50:48] wait my bad I think I sorry I think I I'm doing the [00:50:50] I'm doing the should be this [00:50:57] um and [00:50:58] um and um [00:50:59] um and so that's why this is kind of better [00:51:00] and so that's why this is kind of better than the previous Bond because the [00:51:02] than the previous Bond because the previous Bond didn't consider the [00:51:03] previous Bond didn't consider the different preciousness a different data [00:51:05] different preciousness a different data point but here you are really talking [00:51:08] point but here you are really talking about you know if your lips at the data [00:51:11] about you know if your lips at the data point you have seen then you can uh [00:51:13] point you have seen then you can uh generalize code well [00:51:17] um but maybe let me discuss this more I [00:51:19] um but maybe let me discuss this more I think [00:51:21] think yeah let me have a more thyroid [00:51:24] yeah let me have a more thyroid discussion about this like later I just [00:51:25] discussion about this like later I just don't want to I want to have a show a [00:51:27] don't want to I want to have a show a little bit of this just so that you [00:51:28] little bit of this just so that you don't feel like this is a huge response [00:51:31] don't feel like this is a huge response um but maybe bear with me and just [00:51:33] um but maybe bear with me and just assume this is useful and then we can [00:51:35] assume this is useful and then we can discuss all the interpretations [00:51:39] any questions so far [00:51:49] oh well I'm doing [00:51:52] oh well I'm doing fine [00:51:58] okay okay so we have 30 minutes [00:52:01] okay okay so we have 30 minutes okay so let's just dive into the proof [00:52:05] okay so let's just dive into the proof so [00:52:07] so um I guess the proof requires you know a [00:52:09] um I guess the proof requires you know a few steps but um a few small steps so [00:52:12] few steps but um a few small steps so first of all it's it suffices to [00:52:17] to bond [00:52:22] foreign by [00:52:24] foreign by of [00:52:26] of this [00:52:31] by the way I think I have some [00:52:35] sorry I think I have some typos here [00:52:38] sorry I think I have some typos here don't [00:52:47] I think this should be this [00:52:50] I think this should be this this should be this [00:52:52] this should be this I'll double check it later [00:52:55] I'll double check it later because it's always a polynomial so I [00:52:57] because it's always a polynomial so I didn't really pay too much attention but [00:52:58] didn't really pay too much attention but I think this is a type of [00:53:00] I think this is a type of um so so so so I think you only have to [00:53:02] um so so so so I think you only have to show this [00:53:08] okay [00:53:10] okay um [00:53:12] sorry I don't know I I don't know really [00:53:14] sorry I don't know I I don't know really what this [00:53:16] what this I was gonna scrub no take out a [00:53:19] I was gonna scrub no take out a clarification about this I think I don't [00:53:20] clarification about this I think I don't know exactly where the square is applied [00:53:22] know exactly where the square is applied inside or outside [00:53:24] inside or outside um but anyway you have to show some Bond [00:53:25] um but anyway you have to show some Bond like this right so let's assume this is [00:53:28] like this right so let's assume this is the correct amount and then you [00:53:29] the correct amount and then you basically have to show something like [00:53:30] basically have to show something like this [00:53:31] this um because if you have this then you can [00:53:33] um because if you have this then you can use dilemma before and then on the [00:53:35] use dilemma before and then on the challenge margin you get this random [00:53:37] challenge margin you get this random marker this standardization ball all [00:53:39] marker this standardization ball all right so essentially we just have to [00:53:40] right so essentially we just have to bounce the cover number of t and it [00:53:43] bounce the cover number of t and it turns out that the carbon number of G [00:53:45] turns out that the carbon number of G you have uh this very nice decomposition [00:53:48] you have uh this very nice decomposition name on so let's say let f i [00:53:51] name on so let's say let f i Define each layer [00:53:54] Define each layer the hypothesis cost for every layer [00:53:58] the hypothesis cost for every layer right and we also constrained that wi [00:54:01] right and we also constrained that wi one comma one Norm is less than beta I [00:54:05] one comma one Norm is less than beta I okay so then your your if is really f r [00:54:09] okay so then your your if is really f r composed with f r minus 1 up to F1 this [00:54:13] composed with f r minus 1 up to F1 this is the notation we have used and recall [00:54:15] is the notation we have used and recall that we had a [00:54:16] that we had a kind of a decomposition limit before [00:54:18] kind of a decomposition limit before which was kind of complicated right so [00:54:20] which was kind of complicated right so you have all of these dependencies and [00:54:22] you have all of these dependencies and how the error propagates but now dilemma [00:54:24] how the error propagates but now dilemma is pretty simple [00:54:27] cool [00:54:32] so so let's [00:54:34] so so let's I'm composed with f [00:54:37] I'm composed with f denote [00:54:40] its family [00:54:41] its family of the earlier margin [00:54:44] of the earlier margin foreign [00:54:47] [Music] [00:54:51] the log [00:54:53] the log of the infinity carbon number [00:54:57] of the infinity carbon number where the radius is just simply the sum [00:54:59] where the radius is just simply the sum the average in some sense a quadratic [00:55:02] the average in some sense a quadratic average of the radius on each layer [00:55:06] average of the radius on each layer and you care about the generalized [00:55:08] and you care about the generalized margin this is less than the sum [00:55:12] margin this is less than the sum of the log [00:55:14] of the log if you know probably number of Epsilon I [00:55:17] if you know probably number of Epsilon I have to f i [00:55:20] so so in some sense this is saying that [00:55:22] so so in some sense this is saying that you only have to deal with the cover [00:55:25] you only have to deal with the cover number for every layer and then you got [00:55:27] number for every layer and then you got a cover number for the composed function [00:55:29] a cover number for the composed function class [00:55:30] class but you don't get the cover number for [00:55:32] but you don't get the cover number for the compose function class exactly you [00:55:34] the compose function class exactly you get the cover number of the Euler margin [00:55:36] get the cover number of the Euler margin of the composed function class [00:55:40] of the composed function class so and here [00:55:42] so and here this early Infinity Epsilon if I [00:55:45] this early Infinity Epsilon if I is defined [00:55:48] is defined with respect to the input so there's the [00:55:51] with respect to the input so there's the input domain to Define this [00:55:55] input domain to Define this common number right so the input domain [00:55:59] common number right so the input domain X which is the the the one not the true [00:56:02] X which is the the the one not the true Norm ball is less than what [00:56:08] and so I guess the one of the important [00:56:10] and so I guess the one of the important thing is that this is not uh this is an [00:56:13] thing is that this is not uh this is an and right this is the Euler margin [00:56:21] Okay so [00:56:33] on the Corollary [00:56:36] on the Corollary is that [00:56:37] is that if [00:56:40] if each of the layer you can bounce [00:56:44] the cover number by something like Ci [00:56:47] the cover number by something like Ci Square over Epsilon y Square suppose you [00:56:49] Square over Epsilon y Square suppose you can bound this [00:56:51] can bound this let's use letter C here [00:56:55] so then [00:56:58] so then take [00:56:59] take Epsilon to be Epsilon times c p i over [00:57:04] Epsilon to be Epsilon times c p i over sum of square roots C Square [00:57:07] sum of square roots C Square so I is equals this [00:57:10] so I is equals this then we have [00:57:12] then we have log [00:57:15] infinite Epsilon you have the carbon [00:57:17] infinite Epsilon you have the carbon number of the compose model [00:57:19] number of the compose model is less than [00:57:20] is less than sum of c i Square over Epsilon Square [00:57:25] sum of c i Square over Epsilon Square so which means that suppose you believe [00:57:26] so which means that suppose you believe CI is a complex measure for each of the [00:57:28] CI is a complex measure for each of the layer then you can get the complexity [00:57:30] layer then you can get the complexity for the composed model the Euler margin [00:57:33] for the composed model the Euler margin of the compose model [00:57:35] of the compose model the complexity will be just the sum of [00:57:37] the complexity will be just the sum of CI score [00:57:42] yeah I think [00:57:44] yeah I think I think I know [00:57:45] I think I know I didn't have an error here I think this [00:57:47] I didn't have an error here I think this is indeed [00:57:48] is indeed something like this [00:57:54] yeah this was correct [00:57:57] yeah this was correct sorry [00:58:00] Okay cool so [00:58:03] Okay cool so and you see I will be something like see [00:58:05] and you see I will be something like see I will be something like the w i [00:58:07] I will be something like the w i one comma one Norm and that's how you [00:58:11] one comma one Norm and that's how you go through all these things [00:58:16] right so so basically you know we will [00:58:20] right so so basically you know we will show we can show [00:58:24] I think I will improve this but this is [00:58:26] I think I will improve this but this is not this is all because this is for one [00:58:28] not this is all because this is for one layer right so we assume you can [00:58:30] layer right so we assume you can basically you know you can believe that [00:58:33] basically you know you can believe that you can basically invoke a theorem for [00:58:35] you can basically invoke a theorem for linear model together so indeed it's [00:58:37] linear model together so indeed it's true so for linear models [00:58:40] true so for linear models um you get something like this [00:58:44] and beta I was the bond on the so we [00:58:47] and beta I was the bond on the so we call that beta I was the bound on the [00:58:49] call that beta I was the bound on the one comma one Norm of WI [00:58:51] one comma one Norm of WI and this will this will imply the main [00:58:53] and this will this will imply the main theorem [00:58:59] okay so I think I hope I convinced you [00:59:03] okay so I think I hope I convinced you that basically as long as you prove this [00:59:04] that basically as long as you prove this decomposition level then you are done [00:59:06] decomposition level then you are done because you [00:59:07] because you for the right hand side you invoke [00:59:09] for the right hand side you invoke something about linear model and then [00:59:11] something about linear model and then you plug in Islam or you get the cover [00:59:13] you plug in Islam or you get the cover number bounds for the all layer margin [00:59:15] number bounds for the all layer margin and then you get the [00:59:17] and then you get the original zero using the the lemon I have [00:59:20] original zero using the the lemon I have shown before about the generalized [00:59:21] shown before about the generalized margin [00:59:24] Okay so [00:59:30] any questions so far [00:59:38] foreign [00:59:47] [Music] [00:59:57] for the concrete fi right which is the Z [01:00:00] for the concrete fi right which is the Z map to Sigma w i z [01:00:05] map to Sigma w i z you can also have you can State the [01:00:06] you can also have you can State the landmark in a more general form and also [01:00:09] landmark in a more general form and also you can prove it in a more general form [01:00:10] you can prove it in a more general form but I'm only going to prove it for this [01:00:12] but I'm only going to prove it for this particular family of F5 [01:00:15] particular family of F5 um and so the first step is that so so [01:00:18] um and so the first step is that so so we'll show so the so the two steps [01:00:22] we'll show so the so the two steps step one [01:00:23] step one uh we show that mfx [01:00:27] uh we show that mfx is one ellipsis [01:00:32] are in [01:00:34] are in f [01:00:35] f and what one loop 16 f means is the [01:00:38] and what one loop 16 f means is the following [01:00:39] following so for every Earth an F Prime [01:00:43] so for every Earth an F Prime MF X Y minus f f Prime x y [01:00:49] MF X Y minus f f Prime x y it's less than [01:00:54] in some sense the luciousness of every [01:00:57] in some sense the luciousness of every layer so you have all layers so this is [01:00:59] layer so you have all layers so this is some from one to R [01:01:02] some from one to R and [01:01:06] you take the max [01:01:09] of f i x minus f i Prime X [01:01:15] of f i x minus f i Prime X to know X [01:01:19] is so so here f is equals to f r [01:01:23] is so so here f is equals to f r composed with f r minus 1 up to F1 [01:01:27] composed with f r minus 1 up to F1 and F Prime is [01:01:28] and F Prime is FR Prime composed with f r minus one [01:01:32] FR Prime composed with f r minus one F1 Prime [01:01:34] F1 Prime so basically the largestness of this [01:01:36] so basically the largestness of this Euler margin is [01:01:38] Euler margin is something that doesn't have actual scale [01:01:40] something that doesn't have actual scale in some sense because you are looking at [01:01:42] in some sense because you are looking at the [01:01:43] the uh this [01:01:45] uh this this no no this power with no scale and [01:01:48] this no no this power with no scale and also it only depends it's basically the [01:01:50] also it only depends it's basically the sum of the ellipselessness [01:01:52] sum of the ellipselessness or the sum of the [01:01:56] or the sum of the um the differences between F5 and F5 [01:01:57] um the differences between F5 and F5 Prime right so so there's no multiplier [01:01:59] Prime right so so there's no multiplier here right you are not multiplying on [01:02:01] here right you are not multiplying on the the ellipselessness of f right so [01:02:03] the the ellipselessness of f right so it's really literal stuff it's very [01:02:05] it's really literal stuff it's very clean [01:02:06] clean um and let's prove this step one um uh [01:02:09] um and let's prove this step one um uh your moment but suppose you have step [01:02:11] your moment but suppose you have step one then you what you can get [01:02:13] one then you what you can get is that you can use step two you can use [01:02:15] is that you can use step two you can use this step one just to get uh the theorem [01:02:18] this step one just to get uh the theorem relatively easily what you do is you say [01:02:20] relatively easily what you do is you say now you construct a cover [01:02:23] now you construct a cover construct [01:02:24] construct a cover [01:02:26] a cover and how do you do that you the cover is [01:02:27] and how do you do that you the cover is the construction is also kind of show [01:02:29] the construction is also kind of show you so what you do is you let [01:02:34] U1 up to u r b Epsilon 1 up to Epsilon R [01:02:40] U1 up to u r b Epsilon 1 up to Epsilon R cover [01:02:41] cover of uh F1 up to fr [01:02:45] of uh F1 up to fr respectively [01:02:47] respectively and recall that you know if you still [01:02:50] and recall that you know if you still remember what DVD last time the covering [01:02:52] remember what DVD last time the covering was very complicated right so you [01:02:53] was very complicated right so you iteratively construct covers and make it [01:02:55] iteratively construct covers and make it you know very complicated but now we [01:02:57] you know very complicated but now we just individually construct covers for [01:02:59] just individually construct covers for every fi [01:03:00] every fi and then and then we say UI such that UI [01:03:05] and then and then we say UI such that UI is equals to this [01:03:06] is equals to this the infinite Norm carbon number [01:03:14] and [01:03:15] and um and then [01:03:17] um and then so so this means that you know by [01:03:19] so so this means that you know by definition right so we got that you know [01:03:22] definition right so we got that you know for every [01:03:25] fi [01:03:28] fi in capital f i there exists some [01:03:30] in capital f i there exists some function UI in capital UI [01:03:32] function UI in capital UI such that [01:03:35] f i minus UI [01:03:39] f i minus UI is small right so and I guess we are [01:03:41] is small right so and I guess we are using this metric this if Phenom metric [01:03:46] f i x minus u i x [01:03:50] f i x minus u i x to none [01:03:51] to none is smaller than Epsilon this is window [01:03:54] is smaller than Epsilon this is window by definition [01:03:55] by definition and now we're going to turn this into a [01:03:57] and now we're going to turn this into a cover for the compose the family so and [01:04:01] cover for the compose the family so and the cover is just a so we just take [01:04:06] the U to be the family of just the [01:04:09] the U to be the family of just the composition of all of this [01:04:12] composition of all of this the conversation of u r composed with U [01:04:14] the conversation of u r composed with U or minus 1 composed with U1 which is [01:04:21] foreign [01:04:23] foreign and this is will be our cover and we'll [01:04:26] and this is will be our cover and we'll show this is will be shown [01:04:29] show this is will be shown will be our cover [01:04:33] for [01:04:36] for M composed with f and why that's the [01:04:38] M composed with f and why that's the case this is because suppose we are [01:04:42] case this is because suppose we are given [01:04:43] given f is equals to FR compose [01:04:46] f is equals to FR compose up to F1 in capital f [01:04:52] you you are up to you one [01:04:56] you you are up to you one be the nearest neighbor [01:04:59] 's neighbor [01:05:03] of f r up to F1 [01:05:07] of f r up to F1 so then [01:05:09] so then as you can see that using this [01:05:10] as you can see that using this ellipselessness [01:05:13] ellipselessness this minus and [01:05:15] this minus and as a u is equals to u r composed [01:05:20] with 0 minus 1 up to U1 so look at this [01:05:25] with 0 minus 1 up to U1 so look at this so suppose you do this this is less than [01:05:28] so suppose you do this this is less than using our step one [01:05:31] using our step one sum of [01:05:34] the difference between if I and UI [01:05:40] the worst case difference between them [01:05:42] the worst case difference between them over the normal one [01:05:45] over the normal one and because F and U are closed that's [01:05:48] and because F and U are closed that's how we constructed the cover so we get [01:05:50] how we constructed the cover so we get square root sum of Epsilon I Square [01:05:53] square root sum of Epsilon I Square and from 1 to R [01:06:02] okay [01:06:04] okay so so basically you know once you have [01:06:06] so so basically you know once you have this such a nice ellipselessness [01:06:11] so kind of lepslessness property then [01:06:14] so kind of lepslessness property then you can just cover everything [01:06:15] you can just cover everything individually and you don't have to think [01:06:17] individually and you don't have to think too much about conversation the [01:06:19] too much about conversation the conversations to you because it's deal [01:06:20] conversations to you because it's deal with this it's dealt with but [01:06:23] with this it's dealt with but so so now and now why deluxiousness [01:06:26] so so now and now why deluxiousness holds so let's prove step one [01:06:32] so and so we only approved the upper [01:06:35] so and so we only approved the upper bounds so we want to prove [01:06:39] this by symmetry [01:06:42] this by symmetry right because I find F Prime have the [01:06:45] right because I find F Prime have the same row so you can only have to proof [01:06:46] same row so you can only have to proof one side and you can flip them together [01:06:48] one side and you can flip them together other side [01:06:50] other side and [01:06:51] and it sometimes the way to prove it is just [01:06:53] it sometimes the way to prove it is just really each of them is defined by some [01:06:56] really each of them is defined by some optimization program right and there are [01:06:59] optimization program right and there are the solutions the optimal value of the [01:07:01] the solutions the optimal value of the optimization programs right basically [01:07:03] optimization programs right basically you're trying to show the two [01:07:04] you're trying to show the two optimization programs are doing similar [01:07:06] optimization programs are doing similar stuff [01:07:07] stuff um and how do you do that hey people you [01:07:09] um and how do you do that hey people you construct optimal solution of one [01:07:10] construct optimal solution of one optimization program into a speedable [01:07:13] optimization program into a speedable solution to other optimization program [01:07:15] solution to other optimization program that's how you relate to optimization [01:07:17] that's how you relate to optimization programs so so let Delta one store up to [01:07:22] programs so so let Delta one store up to Delta R star [01:07:24] Delta R star the the optimal [01:07:28] choice [01:07:30] choice of Delta [01:07:32] of Delta uh [01:07:33] uh in defining [01:07:37] in defining MF x y [01:07:40] MF x y so and we want to turn so our goal [01:07:43] so and we want to turn so our goal is to turn this into Delta one [01:07:47] is to turn this into Delta one height there's our gamma [01:07:51] height there's our gamma of feasible solution [01:07:56] of MF Prime x y [01:08:00] of MF Prime x y right so if you have a if it's a [01:08:02] right so if you have a if it's a feasible solution you get NF Prime x y [01:08:05] feasible solution you get NF Prime x y is less than sum of [01:08:07] is less than sum of Delta I hat Square [01:08:10] Delta I hat Square and then you can relate this to [01:08:14] and then you can relate this to sum of Delta I using your Construction [01:08:18] sum of Delta I using your Construction right so [01:08:19] right so okay so that's the rough idea [01:08:22] okay so that's the rough idea and how do we construct this the other [01:08:23] and how do we construct this the other one hand up to build the r height so [01:08:26] one hand up to build the r height so so we wanted it to be feasible so that [01:08:28] so we wanted it to be feasible so that we can have this inner Corner this part [01:08:30] we can have this inner Corner this part is going to be the feasibility part so [01:08:33] is going to be the feasibility part so so basically the way we do it is that we [01:08:35] so basically the way we do it is that we want to construct [01:08:36] want to construct so we want to make [01:08:43] of [01:08:44] of Prime with the other one had up to Delta [01:08:47] Prime with the other one had up to Delta R hat [01:08:48] R hat doing the same thing [01:08:52] as as and with the other one star up to [01:08:56] as as and with the other one star up to Delta R star [01:08:58] Delta R star so basically on the perturbation on F [01:09:00] so basically on the perturbation on F Prime to do the same thing as [01:09:01] Prime to do the same thing as perturbation [01:09:03] perturbation the the perturbation of data I star on F [01:09:06] the the perturbation of data I star on F so that then you know that this will be [01:09:09] so that then you know that this will be a feasible solution because what's the [01:09:11] a feasible solution because what's the feasibility the feasibility is about [01:09:12] feasibility the feasibility is about whether you perturbed the prediction to [01:09:15] whether you perturbed the prediction to to the other side right so if this this [01:09:18] to the other side right so if this this one can perturb the prediction to change [01:09:20] one can perturb the prediction to change the prediction to the other side then [01:09:21] the prediction to the other side then the other one also change the position [01:09:23] the other one also change the position because they are doing the same thing [01:09:24] because they are doing the same thing that that's the the the the the the the [01:09:27] that that's the the the the the the the principle and how do you do that it's [01:09:30] principle and how do you do that it's just pretty much just algebra right so F [01:09:32] just pretty much just algebra right so F has parameter [01:09:35] has parameter that's a parameter W1 [01:09:38] that's a parameter W1 after WR [01:09:40] after WR and F Prime has parameter W1 Prime up to [01:09:43] and F Prime has parameter W1 Prime up to w r Prime [01:09:45] w r Prime and let's consider the computation right [01:09:47] and let's consider the computation right so [01:09:48] so um [01:09:50] um I guess the [01:09:52] I guess the the in the competition is this right so [01:09:54] the in the competition is this right so H1 is equals w1x plus Delta one star X [01:10:00] H1 is equals w1x plus Delta one star X S2 is equals to Sigma W 2 H1 plus Delta [01:10:05] S2 is equals to Sigma W 2 H1 plus Delta 2 Star [01:10:06] 2 Star H1 [01:10:08] H1 so and so forth HR is equal to Sigma w r [01:10:11] so and so forth HR is equal to Sigma w r h r minus 1 plus Delta R star [01:10:15] h r minus 1 plus Delta R star h r minus one so this is the computation [01:10:18] h r minus one so this is the computation you did for [01:10:21] you did for this is the computation for [01:10:25] for m f x y right so and I want to [01:10:29] for m f x y right so and I want to imitate this computation by perturbing F [01:10:31] imitate this computation by perturbing F Prime in some way and how do we imitate [01:10:34] Prime in some way and how do we imitate that so the imitation is kind of trivial [01:10:38] that so the imitation is kind of trivial so imitate this [01:10:41] so what you do is you say you take [01:10:44] so what you do is you say you take so for f Prime what happens H1 is equals [01:10:47] so for f Prime what happens H1 is equals to W1 Prime X plus something right [01:10:50] to W1 Prime X plus something right plus and how do so and you suppose you [01:10:53] plus and how do so and you suppose you predict the other one star X then it [01:10:55] predict the other one star X then it wouldn't help [01:10:57] wouldn't help because this W1 Prime is different from [01:10:59] because this W1 Prime is different from W1 [01:11:01] W1 right so you have to predict something [01:11:02] right so you have to predict something new addition to make this computation [01:11:04] new addition to make this computation the same as before and how to do that is [01:11:06] the same as before and how to do that is you perturb in addition [01:11:08] you perturb in addition W1 minus W1 Prime X [01:11:12] W1 minus W1 Prime X and then these two are literally exactly [01:11:14] and then these two are literally exactly the same [01:11:15] the same so basically you declare this as my new [01:11:18] so basically you declare this as my new preservation you declare this has to be [01:11:20] preservation you declare this has to be now Delta one hat [01:11:22] now Delta one hat times x [01:11:24] times x to know [01:11:26] to know so so basically compensate the [01:11:29] so so basically compensate the difference between W1 and W1 Prime by [01:11:31] difference between W1 and W1 Prime by adding this additional perturbation [01:11:33] adding this additional perturbation so that means that Delta 1 Hat is equals [01:11:36] so that means that Delta 1 Hat is equals to Delta 1 Star Plus W1 minus W1 Prime X [01:11:41] to Delta 1 Star Plus W1 minus W1 Prime X over the two Norm of x [01:11:44] over the two Norm of x and you do the same thing basically for [01:11:46] and you do the same thing basically for every layer so now H2 you want H2 to be [01:11:50] every layer so now H2 you want H2 to be equal to the same H2 above but your [01:11:52] equal to the same H2 above but your first step is [01:11:54] first step is your only product being based on W1 [01:11:56] your only product being based on W1 Prime you are not pretending based on W2 [01:11:58] Prime you are not pretending based on W2 so you are only practicing based on W2 [01:12:00] so you are only practicing based on W2 Prime but not based on W2 so what you do [01:12:03] Prime but not based on W2 so what you do is you first perturb the original one [01:12:05] is you first perturb the original one the one it gives us two star H1 [01:12:08] the one it gives us two star H1 and then you compensate the div [01:12:11] and then you compensate the div by perturbing even more [01:12:16] something like this and you declare this [01:12:18] something like this and you declare this entire thing will be defined to be Delta [01:12:20] entire thing will be defined to be Delta 2 [01:12:22] 2 h [01:12:23] h one to naught and that means that the [01:12:25] one to naught and that means that the other two that you had that you had will [01:12:28] other two that you had that you had will be the other one two star [01:12:29] be the other one two star Plus [01:12:31] Plus Something Like H one two Norm [01:12:34] Something Like H one two Norm uh and the denominator will be Sigma W 2 [01:12:37] uh and the denominator will be Sigma W 2 H1 minus Sigma 2W 2 Prime H1 [01:12:43] I guess you do the same thing for every [01:12:45] I guess you do the same thing for every layer and in general [01:12:49] you just take Delta I hat to be Delta I [01:12:52] you just take Delta I hat to be Delta I Star Plus [01:12:54] Star Plus Sigma w i h i minus 1 minus Sigma [01:12:58] Sigma w i h i minus 1 minus Sigma wi Prime h i minus 1 [01:13:02] wi Prime h i minus 1 over h i minus one [01:13:06] and now we've got the same we got our [01:13:09] and now we've got the same we got our goal so basically Delta one hat [01:13:14] on F right on F Prime is the same doing [01:13:17] on F right on F Prime is the same doing the same thing as Delta 1 up to Delta r [01:13:21] the same thing as Delta 1 up to Delta r um I'm using this short answers too [01:13:24] um I'm using this short answers too to save some time in writing so I'm [01:13:26] to save some time in writing so I'm saying that basically I'm saying you [01:13:27] saying that basically I'm saying you perturbed the Delta one has up to the r [01:13:29] perturbed the Delta one has up to the r height from F Prime it's doing this [01:13:32] height from F Prime it's doing this exactly the same functionality the same [01:13:35] exactly the same functionality the same prediction as the other one so that [01:13:38] prediction as the other one so that means that we this is a feasible [01:13:39] means that we this is a feasible solution this is a feasible solution for [01:13:41] solution this is a feasible solution for MF Prime so that's why MF Prime x y [01:13:45] MF Prime so that's why MF Prime x y is less than the sum of [01:13:49] is less than the sum of Delta I had two Norm square square root [01:13:52] Delta I had two Norm square square root and now this is [01:13:55] and now this is uh this is equals to [01:13:58] uh this is equals to I guess I'm going to bound this by [01:14:01] I guess I'm going to bound this by square root sum of [01:14:03] square root sum of the entire star to Norm square plus [01:14:06] the entire star to Norm square plus square root sum of the differences [01:14:09] square root sum of the differences between them [01:14:17] and this is using the so-called [01:14:21] [Music] [01:14:23] [Music] this is using the so-called mink misconf [01:14:26] this is using the so-called mink misconf I guess [01:14:27] I guess I I always think of it as cautious words [01:14:30] I I always think of it as cautious words but I think there's a technical name [01:14:31] but I think there's a technical name which is called means minkowski [01:14:33] which is called means minkowski inequality [01:14:36] in the quality so what he's doing is [01:14:39] in the quality so what he's doing is that [01:14:42] say stay in the following so if you look [01:14:45] say stay in the following so if you look at [01:14:46] at square root [01:14:47] square root of the sum of a i plus b i to Norm [01:14:50] of the sum of a i plus b i to Norm Square [01:14:52] Square it says less than square root sum of a i [01:14:55] it says less than square root sum of a i Square [01:14:56] Square and square root sum of bi [01:14:59] and square root sum of bi to Norm squared [01:15:01] to Norm squared so this is the mean meaning of [01:15:03] so this is the mean meaning of inequality and actually you can prove [01:15:05] inequality and actually you can prove this inequality by cosine squares by [01:15:07] this inequality by cosine squares by just X you take the square on both sides [01:15:09] just X you take the square on both sides and cancel Bank of terms and it becomes [01:15:11] and cancel Bank of terms and it becomes course insurance [01:15:13] course insurance um all right okay so so we apply this [01:15:16] um all right okay so so we apply this when where AI is Delta I Star right AI [01:15:19] when where AI is Delta I Star right AI is Delta I star and b i [01:15:21] is Delta I star and b i is is this thing it's the diff between [01:15:24] is is this thing it's the diff between note [01:15:25] note I think five percent is enough for me [01:15:29] okay yeah uh okay cool so uh yeah I [01:15:35] okay yeah uh okay cool so uh yeah I think one minutes per percentage [01:15:38] think one minutes per percentage okay uh okay so now let's see so and [01:15:41] okay uh okay so now let's see so and this one is the [01:15:44] this one is the MF of X Y [01:15:47] MF of X Y and on the other one you can bound it by [01:15:52] the this is [01:15:54] the this is I guess this is literally uh [01:15:58] I guess this is literally uh you can bound this by [01:16:00] you can bound this by square root sum over I from 1 to R [01:16:03] square root sum over I from 1 to R the max over [01:16:04] the max over X [01:16:07] X in one norm and sigma w x minus Sigma [01:16:12] in one norm and sigma w x minus Sigma I prime X [01:16:14] I prime X Square [01:16:16] Square right just because this whole thing is [01:16:18] right just because this whole thing is homogeneous so dividing by the true Norm [01:16:20] homogeneous so dividing by the true Norm is the same as restrictment to Norm to [01:16:22] is the same as restrictment to Norm to be one [01:16:23] be one right so and then this is equals to MF [01:16:26] right so and then this is equals to MF of X Y [01:16:28] of X Y plus square roots [01:16:30] plus square roots sum over r [01:16:32] sum over r Max [01:16:33] Max x 2 non less than one f i x [01:16:37] x 2 non less than one f i x minus f i Prime X [01:16:39] minus f i Prime X squared [01:16:41] squared and this is what we want it for for step [01:16:43] and this is what we want it for for step one [01:16:46] foreign [01:16:55] [Applause] [01:17:06] W [01:17:09] is the parameter for f and double Prime [01:17:12] is the parameter for f and double Prime is the parameter for f Prime [01:17:14] is the parameter for f Prime and they don't have [01:17:16] and they don't have at least in this contact they don't have [01:17:18] at least in this contact they don't have any relationship because I'm just trying [01:17:20] any relationship because I'm just trying to show that I'm trying to show this [01:17:22] to show that I'm trying to show this stuff right this step one [01:17:24] stuff right this step one so I'm taking two arbitrary F and F [01:17:26] so I'm taking two arbitrary F and F Prime and I want to say that the the [01:17:28] Prime and I want to say that the the Euler margin difference difference in [01:17:30] Euler margin difference difference in order margin is bonded by a difference [01:17:33] order margin is bonded by a difference in each of the layers right so it [01:17:36] in each of the layers right so it doesn't matter what they are [01:17:42] yes [01:17:45] yes so F Prime involves all the wi primes [01:17:47] so F Prime involves all the wi primes and F involves all the Wis [01:17:50] and F involves all the Wis foreign [01:17:54] okay cool any other questions [01:18:08] [Music] [01:18:13] okay yeah so so this [01:18:16] okay yeah so so this um you know if if I'm guessing the [01:18:18] um you know if if I'm guessing the question I think all of this depends on [01:18:19] question I think all of this depends on the definition of MF yes of course [01:18:22] the definition of MF yes of course um and and it's actually when we do the [01:18:25] um and and it's actually when we do the research I think it's that we are trying [01:18:27] research I think it's that we are trying to meet in the middle like you have to [01:18:28] to meet in the middle like you have to change the definition in a way so that [01:18:30] change the definition in a way so that the analysis is okay [01:18:32] the analysis is okay um [01:18:33] um but in some sense this is in some sense [01:18:35] but in some sense this is in some sense because the proof is Simple and Clean so [01:18:37] because the proof is Simple and Clean so somehow I feel good about the definition [01:18:39] somehow I feel good about the definition to some extent [01:18:41] to some extent um [01:18:42] um yeah so I guess I'll use the next few [01:18:45] yeah so I guess I'll use the next few minutes and the four percent of battery [01:18:48] minutes and the four percent of battery uh to to talk about some of the [01:18:51] uh to to talk about some of the comparisons interpretations and [01:18:55] comparisons interpretations and our next possible extensions [01:18:58] our next possible extensions so [01:19:00] so um [01:19:02] I think I guess interpretation I kind of [01:19:05] I think I guess interpretation I kind of like [01:19:07] like um in some sense I've discussed this a [01:19:08] um in some sense I've discussed this a little bit so the most important thing [01:19:11] little bit so the most important thing is the Euler emerging part at least [01:19:13] is the Euler emerging part at least that's our main side you know and we [01:19:15] that's our main side you know and we don't even care about the norm so and so [01:19:19] don't even care about the norm so and so you can you then you can compare with [01:19:23] compare with [01:19:25] compare with bar trade at all [01:19:29] 17 is the paper that we discussed last [01:19:32] 17 is the paper that we discussed last time so you can formally do this where [01:19:34] time so you can formally do this where you can formally say that the perturbed [01:19:37] you can formally say that the perturbed model [01:19:37] model if you look at the difference between [01:19:39] if you look at the difference between the preferred model and the original [01:19:40] the preferred model and the original model [01:19:42] model the difference is something [01:19:43] the difference is something like if you do a some kind of like [01:19:46] like if you do a some kind of like telescoping [01:19:48] telescoping thing [01:19:49] thing this is supposed to be not super hard so [01:19:52] this is supposed to be not super hard so you can basically kind of Imagine That [01:19:55] you can basically kind of Imagine That for every layer you you perturb [01:19:57] for every layer you you perturb something so you pay something like that [01:19:58] something so you pay something like that and then you have a you also have to pay [01:20:00] and then you have a you also have to pay the blowing up Factor because of the the [01:20:03] the blowing up Factor because of the the other things right so you can prove this [01:20:11] all right [01:20:27] so if you kind of like you know ignore [01:20:30] so if you kind of like you know ignore some ignoring some Minor Details which [01:20:33] some ignoring some Minor Details which you know allows me to have a cleaner [01:20:36] you know allows me to have a cleaner extra position [01:20:39] extra position um so for example you ignore dependency [01:20:40] um so for example you ignore dependency on r [01:20:46] then you can basically say that you know [01:20:48] then you can basically say that you know if maybe let's also supposed Y is bigger [01:20:52] if maybe let's also supposed Y is bigger than zero [01:20:53] than zero that's for simplistic say Y is one [01:20:59] then basically if you want to if you [01:21:01] then basically if you want to if you want f x to be bigger than zero and F X [01:21:04] want f x to be bigger than zero and F X Plus Delta to be less than zero right [01:21:06] Plus Delta to be less than zero right that's kind of like a what [01:21:08] that's kind of like a what the situation would be right well you [01:21:10] the situation would be right well you perturb the your model to predict the [01:21:12] perturb the your model to predict the wrong thing and and then this means that [01:21:15] wrong thing and and then this means that you know your Delta [01:21:17] you know your Delta needs to be basically something [01:21:21] needs to be basically something larger than the product of the spectral [01:21:24] larger than the product of the spectral Norm [01:21:26] Norm because you know [01:21:28] because you know at least right because you know uh [01:21:30] at least right because you know uh that's how you can make enough [01:21:32] that's how you can make enough difference right so if your Delta is too [01:21:35] difference right so if your Delta is too small then the right hand side will be [01:21:36] small then the right hand side will be too small so that you don't really make [01:21:38] too small so that you don't really make them big enough of difference [01:21:40] them big enough of difference so and times FX [01:21:43] so and times FX right so [01:21:45] right so um so so that's that's it that's it [01:21:47] um so so that's that's it that's it saying that basically [01:21:49] saying that basically uh in some sense this is saying that m f [01:21:52] uh in some sense this is saying that m f x y [01:21:54] x y or Y times f x right the new margin [01:21:58] or Y times f x right the new margin versus the old margin the ratio is [01:22:00] versus the old margin the ratio is something like this [01:22:05] I'm writing this you know in some kind [01:22:07] I'm writing this you know in some kind of like in somewhat informal way so I'm [01:22:08] of like in somewhat informal way so I'm ignoring constant not even going some [01:22:10] ignoring constant not even going some kind of like small Minor Details I think [01:22:13] kind of like small Minor Details I think this product probably shouldn't probably [01:22:15] this product probably shouldn't probably range from one to R it should miss some [01:22:17] range from one to R it should miss some terms in the middle but those are not [01:22:19] terms in the middle but those are not super important and and this is [01:22:21] super important and and this is basically saying that you know if you [01:22:23] basically saying that you know if you look at the inverse margin this is kind [01:22:26] look at the inverse margin this is kind of like FFX [01:22:28] of like FFX times the product of the spectrum [01:22:31] times the product of the spectrum so [01:22:33] so so this is indeed a better bound than [01:22:34] so this is indeed a better bound than before because you you approve the our [01:22:36] before because you you approve the our new Bond depends on this and all the [01:22:38] new Bond depends on this and all the bond depends on this on the right hand [01:22:39] bond depends on this on the right hand side times the spectral [01:22:43] so so this is a better bound [01:22:48] at least in this aspect [01:22:50] at least in this aspect um and [01:22:54] and another thing is that you know later [01:22:59] um but but why this but this but why [01:23:02] um but but why this but this but why this is you know how how much better it [01:23:04] this is you know how how much better it is when compared to the previous one [01:23:05] is when compared to the previous one that's uh that's a that's a question [01:23:07] that's uh that's a that's a question mark right so so is it true that your [01:23:09] mark right so so is it true that your order merge now becomes like polynomial [01:23:11] order merge now becomes like polynomial instead of like exponential [01:23:14] instead of like exponential um there are some indicators that this [01:23:16] um there are some indicators that this is a much better Bond empirically like [01:23:18] is a much better Bond empirically like all like or conceptually [01:23:21] all like or conceptually um empirical we did verify it seems to [01:23:22] um empirical we did verify it seems to be much better like the number becomes [01:23:24] be much better like the number becomes smaller but just because in parallel [01:23:27] smaller but just because in parallel your your lips your lipsense is better [01:23:29] your your lips your lipsense is better than the worst hit spot another reason [01:23:31] than the worst hit spot another reason why you can somewhat hope that the [01:23:34] why you can somewhat hope that the imperial this is better is because later [01:23:36] imperial this is better is because later we'll show [01:23:38] we'll show I think I've said this once before but [01:23:40] I think I've said this once before but let me spray down again so SGD prefers [01:23:45] let me spray down again so SGD prefers uh lipstick Solutions [01:23:51] and and in some sense on data points and [01:23:54] and and in some sense on data points and ellipses on data points [01:23:57] in some sense this is saying that no [01:23:59] in some sense this is saying that no your algorithm in some sense is [01:24:00] your algorithm in some sense is minimizing the leftist-ness on the data [01:24:02] minimizing the leftist-ness on the data point so that's why you you your lips [01:24:05] point so that's why you you your lips just on the data point is probably [01:24:06] just on the data point is probably better than the worst case lipstickness [01:24:08] better than the worst case lipstickness of the entire domain and that's probably [01:24:10] of the entire domain and that's probably why the gap between these two bounds are [01:24:12] why the gap between these two bounds are significant [01:24:15] um so so in some sense you are implicit [01:24:16] um so so in some sense you are implicit so this in some sense this is very [01:24:18] so this in some sense this is very implicitly [01:24:20] implicitly minimizing [01:24:22] minimizing maximizing the Euler margins [01:24:33] so but this of course this is [01:24:36] so but this of course this is approximately because you know [01:24:38] approximately because you know all of this kind of like what icg [01:24:40] all of this kind of like what icg prefers you know in terms of the form [01:24:43] prefers you know in terms of the form we'll see is similar but uh but they [01:24:47] we'll see is similar but uh but they won't be exactly matching the same form [01:24:48] won't be exactly matching the same form so so we haven't got a fully kind of [01:24:51] so so we haven't got a fully kind of like a [01:24:52] like a um coherent story Theory yet but by [01:24:56] um coherent story Theory yet but by conceptually they are they all seems to [01:24:58] conceptually they are they all seems to properly match and another thing is that [01:25:00] properly match and another thing is that there's something which actually people [01:25:02] there's something which actually people actually use in practice which is called [01:25:04] actually use in practice which is called San this is called sharpness [01:25:08] aware [01:25:10] aware regularization [01:25:12] regularization this is something that can let you to [01:25:14] this is something that can let you to get better performance empirically on [01:25:15] get better performance empirically on many data sets [01:25:17] many data sets um and and what they are doing is that [01:25:19] um and and what they are doing is that they are doing perturbation [01:25:22] they are doing perturbation so we are doing a preservation but they [01:25:23] so we are doing a preservation but they are also doing perturbation but they are [01:25:25] are also doing perturbation but they are preserving the parameter [01:25:28] Theta so they're trying to make this [01:25:30] Theta so they're trying to make this model more ellipses in the parameter [01:25:32] model more ellipses in the parameter Theta instead of more ellipses in the [01:25:35] Theta instead of more ellipses in the more ellipses in the hidden variable the [01:25:37] more ellipses in the hidden variable the intermediate variable H I's but actually [01:25:41] intermediate variable H I's but actually these two are very related so so here is [01:25:44] these two are very related so so here is a fact [01:25:45] a fact if you look at the laws with the [01:25:47] if you look at the laws with the greeting of the laws with respect to the [01:25:49] greeting of the laws with respect to the uh the parameter WI this is equals to [01:25:54] uh the parameter WI this is equals to the grid of laws with respect to [01:25:56] the grid of laws with respect to I'm always this is now a single example [01:25:58] I'm always this is now a single example always on a single example [01:26:02] always on a single example this is the equals to the green of the [01:26:04] this is the equals to the green of the laws resurrects to [01:26:08] to the [01:26:09] to the um [01:26:10] um uh to the hidden variables [01:26:13] uh to the hidden variables in the layer above and times the hidden [01:26:15] in the layer above and times the hidden variable transpose [01:26:16] variable transpose so this is just a just a [01:26:19] so this is just a just a just by derivation okay actually this is [01:26:22] just by derivation okay actually this is used to call [01:26:25] a black kind of name like a [01:26:27] a black kind of name like a um [01:26:28] um like in neural science there's actual [01:26:30] like in neural science there's actual term for this this thing like but this [01:26:32] term for this this thing like but this is just literally just a you can feel [01:26:34] is just literally just a you can feel the gradient of wi how you do it you [01:26:35] the gradient of wi how you do it you just change one you get this [01:26:38] just change one you get this so so here this is the green with a [01:26:40] so so here this is the green with a cleaning variable and this is the size [01:26:41] cleaning variable and this is the size obtaining verb so if you look so that's [01:26:43] obtaining verb so if you look so that's why if you look at a norm of the [01:26:47] why if you look at a norm of the the the Grid in respect to the parameter [01:26:49] the the Grid in respect to the parameter then it's quite related to the norm of [01:26:52] then it's quite related to the norm of the gradient [01:26:53] the gradient with respect to the hidden verb [01:26:57] this is a vector this is a vector and [01:26:59] this is a vector this is a vector and this is a matrix so so so that's why [01:27:02] this is a matrix so so so that's why this is true [01:27:03] this is true um so let's just in parameter is similar [01:27:06] um so let's just in parameter is similar to ellipselessness in Hidden variable [01:27:13] it's somewhat related [01:27:22] okay so the first the last thing I guess [01:27:24] okay so the first the last thing I guess I'm running out of time sorry so [01:27:26] I'm running out of time sorry so um this is a more General version [01:27:30] um this is a more General version where you don't have to care about [01:27:32] where you don't have to care about the minimum margin over the entire data [01:27:34] the minimum margin over the entire data set you can prove something like test [01:27:36] set you can prove something like test our [01:27:37] our is less than [01:27:39] is less than one over square root n times instead of [01:27:42] one over square root n times instead of the average the this is average margin [01:27:45] the average the this is average margin instead of like the worst case smart the [01:27:47] instead of like the worst case smart the the minimum margin of the data set so [01:27:49] the minimum margin of the data set so you look at the average inverse margin [01:27:53] you look at the average inverse margin of this form [01:28:01] and then times the [01:28:04] and then times the sum of complexities [01:28:09] of each layer [01:28:13] so [01:28:14] so um and plus lower the term [01:28:22] oops oh really [01:28:24] oops oh really of course [01:28:26] of course it can really okay five percent is not [01:28:28] it can really okay five percent is not enough okay but this is literally the [01:28:30] enough okay but this is literally the last last thing I want to say [01:28:32] last last thing I want to say um [01:28:33] um um yeah [01:28:36] um yeah are there any questions [01:28:38] are there any questions [Music] [01:28:43] so so basically the last thing I want to [01:28:45] so so basically the last thing I want to say is that you know like instead of [01:28:47] say is that you know like instead of having a minimum like all your margin [01:28:49] having a minimum like all your margin there you can have the average earlier [01:28:50] there you can have the average earlier March [01:28:52] March two [01:28:53] two um [01:28:55] okay uh any questions [01:29:06] okay I guess then see you next Monday or [01:29:11] okay I guess then see you next Monday or Wednesday in two days okay wait today's [01:29:14] Wednesday in two days okay wait today's Monday right okay okay [01:29:16] Monday right okay okay two bye thanks ================================================================================ LECTURE 012 ================================================================================ Stanford CS229M - Lecture 13: Neural Tangent Kernel Source: https://www.youtube.com/watch?v=btphvvnad0A --- Transcript [00:00:05] okay I guess like let's get started so [00:00:09] okay I guess like let's get started so um [00:00:10] um I think last week [00:00:13] I think last week um I I spent some time reading the [00:00:15] um I I spent some time reading the feedback from the survey [00:00:18] feedback from the survey um I've been going through all of them [00:00:20] um I've been going through all of them so I guess I'm not going to discuss [00:00:22] so I guess I'm not going to discuss every um points there like all the [00:00:24] every um points there like all the points are well taken and thanks for all [00:00:26] points are well taken and thanks for all the for helpful feedback [00:00:29] the for helpful feedback um um and for some of those I'm going to [00:00:32] um um and for some of those I'm going to improve I guess [00:00:33] improve I guess um there are also some other [00:00:35] um there are also some other um conflictory requests uh you know [00:00:38] um conflictory requests uh you know which you know still are very [00:00:40] which you know still are very understandable because different people [00:00:41] understandable because different people have different preferences that's [00:00:43] have different preferences that's completely fun [00:00:44] completely fun um um but I guess I'm just saying that [00:00:46] um um but I guess I'm just saying that it's not like all I can address all the [00:00:49] it's not like all I can address all the possible requests just because sometimes [00:00:51] possible requests just because sometimes there are some constraints [00:00:54] there are some constraints um um but of course I you know sometimes [00:00:56] um um but of course I you know sometimes I think even conflictory requests can be [00:00:59] I think even conflictory requests can be addressed uh if you are creative you [00:01:02] addressed uh if you are creative you know I will try to do that as well [00:01:05] know I will try to do that as well um [00:01:05] um um I guess one [00:01:08] um I guess one um [00:01:08] um I I guess there's one thing I want to [00:01:10] I I guess there's one thing I want to discuss a little bit which I think might [00:01:12] discuss a little bit which I think might be useful for you is not trying to find [00:01:14] be useful for you is not trying to find any excuses for the for the lecture but [00:01:16] any excuses for the for the lecture but I think some people mentioned that it's [00:01:19] I think some people mentioned that it's a little bit hard to uh follow the [00:01:22] a little bit hard to uh follow the um take notes well later in the lecture [00:01:25] um take notes well later in the lecture I can completely understand that I wrote [00:01:27] I can completely understand that I wrote pretty fast which I'm going to slow down [00:01:29] pretty fast which I'm going to slow down a little bit at least to make the layout [00:01:31] a little bit at least to make the layout and form a little bit cleaner to easier [00:01:33] and form a little bit cleaner to easier to read [00:01:34] to read um but I think in my opinion you know [00:01:36] um but I think in my opinion you know this of course I'm not saying that you [00:01:39] this of course I'm not saying that you have to really follow my way to take [00:01:41] have to really follow my way to take courses like I typically don't take a [00:01:43] courses like I typically don't take a lot of notes [00:01:45] lot of notes um I think at least this course I [00:01:47] um I think at least this course I I try to design so that you don't have [00:01:49] I try to design so that you don't have to take all the notes yourself just [00:01:51] to take all the notes yourself just because we're gonna have scrap notes [00:01:52] because we're gonna have scrap notes later and some of the scrap notes are [00:01:54] later and some of the scrap notes are already there [00:01:56] already there um and I when I listen to a theoretical [00:01:59] um and I when I listen to a theoretical lecture I try to [00:02:01] lecture I try to think more uh so that I can remember [00:02:03] think more uh so that I can remember them in my in my head a little bit [00:02:06] them in my in my head a little bit um because I feel that at least for me [00:02:08] um because I feel that at least for me it's it takes too much energy for taking [00:02:10] it's it takes too much energy for taking all the notes like um [00:02:13] all the notes like um um you know I'm not sure whether this is [00:02:15] um you know I'm not sure whether this is useful for everyone maybe I don't think [00:02:16] useful for everyone maybe I don't think it can be useful for everyone but you [00:02:18] it can be useful for everyone but you know maybe you can try it a little bit [00:02:19] know maybe you can try it a little bit just to see whether it's easier if you [00:02:22] just to see whether it's easier if you take even less notes and and try to [00:02:24] take even less notes and and try to remember a little more [00:02:26] remember a little more um anyways I'm I'm gonna slow down a [00:02:28] um anyways I'm I'm gonna slow down a little bit at least in terms of the [00:02:30] little bit at least in terms of the writing posture and also probably I'm [00:02:31] writing posture and also probably I'm going to kind of like slow down a little [00:02:33] going to kind of like slow down a little bit in terms of the overall Pace a [00:02:35] bit in terms of the overall Pace a little bit as well [00:02:36] little bit as well given all the uh some of the feedbacks [00:02:40] given all the uh some of the feedbacks saying that uh some of the lectures are [00:02:42] saying that uh some of the lectures are a little bit too fast [00:02:44] a little bit too fast um and also another thing is that I [00:02:45] um and also another thing is that I think the homework questions you know [00:02:47] think the homework questions you know indeed some of the questions I think I [00:02:49] indeed some of the questions I think I probably made the mistake that uh a few [00:02:52] probably made the mistake that uh a few sub questions are a little bit too [00:02:53] sub questions are a little bit too difficult they were bonus questions in [00:02:56] difficult they were bonus questions in the past offerings and at this quarter I [00:02:59] the past offerings and at this quarter I I thought that we have um you have a [00:03:01] I thought that we have um you have a team of three people so maybe I can put [00:03:03] team of three people so maybe I can put them as regular points but still they [00:03:06] them as regular points but still they are probably a little too difficult they [00:03:07] are probably a little too difficult they require some kind of tricks as you [00:03:09] require some kind of tricks as you probably noticed some commentaro tricks [00:03:11] probably noticed some commentaro tricks um right so [00:03:14] um right so um yeah but I guess you know I checked [00:03:16] um yeah but I guess you know I checked the the last homework I think there's no [00:03:18] the the last homework I think there's no nothing like that like most of the [00:03:20] nothing like that like most of the questions are [00:03:22] questions are um I probably shouldn't require anything [00:03:24] um I probably shouldn't require anything super special tricks about [00:03:27] super special tricks about commentatorics [00:03:28] commentatorics um and and I guess another thing is that [00:03:32] um and and I guess another thing is that you know if you want to take some bonus [00:03:33] you know if you want to take some bonus points I guess there are other ways for [00:03:35] points I guess there are other ways for example to subscribe notes you know [00:03:37] example to subscribe notes you know improve existing lectures the if you [00:03:40] improve existing lectures the if you don't care about a plus I think a bonus [00:03:42] don't care about a plus I think a bonus point [00:03:43] point you know the bonus points is always it's [00:03:45] you know the bonus points is always it's as the same it's the same as the regular [00:03:47] as the same it's the same as the regular points in some sense if you look at the [00:03:49] points in some sense if you look at the the grading policy at least it works [00:03:51] the grading policy at least it works from your perspective worth the same as [00:03:53] from your perspective worth the same as the regular points so [00:03:55] the regular points so um so from the the basically the grading [00:03:58] um so from the the basically the grading policies that we first decided to cut [00:04:00] policies that we first decided to cut off before the bonus response and then [00:04:02] off before the bonus response and then the bonus points can only make you [00:04:05] the bonus points can only make you um have better electric grade [00:04:07] um have better electric grade um okay uh anyway so [00:04:10] um okay uh anyway so um so yeah I guess you know there are [00:04:13] um so yeah I guess you know there are other very very important uh very nice [00:04:15] other very very important uh very nice feedbacks and which I'm going to [00:04:17] feedbacks and which I'm going to incorporate as well in the in the [00:04:19] incorporate as well in the in the lecture [00:04:20] lecture um I'm not going to discuss all of the [00:04:21] um I'm not going to discuss all of the supports just to save some time so okay [00:04:23] supports just to save some time so okay so maybe let's get into the [00:04:26] so maybe let's get into the um [00:04:27] um get into the the technical part if [00:04:30] get into the the technical part if there's no other questions other [00:04:31] there's no other questions other discussions [00:04:33] discussions um so so I guess last last Wednesday I [00:04:36] um so so I guess last last Wednesday I was sick and uh and we skipped that we [00:04:39] was sick and uh and we skipped that we we like we ask you to watch the video [00:04:42] we like we ask you to watch the video online and roughly speaking what we did [00:04:45] online and roughly speaking what we did in the video is that we talk about this [00:04:48] in the video is that we talk about this optimization [00:04:50] optimization like uh [00:04:52] like uh non-convex optimization [00:04:57] and [00:04:58] and I think so the main point there was that [00:05:01] I think so the main point there was that if you're if you have the so-called [00:05:03] if you're if you have the so-called property of low low local Minima [00:05:06] property of low low local Minima or local Minima global [00:05:10] or local Minima global then you can find global minimum of [00:05:13] then you can find global minimum of course there are technical things like [00:05:14] course there are technical things like there is so-called strict cycle point [00:05:16] there is so-called strict cycle point which we discussed in the video [00:05:18] which we discussed in the video um and and there are other kind of like [00:05:20] um and and there are other kind of like things [00:05:21] things um that are a little bit subtle but but [00:05:23] um that are a little bit subtle but but this is the main part and and so [00:05:24] this is the main part and and so basically you only have to show that [00:05:26] basically you only have to show that this is true and then you can find a [00:05:28] this is true and then you can find a global minimum of the non-conflex stocks [00:05:30] global minimum of the non-conflex stocks and this allow for a search from a [00:05:33] and this allow for a search from a broader kind of kind of point of view [00:05:36] broader kind of kind of point of view um is successful and in some sense the [00:05:38] um is successful and in some sense the what I'm going to discuss next is [00:05:40] what I'm going to discuss next is another example of this [00:05:42] another example of this uh however [00:05:43] uh however um there are some kind of like special [00:05:46] um there are some kind of like special uh subtleties so so basically what what [00:05:50] uh subtleties so so basically what what we saw last time is that this is really [00:05:52] we saw last time is that this is really true globally the statement all local [00:05:54] true globally the statement all local minimum or Global are basically this is [00:05:57] minimum or Global are basically this is a true statement for every for the [00:05:59] a true statement for every for the entire space [00:06:01] entire space and today what we're going to discuss is [00:06:03] and today what we're going to discuss is that you're only looking at a special [00:06:05] that you're only looking at a special part of the space so in some sense the [00:06:06] part of the space so in some sense the function we are going to discuss this [00:06:08] function we are going to discuss this today is like look like something like [00:06:09] today is like look like something like this you have some kind of like complex [00:06:11] this you have some kind of like complex part about this function which you don't [00:06:13] part about this function which you don't know how to [00:06:14] know how to how to characterize but you identify a [00:06:16] how to characterize but you identify a small part where this is true [00:06:19] small part where this is true you look at a special region where all [00:06:22] you look at a special region where all local minimum are Global and there is [00:06:24] local minimum are Global and there is actually a good Global minimum there so [00:06:26] actually a good Global minimum there so then you just only work in that widget [00:06:30] then you just only work in that widget um and [00:06:32] um and um that's kind of the the connection to [00:06:33] um that's kind of the the connection to the previous lecture uh there are other [00:06:35] the previous lecture uh there are other issues with this kind of approach I [00:06:37] issues with this kind of approach I guess we discussed a little bit in one [00:06:39] guess we discussed a little bit in one of the the outlining lecture [00:06:42] of the the outlining lecture um the the limitation would be that you [00:06:44] um the the limitation would be that you know you identify this region where [00:06:47] know you identify this region where everything is nice the landscape is so [00:06:49] everything is nice the landscape is so nice but is this the reason you really [00:06:51] nice but is this the reason you really care about right so if you really care [00:06:53] care about right so if you really care about funding a global minimum of the [00:06:55] about funding a global minimum of the tuning loss then yes this has to be the [00:06:57] tuning loss then yes this has to be the reason because you find the global [00:06:58] reason because you find the global minimum of the chaining loss but if you [00:07:01] minimum of the chaining loss but if you care about other properties right so [00:07:02] care about other properties right so like generalization performance then [00:07:05] like generalization performance then um um it might be not the right region [00:07:08] um um it might be not the right region that you should focus on so but for [00:07:10] that you should focus on so but for today's lecture we don't care about that [00:07:12] today's lecture we don't care about that where we just say let's just we were [00:07:14] where we just say let's just we were going to go through what this works and [00:07:16] going to go through what this works and then we talk about limitations and and [00:07:18] then we talk about limitations and and in the future lectures we're going to [00:07:19] in the future lectures we're going to talk about ways to in some sense improve [00:07:21] talk about ways to in some sense improve upon this [00:07:22] upon this I'll kind of like fix the issues of this [00:07:25] I'll kind of like fix the issues of this kind of approach [00:07:27] kind of approach um [00:07:28] um okay so that's a that's a very rough um [00:07:31] okay so that's a that's a very rough um that's a [00:07:33] that's a um kind of like a high level overview [00:07:34] um kind of like a high level overview and also by the way if you haven't seen [00:07:36] and also by the way if you haven't seen my uh notes or announcement on ad so [00:07:41] my uh notes or announcement on ad so there are actually two videos that we [00:07:43] there are actually two videos that we asked to watch uh for making up the last [00:07:46] asked to watch uh for making up the last lecture so uh one of them is a full [00:07:48] lecture so uh one of them is a full action the other one has like 50 minutes [00:07:51] action the other one has like 50 minutes um so they are about this uh non-convex [00:07:53] um so they are about this uh non-convex conversation although coming my Global [00:07:55] conversation although coming my Global Minima kind of phenomenal [00:07:57] Minima kind of phenomenal um and this does relates to one of the [00:07:59] um and this does relates to one of the homework questions the question itself [00:08:01] homework questions the question itself is still in some sense self content but [00:08:04] is still in some sense self content but but I think it's useful for you to know [00:08:06] but I think it's useful for you to know the the basic idea [00:08:08] the the basic idea um even the basic proof ideas in those [00:08:11] um even the basic proof ideas in those two two videos [00:08:13] two two videos um so that you kind of see better how do [00:08:15] um so that you kind of see better how do you do the homework question [00:08:17] you do the homework question um [00:08:18] um okay so okay so today let's talk about [00:08:21] okay so okay so today let's talk about this uh this this [00:08:23] this uh this this um this thing we are like about this [00:08:25] um this thing we are like about this special region thing so and this is all [00:08:27] special region thing so and this is all for [00:08:29] for um called neural tangent kernel approach [00:08:31] um called neural tangent kernel approach I guess the the name doesn't really [00:08:34] I guess the the name doesn't really you know so far just think of this as a [00:08:36] you know so far just think of this as a placeholder like I'm gonna I'm gonna [00:08:38] placeholder like I'm gonna I'm gonna explain why this is called neurotangent [00:08:39] explain why this is called neurotangent kernel [00:08:41] kernel so so the the basic idea is that you [00:08:44] so so the the basic idea is that you look at some special place [00:08:46] look at some special place um around a neighborhood of your [00:08:48] um around a neighborhood of your initialization and you do some Taylor [00:08:49] initialization and you do some Taylor expansion so so Taylor expanding [00:08:53] expansion so so Taylor expanding and this works for any non-linear [00:08:55] and this works for any non-linear functions so suppose you have a [00:08:58] functions so suppose you have a you have a non-linear [00:09:00] you have a non-linear or even linear but non-linear would be [00:09:02] or even linear but non-linear would be the most interesting case a nonlinear [00:09:05] the most interesting case a nonlinear Model F Theta X [00:09:07] Model F Theta X and then you'll do a Taylor expansion [00:09:12] around [00:09:14] around initialization [00:09:16] initialization zero [00:09:18] zero so and when you tell expand the model at [00:09:22] so and when you tell expand the model at the initialization so your model [00:09:24] the initialization so your model is after the X you tell expand with [00:09:27] is after the X you tell expand with respect to the parameters but not input [00:09:30] respect to the parameters but not input so so the input is fixed and the [00:09:32] so so the input is fixed and the parameter is the variable so Theta 0 is [00:09:34] parameter is the variable so Theta 0 is the reference point and then you look at [00:09:36] the reference point and then you look at the gradient with Vector Theta evaluated [00:09:39] the gradient with Vector Theta evaluated zero [00:09:41] zero this is the first other gradient times [00:09:43] this is the first other gradient times Theta zero so this is the first order [00:09:46] Theta zero so this is the first order uh um Taylor expansion and then you you [00:09:50] uh um Taylor expansion and then you you say you have some higher order terms [00:09:51] say you have some higher order terms which we are going to ignore so and once [00:09:56] which we are going to ignore so and once you do this you can Define maybe this [00:09:58] you do this you can Define maybe this one [00:10:00] one this let's call this G Theta X [00:10:03] this let's call this G Theta X of course it also depends on sale at [00:10:05] of course it also depends on sale at zero but let's say Theta is the variable [00:10:06] zero but let's say Theta is the variable so that zero is fixed so this is a [00:10:08] so that zero is fixed so this is a function of theta [00:10:11] function of theta um so this is a linear function so [00:10:15] um so this is a linear function so so if you define this then G Theta X [00:10:19] so if you define this then G Theta X is a linear function in Theta [00:10:24] because where say that shows up say it [00:10:27] because where say that shows up say it only shows up here and it shows up [00:10:29] only shows up here and it shows up linearly [00:10:30] linearly and basically you linearize on your [00:10:34] and basically you linearize on your um your your model [00:10:37] um your your model um and you can also I guess Define [00:10:40] um and you can also I guess Define Delta Theta which is the difference [00:10:42] Delta Theta which is the difference between Theta and Theta zero [00:10:45] between Theta and Theta zero um I guess technically you should call [00:10:48] um I guess technically you should call this not nearly this is a fine function [00:10:50] this not nearly this is a fine function because there is a constant term right [00:10:53] because there is a constant term right function [00:10:54] function um in Theta or in Delta Theta you know [00:10:58] um in Theta or in Delta Theta you know they are not too different [00:11:00] they are not too different um I guess just want to introduce this [00:11:02] um I guess just want to introduce this notation Delta Theta [00:11:04] notation Delta Theta um and [00:11:05] um and and [00:11:06] and so I have to say that zero this [00:11:08] so I have to say that zero this reference point [00:11:10] reference point this is a constant [00:11:12] this is a constant from this perspective right it's a [00:11:14] from this perspective right it's a constant that doesn't depend the [00:11:16] constant that doesn't depend the constant for fixed X [00:11:19] constant for fixed X but it didn't change as you change Theta [00:11:24] and and in some sense this is just a [00:11:26] and and in some sense this is just a um not that important and so so so [00:11:30] um not that important and so so so sometimes for convenience [00:11:32] sometimes for convenience so not very important [00:11:34] so not very important because it's a constant [00:11:37] because it's a constant and sometimes for convenience [00:11:42] convenience [00:11:45] you choose [00:11:47] you choose to choose [00:11:49] to choose to the zero such that [00:11:52] to the zero such that F said that's zero [00:11:54] F said that's zero X is equals to zero for every X [00:11:59] X is equals to zero for every X um how do you do it so you do it so if [00:12:03] um how do you do it so you do it so if you really want to do this you know you [00:12:05] you really want to do this you know you need to for example what you can do is [00:12:07] need to for example what you can do is you can Design Network that [00:12:11] you can Design Network that um you split your invite Fork into two [00:12:13] um you split your invite Fork into two two parts right so maybe you have [00:12:15] two parts right so maybe you have suppose before you have a network we [00:12:17] suppose before you have a network we have all of these connections [00:12:20] have all of these connections right so and then for the second layer [00:12:22] right so and then for the second layer maybe for some layers split in two two [00:12:24] maybe for some layers split in two two halves right so you have something like [00:12:26] halves right so you have something like this [00:12:27] this and then something like this [00:12:29] and then something like this and and you do the same thing in these [00:12:32] and and you do the same thing in these two halves exactly the same thing it is [00:12:34] two halves exactly the same thing it is two halves and then you put plus one [00:12:36] two halves and then you put plus one here and minus one here [00:12:39] here and minus one here so so that they got canceled so that you [00:12:42] so so that they got canceled so that you have still a sum a random visualization [00:12:43] have still a sum a random visualization but by the initialization has a [00:12:46] but by the initialization has a functionality that [00:12:48] functionality that um you know the functionality of this [00:12:50] um you know the functionality of this initial model is is zero I'm not sure [00:12:53] initial model is is zero I'm not sure whether my drawing makes any sense this [00:12:55] whether my drawing makes any sense this this like you know [00:12:57] this like you know I see some confusion in your face but [00:13:00] I see some confusion in your face but this is supposed to be something simple [00:13:01] this is supposed to be something simple for example let's say [00:13:03] for example let's say um [00:13:04] um for example let's say you have some of [00:13:06] for example let's say you have some of linear models some of sorry two layers [00:13:08] linear models some of sorry two layers sum of AI times [00:13:11] sum of AI times Sigma of wi transpose X I from one to I [00:13:14] Sigma of wi transpose X I from one to I suppose this is a mode and what you can [00:13:16] suppose this is a mode and what you can do is you can say you add it to [00:13:19] do is you can say you add it to minus AI Sigma w x y transpose X [00:13:24] minus AI Sigma w x y transpose X so you have two amp neurons and and and [00:13:27] so you have two amp neurons and and and the the wi is the same and the AIS are [00:13:31] the the wi is the same and the AIS are paired [00:13:32] paired right so then this it becomes [00:13:34] right so then this it becomes zero right so so if you have two ab [00:13:36] zero right so so if you have two ab neurons and one part is the same as the [00:13:39] neurons and one part is the same as the other part in terms of w and in terms of [00:13:41] other part in terms of w and in terms of a there are negation of each other then [00:13:42] a there are negation of each other then you make this zero and you still have [00:13:44] you make this zero and you still have relatively good Randomness right you can [00:13:46] relatively good Randomness right you can still choose wi to be random as long as [00:13:48] still choose wi to be random as long as w i these are all wides and these are [00:13:52] w i these are all wides and these are ARS so anyway this is a not super [00:13:54] ARS so anyway this is a not super important Point [00:13:55] important Point um and also even you don't do this you [00:13:57] um and also even you don't do this you can still somewhat kind of get away from [00:13:59] can still somewhat kind of get away from it because this F Theta 0x is a constant [00:14:03] it because this F Theta 0x is a constant okay so so basically from now on we're [00:14:05] okay so so basically from now on we're going to assume this f70x is zero in [00:14:09] going to assume this f70x is zero in most of the cases [00:14:11] most of the cases so and and if you think about this right [00:14:14] so and and if you think about this right so [00:14:15] so so basically this is saying that [00:14:18] so basically this is saying that um so y Prime supposed to take y Prime [00:14:21] um so y Prime supposed to take y Prime to be y minus this constant [00:14:24] to be y minus this constant which we are going to assume is zero but [00:14:27] which we are going to assume is zero but so far if I say this is uh for this [00:14:30] so far if I say this is uh for this equation I think we can still think of [00:14:32] equation I think we can still think of as generic so then you get this is this [00:14:36] as generic so then you get this is this is a linear function instead of [00:14:38] is a linear function instead of transpose right so it's going to be [00:14:41] great Theta fc.0 x times Delta Theta [00:14:47] great Theta fc.0 x times Delta Theta and this becomes a linear function in [00:14:49] and this becomes a linear function in Delta Theta so this you can think of [00:14:51] Delta Theta so this you can think of this as the parameter [00:14:54] this as the parameter and this you can think of this as a [00:14:56] and this you can think of this as a feature map [00:14:59] so this is the same as the feature map V [00:15:01] so this is the same as the feature map V of x [00:15:02] of x that we discussed for example in cx299 [00:15:05] that we discussed for example in cx299 right when you have a kernel method so [00:15:07] right when you have a kernel method so and while this is a feature map this is [00:15:09] and while this is a feature map this is a pre this is something that doesn't [00:15:11] a pre this is something that doesn't depend on the parameter right so Theta 0 [00:15:14] depend on the parameter right so Theta 0 is fixed already so F Theta zero of X is [00:15:16] is fixed already so F Theta zero of X is really just a [00:15:18] really just a fixed function of x [00:15:20] fixed function of x right so [00:15:22] right so um so this is a fixed [00:15:25] um so this is a fixed let's call this [00:15:30] of [00:15:32] of given the architecture [00:15:35] and say that zero [00:15:38] and say that zero but it's not it doesn't depend on the [00:15:39] but it's not it doesn't depend on the outer fade [00:15:41] outer fade so in some sense not it just becomes [00:15:44] so in some sense not it just becomes kernel method and and and this curve so [00:15:47] kernel method and and and this curve so you can Define [00:15:49] um [00:15:51] um oops what's going on so I guess if it [00:15:53] oops what's going on so I guess if it was [00:15:54] was if for Simplicity [00:15:57] if you assume I have Theta 0 x is zero [00:16:00] if you assume I have Theta 0 x is zero then Y and Y Prime the same so basically [00:16:04] then Y and Y Prime the same so basically our fading a linear function onto our [00:16:06] our fading a linear function onto our Target and and this becomes kernel [00:16:08] Target and and this becomes kernel method you can Define the kernel [00:16:12] method you can Define the kernel okay [00:16:13] okay uh [00:16:15] uh x x Prime to be the inner product of [00:16:18] x x Prime to be the inner product of features [00:16:19] features by the the [00:16:23] fee of X transpose V of X Prime which is [00:16:27] fee of X transpose V of X Prime which is the inner Paradox of these two gradients [00:16:38] right so and and why this is called [00:16:41] right so and and why this is called neural tangent kernel the reason is that [00:16:43] neural tangent kernel the reason is that this [00:16:45] this is the tangent [00:16:48] of uh of the right work is the gradient [00:16:52] of uh of the right work is the gradient of the large work [00:16:58] I think then that's why it's called [00:17:00] I think then that's why it's called neural tangential kernel because it's [00:17:01] neural tangential kernel because it's the [00:17:03] the the feature is the gradient of the new [00:17:06] the feature is the gradient of the new network anyway the neural tangent kernel [00:17:08] network anyway the neural tangent kernel is just the name [00:17:09] is just the name so [00:17:11] so Okay so [00:17:13] Okay so and suppose [00:17:17] so suppose [00:17:20] so suppose we just use model we just use [00:17:24] we just use model we just use G Theta X [00:17:26] G Theta X instead instead of the original model [00:17:31] then basically you just get kernel [00:17:34] then basically you just get kernel method [00:17:37] or linear [00:17:40] or linear model [00:17:42] model on top of the feature [00:17:47] right so and [00:17:49] right so and and the loss function so suppose you [00:17:52] and the loss function so suppose you believe that [00:17:53] believe that Theta is close to zero so when [00:17:57] Theta is close to zero so when then you can also kind of intuitively [00:17:59] then you can also kind of intuitively say okay [00:18:01] say okay my original loss function which is a [00:18:03] my original loss function which is a function of the model output and why [00:18:06] function of the model output and why probably is approximately close to [00:18:09] probably is approximately close to my new loss function right which is G [00:18:11] my new loss function right which is G Theta of x [00:18:13] Theta of x and Y [00:18:14] and Y and and this is linear [00:18:19] and the whole thing is convex [00:18:23] and the whole thing is convex because L is convex right a convex [00:18:26] because L is convex right a convex function composed with linear function [00:18:27] function composed with linear function is still convex [00:18:29] is still convex right so when this but this is when [00:18:31] right so when this but this is when Theta is very close to [00:18:33] Theta is very close to the zero [00:18:39] and so basically the question so the so [00:18:42] and so basically the question so the so the remaining thing is just really that [00:18:44] the remaining thing is just really that so how valid [00:18:48] is this approximation [00:18:52] right because everything sounds nice you [00:18:54] right because everything sounds nice you know after we did this you know like [00:18:56] know after we did this you know like everything becomes super easy [00:18:58] everything becomes super easy um but in what cases this can be valid [00:19:02] um but in what cases this can be valid um go ahead [00:19:06] [Music] [00:19:14] yeah so the inner Paradise just the the [00:19:16] yeah so the inner Paradise just the the typical inner product you just take the [00:19:18] typical inner product you just take the because these two are just vectors right [00:19:25] so so okay so what's the dimensionality [00:19:27] so so okay so what's the dimensionality here so [00:19:29] here so gradient of say this thing is in a [00:19:32] gradient of say this thing is in a dimension that's a p if Theta is in RP [00:19:35] dimension that's a p if Theta is in RP so it has the same dimensionality as [00:19:38] so it has the same dimensionality as okay I guess it also depends on what f [00:19:40] okay I guess it also depends on what f is so let's suppose f is from [00:19:45] is so let's suppose f is from um some Rd to R right d is the dimension [00:19:49] um some Rd to R right d is the dimension of x [00:19:51] of x is the dimension of x [00:19:54] is the dimension of x then but the problem the point is that [00:19:57] then but the problem the point is that the output is one dimensional and then [00:19:59] the output is one dimensional and then you take the gradient with Vector you [00:20:00] you take the gradient with Vector you get a p dimensional Vector where p is [00:20:03] get a p dimensional Vector where p is the dimension of the Theta so the [00:20:05] the dimension of the Theta so the gradient so the gradient with respect to [00:20:06] gradient so the gradient with respect to Theta has the same Dimension as Theta [00:20:08] Theta has the same Dimension as Theta that makes sense right so so this is a p [00:20:11] that makes sense right so so this is a p dimensional vector and you just take [00:20:12] dimensional vector and you just take inner product of two vectors [00:20:15] inner product of two vectors to define a feature [00:20:18] makes sense [00:20:21] cool okay so [00:20:26] okay so I guess to proceed and I Define [00:20:28] okay so I guess to proceed and I Define two notations just for the simplicity [00:20:31] two notations just for the simplicity so let's define IO hat of theta to be [00:20:34] so let's define IO hat of theta to be the empirical loss with the model f [00:20:36] the empirical loss with the model f Theta [00:20:41] this is just for formality so that we [00:20:43] this is just for formality so that we can write this easier in the and F have [00:20:46] can write this easier in the and F have G Theta [00:20:48] G Theta is the [00:20:51] laws [00:20:53] laws with model X G Theta [00:20:58] okay so the key idea is that [00:21:04] so in certain cases this Taylor [00:21:05] so in certain cases this Taylor expansion makes sense right so [00:21:08] expansion makes sense right so so I guess the the Taylor expansion [00:21:11] so I guess the the Taylor expansion can make sense can can work [00:21:15] can make sense can can work for certain cases [00:21:22] um here we're gonna hide you know [00:21:24] um here we're gonna hide you know they're gonna what for what cases it [00:21:26] they're gonna what for what cases it makes sense you know can work it's gonna [00:21:28] makes sense you know can work it's gonna be a uh it's gonna be a big question you [00:21:31] be a uh it's gonna be a big question you know that we probably will discuss at [00:21:33] know that we probably will discuss at the very end but so far let's say just [00:21:36] the very end but so far let's say just let's see how the how does it work so [00:21:38] let's see how the how does it work so the way that they work is the following [00:21:40] the way that they work is the following so in the following sense [00:21:45] so how do you say it works right so so [00:21:47] so how do you say it works right so so you say that there exists a neighborhood [00:21:54] of Sega 0 [00:21:57] of Sega 0 such that [00:21:59] such that in this neighborhood so let's call this [00:22:02] in this neighborhood so let's call this neighborhood B Theta [00:22:08] zero [00:22:10] zero such that [00:22:13] such that the several things happens so one thing [00:22:16] the several things happens so one thing is that you have an accurate [00:22:18] is that you have an accurate approximation in terms of function value [00:22:21] approximation in terms of function value so the F Theta is somewhat close to G [00:22:24] so the F Theta is somewhat close to G Theta f x [00:22:26] Theta f x and and I had as a result I had F Theta [00:22:30] and and I had as a result I had F Theta is close to L height G Theta [00:22:34] is close to L height G Theta for every Theta in this neighborhood B [00:22:36] for every Theta in this neighborhood B Theta zero [00:22:39] Theta zero so that's something you want [00:22:40] so that's something you want which makes sense right so this is the [00:22:43] which makes sense right so this is the point of Taylor expansion you want to [00:22:44] point of Taylor expansion you want to export approximate original function [00:22:46] export approximate original function and also you want that it suffices to [00:22:52] optimize [00:22:54] optimize zero [00:22:57] zero because you know if in this beta0 [00:22:58] because you know if in this beta0 there's no good [00:23:01] there's no good um maybe let me let me draw this again [00:23:03] um maybe let me let me draw this again so [00:23:07] so basically what we are saying is there [00:23:09] so basically what we are saying is there is a neighborhood [00:23:11] is a neighborhood it's got b instead of zero [00:23:13] it's got b instead of zero and this neighborhood first of all you [00:23:15] and this neighborhood first of all you have say suppose your empirical loss is [00:23:18] have say suppose your empirical loss is looking like this [00:23:21] maybe something else happening somewhere [00:23:22] maybe something else happening somewhere here I also we don't know so first of [00:23:25] here I also we don't know so first of all you if you do the quadratic [00:23:26] all you if you do the quadratic approximation [00:23:28] approximation is natural expansion 1 Theta zero [00:23:31] is natural expansion 1 Theta zero let's say this is Theta zero [00:23:34] let's say this is Theta zero you do a quadratic expansion it's [00:23:36] you do a quadratic expansion it's something it looks something like this [00:23:39] something it looks something like this it's a very close [00:23:40] it's a very close because my drawing [00:23:42] because my drawing so basically you can think this right [00:23:44] so basically you can think this right one [00:23:45] one this right one is G Theta of X and the [00:23:49] this right one is G Theta of X and the black one is f Theta of x [00:23:52] black one is f Theta of x so the quadratic expansion is very close [00:23:53] so the quadratic expansion is very close to the original expansion [00:23:56] to the original expansion sorry the original function and second [00:23:59] sorry the original function and second you wanted it's the fastest to optimize [00:24:01] you wanted it's the fastest to optimize here right because if they are if both [00:24:03] here right because if they are if both the right and and our the black curve [00:24:06] the right and and our the black curve even though they are closed if they are [00:24:07] even though they are closed if they are both you know very high [00:24:09] both you know very high it doesn't make sense to zoom into this [00:24:11] it doesn't make sense to zoom into this region right you should leave this [00:24:12] region right you should leave this region but you can say that it suffices [00:24:15] region but you can say that it suffices to optimize here in terms of the [00:24:19] to optimize here in terms of the in the following sense so there exists [00:24:21] in the following sense so there exists an approximate [00:24:24] an approximate on global mean [00:24:28] Taylor heart [00:24:30] Taylor heart in b0 [00:24:38] sorry I'm using the superscript for zero [00:24:41] sorry I'm using the superscript for zero for the zero which might be a good [00:24:43] for the zero which might be a good mistake but [00:24:45] mistake but let me consistently use that [00:24:48] let me consistently use that um [00:24:49] um that [00:24:50] that I think for in some other lectures I use [00:24:52] I think for in some other lectures I use superscript for time so that's why I [00:24:54] superscript for time so that's why I keep using superscript for time [00:24:57] keep using superscript for time um anyway so you want to have a say [00:24:59] um anyway so you want to have a say ahead such that it's Global mean and [00:25:02] ahead such that it's Global mean and actually here you want that L hat [00:25:05] actually here you want that L hat G Theta hat to be approximately zero and [00:25:08] G Theta hat to be approximately zero and this indicates that you are Global mean [00:25:10] this indicates that you are Global mean because zero is [00:25:12] because zero is is the minimum there is no way you can [00:25:14] is the minimum there is no way you can go below zero you know so so if you are [00:25:17] go below zero you know so so if you are close to zero it means you have to be [00:25:19] close to zero it means you have to be close to Global here and and this also [00:25:22] close to Global here and and this also implies that [00:25:23] implies that I'll had F Theta [00:25:26] I'll had F Theta hat is close to zero [00:25:29] hat is close to zero okay so but with these two we still [00:25:34] okay so but with these two we still don't really understand what how do how [00:25:36] don't really understand what how do how do we optimize the black curve right so [00:25:38] do we optimize the black curve right so so you also want to know that optimizing [00:25:43] this loss L hat F Theta [00:25:47] this loss L hat F Theta is similar [00:25:50] is similar to optimizing [00:25:54] L hat G Theta [00:25:57] L hat G Theta foreign [00:26:01] and not only this but also [00:26:05] and not only this but also and does not [00:26:09] leave [00:26:11] leave be zero right because if you leave beta0 [00:26:15] be zero right because if you leave beta0 then all better off right your Taylor [00:26:16] then all better off right your Taylor expansion breaks so so you have to say [00:26:19] expansion breaks so so you have to say that when optimized either the L hat F [00:26:22] that when optimized either the L hat F or the L has G I don't leave this region [00:26:24] or the L has G I don't leave this region so that's so everything is confined to [00:26:26] so that's so everything is confined to this region [00:26:28] this region and [00:26:29] and so so this is how we make it work of [00:26:32] so so this is how we make it work of course you can ask you know whether this [00:26:34] course you can ask you know whether this is really reflecting the [00:26:36] is really reflecting the um what happens in reality you know the [00:26:38] um what happens in reality you know the answer is no not always but but so far [00:26:40] answer is no not always but but so far we are just trying to make this work uh [00:26:43] we are just trying to make this work uh under certain cases [00:26:45] under certain cases um [00:26:46] um so and so that we can appreciate why [00:26:47] so and so that we can appreciate why this you know we have to improve this [00:26:49] this you know we have to improve this right so and in some sense three is kind [00:26:52] right so and in some sense three is kind of like a little bit like extension [00:26:54] of like a little bit like extension so three to some extent is [00:26:59] an [00:27:01] an is a follows from one two because if you [00:27:04] is a follows from one two because if you are you have a global minimum in this [00:27:07] are you have a global minimum in this region right so and and black and white [00:27:09] region right so and and black and white are closed then why optimizing you know [00:27:11] are closed then why optimizing you know optimizing I should probably should [00:27:12] optimizing I should probably should converge like that that Global minimum [00:27:14] converge like that that Global minimum and you should stay in that region right [00:27:16] and you should stay in that region right to some extent it follows [00:27:18] to some extent it follows follows one two [00:27:22] but not exactly you know technically so [00:27:24] but not exactly you know technically so so but still requires a formal proof [00:27:31] so what I'm saying is that you know if [00:27:33] so what I'm saying is that you know if you really just want something somewhat [00:27:35] you really just want something somewhat informal to think about the dependency [00:27:37] informal to think about the dependency then probably you only have to first [00:27:39] then probably you only have to first make sure one two is happening [00:27:42] make sure one two is happening um and but if you really want [00:27:44] um and but if you really want um everything that you you need to um [00:27:46] um everything that you you need to um get approve the three [00:27:48] get approve the three um [00:27:50] um and and you and one two three you can [00:27:53] and and you and one two three you can make this work you know can be out to [00:27:57] in various settings right so with either [00:28:01] in various settings right so with either over parametrization [00:28:06] or [00:28:08] or and or [00:28:11] and or some particular [00:28:14] particular scaling [00:28:17] particular scaling of the initialization [00:28:19] of the initialization so if you play with this transition or [00:28:22] so if you play with this transition or you play with the [00:28:25] you play with the um the the width [00:28:27] um the the width um and and also you need and small [00:28:30] um and and also you need and small cluster capacity [00:28:33] um even small or even zero [00:28:35] um even small or even zero stochastically [00:28:42] so if you play with the over [00:28:44] so if you play with the over parenthesization and scaling of the [00:28:45] parenthesization and scaling of the initialization and also you insist that [00:28:47] initialization and also you insist that there's no stochasticity that make you [00:28:49] there's no stochasticity that make you live by go very far [00:28:51] live by go very far because the sarcastic will make like to [00:28:54] because the sarcastic will make like to leave the local neighborhood so so [00:28:56] leave the local neighborhood so so that's why you want small stochasticity [00:28:59] that's why you want small stochasticity and and how so then you can achieve all [00:29:01] and and how so then you can achieve all of this and how do you guys most [00:29:02] of this and how do you guys most sarcastically in a nutshell you either [00:29:04] sarcastically in a nutshell you either need [00:29:05] need by so to guess those small stochasticity [00:29:08] by so to guess those small stochasticity you need to either do smaller enrich [00:29:12] you need to either do smaller enrich or full batch couldn't be said [00:29:18] so so in sometimes this is the [00:29:20] so so in sometimes this is the limitation right so this is the [00:29:22] limitation right so this is the limitation because you require this and [00:29:24] limitation because you require this and this is also limitation because you have [00:29:26] this is also limitation because you have to play with it you cannot just to say [00:29:28] to play with it you cannot just to say and and what you really eventually get [00:29:31] and and what you really eventually get is probably not exactly matching uh what [00:29:34] is probably not exactly matching uh what people do in practice [00:29:36] people do in practice um [00:29:36] um okay [00:29:37] okay who so [00:29:39] who so now let's see how do we do uh one two [00:29:42] now let's see how do we do uh one two rest but still like [00:29:44] rest but still like regardless of all the limitations still [00:29:45] regardless of all the limitations still this is kind of interesting or [00:29:47] this is kind of interesting or interesting approach like it's kind of [00:29:49] interesting approach like it's kind of surprising that such a region even [00:29:51] surprising that such a region even exists even you think about the one and [00:29:52] exists even you think about the one and two right so you don't care about any [00:29:54] two right so you don't care about any limitations about it it's still kind of [00:29:56] limitations about it it's still kind of interesting that there exists a such [00:29:58] interesting that there exists a such region that you can basically be close [00:29:59] region that you can basically be close to a convex function actual quadratic [00:30:02] to a convex function actual quadratic function if the loss is quadratic right [00:30:04] function if the loss is quadratic right and and they're still a global minimum [00:30:07] and and they're still a global minimum they suggest that there's a lot of [00:30:08] they suggest that there's a lot of flexibilities [00:30:09] flexibilities and in this landscape your life works [00:30:12] and in this landscape your life works right so you we have a lot of over [00:30:13] right so you we have a lot of over factorizations [00:30:15] factorizations um and non-convex students somewhere you [00:30:18] um and non-convex students somewhere you have to have a complex region [00:30:20] have to have a complex region right so that's basically we're saying [00:30:22] right so that's basically we're saying right so in this landscape globally you [00:30:24] right so in this landscape globally you know it's very non-convex very [00:30:25] know it's very non-convex very complicated but there's some special [00:30:27] complicated but there's some special places in some neighborhoods you are you [00:30:30] places in some neighborhoods you are you are really having a combat function and [00:30:32] are really having a combat function and that convex function has a global [00:30:33] that convex function has a global minimum reaching average zero [00:30:36] minimum reaching average zero so even this is still some surprise [00:30:38] so even this is still some surprise so okay so now let's try to formalize [00:30:45] one two [00:30:49] um and then we talk about three so [00:30:52] um and then we talk about three so how do we do this so let's introduce [00:30:55] how do we do this so let's introduce some notation like fee [00:30:58] some notation like fee I [00:30:59] I to be the Phi of x i the features for [00:31:03] to be the Phi of x i the features for the is example which is religious [00:31:09] and let's define this [00:31:11] and let's define this feature Matrix [00:31:13] feature Matrix to be [00:31:22] fee and transpose [00:31:26] fee and transpose you put all the features in a row [00:31:28] you put all the features in a row and this is n by P where p is the number [00:31:32] and this is n by P where p is the number of parameters [00:31:37] okay so now we can see that [00:31:42] the loss functions with respect to the [00:31:45] the loss functions with respect to the linear model [00:31:46] linear model is just a linear regression problem [00:31:48] is just a linear regression problem which you are probably familiar with I'm [00:31:51] which you are probably familiar with I'm taking quadratic loss [00:31:55] or mean Square loss [00:31:59] so [00:32:02] so this is just the y i minus Delta Theta [00:32:06] this is just the y i minus Delta Theta transpose [00:32:07] transpose times Phi of x i [00:32:10] times Phi of x i recall that you know you have a [00:32:11] recall that you know you have a basically linear model in Delta Theta [00:32:14] basically linear model in Delta Theta and square [00:32:16] and square right so [00:32:19] um I guess maybe it's easier to write in [00:32:21] um I guess maybe it's easier to write in the other way so that [00:32:23] the other way so that it's more consistent with the notation [00:32:25] it's more consistent with the notation here [00:32:27] here transpose Delta State [00:32:34] so if you write in The Matrix notation [00:32:37] so if you write in The Matrix notation this will be 1 over n times [00:32:40] this will be 1 over n times the true Norm of Y minus [00:32:43] the true Norm of Y minus Phi times Delta Theta so naught Square [00:32:47] Phi times Delta Theta so naught Square where y y is [00:32:49] where y y is the concatenation of all the labels [00:32:53] the concatenation of all the labels is in RN [00:32:57] right so this just sounds very familiar [00:32:59] right so this just sounds very familiar with [00:33:00] with linear regression it's just exactly new [00:33:02] linear regression it's just exactly new regression where Delta Theta is your [00:33:04] regression where Delta Theta is your parameter V is your design Matrix so all [00:33:08] parameter V is your design Matrix so all the other feature Matrix [00:33:10] the other feature Matrix so and let's assume [00:33:14] so and let's assume this is just for convenience y i is on [00:33:17] this is just for convenience y i is on the order of one [00:33:19] the order of one so that's the two Norms Y is on out of [00:33:23] so that's the two Norms Y is on out of score [00:33:25] score so here is a Lemma that kind of [00:33:28] so here is a Lemma that kind of characterize [00:33:29] characterize what is the uh of [00:33:33] what is the uh of I guess [00:33:34] I guess so it's a Lemma [00:33:36] so it's a Lemma this is in somewhat for two [00:33:39] this is in somewhat for two and sometimes you are trying to see that [00:33:41] and sometimes you are trying to see that in what neighborhood you have a global [00:33:44] in what neighborhood you have a global minimum [00:33:45] minimum so so suppose [00:33:48] so so suppose p is bigger than n you have more [00:33:50] p is bigger than n you have more features than the number of data points [00:33:53] features than the number of data points and the rank [00:33:55] and the rank of this feature Matrix [00:33:58] of this feature Matrix is equals to n and [00:34:03] the minimum [00:34:04] the minimum singular value is equals to Sigma [00:34:08] singular value is equals to Sigma larger than zero [00:34:11] larger than zero then [00:34:15] light [00:34:18] be [00:34:23] Norm [00:34:26] Norm Solutions [00:34:31] to y height [00:34:34] to y height to this [00:34:36] to this all right so you want to fit feedback to [00:34:39] all right so you want to fit feedback to right back and and you you want to [00:34:41] right back and and you you want to understand you know what is the nearest [00:34:43] understand you know what is the nearest Global minimum right so so the other [00:34:45] Global minimum right so so the other thing is this is the nearest Global [00:34:47] thing is this is the nearest Global minimum [00:34:48] minimum so this is in the nearest [00:34:51] so this is in the nearest Global mean [00:34:52] Global mean in some sense right because you know if [00:34:54] in some sense right because you know if you fit it you know you you are [00:34:56] you fit it you know you you are achieving a glow mean and you want the [00:34:57] achieving a glow mean and you want the other side to be the smallest so that [00:34:59] other side to be the smallest so that means you are looking for the nearest [00:35:00] means you are looking for the nearest one [00:35:01] one and if you're looking for the nearest [00:35:03] and if you're looking for the nearest abundance [00:35:04] abundance you can have a bond on the nearest [00:35:06] you can have a bond on the nearest Global minimum [00:35:08] Global minimum where the bound is something like this [00:35:15] square root and over Sigma [00:35:17] square root and over Sigma so the bond itself so far is not that [00:35:20] so the bond itself so far is not that you know interpretable but but the point [00:35:23] you know interpretable but but the point here is that if you take the ball [00:35:28] this means that if you take the ball [00:35:33] b instead of zero to have this radius [00:35:35] b instead of zero to have this radius right to to be all the [00:35:38] right to to be all the Theta such that is equals Theta zero [00:35:41] Theta such that is equals Theta zero plus Delta Theta [00:35:42] plus Delta Theta such that [00:35:44] such that Delta Theta 2 Norm is less than [00:35:47] Delta Theta 2 Norm is less than of square root and sigma over Sigma [00:35:51] of square root and sigma over Sigma then this ball this B Theta will contain [00:35:53] then this ball this B Theta will contain a global mineral [00:35:55] a global mineral contains [00:35:57] contains a global mean [00:36:04] okay so so this is characterizing you [00:36:06] okay so so this is characterizing you know how large the boundary is to be or [00:36:08] know how large the boundary is to be or how large the region needs to be so you [00:36:10] how large the region needs to be so you can contain a global mean [00:36:11] can contain a global mean and a number here so far the number is [00:36:14] and a number here so far the number is not interpretable I'm going to compare [00:36:15] not interpretable I'm going to compare it with some other things because [00:36:18] it with some other things because you know by itself you know how large [00:36:19] you know by itself you know how large the region if you just carry about two [00:36:21] the region if you just carry about two then you you can just take the region to [00:36:23] then you you can just take the region to be as large as possible you have to [00:36:24] be as large as possible you have to compare it with something else [00:36:25] compare it with something else and the proof is also pretty easy this [00:36:28] and the proof is also pretty easy this is really just [00:36:29] is really just a simple trivia thing like you say you [00:36:33] a simple trivia thing like you say you can write this outside the hat to be [00:36:35] can write this outside the hat to be because you are the so the minimum Norm [00:36:37] because you are the so the minimum Norm solution is the pseudo inverse [00:36:40] solution is the pseudo inverse of Phi Times by back [00:36:43] of Phi Times by back and there are some I guess this is not [00:36:47] and there are some I guess this is not extremely obvious but you can you know [00:36:50] extremely obvious but you can you know invoke this is some relatively basic [00:36:52] invoke this is some relatively basic properties of the pseudo inverse you [00:36:55] properties of the pseudo inverse you know that the operator Norm of the [00:36:56] know that the operator Norm of the pseudo inverse is less than [00:36:58] pseudo inverse is less than the minimum [00:36:59] the minimum singular value of B [00:37:02] singular value of B actually these are I think are exactly [00:37:04] actually these are I think are exactly the same [00:37:08] so and this is equals to 1 over Sigma [00:37:12] so and this is equals to 1 over Sigma and then you you know that you have a [00:37:15] and then you you know that you have a bound on [00:37:17] bound on Delta Theta 2 Norm by [00:37:21] Delta Theta 2 Norm by using the operating Norm of the pseudo [00:37:23] using the operating Norm of the pseudo inverse of V [00:37:24] inverse of V times the two Norm of Pi back [00:37:27] times the two Norm of Pi back so this becomes more over Sigma Times [00:37:30] so this becomes more over Sigma Times Square Root [00:37:32] Square Root that's it I guess I don't even need a [00:37:34] that's it I guess I don't even need a big O I don't know why I have to be go [00:37:39] sorry [00:37:43] just for me like it's always safe to [00:37:45] just for me like it's always safe to have big also so it's just a part of my [00:37:47] have big also so it's just a part of my brain [00:37:48] brain it cannot work we speak without people [00:37:50] it cannot work we speak without people anyway but here you'll need anything and [00:37:53] anyway but here you'll need anything and a constant [00:37:55] a constant so oh I guess oh I guess there's a I [00:37:58] so oh I guess oh I guess there's a I think I need a big O because I'm only [00:38:00] think I need a big O because I'm only assuming that why it's on on the order [00:38:02] assuming that why it's on on the order of square rooting [00:38:03] of square rooting sorry so because here I'm only assuming [00:38:06] sorry so because here I'm only assuming this is [00:38:07] this is why it's less than o square rooted so [00:38:09] why it's less than o square rooted so that's why [00:38:10] that's why I need to be go [00:38:12] I need to be go but anyway the constant doesn't matter [00:38:14] but anyway the constant doesn't matter here [00:38:15] here um you get the point I guess [00:38:17] um you get the point I guess so Okay so [00:38:21] so Okay so any questions so far [00:38:31] so and now let's see whether this region [00:38:34] so and now let's see whether this region is whether it's too big or too small [00:38:36] is whether it's too big or too small right it sounds somewhat big because [00:38:37] right it sounds somewhat big because only is there right but actually you'll [00:38:40] only is there right but actually you'll see that the region is not that big [00:38:41] see that the region is not that big because the sigma could be made very big [00:38:45] because the sigma could be made very big in some sense [00:38:46] in some sense um all all there are some relative kind [00:38:48] um all all there are some relative kind of things which you have to compare it [00:38:50] of things which you have to compare it with something else right so because you [00:38:52] with something else right so because you have to compare this with how good you [00:38:55] have to compare this with how good you have approximation in the region so next [00:38:58] have approximation in the region so next one is for suleima [00:39:02] so this is for one [00:39:05] so this is for one in some sense so suppose [00:39:09] in some sense so suppose this [00:39:12] is beta ellipsis [00:39:16] is beta ellipsis suppose this gradient of [00:39:19] suppose this gradient of the the network is ellipses in Theta [00:39:23] the the network is ellipses in Theta in a sense that [00:39:26] in a sense that for every X for every Theta and Theta [00:39:29] for every X for every Theta and Theta Prime you have this [00:39:35] so I think this is zero [00:39:38] so I think this is zero because we always only care about the [00:39:40] because we always only care about the green I'd say that zero evaluate let's [00:39:42] green I'd say that zero evaluate let's say that zero [00:39:45] wait sorry my bad this is you viewed it [00:39:48] wait sorry my bad this is you viewed it sorry my back okay so this is a so here [00:39:52] sorry my back okay so this is a so here what I'm writing here this is a function [00:39:53] what I'm writing here this is a function of theta because I evaluate at some some [00:39:56] of theta because I evaluate at some some arbitrary state that I see [00:39:58] arbitrary state that I see so I want this as a function of failure [00:40:01] so I want this as a function of failure to be ellipsis in Theta so that means [00:40:03] to be ellipsis in Theta so that means that if you choose two different [00:40:05] that if you choose two different place where either Theta or Theta Prime [00:40:08] place where either Theta or Theta Prime the differences between them in L2 Norm [00:40:12] the differences between them in L2 Norm I have to use L2 Norm here because there [00:40:14] I have to use L2 Norm here because there are vectors and you want to say that L2 [00:40:17] are vectors and you want to say that L2 Norm is bounded by [00:40:19] Norm is bounded by the the differences in the the failure [00:40:22] the the differences in the the failure space [00:40:23] space so if you have this then [00:40:26] so if you have this then we know that [00:40:28] we know that after the x minus G Theta X [00:40:31] after the x minus G Theta X your approximation is less than [00:40:35] your approximation is less than Big O O beta times the difference [00:40:40] Big O O beta times the difference on [00:40:42] on the Delta Theta through Norm square [00:40:44] the Delta Theta through Norm square right because the difference between [00:40:46] right because the difference between these two is basically in some sense [00:40:48] these two is basically in some sense depends on how far you are away from the [00:40:50] depends on how far you are away from the reference point right if at the [00:40:51] reference point right if at the reference point it should be exactly the [00:40:53] reference point it should be exactly the same [00:40:53] same and if you're a little bit more away [00:40:55] and if you're a little bit more away from the reference point then you're [00:40:57] from the reference point then you're going to incur some logs and the loss is [00:40:59] going to incur some logs and the loss is something on the second order right [00:41:01] something on the second order right that's that's also intuitive [00:41:04] that's that's also intuitive so and the important thing is that so [00:41:07] so and the important thing is that so for every Theta in the B set at zero [00:41:10] for every Theta in the B set at zero that we just have to we just defined [00:41:13] that we just have to we just defined we have that F Theta [00:41:16] we have that F Theta x minus G Theta X [00:41:20] x minus G Theta X is less than beta n Square over Sigma [00:41:23] is less than beta n Square over Sigma Square [00:41:23] Square and that's just by plugging in the [00:41:26] and that's just by plugging in the definition of beta0 the b0 has this [00:41:29] definition of beta0 the b0 has this radius [00:41:32] squared n over Sigma and you plug it in [00:41:34] squared n over Sigma and you plug it in into this here [00:41:36] into this here then you get uh so you plug in this here [00:41:40] then you get uh so you plug in this here you got that in this video you have some [00:41:42] you got that in this video you have some bound on how good your approximation is [00:41:47] bound on how good your approximation is so [00:41:49] so oh sorry this [00:41:51] oh sorry this let's try this video [00:41:54] copy pasting error okay [00:41:58] copy pasting error okay unscore [00:42:00] unscore okay so far this bounce [00:42:09] so I saw a question [00:42:11] so I saw a question um by the way you can feel free to [00:42:12] um by the way you can feel free to unmute but I can read the question right [00:42:14] unmute but I can read the question right now so [00:42:15] now so how does how do we define fee [00:42:18] how does how do we define fee superscript plus oh this is the [00:42:22] superscript plus oh this is the so what is this free plus this is the [00:42:25] so what is this free plus this is the pseudo inverse [00:42:30] of B [00:42:32] of B uh I think this is called uh [00:42:36] uh I think this is called uh there's actually this is the most common [00:42:39] there's actually this is the most common definition of pseudo Universal fee [00:42:42] definition of pseudo Universal fee um I guess you can [00:42:45] um I guess you can roughly think of as the V inverse of V [00:42:49] roughly think of as the V inverse of V um with some small caveat [00:42:51] um with some small caveat uh [00:42:53] uh yeah yes more Tangos to the inverse [00:42:56] yeah yes more Tangos to the inverse thanks for the comments in the chat [00:42:59] thanks for the comments in the chat um [00:43:00] um um this is I think this is supposed to [00:43:03] um this is I think this is supposed to be taught in the linear algebra course [00:43:05] be taught in the linear algebra course maybe I don't know like um I'm not sure [00:43:08] maybe I don't know like um I'm not sure whether what I can say about it what you [00:43:10] whether what I can say about it what you know about it is that [00:43:13] know about it is that at least in this case [00:43:15] at least in this case uh [00:43:18] uh I think maybe just so that's for the [00:43:21] I think maybe just so that's for the sake of Simplicity just to think of the [00:43:23] sake of Simplicity just to think of the pseudo universe as the inverse so if you [00:43:26] pseudo universe as the inverse so if you are not super familiar with it [00:43:28] are not super familiar with it um and and then you can verify this is a [00:43:30] um and and then you can verify this is a good solution uh to this equation right [00:43:33] good solution uh to this equation right because if you multiply so the inverse [00:43:36] because if you multiply so the inverse you get you get this to the inverse [00:43:38] you get you get this to the inverse cancels with Phi and you get Delta Theta [00:43:41] cancels with Phi and you get Delta Theta so so sorry so so you guys so you plug [00:43:44] so so sorry so so you guys so you plug in this data to this equation you can [00:43:46] in this data to this equation you can cancel Phi and V to the inverse and you [00:43:48] cancel Phi and V to the inverse and you get y back [00:43:50] get y back um that's how you get the [00:43:53] um that's how you get the um how do you verify this is a solution [00:43:55] um how do you verify this is a solution to the equation [00:43:58] to the equation Okay so [00:44:01] um and and also another I think another [00:44:03] um and and also another I think another useful thing to know is that the pseudo [00:44:06] useful thing to know is that the pseudo inverse has exactly the same it has the [00:44:08] inverse has exactly the same it has the inverse of the spectrum of the original [00:44:10] inverse of the spectrum of the original one right so suppose the fee has [00:44:13] one right so suppose the fee has singular value [00:44:15] singular value Sigma 1 up to Sigma k [00:44:18] Sigma 1 up to Sigma k and then fee pseudo inverse [00:44:21] and then fee pseudo inverse has [00:44:22] has singular value [00:44:27] one over Sigma 1 up to one over Sigma k [00:44:30] one over Sigma 1 up to one over Sigma k so [00:44:33] so um and this you know if all these sigmas [00:44:35] um and this you know if all these sigmas are positive right you ignore the zero [00:44:37] are positive right you ignore the zero singular values then this is exactly [00:44:39] singular values then this is exactly true so the singular value such as the [00:44:41] true so the singular value such as the code inverted [00:44:43] code inverted um [00:44:44] um Okay cool so [00:44:46] Okay cool so I hope that answers the question [00:44:49] I hope that answers the question um okay going back to the second lever [00:44:51] um okay going back to the second lever for number one so this is saying that [00:44:52] for number one so this is saying that you know in this neighborhood how good [00:44:54] you know in this neighborhood how good your approximation is right so we got [00:44:56] your approximation is right so we got this you know like uh uh number so I'm [00:45:00] this you know like uh uh number so I'm going to explain this number uh that's [00:45:02] going to explain this number uh that's the important thing right so how small [00:45:04] the important thing right so how small is this right if this is small that's [00:45:05] is this right if this is small that's great if this is big that's a problem so [00:45:08] great if this is big that's a problem so but maybe just let me just say the proof [00:45:10] but maybe just let me just say the proof for this Lemma the proof of the Lima is [00:45:12] for this Lemma the proof of the Lima is kind of like basically this follows the [00:45:14] kind of like basically this follows the basic fact that [00:45:19] from the fact that [00:45:22] um if you have H [00:45:26] Theta satisfies [00:45:30] gradient of hclipses [00:45:35] and this green inflation is basically [00:45:37] and this green inflation is basically equivalent to the hyphen [00:45:40] equivalent to the hyphen he offered a norm is followed by Beta [00:45:45] he offered a norm is followed by Beta and if if everything is differentiable [00:45:47] and if if everything is differentiable then you know you can bounce the quality [00:45:50] then you know you can bounce the quality of the Taylor expansion [00:45:53] of the Taylor expansion so you can say that she is Theta minus G [00:45:56] so you can say that she is Theta minus G to the h0 [00:45:58] to the h0 minus gradient H 0 [00:46:04] this is bounded by [00:46:06] this is bounded by o of data [00:46:09] o of data Theta minus 0 to 2 square [00:46:12] Theta minus 0 to 2 square Phenom square and this H Theta will be [00:46:15] Phenom square and this H Theta will be just F Theta X [00:46:17] just F Theta X in our case if we take a Theta to be F [00:46:20] in our case if we take a Theta to be F the X then you get [00:46:22] the X then you get um [00:46:22] um you get the Dilemma above [00:46:25] you get the Dilemma above so so the point is that you you the [00:46:27] so so the point is that you you the approximation error is [00:46:29] approximation error is second order in the the outside in the [00:46:32] second order in the the outside in the difference between your point and the [00:46:34] difference between your point and the reference point [00:46:36] so [00:46:39] okay and there's a small remark another [00:46:41] okay and there's a small remark another small remark is that [00:46:45] um if if the F Theta involves value [00:46:52] then [00:46:53] then uh number of data is not even continuous [00:46:59] so you cannot it cannot be ellipse [00:47:02] so you cannot it cannot be ellipse everywhere [00:47:07] and this requires some special fix so [00:47:10] and this requires some special fix so requires [00:47:12] requires special [00:47:14] special fixes the fixes is not that you know [00:47:17] fixes the fixes is not that you know surprising uh just because even though [00:47:20] surprising uh just because even though it's not continuous everywhere it still [00:47:22] it's not continuous everywhere it still continues almost everywhere so so [00:47:24] continues almost everywhere so so basically it's kind of close to [00:47:25] basically it's kind of close to reluctious and in some sense IO F Theta [00:47:27] reluctious and in some sense IO F Theta X is is still like if you look at the [00:47:30] X is is still like if you look at the average over data points then you still [00:47:32] average over data points then you still have some relapiousness but I think [00:47:34] have some relapiousness but I think let's let's now discuss that it's a [00:47:35] let's let's now discuss that it's a little bit kind of like a low level [00:47:37] little bit kind of like a low level details which is not important we can [00:47:39] details which is not important we can just assume we are dealing with a [00:47:42] just assume we are dealing with a um not well we are dealing with [00:47:44] um not well we are dealing with something like sigmoid then then there [00:47:45] something like sigmoid then then there is no such issue [00:47:48] is no such issue Okay cool so now let's go back to the [00:47:51] Okay cool so now let's go back to the the main thing right so the main thing [00:47:52] the main thing right so the main thing is that whether this is a good balance [00:47:54] is that whether this is a good balance right so like you say that you have [00:47:55] right so like you say that you have found this three the zero [00:47:57] found this three the zero and you have shown that in this video [00:47:59] and you have shown that in this video you have such a good approximate you [00:48:01] you have such a good approximate you have such an approximation error so [00:48:03] have such an approximation error so important fact is that what is this beta [00:48:05] important fact is that what is this beta and the sigma Square it's this small or [00:48:07] and the sigma Square it's this small or bad [00:48:08] bad and the important thing is that [00:48:10] and the important thing is that so the the interesting thing is that [00:48:15] this thing is not scaling environment so [00:48:18] this thing is not scaling environment so only is something you cannot change [00:48:19] only is something you cannot change right but B over bait over 6 is not [00:48:23] right but B over bait over 6 is not scaling environment [00:48:27] scaling environment so what does that mean that's I think [00:48:31] so what does that mean that's I think you can interpret this in some way so [00:48:33] you can interpret this in some way so but but in some sense like beta [00:48:35] but but in some sense like beta basically I think the the easiest way to [00:48:37] basically I think the the easiest way to think about is that you have a square in [00:48:39] think about is that you have a square in below right you have beta the the the [00:48:41] below right you have beta the the the the beta on top so so somehow you can [00:48:45] the beta on top so so somehow you can play with the scaling so to make this [00:48:47] play with the scaling so to make this going to zero [00:48:49] going to zero so there are two cases [00:48:51] so there are two cases actually there are more than two cases [00:48:53] actually there are more than two cases but I'm going to discuss two cases these [00:48:55] but I'm going to discuss two cases these are in different papers but I'm going to [00:48:57] are in different papers but I'm going to unify them in the following way so there [00:48:59] unify them in the following way so there are two cases where [00:49:01] are two cases where beta over Sigma Square can go to zero [00:49:05] beta over Sigma Square can go to zero so the first way is that you can replay [00:49:07] so the first way is that you can replay my choice [00:49:11] uh with a scalar [00:49:15] uh with a scalar and this is uh inches Act [00:49:23] I think this is 19 and the paper is [00:49:25] I think this is 19 and the paper is called lazy tuning [00:49:28] often it might work something like that [00:49:31] often it might work something like that so I guess the paper title suggests that [00:49:35] so I guess the paper title suggests that they're saying that this is a lazy way [00:49:37] they're saying that this is a lazy way of changing networks it's not really the [00:49:38] of changing networks it's not really the final way you should believe it [00:49:41] final way you should believe it um but nevertheless the paper is very [00:49:42] um but nevertheless the paper is very nice so and what they do is the [00:49:45] nice so and what they do is the following so they say that [00:49:48] following so they say that um the large [00:49:50] um the large so let your F the X your [00:49:53] so let your F the X your parametricization [00:49:55] parametricization to be the following you take the alpha [00:49:57] to be the following you take the alpha times let's say I've say the bar X [00:50:00] times let's say I've say the bar X so and this let's make this a fixed this [00:50:05] so and this let's make this a fixed this is actually this is a standard right for [00:50:07] is actually this is a standard right for a standard [00:50:10] there are not [00:50:12] there are not uh and fixed [00:50:15] uh and fixed fixed in the sense that you don't change [00:50:16] fixed in the sense that you don't change the architecture right you don't like [00:50:17] the architecture right you don't like the you just take whatever standing [00:50:19] the you just take whatever standing right track you know with any Finance [00:50:20] right track you know with any Finance with [00:50:22] with um which fix your fixed with [00:50:26] um [00:50:29] and and depth so and so forth right [00:50:31] and and depth so and so forth right something that you don't change at least [00:50:33] something that you don't change at least for this perspective and you only change [00:50:35] for this perspective and you only change Alpha right so forever Alpha you define [00:50:37] Alpha right so forever Alpha you define perfect it's a very nice workplace just [00:50:40] perfect it's a very nice workplace just you have a different scaling in front of [00:50:42] you have a different scaling in front of it right so so Alpha you got the right [00:50:44] it right so so Alpha you got the right work and then let's see how does [00:50:46] work and then let's see how does everything change as you change Alpha [00:50:49] everything change as you change Alpha and also you fix initialization takes [00:50:52] and also you fix initialization takes initialization [00:50:54] initialization scheme 0 [00:50:55] scheme 0 and then let's consider uh let's say [00:50:58] and then let's consider uh let's say setup bar Sigma bar is the sigma mean [00:51:02] setup bar Sigma bar is the sigma mean of the of the of the base Network let's [00:51:05] of the of the of the base Network let's say the base network is the F Bar so [00:51:07] say the base network is the F Bar so right so it's the second mean of this [00:51:09] right so it's the second mean of this base one [00:51:21] right this is the [00:51:24] right this is the right on the beta bar with the [00:51:26] right on the beta bar with the ellipselessness beta bar be [00:51:28] ellipselessness beta bar be deliciousness [00:51:29] deliciousness also of the base one [00:51:36] so you can think of Sigma bar and beta [00:51:38] so you can think of Sigma bar and beta bar are not changing us [00:51:41] bar are not changing us as you change Alpha right so and now [00:51:43] as you change Alpha right so and now let's see how does the alpha change the [00:51:45] let's see how does the alpha change the final segment beta of your final Network [00:51:49] final segment beta of your final Network right so then [00:51:51] right so then so Sigma is equal to Alpha Sigma bar [00:51:55] so Sigma is equal to Alpha Sigma bar because once you have considered F Theta [00:51:58] because once you have considered F Theta you multiply as Alpha right so all the [00:52:00] you multiply as Alpha right so all the features right like the gradient becomes [00:52:02] features right like the gradient becomes Alpha times bigger and then everything [00:52:04] Alpha times bigger and then everything becomes Alpha times bigger [00:52:06] becomes Alpha times bigger right so [00:52:09] right so um right this is just because when you [00:52:10] um right this is just because when you take the gradient with respect to [00:52:12] take the gradient with respect to so this is just because of chain rule [00:52:14] so this is just because of chain rule right so if you take root in respect to [00:52:16] right so if you take root in respect to Theta of the f is the same as Alpha [00:52:19] Theta of the f is the same as Alpha times the gradient of theta with back to [00:52:22] times the gradient of theta with back to uh uh like the the gradient of f bar [00:52:26] uh uh like the the gradient of f bar with back to Theta right so you have a [00:52:27] with back to Theta right so you have a channel [00:52:28] channel so everything got scaled and beta also [00:52:31] so everything got scaled and beta also got scaled by Alpha just because you [00:52:35] got scaled by Alpha just because you know the gradient got scaled for the [00:52:36] know the gradient got scaled for the same reason [00:52:37] same reason and then you can see that you get for [00:52:39] and then you can see that you get for free of some Factor about Alpha and this [00:52:42] free of some Factor about Alpha and this equation so beta over Sigma squared [00:52:44] equation so beta over Sigma squared becomes [00:52:45] becomes beta bar over Sigma bar squared times 1 [00:52:48] beta bar over Sigma bar squared times 1 over Alpha [00:52:49] over Alpha and this can go to zero as Alpha goes to [00:52:52] and this can go to zero as Alpha goes to Infinity [00:52:55] so basically they're saying that you [00:52:57] so basically they're saying that you know whatever Network you take whatever [00:53:00] know whatever Network you take whatever initialization right so as long as you [00:53:02] initialization right so as long as you your Sigma bar is better bar they are [00:53:04] your Sigma bar is better bar they are reasonable and they are not [00:53:06] reasonable and they are not zero or something like that right so so [00:53:09] zero or something like that right so so um so as long as Sigma bar is not zero [00:53:11] um so as long as Sigma bar is not zero so so you have some beta bar over Sigma [00:53:13] so so you have some beta bar over Sigma bar square that might be bad but you can [00:53:15] bar square that might be bad but you can always rescale re-parametrate it with a [00:53:18] always rescale re-parametrate it with a constant in front of it [00:53:19] constant in front of it so that this key quantity Beta Sigma [00:53:23] so that this key quantity Beta Sigma Square becomes going to zero and if this [00:53:25] Square becomes going to zero and if this goes to zero what does it mean it means [00:53:27] goes to zero what does it mean it means that your approximation becomes better [00:53:29] that your approximation becomes better better right so and they till some point [00:53:31] better right so and they till some point if if you change your Alpha large enough [00:53:34] if if you change your Alpha large enough you make this approximation super good [00:53:37] you make this approximation super good right so and in the neighborhood so [00:53:40] right so and in the neighborhood so basically you found the neighborhood [00:53:41] basically you found the neighborhood such that in that neighborhood your [00:53:43] such that in that neighborhood your approximation is very good if you take [00:53:45] approximation is very good if you take Alpha to be big [00:53:55] no the last one wouldn't change right [00:53:58] no the last one wouldn't change right because the laws that's a good question [00:53:59] because the laws that's a good question it allows what is the loss loss is [00:54:02] it allows what is the loss loss is something composed with the [00:54:04] something composed with the uh composed on top of with the with the [00:54:07] uh composed on top of with the with the network right so the loss is [00:54:09] network right so the loss is out of for example [00:54:12] out of for example Alpha F Bar [00:54:14] Alpha F Bar Theta X [00:54:15] Theta X y right [00:54:18] y right so [00:54:19] so so first of all an initialization we [00:54:22] so first of all an initialization we always try to make the initialization [00:54:23] always try to make the initialization zero zero the the output adding [00:54:25] zero zero the the output adding transition zero so that wouldn't change [00:54:27] transition zero so that wouldn't change and second [00:54:29] and second even though a p [00:54:31] even though a p seemingly this whole thing is Big sure [00:54:33] seemingly this whole thing is Big sure that's true but we show that you have a [00:54:36] that's true but we show that you have a global minimum [00:54:38] global minimum where this be in this neighborhood you [00:54:40] where this be in this neighborhood you have a global minimum [00:54:46] I'm not sure whether it makes sense so [00:54:58] so in some sense I think okay let me try [00:55:00] so in some sense I think okay let me try to draw a figure to answer this question [00:55:02] to draw a figure to answer this question so the question is that you know what [00:55:03] so the question is that you know what happens when Alpha is Big it sounds like [00:55:05] happens when Alpha is Big it sounds like function value becomes big right [00:55:07] function value becomes big right so [00:55:08] so that's true but I think you see you [00:55:13] I think what happens is that for example [00:55:15] I think what happens is that for example suppose you have [00:55:19] um [00:55:21] not sure so [00:55:27] how do I visualize this so use [00:55:34] I think your loss will be so if you [00:55:36] I think your loss will be so if you stretch Alpha your loss will be sharper [00:55:39] stretch Alpha your loss will be sharper your loss so if you look at everything [00:55:41] your loss so if you look at everything you look at a dependency on Alpha right [00:55:43] you look at a dependency on Alpha right so if you make alpha bigger [00:55:46] so if you make alpha bigger you make this neighborhood smaller [00:55:49] you make this neighborhood smaller right [00:55:51] right so you make the neighborhood smaller so [00:55:52] so you make the neighborhood smaller so you're gonna get something like like [00:55:54] you're gonna get something like like this very sharp in the neighborhood [00:55:57] this very sharp in the neighborhood and uh however [00:56:03] so so so so if Alpha is bigger actually [00:56:05] so so so so if Alpha is bigger actually you can find even something that is very [00:56:07] you can find even something that is very close by [00:56:09] close by uh to make the so like you even you have [00:56:12] uh to make the so like you even you have to even leave move even [00:56:15] to even leave move even less [00:56:16] less from neutralization [00:56:18] from neutralization that's because if you do a little bit of [00:56:20] that's because if you do a little bit of work then you actually already kind of [00:56:22] work then you actually already kind of like [00:56:23] like uh already fixed the data [00:56:28] I'm not sure why did that make sense so [00:56:30] I'm not sure why did that make sense so okay so so there's always okay so [00:56:32] okay so so there's always okay so there's always one thing which is useful [00:56:33] there's always one thing which is useful which is like the F set of zero X this [00:56:36] which is like the F set of zero X this is zero [00:56:37] is zero so basically like you always start with [00:56:39] so basically like you always start with this [00:56:40] this right where you don't have any scale [00:56:43] right where you don't have any scale right so this is just literally zero and [00:56:47] right so this is just literally zero and if Alpha is Big then this is still zero [00:56:49] if Alpha is Big then this is still zero right [00:56:50] right but when Alpha is Big you are more [00:56:52] but when Alpha is Big you are more sensitive to Theta [00:56:53] sensitive to Theta right so that's why if you change a [00:56:55] right so that's why if you change a little bit [00:56:56] little bit then you can already fit your data [00:57:01] so you only have to change very very [00:57:02] so you only have to change very very little [00:57:04] little from the cellular zero to straighter [00:57:05] from the cellular zero to straighter data [00:57:07] data and and and and when you change very [00:57:10] and and and and when you change very little then you actually your [00:57:11] little then you actually your approximation is very good in that [00:57:13] approximation is very good in that neighborhood [00:57:19] I'm not sure whether that makes some [00:57:21] I'm not sure whether that makes some sense [00:57:22] sense but maybe you can discuss you know it's [00:57:24] but maybe you can discuss you know it's a little bit counterintuitive I agree [00:57:26] a little bit counterintuitive I agree but right it's just it really because [00:57:28] but right it's just it really because the only thing that happens here is that [00:57:30] the only thing that happens here is that how does this beta and sigma reality the [00:57:33] how does this beta and sigma reality the the relative difference between beta and [00:57:34] the relative difference between beta and sigma how does that depend on Alpha [00:57:36] sigma how does that depend on Alpha right so [00:57:37] right so um [00:57:38] um so [00:57:40] so so in some sense the if you have larger [00:57:43] so in some sense the if you have larger Alpha you need to have smaller [00:57:44] Alpha you need to have smaller neighborhoods but but but but your um [00:57:50] but but the approximation hours scales [00:57:53] but but the approximation hours scales you know faster [00:57:54] you know faster because because your function is kind of [00:57:56] because because your function is kind of much much kind of like more [00:57:59] much much kind of like more non-smooth right so like your function [00:58:01] non-smooth right so like your function becomes like sharper so but but actually [00:58:04] becomes like sharper so but but actually the neighborhood shrinks faster than the [00:58:06] the neighborhood shrinks faster than the the the the sharpness [00:58:09] the the the sharpness growth so that's why it's working [00:58:13] growth so that's why it's working yeah [00:58:15] yeah I hope that somewhat answers the [00:58:17] I hope that somewhat answers the question right [00:58:18] question right um yeah but this is generally this is [00:58:20] um yeah but this is generally this is kind of somewhat kind of confusing [00:58:23] kind of somewhat kind of confusing um and this is another case where we can [00:58:25] um and this is another case where we can also see this [00:58:27] also see this so [00:58:28] so um the other case is if you over Prime [00:58:30] um the other case is if you over Prime my twice [00:58:37] so [00:58:39] so and [00:58:40] and so here let's say suppose this is a this [00:58:43] so here let's say suppose this is a this is actually the original [00:58:45] is actually the original um or the original uh first few papers [00:58:48] um or the original uh first few papers uh that you know like which invents the [00:58:51] uh that you know like which invents the the ntk approach take so um [00:58:55] the ntk approach take so um so basically what you do is you say you [00:58:57] so basically what you do is you say you have a model [00:58:59] have a model uh wire hat which is equals to Y squared [00:59:02] uh wire hat which is equals to Y squared and times sum of a i Sigma W I transpose [00:59:06] and times sum of a i Sigma W I transpose X this is a two layer Network [00:59:09] X this is a two layer Network with M neurons and I'm scaling this [00:59:12] with M neurons and I'm scaling this so in some sense mostly for convenience [00:59:16] so in some sense mostly for convenience uh because you know whatever skill you [00:59:18] uh because you know whatever skill you do you can also change other skills to [00:59:19] do you can also change other skills to compensate [00:59:21] compensate um and and the convenience comes from [00:59:23] um and and the convenience comes from that you know if I choose the [00:59:25] that you know if I choose the everything on order of one then this [00:59:27] everything on order of one then this will output something on all of one [00:59:29] will output something on all of one which we'll see so but maybe let's [00:59:32] which we'll see so but maybe let's discuss that in a moment after I [00:59:34] discuss that in a moment after I introduce the notation so I'm gonna have [00:59:36] introduce the notation so I'm gonna have this Matrix w [00:59:37] this Matrix w which contains all the rows [00:59:39] which contains all the rows and W is in [00:59:43] and W is in I'm by D [00:59:45] I'm by D and and sigma is value here [00:59:50] um I guess let's say is maybe let's not [00:59:53] um I guess let's say is maybe let's not say Sigma zero Sigma is [00:59:55] say Sigma zero Sigma is something like uh [00:59:58] something like uh it's one ellipsis [01:00:03] and it has second another derivative [01:00:04] and it has second another derivative second now the derivative [01:00:11] so actually [01:00:13] so actually yeah so [01:00:15] yeah so you wouldn't see how those coming into [01:00:18] you wouldn't see how those coming into play especially they're not super [01:00:20] play especially they're not super important and what is the neutralization [01:00:23] important and what is the neutralization does AI zero [01:00:25] does AI zero so actually AI [01:00:26] so actually AI is initialized to be [01:00:29] is initialized to be plus one minus one [01:00:31] plus one minus one uh initially [01:00:34] and not optimized at all [01:00:38] and not optimized at all so they are not even parameters [01:00:39] so they are not even parameters technically speaking [01:00:41] technically speaking and WIS are parameters w i zero is [01:00:45] and WIS are parameters w i zero is initialized from gaussian [01:00:48] initialized from gaussian a d dimensional gaussian with safer [01:00:50] a d dimensional gaussian with safer cocoa virus [01:00:52] cocoa virus and let's say x has the norm the norm of [01:00:55] and let's say x has the norm the norm of X is on out of one [01:00:58] X is on out of one is on out of one [01:01:03] um this is just for convenience so that [01:01:05] um this is just for convenience so that we have a fixed scaling [01:01:08] we have a fixed scaling and unless it's [01:01:12] let's let's say Theta to the the [01:01:15] let's let's say Theta to the the parameter Theta is really just the [01:01:17] parameter Theta is really just the a vector version [01:01:20] a vector version D times m is just real Vector rise to [01:01:22] D times m is just real Vector rise to version of w so vectorized [01:01:26] version [01:01:28] version of w [01:01:30] of w Okay so [01:01:33] Okay so and [01:01:36] so we'll assume [01:01:40] I'm close to Infinity [01:01:43] I'm close to Infinity like uh so so I'm is eventually [01:01:46] like uh so so I'm is eventually technically poorly [01:01:48] technically poorly and then D so A and D are considered to [01:01:51] and then D so A and D are considered to be fixed and I mean something that will [01:01:52] be fixed and I mean something that will becomes bigger and bigger and and and [01:01:55] becomes bigger and bigger and and and that's the power so like everything [01:01:56] that's the power so like everything comes from the the scaling of M [01:02:00] comes from the the scaling of M so I guess just to explain why we all [01:02:02] so I guess just to explain why we all want to have this smaller squared I'm an [01:02:03] want to have this smaller squared I'm an initialization scale like this [01:02:05] initialization scale like this so [01:02:08] scaling and I think the reason at least [01:02:11] scaling and I think the reason at least one reason is that [01:02:13] one reason is that if you take if you look at this so Sigma [01:02:16] if you take if you look at this so Sigma w i zero transpose X this is on order of [01:02:20] w i zero transpose X this is on order of one [01:02:23] because wi is a spherical gaussian and X [01:02:27] because wi is a spherical gaussian and X has no one a spherical gaussian times a [01:02:29] has no one a spherical gaussian times a number one thing will have expectation [01:02:31] number one thing will have expectation like will roughly be on all of one and [01:02:34] like will roughly be on all of one and it will take some value or kind of [01:02:36] it will take some value or kind of something like relative more than you [01:02:38] something like relative more than you are Beyond you're going to be on out of [01:02:39] are Beyond you're going to be on out of one and then uh the sum of this [01:02:47] will be on the order of square root n [01:02:50] will be on the order of square root n right because you have I'm of these [01:02:52] right because you have I'm of these things that are somewhat plus one minus [01:02:54] things that are somewhat plus one minus one and [01:02:55] one and um and you because AI is plus one minus [01:02:58] um and you because AI is plus one minus one so you cancel them in some sense and [01:03:00] one so you cancel them in some sense and then you get [01:03:03] so and that means I've Theta 0x is on [01:03:08] so and that means I've Theta 0x is on out of one [01:03:11] because um you'd have another a one over [01:03:14] because um you'd have another a one over script I'm in front [01:03:17] script I'm in front so that's why [01:03:19] so that's why that's one of the reason why you choose [01:03:20] that's one of the reason why you choose this game [01:03:21] this game okay uh and okay so so initially our [01:03:25] okay uh and okay so so initially our output is on the out of one [01:03:27] output is on the out of one and now let's see how does the segment [01:03:30] and now let's see how does the segment beta depends on all of these qualities [01:03:33] beta depends on all of these qualities um so we hope that this this key [01:03:35] um so we hope that this this key quantity beta over Sigma Square [01:03:37] quantity beta over Sigma Square to go to zero as M goes to Infinity [01:03:41] so um [01:03:43] so um so let's first look at the sigma Sigma [01:03:44] so let's first look at the sigma Sigma is the sigma mean [01:03:47] is the sigma mean of this [01:03:48] of this feature Matrix fee [01:03:51] feature Matrix fee and [01:03:53] and and you can this is also the same as [01:03:56] and you can this is also the same as the sigma mean [01:03:57] the sigma mean of [01:03:59] of this fee feature transpose [01:04:03] this fee feature transpose um this is just equality because Fifi [01:04:06] um this is just equality because Fifi transpose the spectrum is just a squared [01:04:09] transpose the spectrum is just a squared of the respectable fee [01:04:12] of the respectable fee and what is VP transpose Fifi transpose [01:04:15] and what is VP transpose Fifi transpose is basically this empirical kernel [01:04:17] is basically this empirical kernel Matrix right [01:04:18] Matrix right the IGS actually is just the inner [01:04:20] the IGS actually is just the inner product [01:04:21] product between [01:04:23] between two features of two examples [01:04:31] right so and and let's look at what the [01:04:34] right so and and let's look at what the scaling of this Fifi transpose so to do [01:04:37] scaling of this Fifi transpose so to do that you have to look at what's the [01:04:39] that you have to look at what's the gradient [01:04:40] gradient so let's look at the gradient [01:04:44] so F Theta if you look at the derivative [01:04:46] so F Theta if you look at the derivative of the output [01:04:48] of the output with respect to each of this WI [01:04:51] with respect to each of this WI then you can use chain Rule and and [01:04:54] then you can use chain Rule and and then you can get something like this [01:04:58] then you can get something like this time sucks [01:05:00] time sucks so this is the gradient of every neuron [01:05:03] so this is the gradient of every neuron wi every Vector WI [01:05:07] wi every Vector WI and [01:05:09] and that means that if you look at the [01:05:11] that means that if you look at the gradient the entire gradient [01:05:15] gradient the entire gradient all the the other vectors feel like the [01:05:17] all the the other vectors feel like the norm then it's more over M times [01:05:20] norm then it's more over M times the sum over m [01:05:24] the sum over m W I transpose x times x to null [01:05:27] W I transpose x times x to null Square which is [01:05:29] Square which is 1 over n times the two Norm of x squared [01:05:32] 1 over n times the two Norm of x squared times [01:05:41] and and what is this it's kind of hard [01:05:44] and and what is this it's kind of hard to know exactly what is this but I think [01:05:45] to know exactly what is this but I think you mostly care about what's the [01:05:47] you mostly care about what's the dependency [01:05:48] dependency on app [01:05:50] on app right so what's the dependency on M then [01:05:52] right so what's the dependency on M then this as M goes to Infinity by [01:05:54] this as M goes to Infinity by concentration [01:05:57] by concentration [01:06:00] so if as and goes to Infinity this is [01:06:02] so if as and goes to Infinity this is really just converting to expectation [01:06:04] really just converting to expectation because this is the empirical sum [01:06:06] because this is the empirical sum there's a y over M here right so Sigma [01:06:09] there's a y over M here right so Sigma Prime W transpose X [01:06:12] Prime W transpose X Square where W is from [01:06:16] Square where W is from simple precautions [01:06:17] simple precautions times the true Norm of x square which is [01:06:19] times the true Norm of x square which is a constant which is one basically right [01:06:22] a constant which is one basically right so [01:06:23] so um and this whole thing will not depend [01:06:25] um and this whole thing will not depend on [01:06:26] on this whole thing will be something like [01:06:27] this whole thing will be something like o of one so I guess you know to see it's [01:06:30] o of one so I guess you know to see it's all one maybe it's [01:06:32] all one maybe it's some someone tricky but at least you [01:06:34] some someone tricky but at least you know that this is not [01:06:35] know that this is not depending [01:06:38] on [01:06:40] on so so m is not in in this equation so so [01:06:43] so so m is not in in this equation so so basically this is saying that [01:06:46] basically this is saying that um every [01:06:48] um every every quantity here [01:06:50] every quantity here as M goes to Infinity you know it's sum [01:06:52] as M goes to Infinity you know it's sum the norm of this is on all of one [01:06:54] the norm of this is on all of one doesn't change that M goes to Infinity [01:06:57] doesn't change that M goes to Infinity and also you can do the same thing for [01:06:59] and also you can do the same thing for uh [01:07:01] uh the inner product of two [01:07:04] the inner product of two for example [01:07:08] the same thing happens that if you look [01:07:10] the same thing happens that if you look at the inner product it's something like [01:07:11] at the inner product it's something like this [01:07:22] I [01:07:25] transpose [01:07:28] so this is [01:07:31] so this is I think they're technically there should [01:07:33] I think they're technically there should be a zero here [01:07:35] be a zero here this is the initial addition [01:07:39] Prime [01:07:44] and as M goes to Infinity [01:07:49] so as M goes to Infinity by [01:07:52] so as M goes to Infinity by concentration this is Con concentrated [01:07:55] concentration this is Con concentrated around the expectation of it the [01:07:57] around the expectation of it the expectation is something like Sigma [01:07:58] expectation is something like Sigma Prime [01:07:59] Prime w [01:08:00] w I transpose X Sigma Prime W I transpose [01:08:04] I transpose X Sigma Prime W I transpose X the [01:08:06] X the okay [01:08:09] yes I can write following W transpose X [01:08:12] yes I can write following W transpose X Sigma Prime W transpose X Prime [01:08:15] Sigma Prime W transpose X Prime times x in the product with X Prime [01:08:17] times x in the product with X Prime where W is from [01:08:19] where W is from vertical gaussian [01:08:22] vertical gaussian right [01:08:28] okay so again this doesn't depend does [01:08:30] okay so again this doesn't depend does not [01:08:33] depend on M [01:08:36] okay so so which means that so basically [01:08:39] okay so so which means that so basically this is saying that this entire Matrix [01:08:40] this is saying that this entire Matrix Fifi transpose [01:08:42] Fifi transpose you know goes to a some kind of a [01:08:45] you know goes to a some kind of a constant Matrix [01:08:47] constant Matrix constant Matrix [01:08:50] constant Matrix as M goes to Infinity [01:08:54] and I think this Matrix sometimes people [01:08:56] and I think this Matrix sometimes people call it t Infinity [01:08:59] call it t Infinity um and and this is the the neural Canon [01:09:01] um and and this is the the neural Canon kernel this is the neural tangent kernel [01:09:04] kernel this is the neural tangent kernel with m is equals to Infinity [01:09:07] with m is equals to Infinity okay so this is the fixed Matrix and you [01:09:09] okay so this is the fixed Matrix and you can show that [01:09:11] can show that uh [01:09:14] uh and you can show that this is [01:09:17] and you can show that this is a matrix that [01:09:20] a matrix that um that is like a at least High sum is a [01:09:23] um that is like a at least High sum is a Forex [01:09:25] Forex so [01:09:26] so I'm gonna skip this part so it can be [01:09:29] I'm gonna skip this part so it can be shown that [01:09:32] is K Infinity is full rank [01:09:37] so and let's take Sigma mean [01:09:41] so and let's take Sigma mean to be the sigma mean of K Infinity which [01:09:45] to be the sigma mean of K Infinity which is larger than zero [01:09:47] is larger than zero um [01:09:48] um then basically you can show that the fee [01:09:51] then basically you can show that the fee fee transpose [01:09:55] for the sigma meal free fee transpose [01:10:02] sorry [01:10:09] this is you know [01:10:10] this is you know larger than for example up half times [01:10:13] larger than for example up half times Sigma mean [01:10:15] Sigma mean if I'm is sufficiently Big just because [01:10:18] if I'm is sufficiently Big just because you know fee fee transpose is converging [01:10:20] you know fee fee transpose is converging to the the constant Matrix K Infinity so [01:10:24] to the the constant Matrix K Infinity so so if I'm insufficiently big then you [01:10:26] so if I'm insufficiently big then you should your eigenvalue should also [01:10:28] should your eigenvalue should also converge this part again is not [01:10:32] converge this part again is not like I didn't do it exactly uh uh [01:10:35] like I didn't do it exactly uh uh rigorously but you can expect that when [01:10:37] rigorously but you can expect that when you converge to some Matrix your [01:10:38] you converge to some Matrix your eigenvalue should also your your [01:10:40] eigenvalue should also your your spectral your spectrum should also [01:10:42] spectral your spectrum should also converts to that Matrix [01:10:45] converts to that Matrix um [01:10:46] um okay so [01:10:48] okay so um so with all of this right so [01:10:50] um so with all of this right so basically this saying your Sigma [01:10:52] basically this saying your Sigma um is not this is our Sigma right the [01:10:55] um is not this is our Sigma right the sigma is not changing in some sense as M [01:10:59] sigma is not changing in some sense as M goes to Infinity but let's see what what [01:11:01] goes to Infinity but let's see what what beta changes so now [01:11:04] beta changes so now how beta changes [01:11:09] as M goes to Infinity we'll show that [01:11:11] as M goes to Infinity we'll show that the beta will goes to zero as M goes to [01:11:13] the beta will goes to zero as M goes to Infinity so that beta over Sigma Square [01:11:15] Infinity so that beta over Sigma Square the key quantity will go to zero [01:11:18] the key quantity will go to zero and [01:11:20] and um [01:11:39] okay so now what we do is that [01:11:44] okay so now what we do is that um so we want to look at the [01:11:46] um so we want to look at the ellipselessness [01:11:47] ellipselessness of beta which means that you care about [01:11:50] of beta which means that you care about these two things the difference between [01:11:52] these two things the difference between these two things [01:11:54] these two things and we have computed you know what the [01:11:56] and we have computed you know what the the gradient is the gradient in every [01:11:58] the gradient is the gradient in every this both of these are Matrix matrices [01:12:02] this both of these are Matrix matrices because Theta is a matrix right so and [01:12:04] because Theta is a matrix right so and each uh the gradient of each column of [01:12:08] each uh the gradient of each column of our h [01:12:10] our h rho is something like this right so this [01:12:13] rho is something like this right so this is really a matrix with entries [01:12:17] is really a matrix with entries foreign [01:12:32] right so you have each of these [01:12:34] right so you have each of these integrated so that's why if you look at [01:12:36] integrated so that's why if you look at the norm [01:12:39] between these two [01:12:42] if you look at the euclidean norm [01:12:45] if you look at the euclidean norm then it's the pro is the sum of the [01:12:47] then it's the pro is the sum of the Norms of each of the components [01:12:49] Norms of each of the components so you got 1 over n which is come from [01:12:52] so you got 1 over n which is come from this small squared and and then you look [01:12:54] this small squared and and then you look at the norm of each of these components [01:12:56] at the norm of each of these components this is a scalar this is a vector so you [01:12:57] this is a scalar this is a vector so you get X [01:12:58] get X two Norm times the scalar Sigma Prime [01:13:03] two Norm times the scalar Sigma Prime so x minus Sigma Prime w i Prime [01:13:06] so x minus Sigma Prime w i Prime transpose X [01:13:08] transpose X squared [01:13:10] squared right so and then so suppose you you [01:13:13] right so and then so suppose you you want to get let's say let's try to get [01:13:15] want to get let's say let's try to get rid of this Sigma Prime so let's say [01:13:17] rid of this Sigma Prime so let's say this is less than one over m [01:13:19] this is less than one over m times [01:13:23] just without the sigma Prime and this is [01:13:26] just without the sigma Prime and this is your similar Sigma Prime is [01:13:33] Sigma Prime is one leptus or five [01:13:37] Sigma Prime is one leptus or five lectures [01:13:40] because let's pay our Big O here [01:13:45] and then of course this doesn't work for [01:13:48] and then of course this doesn't work for Value as I said before value we have to [01:13:50] Value as I said before value we have to fix it in some in some way [01:13:54] fix it in some in some way um [01:13:56] um okay so and then you say that [01:14:00] okay so and then you say that this is [01:14:02] you get rid of the ax again so [01:14:06] you get rid of the ax again so M times [01:14:08] M times [Music] [01:14:10] [Music] I guess the norm of X is one as we [01:14:13] I guess the norm of X is one as we claimed so and this one we just use [01:14:16] claimed so and this one we just use culture schwares let's say w i minus w i [01:14:18] culture schwares let's say w i minus w i Prime [01:14:21] Prime score two Norm Square [01:14:25] uh times X2 Norm Square [01:14:28] uh times X2 Norm Square that's this part [01:14:30] that's this part and X2 Norm square is also one so we can [01:14:33] and X2 Norm square is also one so we can just [01:14:38] and then this is one over n times [01:14:42] and then this is one over n times the distance between Theta and Theta [01:14:44] the distance between Theta and Theta Prime in your cleaning distance [01:14:47] Prime in your cleaning distance so [01:14:48] so so this is saying that deluxiousness [01:14:51] so this is saying that deluxiousness is one over m [01:14:53] is one over m oh I guess delicious needs more squared [01:14:55] oh I guess delicious needs more squared and because we didn't take the square [01:14:57] and because we didn't take the square root right so [01:15:00] X [01:15:05] so Norm is less than 1 over square root [01:15:07] so Norm is less than 1 over square root of M [01:15:10] so [01:15:13] so so beta is one over square root n [01:15:17] so beta is one over square root n and now if we look at this P quantity [01:15:24] beta over Sigma Square then this is [01:15:27] beta over Sigma Square then this is equals to 1 over square root of M over [01:15:29] equals to 1 over square root of M over Sigma Sigma is something like Sigma bar [01:15:32] Sigma Sigma is something like Sigma bar Square so sorry Sigma is this [01:15:34] Square so sorry Sigma is this something that doesn't depend on M right [01:15:36] something that doesn't depend on M right so Sigma mean [01:15:38] so Sigma mean Square [01:15:39] Square right so this will go to zero [01:15:42] right so this will go to zero as M goes to Infinity [01:15:47] so so here I think the the radius you [01:15:50] so so here I think the the radius you need is always the same [01:15:52] need is always the same because Sigma is always the same but [01:15:55] because Sigma is always the same but your function becomes more and more [01:15:56] your function becomes more and more smooth [01:15:58] smooth uh your gradient becomes more ellipses [01:16:00] uh your gradient becomes more ellipses as you have more and more neurons so so [01:16:03] as you have more and more neurons so so that's why eventually as you have more [01:16:05] that's why eventually as you have more neurons you can get into this regime [01:16:28] uh let me see [01:16:47] oh [01:16:49] oh my God [01:17:02] so I guess let me take the next 10 [01:17:05] so I guess let me take the next 10 minutes to discuss the outline of the [01:17:08] minutes to discuss the outline of the next steps [01:17:09] next steps so so any questions so far [01:17:15] so now suppose let's try to establish [01:17:23] three so we call that 3 is about [01:17:26] three so we call that 3 is about optimizing G and optimizing over [01:17:29] optimizing G and optimizing over f are similar [01:17:31] f are similar right so [01:17:33] right so um [01:17:34] um I can you can basically do two things [01:17:37] I can you can basically do two things right so like all the analysis there are [01:17:39] right so like all the analysis there are a lot of different ways to analyze this [01:17:41] a lot of different ways to analyze this and all the analysis kind of like take [01:17:44] and all the analysis kind of like take probably you can think of as two steps [01:17:46] probably you can think of as two steps implicitly even though the first step [01:17:48] implicitly even though the first step probably don't have to write in the [01:17:49] probably don't have to write in the paper but I'm pretty sure many people do [01:17:52] paper but I'm pretty sure many people do that when they derive the on the [01:17:55] that when they derive the on the analysis so your first step it sounds [01:17:57] analysis so your first step it sounds reasonable to say that you first [01:17:59] reasonable to say that you first optimize you analyze optimization [01:18:02] optimize you analyze optimization of L height [01:18:04] of L height G Theta [01:18:05] G Theta and the second step is that you somehow [01:18:08] and the second step is that you somehow analyze [01:18:12] optimization of iof Heights F Theta [01:18:16] optimization of iof Heights F Theta by [01:18:17] by someone reusing [01:18:23] reusing proofs [01:18:25] reusing proofs in a in some way of course you cannot [01:18:27] in a in some way of course you cannot reuse exactly but you can probably use [01:18:29] reuse exactly but you can probably use most of the ideas right so [01:18:32] most of the ideas right so and and and your intuition is that these [01:18:35] and and and your intuition is that these two things are similar so somehow you [01:18:36] two things are similar so somehow you can [01:18:37] can reuse the proof to do the the actual [01:18:41] reuse the proof to do the the actual optimization for the new actual F data [01:18:44] optimization for the new actual F data right so and there are two ways to do [01:18:51] to for for a [01:18:54] to for for a I think essentially you can say two ways [01:18:55] I think essentially you can say two ways maybe there's possibility that I missed [01:18:58] maybe there's possibility that I missed some of the existing papers but roughly [01:19:00] some of the existing papers but roughly speaking there are two ways for a and [01:19:03] speaking there are two ways for a and therefore there are two ways for B in [01:19:05] therefore there are two ways for B in some sense [01:19:10] so [01:19:11] so so the first way let's say I is that you [01:19:15] so the first way let's say I is that you Leverage [01:19:18] uh the strong convexity [01:19:24] of this L height G Theta [01:19:28] of this L height G Theta um and then [01:19:31] show [01:19:33] show exponential convergence [01:19:39] I have to say that this you know the [01:19:42] I have to say that this you know the definition of strong complexity I'm not [01:19:43] definition of strong complexity I'm not sure whether I have really given it in [01:19:45] sure whether I have really given it in this course [01:19:46] this course um this is a stronger notion of [01:19:48] um this is a stronger notion of convexity if you haven't heard of it [01:19:50] convexity if you haven't heard of it um You probably don't you know it's not [01:19:52] um You probably don't you know it's not super essential for this course but if [01:19:54] super essential for this course but if you have heard of it you know what kind [01:19:56] you have heard of it you know what kind of what kind of things I'm talking about [01:19:58] of what kind of things I'm talking about because this the analyzing a this is [01:20:00] because this the analyzing a this is analyzing how to optimize a convex [01:20:02] analyzing how to optimize a convex function it does require a little bit of [01:20:04] function it does require a little bit of optimization background uh at least it [01:20:07] optimization background uh at least it took like on a conceptual level you you [01:20:09] took like on a conceptual level you you can imagine there are many different [01:20:11] can imagine there are many different ways to analyze all organizations for [01:20:14] ways to analyze all organizations for um regression so so strong complexity is [01:20:17] um regression so so strong complexity is the stronger version of convexity and [01:20:18] the stronger version of convexity and you can somewhat use that to get the [01:20:21] you can somewhat use that to get the very fast convergence rate exponential [01:20:23] very fast convergence rate exponential means every time you Decay the error by [01:20:26] means every time you Decay the error by a constant Factor so that you get [01:20:28] a constant Factor so that you get exponential decay of the errors and [01:20:30] exponential decay of the errors and another way to do this is that you don't [01:20:33] another way to do this is that you don't use the strong convex teeth [01:20:35] use the strong convex teeth because sometimes you actually don't [01:20:36] because sometimes you actually don't have the strong complexity in certain [01:20:38] have the strong complexity in certain cases so you don't use the strong [01:20:40] cases so you don't use the strong complexity but only use the smoothness [01:20:49] the smoothness means that you have a [01:20:50] the smoothness means that you have a bonding technology derivative [01:20:52] bonding technology derivative um and again if you have taken some [01:20:55] um and again if you have taken some courses about optimization then this [01:20:57] courses about optimization then this would make a lot of sense probably [01:20:58] would make a lot of sense probably because there are different ways to [01:21:00] because there are different ways to analyze optimization sometimes you only [01:21:02] analyze optimization sometimes you only have smoothness you have a different [01:21:03] have smoothness you have a different kind of analysis [01:21:05] kind of analysis um [01:21:06] um so [01:21:07] so um and based on these two approaches you [01:21:09] um and based on these two approaches you know you can get two different proofs [01:21:11] know you can get two different proofs for b as well and we're only going to [01:21:13] for b as well and we're only going to talk about a [01:21:15] talk about a so we only talk about a [01:21:17] so we only talk about a sometimes I talk about [01:21:19] sometimes I talk about I the first approach [01:21:22] I the first approach um and and for this approach there is no [01:21:25] um and and for this approach there is no no prior knowledge is required you [01:21:27] no prior knowledge is required you probably wouldn't understand exactly [01:21:28] probably wouldn't understand exactly what I'm saying about this conceptual [01:21:29] what I'm saying about this conceptual thing but the actual proof doesn't [01:21:31] thing but the actual proof doesn't require [01:21:32] require prior knowledge and it's actually also [01:21:35] prior knowledge and it's actually also pretty intuitive by itself as well [01:21:38] pretty intuitive by itself as well so um I think we're going to talk about [01:21:40] so um I think we're going to talk about the approach um the concrete analysis uh [01:21:44] the approach um the concrete analysis uh next week next lecture but I mean before [01:21:46] next week next lecture but I mean before ending this lecture let me make another [01:21:48] ending this lecture let me make another remark which I think is useful [01:21:50] remark which I think is useful and in some sense it's useful [01:21:53] and in some sense it's useful for two for the second approach more but [01:21:56] for two for the second approach more but it's also useful for the first approach [01:21:57] it's also useful for the first approach so this is a interesting [01:22:00] so this is a interesting observation [01:22:04] or maybe intuition you can say [01:22:08] so [01:22:10] so um and particularly useful for two [01:22:20] so so this is [01:22:22] so so this is at any [01:22:25] at any Theta t [01:22:27] Theta t suppose you take this Taylor expansion [01:22:29] suppose you take this Taylor expansion take Taylor expansion [01:22:36] uh with reference point [01:22:39] Theta T so now we are not taking the [01:22:42] Theta T so now we are not taking the expansion instead of zero which I can [01:22:43] expansion instead of zero which I can tell expansion until a t you can Define [01:22:45] tell expansion until a t you can Define this g t Theta X is a function of theta [01:22:49] this g t Theta X is a function of theta t [01:22:54] and I have [01:22:55] and I have gradient of theta t [01:22:57] gradient of theta t x times Theta minus Theta t [01:23:01] x times Theta minus Theta t so this is the linear function [01:23:04] so this is the linear function um [01:23:05] um and then you can consider [01:23:08] and then you can consider um [01:23:10] um l f Theta [01:23:14] at Theta T right so this is the gradient [01:23:17] at Theta T right so this is the gradient that you actually you are [01:23:20] that you actually you are this is the [01:23:23] gradient [01:23:26] gradient I [01:23:27] I you are taking [01:23:30] you are taking because you are you would really care [01:23:32] because you are you would really care about optimizing app right so this is [01:23:34] about optimizing app right so this is what you are taking but actually it's [01:23:36] what you are taking but actually it's the same [01:23:37] the same as the gradient [01:23:42] of this [01:23:43] of this title expansion [01:23:45] title expansion at the same point [01:23:47] at the same point so these two things [01:23:50] yes there's two t here this is Theta T [01:23:53] yes there's two t here this is Theta T and this T is indicating that this is [01:23:55] and this T is indicating that this is also Taylor expression at the reference [01:23:58] also Taylor expression at the reference functionality [01:24:00] functionality so so why this is the case you know I [01:24:02] so so why this is the case you know I guess you know if you want you can take [01:24:03] guess you know if you want you can take the derivative and you can verify it but [01:24:06] the derivative and you can verify it but but fundamentally this is actually [01:24:08] but fundamentally this is actually it's really just saying that I have said [01:24:11] it's really just saying that I have said that he [01:24:13] that he after and [01:24:17] GS agree [01:24:22] up to [01:24:23] up to first order [01:24:26] T this is material expansion if they [01:24:29] T this is material expansion if they agree after first order I said t and [01:24:31] agree after first order I said t and then [01:24:33] then anything that [01:24:35] anything that like I'll so so this implies [01:24:39] like I'll so so this implies IO of f Theta and [01:24:42] IO of f Theta and L of G Theta t [01:24:46] L of G Theta t also Gray [01:24:48] also Gray up to first order [01:24:52] all right CLT [01:24:55] all right CLT so [01:24:56] so so so that's why [01:24:59] so so that's why um [01:24:59] um so so what does this really mean this [01:25:02] so so what does this really mean this really means that gradient descent [01:25:05] really means that gradient descent on F on this function on or maybe on [01:25:09] on F on this function on or maybe on technical [01:25:10] technical IO had F Theta [01:25:13] IO had F Theta right you are taking ingredients only [01:25:15] right you are taking ingredients only with respect to the the f [01:25:17] with respect to the the f this is the same [01:25:21] as taking [01:25:24] as taking online green descent [01:25:28] online green descent on [01:25:29] on I guess I haven't defined online when [01:25:31] I guess I haven't defined online when you said but let me Define that in a [01:25:33] you said but let me Define that in a moment after write down our sequence [01:25:36] moment after write down our sequence of changing objective [01:25:45] LG Theta [01:25:48] LG Theta 0 up to I'll [01:25:51] 0 up to I'll G30 [01:25:52] G30 so what does our language in design [01:25:54] so what does our language in design really mean it just really means that [01:25:55] really mean it just really means that you every time you take the gradient of [01:25:58] you every time you take the gradient of the new function you have a sequence of [01:26:00] the new function you have a sequence of functions and every time you take you [01:26:03] functions and every time you take you get a new function and you take the [01:26:04] get a new function and you take the gradient of that function you take one [01:26:05] gradient of that function you take one step [01:26:06] step so that's online reading results so [01:26:09] so that's online reading results so so basically you are saying that hey I'm [01:26:11] so basically you are saying that hey I'm going inside on this fixed function L [01:26:12] going inside on this fixed function L hat is the same as taking [01:26:14] hat is the same as taking gradients updates with this with respect [01:26:17] gradients updates with this with respect to a sequence of changing functions [01:26:20] to a sequence of changing functions and this is actually how the Second Step [01:26:23] and this is actually how the Second Step the second approach really worked so so [01:26:26] the second approach really worked so so this means that you can [01:26:28] this means that you can use [01:26:31] online [01:26:33] online uh learning approach [01:26:39] I guess [01:26:41] I guess in this in this in this culture I'm not [01:26:43] in this in this in this culture I'm not planning to talk about online learning [01:26:44] planning to talk about online learning but online learning is trying to deal [01:26:46] but online learning is trying to deal with [01:26:47] with the case where you have a changing [01:26:49] the case where you have a changing sequence of changing functions right so [01:26:51] sequence of changing functions right so you are not optimizing a single function [01:26:52] you are not optimizing a single function you have a changing for example changing [01:26:55] you have a changing for example changing distribution or changing environment or [01:26:56] distribution or changing environment or changing loss function whatsoever so so [01:27:00] changing loss function whatsoever so so there's a rich literature on how do you [01:27:02] there's a rich literature on how do you analyze optimization when you have a [01:27:04] analyze optimization when you have a sequence of changing loss functions and [01:27:07] sequence of changing loss functions and this is exactly how [01:27:09] this is exactly how um uh what this is about right you are [01:27:11] um uh what this is about right you are having a sequence of change in loss [01:27:12] having a sequence of change in loss functions and [01:27:14] functions and um and E and if you can analyze that you [01:27:17] um and E and if you can analyze that you can analyze the original cases [01:27:19] can analyze the original cases now here they're also like special [01:27:22] now here they're also like special structures about these loss functions [01:27:23] structures about these loss functions because they're all somewhere similar to [01:27:25] because they're all somewhere similar to each other right so they are all kind of [01:27:27] each other right so they are all kind of expansions [01:27:29] expansions um with respects to reference points [01:27:31] um with respects to reference points that are in a small region so you can [01:27:33] that are in a small region so you can also leverage additional information [01:27:34] also leverage additional information about that [01:27:37] about that so yeah so this is chapter 10 in the [01:27:39] so yeah so this is chapter 10 in the lecture notes but I think in this [01:27:42] lecture notes but I think in this quarter I just don't think we have time [01:27:44] quarter I just don't think we have time to uh go there [01:27:46] to uh go there um okay I think I'm already five minutes [01:27:49] um okay I think I'm already five minutes late [01:27:50] late um and next lecture we are going to talk [01:27:52] um and next lecture we are going to talk about the the the the approach one which [01:27:55] about the the the the approach one which is more self-content and [01:27:58] is more self-content and um and uh also kind of cleaner to some [01:28:01] um and uh also kind of cleaner to some extent [01:28:03] extent um okay maybe just the last comment I [01:28:06] um okay maybe just the last comment I think there are many different [01:28:07] think there are many different um neurotangent criminal papers [01:28:10] um neurotangent criminal papers um [01:28:10] um I I'm I probably I'm not super [01:28:13] I I'm I probably I'm not super comprehensive but I think most of them [01:28:15] comprehensive but I think most of them basically is a combination of [01:28:18] basically is a combination of these several things right so one thing [01:28:20] these several things right so one thing is that you have to optimize this [01:28:21] is that you have to optimize this establish this third step of [01:28:24] establish this third step of optimization and you have two ways two [01:28:26] optimization and you have two ways two large ways and maybe some even subtle [01:28:29] large ways and maybe some even subtle differences under underlying differences [01:28:31] differences under underlying differences and also you have to establish the [01:28:34] and also you have to establish the the first two properties and those are [01:28:36] the first two properties and those are properties about not about optimization [01:28:38] properties about not about optimization they're about your parametrization of [01:28:40] they're about your parametrization of your function plus your initialization [01:28:42] your function plus your initialization right so there you can also have a bunch [01:28:43] right so there you can also have a bunch of different flexibilities you can [01:28:45] of different flexibilities you can change the reference the the scaling you [01:28:47] change the reference the the scaling you can change the the wealth you can do [01:28:49] can change the the wealth you can do many different things or you can even [01:28:51] many different things or you can even change for example the uh the [01:28:53] change for example the uh the architecture uh in certain cases to make [01:28:55] architecture uh in certain cases to make it you know more [01:28:56] it you know more um more efficient or less efficient in [01:28:59] um more efficient or less efficient in certain cases [01:29:00] certain cases um so [01:29:02] um so um [01:29:03] um yeah so so so I'm not I don't want to [01:29:06] yeah so so so I'm not I don't want to have a very comprehensive discussion of [01:29:09] have a very comprehensive discussion of this ntk just because there are so many [01:29:11] this ntk just because there are so many limitations but I think it's a useful [01:29:12] limitations but I think it's a useful thing to know given that there are so [01:29:14] thing to know given that there are so many Works in that and and there there [01:29:16] many Works in that and and there there are indeed some nice ideas there [01:29:19] are indeed some nice ideas there um [01:29:19] um Okay cool so I guess I'll continue uh [01:29:21] Okay cool so I guess I'll continue uh next Wednesday next ================================================================================ LECTURE 013 ================================================================================ Stanford CS229M - Lecture 14: Neural Tangent Kernel, Implicit regularization of gradient descent Source: https://www.youtube.com/watch?v=xpT1ymwCk9w --- Transcript [00:00:05] uh okay hello everyone let's get started [00:00:08] uh okay hello everyone let's get started so last time [00:00:10] so last time what we did was ntk [00:00:13] what we did was ntk um the neural tangent kernel approach [00:00:15] um the neural tangent kernel approach and [00:00:16] and um [00:00:18] um so today we are going to continue with [00:00:20] so today we are going to continue with that to finish the last part of the [00:00:22] that to finish the last part of the neural tangent kernel approach and then [00:00:23] neural tangent kernel approach and then we talk about on the so-called implicit [00:00:25] we talk about on the so-called implicit regularization effect [00:00:28] regularization effect um so last time [00:00:30] um so last time briefly we recall that last time we have [00:00:32] briefly we recall that last time we have done the following [00:00:35] done the following so [00:00:38] so we have claimed that there are two steps [00:00:41] we have claimed that there are two steps three steps [00:00:42] three steps and this analysis using the ntk approach [00:00:46] and this analysis using the ntk approach so one step is that you say [00:00:49] so one step is that you say um [00:00:53] you say that F Theta X is close to G 3x [00:00:57] you say that F Theta X is close to G 3x in some neighborhood [00:01:00] oh wait [00:01:02] oh wait oh sorry [00:01:13] there are too many steps in this setup [00:01:14] there are too many steps in this setup so I always forgot some step [00:01:19] the worst step I would forget is that I [00:01:21] the worst step I would forget is that I forgot to record on then like other you [00:01:25] forgot to record on then like other you know this one you can remind me but if I [00:01:27] know this one you can remind me but if I forgot to record then nobody would [00:01:29] forgot to record then nobody would remind me so so that's what that's the [00:01:31] remind me so so that's what that's the thing I check every time okay cool and [00:01:36] I think this is recording this is [00:01:38] I think this is recording this is recording and you can you can hear me on [00:01:41] recording and you can you can hear me on Zoom right maybe [00:01:42] Zoom right maybe um [00:01:44] um [Music] [00:01:53] nobody seems to say anything but it [00:01:54] nobody seems to say anything but it sounds like the [00:01:57] sounds like the it's recording the [00:02:00] it's recording the it's receiving audio okay [00:02:02] it's receiving audio okay um so [00:02:04] um so so let's [00:02:07] um so so last time we we saw that if in [00:02:12] um so so last time we we saw that if in some neighborhoods [00:02:13] some neighborhoods um Theta [00:02:16] um Theta around B zero uh you can have accurate [00:02:19] around B zero uh you can have accurate accurate of approximation and recall [00:02:21] accurate of approximation and recall that the b0 was something like a [00:02:24] that the b0 was something like a neighborhood of size something that [00:02:26] neighborhood of size something that depends on the sigma so B to the zero [00:02:29] depends on the sigma so B to the zero was defined we showed that if you look [00:02:32] was defined we showed that if you look at [00:02:33] at the neighborhood where [00:02:37] you have some say whereas data is close [00:02:39] you have some say whereas data is close to set a zero with distance something [00:02:42] to set a zero with distance something like [00:02:43] like square root n over Sigma [00:02:46] square root n over Sigma then you indeed have something [00:02:49] then you indeed have something um [00:02:52] I guess you get how well the [00:02:54] I guess you get how well the approximation is I think the [00:02:55] approximation is I think the approximation in this neighborhood [00:02:57] approximation in this neighborhood the approximation [00:03:01] our [00:03:02] our is something like beta squared and beta [00:03:06] is something like beta squared and beta n over Sigma Square [00:03:11] right that's what we had and and also in [00:03:15] right that's what we had and and also in this neighborhood [00:03:16] this neighborhood two in this neighborhood [00:03:23] so there exists a global minimum [00:03:30] with our [00:03:31] with our zero [00:03:38] nobody seems to respond to my requests [00:03:42] nobody seems to respond to my requests about confirming that the audio is [00:03:44] about confirming that the audio is working but I don't know [00:03:46] working but I don't know there are only four people here oh okay [00:03:49] there are only four people here oh okay oh [00:03:52] oh okay so you can hear you hear me okay [00:03:53] okay so you can hear you hear me okay thank you so much [00:03:55] thank you so much um great great thanks [00:03:57] um great great thanks it's just uh sometimes I I got Paranoid [00:04:01] it's just uh sometimes I I got Paranoid by this [00:04:02] by this um okay thank you so much okay cool so [00:04:06] um okay thank you so much okay cool so um [00:04:07] um all right so this is what we proved last [00:04:09] all right so this is what we proved last time and we discussed you know this [00:04:11] time and we discussed you know this quantity is the key thing right beta [00:04:13] quantity is the key thing right beta over Sigma square is the very key thing [00:04:14] over Sigma square is the very key thing and if it goes to zero then that's great [00:04:16] and if it goes to zero then that's great because your arrow becomes smaller and [00:04:18] because your arrow becomes smaller and smaller [00:04:20] smaller um [00:04:22] um oh I see I cannot hear you that's the [00:04:25] oh I see I cannot hear you that's the problem [00:04:26] problem I see [00:04:27] I see probably now I can hear you I I think my [00:04:29] probably now I can hear you I I think my speaker is very [00:04:31] speaker is very the the volume is very low okay cool [00:04:34] the the volume is very low okay cool thanks so much [00:04:35] thanks so much um [00:04:37] um so [00:04:41] and now the third step as we discussed [00:04:43] and now the third step as we discussed last time is that and we have discussed [00:04:46] last time is that and we have discussed various cases where this Beta Sigma [00:04:48] various cases where this Beta Sigma square is goes to zero uh we discussed [00:04:50] square is goes to zero uh we discussed two cases [00:04:52] two cases and today we are gonna [00:04:56] and today we are gonna talk about the third step where we show [00:04:58] talk about the third step where we show that optimizing [00:05:03] optimizing [00:05:04] optimizing optimize the [00:05:12] uh with [00:05:13] uh with I have data with any network is [00:05:16] I have data with any network is similarly similar to optimizing [00:05:20] with G Theta [00:05:23] with G Theta um and in some sense the only thing you [00:05:25] um and in some sense the only thing you care about is an analysis of the [00:05:27] care about is an analysis of the optimization for f Theta but you want to [00:05:29] optimization for f Theta but you want to do this kind of like relationship so [00:05:31] do this kind of like relationship so that you can make the optimization [00:05:33] that you can make the optimization easier [00:05:34] easier and we also briefly discussed you know [00:05:36] and we also briefly discussed you know what we do with this optimization I [00:05:38] what we do with this optimization I think there are two ways right one way [00:05:39] think there are two ways right one way is you [00:05:41] is you um there are two ways to deal with the G [00:05:42] um there are two ways to deal with the G Theta so I think there are two ways one [00:05:44] Theta so I think there are two ways one is something like using strong convexity [00:05:48] and the other is using the only the [00:05:51] and the other is using the only the smoothness [00:05:52] smoothness and and today we're going to focus on [00:05:54] and and today we're going to focus on this case which doesn't require too much [00:05:56] this case which doesn't require too much of the background about optimization [00:06:00] all right so now let's go into the [00:06:02] all right so now let's go into the detail [00:06:14] so by the way I think a small remark [00:06:15] so by the way I think a small remark before I go into the detail so why you [00:06:17] before I go into the detail so why you care about the steps three in some sense [00:06:20] care about the steps three in some sense right so a priority there is no another [00:06:23] right so a priority there is no another reason you know okay so there's one [00:06:25] reason you know okay so there's one reason which is that you want to [00:06:26] reason which is that you want to understand what happens when you [00:06:28] understand what happens when you optimize over in your Networks right but [00:06:31] optimize over in your Networks right but suppose the solution the understanding [00:06:33] suppose the solution the understanding suppose at some point like we are at the [00:06:36] suppose at some point like we are at the moment where we want to prove this [00:06:38] moment where we want to prove this number three but we haven't succeeded [00:06:40] number three but we haven't succeeded but you probably would question yourself [00:06:42] but you probably would question yourself you know why I care about such a answer [00:06:44] you know why I care about such a answer right even I prove this answer why it's [00:06:46] right even I prove this answer why it's interesting and the answer is yes it's [00:06:49] interesting and the answer is yes it's not that interesting because you know if [00:06:51] not that interesting because you know if you prove that optimizing overnight work [00:06:53] you prove that optimizing overnight work is the same as optimizing some linear [00:06:56] is the same as optimizing some linear model like a kernel method then why not [00:07:00] model like a kernel method then why not just using kernel method right so so and [00:07:03] just using kernel method right so so and and it turns out that that's indeed true [00:07:05] and it turns out that that's indeed true like you know if you use kernel method [00:07:07] like you know if you use kernel method um you are not it's not going to work if [00:07:09] um you are not it's not going to work if you're optimize the right work in this [00:07:10] you're optimize the right work in this way in this particular initialization [00:07:12] way in this particular initialization with this particular learning universe [00:07:14] with this particular learning universe and so forth it wouldn't work as well [00:07:16] and so forth it wouldn't work as well either so so in some sense this is um [00:07:20] either so so in some sense this is um um so this whole theorem the value of [00:07:22] um so this whole theorem the value of this theorem is more is only for showing [00:07:24] this theorem is more is only for showing that under certain regime optimizing [00:07:26] that under certain regime optimizing overnight work is the same as optimizing [00:07:28] overnight work is the same as optimizing over linear model but but there's no any [00:07:31] over linear model but but there's no any like bigger impact in some sense like [00:07:35] like bigger impact in some sense like just because you are optimizing the [00:07:36] just because you are optimizing the invite working a weird regime which is [00:07:38] invite working a weird regime which is the same as which is the same as [00:07:40] the same as which is the same as optimizing kernels and in this regime [00:07:43] optimizing kernels and in this regime nothing works very well [00:07:45] nothing works very well but still you know for the technical [00:07:47] but still you know for the technical reason let's do try to go through this [00:07:48] reason let's do try to go through this you know it's not super complicated and [00:07:50] you know it's not super complicated and I think in some sense the techniques is [00:07:52] I think in some sense the techniques is kind of also it's kind of useful because [00:07:55] kind of also it's kind of useful because um partly because I think it's somewhat [00:07:57] um partly because I think it's somewhat surprising because at the first place [00:07:59] surprising because at the first place you wouldn't probably believe it right [00:08:00] you wouldn't probably believe it right why you believe that optimizing the [00:08:02] why you believe that optimizing the right work in any case would be similar [00:08:04] right work in any case would be similar to optimizing kernels and this shows [00:08:06] to optimizing kernels and this shows that that's possible in some cases even [00:08:08] that that's possible in some cases even though that case is not that [00:08:11] though that case is not that um informative or that that kind of like [00:08:12] um informative or that that kind of like useful [00:08:13] useful okay so [00:08:15] okay so um let's analyze um um so we start with [00:08:18] um let's analyze um um so we start with you know the first step we start with [00:08:19] you know the first step we start with like analyzing the the optimization with [00:08:23] like analyzing the the optimization with G Theta [00:08:31] so [00:08:32] so and and this is really just a you know [00:08:35] and and this is really just a you know um linear regression right it's really [00:08:38] um linear regression right it's really understanding optimization of linear [00:08:40] understanding optimization of linear regression [00:08:41] regression so you take so the problem is really [00:08:44] so you take so the problem is really you're taking mean of this y back minus [00:08:47] you're taking mean of this y back minus V Delta Theta [00:08:50] V Delta Theta 2 num Square [00:08:51] 2 num Square uh with GD [00:08:56] and just to briefly recall the notation [00:08:58] and just to briefly recall the notation so Phi is this feature Matrix which is [00:09:01] so Phi is this feature Matrix which is of Dimension n by P [00:09:04] of Dimension n by P so this is n by P Matrix and each row is [00:09:07] so this is n by P Matrix and each row is something like write F Theta [00:09:10] something like write F Theta 0 x [00:09:12] 0 x I [00:09:14] I trans transpose I see you put all of [00:09:17] trans transpose I see you put all of these as rows you know what exactly this [00:09:20] these as rows you know what exactly this three Matrix is you know it doesn't [00:09:21] three Matrix is you know it doesn't matter so much anymore for the for the [00:09:24] matter so much anymore for the for the rest of the discussion because V is just [00:09:26] rest of the discussion because V is just a matrix and Delta Theta is the [00:09:28] a matrix and Delta Theta is the difference between Theta and Theta zero [00:09:30] difference between Theta and Theta zero and we're just optimizing over Delta [00:09:32] and we're just optimizing over Delta Theta [00:09:33] Theta and what you do is we just take winning [00:09:35] and what you do is we just take winning descent and you'd say [00:09:38] descent and you'd say um then the grading descent [00:09:41] um then the grading descent is you take the [00:09:44] is you take the ingredient and the gradient is V [00:09:46] ingredient and the gradient is V transpose [00:09:48] transpose transpose times y back [00:09:51] transpose times y back minus V Delta Theta [00:09:55] minus V Delta Theta right this is the green update and [00:09:59] right this is the green update and one of the kind of the features of this [00:10:01] one of the kind of the features of this analysis is that you are looking at the [00:10:03] analysis is that you are looking at the convergence [00:10:06] the changes [00:10:09] the changes in the output space [00:10:13] and in some sense this is kind of like [00:10:15] and in some sense this is kind of like to some extent this is the the spirit of [00:10:18] to some extent this is the the spirit of the kernel method you're looking like [00:10:19] the kernel method you're looking like not the parametric you don't look at a [00:10:22] not the parametric you don't look at a parameter space but you can output space [00:10:23] parameter space but you can output space you know and when you look at the output [00:10:25] you know and when you look at the output space it's kind of like you're looking [00:10:26] space it's kind of like you're looking at the function space [00:10:28] at the function space to some extent but [00:10:30] to some extent but um you know what does that mean that [00:10:31] um you know what does that mean that means that you're looking at the the [00:10:33] means that you're looking at the the output Y at time t plus one [00:10:36] output Y at time t plus one um so this is defined to be just a v [00:10:40] um so this is defined to be just a v times Delta Theta i t plus one [00:10:42] times Delta Theta i t plus one I guess by definitely let's define the [00:10:44] I guess by definitely let's define the version with T so and you look at how [00:10:48] version with T so and you look at how you change the residual [00:10:52] um over time right so [00:10:54] um over time right so you compare [00:10:56] you compare your output at time t with the corrected [00:11:00] your output at time t with the corrected output at uh uh Target output where uh [00:11:04] output at uh uh Target output where uh way back [00:11:05] way back and how does this change over time and [00:11:08] and how does this change over time and this is just [00:11:10] this is just the definition [00:11:11] the definition and you plug in [00:11:14] and you plug in the definition of theta T which is Delta [00:11:17] the definition of theta T which is Delta Theta T minus ETA [00:11:20] Theta T minus ETA on fee transpose y back minus feed Delta [00:11:25] on fee transpose y back minus feed Delta Theta [00:11:26] Theta minus y back [00:11:28] minus y back and now this you know requires some kind [00:11:30] and now this you know requires some kind of like real arrangement to make it look [00:11:33] of like real arrangement to make it look cleaner and how do I rearrange this I [00:11:36] cleaner and how do I rearrange this I guess I'm going to [00:11:37] guess I'm going to first group everything [00:11:42] um [00:11:43] um about Delta Theta so you group this term [00:11:45] about Delta Theta so you group this term and term related to this [00:11:48] and term related to this this is also Delta Theta t [00:11:50] this is also Delta Theta t so so you get [00:11:52] so so you get fee [00:11:54] fee minus Eta Phi transpose Phi [00:12:02] I think I'm missing something here [00:12:07] what's wrong with this one moment sorry [00:12:11] what's wrong with this one moment sorry yeah okay I think I'm right so [00:12:15] yeah okay I think I'm right so so if you look at what is Multiplied in [00:12:17] so if you look at what is Multiplied in front of this term there's this one this [00:12:19] front of this term there's this one this one right so there are three things [00:12:21] one right so there are three things multiplied in front of the Theta t [00:12:26] so this is the multiplier in front of [00:12:28] so this is the multiplier in front of security and then you look at the [00:12:29] security and then you look at the multiply in front of y y back then you [00:12:32] multiply in front of y y back then you get minus [00:12:34] get minus ETA fee Phi transpose [00:12:38] ETA fee Phi transpose um plus identity [00:12:46] right back and and interestingly [00:12:49] right back and and interestingly this you can write it as [00:12:52] this you can write it as I minus ETA V transpose [00:12:57] sorry m [00:13:00] C [00:13:04] you can write this as Beta Phi Phi [00:13:07] you can write this as Beta Phi Phi transpose [00:13:08] transpose times V [00:13:10] times V Delta 30 minus I minus ETA [00:13:13] Delta 30 minus I minus ETA Phi Phi transpose [00:13:16] Phi Phi transpose right back [00:13:18] right back and then you get I minus ETA [00:13:21] and then you get I minus ETA if you transpose [00:13:28] Delta Theta T minus y back all of these [00:13:30] Delta Theta T minus y back all of these are basically just a standard [00:13:32] are basically just a standard calculation [00:13:34] calculation um I think some version of you know if [00:13:37] um I think some version of you know if you see take some version of like a [00:13:39] you see take some version of like a linear regression course then probably [00:13:41] linear regression course then probably you would have seen you have seen this [00:13:43] you would have seen you have seen this you got this [00:13:45] you got this so that's the the update that's the [00:13:48] so that's the the update that's the recursion for the residual of the of the [00:13:53] recursion for the residual of the of the output [00:13:54] output and you can see what happens is [00:13:56] and you can see what happens is basically your residual in the previous [00:13:58] basically your residual in the previous round got multiplied by this Matrix and [00:14:01] round got multiplied by this Matrix and what this Matrix is this Matrix is [00:14:03] what this Matrix is this Matrix is a matrix that is smaller than identity [00:14:05] a matrix that is smaller than identity because you have I minus something and [00:14:07] because you have I minus something and this something is a PSD Matrix [00:14:10] this something is a PSD Matrix so you are shrinking your residual in [00:14:12] so you are shrinking your residual in some way every time so and you can [00:14:15] some way every time so and you can quantify how fast you are shrinking so [00:14:17] quantify how fast you are shrinking so if when [00:14:19] if when ETA is less than 2 over Sigma Max [00:14:24] ETA is less than 2 over Sigma Max free free transpose [00:14:26] free free transpose and I Define let's define this quantity [00:14:28] and I Define let's define this quantity let's call it one over two tall Square [00:14:31] let's call it one over two tall Square then [00:14:33] then you can show that [00:14:35] you can show that this convergence then [00:14:39] this convergence then I minus Eta Phi Phi transpose can be [00:14:43] I minus Eta Phi Phi transpose can be shown to have [00:14:45] shown to have operator Norm less than [00:14:47] operator Norm less than uh [00:14:51] 1 over [00:14:57] ETA [00:14:59] ETA tall Ethan [00:15:02] tall Ethan I think I have some [00:15:05] I think I have some interest Sigma Square [00:15:11] so and then you have Okay so this I'll [00:15:14] so and then you have Okay so this I'll show this in a moment but suppose you [00:15:15] show this in a moment but suppose you have this then you know that YT well [00:15:19] have this then you know that YT well High T minus y y [00:15:21] High T minus y y and two Norm is less than [00:15:23] and two Norm is less than the operator normative of this Matrix [00:15:27] times [00:15:29] times sorry let's take this 32 t plus one [00:15:33] sorry let's take this 32 t plus one while I'm back to normal so this is less [00:15:36] while I'm back to normal so this is less than one minus E times Sigma Square [00:15:39] than one minus E times Sigma Square and Y at T minus y [00:15:42] and Y at T minus y two norm and this is if you do a lot of [00:15:44] two norm and this is if you do a lot of recursions you get one minus ETA Sigma [00:15:46] recursions you get one minus ETA Sigma Square to the power t [00:15:50] Square to the power t plus 1 times y over x 0 minus y to Norm [00:15:55] plus 1 times y over x 0 minus y to Norm so you have an exponential decay of [00:15:57] so you have an exponential decay of error [00:16:04] yep yes that's that's a good point so [00:16:07] yep yes that's that's a good point so Sigma [00:16:10] value of Phi [00:16:13] value of Phi and let's let's prove this right now so [00:16:16] and let's let's prove this right now so this let's say this is number one but [00:16:18] this let's say this is number one but suppose you have number one then you [00:16:19] suppose you have number one then you have all of this exponential decay right [00:16:21] have all of this exponential decay right so now let's prove the number one [00:16:25] so now let's prove the number one so [00:16:27] so um [00:16:28] um how do we do that [00:16:37] so [00:16:39] so establishing [00:16:43] one [00:16:44] one so intuitively it's just really that [00:16:47] so intuitively it's just really that this is a hxx PSD so that's why you [00:16:49] this is a hxx PSD so that's why you subtract something right your operator [00:16:51] subtract something right your operator normally is less than one no right so [00:16:53] normally is less than one no right so but we need to make sure we know exactly [00:16:55] but we need to make sure we know exactly how small it is right it cannot be if it [00:16:57] how small it is right it cannot be if it has to be strictly less than one that's [00:16:59] has to be strictly less than one that's why we needed this equality one and and [00:17:02] why we needed this equality one and and to see this I guess you know [00:17:04] to see this I guess you know um there are multiple ways to see it but [00:17:06] um there are multiple ways to see it but I think the way I tend to think about it [00:17:07] I think the way I tend to think about it is that so first of all Sigma [00:17:10] is that so first of all Sigma is Sigma mean of V which is also the [00:17:14] is Sigma mean of V which is also the sigma mean [00:17:15] sigma mean of Phi Phi transpose square root [00:17:19] of Phi Phi transpose square root this is just by the standard property of [00:17:21] this is just by the standard property of the singular values so [00:17:24] the singular values so um [00:17:25] um um so so I guess you know one way to [00:17:27] um so so I guess you know one way to think about this is that if you look at [00:17:29] think about this is that if you look at the eigen single [00:17:32] the eigen single either eigenvalue or singular value [00:17:37] because this is a PSD Matrix so it [00:17:39] because this is a PSD Matrix so it doesn't matter [00:17:41] doesn't matter value of [00:17:42] value of transpose [00:17:45] transpose then let's say suppose the eigenvalues [00:17:46] then let's say suppose the eigenvalues are Tau 1 square up to Tau n Square [00:17:50] are Tau 1 square up to Tau n Square and then Tau 1 square is equals to Tau [00:17:52] and then Tau 1 square is equals to Tau Square that's a definition of the sigma [00:17:55] Square that's a definition of the sigma Max right and until [00:17:57] Max right and until n square is the sigma Square this is my [00:18:00] n square is the sigma Square this is my definition of the sigma mean [00:18:02] definition of the sigma mean and and you and then the I minus we care [00:18:06] and and you and then the I minus we care about this Matrix [00:18:08] about this Matrix so this one has [00:18:10] so this one has a singular eigenvalue or singular value [00:18:18] um y minus ETA Tau y square up to 1 [00:18:22] um y minus ETA Tau y square up to 1 minus ETA Tau n Square [00:18:25] minus ETA Tau n Square right this is just because [00:18:26] right this is just because um [00:18:27] um I have single value one so [00:18:30] I have single value one so I'm not sure whether this is [00:18:32] I'm not sure whether this is do you think these requires a proof this [00:18:33] do you think these requires a proof this is just because I guess there are many [00:18:35] is just because I guess there are many ways to see this for example [00:18:38] ways to see this for example I guess the best way for me to see it is [00:18:40] I guess the best way for me to see it is always a tick [00:18:41] always a tick eigen decomposition you see C is U Sigma [00:18:45] eigen decomposition you see C is U Sigma V transpose where Sigma is this Matrix [00:18:47] V transpose where Sigma is this Matrix with Tau one up to token and then I [00:18:51] with Tau one up to token and then I minus V Phi transpose ETA transpose it's [00:18:54] minus V Phi transpose ETA transpose it's just a [00:18:56] just a uh I minus [00:18:58] uh I minus ETA [00:19:00] ETA U Sigma Square [00:19:03] U Sigma Square U transpose [00:19:05] U transpose right and and [00:19:07] right and and I is equals to UU transpose [00:19:12] transpose and then you get U times I [00:19:15] transpose and then you get U times I minus [00:19:17] minus ETA Sigma Square U transpose and this is [00:19:19] ETA Sigma Square U transpose and this is the SPD [00:19:21] the SPD for [00:19:22] for or the eigenic conversation or SVD for [00:19:25] or the eigenic conversation or SVD for uh [00:19:27] uh for I minus ETA VV transpose and as [00:19:32] for I minus ETA VV transpose and as that's why what's inside is the [00:19:34] that's why what's inside is the eigenvalues or the singular values for [00:19:36] eigenvalues or the singular values for the resulting Matrix [00:19:38] the resulting Matrix right so and and then now you bound this [00:19:40] right so and and then now you bound this right so so there but but if you care [00:19:42] right so so there but but if you care about operating number the largest the [00:19:44] about operating number the largest the absolute value of the eigenvalue right [00:19:46] absolute value of the eigenvalue right so so basically that's why I minus [00:19:50] so so basically that's why I minus eta-free transpose [00:19:53] eta-free transpose the Opera Norm is less than the max over [00:19:56] the Opera Norm is less than the max over J [00:19:57] J one minus ETA Tau J Square [00:20:00] one minus ETA Tau J Square absolute value and I think your ETA is [00:20:03] absolute value and I think your ETA is guaranteed the choice of ETA is trying [00:20:06] guaranteed the choice of ETA is trying to guarantee that you never get to the [00:20:07] to guarantee that you never get to the negative side [00:20:08] negative side you try to make sure the ETA is less [00:20:10] you try to make sure the ETA is less than 1 over 2 Tau square and tau is the [00:20:13] than 1 over 2 Tau square and tau is the largest one [00:20:14] largest one so that's why even the largest one is [00:20:16] so that's why even the largest one is even as the largest tall will not make [00:20:18] even as the largest tall will not make the one minus either tall Square [00:20:20] the one minus either tall Square negative so everything is positive so [00:20:23] negative so everything is positive so then [00:20:25] then then this is a [00:20:28] then this is a uh [00:20:29] uh this is just equals to 1 minus ETA [00:20:33] this is just equals to 1 minus ETA taught in Square which is equal to minus [00:20:36] taught in Square which is equal to minus ETA Sigma Square [00:20:38] ETA Sigma Square Okay so [00:20:40] Okay so sounds good [00:20:43] sounds good um yeah in some sense these are [00:20:46] um yeah in some sense these are um [00:20:47] um you know sometimes the basics of [00:20:49] you know sometimes the basics of optimization but this course doesn't [00:20:50] optimization but this course doesn't require optimization that's why I'm [00:20:52] require optimization that's why I'm providing some some basic tools here so [00:20:55] providing some some basic tools here so and [00:20:57] and right so [00:20:59] right so okay so so basically we are done with [00:21:02] okay so so basically we are done with the the analysis of this linear [00:21:04] the the analysis of this linear regression right after you have this you [00:21:06] regression right after you have this you you know that you are the your arrow is [00:21:08] you know that you are the your arrow is decaying expansion very fast and then [00:21:12] decaying expansion very fast and then after sufficient number of iterations [00:21:14] after sufficient number of iterations you error is small right so so basically [00:21:17] you error is small right so so basically maybe let's call this two so from two [00:21:20] maybe let's call this two so from two you have after [00:21:23] you have after for T is at most [00:21:26] for T is at most something like log one over Epsilon over [00:21:29] something like log one over Epsilon over ETA Sigma Square iteration [00:21:33] so uh your error y hat T minus y [00:21:38] so uh your error y hat T minus y uck [00:21:39] uck is less than Epsilon times the initial [00:21:46] I think and and you can take Epson to be [00:21:49] I think and and you can take Epson to be small so that you can [00:21:52] small so that you can you can have a you know if you can take [00:21:54] you can have a you know if you can take options to be very small right because [00:21:55] options to be very small right because it's logarithmic decay of errors [00:21:59] it's logarithmic decay of errors Okay so [00:22:02] so this is analysis for G and now let's [00:22:05] so this is analysis for G and now let's talk talk about analysis for f right so [00:22:07] talk talk about analysis for f right so so [00:22:10] you'll see that uh [00:22:12] you'll see that uh now analysis for f is very similar to [00:22:15] now analysis for f is very similar to this but with some some tweak maybe let [00:22:18] this but with some some tweak maybe let me State ethereum just so that we have a [00:22:20] me State ethereum just so that we have a formal statement somewhere so there [00:22:22] formal statement somewhere so there exists [00:22:24] exists a constant say c0 [00:22:27] a constant say c0 and 0 and 1. [00:22:29] and 0 and 1. such that [00:22:32] when this key quantity B's weight over [00:22:35] when this key quantity B's weight over Sigma square is less than c0 [00:22:39] Sigma square is less than c0 for sufficiently [00:22:43] small [00:22:44] small ETA and ETA could this could depend on [00:22:49] ETA and ETA could this could depend on which could depend on [00:22:54] Beta Sigma maybe the dimension P so and [00:22:57] Beta Sigma maybe the dimension P so and so forth I think you can have a concrete [00:22:59] so forth I think you can have a concrete Bound for the for how small ETA has to [00:23:01] Bound for the for how small ETA has to be but I I'm just lazy I want to you [00:23:04] be but I I'm just lazy I want to you know I'm lazy but like I want to hide [00:23:06] know I'm lazy but like I want to hide these details you know I'm also lazy but [00:23:08] these details you know I'm also lazy but I also want to have these details uh so [00:23:10] I also want to have these details uh so that it's not too too complicated for [00:23:12] that it's not too too complicated for for the for the expectation but when you [00:23:16] for the for the expectation but when you have sufficient small ETA then uh in T [00:23:20] have sufficient small ETA then uh in T is equals to O of log 1 over Epsilon [00:23:23] is equals to O of log 1 over Epsilon over ETA Sigma Square steps [00:23:29] um the empirical laws for the F7 t [00:23:33] um the empirical laws for the F7 t is also less than is less than absolute [00:23:37] is also less than is less than absolute so so so the empirical loss for the for [00:23:40] so so so the empirical loss for the for the new network is also has having this [00:23:42] the new network is also has having this error episode [00:23:43] error episode so how do we do this so the I guess we [00:23:47] so how do we do this so the I guess we have kind of discussed the intuition [00:23:48] have kind of discussed the intuition already so the intuition is that you [00:23:50] already so the intuition is that you always have [00:23:52] always have um try to kind of relate this to the G [00:23:54] um try to kind of relate this to the G so and here by relating to the G [00:23:56] so and here by relating to the G basically you just try to follow what I [00:23:58] basically you just try to follow what I mean is really you try to follow the [00:24:00] mean is really you try to follow the proof you had before like just try to [00:24:03] proof you had before like just try to imitate as much as possible and and of [00:24:06] imitate as much as possible and and of course there will be some differences [00:24:07] course there will be some differences and then you deal with the differences [00:24:09] and then you deal with the differences so [00:24:10] so so I think one difference is that by the [00:24:13] so I think one difference is that by the way this is a proof sketch because I'm [00:24:14] way this is a proof sketch because I'm going to Omit some small technical [00:24:17] going to Omit some small technical drawings which is not super important [00:24:21] drawings which is not super important um so the important thing is that [00:24:23] um so the important thing is that um you have a changing [00:24:26] um you have a changing fee in some sense [00:24:27] fee in some sense so [00:24:29] so um this is what's the difference in new [00:24:31] um this is what's the difference in new networks when you have networks compared [00:24:34] networks when you have networks compared to [00:24:35] to um the [00:24:37] um the um the the the linear regression case so [00:24:41] um the the the linear regression case so you'll see why this is the case so [00:24:42] you'll see why this is the case so suppose you define this fee fee sub V [00:24:45] suppose you define this fee fee sub V superscript T to be the [00:24:48] superscript T to be the the the kernel at time t [00:24:54] right so this is the [00:25:02] this is the ntk kernel when your Taylor [00:25:04] this is the ntk kernel when your Taylor expand at time t [00:25:06] expand at time t and if you tell expand time T then [00:25:10] and if you tell expand time T then then the gradient descent I think we [00:25:12] then the gradient descent I think we have discussed this before but now you [00:25:14] have discussed this before but now you can see it explicitly I think I've [00:25:16] can see it explicitly I think I've claimed that if you tell expand at time [00:25:18] claimed that if you tell expand at time T then the gradient with respect to the [00:25:20] T then the gradient with respect to the approximation is the same as the [00:25:22] approximation is the same as the gradient with respect to the [00:25:26] gradient with respect to the um on the original phone the the the the [00:25:28] um on the original phone the the the the new ice work price so this is what I I [00:25:31] new ice work price so this is what I I think wrote [00:25:33] think wrote at the background here this is the [00:25:34] at the background here this is the remark so if you tell expander time T [00:25:37] remark so if you tell expander time T then the gradient with recognized work [00:25:39] then the gradient with recognized work is the same as the grading respect [00:25:40] is the same as the grading respect linear function [00:25:42] linear function um this is just because these two things [00:25:44] um this is just because these two things are great uh this is this is a these two [00:25:47] are great uh this is this is a these two things are great at this point up to [00:25:49] things are great at this point up to first order so that's why even you [00:25:51] first order so that's why even you compose with a loss function up to first [00:25:52] compose with a loss function up to first order they still agree [00:25:53] order they still agree so [00:25:55] so um [00:25:56] um so and here you can even see that [00:25:58] so and here you can even see that explicitly so suppose you write down the [00:26:00] explicitly so suppose you write down the gradient [00:26:01] gradient of the [00:26:04] of the of the loss function at I'd say that t [00:26:07] of the loss function at I'd say that t then what you get is that you have one [00:26:09] then what you get is that you have one over n [00:26:10] over n uh [00:26:12] uh you can do the chain rule so you can get [00:26:14] you can do the chain rule so you can get this is [00:26:16] this is y i minus F Theta t x i [00:26:21] y i minus F Theta t x i times gradient f [00:26:23] times gradient f Theta T times x i [00:26:26] Theta T times x i you can verify this just even without [00:26:28] you can verify this just even without having using a remarket hat right this [00:26:30] having using a remarket hat right this is just a chain rule and uh and then [00:26:33] is just a chain rule and uh and then this is equals to [00:26:36] this is equals to um [00:26:37] um if you write if you view this as the [00:26:39] if you write if you view this as the this is from the feed this is the the [00:26:41] this is from the feed this is the the iso of the feed right [00:26:43] iso of the feed right the yes so this corresponds to i03 and [00:26:47] the yes so this corresponds to i03 and this corresponds to the the difference [00:26:48] this corresponds to the the difference of maybe let me write this more [00:26:50] of maybe let me write this more explicitly [00:26:52] explicitly this is why [00:26:53] this is why I minus y [00:26:56] I minus y i t [00:26:59] right times the gradient [00:27:03] right times the gradient sorry [00:27:04] sorry and then if you write this as a vector [00:27:07] and then if you write this as a vector for matrix multiplication form you get [00:27:09] for matrix multiplication form you get VT which corresponds to this one [00:27:12] VT which corresponds to this one and then you have y [00:27:14] and then you have y V minus [00:27:16] V minus y hat [00:27:18] y hat t [00:27:20] t and there's a wall right in front of it [00:27:27] so that's the gradient [00:27:31] and that means that the update rule [00:27:34] and that means that the update rule facility [00:27:36] facility I think I somehow say that here [00:27:40] I think I somehow say that here uh yeah [00:27:41] uh yeah I guess I'm going to use CLT instead of [00:27:44] I guess I'm going to use CLT instead of data 30 they are the same right they are [00:27:46] data 30 they are the same right they are just only different Optical translation [00:27:48] just only different Optical translation so 30 plus 1 is equals to set at T minus [00:27:52] so 30 plus 1 is equals to set at T minus ETA [00:27:53] ETA times this gradient [00:28:00] and this is equals to Theta T minus one [00:28:03] and this is equals to Theta T minus one over n times VT [00:28:05] over n times VT while back wire had tea [00:28:10] and [00:28:17] okay so now there is a little bit kind [00:28:19] okay so now there is a little bit kind of like small [00:28:22] um thing here so suppose you say [00:28:25] um thing here so suppose you say you give a name called this BT so then [00:28:28] you give a name called this BT so then this is [00:28:30] let's say [00:28:35] etern over and [00:28:37] etern over and so this is Theta t [00:28:40] so this is Theta t ability [00:28:41] ability so I'm going to try to [00:28:44] so I'm going to try to um okay so what's the our goal our goal [00:28:46] um okay so what's the our goal our goal is try to do kind of recursion for the [00:28:48] is try to do kind of recursion for the Y's that's what we did before right a [00:28:50] Y's that's what we did before right a good question for the wise [00:28:51] good question for the wise and how do I get a recursion for the Y's [00:28:53] and how do I get a recursion for the Y's I have to look at how that why changes [00:28:56] I have to look at how that why changes right what's what is the why this is one [00:28:59] right what's what is the why this is one entry of the y y hat uh I the output at [00:29:02] entry of the y y hat uh I the output at time t plus one [00:29:05] time t plus one um but I also write this as a something [00:29:07] um but I also write this as a something like related to the time my time the the [00:29:10] like related to the time my time the the function output item T so how do I do [00:29:14] function output item T so how do I do that this is a non-linear function [00:29:15] that this is a non-linear function before we just you know do a linear [00:29:17] before we just you know do a linear multiplication because before we just [00:29:19] multiplication because before we just know that [00:29:20] know that uh if this is G If This Were G then this [00:29:23] uh if this is G If This Were G then this is just equals to Phi times Theta Theta [00:29:25] is just equals to Phi times Theta Theta t plus one right if if this F was was G [00:29:28] t plus one right if if this F was was G but if it but because this is non-linear [00:29:31] but if it but because this is non-linear we have to do something so what we do is [00:29:33] we have to do something so what we do is we try to Taylor expand [00:29:35] we try to Taylor expand at time t [00:29:37] at time t so that you can have a relationship [00:29:39] so that you can have a relationship between [00:29:40] between uh Theta T and Theta t plus one [00:29:42] uh Theta T and Theta t plus one so if a Taylor expanded you have to [00:29:44] so if a Taylor expanded you have to write the gradient of F7 [00:29:47] write the gradient of F7 t x i [00:29:50] t x i and times the differences between [00:29:54] and times the differences between the two iterate and plus some something [00:29:58] the two iterate and plus some something high order right and and if you look at [00:30:02] high order right and and if you look at what's the difference the difference is [00:30:03] what's the difference the difference is a function of ETA right the difference [00:30:05] a function of ETA right the difference is exactly this ETA BT [00:30:07] is exactly this ETA BT so that's why we can write [00:30:11] plus the gradient let's take a t x i [00:30:15] plus the gradient let's take a t x i times minus ETA b t [00:30:18] times minus ETA b t and plus the second other term will be [00:30:21] and plus the second other term will be quadratic in the this will be quadratic [00:30:24] quadratic in the this will be quadratic in this way if I'm going to write it [00:30:28] in this way if I'm going to write it somewhat [00:30:30] somewhat um [00:30:30] um informally and and more formally I can [00:30:33] informally and and more formally I can write this as [00:30:35] write this as something [00:30:36] something a function of like ETA Square times [00:30:38] a function of like ETA Square times something [00:30:41] because the the difference has a ETA in [00:30:43] because the the difference has a ETA in it so that's why you square it you get [00:30:45] it so that's why you square it you get ETA Square so in sometimes this is a [00:30:47] ETA Square so in sometimes this is a term I want to ignore [00:30:49] term I want to ignore I'm trying I'm trying very hard here [00:30:51] I'm trying I'm trying very hard here just because I want to ignore this term [00:30:53] just because I want to ignore this term because and the reason why to ignore it [00:30:55] because and the reason why to ignore it is because it's ETA square right so so [00:30:58] is because it's ETA square right so so uh so basically what I'm going to say is [00:31:01] uh so basically what I'm going to say is that I'm T the constant you know is not [00:31:04] that I'm T the constant you know is not a function of it's not a function of [00:31:08] of ETA so if you fix everything else and [00:31:11] of ETA so if you fix everything else and just take ETA to be very small so if you [00:31:13] just take ETA to be very small so if you take ETA [00:31:15] take ETA to go to zero then the second other term [00:31:17] to go to zero then the second other term the ETA square m t term [00:31:20] the ETA square m t term is negligible [00:31:23] we can do this more formally but I don't [00:31:25] we can do this more formally but I don't want to go into so much uh dragons but [00:31:28] want to go into so much uh dragons but there's a way for you to kind of bound [00:31:30] there's a way for you to kind of bound empty by something right whatever you [00:31:32] empty by something right whatever you want it right so you get the bond for m [00:31:34] want it right so you get the bond for m key and then you just say that if ETA is [00:31:36] key and then you just say that if ETA is small enough so that ETA Square band [00:31:38] small enough so that ETA Square band becomes negligible that's that's [00:31:40] becomes negligible that's that's basically how you formally do it so [00:31:43] basically how you formally do it so so if you ignore this this second term [00:31:45] so if you ignore this this second term then everything becomes so simple right [00:31:47] then everything becomes so simple right so for now [00:31:49] so for now uh Let's ignore the second another term [00:31:58] then what you have is that why [00:32:02] then what you have is that why t plus one [00:32:04] t plus one this is equals to [00:32:06] this is equals to if you put all put this equation let's [00:32:09] if you put all put this equation let's call this equation three in Vector form [00:32:15] Vector form then you get YT y hat t plus [00:32:18] Vector form then you get YT y hat t plus 1 is equals to y height T because this [00:32:21] 1 is equals to y height T because this is y High T [00:32:22] is y High T and plus this linear function of BT [00:32:25] and plus this linear function of BT right so what is this this is really [00:32:28] right so what is this this is really ETA [00:32:29] ETA um [00:32:31] um minus ETA times [00:32:34] minus ETA times uh Phi [00:32:36] uh Phi transpose [00:32:38] transpose feed transpose [00:32:41] times BT [00:32:43] times BT I guess actually this is uh right so yes [00:32:46] I guess actually this is uh right so yes V transpose times BT [00:32:48] V transpose times BT and then plus something like [00:32:51] and then plus something like ETA Square Times some constant [00:32:53] ETA Square Times some constant I'm gonna keep this just for a little [00:32:55] I'm gonna keep this just for a little bit so that [00:32:56] bit so that um but but essentially I want to ignore [00:32:57] um but but essentially I want to ignore it and [00:32:59] it and and then you can rewrite this as well [00:33:02] and then you can rewrite this as well how T minus ETA V transpose and what is [00:33:05] how T minus ETA V transpose and what is BT BT is this difference between prime [00:33:09] BT BT is this difference between prime is this something like this but let's [00:33:11] is this something like this but let's you let's go back to that so [00:33:13] you let's go back to that so this fee [00:33:16] this fee uh [00:33:18] uh wait I'm not sure why I'm missing look [00:33:21] wait I'm not sure why I'm missing look constant here [00:33:24] constant here oh I see I see so [00:33:27] oh I see I see so I think this one over and [00:33:31] I guess there's some mismatch in my [00:33:32] I guess there's some mismatch in my notes like a but it doesn't matter so [00:33:34] notes like a but it doesn't matter so let's have the one button here well like [00:33:37] let's have the one button here well like like for the linear regression case I [00:33:39] like for the linear regression case I didn't have the one over in the last [00:33:40] didn't have the one over in the last function and now I have the one over and [00:33:41] function and now I have the one over and that's why it's a little bit mismatched [00:33:43] that's why it's a little bit mismatched but it's not a fundamental [00:33:45] but it's not a fundamental difference so [00:33:48] difference so um [00:33:49] um so let's have the well and then you have [00:33:54] this is actually this is VT [00:33:58] this is actually this is VT so this is 50 [00:34:01] so this is 50 and then you have [00:34:04] ETA v t [00:34:08] transpose [00:34:11] transpose times y y [00:34:12] times y y minus y height t [00:34:15] minus y height t times one over n [00:34:17] times one over n um [00:34:18] um maybe let me okay how do I do this [00:34:21] maybe let me okay how do I do this so let me ignore the one over n [00:34:24] so let me ignore the one over n fiber because you know you can you can [00:34:27] fiber because you know you can you can redefine the loss function whatever you [00:34:28] redefine the loss function whatever you want right like let's say just we don't [00:34:29] want right like let's say just we don't have one of them in the last one sorry [00:34:31] have one of them in the last one sorry so [00:34:33] so um [00:34:34] um so [00:34:35] so then you have this and then this becomes [00:34:39] then you have this and then this becomes um [00:34:41] if you [00:34:43] if you subtract [00:34:46] right back from [00:34:50] from both of this [00:35:06] and then you will organize this [00:35:11] and then you will organize this so you're gonna get [00:35:14] this is equals to I minus ETA v t [00:35:19] this is equals to I minus ETA v t transpose [00:35:24] by [00:35:25] by at T minus y black [00:35:31] I think there's somehow there's a little [00:35:33] I think there's somehow there's a little bit problem with the [00:35:40] there's a little bit I think this is a [00:35:42] there's a little bit I think this is a plus this is a plus [00:35:44] plus this is a plus right okay but the point is that you [00:35:47] right okay but the point is that you know basically if you compare this [00:35:49] know basically if you compare this equation [00:35:53] uh I guess technically you still have [00:35:56] uh I guess technically you still have some [00:35:57] some ETA Square term [00:35:58] ETA Square term which we don't care uh if you compare [00:36:01] which we don't care uh if you compare this equation with the recursion before [00:36:04] this equation with the recursion before the recursion before was this [00:36:07] the recursion before was this uh actually maybe here [00:36:10] uh actually maybe here the only difference is that this Matrix [00:36:13] the only difference is that this Matrix is different [00:36:14] is different but before you are multiplying with a [00:36:16] but before you are multiplying with a fixed Matrix Fifi transpose and now you [00:36:19] fixed Matrix Fifi transpose and now you are using [00:36:21] are using VT VT transpose [00:36:24] VT VT transpose so so [00:36:27] so so um but you know everything could be the [00:36:29] um but you know everything could be the same if this VT but you don't [00:36:30] same if this VT but you don't necessarily need it to be the same right [00:36:32] necessarily need it to be the same right in our original proof you only need this [00:36:34] in our original proof you only need this Matrix to be [00:36:35] Matrix to be uh smaller you only need I minus etel [00:36:38] uh smaller you only need I minus etel Fifi transpose this Matrix to be smaller [00:36:40] Fifi transpose this Matrix to be smaller than identity [00:36:42] than identity right so suppose we ignore [00:36:44] right so suppose we ignore the ETA square m term [00:36:47] the ETA square m term um because in second order and then [00:36:49] um because in second order and then suppose [00:36:53] um Theta T minus zero [00:36:55] um Theta T minus zero suppose you are in within Sigma [00:36:58] suppose you are in within Sigma for a beta [00:37:00] for a beta at time t [00:37:03] and then suppose okay so suppose you are [00:37:06] and then suppose okay so suppose you are not far far away from the let's say that [00:37:08] not far far away from the let's say that zero okay so then you know that VT minus [00:37:12] zero okay so then you know that VT minus Phi [00:37:14] but in two Norm is less than Sigma over [00:37:17] but in two Norm is less than Sigma over four this is by [00:37:19] four this is by ellipsislessness [00:37:23] coffee this is our assumption [00:37:25] coffee this is our assumption and that means that Sigma mean [00:37:28] and that means that Sigma mean of VT is [00:37:31] of VT is not very different from the sigma mu Phi [00:37:36] say minus Sigma over four this is larger [00:37:40] say minus Sigma over four this is larger than 3 4 times Sigma [00:37:43] than 3 4 times Sigma so Sigma mean of v t is also good [00:37:46] so Sigma mean of v t is also good so so you so so the eigenvalue so you [00:37:48] so so you so so the eigenvalue so you still have a lower Bound for the [00:37:49] still have a lower Bound for the eigenvalue it's just a little bit weaker [00:37:51] eigenvalue it's just a little bit weaker up to a constant Factor [00:37:53] up to a constant Factor and that means that I minus ETA v t [00:37:57] and that means that I minus ETA v t transpose [00:38:01] operator Norm [00:38:03] operator Norm is less than [00:38:04] is less than 1 minus ETA times 3 over 4 times Sigma [00:38:09] 1 minus ETA times 3 over 4 times Sigma right so very similar to the before [00:38:12] right so very similar to the before and then uh so but there is an [00:38:15] and then uh so but there is an assumption here so so this sounds great [00:38:17] assumption here so so this sounds great right there but there is a assumption [00:38:19] right there but there is a assumption which is that Theta T is not very far [00:38:20] which is that Theta T is not very far away from so that's zero this is [00:38:22] away from so that's zero this is something you cannot take for granted [00:38:24] something you cannot take for granted you have to prove it is the case so [00:38:27] you have to prove it is the case so that's why we have to inductively [00:38:29] that's why we have to inductively so basically the only thing left is that [00:38:32] so basically the only thing left is that we need to [00:38:34] we need to inductively [00:38:37] prove [00:38:39] prove this proves that t [00:38:43] this proves that t minus 0 is never too big [00:38:49] that's uh basically that's the [00:38:53] that's uh basically that's the that's the thing and [00:38:57] and in some sense this is expected [00:38:59] and in some sense this is expected because you need [00:39:01] because you need uh this is expected because in some [00:39:05] uh this is expected because in some sense this is expected [00:39:07] sense this is expected because [00:39:09] because um recall that the autocader had the two [00:39:12] um recall that the autocader had the two Norm [00:39:13] Norm say the Hat was the the global minimizer [00:39:16] say the Hat was the the global minimizer like we this is the [00:39:18] like we this is the Global mean that we constructed [00:39:22] Global mean that we constructed in the last [00:39:25] in the last lecture so we said that there is a [00:39:27] lecture so we said that there is a global mean of size squared and over [00:39:29] global mean of size squared and over Sigma [00:39:31] Sigma right so and and and because there's a [00:39:34] right so and and and because there's a global mean with this size and if this [00:39:37] global mean with this size and if this is much much less than Sigma over 4 beta [00:39:41] is much much less than Sigma over 4 beta right so when [00:39:44] right so when Sigma [00:39:45] Sigma beta over Sigma square is sufficiently [00:39:47] beta over Sigma square is sufficiently small [00:39:49] small right so so I guess this [00:39:53] over Sigma is very sufficient small then [00:39:55] over Sigma is very sufficient small then you have this in equality and that means [00:39:58] you have this in equality and that means that there exists a global mean within [00:39:59] that there exists a global mean within this region Sigma over 4 beta if there [00:40:02] this region Sigma over 4 beta if there exist a global mean within this region y [00:40:04] exist a global mean within this region y used to leave this region right that's [00:40:07] used to leave this region right that's why you should somewhat expect that it's [00:40:10] why you should somewhat expect that it's always in within this region [00:40:12] always in within this region um [00:40:12] um and how do we formally do this I think [00:40:15] and how do we formally do this I think you just say uh formally [00:40:19] you just say uh formally you just do an induction right because [00:40:21] you just do an induction right because you know that [00:40:23] you know that um I guess square root [00:40:26] um I guess square root we know that [00:40:27] we know that one over n [00:40:32] let me see when I made the mistake here [00:40:33] let me see when I made the mistake here so one [00:40:35] so one over square root and times [00:40:38] over square root and times where I had T minus y [00:40:40] where I had T minus y what [00:40:41] what uh [00:40:45] zero this is of one because every answer [00:40:49] zero this is of one because every answer is on the order of all fun so you have [00:40:50] is on the order of all fun so you have an entries so here and then inductively [00:40:54] an entries so here and then inductively you can show that [00:40:55] you can show that you can show this implies that [00:40:59] you can show this implies that we actually have you know exponential [00:41:01] we actually have you know exponential decay of error but actually even you [00:41:03] decay of error but actually even you don't care about that you still have [00:41:04] don't care about that you still have for every time you have this [00:41:06] for every time you have this and for if for every time T you have [00:41:08] and for if for every time T you have this then it means that [00:41:11] this then it means that um [00:41:12] um passcode and [00:41:15] passcode and V Sigma T minus Sigma hat [00:41:20] V Sigma T minus Sigma hat two Norm is less than [00:41:24] o of one because it's Theta hat is the [00:41:27] o of one because it's Theta hat is the the ground shoots right so this is [00:41:29] the ground shoots right so this is because [00:41:31] because fee Theta hat is equals to Y back [00:41:35] fee Theta hat is equals to Y back to the height is the construct with the [00:41:37] to the height is the construct with the one that we constantly lost the lecture [00:41:39] one that we constantly lost the lecture and then this means the Theta T minus [00:41:41] and then this means the Theta T minus Theta hat [00:41:42] Theta hat two Norm is less than squared and over [00:41:45] two Norm is less than squared and over Sigma which is exactly right you are [00:41:47] Sigma which is exactly right you are saying that your [00:41:48] saying that your your [00:41:49] your um [00:41:51] um your uh iterate is not very far away [00:41:54] your uh iterate is not very far away from your targets that I had [00:41:56] from your targets that I had and then you also know that your targets [00:41:59] and then you also know that your targets that I had is also not very far away [00:42:00] that I had is also not very far away from [00:42:02] from so we also know that [00:42:05] the target this is also less than [00:42:08] the target this is also less than uh I guess there's a big O here [00:42:12] uh I guess there's a big O here Big O squared and over Sigma [00:42:15] Big O squared and over Sigma because [00:42:16] because um this is what we did last time [00:42:19] um this is what we did last time and then by triangle inequality [00:42:25] we got Theta T minus Theta zero [00:42:29] we got Theta T minus Theta zero is less than [00:42:33] 30 minutes later hat [00:42:37] say sorry zero [00:42:41] so now which is less than or square root [00:42:43] so now which is less than or square root and sigma [00:42:45] and sigma and this is less than Sigma over 4 beta [00:42:48] and this is less than Sigma over 4 beta if beta over Sigma squared is less than [00:42:51] if beta over Sigma squared is less than much much less than 1 over square rooted [00:42:55] much much less than 1 over square rooted so so this is how you inductively [00:42:58] so so this is how you inductively maintain uh the distance between say [00:43:00] maintain uh the distance between say that how do inductively show that [00:43:03] that how do inductively show that um [00:43:05] um Theta T is not very far away from zero [00:43:07] Theta T is not very far away from zero zero [00:43:10] yeah the the the step sounds a little [00:43:13] yeah the the the step sounds a little bit complicated but actually the [00:43:14] bit complicated but actually the intuition is very simple it's just that [00:43:15] intuition is very simple it's just that you [00:43:16] you like there are probably many ways to [00:43:18] like there are probably many ways to prove this like I I just presented one [00:43:20] prove this like I I just presented one way so like there's already a global [00:43:22] way so like there's already a global minimal there right so there shouldn't [00:43:23] minimal there right so there shouldn't be any ways for you to [00:43:26] be any ways for you to um leave and and what you do is [00:43:28] um leave and and what you do is basically you say that you have a set of [00:43:30] basically you say that you have a set of hat here you'll say that zero you know [00:43:32] hat here you'll say that zero you know that these two are [00:43:33] that these two are the distance between these two is of [00:43:37] the distance between these two is of squared and over Sigma [00:43:39] squared and over Sigma and and you are optimizing and in some [00:43:41] and and you are optimizing and in some sense that your head is your target [00:43:42] sense that your head is your target right because of the headaches has the [00:43:45] right because of the headaches has the great the best fitting so you're someone [00:43:47] great the best fitting so you're someone moving even closer to set a height so [00:43:49] moving even closer to set a height so why not you should have even bigger [00:43:51] why not you should have even bigger distance eventually afterwards right so [00:43:53] distance eventually afterwards right so so so so so so that's that's why [00:43:57] so so so so so that's that's why um uh that's why this is working right [00:43:59] um uh that's why this is working right so if you look at the iterate I think [00:44:01] so if you look at the iterate I think you are somewhat moving to see the Hat [00:44:04] you are somewhat moving to see the Hat so okay so enough out of all of this [00:44:06] so okay so enough out of all of this right so then we got this [00:44:09] right so then we got this equality [00:44:12] and with this equality we got that [00:44:15] and with this equality we got that y back t plus one minus y back [00:44:20] y back t plus one minus y back y height t plus one minus y is less than [00:44:24] y height t plus one minus y is less than T minus y back two Norm minus ETA [00:44:29] T minus y back two Norm minus ETA times three over four Sigma Square and [00:44:32] times three over four Sigma Square and then you can do a recursion to get [00:44:33] then you can do a recursion to get exponential decay of error [00:44:43] okay any questions [00:44:46] I think I made a small typo somewhere in [00:44:49] I think I made a small typo somewhere in the Assumption of the [00:44:51] the Assumption of the theorem I need to fix that [00:44:54] theorem I need to fix that I think my assumption should be that [00:44:56] I think my assumption should be that this is less than C over square root and [00:45:00] this is less than C over square root and and it doesn't really matter because you [00:45:02] and it doesn't really matter because you can make better Instinct if you change [00:45:04] can make better Instinct if you change as we see last time right you can make [00:45:06] as we see last time right you can make either of them with bigger or the alpha [00:45:08] either of them with bigger or the alpha bigger you can make beta over Sigma [00:45:10] bigger you can make beta over Sigma Square arbitrary small two small so [00:45:13] Square arbitrary small two small so um so it doesn't really matter that much [00:45:17] any questions [00:45:21] sure [00:45:23] [Music] [00:45:33] okay [00:45:46] [Music] [00:45:48] [Music] thank you so I guess you know I think I [00:45:54] thank you so I guess you know I think I there's one word like let me rephrase [00:45:57] there's one word like let me rephrase your question and let me know if it's [00:45:58] your question and let me know if it's not uh what you asked so I guess one [00:46:01] not uh what you asked so I guess one question you could ask is that whether [00:46:03] question you could ask is that whether you really rely on the exponential decay [00:46:06] you really rely on the exponential decay for the kernel case to to have this [00:46:08] for the kernel case to to have this relationship between Network and kernel [00:46:11] relationship between Network and kernel I think the answer to that is no [00:46:13] I think the answer to that is no um so so the second type of approach [00:46:15] um so so the second type of approach that I some would alternate last time [00:46:17] that I some would alternate last time but didn't really go into detail better [00:46:20] but didn't really go into detail better approach doesn't require [00:46:21] approach doesn't require that you have exponential decay of error [00:46:24] that you have exponential decay of error so in that case both the kernel and new [00:46:27] so in that case both the kernel and new network you can only show them to have [00:46:29] network you can only show them to have some polynomial speed of Decay mode like [00:46:32] some polynomial speed of Decay mode like the error is polynomial in t so you can [00:46:36] the error is polynomial in t so you can still make this relationship so so [00:46:37] still make this relationship so so exponential decay is not [00:46:40] exponential decay is not um it's not that important but I think [00:46:42] um it's not that important but I think this is actually something that people [00:46:44] this is actually something that people realize after the first few papers at [00:46:47] realize after the first few papers at the very beginning [00:46:48] the very beginning on the very first paper using this [00:46:50] on the very first paper using this exponential thing and and people thought [00:46:52] exponential thing and and people thought that because you have expandable [00:46:53] that because you have expandable Converse so fast that's why you don't [00:46:55] Converse so fast that's why you don't leave this neighborhood but I think you [00:46:58] leave this neighborhood but I think you can do something so that even without [00:47:00] can do something so that even without explaining your Decay you still don't [00:47:01] explaining your Decay you still don't leave the neighborhood because when they [00:47:02] leave the neighborhood because when they will leave the neighborhood [00:47:04] will leave the neighborhood probably depends most on whether in the [00:47:06] probably depends most on whether in the neighborhood there is is a global [00:47:08] neighborhood there is is a global minimum [00:47:09] minimum if there is a global minimum in the [00:47:11] if there is a global minimum in the neighborhood but somehow you cannot [00:47:12] neighborhood but somehow you cannot converge to it exponentially fast that's [00:47:15] converge to it exponentially fast that's still probably fine as long as you [00:47:16] still probably fine as long as you converts to it eventually [00:47:18] converts to it eventually right so [00:47:20] right so um I'm not sure whether that's what you [00:47:21] um I'm not sure whether that's what you asked [00:47:31] but I think [00:47:33] but I think here you want to cover this kind of stay [00:47:36] here you want to cover this kind of stay close [00:47:38] close right right you do want to say and also [00:47:42] right right you do want to say and also you want to [00:47:43] you want to you know you want to characterize the [00:47:45] you know you want to characterize the dynamite works right so if they don't [00:47:47] dynamite works right so if they don't have the same property right you and you [00:47:49] have the same property right you and you somehow can't optimize can analyze the [00:47:51] somehow can't optimize can analyze the optimization of nice works that's fine [00:47:53] optimization of nice works that's fine but but I think though but then but it [00:47:55] but but I think though but then but it becomes true but the the relationship is [00:47:58] becomes true but the the relationship is something to help us to to bridge the [00:48:01] something to help us to to bridge the the gap between what we knew and what we [00:48:03] the gap between what we knew and what we don't know right so the new networks is [00:48:05] don't know right so the new networks is something we didn't know but uh the the [00:48:07] something we didn't know but uh the the kernel wise something we knew and and [00:48:10] kernel wise something we knew and and and if they are similar then you can [00:48:12] and if they are similar then you can hope to analyze the new network right [00:48:15] hope to analyze the new network right um [00:48:16] um yeah so I think that's that's why we so [00:48:18] yeah so I think that's that's why we so they are doing something similar [00:48:23] okay all right I think that's [00:48:26] okay all right I think that's [Applause] [00:48:27] [Applause] um [00:48:28] um I think I have a little bit more things [00:48:31] I think I have a little bit more things to to add about the newer tangent kernel [00:48:33] to to add about the newer tangent kernel I guess we've discussed this many times [00:48:36] I guess we've discussed this many times like the limitation of neurotangent [00:48:38] like the limitation of neurotangent Kernel is that you only [00:48:40] Kernel is that you only at most do as well as the kernel method [00:48:43] at most do as well as the kernel method right so so they are kind of um so [00:48:45] right so so they are kind of um so basically question is you know how [00:48:48] basically question is you know how how well a kernel method can work [00:48:55] right so are we really characterizing [00:48:58] right so are we really characterizing the power of deep learning right if deep [00:49:00] the power of deep learning right if deep learning is only doing as well as [00:49:01] learning is only doing as well as kernels [00:49:02] kernels is it really is it good or not bad right [00:49:05] is it really is it good or not bad right so another I think is that at least I [00:49:07] so another I think is that at least I think most people believe in this the [00:49:09] think most people believe in this the answer is that new network can do much [00:49:12] answer is that new network can do much better things than kernels [00:49:14] better things than kernels um and and this characterization of the [00:49:17] um and and this characterization of the new network as kernels is not uh [00:49:21] new network as kernels is not uh characterizing the true power of the [00:49:23] characterizing the true power of the network [00:49:25] network and you can you can try to you know say [00:49:27] and you can you can try to you know say this you know in various ways [00:49:29] this you know in various ways um so there are very a lot of papers [00:49:31] um so there are very a lot of papers that tries to do this right so beyond [00:49:34] that tries to do this right so beyond the ntk approach I guess if you search [00:49:35] the ntk approach I guess if you search Beyond ntk or Beyond lazy training [00:49:37] Beyond ntk or Beyond lazy training you'll see a bunch of papers including [00:49:39] you'll see a bunch of papers including some of my papers [00:49:41] some of my papers um so so we try to analyze deep learning [00:49:45] um so so we try to analyze deep learning in different regime [00:49:47] in different regime um but there is there's a simple [00:49:49] um but there is there's a simple separation if you don't care about the [00:49:51] separation if you don't care about the optimization performance if you just [00:49:54] optimization performance if you just care about the power of the the [00:49:56] care about the power of the the regularization and the under under the [00:50:00] regularization and the under under the the [00:50:01] the um if you only have a statistic aspect [00:50:03] um if you only have a statistic aspect you can easily show that new network can [00:50:06] you can easily show that new network can do better things than kernels and this [00:50:09] do better things than kernels and this is um [00:50:10] is um uh this is a um example so the example [00:50:15] uh this is a um example so the example is something like this [00:50:27] so I guess this is the example where [00:50:34] foreign [00:50:43] statistically [00:50:47] Limited [00:50:50] so [00:50:52] so um [00:50:54] um and in some sense the intuition is that [00:50:55] and in some sense the intuition is that the limitation comes from that [00:50:58] the limitation comes from that the kernel the kernel [00:51:00] the kernel the kernel of the features [00:51:03] of the features are fixed [00:51:05] are fixed in the ntk approach you don't have any [00:51:07] in the ntk approach you don't have any adaptive adaptivity to the data so your [00:51:10] adaptive adaptivity to the data so your data probably wants to use some features [00:51:12] data probably wants to use some features but but you are using a fixed feature [00:51:15] but but you are using a fixed feature for for your data so and and this is a [00:51:18] for for your data so and and this is a simple case where you have to have set a [00:51:19] simple case where you have to have set a concrete example so so suppose you [00:51:21] concrete example so so suppose you consider [00:51:22] consider this case where X is in say Rd [00:51:26] this case where X is in say Rd and and Y is in plus one minus one [00:51:31] and and Y is in plus one minus one and let's say each of the X i's are just [00:51:34] and let's say each of the X i's are just a uniform [00:51:36] a uniform like IED gaussian so x i is if example [00:51:40] like IED gaussian so x i is if example is index [00:51:43] the superscript is for the examples and [00:51:46] the superscript is for the examples and the subscript is for the for the [00:51:48] the subscript is for the for the dimension [00:51:49] dimension and and let's say Y is equals to X1 X2 [00:51:53] and and let's say Y is equals to X1 X2 so you have a very simple function which [00:51:55] so you have a very simple function which is just learning the product of the [00:51:56] is just learning the product of the first two dimension of the data [00:52:00] first two dimension of the data so if you draw this suppose this is X1 [00:52:03] so if you draw this suppose this is X1 this is X2 then you just have [00:52:05] this is X2 then you just have four different combinations and this is [00:52:08] four different combinations and this is a positive example this is an active [00:52:10] a positive example this is an active example positive example and this is [00:52:11] example positive example and this is negative and this is negative [00:52:13] negative and this is negative so this is not linearly separable [00:52:15] so this is not linearly separable because you have this [00:52:17] because you have this four points that are positioned like [00:52:19] four points that are positioned like this so you have to use a non-linear [00:52:21] this so you have to use a non-linear model [00:52:22] model um or a linear model on some feature [00:52:25] um or a linear model on some feature space right so if you use cinematic [00:52:27] space right so if you use cinematic works [00:52:32] and suppose you uh regularize suppose [00:52:35] and suppose you uh regularize suppose you regularize [00:52:37] you regularize on the L2 Norm [00:52:41] and this is equivalent to [00:52:46] to regularize [00:52:49] to regularize on this Norm that we discussed right [00:52:51] on this Norm that we discussed right this Norm C Theta which is something [00:52:53] this Norm C Theta which is something like sum of [00:52:55] like sum of um [00:52:55] um AI [00:52:57] AI WI [00:52:59] WI I'm not sure whether you still remember [00:53:00] I'm not sure whether you still remember this whereas when we have this invite [00:53:03] this whereas when we have this invite work which is y is equals to something [00:53:05] work which is y is equals to something like sum of AI Sigma W I transpose X and [00:53:08] like sum of AI Sigma W I transpose X and you can Define this complexity measure [00:53:10] you can Define this complexity measure which is the kind of like the path Norm [00:53:12] which is the kind of like the path Norm right and and and we have shown that [00:53:15] right and and and we have shown that regularizing L2 Norm of the other [00:53:16] regularizing L2 Norm of the other parameters is the same as regularizing [00:53:18] parameters is the same as regularizing this somewhat complex you know which and [00:53:22] this somewhat complex you know which and which gives the actually the complexity [00:53:24] which gives the actually the complexity the optimization guarantees right [00:53:26] the optimization guarantees right so we have discussed this and in some [00:53:29] so we have discussed this and in some sense and suppose you uh use the right [00:53:32] sense and suppose you uh use the right works then what you'll find is so the [00:53:34] works then what you'll find is so the best solution [00:53:37] but but by the best solution I mean the [00:53:39] but but by the best solution I mean the minimum Norm solution the minimum [00:53:43] Norm Solutions [00:53:47] is a is a sparse one right is um uses [00:53:53] is a is a sparse one right is um uses a sparse combination [00:53:58] of neurons [00:54:00] of neurons so basically the best solution actually [00:54:02] so basically the best solution actually you can in this case you can exactly [00:54:04] you can in this case you can exactly compute what's the best solution I'm not [00:54:05] compute what's the best solution I'm not going to prove it but I think it's [00:54:08] going to prove it but I think it's something relatively believable [00:54:11] something relatively believable um so so the best solution first of all [00:54:13] um so so the best solution first of all doesn't really use any other dimensions [00:54:14] doesn't really use any other dimensions that seems to be believable right why [00:54:16] that seems to be believable right why you want to use any other dimensions if [00:54:18] you want to use any other dimensions if your function is only the first function [00:54:20] your function is only the first function of the first two dimensions and you only [00:54:22] of the first two dimensions and you only have to use something about the first [00:54:23] have to use something about the first two Dimensions you only need the [00:54:25] two Dimensions you only need the following [00:54:27] following uh four neurons so one neuron computes [00:54:32] uh four neurons so one neuron computes X1 plus X2 [00:54:34] X1 plus X2 and another neuron compute [00:54:37] and another neuron compute minus X1 minus X2 [00:54:39] minus X1 minus X2 and leaving another neuron compute that [00:54:42] and leaving another neuron compute that computes the value of X1 minus X2 [00:54:44] computes the value of X1 minus X2 another one that computes the value of [00:54:47] another one that computes the value of X2 minus X1 [00:54:50] so I claim that this is actually equals [00:54:52] so I claim that this is actually equals to the function and if you want to [00:54:54] to the function and if you want to verify that I can briefly do that so [00:54:57] verify that I can briefly do that so value of x value of t [00:55:00] value of x value of t plus value of minus t [00:55:03] plus value of minus t this is equal to absolute value of t [00:55:06] this is equal to absolute value of t so so this is equals to X1 plus X2 [00:55:09] so so this is equals to X1 plus X2 minus x square minus X2 [00:55:13] and and now I claim that this is [00:55:16] and and now I claim that this is actually equals to X1 times X2 [00:55:19] actually equals to X1 times X2 when x minus X2 are both binary and and [00:55:22] when x minus X2 are both binary and and how do you see this I guess the only way [00:55:24] how do you see this I guess the only way I can see this is just [00:55:26] I can see this is just like try all the four combinations right [00:55:28] like try all the four combinations right if X1 X2 are have the same sign then [00:55:32] if X1 X2 are have the same sign then this will becomes zero and this will [00:55:34] this will becomes zero and this will becomes one [00:55:36] becomes one right if I explained X2 are either both [00:55:38] right if I explained X2 are either both while both minus one and that's is the [00:55:40] while both minus one and that's is the case when X1 X2 the power that is one [00:55:42] case when X1 X2 the power that is one and if X1 X will have difference on the [00:55:45] and if X1 X will have difference on the first term is zero and the second term [00:55:46] first term is zero and the second term becomes [00:55:48] becomes uh one [00:55:51] uh one wait why I'm not why I'm having a half [00:55:53] wait why I'm not why I'm having a half here oh yes and second term becomes two [00:55:55] here oh yes and second term becomes two right so right so and you multiply half [00:55:57] right so right so and you multiply half you get minus one right so [00:56:00] you get minus one right so good so so basically if you use new [00:56:03] good so so basically if you use new network and you you recognize we can [00:56:04] network and you you recognize we can show that this is the solution fund [00:56:06] show that this is the solution fund which is a various sparse combination of [00:56:09] which is a various sparse combination of a a small number of features in some [00:56:11] a a small number of features in some sense when you use regularization you [00:56:13] sense when you use regularization you find these four features and you do a [00:56:15] find these four features and you do a linear combination [00:56:17] linear combination so so these four features are the right [00:56:19] so so these four features are the right features for this task [00:56:21] features for this task however if you use um ntk suppose you [00:56:26] however if you use um ntk suppose you use ntk [00:56:32] suppose using ntk [00:56:34] suppose using ntk then what you do is you just don't you [00:56:36] then what you do is you just don't you don't learn any features you just try to [00:56:39] don't learn any features you just try to um uh do a do a L2 you do a dance [00:56:42] um uh do a do a L2 you do a dance combination of your existing features [00:56:44] combination of your existing features right so so [00:56:46] right so so um [00:56:47] um um in some sense what you do is you say [00:56:50] um in some sense what you do is you say um [00:56:51] um I guess how do I say this in the best [00:56:53] I guess how do I say this in the best way so [00:56:57] so basically when you do ntk what you [00:56:59] so basically when you do ntk what you learn is something like uh [00:57:02] learn is something like uh um [00:57:03] um sorry why I'm not [00:57:10] [Applause] [00:57:11] [Applause] I guess I guess I can see the serum but [00:57:13] I guess I guess I can see the serum but uh I think the roughly the intuition is [00:57:15] uh I think the roughly the intuition is that your why will be your prediction [00:57:17] that your why will be your prediction will be something like a sum of AI Sigma [00:57:20] will be something like a sum of AI Sigma W I transpose X [00:57:23] W I transpose X um or maybe a maybe something like this [00:57:26] um or maybe a maybe something like this some of AI times [00:57:28] some of AI times fee of [00:57:32] fee of X [00:57:33] X right and and there are a bunch of [00:57:35] right and and there are a bunch of features and each feature is VI and and [00:57:38] features and each feature is VI and and these features [00:57:39] these features uh use [00:57:41] uh use all the [00:57:43] all the dimensions [00:57:45] dimensions you know this depends of course exactly [00:57:47] you know this depends of course exactly what the features are depends on what [00:57:49] what the features are depends on what kernels you are using if you use [00:57:50] kernels you are using if you use anti-curve kernel you're going to get [00:57:52] anti-curve kernel you're going to get some feature Vector if you use random [00:57:54] some feature Vector if you use random kernel you get some other feature [00:57:57] kernel you get some other feature um uh some other features but whatever [00:58:00] um uh some other features but whatever features you do uh this is always the [00:58:03] features you do uh this is always the function of all the data right so [00:58:05] function of all the data right so there's no [00:58:06] there's no uh uh like uh you cannot specialize to a [00:58:10] uh uh like uh you cannot specialize to a special subset of features and and also [00:58:13] special subset of features and and also because you are doing a regularization [00:58:15] because you are doing a regularization on the out minimum L2 Norm solution for [00:58:17] on the out minimum L2 Norm solution for the coefficient in front of the feature [00:58:19] the coefficient in front of the feature you don't prefer any sparse solution [00:58:22] you don't prefer any sparse solution right so so recall that [00:58:25] right so so recall that um [00:58:29] yeah sorry I'm [00:58:32] yeah sorry I'm I'm using the wrong version of notes so [00:58:34] I'm using the wrong version of notes so I guess I have to improvise a little bit [00:58:37] I guess I have to improvise a little bit so so if you look at ntk what you do is [00:58:39] so so if you look at ntk what you do is you try to minimize [00:58:42] you try to minimize the L2 Norm of this Vector a such that [00:58:45] the L2 Norm of this Vector a such that the data sum of AI [00:58:47] the data sum of AI vix is equals to y [00:58:50] vix is equals to y maybe philosophy [00:58:53] maybe philosophy right so and if you do the new right [00:58:57] right so and if you do the new right work [00:58:59] work I think we have claimed that pneumatic [00:59:01] I think we have claimed that pneumatic work is the same as L1 as VM in the [00:59:04] work is the same as L1 as VM in the kernel space so then the corresponding [00:59:06] kernel space so then the corresponding thing would be you minimize the L1 Norm [00:59:08] thing would be you minimize the L1 Norm of a such that the sum of a i v i x j [00:59:14] of a such that the sum of a i v i x j is equals to y j [00:59:16] is equals to y j so in some sense when you do the right [00:59:18] so in some sense when you do the right works and you have a you have a lot of [00:59:20] works and you have a you have a lot of features and you're choosing the [00:59:21] features and you're choosing the subsider features a sparse subset [00:59:23] subsider features a sparse subset features and we'll do ntk you are [00:59:25] features and we'll do ntk you are minimizing I O2 Norm that never gives [00:59:28] minimizing I O2 Norm that never gives you sparse uh combinations it's actually [00:59:30] you sparse uh combinations it's actually prefers dense combination it's the [00:59:32] prefers dense combination it's the reverse Direction you want a smooth [00:59:34] reverse Direction you want a smooth combinations of the existing features as [00:59:36] combinations of the existing features as possible so that's why you have to pay [00:59:39] possible so that's why you have to pay um more samples to if you use ntk [00:59:42] um more samples to if you use ntk because you are using kind of [00:59:43] because you are using kind of sub-optimal features and this can be [00:59:45] sub-optimal features and this can be proved in in this case like you can [00:59:48] proved in in this case like you can prove that [00:59:50] prove that um this is a theorem [00:59:52] um this is a theorem where you can prove that [00:59:55] where you can prove that kernel method [00:59:59] with ntk kernel [01:00:02] with ntk kernel requires [01:00:05] requires and to be Omega of d square samples [01:00:09] and to be Omega of d square samples uh to [01:00:12] uh to to learn a problem [01:00:21] with uh less than with arrow less than [01:00:24] with uh less than with arrow less than one [01:00:27] and and in contrast [01:00:31] uh regularized Network only need [01:00:36] uh regularized Network only need internet [01:00:38] only need [01:00:41] only need and is equal to all of the samples [01:00:49] any questions about this I think this [01:00:51] any questions about this I think this part is a little bit kind of like handy [01:00:53] part is a little bit kind of like handy baby because I didn't want to go into [01:00:55] baby because I didn't want to go into all the details and also this depends a [01:00:57] all the details and also this depends a little bit on what we discussed in the [01:00:58] little bit on what we discussed in the past right so that what I want svm [01:01:00] past right so that what I want svm connections onto the on to the [01:01:04] connections onto the on to the um the connection between our svm and [01:01:06] um the connection between our svm and right works [01:01:08] right works any questions [01:01:19] so I guess maybe maybe the just to wrap [01:01:22] so I guess maybe maybe the just to wrap up this once again so so basically if [01:01:24] up this once again so so basically if you do [01:01:26] you do new network with regularization then you [01:01:29] new network with regularization then you are we have shown that this is [01:01:31] are we have shown that this is equivalent to [01:01:33] equivalent to um doing io1 and CVM in a feature space [01:01:35] um doing io1 and CVM in a feature space you are trying to find a sparse [01:01:37] you are trying to find a sparse combination of features that place your [01:01:39] combination of features that place your data right and in this particular [01:01:41] data right and in this particular example This concerns [01:01:44] example This concerns for the intuitive that finding a sparse [01:01:46] for the intuitive that finding a sparse combination is useful because not all [01:01:48] combination is useful because not all the features are equally useful right so [01:01:50] the features are equally useful right so these features we designed are much [01:01:53] these features we designed are much better features than a random feature [01:01:56] better features than a random feature so and that's why [01:01:58] so and that's why you might work with regularization could [01:02:01] you might work with regularization could have good sample complexity [01:02:03] have good sample complexity and on the other hand when you do ntk [01:02:04] and on the other hand when you do ntk kernel only on most of the other kernels [01:02:07] kernel only on most of the other kernels so you are not trying to find the sparse [01:02:10] so you are not trying to find the sparse combination of the features if you're [01:02:11] combination of the features if you're trying to find a dance combination of [01:02:12] trying to find a dance combination of the features because you are doing ll2 [01:02:14] the features because you are doing ll2 uh you're finding a minimum L2 no [01:02:17] uh you're finding a minimum L2 no absolution [01:02:18] absolution and and the features and each of the [01:02:20] and and the features and each of the features is a function of all the data [01:02:22] features is a function of all the data all the coordinates of the data point so [01:02:25] all the coordinates of the data point so so the features are you know not that [01:02:27] so the features are you know not that useful in some sense there's a lot of [01:02:28] useful in some sense there's a lot of noise in your features you have to rely [01:02:30] noise in your features you have to rely on averaging all the noise over multiple [01:02:33] on averaging all the noise over multiple features to to learn something you can [01:02:35] features to to learn something you can still learn something but it's going to [01:02:36] still learn something but it's going to be less efficient [01:02:38] be less efficient um [01:02:41] right I think that's the summary [01:02:51] foreign [01:03:06] there's no other questions I'm going to [01:03:08] there's no other questions I'm going to move on to the next topic [01:03:12] move on to the next topic which is about [01:03:15] which is about implicit [01:03:17] implicit regularization effect [01:03:23] I'm not sure whether you still remember [01:03:24] I'm not sure whether you still remember what we discussed in the mystery of the [01:03:27] what we discussed in the mystery of the deep learning series section so I'm [01:03:29] deep learning series section so I'm going to briefly repeat the kind of the [01:03:31] going to briefly repeat the kind of the high level goal here so the kind of the [01:03:33] high level goal here so the kind of the observation we had about the empirical [01:03:35] observation we had about the empirical deep learning is that we found that [01:03:37] deep learning is that we found that there are multiple [01:03:40] there are multiple on global Minima [01:03:45] of China laws exist [01:03:51] and and the optimizers [01:03:55] and and the optimizers have some preference [01:03:58] have some preference have some [01:04:00] have some implicit preferences [01:04:05] and we have claimed that you know almost [01:04:07] and we have claimed that you know almost every aspect of the optimizers have some [01:04:09] every aspect of the optimizers have some preferences for example if you use the [01:04:12] preferences for example if you use the particular initialization [01:04:14] particular initialization that enables ntk then you have the ntk [01:04:17] that enables ntk then you have the ntk preferences you are learning the the ntk [01:04:19] preferences you are learning the the ntk solution [01:04:20] solution and if you use some other initialization [01:04:22] and if you use some other initialization you have some other preferences and we [01:04:24] you have some other preferences and we have kind of concluded that you know if [01:04:26] have kind of concluded that you know if the integral solution is the wrong [01:04:27] the integral solution is the wrong preferences right so like you you don't [01:04:30] preferences right so like you you don't do much Beyond a kernel method you [01:04:32] do much Beyond a kernel method you actually do exactly the same as kernel [01:04:33] actually do exactly the same as kernel method so so basically that means you [01:04:36] method so so basically that means you are finding the wrong Global minimum [01:04:37] are finding the wrong Global minimum that doesn't necessarily generalize as [01:04:40] that doesn't necessarily generalize as well as other Global minimum so so from [01:04:42] well as other Global minimum so so from now on we're going to try to look at [01:04:45] now on we're going to try to look at other Global Minima of this objective [01:04:47] other Global Minima of this objective and see what other optimizers you know [01:04:49] and see what other optimizers you know prefer [01:04:50] prefer so if you use default meter you may [01:04:52] so if you use default meter you may prefer a solution that is different from [01:04:54] prefer a solution that is different from the NGK solution [01:04:58] oh yeah what would that means [01:05:05] right right so so so why I call ntk [01:05:08] right right so so so why I call ntk initialization right so [01:05:10] initialization right so um so the antique so [01:05:12] um so the antique so this is the anti-kinetization basically [01:05:14] this is the anti-kinetization basically I mean the initialization under which [01:05:16] I mean the initialization under which you can prove the anti-kare result [01:05:18] you can prove the anti-kare result so maybe specifically I think last time [01:05:21] so maybe specifically I think last time we we have two examples right so one [01:05:23] we we have two examples right so one example is [01:05:26] um [01:05:28] um maybe maybe the the this example is [01:05:37] right right [01:05:40] so so for example I think just when we [01:05:43] so so for example I think just when we have the weight thing right so like [01:05:44] have the weight thing right so like where there is a [01:05:47] where there is a like uh [01:05:50] something like here I think we have this [01:05:51] something like here I think we have this over parameters model we have this with [01:05:53] over parameters model we have this with and we initialize with AI to be plus one [01:05:55] and we initialize with AI to be plus one minus one and w i to be this spherical [01:05:57] minus one and w i to be this spherical gaussian and you can for example [01:06:00] gaussian and you can for example initialize something with something much [01:06:02] initialize something with something much smaller [01:06:03] smaller right and actually you should right if [01:06:05] right and actually you should right if you really do the experiments for this [01:06:07] you really do the experiments for this exactly parametization you should [01:06:08] exactly parametization you should initialize [01:06:09] initialize both either AI or WIS um like maybe a [01:06:12] both either AI or WIS um like maybe a one like some like square root and [01:06:15] one like some like square root and Factor smaller and then you're going to [01:06:17] Factor smaller and then you're going to see much different empirical results and [01:06:19] see much different empirical results and actually we have done this you know in [01:06:21] actually we have done this you know in the in actually in the paper uh you know [01:06:23] the in actually in the paper uh you know many people have done this like it's [01:06:24] many people have done this like it's relatively simple experiments so so here [01:06:27] relatively simple experiments so so here you can say the initialization is the [01:06:29] you can say the initialization is the the corporate right so uh for the other [01:06:31] the corporate right so uh for the other case I think when you change the [01:06:33] case I think when you change the parametization to see the antique regime [01:06:34] parametization to see the antique regime I think you can say the the [01:06:36] I think you can say the the parametization is the corporate and it [01:06:39] parametization is the corporate and it and also even in this case where even [01:06:41] and also even in this case where even it's supposed to initialize the same as [01:06:42] it's supposed to initialize the same as ntk suppose you do stochastic winning [01:06:45] ntk suppose you do stochastic winning descent you have sufficiently large [01:06:47] descent you have sufficiently large stochastic it doesn't have to be super [01:06:49] stochastic it doesn't have to be super large but but a little bit larger than [01:06:52] large but but a little bit larger than zero then you will leave that [01:06:54] zero then you will leave that initialization area you're going to [01:06:55] initialization area you're going to convert some other places so that's [01:06:57] convert some other places so that's another way to relieve ntk region [01:07:00] another way to relieve ntk region so so we're gonna sometimes discuss this [01:07:04] so so we're gonna sometimes discuss this other kind of like [01:07:05] other kind of like ways to basically what we're going to [01:07:07] ways to basically what we're going to discuss next are either using transition [01:07:10] discuss next are either using transition to leave ntk or use sarcasticity and [01:07:13] to leave ntk or use sarcasticity and what else you can also [01:07:16] what else you can also um [01:07:17] um I use the [01:07:19] I use the um the learning rate uh the learning [01:07:21] um the learning rate uh the learning rate too learning rate is kind of almost [01:07:22] rate too learning rate is kind of almost the same as stochasticity because if you [01:07:24] the same as stochasticity because if you have larger learning rate in some sense [01:07:26] have larger learning rate in some sense and you have sud then your stochasticity [01:07:28] and you have sud then your stochasticity is bigger [01:07:32] is bigger so [01:07:33] so um [01:07:35] um right so so the first thing I want to do [01:07:37] right so so the first thing I want to do is the the factor from Equalization the [01:07:39] is the the factor from Equalization the the implicit organization effect from [01:07:41] the implicit organization effect from initialization [01:07:42] initialization the first is this effect [01:07:45] the first is this effect of initialization [01:07:51] and you will see that in certain cases [01:07:52] and you will see that in certain cases you you can leave I'm you know with with [01:07:56] you you can leave I'm you know with with us we don't necessarily really care [01:07:58] us we don't necessarily really care about leaving ntk we really care about [01:07:59] about leaving ntk we really care about like having a better generalization [01:08:01] like having a better generalization right so that means you have to live on [01:08:03] right so that means you have to live on Decay but you have to do probably more [01:08:04] Decay but you have to do probably more than that [01:08:05] than that um to get better generalization [01:08:07] um to get better generalization so [01:08:09] so um and so this is what we're gonna do [01:08:11] um and so this is what we're gonna do this lack the next 15 minutes of this [01:08:13] this lack the next 15 minutes of this lecture and the next lecture the effect [01:08:15] lecture and the next lecture the effect of initialization [01:08:17] of initialization and [01:08:18] and um I'm going to start with a simple case [01:08:20] um I'm going to start with a simple case where you have over parametrized [01:08:29] um over parameters linear regression [01:08:31] um over parameters linear regression case [01:08:32] case you need the over parametricization [01:08:33] you need the over parametricization because you need [01:08:35] because you need especially if you consider linear models [01:08:37] especially if you consider linear models right the one of the important thing is [01:08:41] right the one of the important thing is that you have to have multiple Global [01:08:42] that you have to have multiple Global minimum otherwise there's no so-called [01:08:44] minimum otherwise there's no so-called implicit organization effect [01:08:45] implicit organization effect right because optimizers have to convert [01:08:47] right because optimizers have to convert to Global minimum you have to have [01:08:49] to Global minimum you have to have multiple Global minerals so that the [01:08:50] multiple Global minerals so that the optimizers can have a choice to choose [01:08:52] optimizers can have a choice to choose between them so so that's why we need [01:08:55] between them so so that's why we need over parametric regression so that you [01:08:57] over parametric regression so that you have multiple Global minimum actually [01:08:59] have multiple Global minimum actually there is an infinite number of global [01:09:01] there is an infinite number of global minimum where you have over [01:09:02] minimum where you have over parametration [01:09:03] parametration so and we'll see that in this case [01:09:07] so and we'll see that in this case um small initialization [01:09:10] prefers [01:09:13] prefers a low Norm solution [01:09:18] and this is also the case when we in the [01:09:21] and this is also the case when we in the next lecture we're going to go beyond [01:09:22] next lecture we're going to go beyond linear model and the the high level [01:09:25] linear model and the the high level conclusion is the same if you small [01:09:27] conclusion is the same if you small initialization then you prefer a lower [01:09:29] initialization then you prefer a lower Norm solution [01:09:30] Norm solution so and today we're only going to you [01:09:32] so and today we're only going to you know in the next 15 minutes when we're [01:09:34] know in the next 15 minutes when we're only going to do the linear models so [01:09:36] only going to do the linear models so and this is actually not that hard so [01:09:40] and this is actually not that hard so um let's set up first [01:09:42] um let's set up first so this is a standard linear regression [01:09:45] so this is a standard linear regression case you have [01:09:46] case you have something like [01:09:49] I guess for this lecture I'm using the [01:09:52] I guess for this lecture I'm using the lower subscript for the uh for the [01:09:56] lower subscript for the uh for the number of examples just because you [01:09:58] number of examples just because you should really look up any linear [01:10:00] should really look up any linear regression book then they use [01:10:02] regression book then they use subscript for examples so and here we [01:10:05] subscript for examples so and here we don't so this [01:10:07] don't so this so so each of these X i's are examples [01:10:12] so so each of these X i's are examples example I and you put them into a matrix [01:10:15] example I and you put them into a matrix and let's assume X is full rank [01:10:20] so means rank n [01:10:22] so means rank n and let's also assume n is much smaller [01:10:24] and let's also assume n is much smaller than d [01:10:27] than d okay so and and we have a parameter [01:10:32] beta [01:10:34] beta so you get a loss function [01:10:38] y back minus X beta [01:10:43] y back minus X beta to Norm Square let's see I have a half [01:10:45] to Norm Square let's see I have a half here just for convenience [01:10:46] here just for convenience this is the main particles [01:10:50] this is the main particles okay so this is standard linear [01:10:52] okay so this is standard linear regression and I know indeed L hat beta [01:10:56] regression and I know indeed L hat beta has [01:10:57] has infinite [01:11:00] infinite number [01:11:04] of [01:11:05] of Global mean [01:11:08] and you actually can characterize [01:11:09] and you actually can characterize exactly what that Global means are [01:11:11] exactly what that Global means are and so and other Global means with error [01:11:15] and so and other Global means with error with loss zero [01:11:17] with loss zero so what are the global means so beta [01:11:20] so what are the global means so beta is equals to supposed to take beta to be [01:11:22] is equals to supposed to take beta to be X [01:11:23] X pseudo inverse times y back plus [01:11:27] pseudo inverse times y back plus some [01:11:28] some Vector Zeta [01:11:30] Vector Zeta where [01:11:32] where Zeta is any Vector that is also also [01:11:35] Zeta is any Vector that is also also normal [01:11:36] normal also orthogonal 2x1 up to x n [01:11:40] also orthogonal 2x1 up to x n is a global mean [01:11:44] so as also your beta has this form then [01:11:47] so as also your beta has this form then it's a global mean and these are [01:11:48] it's a global mean and these are actually all the global mean [01:11:50] actually all the global mean so [01:11:51] so um and actually here [01:11:53] um and actually here I think last time someone asked about [01:11:55] I think last time someone asked about pseudo inverse maybe let me have a some [01:11:57] pseudo inverse maybe let me have a some quickly some basic properties of [01:11:59] quickly some basic properties of pseudonym words [01:12:02] uh I guess my [01:12:07] my way of thinking about it is probably [01:12:08] my way of thinking about it is probably slightly different from [01:12:10] slightly different from the Wikipedia so the way I always think [01:12:12] the Wikipedia so the way I always think about student universe is the following [01:12:14] about student universe is the following so I always think about it when when [01:12:17] so I always think about it when when there's a in ICD because with that SVD I [01:12:20] there's a in ICD because with that SVD I can verify everything so that I don't [01:12:22] can verify everything so that I don't have to remember them [01:12:24] have to remember them so so suppose you have a matrix X in [01:12:27] so so suppose you have a matrix X in dimension and by D [01:12:28] dimension and by D and suppose X is of rank r [01:12:33] of course R has to be less than either n [01:12:36] of course R has to be less than either n both less than and less than d [01:12:39] both less than and less than d um so the way how I do uh how I remember [01:12:42] um so the way how I do uh how I remember every property of super inverse is the [01:12:44] every property of super inverse is the following so I consider X [01:12:46] following so I consider X SVD of X which is U Sigma V transpose [01:12:49] SVD of X which is U Sigma V transpose and sigma is of Dimension R by R let's [01:12:53] and sigma is of Dimension R by R let's say you you ignore all the non-zero [01:12:55] say you you ignore all the non-zero entries so an U is of Dimension say and [01:12:59] entries so an U is of Dimension say and by r and v is of dimension [01:13:03] by r and v is of dimension on D by r [01:13:05] on D by r so then you know that um the column span [01:13:10] so then you know that um the column span of U is the same as the column span [01:13:14] of U is the same as the column span of x [01:13:15] of x and the Rose Band [01:13:18] and the Rose Band sorry the column span [01:13:20] sorry the column span of V because there's a transpose here [01:13:22] of V because there's a transpose here right so column span of B [01:13:25] right so column span of B is the Rose Band [01:13:29] of x [01:13:30] of x and also you know that the pseudo [01:13:33] and also you know that the pseudo inverse [01:13:34] inverse in this notation is you can think of as [01:13:37] in this notation is you can think of as Define to be V Sigma inverse [01:13:41] Define to be V Sigma inverse U transpose so here all the sigma [01:13:44] U transpose so here all the sigma so here's Sigma is a diagonal matrix [01:13:48] so here's Sigma is a diagonal matrix with anxious Sigma 1 up to Sigma r [01:13:50] with anxious Sigma 1 up to Sigma r and sigma i's are all positive [01:13:54] and sigma i's are all positive so [01:13:55] so so then this inverse is well defined and [01:13:58] so then this inverse is well defined and X2 inverse is just this V Sigma inverse [01:14:00] X2 inverse is just this V Sigma inverse U transpose and now if you want to [01:14:03] U transpose and now if you want to understand what's the probability of the [01:14:04] understand what's the probability of the pseudo inverse you can verify yourself [01:14:06] pseudo inverse you can verify yourself so you know x x should be inverse this [01:14:10] so you know x x should be inverse this is gonna be what this is gonna be U [01:14:13] is gonna be what this is gonna be U Sigma [01:14:14] Sigma V transpose V Sigma inverse U transpose [01:14:17] V transpose V Sigma inverse U transpose V transpose B is identity [01:14:21] V transpose B is identity um so [01:14:22] um so so U [01:14:24] so U transpose [01:14:25] transpose right because B transpose V is identity [01:14:27] right because B transpose V is identity Sigma times Sigma inverse is identity so [01:14:30] Sigma times Sigma inverse is identity so what is this this is the projection [01:14:32] what is this this is the projection to the column span [01:14:34] to the column span of X of x [01:14:37] of X of x right it's the projection of to the [01:14:38] right it's the projection of to the constant of U and the column spell for U [01:14:40] constant of U and the column spell for U is the same as the column span of x [01:14:43] is the same as the column span of x and X through the inverse times x [01:14:46] and X through the inverse times x this is equal if you do the same [01:14:48] this is equal if you do the same calculation it's going to be equals to [01:14:50] calculation it's going to be equals to V Sigma inverse U transpose times U [01:14:55] V Sigma inverse U transpose times U Sigma V transpose which is VV transpose [01:14:57] Sigma V transpose which is VV transpose which is the projection so the row span [01:15:03] of X and you can also see the dimension [01:15:05] of X and you can also see the dimension matches because this is a matrix of [01:15:07] matches because this is a matrix of dimension [01:15:09] d by D and and and V and it goes by the [01:15:14] d by D and and and V and it goes by the rows of a v is in in dimension d [01:15:17] rows of a v is in in dimension d I'm sorry the the rows of axis in [01:15:19] I'm sorry the the rows of axis in dimension D the column to V is in [01:15:21] dimension D the column to V is in dimension d so [01:15:23] dimension d so um right and in the case when you have [01:15:25] um right and in the case when you have so [01:15:27] so in the in this case in this case where X [01:15:31] in the in this case in this case where X is in [01:15:32] is in and by D then you know under the rank is [01:15:36] and by D then you know under the rank is n [01:15:38] n then you know that on x x to the inverse [01:15:41] then you know that on x x to the inverse is the projection [01:15:43] is the projection to the column span [01:15:46] to the column span of X and the column spell of X now is [01:15:48] of X and the column spell of X now is full the full span of all the vectors [01:15:50] full the full span of all the vectors but column span is right the column [01:15:54] but column span is right the column spine is the spell of the [01:15:55] spine is the spell of the uh [01:15:57] uh uh right it's all the vectors so that's [01:15:59] uh right it's all the vectors so that's why this is just identity [01:16:01] why this is just identity and [01:16:03] and an X pseudo inverse X [01:16:05] an X pseudo inverse X this is the production of row spun [01:16:08] this is the production of row spun of X so how many rows they are there are [01:16:12] of X so how many rows they are there are rows of X and they don't spend [01:16:14] rows of X and they don't spend everything because the dimension D is [01:16:16] everything because the dimension D is bigger so you cannot spell everything so [01:16:18] bigger so you cannot spell everything so that's that's why this is not identity [01:16:19] that's that's why this is not identity this is really just the [01:16:22] this is really just the the projection of the Rose trans you [01:16:24] the projection of the Rose trans you cannot simplify more [01:16:27] cannot simplify more right [01:16:29] right Okay so [01:16:31] Okay so it's a little bit too long I guess as a [01:16:34] it's a little bit too long I guess as a as a building block but okay I hope this [01:16:36] as a building block but okay I hope this helps like this is how I understand this [01:16:38] helps like this is how I understand this is pseudo Universe like I I never [01:16:40] is pseudo Universe like I I never remember what x axial inverse is equals [01:16:42] remember what x axial inverse is equals to so this this is how I remember [01:16:45] to so this this is how I remember um [01:16:46] um so and and the question okay so now we [01:16:49] so and and the question okay so now we have so many Global Minima right so and [01:16:51] have so many Global Minima right so and okay I think with this it's easy to [01:16:53] okay I think with this it's easy to verify this is our Global minimum [01:16:54] verify this is our Global minimum because you can verify [01:16:57] because you can verify um this beta is global minimum because [01:16:58] um this beta is global minimum because you can say take X beta which is equals [01:17:01] you can say take X beta which is equals to x x to the inverse y back plus X Zeta [01:17:05] to x x to the inverse y back plus X Zeta the data is orthogonal to the rows of X [01:17:08] the data is orthogonal to the rows of X so that's why x0 is zero [01:17:10] so that's why x0 is zero so you get x x to the inverse y back and [01:17:13] so you get x x to the inverse y back and x axis inverse in this case is that [01:17:15] x axis inverse in this case is that identity so you get the Y back right I [01:17:18] identity so you get the Y back right I just claim that x x to the inverse is [01:17:19] just claim that x x to the inverse is identity [01:17:21] identity okay so so that's why X beta is equal to [01:17:23] okay so so that's why X beta is equal to 5x that's why it's a global minimum and [01:17:25] 5x that's why it's a global minimum and the question is which global mineral [01:17:26] the question is which global mineral you're gonna converts to [01:17:28] you're gonna converts to so [01:17:30] so um my theorem is that the theorem [01:17:34] um my theorem is that the theorem is that if you initialize so green [01:17:36] is that if you initialize so green descent [01:17:39] on L has beta [01:17:42] on L has beta was [01:17:43] was initialization [01:17:45] initialization beta0 is equals to zero [01:17:49] um and suppression is smaller than rate [01:17:56] and the rate [01:17:58] and the rate actually you know exactly how small it [01:18:00] actually you know exactly how small it is like I just don't want to give you [01:18:03] is like I just don't want to give you too too much Dragon [01:18:05] too too much Dragon um so so if you have Learners small [01:18:07] um so so if you have Learners small enough and the neutralization is zero [01:18:09] enough and the neutralization is zero then this converges to [01:18:15] uh the minimum solution [01:18:21] so the minimum solution beta hat is [01:18:23] so the minimum solution beta hat is defined to be [01:18:25] defined to be the one among all Global minimum [01:18:30] where the speed of Norm solution among [01:18:34] where the speed of Norm solution among our Global minimum of the loss function [01:18:36] our Global minimum of the loss function so basically you get this two Norm for [01:18:37] so basically you get this two Norm for free right you're minimizing this [01:18:40] free right you're minimizing this you need to say that I want to have the [01:18:42] you need to say that I want to have the minimum normal solution you just say I [01:18:43] minimum normal solution you just say I want to do any descent but you get a [01:18:45] want to do any descent but you get a minimum solution for free [01:18:48] minimum solution for free and and the reason why you got it is [01:18:49] and and the reason why you got it is because you you express your implicit [01:18:52] because you you express your implicit preferences through the initialization [01:18:58] Okay cool so [01:19:01] Okay cool so yeah I think I have five minutes which [01:19:03] yeah I think I have five minutes which is perfect for the proof sketch [01:19:05] is perfect for the proof sketch so [01:19:06] so um [01:19:18] sketch I guess this is actually really [01:19:21] sketch I guess this is actually really approved but I just uh I I think I [01:19:24] approved but I just uh I I think I ignore some small details that's why I [01:19:26] ignore some small details that's why I call it a sketch so the first step is [01:19:28] call it a sketch so the first step is that if you do standard [01:19:32] convex optimization [01:19:34] convex optimization you know that [01:19:37] you know that this goes to zero as T goes to Infinity [01:19:40] this goes to zero as T goes to Infinity right you know that if you run for a [01:19:41] right you know that if you run for a long time then you your loss will become [01:19:43] long time then you your loss will become zero [01:19:44] zero that's you know I'm not going to show [01:19:47] that's you know I'm not going to show how do you do this but this is uh you [01:19:49] how do you do this but this is uh you can invoke any off-the-shelf [01:19:51] can invoke any off-the-shelf optimization results [01:19:53] optimization results and the second thing is that you know [01:19:55] and the second thing is that you know that [01:19:56] that it's better hat is actually equals to x [01:19:58] it's better hat is actually equals to x x [01:20:00] x dagger x x to the inverse times y back [01:20:02] dagger x x to the inverse times y back right so so we know that [01:20:06] right so so we know that all of these are Global minimum but if [01:20:08] all of these are Global minimum but if you you take Zeta to be zero then that's [01:20:10] you you take Zeta to be zero then that's the uh [01:20:14] if you take [01:20:16] if you take sorry I think I'm [01:20:21] just if you take zero to be zero then [01:20:22] just if you take zero to be zero then that's the minimum solution I think this [01:20:25] that's the minimum solution I think this is this there's no X here [01:20:32] and and this is can also be simply [01:20:34] and and this is can also be simply simply verified because if you take any [01:20:36] simply verified because if you take any beta [01:20:38] beta so for any data [01:20:41] so for any data orthogonal to X1 up to x n [01:20:45] orthogonal to X1 up to x n then you look at x to the inverse y back [01:20:48] then you look at x to the inverse y back plus Zeta the two Norm of this [01:20:51] plus Zeta the two Norm of this this is equals to [01:20:53] this is equals to x to the inverse wire back to Norm plus [01:20:56] x to the inverse wire back to Norm plus Zeta to Norm plus two times x to the [01:21:02] Zeta to Norm plus two times x to the inverse back Theta [01:21:05] inverse back Theta and this is larger than x to the inverse [01:21:08] and this is larger than x to the inverse y back two Norm Square [01:21:10] y back two Norm Square plus 0 because the norm is less than [01:21:13] plus 0 because the norm is less than zero and this [01:21:14] zero and this quantity [01:21:15] quantity is just equals to [01:21:17] is just equals to zero [01:21:20] because [01:21:23] see why this is equal to zero this is [01:21:25] see why this is equal to zero this is equals to zero because [01:21:30] I guess this is maybe let's say this is [01:21:32] I guess this is maybe let's say this is equals two times y back [01:21:34] equals two times y back um [01:21:36] um X [01:21:38] X so what's this this is [01:21:42] Zeta transpose x to the inverse wave [01:21:45] Zeta transpose x to the inverse wave back right [01:21:46] back right and I claim that this is equals to x to [01:21:48] and I claim that this is equals to x to the inverse point back to normal [01:21:50] the inverse point back to normal because this is zero and why this is [01:21:52] because this is zero and why this is zero I guess [01:21:54] zero I guess um this is actually a good way to [01:21:56] um this is actually a good way to practice what I had so x to the inverse [01:22:00] practice what I had so x to the inverse is V [01:22:02] is V Sigma inverse U transpose sorry U said [01:22:06] Sigma inverse U transpose sorry U said wait pseudo inverse is V Sigma inverse U [01:22:09] wait pseudo inverse is V Sigma inverse U transpose [01:22:10] transpose so the [01:22:12] so the the columns [01:22:14] the columns span of X inverse is the same as the [01:22:18] span of X inverse is the same as the Rose one of X right so and [01:22:23] Rose one of X right so and Zeta okay sorry this is transpose not [01:22:26] Zeta okay sorry this is transpose not this [01:22:28] this Zeta is orthogonal [01:22:30] Zeta is orthogonal to the rows of X so which means that [01:22:33] to the rows of X so which means that Zeta is orthogonal to the column [01:22:36] Zeta is orthogonal to the column of x to the universe right so that's why [01:22:39] of x to the universe right so that's why Zeta times this is the Zeta times The [01:22:42] Zeta times this is the Zeta times The Columns of [01:22:43] Columns of of the of the pseudo inverse so so [01:22:46] of the of the pseudo inverse so so that's why everything is zero right so [01:22:47] that's why everything is zero right so this is zero [01:22:49] this is zero so Zeta is our sock node too [01:22:53] so Zeta is our sock node too column span [01:22:55] column span of X2 inverse which is equals to the row [01:22:58] of X2 inverse which is equals to the row span [01:22:59] span of x [01:23:04] okay all right so so so basically you [01:23:07] okay all right so so so basically you see that in the norm is only decreasing [01:23:10] see that in the norm is only decreasing if you set as Delta Zeta to be zero [01:23:12] if you set as Delta Zeta to be zero that's why [01:23:13] that's why when zero is zero that's the minimum [01:23:15] when zero is zero that's the minimum solution [01:23:16] solution right so three [01:23:18] right so three I guess this is this one to our base a [01:23:21] I guess this is this one to our base a basic facts about this linear regression [01:23:22] basic facts about this linear regression thing [01:23:23] thing um the three is what's really about the [01:23:26] um the three is what's really about the optimization so you can prove that beta [01:23:28] optimization so you can prove that beta t [01:23:28] t is in a span [01:23:30] is in a span of X1 up to X N you can prove this [01:23:33] of X1 up to X N you can prove this inductively [01:23:40] and why this is the case this is a super [01:23:43] and why this is the case this is a super simple induction because beta T is [01:23:46] simple induction because beta T is equals to Beta t plus 1 is equal to Beta [01:23:48] equals to Beta t plus 1 is equal to Beta T minus ETA times the gradient [01:23:51] T minus ETA times the gradient and there are t and it goes to the [01:23:53] and there are t and it goes to the gradient [01:23:54] gradient the gradient is X transpose [01:23:57] the gradient is X transpose y back minus X beta [01:24:00] y back minus X beta so this is [01:24:02] so this is in the column span [01:24:07] of X transpose [01:24:09] of X transpose and so [01:24:11] and so so it's also in the low span [01:24:15] so it's also in the low span of x [01:24:16] of x right so so basically your update is [01:24:19] right so so basically your update is always in your spine of X that's why you [01:24:21] always in your spine of X that's why you never leave the spine right so and [01:24:23] never leave the spine right so and better by the way better so maybe I [01:24:25] better by the way better so maybe I should start with the [01:24:27] should start with the the better zero is in this but the zero [01:24:28] the better zero is in this but the zero is zero is in the span [01:24:31] is zero is in the span of X1 abstraction [01:24:33] of X1 abstraction and each time you update that update is [01:24:36] and each time you update that update is the spell X1 after accent so by [01:24:38] the spell X1 after accent so by induction you get that you're always in [01:24:40] induction you get that you're always in display [01:24:41] display so basically then so far because you're [01:24:44] so basically then so far because you're always in a span [01:24:45] always in a span right so the only solution [01:24:48] right so the only solution solution [01:24:50] solution to L at beta0 [01:24:53] to L at beta0 in the Span in the span [01:24:58] is [01:25:00] is this [01:25:03] right because what are the solutions [01:25:06] right because what are the solutions um with either zero the solutions with [01:25:08] um with either zero the solutions with arrow 0 are these ones [01:25:10] arrow 0 are these ones and guys are [01:25:13] and guys are on this one that's right so this has [01:25:15] on this one that's right so this has solution zero [01:25:17] solution zero and and among this you know who has who [01:25:20] and and among this you know who has who are in the span of the real spell of X [01:25:22] are in the span of the real spell of X right so only this one has is in the [01:25:25] right so only this one has is in the real spell facts because all of these [01:25:26] real spell facts because all of these are not in the little smile facts they [01:25:27] are not in the little smile facts they are orthogonal stereo specifics so the [01:25:30] are orthogonal stereo specifics so the only solution that is in the real space [01:25:31] only solution that is in the real space of x [01:25:33] of x is is just a the first term we just take [01:25:35] is is just a the first term we just take the first term [01:25:36] the first term so and that happens to be the minimum [01:25:38] so and that happens to be the minimum solution and that's why [01:25:41] solution and that's why um you you get a minimum solution [01:25:43] um you you get a minimum solution and sometimes basically all the magic [01:25:45] and sometimes basically all the magic comes from this right so basically this [01:25:48] comes from this right so basically this is a regularization in sometimes this is [01:25:50] is a regularization in sometimes this is a [01:25:50] a this is a constraint imposed by the [01:25:53] this is a constraint imposed by the algorithm the algorithm say that you [01:25:55] algorithm the algorithm say that you cannot go everywhere the algorithm say [01:25:57] cannot go everywhere the algorithm say you can only go to those places where in [01:26:01] you can only go to those places where in a span of the data so that's why you [01:26:03] a span of the data so that's why you have to [01:26:04] have to um stay in a smile of the data and it [01:26:06] um stay in a smile of the data and it happens that in the span of the data [01:26:08] happens that in the span of the data there's only one solution and that [01:26:09] there's only one solution and that solution is the minimum solution [01:26:14] right [01:26:17] so I think in some sense the if I draw a [01:26:22] so I think in some sense the if I draw a picture okay I guess I'm running late [01:26:24] picture okay I guess I'm running late but real quick so if I draw a picture I [01:26:27] but real quick so if I draw a picture I think the this is a very difficult [01:26:29] think the this is a very difficult picture to draw but I think you can [01:26:30] picture to draw but I think you can still try it so if you you have a [01:26:35] still try it so if you you have a um [01:26:36] um you can have a maybe I say this this [01:26:39] you can have a maybe I say this this blue Direction [01:26:40] blue Direction this is the spell of the data let's say [01:26:42] this is the spell of the data let's say suppose we will have sometimes one data [01:26:45] suppose we will have sometimes one data so the direction of the span of the data [01:26:46] so the direction of the span of the data is only one dimensional [01:26:48] is only one dimensional and [01:26:50] and and then you have a [01:26:52] and then you have a a Subspace of solutions [01:26:56] which are orthogonal so this is [01:26:59] which are orthogonal so this is orthogonal here [01:27:00] orthogonal here to the span right this is this this is [01:27:03] to the span right this is this this is the [01:27:04] the this is the solution where you have this [01:27:08] this is the solution where you have this right so it's orthogonal to the smell of [01:27:10] right so it's orthogonal to the smell of the data and the intersection part is [01:27:13] the data and the intersection part is the the target solution so the [01:27:15] the the target solution so the intersection [01:27:17] intersection the [01:27:21] the intersection part is really access [01:27:24] the intersection part is really access to the universe playback [01:27:25] to the universe playback and [01:27:27] and and so you start with this [01:27:30] and so you start with this you try to reach this purple plane [01:27:34] you try to reach this purple plane because that's the optimization wants to [01:27:36] because that's the optimization wants to do the optimization wants to reach the [01:27:37] do the optimization wants to reach the purple plane [01:27:38] purple plane but optimization also say you can only [01:27:40] but optimization also say you can only go in the the blue Direction [01:27:43] go in the the blue Direction and so that's why you meet in the [01:27:45] and so that's why you meet in the interception and the intersection is the [01:27:47] interception and the intersection is the closest point to the origin [01:27:51] okay I guess that's oh yeah [01:28:01] yeah so do you need a condition that you [01:28:02] yeah so do you need a condition that you have to spend on the on a span right so [01:28:05] have to spend on the on a span right so yes you do because if you don't spend [01:28:07] yes you do because if you don't spend suppose for example what you supposed to [01:28:09] suppose for example what you supposed to start here [01:28:10] start here so what happens is that you [01:28:13] so what happens is that you can only move in this direction [01:28:14] can only move in this direction that's what the the algorithm says right [01:28:16] that's what the the algorithm says right the algorithm says the update isn't [01:28:18] the algorithm says the update isn't expand so you can only all your changes [01:28:20] expand so you can only all your changes is in the spot so you could you can only [01:28:22] is in the spot so you could you can only move it so basically you basically go in [01:28:23] move it so basically you basically go in this way until you hit here [01:28:26] this way until you hit here so then this place is not the minimum of [01:28:29] so then this place is not the minimum of social anymore [01:28:30] social anymore this place is going to have some higher [01:28:32] this place is going to have some higher Norm than [01:28:33] Norm than the ideal point [01:28:42] yes and yes so you can say the [01:28:44] yes and yes so you can say the impressive acquisition effect always [01:28:46] impressive acquisition effect always happens but the effect is in the minimum [01:28:48] happens but the effect is in the minimum no solution only if your initialization [01:28:51] no solution only if your initialization is zero [01:28:52] is zero like you always have a preferences right [01:28:54] like you always have a preferences right so whatever you do with initialization [01:28:55] so whatever you do with initialization you have some preferences about which [01:28:58] you have some preferences about which global minimum you want to convert to [01:29:00] global minimum you want to convert to right but if you want the preferences to [01:29:02] right but if you want the preferences to be the minimum solution then you really [01:29:05] be the minimum solution then you really have to choose uh zero is the [01:29:09] have to choose uh zero is the initialization [01:29:17] any other questions [01:29:30] right now [01:29:32] right now it'll still be able to solved back [01:29:43] right so so the question is whether [01:29:45] right so so the question is whether there's any hope that this can transfer [01:29:47] there's any hope that this can transfer to nonlinear cases I think [01:29:49] to nonlinear cases I think here we are using a lot of things about [01:29:51] here we are using a lot of things about linear algebra right so we know we know [01:29:55] linear algebra right so we know we know what is the minimum solution so and so [01:29:56] what is the minimum solution so and so forth we have the orthogonality [01:29:57] forth we have the orthogonality everything right so we have [01:29:58] everything right so we have non-linearity you don't have most of [01:30:01] non-linearity you don't have most of this [01:30:02] this um so so those parts that we discussed [01:30:04] um so so those parts that we discussed about the very highly linear algebra in [01:30:07] about the very highly linear algebra in the algebraic part those don't transfer [01:30:10] the algebraic part those don't transfer at all probably [01:30:12] at all probably um but somehow at least we can find one [01:30:14] um but somehow at least we can find one other situation where we have nonlinear [01:30:17] other situation where we have nonlinear models you can still do [01:30:19] models you can still do uh you still prefer the minimum solution [01:30:21] uh you still prefer the minimum solution and that's that's next lecture yeah so [01:30:24] and that's that's next lecture yeah so but but the mechanism is somehow it's [01:30:26] but but the mechanism is somehow it's not exactly the same the only so next [01:30:29] not exactly the same the only so next action and this lecture that the only [01:30:32] action and this lecture that the only connections is that the the final [01:30:36] connections is that the the final um the final message of similar or is [01:30:39] um the final message of similar or is the same but the techniques are quite [01:30:41] the same but the techniques are quite different we still don't know how to [01:30:43] different we still don't know how to unify them in a right way [01:30:48] um [01:30:52] [Music] [01:31:08] right yeah [01:31:11] right yeah yeah I'm wondering if [01:31:19] you might be learning a little bit [01:31:20] you might be learning a little bit smaller [01:31:24] yeah yeah so so you're absolutely right [01:31:26] yeah yeah so so you're absolutely right so like the the difficult case come from [01:31:28] so like the the difficult case come from the very very small generate case I [01:31:30] the very very small generate case I think even a testable small in red case [01:31:32] think even a testable small in red case so even for infant test most modeling [01:31:34] so even for infant test most modeling rate [01:31:36] rate um like so it's basically have a [01:31:38] um like so it's basically have a differential equation right you just [01:31:39] differential equation right you just have a trajectory and [01:31:42] have a trajectory and um and you want to know where the trade [01:31:43] um and you want to know where the trade actually goes right so I as far as I [01:31:47] actually goes right so I as far as I know like uh you know I'm not a I don't [01:31:50] know like uh you know I'm not a I don't know too much about differential [01:31:51] know too much about differential equations but I think the problem is how [01:31:53] equations but I think the problem is how to solve that equation like like you [01:31:55] to solve that equation like like you know they create the solution exists [01:31:57] know they create the solution exists that you know there's a trajectory but [01:31:58] that you know there's a trajectory but what's the where the structure really [01:32:00] what's the where the structure really goes that's the hard part I don't think [01:32:03] goes that's the hard part I don't think we [01:32:04] we at least I'm not aware of any [01:32:06] at least I'm not aware of any papers that use the tools from [01:32:08] papers that use the tools from differential equations [01:32:10] differential equations uh heavily right so this is a useful [01:32:12] uh heavily right so this is a useful language right so you can look from the [01:32:16] language right so you can look from the language like the the formulation of the [01:32:18] language like the the formulation of the language perspective the differential [01:32:19] language perspective the differential equations language are very useful [01:32:22] equations language are very useful um but [01:32:23] um but typically the hard part is how do you [01:32:25] typically the hard part is how do you solve it [01:32:30] in some cases you can I think I know one [01:32:32] in some cases you can I think I know one paper where you can solve it but but [01:32:34] paper where you can solve it but but it's not like it's using the structure [01:32:36] it's not like it's using the structure of the problem you have to literally [01:32:37] of the problem you have to literally solve it using some new math it's not [01:32:40] solve it using some new math it's not like you can invoke a theorem in in the [01:32:43] like you can invoke a theorem in in the differential equation literature say [01:32:44] differential equation literature say this kind of questions can all be solved [01:32:46] this kind of questions can all be solved I don't think so [01:32:51] okay sounds great okay cool see you next [01:32:53] okay sounds great okay cool see you next week ================================================================================ LECTURE 014 ================================================================================ Stanford CS229M - Lecture 15: Implicit regularization effect of initialization Source: https://www.youtube.com/watch?v=l-CR_TLihdg --- Transcript [00:00:05] okay let's get started I guess uh [00:00:07] okay let's get started I guess uh everything is working now [00:00:10] everything is working now Okay cool so [00:00:12] Okay cool so um last time we talked about the we [00:00:14] um last time we talked about the we started to talk about the so-called [00:00:16] started to talk about the so-called implicit regularization effect [00:00:18] implicit regularization effect of the optimizers [00:00:20] of the optimizers and [00:00:24] and last time we discussed the very [00:00:27] and last time we discussed the very basic one which is that if you use on [00:00:29] basic one which is that if you use on initialization zero [00:00:35] and then [00:00:36] and then um is his greeting descent and you have [00:00:38] um is his greeting descent and you have a regression problem a linear regression [00:00:40] a regression problem a linear regression problem [00:00:42] problem then what you get is that you get a [00:00:45] then what you get is that you get a minimum Norm solution [00:00:52] this is the last time [00:00:55] this is the last time and today we're going to talk about a [00:00:57] and today we're going to talk about a case where we have nonlinear models and [00:00:59] case where we have nonlinear models and we'll see similar phenomena but we're [00:01:02] we'll see similar phenomena but we're going to have a somewhat different proof [00:01:08] foreign [00:01:30] so [00:01:32] so um [00:01:33] um okay so I guess so let's uh [00:01:36] okay so I guess so let's uh delve into the detail so this is the [00:01:39] delve into the detail so this is the non-linear model that we [00:01:42] non-linear model that we um [00:01:42] um you know you will see that this is now [00:01:44] you know you will see that this is now this model is not linear but it's not [00:01:46] this model is not linear but it's not actually that much different from linear [00:01:47] actually that much different from linear model use as you will see [00:01:49] model use as you will see um [00:01:50] um um there is a paper that can do a little [00:01:52] um there is a paper that can do a little more than this but generally we don't [00:01:54] more than this but generally we don't know how to do deal with you know very [00:01:56] know how to do deal with you know very complex models like deep Networks [00:01:58] complex models like deep Networks so [00:02:00] so um so this is the nonlinear model we're [00:02:02] um so this is the nonlinear model we're going to consider [00:02:05] going to consider so suppose you have beta is the [00:02:08] so suppose you have beta is the parameter and X is the input and the [00:02:10] parameter and X is the input and the model is FX is equals to [00:02:13] model is FX is equals to the inner product between beta times o [00:02:17] the inner product between beta times o dot beta and X so all doubt is the hard [00:02:20] dot beta and X so all doubt is the hard mod [00:02:23] product meaning the anchovies product so [00:02:26] product meaning the anchovies product so basically you [00:02:27] basically you entry-wise Square the parameter [00:02:30] entry-wise Square the parameter and then you take the inner product with [00:02:32] and then you take the inner product with x [00:02:34] x so this is still linear in X but it's [00:02:38] so this is still linear in X but it's not linear in beta [00:02:50] um so so in terms of like I understood [00:02:52] um so so in terms of like I understood the loss function [00:02:55] the loss function will be [00:02:57] will be non-convex [00:03:00] because it's not only the in beta and [00:03:02] because it's not only the in beta and you do the loss function you take your [00:03:03] you do the loss function you take your Square then it becomes a non-convex so [00:03:06] Square then it becomes a non-convex so it's not that interesting in terms of [00:03:08] it's not that interesting in terms of you know the model itself because anyway [00:03:10] you know the model itself because anyway you are doing a linear model but from [00:03:13] you are doing a linear model but from the algorithm the implicit [00:03:14] the algorithm the implicit regularization effect perspectively [00:03:15] regularization effect perspectively still interesting because you have a [00:03:17] still interesting because you have a non-convex objective function [00:03:19] non-convex objective function and we're going to make this even more [00:03:22] and we're going to make this even more interesting by you know consider a [00:03:24] interesting by you know consider a special case where suppose you have [00:03:26] special case where suppose you have ground shoes [00:03:29] um is that Y is equals to [00:03:33] um is that Y is equals to beta star o dot beta star times x [00:03:37] beta star o dot beta star times x where [00:03:39] where beta star is R sparse [00:03:43] beta star is R sparse so the reason why we want beta star to [00:03:45] so the reason why we want beta star to be R sparse is that [00:03:47] be R sparse is that R sparse means that the zero Norm of [00:03:50] R sparse means that the zero Norm of beta is less than r [00:03:51] beta is less than r like you only have [00:03:55] R non-zero entries [00:03:59] and the reason why we want to have this [00:04:01] and the reason why we want to have this restriction on beta star is because [00:04:03] restriction on beta star is because uh we want to consider over [00:04:04] uh we want to consider over parenthesized models if you have over [00:04:07] parenthesized models if you have over parameters models meaning so we consider [00:04:09] parameters models meaning so we consider the case where [00:04:12] the case where n is [00:04:14] n is smaller than D when a is smaller than b [00:04:16] smaller than D when a is smaller than b if beta is fully General beta star is [00:04:18] if beta is fully General beta star is fully General then there is no way you [00:04:20] fully General then there is no way you can hope to learn anything from from [00:04:22] can hope to learn anything from from less than dimensionality number of data [00:04:25] less than dimensionality number of data points so so basically I will make sure [00:04:28] points so so basically I will make sure beta star is sparse and so we assume [00:04:31] beta star is sparse and so we assume I'm going to assume that 10 is small [00:04:34] I'm going to assume that 10 is small indeed but and it's larger than some [00:04:37] indeed but and it's larger than some poly r [00:04:38] poly r that's the setting where we are going to [00:04:40] that's the setting where we are going to work with [00:04:41] work with um [00:04:42] um and more uh specific and for Simplicity [00:04:46] and more uh specific and for Simplicity or without loss of generality let's [00:04:48] or without loss of generality let's assume also beta star is larger than [00:04:51] assume also beta star is larger than zero entry-wise [00:04:54] zero entry-wise because you can see that the sign of [00:04:56] because you can see that the sign of videos that doesn't really matter in [00:04:57] videos that doesn't really matter in terms of the functionality in terms of [00:04:59] terms of the functionality in terms of the ground shoes model and actually for [00:05:02] the ground shoes model and actually for Simplicity [00:05:08] of this lecture let's also assume that [00:05:10] of this lecture let's also assume that beta is just some indicator on some [00:05:14] beta is just some indicator on some subside of course where [00:05:17] subside of course where as is a subset of coordinates and the [00:05:21] as is a subset of coordinates and the size of s is equals to R [00:05:24] size of s is equals to R this is only for Simplicity of this [00:05:26] this is only for Simplicity of this lecture [00:05:28] lecture um [00:05:29] um okay and now let's define the data so I [00:05:32] okay and now let's define the data so I guess we have talked about that we are [00:05:33] guess we have talked about that we are going to have over parameterized our [00:05:35] going to have over parameterized our model so we have undidable point and N [00:05:37] model so we have undidable point and N is less than D so this n data point are [00:05:39] is less than D so this n data point are denoted by X1 up to x n [00:05:42] denoted by X1 up to x n let's say our ID from [00:05:45] let's say our ID from gaussian [00:05:47] gaussian of Dimension work of spherical [00:05:50] of Dimension work of spherical covariance and y i is generated from [00:05:54] covariance and y i is generated from this model [00:05:55] this model without any error [00:05:58] without any error so the Yi is just in a product beta star [00:06:01] so the Yi is just in a product beta star times beta star [00:06:02] times beta star like is this inner product of square of [00:06:05] like is this inner product of square of beta star times x i [00:06:07] beta star times x i so and [00:06:09] so and and is much much less than D but we [00:06:12] and is much much less than D but we assume that [00:06:15] assume that n is bigger than Omega [00:06:18] n is bigger than Omega tilde of R square so n is roughly bigger [00:06:21] tilde of R square so n is roughly bigger than R square [00:06:23] than R square um and this so so this amount of data [00:06:26] um and this so so this amount of data points in principle that allows us to [00:06:28] points in principle that allows us to recover beta stock actually you only [00:06:30] recover beta stock actually you only need [00:06:31] need Omega R to recover beta star if you [00:06:34] Omega R to recover beta star if you can't dimensionality right so there are [00:06:36] can't dimensionality right so there are degree of Freedom approximately so we [00:06:38] degree of Freedom approximately so we only need 10 to be lower than Omega R [00:06:40] only need 10 to be lower than Omega R but for the theory to work we have to [00:06:42] but for the theory to work we have to allow it to be larger than R square but [00:06:45] allow it to be larger than R square but still you know if R is very small then [00:06:46] still you know if R is very small then you still can like n much much smaller [00:06:49] you still can like n much much smaller than D and still bigger than r squared [00:06:50] than D and still bigger than r squared let's say R is a constant that's [00:06:51] let's say R is a constant that's probably the right way to think about it [00:06:54] probably the right way to think about it um any polynomial dependency on R is fun [00:06:56] um any polynomial dependency on R is fun and so that n is just something like a [00:06:59] and so that n is just something like a big constant but n could be much less [00:07:01] big constant but n could be much less than t [00:07:03] than t okay so and maybe so far after we Define [00:07:06] okay so and maybe so far after we Define this you may wonder why we are why we [00:07:08] this you may wonder why we are why we have to use this nonlinear model right [00:07:10] have to use this nonlinear model right the answer is no you don't have to use [00:07:12] the answer is no you don't have to use it if you really want to solve the [00:07:13] it if you really want to solve the problem [00:07:14] problem so the nonlinear model is only [00:07:16] so the nonlinear model is only introduced to study this effect right [00:07:18] introduced to study this effect right suppose you really care about solving [00:07:20] suppose you really care about solving the question then you can use the [00:07:21] the question then you can use the classical [00:07:23] classical um solution which is called La Sol so or [00:07:26] um solution which is called La Sol so or in the more [00:07:29] in the more um [00:07:30] um kind of like us you know terms using [00:07:33] kind of like us you know terms using terms with this we use in this lecture [00:07:35] terms with this we use in this lecture you can use L1 regularization [00:07:38] you can use L1 regularization so basically [00:07:42] basically [00:07:43] basically um so to Leverage sparsity [00:07:48] I think I'm not sure whether you all [00:07:49] I think I'm not sure whether you all have this background but typically [00:07:51] have this background but typically people use L1 to in some sense encourage [00:07:54] people use L1 to in some sense encourage sparse vectors [00:07:56] sparse vectors um I'm not going to get into detail [00:07:57] um I'm not going to get into detail there but [00:07:59] there but um you can show that you know if you [00:08:02] um you can show that you know if you minimize the L1 Norm of theta of the of [00:08:05] minimize the L1 Norm of theta of the of the model then you can [00:08:07] the model then you can um [00:08:09] um um you can you can reconstruct spots [00:08:11] um you can you can reconstruct spots vectors so so in particular I think [00:08:14] vectors so so in particular I think suppose you have this model f Theta [00:08:17] suppose you have this model f Theta uh X which is a linear X [00:08:21] uh X which is a linear X then this is so called that's all this [00:08:24] then this is so called that's all this is L1 regress objective which is [00:08:26] is L1 regress objective which is something like this [00:08:30] plus Lambda times L1 Norm of theta [00:08:35] plus Lambda times L1 Norm of theta so and and the the classical machine [00:08:38] so and and the the classical machine learning theory I'm not going to go into [00:08:39] learning theory I'm not going to go into detail here in some sense this is you [00:08:41] detail here in some sense this is you know if you don't know the background [00:08:43] know if you don't know the background probably just the somewhat memorized it [00:08:46] probably just the somewhat memorized it or kind of like shaded it as a fact so [00:08:48] or kind of like shaded it as a fact so so the classical Theory [00:08:51] so the classical Theory says that [00:08:54] if n is larger than r [00:08:58] if n is larger than r say I think you need to pay a [00:09:00] say I think you need to pay a logarithmic Factor here if a is larger [00:09:03] logarithmic Factor here if a is larger than R then this objective function [00:09:06] than R then this objective function recovers the ground choice right [00:09:08] recovers the ground choice right so objective [00:09:11] so objective above [00:09:12] above recovers [00:09:14] recovers the ground shoes [00:09:18] Theta star [00:09:21] Theta star which is the [00:09:23] which is the you can you can I guess you probably [00:09:25] you can you can I guess you probably already see the state of corresponds [00:09:27] already see the state of corresponds beta up to this square square of the [00:09:29] beta up to this square square of the thing so approximately [00:09:34] so basically if you just really care [00:09:36] so basically if you just really care about solving this question you view [00:09:37] about solving this question you view this as a linear model you don't have to [00:09:39] this as a linear model you don't have to care about the quadratic thing and then [00:09:41] care about the quadratic thing and then use L1 regularization [00:09:43] use L1 regularization to recover smart structure there's a [00:09:45] to recover smart structure there's a rich you know a lot of like existing [00:09:48] rich you know a lot of like existing kind of cereal about this I'm not going [00:09:50] kind of cereal about this I'm not going to the details but this is something you [00:09:52] to the details but this is something you know somewhat believable because [00:09:54] know somewhat believable because [Music] [00:09:54] [Music] um [00:09:56] um um because you are using the sparsity of [00:09:57] um because you are using the sparsity of the vector [00:09:59] the vector um and also another thing to note is [00:10:01] um and also another thing to note is that [00:10:02] that because data and Theta the relationship [00:10:04] because data and Theta the relationship between beta and Theta is that [00:10:06] between beta and Theta is that Theta corresponds to Beta o dot beta [00:10:10] Theta corresponds to Beta o dot beta Vector entryway Square [00:10:12] Vector entryway Square so and then [00:10:14] so and then um that one Norm of theta [00:10:17] um that one Norm of theta is equals to the two Norm square of beta [00:10:21] is equals to the two Norm square of beta the one normal phase is the sum of [00:10:23] the one normal phase is the sum of entries of theta which is equals the sum [00:10:25] entries of theta which is equals the sum of [00:10:26] of right I squares which is a two Norm [00:10:28] right I squares which is a two Norm square of beta [00:10:30] square of beta so so basically if you do the quadratic [00:10:33] so so basically if you do the quadratic one you can you should regularize L2 [00:10:36] one you can you should regularize L2 Norm right so so basically this [00:10:38] Norm right so so basically this objective one corresponds to [00:10:43] um [00:10:44] um L2 regularized objective [00:10:47] L2 regularized objective so [00:10:48] so if you really want to use the quadratic [00:10:49] if you really want to use the quadratic one you should do [00:10:51] one you should do the the [00:10:53] the the quadratic parametricization you should [00:10:55] quadratic parametricization you should do this F beta X [00:10:58] do this F beta X I [00:11:00] I square plus L Lambda 2 Norm beta squared [00:11:04] square plus L Lambda 2 Norm beta squared uh beta2 Norm square right this is the [00:11:07] uh beta2 Norm square right this is the objective [00:11:08] objective uh with respects beta [00:11:10] uh with respects beta right so so in the beta space you should [00:11:12] right so so in the beta space you should regularize L2 down Square in the [00:11:14] regularize L2 down Square in the cellular space you should request L1 [00:11:16] cellular space you should request L1 right so [00:11:19] right so so this is the classical solution and [00:11:22] so this is the classical solution and now we're talking about implicit [00:11:23] now we're talking about implicit regularization [00:11:28] I think our goal is essentially [00:11:30] I think our goal is essentially basically saying that if you [00:11:34] basically saying that if you use small initialization [00:11:37] use small initialization this is you know without regularization [00:11:39] this is you know without regularization without expressive regularization [00:11:44] this is kind of like basically doing the [00:11:46] this is kind of like basically doing the same thing as let's call this two [00:11:50] same thing as let's call this two so as long as you use small [00:11:51] so as long as you use small initialization with beta parametrization [00:11:56] initialization with beta parametrization um you automatically get this L2 Norm [00:11:58] um you automatically get this L2 Norm graduation for free to some extents this [00:12:01] graduation for free to some extents this is not exactly you know the the exact [00:12:04] is not exactly you know the the exact way to freeze to State the theorem but [00:12:06] way to freeze to State the theorem but this is the rough basically the main [00:12:08] this is the rough basically the main idea [00:12:10] idea so [00:12:12] so um [00:12:13] um so more concretely webinar we're [00:12:15] so more concretely webinar we're interested in is the objective High beta [00:12:18] interested in is the objective High beta let's formally Define it [00:12:20] let's formally Define it I think I normalized by four here just [00:12:22] I think I normalized by four here just because it makes the gradient looks [00:12:24] because it makes the gradient looks cleaner [00:12:25] cleaner so uh but it's just constant Factor so [00:12:29] so uh but it's just constant Factor so one over four and times the square loss [00:12:32] one over four and times the square loss the mean [00:12:33] the mean squared Arrow [00:12:36] squared Arrow right no regularization right so this is [00:12:38] right no regularization right so this is our objective and we are going to do the [00:12:42] our objective and we are going to do the optimizer [00:12:44] optimizer will be that you do GD [00:12:47] will be that you do GD and I operate height data [00:12:50] and I operate height data with [00:12:52] with small initialization [00:12:59] and or more concretely so the algorithm [00:13:02] and or more concretely so the algorithm is for some [00:13:04] is for some very small Alpha [00:13:06] very small Alpha larger than zero [00:13:08] larger than zero we initialize beta [00:13:11] we initialize beta to be Alpha times all one vector [00:13:14] to be Alpha times all one vector so you don't know the the support of [00:13:16] so you don't know the the support of beta of course where you each initialize [00:13:18] beta of course where you each initialize all the answers by Alpha and then you [00:13:22] all the answers by Alpha and then you take a green descent update every time [00:13:24] take a green descent update every time so you say the beta key plus one is [00:13:26] so you say the beta key plus one is equals to be the T minus [00:13:29] equals to be the T minus ETA times the gradient [00:13:32] ETA times the gradient ability [00:13:33] ability [Applause] [00:13:37] okay so this is the objective we are [00:13:39] okay so this is the objective we are going to study and we'll claim that this [00:13:41] going to study and we'll claim that this objective oh sorry this is the optimizer [00:13:44] objective oh sorry this is the optimizer we're going to study we're going to [00:13:45] we're going to study we're going to claim that this Optimizer actually [00:13:48] claim that this Optimizer actually um [00:13:49] um um actually finds the the beta stock [00:13:51] um actually finds the the beta stock even though there is no explicit [00:13:53] even though there is no explicit representation [00:13:56] any questions so far [00:14:09] so so here is the theorem [00:14:13] so so here is the theorem so the theorem is that [00:14:17] so the theorem is that basically you know the shorter version [00:14:19] basically you know the shorter version of the theorem is that [00:14:20] of the theorem is that is so a is Omega R square we can [00:14:24] is so a is Omega R square we can converge to [00:14:25] converge to this algorithm [00:14:28] this algorithm converters to [00:14:38] to Beta star but let's let's but I think [00:14:42] to Beta star but let's let's but I think there's a little bit kind of like a [00:14:44] there's a little bit kind of like a uh with small Alpha [00:14:48] but there's a lot of detail so let me [00:14:50] but there's a lot of detail so let me State the main serum [00:14:53] so [00:14:57] I guess suppose [00:15:00] and is bigger than Big O of R square log [00:15:05] and is bigger than Big O of R square log Square D [00:15:08] Square D uh [00:15:14] so [00:15:16] so Alpha [00:15:19] Alpha let me see [00:15:21] let me see maybe let's writing is this way I think [00:15:23] maybe let's writing is this way I think just to avoid confusion so let's see if [00:15:27] just to avoid confusion so let's see if your sufficient large constant [00:15:32] foreign [00:15:38] is bigger than c times R Square Times [00:15:41] is bigger than c times R Square Times log Square D I think the dependency on [00:15:44] log Square D I think the dependency on the logarithmic factor is sub-optimal it [00:15:46] the logarithmic factor is sub-optimal it depends on the r probably is also [00:15:48] depends on the r probably is also sub-optimal [00:15:50] sub-optimal um and [00:15:51] um and Alpha to be less than [00:15:54] Alpha to be less than some inverse poly into the C [00:15:58] some inverse poly into the C then [00:16:00] then when the time t t is the total number of [00:16:06] when the time t t is the total number of um the total number of [00:16:08] um the total number of um steps is less than one over ETA [00:16:13] um steps is less than one over ETA Times Square Root d alpha and bigger [00:16:15] Times Square Root d alpha and bigger than [00:16:16] than log D over Alpha [00:16:19] log D over Alpha over ETA so for this range of time steps [00:16:22] over ETA so for this range of time steps we have [00:16:28] you can recover beta o dot beta [00:16:33] you can recover beta o dot beta foreign [00:16:38] with our Alpha Times Square D [00:16:46] [Applause] [00:16:49] [Applause] okay so how do we interpret this [00:16:52] okay so how do we interpret this so [00:16:55] so I guess there are a few remarks for [00:16:57] so I guess there are a few remarks for interpretations [00:17:02] the first thing is that I guess this is [00:17:05] the first thing is that I guess this is something probably I should have [00:17:06] something probably I should have mentioned earlier so I'll hype data [00:17:10] mentioned earlier so I'll hype data uh has many [00:17:13] uh has many Global mean [00:17:17] and why this is because [00:17:19] and why this is because of over parametrization [00:17:26] because you know you have like if you [00:17:28] because you know you have like if you count a degree of Freedom you have [00:17:30] count a degree of Freedom you have earned data points and D parameters [00:17:32] earned data points and D parameters right so you have more degree of freedom [00:17:35] right so you have more degree of freedom than the number of constants so you have [00:17:37] than the number of constants so you have many many Global minimum [00:17:39] many many Global minimum right so so that's that's one of the [00:17:41] right so so that's that's one of the reasons why you have implicit bias uh if [00:17:44] reasons why you have implicit bias uh if you want to have one Global minimum [00:17:45] you want to have one Global minimum there's no way you can have implicit [00:17:46] there's no way you can have implicit bias [00:17:48] bias um and [00:17:50] um and second thing how do we interpret all of [00:17:52] second thing how do we interpret all of these quantities in the bond [00:17:54] these quantities in the bond so [00:17:55] so so the run time [00:17:59] lower bound [00:18:02] depends [00:18:05] depends only on the logarithmic of alpha [00:18:08] only on the logarithmic of alpha so this means that you can choose Alpha [00:18:10] so this means that you can choose Alpha to be anything inverse polynomial [00:18:12] to be anything inverse polynomial so Alpha can be [00:18:15] so Alpha can be inverse poly [00:18:17] inverse poly right you can choose basically the [00:18:19] right you can choose basically the constant C to be as to be a constant so [00:18:22] constant C to be as to be a constant so so then the runtime wouldn't be affected [00:18:24] so then the runtime wouldn't be affected too much [00:18:25] too much and an error [00:18:29] and an error depends on Alpha [00:18:32] depends on Alpha so basically if you want very very small [00:18:33] so basically if you want very very small Arrow inverse poly error you can just [00:18:36] Arrow inverse poly error you can just take Alpha to the inverse poly and your [00:18:38] take Alpha to the inverse poly and your run time is not changed too much [00:18:42] run time is not changed too much and there's the upper bound on the [00:18:43] and there's the upper bound on the runtime which means that [00:18:46] runtime which means that um [00:18:46] um you need to do our early stopping [00:18:48] you need to do our early stopping according to this Bond [00:18:52] so if you really believe in this you [00:18:55] so if you really believe in this you have to do early starting but the early [00:18:57] have to do early starting but the early stopping is pretty mild right but pretty [00:18:59] stopping is pretty mild right but pretty mild because you can see that the upper [00:19:02] mild because you can see that the upper bound actually depends on inverse Alpha [00:19:04] bound actually depends on inverse Alpha so if you take something like Alpha to [00:19:05] so if you take something like Alpha to be one over D to the power 10 then your [00:19:07] be one over D to the power 10 then your upper bond is pretty relaxed you can run [00:19:08] upper bond is pretty relaxed you can run for a long time right so I need an [00:19:11] for a long time right so I need an action practice we never observe that [00:19:14] action practice we never observe that you have to if you really do the this [00:19:16] you have to if you really do the this synthetic example really run experiments [00:19:18] synthetic example really run experiments you never have to already stop and I [00:19:20] you never have to already stop and I probably I don't believe that you have [00:19:22] probably I don't believe that you have to already stop this is more or less [00:19:23] to already stop this is more or less artifact of the [00:19:25] artifact of the of the proof but this artifact is is is [00:19:29] of the proof but this artifact is is is not too restrictive anyways because [00:19:32] not too restrictive anyways because it depends on Alpha in the inverse of [00:19:34] it depends on Alpha in the inverse of the alpha so you can take Alpha to be [00:19:35] the alpha so you can take Alpha to be small to make the bond very relaxed so [00:19:37] small to make the bond very relaxed so so we didn't [00:19:39] so we didn't um we didn't pay the attention to to to [00:19:41] um we didn't pay the attention to to to remove that completely even though we [00:19:43] remove that completely even though we believe that's possible [00:19:46] believe that's possible um [00:19:46] um anyway so basically the right way to use [00:19:48] anyway so basically the right way to use this is that you take Alpha to be [00:19:50] this is that you take Alpha to be something super small and and then your [00:19:52] something super small and and then your error is very small and your run time is [00:19:54] error is very small and your run time is your runtime lower bound is logarithmic [00:19:57] your runtime lower bound is logarithmic in awful [00:20:00] okay so there's one small thing is that [00:20:03] okay so there's one small thing is that Alpha cannot be zero [00:20:07] so why you don't take Alpha to be zero [00:20:09] so why you don't take Alpha to be zero the only reason why Alpha cannot be zero [00:20:11] the only reason why Alpha cannot be zero is because [00:20:13] is because zero beta is zero is a saddle point [00:20:21] so the number L height beta [00:20:25] so the number L height beta is at zero it's zero [00:20:30] this is part will come from the [00:20:31] this is part will come from the quadratic parametricization right so of [00:20:33] quadratic parametricization right so of the group will complete the gradient you [00:20:36] the group will complete the gradient you will see that because you have the [00:20:37] will see that because you have the quadratic parameters this is everything [00:20:39] quadratic parameters this is everything the gradient always multiplied by Beta [00:20:41] the gradient always multiplied by Beta itself so if beta is zero then you just [00:20:43] itself so if beta is zero then you just have zero gradient so so so so if the [00:20:47] have zero gradient so so so so if the grain is zero and if you don't if you [00:20:48] grain is zero and if you don't if you when we are analyzing wind designs no [00:20:50] when we are analyzing wind designs no noise nothing right so no one is the [00:20:52] noise nothing right so no one is the plasticity so so if you start at zero it [00:20:55] plasticity so so if you start at zero it will just stay there forever so that's [00:20:57] will just stay there forever so that's why you cannot [00:20:58] why you cannot um you can use um zero [00:21:01] um you can use um zero I think transition but anything close to [00:21:03] I think transition but anything close to zero is five in some sense this log 1 [00:21:05] zero is five in some sense this log 1 over Alpha then you pay [00:21:09] how much time you have to pay to leave [00:21:12] how much time you have to pay to leave the saddle port [00:21:13] the saddle port so and the leaving Center Point is [00:21:15] so and the leaving Center Point is actually very fast [00:21:16] actually very fast that's in some sense you can kind of [00:21:18] that's in some sense you can kind of believe it right because you have a [00:21:19] believe it right because you have a saddle point how do I draw it like [00:21:23] saddle point how do I draw it like something like this [00:21:25] something like this right so like leaving it is kind of like [00:21:27] right so like leaving it is kind of like you have it's it's optimizing a concave [00:21:29] you have it's it's optimizing a concave function and you're going downhill right [00:21:31] function and you're going downhill right you basically accelerates so fast that [00:21:34] you basically accelerates so fast that eventually you leave it very quickly [00:21:37] eventually you leave it very quickly so [00:21:38] so Okay cool so [00:21:46] right and [00:21:50] and in some sense you can interpret this [00:21:53] and in some sense you can interpret this as you know a gradient descent [00:21:57] as you know a gradient descent is preferring prefers minimum Norm [00:21:59] is preferring prefers minimum Norm solution [00:22:03] in L2 [00:22:05] in L2 so so I may be preferring [00:22:08] so so I may be preferring sorry action [00:22:09] sorry action like a preferring Global minimum [00:22:16] um closest to [00:22:18] um closest to initialization [00:22:21] initialization foreign [00:22:22] foreign because we have kind of claimed that you [00:22:25] because we have kind of claimed that you know so so actually here [00:22:27] know so so actually here I think we have [00:22:30] I think we have um [00:22:31] um somewhat allude related to this but just [00:22:33] somewhat allude related to this but just be formal in this case you can prove the [00:22:35] be formal in this case you can prove the following [00:22:37] following so you can prove that beta star is [00:22:40] so you can prove that beta star is actually the arc mean [00:22:42] actually the arc mean of [00:22:44] of the norm [00:22:45] the norm with the Constitution that [00:22:47] with the Constitution that you fed the data [00:22:50] you fed the data suppose you try to find a global minimum [00:22:53] suppose you try to find a global minimum with the minimum L2 Norm [00:22:55] with the minimum L2 Norm right so [00:22:57] right so this is the constraint right so this [00:22:59] this is the constraint right so this means you have a global minimum if you [00:23:01] means you have a global minimum if you satisfies everything [00:23:03] satisfies everything um and you minimize the two Norm then [00:23:05] um and you minimize the two Norm then this is actually equals to Beta star [00:23:08] this is actually equals to Beta star and and the reason why this is true is [00:23:10] and and the reason why this is true is kind of like similar to why the i1 norm [00:23:13] kind of like similar to why the i1 norm works right just because [00:23:15] works right just because the minimum the two Norm of beta is the [00:23:18] the minimum the two Norm of beta is the same as the one Norm of theta [00:23:20] same as the one Norm of theta and and this one [00:23:24] and and this one if you replace this base Theta then this [00:23:26] if you replace this base Theta then this is true [00:23:28] is true and and this part [00:23:30] and and this part is by [00:23:32] is by the [00:23:35] the uh the standard Theory [00:23:37] uh the standard Theory which I didn't show but you know that if [00:23:40] which I didn't show but you know that if you look at all the linear models that [00:23:42] you look at all the linear models that face the data you look as far as this [00:23:44] face the data you look as far as this one it's going to be the [00:23:46] one it's going to be the um [00:23:47] um the Theta actually technically I think [00:23:50] the Theta actually technically I think this should be argument to the o.2 [00:23:53] this should be argument to the o.2 because sorry [00:23:56] because sorry to the square root because there's a [00:23:58] to the square root because there's a there's a translation [00:23:59] there's a translation like the objective is the same but the [00:24:01] like the objective is the same but the but the arc means has the translation [00:24:04] but the arc means has the translation I'm not sure whether that makes [00:24:05] I'm not sure whether that makes that's right so like so if you so maybe [00:24:07] that's right so like so if you so maybe I slice it [00:24:09] I slice it maybe the easiest way to write this is [00:24:11] maybe the easiest way to write this is the following so [00:24:14] the following so so these two are exactly the same [00:24:16] so these two are exactly the same just because you have a translation if [00:24:19] just because you have a translation if you look at me [00:24:21] right and for the first object if the [00:24:23] right and for the first object if the argument is beta star and then you can [00:24:26] argument is beta star and then you can somehow see that the arc mean also [00:24:29] somehow see that the arc mean also transfers you know just by taking a [00:24:31] transfers you know just by taking a square root [00:24:33] so and this is also the case for linear [00:24:35] so and this is also the case for linear regression right recall that we also [00:24:37] regression right recall that we also proved that if you start with green [00:24:38] proved that if you start with green designs with zero and you do linear [00:24:41] designs with zero and you do linear regression you get the minimum solution [00:24:42] regression you get the minimum solution that face the data so it's very similar [00:24:45] that face the data so it's very similar at least from the on the surface from [00:24:47] at least from the on the surface from the formulas [00:24:48] the formulas like uh like you have you have almost [00:24:51] like uh like you have you have almost the same guarantee [00:24:54] the same guarantee but but I don't necessarily believe that [00:24:57] but but I don't necessarily believe that this is always the case for all the kids [00:24:58] this is always the case for all the kids like I don't feel like you you always [00:25:00] like I don't feel like you you always find the minimum Norm social the [00:25:04] find the minimum Norm social the solution that is closest to [00:25:05] solution that is closest to initialization [00:25:07] initialization let's face the data I don't think this [00:25:08] let's face the data I don't think this is always true it's probably [00:25:11] is always true it's probably I think there's still some something [00:25:13] I think there's still some something special about this examples [00:25:15] special about this examples yeah we cannot just extrapolate them [00:25:17] yeah we cannot just extrapolate them generically [00:25:19] generically Okay so [00:25:21] Okay so now [00:25:23] now uh we are going to try to prove this any [00:25:26] uh we are going to try to prove this any questions so far [00:25:34] foreign [00:25:41] I'll try to finish it in one lecture but [00:25:44] I'll try to finish it in one lecture but if we cannot I think I'm going to refer [00:25:46] if we cannot I think I'm going to refer you to the notes the notes has a pretty [00:25:48] you to the notes the notes has a pretty detailed derivation [00:25:50] detailed derivation so [00:25:52] um so so to to kind of get some [00:25:55] um so so to to kind of get some preparation let's try to understand [00:25:58] preparation let's try to understand uh some basic stuff about this loss [00:26:00] uh some basic stuff about this loss function [00:26:01] function so [00:26:07] so first of all let's look at the [00:26:09] so first of all let's look at the population risk [00:26:11] population risk this is the population no hat [00:26:13] this is the population no hat if the population risk is [00:26:16] if the population risk is y minus [00:26:19] y minus Bit O DOT beta times x [00:26:22] Bit O DOT beta times x squared [00:26:24] squared right so and you can try to [00:26:27] right so and you can try to um [00:26:30] get rid of the expectation because uh [00:26:33] get rid of the expectation because uh this is population so what you do is you [00:26:35] this is population so what you do is you plug in a definition of y so you get [00:26:38] plug in a definition of y so you get beta star o dot beta star minus [00:26:42] beta star o dot beta star minus beta0. beta times x [00:26:45] beta0. beta times x squared [00:26:49] and then this gives you [00:26:53] I think I have one force here that's my [00:26:56] I think I have one force here that's my population right because [00:26:58] population right because um I know I have the additional four [00:27:01] um I know I have the additional four everywhere [00:27:03] everywhere so and this becomes one force [00:27:06] so and this becomes one force times the norm [00:27:09] the difference in Norm [00:27:17] this is just because [00:27:19] this is just because the expectation of [00:27:21] the expectation of some Vector times x square if x is [00:27:24] some Vector times x square if x is gaussian this is equals to [00:27:26] gaussian this is equals to the norm of B to know [00:27:30] the norm of B to know Square One [00:27:32] Square One so and [00:27:36] uh [00:27:39] uh and I'm going to click the following so [00:27:41] and I'm going to click the following so you're going to have uniform convergence [00:27:47] for sparse beta [00:27:52] so but we don't have uniform convergence [00:27:54] so but we don't have uniform convergence over the entire space right because we [00:27:56] over the entire space right because we have over parenthesization that's the if [00:27:58] have over parenthesization that's the if you have overpowerment uniform [00:28:00] you have overpowerment uniform convergence for everything then [00:28:02] convergence for everything then they wouldn't have any implicit [00:28:04] they wouldn't have any implicit regularization effect [00:28:05] regularization effect so that would be the kind of the [00:28:06] so that would be the kind of the classical [00:28:07] classical uh theory that we discussed in the first [00:28:09] uh theory that we discussed in the first part of the [00:28:10] part of the uh the first part of the um [00:28:13] uh the first part of the um the the course right so but but we claim [00:28:16] the the course right so but but we claim that if you look at sparse beta then you [00:28:18] that if you look at sparse beta then you have uniform converges [00:28:20] have uniform converges so so here is the um I'm going to build [00:28:22] so so here is the um I'm going to build I'm going to build towards this so first [00:28:25] I'm going to build towards this so first there's a claim [00:28:27] there's a claim which is with high probability [00:28:33] um over the choice of data [00:28:42] so for if n is bigger than something [00:28:45] so for if n is bigger than something like [00:28:48] this oh tilde of R over Delta Square [00:28:52] this oh tilde of R over Delta Square then for every V [00:28:56] such that [00:28:59] such that foreign [00:29:14] then you have the following so [00:29:21] the empirical average of this kind of [00:29:23] the empirical average of this kind of things right so why we care about this I [00:29:25] things right so why we care about this I guess this is probably going to be seen [00:29:27] guess this is probably going to be seen here right so this is the pop the [00:29:29] here right so this is the pop the population has this form something like [00:29:31] population has this form something like V Dot x square and you take a stackation [00:29:33] V Dot x square and you take a stackation and this is the Imperial one I'm going [00:29:35] and this is the Imperial one I'm going to be more explicit in a moment but this [00:29:38] to be more explicit in a moment but this time is it's kind of like [00:29:40] time is it's kind of like a small tool so you're saying that if [00:29:43] a small tool so you're saying that if you have this empirical version of this [00:29:44] you have this empirical version of this V Dot x square [00:29:46] V Dot x square and it's going to be very close to the [00:29:48] and it's going to be very close to the population version the population is [00:29:49] population version the population is just the two Norm of V squared right [00:29:52] just the two Norm of V squared right um so it's going to be very close to [00:29:56] um so it's going to be very close to the population [00:30:00] but only for V that is sparse [00:30:04] but only for V that is sparse so this concentration only works so so [00:30:06] so this concentration only works so so if you have enough ends right if n is [00:30:09] if you have enough ends right if n is infinite then or close to infinite then [00:30:13] infinite then or close to infinite then you should expect this to work for every [00:30:14] you should expect this to work for every V just because this is a lot of law of [00:30:17] V just because this is a lot of law of large number or like a concentration in [00:30:19] large number or like a concentration in the qualities right but here the [00:30:21] the qualities right but here the Constitutional code is more subtle or [00:30:23] Constitutional code is more subtle or more [00:30:24] more kind of like there's a funny Finance [00:30:26] kind of like there's a funny Finance here because you only care about [00:30:29] here because you only care about um these that are sparse and also you [00:30:32] um these that are sparse and also you only have this many of examples you [00:30:34] only have this many of examples you don't have [00:30:35] don't have a lot of examples and it's not on bigger [00:30:38] a lot of examples and it's not on bigger than D even right and it's only bigger [00:30:39] than D even right and it's only bigger than the sparsity of beef [00:30:42] than the sparsity of beef so [00:30:44] so so this is uh and actually this is also [00:30:46] so this is uh and actually this is also called [00:30:48] called so and also just the follow from the [00:30:50] so and also just the follow from the language I guess this is something [00:30:52] language I guess this is something useful to to know you know we don't [00:30:54] useful to to know you know we don't really have any depend on this kind of [00:30:57] really have any depend on this kind of purpose but we we say that [00:31:00] purpose but we we say that uh if if a vector satisfied this [00:31:03] uh if if a vector satisfied this condition [00:31:05] condition suppose let's call this three [00:31:09] so suppose [00:31:12] so suppose X I's [00:31:14] X I's satisfy [00:31:17] 3 then we call [00:31:20] 3 then we call then then we say [00:31:25] this satisfied rip condition so [00:31:28] this satisfied rip condition so basically three is called R Delta [00:31:31] basically three is called R Delta rip condition [00:31:35] the acronym is a little bit kind of [00:31:37] the acronym is a little bit kind of weird but I think there is [00:31:40] weird but I think there is standing for restricted [00:31:43] standing for restricted ISO [00:31:45] ISO magic condition [00:31:47] magic condition the reason why it's called restricted is [00:31:49] the reason why it's called restricted is because you are only restricting to [00:31:51] because you are only restricting to vectors V [00:31:52] vectors V but if you are not restricting to vector [00:31:54] but if you are not restricting to vector v then this is a kind of like isometric [00:31:57] v then this is a kind of like isometric condition because you are basically [00:31:58] condition because you are basically seeing that all the X i's are ISO [00:32:00] seeing that all the X i's are ISO isometric right they are they are kind [00:32:03] isometric right they are they are kind of spreading the entire all the [00:32:04] of spreading the entire all the directions equally right the exercise [00:32:06] directions equally right the exercise has Converse identity that's pretty much [00:32:08] has Converse identity that's pretty much what do I say right so if you don't if [00:32:11] what do I say right so if you don't if you require this for every V then you [00:32:14] you require this for every V then you are saying that the the covariance of so [00:32:17] are saying that the the covariance of so you know what this is really saying this [00:32:19] you know what this is really saying this equation three is really just equivalent [00:32:20] equation three is really just equivalent to [00:32:22] to sum of x i [00:32:24] sum of x i transpose times V transpose times V [00:32:29] transpose times V transpose times V is uh [00:32:34] right this is like bounded by V [00:32:37] right this is like bounded by V transpose I times V times 1 [00:32:40] transpose I times V times 1 Delta larger than one minus Delta times [00:32:42] Delta larger than one minus Delta times B transpose I times B right so if you [00:32:46] B transpose I times B right so if you require this for every B so suppose [00:32:51] require [00:32:54] require D for every V [00:32:57] then what this is saying is [00:33:00] then what this is saying is sum of x i [00:33:03] sum of x i transpose small n is [00:33:05] transpose small n is in PSD sense [00:33:10] how do I write this poster [00:33:13] how do I write this poster okay how do I write MPS [00:33:16] okay how do I write MPS wait [00:33:18] wait oh okay right [00:33:20] oh okay right background notation for PS yeah like [00:33:23] background notation for PS yeah like this [00:33:23] this less than one plus Delta times lnt and [00:33:26] less than one plus Delta times lnt and larger than 1 minus Delta template so if [00:33:29] larger than 1 minus Delta template so if you call it for every V you are [00:33:31] you call it for every V you are basically seeing the covariance of x i [00:33:33] basically seeing the covariance of x i are as ASO as some [00:33:37] are as ASO as some I think it's called not called isometric [00:33:40] I think it's called not called isometric it's called [00:33:41] it's called I so it's a magic ISO parametrics it's [00:33:46] I so it's a magic ISO parametrics it's just covalented six Source or anything [00:33:47] just covalented six Source or anything right that's I'm blocking on the word so [00:33:50] right that's I'm blocking on the word so so basically you're seeing the [00:33:51] so basically you're seeing the covariance is closer but you are not [00:33:53] covariance is closer but you are not requiring for every V right and also [00:33:55] requiring for every V right and also this is not true for if you don't have [00:33:57] this is not true for if you don't have enough data right so we only have n is [00:33:59] enough data right so we only have n is less than data point so so in our case [00:34:02] less than data point so so in our case this Matrix is not even full rank how [00:34:05] this Matrix is not even full rank how come you can expect the food this is [00:34:07] come you can expect the food this is this only has rank r because it only has [00:34:10] this only has rank r because it only has a record n because we only have undated [00:34:12] a record n because we only have undated points and is less than this so it's not [00:34:14] points and is less than this so it's not even a full graph Matrix how come this [00:34:16] even a full graph Matrix how come this can be close to an entity there's no way [00:34:17] can be close to an entity there's no way right but if you look at the quadratic [00:34:20] right but if you look at the quadratic form right so if you look at a quadratic [00:34:22] form right so if you look at a quadratic form [00:34:23] form and you only look at the quadrilage form [00:34:25] and you only look at the quadrilage form evaluated on sparse vector v then this [00:34:29] evaluated on sparse vector v then this Matrix becomes [00:34:31] Matrix becomes effectively look like an entity [00:34:33] effectively look like an entity that's basically what this condition is [00:34:35] that's basically what this condition is saying [00:34:38] right okay so [00:34:42] right okay so um and once you have this um Lemma or [00:34:44] um and once you have this um Lemma or this claim then uh we know that you have [00:34:47] this claim then uh we know that you have the uniform convergence [00:34:53] for beta [00:34:55] for beta sparse so sparse beta [00:35:00] and this is just because you know [00:35:03] and this is just because you know uh I always had beta [00:35:06] uh I always had beta is this one [00:35:09] is this one times 4 times 1 over n times sum of [00:35:13] times 4 times 1 over n times sum of beta oh dot beta minus beta star oh dot [00:35:17] beta oh dot beta minus beta star oh dot beta star times x i squared [00:35:21] beta star times x i squared and this is of this form right so you [00:35:23] and this is of this form right so you can treat this as V [00:35:26] can treat this as V and then your urine is Formula V Dot x i [00:35:29] and then your urine is Formula V Dot x i Square [00:35:30] Square right and this V is sparse if beta is [00:35:33] right and this V is sparse if beta is sparse and beta stars as far as beta [00:35:35] sparse and beta stars as far as beta star is always sparse that's our [00:35:37] star is always sparse that's our assumption and if beta is foreign this [00:35:40] assumption and if beta is foreign this thing is also sparse you you pay 2 hours [00:35:43] thing is also sparse you you pay 2 hours password there were R stars and [00:35:46] password there were R stars and and now [00:35:48] and now um this whole thing would be probably [00:35:49] um this whole thing would be probably two hours [00:35:51] two hours at most so then this means that this is [00:35:54] at most so then this means that this is close to [00:35:56] um the norm right so one times four [00:35:58] um the norm right so one times four times the norm of this [00:36:03] and and this is equals to L beta [00:36:08] and and this is equals to L beta so so for sparse beta you have uniform [00:36:11] so so for sparse beta you have uniform convergence but you don't have uniform [00:36:13] convergence but you don't have uniform convergence over the entire space [00:36:16] convergence over the entire space so in some sense on uh [00:36:21] so in some sense on uh so and also you can have [00:36:23] so and also you can have uniform convergence [00:36:27] for the gradient [00:36:29] for the gradient of this if you really care about it [00:36:33] of this if you really care about it um I guess [00:36:34] um I guess I think I will show this later uh [00:36:37] I think I will show this later uh for [00:36:39] for for sparse [00:36:41] for sparse wait so you can even show the gradient [00:36:43] wait so you can even show the gradient concentrate [00:36:46] concentrate um around the expected gradient the [00:36:48] um around the expected gradient the empirical gradient [00:36:50] empirical gradient concentrate around [00:36:52] concentrate around the population gradient [00:36:54] the population gradient for smart speed [00:36:56] for smart speed so however on the other hand [00:37:01] there exists [00:37:03] there exists dance beta [00:37:08] such that [00:37:11] for example L height beta is zero [00:37:15] for example L height beta is zero but L beta is very much lower than zero [00:37:20] but L beta is very much lower than zero so they are they are over 15 positions [00:37:22] so they are they are over 15 positions there's there are places where you don't [00:37:23] there's there are places where you don't have the [00:37:25] have the the property the training and test laws [00:37:28] the property the training and test laws are similar so but those are dense beta [00:37:32] are similar so but those are dense beta okay so the question is why you're [00:37:34] okay so the question is why you're offending a sports one but not a dance [00:37:36] offending a sports one but not a dance one right because the dance one doesn't [00:37:38] one right because the dance one doesn't have the nice property so the main [00:37:40] have the nice property so the main intuition is the following so we have [00:37:42] intuition is the following so we have done quite some preparation so many [00:37:45] done quite some preparation so many intuition [00:37:48] intuition or or what we believe to be happening is [00:37:50] or or what we believe to be happening is that following so you can think of this [00:37:52] that following so you can think of this every different XR [00:37:54] every different XR to be uh [00:37:58] to be uh to be the spider vectors that are sparse [00:38:00] to be the spider vectors that are sparse so beta such that [00:38:03] so beta such that with our spores [00:38:05] with our spores let's see what they are used to [00:38:08] let's see what they are used to so and [00:38:11] so and so suppose you look at the space right [00:38:13] so suppose you look at the space right so you have an entire space which is [00:38:15] so you have an entire space which is probably something very large [00:38:17] probably something very large and zero is somewhere here this is the [00:38:20] and zero is somewhere here this is the origin [00:38:21] origin and you have a [00:38:24] and you have a some family of let's call this XR this [00:38:27] some family of let's call this XR this is the family of sparse vectors [00:38:30] is the family of sparse vectors and [00:38:31] and and you know in this XR everything [00:38:34] and you know in this XR everything behaves so nicely the tuning and tests [00:38:36] behaves so nicely the tuning and tests are just basically the same up to [00:38:38] are just basically the same up to like like a up to some small error right [00:38:41] like like a up to some small error right so the training test and also in terms [00:38:43] so the training test and also in terms of gradients they are similar the green [00:38:44] of gradients they are similar the green of L hat and green of L are similar and [00:38:47] of L hat and green of L are similar and and I think basically what happens is [00:38:49] and I think basically what happens is that you start from somewhere close to [00:38:51] that you start from somewhere close to zero and the reason you cannot start at [00:38:53] zero and the reason you cannot start at zero just because the setup ones not [00:38:55] zero just because the setup ones not very important and you can think of like [00:38:58] very important and you can think of like a green descent so you are doing good in [00:39:00] a green descent so you are doing good in this sense [00:39:01] this sense uh [00:39:04] uh on the empirical loss L hat beta [00:39:08] on the empirical loss L hat beta uniform convergence is [00:39:10] uniform convergence is you're basically doing the same thing as [00:39:12] you're basically doing the same thing as screening designs on our beta [00:39:15] screening designs on our beta as long as you don't leave this type XR [00:39:18] as long as you don't leave this type XR right so if you leave it there's all [00:39:20] right so if you leave it there's all better off but if you don't leave it [00:39:21] better off but if you don't leave it it's fine [00:39:22] it's fine right so it turns out that [00:39:25] right so it turns out that what happens is that if you do when you [00:39:26] what happens is that if you do when you design you can consider the alternative [00:39:28] design you can consider the alternative world where you do green descent on a [00:39:31] world where you do green descent on a population [00:39:33] so let's say this is the green descent [00:39:37] so let's say this is the green descent on a population loss L beta [00:39:39] on a population loss L beta and if it turns out if the green is in a [00:39:42] and if it turns out if the green is in a population you are going to reach a [00:39:44] population you are going to reach a point [00:39:45] point which is beta star which is kind of on [00:39:47] which is beta star which is kind of on the boundary of this side [00:39:49] the boundary of this side um and because and also in this [00:39:52] um and because and also in this transaction you never leave this set XR [00:39:56] transaction you never leave this set XR right so now because you believe that [00:39:58] right so now because you believe that the [00:39:59] the the black trajectory is similar to the [00:40:01] the black trajectory is similar to the purple trajectory [00:40:03] purple trajectory as long as they are in the set XR and [00:40:05] as long as they are in the set XR and the purple trajectory is never leave [00:40:08] the purple trajectory is never leave this set XR so you be so that's why the [00:40:11] this set XR so you be so that's why the black trajectory also converts to Beta [00:40:13] black trajectory also converts to Beta star [00:40:16] I'm not sure whether that makes sense so [00:40:17] I'm not sure whether that makes sense so so basically the purple one is the [00:40:20] so basically the purple one is the population trajectory and black one is [00:40:22] population trajectory and black one is the empirical structures [00:40:24] the empirical structures so you know that the empirical [00:40:25] so you know that the empirical introduction and the population [00:40:27] introduction and the population structure are similar in the set XR you [00:40:30] structure are similar in the set XR you don't know anything about outside the [00:40:32] don't know anything about outside the world right so and also you know the [00:40:34] world right so and also you know the proper structure never leaves the set XR [00:40:37] proper structure never leaves the set XR then the black one probably shouldn't [00:40:38] then the black one probably shouldn't leave as well and the black one should [00:40:40] leave as well and the black one should be similar to the purple one right so so [00:40:41] be similar to the purple one right so so for example it's the pose suppose the [00:40:44] for example it's the pose suppose the purple structure look like this [00:40:49] suppose that's the what's happening then [00:40:51] suppose that's the what's happening then you lose control because at the [00:40:54] you lose control because at the beginning you you're following the [00:40:55] beginning you you're following the proper structure and then you leave this [00:40:58] proper structure and then you leave this side and then all that's off you don't [00:41:00] side and then all that's off you don't have any control anymore but this turns [00:41:03] have any control anymore but this turns out to be not what's happening what's [00:41:04] out to be not what's happening what's happening is that the purple after stays [00:41:06] happening is that the purple after stays in the set XR like for a long time like [00:41:09] in the set XR like for a long time like you know until it reaches beta star [00:41:12] you know until it reaches beta star and then stay a bit of star so that's [00:41:14] and then stay a bit of star so that's that's why this alternative [00:41:16] that's why this alternative situation doesn't happen [00:41:18] situation doesn't happen right this doesn't happen [00:41:20] right this doesn't happen this is not what's happening and [00:41:23] this is not what's happening and inside this XR everything behaves nicely [00:41:25] inside this XR everything behaves nicely there's only a global minimum which is [00:41:27] there's only a global minimum which is beta star there's nothing else and also [00:41:29] beta star there's nothing else and also at XR there are a bunch of different [00:41:30] at XR there are a bunch of different things right so outside Sr you can [00:41:32] things right so outside Sr you can imagine that there is a [00:41:34] imagine that there is a it's a different color so all set XR [00:41:36] it's a different color so all set XR there's probably [00:41:37] there's probably acquired a bunch of overfitting [00:41:40] acquired a bunch of overfitting Solutions [00:41:44] so these are all solutions that makes [00:41:46] so these are all solutions that makes the [00:41:47] the the the report Clause zero there are so [00:41:51] the the report Clause zero there are so many of these Solutions [00:41:52] many of these Solutions but you never get to actually even go to [00:41:54] but you never get to actually even go to those places just because your your [00:41:56] those places just because your your black trajectory [00:41:58] black trajectory is emitting the purple one and the [00:42:00] is emitting the purple one and the purple one didn't go to those places and [00:42:02] purple one didn't go to those places and the black one doesn't go to those places [00:42:04] the black one doesn't go to those places either [00:42:05] either right so so that's the that's the [00:42:07] right so so that's the that's the intuition uh why this is working [00:42:10] intuition uh why this is working any questions [00:42:19] but why the purple one doesn't leave [00:42:21] but why the purple one doesn't leave though yeah so that's not uh I don't [00:42:24] though yeah so that's not uh I don't have I didn't give a justification [00:42:26] have I didn't give a justification either right that's um that's something [00:42:27] either right that's um that's something we're gonna prove yeah um [00:42:30] we're gonna prove yeah um um and I don't think this this is [00:42:33] um and I don't think this this is something about the property of this [00:42:35] something about the property of this problem [00:42:37] problem um and [00:42:39] um and um if you see the proof it's not that [00:42:41] um if you see the proof it's not that surprising because you are gradually you [00:42:43] surprising because you are gradually you are in some sense with instance trying [00:42:45] are in some sense with instance trying to you know it's a local search [00:42:46] to you know it's a local search algorithm right so you are trying to [00:42:47] algorithm right so you are trying to search your neighborhood first right so [00:42:49] search your neighborhood first right so your gradually kind of like a [00:42:52] your gradually kind of like a um you start from zero somewhere close [00:42:54] um you start from zero somewhere close to zero you're gradually searching on [00:42:55] to zero you're gradually searching on neighborhood until you find the global [00:42:57] neighborhood until you find the global mineral that's why you don't want to [00:42:59] mineral that's why you don't want to probably wouldn't go this circuitous way [00:43:02] probably wouldn't go this circuitous way uh right so like so you're gonna go more [00:43:05] uh right so like so you're gonna go more straight to the to the to the closest [00:43:07] straight to the to the to the closest point [00:43:08] point um but but the real proof has to go [00:43:11] um but but the real proof has to go through the math [00:43:21] yeah that's a great question so so the [00:43:23] yeah that's a great question so so the initialization Alpha times one [00:43:27] right so this is literally speaking is [00:43:30] right so this is literally speaking is not in XR right it's it's that [00:43:34] not in XR right it's it's that um and I got to ask this question many [00:43:35] um and I got to ask this question many times and I think the right way to think [00:43:37] times and I think the right way to think about this [00:43:38] about this I think I had some remarks somewhere [00:43:40] I think I had some remarks somewhere else but you asked it early I should [00:43:42] else but you asked it early I should probably should just really answer it [00:43:43] probably should just really answer it here so the so the question is why the [00:43:45] here so the so the question is why the initialization is not in XR but I think [00:43:47] initialization is not in XR but I think the thing to think about is that [00:43:51] the thing to think about is that um of course it's not exactly in this [00:43:52] um of course it's not exactly in this far set but it's close and closing was [00:43:55] far set but it's close and closing was done closing the sunset [00:43:57] done closing the sunset Alpha One is very close to [00:44:00] Alpha One is very close to zero and zeros is in this set [00:44:04] zero and zeros is in this set so so that's that's the kind of like the [00:44:06] so so that's that's the kind of like the the the profit we're going to use so so [00:44:09] the the profit we're going to use so so yes so you are right that you know we [00:44:11] yes so you are right that you know we never we can never say exactly you know [00:44:12] never we can never say exactly you know you're you're gonna say that you're in [00:44:14] you're you're gonna say that you're in the neighborhood of XR with a little bit [00:44:16] the neighborhood of XR with a little bit smaller and the arrow is very small it [00:44:18] smaller and the arrow is very small it depends on Alpha so uh and if that's why [00:44:21] depends on Alpha so uh and if that's why we have to choose Alpha to be very small [00:44:24] we have to choose Alpha to be very small so in some sense you really want to [00:44:26] so in some sense you really want to choose zero so so like from the all of [00:44:28] choose zero so so like from the all of this discussion the only the only thing [00:44:30] this discussion the only the only thing you want to do is to choose zero and and [00:44:32] you want to do is to choose zero and and but zero it just happens to be a set of [00:44:34] but zero it just happens to be a set of one that's unfortunate so you have to [00:44:36] one that's unfortunate so you have to perturbate a little bit [00:44:45] [Music] [00:44:54] [Music] so the question is whether this this [00:44:57] so the question is whether this this particular property has anything to do [00:44:58] particular property has anything to do with the positivity [00:45:00] with the positivity of beta right so [00:45:04] um [00:45:05] um I don't think so [00:45:08] I don't think so um because so are you talking about the [00:45:10] um because so are you talking about the positive way to start or by data the [00:45:12] positive way to start or by data the variable data [00:45:14] variable data okay let's start right so so we assume [00:45:17] okay let's start right so so we assume the data start to be positive [00:45:19] the data start to be positive um [00:45:20] um I think no matter where the star of beta [00:45:23] I think no matter where the star of beta is positive [00:45:25] is positive is positive [00:45:28] is positive beta star square is always positive [00:45:30] beta star square is always positive right so [00:45:32] right so so so you can if you initialize from [00:45:36] so so you can if you initialize from this then you just always go to the Post [00:45:37] this then you just always go to the Post it's always keep being positive so [00:45:40] it's always keep being positive so basically you just learn the absolute [00:45:41] basically you just learn the absolute value of beta stock [00:45:43] value of beta stock and if you're learning the absolute [00:45:44] and if you're learning the absolute value of beta star is not that different [00:45:45] value of beta star is not that different from [00:45:46] from learning better stock so so basically [00:45:49] learning better stock so so basically suppose you don't receive Capital start [00:45:51] suppose you don't receive Capital start to be positive then you cannot claim [00:45:54] to be positive then you cannot claim that you recover beta star you can only [00:45:56] that you recover beta star you can only say you cover the absolute value of beta [00:45:59] say you cover the absolute value of beta star [00:46:00] star yeah but but the picture the the [00:46:02] yeah but but the picture the the intuition is still the same after that [00:46:05] intuition is still the same after that changes [00:46:26] yeah so the question [00:46:31] yes so I guess the question is whether [00:46:33] yes so I guess the question is whether we really have to be exactly Alpha times [00:46:36] we really have to be exactly Alpha times or one vector right so and the answer is [00:46:38] or one vector right so and the answer is this is a great question the answer is [00:46:40] this is a great question the answer is no you don't have to do that the only [00:46:42] no you don't have to do that the only thing you have to do is that you [00:46:43] thing you have to do is that you initialization beta0 this is a vector I [00:46:46] initialization beta0 this is a vector I think you only need to make sure every [00:46:47] think you only need to make sure every entry of it [00:46:49] entry of it is very small so you only need to make [00:46:51] is very small so you only need to make sure this infinite Norm is very small [00:46:53] sure this infinite Norm is very small less than something like Alpha [00:46:55] less than something like Alpha um and [00:46:57] um and yeah and and then you can even UniFirst [00:47:00] yeah and and then you can even UniFirst negatively I think so if you need for a [00:47:02] negatively I think so if you need for a snake fluid and then the action will [00:47:03] snake fluid and then the action will become negative eventually [00:47:05] become negative eventually but the sun doesn't matter that much so [00:47:07] but the sun doesn't matter that much so that's fine [00:47:12] yeah but yeah so so so so so so and I'm [00:47:15] yeah but yeah so so so so so so and I'm only doing this just for convenience [00:47:16] only doing this just for convenience because it makes the proof cleaner right [00:47:24] Okay so [00:47:25] Okay so so given this uh plan this intuition so [00:47:28] so given this uh plan this intuition so it's natural that we should start [00:47:30] it's natural that we should start analyzing the population traction right [00:47:33] analyzing the population traction right like the purple one right so and then we [00:47:36] like the purple one right so and then we try to say that the black one is close [00:47:37] try to say that the black one is close to purple so so so let's start with the [00:47:40] to purple so so so let's start with the population trajectory [00:47:52] right so you can sometimes think of this [00:47:55] right so you can sometimes think of this as a warm up or in some sense this is a [00:47:58] as a warm up or in some sense this is a also a kind of similarly check for for [00:48:01] also a kind of similarly check for for this approach right so [00:48:04] this approach right so so this is I'm let me let me say the [00:48:07] so this is I'm let me let me say the theorem formally but I think you are [00:48:08] theorem formally but I think you are expected with a Serum is saying right GD [00:48:10] expected with a Serum is saying right GD on the population loss Will converge to [00:48:15] on the population loss Will converge to beta star and and [00:48:18] beta star and and in [00:48:20] I think all log [00:48:23] I think all log one over Epsilon Alpha over ETO [00:48:27] one over Epsilon Alpha over ETO tuition [00:48:29] tuition with [00:48:32] Epsilon error [00:48:34] Epsilon error in L2 distance [00:48:40] okay [00:48:41] okay so but I guess the formal theorem [00:48:43] so but I guess the formal theorem matters less than the the proof [00:48:45] matters less than the the proof um let's see how the proof goes the [00:48:48] um let's see how the proof goes the proof is kind of Brute Force [00:48:50] proof is kind of Brute Force um because you just really literally [00:48:51] um because you just really literally control [00:48:54] control each what each of the coordinates is [00:48:57] each what each of the coordinates is doing [00:48:58] doing so so it's pretty expensive and you see [00:49:00] so so it's pretty expensive and you see how the coordinates are changing [00:49:03] how the coordinates are changing um but the expression is actually a [00:49:05] um but the expression is actually a weakness in some sense because because [00:49:07] weakness in some sense because because you are doing so explicit derivation [00:49:09] you are doing so explicit derivation it's great for this problem [00:49:11] it's great for this problem but it's very it's harder to be [00:49:12] but it's very it's harder to be extendable [00:49:14] extendable I think that's the that's a general [00:49:15] I think that's the that's a general thing right so if you have a very [00:49:17] thing right so if you have a very various kind of strong analysis for toy [00:49:20] various kind of strong analysis for toy case that no that's not necessarily [00:49:22] case that no that's not necessarily always the good case because if it's too [00:49:23] always the good case because if it's too strong too explicit then the [00:49:26] strong too explicit then the expandability the the applicability to [00:49:29] expandability the the applicability to broader case becomes a problem [00:49:33] broader case becomes a problem um and and this is in my opinion the [00:49:35] um and and this is in my opinion the probably the the main reason why we [00:49:37] probably the the main reason why we cannot extend to more General cases [00:49:39] cannot extend to more General cases other than this simple quadratic one [00:49:41] other than this simple quadratic one there's there's a small cut there's a [00:49:43] there's there's a small cut there's a there's extension to the matrices case [00:49:45] there's extension to the matrices case but not fundamental extension so you can [00:49:48] but not fundamental extension so you can change all of these to matrices instant [00:49:50] change all of these to matrices instant vectors that's still fun but not beyond [00:49:52] vectors that's still fun but not beyond that okay so but still anyway let me do [00:49:55] that okay so but still anyway let me do the an analysis so [00:49:58] the an analysis so the proof sketch is that you first [00:50:00] the proof sketch is that you first complete the gradient we call that L [00:50:02] complete the gradient we call that L beta [00:50:05] so L beta is equals to 1 4 times beta0. [00:50:09] so L beta is equals to 1 4 times beta0. beta minus beta star [00:50:12] beta minus beta star I'm gonna open the star two Norm square [00:50:14] I'm gonna open the star two Norm square and you can compute a gradient with [00:50:16] and you can compute a gradient with respect to Beta it becomes beta o dot [00:50:19] respect to Beta it becomes beta o dot beta minus beta star or dot Bill store [00:50:23] beta minus beta star or dot Bill store times [00:50:24] times odog beta I guess you can kind of verify [00:50:28] odog beta I guess you can kind of verify this you know with [00:50:29] this you know with um pretty much you know scalars but the [00:50:31] um pretty much you know scalars but the vector version is pretty much [00:50:33] vector version is pretty much taking the sum of the scalar all the [00:50:35] taking the sum of the scalar all the dimensions right so here all the [00:50:37] dimensions right so here all the dimensions are [00:50:38] dimensions are separated [00:50:39] separated so so we basically this is the sum of [00:50:41] so so we basically this is the sum of the objective and each objective is [00:50:44] the objective and each objective is about one coordinate and it's just a [00:50:47] about one coordinate and it's just a simple chain Rule and you can see that [00:50:49] simple chain Rule and you can see that everything is multiplied by Beta always [00:50:51] everything is multiplied by Beta always right the Boolean is always multiplied [00:50:53] right the Boolean is always multiplied with beta and this is why when L 0 is 0. [00:50:58] with beta and this is why when L 0 is 0. so and now let's look at uh the update [00:51:02] so and now let's look at uh the update the update will be beta t plus 1 is [00:51:04] the update will be beta t plus 1 is equals to Beta T minus ETA times it's [00:51:08] equals to Beta T minus ETA times it's beta t o dot [00:51:10] beta t o dot let's play the Star o dot beta star [00:51:15] let's play the Star o dot beta star timeso dot beta t [00:51:17] timeso dot beta t okay so so and this is this is really [00:51:20] okay so so and this is this is really this is everything is in B Dimension but [00:51:23] this is everything is in B Dimension but really you can view this as D separate [00:51:25] really you can view this as D separate update [00:51:32] in in the coordinates [00:51:36] in in the coordinates because each coordinates are not doing [00:51:38] because each coordinates are not doing having any kind of correlation with [00:51:40] having any kind of correlation with anything else [00:51:42] anything else so so this is really just the saying [00:51:44] so so this is really just the saying that bti is equals to bti minus ETA [00:51:50] that bti is equals to bti minus ETA bti Square [00:51:52] bti Square minus B [00:51:54] minus B star I a square [00:51:56] star I a square times [00:51:57] times bti [00:52:00] bti okay so so every corner are just doing [00:52:03] okay so so every corner are just doing separate things [00:52:04] separate things and [00:52:07] and um and maybe maybe it's used and but [00:52:09] um and maybe maybe it's used and but this different course has a lot of [00:52:11] this different course has a lot of differences [00:52:12] differences right because this one is different [00:52:14] right because this one is different otherwise all the chords are doing the [00:52:16] otherwise all the chords are doing the same thing so the target is different [00:52:18] same thing so the target is different so in some sense you're basically so [00:52:21] so in some sense you're basically so when [00:52:23] when uh when I is in the support [00:52:27] uh when I is in the support of beta star [00:52:30] of beta star which is denoted to be as [00:52:32] which is denoted to be as so then your update is basically beta I [00:52:34] so then your update is basically beta I is updates V beta I minus ETA [00:52:38] is updates V beta I minus ETA with I Square minus 1 [00:52:41] with I Square minus 1 with I [00:52:43] and when I guess a moment in the T just [00:52:47] and when I guess a moment in the T just for for notational Simplicity and if I [00:52:49] for for notational Simplicity and if I is not in the support of beta [00:52:52] is not in the support of beta star [00:52:53] star um then beta I is just up a to B with [00:52:56] um then beta I is just up a to B with the I minus ETA [00:52:59] the I minus ETA with i squared Cube [00:53:04] so and you can see this includes all of [00:53:06] so and you can see this includes all of this intuitive makes sense because [00:53:07] this intuitive makes sense because suppose this is the case then [00:53:11] suppose this is the case then if so beta is suppose is less than one [00:53:14] if so beta is suppose is less than one between zero right that's suppose that's [00:53:16] between zero right that's suppose that's the current rate of I and this number is [00:53:19] the current rate of I and this number is negative [00:53:20] negative this is positive so this whole thing is [00:53:25] this is positive so this whole thing is an active so you are trying to increase [00:53:27] an active so you are trying to increase beta [00:53:29] beta right so so basically so this update is [00:53:31] right so so basically so this update is trying to increase by the eye if beta is [00:53:33] trying to increase by the eye if beta is not not yet reaching one [00:53:36] not not yet reaching one and this update is doing the reverse [00:53:38] and this update is doing the reverse direction right so this is trying to say [00:53:40] direction right so this is trying to say that as long as your beta I is bigger [00:53:43] that as long as your beta I is bigger than zero then you are trying to [00:53:44] than zero then you are trying to decrease your beta I [00:53:45] decrease your beta I so so basically here this encourages [00:53:50] so so basically here this encourages beta I to go to one and this encourage [00:53:52] beta I to go to one and this encourage the speed I to go to zero [00:53:55] the speed I to go to zero and that makes sense because Y is the [00:53:57] and that makes sense because Y is the beta I star and zero is the beta I star [00:53:59] beta I star and zero is the beta I star in in the other case [00:54:02] in in the other case right so [00:54:05] um [00:54:06] um okay so and now let's try to do a more [00:54:09] okay so and now let's try to do a more uh detailed calculation to see what what [00:54:12] uh detailed calculation to see what what happens in each of these case so let's [00:54:14] happens in each of these case so let's first look at the case one let's say [00:54:16] first look at the case one let's say this case one and this case two [00:54:19] this case one and this case two case one [00:54:22] case one so so here we are trying to you know the [00:54:25] so so here we are trying to you know the update is trying to increase beta I [00:54:27] update is trying to increase beta I until it reaches one right so there are [00:54:29] until it reaches one right so there are still two separate cases the first case [00:54:31] still two separate cases the first case is that suppose beta I is less than at [00:54:35] is that suppose beta I is less than at some time T is less than half [00:54:37] some time T is less than half so you're only half done your work and [00:54:40] so you're only half done your work and then you can see what's the changes so [00:54:43] then you can see what's the changes so bet iot plus one is equals to [00:54:46] bet iot plus one is equals to beta I T minus ETA [00:54:50] beta I T minus ETA beta i t squared minus 1 [00:54:54] beta i t squared minus 1 with i t [00:54:56] with i t and [00:54:58] and we argue that this is trying to increase [00:55:00] we argue that this is trying to increase with I and we can see how much it [00:55:02] with I and we can see how much it increases beta I it increases beta I by [00:55:05] increases beta I it increases beta I by this Factor ETA times 1 minus [00:55:09] this Factor ETA times 1 minus with it [00:55:11] with it square right so you have a [00:55:13] square right so you have a multiplicative Factor [00:55:15] multiplicative Factor so in some sense you're multiplying a [00:55:16] so in some sense you're multiplying a beta I to make it bigger but how much [00:55:19] beta I to make it bigger but how much you can you make it bigger you know it [00:55:22] you can you make it bigger you know it depends on [00:55:23] depends on the value of beta itself but if we know [00:55:25] the value of beta itself but if we know that beta is not too big then we can [00:55:27] that beta is not too big then we can bound this by [00:55:29] bound this by beta i t times 1 plus ETA [00:55:31] beta i t times 1 plus ETA one minus a quarter [00:55:34] one minus a quarter which is a [00:55:36] which is a bigger than beta is equal to [00:55:42] by the IIT one plus three four times e [00:55:46] by the IIT one plus three four times e times [00:55:48] times so so we have exponential growth [00:55:57] and [00:55:59] and [Music] [00:56:01] [Music] if [00:56:02] if beta I is already bigger than half [00:56:06] beta I is already bigger than half right then let's see what happens [00:56:09] right then let's see what happens so now the growth rate might be slower [00:56:11] so now the growth rate might be slower slow down because if you see if beta is [00:56:13] slow down because if you see if beta is kind of close to one then it's this [00:56:16] kind of close to one then it's this this much this this constant is becomes [00:56:19] this much this this constant is becomes close to zero so your growth rate slows [00:56:21] close to zero so your growth rate slows down [00:56:22] down um and that's true but what you can do [00:56:24] um and that's true but what you can do is you can analyze how [00:56:26] is you can analyze how far you are away from [00:56:29] far you are away from one from your target [00:56:31] one from your target and if you look at how far you're away [00:56:33] and if you look at how far you're away from one then you get the following [00:56:35] from one then you get the following recursion [00:56:39] uh minus ETA times 1 minus beta i t [00:56:45] uh minus ETA times 1 minus beta i t square with i t [00:56:48] square with i t and [00:56:51] and so and let's try to reorganize this a [00:56:53] so and let's try to reorganize this a little bit [00:56:55] little bit so [00:56:56] so I guess let's say [00:56:59] I guess let's say I think let's assume this let's assume [00:57:02] I think let's assume this let's assume this is [00:57:03] this is also likes the one [00:57:10] and then this is less than [00:57:17] sorry let me think [00:57:20] sorry let me think foreign [00:57:22] foreign have to assume this let's [00:57:25] have to assume this let's remove this for the moment this is less [00:57:28] remove this for the moment this is less than one minus beta I T minus ETA [00:57:32] than one minus beta I T minus ETA T Square Times a half [00:57:36] T Square Times a half right because beta i t is bigger than [00:57:38] right because beta i t is bigger than half [00:57:39] half and now you can factorize this to get a [00:57:42] and now you can factorize this to get a factor one minus beta i t out [00:57:46] factor one minus beta i t out and this one minus ETA 1 plus beta i t [00:57:51] and this one minus ETA 1 plus beta i t times a half [00:57:57] all right and [00:58:01] I guess it might be a little bit kind of [00:58:04] I guess it might be a little bit kind of like you might get you may feel like [00:58:06] like you might get you may feel like this is a little bit unnatural but I [00:58:07] this is a little bit unnatural but I think if you if you see my final Target [00:58:10] think if you if you see my final Target I guess you know it's probably actually [00:58:11] I guess you know it's probably actually not super difficult to to guess how to [00:58:14] not super difficult to to guess how to do the intermediate steps so now I'm [00:58:16] do the intermediate steps so now I'm going to use the fact that paid IIT is [00:58:18] going to use the fact that paid IIT is bigger than zero [00:58:20] bigger than zero so get one minus beta ID [00:58:23] so get one minus beta ID 1 minus beta times a half [00:58:28] so so my point is that if you look at [00:58:31] so so my point is that if you look at the final outcome you see that [00:58:33] the final outcome you see that now you are not growing exponentially [00:58:35] now you are not growing exponentially but you are converging to one [00:58:37] but you are converging to one exponentially fast so you are decreasing [00:58:39] exponentially fast so you are decreasing your error [00:58:41] your error you are required decreasing your [00:58:42] you are required decreasing your distance to one in your exponentially [00:58:44] distance to one in your exponentially fast [00:58:45] fast all right [00:58:51] so in some sense this this behavior of [00:58:53] so in some sense this this behavior of these Dynamics has two regime right so [00:58:55] these Dynamics has two regime right so when you are small you are growing very [00:58:57] when you are small you are growing very very fast and then when it becomes [00:58:59] very fast and then when it becomes bigger then your growth rate slows down [00:59:01] bigger then your growth rate slows down but you are converting to one [00:59:03] but you are converting to one exponentially cost [00:59:09] so so that's why if you combine these [00:59:11] so so that's why if you combine these two regime then you are you and also you [00:59:13] two regime then you are you and also you can see that [00:59:15] can see that you can also see that this maintains [00:59:20] beta I is less than one right so because [00:59:23] beta I is less than one right so because if you before you are less than one and [00:59:25] if you before you are less than one and later you're gonna be also less than one [00:59:28] later you're gonna be also less than one so so so basically the behavior is that [00:59:31] so so so basically the behavior is that if you summarize so basically in [00:59:36] log 1 over [00:59:38] log 1 over Alpha over ETA iteration [00:59:43] you are in the first you are in the [00:59:45] you are in the first you are in the first regime so [00:59:47] first regime so so you so this is beta i t [00:59:51] so you so this is beta i t grows [00:59:52] grows to a half exponentially fast [00:59:56] to a half exponentially fast exponentially right and you only need to [00:59:58] exponentially right and you only need to use this number of iterations because [01:00:00] use this number of iterations because you you initially it's Alpha and you [01:00:03] you you initially it's Alpha and you want to grow to a half so you I guess [01:00:06] want to grow to a half so you I guess technically you also have a two here if [01:00:07] technically you also have a two here if you want [01:00:09] you want um [01:00:11] right so and you have a learning ETA so [01:00:14] right so and you have a learning ETA so so basically this is because one over [01:00:16] so basically this is because one over eater the half to the power [01:00:19] eater the half to the power suppose this T1 T1 this is something [01:00:21] suppose this T1 T1 this is something like [01:00:22] like at least [01:00:24] at least sorry this is [01:00:26] sorry this is guys this is your growth the factor that [01:00:29] guys this is your growth the factor that you growth and you you take you find [01:00:31] you growth and you you take you find some power to it and then you want to [01:00:33] some power to it and then you want to grow at least a half over Alpha factor [01:00:36] grow at least a half over Alpha factor and and that's how you solve this number [01:00:38] and and that's how you solve this number T1 okay but anyway so [01:00:42] T1 okay but anyway so um so so this is how okay the first part [01:00:45] um so so this is how okay the first part and then [01:00:48] in [01:00:50] in log one over Epsilon over ETA iteration [01:00:56] beta i t [01:00:58] beta i t convert this to [01:01:01] convert this to one minus Epsilon [01:01:03] one minus Epsilon and this is because you want to start [01:01:05] and this is because you want to start from a half the hour half to Arrow [01:01:07] from a half the hour half to Arrow absolute and each time you decrease by [01:01:10] absolute and each time you decrease by one minus E to over two so that's why [01:01:11] one minus E to over two so that's why you have to pay [01:01:13] you have to pay this number of iterations [01:01:16] because this makes sense [01:01:20] I guess this is there is a small thing [01:01:22] I guess this is there is a small thing that I probably escaped in some sense [01:01:24] that I probably escaped in some sense right so this is about how do you kind [01:01:27] right so this is about how do you kind of derive how many iterations you need [01:01:29] of derive how many iterations you need right so if you want one plus beta to [01:01:31] right so if you want one plus beta to the power T to be bigger than [01:01:33] the power T to be bigger than some number R then this means T needs to [01:01:35] some number R then this means T needs to be bigger than [01:01:41] log r [01:01:44] log r over over ETA [01:01:47] over over ETA so this is just something that bursting [01:01:49] so this is just something that bursting burning into my hack but you you can you [01:01:53] burning into my hack but you you can you can develop it yourself as well [01:01:56] can develop it yourself as well Okay cool so [01:02:00] all right so so that's what happens with [01:02:03] all right so so that's what happens with those coordinates that you want to [01:02:04] those coordinates that you want to convert to one [01:02:06] convert to one and you also have case two [01:02:09] and you can do the same thing I don't [01:02:12] and you can do the same thing I don't necessarily want to bore you with all [01:02:13] necessarily want to bore you with all the derivations but I guess the [01:02:15] the derivations but I guess the derivations here is easy [01:02:17] derivations here is easy because you're just trying to say beta I [01:02:20] because you're just trying to say beta I is [01:02:22] decreasing right in the speed right so [01:02:25] decreasing right in the speed right so and [01:02:31] and here [01:02:35] let's see [01:02:40] and here interestingly [01:02:42] and here interestingly so [01:02:44] so um [01:02:45] um so if you really look at literally look [01:02:47] so if you really look at literally look at this [01:02:48] at this uh [01:02:49] uh this is actually saying that you are [01:02:51] this is actually saying that you are decreasing beta I eventually to zero but [01:02:55] decreasing beta I eventually to zero but somehow we care about something weaker [01:02:57] somehow we care about something weaker that is more [01:02:59] that is more uh [01:03:01] let me see [01:03:09] I'm acting a plus here so I don't know [01:03:11] I'm acting a plus here so I don't know why I'm [01:03:12] why I'm I think I'm not missing a plus [01:03:18] [Applause] [01:03:22] which is let me make sure I don't make [01:03:25] which is let me make sure I don't make any mistake here [01:03:28] any mistake here foreign [01:03:37] Okay so [01:03:40] Okay so I think [01:03:42] I think I think what I do here like uh I have [01:03:45] I think what I do here like uh I have some derivation here some small claim [01:03:46] some derivation here some small claim here which is particularly useful for [01:03:50] here which is particularly useful for uh the the empirical case [01:03:55] uh the the empirical case um which I'm not sure I can get into so [01:03:57] um which I'm not sure I can get into so I'm gonna skip this part so so at least [01:03:59] I'm gonna skip this part so so at least for now it's sort of trivial to see that [01:04:01] for now it's sort of trivial to see that beta I is decreasing [01:04:04] right if you start with Alpha and then [01:04:06] right if you start with Alpha and then you keep being smaller than Alpha that's [01:04:08] you keep being smaller than Alpha that's that sounds trivial to see and and maybe [01:04:10] that sounds trivial to see and and maybe let's just uh leave it there [01:04:13] let's just uh leave it there um this is enough for for us to deal [01:04:15] um this is enough for for us to deal with the population case [01:04:16] with the population case right so so basically our conclusion is [01:04:19] right so so basically our conclusion is you can see that the conclusion is that [01:04:21] you can see that the conclusion is that you converge [01:04:23] you converge to something close to one [01:04:24] to something close to one uh in this number of iterations right [01:04:27] uh in this number of iterations right the filtration count is something [01:04:29] the filtration count is something logarithmic times one over ETA [01:04:31] logarithmic times one over ETA and you also have this property that you [01:04:34] and you also have this property that you you are always less than one right all [01:04:36] you are always less than one right all the answers are less than one [01:04:38] the answers are less than one and also [01:04:40] and also the the the the the small entries are [01:04:43] the the the the the small entries are never growing [01:04:45] never growing so basically your beta [01:04:47] so basically your beta T at any time [01:04:51] basically looks like there are a bunch [01:04:53] basically looks like there are a bunch of entries which is growing right so the [01:04:56] of entries which is growing right so the S and also in the in the ice complement [01:04:59] S and also in the in the ice complement in the ice complement all of these [01:05:01] in the ice complement all of these entries is less than Alpha [01:05:03] entries is less than Alpha Fiverr and in this cup in the in the ice [01:05:06] Fiverr and in this cup in the in the ice coordinates you are growing potentially [01:05:09] coordinates you are growing potentially so so you can see that this is still [01:05:11] so so you can see that this is still always approximately our Sparks because [01:05:13] always approximately our Sparks because at least you only have our big non-zero [01:05:16] at least you only have our big non-zero entries and all the other entries are [01:05:18] entries and all the other entries are very small [01:05:19] very small so approximately this is still always [01:05:21] so approximately this is still always approximately in the XR [01:05:24] approximately in the XR just because the small entries are keep [01:05:26] just because the small entries are keep being small [01:05:37] okay so now let's try to talk about the [01:05:41] okay so now let's try to talk about the empirical case a little bit [01:05:44] empirical case a little bit um [01:05:44] um I think this full analysis it probably [01:05:46] I think this full analysis it probably wouldn't would fit within 15 minutes but [01:05:48] wouldn't would fit within 15 minutes but I think I can give you some idea about [01:05:50] I think I can give you some idea about it [01:05:59] and actually I'm going to only do the [01:06:01] and actually I'm going to only do the placement R is one because R is more [01:06:04] placement R is one because R is more than one it's kind of like a little bit [01:06:06] than one it's kind of like a little bit complicated [01:06:08] complicated um so I'm going to do the the [01:06:11] um so I'm going to do the the R is one case [01:06:14] R is one case so [01:06:16] so all right [01:06:19] so got that [01:06:22] so got that so for some Delta and when R is one [01:06:26] so for some Delta and when R is one and it's less than Omega of one so [01:06:29] and it's less than Omega of one so basically you only have to have [01:06:31] basically you only have to have logarithmen of examples [01:06:33] logarithmen of examples and then GD [01:06:35] and then GD competitor had [01:06:38] this is just a simplification of the [01:06:40] this is just a simplification of the theorem we have already stated I guess [01:06:42] theorem we have already stated I guess uh now the [01:06:45] uh now the I guess this is also actually weaker [01:06:49] maybe I should say [01:06:52] maybe I should say it's a weaker serum [01:06:55] it's a weaker serum not only weaker in the sense of [01:06:56] not only weaker in the sense of simplification but also weaker so weaker [01:06:59] simplification but also weaker so weaker and simplified [01:07:04] um so you get this iteration [01:07:08] steps [01:07:10] we have [01:07:25] it's less than o to the power square [01:07:27] it's less than o to the power square root [01:07:29] root so here I guess this is the why this is [01:07:31] so here I guess this is the why this is weaker than what we have said before [01:07:34] weaker than what we have said before before the arrow can goes to zero as [01:07:35] before the arrow can goes to zero as long as you take Alpha to be small [01:07:37] long as you take Alpha to be small enough [01:07:38] enough right so [01:07:39] right so so this is weaker [01:07:42] so this is weaker because error [01:07:44] because error doesn't go to zero [01:07:51] so before we can make the error goes to [01:07:52] so before we can make the error goes to zero as Alpha goes to zero and now you [01:07:56] zero as Alpha goes to zero and now you want to prove that it depends on [01:07:57] want to prove that it depends on something like the number of examples [01:08:00] something like the number of examples um this is just a technicality like if [01:08:02] um this is just a technicality like if you want to prove the case when the [01:08:03] you want to prove the case when the error goes to zero you have to do actual [01:08:05] error goes to zero you have to do actual work [01:08:06] work um which is probably too much for this [01:08:08] um which is probably too much for this course [01:08:09] course um [01:08:11] um right so [01:08:13] right so how do I do this so in some sense [01:08:17] how do I do this so in some sense um the proof is trying to [01:08:21] in some sense maybe let me the proof [01:08:23] in some sense maybe let me the proof idea [01:08:25] idea and sometimes it's pretty intuitive like [01:08:27] and sometimes it's pretty intuitive like like given that the this figure we have [01:08:30] like given that the this figure we have joined John [01:08:32] joined John um so you are just trying to show that L [01:08:34] um so you are just trying to show that L hat beta is close to L beta and that's [01:08:37] hat beta is close to L beta and that's something you can prove very easily so [01:08:39] something you can prove very easily so you try to prove that [01:08:43] you try to prove that so one you try to prove that lhydrator [01:08:45] so one you try to prove that lhydrator is close to L beta [01:08:47] is close to L beta for every beta that is [01:08:52] for every beta that is I guess technically you have to say this [01:08:54] I guess technically you have to say this is approximately sparse [01:08:58] is approximately sparse because you can never be exactly as far [01:08:59] because you can never be exactly as far as we discussed [01:09:01] as we discussed so that's something that is relatively [01:09:03] so that's something that is relatively easy [01:09:04] easy and all and the second thing is that you [01:09:06] and all and the second thing is that you want to say that the beta G [01:09:09] uh under for the for the empirical case [01:09:14] uh under for the for the empirical case thank you [01:09:17] empirical trajectory [01:09:19] empirical trajectory in the empirical trajectory [01:09:25] never leaves [01:09:29] never leaves uh this XR [01:09:32] uh this XR very far right [01:09:35] very far right but never never leave it uh [01:09:37] but never never leave it uh significantly [01:09:43] so so so how do we show two basically [01:09:45] so so so how do we show two basically you are trying to say that you are [01:09:47] you are trying to say that you are staying close to the so you basically [01:09:49] staying close to the so you basically want to Arrow to not [01:09:51] want to Arrow to not uh to note [01:09:54] uh to note uh blow up so what does that mean so it [01:09:57] uh blow up so what does that mean so it means that maybe let's draw something [01:09:59] means that maybe let's draw something here so so you're trying to show two [01:10:02] here so so you're trying to show two trajectories that are close to each [01:10:03] trajectories that are close to each other just forever right so so you have [01:10:05] other just forever right so so you have a trajectory which is the purple one [01:10:06] a trajectory which is the purple one this is the green distance [01:10:09] this is the green distance and what happens is after you take the [01:10:12] and what happens is after you take the first step you you have some error here [01:10:15] first step you you have some error here right so now these two tractors are not [01:10:17] right so now these two tractors are not doing the same thing anymore right [01:10:19] doing the same thing anymore right initially you are taking the Grid in [01:10:21] initially you are taking the Grid in that same plot and now this purple one [01:10:23] that same plot and now this purple one is taking green at this point [01:10:25] is taking green at this point and and the black ones think moving at [01:10:27] and and the black ones think moving at this point [01:10:28] this point so you have this error in in not in [01:10:31] so you have this error in in not in terms of the the griding not different [01:10:33] terms of the the griding not different but also in terms of the the difference [01:10:35] but also in terms of the the difference at of the points where you are [01:10:37] at of the points where you are evaluating your gradients so it's [01:10:39] evaluating your gradients so it's empirical versus population that's one [01:10:41] empirical versus population that's one difference and the other thing is that [01:10:43] difference and the other thing is that you are evaluating the empirical and [01:10:45] you are evaluating the empirical and population growth at different places [01:10:47] population growth at different places and that could introduce a bigger error [01:10:50] and that could introduce a bigger error and then it will be introduced even [01:10:52] and then it will be introduced even bigger error so so if you don't do this [01:10:54] bigger error so so if you don't do this carefully then it's possible that [01:10:57] carefully then it's possible that eventually you go this way and the other [01:10:58] eventually you go this way and the other one goes the other way that's because [01:11:00] one goes the other way that's because the error keep blowing keep being bigger [01:11:03] the error keep blowing keep being bigger and bigger so basically you have to kind [01:11:05] and bigger so basically you have to kind of control [01:11:06] of control uh how the error control how the error [01:11:10] uh how the error control how the error changes [01:11:14] so that's the that's the key part [01:11:16] so that's the that's the key part and and this boils down to a lot of [01:11:18] and and this boils down to a lot of different a lot of kind of like [01:11:20] different a lot of kind of like at least on the surface similarly very [01:11:22] at least on the surface similarly very boring calculations uh if you really [01:11:25] boring calculations uh if you really want to do all this calculation well you [01:11:27] want to do all this calculation well you have to understand a little bit about [01:11:28] have to understand a little bit about you know what each term means [01:11:31] you know what each term means um and it does require some extra work [01:11:34] um and it does require some extra work um [01:11:35] um um but but but for the first level thing [01:11:37] um but but but for the first level thing is that you know at least the [01:11:40] is that you know at least the um [01:11:41] um this whole thing is a simplification of [01:11:44] this whole thing is a simplification of one of the paper I wrote [01:11:46] one of the paper I wrote um a few years back and and when we did [01:11:49] um a few years back and and when we did this thing like the first thing we tried [01:11:51] this thing like the first thing we tried is that we just tried to do the [01:11:52] is that we just tried to do the calculation so like and you try to [01:11:54] calculation so like and you try to understand which term is problematic [01:11:56] understand which term is problematic which term may cause a bigger blow up [01:11:58] which term may cause a bigger blow up and and then you focus more on that term [01:12:00] and and then you focus more on that term and attention and then try to understand [01:12:01] and attention and then try to understand a little bit of random and then maybe [01:12:03] a little bit of random and then maybe some if I divide some inequalities but [01:12:06] some if I divide some inequalities but basically below this below is this level [01:12:08] basically below this below is this level like it becomes like a quite technical [01:12:10] like it becomes like a quite technical so um [01:12:13] so um so I think I'm going to probably spend [01:12:16] so I think I'm going to probably spend another five minutes to do a little bit [01:12:18] another five minutes to do a little bit just like a uh things so so I think the [01:12:22] just like a uh things so so I think the the thing for you to kind of control [01:12:23] the thing for you to kind of control this error is that one thing we realized [01:12:26] this error is that one thing we realized is useful is that actually this is [01:12:28] is useful is that actually this is actually important [01:12:29] actually important kind of like conceptual you know [01:12:32] kind of like conceptual you know semi-conceptual kind of like uh thing [01:12:36] semi-conceptual kind of like uh thing that we realized right so to control [01:12:37] that we realized right so to control this how this error is good to represent [01:12:39] this how this error is good to represent your [01:12:40] your uh each of it in your in a convenient [01:12:42] uh each of it in your in a convenient way [01:12:43] way so so what does that mean so it means [01:12:46] so so what does that mean so it means that the beta star will assume better [01:12:48] that the beta star will assume better start we assume R is one already right [01:12:50] start we assume R is one already right so let's assume beta star [01:12:52] so let's assume beta star is just E1 right so it's just a one zero [01:12:56] is just E1 right so it's just a one zero zero zero [01:12:57] zero zero right so you just want to say that you [01:12:59] right so you just want to say that you Converse to this this vector [01:13:00] Converse to this this vector and [01:13:02] and one of the useful thing we did is that [01:13:04] one of the useful thing we did is that we take beta T to be we write beta T to [01:13:08] we take beta T to be we write beta T to be RT times E1 plus arrow Vector Zeta t [01:13:12] be RT times E1 plus arrow Vector Zeta t so especially you write it to is [01:13:14] so especially you write it to is a multiplication of beta star and some [01:13:17] a multiplication of beta star and some error [01:13:19] error so so in some sense like beta star is [01:13:21] so so in some sense like beta star is here and you're starting from zero and [01:13:23] here and you're starting from zero and you try to [01:13:25] you try to say how much you are different from this [01:13:28] say how much you are different from this this line so this is our data T this is [01:13:32] this line so this is our data T this is your RT times E1 that's how you [01:13:34] your RT times E1 that's how you represent where you're at at time t [01:13:37] represent where you're at at time t and and you want to keep and what we did [01:13:39] and and you want to keep and what we did is that we want to say the RT [01:13:42] is that we want to say the RT the plan is that you want to say RT is [01:13:46] the plan is that you want to say RT is going to 1 [01:13:48] going to 1 because eventually you want to go to E1 [01:13:50] because eventually you want to go to E1 and Zeta T the error term [01:13:53] and Zeta T the error term is keep is always small [01:13:56] is keep is always small small [01:13:57] small I think we prove it to be smaller than [01:14:00] I think we prove it to be smaller than of alpha [01:14:02] of alpha for energy [01:14:07] that's how that's the the next level [01:14:10] that's how that's the the next level right so and [01:14:12] right so and and then you basically what you have to [01:14:15] and then you basically what you have to do you have to try to derive derive [01:14:18] do you have to try to derive derive a recursion [01:14:21] for RT [01:14:23] for RT and 0 t [01:14:26] and 0 t right and we'll derive the recursion for [01:14:28] right and we'll derive the recursion for both of these two things right you can [01:14:30] both of these two things right you can always keep in mind that what happens [01:14:32] always keep in mind that what happens with the population [01:14:34] with the population right so so so so so so so the recursive [01:14:37] right so so so so so so so the recursive for art and you can have the same [01:14:38] for art and you can have the same recursion for the population case [01:14:41] recursion for the population case so so basically you're gonna have um so [01:14:43] so so basically you're gonna have um so suppose let's folks for example talk [01:14:45] suppose let's folks for example talk about [01:14:46] about let me see which one I can talk about [01:14:48] let me see which one I can talk about easily [01:14:50] easily uh let's see [01:15:02] simplify those notes I think I had some [01:15:05] simplify those notes I think I had some backup plan yes here so [01:15:13] so [01:15:14] so basically for example if you look at the [01:15:17] basically for example if you look at the recursion for RT it looks like RT plus [01:15:20] recursion for RT it looks like RT plus one RT is equals to RT minus beta [01:15:24] one RT is equals to RT minus beta RT squared minus 1 times RT [01:15:28] RT squared minus 1 times RT minus some term [01:15:31] minus some term that depends on Zeta t [01:15:34] that depends on Zeta t so if you do like I have all of these [01:15:36] so if you do like I have all of these formulas written here but I don't want [01:15:37] formulas written here but I don't want to show all the details so and if you [01:15:40] to show all the details so and if you look at this this is very similar [01:15:43] look at this this is very similar to the the thing that we had before [01:15:47] to the the thing that we had before I guess I don't know why I'm let me also [01:15:50] I guess I don't know why I'm let me also change the the superscript [01:15:54] I think I should probably have [01:15:57] I think I should probably have yeah in my nose is also super script [01:16:12] okay I guess you know I guess let me now [01:16:14] okay I guess you know I guess let me now change everything just you know you know [01:16:16] change everything just you know you know like superscript is the same as the here [01:16:19] like superscript is the same as the here so if you look at this one this part [01:16:22] so if you look at this one this part this is the same as update for the beta [01:16:25] this is the same as update for the beta we had [01:16:26] we had right so [01:16:29] right so where is the update for data [01:16:34] here right [01:16:38] so this is the case when when you are [01:16:40] so this is the case when when you are looking at the coordinate where you have [01:16:42] looking at the coordinate where you have a [01:16:43] a entry one in the beta star so and this [01:16:46] entry one in the beta star so and this is the update right so and if you just [01:16:49] is the update right so and if you just replace beta I to the RT you get the [01:16:51] replace beta I to the RT you get the same formula right so RT has the same [01:16:53] same formula right so RT has the same formula [01:16:55] formula so so this so basically this part is [01:16:58] so so this so basically this part is where the population grid does [01:17:03] and and you already analyze this part [01:17:06] and and you already analyze this part already [01:17:06] already so [01:17:08] so um so so basically the only thing you [01:17:09] um so so basically the only thing you have to deal with is that how does the [01:17:11] have to deal with is that how does the error term uh affect you and and you [01:17:16] error term uh affect you and and you inductively show the error is small so [01:17:18] inductively show the error is small so under the assumption that error is small [01:17:19] under the assumption that error is small then you can show that the update for RT [01:17:22] then you can show that the update for RT uh is is basically doing the same thing [01:17:24] uh is is basically doing the same thing as the updates for the beta key before [01:17:28] as the updates for the beta key before so so that's how you deal with rt but [01:17:31] so so that's how you deal with rt but how do you know the data T is small that [01:17:33] how do you know the data T is small that becomes even more complicated because [01:17:35] becomes even more complicated because data T also has a derivative recursion [01:17:37] data T also has a derivative recursion right so zlt has a person which is I [01:17:41] right so zlt has a person which is I think I don't even see a simple way to [01:17:43] think I don't even see a simple way to write it I can actually have something [01:17:45] write it I can actually have something like this so Zeta T is equals to [01:17:48] like this so Zeta T is equals to 30 minus [01:17:53] something like [01:17:55] something like some Matrix Mt times Theta t [01:17:59] some Matrix Mt times Theta t some Vector sorry [01:18:05] come back to Low T times Little T [01:18:07] come back to Low T times Little T something like this [01:18:09] something like this I'm not going to Define goatee and and [01:18:12] I'm not going to Define goatee and and what you do is that this [01:18:14] what you do is that this is [01:18:17] is somewhat similar [01:18:23] to the case to [01:18:26] to the case to those I not in eyes [01:18:30] those I not in eyes the beta I the T recursion [01:18:35] so you know it's not so okay what's the [01:18:37] so you know it's not so okay what's the beta ID recursion the beta it recursion [01:18:39] beta ID recursion the beta it recursion was something like beta i t plus 1 is [01:18:41] was something like beta i t plus 1 is equals to Beta i t [01:18:43] equals to Beta i t so let's eat up with I [01:18:46] so let's eat up with I T cubed so I think this if you really [01:18:50] T cubed so I think this if you really look at the derivation I think this is [01:18:52] look at the derivation I think this is something like [01:18:53] something like beta i t Square minus 0 times beta i t [01:18:58] beta i t Square minus 0 times beta i t so if you really [01:19:01] so if you really to the the [01:19:04] to the the match the terms I think these two [01:19:05] match the terms I think these two matches and these two matches and [01:19:07] matches and these two matches and something here which also matches if you [01:19:09] something here which also matches if you look at the details to some extent not [01:19:11] look at the details to some extent not exactly [01:19:12] exactly but but but but you somehow kind of have [01:19:15] but but but but you somehow kind of have a you know there's there's no way to [01:19:16] a you know there's there's no way to match everything exactly but you just [01:19:18] match everything exactly but you just use this as use the beta of it as a [01:19:20] use this as use the beta of it as a reference for you and you know that [01:19:22] reference for you and you know that something already matches and what does [01:19:25] something already matches and what does match is this this roachy which I didn't [01:19:28] match is this this roachy which I didn't Define to this beta T square and you do [01:19:31] Define to this beta T square and you do some kind of like concentration to show [01:19:32] some kind of like concentration to show that they are similar [01:19:34] that they are similar and and what exact conversation you show [01:19:36] and and what exact conversation you show it's also actually up to the exact terms [01:19:38] it's also actually up to the exact terms and and somehow you you sometimes you [01:19:41] and and somehow you you sometimes you relate the data to recursion to the beta [01:19:44] relate the data to recursion to the beta recursion [01:19:45] recursion um under the hood so that you can show [01:19:48] um under the hood so that you can show that that it doesn't grow on uh [01:19:50] that that it doesn't grow on uh eventually right because you knew beta T [01:19:52] eventually right because you knew beta T doesn't grow right that's what we proved [01:19:55] doesn't grow right that's what we proved easily and and once you can relate [01:19:58] easily and and once you can relate delity to better tea then you can also [01:19:59] delity to better tea then you can also try to show that he doesn't grow uh [01:20:02] try to show that he doesn't grow uh eventually [01:20:04] eventually um [01:20:06] I think that's the [01:20:09] I think that's the pretty much the the best thing I can do [01:20:11] pretty much the the best thing I can do in a short amount of time [01:20:16] um and and the details are in the notes [01:20:20] um and and the details are in the notes um [01:20:21] um any questions [01:20:28] sorry yes I think I it was RT plus one [01:20:32] sorry yes I think I it was RT plus one that when I changed the superscript to [01:20:34] that when I changed the superscript to subscript I forgot yeah yeah thanks [01:20:59] foreign [01:21:10] last lecture we saw two examples where [01:21:13] last lecture we saw two examples where reading designs convert this to [01:21:16] reading designs convert this to a solution that is closest to the [01:21:18] a solution that is closest to the initialization and but why empirical you [01:21:21] initialization and but why empirical you still have to use the express [01:21:22] still have to use the express memorization that we decay [01:21:24] memorization that we decay so [01:21:26] so um [01:21:27] um I think I would like to argue that um [01:21:31] I think I would like to argue that um empirically the way Decay is actually [01:21:34] empirically the way Decay is actually not very [01:21:36] not very uh not very strong [01:21:38] uh not very strong so so so there so it's not even clear [01:21:40] so so so there so it's not even clear whether the way the case is really doing [01:21:42] whether the way the case is really doing over organizations [01:21:43] over organizations because [01:21:45] because um [01:21:45] um with the same radical actually can [01:21:47] with the same radical actually can memorize the chain of data like you can [01:21:50] memorize the chain of data like you can you can even memorize oh sorry you can [01:21:51] you can even memorize oh sorry you can even memorize string data with random [01:21:53] even memorize string data with random labels [01:21:55] labels so so um so suppose you permit your [01:21:58] so so um so suppose you permit your label [01:21:59] label there's no packing it's just random [01:22:02] there's no packing it's just random label you can still use the same way [01:22:04] label you can still use the same way Decay uh and and train your network with [01:22:07] Decay uh and and train your network with the same way decade to find a zero hour [01:22:09] the same way decade to find a zero hour solution so that seems to us that the [01:22:11] solution so that seems to us that the way the case not really doing that much [01:22:13] way the case not really doing that much of a regularization in fact at least not [01:22:15] of a regularization in fact at least not as strong as [01:22:16] as strong as the theoretical setting would say well [01:22:18] the theoretical setting would say well for example in this case suppose or in [01:22:21] for example in this case suppose or in this case of the previous case where [01:22:23] this case of the previous case where suppose use may depend you usually find [01:22:24] suppose use may depend you usually find the minimum solution [01:22:26] the minimum solution then [01:22:27] then um use a strong like a regularizer to [01:22:29] um use a strong like a regularizer to say you want to find a solution with [01:22:31] say you want to find a solution with small Norm then you cannot fit random [01:22:33] small Norm then you cannot fit random labels anymore [01:22:35] labels anymore so [01:22:37] so um and also another kind of like tricky [01:22:39] um and also another kind of like tricky thing is that the way decay in practice [01:22:43] thing is that the way decay in practice also has some other effects that kind of [01:22:45] also has some other effects that kind of like kind of regulate for example how [01:22:49] like kind of regulate for example how the best generalization is working [01:22:52] the best generalization is working and like if for example I think like if [01:22:55] and like if for example I think like if you have specialization [01:22:57] you have specialization then [01:22:59] then the model becomes scale environment [01:23:02] the model becomes scale environment so so like it becomes like a [01:23:05] so so like it becomes like a like uh like how like if you multiply [01:23:07] like uh like how like if you multiply all the ways by two technically you [01:23:09] all the ways by two technically you don't you don't change anything [01:23:11] don't you don't change anything but somehow you want to recognize that [01:23:13] but somehow you want to recognize that you want to kind of like regulate that [01:23:14] you want to kind of like regulate that in some way because in certain cases it [01:23:17] in some way because in certain cases it changes to [01:23:19] changes to um the optimization so so basically okay [01:23:20] um the optimization so so basically okay I guess I don't have a very concrete [01:23:22] I guess I don't have a very concrete this is a good question I don't have a [01:23:23] this is a good question I don't have a very concrete answer but I think the the [01:23:25] very concrete answer but I think the the thing we believe is that the way Decay [01:23:28] thing we believe is that the way Decay is not actually doing a strong work in [01:23:31] is not actually doing a strong work in terms of the standard normalization of [01:23:32] terms of the standard normalization of the norm like regularization of the norm [01:23:34] the norm like regularization of the norm and also we somehow suspected the way it [01:23:37] and also we somehow suspected the way it has some other effects to some extent [01:23:39] has some other effects to some extent and also sometimes the way Decay is not [01:23:41] and also sometimes the way Decay is not even important right so if you remove [01:23:43] even important right so if you remove the weight because [01:23:45] the weight because um you still got pretty good results in [01:23:47] um you still got pretty good results in certain cases so [01:23:49] certain cases so um I guess that's the the best we know [01:23:51] um I guess that's the the best we know for now yeah [01:23:55] any other questions [01:24:03] okay sounds good I guess I'll see you on [01:24:06] okay sounds good I guess I'll see you on Wednesday ================================================================================ LECTURE 015 ================================================================================ Stanford CS229M - Lecture 16: Implicit regularization in classification problems Source: https://www.youtube.com/watch?v=mham4hHpo7A --- Transcript [00:00:05] okay hi everyone yeah let's get started [00:00:08] okay hi everyone yeah let's get started so I guess today [00:00:11] so I guess today uh we are going to talk about continue [00:00:13] uh we are going to talk about continue to talk about in the implicit [00:00:15] to talk about in the implicit regularization [00:00:17] regularization um [00:00:18] um so last time we have talked about [00:00:21] so last time we have talked about the implicit regularization [00:00:28] of initialization [00:00:30] of initialization and today [00:00:32] and today this is the last [00:00:35] this is the last lecture [00:00:36] lecture actually last week we like in the last [00:00:39] actually last week we like in the last two lectures we have talked about [00:00:39] two lectures we have talked about implicit regularization of [00:00:41] implicit regularization of neutralization and today we are going to [00:00:43] neutralization and today we are going to have on two parts so one part is we [00:00:46] have on two parts so one part is we continue with the implicit [00:00:47] continue with the implicit regularization [00:00:49] um [00:00:53] and this is a better characterization in [00:00:55] and this is a better characterization in certain cases as I would describe more [00:00:58] certain cases as I would describe more and another is that [00:01:01] and another is that um [00:01:02] um we are going to talk about [00:01:03] we are going to talk about classification problem [00:01:05] classification problem in all the past few examples we're [00:01:07] in all the past few examples we're talking about a regression problem it [00:01:09] talking about a regression problem it turns out that the classification [00:01:11] turns out that the classification problem the the behavior is a little bit [00:01:13] problem the the behavior is a little bit different and [00:01:16] different and um [00:01:17] um and you instead of like converting to [00:01:19] and you instead of like converting to some minimum solution you converge to [00:01:20] some minimum solution you converge to Max margin solution which is in some [00:01:22] Max margin solution which is in some sense you know similar but not exactly [00:01:25] sense you know similar but not exactly the same to the [00:01:26] the same to the um regression case [00:01:28] um regression case I guess uh with this lecture we are [00:01:30] I guess uh with this lecture we are going to conclude the discussion about [00:01:32] going to conclude the discussion about the implicit reposition of [00:01:34] the implicit reposition of initialization and the next lecture [00:01:37] initialization and the next lecture we're going to talk about the [00:01:37] we're going to talk about the stochasticity and that will be the last [00:01:40] stochasticity and that will be the last lecture about the implicit [00:01:41] lecture about the implicit representation [00:01:43] representation so okay so [00:01:46] so okay so So today we're gonna have so the first [00:01:48] So today we're gonna have so the first part this is number one number two so [00:01:50] part this is number one number two so the first part we're going to talk about [00:01:51] the first part we're going to talk about a more precise [00:01:56] characterization [00:01:59] characterization in a certain case about the implicit [00:02:01] in a certain case about the implicit probabilization effective initialization [00:02:02] probabilization effective initialization you can see exactly how the [00:02:05] you can see exactly how the initialization influenced the recordizer [00:02:08] initialization influenced the recordizer um and so and to [00:02:12] um and so and to I have some preparation for today's [00:02:14] I have some preparation for today's lecture we are going to talk about the [00:02:16] lecture we are going to talk about the so-called gradient flow [00:02:18] so-called gradient flow I was trying to you know avoid this [00:02:20] I was trying to you know avoid this notion in the past but I think the kind [00:02:22] notion in the past but I think the kind of the Spirit uh has has shown up in the [00:02:26] of the Spirit uh has has shown up in the past as well like so basically this is [00:02:28] past as well like so basically this is screen designed with infinite testimony [00:02:31] screen designed with infinite testimony um green descent with [00:02:34] um green descent with infinite [00:02:35] infinite testimony [00:02:39] right [00:02:42] so [00:02:43] so um and the reason why this is useful is [00:02:45] um and the reason why this is useful is because in certain cases [00:02:47] because in certain cases we have infinite test modeling rate you [00:02:50] we have infinite test modeling rate you can ignore some of the second order [00:02:51] can ignore some of the second order effects from the learning rate and [00:02:55] effects from the learning rate and um so this is just a mixed analysis much [00:02:57] um so this is just a mixed analysis much simpler you don't have to say house [00:02:58] simpler you don't have to say house model anywhere is you don't have to deal [00:03:00] model anywhere is you don't have to deal with the technology effect because the [00:03:02] with the technology effect because the second order effect is just literally [00:03:04] second order effect is just literally zero there's no technology effect and [00:03:07] zero there's no technology effect and the way to do this is that it's actually [00:03:09] the way to do this is that it's actually also kind of a pretty clean formulation [00:03:11] also kind of a pretty clean formulation of optimization even though it's [00:03:14] of optimization even though it's continuous time so so what you do is [00:03:17] continuous time so so what you do is they say you have a loss function let's [00:03:19] they say you have a loss function let's say LW is the loss function [00:03:22] say LW is the loss function so if you do green descent [00:03:25] so if you do green descent then what you do is you say you take WT [00:03:27] then what you do is you say you take WT plus one and now I'm using the [00:03:29] plus one and now I'm using the parenthesis for time because I'm going [00:03:31] parenthesis for time because I'm going to use that for the continuous time so [00:03:33] to use that for the continuous time so WT plus 1 is equal to WT minus ETA times [00:03:37] WT plus 1 is equal to WT minus ETA times the gradient of the loss by time key [00:03:40] the gradient of the loss by time key this is what we'll do with bringing this [00:03:42] this is what we'll do with bringing this set [00:03:43] set and if you scale the time [00:03:50] by Beta what do I mean is that now [00:03:54] by Beta what do I mean is that now currently when you do green design every [00:03:56] currently when you do green design every time you increase the step counter by [00:03:58] time you increase the step counter by one so before the time is T and now [00:04:01] one so before the time is T and now times t plus one right so and now [00:04:03] times t plus one right so and now suppose you don't do that you change the [00:04:05] suppose you don't do that you change the time scale you say every every update I [00:04:08] time scale you say every every update I only [00:04:09] only Advance the time counter by ETA instead [00:04:12] Advance the time counter by ETA instead by one so what I'm gonna get is that I [00:04:15] by one so what I'm gonna get is that I got W of t plus ETA is equals to WT [00:04:19] got W of t plus ETA is equals to WT minus ETA [00:04:22] and these two process if effective are [00:04:24] and these two process if effective are the same it's just at the time the unit [00:04:27] the same it's just at the time the unit of time change by a fact of ETA or one [00:04:30] of time change by a fact of ETA or one one over ETA [00:04:32] one over ETA um so and now if you scale the time then [00:04:34] um so and now if you scale the time then you can take e to go to zero so [00:04:36] you can take e to go to zero so take each have to go to zero and then [00:04:39] take each have to go to zero and then this becomes a differential [00:04:41] this becomes a differential um equation or kind of like a continuous [00:04:43] um equation or kind of like a continuous process so you can write this as WT plus [00:04:46] process so you can write this as WT plus DT [00:04:48] DT is equals WT minus ETA gradient LWT [00:04:53] is equals WT minus ETA gradient LWT because depending on what kind of [00:04:55] because depending on what kind of comment you come from like you can also [00:04:57] comment you come from like you can also write this as w t [00:04:59] write this as w t dot right this is the derivative with [00:05:02] dot right this is the derivative with respect to I guess sorry I think here [00:05:04] respect to I guess sorry I think here you also replace ETA by DT this is how [00:05:07] you also replace ETA by DT this is how you take [00:05:08] you take the the each had to go to zero and and [00:05:10] the the each had to go to zero and and this is effective saying that the [00:05:12] this is effective saying that the gradient so that the the derivative of w [00:05:16] gradient so that the the derivative of w with respect to T which we denote by w [00:05:19] with respect to T which we denote by w dot T is equals to minus gradient L at w [00:05:23] dot T is equals to minus gradient L at w at time t [00:05:25] at time t where w dot t [00:05:27] where w dot t this is just a derivative [00:05:30] this is just a derivative of w with respect to the time t [00:05:35] of w with respect to the time t so and in some sense this allows us to [00:05:40] to ignore [00:05:42] to ignore the ETA Square term and it because if [00:05:45] the ETA Square term and it because if the square here becomes DT squared which [00:05:48] the square here becomes DT squared which is zero compared to DT so that's why [00:05:52] is zero compared to DT so that's why this is useful this will be useful [00:05:55] this is useful this will be useful um [00:05:56] um for us like this in some sense this is [00:05:58] for us like this in some sense this is mostly to simplify the equation like [00:06:00] mostly to simplify the equation like like all the the technical meat in some [00:06:04] like all the the technical meat in some sense are the same it just makes the [00:06:07] sense are the same it just makes the analysis cleaner [00:06:09] analysis cleaner and for the next two examples I've got [00:06:11] and for the next two examples I've got all on both in both of the examples I'm [00:06:13] all on both in both of the examples I'm going to use this gradient flow [00:06:15] going to use this gradient flow formulation for green descent [00:06:18] formulation for green descent um [00:06:18] um okay so now let's talk about [00:06:21] okay so now let's talk about um the the model we're going to discuss [00:06:25] um the the model we're going to discuss so the model is a variant of the last [00:06:27] so the model is a variant of the last lecture [00:06:30] and there are some reasons for for [00:06:32] and there are some reasons for for changing the model a little bit I'm [00:06:34] changing the model a little bit I'm going to discuss that but it's not super [00:06:35] going to discuss that but it's not super important [00:06:36] important so for the model we're going to do is [00:06:39] so for the model we're going to do is that you have not some [00:06:42] that you have not some quadratically [00:06:44] quadratically parametrized linear model in some sense [00:06:46] parametrized linear model in some sense right so we have two parts let me write [00:06:49] right so we have two parts let me write it down so you have some Vector W plus [00:06:51] it down so you have some Vector W plus and you take this 0.2 minus W minus 0.2 [00:06:56] and you take this 0.2 minus W minus 0.2 transpose X where [00:06:58] transpose X where we use this notation x0.2 this means x o [00:07:02] we use this notation x0.2 this means x o dot X and O the dot the the [00:07:05] dot X and O the dot the the this dot o uh is uh is the um this [00:07:10] this dot o uh is uh is the um this actualized product [00:07:12] actualized product so so w Square minus W minus Square [00:07:15] so so w Square minus W minus Square transpose X and and w [00:07:18] transpose X and and w and W minus these are both vectors [00:07:21] and W minus these are both vectors uh in Rd [00:07:23] uh in Rd and and you can write w as the [00:07:25] and and you can write w as the concatenation of W plus and W minus is [00:07:28] concatenation of W plus and W minus is the parameter [00:07:33] so [00:07:34] so and so so basically this is very similar [00:07:37] and so so basically this is very similar to what we did last time so last time [00:07:41] to what we did last time so last time we had something like f beta of X which [00:07:44] we had something like f beta of X which is beta of the data transpose X [00:07:48] is beta of the data transpose X right so [00:07:51] right so um so basically now you have a negative [00:07:52] um so basically now you have a negative term in it instead of just positive and [00:07:54] term in it instead of just positive and the reason is there are two reasons so [00:07:57] the reason is there are two reasons so so there are two benefits [00:07:59] so there are two benefits compared to last time so they are not [00:08:02] compared to last time so they are not super important but uh let me I know you [00:08:04] super important but uh let me I know you mentioned it the one thing is that it's [00:08:06] mentioned it the one thing is that it's activator so so this model last time [00:08:10] activator so so this model last time can only [00:08:14] represent lead positively in the [00:08:16] represent lead positively in the combination [00:08:24] of facts and now [00:08:28] of facts and now this fwx can represent any linear model [00:08:38] right because before if you take the [00:08:41] right because before if you take the actualized product of Dublin debate and [00:08:43] actualized product of Dublin debate and beta is always positive or non-negative [00:08:45] beta is always positive or non-negative so you can have only have a non-native [00:08:47] so you can have only have a non-native linear combination of the coordinates of [00:08:49] linear combination of the coordinates of X and now it can have like a negative [00:08:52] X and now it can have like a negative coordinates [00:08:53] coordinates and another benefit is that [00:08:57] and another benefit is that um [00:08:58] um like now if you initialize [00:09:00] like now if you initialize if initialize [00:09:04] uh W plus [00:09:07] uh W plus zero I times zero two and to be equal to [00:09:10] zero I times zero two and to be equal to W minus FM 0 to be the same thing [00:09:12] W minus FM 0 to be the same thing then [00:09:14] then as you can see [00:09:15] as you can see fw0x is equals to zero and I guess for [00:09:19] fw0x is equals to zero and I guess for every X just because the positive part [00:09:22] every X just because the positive part cancels with the negative part [00:09:24] cancels with the negative part and I guess you have seen this kind of [00:09:25] and I guess you have seen this kind of things before this is mostly for [00:09:27] things before this is mostly for convenience so [00:09:29] convenience so so this will make the [00:09:31] so this will make the the analysis even more convenient [00:09:33] the analysis even more convenient because initialization has a zero [00:09:35] because initialization has a zero functionality and we have you know kind [00:09:38] functionality and we have you know kind of like a [00:09:39] of like a um use this for the ntk and [00:09:42] um use this for the ntk and um [00:09:43] um and and this will be useful on our [00:09:45] and and this will be useful on our analysis and actually what we are going [00:09:46] analysis and actually what we are going to do today is that with this uh thing [00:09:49] to do today is that with this uh thing we're going to see that if you change [00:09:50] we're going to see that if you change initialization actually you're going to [00:09:51] initialization actually you're going to get different regularization and you can [00:09:53] get different regularization and you can precisely characterize how does the [00:09:55] precisely characterize how does the regularization depends on neutralization [00:09:57] regularization depends on neutralization and in one of the cases it will be done [00:09:59] and in one of the cases it will be done you can in the other case it will be [00:10:00] you can in the other case it will be similar to what we discussed in the last [00:10:02] similar to what we discussed in the last time so [00:10:04] time so yeah maybe I should yeah mention that [00:10:11] so [00:10:12] so Okay so [00:10:13] Okay so [Applause] [00:10:15] [Applause] and [00:10:17] and continue with the setup so the loss [00:10:19] continue with the setup so the loss function the L High W will be that's the [00:10:22] function the L High W will be that's the square loss [00:10:28] and we can and we we can consider [00:10:33] and we can and we we can consider uh initialization as we discussed [00:10:37] uh initialization as we discussed or concentration W plus zero is equals W [00:10:40] or concentration W plus zero is equals W minus zero so that's you get the zero [00:10:44] minus zero so that's you get the zero initialization as the functionality of [00:10:46] initialization as the functionality of the initial model is zero and also we [00:10:49] the initial model is zero and also we for Simplicity we choose this to be [00:10:51] for Simplicity we choose this to be Alpha times all one vector and Alpha is [00:10:53] Alpha times all one vector and Alpha is the uh the thing we're going to change [00:10:55] the uh the thing we're going to change we're going to see that how does the [00:10:58] we're going to see that how does the regularization implicit organization [00:11:00] regularization implicit organization effect depends on the scale of alpha so [00:11:04] effect depends on the scale of alpha so when Alpha is small it gives you [00:11:06] when Alpha is small it gives you something about if I speak it gives you [00:11:07] something about if I speak it gives you something else [00:11:09] something else and and this all one vector is kind of [00:11:12] and and this all one vector is kind of like chosen as before [00:11:13] like chosen as before um in the previous lecture it's chosen [00:11:15] um in the previous lecture it's chosen some words for convenience [00:11:18] some words for convenience um you can still do it with other [00:11:19] um you can still do it with other initialization it's just that the form [00:11:21] initialization it's just that the form will be a little bit more complicated [00:11:24] will be a little bit more complicated um so and you can also have this [00:11:28] um so and you can also have this we can also have this Theta space right [00:11:30] we can also have this Theta space right so suppose Theta corresponds to w [00:11:35] so suppose Theta corresponds to w um plus so let's say the W is defined to [00:11:39] um plus so let's say the W is defined to be [00:11:41] be W plus to the power 2 minus W minus 2 to [00:11:44] W plus to the power 2 minus W minus 2 to the power two this is the actual linear [00:11:46] the power two this is the actual linear function you compute uh uh the linear [00:11:50] function you compute uh uh the linear function of w right so and we can Define [00:11:55] function of w right so and we can Define and and and we are interested in what [00:11:57] and and and we are interested in what kind of like linear model we are [00:11:58] kind of like linear model we are learning individually right so so let's [00:12:01] learning individually right so so let's say w Infinity [00:12:04] say w Infinity let's define this to be the [00:12:06] let's define this to be the the limit in time Infinity so the this [00:12:09] the limit in time Infinity so the this is how the green photo where the green [00:12:11] is how the green photo where the green flow converts to and a large uh [00:12:16] flow converts to and a large uh Theta sub Alpha Infinity [00:12:20] Theta sub Alpha Infinity and I think sometimes we just call this [00:12:23] and I think sometimes we just call this uh this is the basically w [00:12:27] uh this is the basically w the course the model corresponding to W [00:12:30] the course the model corresponding to W Infinity [00:12:32] Infinity so so this is the the coefficient that [00:12:36] so so this is the the coefficient that we care about right we care about when [00:12:37] we care about right we care about when you converge to Infinity what's the [00:12:40] you converge to Infinity what's the corresponding of theta you get right so [00:12:42] corresponding of theta you get right so uh what's the property of this this data [00:12:45] uh what's the property of this this data and sometimes we just call it [00:12:49] and sometimes we just call it or simplistic [00:12:51] or simplistic we just write say the alpha [00:12:58] and and omit the Infinity but this is [00:13:01] and and omit the Infinity but this is the the where we convert this to okay [00:13:04] the the where we convert this to okay and and for the sake of the Simplicity [00:13:07] and and for the sake of the Simplicity of the lecture we assume everything kind [00:13:08] of the lecture we assume everything kind of have a limit so and so forth right so [00:13:10] of have a limit so and so forth right so all the regulatory conditions are [00:13:12] all the regulatory conditions are assumed to be met [00:13:14] assumed to be met okay and so and also just for to set up [00:13:18] okay and so and also just for to set up some notation let this X [00:13:21] some notation let this X to be the data Matrix [00:13:26] and this is the Matrix and into D and [00:13:30] and this is the Matrix and into D and let's say y back is the label Matrix [00:13:33] let's say y back is the label Matrix label vector [00:13:36] okay so now here is the theorem that [00:13:38] okay so now here is the theorem that characterized how does the characterized [00:13:41] characterized how does the characterized implicit regularization [00:13:43] implicit regularization let me write it down on the interpret [00:13:46] let me write it down on the interpret so for only [00:13:49] so for only Alpha [00:13:53] [Applause] [00:13:58] so [00:14:03] assume that [00:14:09] you convert to a feasible solution so [00:14:12] you convert to a feasible solution so assume that [00:14:21] to a solution [00:14:25] that spread the data [00:14:29] so in a sense that X Theta Alpha is [00:14:32] so in a sense that X Theta Alpha is equals to Y back right so this is [00:14:35] equals to Y back right so this is you know if this is satisfied it means [00:14:37] you know if this is satisfied it means that we fill the data exactly and [00:14:40] that we fill the data exactly and um I'm using a purple color to for this [00:14:42] um I'm using a purple color to for this is because I don't feel like this is [00:14:44] is because I don't feel like this is necessarily have to be assumption you [00:14:45] necessarily have to be assumption you can prove this actually uh the paper did [00:14:47] can prove this actually uh the paper did assume this uh in the theorem but I [00:14:50] assume this uh in the theorem but I don't think you have to [00:14:51] don't think you have to um do it and actually I just check check [00:14:53] um do it and actually I just check check we smoked a lot author two days ago [00:14:57] we smoked a lot author two days ago um and he also thinks that you don't [00:14:59] um and he also thinks that you don't need it [00:15:01] need it um but but this is not formula stated in [00:15:02] um but but this is not formula stated in the theorem so I assume uh but but I do [00:15:06] the theorem so I assume uh but but I do strongly believe that you can prove that [00:15:07] strongly believe that you can prove that you can convert to such a feasible [00:15:09] you can convert to such a feasible solution [00:15:10] solution um you don't necessarily have to assume [00:15:12] um you don't necessarily have to assume it but anyway let's assume it [00:15:14] it but anyway let's assume it um so that we have we are consistent [00:15:16] um so that we have we are consistent with the paper [00:15:17] with the paper um by the way this is the paper by wood [00:15:21] um by the way this is the paper by wood I guess I'll probably add the link [00:15:25] would word [00:15:27] would word in 2020 and the title of the paper is [00:15:30] in 2020 and the title of the paper is something that could reach and kernel [00:15:32] something that could reach and kernel regime [00:15:35] I don't know I know I'm forgetting [00:15:37] I don't know I know I'm forgetting what's the [00:15:39] what's the rich and kernel regime over Prime [00:15:42] rich and kernel regime over Prime matches mode [00:15:49] so it's a pretty recent paper just it [00:15:52] so it's a pretty recent paper just it just show up like two years ago [00:15:54] just show up like two years ago one year ago actually so [00:15:57] one year ago actually so um and so suppose you okay so supposed [00:15:59] um and so suppose you okay so supposed to have this then you [00:16:02] to have this then you we know [00:16:04] we know Theta Alpha is not only a feasible [00:16:07] Theta Alpha is not only a feasible solution it's not only zero order [00:16:08] solution it's not only zero order solution but also it's actually the [00:16:09] solution but also it's actually the minimum solution [00:16:11] minimum solution or minimum complex solution according to [00:16:13] or minimum complex solution according to the following complex dimension that is [00:16:16] the following complex dimension that is the minimum complex solution [00:16:19] the minimum complex solution so you take Arc mean over Theta such [00:16:21] so you take Arc mean over Theta such that X data is equals to Y so among all [00:16:24] that X data is equals to Y so among all the feasible Solutions you will try to [00:16:25] the feasible Solutions you will try to find [00:16:27] find foreign [00:16:28] foreign complexity where the complexity defined [00:16:30] complexity where the complexity defined by Q Alpha and what is Q Alpha [00:16:33] by Q Alpha and what is Q Alpha so Q Alpha is a function of alpha so the [00:16:35] so Q Alpha is a function of alpha so the complex measure does change as you [00:16:37] complex measure does change as you change Alpha and Q Alpha is equals to [00:16:40] change Alpha and Q Alpha is equals to Alpha Square Times the alpha Square [00:16:43] Alpha Square Times the alpha Square doesn't really matter because it's the [00:16:44] doesn't really matter because it's the scalar when you do the argument uh times [00:16:47] scalar when you do the argument uh times sum [00:16:49] sum function of each of the coordinates [00:16:52] function of each of the coordinates Q of theta [00:16:55] Q of theta I [00:16:59] over Alpha squared [00:17:03] over Alpha squared and what is this little Cube this little [00:17:06] and what is this little Cube this little Q is a one-dimensional function [00:17:08] Q is a one-dimensional function where this little Q [00:17:11] where this little Q is a function that Maps R to R [00:17:14] is a function that Maps R to R actually Maps R to our non-nactive I [00:17:16] actually Maps R to our non-nactive I think [00:17:17] think uh and Q Z is equals to [00:17:20] uh and Q Z is equals to something that I don't expect you to [00:17:22] something that I don't expect you to interpret [00:17:23] interpret but I think we're gonna look at [00:17:26] but I think we're gonna look at special cases which we can interpret [00:17:29] special cases which we can interpret so this Arc such [00:17:32] so this Arc such I guess it's pronounced as science right [00:17:34] I guess it's pronounced as science right in the U.S or shine such [00:17:36] in the U.S or shine such okay anyway accent [00:17:39] okay anyway accent um [00:17:40] um sign hyperbolic right so Z over two [00:17:44] sign hyperbolic right so Z over two I think UK is called Sean I don't know [00:17:46] I think UK is called Sean I don't know why anyway uh yeah okay so [00:17:51] why anyway uh yeah okay so um all right so okay so the first order [00:17:53] um all right so okay so the first order bit is that even though you didn't [00:17:56] bit is that even though you didn't minimize this complex match in your [00:17:58] minimize this complex match in your algorithm you only run green designs [00:18:00] algorithm you only run green designs somehow you find the minimum complexity [00:18:03] somehow you find the minimum complexity solution and the complexity is defined [00:18:05] solution and the complexity is defined by something like this [00:18:07] by something like this right so [00:18:09] right so um so now let's try to interpret this is [00:18:11] um so now let's try to interpret this is the abstract theorem but the important [00:18:13] the abstract theorem but the important thing is that in particular [00:18:17] when Alpha goes to Infinity so if you [00:18:20] when Alpha goes to Infinity so if you have very large naturalization [00:18:22] have very large naturalization then this complex match of Q Theta Alpha [00:18:26] then this complex match of Q Theta Alpha this is something like [00:18:28] this is something like Theta [00:18:30] Theta Theta I Alpha Square this is something [00:18:33] Theta I Alpha Square this is something like Theta I Square [00:18:34] like Theta I Square over Alpha to the fourth [00:18:42] so which means that the Q [00:18:46] so which means that the Q Alpha Theta is something like [00:18:50] Alpha Theta is something like uh I guess one over Alpha Square Times [00:18:54] uh I guess one over Alpha Square Times the two Norm of theta [00:18:58] Square [00:18:59] Square so basically if Alpha goes to Infinity [00:19:01] so basically if Alpha goes to Infinity then in the so-called complex match the [00:19:03] then in the so-called complex match the Q Alpha [00:19:04] Q Alpha is the L2 Norm of theta [00:19:08] is the L2 Norm of theta and [00:19:10] and so and if Alpha goes to zero then what's [00:19:13] so and if Alpha goes to zero then what's the complexity measure [00:19:15] the complexity measure so the complexity measure is oh what's [00:19:19] so the complexity measure is oh what's this regularization effect or complexity [00:19:21] this regularization effect or complexity measure this is Q Theta I over Alpha [00:19:23] measure this is Q Theta I over Alpha Square [00:19:25] Square this is roughly [00:19:27] this is roughly something like Theta I [00:19:30] something like Theta I absolute value over Alpha Square Times [00:19:34] absolute value over Alpha Square Times log 1 over Alpha Square I don't expect [00:19:36] log 1 over Alpha Square I don't expect you to verify this limit because you [00:19:38] you to verify this limit because you have to do some kind of like Taylor [00:19:41] have to do some kind of like Taylor expansion to see this but this is the [00:19:43] expansion to see this but this is the thing so then this means that a few of [00:19:46] thing so then this means that a few of us Theta is in some sense the one Norm [00:19:50] us Theta is in some sense the one Norm of theta [00:19:53] of theta um I guess over Alpha Square Times log [00:19:55] um I guess over Alpha Square Times log well over Alpha square but the constant [00:19:57] well over Alpha square but the constant doesn't really matter because [00:19:59] doesn't really matter because um it doesn't change the the order of [00:20:01] um it doesn't change the the order of this [00:20:02] this of the different data [00:20:04] of the different data um [00:20:08] [Music] [00:20:12] oh sorry [00:20:24] so so basically this is okay or in [00:20:27] so so basically this is okay or in summary so when in summary [00:20:30] summary so when in summary but Alpha [00:20:32] but Alpha goes to Infinity then this is minimum L2 [00:20:37] goes to Infinity then this is minimum L2 Norm [00:20:38] Norm solution [00:20:39] solution in a status space [00:20:42] in a status space of theta [00:20:44] of theta um [00:20:45] um and an L4 Norm [00:20:48] and an L4 Norm for for the w [00:20:51] for for the w right because Theta is the square of w [00:20:54] right because Theta is the square of w and when Alpha goes to zero this is [00:20:56] and when Alpha goes to zero this is similar to what we discussed the last [00:20:58] similar to what we discussed the last lecture so you have the minimum L1 Norm [00:21:03] of theta which is the minimum L2 Norm [00:21:09] of theta which is the minimum L2 Norm of the W's [00:21:11] of the W's so so this regime is what we have seen [00:21:13] so so this regime is what we have seen in the last lecture and with a very [00:21:15] in the last lecture and with a very similar model right so [00:21:18] similar model right so um [00:21:19] um um but this characterized the whole [00:21:21] um but this characterized the whole regime and there is so and and between [00:21:23] regime and there is so and and between Alpha and zero you basically have [00:21:25] Alpha and zero you basically have interpolation some kind of interpolation [00:21:28] interpolation some kind of interpolation between [00:21:31] between L1 and L2 [00:21:33] L1 and L2 localization [00:21:37] so [00:21:39] so so that's that's why this is more [00:21:41] so that's that's why this is more precise than before right so you know [00:21:42] precise than before right so you know how does things you know interpret of [00:21:45] how does things you know interpret of course you know it's kind of like for [00:21:46] course you know it's kind of like for any particular offer this the skill [00:21:48] any particular offer this the skill function is a little bit complicated but [00:21:49] function is a little bit complicated but it's just the sum [00:21:51] it's just the sum kind of like some of like some power of [00:21:53] kind of like some of like some power of theta I in some sense [00:21:55] theta I in some sense um but the power is kind of between one [00:21:56] um but the power is kind of between one so you can kind of think of it like that [00:21:58] so you can kind of think of it like that but it's not exactly a power but it's [00:22:00] but it's not exactly a power but it's something like that [00:22:05] at this point [00:22:07] at this point oh okay say it again [00:22:10] oh okay say it again yes yes Alpha is the scale so this this [00:22:14] yes yes Alpha is the scale so this this is the only thing that depends on Alpha [00:22:15] is the only thing that depends on Alpha in the algorithm [00:22:17] in the algorithm so we say that the initialization is [00:22:19] so we say that the initialization is Alpha times over vector [00:22:22] Alpha times over vector and this all one vector actually can [00:22:24] and this all one vector actually can change it as well I think if you change [00:22:25] change it as well I think if you change it to an arbitrary vector [00:22:27] it to an arbitrary vector uh for [00:22:28] uh for for this regime for this case you [00:22:32] for this regime for this case you actually don't change anything [00:22:34] actually don't change anything so for this element so for the other [00:22:36] so for this element so for the other element I think you change a little bit [00:22:39] element I think you change a little bit because the L2 Norm becomes weighted by [00:22:42] because the L2 Norm becomes weighted by the initialization [00:22:44] the initialization by the by the particular information if [00:22:46] by the by the particular information if it's not all one but if it's all one [00:22:47] it's not all one but if it's all one vector the waiting is just all the same [00:22:49] vector the waiting is just all the same for all coordinates if they are not or [00:22:51] for all coordinates if they are not or one vector it's not open Vector then you [00:22:53] one vector it's not open Vector then you have a different weight [00:22:55] have a different weight um yeah those can be found in the paper [00:22:56] um yeah those can be found in the paper I just for simply sell it and show the [00:22:58] I just for simply sell it and show the exact rating [00:23:05] right and [00:23:08] right and so here are some iterations is about [00:23:11] so here are some iterations is about this interpolation and in some sense [00:23:13] this interpolation and in some sense just to connect to what we have uh this [00:23:16] just to connect to what we have uh this in some sense this is kind of like a [00:23:18] in some sense this is kind of like a you can view this as a unification of [00:23:20] you can view this as a unification of what we have discussed like in in the [00:23:22] what we have discussed like in in the past [00:23:24] past few lectures when Alpha is small this is [00:23:26] few lectures when Alpha is small this is small initialization [00:23:29] small initialization so [00:23:30] so um [00:23:32] um so this is basically similar to similar [00:23:35] so this is basically similar to similar to the previous case a similar intuition [00:23:40] to previous lecture [00:23:45] and but there is a small thing so it's [00:23:49] and but there is a small thing so it's not note is that [00:23:53] um it's not exactly so this is only when [00:23:56] um it's not exactly so this is only when you have to learn when you go to a limit [00:23:58] you have to learn when you go to a limit right so you have the minimum out to [00:23:59] right so you have the minimum out to normal solution and then Alpha is not [00:24:01] normal solution and then Alpha is not when Alpha is not zero it's not exactly [00:24:07] minimum like the [00:24:12] minimum like the the regularization effect is means the [00:24:14] the regularization effect is means the regularization effect is not exactly [00:24:16] regularization effect is not exactly closest [00:24:19] closest uh solution to initialize [00:24:27] I think this uh the paper shows uh some [00:24:31] I think this uh the paper shows uh some there's some tiny differences in some [00:24:33] there's some tiny differences in some sense so only when Alpha goes to zero [00:24:35] sense so only when Alpha goes to zero you can basically say this is the [00:24:37] you can basically say this is the closest solution to any strategy [00:24:40] closest solution to any strategy um but but generally this is something [00:24:41] um but but generally this is something we have kind of discussed in the last [00:24:43] we have kind of discussed in the last time [00:24:44] time um and and when Alpha goes to Infinity [00:24:47] um and and when Alpha goes to Infinity actually this is indeed this is the [00:24:49] actually this is indeed this is the anti-curity [00:24:52] anti-curity so [00:24:54] so [Music] [00:24:54] [Music] um [00:24:55] um and why this is the NGK regime I'm going [00:24:57] and why this is the NGK regime I'm going to show you like you know this is kind [00:24:59] to show you like you know this is kind of similar to what we have discussed [00:25:00] of similar to what we have discussed before but let me just do it again so [00:25:03] before but let me just do it again so why this is the anti-creaging so let's [00:25:05] why this is the anti-creaging so let's look at the recall that in the ntk [00:25:07] look at the recall that in the ntk regime we have this so-called two [00:25:09] regime we have this so-called two parameter Sigma and the beta so beta [00:25:12] parameter Sigma and the beta so beta there was the smoothness of the [00:25:14] there was the smoothness of the deluxiousness [00:25:17] of the of the gradient [00:25:22] and this is the condition number [00:25:26] right and we say that and recall that we [00:25:29] right and we say that and recall that we had this discussion that Sigma [00:25:32] had this discussion that Sigma beta over Sigma square meters [00:25:36] beta over Sigma square meters right so if this goes to zero then you [00:25:39] right so if this goes to zero then you are in the ntk regime you can [00:25:40] are in the ntk regime you can approximate that quadratic and now let's [00:25:41] approximate that quadratic and now let's compute what Sigma beta is in this case [00:25:45] compute what Sigma beta is in this case so the gradient with respect to the [00:25:48] so the gradient with respect to the neutralization X so w 0 [00:25:52] neutralization X so w 0 let's [00:25:54] let's W 0 let's take this to be alpha or one [00:25:57] W 0 let's take this to be alpha or one vector [00:25:58] vector W plus and W minus be both all and [00:26:01] W plus and W minus be both all and Vector Alpha times all one vector [00:26:04] Vector Alpha times all one vector and [00:26:06] and um [00:26:08] and we can compute the gradient and in [00:26:10] and we can compute the gradient and in transition [00:26:14] X right so there are two set of [00:26:17] X right so there are two set of parameters right so the gradient for the [00:26:18] parameters right so the gradient for the W plus I think you can if you compute is [00:26:21] W plus I think you can if you compute is something like this [00:26:23] something like this an agreement for the W minus is the [00:26:25] an agreement for the W minus is the students [00:26:28] oh sorry this is an element white [00:26:31] oh sorry this is an element white product [00:26:32] product you can do this easily just by 10 rule [00:26:35] you can do this easily just by 10 rule for every dimension [00:26:36] for every dimension and [00:26:38] and this is just the [00:26:41] this is just the equals to I guess there's a two Factor [00:26:43] equals to I guess there's a two Factor two here which I [00:26:45] two here which I oftentimes probably look out and this is [00:26:48] oftentimes probably look out and this is too Alpha [00:26:50] too Alpha because these two are just all one [00:26:52] because these two are just all one vector so this is two alpha x minus X [00:26:56] vector so this is two alpha x minus X sometimes [00:26:59] sometimes right so and now you can see that [00:27:03] right so and now you can see that Sigma and beta [00:27:06] Sigma and beta both [00:27:07] both linearly [00:27:09] linearly dependent Alpha [00:27:12] dependent Alpha right so what is Sigma Sigma is the [00:27:14] right so what is Sigma Sigma is the condition number of this of Sigma is the [00:27:16] condition number of this of Sigma is the condition number of the [00:27:18] condition number of the the gradient Matrix right the feature [00:27:21] the gradient Matrix right the feature Matrix that consist of the gradient for [00:27:23] Matrix that consist of the gradient for every little point and the condition [00:27:24] every little point and the condition number scale linearly [00:27:26] number scale linearly in Alpha right and beta is deliciousness [00:27:29] in Alpha right and beta is deliciousness which also scales linearly in Alpha [00:27:31] which also scales linearly in Alpha that's because Alpha is Multiplied in [00:27:33] that's because Alpha is Multiplied in front of the gradient just so so both of [00:27:36] front of the gradient just so so both of these scales linearly so that's why beta [00:27:39] these scales linearly so that's why beta over Sigma Square will converge to zero [00:27:42] over Sigma Square will converge to zero as Alpha goes to Infinity because [00:27:45] as Alpha goes to Infinity because below you have degree 2 uh dependency on [00:27:48] below you have degree 2 uh dependency on Alpha and in the in the denominator you [00:27:51] Alpha and in the in the denominator you have degree in the numerator you have [00:27:52] have degree in the numerator you have degree [00:27:53] degree like you have a linear dependency on [00:27:55] like you have a linear dependency on Alpha so that's why uh this this whole [00:27:58] Alpha so that's why uh this this whole thing you know goes to zero as Alpha [00:28:00] thing you know goes to zero as Alpha goes to Infinity [00:28:04] so and also like when Alpha goes to [00:28:07] so and also like when Alpha goes to Infinity [00:28:08] Infinity this is the ntk regime [00:28:11] this is the ntk regime for the for the trivial feature because [00:28:12] for the for the trivial feature because now [00:28:13] now the XR V of X is this this thing [00:28:18] the XR V of X is this this thing this feature map is really just [00:28:20] this feature map is really just literally this right [00:28:22] literally this right this is just a literally the trio [00:28:24] this is just a literally the trio feature because the only thing you did [00:28:25] feature because the only thing you did is that you flipped the X which doesn't [00:28:27] is that you flipped the X which doesn't really make any difference essentially [00:28:29] really make any difference essentially so so [00:28:31] so so um so basically you got the minimum Norm [00:28:33] um so basically you got the minimum Norm solution so you got [00:28:35] solution so you got so so if you believe in the ntk [00:28:37] so so if you believe in the ntk perspective you should get the minimum [00:28:39] perspective you should get the minimum Norm solution so the ntk perspective [00:28:42] Norm solution so the ntk perspective will also tell you that you get minimum [00:28:43] will also tell you that you get minimum Norm solution [00:28:48] um uh [00:28:49] um uh like according to the features right you [00:28:52] like according to the features right you know Norm solution [00:28:53] know Norm solution minimum L2 Norm solution [00:28:55] minimum L2 Norm solution using [00:28:57] using the feature [00:28:59] the feature basically x minus X [00:29:01] basically x minus X and x minus X is in terms of a feature [00:29:04] and x minus X is in terms of a feature that is not very different from X itself [00:29:06] that is not very different from X itself so so basically you just get essentially [00:29:08] so so basically you just get essentially the minimum L2 Norm solution for the [00:29:10] the minimum L2 Norm solution for the linear model [00:29:11] linear model so so that's the same as we um [00:29:14] so so that's the same as we um we discussed like the same conclusion as [00:29:16] we discussed like the same conclusion as we discussed about [00:29:23] any questions so far [00:29:35] foreign [00:29:40] I think this is just because we'll do [00:29:42] I think this is just because we'll do the kernel method so [00:29:46] so when you use kernel method so ntk [00:29:48] so when you use kernel method so ntk tells you that you are doing kernel [00:29:49] tells you that you are doing kernel method with certain feature [00:29:52] method with certain feature so ntk means kernel method [00:29:56] so ntk means kernel method and the feature just turns out to be [00:29:58] and the feature just turns out to be this chillio feature and and kernel [00:30:01] this chillio feature and and kernel message with the feature it just gave [00:30:03] message with the feature it just gave you the minimum Solution that's that's [00:30:05] you the minimum Solution that's that's the that's where the kernel method does [00:30:07] the that's where the kernel method does so because what kernel Master does is [00:30:09] so because what kernel Master does is that you [00:30:11] that you like [00:30:12] like that's that's just what the chronometer [00:30:14] that's that's just what the chronometer do like a [00:30:16] do like a um when you don't have enough data when [00:30:18] um when you don't have enough data when you when your kernel like when you [00:30:19] you when your kernel like when you feature Dimension it's bigger than the [00:30:22] feature Dimension it's bigger than the number of examples [00:30:23] number of examples uh in a kernel method you are learning [00:30:25] uh in a kernel method you are learning the minimum solution [00:30:27] the minimum solution uh for the features [00:30:32] because otherwise you know you have to [00:30:34] because otherwise you know you have to define something right so the kernel [00:30:35] define something right so the kernel method everything is L2 so you are [00:30:37] method everything is L2 so you are minimizing L2 Norm that's implicitly in [00:30:39] minimizing L2 Norm that's implicitly in the in the kernel method [00:30:45] that doesn't depend on your transition [00:30:47] that doesn't depend on your transition because it's a convex problem [00:30:49] because it's a convex problem uh [00:30:51] uh yeah [00:30:52] yeah uh and yeah and you use a particular [00:30:54] uh and yeah and you use a particular algorithm when you do the kernel method [00:30:56] algorithm when you do the kernel method that my algorithm give you the minimum [00:30:58] that my algorithm give you the minimum solution [00:31:03] foreign [00:31:31] feel that better [00:31:34] feel that better you know [00:31:40] yep so so I guess so [00:31:43] yep so so I guess so um in some sense repeating the question [00:31:44] um in some sense repeating the question and also answering it so when Arthur [00:31:46] and also answering it so when Arthur Goes To infinity [00:31:49] Goes To infinity um [00:31:50] um so yes so your [00:31:53] so yes so your um your problem will be very your post [00:31:55] um your problem will be very your post in some sense like the optimization [00:31:56] in some sense like the optimization landscape will be very bad just because [00:31:58] landscape will be very bad just because your function will be not very smooth [00:32:02] your function will be not very smooth um so and and this part is hidden here [00:32:05] um so and and this part is hidden here because you are using gradient flow so [00:32:08] because you are using gradient flow so you can so you are using infinite [00:32:10] you can so you are using infinite testable smaller than Rick so that's why [00:32:11] testable smaller than Rick so that's why it's hidden under the it's left and the [00:32:14] it's hidden under the it's left and the bug and in practically [00:32:17] bug and in practically um you don't [00:32:18] um you don't uh you also don't necessarily want to [00:32:20] uh you also don't necessarily want to use the large and rate for one reason is [00:32:22] use the large and rate for one reason is the optimization and the other reason is [00:32:23] the optimization and the other reason is that maybe the L2 Norm solution is also [00:32:25] that maybe the L2 Norm solution is also not good right so like you also want you [00:32:27] not good right so like you also want you have L1 representation at least for this [00:32:29] have L1 representation at least for this particular setting right so that's [00:32:31] particular setting right so that's another reason why you don't want to use [00:32:32] another reason why you don't want to use very large learning rate [00:32:35] very large learning rate um so uh sorry we're very large [00:32:37] um so uh sorry we're very large initialization [00:32:39] initialization um and another thing is that in practice [00:32:43] um and another thing is that in practice people sometimes do use like this is [00:32:46] people sometimes do use like this is about the empirical setup when like [00:32:48] about the empirical setup when like sometimes people do use large [00:32:50] sometimes people do use large initialization [00:32:52] initialization but people don't use infinitesimal [00:32:54] but people don't use infinitesimal smaller Direct [00:32:56] smaller Direct so so then you still cannot get into the [00:32:58] so so then you still cannot get into the antichrology [00:32:59] antichrology but that's a good thing because you [00:33:01] but that's a good thing because you don't want to go to the NCAA regime so [00:33:03] don't want to go to the NCAA regime so so that's that's why at the beginning [00:33:04] so that's that's why at the beginning some people have confusions because at [00:33:07] some people have confusions because at the very the very first paper by um this [00:33:09] the very the very first paper by um this ntk paper I think they are claiming that [00:33:12] ntk paper I think they are claiming that in transition scheme they are studying [00:33:14] in transition scheme they are studying studying is actually what people do in [00:33:16] studying is actually what people do in practice and that's kind of true that's [00:33:18] practice and that's kind of true that's kind of it's very close to the the [00:33:19] kind of it's very close to the the timing uh the Communist neutralization [00:33:22] timing uh the Communist neutralization or the the Xavier initialization so [00:33:26] or the the Xavier initialization so um in terms of the scale but because [00:33:27] um in terms of the scale but because they are using very very small learning [00:33:29] they are using very very small learning rate so it's actually not really the [00:33:31] rate so it's actually not really the theory the theoretical setup requires [00:33:33] theory the theoretical setup requires very very small learning rate but [00:33:35] very very small learning rate but empirical you don't use those smaller [00:33:36] empirical you don't use those smaller rates and also the theoretical setup [00:33:38] rates and also the theoretical setup doesn't have the stochasticity so all of [00:33:40] doesn't have the stochasticity so all of this together it makes the theoretical [00:33:42] this together it makes the theoretical setup different from the empirical [00:33:44] setup different from the empirical setting [00:33:46] setting and and and and and that's a good thing [00:33:49] and and and and and that's a good thing because the theoretical setup says that [00:33:51] because the theoretical setup says that you don't really do anything super [00:33:53] you don't really do anything super different from kernels [00:33:58] okay so now let's discuss the proof of [00:34:00] okay so now let's discuss the proof of this theorem so um [00:34:03] this theorem so um I don't have a like there is a little [00:34:06] I don't have a like there is a little bit kind of like this this proof is kind [00:34:08] bit kind of like this this proof is kind of interesting in the sense that it's [00:34:10] of interesting in the sense that it's it's the proof is similar [00:34:13] it's the proof is similar to the actual the linear regression [00:34:16] to the actual the linear regression model [00:34:17] model similar to the [00:34:22] the linear regression proof [00:34:25] the linear regression proof but not similar [00:34:27] but not similar not similar [00:34:29] not similar to what we discussed last time [00:34:31] to what we discussed last time now similar to the last lecture but you [00:34:34] now similar to the last lecture but you would probably guess this is similar to [00:34:35] would probably guess this is similar to loss actually because the last lecture [00:34:36] loss actually because the last lecture has almost the same model as this one [00:34:38] has almost the same model as this one and it's only doing a sub case of this [00:34:40] and it's only doing a sub case of this right when Alpha goes to zero [00:34:42] right when Alpha goes to zero but it turns out that the proof is very [00:34:44] but it turns out that the proof is very similar to the linear regression one and [00:34:47] similar to the linear regression one and you have this this two steps the first [00:34:49] you have this this two steps the first step is that you find the environment [00:34:51] step is that you find the environment foreign [00:34:59] by the [00:35:00] by the by the algorithm by the optimizer [00:35:04] by the algorithm by the optimizer and recall that this environment was [00:35:07] and recall that this environment was was that Theta is in a span [00:35:11] was that Theta is in a span of x i [00:35:13] of x i right for any regression [00:35:17] right for any regression this was like a probably like two or [00:35:20] this was like a probably like two or three lectures ago when we analyzed the [00:35:22] three lectures ago when we analyzed the implicit [00:35:24] implicit to accreditation fact of initialization [00:35:26] to accreditation fact of initialization for linear regression we say that [00:35:28] for linear regression we say that because Universe at zero and you use [00:35:29] because Universe at zero and you use green in Sun you're always in the span [00:35:31] green in Sun you're always in the span of the data [00:35:32] of the data and and here we're going to find a [00:35:35] and and here we're going to find a different environments which is more [00:35:36] different environments which is more complicated and it's even harder to [00:35:39] complicated and it's even harder to express but if we're going to find an [00:35:40] express but if we're going to find an environment and then you use the [00:35:41] environment and then you use the environment [00:35:42] environment so [00:35:43] so so step two [00:35:46] so step two uh like like a kind of like in some [00:35:49] uh like like a kind of like in some sense characterized I guess credit cards [00:35:51] sense characterized I guess credit cards is a very weak term but characterize the [00:35:53] is a very weak term but characterize the solution using environments you can [00:35:55] solution using environments you can sometimes use the environments as [00:35:56] sometimes use the environments as additional [00:35:58] additional um [00:35:59] um additional information [00:36:01] additional information right to to pin down which solution you [00:36:04] right to to pin down which solution you converts to right in some sense the [00:36:06] converts to right in some sense the difficulty is that if you [00:36:09] difficulty is that if you without any additional thing you just [00:36:11] without any additional thing you just know that you convert to a zero hour [00:36:12] know that you convert to a zero hour solution you don't know which one you [00:36:14] solution you don't know which one you converts to and the environment tells [00:36:15] converts to and the environment tells you that which one you converge to and [00:36:17] you that which one you converge to and the environment depends on Alpha [00:36:19] the environment depends on Alpha so [00:36:20] so um [00:36:21] um and there's nothing about nothing about [00:36:24] and there's nothing about nothing about population versus [00:36:27] population [00:36:30] population versus empirical like everything is [00:36:32] versus empirical like everything is empirical here [00:36:33] empirical here right I don't even I didn't even Define [00:36:35] right I don't even I didn't even Define where the data come from right I I only [00:36:38] where the data come from right I I only tell you that this is the minimum [00:36:39] tell you that this is the minimum solution such that the empirical error [00:36:42] solution such that the empirical error is zero I don't have to care about [00:36:44] is zero I don't have to care about population at all [00:36:45] population at all so [00:36:47] so um yeah how does this kind of techniques [00:36:50] um yeah how does this kind of techniques compare with the techniques we discussed [00:36:52] compare with the techniques we discussed the last time where you used the fact [00:36:54] the last time where you used the fact that the data the empirical laws [00:36:56] that the data the empirical laws concentrates [00:36:57] concentrates around the population loss in certain [00:36:59] around the population loss in certain regions and you somehow do some kind of [00:37:01] regions and you somehow do some kind of like control of the Dynamics you know I [00:37:03] like control of the Dynamics you know I don't know how it's kind of hard to [00:37:05] don't know how it's kind of hard to compare like [00:37:07] compare like these are two different approaches [00:37:08] these are two different approaches there's some [00:37:10] there's some good thing about this kind of approach [00:37:11] good thing about this kind of approach because this doesn't require population [00:37:13] because this doesn't require population that sounds a good good thing but the [00:37:16] that sounds a good good thing but the but the bad things about this approach [00:37:18] but the bad things about this approach seems to be that it's very hard to find [00:37:19] seems to be that it's very hard to find an environments for harder [00:37:21] an environments for harder for other models or like for for more [00:37:24] for other models or like for for more complex models the environments you will [00:37:26] complex models the environments you will see the environment is a little bit of [00:37:28] see the environment is a little bit of magical somehow [00:37:30] magical somehow um but that's that you know for more [00:37:32] um but that's that you know for more complex models even the previous [00:37:33] complex models even the previous approach the approach we discussed last [00:37:36] approach the approach we discussed last time wouldn't work either so so it's [00:37:38] time wouldn't work either so so it's hard to say uh anyway so let's proceed [00:37:42] hard to say uh anyway so let's proceed to see how does the proof work [00:37:44] to see how does the proof work so [00:37:46] so um we need to alert my notation to [00:37:47] um we need to alert my notation to somehow simplify our expectation so [00:37:49] somehow simplify our expectation so let's say let's X tilde to be the [00:37:52] let's say let's X tilde to be the extended data Matrix [00:37:54] extended data Matrix so you expanded Matrix to [00:37:57] so you expanded Matrix to this is to deal with the [00:38:00] this is to deal with the the [00:38:02] the this is [00:38:03] this is your concatenate X and minus X so that [00:38:05] your concatenate X and minus X so that you get by 2D Matrix and sometimes this [00:38:09] you get by 2D Matrix and sometimes this is just to try to write everything in [00:38:10] is just to try to write everything in Matrix notation so that you don't have [00:38:12] Matrix notation so that you don't have to have the minus link so I so we will [00:38:14] to have the minus link so I so we will take WT to be the concatenation of W [00:38:17] take WT to be the concatenation of W plus t [00:38:19] plus t and W minus t [00:38:21] and W minus t this is of Dimension 2D [00:38:24] this is of Dimension 2D and and let's take a [00:38:26] and and let's take a WT 0.2 I guess with this we we say that [00:38:31] WT 0.2 I guess with this we we say that this is the anchovies power of WT and [00:38:35] this is the anchovies power of WT and this means that with this notation X [00:38:38] this means that with this notation X tilde times WT [00:38:40] tilde times WT 0.2 this is [00:38:45] x minus x times W plus t [00:38:49] x minus x times W plus t 2 W minus t so 2 and you can verify this [00:38:53] 2 W minus t so 2 and you can verify this is really the same as the this is just [00:38:56] is really the same as the this is just what the model computes [00:39:04] right this is the model output [00:39:07] right this is the model output on the data points [00:39:09] on the data points so I just want to use this so that so [00:39:12] so I just want to use this so that so that you have the Matrix notation and [00:39:14] that you have the Matrix notation and now you can compute what's the [00:39:16] now you can compute what's the derivative [00:39:18] derivative uh like of T was w dot T right this is [00:39:21] uh like of T was w dot T right this is the gradient because we are doing [00:39:23] the gradient because we are doing gradient flow so this is equals to the [00:39:24] gradient flow so this is equals to the gradient of L [00:39:26] gradient of L item key and what's the gradient of L [00:39:29] item key and what's the gradient of L item T this is gradient [00:39:31] item T this is gradient of this loss function and now can be [00:39:33] of this loss function and now can be written as [00:39:34] written as against the loss function [00:39:37] against the loss function of WT now can be written as [00:39:42] of WT now can be written as x to the [00:39:43] x to the WT 0.2 minus y back to Norm Square [00:39:48] WT 0.2 minus y back to Norm Square that's because I vectorized everything [00:39:51] that's because I vectorized everything so [00:39:52] so it's called replaced here [00:39:55] and then I take the gradient taking a [00:39:58] and then I take the gradient taking a gradient you can use the chain rule so [00:40:01] gradient you can use the chain rule so uh if you believe me that I got a [00:40:03] uh if you believe me that I got a correction rules this is actually the [00:40:05] correction rules this is actually the transpose [00:40:06] transpose RT [00:40:09] RT entry wise times WT [00:40:13] entry wise times WT where RT [00:40:16] where RT is equals to X tilde [00:40:19] is equals to X tilde times wto2 [00:40:22] times wto2 minus y back is the residual vector [00:40:28] so if you are familiar with linear [00:40:30] so if you are familiar with linear regression you will realize that this [00:40:32] regression you will realize that this is kind of like [00:40:35] is kind of like this is what you got from the it is so [00:40:39] this is what you got from the it is so this is the if if it was linear [00:40:41] this is the if if it was linear regression [00:40:45] if it is linear regression then this [00:40:47] if it is linear regression then this time will be upgraded [00:40:48] time will be upgraded and now it's not linear regression [00:40:50] and now it's not linear regression because you have the quadratic [00:40:51] because you have the quadratic parameters model that's why you also [00:40:53] parameters model that's why you also have to do chain rule to look at the [00:40:55] have to do chain rule to look at the derivative of the quadratic of WT that's [00:40:58] derivative of the quadratic of WT that's why this is this is because [00:41:00] why this is this is because it's quadratic [00:41:05] anyway so you know if this is just one [00:41:08] anyway so you know if this is just one way to think about why this is true but [00:41:10] way to think about why this is true but the the formal verification would be [00:41:12] the the formal verification would be just that you look at little chain rules [00:41:14] just that you look at little chain rules app by everything [00:41:16] app by everything oh sorry sorry there's a [00:41:20] there should be a I think there's a two [00:41:23] there should be a I think there's a two here [00:41:25] here wait [00:41:27] uh let me see [00:41:32] I think I wanted to make a two [00:41:34] I think I wanted to make a two so that means I think my loss function [00:41:36] so that means I think my loss function should have a half [00:41:40] so where is my last function [00:41:43] okay the loss function [00:41:46] okay the loss function is so hard [00:41:48] is so hard I guess I also Define a lot of [00:41:49] I guess I also Define a lot of functioning somewhere else before [00:41:59] here [00:42:00] here right that sounds good [00:42:03] right that sounds good let me [00:42:06] like like my brain just automatically [00:42:08] like like my brain just automatically remove all the constant [00:42:10] remove all the constant [Applause] [00:42:12] [Applause] so it's very hard for me to deal with [00:42:15] so it's very hard for me to deal with this I can [00:42:25] Okay cool so [00:42:27] Okay cool so um [00:42:28] um all right okay so now [00:42:32] all right okay so now how do we kind of like we we said that [00:42:34] how do we kind of like we we said that we want to have some environments for [00:42:36] we want to have some environments for this in some sense right we're going to [00:42:37] this in some sense right we're going to somehow solve this differential equation [00:42:39] somehow solve this differential equation but you cannot really solve it exactly [00:42:41] but you cannot really solve it exactly this is [00:42:42] this is I I'm not an expert on solving [00:42:44] I I'm not an expert on solving differential equations but I think this [00:42:46] differential equations but I think this is beyond the [00:42:48] is beyond the the scope of like like this is something [00:42:51] the scope of like like this is something you cannot really have [00:42:53] you cannot really have have uh have a close form solution [00:42:56] have uh have a close form solution but the interesting way you can do you [00:42:58] but the interesting way you can do you can you can you can get something [00:43:00] can you can you can get something without solving it exactly [00:43:02] without solving it exactly so we claim that [00:43:06] so actually this is a paper they just [00:43:08] so actually this is a paper they just say the claim is easy to verify this [00:43:11] say the claim is easy to verify this um you can claim that [00:43:14] um you can claim that WT satisfies the following [00:43:17] WT satisfies the following since like this times exponential minus [00:43:20] since like this times exponential minus two x cubed transpose [00:43:23] two x cubed transpose key [00:43:25] key all right [00:43:26] all right yes [00:43:33] okay why this is the case so first of [00:43:35] okay why this is the case so first of all this is not a solution this is not [00:43:37] all this is not a solution this is not like a it depending on what you mean by [00:43:39] like a it depending on what you mean by Solutions this is not necessarily my [00:43:41] Solutions this is not necessarily my definition or closed person solution [00:43:43] definition or closed person solution because RIS still is a function of w [00:43:48] because RIS still is a function of w right so [00:43:55] but it's going to be something you very [00:43:58] but it's going to be something you very useful [00:43:58] useful for us and why and why this is true [00:44:02] for us and why and why this is true um it's actually relatively simple but [00:44:04] um it's actually relatively simple but so here is the reason [00:44:05] so here is the reason so this is because [00:44:10] uh what you can do suppose you have a [00:44:13] uh what you can do suppose you have a suppose you have a [00:44:15] suppose you have a like a differential equation something [00:44:17] like a differential equation something like U dot T is equals to VT times u t [00:44:22] like U dot T is equals to VT times u t so I'm trying to abstract fight a little [00:44:24] so I'm trying to abstract fight a little bit so that I can um I can give a clean [00:44:28] bit so that I can um I can give a clean analysis so you can see that this is a [00:44:30] analysis so you can see that this is a good abstraction of what we had [00:44:33] good abstraction of what we had before because before on the left hand [00:44:35] before because before on the left hand side you have the derivative of w and on [00:44:37] side you have the derivative of w and on the right hand side you have something [00:44:38] the right hand side you have something times W itself so this will be U this [00:44:41] times W itself so this will be U this will be V this will be U and this is U [00:44:43] will be V this will be U and this is U dot T right that's my abstraction [00:44:46] dot T right that's my abstraction oh and then suppose you have such a [00:44:48] oh and then suppose you have such a thing that you can always do the [00:44:49] thing that you can always do the following you can say that U dot t [00:44:54] not over u g is equals to VT [00:44:58] not over u g is equals to VT right that's always true and then the [00:45:01] right that's always true and then the left hand side this is a part of the [00:45:04] left hand side this is a part of the magical thing in many cases like this is [00:45:07] magical thing in many cases like this is log of [00:45:09] log of derivative of the log of u t [00:45:15] spect [00:45:16] spect I think probably I've seen this in other [00:45:18] I think probably I've seen this in other contexts like policy ingredients in [00:45:21] contexts like policy ingredients in other cases depending on whether you [00:45:23] other cases depending on whether you know on your clothes but anyway so and [00:45:25] know on your clothes but anyway so and then [00:45:26] then you can integrate both sides [00:45:29] you can integrate both sides so you can say if you integrate [00:45:32] so you can say if you integrate you get a log of U T minus log of u0 [00:45:40] you get a log of U T minus log of u0 is equals to the integration of the [00:45:43] is equals to the integration of the right hand side [00:45:46] right hand side and have this [00:45:48] and have this and now you remove the log you get the [00:45:50] and now you remove the log you get the exponential so u g [00:45:53] exponential so u g over u0 [00:45:55] over u0 is exponential times integration of this [00:46:03] and now uh if you map U [00:46:07] and now uh if you map U to W and [00:46:10] to W and V to this I guess U2 uh [00:46:14] V to this I guess U2 uh according of w and V to be a [00:46:18] according of w and V to be a a coordinate of this x transpose r [00:46:21] a coordinate of this x transpose r t i then you can apply this and you get [00:46:25] t i then you can apply this and you get the the desired result and by the way I [00:46:27] the the desired result and by the way I think I need to make a remark here that [00:46:29] think I need to make a remark here that this is entry wise [00:46:34] application of exponential [00:46:38] so this is a vector this is another [00:46:41] so this is a vector this is another Matrix so a matrix system of vector this [00:46:43] Matrix so a matrix system of vector this becomes a vector and you take anti-wise [00:46:45] becomes a vector and you take anti-wise exponential and you take the antivirus [00:46:47] exponential and you take the antivirus product with w0 [00:46:53] okay any questions so far [00:46:59] so now let's see why this is useful it's [00:47:03] so now let's see why this is useful it's a little bit magical in my opinion I [00:47:05] a little bit magical in my opinion I don't have a [00:47:07] don't have a [Music] [00:47:07] [Music] um [00:47:08] um you know conceptual I think this is fun [00:47:10] you know conceptual I think this is fun but I think the the the the proof on a [00:47:12] but I think the the the the proof on a proof level is somehow there's a little [00:47:14] proof level is somehow there's a little kind of like [00:47:16] kind of like either you can call it coincidence on [00:47:18] either you can call it coincidence on Magic so so there turns out that this is [00:47:21] Magic so so there turns out that this is all you need to to verify this is a good [00:47:23] all you need to to verify this is a good solution uh this is the minimizer of the [00:47:26] solution uh this is the minimizer of the on of the solution so first of all first [00:47:29] on of the solution so first of all first of all we turn this into something about [00:47:30] of all we turn this into something about stata right so now we have a craft [00:47:32] stata right so now we have a craft position for w and let's turn it into [00:47:35] position for w and let's turn it into something about stata so recall that so [00:47:37] something about stata so recall that so and also we simplify this a little bit [00:47:39] and also we simplify this a little bit we call it w [00:47:40] we call it w plus 0 is Alpha o1 vector W minus 0 is [00:47:45] plus 0 is Alpha o1 vector W minus 0 is Alpha times over one vector so that [00:47:48] Alpha times over one vector so that means that W 0 is also all one Alpha [00:47:51] means that W 0 is also all one Alpha times all Vector this this is in 2D [00:47:53] times all Vector this this is in 2D dimension because W is a concatenation [00:47:56] dimension because W is a concatenation of W plus and W minus [00:47:58] of W plus and W minus so that's why this saying w0 is [00:48:01] so that's why this saying w0 is basically not important you just have [00:48:03] basically not important you just have Alpha [00:48:04] Alpha so on the Theta t [00:48:06] so on the Theta t so the stata at time t [00:48:08] so the stata at time t is W plus t [00:48:11] is W plus t power 2 minus W minus t to the power 2 [00:48:16] power 2 minus W minus t to the power 2 and this will be [00:48:18] and this will be uh [00:48:22] okay let's use this formula let's use [00:48:25] okay let's use this formula let's use this formula let's call this one [00:48:26] this formula let's call this one using one [00:48:29] using one so the w0 doesn't matter right only [00:48:32] so the w0 doesn't matter right only thing it contributes Alpha so we get [00:48:34] thing it contributes Alpha so we get Alpha Square [00:48:35] Alpha Square and [00:48:36] and so and [00:48:39] so and um [00:48:42] I guess I'm not sure let's see so [00:48:45] I guess I'm not sure let's see so maybe I'll just do this small [00:48:47] maybe I'll just do this small preparation here so X tilde transpose is [00:48:51] preparation here so X tilde transpose is X transpose minus X transpose [00:48:54] X transpose minus X transpose so that's why exponential [00:48:56] so that's why exponential minus 2x to the transpose [00:48:59] minus 2x to the transpose some vector v [00:49:01] some vector v we we will be this uh integral [00:49:04] we we will be this uh integral so this will be a vector [00:49:09] so this will be a vector like [00:49:12] um [00:49:14] um so let's say it's supposed to take this [00:49:16] so let's say it's supposed to take this to the power of 2 then we'll this will [00:49:19] to the power of 2 then we'll this will be exponential minus 2 [00:49:22] be exponential minus 2 X [00:49:28] transpose B because this part is the [00:49:31] transpose B because this part is the from the first part and the exponential [00:49:33] from the first part and the exponential minus two [00:49:35] minus two uh minus [00:49:37] uh minus times 2x transpose B [00:49:39] times 2x transpose B and then to the power 2. [00:49:42] and then to the power 2. okay so and then this and this power 2 [00:49:45] okay so and then this and this power 2 will become 4 because it's exponential [00:49:47] will become 4 because it's exponential so we got better so [00:49:49] so we got better so four minus 4X transpose B exponential [00:49:53] four minus 4X transpose B exponential for X transpose [00:49:56] for X transpose so this small derivation is trying to [00:49:58] so this small derivation is trying to simple to deal with this so you know [00:50:00] simple to deal with this so you know that [00:50:01] that this thing to the power 2 will be [00:50:03] this thing to the power 2 will be something like this [00:50:05] something like this and and then the first part corresponds [00:50:07] and and then the first part corresponds to [00:50:08] to the first part corresponds to W plus and [00:50:11] the first part corresponds to W plus and second particle W minus so that's why [00:50:13] second particle W minus so that's why you got to [00:50:15] you got to um [00:50:15] um something like here [00:50:17] something like here W plus Square will be [00:50:20] W plus Square will be uh [00:50:22] uh exponential minus 4 [00:50:26] exponential minus 4 X transpose [00:50:33] and minus [00:50:35] and minus exponential [00:50:37] exponential for x transposes [00:50:42] I guess what I I'm doing here is just [00:50:45] I guess what I I'm doing here is just trying to make make you believe that [00:50:46] trying to make make you believe that this derivation is true but it should be [00:50:48] this derivation is true but it should be it should be trivial derivation there is [00:50:50] it should be trivial derivation there is nothing like difficult [00:50:52] nothing like difficult so and [00:50:54] so and okay so this is the characterization of [00:50:57] okay so this is the characterization of theta and and you can see that this is [00:51:00] theta and and you can see that this is [Music] [00:51:01] [Music] um [00:51:03] um this exponential of this minus x [00:51:05] um this exponential of this minus x minus of the same thing you can write [00:51:07] minus of the same thing you can write this more succinctly as [00:51:09] this more succinctly as the sign the Cinch [00:51:13] for [00:51:15] for X transpose [00:51:18] X transpose this is just about a definition of the [00:51:20] this is just about a definition of the Cinch I think cinch is something like [00:51:22] Cinch I think cinch is something like exponential t plus x minus minus t over [00:51:24] exponential t plus x minus minus t over 2 something like that [00:51:26] 2 something like that so okay so basically we have a [00:51:28] so okay so basically we have a calculation of theta [00:51:30] calculation of theta and [00:51:32] and uh [00:51:34] uh right and then you know that the Fatal [00:51:36] right and then you know that the Fatal Alpha [00:51:38] Alpha right is equal to Theta Alpha as if at [00:51:41] right is equal to Theta Alpha as if at Infinity [00:51:43] Infinity right so he so this is equals to 2 Alpha [00:51:47] right so he so this is equals to 2 Alpha Square [00:51:49] minus four x transpose [00:51:55] zero to Infinity RS DS [00:51:59] zero to Infinity RS DS foreign [00:52:04] the final count Point satisfied maybe [00:52:06] the final count Point satisfied maybe let's call this equation two and we also [00:52:08] let's call this equation two and we also know that X said Alpha is equals to Y [00:52:11] know that X said Alpha is equals to Y because we assume or we can prove I [00:52:13] because we assume or we can prove I guess we discussed this I think we can [00:52:15] guess we discussed this I think we can prove that [00:52:17] prove that um you convert to a feasible solution [00:52:20] um you convert to a feasible solution um so and I'm claiming that one two [00:52:23] um so and I'm claiming that one two so two and three these two things turned [00:52:27] so two and three these two things turned out to be [00:52:30] to the optimality condition [00:52:36] of the of the program [00:52:38] of the of the program of the of the the program [00:52:41] of the of the the program that's called one so one is this [00:52:45] that's called one so one is this argument thing okay I guess it's far far [00:52:48] argument thing okay I guess it's far far away so here [00:52:53] let's call this program program one [00:52:57] right so you want to say that say the [00:52:58] right so you want to say that say the alpha is the minimizer of this program [00:53:00] alpha is the minimizer of this program one and it turns out that set it Alpha [00:53:02] one and it turns out that set it Alpha satisfies these two equations [00:53:04] satisfies these two equations two and three and these two equations [00:53:07] two and three and these two equations are the optimality condition of that [00:53:09] are the optimality condition of that optimization program one so and and that [00:53:13] optimization program one so and and that optimization problem only has one [00:53:15] optimization problem only has one solution because it's convex [00:53:17] solution because it's convex so so so that's why this second Alpha is [00:53:20] so so so that's why this second Alpha is the solution [00:53:21] the solution that's the that's the plan for next [00:53:24] that's the that's the plan for next right sounds good so so by automatic [00:53:27] right sounds good so so by automatic condition I really mean okay kkt [00:53:28] condition I really mean okay kkt conclusion [00:53:30] conclusion um so I'm not sure whether [00:53:33] um so I'm not sure whether all of you are familiar with the kkt [00:53:34] all of you are familiar with the kkt condition so I guess there are two ways [00:53:37] condition so I guess there are two ways to think about this [00:53:38] to think about this this is just a small [00:53:40] this is just a small uh thing about background about kkt [00:53:42] uh thing about background about kkt conversation [00:53:44] conversation so these are optimality conditions [00:53:47] so these are optimality conditions for constraints optimization problems [00:53:54] to be honest I never really remember [00:53:56] to be honest I never really remember exactly what the calculator generation [00:53:57] exactly what the calculator generation in many cases so what I well I'm gonna [00:54:00] in many cases so what I well I'm gonna show you is that one way to think about [00:54:02] show you is that one way to think about it so [00:54:04] it so um which is probably not exactly [00:54:07] um which is probably not exactly the same as what you can read from the [00:54:10] the same as what you can read from the from the book but it's going to be very [00:54:12] from the book but it's going to be very similar so suppose you have this kind of [00:54:14] similar so suppose you have this kind of things right optimization program like [00:54:17] things right optimization program like this and Q Theta is a convex thing [00:54:20] this and Q Theta is a convex thing and so so first of all the the quantity [00:54:24] and so so first of all the the quantity condition is the following [00:54:27] condition is the following so it says that Q Theta [00:54:30] so it says that Q Theta is to be equal to X transpose V for some [00:54:33] is to be equal to X transpose V for some V [00:54:36] um in dimension I think this Dimension [00:54:38] um in dimension I think this Dimension is n [00:54:40] is n and the next data needs to be equal to Y [00:54:43] and the next data needs to be equal to Y so this is the kkt condition for this [00:54:45] so this is the kkt condition for this kind of program [00:54:47] kind of program and you know one last thing you can do [00:54:49] and you know one last thing you can do is you can just look up a book and and [00:54:50] is you can just look up a book and and like just a invoke a theorem from a book [00:54:54] like just a invoke a theorem from a book saying which says that this is the [00:54:55] saying which says that this is the optimality condition [00:54:58] optimality condition um the way I think about it is the [00:54:59] um the way I think about it is the following if you're interested in it so [00:55:02] following if you're interested in it so the way I remember this is that [00:55:05] the way I remember this is that I remember this [00:55:07] I remember this all right derived this every time if I [00:55:10] all right derived this every time if I need it uh [00:55:12] need it uh as follows [00:55:14] as follows so I think that the Insight is that [00:55:17] so I think that the Insight is that optimality [00:55:19] optimality at least means that there's no first [00:55:22] at least means that there's no first order update there is no first order [00:55:25] order update there is no first order local Improvement [00:55:28] local Improvement so if you perturb your solution [00:55:30] so if you perturb your solution you shouldn't have a first order [00:55:32] you shouldn't have a first order Improvement locally if a portable [00:55:34] Improvement locally if a portable solution locally a little bit but if [00:55:36] solution locally a little bit but if that has no small amount you shouldn't [00:55:38] that has no small amount you shouldn't get get [00:55:39] get get a a noun to your first other Improvement [00:55:42] a a noun to your first other Improvement so [00:55:44] so so but you also have to satisfy the [00:55:46] so but you also have to satisfy the constraints so you also like a known for [00:55:50] constraints so you also like a known for the [00:55:51] the local Improvement satisfying the [00:55:53] local Improvement satisfying the constraint [00:55:56] since we're not constant act also up to [00:55:58] since we're not constant act also up to first order [00:56:00] first order because you may not be able to [00:56:03] because you may not be able to so what does this mean in this case it [00:56:05] so what does this mean in this case it means that suppose we consider [00:56:07] means that suppose we consider perturbation the outside [00:56:09] perturbation the outside right this is a perturbation [00:56:12] so how do you satisfy the constraints to [00:56:14] so how do you satisfy the constraints to satisfy the constraint you have to say [00:56:16] satisfy the constraint you have to say that the perturbation needs to be in the [00:56:18] that the perturbation needs to be in the orthogonal to the ghost line [00:56:21] orthogonal to the ghost line of x [00:56:22] of x because if it's not an integral spell [00:56:24] because if it's not an integral spell facts you you perturb it you may change [00:56:26] facts you you perturb it you may change the you may change the X Theta and then [00:56:28] the you may change the X Theta and then you you change the you don't satisfy [00:56:30] you you change the you don't satisfy constraint anymore so this is the way to [00:56:32] constraint anymore so this is the way to satisfy Constitution so that you you [00:56:34] satisfy Constitution so that you you actually so that X layer Theta is zero [00:56:37] actually so that X layer Theta is zero so that's how you make the constraint [00:56:38] so that's how you make the constraint work and now let's look at say the plus [00:56:40] work and now let's look at say the plus they'll say the local perturbation [00:56:43] they'll say the local perturbation and so this still satisfy the [00:56:44] and so this still satisfy the constituent and let's see what's the [00:56:46] constituent and let's see what's the value [00:56:48] value so so [00:56:51] so so so let's see what's the value of Q so Q [00:56:54] so let's see what's the value of Q so Q said A Plus Delta Theta [00:56:56] said A Plus Delta Theta this is equals to [00:56:57] this is equals to up to first order Q signal [00:57:00] up to first order Q signal plus Delta Vector times [00:58:32] hi professor [00:58:35] hi professor we cannot hear you [00:58:39] maybe let's try this can you hear me now [00:58:41] maybe let's try this can you hear me now thanks for letting me know [00:58:44] thanks for letting me know yes [00:58:47] yes um is the audio good or not is it okay [00:58:53] okay so I'm using the my laptops on [00:58:56] okay so I'm using the my laptops on microphone so maybe let me turn it in [00:58:58] microphone so maybe let me turn it in some way so that [00:59:00] some way so that yeah thanks for letting me know so maybe [00:59:02] yeah thanks for letting me know so maybe I'll rewind a little bit back I don't [00:59:04] I'll rewind a little bit back I don't know how far how long time you have lost [00:59:06] know how far how long time you have lost me [00:59:08] me um [00:59:10] um so so I guess uh maybe I'll just briefly [00:59:13] so so I guess uh maybe I'll just briefly go through the steps that [00:59:15] go through the steps that we have discussed that so so [00:59:17] we have discussed that so so I guess I was saying that um if you have [00:59:20] I guess I was saying that um if you have the perturbation [00:59:22] the perturbation on [00:59:24] on you always satisfy the constraint right [00:59:26] you always satisfy the constraint right the preservation is the lowest value of [00:59:27] the preservation is the lowest value of x is also going to the real span of X [00:59:30] x is also going to the real span of X then you always satisfy the constant and [00:59:32] then you always satisfy the constant and we want to understand [00:59:34] we want to understand we want to figure out in under what [00:59:36] we want to figure out in under what condition this preservation will never [00:59:38] condition this preservation will never improve your function will make the [00:59:40] improve your function will make the function bigger because if it makes the [00:59:42] function bigger because if it makes the function bigger it means that this point [00:59:44] function bigger it means that this point is not optimal so so that's why you look [00:59:48] is not optimal so so that's why you look at a Taylor expansion of the of this [00:59:51] at a Taylor expansion of the of this queue and you found out that the first [00:59:52] queue and you found out that the first other changes is this term and you want [00:59:55] other changes is this term and you want this term to be always [00:59:57] this term to be always you know negative non-positive because [00:59:59] you know negative non-positive because if it's positive then it violates the [01:00:02] if it's positive then it violates the optimality Assumption so so so so it's a [01:00:06] optimality Assumption so so so so it's a necessary condition is that this term is [01:00:07] necessary condition is that this term is always non-positive but this term you [01:00:10] always non-positive but this term you know it's very easy to make the sign [01:00:11] know it's very easy to make the sign flip because you can use the flip the [01:00:13] flip because you can use the flip the allocator [01:00:15] allocator um by whatever you want so that [01:00:17] um by whatever you want so that basically means that for every denotator [01:00:18] basically means that for every denotator in this goes orthogonal space of low [01:00:21] in this goes orthogonal space of low span of X this term has to be just [01:00:23] span of X this term has to be just literally zero because if it's not zero [01:00:24] literally zero because if it's not zero you can flip the illustrator to make it [01:00:27] you can flip the illustrator to make it positive [01:00:28] positive so that's why we are saying that here [01:00:31] so that's why we are saying that here for every data center of x this term is [01:00:36] for every data center of x this term is zero and that really just means that [01:00:39] zero and that really just means that this Vector so because [01:00:42] this Vector so because every data set in the rows in this [01:00:44] every data set in the rows in this subject is also equal to this Vector [01:00:46] subject is also equal to this Vector that means that this Vector is in a [01:00:47] that means that this Vector is in a complementary [01:00:49] complementary Subspace [01:00:50] Subspace of the subscript so that's why this [01:00:54] of the subscript so that's why this Vector Q Theta needs to be in the real [01:00:56] Vector Q Theta needs to be in the real span [01:01:00] of x [01:01:02] of x so that for every Vector Dela Theta or [01:01:05] so that for every Vector Dela Theta or thousands the real slang the inner [01:01:07] thousands the real slang the inner product is zero so so that's why this is [01:01:10] product is zero so so that's why this is this it can be written as X transpose [01:01:13] this it can be written as X transpose times U [01:01:15] times U because X transpose if you don't call it [01:01:17] because X transpose if you don't call it the low spine X transpose mu X transpose [01:01:21] the low spine X transpose mu X transpose I think this is that's called v x [01:01:24] I think this is that's called v x transpose B is the uh is the [01:01:28] transpose B is the uh is the representation of a vector in a row span [01:01:30] representation of a vector in a row span of x [01:01:32] of x so that's why this is the [01:01:35] so that's why this is the that's how we develop the package [01:01:36] that's how we develop the package information right the package [01:01:37] information right the package information was that you have to be in [01:01:39] information was that you have to be in the lowest one the gradient of Trio as [01:01:42] the lowest one the gradient of Trio as Theta has to be the low smile of X and [01:01:43] Theta has to be the low smile of X and also has to be a feasible solution [01:01:47] Okay cool so this is some segregation [01:01:50] Okay cool so this is some segregation about kkt conditions if you not familiar [01:01:53] about kkt conditions if you not familiar with it then the only important thing is [01:01:55] with it then the only important thing is that this is the calculation of the [01:01:57] that this is the calculation of the optimal solution uh I say that and and [01:02:01] optimal solution uh I say that and and we can now it's just a pattern matching [01:02:04] we can now it's just a pattern matching right so [01:02:05] right so so this [01:02:08] so this corresponds to this right obviously and [01:02:10] corresponds to this right obviously and this one [01:02:12] this one really just corresponds to equation two [01:02:15] really just corresponds to equation two because [01:02:16] because equation two okay that's what I'm going [01:02:19] equation two okay that's what I'm going to claim okay it's not to you yet but [01:02:20] to claim okay it's not to you yet but let's see that so [01:02:23] let's see that so um [01:02:25] um so kkt tells you that the gradient of Q [01:02:28] so kkt tells you that the gradient of Q Theta needs to be something like X [01:02:31] Theta needs to be something like X transpose V and the the environment [01:02:35] transpose V and the the environment or the differential equation tells us [01:02:37] or the differential equation tells us that let me just copy paste it just let [01:02:40] that let me just copy paste it just let me just rewrite it so Theta Alpha is [01:02:44] me just rewrite it so Theta Alpha is equals to [01:02:47] Alpha Square [01:02:48] Alpha Square Sun [01:02:57] all right so first of all let's rewrite [01:02:58] all right so first of all let's rewrite this as [01:03:00] this as to simplify this and we write this as [01:03:02] to simplify this and we write this as times V [01:03:04] times V times [01:03:07] times maybe B Prime let's see because the V [01:03:10] maybe B Prime let's see because the V what we it doesn't matter and then [01:03:16] and then you can [01:03:22] uh [01:03:23] uh I guess let's also look work on the Q [01:03:26] I guess let's also look work on the Q set of scene the data Q say that you can [01:03:28] set of scene the data Q say that you can compute this [01:03:29] compute this in some sense you know when you derive [01:03:31] in some sense you know when you derive the queue you you we are verifying it [01:03:34] the queue you you we are verifying it but actually what you have what you have [01:03:35] but actually what you have what you have to do is to reverse anything to do this [01:03:37] to do is to reverse anything to do this in another Direction but but if you just [01:03:40] in another Direction but but if you just verify that suppose you have given a q [01:03:42] verify that suppose you have given a q if you're just going to prove it you can [01:03:43] if you're just going to prove it you can find the derivative of the queue will be [01:03:45] find the derivative of the queue will be just Arc [01:03:47] just Arc cinch [01:03:49] cinch one over two of a square [01:03:52] one over two of a square times Theta [01:03:54] times Theta this is the derivative of Q it makes [01:03:56] this is the derivative of Q it makes sense because the Q is the sum of [01:03:59] sense because the Q is the sum of some function outside the I and the [01:04:01] some function outside the I and the derivative Q is a if each answer is the [01:04:03] derivative Q is a if each answer is the same function of theta I [01:04:05] same function of theta I and and so so then you can see that this [01:04:11] if you plug in the set Alpha here [01:04:13] if you plug in the set Alpha here to this thing so Arc since [01:04:18] to this thing so Arc since file score so the alpha this is equals [01:04:22] file score so the alpha this is equals to [01:04:23] to just 4X transpose V Prime [01:04:26] just 4X transpose V Prime right so that's why [01:04:31] so so basically that's why gradient Q [01:04:34] so so basically that's why gradient Q Theta Alpha [01:04:36] Theta Alpha is equals to minus four x transpose e [01:04:38] is equals to minus four x transpose e Prime [01:04:39] Prime and and this satisfies the 10 [01:04:41] and and this satisfies the 10 interpolation the four doesn't matter [01:04:42] interpolation the four doesn't matter because the v v can be any vector [01:04:45] because the v v can be any vector right so so that's why Q Alpha satisfy [01:04:50] right so so that's why Q Alpha satisfy the paper stories [01:04:56] like that so so so is the global mean [01:05:00] like that so so so is the global mean physical [01:05:03] I guess you know the last step you know [01:05:06] I guess you know the last step you know satisfying hey the condition means [01:05:07] satisfying hey the condition means Global mean uh this requires the [01:05:10] Global mean uh this requires the convection of this program the [01:05:11] convection of this program the constraint is linear is convex the the [01:05:14] constraint is linear is convex the the objective [01:05:16] objective um you can verify still a social complex [01:05:19] um you can verify still a social complex it's it's something between our one [01:05:21] it's it's something between our one normal complex [01:05:24] normal complex foreign [01:05:56] so if there's no questions I'm going to [01:05:59] so if there's no questions I'm going to move on to the next thing which is about [01:06:01] move on to the next thing which is about classification problem [01:06:08] yeah as many of you are saying like [01:06:11] yeah as many of you are saying like there's proof Steve like this proof [01:06:15] there's proof Steve like this proof I don't know I don't have um [01:06:18] I don't know I don't have um you know the the plan sounds very [01:06:20] you know the the plan sounds very intuitive right so how do you prove [01:06:21] intuitive right so how do you prove something is the minimizer for com [01:06:23] something is the minimizer for com optimization problem you have to verify [01:06:25] optimization problem you have to verify it satisfies the cable companies I guess [01:06:28] it satisfies the cable companies I guess that's probably the more or less the [01:06:30] that's probably the more or less the only way to do it if you want to show [01:06:32] only way to do it if you want to show something is the optimizer of some [01:06:34] something is the optimizer of some optimization program [01:06:36] optimization program but but it's kind of the magical why it [01:06:39] but but it's kind of the magical why it just happens to satisfy the kkg [01:06:41] just happens to satisfy the kkg so uh of course there's something that [01:06:43] so uh of course there's something that we can choose right we can choose the [01:06:45] we can choose right we can choose the queue to make it satisfy the [01:06:46] queue to make it satisfy the communication that's something you can [01:06:48] communication that's something you can choose but the magical thing is that [01:06:49] choose but the magical thing is that other things all match up like the more [01:06:51] other things all match up like the more of the ACT transpose time something all [01:06:54] of the ACT transpose time something all of those things are messed up and also [01:06:56] of those things are messed up and also you can somewhat in some sense you can [01:06:58] you can somewhat in some sense you can always work with each coordinate [01:07:00] always work with each coordinate independently in this special case so [01:07:03] independently in this special case so that's also something that [01:07:05] that's also something that maybe a little special to this uh to [01:07:08] maybe a little special to this uh to this special model that we considering [01:07:12] all right okay so now let's move on [01:07:14] all right okay so now let's move on classification problem and and we are [01:07:16] classification problem and and we are looking at separable data [01:07:19] looking at separable data um as we always do for classification [01:07:21] um as we always do for classification problem [01:07:22] problem and [01:07:23] and and here we are going to only discuss by [01:07:25] and here we are going to only discuss by results and uh which says that [01:07:28] results and uh which says that um if you do green it is that it [01:07:30] um if you do green it is that it converters to a Max Market solution [01:07:34] converters to a Max Market solution and this is actually [01:07:36] and this is actually it doesn't require any [01:07:38] it doesn't require any doesn't [01:07:40] doesn't require [01:07:41] require any differentiation like it works for [01:07:44] any differentiation like it works for any tradition [01:07:46] any tradition like uh so the only thing you need is [01:07:48] like uh so the only thing you need is really decent and some loss function [01:07:50] really decent and some loss function which I will Define and no [01:07:51] which I will Define and no regularization you just compute Green [01:07:53] regularization you just compute Green Design on the loss function you run for [01:07:55] Design on the loss function you run for a long time you're going to convert some [01:07:57] a long time you're going to convert some maximization source [01:07:58] maximization source so [01:08:00] so um so I will have again start with the [01:08:02] um so I will have again start with the setup [01:08:04] setup so now we have a data set x i y i [01:08:09] so now we have a data set x i y i promise to n [01:08:11] promise to n and x i [01:08:14] and x i are in Rd [01:08:17] are in Rd and y i [01:08:19] and y i is a binary label plus one minus one [01:08:22] is a binary label plus one minus one all right [01:08:26] okay so [01:08:29] okay so um [01:08:31] um [Music] [01:08:38] been to a restaurant [01:08:41] been to a restaurant because it breaks down [01:08:48] okay yeah so the question is about the [01:08:50] okay yeah so the question is about the previous thing and the question is about [01:08:52] previous thing and the question is about if you don't use W Square you use W to [01:08:55] if you don't use W Square you use W to the K and this is a very good question [01:08:57] the K and this is a very good question actually this is exactly what the paper [01:08:59] actually this is exactly what the paper studies in the in a more technical part [01:09:02] studies in the in a more technical part so [01:09:03] so um and the short answer is that [01:09:04] um and the short answer is that everything can still go through and the [01:09:08] everything can still go through and the but the eventual Q would be different so [01:09:11] but the eventual Q would be different so the form of your queue would be [01:09:13] the form of your queue would be something not l102 I think it's [01:09:15] something not l102 I think it's something like I'll it depends on the [01:09:17] something like I'll it depends on the power [01:09:18] power so I think if the power is P I don't [01:09:20] so I think if the power is P I don't exactly remember but I think it's [01:09:21] exactly remember but I think it's something like a 2 over P Norm when [01:09:23] something like a 2 over P Norm when Alpha is close to zero but Alpha is [01:09:25] Alpha is close to zero but Alpha is going to Infinity I think [01:09:27] going to Infinity I think everything's still the same [01:09:29] everything's still the same like the ntk regime is now sensitive to [01:09:32] like the ntk regime is now sensitive to this so so okay and the technically why [01:09:36] this so so okay and the technically why did everything go through I think the [01:09:38] did everything go through I think the reason why everything goes through is [01:09:40] reason why everything goes through is that [01:09:44] roughly speaking [01:09:46] roughly speaking you are only playing with this this [01:09:49] you are only playing with this this single dimensional function in some [01:09:51] single dimensional function in some sense right so so so it won't be sent [01:09:54] sense right so so so it won't be sent anymore probably there will be some [01:09:56] anymore probably there will be some constant some other function [01:09:58] constant some other function um and but still this x transpose [01:10:01] um and but still this x transpose something is still there I think it's [01:10:03] something is still there I think it's now changed uh and [01:10:06] now changed uh and So eventually you just have to engineer [01:10:08] So eventually you just have to engineer different queue to make everything work [01:10:12] different queue to make everything work and the queue is still kind of call only [01:10:15] and the queue is still kind of call only depends on the coordinates right you do [01:10:17] depends on the coordinates right you do something on each corner and you take [01:10:18] something on each corner and you take the sum the queue still has that support [01:10:20] the sum the queue still has that support so that's why it's still someone doable [01:10:27] Okay cool so [01:10:29] Okay cool so going back to the classification problem [01:10:31] going back to the classification problem so this is our setup and here we are [01:10:33] so this is our setup and here we are only going to do the linear model [01:10:36] only going to do the linear model even though some of this Theory still [01:10:37] even though some of this Theory still works for nonlinear model with roughly [01:10:41] works for nonlinear model with roughly similar technique and similar conclusion [01:10:44] similar technique and similar conclusion um and here we're going to have a loss [01:10:46] um and here we're going to have a loss function so the loss function [01:10:50] function so the loss function will be [01:10:51] will be I'll hat w [01:10:54] I'll hat w is the the [01:10:57] is the the let's say we do the cross entropy laws [01:11:00] let's say we do the cross entropy laws the the logistic loss [01:11:03] the the logistic loss across entry laws [01:11:08] times h w x defined [01:11:20] and where this lost is this logistic [01:11:23] and where this lost is this logistic loss which is log [01:11:25] loss which is log of one class exponential [01:11:28] of one class exponential density [01:11:33] Okay cool so [01:11:36] Okay cool so um [01:11:37] um and the first thing is that to to kind [01:11:39] and the first thing is that to to kind of like there's some intuition so first [01:11:42] of like there's some intuition so first of all we have multiple [01:11:44] of all we have multiple Global mean [01:11:47] if a separable data [01:11:52] so this is a premises for any implicit [01:11:55] so this is a premises for any implicit requisition Factor if you don't have one [01:11:57] requisition Factor if you don't have one Global mean then and you can convert the [01:11:59] Global mean then and you can convert the global I mean there's no impressive [01:12:01] global I mean there's no impressive organization class [01:12:03] organization class this is just because you can [01:12:07] this is just because you can um [01:12:07] um you can always have like a infinite [01:12:10] you can always have like a infinite number of like separators pretty much [01:12:12] number of like separators pretty much unless in a very extreme case you just [01:12:14] unless in a very extreme case you just happen to get stuck at the exactly so so [01:12:16] happen to get stuck at the exactly so so for example let's I think it's probably [01:12:18] for example let's I think it's probably easy to draw something [01:12:19] easy to draw something so suppose you have some data points [01:12:20] so suppose you have some data points like this [01:12:22] like this and [01:12:23] and you have so many different possible [01:12:25] you have so many different possible separators [01:12:27] separators as long as you have one you perturb a [01:12:29] as long as you have one you perturb a little but it's still a separate so so [01:12:31] little but it's still a separate so so they are so there is this the infinite [01:12:36] my name w [01:12:39] my name w such that [01:12:42] W transpose x i y i [01:12:45] W transpose x i y i is bigger than 0 for every r [01:12:48] is bigger than 0 for every r so you have so many separators and for [01:12:51] so you have so many separators and for every w [01:12:53] every w for maybe let's say for infant number of [01:12:55] for maybe let's say for infant number of w bar such that where W bar is in it [01:12:59] w bar such that where W bar is in it the real bar is unit vector [01:13:03] this statement doesn't really depend on [01:13:04] this statement doesn't really depend on the log so you can always scale it so [01:13:06] the log so you can always scale it so for every W bar for any W bar such that [01:13:09] for every W bar for any W bar such that this [01:13:10] this you can scale it so if you look at our [01:13:14] you can scale it so if you look at our hat Alpha w [01:13:17] hat Alpha w bar [01:13:18] bar so we'll go to zero as [01:13:21] so we'll go to zero as Alpha goes to Infinity so any scaling of [01:13:25] Alpha goes to Infinity so any scaling of this unit separator [01:13:28] this unit separator if you scale it extremely [01:13:30] if you scale it extremely then you are gonna get uh the loss goes [01:13:33] then you are gonna get uh the loss goes to zero so basically you have so many [01:13:35] to zero so basically you have so many directions in it so you can go to [01:13:37] directions in it so you can go to Infinity in different directions and [01:13:39] Infinity in different directions and still converts to uh uh zero loss so [01:13:43] still converts to uh uh zero loss so basically in some sense if you are a [01:13:46] basically in some sense if you are a little sloppy about so all of the this [01:13:48] little sloppy about so all of the this infinity times W bar [01:13:50] infinity times W bar are Global minimum [01:13:54] are Global minimum of this loss function just because the [01:13:56] of this loss function just because the loss function goes to zero at Infinity [01:13:59] loss function goes to zero at Infinity the loss function maybe I should also [01:14:01] the loss function maybe I should also draw this the loss function looks like [01:14:02] draw this the loss function looks like this [01:14:03] this this is the layer of T when T goes to [01:14:06] this is the layer of T when T goes to Infinity you get close to zero [01:14:09] Infinity you get close to zero close to zero loss and and and what's [01:14:13] close to zero loss and and and what's inside what what's t t is y times double [01:14:15] inside what what's t t is y times double transpose x i and this thing will go to [01:14:17] transpose x i and this thing will go to efficiency as you scale the norm of the [01:14:21] efficiency as you scale the norm of the of the classifier [01:14:24] of the classifier so so you have so many directions that [01:14:26] so so you have so many directions that you can you can find uh there are so [01:14:28] you can you can find uh there are so many Global minimums the question is [01:14:30] many Global minimums the question is which [01:14:32] which which direction [01:14:36] can you find [01:14:39] can you find by agreeing distance right if you don't [01:14:41] by agreeing distance right if you don't use any if you don't if you just invoke [01:14:44] use any if you don't if you just invoke a theorem about optimization you know [01:14:46] a theorem about optimization you know that green Nissan will find a solution [01:14:48] that green Nissan will find a solution with every close to zero with loss close [01:14:51] with every close to zero with loss close to zero but you don't know which [01:14:52] to zero but you don't know which direction it is you still have a bunch [01:14:54] direction it is you still have a bunch of flexibilities a few many directions [01:14:56] of flexibilities a few many directions can [01:14:58] can um if you go to Infinity in my [01:14:59] um if you go to Infinity in my directions you can get the loss to uh [01:15:02] directions you can get the loss to uh back here [01:15:04] back here so that's the question that we're [01:15:05] so that's the question that we're actually going to address [01:15:07] actually going to address so [01:15:08] so um and we're going to say that this [01:15:10] um and we're going to say that this actually converges the max smarter [01:15:11] actually converges the max smarter solution [01:15:12] solution so let's define [01:15:14] so let's define so the answer is Max modern solution [01:15:21] production [01:15:23] production so I guess let's define maybe first the [01:15:27] so I guess let's define maybe first the marginalized [01:15:29] marginalized the the margin and the normalized margin [01:15:33] the the margin and the normalized margin so so I guess we have defined the margin [01:15:37] so so I guess we have defined the margin in this class many case whereas in many [01:15:39] in this class many case whereas in many cases [01:15:40] cases the margin is the minimum [01:15:43] the margin is the minimum this [01:15:47] uh and we also assume okay so assume we [01:15:50] uh and we also assume okay so assume we always assume [01:15:53] assume uh linearly separable [01:15:59] see this this calculation is only [01:16:01] see this this calculation is only defined for cases where it's linearly [01:16:04] defined for cases where it's linearly separable and a normalized margin [01:16:10] is defined to be you normalize this by [01:16:13] is defined to be you normalize this by the norm of w [01:16:16] because otherwise you can make the [01:16:20] because otherwise you can make the Norm arbitrary repeat [01:16:23] Norm arbitrary repeat carburetor is small [01:16:26] carburetor is small Okay so [01:16:28] Okay so so max Martin Solutions [01:16:33] is defined to be the [01:16:38] for all w [01:16:40] for all w which one give you the maximum [01:16:42] which one give you the maximum normalized margin [01:16:45] normalized margin um and let's [01:16:48] um and let's W store be the [01:16:51] W store be the maximizer [01:16:55] this is the direction of the max margin [01:16:57] this is the direction of the max margin solution and with uninor because [01:17:02] if you only look at this objective [01:17:05] if you only look at this objective or it doesn't depend on the scale [01:17:06] or it doesn't depend on the scale because the scale is already normalized [01:17:08] because the scale is already normalized so [01:17:10] so um so we Define them that will start [01:17:11] um so we Define them that will start with a maximizer recently notes okay so [01:17:14] with a maximizer recently notes okay so basically we're going to prove that if [01:17:16] basically we're going to prove that if you do win descent you're going to go to [01:17:18] you do win descent you're going to go to Infinity but it already will go to [01:17:19] Infinity but it already will go to Infinity but only a lot of direction of [01:17:22] Infinity but only a lot of direction of w star [01:17:23] w star that's the zero [01:17:30] so green and Flow [01:17:33] so green and Flow okay [01:17:34] okay here you are talking we're talking about [01:17:36] here you are talking we're talking about gradient flow just because you know it's [01:17:39] gradient flow just because you know it's convenient as we discussed so convert [01:17:41] convenient as we discussed so convert this to [01:17:44] this to uh the direction [01:17:48] of Max Martin Solutions [01:17:55] in the sense that [01:17:58] in the sense that I think we don't really exactly say that [01:18:00] I think we don't really exactly say that convergence interaction we only say the [01:18:02] convergence interaction we only say the convergence in the [01:18:04] convergence in the in the sense of like the value of the [01:18:06] in the sense of like the value of the margin [01:18:08] margin uh I think really you want to do the [01:18:10] uh I think really you want to do the exact convergence interaction is it [01:18:12] exact convergence interaction is it requires a little more work [01:18:13] requires a little more work so what we say is that the margin of [01:18:15] so what we say is that the margin of your iterate will convert to the maximum [01:18:17] your iterate will convert to the maximum possible margin gamma bar [01:18:20] possible margin gamma bar um [01:18:23] right so [01:18:25] right so as T goes to Infinity [01:18:29] so and WT is the wtd [01:18:33] so and WT is the wtd is the eteric attack key [01:18:36] is the eteric attack key foreign [01:18:50] this is working [01:18:53] this is working um this intrusion contains while this is [01:18:55] um this intrusion contains while this is working and how do we prove it in some [01:18:57] working and how do we prove it in some kind of a mixture of both of these two [01:18:58] kind of a mixture of both of these two and then in the next lecture I guess I [01:19:00] and then in the next lecture I guess I would improve the the thing more [01:19:02] would improve the the thing more reversive [01:19:04] reversive so why this is going why this is working [01:19:06] so why this is going why this is working so the intuition is that [01:19:09] so the intuition is that so okay so I have a few steps here so [01:19:12] so okay so I have a few steps here so once step one this loss function I all [01:19:16] once step one this loss function I all had [01:19:17] had WT [01:19:19] WT is going to zero [01:19:21] is going to zero by standard [01:19:23] by standard optimization argument [01:19:25] optimization argument which [01:19:27] which which is not covered by this course so [01:19:29] which is not covered by this course so but I think you can quickly leave it if [01:19:31] but I think you can quickly leave it if you optimization is working your [01:19:34] you optimization is working your if your optimization is working then [01:19:36] if your optimization is working then your um a loss should go to zero [01:19:39] your um a loss should go to zero and [01:19:42] and second [01:19:44] second so this is [01:19:46] so this is object finish one and observation two [01:19:51] object finish one and observation two um [01:19:53] um I guess the laws [01:19:57] this loss function [01:19:59] this loss function which we Define to the logistic loss [01:20:00] which we Define to the logistic loss right [01:20:01] right something like this [01:20:03] something like this loss function is actually [01:20:07] loss function is actually uh close to exponential [01:20:11] for [01:20:14] for large C [01:20:17] large C right this is just because you do total [01:20:19] right this is just because you do total expansion right log of 1 plus X is [01:20:21] expansion right log of 1 plus X is approximately X that's where you can get [01:20:24] approximately X that's where you can get rid of the log and one [01:20:31] and this is actually an interesting [01:20:32] and this is actually an interesting thing right so you call it logistic loss [01:20:34] thing right so you call it logistic loss but actually it's close to exponential [01:20:35] but actually it's close to exponential loss so like [01:20:37] loss so like so so like uh so Logistics a lot so [01:20:40] so so like uh so Logistics a lot so logistic loss is close to exponential [01:20:42] logistic loss is close to exponential loss [01:20:48] so so so so then so in most of the proof [01:20:51] so so so so then so in most of the proof actually we are only going to do the [01:20:52] actually we are only going to do the logistics I think I think the proof [01:20:54] logistics I think I think the proof actually I'm just going sorry I'm going [01:20:56] actually I'm just going sorry I'm going to the exponential loss so in the proof [01:20:58] to the exponential loss so in the proof I'm going to just assume it's [01:20:59] I'm going to just assume it's exponential [01:21:00] exponential um even though you can still these small [01:21:02] um even though you can still these small differences can be [01:21:04] differences can be um dealt with relatively easily [01:21:07] um dealt with relatively easily and the third observation is that [01:21:09] and the third observation is that because one because of one [01:21:13] because one because of one the WT [01:21:15] the WT has to go to the norm has to go to [01:21:18] has to go to the norm has to go to Infinity [01:21:19] Infinity and and the reason is that if you just [01:21:21] and and the reason is that if you just don't go to Infinity you never make the [01:21:23] don't go to Infinity you never make the loss close to zero now right this is [01:21:25] loss close to zero now right this is just because [01:21:27] if WT is bounded [01:21:30] if WT is bounded however let's say it's mined by B [01:21:33] however let's say it's mined by B suppose it's always bounded then [01:21:37] suppose it's always bounded then you can always bounce this y i times W [01:21:40] you can always bounce this y i times W transpose x i by something like [01:21:43] transpose x i by something like maybe B times the norm of x i you have [01:21:46] maybe B times the norm of x i you have some bonds [01:21:47] some bonds so this is bounded [01:21:50] so this is bounded and then your loss L height WT [01:21:54] and then your loss L height WT this is you know [01:21:56] this is you know bonded by [01:21:59] bonded by exponential minus E times x i [01:22:03] exponential minus E times x i something like this [01:22:06] something like this and this is my below by zero and this [01:22:08] and this is my below by zero and this contradictory is one [01:22:15] right so if you if your Norm is always [01:22:17] right so if you if your Norm is always belonging then your loss is pointed [01:22:19] belonging then your loss is pointed below by some number the number is you [01:22:21] below by some number the number is you know very close to zero but it still [01:22:22] know very close to zero but it still responded by some number which [01:22:24] responded by some number which contradicts with the convergence to zero [01:22:27] contradicts with the convergence to zero so [01:22:28] so so now it comes to the the most [01:22:32] so now it comes to the the most important thing so with a lot of this [01:22:33] important thing so with a lot of this preparation so we know that the norm [01:22:35] preparation so we know that the norm goes to infinity and then suppose let's [01:22:38] goes to infinity and then suppose let's say let's only look at the final case [01:22:40] say let's only look at the final case the later regime where WT [01:22:44] the later regime where WT is very big so suppose WT [01:22:46] is very big so suppose WT to Norm is really big that's called Q is [01:22:50] to Norm is really big that's called Q is very big [01:22:54] so [01:22:59] then let's try to simplify the loss [01:23:01] then let's try to simplify the loss function see what the loss function is [01:23:03] function see what the loss function is found so the loss function [01:23:05] found so the loss function I can't [01:23:06] I can't so maybe let's let's remove the key just [01:23:09] so maybe let's let's remove the key just for Simplicity let's just look at [01:23:10] for Simplicity let's just look at suppose you want to look at some W such [01:23:12] suppose you want to look at some W such that the W Norm is very big so the l h w [01:23:15] that the W Norm is very big so the l h w is the sum of [01:23:19] is the sum of this logistic loss or exponential loss [01:23:21] this logistic loss or exponential loss we're not distinguishing for now [01:23:28] and this is [01:23:32] and this is uh let's say this is roughly equals to [01:23:34] uh let's say this is roughly equals to the exponential minus y [01:23:38] the exponential minus y I times double transpose x i [01:23:44] and because the loss will be very close [01:23:45] and because the loss will be very close to zero it's actually more informative [01:23:47] to zero it's actually more informative to look at the log space [01:23:49] to look at the log space um so if you took a log space of L hat w [01:23:52] um so if you took a log space of L hat w then this is roughly equals to [01:23:55] then this is roughly equals to the log [01:23:57] the log of sum [01:23:59] of sum of exponential minus y i [01:24:02] of exponential minus y i times W transpose times F1 [01:24:06] times W transpose times F1 so this is a log sum exponential I'm not [01:24:08] so this is a log sum exponential I'm not sure whether it is we're about to some [01:24:10] sure whether it is we're about to some of you so this is basically soft Max [01:24:14] of you so this is basically soft Max um so so I'm gonna claim that this lots [01:24:16] um so so I'm gonna claim that this lots of exponential is close to Max [01:24:20] of this [01:24:28] minus this [01:24:31] am I yes [01:24:35] so why this is the case let's do some [01:24:37] so why this is the case let's do some again abstract derivation [01:24:42] so if you have a log [01:24:46] so if you have a log okay I guess sorry I think I'm [01:24:49] okay I guess sorry I think I'm probably I should have another status so [01:24:52] probably I should have another status so let's first another step [01:24:55] so this will lock some exponential but [01:24:57] so this will lock some exponential but also I can try to [01:24:59] also I can try to I want to use the fact that YW has large [01:25:01] I want to use the fact that YW has large Norm right so let's [01:25:03] Norm right so let's get a norm of WQ in front you get y i [01:25:07] get a norm of WQ in front you get y i tends W bar transpose x i [01:25:09] tends W bar transpose x i where W bar is equal to normalization of [01:25:12] where W bar is equal to normalization of w [01:25:14] w okay so now I'm going to claim that this [01:25:16] okay so now I'm going to claim that this is close to [01:25:18] is close to Max minus q y i [01:25:21] Max minus q y i double transpose XI i o [01:25:25] double transpose XI i o so why this is the case this is I guess [01:25:27] so why this is the case this is I guess if those who are familiar with this you [01:25:29] if those who are familiar with this you know lock some exponential is kind of [01:25:31] know lock some exponential is kind of like a soft Max [01:25:33] like a soft Max so if you look at log some [01:25:36] so if you look at log some exponential of some [01:25:38] exponential of some I said Q [01:25:40] I said Q times UI [01:25:42] times UI I'm trying to kind of like [01:25:44] I'm trying to kind of like um abstract it by a little bit right so [01:25:46] um abstract it by a little bit right so you have a hue that is very large and UI [01:25:49] you have a hue that is very large and UI is something fixed I claim that this is [01:25:51] is something fixed I claim that this is closed this is roughly [01:25:54] closed this is roughly Q times the max of UI [01:25:57] Q times the max of UI i n [01:25:59] i n plus something like little of Q [01:26:01] plus something like little of Q something like that that doesn't depend [01:26:04] something like that that doesn't depend on keyword as much as qbox Infinity [01:26:07] on keyword as much as qbox Infinity so when the queue is very big then this [01:26:09] so when the queue is very big then this is really doing the Max and sometimes [01:26:11] is really doing the Max and sometimes this is kind of like when you do a [01:26:12] this is kind of like when you do a temperature in the soft max if you make [01:26:14] temperature in the soft max if you make it big then [01:26:16] it big then you soft Max becomes hard Max [01:26:18] you soft Max becomes hard Max so and and if you do the then this is [01:26:21] so and and if you do the then this is really in some kind of soft Max [01:26:23] really in some kind of soft Max and and this is you know if you want how [01:26:26] and and this is you know if you want how to improve this just say the sum of [01:26:29] to improve this just say the sum of exponential q u i [01:26:32] exponential q u i proper upper bound the upper bound is [01:26:35] proper upper bound the upper bound is that [01:26:35] that it's a log of [01:26:37] it's a log of replace each of these back by the [01:26:40] replace each of these back by the biggest one [01:26:41] biggest one exponential few times the max [01:26:44] exponential few times the max UI [01:26:46] UI and this is only log on [01:26:49] and this is only log on plus Q [01:26:52] plus Q times the max of UI [01:26:56] times the max of UI and [01:26:57] and so the login is small compared to Q [01:27:00] so the login is small compared to Q because Q will go to infinity and I need [01:27:02] because Q will go to infinity and I need something fixed in this abstraction and [01:27:04] something fixed in this abstraction and on the other hand you just take one term [01:27:07] on the other hand you just take one term you just only keep the term you have the [01:27:09] you just only keep the term you have the max then you get Q next [01:27:11] max then you get Q next UI or you jump all the other terms [01:27:13] UI or you jump all the other terms there's no sum anymore log cancels with [01:27:15] there's no sum anymore log cancels with exponential [01:27:16] exponential to get this [01:27:18] to get this so so basically [01:27:21] so so basically um [01:27:21] um there's lots of exponential is close to [01:27:23] there's lots of exponential is close to the max up to some [01:27:25] the max up to some Factor log n but this Factor log n will [01:27:28] Factor log n but this Factor log n will be small if Q goes to Infinity [01:27:32] be small if Q goes to Infinity and that that justifies this step [01:27:34] and that that justifies this step right so and once you have this done [01:27:38] right so and once you have this done this uh what's going on here right so [01:27:41] this uh what's going on here right so this is saying that you are [01:27:43] this is saying that you are you are minimizing a lot you're [01:27:45] you are minimizing a lot you're minimizing law so that's why you [01:27:46] minimizing law so that's why you minimize the log loss as well right so [01:27:49] minimize the log loss as well right so so so so minimizing loss is kind of like [01:27:51] so so so minimizing loss is kind of like trying to maximize this and that means [01:27:53] trying to maximize this and that means you are maximizing the [01:27:56] you are maximizing the so so you are trying to maximum minimize [01:28:00] so so you are trying to maximum minimize this quantity [01:28:06] minimize the quantity Max minus q y i [01:28:11] minimize the quantity Max minus q y i will transpose x i [01:28:14] will transpose x i which means it's the same as maximizing [01:28:18] which means that so you are you are [01:28:22] which means that so you are you are you are minimizing but you're maximizing [01:28:26] you are minimizing but you're maximizing the mean of Q [01:28:29] the mean of Q y i w transpose x y [01:28:36] right so it's so you just select this [01:28:38] right so it's so you just select this time [01:28:39] time right [01:28:41] right minimizing the max of this is the same [01:28:43] minimizing the max of this is the same as maximizing an email this is just [01:28:45] as maximizing an email this is just literally the same thing [01:28:47] literally the same thing like without any it's not like you're [01:28:50] like without any it's not like you're switching in the max it's just a really [01:28:52] switching in the max it's just a really the Sun [01:28:53] the Sun you have the minus sign you maximize [01:28:55] you have the minus sign you maximize something is the same as [01:28:58] something is the same as the max of this is equals to [01:29:02] the max of this is equals to the minus the mean [01:29:04] the minus the mean q y double transpose XL [01:29:09] q y double transpose XL and then you can put this minus also the [01:29:11] and then you can put this minus also the minimize Okay so [01:29:13] minimize Okay so so [01:29:14] so um all right so basically you are [01:29:16] um all right so basically you are maximizing the margin that's uh that's [01:29:18] maximizing the margin that's uh that's uh what is the same so if you do this [01:29:20] uh what is the same so if you do this approximation [01:29:22] approximation uh then you have maximized margin if Q [01:29:24] uh then you have maximized margin if Q goes to infinite and next time we're [01:29:26] goes to infinite and next time we're going to make this more formal uh with a [01:29:28] going to make this more formal uh with a little bit with essentially the same [01:29:31] little bit with essentially the same intuition but the truth you know would [01:29:33] intuition but the truth you know would be more clear it's not exactly like this [01:29:35] be more clear it's not exactly like this it's not like you're dealing with arrows [01:29:37] it's not like you're dealing with arrows okay it's going to be a very clean proof [01:29:40] okay it's going to be a very clean proof um [01:29:41] um okay I think that's uh yeah that's all [01:29:44] okay I think that's uh yeah that's all for today thanks ================================================================================ LECTURE 016 ================================================================================ Stanford CS229M - Lecture 17: Implicit regularization effect of the noise Source: https://www.youtube.com/watch?v=60GqpISCtCU --- Transcript [00:00:05] uh okay cool let's get started so guys [00:00:09] uh okay cool let's get started so guys um [00:00:10] um today we're going to talk about [00:00:12] today we're going to talk about um implicit regularization of the noise [00:00:26] so and [00:00:28] so and the plan today is that [00:00:31] the plan today is that um [00:00:31] um because this is a pretty challenging [00:00:33] because this is a pretty challenging topic and and I think the [00:00:36] topic and and I think the the research Community is still in some [00:00:38] the research Community is still in some sense doing research on this [00:00:41] sense doing research on this um so we have some results [00:00:43] um so we have some results um it's pretty complicated so what I'm [00:00:45] um it's pretty complicated so what I'm gonna do is I'm going to somewhat um uh [00:00:48] gonna do is I'm going to somewhat um uh using a relatively heuristic approach [00:00:51] using a relatively heuristic approach so I'm going to try to convey the main [00:00:53] so I'm going to try to convey the main idea without doing the actual [00:00:56] idea without doing the actual uh rigorous statement so so in this [00:00:59] uh rigorous statement so so in this lecture I don't even think I have a [00:01:02] lecture I don't even think I have a formal statement to State because it's [00:01:05] formal statement to State because it's just a little bit too complicated and [00:01:06] just a little bit too complicated and unnecessary right so if I really [00:01:08] unnecessary right so if I really approved of the formal version of the [00:01:10] approved of the formal version of the ethereum they'll probably take two [00:01:12] ethereum they'll probably take two lectures or three lectures [00:01:14] lectures or three lectures so that's why instead I'm trying to kind [00:01:16] so that's why instead I'm trying to kind of at least convey the main intuition [00:01:19] of at least convey the main intuition why the noise is useful with some still [00:01:22] why the noise is useful with some still with some math because without the mask [00:01:24] with some math because without the mask you don't even see the intuition [00:01:25] you don't even see the intuition sometimes so but but the knife wouldn't [00:01:27] sometimes so but but the knife wouldn't be always rigorous and I want to know [00:01:29] be always rigorous and I want to know where it is kind of like a not rigorous [00:01:32] where it is kind of like a not rigorous and also some part [00:01:34] and also some part cannot be made rigorous without [00:01:35] cannot be made rigorous without additional assumption and that would be [00:01:37] additional assumption and that would be I will clarifying that as well some part [00:01:40] I will clarifying that as well some part is really just like for convenience I [00:01:42] is really just like for convenience I ignore some kind of like the dragons but [00:01:44] ignore some kind of like the dragons but they can be fixed by just a little more [00:01:46] they can be fixed by just a little more careful mess and some part is actually [00:01:48] careful mess and some part is actually fundamental challenges and you have to [00:01:50] fundamental challenges and you have to really use additional assumptions or [00:01:51] really use additional assumptions or maybe even change the problem setting to [00:01:53] maybe even change the problem setting to go through those steps uh regressive [00:01:56] go through those steps uh regressive so [00:01:58] so um so I guess um the the main portion of [00:02:01] um so I guess um the the main portion of the the talk the lecture is actually not [00:02:05] the the talk the lecture is actually not about any particular loss function it's [00:02:07] about any particular loss function it's about generic loss function we're going [00:02:09] about generic loss function we're going to make some special uh simplification [00:02:11] to make some special uh simplification for them but but you only even need to [00:02:13] for them but but you only even need to really think about parametrization uh in [00:02:16] really think about parametrization uh in most of this lecture so so the setup is [00:02:19] most of this lecture so so the setup is that we have a loss function [00:02:22] that's for for this function G Theta and [00:02:25] that's for for this function G Theta and I'm also going to use x as the variable [00:02:27] I'm also going to use x as the variable as certain cases so and the the the [00:02:30] as certain cases so and the the the stochastic descent algorithm by noise I [00:02:33] stochastic descent algorithm by noise I really mean the noise in ICD [00:02:37] the stochastic algorithm will analyze is [00:02:39] the stochastic algorithm will analyze is something like this [00:02:41] something like this so Theta T is equal to 30 plus 1 is [00:02:44] so Theta T is equal to 30 plus 1 is equal to T minus Some Noise ingredient [00:02:48] equal to T minus Some Noise ingredient so we have the [00:02:50] so we have the full gradient plus some stochastic noise [00:02:54] full gradient plus some stochastic noise so where [00:02:57] so where the expectation of this quasi T is zero [00:03:01] the expectation of this quasi T is zero so this is really a mean zero noise but [00:03:05] so this is really a mean zero noise but a distribution [00:03:07] a distribution the city you know in the most General [00:03:08] the city you know in the most General case [00:03:11] case in most General case [00:03:20] the distribute distribution of the city [00:03:25] the distribute distribution of the city can depend on sterility [00:03:29] right so the noise is distribution [00:03:31] right so the noise is distribution depends on which point you are [00:03:34] depends on which point you are evaluating right right so you can see [00:03:37] evaluating right right so you can see this formulation at least so far at a [00:03:40] this formulation at least so far at a very general level does capture for [00:03:42] very general level does capture for example stochastic as you usually know [00:03:45] example stochastic as you usually know like the mini batch sarcastic will be [00:03:47] like the mini batch sarcastic will be set because suppose you take a mini [00:03:50] set because suppose you take a mini batch gradient with a few samples then [00:03:52] batch gradient with a few samples then is indeed can be written as something [00:03:54] is indeed can be written as something like the full gradient plus a stochastic [00:03:57] like the full gradient plus a stochastic variable which has means zero [00:03:59] variable which has means zero but we are not going to analyze that [00:04:01] but we are not going to analyze that particular version because then the [00:04:03] particular version because then the noise it becomes too complicated in some [00:04:06] noise it becomes too complicated in some sense right so we're going to analyze [00:04:07] sense right so we're going to analyze much simpler noise in most of the cases [00:04:09] much simpler noise in most of the cases like something like a gaussian noise [00:04:11] like something like a gaussian noise so this is a so so strictly speaking [00:04:14] so this is a so so strictly speaking this is more about noisy gradient [00:04:17] this is more about noisy gradient descent than stochastic mini batch stock [00:04:19] descent than stochastic mini batch stock has definitely designed but they do [00:04:21] has definitely designed but they do share you know a lot of similarities [00:04:25] share you know a lot of similarities okay so and what do I try to do is we're [00:04:28] okay so and what do I try to do is we're going to gradually build up our [00:04:29] going to gradually build up our intuition about how does this noise [00:04:31] intuition about how does this noise affect our optimization algorithm so so [00:04:35] affect our optimization algorithm so so we're going to start with you know a [00:04:37] we're going to start with you know a virus like we're gonna have several [00:04:39] virus like we're gonna have several levels warm-up so the first Walmart is [00:04:41] levels warm-up so the first Walmart is that what if you have a quadratic loss [00:04:43] that what if you have a quadratic loss function [00:04:45] function so contrasting loss pretty much means [00:04:47] so contrasting loss pretty much means that you have a linear model under the [00:04:48] that you have a linear model under the hood but here I don't even have model [00:04:50] hood but here I don't even have model parametization I only have a loss [00:04:52] parametization I only have a loss function so you Theta I say we have a [00:04:54] function so you Theta I say we have a quadratic loss function and have [00:04:55] quadratic loss function and have gaussian noise [00:04:59] and and also we have 1D [00:05:02] and and also we have 1D one dimensional functions [00:05:05] one dimensional functions like Theta is one dimensional so I guess [00:05:08] like Theta is one dimensional so I guess here I'm going to use [00:05:10] here I'm going to use from now on I'm going to use x as my [00:05:12] from now on I'm going to use x as my variable just to make it more consistent [00:05:14] variable just to make it more consistent with the optimization literature and GIS [00:05:17] with the optimization literature and GIS let's just assume it's one over two [00:05:19] let's just assume it's one over two times x squared and this one over two [00:05:21] times x squared and this one over two doesn't really matter it just makes the [00:05:22] doesn't really matter it just makes the gradient cleaner [00:05:23] gradient cleaner so [00:05:25] so um so then what's the update rule for [00:05:27] um so then what's the update rule for this case right you are optimizing a [00:05:28] this case right you are optimizing a quadratic function basically [00:05:30] quadratic function basically which where the global minimum is at [00:05:32] which where the global minimum is at zero [00:05:34] zero but you are using stochastic when you [00:05:35] but you are using stochastic when you decide or green design with noise [00:05:37] decide or green design with noise so x t plus 1 is equals to x t [00:05:41] so x t plus 1 is equals to x t minus ETA [00:05:43] minus ETA within G of x t plus some noise let's [00:05:47] within G of x t plus some noise let's say the noise has some [00:05:49] say the noise has some scale Sigma [00:05:54] is not doesn't have a skill [00:05:57] is not doesn't have a skill so have it means zero on standard [00:06:00] so have it means zero on standard deviation one so basically a noise has [00:06:01] deviation one so basically a noise has standard deviation Sigma [00:06:03] standard deviation Sigma and gaussian distribution [00:06:05] and gaussian distribution and now let's compute the gradient the [00:06:07] and now let's compute the gradient the gradient is really just a x t right the [00:06:12] gradient is really just a x t right the green of one over two times x square is [00:06:14] green of one over two times x square is the X so plus Sigma Times Square c g [00:06:18] the X so plus Sigma Times Square c g so this is one over ETA x t minus ETA [00:06:23] so this is one over ETA x t minus ETA times Sigma times the CT [00:06:28] this one is not working [00:06:30] this one is not working maybe it doesn't matter [00:06:32] maybe it doesn't matter so [00:06:34] so um [00:06:35] um so so what's happening here is really [00:06:37] so so what's happening here is really that [00:06:38] that this is a construction [00:06:42] meaning that if you have x t then you it [00:06:45] meaning that if you have x t then you it contracts [00:06:46] contracts um a minimum or make the XD smaller by [00:06:48] um a minimum or make the XD smaller by effective one minus ETA [00:06:50] effective one minus ETA and this is the stochastic term [00:06:53] and this is the stochastic term that may make actually bigger or smaller [00:06:56] that may make actually bigger or smaller depending on you know when you are lucky [00:06:58] depending on you know when you are lucky or not right so so in sometimes what [00:07:01] or not right so so in sometimes what happens is that [00:07:02] happens is that so the interesting thing is that if [00:07:08] um [00:07:10] so when [00:07:12] so when x t is large [00:07:16] um the contraction [00:07:18] um the contraction is not dominating the construction of [00:07:22] is not dominating the construction of the shrinking is dominating right [00:07:23] the shrinking is dominating right because for example suppose your XT is [00:07:24] because for example suppose your XT is here [00:07:26] here then what happens is that you first [00:07:27] then what happens is that you first contract it [00:07:29] contract it to somewhere here [00:07:30] to somewhere here right but multiplying one minus ETA and [00:07:33] right but multiplying one minus ETA and then you add some stochastic noise so [00:07:36] then you add some stochastic noise so then maybe at the end of that you may [00:07:37] then maybe at the end of that you may end up somewhere near [00:07:39] end up somewhere near uh the strict ones right so but still [00:07:43] uh the strict ones right so but still largely speaking you are moving towards [00:07:44] largely speaking you are moving towards zero because of the the construction or [00:07:47] zero because of the the construction or the shrinking right just because the [00:07:49] the shrinking right just because the shrinking is doing the most of the work [00:07:51] shrinking is doing the most of the work however some so let me finish so the [00:07:55] however some so let me finish so the construction [00:07:57] dominance [00:08:00] however when actually is small [00:08:03] however when actually is small or maybe XC is very small or maybe XC is [00:08:06] or maybe XC is very small or maybe XC is zero for Simplicity then the noise [00:08:10] zero for Simplicity then the noise uh is is dominating the process [00:08:17] all right guys this suppose you are [00:08:19] all right guys this suppose you are start you start somewhere very very [00:08:21] start you start somewhere very very close to zero [00:08:22] close to zero then when you stroke it it doesn't [00:08:24] then when you stroke it it doesn't really change much right because one [00:08:26] really change much right because one minute times the small one number is [00:08:28] minute times the small one number is still a small number [00:08:29] still a small number and name and the noise and noise [00:08:31] and name and the noise and noise probably make it somewhere either on the [00:08:34] probably make it somewhere either on the left hand side on the right hand side [00:08:36] left hand side on the right hand side so so the noise becomes the dominating [00:08:39] so so the noise becomes the dominating part when XC is small [00:08:42] part when XC is small so and eventually basically you are [00:08:45] so and eventually basically you are gonna basically [00:08:47] gonna basically converts to this uh second case to some [00:08:50] converts to this uh second case to some extent [00:08:51] extent so because you know if the if XP is [00:08:54] so because you know if the if XP is larger than you are moving towards zero [00:08:55] larger than you are moving towards zero and what happens is the eventually [00:08:57] and what happens is the eventually actually becomes somewhat small and [00:08:59] actually becomes somewhat small and noise is kind of like governing all the [00:09:01] noise is kind of like governing all the process [00:09:02] process So eventually you are just bouncing [00:09:05] So eventually you are just bouncing you're bouncing just around [00:09:09] you're bouncing just around Global mean [00:09:13] on a certain level right so you cannot [00:09:15] on a certain level right so you cannot bounce you cannot bounce around on some [00:09:17] bounce you cannot bounce around on some very very high level values because then [00:09:19] very very high level values because then your construction is too large you [00:09:20] your construction is too large you wouldn't be able to bounce around that [00:09:22] wouldn't be able to bounce around that level very much So eventually we will be [00:09:25] level very much So eventually we will be bouncing around on certain level [00:09:27] bouncing around on certain level depending on the noise level [00:09:29] depending on the noise level right so [00:09:31] right so um it's kind of like what happens if you [00:09:33] um it's kind of like what happens if you have [00:09:34] have I guess um if you think about you drop a [00:09:37] I guess um if you think about you drop a ball you know [00:09:39] ball you know kind of a concave kind of thing without [00:09:41] kind of a concave kind of thing without any friction it's not exactly the same [00:09:44] any friction it's not exactly the same because there you don't have really [00:09:46] because there you don't have really um additive noise but you still see this [00:09:48] um additive noise but you still see this bouncing around eventually just because [00:09:50] bouncing around eventually just because you can overshoot a little bit [00:09:52] you can overshoot a little bit um yeah maybe that's not exactly the [00:09:54] um yeah maybe that's not exactly the right analogy but anyway So eventually [00:09:56] right analogy but anyway So eventually you'll bounce around some the the the [00:09:59] you'll bounce around some the the the valley uh on a certain level and how do [00:10:02] valley uh on a certain level and how do we kind of process it by the way this [00:10:05] we kind of process it by the way this sounds like nothing really to do with [00:10:06] sounds like nothing really to do with implicit regularization because [00:10:07] implicit regularization because eventually whatever you do you always [00:10:09] eventually whatever you do you always stay close to the global minimum because [00:10:12] stay close to the global minimum because there's no even two other there is no [00:10:14] there's no even two other there is no too low minimum right but the intuition [00:10:17] too low minimum right but the intuition is very useful for for future since when [00:10:20] is very useful for for future since when you kind of move away from this so this [00:10:22] you kind of move away from this so this is indeed important so and let's try to [00:10:25] is indeed important so and let's try to be more precise [00:10:27] be more precise yeah [00:10:28] yeah and this is actually a case that we can [00:10:30] and this is actually a case that we can be precise [00:10:31] be precise so [00:10:32] so I can solve the recurrence [00:10:39] so we'll solve the occurrence what [00:10:40] so we'll solve the occurrence what happens is the x t is equal to 1 minus [00:10:42] happens is the x t is equal to 1 minus equal x t [00:10:44] equal x t X Plus Y is equal to 1 minus Theta x t [00:10:46] X Plus Y is equal to 1 minus Theta x t minus E so Sigma c t [00:10:49] minus E so Sigma c t and then you plug in the definition of x [00:10:51] and then you plug in the definition of x t again okay y minus ETA one minus Theta [00:10:55] t again okay y minus ETA one minus Theta 2 minus 1 minus E to the sigma plus C T [00:10:57] 2 minus 1 minus E to the sigma plus C T minus one [00:10:59] minus one minus ether signocity [00:11:02] minus ether signocity and if you realize this you get 1 minus [00:11:05] and if you realize this you get 1 minus ETA Square x 3 minus 1. minus 1 minus [00:11:08] ETA Square x 3 minus 1. minus 1 minus ETA times ETA Sigma C T minus 1 minus [00:11:11] ETA times ETA Sigma C T minus 1 minus ETA Sigma the CT and if you do this for [00:11:14] ETA Sigma the CT and if you do this for another level you get one minus Theta [00:11:16] another level you get one minus Theta Cube x t [00:11:18] Cube x t minus 2 minus 1 minus E plus square e to [00:11:21] minus 2 minus 1 minus E plus square e to Sigma [00:11:23] Sigma because City [00:11:25] because City minus 2 [00:11:33] and if you do this you know more and [00:11:36] and if you do this you know more and more eventually what you're going to get [00:11:38] more eventually what you're going to get is the one minus Ethan to the power of t [00:11:41] is the one minus Ethan to the power of t plus 1 times x zero [00:11:43] plus 1 times x zero minus E times Sigma times this summation [00:11:48] minus E times Sigma times this summation summation looks like this [00:11:52] is a company linear combination of [00:11:54] is a company linear combination of constitution [00:11:55] constitution but uh coefficient in front of it is [00:11:59] but uh coefficient in front of it is some power of one minus E times [00:12:03] some power of one minus E times so [00:12:04] so so from this you can see that you know [00:12:06] so from this you can see that you know so actually there are several interest [00:12:07] so actually there are several interest things about this formula which can give [00:12:09] things about this formula which can give you some intuition [00:12:11] you some intuition um so so one thing is that this [00:12:15] um so so one thing is that this thing is is a very strong construction [00:12:18] thing is is a very strong construction right so you are this is the the [00:12:20] right so you are this is the the construction part [00:12:23] right because and in some sense this [00:12:25] right because and in some sense this term comes from you contract the initial [00:12:27] term comes from you contract the initial value by a lot of one minutes ETA so [00:12:30] value by a lot of one minutes ETA so that basically this becomes negligible [00:12:33] that basically this becomes negligible this becomes [00:12:35] this becomes negligible [00:12:37] negligible uh when [00:12:40] uh when uh [00:12:42] uh he it attempts T is much much bigger [00:12:45] he it attempts T is much much bigger than one because you know one month's [00:12:47] than one because you know one month's equal to the poverty is something like e [00:12:49] equal to the poverty is something like e to the minus E 30 and with even times T [00:12:52] to the minus E 30 and with even times T is much bigger than one then um and this [00:12:55] is much bigger than one then um and this term becomes you know Superstar [00:12:57] term becomes you know Superstar so and you can also see from the other [00:13:01] so and you can also see from the other term [00:13:02] term this is the in some sense [00:13:05] this is the in some sense you can view this as accumulation [00:13:08] you can view this as accumulation accumulation [00:13:12] of the noise [00:13:14] of the noise the noise are not just the hiding up the [00:13:16] the noise are not just the hiding up the noise are accumulated in a certain way [00:13:19] noise are accumulated in a certain way and maybe it's easier even to look at [00:13:21] and maybe it's easier even to look at this [00:13:23] this so the lot the noise that you either [00:13:26] so the lot the noise that you either lost I lost that is scaled by ETA times [00:13:29] lost I lost that is scaled by ETA times Sigma [00:13:30] Sigma so but the noise that you added [00:13:33] so but the noise that you added second but last one step is scaled by [00:13:36] second but last one step is scaled by one minus eight you have additional [00:13:38] one minus eight you have additional Factor one minus E [00:13:40] Factor one minus E and how does this one manager come from [00:13:42] and how does this one manager come from this is come from the construction [00:13:46] of the second last step so basically [00:13:48] of the second last step so basically is what you added in the second last [00:13:50] is what you added in the second last step [00:13:51] step and then because you do another grid [00:13:53] and then because you do another grid instead we instead of going in the third [00:13:55] instead we instead of going in the third step on top of that you still contract [00:13:57] step on top of that you still contract that noise a little bit [00:13:59] that noise a little bit right so so or maybe you can see this [00:14:01] right so so or maybe you can see this from here as well right so this is what [00:14:03] from here as well right so this is what you got from here [00:14:05] you got from here but this one is ether comes from the [00:14:08] but this one is ether comes from the construction in the very last step and [00:14:10] construction in the very last step and the same thing happens right so so this [00:14:12] the same thing happens right so so this comes from the construction the last two [00:14:14] comes from the construction the last two steps [00:14:15] steps so basically every time you add noise in [00:14:18] so basically every time you add noise in some intermediates that eventually this [00:14:20] some intermediates that eventually this noise will die eventually at some point [00:14:23] noise will die eventually at some point if you run for long and long enough time [00:14:25] if you run for long and long enough time just because there is always a [00:14:27] just because there is always a contraption that is applied after this [00:14:29] contraption that is applied after this noisy step [00:14:31] noisy step right so so that's why that was in [00:14:34] right so so that's why that was in multiplied in front of the noise is a [00:14:36] multiplied in front of the noise is a geometric series but it depends on when [00:14:39] geometric series but it depends on when you added this noise the the the [00:14:41] you added this noise the the the coefficient in front of the noise you [00:14:44] coefficient in front of the noise you know can become smaller and smaller [00:14:46] know can become smaller and smaller right so that you you forget about the [00:14:48] right so that you you forget about the very very long history right so if you [00:14:50] very very long history right so if you add a noise at the very very first step [00:14:51] add a noise at the very very first step it doesn't really matter because you [00:14:53] it doesn't really matter because you multiply that so when K is equals to for [00:14:55] multiply that so when K is equals to for example T minus one when that's the [00:14:57] example T minus one when that's the noise for C1 then you add it in the [00:14:59] noise for C1 then you add it in the virus First Step then that noise becomes [00:15:01] virus First Step then that noise becomes much or less because you multiply one [00:15:04] much or less because you multiply one minus e to the power K in front of Base [00:15:07] minus e to the power K in front of Base because of the contraction then it [00:15:09] because of the contraction then it always becomes less and less important [00:15:13] always becomes less and less important so that's one thing like the [00:15:15] so that's one thing like the accumulation you know of the noise is [00:15:18] accumulation you know of the noise is is kind of like a prefers the closed [00:15:21] is kind of like a prefers the closed history and ignore the long-term history [00:15:24] history and ignore the long-term history so and another thing is that this is a [00:15:27] so and another thing is that this is a sum of gaussian [00:15:31] right because each of this term is your [00:15:33] right because each of this term is your gaussian and our assumption can see is a [00:15:35] gaussian and our assumption can see is a gaussian and because see time something [00:15:37] gaussian and because see time something is still gone [00:15:39] is still gone and you can also compute the variance of [00:15:41] and you can also compute the variance of this the variance of this [00:15:44] this the variance of this is equals to [00:15:46] is equals to Eta Square Sigma squared times the [00:15:48] Eta Square Sigma squared times the variance of each of the term which is [00:15:50] variance of each of the term which is something like [00:15:52] something like one minus ETA to the power 2 okay [00:15:57] one minus ETA to the power 2 okay so and the point is that if you take T [00:16:00] so and the point is that if you take T to go to Infinity then you can know [00:16:02] to go to Infinity then you can know what's the limiting virus what's the [00:16:05] what's the limiting virus what's the virus at the end [00:16:06] virus at the end so if K goes to u p goes to Infinity [00:16:11] you can complete the variance of x t [00:16:15] you can complete the variance of x t is roughly something like because [00:16:19] is roughly something like because Square Sigma Square [00:16:24] if you you press T to be Infinity one [00:16:26] if you you press T to be Infinity one over two equal to 2K and this is ETA [00:16:30] over two equal to 2K and this is ETA Square Sigma squared 1 minus one [00:16:34] Square Sigma squared 1 minus one okay [00:16:35] okay sorry my writing is not perfect [00:16:39] sorry my writing is not perfect one minus ETA Square [00:16:41] one minus ETA Square is how you computer geometric series and [00:16:44] is how you computer geometric series and this is ETA Square over Sigma Square [00:16:47] this is ETA Square over Sigma Square 2 ETA minus ETA square but this term can [00:16:50] 2 ETA minus ETA square but this term can be dropped because that's going to apply [00:16:52] be dropped because that's going to apply all the term so this is approximately [00:16:54] all the term so this is approximately just a [00:16:57] just a on the order of [00:16:59] on the order of ETA Sigma Square [00:17:01] ETA Sigma Square okay so in other words [00:17:04] okay so in other words um [00:17:04] um like the XC eventually as T goes to [00:17:08] like the XC eventually as T goes to Infinity [00:17:09] Infinity eventually has this gaussian [00:17:11] eventually has this gaussian distribution which means zero an [00:17:13] distribution which means zero an environment something on the order of e [00:17:15] environment something on the order of e times Sigma Square [00:17:17] so [00:17:19] so so I think [00:17:21] so I think so here so far again we haven't really [00:17:22] so here so far again we haven't really talked about induct like implicit bias [00:17:25] talked about induct like implicit bias yet but I think we still already got [00:17:27] yet but I think we still already got some intuition [00:17:28] some intuition about what's happening here with the [00:17:30] about what's happening here with the convex case [00:17:31] convex case so smaller rate [00:17:35] so smaller rate small ETA [00:17:37] small ETA means that [00:17:39] means that um your each of it will have small [00:17:41] um your each of it will have small bouncing around right so small [00:17:45] bouncing around right so small stochasticity of the final iterate [00:17:54] because your virus in iterate x t is [00:17:58] because your virus in iterate x t is smaller [00:17:59] smaller and you have small noise the same thing [00:18:03] and you have small noise the same thing also it implies the same thing [00:18:05] also it implies the same thing and [00:18:07] and um [00:18:08] um and [00:18:10] and um and so basically what happens here is [00:18:12] um and so basically what happens here is that in the noise [00:18:14] that in the noise only makes it harder [00:18:20] to converge [00:18:23] to converge me [00:18:28] so in some sense if you only care about [00:18:30] so in some sense if you only care about the quality of the final solution you [00:18:33] the quality of the final solution you converge to the noise is always hurting [00:18:36] converge to the noise is always hurting you [00:18:37] you especially if you're willing to take T [00:18:39] especially if you're willing to take T to Infinity right just because so here [00:18:42] to Infinity right just because so here you can see that when she goes to [00:18:43] you can see that when she goes to Infinity you still never converts to [00:18:45] Infinity you still never converts to exactly the global meat market so you [00:18:47] exactly the global meat market so you always have some virus around the global [00:18:49] always have some virus around the global mean so and you want the virus to be as [00:18:52] mean so and you want the virus to be as small as possible because you want to be [00:18:53] small as possible because you want to be as close to the globe mean as possible [00:18:55] as close to the globe mean as possible so and the noise is only a hurdle [00:18:59] so and the noise is only a hurdle instead of typing anything [00:19:01] instead of typing anything so so this is why [00:19:04] so so this is why um [00:19:06] um um in the classical convex case [00:19:09] um in the classical convex case so so this is why in the classical [00:19:11] so so this is why in the classical Converses [00:19:15] opposition [00:19:24] typically when you think about noise [00:19:27] typically when you think about noise it is only about two things it's only [00:19:30] it is only about two things it's only about a the noisy green descent [00:19:33] about a the noisy green descent uh [00:19:36] uh like a leads to ice less accurate [00:19:43] less accurate [00:19:46] less accurate Solutions [00:19:47] Solutions so this is a bad thing that's what we [00:19:49] so this is a bad thing that's what we discussed [00:19:50] discussed and second [00:19:52] and second um [00:19:55] noisy when you said it's faster to [00:19:57] noisy when you said it's faster to compute [00:19:59] compute foreign [00:20:17] and the only thing is that you are [00:20:20] and the only thing is that you are trading off this tool [00:20:25] two factors above [00:20:28] two factors above that's the kind of the I would say the [00:20:31] that's the kind of the I would say the typical way of thinking about stop [00:20:33] typical way of thinking about stop casting these sense when you really [00:20:34] casting these sense when you really think about the convex case right so [00:20:37] think about the convex case right so noise is bad because it hurts your final [00:20:40] noise is bad because it hurts your final accuracy but you want to allow some [00:20:43] accuracy but you want to allow some noise in certain cases because you [00:20:44] noise in certain cases because you compute faster right so if you think you [00:20:46] compute faster right so if you think you can actually don't mean trade off this [00:20:48] can actually don't mean trade off this in the right way you can get the fastest [00:20:50] in the right way you can get the fastest algorithm eventually and you can kind of [00:20:53] algorithm eventually and you can kind of imagine how you trade off this right so [00:20:55] imagine how you trade off this right so at the beginning of the optimization you [00:20:58] at the beginning of the optimization you don't care that much about accuracy you [00:20:59] don't care that much about accuracy you don't care about that much about [00:21:01] don't care about that much about converting to exactly the global mean [00:21:04] converting to exactly the global mean you want to go to the Google mean as [00:21:05] you want to go to the Google mean as fast as possible so that's why at the [00:21:07] fast as possible so that's why at the beginning you don't care that much about [00:21:09] beginning you don't care that much about noise so that's why in the beginning use [00:21:10] noise so that's why in the beginning use an option rate and then when you're [00:21:13] an option rate and then when you're already closed where your goal changes [00:21:14] already closed where your goal changes right because now your goal is really [00:21:16] right because now your goal is really literally go to the global mean period [00:21:19] literally go to the global mean period so then you can't allow any loans so [00:21:21] so then you can't allow any loans so that's why you have to Decay your [00:21:22] that's why you have to Decay your learning rate so that's why there's [00:21:24] learning rate so that's why there's always this kind of depending on your [00:21:25] always this kind of depending on your version of algorithms [00:21:27] version of algorithms so [00:21:29] so um so so far this is about uh [00:21:31] um so so far this is about uh [Applause] [00:21:33] [Applause] um [00:21:34] um yeah and also another thing is that just [00:21:37] yeah and also another thing is that just a side remark which is useful for us you [00:21:39] a side remark which is useful for us you know which is used for comparison for us [00:21:41] know which is used for comparison for us later so for any fixed [00:21:44] later so for any fixed suppose you're fixed [00:21:46] suppose you're fixed heat and sigma [00:21:48] heat and sigma right so the expectation of x t [00:21:51] right so the expectation of x t is always going to zero conversion to [00:21:53] is always going to zero conversion to zero as T goes to Infinity right so even [00:21:56] zero as T goes to Infinity right so even though there's a stochasticity there's a [00:21:58] though there's a stochasticity there's a bouncing wrong your average is always at [00:22:01] bouncing wrong your average is always at zero so this is saying that there's no [00:22:04] zero so this is saying that there's no bias introduced [00:22:09] by the stochasticity [00:22:12] by the stochasticity you only introduced [00:22:13] you only introduced in some sense some fluctuation of course [00:22:17] in some sense some fluctuation of course fluctuation is also biased but at least [00:22:19] fluctuation is also biased but at least even you can use any bios a systematic [00:22:21] even you can use any bios a systematic bias against any other directions [00:22:25] bias against any other directions so [00:22:28] so um [00:22:30] so that's another strategy remark which [00:22:32] so that's another strategy remark which will kind of like [00:22:33] will kind of like compare with you know that and also not [00:22:37] compare with you know that and also not another small remark is that this is [00:22:38] another small remark is that this is also called [00:22:40] also called this process [00:22:43] this process actually has a name it's called Orson [00:22:47] actually has a name it's called Orson uh open and back process [00:22:51] uh open and back process if you are familiar with this processing [00:22:53] if you are familiar with this processing some other context you know you can see [00:22:55] some other context you know you can see this is actually doing the same thing [00:22:56] this is actually doing the same thing and and we we're going to call it ou [00:22:59] and and we we're going to call it ou process just for Simplicity [00:23:01] process just for Simplicity this is going to be a kind of a basic [00:23:03] this is going to be a kind of a basic building block for us to analyze [00:23:05] building block for us to analyze actually more complex case [00:23:08] actually more complex case okay so we have we have kind of like [00:23:10] okay so we have we have kind of like understood the the quadratic and now [00:23:13] understood the the quadratic and now let's do the multi-dimensional quadratic [00:23:15] let's do the multi-dimensional quadratic which is not really much different but I [00:23:18] which is not really much different but I think I needed [00:23:19] think I needed in some sense just to fall for the sake [00:23:21] in some sense just to fall for the sake of like future [00:23:23] of like future like the the future steps so suppose you [00:23:25] like the the future steps so suppose you have a multi-dimensional quadratic [00:23:28] have a multi-dimensional quadratic yeah [00:23:30] yeah supposed to have something like GX is [00:23:32] supposed to have something like GX is equals to a half times x transpose ax [00:23:35] equals to a half times x transpose ax where a is a matrix [00:23:37] where a is a matrix in dimension d by D X is the variable in [00:23:40] in dimension d by D X is the variable in dimension d and a is PSD [00:23:44] dimension d and a is PSD so unless suppose suppose your noise [00:23:47] so unless suppose suppose your noise because CT now or let's not assume it's [00:23:50] because CT now or let's not assume it's just a spherical gaussian let's assume [00:23:52] just a spherical gaussian let's assume it has a covariance [00:23:53] it has a covariance Sigma [00:23:55] Sigma and then your update rule let's say [00:23:58] and then your update rule let's say suppose we care about this process where [00:24:00] suppose we care about this process where you [00:24:04] will descend with this [00:24:09] and then this becomes x t minus ETA the [00:24:13] and then this becomes x t minus ETA the gradient will just be a times x t [00:24:16] gradient will just be a times x t you add the city and this real arranging [00:24:19] you add the city and this real arranging you can I minus E to a [00:24:22] you can I minus E to a times x t minus ETA CT [00:24:26] times x t minus ETA CT so and you can do the same recursion [00:24:29] so and you can do the same recursion recursion as we did before right we [00:24:31] recursion as we did before right we replace x t the definition of x t uh as [00:24:34] replace x t the definition of x t uh as a function of x t minus 1 and you do [00:24:36] a function of x t minus 1 and you do this recursively eventually you if you [00:24:38] this recursively eventually you if you look out you get I minus ETA a [00:24:42] look out you get I minus ETA a to the power t plus 1 times x zero [00:24:45] to the power t plus 1 times x zero minus [00:24:47] minus ETA [00:24:48] ETA fine [00:24:51] I minus ETA a to the power k [00:24:55] I minus ETA a to the power k Times Square C T minus k [00:24:58] Times Square C T minus k and we can see this is still the same [00:24:59] and we can see this is still the same kind of like intuition this is the [00:25:01] kind of like intuition this is the contraction of course this is the Matrix [00:25:04] contraction of course this is the Matrix we are multiplying something less than [00:25:05] we are multiplying something less than one less than anything so you are [00:25:08] one less than anything so you are Contracting in a matrix sense and here [00:25:12] Contracting in a matrix sense and here this is the how the noise accumulate and [00:25:14] this is the how the noise accumulate and also the noise [00:25:16] also the noise in the in the in the history in a very [00:25:18] in the in the in the history in a very far history supposed to take for example [00:25:21] far history supposed to take for example K to be close to T right so that because [00:25:24] K to be close to T right so that because C T minus K this is something in very [00:25:26] C T minus K this is something in very very important history like in in a [00:25:28] very important history like in in a remote history in a remote history the [00:25:30] remote history in a remote history the noise becomes less important because [00:25:32] noise becomes less important because there's a contraction term [00:25:34] there's a contraction term applied after noise is added [00:25:38] applied after noise is added so [00:25:40] so um and this is right so and you can [00:25:43] um and this is right so and you can in some sense that so so this becomes a [00:25:45] in some sense that so so this becomes a more complicated formula but you can [00:25:47] more complicated formula but you can still somewhat uh do the same thing [00:25:49] still somewhat uh do the same thing suppose [00:25:52] suppose um [00:25:53] um suppose if you [00:25:55] suppose if you um [00:25:57] um so you can still do the similar similar [00:25:59] so you can still do the similar similar calculation [00:26:04] if uh a and sigma [00:26:08] if uh a and sigma are simultaneously diagnosable [00:26:18] so if there are no simultaneous darkness [00:26:20] so if there are no simultaneous darkness that diagonize it there was no likable [00:26:23] that diagonize it there was no likable you can still do something to solve this [00:26:26] you can still do something to solve this um to compute this sum to simplify the [00:26:28] um to compute this sum to simplify the sum that's going to be even more [00:26:30] sum that's going to be even more competitive so let's only think about [00:26:32] competitive so let's only think about the placement a and same thing is [00:26:33] the placement a and same thing is diagonizable then in some sense you can [00:26:36] diagonizable then in some sense you can just view this as [00:26:38] just view this as you can just view this as viewed as [00:26:42] you can just view this as viewed as uh the different [00:26:47] to separate all your process in the [00:26:49] to separate all your process in the eigenspace [00:26:53] all your processes means one dimensional [00:26:55] all your processes means one dimensional problem [00:26:57] problem in the eigen [00:26:59] in the eigen coordinate system [00:27:03] because when you use the eigen quarter [00:27:06] because when you use the eigen quarter system then a and sigma are just both [00:27:08] system then a and sigma are just both diagonal matrices and then you are [00:27:10] diagonal matrices and then you are basically just updating in uh as if you [00:27:13] basically just updating in uh as if you are in One Direction [00:27:15] are in One Direction uh and in some sense form more formally [00:27:18] uh and in some sense form more formally what happens is they're supposed to take [00:27:20] what happens is they're supposed to take a [00:27:21] a suppose a is u d [00:27:24] suppose a is u d U transpose where D is this diagonal [00:27:26] U transpose where D is this diagonal matrix [00:27:29] um which has the eigenvalue of a let's [00:27:32] um which has the eigenvalue of a let's just suppose Sigma is u diag [00:27:36] just suppose Sigma is u diag Sigma I you transpose [00:27:39] Sigma I you transpose then what you can do is that as T goes [00:27:42] then what you can do is that as T goes to the infinity [00:27:43] to the infinity the [00:27:45] the x t roughly comes from [00:27:48] x t roughly comes from this gaussian which means zero because [00:27:50] this gaussian which means zero because this particle contractors and you can [00:27:53] this particle contractors and you can look at the variance which looks like [00:27:55] look at the variance which looks like something like this [00:28:00] the power K times Sigma times 1 minus [00:28:03] the power K times Sigma times 1 minus ETA [00:28:04] ETA a to the power of K [00:28:07] a to the power of K uh this is just because [00:28:11] this is because of computer [00:28:13] this is because of computer the the variance of the right so lecture [00:28:17] the the variance of the right so lecture expectation [00:28:19] expectation some Matrix W could see you could see [00:28:23] some Matrix W could see you could see Times Square CW transpose the covariance [00:28:26] Times Square CW transpose the covariance of this linear transformation of [00:28:27] of this linear transformation of gaussian is equals to w [00:28:31] see transpose W transpose which is [00:28:34] see transpose W transpose which is equals W Times Sigma about transpose [00:28:36] equals W Times Sigma about transpose right that's how you compute the [00:28:38] right that's how you compute the covariance of each of these terms and [00:28:40] covariance of each of these terms and negative sum of them and E A is a [00:28:42] negative sum of them and E A is a symmetric Matrix so a and a transpose [00:28:45] symmetric Matrix so a and a transpose are the same [00:28:46] are the same so um and you can do this and then you [00:28:48] so um and you can do this and then you can simplify this when you have the [00:28:50] can simplify this when you have the eigen decomposition so you have the [00:28:52] eigen decomposition so you have the identity conversation then [00:28:54] identity conversation then I minus ETA a [00:28:56] I minus ETA a is EU times diagonal [00:28:59] is EU times diagonal 1 minus ETA e i [00:29:02] 1 minus ETA e i will transpose [00:29:04] will transpose and sigma is U di Sigma i u transpose [00:29:09] and sigma is U di Sigma i u transpose and then you can compute this [00:29:12] and then you can compute this so you can compute this sum [00:29:16] so you can compute this sum to be something like [00:29:19] ETA Square Times [00:29:21] ETA Square Times I guess maybe that's also Pluto let me [00:29:23] I guess maybe that's also Pluto let me just do this [00:29:25] just do this look at the case power you just multiply [00:29:27] look at the case power you just multiply the K in front [00:29:29] the K in front because the UI U transpose group got [00:29:31] because the UI U transpose group got canceled right if you of the sequence of [00:29:34] canceled right if you of the sequence of this [00:29:35] this so then this becomes either Square Times [00:29:38] so then this becomes either Square Times sum of [00:29:40] sum of U diagonal [00:29:44] U diagonal I guess I just assume this is the same [00:29:46] I guess I just assume this is the same let's assume it's Sigma Square just to [00:29:48] let's assume it's Sigma Square just to submit it [00:29:49] submit it there's a looping Sigma four times one [00:29:52] there's a looping Sigma four times one minus ETA [00:29:54] minus ETA the eye [00:29:58] to the power 2 K U transpose [00:30:02] to the power 2 K U transpose right that's this Matrix and then this [00:30:04] right that's this Matrix and then this is the beauty of eigen decomposition [00:30:06] is the beauty of eigen decomposition because everything becomes on a diagonal [00:30:08] because everything becomes on a diagonal and they have U score [00:30:10] and they have U score ETA Square U times [00:30:12] ETA Square U times then this becomes you take the sum [00:30:18] to Infinity [00:30:20] to Infinity because I from 1 to Infinity Square [00:30:23] because I from 1 to Infinity Square this is K [00:30:27] the I is for the part in the case for [00:30:29] the I is for the part in the case for the [00:30:35] number okay [00:30:37] number okay uh Sigma I square one minus ETA d i [00:30:42] uh Sigma I square one minus ETA d i it's okay you transpose and this becomes [00:30:44] it's okay you transpose and this becomes either Square Times U times something [00:30:46] either Square Times U times something like Sigma I Square [00:30:49] like Sigma I Square over d i [00:30:52] over d i U transpose and there's over e top here [00:30:55] U transpose and there's over e top here so let's remove each of you something [00:30:56] so let's remove each of you something like this [00:30:57] like this so you can see that basically you have [00:31:00] so you can see that basically you have some noise [00:31:01] some noise and Noise [00:31:03] and Noise is something on this level so this is [00:31:05] is something on this level so this is the noise [00:31:07] the noise level [00:31:10] level in the eyes [00:31:13] in the eyes eigenvector Direction [00:31:18] and noise level depends on the [00:31:22] and noise level depends on the this is this is the this is the [00:31:26] this is this is the this is the the [00:31:27] the the live maybe let's just be precise [00:31:29] the live maybe let's just be precise this is the [00:31:35] the each original noise right [00:31:37] the each original noise right iterator stochasticity [00:31:40] iterator stochasticity fluctuation level [00:31:42] fluctuation level because we are Computing was the [00:31:44] because we are Computing was the fluctuation of the each word and the [00:31:46] fluctuation of the each word and the fluctuation level in the eigenvector [00:31:48] fluctuation level in the eigenvector direction depends on the lowest level in [00:31:50] direction depends on the lowest level in that direction and also depends on how [00:31:52] that direction and also depends on how strong the contraction is the [00:31:54] strong the contraction is the contraction is Big then you are going to [00:31:57] contraction is Big then you are going to have smaller noise smaller iterate [00:32:00] have smaller noise smaller iterate stochasticity [00:32:01] stochasticity because you have so strong construction [00:32:03] because you have so strong construction and [00:32:05] and um it doesn't which doesn't a lot of [00:32:06] um it doesn't which doesn't a lot of noise to build up and and if the noise [00:32:09] noise to build up and and if the noise is speed of course and eventually you're [00:32:10] is speed of course and eventually you're gonna have like a larger Plantation it's [00:32:14] right so [00:32:16] right so um and another thing that is useful to [00:32:18] um and another thing that is useful to realize is that another small remark [00:32:20] realize is that another small remark that is useful [00:32:22] that is useful is that [00:32:24] is that um [00:32:25] um the [00:32:27] the this Matrix [00:32:28] this Matrix you dive [00:32:31] you dive Sigma I score d i u transpose this is [00:32:35] Sigma I score d i u transpose this is always in the span [00:32:37] always in the span of Sigma right so [00:32:40] of Sigma right so so if Sigma has some direction where [00:32:44] so if Sigma has some direction where suppose Sigma is lower than capital [00:32:46] suppose Sigma is lower than capital Sigma is lower so in some direction [00:32:48] Sigma is lower so in some direction there's no noise [00:32:49] there's no noise and in those Direction the x t doesn't [00:32:51] and in those Direction the x t doesn't have any fluctuation either [00:32:53] have any fluctuation either so that will be something useful for us [00:32:55] so that will be something useful for us on in the in the future and another [00:32:58] on in the in the future and another thing is that [00:33:00] thing is that x t if you think about what's the rough [00:33:03] x t if you think about what's the rough size of x t just the norm of x t this is [00:33:07] size of x t just the norm of x t this is on auto [00:33:09] on auto square root detail [00:33:10] square root detail because this the [00:33:13] because this the this quantity is something that doesn't [00:33:16] this quantity is something that doesn't depend on each other [00:33:17] depend on each other right so if you only look at the [00:33:19] right so if you only look at the interdependency then the the norm of the [00:33:24] interdependency then the the norm of the the stompasticity or the fluctuation in [00:33:26] the stompasticity or the fluctuation in the [00:33:27] the in the iterate will be on all of square [00:33:30] in the iterate will be on all of square root would beta and this is something [00:33:31] root would beta and this is something that's probably good to remember for the [00:33:33] that's probably good to remember for the moment [00:33:34] moment um uh it will be used for us in the [00:33:37] um uh it will be used for us in the future as well [00:33:40] um any questions so far [00:33:46] thank you [00:33:51] shouldn't be opposal cylinder all right [00:33:55] shouldn't be opposal cylinder all right yeah but all of those depends on [00:33:57] yeah but all of those depends on dimension for example it depends on how [00:34:00] dimension for example it depends on how large the sigma is and how large the [00:34:02] large the sigma is and how large the eyes are but in terms of dependency on [00:34:05] eyes are but in terms of dependency on ETA [00:34:06] ETA this is on order of square root B that's [00:34:08] this is on order of square root B that's what I mean [00:34:12] yeah [00:34:14] yeah I guess I'm I'm only talking about it [00:34:16] I guess I'm I'm only talking about it depends on either so far [00:34:19] depends on either so far cool that's like the standard deviation [00:34:22] cool that's like the standard deviation of [00:34:24] of like capacity essentially [00:34:26] like capacity essentially um [00:34:28] um yeah well it's [00:34:30] yeah well it's the side effects [00:34:32] the side effects [Music] [00:34:34] [Music] so is this like for water skis [00:34:38] so is this like for water skis sufficiently small or yes I'm talking [00:34:42] sufficiently small or yes I'm talking about a casement she is infinity [00:34:46] okay so maybe maybe one way to think [00:34:49] okay so maybe maybe one way to think about this okay I think I kind of sense [00:34:51] about this okay I think I kind of sense what your question is so this is the the [00:34:54] what your question is so this is the the fluctuation in the eventual iterate [00:34:55] fluctuation in the eventual iterate right so in the each treatment G is kind [00:34:57] right so in the each treatment G is kind of infinity it's different from the [00:35:00] of infinity it's different from the noise you added at each time so actually [00:35:03] noise you added at each time so actually that's actually a very good question so [00:35:05] that's actually a very good question so if you look at the noise that you added [00:35:07] if you look at the noise that you added at each time [00:35:08] at each time so this is how large is this this is on [00:35:11] so this is how large is this this is on out of ETA if you ignore of course even [00:35:14] out of ETA if you ignore of course even on other dependencies except ether so [00:35:16] on other dependencies except ether so each time you add some noise on all of [00:35:19] each time you add some noise on all of ETA and eventually all of this noise [00:35:21] ETA and eventually all of this noise build up they got added up together [00:35:24] build up they got added up together and and they add up to something on out [00:35:26] and and they add up to something on out of square roots [00:35:28] of square roots so so that's all the noise kind of like [00:35:30] so so that's all the noise kind of like accumulates but it wouldn't accumulate [00:35:32] accumulates but it wouldn't accumulate to Infinity [00:35:33] to Infinity just because because of the construction [00:35:34] just because because of the construction right so because of this [00:35:37] right so because of this this term that also constructs the noise [00:35:39] this term that also constructs the noise to some extent a little bit but still [00:35:40] to some extent a little bit but still the noise build up with one kind of like [00:35:43] the noise build up with one kind of like half order higher [00:35:45] half order higher um in terms of ether right so it it [00:35:48] um in terms of ether right so it it accumulates from ETA to e square with e [00:35:50] accumulates from ETA to e square with e to over time [00:35:52] to over time foreign [00:36:06] good understanding of what's coming [00:36:09] good understanding of what's coming what's happening basically you basically [00:36:11] what's happening basically you basically eventually it's bouncing around with the [00:36:14] eventually it's bouncing around with the E with the radius something like Square [00:36:16] E with the radius something like Square will become in the in the value of this [00:36:19] will become in the in the value of this quadratic and also you don't bounce [00:36:21] quadratic and also you don't bounce around in those directions where you [00:36:22] around in those directions where you need to type noise [00:36:23] need to type noise right so that's the [00:36:25] right so that's the um [00:36:27] um so and now let's look at the reasons [00:36:31] so and now let's look at the reasons directions [00:36:42] so is there any way for us to map back [00:36:46] so is there any way for us to map back how like so you want to connect back to [00:36:48] how like so you want to connect back to the world where we have them in the back [00:36:50] the world where we have them in the back screen in these networks so all [00:36:54] screen in these networks so all um for convex case it's not that [00:36:57] um for convex case it's not that difficult so what you do basically what [00:36:59] difficult so what you do basically what you say is that what is sigma [00:37:01] you say is that what is sigma Sigma is your uh so in our calculations [00:37:04] Sigma is your uh so in our calculations the sigma in our definition [00:37:06] the sigma in our definition a sigma is the code the code virus of [00:37:09] a sigma is the code the code virus of the the noise in the gradient [00:37:12] the the noise in the gradient right you can compute towards the [00:37:14] right you can compute towards the covariance of noise when you use Minimax [00:37:16] covariance of noise when you use Minimax gradients [00:37:17] gradients so you can compute that and and that's [00:37:20] so you can compute that and and that's um that is something that might change [00:37:22] um that is something that might change over time [00:37:24] over time but I think you can pretty much say that [00:37:28] but I think you can pretty much say that when you are kind of close to Global [00:37:29] when you are kind of close to Global minimum [00:37:30] minimum the changes of the covariance of the [00:37:33] the changes of the covariance of the green the the changes in the covariance [00:37:36] green the the changes in the covariance of the gradient [00:37:38] of the gradient of the Minimax gradients is negligible [00:37:41] of the Minimax gradients is negligible it is even higher you can basically [00:37:44] it is even higher you can basically ignore it so basically if you want to [00:37:46] ignore it so basically if you want to map this back to the mini magic with [00:37:47] map this back to the mini magic with this Sigma will just map to [00:37:50] this Sigma will just map to the covariance of the mini batch [00:37:54] the covariance of the mini batch gradient [00:37:57] star [00:38:03] so so then you can kind of face gravity [00:38:06] so so then you can kind of face gravity but I don't think you got anything super [00:38:08] but I don't think you got anything super interpretable anyways so that's why I [00:38:10] interpretable anyways so that's why I didn't get into it [00:38:13] [Music] [00:38:16] [Music] is [00:38:25] yes exactly [00:38:28] yes exactly exactly exactly that's that's exactly [00:38:30] exactly exactly that's that's exactly correct so this is a so suppose you have [00:38:33] correct so this is a so suppose you have two Dimensions I think this is actually [00:38:34] two Dimensions I think this is actually a very good question so [00:38:36] a very good question so so if you have One Direction which [00:38:38] so if you have One Direction which examples [00:38:39] examples and suppose you have another Direction [00:38:41] and suppose you have another Direction [Music] [00:38:42] [Music] which is like this [00:38:45] which is like this so [00:38:46] so the question is you know how does the [00:38:49] the question is you know how does the noise affect the student Nation [00:38:51] noise affect the student Nation so and also there's a question about how [00:38:53] so and also there's a question about how do you evaluate the effect of the noise [00:38:55] do you evaluate the effect of the noise what's the match that you are thinking [00:38:57] what's the match that you are thinking about [00:38:57] about so so far I'm thinking about how does [00:39:00] so so far I'm thinking about how does the noise change the fluctuation [00:39:02] the noise change the fluctuation in the internet right so so [00:39:06] in the internet right so so so suppose I'm adding noise to the same [00:39:09] so suppose I'm adding noise to the same the same amount of noise [00:39:11] the same amount of noise in both these cases I think it's indeed [00:39:14] in both these cases I think it's indeed true that stochastic when you said we'll [00:39:16] true that stochastic when you said we'll kind of like fluctuate more in this case [00:39:20] kind of like fluctuate more in this case actually probably wouldn't look at this [00:39:21] actually probably wouldn't look at this it probably look like something like [00:39:23] it probably look like something like if you you do some kind of like [00:39:25] if you you do some kind of like stochastic things like this but this is [00:39:28] stochastic things like this but this is you're gonna have a larger radius [00:39:32] you're gonna have a larger radius um for bouncing around and and here [00:39:35] um for bouncing around and and here you're going to have a smaller radius [00:39:37] you're going to have a smaller radius you are going to be more closer to the [00:39:39] you are going to be more closer to the value however [00:39:41] value however even you have a larger radius here it [00:39:44] even you have a larger radius here it doesn't necessarily mean that you have a [00:39:45] doesn't necessarily mean that you have a larger effect and a function value [00:39:49] because it reflection a lot but the [00:39:51] because it reflection a lot but the function is flat as well so it's okay to [00:39:53] function is flat as well so it's okay to fluctuate more in some cases so I think [00:39:56] fluctuate more in some cases so I think let's see whether [00:39:58] let's see whether let's see whether we can compute the [00:40:00] let's see whether we can compute the fluctuations so suppose you have Sigma [00:40:02] fluctuations so suppose you have Sigma Square over i d i Square this is the the [00:40:07] Square over i d i Square this is the the radius of the fluctuation and you [00:40:08] radius of the fluctuation and you multiply [00:40:10] multiply what you multiply you multiply d i [00:40:13] what you multiply you multiply d i because d i is the curvature [00:40:15] because d i is the curvature of your open function so this is Sigma I [00:40:18] of your open function so this is Sigma I squared this is something that doesn't [00:40:19] squared this is something that doesn't depend on curvature e i is the curvature [00:40:22] depend on curvature e i is the curvature right so this is the this is kind of [00:40:24] right so this is the this is kind of like x square right the the fluctuation [00:40:26] like x square right the the fluctuation you have so if you look at the the fact [00:40:28] you have so if you look at the the fact the effects to the function value and it [00:40:31] the effects to the function value and it may not depend on the curvature that's [00:40:33] may not depend on the curvature that's much at least not for the quadratic list [00:40:43] right so so [00:40:45] right so so that makes sense [00:40:47] that makes sense Okay cool so [00:40:49] Okay cool so all right so now let's talk about [00:40:51] all right so now let's talk about non-contractive function and then this [00:40:53] non-contractive function and then this is kind of [00:40:54] is kind of when the things becomes interesting but [00:40:57] when the things becomes interesting but it's interesting only on top of what we [00:40:58] it's interesting only on top of what we have discussed that's why we need to [00:40:59] have discussed that's why we need to have the Walmart [00:41:01] have the Walmart so [00:41:02] so um so non-quadratic [00:41:07] and so far I'm still doing [00:41:09] and so far I'm still doing um kind of a you can still think of this [00:41:11] um kind of a you can still think of this as a convex function even one [00:41:13] as a convex function even one dimensional complex so far I'm going to [00:41:15] dimensional complex so far I'm going to change that a little bit a little bit so [00:41:18] change that a little bit a little bit so and again for Simplicity let's say [00:41:21] and again for Simplicity let's say without loss of generality [00:41:23] without loss of generality let's assume the global minimizer [00:41:28] of this GX [00:41:30] of this GX is just the zero password right so we [00:41:33] is just the zero password right so we still have a zero is the global [00:41:35] still have a zero is the global minimizer we're still doing something [00:41:36] minimizer we're still doing something around zero [00:41:37] around zero and [00:41:43] I think I'm using a mentions notation [00:41:46] I think I'm using a mentions notation right now here but [00:41:48] right now here but I think I realized that in the Matrix [00:41:50] I think I realized that in the Matrix notation [00:41:52] notation oh I remember okay so the reason why I'm [00:41:55] oh I remember okay so the reason why I'm using dimensions for patients because [00:41:56] using dimensions for patients because now I don't have to do the two things [00:41:58] now I don't have to do the two things the scalar case in the major Escape but [00:42:00] the scalar case in the major Escape but but you can in your mind you can pretty [00:42:03] but you can in your mind you can pretty much interpret all business scalars [00:42:06] much interpret all business scalars um [00:42:06] um um Okay so [00:42:09] um Okay so so I'm also assuming that because zero [00:42:11] so I'm also assuming that because zero is going to mean that means that the [00:42:12] is going to mean that means that the gradient at zero is zero [00:42:15] gradient at zero is zero right that's a necessary condition and [00:42:17] right that's a necessary condition and also that means that the hyacin [00:42:19] also that means that the hyacin and g0 is PSD [00:42:24] okay [00:42:25] okay so [00:42:27] so and let's also assume [00:42:29] and let's also assume This is the part when they kind of like [00:42:31] This is the part when they kind of like become not super super reverse but that [00:42:34] become not super super reverse but that we can make this far Universe it's just [00:42:36] we can make this far Universe it's just that I wouldn't have [00:42:38] that I wouldn't have time to uh to do all the river stuff but [00:42:42] time to uh to do all the river stuff but this part is doable so suppose you will [00:42:44] this part is doable so suppose you will assume each rate are closed [00:42:47] assume each rate are closed to zero so start from somewhere that's [00:42:49] to zero so start from somewhere that's close to zero expansion [00:42:57] around zero [00:42:59] around zero so so what you do is you do x t plus one [00:43:03] so so what you do is you do x t plus one is equal to x t that's ETA times going [00:43:06] is equal to x t that's ETA times going into [00:43:07] into at x t plus cos c t [00:43:10] at x t plus cos c t and into 10 expansion [00:43:13] and into 10 expansion to approximate the gradient [00:43:15] to approximate the gradient at at the city at x t [00:43:18] at at the city at x t so how to do a third expansion [00:43:21] so how to do a third expansion so if you take advantage zero you would [00:43:23] so if you take advantage zero you would even got you're gonna get the number so [00:43:25] even got you're gonna get the number so you zero [00:43:26] you zero plus now plus Square G 0 times x [00:43:30] plus now plus Square G 0 times x T minus zero [00:43:33] T minus zero and plus you know [00:43:35] and plus you know block Cube G zero [00:43:38] block Cube G zero at x t x t [00:43:41] at x t x t and maybe let's have also High all the [00:43:44] and maybe let's have also High all the terms [00:43:45] terms which we are going to ignore [00:43:46] which we are going to ignore heuristically and then [00:43:50] heuristically and then I'm gonna get episode [00:43:54] I'm gonna get episode T sometimes [00:43:56] T sometimes so I guess you know if you [00:43:58] so I guess you know if you if you're not familiar with the note [00:44:00] if you're not familiar with the note Matrix thing then I guess this is really [00:44:01] Matrix thing then I guess this is really just saying that g [00:44:03] just saying that g G Prime x t is roughly equals to G Prime [00:44:06] G Prime x t is roughly equals to G Prime zero times x t plus G prime prime zero [00:44:10] zero times x t plus G prime prime zero times x t squared [00:44:12] times x t squared plus GQ [00:44:15] plus GQ third order derivative terms [00:44:19] wait what I'm doing here so [00:44:22] wait what I'm doing here so this there's no x t here there's x t [00:44:25] this there's no x t here there's x t times x t square plus even high order [00:44:29] times x t square plus even high order terms [00:44:30] terms sometimes right so but homologous [00:44:33] sometimes right so but homologous rotation that this [00:44:35] rotation that this if you do the Matrix thing that this is [00:44:38] if you do the Matrix thing that this is the Matrix Vector product [00:44:41] the Matrix Vector product and this is a tensor Vector product let [00:44:44] and this is a tensor Vector product let me explain that a little bit so [00:44:46] me explain that a little bit so so image so if you do the market [00:44:49] so image so if you do the market dimensional case this is a third of the [00:44:51] dimensional case this is a third of the tensor [00:44:55] d by B by D so and suppose you have a t [00:44:59] d by B by D so and suppose you have a t that is a third or a tensor [00:45:02] that is a third or a tensor then I'm using this t [00:45:04] then I'm using this t t [00:45:06] t X [00:45:07] X y [00:45:09] y reacts and Y are all vectors is defined [00:45:14] reacts and Y are all vectors is defined to be [00:45:15] to be a vector so suppose so this is a this is [00:45:18] a vector so suppose so this is a this is the multiplication of this tensor with [00:45:20] the multiplication of this tensor with two vectors it's first of all is a [00:45:22] two vectors it's first of all is a vector [00:45:23] vector and second the definition of this is [00:45:25] and second the definition of this is that [00:45:27] that the ith coordinate of this is [00:45:30] the ith coordinate of this is the sum over j k t i j k x i y j so x x [00:45:37] the sum over j k t i j k x i y j so x x k y [00:45:40] k y x j [00:45:41] x j y k so basically sum over [00:45:45] y k so basically sum over the remaining Point J and K and you [00:45:48] the remaining Point J and K and you leave the I along the left and that's [00:45:51] leave the I along the left and that's the [00:45:52] the outcome [00:45:54] outcome so this is the basically the scalar [00:45:56] so this is the basically the scalar expansion in multiple dimensions [00:46:00] expansion in multiple dimensions Okay so [00:46:02] Okay so by the way just a reminder for the scrap [00:46:04] by the way just a reminder for the scrap note takers I think for this kind of [00:46:06] note takers I think for this kind of small things I write on the side you [00:46:08] small things I write on the side you please also take notes for those because [00:46:11] please also take notes for those because they are useful for [00:46:13] they are useful for um for readers as well if someone [00:46:15] um for readers as well if someone doesn't have time to take the lectures [00:46:17] doesn't have time to take the lectures they would like to know so I think these [00:46:19] they would like to know so I think these small explanations are also useful [00:46:21] small explanations are also useful um you can just have a small kind of [00:46:22] um you can just have a small kind of remark of some run it in a paragraph [00:46:26] remark of some run it in a paragraph so [00:46:27] so um [00:46:29] um all right so first of that I have a [00:46:32] all right so first of that I have a third expansion and then we can see that [00:46:34] third expansion and then we can see that so while we so so we're expecting [00:46:36] so while we so so we're expecting something somewhat similar to what we [00:46:38] something somewhat similar to what we have done before right so and and indeed [00:46:41] have done before right so and and indeed you will see that because a [00:46:44] you will see that because a this is gonna be zero because this is [00:46:46] this is gonna be zero because this is the gradient at zero [00:46:49] the gradient at zero and this is zero so and so basically [00:46:53] and this is zero so and so basically what we're gonna get is that [00:46:55] what we're gonna get is that x t minus ETA [00:46:58] x t minus ETA times [00:47:00] times okay so I guess [00:47:03] okay so I guess let me Define [00:47:05] let me Define as for Simplicity H to be has used [00:47:13] like H to be the hessing and zero [00:47:18] like H to be the hessing and zero then this you can rewrite this as x t [00:47:22] then this you can rewrite this as x t minus beta h x t [00:47:27] minus beta h x t minus beta [00:47:30] minus beta uh cos CT [00:47:33] uh cos CT and minus the third the third order [00:47:37] and minus the third the third order I guess I'm also going to Define [00:47:40] I guess I'm also going to Define let me see what's my location here [00:47:43] let me see what's my location here I Define t to be [00:47:46] I Define t to be the third order derivative [00:47:50] the third order derivative and then this is T [00:47:53] and then this is T x t x t [00:47:57] x t x t so and how terms Let's ignore that from [00:48:00] so and how terms Let's ignore that from from now on either way we are not going [00:48:02] from now on either way we are not going to really formulating [00:48:04] to really formulating we just have an approximation here so [00:48:06] we just have an approximation here so and then this is I minus ETA h [00:48:10] and then this is I minus ETA h x t [00:48:12] x t minus ETA cos c t [00:48:14] minus ETA cos c t minus ETA t [00:48:17] minus ETA t x t x t [00:48:19] x t x t and I think you can see I guess what I [00:48:22] and I think you can see I guess what I was hoping for you to see is that [00:48:26] was hoping for you to see is that the third law term is something new but [00:48:30] the third law term is something new but the first alternate second order are not [00:48:32] the first alternate second order are not new a noise term and this is the second [00:48:35] new a noise term and this is the second order are exactly what we had before [00:48:37] order are exactly what we had before right so if you look look at [00:48:41] right so if you look look at here [00:48:45] so for [00:48:47] so for the quadratic case [00:48:49] the quadratic case you have extraction and you have a noise [00:48:51] you have extraction and you have a noise the extraction is linear and you have [00:48:54] the extraction is linear and you have the noise and now the only difference is [00:48:56] the noise and now the only difference is that you see you have additional term [00:48:57] that you see you have additional term from the third order derivative and [00:48:59] from the third order derivative and that's expected because if you don't [00:49:01] that's expected because if you don't have the third order you ignore the [00:49:02] have the third order you ignore the third order term it becomes just [00:49:03] third order term it becomes just quadratic so that's why we want to [00:49:06] quadratic so that's why we want to explain how to the third order because [00:49:08] explain how to the third order because we want to really use the fact that this [00:49:10] we want to really use the fact that this is not a quadratic function [00:49:12] is not a quadratic function so basically you can think of this as [00:49:14] so basically you can think of this as two process going on right so one [00:49:15] two process going on right so one process is this this [00:49:18] process is this this OU process kind of like this kind of [00:49:20] OU process kind of like this kind of like basic one about quadratic and you [00:49:22] like basic one about quadratic and you have additional term that's kind of like [00:49:25] have additional term that's kind of like make it a little bit more conflict [00:49:28] right so and how do you proceed here [00:49:33] so [00:49:34] so if you really think about um [00:49:38] if you really think about um thank you [00:49:43] so so in some sense if you so there is [00:49:46] so so in some sense if you so there is one thing okay so this is okayuristic [00:49:50] of course like uh your wishes [00:49:55] so in certain cases you can you're kind [00:49:57] so in certain cases you can you're kind of attempting to even drop the third [00:49:59] of attempting to even drop the third long term because that may be a small [00:50:01] long term because that may be a small term and let's try to do that right so [00:50:03] term and let's try to do that right so this is [00:50:04] this is so just to drop the third order [00:50:07] so just to drop the third order foreign [00:50:12] [Music] [00:50:16] then you have this process X is updated [00:50:20] then you have this process X is updated by [00:50:21] by something like this [00:50:24] something like this right so this is the process this is [00:50:26] right so this is the process this is something we have finalized and we know [00:50:28] something we have finalized and we know that when convergence [00:50:33] x t will be something on the order [00:50:37] x t will be something on the order from each other so here [00:50:39] from each other so here I'm ignoring all the dependencies [00:50:41] I'm ignoring all the dependencies accepted that penicillin ETA [00:50:43] accepted that penicillin ETA so and [00:50:45] so and and now if you look at you look back on [00:50:48] and now if you look at you look back on uh what happened with this tunnel term [00:50:51] uh what happened with this tunnel term so when X is on this order [00:50:54] so when X is on this order so each of t x t x t what is this this [00:50:59] so each of t x t x t what is this this is on order of ETA Square because x t [00:51:01] is on order of ETA Square because x t contributes square root ETA this [00:51:03] contributes square root ETA this actually contributes for reader and [00:51:05] actually contributes for reader and there's an ETA here so basically you [00:51:07] there's an ETA here so basically you have e to the square term [00:51:10] have e to the square term um which sounds very small [00:51:12] um which sounds very small why this is very small this is much [00:51:13] why this is very small this is much smaller this is so each square is [00:51:17] smaller this is so each square is much much smaller than for example ETA [00:51:20] much much smaller than for example ETA City [00:51:21] City right but that probably is not unfair [00:51:23] right but that probably is not unfair because Cassidy is doing some random [00:51:25] because Cassidy is doing some random stuff so uh but ether square is also [00:51:29] stuff so uh but ether square is also much much smaller than even just e to H [00:51:32] much much smaller than even just e to H x t which is on order of either [00:51:36] x t which is on order of either uh 1.5 [00:51:40] so basically the changes [00:51:43] so basically the changes of your process where the two other [00:51:45] of your process where the two other changes of your process is these two [00:51:47] changes of your process is these two terms [00:51:49] terms right and this term you can say [00:51:51] right and this term you can say comparing with that is a little kind of [00:51:52] comparing with that is a little kind of like a [00:51:54] like a uh uh unfair because that term is doing [00:51:57] uh uh unfair because that term is doing some random stuff right so maybe you [00:52:00] some random stuff right so maybe you shouldn't compare with the absolute [00:52:01] shouldn't compare with the absolute value of x just because eventually they [00:52:03] value of x just because eventually they will subtract there will be some [00:52:04] will subtract there will be some cancellation but at least you compare [00:52:06] cancellation but at least you compare with the the other deterministic term e [00:52:08] with the the other deterministic term e to hxt [00:52:10] to hxt you are still uh this is a square term [00:52:12] you are still uh this is a square term is still much smaller [00:52:14] is still much smaller than a deterministic term so in some [00:52:16] than a deterministic term so in some sense it's very tempting to say that [00:52:18] sense it's very tempting to say that okay this this final term is t x t x t [00:52:21] okay this this final term is t x t x t thing [00:52:22] thing is uh it's very it's very small [00:52:26] is uh it's very it's very small right so so the conclusion would be that [00:52:29] right so so the conclusion would be that you know uh um [00:52:32] you know uh um the conclusion is that this is kind of [00:52:34] the conclusion is that this is kind of negligible [00:52:40] so and indeed it's true when so and this [00:52:45] so and indeed it's true when so and this is indeed true this is negligible and [00:52:47] is indeed true this is negligible and and you need these two under one [00:52:49] and you need these two under one condition when the age [00:52:52] condition when the age the housing is strictly PSD [00:52:56] the housing is strictly PSD so so that's why that that's when you [00:52:59] so so that's why that that's when you have Construction in all different [00:53:00] have Construction in all different directions [00:53:02] directions so [00:53:04] so so however when H is not strictly PSD so [00:53:08] so however when H is not strictly PSD so for example in some Direction [00:53:10] for example in some Direction so basically in other words if you think [00:53:12] so basically in other words if you think about this so [00:53:14] about this so this term is only on out of ETA to the [00:53:17] this term is only on out of ETA to the 1.5 when H is not zero right so if H is [00:53:21] 1.5 when H is not zero right so if H is zero in some direction then this ETA H [00:53:23] zero in some direction then this ETA H calc C term is just literally zero so [00:53:26] calc C term is just literally zero so then the ETA Square term is running [00:53:28] then the ETA Square term is running right so so basically if in [00:53:31] right so so basically if in in some Direction [00:53:36] where [00:53:38] where H is zero [00:53:41] H is zero then ETA h x t is just a zero in that [00:53:44] then ETA h x t is just a zero in that direction [00:53:48] and then [00:53:50] and then ETA Square becomes the largest [00:53:55] largest update [00:53:58] largest update together [00:54:06] city is always the largest if you really [00:54:09] city is always the largest if you really look at the absolute value right so if [00:54:11] look at the absolute value right so if the city is a lot of what this is on out [00:54:13] the city is a lot of what this is on out of just the Ethan it's always the [00:54:15] of just the Ethan it's always the largest but I think I I kind of I try to [00:54:18] largest but I think I I kind of I try to I'm trying to argue that in the city if [00:54:21] I'm trying to argue that in the city if you compare with that it's a little bit [00:54:22] you compare with that it's a little bit kind of like [00:54:24] kind of like um [00:54:25] um is leading in the sense that if a city [00:54:28] is leading in the sense that if a city is doing random stuff right so in one [00:54:30] is doing random stuff right so in one step is to it's going to pause the [00:54:31] step is to it's going to pause the direction the other step is going to [00:54:33] direction the other step is going to negative Direction So eventually So [00:54:35] negative Direction So eventually So eventually the the so basically what [00:54:37] eventually the the so basically what happens is that if I have a random [00:54:39] happens is that if I have a random sarcastic term suppose you have a [00:54:42] sarcastic term suppose you have a sarcastic term [00:54:43] sarcastic term or mean zero term [00:54:46] or mean zero term such that if you have one stack [00:54:49] such that if you have one stack is on out of ETA implies [00:54:52] is on out of ETA implies uh eventually like matches the [00:54:59] this will be something like on the [00:55:02] this will be something like on the square root either [00:55:04] square root either that's kind of like what we have [00:55:05] that's kind of like what we have discussed in the quadratic history if [00:55:07] discussed in the quadratic history if every step is stochastic can give you a [00:55:09] every step is stochastic can give you a return [00:55:10] return noise perturbation and then eventually [00:55:14] noise perturbation and then eventually they will build up to square root data [00:55:16] they will build up to square root data so uh however when you have a [00:55:20] so uh however when you have a deterministic term [00:55:24] so if one step [00:55:27] so if one step is something like ETA then eventually [00:55:29] is something like ETA then eventually it's unclear what you will go down it [00:55:31] it's unclear what you will go down it probably would build up to either T [00:55:33] probably would build up to either T because it [00:55:34] because it times Little T let's say [00:55:37] times Little T let's say it it won't kind of like they won't [00:55:39] it it won't kind of like they won't cancel maybe this is a little bit I'm [00:55:42] cancel maybe this is a little bit I'm not sure about this is the best way to [00:55:43] not sure about this is the best way to explain it [00:55:45] explain it um [00:55:47] um how do I say this so either way this is [00:55:49] how do I say this so either way this is all the heuristic or because you know it [00:55:50] all the heuristic or because you know it requires a little bit more if you want [00:55:53] requires a little bit more if you want to format all of let's say it requires a [00:55:54] to format all of let's say it requires a little more work but I guess what I'm [00:55:56] little more work but I guess what I'm maybe let me delete what I'm saying is [00:55:58] maybe let me delete what I'm saying is basically like the [00:56:00] basically like the the low the low quality the largest [00:56:02] the low the low quality the largest update but like in our case by locally [00:56:06] update but like in our case by locally the audience out there of course is [00:56:07] the audience out there of course is either casinos but this one will have [00:56:09] either casinos but this one will have cancellation over time [00:56:10] cancellation over time because in future you're gonna have like [00:56:13] because in future you're gonna have like a you you can move in different [00:56:14] a you you can move in different directions so because that's why it's [00:56:16] directions so because that's why it's probably good to also compare with the [00:56:18] probably good to also compare with the deterministic changes which is the ETA [00:56:20] deterministic changes which is the ETA htxt [00:56:21] htxt and we compare with that typically the [00:56:24] and we compare with that typically the domestic changes is bigger than the ETA [00:56:26] domestic changes is bigger than the ETA Square term from the third hour [00:56:27] Square term from the third hour derivative [00:56:28] derivative um but when when H is zero in some [00:56:31] um but when when H is zero in some direction there is no longer true [00:56:34] direction there is no longer true so [00:56:35] so um so so so but but you can so so you [00:56:39] um so so so but but you can so so you sometimes you can prove that if H is [00:56:43] um [00:56:44] um so if when so when H is strictly [00:56:48] so if when so when H is strictly positive no zero then it's negligible [00:56:50] positive no zero then it's negligible and otherwise becomes a becomes tricky [00:56:52] and otherwise becomes a becomes tricky right so if H has a [00:56:54] right so if H has a completely flat Direction it becomes a [00:56:57] completely flat Direction it becomes a tree [00:57:00] tree so I think here is a good point maybe [00:57:02] so I think here is a good point maybe let's just continue with this so when [00:57:04] let's just continue with this so when this is case then uh so the third so in [00:57:08] this is case then uh so the third so in this case the third order [00:57:09] this case the third order [Music] [00:57:13] will [00:57:15] will introduce some biases but very small [00:57:18] introduce some biases but very small some buys but very small [00:57:23] and small in a sense as ETA goes to zero [00:57:25] and small in a sense as ETA goes to zero this becomes new becomes negligible and [00:57:29] this becomes new becomes negligible and I think I have some kind of like a [00:57:33] um [00:57:37] figures here [00:57:44] yes so I have this figure [00:57:48] so I'm just see whether you can see it [00:57:50] so I'm just see whether you can see it yes so [00:57:52] yes so this is a little bit small I guess maybe [00:57:54] this is a little bit small I guess maybe this way [00:57:56] this way so the function [00:57:58] so the function is a one-dimensional function it's a [00:58:00] is a one-dimensional function it's a convex one so I'm in the case where the [00:58:02] convex one so I'm in the case where the the H is strictly bigger than 0. so the [00:58:05] the H is strictly bigger than 0. so the h [00:58:06] h because it's one dimension this is [00:58:07] because it's one dimension this is strictly you know convex function so so [00:58:10] strictly you know convex function so so so this is the function but it's not [00:58:12] so this is the function but it's not like a quadratic it's something like a I [00:58:15] like a quadratic it's something like a I think it's quadratically on both sides [00:58:16] think it's quadratically on both sides but it's not the same kind of curvature [00:58:18] but it's not the same kind of curvature the left hand side is more flat and the [00:58:20] the left hand side is more flat and the right hand side is more Sharp [00:58:22] right hand side is more Sharp and [00:58:25] um [00:58:26] um and if you the green descent [00:58:30] and if you the green descent so I guess probably the only thing [00:58:31] so I guess probably the only thing important is this so if you look at this [00:58:34] important is this so if you look at this is after you take [00:58:36] is after you take results [00:58:41] and you can see that the mean [00:58:45] and you can see that the mean like the iterate is bouncing around [00:58:47] like the iterate is bouncing around right between this is the this is the [00:58:50] right between this is the this is the each this is the distribution of each [00:58:51] each this is the distribution of each rate [00:58:52] rate distribution of the XT when T is [00:58:57] distribution of the XT when T is 10 24. [00:58:58] 10 24. so 1094 is pretty big considered to be [00:59:01] so 1094 is pretty big considered to be infinitive right so and you can see that [00:59:03] infinitive right so and you can see that it's bouncing around around zero zero is [00:59:06] it's bouncing around around zero zero is the global minimum but [00:59:08] the global minimum but the the mean is no longer zero anymore [00:59:10] the the mean is no longer zero anymore because you have the third order kind of [00:59:13] because you have the third order kind of like derivative and the mean is [00:59:15] like derivative and the mean is something left [00:59:16] something left um [00:59:17] um to zero [00:59:19] to zero um in some sense you prefer the left [00:59:21] um in some sense you prefer the left hand side a little more than the right [00:59:22] hand side a little more than the right hand side because it's easy to stay on [00:59:25] hand side because it's easy to stay on the left hand side the left hand side is [00:59:27] the left hand side the left hand side is lighter so it's easier to stay there [00:59:28] lighter so it's easier to stay there because the construction is weaker [00:59:30] because the construction is weaker and the right hand side you know it's [00:59:32] and the right hand side you know it's sharper and it's you add some noise and [00:59:34] sharper and it's you add some noise and you kind of like contract it and you go [00:59:36] you kind of like contract it and you go back to the to zero quickly more quickly [00:59:39] back to the to zero quickly more quickly so that's why the the the the buyers is [00:59:43] so that's why the the the the buyers is towards the left hand side where you [00:59:44] towards the left hand side where you have flighter curvature but the bias is [00:59:47] have flighter curvature but the bias is relatively small you can see right like [00:59:49] relatively small you can see right like uh you can even say this is negligible [00:59:51] uh you can even say this is negligible because the bot at least you know if you [00:59:54] because the bot at least you know if you if you do take a random Point you're [00:59:55] if you do take a random Point you're going to take something between minus [00:59:56] going to take something between minus 0.05 to 0.05 maybe and the bias is only [00:59:59] 0.05 to 0.05 maybe and the bias is only a very very small number I know but you [01:00:01] a very very small number I know but you you are like you're stuck your [01:00:03] you are like you're stuck your fluctuation is bigger than the bias for [01:00:04] fluctuation is bigger than the bias for sure [01:00:05] sure so so that's why in the classical kind [01:00:09] so so that's why in the classical kind of like uh [01:00:12] um in a classical kind of like [01:00:15] um in a classical kind of like um like optimization settings people [01:00:17] um like optimization settings people didn't really pay too much attention to [01:00:19] didn't really pay too much attention to this I think they are [01:00:21] this I think they are they are uh some papers I guess uh [01:00:26] so I guess there's this paper by [01:00:29] so I guess there's this paper by e [01:00:31] e 17. bridging [01:00:36] the Gap I guess between [01:00:41] constant [01:00:44] step size SUV [01:00:50] and the market option so this paper [01:00:52] and the market option so this paper characterize this [01:00:54] characterize this effect for convex case [01:00:56] effect for convex case and you can see from the top title of [01:00:59] and you can see from the top title of this paper right it's talking about [01:01:00] this paper right it's talking about constant step sets and why you have to [01:01:03] constant step sets and why you have to talk about constant steps that just [01:01:05] talk about constant steps that just because this will go [01:01:06] because this will go if you Decay the step size then this [01:01:09] if you Decay the step size then this bias effect will be even smaller and it [01:01:11] bias effect will be even smaller and it will be negligible just completely gone [01:01:13] will be negligible just completely gone right eventually so that's why you have [01:01:16] right eventually so that's why you have to to make this even somewhat useful you [01:01:19] to to make this even somewhat useful you know someone can make a difference you [01:01:21] know someone can make a difference you have to make the steps that's not going [01:01:22] have to make the steps that's not going to zero so that's why in the convex case [01:01:25] to zero so that's why in the convex case people think we don't care about this [01:01:26] people think we don't care about this that much in some other cases you know [01:01:28] that much in some other cases you know you care about this a little bit this [01:01:29] you care about this a little bit this figure is from one of my recent paper [01:01:32] figure is from one of my recent paper um with some [01:01:34] um with some um [01:01:35] um students as an effort all right guys so [01:01:37] students as an effort all right guys so and here the reason why we try about [01:01:39] and here the reason why we try about this is because you have um you have [01:01:41] this is because you have um you have multiple machines and for some other [01:01:43] multiple machines and for some other reasons you have to care about this but [01:01:45] reasons you have to care about this but technically you wouldn't really care [01:01:46] technically you wouldn't really care that much about it [01:01:48] that much about it um [01:01:49] um uh just because the buy is small [01:01:51] uh just because the buy is small okay so now let's go back to [01:01:54] okay so now let's go back to uh now let's move on to [01:01:59] um [01:02:00] um okay finally we are moving to the [01:02:03] okay finally we are moving to the implicit regulation effect [01:02:07] so the more more complex case [01:02:14] I'm heading too fast this could [01:02:17] I'm heading too fast this could through it I guess [01:02:23] who's stronger [01:02:26] who's stronger thank you [01:02:29] press it right [01:02:32] press it right so [01:02:35] um so and these are cases where both [01:02:39] um so and these are cases where both age and [01:02:41] age and Sigma [01:02:43] Sigma are not [01:02:46] are not for Rock [01:02:49] so your hazard and your noise are both [01:02:52] so your hazard and your noise are both somewhat [01:02:54] somewhat not for full dimensional and this is now [01:02:57] not for full dimensional and this is now something to be super surprised this is [01:02:58] something to be super surprised this is called party comes from [01:03:01] called party comes from comes from [01:03:04] comes from over parametrization [01:03:07] especially I think it's easier to think [01:03:09] especially I think it's easier to think about the same the housing if you have a [01:03:12] about the same the housing if you have a manifold of global minimum [01:03:15] manifold of global minimum then along that direction of the [01:03:18] then along that direction of the manifold your housing will be zero [01:03:20] manifold your housing will be zero so so as long as you have a lot of [01:03:22] so so as long as you have a lot of different Global minimum then your high [01:03:24] different Global minimum then your high system will be flat is will be zero in [01:03:28] system will be flat is will be zero in certain directions [01:03:30] so [01:03:32] so um and let's say suppose [01:03:34] um and let's say suppose so possibly let me not discuss when this [01:03:37] so possibly let me not discuss when this can happen exactly because you need some [01:03:39] can happen exactly because you need some calculations uh so forth but suppose [01:03:41] calculations uh so forth but suppose let's say Asian Sigma are both [01:03:45] in some Subspace in a Subspace [01:03:50] K and the Subspace K is low dimensional [01:03:54] K and the Subspace K is low dimensional or this is not full dimensional [01:03:56] or this is not full dimensional and [01:03:58] and um and if the loss is quadratic [01:04:03] for the moment just let's still think [01:04:05] for the moment just let's still think about the laws it's quadratic I guess we [01:04:07] about the laws it's quadratic I guess we have computers [01:04:08] have computers we said that [01:04:12] we said that the iterate will have zero something [01:04:16] the iterate will have zero something like this [01:04:17] like this recall that this is our calculation [01:04:19] recall that this is our calculation Sigma Square D I [01:04:22] Sigma Square D I U transpose [01:04:24] U transpose and so my kind of the picture I think is [01:04:27] and so my kind of the picture I think is that [01:04:29] that uh [01:04:31] uh but [01:04:35] um [01:04:36] um there's no [01:04:38] there's no so so basically you have no noise [01:04:43] so so basically you have no noise and no construction [01:04:46] and no construction nothing [01:04:47] nothing in the in the in the property the [01:04:50] in the in the in the property the perpendicular space of height so in some [01:04:53] perpendicular space of height so in some sense I think the function [01:04:55] sense I think the function look like this in when you have so [01:04:57] look like this in when you have so suppose you have [01:04:59] suppose you have some direction of [01:05:01] some direction of K this is the direction of K [01:05:04] K this is the direction of K then this is the direction of [01:05:07] then this is the direction of k-park [01:05:08] k-park and and suppose your function is [01:05:10] and and suppose your function is quadratic in dimension of K [01:05:13] quadratic in dimension of K something like this I'm not sure whether [01:05:15] something like this I'm not sure whether you can [01:05:17] you can think of my drawing is too bad so [01:05:20] think of my drawing is too bad so so imagine how variety this is a I'm [01:05:23] so imagine how variety this is a I'm drawing a violin like this [01:05:26] drawing a violin like this and but this value is completely kind of [01:05:28] and but this value is completely kind of like oblivious to Dimension paper [01:05:31] like oblivious to Dimension paper so and so so this thing [01:05:36] so and so so this thing is the middle of the Violet this is the [01:05:38] is the middle of the Violet this is the three dividing so so basically then what [01:05:42] three dividing so so basically then what happens is that if you start somewhere [01:05:44] happens is that if you start somewhere here [01:05:45] here everything happens in the direction of K [01:05:48] everything happens in the direction of K and nothing happens Direction paper so [01:05:50] and nothing happens Direction paper so you're basically bouncing along the [01:05:52] you're basically bouncing along the direction of K so basically you maybe go [01:05:54] direction of K so basically you maybe go here here and go about to do some [01:05:56] here here and go about to do some bouncing like something like this but [01:05:58] bouncing like something like this but you never move in the direction of paper [01:06:00] you never move in the direction of paper so in Caper it's kind of like a you just [01:06:04] so in Caper it's kind of like a you just know nothing you know nothing or you [01:06:06] know nothing you know nothing or you don't move at all [01:06:08] don't move at all so that's still not kind of liking that [01:06:10] so that's still not kind of liking that that's that's a little bit impressive [01:06:11] that's that's a little bit impressive bias we implicit requisition because [01:06:14] bias we implicit requisition because the implicit requisition comes from what [01:06:16] the implicit requisition comes from what comes from the transition if you start [01:06:17] comes from the transition if you start with this point then you're gonna [01:06:19] with this point then you're gonna stay in this uh in this in this part but [01:06:23] stay in this uh in this in this part but if you start here then you will bounce [01:06:24] if you start here then you will bounce around here [01:06:26] around here right and this is exactly what happens [01:06:28] right and this is exactly what happens when you have over parameters linear [01:06:30] when you have over parameters linear model because we have all preciously [01:06:33] model because we have all preciously anymore you never leave the Subspace [01:06:35] anymore you never leave the Subspace it may never lead to some Subspace any [01:06:37] it may never lead to some Subspace any any other substitution never moved so [01:06:40] any other substitution never moved so but this is not the most important thing [01:06:41] but this is not the most important thing right so about the noise but because the [01:06:42] right so about the noise but because the noise doesn't really do much it's really [01:06:44] noise doesn't really do much it's really just that you cannot leave a certain [01:06:46] just that you cannot leave a certain certain substance however when your loss [01:06:50] certain substance however when your loss is quadratic when your loss is not [01:06:52] is quadratic when your loss is not quadratic then [01:06:54] quadratic then the third dollar term is is going to [01:06:57] the third dollar term is is going to measure [01:06:59] measure so this is the the main thing I want to [01:07:01] so this is the the main thing I want to kind of like [01:07:03] kind of like Conway today but unfortunately just [01:07:05] Conway today but unfortunately just because this is complicated so I [01:07:07] because this is complicated so I probably wouldn't be able to do all the [01:07:08] probably wouldn't be able to do all the everything rigorously [01:07:10] everything rigorously so I just really can't do everything [01:07:12] so I just really can't do everything regressing so what happens is that [01:07:16] regressing so what happens is that if you the loss is not quadratic the [01:07:18] if you the loss is not quadratic the recall that what happens is you have x t [01:07:20] recall that what happens is you have x t plus one is equal to one minus E to H [01:07:27] minus ETA t x t [01:07:31] minus ETA t x t x t [01:07:33] x t plus High auditor [01:07:36] and this is happening [01:07:42] so this [01:07:44] so this is working [01:07:45] is working [Music] [01:07:48] [Music] in K because the I assume that H is [01:07:51] in K because the I assume that H is working hey and the noise is always in [01:07:54] working hey and the noise is always in pain in the Subspace so this main part [01:07:56] pain in the Subspace so this main part is we're always working hey your boss [01:08:00] is we're always working hey your boss are running head and and this is working [01:08:04] in paper [01:08:06] in paper and that's that makes them kind of [01:08:08] and that's that makes them kind of completely separate so there's nothing [01:08:09] completely separate so there's nothing you can control the the sort of term the [01:08:12] you can control the the sort of term the third term can build up [01:08:13] third term can build up for long and long a very long time [01:08:16] for long and long a very long time so [01:08:18] so um so maybe this is the one let me see [01:08:25] so basically let's see so [01:08:31] I probably will go to this figure [01:08:32] I probably will go to this figure multiple times [01:08:36] right so this is what's happening here [01:08:38] right so this is what's happening here so [01:08:40] so I don't think I can [01:08:44] I don't think I can draw anything here [01:08:46] I don't think I can draw anything here so but the [01:08:50] so but the maybe maybe you first watch it and then [01:08:52] maybe maybe you first watch it and then I'm going to go to a static figure so [01:08:54] I'm going to go to a static figure so that you can [01:08:56] that you can uh I can I can annotate [01:08:59] uh I can I can annotate so this is a sarcastic we've been inside [01:09:01] so this is a sarcastic we've been inside on this on this in this Valley [01:09:05] then you can see that it's moving in [01:09:07] then you can see that it's moving in this Valley so now let's look at the [01:09:08] this Valley so now let's look at the static figure I think I have one [01:09:11] static figure I think I have one somewhere [01:09:12] somewhere thank you [01:09:14] thank you so [01:09:16] so in our life you know the mathematical [01:09:18] in our life you know the mathematical language so this direction [01:09:24] let's see a different color [01:09:27] let's see a different color so this is the direction of the k [01:09:30] so this is the direction of the k perk [01:09:32] perk and this is the direction [01:09:35] of K so this direction is K [01:09:39] of K so this direction is K okay and so but here this is not a [01:09:42] okay and so but here this is not a quadratic because it's at least this is [01:09:44] quadratic because it's at least this is another project because you're at least [01:09:47] another project because you're at least the the K perfect direction does measure [01:09:49] the the K perfect direction does measure to some extent [01:09:50] to some extent because the paper you can see that from [01:09:52] because the paper you can see that from if you go from here to here you're going [01:09:54] if you go from here to here you're going to go to flightcher and flighter region [01:09:57] to go to flightcher and flighter region right so [01:09:58] right so what happens is that most of your work [01:10:01] what happens is that most of your work is in the K Direction you are just [01:10:03] is in the K Direction you are just bouncing around in K Direction [01:10:05] bouncing around in K Direction but there is some several term that [01:10:07] but there is some several term that drives you in the paper Direction and [01:10:09] drives you in the paper Direction and that can build up eventually for a long [01:10:10] that can build up eventually for a long time right recall that you start from [01:10:12] time right recall that you start from here you do a lot of like bouncing [01:10:13] here you do a lot of like bouncing around but eventually after you bounce [01:10:15] around but eventually after you bounce for so long time you move in the paper [01:10:17] for so long time you move in the paper Direction and this is just because the [01:10:20] Direction and this is just because the third hour term is [01:10:21] third hour term is is accumulating for long for a long time [01:10:24] is accumulating for long for a long time until you go to The Brighter region [01:10:28] until you go to The Brighter region so um so the so the main term is doing [01:10:32] so um so the so the main term is doing the bouncing and the the lower term is [01:10:35] the bouncing and the the lower term is doing is accumulating in the you know in [01:10:37] doing is accumulating in the you know in the direction of the violate [01:10:40] the direction of the violate um [01:10:42] um any questions so far I can go I'll go [01:10:45] any questions so far I can go I'll go back to this bigger problem once again [01:10:46] back to this bigger problem once again just one second with a little bit more [01:10:48] just one second with a little bit more math [01:10:55] [Music] [01:10:59] yeah that's a that's a great question so [01:11:01] yeah that's a that's a great question so so if you know this is what's happening [01:11:03] so if you know this is what's happening why not just do something more expensive [01:11:04] why not just do something more expensive to make it faster right so [01:11:07] to make it faster right so um [01:11:09] um I think there are several let me see so [01:11:12] I think there are several let me see so there are several things that you know [01:11:14] there are several things that you know like this is um this is a good question [01:11:16] like this is um this is a good question but this is not something super new like [01:11:18] but this is not something super new like people have thought about it and I have [01:11:20] people have thought about it and I have thought about it I think there are [01:11:22] thought about it I think there are multiple [01:11:23] multiple constraints we have to [01:11:27] constraints we have to um kind of respect like uh you know I [01:11:29] um kind of respect like uh you know I still think this is a feasible thing [01:11:31] still think this is a feasible thing direction to go but it's not easy so and [01:11:33] direction to go but it's not easy so and and I don't think there is a um there's [01:11:35] and I don't think there is a um there's existing paper that can really achieve [01:11:37] existing paper that can really achieve this very well so so one of the thing is [01:11:40] this very well so so one of the thing is that so how do you go to the valley [01:11:43] that so how do you go to the valley so [01:11:50] um so how do you go to the value I think [01:11:52] um so how do you go to the value I think that's that's not too hard because you [01:11:54] that's that's not too hard because you have to use [01:11:55] have to use but not tribute because to go to the [01:11:57] but not tribute because to go to the value you have to [01:11:58] value you have to either Decay already rate or make your [01:12:00] either Decay already rate or make your back size bigger so that you have a [01:12:01] back size bigger so that you have a smaller noise [01:12:02] smaller noise but that requires putting more compute [01:12:06] but that requires putting more compute because you want to be more accurate so [01:12:08] because you want to be more accurate so that you but sometimes you want to be [01:12:10] that you but sometimes you want to be more accurate in the Creator so if you [01:12:11] more accurate in the Creator so if you require small countries [01:12:13] require small countries so so that's one small segment so [01:12:15] so so that's one small segment so whether you you really can afford those [01:12:18] whether you you really can afford those compute to really go to the body in the [01:12:19] compute to really go to the body in the first place I think you can probably in [01:12:22] first place I think you can probably in most cases you can [01:12:24] most cases you can um but it's not like a for free so you [01:12:26] um but it's not like a for free so you do have to consider the cost [01:12:27] do have to consider the cost so and then you go to the valley and [01:12:30] so and then you go to the valley and then you do the you do this right you [01:12:33] then you do the you do this right you move in okay [01:12:35] move in okay but the problem becomes that the real [01:12:37] but the problem becomes that the real picture is not just one single [01:12:41] picture is not just one single it's actually this is only a local View [01:12:44] it's actually this is only a local View so once you go to here maybe [01:12:46] so once you go to here maybe you know if locally it sounds great but [01:12:48] you know if locally it sounds great but I'm going to a better place but maybe [01:12:50] I'm going to a better place but maybe there's actually this function has a lot [01:12:52] there's actually this function has a lot of other parts so actually I have to [01:12:53] of other parts so actually I have to travel really far far away somewhere [01:12:56] travel really far far away somewhere else so then you have to do this again [01:12:57] else so then you have to do this again this local view again and then try to do [01:12:59] this local view again and then try to do it and so and so forth right so then it [01:13:02] it and so and so forth right so then it becomes a then you have to also find a [01:13:05] becomes a then you have to also find a new value and then probably find the [01:13:07] new value and then probably find the direction of the paper and also find the [01:13:09] direction of the paper and also find the direction of paper it's also not [01:13:10] direction of paper it's also not matching because it requires completing [01:13:12] matching because it requires completing the third hour derivative [01:13:14] the third hour derivative uh computer serial derivative [01:13:17] uh computer serial derivative on one example is still okay [01:13:20] on one example is still okay it costs you [01:13:22] it costs you completing any other derivative on one [01:13:24] completing any other derivative on one example takes you a constant Factor more [01:13:27] example takes you a constant Factor more than completing the personal derivative [01:13:29] than completing the personal derivative this is a very interesting thing about [01:13:31] this is a very interesting thing about deep learning so [01:13:32] deep learning so Computing the only derivative is give [01:13:35] Computing the only derivative is give you almost your price almost the same [01:13:37] you almost your price almost the same time as as completing the first [01:13:39] time as as completing the first derivative I start your automation [01:13:40] derivative I start your automation vector so but but I do you don't have to [01:13:43] vector so but but I do you don't have to pay a constant bucket sometimes two or [01:13:44] pay a constant bucket sometimes two or three times more on computers and and [01:13:48] three times more on computers and and also you have to do this for [01:13:50] also you have to do this for um [01:13:51] um um and this team this K-pop Direction so [01:13:54] um and this team this K-pop Direction so do to to get it [01:13:56] do to to get it to get it uh um exactly you also have to [01:13:58] to get it uh um exactly you also have to do a full batch full [01:14:06] derivative of the of the full function [01:14:09] derivative of the of the full function of the of the population of the other [01:14:11] of the of the population of the other population of the of the full empirical [01:14:15] population of the of the full empirical function so if you use the Smith [01:14:16] function so if you use the Smith investing then maybe you wouldn't get [01:14:18] investing then maybe you wouldn't get the K from Direction very accurately [01:14:20] the K from Direction very accurately so so there's a bunch of decisions which [01:14:22] so so there's a bunch of decisions which makes it complicated you know we don't [01:14:25] makes it complicated you know we don't even know exactly which one is the [01:14:26] even know exactly which one is the bottleneck so it's a little bit tricky [01:14:29] bottleneck so it's a little bit tricky so but that's a great question yeah so [01:14:32] so but that's a great question yeah so we will try very hard to just somehow do [01:14:34] we will try very hard to just somehow do this like for quite a while already yeah [01:14:37] this like for quite a while already yeah so [01:14:38] so um okay all right so now let's see so [01:14:50] I think I'll do a little more math just [01:14:52] I think I'll do a little more math just to kind of give you a small feeling [01:14:54] to kind of give you a small feeling about how how we perceive to Analyze [01:14:57] about how how we perceive to Analyze This so and the way to analyze this is [01:15:00] This so and the way to analyze this is that you somehow view this as I said as [01:15:03] that you somehow view this as I said as a two things so you first Define a [01:15:05] a two things so you first Define a companion process which is easier to [01:15:08] companion process which is easier to Define [01:15:10] Define unity plus one to be one minus E to H [01:15:13] unity plus one to be one minus E to H U T minus E capacity [01:15:16] U T minus E capacity and this is what others do [01:15:19] because this is space where you are [01:15:21] because this is space where you are doing optimization on the quadratic [01:15:23] doing optimization on the quadratic approximation [01:15:26] approximation and we have done this already [01:15:28] and we have done this already so and and then you characterize the [01:15:31] so and and then you characterize the difference between them so x t minus u t [01:15:33] difference between them so x t minus u t we Define this to be RT so basically the [01:15:37] we Define this to be RT so basically the main question is that what is what RT is [01:15:39] main question is that what is what RT is doing [01:15:40] doing and we're going to take a take a uh we [01:15:44] and we're going to take a take a uh we can compute the recursor recursion of RT [01:15:50] right this is equals to you plug in the [01:15:52] right this is equals to you plug in the definition of x t plus one and U plus [01:15:54] definition of x t plus one and U plus one you go to one minus E2 [01:15:56] one you go to one minus E2 h x t minus u t [01:15:59] h x t minus u t minus E so t x t x t [01:16:04] minus E so t x t x t plus high order term [01:16:08] plus high order term and then still and then it's 1 minus ETA [01:16:10] and then still and then it's 1 minus ETA h r t plus ETA t [01:16:15] h r t plus ETA t x t x t [01:16:18] and and the interesting theory is that [01:16:22] and and the interesting theory is that you still have the construction [01:16:25] and this is the bias or the [01:16:27] and this is the bias or the regularization effect but there's no [01:16:29] regularization effect but there's no noise anymore there's no fantasticity [01:16:31] noise anymore there's no fantasticity no stochasticity [01:16:35] no stochasticity you know there's still a lot of [01:16:36] you know there's still a lot of stochastic in rxt but when you at least [01:16:38] stochastic in rxt but when you at least you don't have the CT term that you have [01:16:40] you don't have the CT term that you have added intentionally [01:16:42] added intentionally just because this is you are taking a [01:16:44] just because this is you are taking a div with the with the sarcastic reaction [01:16:48] div with the with the sarcastic reaction so and you can actually remove the XT as [01:16:50] so and you can actually remove the XT as well because you can basically claim [01:16:52] well because you can basically claim that this is close to [01:16:55] that this is close to the version where you plug in [01:16:58] the version where you plug in it will plug in actually you plug in ut [01:17:03] this is just because [01:17:05] this is just because um this is just because [01:17:08] um this is just because XT and UT [01:17:11] XT and UT are somewhat similar of course you want [01:17:13] are somewhat similar of course you want to understand the exact difference but [01:17:15] to understand the exact difference but for this level especially because you [01:17:17] for this level especially because you are multiplier ETA here [01:17:19] are multiplier ETA here so like the the further differences [01:17:22] so like the the further differences becomes even higher the term you can [01:17:23] becomes even higher the term you can jump because you can jump [01:17:27] so [01:17:29] so now what happens is that if you look at [01:17:32] now what happens is that if you look at the the the the div [01:17:35] the the the the div if you look at the in the Subspace the [01:17:38] if you look at the in the Subspace the Subspace [01:17:40] Subspace of K right which is the the span of H [01:17:46] of K right which is the the span of H this is still a contraction [01:17:49] this is still a contraction because you have some additional biases [01:17:51] because you have some additional biases but the biases will be corrected by the [01:17:54] but the biases will be corrected by the construction eventually [01:17:57] construction eventually so however for [01:18:00] so however for um for the for the inner Caper Subspace [01:18:07] the construction is gone [01:18:08] the construction is gone people you project everything to the [01:18:10] people you project everything to the main purpose of space [01:18:13] main purpose of space then you got [01:18:24] right so the the age doesn't have any [01:18:27] right so the the age doesn't have any effect anymore because age has nothing [01:18:29] effect anymore because age has nothing to do with the K curve Direction you [01:18:31] to do with the K curve Direction you just project it out [01:18:32] just project it out so now the the thing is really simple so [01:18:35] so now the the thing is really simple so in the K-pop Subspace you are just [01:18:37] in the K-pop Subspace you are just basically taking the previous [01:18:39] basically taking the previous RT projecting a character from space [01:18:42] RT projecting a character from space plus something new so basically you are [01:18:44] plus something new so basically you are just the adding up [01:18:46] just the adding up you don't have any construction even so [01:18:48] you don't have any construction even so if you do this in a recursively you are [01:18:50] if you do this in a recursively you are going to get [01:18:52] going to get the K-pop of r0 minus the sum [01:18:57] the K-pop of r0 minus the sum of [01:19:09] so now it becomes you know the question [01:19:11] so now it becomes you know the question becomes how do you understand the sum of [01:19:14] becomes how do you understand the sum of the external term and by the way I've [01:19:15] the external term and by the way I've never told you where the third dollar [01:19:16] never told you where the third dollar term is going right I only claim that [01:19:19] term is going right I only claim that there is a third dollar term I didn't [01:19:20] there is a third dollar term I didn't really say where it's going so now it's [01:19:22] really say where it's going so now it's the question is what the third other [01:19:25] the question is what the third other term terms are going in average right so [01:19:28] term terms are going in average right so in the long run over time [01:19:31] in the long run over time so we can kind of ignore this this is [01:19:33] so we can kind of ignore this this is just a restriction to the sub Subspace [01:19:36] just a restriction to the sub Subspace so if you look at the sum of the where [01:19:39] so if you look at the sum of the where the third order is going [01:19:44] so this you can [01:19:47] so this you can um [01:19:48] um so so first of all let's assume maybe [01:19:50] so so first of all let's assume maybe this is a heuristic let's say it's a [01:19:53] this is a heuristic let's say it's a system Duty [01:19:55] system Duty um [01:19:57] let's say [01:20:01] as is something like UK [01:20:06] as is something like UK UK transport is the covalent of UK [01:20:10] UK transport is the covalent of UK uh this is the SK goes to Infinity [01:20:15] uh this is the SK goes to Infinity and also assume [01:20:18] and also assume is okay or mixes as a market option I'm [01:20:22] is okay or mixes as a market option I'm not sure whether you are familiar with [01:20:23] not sure whether you are familiar with this Max from gym mixing but you just [01:20:25] this Max from gym mixing but you just assume that UK is kind of like doing the [01:20:27] assume that UK is kind of like doing the boss right you is that doing the [01:20:29] boss right you is that doing the bouncing around [01:20:30] bouncing around you assume that it's really just doing [01:20:31] you assume that it's really just doing that it's kind of like a gaussian then [01:20:34] that it's kind of like a gaussian then an ice is the covalents of the gaussian [01:20:36] an ice is the covalents of the gaussian and then this one [01:20:38] and then this one you can rewrite this as you know [01:20:42] you can rewrite this as you know uh T of [01:20:47] in some sense this is like [01:20:50] in some sense this is like it goes to loyalty times roughly because [01:20:53] it goes to loyalty times roughly because Little T times [01:20:54] Little T times that she was the expectation for you [01:20:59] that she was the expectation for you the UK transport or maybe with us [01:21:04] so I guess maybe [01:21:08] so I guess maybe so what I'm doing here is that suppose [01:21:10] so what I'm doing here is that suppose you have some variable U that is drawn [01:21:13] you have some variable U that is drawn from us [01:21:14] from us from gaussian Luis [01:21:16] from gaussian Luis converts as the [01:21:18] converts as the the expectation of T EU [01:21:23] the expectation of T EU is equals to T of [01:21:26] is equals to T of this is [01:21:28] this is expectation [01:21:30] expectation okay let's look at the ice coordinate [01:21:32] okay let's look at the ice coordinate then this is sum of J [01:21:35] then this is sum of J t i j k u j u k and you can export it [01:21:40] t i j k u j u k and you can export it sorry guys okay to this [01:21:44] sorry guys okay to this and then you switch the sum with the [01:21:46] and then you switch the sum with the expectation you have j k t i j k [01:21:52] expectation UJ [01:21:56] you okay [01:21:58] you okay and this is sum over j k t i j k [01:22:03] and this is sum over j k t i j k expectation uu transpose we take the JK [01:22:07] expectation uu transpose we take the JK coordinate [01:22:08] coordinate and this is you know if you know this by [01:22:12] and this is you know if you know this by T of [01:22:14] T of expectation uu transpose [01:22:18] expectation uu transpose so so you can also apply the tensor on a [01:22:20] so so you can also apply the tensor on a matrix and the definition is really just [01:22:22] matrix and the definition is really just this [01:22:23] this so the definition of t applied on The [01:22:26] so the definition of t applied on The Matrix is [01:22:28] Matrix is that you have some [01:22:30] that you have some of t i j k [01:22:33] of t i j k thanks Jake [01:22:34] thanks Jake all right so I think this might be a [01:22:37] all right so I think this might be a little bit too [01:22:38] little bit too too much for this course but another way [01:22:41] too much for this course but another way you can basically at the end of the day [01:22:43] you can basically at the end of the day what you have is that you build [01:22:45] what you have is that you build I guess there's a detail here of my body [01:22:47] I guess there's a detail here of my body so this t comes from you have multiple [01:22:49] so this t comes from you have multiple times right you have three steps that's [01:22:51] times right you have three steps that's where you got T and ETA is what you got [01:22:54] where you got T and ETA is what you got from this ETA and this is something like [01:22:57] from this ETA and this is something like you apply the tensor to the average [01:22:59] you apply the tensor to the average covariance the mixed state the mixed [01:23:02] covariance the mixed state the mixed covers uh thing so basically the [01:23:04] covers uh thing so basically the question becomes like what is X [01:23:06] question becomes like what is X if you don't ask then you know which [01:23:08] if you don't ask then you know which direction you're going and and you know [01:23:10] direction you're going and and you know how far you're going you are going by T [01:23:13] how far you're going you are going by T times this this direction because you [01:23:14] times this this direction because you have Takes Two Steps so [01:23:17] have Takes Two Steps so so the next to the so the final question [01:23:20] so the next to the so the final question is that what this TS is and this is very [01:23:22] is that what this TS is and this is very informal and not even exactly correct [01:23:26] informal and not even exactly correct um I need to fix it you need something [01:23:29] um I need to fix it you need something um a little more [01:23:31] um a little more um your operations so so this various [01:23:33] um your operations so so this various Direction T of s right minus two of us [01:23:36] Direction T of s right minus two of us is the biased Direction [01:23:41] so this last direction is equals to [01:23:43] so this last direction is equals to minus I guess remember what's the t t is [01:23:46] minus I guess remember what's the t t is the first derivative and S is the [01:23:49] the first derivative and S is the iterate of the you know ice is just a [01:23:52] iterate of the you know ice is just a matrix for a moment and you can rewrite [01:23:55] matrix for a moment and you can rewrite this [01:23:56] this so if you think about what is this [01:23:59] so if you think about what is this this is the gradient of [01:24:01] this is the gradient of the inner Paradox of the hessing [01:24:04] the inner Paradox of the hessing times x [01:24:06] times x so right this is the equation [01:24:10] so right this is the equation and in some sense you can argue that [01:24:13] and in some sense you can argue that this is a [01:24:15] this is a heuristic argument and actually it's not [01:24:17] heuristic argument and actually it's not even a correct I've not even 100 correct [01:24:20] even a correct I've not even 100 correct argument [01:24:21] argument so the tias is [01:24:24] so the tias is trying to make [01:24:28] novela Square GX times s [01:24:31] novela Square GX times s smaller because you are moving [01:24:34] smaller because you are moving radiant of that function right so let's [01:24:36] radiant of that function right so let's define this to be RX [01:24:39] define this to be RX right so you're moving the negative RX [01:24:41] right so you're moving the negative RX in that negative number RX Direction [01:24:45] in that negative number RX Direction that's why you can argue that you are [01:24:46] that's why you can argue that you are trying to make that function smaller [01:24:49] trying to make that function smaller so this node bias is trying to make [01:24:53] so this node bias is trying to make is this minus TS term is trying to [01:24:57] is this minus TS term is trying to make the RX smaller by moving in the [01:25:00] make the RX smaller by moving in the gradient of this RX Direction [01:25:03] gradient of this RX Direction so and eventually I think that if you [01:25:06] so and eventually I think that if you work out all of this kind of like [01:25:09] work out all of this kind of like um some of details right so with a lot [01:25:11] um some of details right so with a lot of other [01:25:14] uh like uh [01:25:16] uh like uh like a stance and fixes and assumptions [01:25:19] like a stance and fixes and assumptions you know [01:25:22] and I think I'm not going to go through [01:25:25] and I think I'm not going to go through all of this it's we're already going to [01:25:26] all of this it's we're already going to granulate [01:25:28] granulate um but you can somewhat's prove in [01:25:30] um but you can somewhat's prove in certain cases [01:25:32] certain cases let me just write down what finally you [01:25:34] let me just write down what finally you can prove [01:25:36] can prove so can prove [01:25:39] so can prove something like [01:25:43] actually [01:25:45] actually with [01:25:47] with the so-called label noise [01:25:51] I I didn't tell you what label noise [01:25:53] I I didn't tell you what label noise mean [01:25:54] mean um it doesn't matter it's the one kind [01:25:55] um it doesn't matter it's the one kind of noise it's not exactly [01:25:58] of noise it's not exactly which is some additional noise and [01:26:02] which is some additional noise and convert this to [01:26:08] a stationary point [01:26:14] of [01:26:16] of the regularized loss [01:26:23] IO plus Lambda r i o h plus MDA R where [01:26:28] IO plus Lambda r i o h plus MDA R where power Theta is equals to roughly equals [01:26:33] power Theta is equals to roughly equals to [01:26:33] to trees [01:26:35] trees of the hyacin [01:26:37] of the hyacin of the loss [01:26:41] so yeah so I guess [01:26:43] so yeah so I guess there's no need to understand any [01:26:45] there's no need to understand any details here in this there are some [01:26:47] details here in this there are some other subjects there are other [01:26:48] other subjects there are other assumptions so and so forth I just want [01:26:50] assumptions so and so forth I just want to give you a taste on what kind of [01:26:51] to give you a taste on what kind of theorems you may hope to prove [01:26:54] theorems you may hope to prove so and so basically you are saying like [01:26:56] so and so basically you are saying like if you write CD [01:26:58] if you write CD okay this is on the original loss on our [01:27:00] okay this is on the original loss on our heart so if you run a certain kind of [01:27:02] heart so if you run a certain kind of STD on the unrecognized loss [01:27:05] STD on the unrecognized loss it will convert this to the stationary [01:27:07] it will convert this to the stationary point of a regular source [01:27:09] point of a regular source so that's why you get this regularizer [01:27:12] so that's why you get this regularizer for free [01:27:12] for free and what regularizer tells us that here [01:27:14] and what regularizer tells us that here the radicalizer is the trace of the [01:27:16] the radicalizer is the trace of the housing is something about the flatness [01:27:19] housing is something about the flatness the flatness of the Lost Arrowhead right [01:27:22] the flatness of the Lost Arrowhead right the housing is the curvature the trace [01:27:24] the housing is the curvature the trace of Tyson is about the flatness at that [01:27:26] of Tyson is about the flatness at that point so so you are increasingly [01:27:29] point so so you are increasingly encouraging the flatness [01:27:31] encouraging the flatness um of the [01:27:33] um of the um [01:27:34] um of the loss function [01:27:37] of the loss function so but this has a a lot of things hidden [01:27:41] so but this has a a lot of things hidden here and actually I think I'm missing a [01:27:43] here and actually I think I'm missing a few kind of important questions [01:27:44] few kind of important questions important assumptions you know I'm not [01:27:46] important assumptions you know I'm not writing down some of the important [01:27:48] writing down some of the important assumptions just because they are not [01:27:51] assumptions just because they are not um you know it takes too much time to [01:27:53] um you know it takes too much time to write out but this is kind of like [01:27:55] write out but this is kind of like something we may hope to improve under [01:27:57] something we may hope to improve under some cases [01:28:00] okay any questions [01:28:13] as well [01:28:19] that's a great point so the so the [01:28:22] that's a great point so the so the question just to rephrase the question [01:28:24] question just to rephrase the question is the question is that whether the even [01:28:28] is the question is that whether the even high order derivative the fourth order [01:28:30] high order derivative the fourth order gradient would increase the Infuse the [01:28:33] gradient would increase the Infuse the Box [01:28:34] Box um I think on the conceptual level if [01:28:38] um I think on the conceptual level if your third order uh thing is not zero [01:28:41] your third order uh thing is not zero then I think the fourth dollar one [01:28:43] then I think the fourth dollar one wouldn't matter that much and if the [01:28:45] wouldn't matter that much and if the federal return is zero I think indeed uh [01:28:49] federal return is zero I think indeed uh there should be upon other just if the [01:28:51] there should be upon other just if the four dollar term would have effect [01:28:55] um but so far we're not we're not [01:28:57] um but so far we're not we're not thinking about the well we are assuming [01:28:58] thinking about the well we are assuming the the third order is doing something [01:29:01] the the third order is doing something so so that's all our time will be done [01:29:04] so so that's all our time will be done with another third launcher [01:29:10] support yourself [01:29:17] the in The Last Theorem the stationary [01:29:20] the in The Last Theorem the stationary launch is because [01:29:22] launch is because it's okay [01:29:35] oh I think yeah so the question is why [01:29:37] oh I think yeah so the question is why the regularization is the trace of [01:29:39] the regularization is the trace of testing this is because [01:29:42] testing this is because when the recordizer is about a signal [01:29:44] when the recordizer is about a signal term the second derivative [01:29:46] term the second derivative then the direction you want to move to [01:29:49] then the direction you want to move to is the gradient of the regulator [01:29:53] but because you so when you have a [01:29:54] but because you so when you have a recognizer what do you really mean it [01:29:56] recognizer what do you really mean it means that you should move to the [01:29:57] means that you should move to the direction of the period of direct life [01:29:59] direction of the period of direct life that's why [01:30:01] that's why that's how they they match up so so [01:30:03] that's how they they match up so so actually the direction really moved to [01:30:05] actually the direction really moved to is the third hour derivative depends on [01:30:07] is the third hour derivative depends on the third level derivative of the last [01:30:08] the third level derivative of the last one [01:30:30] so so I guess there are two views one [01:30:33] so so I guess there are two views one view is that you you look at the [01:30:36] view is that you you look at the on the on the regulator on level then [01:30:40] on the on the regulator on level then it's the subs currently it's the second [01:30:42] it's the subs currently it's the second order term the second order derivative [01:30:44] order term the second order derivative of the most [01:30:45] of the most and another view is that you cut the [01:30:48] and another view is that you cut the uh the actual the each space the current [01:30:51] uh the actual the each space the current is the third order derivative it's about [01:30:53] is the third order derivative it's about third of the derivative of the loss and [01:30:55] third of the derivative of the loss and suppose in the in the iterate space the [01:30:57] suppose in the in the iterate space the standard derivative manages [01:30:59] standard derivative manages then you have to talk about the fourth [01:31:02] then you have to talk about the fourth order [01:31:03] order and in that case the regularizer [01:31:05] and in that case the regularizer probably will be about a third another [01:31:07] probably will be about a third another derivative of the of the loss it's [01:31:09] derivative of the of the loss it's because your records is always one order [01:31:11] because your records is always one order up compared to the direction of the [01:31:14] up compared to the direction of the movement [01:31:15] movement does that make some sense [01:31:26] um [01:31:31] medication [01:31:35] yeah so so why flight station point is [01:31:38] yeah so so why flight station point is better right so I think [01:31:42] better right so I think yeah why whether it's better or not so [01:31:45] yeah why whether it's better or not so um I think I'm going to talk about that [01:31:48] um I think I'm going to talk about that immediately next in the next beginning [01:31:50] immediately next in the next beginning of the next action [01:31:52] of the next action the answer is that we do believe is is [01:31:54] the answer is that we do believe is is generally better [01:31:57] generally better um it depends on some it's kind of um [01:31:59] um it depends on some it's kind of um relates to the limitlessness of the [01:32:01] relates to the limitlessness of the models but I'll discuss more excellent [01:32:03] models but I'll discuss more excellent on on Wednesday [01:32:07] expect ================================================================================ LECTURE 017 ================================================================================ Stanford CS229M - Lecture 18: Unsupervised learning, mixture of Gaussians, moment methods Source: https://www.youtube.com/watch?v=4xDEsLUkdG4 --- Transcript [00:00:05] okay so I guess uh let's get started so [00:00:09] okay so I guess uh let's get started so um [00:00:10] um so today this lecture we are going to [00:00:12] so today this lecture we are going to discuss [00:00:14] discuss um a few small stuff [00:00:17] um a few small stuff um that that are remains uh that are [00:00:20] um that that are remains uh that are kind of like a left from previous [00:00:22] kind of like a left from previous lectures and then we are going to move [00:00:23] lectures and then we are going to move on to as far as learning so [00:00:26] on to as far as learning so um so I guess the first thing is that [00:00:29] um so I guess the first thing is that um recall that last time [00:00:32] um recall that last time we talk about uh in place regularization [00:00:36] we talk about uh in place regularization of the noise [00:00:42] and [00:00:44] and and we mentioned that [00:00:47] and we mentioned that um in certain cases you can prove that [00:00:49] um in certain cases you can prove that the regularizer prefers the the noise [00:00:52] the regularizer prefers the the noise noisy STD [00:00:55] noisy STD noisy GD prefers [00:00:59] um smaller [00:01:01] um smaller on this quantity R Theta which is [00:01:04] on this quantity R Theta which is defined to be something like the trace [00:01:05] defined to be something like the trace of the housing [00:01:08] of the housing so and [00:01:11] so and and in the first part of this lecture [00:01:14] and in the first part of this lecture I'm explaining you I'm going to spend [00:01:16] I'm explaining you I'm going to spend probably 10 to 15 minutes to briefly [00:01:19] probably 10 to 15 minutes to briefly discuss why this is a reasonable thing [00:01:21] discuss why this is a reasonable thing to uh try to minimize or try to kind of [00:01:23] to uh try to minimize or try to kind of like [00:01:24] like regularize so why the choice of the [00:01:26] regularize so why the choice of the housing is some meaningful quantity or [00:01:29] housing is some meaningful quantity or you know but this part wouldn't be [00:01:32] you know but this part wouldn't be exactly kind of like rigorous because [00:01:33] exactly kind of like rigorous because you have to do some approximations so [00:01:35] you have to do some approximations so and so forth but I'm just going to do [00:01:37] and so forth but I'm just going to do some kind of like a somewhat heuristic [00:01:40] some kind of like a somewhat heuristic Eurovision to justify why something like [00:01:43] Eurovision to justify why something like the Hassan would be useful for us to [00:01:45] the Hassan would be useful for us to recognize [00:01:47] recognize so I I guess so the thing is that how do [00:01:50] so I I guess so the thing is that how do we uh what is the heising right what is [00:01:53] we uh what is the heising right what is the Hyacinth and maybe actually I'll [00:01:54] the Hyacinth and maybe actually I'll actually write I will hide this is the [00:01:56] actually write I will hide this is the empirical the housing on the empirical [00:01:58] empirical the housing on the empirical loss so maybe uh for Simplicity let's [00:02:02] loss so maybe uh for Simplicity let's only consider [00:02:04] only consider um let's only consider [00:02:08] one data point [00:02:12] so and let's say suppose f [00:02:15] so and let's say suppose f as the node I have to be f [00:02:18] as the node I have to be f comma Theta this is the I guess maybe I [00:02:22] comma Theta this is the I guess maybe I should say [00:02:24] I have to say like X [00:02:26] I have to say like X um without model output [00:02:31] and let [00:02:35] um [00:02:36] um l f y be the loss function [00:02:45] um then what you can do is that you can [00:02:48] um then what you can do is that you can compute so then L Theta in this case is [00:02:51] compute so then L Theta in this case is just the L of [00:02:52] just the L of you know F Theta X Y right [00:02:56] you know F Theta X Y right so in this case then we can compute what [00:02:58] so in this case then we can compute what the hessing is so the hessing [00:03:01] the hessing is so the hessing maybe let's call this L hat just [00:03:04] maybe let's call this L hat just to be consistent in terms of notation [00:03:06] to be consistent in terms of notation The Hyphen is the gradient of the [00:03:09] The Hyphen is the gradient of the gradient now what's the gradient if you [00:03:11] gradient now what's the gradient if you use chain rule what you're going to get [00:03:13] use chain rule what you're going to get is that you get L the partial L over [00:03:16] is that you get L the partial L over partial f [00:03:17] partial f times partial F over partial Theta [00:03:21] times partial F over partial Theta so this is nabla [00:03:24] so this is nabla um [00:03:25] um the derivative is back to so so this is [00:03:28] the derivative is back to so so this is a scalar function L is a is a function [00:03:30] a scalar function L is a is a function of F but this is a very simple function [00:03:32] of F but this is a very simple function right because L is a scalar f is a [00:03:34] right because L is a scalar f is a scalar so so this is a scalar and times [00:03:37] scalar so so this is a scalar and times the gradient of f Theta attacks [00:03:42] the gradient of f Theta attacks so and now you are taking a gradient of [00:03:44] so and now you are taking a gradient of a paradox of two quantities by the [00:03:46] a paradox of two quantities by the scalar and the other is a gradient and [00:03:48] scalar and the other is a gradient and then you can do chain rule what you're [00:03:50] then you can do chain rule what you're going to get is that you first for [00:03:53] going to get is that you first for example do the gradient respect to the [00:03:54] example do the gradient respect to the first part what you got is [00:03:57] first part what you got is um [00:03:58] um uh let me see what you got from the [00:04:00] uh let me see what you got from the first one is that you have l [00:04:03] first one is that you have l the second hour derivative with respect [00:04:05] the second hour derivative with respect to F and then you do the chain rule you [00:04:06] to F and then you do the chain rule you get [00:04:08] get this upside down X so this is going in [00:04:10] this upside down X so this is going in with respect to Theta times greater than [00:04:13] with respect to Theta times greater than F Theta X transpose and this is the [00:04:16] F Theta X transpose and this is the in some sense that the so this part is [00:04:19] in some sense that the so this part is the gradient of this and this part is [00:04:20] the gradient of this and this part is copying [00:04:21] copying I guess [00:04:23] I guess uh this is copying from here in some [00:04:25] uh this is copying from here in some sense [00:04:27] sense um but I guess you know this is [00:04:28] um but I guess you know this is something that you can verify offline [00:04:29] something that you can verify offline like if you do all the kind of like uh [00:04:32] like if you do all the kind of like uh if you look at all the coordinates and [00:04:34] if you look at all the coordinates and and do all the calculation and then you [00:04:36] and do all the calculation and then you can also do the uh the the chain rule [00:04:39] can also do the uh the the chain rule for the other part so what you can get [00:04:40] for the other part so what you can get is the the [00:04:43] is the the the L over DF times the second order [00:04:46] the L over DF times the second order derivative of the model [00:04:49] derivative of the model Theta at X [00:04:51] Theta at X The Matrix [00:04:53] The Matrix of Dimension P by P if p is the number [00:04:56] of Dimension P by P if p is the number of parameter [00:05:00] and and this is a scalar this is the [00:05:02] and and this is a scalar this is the scalar and this is a vector this is a [00:05:04] scalar and this is a vector this is a vector [00:05:05] vector so the whole thing is the P by P Matrix [00:05:09] so the whole thing is the P by P Matrix so when you have a suppose your loss so [00:05:11] so when you have a suppose your loss so this is a general formula which is just [00:05:13] this is a general formula which is just a rigorously true and suppose the loss [00:05:16] a rigorously true and suppose the loss function [00:05:21] uh is uh ilf [00:05:26] uh is uh ilf Y is equals to for example a half times [00:05:29] Y is equals to for example a half times y minus F squared [00:05:31] y minus F squared then this formula becomes [00:05:34] then this formula becomes what is the second order derivative of [00:05:36] what is the second order derivative of this loss function with respect to f [00:05:38] this loss function with respect to f its function is a quadratic function of [00:05:42] its function is a quadratic function of fact the loss function is a quadratic [00:05:43] fact the loss function is a quadratic function with respect to F and the [00:05:45] function with respect to F and the leading term is f squared so the loss [00:05:47] leading term is f squared so the loss the height the second order of the [00:05:49] the height the second order of the respect to F is one so this is equals to [00:05:51] respect to F is one so this is equals to 1 times squareian state [00:05:54] 1 times squareian state of f times the green of f Theta [00:05:56] of f times the green of f Theta transpose [00:05:57] transpose and then the first R derivative will be [00:06:01] and then the first R derivative will be um respect to F will be F minus y [00:06:04] um respect to F will be F minus y times the housing [00:06:06] times the housing after the X [00:06:08] after the X so so this decomposition is often called [00:06:12] so so this decomposition is often called uh so so what you can see is that this [00:06:15] uh so so what you can see is that this is a convex term [00:06:16] is a convex term this is this is a PSD term sorry [00:06:20] this is this is a PSD term sorry this is PSD because it is uh the outer [00:06:24] this is PSD because it is uh the outer product of a random Matrix and this is [00:06:26] product of a random Matrix and this is non-active and and this is not necessary [00:06:29] non-active and and this is not necessary PSD [00:06:34] so the heising may not be PSD in general [00:06:37] so the heising may not be PSD in general of course right because you have a [00:06:38] of course right because you have a non-convex function but one of the term [00:06:40] non-convex function but one of the term is PSD [00:06:43] and if you have it in general if you [00:06:46] and if you have it in general if you have a convex else function the first [00:06:47] have a convex else function the first term so in China this is called [00:06:51] term so in China this is called this is called um I don't know why this [00:06:53] this is called um I don't know why this is called this but it's called ghost [00:06:54] is called this but it's called ghost Newton decomposition [00:06:57] Newton decomposition um I think we must have something to do [00:06:59] um I think we must have something to do with these two famous people at some [00:07:01] with these two famous people at some point but it's called ghost [00:07:03] point but it's called ghost decomposition and in general the first [00:07:06] decomposition and in general the first term [00:07:08] term this is always so is always positive for [00:07:12] this is always so is always positive for con for [00:07:14] con for convex loss function [00:07:16] convex loss function but I guess by loss function I really [00:07:18] but I guess by loss function I really mean that the literally the top the the [00:07:20] mean that the literally the top the the chord either quadratic loss of or kind [00:07:22] chord either quadratic loss of or kind of like um [00:07:25] of like um um or cross entropy lost they're all [00:07:27] um or cross entropy lost they're all convex right so this term in almost all [00:07:30] convex right so this term in almost all cases where you we study the the first [00:07:33] cases where you we study the the first term [00:07:34] term um this is non-zero so the first term is [00:07:38] um this is non-zero so the first term is this is PSD [00:07:44] PSE [00:07:46] PSE and in parallel people found that the [00:07:49] and in parallel people found that the second term in most of cases is small [00:07:51] second term in most of cases is small so there could be multiple reasons for [00:07:53] so there could be multiple reasons for this so empirically [00:07:57] the second term [00:08:01] F minus y [00:08:03] F minus y does [00:08:05] does is generally smaller [00:08:10] and one of the reason it could be that [00:08:13] and one of the reason it could be that you know at least when you are at a [00:08:15] you know at least when you are at a global minimum this term is zero so when [00:08:18] global minimum this term is zero so when Theta [00:08:20] Theta is I tell a gullible mean [00:08:26] so meaning [00:08:30] if Theta X is equals to Y right so glow [00:08:33] if Theta X is equals to Y right so glow mean would fit the data exactly so in [00:08:35] mean would fit the data exactly so in this case then this term is actually [00:08:37] this case then this term is actually literally zero so F minus y [00:08:41] zero [00:08:43] zero so this could be one reason why [00:08:44] so this could be one reason why empirical the second term is relatively [00:08:46] empirical the second term is relatively small and of course this is not always [00:08:48] small and of course this is not always true it's not always you can fill the [00:08:49] true it's not always you can fill the data at any point but somehow people [00:08:51] data at any point but somehow people found that on the second term you know [00:08:55] found that on the second term you know is somewhat smaller than the first term [00:08:57] is somewhat smaller than the first term so if you don't care about anything [00:08:59] so if you don't care about anything super [00:09:00] super if you don't care about a very [00:09:03] if you don't care about a very nuanced kind of qualities but it hasn't [00:09:05] nuanced kind of qualities but it hasn't in the first term is a reasonable [00:09:07] in the first term is a reasonable approximation for that of course you [00:09:10] approximation for that of course you know in certain cases you do care about [00:09:11] know in certain cases you do care about nuances for example when you care about [00:09:13] nuances for example when you care about whether this function is convex or not [00:09:14] whether this function is convex or not you should talk about even any [00:09:16] you should talk about even any non-negative eigenvalue would make it [00:09:18] non-negative eigenvalue would make it non-conflex right so then the second [00:09:19] non-conflex right so then the second term becomes important but if you just [00:09:22] term becomes important but if you just have a choice then the second term is [00:09:24] have a choice then the second term is not that important [00:09:25] not that important so that's the [00:09:28] so that's the um that's the rough intuition and now [00:09:30] um that's the rough intuition and now suppose we know the second term this is [00:09:32] suppose we know the second term this is a a big assumption but suppose ignoring [00:09:36] a a big assumption but suppose ignoring the second term [00:09:37] the second term then we can see what's the trace of the [00:09:39] then we can see what's the trace of the housing [00:09:41] second term [00:09:43] second term so the trace of the housing you know for [00:09:45] so the trace of the housing you know for whatever reason right for example you [00:09:47] whatever reason right for example you can ignore it just because it's in [00:09:48] can ignore it just because it's in perfectly small you can ignore it [00:09:49] perfectly small you can ignore it because you are at a global minimum but [00:09:52] because you are at a global minimum but suppose you ignore the second term then [00:09:53] suppose you ignore the second term then the the trace of the hashing [00:09:57] and Theta is approximately equals to [00:09:59] and Theta is approximately equals to this [00:10:01] this derivative is equal to Y which is [00:10:04] derivative is equal to Y which is probably one if you have square loss [00:10:05] probably one if you have square loss times the [00:10:07] times the the um [00:10:09] the um the trace of this [00:10:16] transpose [00:10:18] transpose which is equals to [00:10:21] which is equals to some scalar times the true Norm of the [00:10:24] some scalar times the true Norm of the gradient [00:10:27] so you can see that by minimizing the [00:10:29] so you can see that by minimizing the trace of the testing you are minimizing [00:10:31] trace of the testing you are minimizing the L2 Norm [00:10:33] the L2 Norm of the of the loop systems with back to [00:10:37] of the of the loop systems with back to the parameter [00:10:38] the parameter so this is so minimizing [00:10:43] the Trees of the housing [00:10:46] the Trees of the housing is somewhat similar heuristically [00:10:49] is somewhat similar heuristically minimizing [00:10:52] the business [00:10:56] of the model [00:11:00] with respect to Theta [00:11:03] um [00:11:05] um and I think so why minimizing Ellipsis [00:11:08] and I think so why minimizing Ellipsis of the model with respect to Theta is [00:11:10] of the model with respect to Theta is useful actually first of all this is [00:11:12] useful actually first of all this is indeed useful if you just expressly [00:11:13] indeed useful if you just expressly minimize this and people have found that [00:11:15] minimize this and people have found that empirical this is useful and and why [00:11:18] empirical this is useful and and why this is useful you know if you allow [00:11:20] this is useful you know if you allow some heuristics you know you can also [00:11:21] some heuristics you know you can also say that this is very similar to uh the [00:11:25] say that this is very similar to uh the minimizing the lipsyness [00:11:30] of the model output [00:11:34] of the model output with respect to [00:11:36] with respect to the hidden variables [00:11:42] I think this is something that we [00:11:43] I think this is something that we discussed probably like [00:11:45] discussed probably like a few weeks ago when we talk about the [00:11:47] a few weeks ago when we talk about the Euler margins so recall that if you have [00:11:50] Euler margins so recall that if you have a generator which consists of for [00:11:52] a generator which consists of for example a bunch of layers [00:11:56] example a bunch of layers suppose you have a deep networks with a [00:11:58] suppose you have a deep networks with a lot of weights and then the derivative [00:12:00] lot of weights and then the derivative of the model with reflects to some [00:12:03] of the model with reflects to some layer I this is equals to the derivative [00:12:07] layer I this is equals to the derivative of the model with respect to the layer [00:12:10] of the model with respect to the layer um above it times the layer uh [00:12:15] um above it times the layer uh let me see so this is H A plus one times [00:12:18] let me see so this is H A plus one times h i minus 1 transpose [00:12:20] h i minus 1 transpose so this is the the so-called okay I [00:12:22] so this is the the so-called okay I guess now I remember the it's called [00:12:24] guess now I remember the it's called hiding rule [00:12:26] hiding rule you know but it's actually just a [00:12:27] you know but it's actually just a lateral layer [00:12:28] lateral layer a simple chain rule [00:12:32] a simple chain rule in the in the in a neuroscience this is [00:12:34] in the in the in a neuroscience this is called having rule [00:12:35] called having rule um but but actually it's just a real [00:12:37] um but but actually it's just a real change we want to take the derivative [00:12:39] change we want to take the derivative with respect the parameter and the [00:12:40] with respect the parameter and the parameter kind of come into play that [00:12:42] parameter kind of come into play that depends on each and as one and h i plus [00:12:45] depends on each and as one and h i plus one and sorry this is H1 [00:12:48] so what is h so so here h i plus 1 [00:12:52] so what is h so so here h i plus 1 is [00:12:54] is uh [00:12:55] uh W Times h i [00:12:57] W Times h i right so this is the ith layer [00:13:01] and this is the pre-activation [00:13:07] of a possible layer [00:13:11] I guess maybe technical I should call [00:13:12] I guess maybe technical I should call this h i Prime just so that I can [00:13:14] this h i Prime just so that I can distinguish it from poster calibration [00:13:16] distinguish it from poster calibration but I guess you get the point the point [00:13:18] but I guess you get the point the point is that [00:13:19] is that if you take the derivative with the [00:13:21] if you take the derivative with the structure parameter it's actually very [00:13:23] structure parameter it's actually very closely related to the derivative with [00:13:25] closely related to the derivative with respect to a hidden variable and a norm [00:13:27] respect to a hidden variable and a norm of the Hidden variable H5 right so this [00:13:30] of the Hidden variable H5 right so this means that [00:13:31] means that that you create a norm [00:13:34] that you create a norm if you leave the Matrix and it's a [00:13:35] if you leave the Matrix and it's a forbidden storm of this is equals to [00:13:40] Norm times the H1 [00:13:43] Norm times the H1 so minimizing the [00:13:45] so minimizing the the deluxiousness with respect to the [00:13:48] the deluxiousness with respect to the parameters is similarly is similar to [00:13:51] parameters is similarly is similar to minimizing the deliciousness with [00:13:53] minimizing the deliciousness with respect to Hidden variable [00:13:55] respect to Hidden variable I think this is something we have [00:13:56] I think this is something we have discussed before [00:13:58] discussed before um when we do the Euler margin right so [00:14:00] um when we do the Euler margin right so then when you talk about the derivative [00:14:01] then when you talk about the derivative of the Hidden variable then this is kind [00:14:03] of the Hidden variable then this is kind of like all their margin [00:14:06] I guess you're Max making earlier March [00:14:08] I guess you're Max making earlier March because Euler margin is bigger if you [00:14:11] because Euler margin is bigger if you have you know more Ellipsis model you [00:14:13] have you know more Ellipsis model you have a lot larger Euler margin so [00:14:16] have a lot larger Euler margin so none of these steps can be made 100 [00:14:18] none of these steps can be made 100 rigorous you know some of the [00:14:20] rigorous you know some of the intermediate equations that I've written [00:14:22] intermediate equations that I've written are exactly true but I don't think all [00:14:25] are exactly true but I don't think all of these steps can be made completely [00:14:26] of these steps can be made completely rigorous but you know sometimes this is [00:14:28] rigorous but you know sometimes this is probably the nature of new networks [00:14:30] probably the nature of new networks right you cannot be 100 precise just [00:14:32] right you cannot be 100 precise just because things don't match exactly but I [00:14:35] because things don't match exactly but I think the intuition [00:14:36] think the intuition um is really just that the hashing [00:14:39] um is really just that the hashing the the hessing relates to the lipsense [00:14:42] the the hessing relates to the lipsense of the model with Vector parameter and [00:14:44] of the model with Vector parameter and the largest of the model with Vector [00:14:46] the largest of the model with Vector parameter relates to the lip system of [00:14:48] parameter relates to the lip system of the model with hidden variables which is [00:14:50] the model with hidden variables which is kind of like captured well very much [00:14:55] um [00:14:57] um any questions [00:15:14] foreign [00:15:25] this is the first thing about [00:15:28] this is the first thing about um [00:15:29] um this is the remaining steps [00:15:31] this is the remaining steps um the remaining remarks from the last [00:15:33] um the remaining remarks from the last lecture about the implicit [00:15:35] lecture about the implicit regularization of the noise and [00:15:38] regularization of the noise and there's another thing I want to discuss [00:15:39] there's another thing I want to discuss which is something I in some sense it's [00:15:42] which is something I in some sense it's my Omission I forgot how to provide the [00:15:44] my Omission I forgot how to provide the proof for one of the theorems that we [00:15:47] proof for one of the theorems that we discussed I think two weeks ago about [00:15:49] discussed I think two weeks ago about the implicit requisition effect in the [00:15:52] the implicit requisition effect in the classification case [00:15:54] classification case um I think there we only basically at [00:15:55] um I think there we only basically at the end of that lecture we only were [00:15:57] the end of that lecture we only were able to kind of provide the theorem [00:16:00] able to kind of provide the theorem under the basic intuition [00:16:02] under the basic intuition um so but we weren't able to really show [00:16:05] um so but we weren't able to really show the proof the proof is very simple and [00:16:07] the proof the proof is very simple and short like just the one page I think [00:16:09] short like just the one page I think it's very nice proof so I really want to [00:16:11] it's very nice proof so I really want to show it to you so so maybe let's discuss [00:16:13] show it to you so so maybe let's discuss that uh in the next part [00:16:15] that uh in the next part so I will remind you what is the what [00:16:18] so I will remind you what is the what the theorem was was about so guys so [00:16:21] the theorem was was about so guys so just um [00:16:22] just um this is a [00:16:24] this is a two I think two lectures ago [00:16:29] so two lectures ago we showed the [00:16:32] so two lectures ago we showed the following theorem the theorem was [00:16:34] following theorem the theorem was something like [00:16:36] something like suppose you have maybe the context [00:16:39] suppose you have maybe the context is that we have linear model [00:16:42] is that we have linear model classification [00:16:47] um and we have a gradient flow [00:16:49] um and we have a gradient flow we have infinite small [00:16:52] we have infinite small um infinitesimal learning rate and one [00:16:54] um infinitesimal learning rate and one understands what's the implicit bias of [00:16:56] understands what's the implicit bias of the algorithm in this case [00:16:59] the algorithm in this case and the theorem that we had was that [00:17:01] and the theorem that we had was that gradient flow [00:17:05] um convert this to [00:17:09] the direction [00:17:10] the direction [Music] [00:17:13] [Music] of [00:17:15] the max margin solution [00:17:22] um in the sense that [00:17:27] so the margin of your interest rate [00:17:31] so the margin of your interest rate is converging to the max margin solution [00:17:34] is converging to the max margin solution as T goes to Infinity [00:17:38] so here WT is iterate [00:17:45] um of green descent [00:17:47] um of green descent at time t [00:17:50] at time t of good in flow at time t [00:17:53] of good in flow at time t so and Gamma is the normalized margin [00:18:02] and Gamma bar is the next margin [00:18:06] and Gamma bar is the next margin Max normalized Mars [00:18:11] I think at the end of the lecture I [00:18:13] I think at the end of the lecture I think I I discussed the intuition the [00:18:15] think I I discussed the intuition the main intuition is that the cross entropy [00:18:17] main intuition is that the cross entropy laws is really [00:18:19] laws is really basically you can do an approximation [00:18:21] basically you can do an approximation and in certain cases the cross entropy [00:18:24] and in certain cases the cross entropy laws is a approximation [00:18:27] laws is a approximation um [00:18:28] um of the max margin so the main intuition [00:18:33] if you recall that lecture I I was just [00:18:36] if you recall that lecture I I was just very very briefly on summarize this so [00:18:39] very very briefly on summarize this so the main question is that if you do a [00:18:41] the main question is that if you do a bunch of kind of heuristical calculation [00:18:42] bunch of kind of heuristical calculation you can find out that the log of the [00:18:45] you can find out that the log of the laws is approximately [00:18:48] laws is approximately equals to [00:18:51] um [00:18:53] let's see [00:18:55] let's see so log of laws is approximately equals [00:18:58] so log of laws is approximately equals to [00:19:03] um [00:19:04] um minus [00:19:06] minus times the norm of w [00:19:08] times the norm of w [Music] [00:19:09] [Music] times the gamma W so basically [00:19:12] times the gamma W so basically minimizing a loss is kind of like either [00:19:14] minimizing a loss is kind of like either you want to make the norm of w bigger or [00:19:16] you want to make the norm of w bigger or you want to make the margin bigger [00:19:19] you want to make the margin bigger so we did this very heuristic heuristic [00:19:22] so we did this very heuristic heuristic uh simplification [00:19:25] uh simplification give you this [00:19:26] give you this so that's why if you want to minimize [00:19:27] so that's why if you want to minimize the loss in some sense you are either [00:19:29] the loss in some sense you are either trying trying to make the norm bigger or [00:19:31] trying trying to make the norm bigger or you're trying to make the margin bigger [00:19:33] you're trying to make the margin bigger and it turns out that you can actually [00:19:34] and it turns out that you can actually control uh both of these two you know [00:19:37] control uh both of these two you know forces these two kind of like tendency [00:19:39] forces these two kind of like tendency and if it's actually true that the norm [00:19:43] and if it's actually true that the norm is growing and the margin is also [00:19:45] is growing and the margin is also growing both of them are trying to be [00:19:47] growing both of them are trying to be big and and you can show that the norm [00:19:49] big and and you can show that the norm grows you know to its frequency and the [00:19:51] grows you know to its frequency and the margin grows to the moderate there are [00:19:54] margin grows to the moderate there are smarter so that's the thing we're going [00:19:55] smarter so that's the thing we're going to prove in this theorem [00:19:58] um any questions so far [00:20:03] and and one of the key thing that we [00:20:05] and and one of the key thing that we discussed at that point was that the the [00:20:07] discussed at that point was that the the log some exponential on one of the key [00:20:10] log some exponential on one of the key technique is that logs some exponential [00:20:12] technique is that logs some exponential just kind of like the same as Max [00:20:15] just kind of like the same as Max um if if your input is uh has a has a [00:20:18] um if if your input is uh has a has a has a large scale [00:20:20] has a large scale so so today I'm going to provide a [00:20:22] so so today I'm going to provide a formal proof for this [00:20:24] formal proof for this theorem which is actually pretty in my [00:20:26] theorem which is actually pretty in my opinion it's very elegant and simple [00:20:28] opinion it's very elegant and simple and we only prove it for [00:20:31] and we only prove it for proof for only the case when the loss [00:20:34] proof for only the case when the loss function [00:20:35] function is [00:20:37] is minus exponential T like the exponential [00:20:40] minus exponential T like the exponential loss recall that in that lecture we also [00:20:42] loss recall that in that lecture we also discussed that [00:20:44] discussed that the logistic laws even though it's [00:20:46] the logistic laws even though it's called logistic loss it's actually very [00:20:48] called logistic loss it's actually very close to exponential loss [00:20:50] close to exponential loss um [00:20:51] um um so so so so we only deal with logic [00:20:54] um so so so so we only deal with logic exponential loss which is kind of almost [00:20:56] exponential loss which is kind of almost the same as the logisticals so the main [00:20:59] the same as the logisticals so the main feature is that as T goes to Infinity [00:21:00] feature is that as T goes to Infinity the loss goes to zero so the t is the [00:21:03] the loss goes to zero so the t is the supposed to be the margin and then when [00:21:06] supposed to be the margin and then when the margin is very very big and your [00:21:07] the margin is very very big and your loss is very small [00:21:09] loss is very small so and the [00:21:13] so and the the idea is that we can consider the [00:21:15] the idea is that we can consider the smooth margin [00:21:17] smooth margin so the smooth margin is defined to be [00:21:20] so the smooth margin is defined to be I think in the last in the lecture we [00:21:22] I think in the last in the lecture we Define a smooth margin to be [00:21:25] Define a smooth margin to be let me find out the [00:21:28] time [00:21:32] okay so I'm looking right [00:21:40] okay so the smooth margin is defined to [00:21:41] okay so the smooth margin is defined to be the following so consider the smooth [00:21:44] be the following so consider the smooth margin [00:21:52] so the smooth margin is defined to be [00:21:54] so the smooth margin is defined to be the log [00:21:55] the log of the empirical loss [00:21:58] of the empirical loss over the true Norm of w [00:22:01] over the true Norm of w so recall that we have try to accept [00:22:05] so recall that we have try to accept minus log so we have established this [00:22:07] minus log so we have established this equation [00:22:10] equation last time during the intuition and that [00:22:12] last time during the intuition and that actually motivates the use of the smooth [00:22:14] actually motivates the use of the smooth margin you can see that the smooth [00:22:15] margin you can see that the smooth margin [00:22:16] margin it's basically supposed to be [00:22:18] it's basically supposed to be approximately equals to the margin gamma [00:22:20] approximately equals to the margin gamma of w right if this approximation is true [00:22:24] of w right if this approximation is true but it's not exactly equal to that just [00:22:26] but it's not exactly equal to that just because this this is only approximately [00:22:28] because this this is only approximately equals to so that's why we we work with [00:22:31] equals to so that's why we we work with this smoother version which is um [00:22:34] this smoother version which is um in some sense like [00:22:35] in some sense like almost the same as the margin but just [00:22:37] almost the same as the margin but just the more kind of closer to the loss [00:22:40] the more kind of closer to the loss function I will have [00:22:42] function I will have and if you work with the smooth margin [00:22:44] and if you work with the smooth margin you can show that the smooth margin is [00:22:46] you can show that the smooth margin is actually uh I guess we have proved this [00:22:48] actually uh I guess we have proved this in the last lecture so the the margin is [00:22:51] in the last lecture so the the margin is actually bigger than a smooth margin [00:22:55] um [00:22:57] um so guys uh [00:23:00] so guys uh maybe let's just write out exactly what [00:23:02] maybe let's just write out exactly what is this is minus log [00:23:04] is this is minus log some well exponential minus [00:23:07] some well exponential minus y i times W transpose x i [00:23:12] y i times W transpose x i don't know of w [00:23:15] and and you can show that the margin [00:23:19] and and you can show that the margin is larger than a smooth margin [00:23:21] is larger than a smooth margin that's because you can replace each of [00:23:23] that's because you can replace each of these terms by the maximum by the [00:23:25] these terms by the maximum by the minimum right this is just because [00:23:28] minimum right this is just because y i w transpose x i is less than [00:23:33] y i w transpose x i is less than gamma W Times the norm of w [00:23:37] sorry this is larger than this [00:23:40] sorry this is larger than this so the simplest margin is supposed to be [00:23:42] so the simplest margin is supposed to be something close to the margin but [00:23:44] something close to the margin but smaller so that's why it's the fastest [00:23:46] smaller so that's why it's the fastest to show [00:23:51] that the smooth margin L gamma tilde W [00:23:54] that the smooth margin L gamma tilde W converges to [00:23:56] converges to on gamma bar [00:24:00] wh homologist [00:24:04] this is because you know you have this [00:24:06] this is because you know you have this sandwich thing right so you know that [00:24:08] sandwich thing right so you know that gamma W is always then gamma bar so so [00:24:11] gamma W is always then gamma bar so so basically the smooth margin is [00:24:13] basically the smooth margin is sandwiched between [00:24:15] sandwiched between um so so if the smooth margin Commerce [00:24:17] um so so if the smooth margin Commerce to the gamma bar then gamma W has to [00:24:19] to the gamma bar then gamma W has to converge the gamma bar because there's [00:24:21] converge the gamma bar because there's no way for gamma W to go beyond gamma [00:24:24] no way for gamma W to go beyond gamma bar [00:24:28] okay so so now basically this is what [00:24:30] okay so so now basically this is what we're going to do we're going to prove [00:24:31] we're going to do we're going to prove that even a smaller value the smooth [00:24:33] that even a smaller value the smooth margin is going to converge to gamma bar [00:24:36] margin is going to converge to gamma bar um so then the larger value will also [00:24:38] um so then the larger value will also converge to gamma bar [00:24:39] converge to gamma bar so and and the proof is actually also [00:24:42] so and and the proof is actually also pretty simple so we basically show that [00:24:45] pretty simple so we basically show that uh we'll show [00:24:49] the gradient flow will increase [00:24:55] um the [00:24:57] um the this quantity the log of the WT [00:25:00] this quantity the log of the WT log of the log the log loss [00:25:03] log of the log the log loss uh by because it [00:25:07] uh by because it decreases [00:25:09] decreases you know intuitively because it [00:25:11] you know intuitively because it decreases L hat WT [00:25:14] decreases L hat WT so let's do this formally [00:25:16] so let's do this formally so concretely [00:25:19] so concretely because I think you know the statement [00:25:22] because I think you know the statement itself right it will increase system [00:25:23] itself right it will increase system minus of the log loss that's kind of [00:25:27] minus of the log loss that's kind of like almost obvious because the the loss [00:25:29] like almost obvious because the the loss itself is going to decrease but how much [00:25:31] itself is going to decrease but how much it increases requires some kind of [00:25:33] it increases requires some kind of mathematical derivation [00:25:34] mathematical derivation so concretely suppose recall that [00:25:38] so concretely suppose recall that the change in W is minus gradient LWT [00:25:44] the change in W is minus gradient LWT this is the definition of the gradient [00:25:47] this is the definition of the gradient flow then what you have is that [00:25:50] flow then what you have is that the derivative with respect to T [00:25:54] um [00:25:55] um of the chain [00:25:58] of the chain the changes in the log the minus of the [00:26:00] the changes in the log the minus of the log loss [00:26:02] log loss is equals to [00:26:04] is equals to so how does this change this you take [00:26:06] so how does this change this you take the chain rule right so you first look [00:26:08] the chain rule right so you first look at how does the loss depends on [00:26:11] at how does the loss depends on w [00:26:16] and then you look at how does w change [00:26:21] and then you look at how does w change and how does the loss depends on W so [00:26:23] and how does the loss depends on W so does the derivative so then this is like [00:26:26] does the derivative so then this is like a derivative of this which is minus [00:26:31] a derivative of this which is minus uh minus [00:26:33] uh minus you use the chain rule again so we get L [00:26:35] you use the chain rule again so we get L of WT [00:26:37] of WT is the chain rule for the log and above [00:26:39] is the chain rule for the log and above you get the gradient of L hat [00:26:43] you get the gradient of L hat w [00:26:46] and then [00:26:47] and then um w dot [00:26:50] um w dot and we call it w dot T is really the [00:26:52] and we call it w dot T is really the gradient of the loss function so [00:26:54] gradient of the loss function so basically our up to our sign so [00:26:56] basically our up to our sign so basically you get [00:26:58] basically you get the gradient of the loss function [00:27:01] the gradient of the loss function to Norm Square over L of w t [00:27:06] to Norm Square over L of w t and this is vigram 0. so this shows that [00:27:08] and this is vigram 0. so this shows that it's minus log loss is going to increase [00:27:11] it's minus log loss is going to increase as T goes to Infinity [00:27:14] as T goes to Infinity and and but the important thing is how [00:27:16] and and but the important thing is how how fast it increases is this quantity [00:27:18] how fast it increases is this quantity this is something we're going to use [00:27:19] this is something we're going to use this is this whole thing is increasing [00:27:21] this is this whole thing is increasing it's not surprising because the loss is [00:27:23] it's not surprising because the loss is decreasing but this we also want to know [00:27:25] decreasing but this we also want to know how fast this is increasing and by the [00:27:27] how fast this is increasing and by the way you can actually I think it's useful [00:27:29] way you can actually I think it's useful to [00:27:31] to use the this because we're going to [00:27:33] use the this because we're going to compare it with [00:27:35] compare it with you can also write this as this equals [00:27:38] you can also write this as this equals to this just because the number of L hat [00:27:41] to this just because the number of L hat is just equals to w dot t [00:27:45] Okay so now [00:27:49] Okay so now with this we can what we can do is that [00:27:51] with this we can what we can do is that we can control [00:27:52] we can control uh what eventually after t-step what [00:27:56] uh what eventually after t-step what happens with the the log blocks so what [00:27:59] happens with the the log blocks so what you get is that minus log [00:28:01] you get is that minus log L hat w t [00:28:04] L hat w t uh is equals to minus log l hat w0 [00:28:10] uh is equals to minus log l hat w0 plus the integral between 0 and T of the [00:28:14] plus the integral between 0 and T of the derivative of this quantity [00:28:25] so and this is going to be using the [00:28:28] so and this is going to be using the equation above you've got that this log [00:28:32] equation above you've got that this log compute 0 Plus [00:28:35] compute 0 Plus the integral of w Del t [00:28:38] the integral of w Del t for Norm Square over L Double H [00:28:45] okay so so we basically know now know [00:28:48] okay so so we basically know now know how fast you know how large is the is [00:28:51] how fast you know how large is the is the log loss right so recall that what [00:28:53] the log loss right so recall that what we care about is this quantity [00:28:56] we care about is this quantity what we care about is [00:28:58] what we care about is [Music] [00:29:00] [Music] what we care about is this [00:29:02] what we care about is this and how does this goes to it how does [00:29:04] and how does this goes to it how does this go to gamma bar as T goes to [00:29:06] this go to gamma bar as T goes to Infinity [00:29:07] Infinity um so and we have already deal with the [00:29:10] um so and we have already deal with the the denot the numerator and we just have [00:29:12] the denot the numerator and we just have to kind of like we know how does this [00:29:15] to kind of like we know how does this someone know how does this change uh for [00:29:18] someone know how does this change uh for the numerator and we have to again [00:29:20] the numerator and we have to again another thing is that we have to kind of [00:29:22] another thing is that we have to kind of like um [00:29:23] like um uh try to understand [00:29:25] uh try to understand um the denominator right so the [00:29:28] um the denominator right so the denominator you know you have to [00:29:29] denominator you know you have to normalize this by the norm of w [00:29:31] normalize this by the norm of w so [00:29:33] so um so basically next thing is that we're [00:29:35] um so basically next thing is that we're going to deal with this term and compare [00:29:36] going to deal with this term and compare it with the normalizer norm of w so what [00:29:39] it with the normalizer norm of w so what you do is that you look at the w dot T [00:29:43] you do is that you look at the w dot T Square [00:29:45] Square this is bigger than [00:29:48] this is bigger than w dot T times [00:29:50] w dot T times W star [00:29:52] W star recall [00:29:54] recall W star [00:29:55] W star is the direction [00:29:56] is the direction [Music] [00:29:59] [Music] of Max margin solution [00:30:05] this is just by caution shorts right so [00:30:07] this is just by caution shorts right so the inner part of two vectors is less [00:30:10] the inner part of two vectors is less than the norm of one vector times the [00:30:11] than the norm of one vector times the norm of the other vector and the norm of [00:30:13] norm of the other vector and the norm of the double star is assumed to be one [00:30:16] the double star is assumed to be one so then we plug in [00:30:19] so then we plug in the definition [00:30:20] the definition of w dot [00:30:23] of w dot is minus green l w t [00:30:26] is minus green l w t h w star [00:30:28] h w star and then we plug in the true definition [00:30:31] and then we plug in the true definition of the number l so [00:30:34] of the number l so like we plug in a derivation for the [00:30:37] like we plug in a derivation for the number l so this [00:30:39] number l so this equals to y i times exponential minus y [00:30:44] equals to y i times exponential minus y i x [00:30:45] i x W transpose x i [00:30:48] W transpose x i times x i [00:30:50] times x i and times W star [00:30:54] and then this is a vector this is a [00:30:56] and then this is a vector this is a scalar this is a scalar so basically you [00:30:59] scalar this is a scalar so basically you can just take inner part of these two [00:31:00] can just take inner part of these two and multiply the scalar [00:31:02] and multiply the scalar so this will be equals to [00:31:03] so this will be equals to [Music] [00:31:05] [Music] um I guess there's no minus here because [00:31:07] um I guess there's no minus here because the there's another minus in the [00:31:09] the there's another minus in the gradient which cancels so then this is [00:31:11] gradient which cancels so then this is equals to sum [00:31:13] equals to sum of y i times exponential minus y i [00:31:18] of y i times exponential minus y i double transpose x i [00:31:20] double transpose x i times double star times x i [00:31:27] guess maybe let's write this [00:31:30] guess maybe let's write this W star transpose x i [00:31:36] so [00:31:38] so and this we can see that this is the [00:31:41] and this we can see that this is the margin of the the max margin because W [00:31:44] margin of the the max margin because W star is the max margin solution so this [00:31:46] star is the max margin solution so this is always bigger than the max margin so [00:31:49] is always bigger than the max margin so so so this is larger than [00:31:53] comma bar times [00:31:59] I guess [00:32:01] I guess let me write let me finish it let me [00:32:03] let me write let me finish it let me explain [00:32:07] because [00:32:08] because why I [00:32:10] why I double star principles x i [00:32:13] double star principles x i is bigger than gamma bar this is just [00:32:16] is bigger than gamma bar this is just because gamma bar is the margin of w [00:32:18] because gamma bar is the margin of w star so that's why every data point has [00:32:20] star so that's why every data point has a bigger margin than the [00:32:28] so and then this is equals to gamma Bar [00:32:30] so and then this is equals to gamma Bar times the loss [00:32:34] so with this then we can proceed by [00:32:37] so with this then we can proceed by dealing with we can proceed by dealing [00:32:39] dealing with we can proceed by dealing with this term to further lower bounds [00:32:42] with this term to further lower bounds how [00:32:43] how kind of control how fast you you grow so [00:32:45] kind of control how fast you you grow so with this you get log l hat WT [00:32:50] with this you get log l hat WT is lower than minus log l hat w [00:32:54] is lower than minus log l hat w zero [00:32:56] zero uh [00:32:59] uh follow us [00:33:03] um [00:33:05] so maybe maybe just one moment before I [00:33:07] so maybe maybe just one moment before I use this let me just try to interpret [00:33:09] use this let me just try to interpret what this is really doing so this [00:33:13] see so [00:33:20] [Music] [00:33:28] so so in some sense as a remark what is [00:33:31] so so in some sense as a remark what is this this is really doing is that so WT [00:33:34] this this is really doing is that so WT we show the WT so we show that [00:33:39] WT is correlated [00:33:44] with W star [00:33:46] with W star all right that's what we are showing [00:33:47] all right that's what we are showing right so we are showing that the [00:33:50] right so we are showing that the the w dot T times W star is bigger than [00:33:53] the w dot T times W star is bigger than a normal active quantity and the whole [00:33:55] a normal active quantity and the whole card is this depends on the correlation [00:34:00] depends on [00:34:03] depends on gamma bar and loss [00:34:07] so so in some sense the and and because [00:34:11] so so in some sense the and and because you according with W star it means that [00:34:13] you according with W star it means that you you cannot double dot T itself [00:34:16] you you cannot double dot T itself cannot be too small [00:34:17] cannot be too small so and also so this is another thing we [00:34:19] so and also so this is another thing we got right so w dot t [00:34:22] got right so w dot t is not too small [00:34:25] at least compared to [00:34:28] at least compared to the laws [00:34:31] the laws so so so so what this is saying is that [00:34:34] so so so so what this is saying is that you know if the loss is not too small [00:34:36] you know if the loss is not too small then you have to make some changes in [00:34:38] then you have to make some changes in your in your w and if you have to make [00:34:40] your in your w and if you have to make some changes in your W then you have to [00:34:42] some changes in your W then you have to make some changes in the log of the L [00:34:45] make some changes in the log of the L hat WT so basically if the loss is not [00:34:47] hat WT so basically if the loss is not small then you log of the loss needs to [00:34:49] small then you log of the loss needs to increase the minus log of the loss needs [00:34:52] increase the minus log of the loss needs to increase [00:34:53] to increase so it's a little kind of like a [00:34:55] so it's a little kind of like a counterintuitive in some sense but I [00:34:57] counterintuitive in some sense but I guess so so what we do next is that [00:34:59] guess so so what we do next is that let's control this term additional term [00:35:02] let's control this term additional term that that's circled here so this term [00:35:07] if we use the [00:35:12] the equation we got we we got this is [00:35:15] the equation we got we we got this is larger than gamma bar [00:35:17] larger than gamma bar times [00:35:20] depends out one of the law the law or [00:35:22] depends out one of the law the law or you you can you can use this for one of [00:35:25] you you can you can use this for one of the there is a power two here right you [00:35:27] the there is a power two here right you can use the equation maybe let's get [00:35:29] can use the equation maybe let's get this equation one for one of this [00:35:31] this equation one for one of this occurrences of w dot t [00:35:34] occurrences of w dot t so then you got you you're left with one [00:35:36] so then you got you you're left with one and then you got [00:35:38] and then you got the command bar and I had our head got [00:35:40] the command bar and I had our head got canceled with the denominator and a [00:35:42] canceled with the denominator and a comma bar is put in the front so we got [00:35:45] comma bar is put in the front so we got this [00:35:49] so basically I'm applying equation one [00:35:52] so basically I'm applying equation one for one of the w.t2 not [00:35:55] for one of the w.t2 not and then you can use a triangle [00:35:56] and then you can use a triangle inequality [00:35:58] inequality this is by one [00:36:00] this is by one it was a triangularity to say that this [00:36:02] it was a triangularity to say that this is larger than [00:36:04] is larger than the integral of w dot T three Norm [00:36:07] the integral of w dot T three Norm Square DT [00:36:08] Square DT foreign [00:36:11] this is replacing the integral with the [00:36:14] this is replacing the integral with the norm then you get gamma Bar times [00:36:18] norm then you get gamma Bar times the norm of WT [00:36:21] the norm of WT so guys you know next you're going to [00:36:23] so guys you know next you're going to see why we kept all of this because we [00:36:24] see why we kept all of this because we care about this because [00:36:26] care about this because now you can [00:36:28] now you can control how fast [00:36:30] control how fast L height [00:36:32] L height is log loss it's improving [00:36:35] is log loss it's improving compared to [00:36:38] how fast the norm of w is improving [00:36:44] foreign [00:36:47] about because fundamentally we care [00:36:50] about because fundamentally we care about the ratio between them it's the [00:36:52] about the ratio between them it's the definition of the soft margin or the [00:36:54] definition of the soft margin or the smooth margin [00:36:57] so this means that a ratio [00:37:02] is [00:37:05] getting closer to gamma bar right so [00:37:07] getting closer to gamma bar right so this term is a constant and this term [00:37:11] this term is a constant and this term is something that becomes closer to zero [00:37:13] is something that becomes closer to zero as T goes to Infinity [00:37:16] so so WT [00:37:19] so so WT goes to Infinity as T goes to zero uh as [00:37:23] goes to Infinity as T goes to zero uh as d goes to Infinity okay [00:37:26] d goes to Infinity okay that's why this term [00:37:28] that's why this term here [00:37:30] here converts to zero as T goes to Infinity [00:37:34] converts to zero as T goes to Infinity so that's why [00:37:36] so that's why if you're taking the limit [00:37:37] if you're taking the limit when T goes to Infinity [00:37:40] when T goes to Infinity regard that this [00:37:44] smooth margin right so we call that this [00:37:46] smooth margin right so we call that this ratio is the smooth margin it's [00:37:48] ratio is the smooth margin it's converging to gamma bar [00:37:52] converging to gamma bar so in other words the limit T's Infinity [00:37:55] so in other words the limit T's Infinity the Matilda [00:37:57] the Matilda W T is equals to Common power [00:38:02] maybe here you only can an active and [00:38:05] maybe here you only can an active and then you use the other way to show that [00:38:08] then you use the other way to show that and you also know that okay so we also [00:38:10] and you also know that okay so we also know [00:38:14] gamma bar is larger than a margin of any [00:38:16] gamma bar is larger than a margin of any w [00:38:18] w because gamma bar is the max margin and [00:38:20] because gamma bar is the max margin and which is larger than [00:38:21] which is larger than WT [00:38:23] WT and then you can show that the learn [00:38:25] and then you can show that the learn much is actually equally equal to gamma [00:38:27] much is actually equally equal to gamma bar exactly [00:38:30] bar exactly so we're done [00:38:36] any questions [00:38:47] okay so I guess with this we basically [00:38:51] okay so I guess with this we basically concluded our section [00:38:54] concluded our section um about our implicit regularization so [00:38:57] um about our implicit regularization so I guess just to very quickly briefly [00:39:00] I guess just to very quickly briefly wrap up so this is the end of the [00:39:04] wrap up so this is the end of the the section about implicit [00:39:06] the section about implicit regularization [00:39:07] regularization and we have talked about a bunch of [00:39:09] and we have talked about a bunch of things like initialization [00:39:12] things like initialization so small initialization prefers certain [00:39:15] so small initialization prefers certain kind of solution typically you know [00:39:17] kind of solution typically you know um small Norm solution [00:39:19] um small Norm solution prefers you know small Norm solution [00:39:28] and and we are actually in one of the [00:39:30] and and we are actually in one of the cases we also show that you know you can [00:39:32] cases we also show that you know you can have an interpolation between small and [00:39:34] have an interpolation between small and transition large initialization so in [00:39:36] transition large initialization so in that case you can show the implicit bias [00:39:38] that case you can show the implicit bias for any initialization [00:39:40] for any initialization and we also talk about the the [00:39:42] and we also talk about the the classification problem [00:39:46] so where you get the max margin [00:39:48] so where you get the max margin so this is where you get Max margin [00:39:50] so this is where you get Max margin solution [00:39:52] solution and we also talk about a lot of the [00:39:54] and we also talk about a lot of the stock the noise [00:39:57] stock the noise so in all these cases it's kind of like [00:39:58] so in all these cases it's kind of like you have something about you know your [00:40:00] you have something about you know your optimizer [00:40:02] optimizer that is only designed for optimizing [00:40:04] that is only designed for optimizing faster in some sense right but somehow [00:40:07] faster in some sense right but somehow as a side effect you get implicit [00:40:09] as a side effect you get implicit regularization effect [00:40:16] foreign [00:40:51] okay so if there's no questions let me [00:40:54] okay so if there's no questions let me move on to [00:40:55] move on to um [00:40:56] um the final part of this lecture of this [00:40:59] the final part of this lecture of this course which is about more about [00:41:01] course which is about more about answering reputation and so and so forth [00:41:04] answering reputation and so and so forth so [00:41:05] so so then so in this lecture and the last [00:41:08] so then so in this lecture and the last two two launches we still have like so [00:41:10] two two launches we still have like so basically next 2.5 lectures we're going [00:41:14] basically next 2.5 lectures we're going to talk about answering [00:41:17] to talk about answering they are not that many [00:41:19] they are not that many uh theoretical work about answering of [00:41:23] uh theoretical work about answering of course there are a lot of like um [00:41:25] course there are a lot of like um very amazing empirical works these days [00:41:29] very amazing empirical works these days um like uh but not many or theoretical [00:41:32] um like uh but not many or theoretical work so what I'm going to do is that I'm [00:41:33] work so what I'm going to do is that I'm going to start with [00:41:35] going to start with somewhat kind of classical approach a [00:41:37] somewhat kind of classical approach a little bit so for this lecture and the [00:41:39] little bit so for this lecture and the the beginning of the second next lecture [00:41:41] the beginning of the second next lecture maybe a good portion of the next lecture [00:41:43] maybe a good portion of the next lecture I'm going to talk about the classical [00:41:45] I'm going to talk about the classical approach [00:41:47] approach uh I mean crackle theoretical approach [00:41:50] uh I mean crackle theoretical approach so there are many many approaches you [00:41:52] so there are many many approaches you know before for example the most [00:41:54] know before for example the most empirically probably [00:41:57] empirically probably um before deep learning the best [00:41:58] um before deep learning the best empirical approach would be probably [00:42:00] empirical approach would be probably like you do later variable models with [00:42:03] like you do later variable models with um em expectation maximization but I'm [00:42:07] um em expectation maximization but I'm going to for those kind of eon [00:42:09] going to for those kind of eon algorithms there are very little [00:42:10] algorithms there are very little theoretical analysis and even there is [00:42:12] theoretical analysis and even there is analysis is kind of like a special case [00:42:13] analysis is kind of like a special case and it's not clear whether they can be [00:42:15] and it's not clear whether they can be extended to [00:42:17] extended to um [00:42:17] um or a complex case so what I'm going to [00:42:19] or a complex case so what I'm going to talk about is a different line of [00:42:20] talk about is a different line of research which used the so-called moment [00:42:22] research which used the so-called moment method [00:42:24] method so so this kind of methods you know [00:42:26] so so this kind of methods you know don't necessarily work very well for [00:42:29] don't necessarily work very well for uh uh empirically but they have very [00:42:33] uh uh empirically but they have very good um uh you can analyze them in a [00:42:37] good um uh you can analyze them in a very clean way and then this kind of [00:42:39] very clean way and then this kind of mathematical techniques are also useful [00:42:41] mathematical techniques are also useful for many other cases so I think it's [00:42:42] for many other cases so I think it's worth spending one lecture to talk about [00:42:45] worth spending one lecture to talk about this uh this approach and it used to be [00:42:47] this uh this approach and it used to be the case that actually around probably [00:42:49] the case that actually around probably 2012 2013 at that point I think the [00:42:53] 2012 2013 at that point I think the community the theoretic Community [00:42:54] community the theoretic Community thought that this might be the new thing [00:42:56] thought that this might be the new thing it could be the new thing that you can [00:42:59] it could be the new thing that you can both analyze and empirically uh work [00:43:02] both analyze and empirically uh work um it turns out that the analysis part [00:43:05] um it turns out that the analysis part you know developed got developed very [00:43:07] you know developed got developed very well but the empirical part you know [00:43:09] well but the empirical part you know it's doing okay but not kind of like [00:43:11] it's doing okay but not kind of like good enough to replace the EM algorithms [00:43:13] good enough to replace the EM algorithms at least not enough to replace them [00:43:15] at least not enough to replace them completely [00:43:17] completely so [00:43:18] so um so so and then I'm going to talk [00:43:20] um so so and then I'm going to talk about some of the more modern [00:43:22] about some of the more modern work [00:43:24] work uh with with deep learning [00:43:27] uh with with deep learning with deep learning like for example [00:43:28] with deep learning like for example self-tuning [00:43:31] self-tuning or contrastive learning [00:43:34] or contrastive learning so these are basically analysis in the [00:43:36] so these are basically analysis in the last [00:43:38] last wow two years about some of the new [00:43:40] wow two years about some of the new algorithms in deep learning [00:43:43] algorithms in deep learning um [00:43:44] um um so I'm going to spend probably the [00:43:46] um so I'm going to spend probably the last lecture and the last 1.5 lectures [00:43:48] last lecture and the last 1.5 lectures on this [00:43:49] on this okay so that's the plan for the next 2.5 [00:43:51] okay so that's the plan for the next 2.5 lecture and [00:43:54] lecture and so today I'm going to talk about a [00:43:57] so today I'm going to talk about a classical approach right so [00:44:00] classical approach right so um [00:44:01] um and by the way another kind of General [00:44:03] and by the way another kind of General comment is that in my opinion this [00:44:05] comment is that in my opinion this answering seems to be the core for many [00:44:07] answering seems to be the core for many things right so this also relates to [00:44:11] things right so this also relates to for example semi surprise learning where [00:44:13] for example semi surprise learning where you have some unlabeled data together [00:44:15] you have some unlabeled data together with label data [00:44:17] with label data and this also relates to unsurprised [00:44:19] and this also relates to unsurprised domain adaptation [00:44:21] domain adaptation and my personal opinion is that all of [00:44:24] and my personal opinion is that all of these questions like what really you [00:44:26] these questions like what really you care about is really [00:44:28] care about is really like in both of these questions what you [00:44:30] like in both of these questions what you really care about is how do you leverage [00:44:32] really care about is how do you leverage unlabeled data so in some sense they all [00:44:34] unlabeled data so in some sense they all reduces to under an expression in my in [00:44:37] reduces to under an expression in my in my opinion [00:44:39] my opinion um [00:44:40] um okay so now let's get into uh something [00:44:43] okay so now let's get into uh something more concrete [00:44:44] more concrete so [00:44:46] so um [00:44:46] um so let's say let's have some setup so [00:44:50] so let's say let's have some setup so um this is the setup [00:44:54] um this is the setup this is worse you know [00:44:56] this is worse you know 18 variable models latent variable [00:44:59] 18 variable models latent variable models [00:45:02] so we are interested in those [00:45:03] so we are interested in those conflicting variable models especially [00:45:05] conflicting variable models especially in a classical approach so the the [00:45:07] in a classical approach so the the formulation is that you have a [00:45:09] formulation is that you have a distribution [00:45:12] P Theta [00:45:14] P Theta parametrical space data [00:45:16] parametrical space data how is parameters Theta that would be [00:45:19] how is parameters Theta that would be you know [00:45:20] you know there are many different ways which I'm [00:45:21] there are many different ways which I'm going to introduce a few of them but [00:45:23] going to introduce a few of them but every parameter decides a distribution P [00:45:25] every parameter decides a distribution P Theta and then you are given [00:45:29] Theta and then you are given uh unlabeled examples right there's no [00:45:32] uh unlabeled examples right there's no labels anywhere right so you're given [00:45:35] labels anywhere right so you're given examples you know X1 after X and they [00:45:38] examples you know X1 after X and they are sample ID from these distributions P [00:45:41] are sample ID from these distributions P Theta and your goal [00:45:44] Theta and your goal is to recover [00:45:47] is to recover data [00:45:50] online data from the data [00:45:53] online data from the data from data [00:45:56] from data so that's the that's the formulation [00:45:58] so that's the that's the formulation and P Theta can be described you know as [00:46:01] and P Theta can be described you know as a latent variable model [00:46:04] a latent variable model or can't be or typically is described [00:46:07] or can't be or typically is described by [00:46:08] by um a latent variable model [00:46:13] so every state that describes your [00:46:15] so every state that describes your genetic model in some sense so for [00:46:17] genetic model in some sense so for example [00:46:19] example um I assume you somehow know roughly [00:46:21] um I assume you somehow know roughly speaking what latent variable model is [00:46:22] speaking what latent variable model is from cs29 but let me give some examples [00:46:25] from cs29 but let me give some examples for example mix of gaussian [00:46:28] for example mix of gaussian this is one of the probably [00:46:31] this is one of the probably um the most studied [00:46:35] distribution in machine learning so so [00:46:38] distribution in machine learning so so the assumption is that in the most [00:46:39] the assumption is that in the most General sense [00:46:40] General sense the Theta [00:46:42] the Theta is so the parameters are [00:46:45] is so the parameters are um describe a bunch of things [00:46:46] um describe a bunch of things so let me write it down first so you [00:46:48] so let me write it down first so you have a bunch of vectors [00:46:50] have a bunch of vectors hey vectors and a probability [00:46:54] hey vectors and a probability a bunch of probability numbers so each [00:46:56] a bunch of probability numbers so each of these mu I is [00:46:59] of these mu I is in dimension D is the mean [00:47:03] in dimension D is the mean of the component [00:47:07] and and P1 up to PK this is a [00:47:10] and and P1 up to PK this is a probability vector [00:47:16] um in in the Simplex right let's call it [00:47:19] um in in the Simplex right let's call it data k [00:47:20] data k a simple action K Dimension which is [00:47:23] a simple action K Dimension which is basically [00:47:25] basically a set of vectors with Norm one Norm [00:47:28] a set of vectors with Norm one Norm sorry Norm equals to one [00:47:30] sorry Norm equals to one and non-negative [00:47:33] and non-negative and in dimension k [00:47:36] and in dimension k right so people have to PK is a [00:47:38] right so people have to PK is a probability Vector over K items [00:47:47] right so and given these parameters [00:47:50] right so and given these parameters what's the model how do you generate so [00:47:52] what's the model how do you generate so this is my parameter and how do you [00:47:54] this is my parameter and how do you generate data so it's mixed of thousands [00:47:56] generate data so it's mixed of thousands so so intuitively you just you want to [00:47:59] so so intuitively you just you want to model the case where you have for [00:48:01] model the case where you have for example [00:48:02] example something like this you have a [00:48:04] something like this you have a you have several clusters of data [00:48:07] you have several clusters of data something like this [00:48:08] something like this I guess you don't see the color [00:48:10] I guess you don't see the color um in in the data you just see the raw [00:48:13] um in in the data you just see the raw inputs the color is just to indicate [00:48:15] inputs the color is just to indicate which gaussians come from so so [00:48:19] which gaussians come from so so mathematically you say that you sample X [00:48:21] mathematically you say that you sample X from P Theta [00:48:23] from P Theta uh by [00:48:24] uh by your first assemble some I the cluster [00:48:27] your first assemble some I the cluster ID from a categorical [00:48:30] ID from a categorical distribution defined by P [00:48:33] distribution defined by P so so I is you know between I I can take [00:48:37] so so I is you know between I I can take values from 1 to K and then given the ID [00:48:40] values from 1 to K and then given the ID the cluster ID you sample a gaussian [00:48:43] the cluster ID you sample a gaussian which means mu sub I [00:48:46] which means mu sub I and then some covers let's say identity [00:48:48] and then some covers let's say identity so actually the covariance can also be a [00:48:50] so actually the covariance can also be a parameter you want to learn but here for [00:48:52] parameter you want to learn but here for Simplicity I just assume all the [00:48:54] Simplicity I just assume all the quotients have the same code bar it's [00:48:56] quotients have the same code bar it's just to make this [00:48:58] just to make this like the uh everything easier [00:49:01] like the uh everything easier so so this is the latent variable model [00:49:02] so so this is the latent variable model so where I is the latent variable [00:49:06] so where I is the latent variable this is something you don't observe [00:49:10] in your data you only observe X but uh [00:49:13] in your data you only observe X but uh but given the latent variable you can [00:49:15] but given the latent variable you can generate the data so you so basically [00:49:17] generate the data so you so basically there are two separate two parts where [00:49:18] there are two separate two parts where you first generating variable and then [00:49:20] you first generating variable and then you generate data [00:49:22] you generate data um I don't know under the under the hood [00:49:24] um I don't know under the under the hood and then many other approach which uh [00:49:26] and then many other approach which uh I'm going to Define probably mostly when [00:49:28] I'm going to Define probably mostly when I'm using I I'm going to use it so hmm [00:49:31] I'm using I I'm going to use it so hmm the hidden variable mode hidden Markov [00:49:33] the hidden variable mode hidden Markov model [00:49:35] model because if you take some NLP class [00:49:37] because if you take some NLP class probably have seen this kind of things [00:49:38] probably have seen this kind of things or ICA independent component analysis [00:49:42] or ICA independent component analysis this is um [00:49:44] this is um also something I think [00:49:46] also something I think covered in [00:49:48] covered in um cs209 [00:49:51] um cs209 and so there are many many other kind of [00:49:53] and so there are many many other kind of like relating variable more space not so [00:49:55] like relating variable more space not so and so forth so [00:49:57] and so forth so so this is the final of questions we're [00:49:59] so this is the final of questions we're going to study [00:50:00] going to study and now let's talk about the approach so [00:50:02] and now let's talk about the approach so the approach [00:50:04] the approach maybe before that any questions [00:50:08] okay so the approach uh [00:50:11] okay so the approach uh we're gonna study is the so-called [00:50:13] we're gonna study is the so-called moment method which is actually pretty [00:50:15] moment method which is actually pretty powerful [00:50:17] powerful uh either as a approach there are some [00:50:19] uh either as a approach there are some drawbacks you know which make it [00:50:21] drawbacks you know which make it empirical urbanized appealing but the [00:50:23] empirical urbanized appealing but the approach itself if you don't [00:50:25] approach itself if you don't have a certain kind of like aspects then [00:50:28] have a certain kind of like aspects then it's actually pretty powerful this is [00:50:29] it's actually pretty powerful this is called moment method [00:50:31] called moment method I think this method is proposed by [00:50:34] I think this method is proposed by actually [00:50:35] actually um Economist uh or actually a few [00:50:37] um Economist uh or actually a few economic economists uh to understand [00:50:40] economic economists uh to understand data from economy [00:50:42] data from economy um like from economists [00:50:44] um like from economists um [00:50:45] um um from like uh I think Michael [00:50:49] um from like uh I think Michael I don't know I don't know like some kind [00:50:51] I don't know I don't know like some kind of like a you know so the the original [00:50:53] of like a you know so the the original source is definitely not machine [00:50:54] source is definitely not machine learning but then people use this for [00:50:57] learning but then people use this for machine learning these days which is [00:50:58] machine learning these days which is actually pretty complicated [00:51:00] actually pretty complicated um approach [00:51:02] um approach um actually even though I think [00:51:05] um actually even though I think actually I I think I I missed me spoke [00:51:08] actually I I think I I missed me spoke the the very original proposal of this [00:51:11] the the very original proposal of this movement method actually probably dates [00:51:12] movement method actually probably dates back soon 19 centuries [00:51:15] back soon 19 centuries um by some statisticians so um and and [00:51:19] um by some statisticians so um and and then actually some economists got [00:51:21] then actually some economists got um uh got even got the Nobel Prize by [00:51:25] um uh got even got the Nobel Prize by generalizing these mobile methods to [00:51:27] generalizing these mobile methods to kind of like something like what we are [00:51:28] kind of like something like what we are discussing right now so anyway so um [00:51:35] so so let's see how how does this work [00:51:37] so so let's see how how does this work so I'm going to just only kind of like [00:51:40] so I'm going to just only kind of like I'm gonna walk you through this kind of [00:51:41] I'm gonna walk you through this kind of method by showing examples so let's do [00:51:44] method by showing examples so let's do the first example [00:51:47] the first example so [00:51:48] so um [00:51:51] so first example let's talk about [00:51:53] so first example let's talk about mixture of [00:51:56] of two gaussians [00:52:00] so you just have two gaussians and [00:52:03] so you just have two gaussians and and I think for and also I say k is two [00:52:06] and I think for and also I say k is two right so and then let's also assume [00:52:12] P1 and P2 are just a half so these two [00:52:16] P1 and P2 are just a half so these two gaussians have the same probability so [00:52:18] gaussians have the same probability so they have the same [00:52:19] they have the same marginal density and also with dollars [00:52:22] marginal density and also with dollars of generality we can assume the the mean [00:52:26] of generality we can assume the the mean is the average of the mean is zero so [00:52:29] is the average of the mean is zero so basically they are just a symmetric [00:52:30] basically they are just a symmetric around some around origin this is in [00:52:34] around some around origin this is in some sense we got a lot of Charity [00:52:36] some sense we got a lot of Charity because you know which which point you [00:52:38] because you know which which point you choose as the origin wouldn't really [00:52:39] choose as the origin wouldn't really matter that much [00:52:40] matter that much so [00:52:42] so um [00:52:42] um so then mu1 [00:52:44] so then mu1 you can write mu-wise [00:52:48] so [00:52:50] so like new to be the equals to Mu 1 and [00:52:53] like new to be the equals to Mu 1 and then mu 2 is equal to minus mu so [00:52:56] then mu 2 is equal to minus mu so basically we want to learn one parameter [00:52:58] basically we want to learn one parameter one parameter Vector which is mu [00:53:01] one parameter Vector which is mu um and the data comes from this mixture [00:53:03] um and the data comes from this mixture of two gaussians whereas one gaussian is [00:53:07] of two gaussians whereas one gaussian is mean mu and covariance and negative [00:53:09] mean mu and covariance and negative another gaussian is mean minus mu [00:53:11] another gaussian is mean minus mu converts identity [00:53:14] converts identity right so and the moment method so [00:53:23] so the general approach for moment [00:53:24] so the general approach for moment method is the following so first [00:53:26] method is the following so first you estimate [00:53:29] you estimate moment [00:53:32] of x [00:53:34] of x using [00:53:36] using empirical samples [00:53:40] I'm going to Define what exactly a [00:53:42] I'm going to Define what exactly a moment really mean moment really means [00:53:44] moment really mean moment really means the I guess you know depending on [00:53:46] the I guess you know depending on whether you have any background you know [00:53:48] whether you have any background you know I can all Define what moment I mean and [00:53:50] I can all Define what moment I mean and then what you do is you recover [00:53:54] um parameters [00:53:58] from momentum facts [00:54:02] from momentum facts and by moment we really mean something [00:54:04] and by moment we really mean something like this so the first moment [00:54:07] like this so the first moment this means the average [00:54:10] this means the average of x [00:54:12] of x of data [00:54:14] of data so [00:54:15] so so the first moment okay and let's let's [00:54:17] so the first moment okay and let's let's try to do this you know for this [00:54:19] try to do this you know for this particular example right so if you do [00:54:20] particular example right so if you do the first moment [00:54:22] the first moment then the first moment is the expectation [00:54:24] then the first moment is the expectation of x [00:54:26] of x and what is the expectation of X there [00:54:29] and what is the expectation of X there are two cases one case is that you are [00:54:31] are two cases one case is that you are you have electric variable which is one [00:54:32] you have electric variable which is one and the other case is the negative [00:54:34] and the other case is the negative variable is two so you can look at the [00:54:36] variable is two so you can look at the expectation of X for both of the two [00:54:37] expectation of X for both of the two gaussians right so so with half the [00:54:39] gaussians right so so with half the chance to come from the first question [00:54:41] chance to come from the first question and that's the case when I is 1 and half [00:54:44] and that's the case when I is 1 and half the chance to come from the second gold [00:54:46] the chance to come from the second gold gaussian [00:54:48] gaussian and when it comes from the first [00:54:49] and when it comes from the first coefficient the mean is Mu [00:54:52] coefficient the mean is Mu so that's the definition right so so you [00:54:55] so that's the definition right so so you get a half times mu and you come from [00:54:57] get a half times mu and you come from the second gaussian that means minus mu [00:54:59] the second gaussian that means minus mu so you get minus mu [00:55:01] so you get minus mu which is zero [00:55:02] which is zero so this means that there's no [00:55:05] so this means that there's no information [00:55:08] about mu [00:55:11] from first moment [00:55:14] from first moment okay [00:55:16] okay not so good right so this is not our [00:55:19] not so good right so this is not our plan our plan is to recover meal from [00:55:21] plan our plan is to recover meal from the moments but from first moment we [00:55:23] the moments but from first moment we cannot really get anything [00:55:25] cannot really get anything so then what you do is you go to the [00:55:27] so then what you do is you go to the second moment [00:55:28] second moment so second moment [00:55:31] so second moment is the expectation [00:55:32] is the expectation [Music] [00:55:35] [Music] let's call it M2 [00:55:37] let's call it M2 maybe I should call this M1 as well so a [00:55:39] maybe I should call this M1 as well so a second moment is M2 is defined to be the [00:55:42] second moment is M2 is defined to be the expectation of the [00:55:45] expectation of the of the of the ultra products of X with X [00:55:49] of the of the ultra products of X with X itself right oh there is this [00:55:51] itself right oh there is this expectation of X and X transpose [00:55:54] expectation of X and X transpose so so why this is called the second [00:55:56] so so why this is called the second moment this is really you know this is a [00:55:58] moment this is really you know this is a matrix basically [00:56:01] matrix basically um [00:56:02] um basically you can see that I'm [00:56:04] basically you can see that I'm 2 i j [00:56:06] 2 i j is the expectation of x i x j [00:56:10] is the expectation of x i x j so basically this expectation of the [00:56:12] so basically this expectation of the kind of like the product [00:56:14] kind of like the product of two coordinates of of the data and [00:56:20] of two coordinates of of the data and then you organize all of this into a [00:56:22] then you organize all of this into a matrix and you call it M2 [00:56:27] so and if you compute the second moment [00:56:30] so and if you compute the second moment then you can see actually mu is you know [00:56:33] then you can see actually mu is you know you can you can kind of see mu from it [00:56:36] you can you can kind of see mu from it so how do I come to the second moment [00:56:37] so how do I come to the second moment again the same thing with half of chance [00:56:40] again the same thing with half of chance your X from the same from the same the [00:56:43] your X from the same from the same the first option with a half of chance your [00:56:46] first option with a half of chance your X comes from the second gaussian [00:56:49] and [00:56:51] and and the way it comes from the first [00:56:52] and the way it comes from the first caution so what's the cover what's the [00:56:55] caution so what's the cover what's the what's the second moment of X under the [00:56:58] what's the second moment of X under the first caution so this requires a little [00:57:00] first caution so this requires a little bit of calculation [00:57:01] bit of calculation so let's do that here so suppose [00:57:06] X come from a gaussian which mean mu and [00:57:09] X come from a gaussian which mean mu and covariance acidity [00:57:11] covariance acidity and what is the second moment [00:57:15] maybe let's have a different letter for [00:57:17] maybe let's have a different letter for this so that we don't call it X let's [00:57:18] this so that we don't call it X let's call it Z [00:57:20] call it Z foreign [00:57:28] so how do you compute this [00:57:30] so how do you compute this um this so there are several ways where [00:57:32] um this so there are several ways where one way is that you just literally look [00:57:33] one way is that you just literally look at each of the coordinates and try to [00:57:36] at each of the coordinates and try to complete expectations that's perfectly [00:57:37] complete expectations that's perfectly fine so here I'm going to be a little [00:57:39] fine so here I'm going to be a little lazy I'm going to write that this is [00:57:41] lazy I'm going to write that this is equals to expectation of Z [00:57:44] equals to expectation of Z times expect from Z transpose plus the [00:57:46] times expect from Z transpose plus the covariance [00:57:48] covariance of C [00:57:49] of C right because covariance of Z is equals [00:57:51] right because covariance of Z is equals to [00:57:52] to the second moment minus the auto product [00:57:55] the second moment minus the auto product of the mean [00:57:57] of the mean and and the mean is Mu so the MU mu [00:58:01] and and the mean is Mu so the MU mu transpose and the covariance is identity [00:58:03] transpose and the covariance is identity so that's where we got mu mu transpose [00:58:05] so that's where we got mu mu transpose plus energy [00:58:07] plus energy and and for the second so basically you [00:58:10] and and for the second so basically you get a half times mu mu transpose [00:58:13] get a half times mu mu transpose personality [00:58:14] personality and then for the second condition [00:58:16] and then for the second condition actually the the movement is the same [00:58:18] actually the the movement is the same just because mu and minus mu [00:58:21] just because mu and minus mu is the same if you square it [00:58:24] is the same if you square it to get a half times new mule transpose [00:58:28] to get a half times new mule transpose So So eventually you get new mule [00:58:30] So So eventually you get new mule transpose plus identity [00:58:36] okay so now it looks good because at [00:58:38] okay so now it looks good because at least mu seems to come from you know [00:58:40] least mu seems to come from you know meal is uh [00:58:42] meal is uh uh can be in something very dark from [00:58:45] uh can be in something very dark from the moment right so if you guide the [00:58:47] the moment right so if you guide the second moment you subtract I you can [00:58:49] second moment you subtract I you can recover mu [00:58:50] recover mu right so basically what you do is you [00:58:52] right so basically what you do is you say you first you estimate I'm but you [00:58:55] say you first you estimate I'm but you don't even necessarily know I'm two [00:58:56] don't even necessarily know I'm two exactly right so you estimate M2 [00:58:59] exactly right so you estimate M2 by the empirical samples [00:59:06] so what's the empirical samples so you [00:59:08] so what's the empirical samples so you define this empirical moment [00:59:10] define this empirical moment as the empirical [00:59:13] as the empirical second moment [00:59:17] and and then you recover [00:59:21] a meal from M to hat [00:59:25] a meal from M to hat you know by pretending M2 is the same as [00:59:29] you know by pretending M2 is the same as uh same as um M2 hat is the same as M2 [00:59:32] uh same as um M2 hat is the same as M2 right so for example you can recover mu [00:59:35] right so for example you can recover mu by how to do this one way to do it is [00:59:38] by how to do this one way to do it is you can subtract I from M to hat and [00:59:41] you can subtract I from M to hat and then try to take the square square root [00:59:43] then try to take the square square root of it [00:59:44] of it and and here I'm going to do uh one so [00:59:48] and and here I'm going to do uh one so so so so basically [00:59:49] so so so basically um the key thing is that [00:59:53] um the key thing is that um so how to recover right so so let's [00:59:56] um so how to recover right so so let's do a warm-up so there are I guess um [00:59:59] do a warm-up so there are I guess um in some sense to recover it from M2 hat [01:00:02] in some sense to recover it from M2 hat the first thing you want to make sure is [01:00:03] the first thing you want to make sure is that you can recover it from M2 [01:00:05] that you can recover it from M2 right so this is kind of like a premises [01:00:10] right so this is kind of like a premises can recover [01:00:12] can recover new [01:00:14] new from M2 [01:00:16] from M2 right and and we have argued that this [01:00:18] right and and we have argued that this is actually true because you can just [01:00:19] is actually true because you can just subtract I from M2 and then take the [01:00:22] subtract I from M2 and then take the square root there's another way to do it [01:00:24] square root there's another way to do it which is uh so another way [01:00:29] which is the spectral method [01:00:31] which is the spectral method I'm going to introduce this here because [01:00:34] I'm going to introduce this here because [Music] [01:00:34] [Music] um [01:00:36] um um it's going to be useful for uh for [01:00:40] um it's going to be useful for uh for the future cases so how to recover [01:00:44] the future cases so how to recover transpose personality what you do is you [01:00:46] transpose personality what you do is you take the top economactor [01:00:52] of M2 [01:00:55] of M2 is actually equals to Mu over the norm [01:00:59] is actually equals to Mu over the norm of mu [01:01:01] of mu let's call this mu bar [01:01:03] let's call this mu bar so the top eigen Vector of M2 is [01:01:05] so the top eigen Vector of M2 is actually exactly in the direction of our [01:01:08] actually exactly in the direction of our new bar [01:01:10] new bar so and and this is something like and [01:01:14] so and and this is something like and also the ideal value [01:01:17] is the top eigenvalue is Mu two Norm [01:01:21] is the top eigenvalue is Mu two Norm square plus identity [01:01:24] square plus identity now this is something like you know you [01:01:26] now this is something like you know you can verify relatively easily right so [01:01:28] can verify relatively easily right so because eigenvector [01:01:32] of mu mu transpose is Mu bar [01:01:36] of mu mu transpose is Mu bar um and then eigenvector [01:01:40] of mu mule transpose personality is the [01:01:43] of mu mule transpose personality is the same [01:01:46] this is just because if you add [01:01:49] this is just because if you add identity to animatrix you don't change [01:01:51] identity to animatrix you don't change the against on the eigen system [01:01:53] the against on the eigen system you don't change diagram back first you [01:01:56] you don't change diagram back first you only change the eigenvalue the [01:01:57] only change the eigenvalue the eigenvalue [01:01:59] eigenvalue the eigenvalues [01:02:01] the eigenvalues got [01:02:04] got increment by one [01:02:10] that's that's what happens when you add [01:02:12] that's that's what happens when you add identity to a 20 Matrix [01:02:15] identity to a 20 Matrix so [01:02:16] so so so so so so [01:02:18] so so so so so you can see that from M2 you can recover [01:02:21] you can see that from M2 you can recover mu either using a simple subtraction on [01:02:25] mu either using a simple subtraction on square root or you can do this eigen [01:02:27] square root or you can do this eigen decomposition [01:02:29] decomposition um and and this is the case actually [01:02:32] um and and this is the case actually this is actually also corresponds to [01:02:35] this is actually also corresponds to the infinite data case because when you [01:02:37] the infinite data case because when you have infinite data you can literally [01:02:39] have infinite data you can literally compute M2 because the average will be [01:02:42] compute M2 because the average will be exactly equal to the population [01:02:43] exactly equal to the population so now the question becomes like what if [01:02:46] so now the question becomes like what if you don't have infinite data right you [01:02:47] you don't have infinite data right you don't have M2 you don't have M2 hat so [01:02:50] don't have M2 you don't have M2 hat so basically you need [01:02:51] basically you need uh recover from M2 head [01:02:57] basically using the same algorithm using [01:03:00] basically using the same algorithm using the same algorithm [01:03:04] on I'm too hot [01:03:05] on I'm too hot so basically just use the same identity [01:03:07] so basically just use the same identity conversation and I'm too hot and and you [01:03:10] conversation and I'm too hot and and you need [01:03:12] need this algorithm [01:03:15] this algorithm to be your bus to errors [01:03:23] robust to errors in a sense that if you [01:03:26] robust to errors in a sense that if you have two matrices I'm too I'm too Health [01:03:28] have two matrices I'm too I'm too Health they are similar [01:03:30] they are similar then applying this algorithm will give [01:03:32] then applying this algorithm will give you similar answers [01:03:34] you similar answers right so if that's the case then you get [01:03:36] right so if that's the case then you get similar answers as if you compute it on [01:03:38] similar answers as if you compute it on M2 so you got to approximate [01:03:42] M2 so you got to approximate um approximate estimate for me [01:03:45] um approximate estimate for me right and it turns out that this [01:03:46] right and it turns out that this robustness thing is often okay like [01:03:51] robustness thing is often okay like um at least in in the in the in a [01:03:53] um at least in in the in the in a qualitative sense like for most of the [01:03:55] qualitative sense like for most of the algorithms we're going to discuss they [01:03:56] algorithms we're going to discuss they are robust [01:03:57] are robust to some errors so [01:04:01] to some errors so um so so so actually the only the most [01:04:03] um so so so actually the only the most important thing would be this [01:04:05] important thing would be this so we're going to focus mostly on the [01:04:07] so we're going to focus mostly on the infinite data case [01:04:09] infinite data case so we're going to focus on infinite [01:04:10] so we're going to focus on infinite database [01:04:20] because uh most of the algorithms is [01:04:23] because uh most of the algorithms is robust to errors so so the average [01:04:25] robust to errors so so the average analysis part is important if you really [01:04:27] analysis part is important if you really publish a paper but for the core idea [01:04:29] publish a paper but for the core idea you don't have to really uh do the iron [01:04:32] you don't have to really uh do the iron analysis because most of the algorithms [01:04:34] analysis because most of the algorithms are reasonably robust [01:04:39] okay [01:04:40] okay [Applause] [01:04:47] [Applause] [01:04:53] any questions so far [01:04:57] any questions so far okay so so we have basically we have [01:05:00] okay so so we have basically we have completed a discussion about [01:05:02] completed a discussion about um [01:05:03] um uh this message of two gauches and now [01:05:05] uh this message of two gauches and now let's deal with mixture of three options [01:05:07] let's deal with mixture of three options and you can see that the point will be [01:05:10] and you can see that the point will be that you cannot just only use the first [01:05:12] that you cannot just only use the first moment and second moment you have to [01:05:14] moment and second moment you have to actually go to the third moment [01:05:16] actually go to the third moment um and and we'll be making things a [01:05:18] um and and we'll be making things a little bit more complicated [01:05:20] little bit more complicated so so maybe the general approach just to [01:05:27] so is that you compute [01:05:31] so is that you compute M1 which is the expectation for x [01:05:34] M1 which is the expectation for x M2 which is the expectation of x x Prime [01:05:38] M2 which is the expectation of x x Prime and M3 what is M3 what's the third [01:05:41] and M3 what is M3 what's the third moment [01:05:44] M3 is the expectation of X tensor X [01:05:47] M3 is the expectation of X tensor X tensor X [01:05:49] tensor X if you're not familiar with this [01:05:50] if you're not familiar with this notation so X tensor [01:05:53] notation so X tensor X tensor X this is the third order [01:05:55] X tensor X this is the third order tensor of Dimension d by D by D [01:05:59] tensor of Dimension d by D by D so [01:06:00] so um and so let's say this is called T So [01:06:02] um and so let's say this is called T So then T is the third order tensor [01:06:05] then T is the third order tensor and the ijk entry of this tensor is [01:06:08] and the ijk entry of this tensor is equals to x i times x j times x k [01:06:14] equals to x i times x j times x k okay so you know in in some sense like [01:06:17] okay so you know in in some sense like if you do the so X tensor X is basically [01:06:20] if you do the so X tensor X is basically just a rewriting of x x transpose and X [01:06:25] just a rewriting of x x transpose and X tensor X tensor X is defined like this [01:06:29] tensor X tensor X is defined like this so [01:06:33] and you can also have like a tensor B [01:06:35] and you can also have like a tensor B tensor C so suppose [01:06:37] tensor C so suppose suppose T Prime is equal to a tensor B [01:06:39] suppose T Prime is equal to a tensor B times C [01:06:41] times C then T Prime i j k is equals to [01:06:45] then T Prime i j k is equals to definition would be a i times B J times [01:06:49] definition would be a i times B J times c k [01:06:52] so and so that's why in this sense if [01:06:55] so and so that's why in this sense if you look at M3 [01:06:58] you look at M3 I J K and J [01:07:00] I J K and J then this is the [01:07:02] then this is the the answer of [01:07:06] the ijk anchi which is expectation of x [01:07:10] the ijk anchi which is expectation of x i times x j times x k [01:07:14] i times x j times x k so basically every entry of this third [01:07:16] so basically every entry of this third order tensor M3 is the expectation of [01:07:21] order tensor M3 is the expectation of the product of three coordinates [01:07:23] the product of three coordinates of the of the data [01:07:27] so and you can do this even for M4 or M4 [01:07:31] so and you can do this even for M4 or M4 and I'm 5 so and so forth so and then [01:07:36] and I'm 5 so and so forth so and then you design algorithm [01:07:42] maybe that's called a [01:07:43] maybe that's called a square a [01:07:45] square a that takes in a moment [01:07:49] and an output [01:08:01] Theta so you want to recover from the [01:08:03] Theta so you want to recover from the moments the parameter Theta [01:08:05] moments the parameter Theta and then if you have if you can do this [01:08:08] and then if you have if you can do this and the last step will be that you have [01:08:09] and the last step will be that you have to show a [01:08:11] to show a is robust [01:08:15] is robust on two errors [01:08:18] on two errors and then apply a [01:08:21] and then apply a to the empirical moment [01:08:28] and how many what is the order of the [01:08:31] and how many what is the order of the moment uh so this is this is in reality [01:08:33] moment uh so this is this is in reality right so this is the final algorithm [01:08:35] right so this is the final algorithm so apply a to the empirical moment at [01:08:38] so apply a to the empirical moment at the final algorithm all the previous [01:08:39] the final algorithm all the previous steps are kind of like the process of [01:08:42] steps are kind of like the process of Designing the algorithm so [01:08:44] Designing the algorithm so so [01:08:46] so um [01:08:47] um so basically but what is the order of [01:08:50] so basically but what is the order of the moments you have to use right so do [01:08:52] the moments you have to use right so do you need third other movement first all [01:08:54] you need third other movement first all the moments that depends on how from how [01:08:56] the moments that depends on how from how many moments you can recover the [01:08:58] many moments you can recover the parameter Theta but if from the first [01:09:01] parameter Theta but if from the first moment second moment you can recover [01:09:02] moment second moment you can recover then sure two moments are fun if you [01:09:05] then sure two moments are fun if you need three moments to recover then you [01:09:07] need three moments to recover then you need M3 otherwise you're probably even [01:09:09] need M3 otherwise you're probably even need M4 in some cases indeed we need M4 [01:09:13] need M4 in some cases indeed we need M4 um I guess in [01:09:15] um I guess in yeah I think even in your case that [01:09:17] yeah I think even in your case that we're going to discuss within M for the [01:09:19] we're going to discuss within M for the first time the first tense [01:09:22] first time the first tense questions [01:09:29] okay so I think I have only [01:09:33] okay so I think I have only less than about [01:09:36] less than about oh I have 50 about 15 minutes [01:09:38] oh I have 50 about 15 minutes so um [01:09:41] so um I'm going to show that [01:09:45] I'm going to show that um [01:09:45] um let's talk about a mix of cake options [01:09:48] let's talk about a mix of cake options and I'm going to show you that you [01:09:51] and I'm going to show you that you actually need [01:09:52] actually need at least [01:09:53] at least of the third moment [01:09:57] when the number of components is not [01:09:59] when the number of components is not just two [01:10:01] just two and this is very typical in most of the [01:10:03] and this is very typical in most of the cases you need at least a third moments [01:10:05] cases you need at least a third moments actually it's not very easy to find a [01:10:08] actually it's not very easy to find a case where second moment suffices like I [01:10:10] case where second moment suffices like I I have to think about you know which [01:10:12] I have to think about you know which case second moment suffices from Wow and [01:10:14] case second moment suffices from Wow and they found this like two component mix [01:10:17] they found this like two component mix of gaussian in almost all other cases [01:10:19] of gaussian in almost all other cases that you need third moment [01:10:22] that you need third moment so so let's let's uh assume again let's [01:10:26] so so let's let's uh assume again let's make it simpler so let's assume that [01:10:28] make it simpler so let's assume that this is a mixture of gaussians with a [01:10:31] this is a mixture of gaussians with a uniform mixture [01:10:33] uniform mixture so all the components have one working [01:10:35] so all the components have one working probability to show up so basically with [01:10:38] probability to show up so basically with sample I uniformly [01:10:41] sample I uniformly from K and then you generate X from [01:10:45] from K and then you generate X from gaussian with mu mean mu I and [01:10:49] gaussian with mu mean mu I and covariance analytic this is the genetics [01:10:51] covariance analytic this is the genetics model for data and alternatively you can [01:10:54] model for data and alternatively you can probably write X is sampled from [01:10:56] probably write X is sampled from this [01:11:00] the average of this [01:11:02] the average of this 10 distributions [01:11:05] and what you know in in all the [01:11:08] and what you know in in all the follow-up we are going to only do A and [01:11:10] follow-up we are going to only do A and B we don't know without so we only do a [01:11:12] B we don't know without so we only do a and b [01:11:15] and b the step a and step B for all examples [01:11:20] in the SQL [01:11:22] in the SQL even including examples in the next [01:11:23] even including examples in the next lecture the robustness you know you can [01:11:25] lecture the robustness you know you can do that but it requires too much kind of [01:11:27] do that but it requires too much kind of like mathematical jargon which is not [01:11:29] like mathematical jargon which is not really needed for this course so an A [01:11:32] really needed for this course so an A and B is really the just is the really [01:11:34] and B is really the just is the really the core thing that enables this so [01:11:40] um [01:11:41] um so so now let's try to compute the [01:11:43] so so now let's try to compute the moment right so and see which moment is [01:11:45] moment right so and see which moment is enough for us to recover [01:11:47] enough for us to recover again let's complete the first moment so [01:11:49] again let's complete the first moment so this is [01:11:51] this is um you have [01:11:55] K possible cases right so each case [01:11:57] K possible cases right so each case arises with its probability K probably [01:12:00] arises with its probability K probably the one over k [01:12:02] the one over k um so each cluster shows up with [01:12:04] um so each cluster shows up with probably one over okay [01:12:06] probably one over okay so condition on the cluster I your mean [01:12:09] so condition on the cluster I your mean is Mu I [01:12:11] is Mu I into the sum of them this one over K [01:12:15] into the sum of them this one over K times the sum of mu y [01:12:18] so clearly from the first order moment [01:12:21] so clearly from the first order moment the first moment you only know the [01:12:22] the first moment you only know the average of the mean you probably [01:12:23] average of the mean you probably wouldn't be able to recover each of the [01:12:25] wouldn't be able to recover each of the mean that sounds reasonable [01:12:28] mean that sounds reasonable so and [01:12:31] so and now let's look at a second moment [01:12:35] now let's look at a second moment the second moment is [01:12:43] I guess we still do this kind of like [01:12:46] I guess we still do this kind of like um total of expectation your condition [01:12:48] um total of expectation your condition on the hidden variable I the limiting [01:12:50] on the hidden variable I the limiting variable I and you look at the second [01:12:51] variable I and you look at the second moment for that that gaussian and we [01:12:54] moment for that that gaussian and we have shown that for every gaussian the [01:12:56] have shown that for every gaussian the the moment [01:12:58] the moment is [01:12:59] is second moment is Mu I mu I transpose [01:13:01] second moment is Mu I mu I transpose Plus anality [01:13:03] Plus anality I take the sum from I to K [01:13:08] right so and then this is one over K [01:13:11] right so and then this is one over K times sum of [01:13:13] times sum of basically it's the average of the author [01:13:16] basically it's the average of the author product of mu and UI transports [01:13:18] product of mu and UI transports plus energy [01:13:21] so the question becomes you know suppose [01:13:23] so the question becomes you know suppose you just want to use the first moment [01:13:26] you just want to use the first moment and a second moment [01:13:28] and a second moment the question becomes [01:13:30] the question becomes can we recover [01:13:34] from M1 and M2 [01:13:37] from M1 and M2 right so [01:13:39] right so um or maybe more specifically from [01:13:43] um or maybe more specifically from the average of mu I [01:13:47] the average of mu I and the average of mu I mu I transpose [01:13:56] so and the claim is that this is not [01:13:58] so and the claim is that this is not possible [01:14:04] at least even K is larger than [01:14:07] at least even K is larger than three [01:14:09] three so there are two kind of arguments [01:14:12] so there are two kind of arguments uh I guess you know there's one argument [01:14:14] uh I guess you know there's one argument the argument is the following the reason [01:14:17] the argument is the following the reason why this is not possible is that these [01:14:19] why this is not possible is that these are just not enough information for you [01:14:21] are just not enough information for you to recover in some sense so there are so [01:14:24] to recover in some sense so there are so like a you're still missing some kind of [01:14:26] like a you're still missing some kind of like a [01:14:28] like a um rotations [01:14:30] um rotations means the rotation is likely information [01:14:34] means the rotation is likely information okay what what does that really mean [01:14:38] okay what what does that really mean let me specify so suppose you [01:14:41] let me specify so suppose you just to make the discussion easier let's [01:14:43] just to make the discussion easier let's define U to be this collection of [01:14:47] define U to be this collection of the means we will have to Mu K which is [01:14:50] the means we will have to Mu K which is in dimension d by K [01:14:53] um so this is the Matrix you want to [01:14:55] um so this is the Matrix you want to recover right [01:14:56] recover right so [01:14:58] so um and [01:15:00] um and I'm claiming that you're gonna exist too [01:15:03] I'm claiming that you're gonna exist too set of meals that have the same [01:15:07] set of meals that have the same average like the same quantities here [01:15:10] average like the same quantities here these two quantities of the both of M1 [01:15:12] these two quantities of the both of M1 and two are the same even though the [01:15:14] and two are the same even though the mules are different I'm going to [01:15:15] mules are different I'm going to construct such a situations [01:15:17] construct such a situations how do I do that [01:15:19] how do I do that I'm going to take [01:15:21] I'm going to take a rotation Matrix [01:15:27] are [01:15:28] are in dimension K by K [01:15:31] in dimension K by K I'm going to rotate so basically I'm [01:15:33] I'm going to rotate so basically I'm going to consider [01:15:36] going to consider U versus U times R right if you rotate [01:15:40] U versus U times R right if you rotate on the right hand side you got a [01:15:41] on the right hand side you got a different type of set of means [01:15:44] different type of set of means um so I'm going to claim that U and U [01:15:46] um so I'm going to claim that U and U times R have the same statistics you [01:15:48] times R have the same statistics you know have have the same two quantities [01:15:52] know have have the same two quantities um so so first thing is that if you look [01:15:55] um so so first thing is that if you look at the average of the auto product Muir [01:15:57] at the average of the auto product Muir mui transpose [01:16:01] sorry this is why we're okay then this [01:16:04] sorry this is why we're okay then this is one over K times uu transpose [01:16:07] is one over K times uu transpose in our simplified notation [01:16:09] in our simplified notation and this is equals to 1 over K times u r [01:16:13] and this is equals to 1 over K times u r times u r transpose [01:16:15] times u r transpose this is because r r transpose is equal [01:16:18] this is because r r transpose is equal to Infinity that's the definition of [01:16:20] to Infinity that's the definition of rotation Matrix [01:16:22] rotation Matrix so so that means that you and you are [01:16:25] so so that means that you and you are are not distinguishable from this [01:16:27] are not distinguishable from this quantity from the average of the product [01:16:29] quantity from the average of the product of V1 the MU and UI transpose [01:16:32] of V1 the MU and UI transpose so and now let's look at the [01:16:35] so and now let's look at the let's look at the the first other moment [01:16:38] let's look at the the first other moment so to make the first other moment also [01:16:41] so to make the first other moment also not distinguishable I also have to take [01:16:42] not distinguishable I also have to take another take [01:16:44] another take in addition [01:16:47] in addition uh R such that [01:16:51] uh R such that R times o one vector is still equal to [01:16:54] R times o one vector is still equal to all one vector [01:16:55] all one vector so you want a rotation such that you [01:16:57] so you want a rotation such that you don't you want a rotation but you don't [01:16:59] don't you want a rotation but you don't want to rotate the direction of all one [01:17:01] want to rotate the direction of all one vector that's easy you have on so many [01:17:04] vector that's easy you have on so many rotations right you can you just want to [01:17:06] rotations right you can you just want to say I'm going to it's kind of like if [01:17:07] say I'm going to it's kind of like if you have a globe you have One Direction [01:17:09] you have a globe you have One Direction which you don't change but you can [01:17:12] which you don't change but you can rotate still in other immense directions [01:17:13] rotate still in other immense directions right there are still all came as two [01:17:15] right there are still all came as two directions but because Dimension is K [01:17:17] directions but because Dimension is K here so you still have a lot of degree [01:17:19] here so you still have a lot of degree freedom to choose many different R's [01:17:21] freedom to choose many different R's that satisfies this as long as K is [01:17:23] that satisfies this as long as K is someone big like louder than three [01:17:25] someone big like louder than three so and then suppose you satisfy this [01:17:28] so and then suppose you satisfy this then you are [01:17:30] then you are times all one vector [01:17:32] times all one vector or maybe let's write this so [01:17:35] or maybe let's write this so 1 over K times sum of mu I this is 1 [01:17:40] 1 over K times sum of mu I this is 1 over k [01:17:41] over k times U times open vector [01:17:45] times U times open vector right I'm claiming that this is equals [01:17:47] right I'm claiming that this is equals to 1 over K times U times R times R1 [01:17:51] to 1 over K times U times R times R1 Vector just because I design R like this [01:17:55] Vector just because I design R like this so that's why uh from [01:17:59] so that's why uh from this average column statistic [01:18:02] this average column statistic or from this quantity you still don't [01:18:04] or from this quantity you still don't distinguish [01:18:05] distinguish you and you are right so [01:18:09] you and you are right so you and you are [01:18:11] you and you are not distinguishable [01:18:20] because the exact can match the first [01:18:23] because the exact can match the first moment and second moment [01:18:25] moment and second moment so that's why we need to go to [01:18:29] go to [01:18:31] go to um M3 to distinguish to uniquely [01:18:35] um M3 to distinguish to uniquely identify uh The Columns of you [01:18:40] identify uh The Columns of you okay I think we are five minutes early [01:18:42] okay I think we are five minutes early but I think the next thing would be [01:18:45] but I think the next thing would be probably takes much more than five [01:18:47] probably takes much more than five minutes so I guess I will just stop here [01:18:49] minutes so I guess I will just stop here uh to see whether there's any questions [01:18:51] uh to see whether there's any questions uh and tomorrow the next lecture will [01:18:54] uh and tomorrow the next lecture will continue uh with solving this question [01:18:57] continue uh with solving this question which I'm with M3 [01:19:01] any questions [01:19:09] yeah so the question is how do you infer [01:19:12] yeah so the question is how do you infer the number of questions so first of all [01:19:14] the number of questions so first of all you indeed you are right that in the [01:19:16] you indeed you are right that in the formulation right now [01:19:17] formulation right now I'm assuming I know exactly the number [01:19:19] I'm assuming I know exactly the number of options I'm even assuming that I know [01:19:23] of options I'm even assuming that I know all the the probabilities for each [01:19:25] all the the probabilities for each question right P1 up to PTR exactly [01:19:27] question right P1 up to PTR exactly equals to one over k so and the question [01:19:29] equals to one over k so and the question is how do you kind of like in front of [01:19:31] is how do you kind of like in front of gaussians maybe also another question is [01:19:33] gaussians maybe also another question is how do you infer the P1 to PK right so [01:19:36] how do you infer the P1 to PK right so um there are ways [01:19:38] um there are ways um on [01:19:40] um on um of course you know there are ways [01:19:42] um of course you know there are ways depending on you know there are various [01:19:43] depending on you know there are various ways depending on what assumptions you [01:19:45] ways depending on what assumptions you make but definitely they are it's [01:19:46] make but definitely they are it's possible [01:19:48] possible um for example one [01:19:49] um for example one somewhat [01:19:52] somewhat um [01:19:53] um one way that would work in certain cases [01:19:55] one way that would work in certain cases is that you can infer the number of [01:19:57] is that you can infer the number of gaussians by looking at [01:19:59] gaussians by looking at the [01:20:01] the the rank of this Matrix [01:20:04] the rank of this Matrix suppose you believe that all the MU eyes [01:20:07] suppose you believe that all the MU eyes are not kind of like degenerated or not [01:20:11] are not kind of like degenerated or not they are all in kind of like General [01:20:12] they are all in kind of like General positions [01:20:13] positions so then the rank of this Matrix will be [01:20:15] so then the rank of this Matrix will be k [01:20:16] k especially for when K is less than b [01:20:19] especially for when K is less than b right so [01:20:21] right so then you can infer number of gaussians K [01:20:23] then you can infer number of gaussians K by looking at the rank of this Matrix [01:20:26] by looking at the rank of this Matrix so but but I'm not saying that that's a [01:20:28] so but but I'm not saying that that's a actually really a great method because [01:20:30] actually really a great method because in protocol you may call got into other [01:20:32] in protocol you may call got into other issues because maybe your condition is [01:20:34] issues because maybe your condition is not exactly not so and so forth so there [01:20:36] not exactly not so and so forth so there are many other ways [01:20:37] are many other ways um and the Imperial empirically the most [01:20:41] um and the Imperial empirically the most kind of like typical way to ask the [01:20:42] kind of like typical way to ask the minimum of gaussians is using [01:20:44] minimum of gaussians is using non-parametric basic methods which I [01:20:46] non-parametric basic methods which I guess is not something we will cover [01:20:48] guess is not something we will cover here so so for the theoretical setup we [01:20:51] here so so for the theoretical setup we are mostly interested in the cleanest I [01:20:53] are mostly interested in the cleanest I think where you know everything when and [01:20:55] think where you know everything when and you you still there's still an open [01:20:56] you you still there's still an open question to recover the new eyes even [01:20:59] question to recover the new eyes even with the knowledge of the number of [01:21:01] with the knowledge of the number of questions [01:21:03] questions foreign [01:21:28] typically they are independent [01:21:30] typically they are independent so [01:21:32] so um but they could have some kind of like [01:21:33] um but they could have some kind of like one of them you could live in [01:21:36] one of them you could live in approximately the Subspace of the others [01:21:38] approximately the Subspace of the others so then it becomes tricky because [01:21:39] so then it becomes tricky because whether you are robust to whatever so [01:21:41] whether you are robust to whatever so and so forth yes so so I think loser [01:21:45] and so forth yes so so I think loser speaking is reasonable but if you really [01:21:47] speaking is reasonable but if you really look at the details it's not that great [01:21:48] look at the details it's not that great so that's why you need other methods [01:21:50] so that's why you need other methods sometimes [01:21:57] okay I guess so if if there's no other [01:22:00] okay I guess so if if there's no other questions I will see you on next Monday ================================================================================ LECTURE 018 ================================================================================ Stanford CS229M - Lecture 19: Mixture of Gaussians, spectral clustering Source: https://www.youtube.com/watch?v=E6rZeGIKdRY --- Transcript [00:00:05] okay I guess uh let's get started [00:00:09] okay I guess uh let's get started um let's see is this working yes [00:00:12] um let's see is this working yes um so I guess so [00:00:15] um so I guess so um last time we have talked about [00:00:17] um last time we have talked about answering and today we're going to [00:00:19] answering and today we're going to continue with Express learning [00:00:21] continue with Express learning um [00:00:23] um so I guess [00:00:24] so I guess so and first we're going to continue [00:00:27] so and first we're going to continue with the moment method [00:00:32] and here we're going to talk about [00:00:33] and here we're going to talk about higher [00:00:35] higher other moment [00:00:39] and then next we're going to talk about [00:00:41] and then next we're going to talk about some [00:00:42] some thing called clustering or special [00:00:44] thing called clustering or special classroom in more [00:00:47] technical course [00:00:50] um so these are different type of [00:00:52] um so these are different type of instructional algorithms [00:00:55] instructional algorithms so I guess just to continue with the [00:00:57] so I guess just to continue with the what you had last time last time we [00:00:59] what you had last time last time we ended up this mix of cautions [00:01:05] the setup was that you have some X which [00:01:08] the setup was that you have some X which is sampled from a mixture of cake [00:01:11] is sampled from a mixture of cake gaussians [00:01:14] gaussians with me new I and covariance identity [00:01:18] with me new I and covariance identity and [00:01:19] and so at last time I think at the beginning [00:01:21] so at last time I think at the beginning we talked about you know case two we [00:01:23] we talked about you know case two we have two mixture of Two gaussians And in [00:01:25] have two mixture of Two gaussians And in the [00:01:26] the in that special case you can just pick [00:01:28] in that special case you can just pick the second moment of the Belgian to [00:01:30] the second moment of the Belgian to recover the new eyes and then we move [00:01:33] recover the new eyes and then we move down to talk about case on bigger than [00:01:35] down to talk about case on bigger than two and in that case we have argued that [00:01:38] two and in that case we have argued that if you take the second mixture right if [00:01:40] if you take the second mixture right if you take the second moment [00:01:43] you take the second moment then this is something like one over K [00:01:45] then this is something like one over K times sum of mu y muy transpose and this [00:01:48] times sum of mu y muy transpose and this is not enough [00:01:52] to recover new [00:01:55] to recover new eyes because you know given this second [00:01:57] eyes because you know given this second moment you uh still cannot identify the [00:02:00] moment you uh still cannot identify the new eyes precisely because there are [00:02:02] new eyes precisely because there are multiple mineralized that can make [00:02:04] multiple mineralized that can make perhaps the same second moment [00:02:07] perhaps the same second moment so so so that motivates us to uh [00:02:10] so so so that motivates us to uh consider the third moment [00:02:11] consider the third moment so the third moment is [00:02:16] as we discussed is the expectation of X [00:02:20] as we discussed is the expectation of X tensor X tensor X so this is the third [00:02:23] tensor X tensor X so this is the third order sensor in dimension [00:02:25] order sensor in dimension d by D by D and let's compute what's the [00:02:29] d by D by D and let's compute what's the third moment and with the hope that the [00:02:31] third moment and with the hope that the third moment will tell us enough about [00:02:32] third moment will tell us enough about new eyes where we can recover [00:02:35] new eyes where we can recover um mu eyes from the third moment and [00:02:36] um mu eyes from the third moment and that's indeed the case so what we do [00:02:39] that's indeed the case so what we do here is the following so we compute the [00:02:41] here is the following so we compute the third moment and I guess the [00:02:43] third moment and I guess the the initial step is always the same [00:02:45] the initial step is always the same because you have a mix of K clusters so [00:02:47] because you have a mix of K clusters so you do what you do is you have one over [00:02:49] you do what you do is you have one over K times the sum of the moment [00:02:54] K times the sum of the moment conditioned on each of the cluster I [00:02:56] conditioned on each of the cluster I where I is the [00:02:58] where I is the cluster ID [00:03:00] cluster ID and now the question becomes that if you [00:03:03] and now the question becomes that if you have a gaussian drawn from if you have [00:03:06] have a gaussian drawn from if you have an X drawn from gaussian then what's the [00:03:07] an X drawn from gaussian then what's the third moment What's the expectation of [00:03:09] third moment What's the expectation of third moment right so what is this uh [00:03:12] third moment right so what is this uh expectation of X tends to X tensor X [00:03:14] expectation of X tends to X tensor X conditional I [00:03:16] conditional I so [00:03:18] so um so let's do some kind of like a [00:03:20] um so let's do some kind of like a simplification just to this is an [00:03:22] simplification just to this is an abstraction in some sense so that we can [00:03:24] abstraction in some sense so that we can make the notation simpler so suppose [00:03:29] make the notation simpler so suppose Z is from some gaussian with mu mean [00:03:32] Z is from some gaussian with mu mean that's called a [00:03:34] that's called a is to distinguish it from the [00:03:36] is to distinguish it from the a and cover Converse identity [00:03:40] a and cover Converse identity then the question is what is [00:03:43] then the question is what is you have this Lemma [00:03:46] um I guess the condition is this and [00:03:48] um I guess the condition is this and then [00:03:49] then our question is what is the expectation [00:03:52] our question is what is the expectation of Z times or Z answer z [00:03:55] of Z times or Z answer z and the claim is that this is [00:03:58] and the claim is that this is pretty much [00:04:00] pretty much equals to A tensor a tensor a but with [00:04:04] equals to A tensor a tensor a but with some caveats there's some other terms [00:04:07] some caveats there's some other terms which are something like this let me [00:04:10] which are something like this let me write it down and explain so this is L [00:04:12] write it down and explain so this is L from 1 to D expectation of x [00:04:16] from 1 to D expectation of x tensor [00:04:19] El tensor elf [00:04:22] El tensor elf Plus [00:04:23] Plus from the expectation [00:04:28] El tensor expectation X [00:04:39] foreign [00:04:42] ER oh sorry my right this is not X this [00:04:45] ER oh sorry my right this is not X this is z [00:04:46] is z there's no X in this level we already [00:04:49] there's no X in this level we already changed our notation to Z [00:04:53] um [00:04:56] so so know that [00:05:04] expectation of Z is really literally a [00:05:06] expectation of Z is really literally a so basically you have already have a [00:05:08] so basically you have already have a formula that expresses the third moment [00:05:11] formula that expresses the third moment of Z into a function of a right that [00:05:15] of Z into a function of a right that makes sense because everything a decides [00:05:17] makes sense because everything a decides everything so everything at the end of [00:05:19] everything so everything at the end of this should be a function of a right so [00:05:21] this should be a function of a right so the reason why we still use EZ in this [00:05:23] the reason why we still use EZ in this formula is because we want to implicitly [00:05:25] formula is because we want to implicitly say that you know this is something [00:05:28] say that you know this is something or it is something that is about the [00:05:30] or it is something that is about the first moment so maybe the the more [00:05:32] first moment so maybe the the more important thing is that [00:05:34] important thing is that this means that [00:05:38] you can you can compute [00:05:40] you can you can compute so we can compute [00:05:45] a tensor a tensor a from [00:05:49] a tensor a tensor a from linear combinations [00:05:52] linear combinations in terms of inner combinations [00:05:55] in terms of inner combinations of the third moment and this first [00:05:58] of the third moment and this first moment [00:06:01] why it's useful to get this I think this [00:06:03] why it's useful to get this I think this will be clear why it's used for the get [00:06:06] will be clear why it's used for the get a tensori tensor a [00:06:08] a tensori tensor a um it's useful [00:06:10] um it's useful um or we'll see it in a moment [00:06:12] um or we'll see it in a moment but the the dyslema tells you that if [00:06:15] but the the dyslema tells you that if you know the first moment and third [00:06:16] you know the first moment and third moment you can get a tensor a tensor a [00:06:20] moment you can get a tensor a tensor a um from sorry I keep messing up with the [00:06:23] um from sorry I keep messing up with the light right here so here is the c [00:06:27] light right here so here is the c foreign [00:06:29] foreign Okay so [00:06:32] Okay so and [00:06:37] see [00:06:40] see I think I would any questions so far [00:06:43] I think I would any questions so far I guess you know it's not exactly clear [00:06:45] I guess you know it's not exactly clear why this Lemma is useful [00:06:47] why this Lemma is useful you know at the current point I guess [00:06:49] you know at the current point I guess the main point is that you can compute [00:06:52] the main point is that you can compute out with the third or third moment when [00:06:54] out with the third or third moment when Z is just a gaussian and I think the [00:06:57] Z is just a gaussian and I think the proof I'm gonna show the proof I think [00:06:58] proof I'm gonna show the proof I think the proof is you know nothing super [00:07:00] the proof is you know nothing super interesting but it tells you how to do [00:07:02] interesting but it tells you how to do this kind of the derivations for the [00:07:04] this kind of the derivations for the moment and once you see it once then all [00:07:06] moment and once you see it once then all the others becomes you know uh kind of [00:07:08] the others becomes you know uh kind of trivial so how do you compute the third [00:07:10] trivial so how do you compute the third moment so what you do is you do it [00:07:12] moment so what you do is you do it forever aren't you [00:07:15] forever aren't you so you say you look at the ijk entry of [00:07:18] so you say you look at the ijk entry of this thing then this is just expectation [00:07:20] this thing then this is just expectation z i z j z k where z i denotes the [00:07:24] z i z j z k where z i denotes the coordinates the ith coordinates [00:07:27] coordinates the ith coordinates and [00:07:30] um [00:07:33] I think there's something [00:07:36] I think there's something sorry for the notetakers I think uh I [00:07:38] sorry for the notetakers I think uh I changed my rotation to V here [00:07:41] changed my rotation to V here just to be consistent [00:07:43] just to be consistent I mean [00:07:45] I mean my back [00:07:49] since this will be it's just a generic [00:07:51] since this will be it's just a generic variable it's just a somehow later I use [00:07:53] variable it's just a somehow later I use V so let's change this to V [00:07:56] V so let's change this to V so [00:07:57] so um [00:08:00] so what we do is that we just uh you [00:08:02] so what we do is that we just uh you know to compute this moment which is the [00:08:04] know to compute this moment which is the and sometimes do this in a Brute Force [00:08:06] and sometimes do this in a Brute Force way right so what is what is the icjck [00:08:09] way right so what is what is the icjck you can write z i mean v i plus some [00:08:13] you can write z i mean v i plus some casilla [00:08:15] casilla and z v j c j can be v j plus T J [00:08:19] and z v j c j can be v j plus T J and z k v k Plus [00:08:24] and z k v k Plus CK this is we are using a factor Z as a [00:08:28] CK this is we are using a factor Z as a vector is equal to V plus some C where C [00:08:30] vector is equal to V plus some C where C is from [00:08:33] is from spherical question that's that's my [00:08:35] spherical question that's that's my definition of Casino sometimes because C [00:08:37] definition of Casino sometimes because C is equal to Z minus V and which has you [00:08:39] is equal to Z minus V and which has you know spherical gaussian [00:08:41] know spherical gaussian um uh description [00:08:43] um uh description and then you can and V by the way VR is [00:08:46] and then you can and V by the way VR is the ith current just to to be clear [00:08:51] so and then we can expand this so there [00:08:53] so and then we can expand this so there are eight terms in this product so so [00:08:56] are eight terms in this product so so what are them so one of the term is v i [00:08:58] what are them so one of the term is v i v j v k that's easy V is deterministic [00:09:01] v j v k that's easy V is deterministic because C Is Random right and some of [00:09:03] because C Is Random right and some of the terms will look like expectation [00:09:06] the terms will look like expectation v i [00:09:08] v i v j c k well the term is this another [00:09:12] v j c k well the term is this another term is expectation v i [00:09:14] term is expectation v i v k equals c j and plus expectation [00:09:19] v k equals c j and plus expectation v j v k equals c i [00:09:23] v j v k equals c i and these terms will be equals to zero [00:09:27] and these terms will be equals to zero because you know the expectation of [00:09:29] because you know the expectation of cursing is zero and V is a deterministic [00:09:31] cursing is zero and V is a deterministic quantity so that's why they are going to [00:09:33] quantity so that's why they are going to be zero [00:09:35] be zero and then we have the other three terms [00:09:38] and then we have the other three terms that looks like expectation VI because [00:09:40] that looks like expectation VI because CJ because CK [00:09:42] CJ because CK plus expectation V J equals c i equals [00:09:46] plus expectation V J equals c i equals CK [00:09:48] CK plus expectation v k equals c i equals [00:09:52] plus expectation v k equals c i equals CJ [00:09:53] CJ so these terms is a little bit different [00:09:55] so these terms is a little bit different let me deal with it [00:09:58] let me deal with it in a moment and this last type of term [00:10:01] in a moment and this last type of term is the product of the three courses [00:10:04] is the product of the three courses so [00:10:06] so how do we deal with the the rest of the [00:10:08] how do we deal with the the rest of the four terms so the thing is that if you [00:10:11] four terms so the thing is that if you look at expectation you can see I could [00:10:14] look at expectation you can see I could see k [00:10:15] see k this is equals to what this is equals to [00:10:18] this is equals to what this is equals to zero if I is not equal to K because if [00:10:21] zero if I is not equal to K because if it's equal to I is not equal to a [00:10:23] it's equal to I is not equal to a because c i c k are two independent [00:10:25] because c i c k are two independent random variables and you can factorize [00:10:27] random variables and you can factorize it to get expectation for C and plus c k [00:10:30] it to get expectation for C and plus c k they're both zero so you get zero and [00:10:32] they're both zero so you get zero and this is one if I is equal to K because [00:10:36] this is one if I is equal to K because okay maybe let's have more steps so this [00:10:39] okay maybe let's have more steps so this is equals to expectation because the I [00:10:41] is equals to expectation because the I squared [00:10:43] squared which is equals to 1 if I is equals to K [00:10:47] which is equals to 1 if I is equals to K so [00:10:48] so so in summary expectation of because C I [00:10:51] so in summary expectation of because C I could see K is equals to [00:10:53] could see K is equals to indicator [00:10:55] indicator that I is equals to K [00:10:59] that I is equals to K and you can also try to do deal with CI [00:11:02] and you can also try to do deal with CI because C J Co clay and here you can [00:11:04] because C J Co clay and here you can still you know try to do the same thing [00:11:06] still you know try to do the same thing like try to divide different cases [00:11:08] like try to divide different cases whether ijp are all the same or maybe [00:11:11] whether ijp are all the same or maybe both two of the I and J's are the same [00:11:13] both two of the I and J's are the same and K is different there are a few cases [00:11:15] and K is different there are a few cases and actually if you knew in all of those [00:11:17] and actually if you knew in all of those cases it turns out that it's always zero [00:11:20] cases it turns out that it's always zero like regardless of regardless of [00:11:24] like regardless of regardless of the choice of ijk [00:11:28] the choice of ijk but for different reasons right for [00:11:29] but for different reasons right for example when ijk [00:11:31] example when ijk are all the same and this is the third [00:11:33] are all the same and this is the third power of CI so it's equals to [00:11:35] power of CI so it's equals to expectation because c i cubed and that's [00:11:37] expectation because c i cubed and that's zero because [00:11:39] zero because gaussian the third moment of gaussian is [00:11:41] gaussian the third moment of gaussian is zero and and when I is equal to J and Y [00:11:44] zero and and when I is equal to J and Y naught equals to hey you can do another [00:11:46] naught equals to hey you can do another different you know calculation but [00:11:48] different you know calculation but generally you can do all of calculations [00:11:49] generally you can do all of calculations and they are equal to zero it sometimes [00:11:52] and they are equal to zero it sometimes I think the reason fundamental reasons [00:11:53] I think the reason fundamental reasons that as long as you have [00:11:56] that as long as you have even your exams have all degree [00:11:59] even your exams have all degree polynomial but not all degree monomial [00:12:02] polynomial but not all degree monomial of this classy eyes [00:12:05] of this classy eyes it doesn't matter like it's always going [00:12:06] it doesn't matter like it's always going to be zero the expectation is going to [00:12:08] to be zero the expectation is going to always not be zero [00:12:09] always not be zero so so these are all in sometimes [00:12:12] so so these are all in sometimes Elementary calculations and then if you [00:12:14] Elementary calculations and then if you use this then you can continue here you [00:12:16] use this then you can continue here you can get this is expect equals to v i v j [00:12:19] can get this is expect equals to v i v j v k plus VI times the indicator J is [00:12:23] v k plus VI times the indicator J is equals to K plus v j and this greater I [00:12:26] equals to K plus v j and this greater I is equal to K plus v k indicator [00:12:30] is equal to K plus v k indicator I is equals to J [00:12:32] I is equals to J and and this is pretty much complete the [00:12:34] and and this is pretty much complete the proof so then you just have to rewrite [00:12:36] proof so then you just have to rewrite this [00:12:37] this you know on your tester form so if you [00:12:41] you know on your tester form so if you verify I guess how do you write it I [00:12:44] verify I guess how do you write it I guess [00:12:45] guess if you verify this equation the target [00:12:46] if you verify this equation the target equation [00:12:48] equation entry by entry then you see that this is [00:12:50] entry by entry then you see that this is exactly exactly [00:12:52] exactly exactly um [00:12:53] um same [00:12:55] same so this is VL [00:13:06] help us [00:13:17] all right so so this is our Target [00:13:19] all right so so this is our Target equation this is what we got [00:13:23] for every entry [00:13:25] for every entry so let's just verify these two you know [00:13:28] so let's just verify these two you know verify that they are they are they are [00:13:30] verify that they are they are they are the same thing it's just a [00:13:31] the same thing it's just a reorganization so [00:13:34] reorganization so um how do you verify that it's kind of [00:13:36] um how do you verify that it's kind of like if you take the ijk coordinates do [00:13:38] like if you take the ijk coordinates do you see that [00:13:40] you see that so the question is what this what the [00:13:42] so the question is what this what the ijk card of this guy is right so V [00:13:44] ijk card of this guy is right so V tensor El tensor El [00:13:48] tensor El tensor El the ijk coordinate [00:13:51] the ijk coordinate is equals to it always has a VI there [00:13:55] is equals to it always has a VI there because you know VI is always there [00:13:58] because you know VI is always there but the J the JK coordinate of this [00:14:03] but the J the JK coordinate of this El and El depends on so basically you [00:14:05] El and El depends on so basically you have to this is the only possibility is [00:14:08] have to this is the only possibility is that this is e [00:14:11] so the only possibility that so [00:14:13] so the only possibility that so basically this is like [00:14:15] basically this is like I guess maybe one way to write this is [00:14:17] I guess maybe one way to write this is that if you really do it [00:14:20] that if you really do it this is V [00:14:21] this is V I [00:14:23] I times El the J is quantity of elf [00:14:26] times El the J is quantity of elf and e l the K is quantity of EO [00:14:30] and e l the K is quantity of EO right [00:14:32] right so and in what case this the J [00:14:35] so and in what case this the J coordinate of el and K Corner are both [00:14:37] coordinate of el and K Corner are both non-zero [00:14:39] non-zero the only case is that [00:14:41] the only case is that L is equals to J and IO is equals to K [00:14:44] L is equals to J and IO is equals to K that's the only case that this can be [00:14:46] that's the only case that this can be non-zero so that's why this is equals to [00:14:48] non-zero so that's why this is equals to v i times y when L is equal to J L is [00:14:54] v i times y when L is equal to J L is equals square and the only way this this [00:14:56] equals square and the only way this this can happen is that J is equals to K [00:14:58] can happen is that J is equals to K right [00:14:59] right When J is equals to K otherwise it's [00:15:02] When J is equals to K otherwise it's going to be zero [00:15:04] going to be zero so that's how you verified you know the [00:15:07] so that's how you verified you know the the I don't I don't expect you to kind [00:15:09] the I don't I don't expect you to kind of verify completely you know on the fly [00:15:11] of verify completely you know on the fly but you know in some sense the exact [00:15:14] but you know in some sense the exact formula doesn't matter that much either [00:15:15] formula doesn't matter that much either way you only need to have a formula that [00:15:18] way you only need to have a formula that depends on the times third moment of v [00:15:20] depends on the times third moment of v and also some like formula basically you [00:15:24] and also some like formula basically you just need a formula for B okay so so any [00:15:28] just need a formula for B okay so so any questions so far [00:15:31] questions so far so now let's see how we use it right so [00:15:33] so now let's see how we use it right so how we use it is the following so and [00:15:35] how we use it is the following so and you can kind of see [00:15:37] you can kind of see um what kind of things we exactly need [00:15:39] um what kind of things we exactly need so you need so now you look at the X [00:15:41] so you need so now you look at the X right X is a mixture of gaussians Z was [00:15:44] right X is a mixture of gaussians Z was only a single God [00:15:46] only a single God and you use this single gaussian thing [00:15:51] uh on them [00:15:53] uh on them the single gaussian has a building block [00:15:55] the single gaussian has a building block to compute the moment of the mix of [00:15:57] to compute the moment of the mix of gaussians [00:15:58] gaussians and what you do is that okay when [00:16:00] and what you do is that okay when condition I becomes a gaussian and you [00:16:02] condition I becomes a gaussian and you apply the the lemon and you get 1 over K [00:16:05] apply the the lemon and you get 1 over K times [00:16:07] times sum over I from 1 to K and then this is [00:16:10] sum over I from 1 to K and then this is Mu I tensor [00:16:12] Mu I tensor mean y times the MU I because mu I is [00:16:14] mean y times the MU I because mu I is taking the place of V [00:16:18] and then you have [00:16:22] the additional three terms [00:16:26] the additional three terms mu I transfer e l tensor e l [00:16:30] mu I transfer e l tensor e l and El tensor mu I [00:16:34] and El tensor mu I transfer El [00:16:38] Plus [00:16:39] Plus e l tensor EO [00:16:42] e l tensor EO UI [00:16:45] Okay so [00:16:49] um parenthesis [00:16:52] okay so and [00:16:54] okay so and so basically the the third moment of X [00:16:56] so basically the the third moment of X is a function of mu I's it's still a [00:16:59] is a function of mu I's it's still a little bit messy so what you do is you [00:17:01] little bit messy so what you do is you say I'm going to get rid of all of these [00:17:03] say I'm going to get rid of all of these terms [00:17:04] terms by using the first moment that's the [00:17:07] by using the first moment that's the first and the the moment reflects right [00:17:09] first and the the moment reflects right so so what you do is that you first [00:17:11] so so what you do is that you first reorganize a little bit you get [00:17:13] reorganize a little bit you get this somewhat cleanly looking term mu [00:17:16] this somewhat cleanly looking term mu eye tensor meanwhile tensor mu I [00:17:21] eye tensor meanwhile tensor mu I and then you [00:17:22] and then you switch the K with the the two sums for [00:17:26] switch the K with the the two sums for the rest of three terms you get sum over [00:17:28] the rest of three terms you get sum over l [00:17:29] l from 1 to D [00:17:32] from 1 to D 1 over K times sum over I from 1 to D mu [00:17:36] 1 over K times sum over I from 1 to D mu I [00:17:37] I tensor El tensor El [00:17:40] tensor El tensor El and we have the the two other terms [00:17:43] and we have the the two other terms which I guess [00:17:45] which I guess you can't really you can imagine what [00:17:47] you can't really you can imagine what they look like they just their [00:17:48] they look like they just their permutations there are rotations of [00:17:50] permutations there are rotations of these terms in something changing the [00:17:52] these terms in something changing the order and now this one [00:17:55] order and now this one because becomes the first moment of x [00:17:58] because becomes the first moment of x so you get one over three [00:18:02] so you get one over three you'll get the same this is I sorry this [00:18:04] you'll get the same this is I sorry this is I [00:18:10] foreign [00:18:18] something that depends on the first one [00:18:19] something that depends on the first one of x [00:18:24] right so what does this mean is that you [00:18:27] right so what does this mean is that you can move this [00:18:28] can move this three things [00:18:31] three things to the left hand side [00:18:33] to the left hand side right so so basically this means that we [00:18:37] right so so basically this means that we can compute [00:18:41] this tensor [00:18:49] from [00:18:51] from the third moment [00:18:55] and the first moment [00:19:00] so [00:19:01] so that's basically our interface right [00:19:03] that's basically our interface right once you have this tensor then the next [00:19:05] once you have this tensor then the next step will be [00:19:07] step will be Next Step [00:19:08] Next Step so we'll go from here so from [00:19:13] this thing I guess let me write this I [00:19:16] this thing I guess let me write this I think I should have introduced this to [00:19:18] think I should have introduced this to mewise [00:19:20] mewise so I'm that's just a funnel notation no [00:19:23] so I'm that's just a funnel notation no purpose so a to the 10th through three [00:19:25] purpose so a to the 10th through three is just a short time for a [00:19:27] is just a short time for a times a tensor a [00:19:31] so basically [00:19:32] so basically what what this you know what the whole [00:19:34] what what this you know what the whole this whole computation is saying is that [00:19:36] this whole computation is saying is that now you can compute this [00:19:38] now you can compute this expected there's a sum of the third [00:19:41] expected there's a sum of the third moment of email I and then you need to [00:19:44] moment of email I and then you need to design algorithm [00:19:45] design algorithm to compute the MU X from this [00:19:48] to compute the MU X from this and and if you can do this in your [00:19:50] and and if you can do this in your question mark then you you're done the [00:19:51] question mark then you you're done the whole thing is soft [00:19:53] whole thing is soft because because you can first use the [00:19:55] because because you can first use the moment to compute the third order the [00:19:57] moment to compute the third order the third tensor from your eyes and then you [00:19:59] third tensor from your eyes and then you can run this algorithm [00:20:04] um there are some you know there are [00:20:05] um there are some you know there are actually some cleaner ways to deal with [00:20:06] actually some cleaner ways to deal with this we don't have to deal with this [00:20:08] this we don't have to deal with this additional terms you know to get there [00:20:10] additional terms you know to get there are some other ways to get this exact [00:20:12] are some other ways to get this exact Federal tensor it's directly in a [00:20:14] Federal tensor it's directly in a cleaner way [00:20:15] cleaner way um but that requires a little bit other [00:20:16] um but that requires a little bit other Machinery so so that's why I'm kind of [00:20:19] Machinery so so that's why I'm kind of like only using this relatively Brute [00:20:23] like only using this relatively Brute Force ways to get the third element but [00:20:24] Force ways to get the third element but the point is that you can always get [00:20:26] the point is that you can always get some something like this so so now the [00:20:29] some something like this so so now the problem becomes this so-called tensor [00:20:31] problem becomes this so-called tensor decolonization problem [00:20:37] So abstractly speaking this tensor [00:20:39] So abstractly speaking this tensor decomposition problem is something like [00:20:42] decomposition problem is something like you [00:20:44] you so abstractly [00:20:48] you have a sequence of vectors A1 up to [00:20:50] you have a sequence of vectors A1 up to a n these are all in dimension okay in [00:20:53] a n these are all in dimension okay in our Dimension B so these are unknown [00:20:57] our Dimension B so these are unknown and which are given [00:20:59] and which are given is a vector [00:21:01] is a vector that looks like this right from 1 to K [00:21:05] that looks like this right from 1 to K and then your goal [00:21:07] and then your goal is to reconstruct [00:21:12] AIS [00:21:15] Le and you can also ask about this you [00:21:17] Le and you can also ask about this you know for different orders of tensors you [00:21:19] know for different orders of tensors you can also ask the same question you know [00:21:22] can also ask the same question you know um [00:21:23] um questions [00:21:25] questions also for for example you can have [00:21:28] also for for example you can have some other odds or the tensor for some [00:21:31] some other odds or the tensor for some art that is bigger the possibly bigger [00:21:32] art that is bigger the possibly bigger than three [00:21:33] than three and it turns out that you can also get [00:21:35] and it turns out that you can also get the 4.10 the first order kind of power [00:21:39] the 4.10 the first order kind of power From This Moment method right if you [00:21:40] From This Moment method right if you take the fourth moment of the Dayton you [00:21:42] take the fourth moment of the Dayton you can get ah the tensor four you know with [00:21:44] can get ah the tensor four you know with some real Arrangement like like that [00:21:47] some real Arrangement like like that so so basically this is the the kind of [00:21:50] so so basically this is the the kind of the [00:21:51] the um [00:21:53] um um the kind of interfaces right you you [00:21:55] um the kind of interfaces right you you basically reduce the moment you reduce [00:21:57] basically reduce the moment you reduce the answering problem to this so-called [00:21:59] the answering problem to this so-called tested conversation problem [00:22:01] tested conversation problem and this tensor decomposition problem [00:22:02] and this tensor decomposition problem also has certain kind of like [00:22:04] also has certain kind of like some was kind of like a [00:22:06] some was kind of like a um let me also give you some Notions for [00:22:08] um let me also give you some Notions for this threshold notation so [00:22:11] this threshold notation so um this [00:22:15] so the the rank of the tensor [00:22:19] so some basic [00:22:21] so some basic notion [00:22:24] notion so the rank [00:22:27] um so I guess let's say [00:22:30] um so I guess let's say a tensor B tensor C is a rank one tensor [00:22:36] a tensor B tensor C is a rank one tensor this is the definition for background [00:22:38] this is the definition for background tensor and then the rank of a tensor k [00:22:42] tensor and then the rank of a tensor k uh tensor t [00:22:44] uh tensor t is the minimum [00:22:47] is the minimum K such that [00:22:51] T can be [00:22:53] T can be written [00:22:56] written as [00:22:57] as a sum of rock one a sum of K [00:23:02] a sum of rock one a sum of K Ron tensors [00:23:04] Ron tensors sometimes this is also called CPT [00:23:06] sometimes this is also called CPT conversations [00:23:08] conversations so in some sense like the re the reason [00:23:11] so in some sense like the re the reason why this conversation is that you [00:23:13] why this conversation is that you observe this you know some of these rank [00:23:15] observe this you know some of these rank one intenses you want to decompose it [00:23:17] one intenses you want to decompose it into components and each component is [00:23:19] into components and each component is referred [00:23:21] referred so and and this question is sometimes [00:23:24] so and and this question is sometimes called CP decomposition because there [00:23:26] called CP decomposition because there are some other decompositions for tensor [00:23:28] are some other decompositions for tensor that could also be meaningful in other [00:23:30] that could also be meaningful in other cases [00:23:32] cases um [00:23:33] um um but but actually you know it's also [00:23:36] um but but actually you know it's also fun to just call it tested accommodation [00:23:38] fun to just call it tested accommodation because this is the the men the most [00:23:40] because this is the the men the most popular decomposition for cancers [00:23:43] popular decomposition for cancers okay so [00:23:45] okay so um okay so now because okay so I guess [00:23:47] um okay so now because okay so I guess so [00:23:49] so um now it becomes a very uh modularized [00:23:51] um now it becomes a very uh modularized question it's algorithmic question right [00:23:53] question it's algorithmic question right how do you figure out [00:23:55] how do you figure out uh the components uh from given a low [00:23:58] uh the components uh from given a low ranked answer how do you figure out the [00:24:00] ranked answer how do you figure out the lower end components [00:24:02] lower end components so I'm going to [00:24:04] so I'm going to um what I'm going to do is that I'm [00:24:06] um what I'm going to do is that I'm going to basically list some of the [00:24:08] going to basically list some of the existing results but not really talking [00:24:10] existing results but not really talking about details because this is a actually [00:24:12] about details because this is a actually what happens you know in this area you [00:24:15] what happens you know in this area you know I think this error becomes kind of [00:24:17] know I think this error becomes kind of like very popular around 2013 2012. [00:24:21] like very popular around 2013 2012. um and in the very beginning I think a [00:24:23] um and in the very beginning I think a few papers [00:24:24] few papers um kind of lay out the framework for [00:24:26] um kind of lay out the framework for this kind of Hosting right so how do you [00:24:27] this kind of Hosting right so how do you complete the moment how to convert it [00:24:29] complete the moment how to convert it into tested conversation problem [00:24:31] into tested conversation problem and then those papers provide some some [00:24:33] and then those papers provide some some easy tensor conversation problem or they [00:24:35] easy tensor conversation problem or they actually invoke some of the existing [00:24:37] actually invoke some of the existing tensor decomposition problems and that's [00:24:39] tensor decomposition problems and that's those the early papers and then [00:24:42] those the early papers and then um this field you know somewhere kind of [00:24:44] um this field you know somewhere kind of like a [00:24:46] like a this question becomes two parts right [00:24:48] this question becomes two parts right one part is about how do you do the [00:24:49] one part is about how do you do the movement but how do you turn the [00:24:51] movement but how do you turn the movement into a tensor and then the [00:24:53] movement into a tensor and then the second part is how to decompose the [00:24:54] second part is how to decompose the tensor so so people have you know there [00:24:58] tensor so so people have you know there are a lot of papers you know involving [00:24:59] are a lot of papers you know involving some of my works as well you know but [00:25:01] some of my works as well you know but there are actually a lot of works that [00:25:04] there are actually a lot of works that uh tries to understand how do you [00:25:06] uh tries to understand how do you decompose all different kind of tensors [00:25:07] decompose all different kind of tensors like when the word conventions you can [00:25:09] like when the word conventions you can you can decompose [00:25:11] you can decompose so so maybe so what I'm going to do is [00:25:14] so so maybe so what I'm going to do is I'm going to list a few conditions that [00:25:16] I'm going to list a few conditions that you can decompose these tensors you know [00:25:19] you can decompose these tensors you know um uh um computational efficiently and [00:25:22] um uh um computational efficiently and and those conditions you will turn into [00:25:24] and those conditions you will turn into a condition for the Upstream problem [00:25:26] a condition for the Upstream problem for example in a mix of gaussian [00:25:28] for example in a mix of gaussian problems gonna have some conditions so [00:25:31] problems gonna have some conditions so um just to set up the the kind of the [00:25:33] um just to set up the the kind of the kind of the basis [00:25:36] kind of the basis um [00:25:43] let me see where I wrote this [00:25:54] is [00:25:55] is this so the maybe the number zero is [00:25:59] this so the maybe the number zero is that in in the in the most General case [00:26:03] that in in the in the most General case or in the worst case you know in the [00:26:05] or in the worst case you know in the more PCS language right so you call the [00:26:07] more PCS language right so you call the worst case or in the most General case [00:26:09] worst case or in the most General case this problem is not solvable [00:26:12] this problem is not solvable so [00:26:14] so um finding the [00:26:17] um finding the AIS [00:26:19] AIS are computationally hard [00:26:23] actually actually there are several [00:26:25] actually actually there are several layers here as well if you want to [00:26:27] layers here as well if you want to discuss the details you know in a very [00:26:29] discuss the details you know in a very worst case actually the the AI is kind [00:26:30] worst case actually the the AI is kind of unique you know have a unique [00:26:32] of unique you know have a unique decomposition [00:26:33] decomposition and when the the composition is unique [00:26:36] and when the the composition is unique you can also there are also cases where [00:26:38] you can also there are also cases where the decomposition is unique but you [00:26:39] the decomposition is unique but you cannot find them you know computational [00:26:40] cannot find them you know computational efficient way [00:26:42] efficient way um [00:26:48] [Music] [00:26:54] so [00:26:56] so if you take a street you will play three [00:26:59] if you take a street you will play three to be two [00:27:00] to be two then it's pretty much like [00:27:03] then it's pretty much like um symmetric [00:27:04] um symmetric this this here is symmetric but you can [00:27:07] this this here is symmetric but you can also make it asymmetric but yes you are [00:27:10] also make it asymmetric but yes you are right it's basically new algebraic stuff [00:27:13] right it's basically new algebraic stuff like FB [00:27:16] um [00:27:17] um um and and this is a very good question [00:27:18] um and and this is a very good question so I think [00:27:19] so I think in some sense as you see like in some of [00:27:22] in some sense as you see like in some of discussions below like in some aspects [00:27:24] discussions below like in some aspects the the tension conversation is kind of [00:27:26] the the tension conversation is kind of close to metric decomposition [00:27:29] close to metric decomposition but [00:27:30] but one there but there is a one fundamental [00:27:32] one there but there is a one fundamental difference so that fundamental [00:27:34] difference so that fundamental difference is what enables [00:27:37] difference is what enables um that that makes this kind of tools [00:27:39] um that that makes this kind of tools powerful but also challenging like like [00:27:42] powerful but also challenging like like it's powerful in the sense that it's [00:27:44] it's powerful in the sense that it's fundamentally powerful because [00:27:46] fundamentally powerful because uh because here this is no there's no [00:27:51] uh because here this is no there's no no rotational environment [00:27:58] I guess this no rotation environment is [00:28:01] I guess this no rotation environment is also you have to interpret in a careful [00:28:03] also you have to interpret in a careful way so what I mean is that [00:28:06] way so what I mean is that sum of a i [00:28:09] sum of a i tensor 3 is not the same as sum of [00:28:12] tensor 3 is not the same as sum of rotation of AI [00:28:16] rotation of AI 1033 [00:28:18] 1033 however this is true for [00:28:22] however this is true for matrices so if you have sum of ai ai [00:28:25] matrices so if you have sum of ai ai transpose this is the same as sum of R [00:28:29] transpose this is the same as sum of R times AI or R is application matrices [00:28:33] times AI or R is application matrices um [00:28:34] um I guess this is [00:28:35] I guess this is I guess it depends on how you rotate it [00:28:39] I guess it depends on how you rotate it um I think this how do I say this I [00:28:42] um I think this how do I say this I probably shouldn't say this on the flyer [00:28:44] probably shouldn't say this on the flyer without thinking about what's the best [00:28:47] without thinking about what's the best way to [00:28:49] I guess I guess technology rotate on the [00:28:52] I guess I guess technology rotate on the right so [00:28:53] right so so [00:28:54] so um [00:28:57] so maybe let me not make it precise but [00:28:59] so maybe let me not make it precise but I think maybe one thing to to realize is [00:29:01] I think maybe one thing to to realize is that if you have matrices you have a [00:29:03] that if you have matrices you have a times a transpose something like this [00:29:04] times a transpose something like this which is kind of like a sum of AI [00:29:06] which is kind of like a sum of AI transpose if you put all the AIS that's [00:29:08] transpose if you put all the AIS that's Columns of capitalism [00:29:10] Columns of capitalism so then this is equals to A times R [00:29:12] so then this is equals to A times R times R A transpose if R is a rotation [00:29:14] times R A transpose if R is a rotation Matrix Matrix and you just cannot do [00:29:17] Matrix Matrix and you just cannot do this you know for the [00:29:20] this you know for the for the tensors that often [00:29:23] for the tensors that often so [00:29:24] so um so but but what happens here is that [00:29:26] um so but but what happens here is that if you permute if you have ai [00:29:28] if you permute if you have ai and you permute it [00:29:30] and you permute it permuted indices [00:29:32] permuted indices to [00:29:33] to AI primes right where AI params are just [00:29:37] AI primes right where AI params are just permutations of AI then the the the the [00:29:40] permutations of AI then the the the the this resulting sum the third mode third [00:29:43] this resulting sum the third mode third tensor is still the same [00:29:45] tensor is still the same so you only have notations symmetry but [00:29:47] so you only have notations symmetry but no [00:29:49] no uh you only have permutation symmetry [00:29:51] uh you only have permutation symmetry but no rotation Symmetry and this is [00:29:54] but no rotation Symmetry and this is actually make it somewhat powerful [00:29:55] actually make it somewhat powerful because this in many cases this is the [00:29:57] because this in many cases this is the case like for mix of gaussians you can [00:30:00] case like for mix of gaussians you can you can permute all the centers and [00:30:02] you can permute all the centers and that's still the same gaussian but you [00:30:03] that's still the same gaussian but you cannot rotate the chlorine systems to [00:30:07] cannot rotate the chlorine systems to make it the same you can't rotate the [00:30:11] make it the same you can't rotate the at least you cannot take linear [00:30:12] at least you cannot take linear combinations of the centers to to still [00:30:15] combinations of the centers to to still maintain the same mixture of caution and [00:30:17] maintain the same mixture of caution and and I think this also applies to large [00:30:18] and I think this also applies to large works I think volumize works you have [00:30:20] works I think volumize works you have the permutation symmetry where you can [00:30:22] the permutation symmetry where you can permute the new the neurons in community [00:30:26] permute the new the neurons in community layers and also the associate edges [00:30:29] layers and also the associate edges and you can still maintain the [00:30:30] and you can still maintain the functionality of the neurotrope exactly [00:30:31] functionality of the neurotrope exactly the same [00:30:33] the same but you cannot do arbitrary kind of [00:30:35] but you cannot do arbitrary kind of rotations in it because you have the [00:30:36] rotations in it because you have the nonlinearity with activations so [00:30:41] nonlinearity with activations so um yeah but I guess this part is [00:30:43] um yeah but I guess this part is supposed to be somewhat abstract because [00:30:44] supposed to be somewhat abstract because you know once you if you see a lot of [00:30:47] you know once you if you see a lot of tech mass then you kind of probably [00:30:49] tech mass then you kind of probably understand it's a little more better [00:30:51] understand it's a little more better um but anyway so there's some [00:30:53] um but anyway so there's some fundamental differences between this and [00:30:55] fundamental differences between this and and linear algebra so that's why cancer [00:30:57] and linear algebra so that's why cancer decomposition becomes difficult [00:30:59] decomposition becomes difficult um [00:31:00] um especially the worst case okay going [00:31:02] especially the worst case okay going back to the um to the list of you know [00:31:05] back to the um to the list of you know questions you know so as I said the [00:31:08] questions you know so as I said the the starting point is in a general case [00:31:10] the starting point is in a general case you can you cannot hope to do anything [00:31:12] you can you cannot hope to do anything but there are many cases that where you [00:31:15] but there are many cases that where you can do something so the easiest case is [00:31:16] can do something so the easiest case is the orthogonal case [00:31:20] so orthogonal case means that if A1 up [00:31:23] so orthogonal case means that if A1 up to a k [00:31:24] to a k are orthogonals [00:31:30] so and in this case actually this is the [00:31:33] so and in this case actually this is the closest to the eigenvector case right [00:31:35] closest to the eigenvector case right this is a so here you can say that then [00:31:38] this is a so here you can say that then a i is actually the global [00:31:44] each of these AI h of a i [00:31:49] foreign [00:31:55] is the global minimizer there are [00:31:58] is the global minimizer there are multiple Global immunizers so that's why [00:31:59] multiple Global immunizers so that's why each of them is a global minimizer [00:32:01] each of them is a global minimizer maximizer actually [00:32:03] maximizer actually of this objective function where you [00:32:05] of this objective function where you maximize [00:32:08] the L2 Norm the maximize this [00:32:12] the L2 Norm the maximize this tensor paid by a Workman tensor so guys [00:32:15] tensor paid by a Workman tensor so guys if you're not familiar with the notation [00:32:18] if you're not familiar with the notation then what I really means that you take [00:32:19] then what I really means that you take this sum of t i j k times x i x j x k [00:32:25] this sum of t i j k times x i x j x k so this is the extension of the [00:32:28] so this is the extension of the quadratic performed for matrices right [00:32:30] quadratic performed for matrices right so suppose you have Matrix then this is [00:32:31] so suppose you have Matrix then this is the quadratic form and for tensor this [00:32:33] the quadratic form and for tensor this is this tensor form [00:32:35] is this tensor form and [00:32:36] and so so so so so [00:32:38] so so so so so eigenvectors can be defined in this way [00:32:40] eigenvectors can be defined in this way you know if you change the tensor to The [00:32:42] you know if you change the tensor to The Matrix because eigenvector is what [00:32:44] Matrix because eigenvector is what maximizing the quadratic form to The [00:32:46] maximizing the quadratic form to The Matrix so sometimes in this sense the [00:32:49] Matrix so sometimes in this sense the the components are some kind of like [00:32:52] the components are some kind of like eigenvector [00:32:53] eigenvector and then you can find this [00:32:55] and then you can find this so this is a interesting property so [00:32:57] so this is a interesting property so this is saying that AI is kind of like [00:33:01] this is saying that AI is kind of like kind of like eigenvectors [00:33:06] of t [00:33:07] of t uh and and also [00:33:10] uh and and also we can find it you know it's not trivial [00:33:12] we can find it you know it's not trivial to find it but you can find [00:33:14] to find it but you can find ai's employment [00:33:18] and and actually the way to find it is [00:33:21] and and actually the way to find it is that you decide to solve this [00:33:22] that you decide to solve this optimization and it's one way to find it [00:33:23] optimization and it's one way to find it is that you try to solve this [00:33:24] is that you try to solve this optimization problem back with music [00:33:27] optimization problem back with music so that's one way one case and [00:33:29] so that's one way one case and another case is that okay more General [00:33:32] another case is that okay more General case that you can have the independent [00:33:34] case that you can have the independent case [00:33:34] case so it turns out that if A1 up to a k [00:33:38] so it turns out that if A1 up to a k are linearly independent [00:33:45] then this is also good case you can find [00:33:47] then this is also good case you can find this in polynomial time I think the [00:33:49] this in polynomial time I think the algorithm is for generation [00:33:53] I'm not going to describe all of this [00:33:55] I'm not going to describe all of this algorithm just because it will take too [00:33:57] algorithm just because it will take too much time [00:33:58] much time um and in some sense these are things [00:34:00] um and in some sense these are things that you can as long as you have some [00:34:01] that you can as long as you have some kind of like [00:34:03] kind of like a basic knowledge you can search over [00:34:05] a basic knowledge you can search over the literature and there's many papers [00:34:08] the literature and there's many papers about this something [00:34:10] about this something um [00:34:11] um so but these are so one two are both [00:34:15] so but these are so one two are both about under so-called under complete [00:34:17] about under so-called under complete case [00:34:18] case one two [00:34:21] one two and the so-called [00:34:25] under complete [00:34:28] case which is really means that K the [00:34:32] case which is really means that K the number of components is less than b you [00:34:35] number of components is less than b you can see that [00:34:36] can see that number one number two can only happen in [00:34:38] number one number two can only happen in one case less than V because if K is [00:34:40] one case less than V because if K is bigger than b there's no way that A1 up [00:34:43] bigger than b there's no way that A1 up to a n are linearly independent it's [00:34:45] to a n are linearly independent it's because your number of components is [00:34:47] because your number of components is speaking of Dimension so they cannot be [00:34:48] speaking of Dimension so they cannot be linearly independent [00:34:50] linearly independent but actually you can also do this for [00:34:52] but actually you can also do this for over complete case over complete case [00:34:54] over complete case over complete case are still possible [00:34:57] but still possible [00:34:59] but still possible in certain cases [00:35:02] in certain cases so there are several different ways to [00:35:04] so there are several different ways to deal with over complete case which means [00:35:06] deal with over complete case which means K is bigger than d [00:35:07] K is bigger than d so the first one is that you can look at [00:35:10] so the first one is that you can look at higher order tensors [00:35:15] so you can say that suppose [00:35:20] so you can say that suppose A1 tensor 2 up to a k tensor 2 [00:35:26] A1 tensor 2 up to a k tensor 2 are independent [00:35:29] this is a much relaxed condition than A1 [00:35:32] this is a much relaxed condition than A1 up to a k are linear panel because now [00:35:34] up to a k are linear panel because now you have higher dimension so so now this [00:35:37] you have higher dimension so so now this this only this only requires you know K [00:35:40] this only this only requires you know K needs to less than b square right less [00:35:42] needs to less than b square right less than b squared to to make this possible [00:35:43] than b squared to to make this possible to happen [00:35:45] to happen and suppose this is true then [00:35:49] and suppose this is true then you can just uh replace a I by AI 10 S2 [00:35:52] you can just uh replace a I by AI 10 S2 so you can recover [00:35:55] so you can recover AI [00:35:56] AI from [00:35:58] from the six other tensor so you recovered [00:36:01] the six other tensor so you recovered from area 10.42 to the tensor power [00:36:04] from area 10.42 to the tensor power three [00:36:05] three I'm from 1 to K [00:36:08] I'm from 1 to K and which is still on the same as the [00:36:10] and which is still on the same as the six of the tens [00:36:11] six of the tens and how do you do it you just invoke the [00:36:14] and how do you do it you just invoke the third level tensor on AI to the power [00:36:16] third level tensor on AI to the power two and then after you get a to the [00:36:18] two and then after you get a to the power 2 you can get AI by just taking a [00:36:20] power 2 you can get AI by just taking a square root [00:36:22] square root so so this relax the the restriction on [00:36:26] so so this relax the the restriction on the k [00:36:28] the k um but but you know somehow but with the [00:36:30] um but but you know somehow but with the cost of estimating six moments because [00:36:33] cost of estimating six moments because how do you get this this is a thing with [00:36:36] how do you get this this is a thing with in R to D to the sixth so you have to [00:36:38] in R to D to the sixth so you have to somehow do something with the sixth [00:36:40] somehow do something with the sixth movement uh it will be less sample [00:36:43] movement uh it will be less sample efficient and well another slightly [00:36:45] efficient and well another slightly clever way to do this is that you can do [00:36:47] clever way to do this is that you can do fourth order tensor [00:36:52] uh with the same condition so so you say [00:36:55] uh with the same condition so so you say that um [00:36:56] that um uh Sports all the generic tensor [00:37:00] uh Sports all the generic tensor and what does generic tension really [00:37:01] and what does generic tension really mean [00:37:02] mean it means that [00:37:04] it means that you exclude [00:37:07] excluding [00:37:12] algebraic set [00:37:17] of measure zero so you exclude [00:37:20] of measure zero so you exclude a small set of you know a match zero set [00:37:22] a small set of you know a match zero set of tensors and accept those kind of [00:37:24] of tensors and accept those kind of cancers you can do this [00:37:26] cancers you can do this and [00:37:28] and and this is saying that when [00:37:30] and this is saying that when K is less than b squared [00:37:32] K is less than b squared you can recover [00:37:36] uh AI from [00:37:41] the first answer [00:37:45] right so before if you achieve your [00:37:47] right so before if you achieve your reduction you get [00:37:48] reduction you get um you get the third you need to use the [00:37:50] um you get the third you need to use the sixth order tensor but now you only have [00:37:52] sixth order tensor but now you only have to use the fourth hour tensor and this [00:37:54] to use the fourth hour tensor and this algorithm is called Ruby [00:37:59] um [00:38:01] and you can also have a robust version [00:38:03] and you can also have a robust version of this [00:38:05] of this this algorithm by itself is not robust [00:38:07] this algorithm by itself is not robust you can also have robust versions of [00:38:09] you can also have robust versions of this [00:38:10] this I guess let me not write down the the [00:38:11] I guess let me not write down the the references I'll I'll references later I [00:38:14] references I'll I'll references later I guess [00:38:15] guess if you just uh if I could just get the [00:38:18] if you just uh if I could just get the initials I think these are some [00:38:20] initials I think these are some references like this where you can get a [00:38:22] references like this where you can get a robust version of these algorithms [00:38:25] robust version of these algorithms and and if you want to be [00:38:29] um more ambitious right so you want to [00:38:31] um more ambitious right so you want to say that I want to even deal with third [00:38:33] say that I want to even deal with third double tensor then you what you can do [00:38:34] double tensor then you what you can do is you can say you have random [00:38:37] is you can say you have random cancers [00:38:40] cancers and by random it means that if you [00:38:43] and by random it means that if you assume [00:38:44] assume AIS [00:38:46] AIS are randomly generated [00:38:53] um [00:38:54] um for some unit vectors [00:38:58] I guess whether it's unibacters is not [00:38:59] I guess whether it's unibacters is not that important but you know for [00:39:01] that important but you know for Community let's say they are all unit [00:39:02] Community let's say they are all unit vectors and they're all randomly [00:39:04] vectors and they're all randomly distributed on immunosphere [00:39:06] distributed on immunosphere and then [00:39:09] for even third Auto tensor [00:39:16] uh [00:39:18] uh K can be as large as [00:39:24] equal 1.5 [00:39:26] equal 1.5 so so you can kind of overcomplicate [00:39:28] so so you can kind of overcomplicate even with third Auto tensor [00:39:32] um and there are some references here [00:39:34] um and there are some references here which I guess [00:39:36] which I guess I'll add it to the nose eventually [00:39:39] I'll add it to the nose eventually okay [00:39:41] okay Okay cool so this is just a very quick [00:39:43] Okay cool so this is just a very quick list kind of probably a little boring [00:39:45] list kind of probably a little boring list of like a references uh but I guess [00:39:48] list of like a references uh but I guess you see the rough idea right so you can [00:39:49] you see the rough idea right so you can for various conditions but component AIS [00:39:53] for various conditions but component AIS you can [00:39:54] you can um you can have various kind of [00:39:56] um you can have various kind of algorithms and and different results [00:39:58] algorithms and and different results right so technically if you have more [00:40:01] right so technically if you have more restrictions on AIS you get stronger [00:40:03] restrictions on AIS you get stronger results right so the strongest one would [00:40:05] results right so the strongest one would be you have you assume they are random [00:40:07] be you have you assume they are random then you can get uh you can even [00:40:09] then you can get uh you can even decompose over complete tensors [00:40:12] decompose over complete tensors uh of when the order is only three [00:40:16] uh of when the order is only three um but if you don't have that strong [00:40:17] um but if you don't have that strong assumptions you have to go with the [00:40:19] assumptions you have to go with the phosphoryl tensor or even six hours [00:40:21] phosphoryl tensor or even six hours sensor uh if you don't use the right [00:40:22] sensor uh if you don't use the right isotope [00:40:24] isotope so this is basically what's going on in [00:40:26] so this is basically what's going on in this area and and you can see there are [00:40:28] this area and and you can see there are many many papers that deal with [00:40:30] many many papers that deal with different kind of like setups [00:40:32] different kind of like setups um so so I write some references to the [00:40:34] um so so I write some references to the to the to the lecture notes but [00:40:36] to the to the lecture notes but generally this is something you can kind [00:40:38] generally this is something you can kind of search on internet and they are just [00:40:41] of search on internet and they are just to before we conclude this part there [00:40:44] to before we conclude this part there are other latent variables that can be [00:40:46] are other latent variables that can be done [00:40:50] um [00:40:52] um can be done [00:40:55] um by moment method on method the moment [00:41:00] um by moment method on method the moment using the same strategy where you first [00:41:02] using the same strategy where you first complete a more money you turn it into a [00:41:03] complete a more money you turn it into a tensor documentation problem so you can [00:41:05] tensor documentation problem so you can do the so-called ICA in the component [00:41:08] do the so-called ICA in the component analysis and the hidden Markov models [00:41:12] analysis and the hidden Markov models and you cancel the topic models [00:41:15] and you cancel the topic models I think they are even more than this and [00:41:17] I think they are even more than this and I'm just listing a few that are most [00:41:20] I'm just listing a few that are most prominent so these are all living [00:41:21] prominent so these are all living variable models for Enterprise Learning [00:41:24] variable models for Enterprise Learning um and you can for each of this you can [00:41:26] um and you can for each of this you can try to compute certain kind of moments [00:41:28] try to compute certain kind of moments and rearrange your moments so that you [00:41:30] and rearrange your moments so that you get a tensor and then decompose the [00:41:31] get a tensor and then decompose the tensor to construct uh the true parents [00:41:39] any questions [00:41:47] turned over cancer [00:41:51] so we want to Captivate [00:41:58] more General [00:42:02] technology [00:42:22] um I think Let Me Maybe I didn't let me [00:42:26] um I think Let Me Maybe I didn't let me try to answer and then you can clarify [00:42:27] try to answer and then you can clarify if I you know answer the question so I [00:42:30] if I you know answer the question so I guess the flow just is something that [00:42:32] guess the flow just is something that you first start with the data you [00:42:34] you first start with the data you compute some [00:42:35] compute some tensor maybe this [00:42:37] tensor maybe this or maybe false or maybe let's say four [00:42:39] or maybe false or maybe let's say four here and you of course you cannot [00:42:41] here and you of course you cannot compute this exactly you compute this [00:42:43] compute this exactly you compute this approximately [00:42:45] approximately you you have some error in isolating [00:42:47] you you have some error in isolating this fourth moment and you know that if [00:42:49] this fourth moment and you know that if you don't have nany error then this will [00:42:51] you don't have nany error then this will be something like sum of AI to the [00:42:53] be something like sum of AI to the tensor for right from 1 to K and then [00:42:56] tensor for right from 1 to K and then you decompose [00:42:58] you decompose you get AIS and I guess how does the [00:43:01] you get AIS and I guess how does the dependency kind of deep so where like [00:43:03] dependency kind of deep so where like what the I guess one thing is whether [00:43:05] what the I guess one thing is whether it's over completely under complete [00:43:07] it's over completely under complete right so [00:43:08] right so why does that matter that matters [00:43:10] why does that matter that matters because this K is what is K in the mix [00:43:14] because this K is what is K in the mix of goals and K is number of mixtures [00:43:17] of goals and K is number of mixtures so if you can handle over complete [00:43:18] so if you can handle over complete tensor decomposition that means that for [00:43:21] tensor decomposition that means that for the original problem you can handle [00:43:23] the original problem you can handle more than these mixtures right your [00:43:26] more than these mixtures right your number of mixtures you can handle is [00:43:27] number of mixtures you can handle is more than dimension [00:43:30] more than dimension um and if you can only do under complete [00:43:31] um and if you can only do under complete answers then your number of mixtures has [00:43:33] answers then your number of mixtures has to be less than Dimension that's why [00:43:34] to be less than Dimension that's why people care about over complete tensors [00:43:51] there's a lot of K the K here is [00:43:53] there's a lot of K the K here is something fixed it's not about okay so I [00:43:55] something fixed it's not about okay so I guess there is another thing which is [00:43:57] guess there is another thing which is K is the something you know [00:44:00] K is the something you know you know there's a number of mixtures in [00:44:02] you know there's a number of mixtures in your you know you know there's something [00:44:04] your you know you know there's something fixed considered right so what is [00:44:06] fixed considered right so what is I guess maybe what you're asking is this [00:44:09] I guess maybe what you're asking is this is the [00:44:10] is the is this empirical see [00:44:16] right so so the real thing is that you [00:44:18] right so so the real thing is that you work on this [00:44:19] work on this and then you say this is approximately [00:44:22] and then you say this is approximately equals to the sum of ax to the tensor [00:44:25] equals to the sum of ax to the tensor four and you decompose that approximate [00:44:27] four and you decompose that approximate version so you also need your algorithm [00:44:29] version so you also need your algorithm of your decomposition algorithm [00:44:31] of your decomposition algorithm to be your boss to some errors but [00:44:33] to be your boss to some errors but because you don't know exactly uh this [00:44:36] because you don't know exactly uh this this thing this slogan tensor exactly [00:44:39] this thing this slogan tensor exactly you only know approximate version of it [00:44:44] and I said I'm not answering the [00:44:46] and I said I'm not answering the question [00:44:48] question go ahead you know maybe I'm not [00:44:49] go ahead you know maybe I'm not answering about the right question [00:44:51] answering about the right question right [00:44:59] right lower you can think of cancer [00:45:02] right lower you can think of cancer organization as a low rank approximation [00:45:04] organization as a low rank approximation for the tenses yes [00:45:08] for the tenses yes [Music] [00:45:14] [Music] [00:45:17] [Music] so all of these theorems so far I listed [00:45:19] so all of these theorems so far I listed you know they are all they are they all [00:45:22] you know they are all they are they all work for all proximate version even [00:45:24] work for all proximate version even though I didn't really talk about the [00:45:25] though I didn't really talk about the approximate version [00:45:27] approximate version uh yet like I didn't talk about [00:45:29] uh yet like I didn't talk about approximately expensively right so so in [00:45:32] approximately expensively right so so in sometimes the kind of the the first [00:45:34] sometimes the kind of the the first order best is that even you don't have [00:45:35] order best is that even you don't have any approximation you get exactly a low [00:45:37] any approximation you get exactly a low rep tensor [00:45:38] rep tensor you have to be able to decompose it that [00:45:41] you have to be able to decompose it that even that's now trivial right so for for [00:45:43] even that's now trivial right so for for matrices it's trivial because you just [00:45:45] matrices it's trivial because you just take SVD but for time service it's not [00:45:47] take SVD but for time service it's not trivial so that's why the first element [00:45:49] trivial so that's why the first element is to say I get exactly low rank tensor [00:45:51] is to say I get exactly low rank tensor I can decompose it and then the second [00:45:53] I can decompose it and then the second question is the so-called robustness [00:45:54] question is the so-called robustness which means that you get approximately [00:45:56] which means that you get approximately low ranked answer how do you decompose [00:45:58] low ranked answer how do you decompose it [00:45:59] it um I think all of these algorithms I [00:46:01] um I think all of these algorithms I think are robust there are some robust [00:46:04] think are robust there are some robust version of them [00:46:05] version of them so [00:46:07] so um and and typically the if you don't [00:46:09] um and and typically the if you don't cover the optimal sample efficiency then [00:46:11] cover the optimal sample efficiency then they're all robust just for trivial [00:46:14] they're all robust just for trivial reasons [00:46:15] reasons but but if you really care about exactly [00:46:17] but but if you really care about exactly how many samples and how robust they are [00:46:19] how many samples and how robust they are it becomes a little tricky because you [00:46:21] it becomes a little tricky because you have to talk about it sample efficiency [00:46:23] have to talk about it sample efficiency like the how does the concentration work [00:46:26] like the how does the concentration work so forth [00:46:38] yeah you can kind of think of the AIS as [00:46:40] yeah you can kind of think of the AIS as the largest eigenvectors [00:46:42] the largest eigenvectors yes [00:46:45] yes you can roughly think of that idea [00:46:51] okay [00:46:52] okay okay cool okay sounds good so I guess um [00:46:55] okay cool okay sounds good so I guess um Okay cool so then I'm gonna move on to [00:46:58] Okay cool so then I'm gonna move on to the last [00:47:00] the last subtopic you know in this in this course [00:47:02] subtopic you know in this in this course I guess it's still about as fast [00:47:04] I guess it's still about as fast learning but it's about slightly [00:47:05] learning but it's about slightly different type of ice Quest learning [00:47:07] different type of ice Quest learning which is a more like clustering and and [00:47:10] which is a more like clustering and and you can see that it's we are still doing [00:47:12] you can see that it's we are still doing spectral at methods we're still doing [00:47:14] spectral at methods we're still doing some kind of special decomposition but [00:47:16] some kind of special decomposition but but it's decomposing you know [00:47:19] but it's decomposing you know slightly different way like you're I [00:47:22] slightly different way like you're I guess I guess I'll you will see [00:47:24] guess I guess I'll you will see um once I I formulate the problem and [00:47:26] um once I I formulate the problem and you can see that before when you do the [00:47:29] you can see that before when you do the tensor method you are building some [00:47:30] tensor method you are building some powerwise information between the [00:47:33] powerwise information between the coordinates all all three of us [00:47:35] coordinates all all three of us information between the coordinates of [00:47:37] information between the coordinates of the data right so here [00:47:40] the data right so here from now on I'm going to talk about a [00:47:42] from now on I'm going to talk about a different type of approach where you [00:47:43] different type of approach where you build paragraph information between the [00:47:46] build paragraph information between the data points [00:47:47] data points and then you do something on top of that [00:47:51] and then you do something on top of that um so um so I guess I'll specify more [00:47:54] um so um so I guess I'll specify more clearly so this spectral [00:47:59] clearly so this spectral clustering [00:48:02] clustering so I'm going to discuss actually a bunch [00:48:04] so I'm going to discuss actually a bunch of different [00:48:05] of different um algorithm or setups you know under [00:48:07] um algorithm or setups you know under this this broad framework this kind of [00:48:10] this this broad framework this kind of whole spectral cluster kind of framework [00:48:12] whole spectral cluster kind of framework I think is proposed by [00:48:14] I think is proposed by um she and Malik [00:48:18] are out around 2000 I think also [00:48:23] are out around 2000 I think also uh wait [00:48:25] uh wait Andrew in Mike Jordan [00:48:29] Andrew in Mike Jordan and wife [00:48:32] in 2001. [00:48:33] in 2001. maybe maybe this is 2016 I I will have [00:48:36] maybe maybe this is 2016 I I will have the references in the election notes so [00:48:40] the references in the election notes so it has been like 20 years old [00:48:42] it has been like 20 years old um so so I'm going to kind of discuss [00:48:44] um so so I'm going to kind of discuss you know a bunch of like a classical [00:48:46] you know a bunch of like a classical things like about this and also next [00:48:48] things like about this and also next lecture I'm going to talk about one of [00:48:50] lecture I'm going to talk about one of my own work which kind of is built on [00:48:53] my own work which kind of is built on top of this to get into a deep learning [00:48:55] top of this to get into a deep learning case so to extend it to the people in [00:48:58] case so to extend it to the people in case [00:48:59] case um [00:49:00] um so the general idea is that suppose you [00:49:02] so the general idea is that suppose you have [00:49:02] have um [00:49:03] um so we are given n data points [00:49:09] um let's call them X1 up to xn [00:49:13] um let's call them X1 up to xn and let's say we are given for the [00:49:15] and let's say we are given for the moment let's say we are given a [00:49:16] moment let's say we are given a similarity Matrix [00:49:18] similarity Matrix and don't ask me how to get this just [00:49:21] and don't ask me how to get this just let's just assume that we have a [00:49:22] let's just assume that we have a similarity Matrix G [00:49:24] similarity Matrix G which is [00:49:26] which is of Dimension n by n you know actually [00:49:28] of Dimension n by n you know actually it's going to be a problem you know to [00:49:29] it's going to be a problem you know to construct the similarity Matrix to some [00:49:31] construct the similarity Matrix to some extent but for the moment let's say we [00:49:33] extent but for the moment let's say we will have this in some cases we do have [00:49:34] will have this in some cases we do have this similarity Matrix and this G where [00:49:40] this similarity Matrix and this G where each of these entries of this Matrix is [00:49:43] each of these entries of this Matrix is doing some [00:49:45] doing some similarity is capturing the similarity [00:49:48] similarity is capturing the similarity between two data points x i and XJ [00:49:59] here you know you can interpret this as [00:50:01] here you know you can interpret this as similarity or something like a sum or [00:50:03] similarity or something like a sum or just generally some makes you capture [00:50:05] just generally some makes you capture some relationship between data points [00:50:08] some relationship between data points um I think it's reasonable to think of [00:50:09] um I think it's reasonable to think of them as similarity [00:50:11] them as similarity um so and the larger the more similar I [00:50:14] um so and the larger the more similar I see [00:50:15] see but this is not that important [00:50:21] so I guess you can see that this is what [00:50:23] so I guess you can see that this is what I call the paragraph information between [00:50:25] I call the paragraph information between data points but not Parallax information [00:50:27] data points but not Parallax information between the coordinates [00:50:30] between the coordinates actually if you do [00:50:31] actually if you do in certain cases they are they are kind [00:50:33] in certain cases they are they are kind of the same [00:50:34] of the same um but in some other cases they're not [00:50:37] um but in some other cases they're not the same [00:50:38] the same um [00:50:39] um so so for example [00:50:41] so so for example um one example could be that [00:50:44] um one example could be that you have X either images [00:50:50] and then row x i [00:50:53] and then row x i x j [00:50:55] x j measures the symmetrical [00:51:00] the semantic similarity of these two [00:51:02] the semantic similarity of these two images [00:51:05] how do you get this I think it's a [00:51:07] how do you get this I think it's a little bit kind of tricky because you [00:51:08] little bit kind of tricky because you know typically you cannot just take the [00:51:10] know typically you cannot just take the L2 Norm to match the symmetric [00:51:12] L2 Norm to match the symmetric similarity you know you because there [00:51:14] similarity you know you because there could be two images that looks pretty [00:51:15] could be two images that looks pretty different but there are simultaneously [00:51:17] different but there are simultaneously similar [00:51:19] similar um but but for the moment let's assume [00:51:20] um but but for the moment let's assume we're given such a matrix [00:51:22] we're given such a matrix uh such a similarity simulator Matrix [00:51:25] uh such a similarity simulator Matrix example two which is probably more [00:51:29] example two which is probably more kind of like classical usage of this [00:51:31] kind of like classical usage of this kind of like models so where you can say [00:51:33] kind of like models so where you can say think of X is our users [00:51:37] think of X is our users of social network [00:51:42] and and the law of x i [00:51:45] and and the law of x i x j is equals to one [00:51:48] x j is equals to one if they are friends [00:51:53] like on Facebook let's say [00:51:56] like on Facebook let's say um so so and when they are France it [00:51:58] um so so and when they are France it means that they share some kind of like [00:52:00] means that they share some kind of like similarities maybe similarity in jobs or [00:52:02] similarities maybe similarity in jobs or interests or [00:52:04] interests or um or some other things right so you can [00:52:06] um or some other things right so you can think of this as a similarity uh measure [00:52:09] think of this as a similarity uh measure between two users [00:52:11] between two users and and eventually on the classify in [00:52:13] and and eventually on the classify in this case you want to eventually [00:52:14] this case you want to eventually classify the users into groups right so [00:52:16] classify the users into groups right so you want to say [00:52:18] you want to say um I can I can detect trading [00:52:20] um I can I can detect trading communities between users uh from this [00:52:23] communities between users uh from this unlabeled graph [00:52:27] um and so so basically the goal is to [00:52:29] um and so so basically the goal is to the goal [00:52:31] the goal is to do some classroom [00:52:34] is to do some classroom to some kind of clustering [00:52:39] I guess maybe I should just just [00:52:41] I guess maybe I should just just clustering [00:52:46] clustering the [00:52:49] clustering the uh the data points [00:52:52] I guess in the [00:52:54] I guess in the in The Social Network example maybe you [00:52:56] in The Social Network example maybe you have like all of these users where you [00:52:58] have like all of these users where you have let's say [00:52:59] have let's say so many users and there's some [00:53:01] so many users and there's some friendship relationship between them [00:53:03] friendship relationship between them something like this maybe [00:53:05] something like this maybe and then what you want to do is you want [00:53:07] and then what you want to do is you want to detect some so-called hidden [00:53:09] to detect some so-called hidden community so for example you can say [00:53:11] community so for example you can say this is a cluster this is another [00:53:12] this is a cluster this is another cluster and maybe this cost that [00:53:14] cluster and maybe this cost that corresponds to people at Stanford and [00:53:16] corresponds to people at Stanford and these cluster corresponds to people at [00:53:17] these cluster corresponds to people at Berkeley and right so of course between [00:53:21] Berkeley and right so of course between Stanford students you have more [00:53:23] Stanford students you have more connections than between Brooklyn [00:53:25] connections than between Brooklyn students you have more connections and [00:53:27] students you have more connections and there are some connections across the [00:53:28] there are some connections across the groups and so forth [00:53:31] groups and so forth and I mean and in this case [00:53:35] and I mean and in this case um and also you know for example for [00:53:37] um and also you know for example for this example two you can think of this [00:53:39] this example two you can think of this also G as a graph [00:53:43] I think even in general case you can [00:53:45] I think even in general case you can view G as a graph but it will be a very [00:53:47] view G as a graph but it will be a very graph here G and this um in the social [00:53:51] graph here G and this um in the social network case and G is you know binary [00:53:53] network case and G is you know binary because gij is binary so it can view G [00:53:56] because gij is binary so it can view G as a graph and gij is an H [00:54:03] and your goal is to kind of [00:54:05] and your goal is to kind of partition there are many ways to say [00:54:08] partition there are many ways to say what your goal is right so you can say [00:54:10] what your goal is right so you can say you can you are clustering the data [00:54:11] you can you are clustering the data points or you can say you are [00:54:12] points or you can say you are partitioning the graph into different [00:54:15] partitioning the graph into different parts uh so that each part has more [00:54:17] parts uh so that each part has more connections within each part you have [00:54:20] connections within each part you have more connections compared to across [00:54:21] more connections compared to across different parts [00:54:23] different parts so [00:54:25] so um so in sometimes you know this is [00:54:28] um so in sometimes you know this is you can now view it as partitioning [00:54:31] you can now view it as partitioning to graph [00:54:34] into kind of like [00:54:37] into kind of like um components [00:54:42] that are separated from each other that [00:54:43] that are separated from each other that are separate [00:54:46] are separate from each other to some extent there's [00:54:49] from each other to some extent there's no way you can [00:54:51] no way you can completely decomposted into completely [00:54:54] completely decomposted into completely destroying Parts but you can somewhat [00:54:56] destroying Parts but you can somewhat self decompose them [00:54:58] self decompose them kind of like partitional graph into more [00:55:01] kind of like partitional graph into more or less destroying parts [00:55:04] or less destroying parts and so this is the kind of the general [00:55:06] and so this is the kind of the general type of setup [00:55:07] type of setup um I'm going to kind of like uh discuss [00:55:09] um I'm going to kind of like uh discuss probably one or two instantiations about [00:55:12] probably one or two instantiations about this [00:55:14] this um [00:55:16] so I guess [00:55:20] so I guess um the general theme is that [00:55:27] is the following so this is a I I feel [00:55:30] is the following so this is a I I feel like this is a pretty deep kind of like [00:55:32] like this is a pretty deep kind of like observation in math [00:55:34] observation in math um and the general kind of way to think [00:55:36] um and the general kind of way to think about to to to say this is that I get [00:55:38] about to to to say this is that I get decomposition [00:55:41] of this graph G [00:55:44] of this graph G really relates a lot to the graph [00:55:46] really relates a lot to the graph practicing so again the conversation of [00:55:48] practicing so again the conversation of this agency symmetric G [00:55:50] this agency symmetric G here by G I really mean agency [00:55:52] here by G I really mean agency symmetrics relate [00:55:55] symmetrics relate uh to very well to the graph [00:55:58] uh to very well to the graph partitioning problem [00:56:05] so you see that you know in all of the [00:56:07] so you see that you know in all of the examples I'm going to give you know the [00:56:08] examples I'm going to give you know the main approach is to do some eigen [00:56:10] main approach is to do some eigen deformatization [00:56:12] deformatization um [00:56:13] um and and actually sometimes it's not [00:56:14] and and actually sometimes it's not economization of G it's either [00:56:16] economization of G it's either transition of some [00:56:18] transition of some transformation of G [00:56:19] transformation of G but the key point is that eigen [00:56:21] but the key point is that eigen optimization seems to relate so much to [00:56:24] optimization seems to relate so much to partitioning and clustering and it's not [00:56:26] partitioning and clustering and it's not that obvious but economic conversation [00:56:27] that obvious but economic conversation is very linear algebra [00:56:29] is very linear algebra thing and graph partition is a very [00:56:31] thing and graph partition is a very common Authority thing and this is why [00:56:33] common Authority thing and this is why it's kind of like useful because when [00:56:35] it's kind of like useful because when you deal with combinatorial stuff right [00:56:38] you deal with combinatorial stuff right my kind of [00:56:40] my kind of I'm not really a comatatorics person but [00:56:43] I'm not really a comatatorics person but but my way to think about it is that you [00:56:45] but my way to think about it is that you know many commentary stuff you know once [00:56:46] know many commentary stuff you know once you can relate to algebraic or linear [00:56:49] you can relate to algebraic or linear algebraic or or other kind of like [00:56:51] algebraic or or other kind of like polynomials then [00:56:52] polynomials then you get to you get exposed to a [00:56:55] you get to you get exposed to a different type of tools and you can do a [00:56:57] different type of tools and you can do a lot more things sometimes a lot more [00:56:58] lot more things sometimes a lot more things uh that you expect it [00:57:01] things uh that you expect it um [00:57:02] um so this is the general thing and we're [00:57:04] so this is the general thing and we're actually going to see probably two two [00:57:05] actually going to see probably two two or three examples to see [00:57:07] or three examples to see um uh why this is the case [00:57:10] um uh why this is the case um so I'm gonna so now I'm gonna do [00:57:12] um so I'm gonna so now I'm gonna do something more concrete so this is [00:57:14] something more concrete so this is called stochastic model [00:57:17] this is a very concrete setup where you [00:57:19] this is a very concrete setup where you can do math and you can say what's what [00:57:21] can do math and you can say what's what I mean you can instantiate what I mean [00:57:25] I mean you can instantiate what I mean um uh clearly so so Mr stochastic blog [00:57:29] um uh clearly so so Mr stochastic blog model I think this is a very related to [00:57:32] model I think this is a very related to SPM [00:57:34] SPM um and so G [00:57:37] um and so G is assumed [00:57:40] to be generated [00:57:44] to be generated randomly [00:57:47] randomly from two [00:57:51] sometimes it could be more but I'm I'm [00:57:52] sometimes it could be more but I'm I'm doing only two groups two hidden [00:57:56] doing only two groups two hidden uh [00:57:57] uh communities [00:57:59] communities or groups [00:58:01] or groups so [00:58:03] so so the setting would be something like [00:58:05] so the setting would be something like you have an vertices or and users and [00:58:10] you have an vertices or and users and you assume that there are two hidden [00:58:12] you assume that there are two hidden groups [00:58:13] groups as an Ice Bar [00:58:15] as an Ice Bar and this is the partition [00:58:18] and this is the partition meaning is and S Bar are disjoint [00:58:24] and then you assume that if you are from [00:58:28] and then you assume that if you are from two users are from the same hidden [00:58:30] two users are from the same hidden Community then they are more likely to [00:58:32] Community then they are more likely to be connected via Edge [00:58:35] be connected via Edge so if I and J are both from us [00:58:40] so if I and J are both from us or I and J are both from S Bar [00:58:43] or I and J are both from S Bar then [00:58:45] then the probability gij is one [00:58:49] the probability gij is one with probability p [00:58:52] with probability p and zero with probability 1 minus p [00:58:56] and zero with probability 1 minus p and then if I and J and otherwise if ing [00:59:01] and then if I and J and otherwise if ing are from different [00:59:03] are from different otherwise which means that they are from [00:59:06] otherwise which means that they are from different community [00:59:07] different community then gij [00:59:09] then gij is one with probability q and zero with [00:59:12] is one with probability q and zero with probability one minus Q [00:59:15] probability one minus Q and here importantly p is much larger [00:59:17] and here importantly p is much larger than Q [00:59:20] maybe that's it much larger for the [00:59:22] maybe that's it much larger for the moment how much larger will Quantified [00:59:24] moment how much larger will Quantified in a moment but you need P to be larger [00:59:27] in a moment but you need P to be larger than Q [00:59:28] than Q let's maybe I'll just write larger but [00:59:30] let's maybe I'll just write larger but not much larger [00:59:32] not much larger so basically [00:59:33] so basically for from the same hiding group you have [00:59:35] for from the same hiding group you have a higher chance to be connected than [00:59:37] a higher chance to be connected than Edge compared to from a different [00:59:40] Edge compared to from a different heading group [00:59:42] heading group thank you [00:59:46] I guess you know if you draw this I [00:59:48] I guess you know if you draw this I guess I don't know how to draw something [00:59:50] guess I don't know how to draw something a random graph but I think you can think [00:59:52] a random graph but I think you can think of there is a s and there's a s bar [00:59:55] of there is a s and there's a s bar some edges and then if p is something [00:59:58] some edges and then if p is something like probably close to one then you're [00:59:59] like probably close to one then you're gonna have something like this [01:00:01] gonna have something like this within the group you have high [01:00:03] within the group you have high probability to connect each other and [01:00:05] probability to connect each other and across the group you have some sparse [01:00:07] across the group you have some sparse edges maybe just some righteous [01:00:13] okay so I know it's not the goal becomes [01:00:18] okay so I know it's not the goal becomes so the goal [01:00:20] so the goal is to recover [01:00:23] is to recover as [01:00:24] as and as far if you cover as you can [01:00:27] and as far if you cover as you can recover as far from the gravity [01:00:31] right so [01:00:33] right so so this is a well-defined data [01:00:36] so this is a well-defined data generation model and you want to [01:00:38] generation model and you want to basically want to discover the hidden [01:00:39] basically want to discover the hidden groups right you want to do the [01:00:40] groups right you want to do the clustering [01:00:42] clustering and and our approach is gonna be eigen [01:00:45] and and our approach is gonna be eigen the conversation [01:00:58] so maybe before talking about eigen [01:01:00] so maybe before talking about eigen implementation for some extreme cases [01:01:02] implementation for some extreme cases you don't have to do eigen conversation [01:01:03] you don't have to do eigen conversation right so suppose let's just do some kind [01:01:06] right so suppose let's just do some kind of like some more trivial warm-up right [01:01:09] of like some more trivial warm-up right so suppose peas [01:01:11] so suppose peas 0.5 and Q is zero then you don't have to [01:01:14] 0.5 and Q is zero then you don't have to do any [01:01:15] do any kind of like I think almost like you [01:01:18] kind of like I think almost like you don't have to do anything because you're [01:01:20] don't have to do anything because you're gonna see two disconnected parts right [01:01:22] gonna see two disconnected parts right so if p is 0.5 and Q is 0 you basically [01:01:25] so if p is 0.5 and Q is 0 you basically have some s and that's bar [01:01:27] have some s and that's bar and you have some ideas not complete [01:01:30] and you have some ideas not complete connections you have some ideas here [01:01:32] connections you have some ideas here and then there is a clear two Sub sub [01:01:35] and then there is a clear two Sub sub graphs you can just basically kind of [01:01:37] graphs you can just basically kind of like for example you can say I start [01:01:38] like for example you can say I start from this I look at all my neighbors and [01:01:41] from this I look at all my neighbors and I put them all in us and then because if [01:01:44] I put them all in us and then because if you see that you know that they are from [01:01:45] you see that you know that they are from the same group right because if they're [01:01:47] the same group right because if they're not from the same group you have zero [01:01:48] not from the same group you have zero chance to see a hatch right so basically [01:01:50] chance to see a hatch right so basically just need to see all the [01:01:52] just need to see all the all the points you can reach from this a [01:01:54] all the points you can reach from this a single point and you got all of this for [01:01:56] single point and you got all of this for three points and then you declare that [01:01:58] three points and then you declare that GPS [01:01:59] GPS and the same thing you can do it for the [01:02:01] and the same thing you can do it for the other side you can do some kind of like [01:02:03] other side you can do some kind of like a just you can just try to [01:02:05] a just you can just try to doesn't make sense I saw some confusions [01:02:07] doesn't make sense I saw some confusions I don't know like basically the [01:02:09] I don't know like basically the algorithm I'm going to do is the [01:02:10] algorithm I'm going to do is the following I started with a note and [01:02:12] following I started with a note and let's see what what this node can reach [01:02:15] let's see what what this node can reach to [01:02:16] to and I put it into my stacks and then I [01:02:18] and I put it into my stacks and then I do the repeatedly to see what other [01:02:20] do the repeatedly to see what other nodes I can reach to [01:02:21] nodes I can reach to until at some point I I reached the [01:02:24] until at some point I I reached the boundary I reach a closure I cannot [01:02:26] boundary I reach a closure I cannot reach any new notes and declare this to [01:02:29] reach any new notes and declare this to be and then I declare this to be us [01:02:31] be and then I declare this to be us so and and under the rest of the might [01:02:34] so and and under the rest of the might have to be as far that would work pretty [01:02:36] have to be as far that would work pretty reasonably well for for piece plus five [01:02:38] reasonably well for for piece plus five and q0 and that's because you know [01:02:42] and q0 and that's because you know first of all you don't have any false [01:02:43] first of all you don't have any false positive [01:02:44] positive because you know [01:02:46] because you know like all the notes you discover should [01:02:48] like all the notes you discover should be belongs to the same group because if [01:02:49] be belongs to the same group because if you're and second I think you can also [01:02:52] you're and second I think you can also try to show that you you can't find all [01:02:54] try to show that you you can't find all the notes because if somebody is in your [01:02:56] the notes because if somebody is in your same group it should connect to someone [01:02:58] same group it should connect to someone right and then right someone you know or [01:03:01] right and then right someone you know or like this is the so-called small world [01:03:03] like this is the so-called small world kind of phenomenal right like if [01:03:05] kind of phenomenal right like if if this other user is from the same [01:03:08] if this other user is from the same group they should be connected with you [01:03:09] group they should be connected with you by some classes press and pause so [01:03:14] by some classes press and pause so um anyway so this is a but you can see [01:03:16] um anyway so this is a but you can see even for me to convince you this is [01:03:18] even for me to convince you this is algorithm is work working for p is 0.5 [01:03:21] algorithm is work working for p is 0.5 and q0 is not that trivial you have to [01:03:23] and q0 is not that trivial you have to do something right and this is a [01:03:24] do something right and this is a commandatory algorithm and what we are [01:03:27] commandatory algorithm and what we are going to do is that we are going to do [01:03:29] going to do is that we are going to do um a more kind of like uh [01:03:31] um a more kind of like uh uh linear algebraic type of algorithm [01:03:34] uh linear algebraic type of algorithm and you can see everything becomes kind [01:03:36] and you can see everything becomes kind of even clearer and this is a more [01:03:38] of even clearer and this is a more powerful algorithm [01:03:40] powerful algorithm um and and you don't need those common [01:03:41] um and and you don't need those common reasonings [01:03:43] reasonings so [01:03:44] so so what do we do so we basically just do [01:03:46] so what do we do so we basically just do identification and as a warm-up what's [01:03:49] identification and as a warm-up what's going to do is that we're going to do [01:03:51] going to do is that we're going to do eigenization [01:03:53] eigenization identity conversation [01:03:57] because how do I simply I can [01:03:59] because how do I simply I can decomposition [01:04:02] decomposition what's the right acronym for eigen [01:04:04] what's the right acronym for eigen defensation [01:04:05] defensation eigen decomposition [01:04:07] eigen decomposition for [01:04:09] for um for G bar which is expectation of d [01:04:14] um for G bar which is expectation of d so clearly so what is the biology bar is [01:04:16] so clearly so what is the biology bar is the expectation of G so you you have a [01:04:19] the expectation of G so you you have a weight and the weight is the expectation [01:04:21] weight and the weight is the expectation of [01:04:23] of uh the the you just take expecting of [01:04:26] uh the the you just take expecting of gig [01:04:27] gig so uh clearly you don't have this c bar [01:04:30] so uh clearly you don't have this c bar but that's just for the starters let's [01:04:31] but that's just for the starters let's let's look at this expected version [01:04:34] let's look at this expected version right so and what is the expectation [01:04:36] right so and what is the expectation what is this G bar i j [01:04:38] what is this G bar i j this is going to be equals to p [01:04:41] this is going to be equals to p if I and J from the same class from the [01:04:43] if I and J from the same class from the same community [01:04:47] and and equals to Q otherwise [01:04:51] and and equals to Q otherwise so that means that basically if you look [01:04:52] so that means that basically if you look at this G bar [01:04:55] so suppose [01:04:57] so suppose this is the indices for us this is the [01:05:00] this is the indices for us this is the indices for S Bar this is Unity for us [01:05:02] indices for S Bar this is Unity for us let's define [01:05:03] let's define so when you have both iron J from us [01:05:06] so when you have both iron J from us you'll get P to get PP like this [01:05:09] you'll get P to get PP like this and here you get Q [01:05:18] so this is G bar [01:05:21] so this is G bar so and my claim is that [01:05:25] so and my claim is that in this case [01:05:27] in this case suppose you have to access to G bar [01:05:29] suppose you have to access to G bar so the top eigenvector [01:05:33] of G bar is [01:05:36] of G bar is a [01:05:39] is the all one vector [01:05:43] and [01:05:46] second the second eigenvector [01:05:51] of G bar is is interesting [01:05:55] of G bar is is interesting is this x Vector where you have once [01:06:00] is this x Vector where you have once on the coordinates in s and minus one on [01:06:04] on the coordinates in s and minus one on the corner in Ice Bar [01:06:06] the corner in Ice Bar so basically if you got the second [01:06:07] so basically if you got the second eigenvector of T Bar you solve the [01:06:09] eigenvector of T Bar you solve the problem because you can speed off the [01:06:11] problem because you can speed off the community membership from from this [01:06:13] community membership from from this eigenvector [01:06:15] eigenvector that's the click [01:06:22] okay [01:06:23] okay so [01:06:25] so so it sounds a little bit kind of [01:06:26] so it sounds a little bit kind of interesting right like the proof [01:06:29] interesting right like the proof I guess what's the intuition here so I [01:06:30] I guess what's the intuition here so I guess the intuition [01:06:32] guess the intuition probably come probably come from the [01:06:34] probably come probably come from the proof so let's first do number one [01:06:36] proof so let's first do number one number one is kind of like a almost [01:06:38] number one is kind of like a almost always true for for any uh funny cases [01:06:42] always true for for any uh funny cases it doesn't even have to be such a [01:06:43] it doesn't even have to be such a special G bar so so what you do is you [01:06:46] special G bar so so what you do is you just say you you cut G times or one [01:06:49] just say you you cut G times or one vector [01:06:50] vector and what is g times over one vector [01:06:52] and what is g times over one vector basically you multiply G with all one [01:06:53] basically you multiply G with all one vector here right something like this [01:06:55] vector here right something like this and you modify it basically you are just [01:06:57] and you modify it basically you are just a [01:07:00] looking at the sum of the answers in [01:07:03] looking at the sum of the answers in each of the logs right that was G times [01:07:05] each of the logs right that was G times one right so what's the sum of the [01:07:07] one right so what's the sum of the entries of each of those the sum of the [01:07:09] entries of each of those the sum of the first row is something like P times n [01:07:12] first row is something like P times n over 2 plus Q times n over 2 because [01:07:14] over 2 plus Q times n over 2 because there are n over 2 answers [01:07:16] there are n over 2 answers with p and of two answers with value q [01:07:19] with p and of two answers with value q and every row has the same thing it's [01:07:22] and every row has the same thing it's just a [01:07:27] so this is equals to basically [01:07:29] so this is equals to basically simplify this p plus Q over 2 times n [01:07:32] simplify this p plus Q over 2 times n times r one vector [01:07:36] times r one vector so [01:07:37] so so you can kind of see that Y is the top [01:07:40] so you can kind of see that Y is the top eigenvector [01:07:43] actually even for more General [01:07:45] actually even for more General for General kind of graphs always graphs [01:07:48] for General kind of graphs always graphs so far [01:07:49] so far for any Matrix [01:07:54] with [01:07:55] with fixed [01:07:57] fixed row sum [01:08:00] row sum or for any [01:08:03] or for any graph [01:08:05] graph with so-called uniform degree [01:08:11] right the degree of a graph is really [01:08:13] right the degree of a graph is really literally the real sum of the GCC Matrix [01:08:16] literally the real sum of the GCC Matrix right so how many ideas you have [01:08:18] right so how many ideas you have connecting to each of the node vertices [01:08:20] connecting to each of the node vertices that's basically the low sum [01:08:22] that's basically the low sum so so if if all the degree of our [01:08:25] so so if if all the degree of our vertices ever all the vertices is the [01:08:28] vertices ever all the vertices is the same that means the low sum of the [01:08:29] same that means the low sum of the existence Matrix is constant and that's [01:08:32] existence Matrix is constant and that's that means that all one vector is uh the [01:08:35] that means that all one vector is uh the topic vector so so this is a um so this [01:08:39] topic vector so so this is a um so this is a very interesting fact [01:08:42] is a very interesting fact um so so basically topic doesn't really [01:08:44] um so so basically topic doesn't really tell you much you have to go to the [01:08:46] tell you much you have to go to the second econometric to see the [01:08:47] second econometric to see the interesting fix [01:08:49] interesting fix so now let's look at the second language [01:08:52] so now let's look at the second language so what I'm going to do is there are [01:08:53] so what I'm going to do is there are many ways to to evaluate whether [01:08:56] many ways to to evaluate whether this let's I think let's call this [01:08:58] this let's I think let's call this Vector U there are many ways to to [01:09:01] Vector U there are many ways to to verify U is an eigenvector right you can [01:09:03] verify U is an eigenvector right you can directly you know multiply it and see [01:09:06] directly you know multiply it and see what's the eigenvalue I think the [01:09:08] what's the eigenvalue I think the probably the most intuitive way to think [01:09:10] probably the most intuitive way to think about is the following so let's look at [01:09:12] about is the following so let's look at G bar [01:09:14] G bar let's subtract from T Bar a background [01:09:17] let's subtract from T Bar a background thing [01:09:27] sorry [01:09:32] foreign [01:09:38] over two [01:09:39] over two oh I guess sorry I I didn't I didn't [01:09:42] oh I guess sorry I I didn't I didn't assume that my back I didn't assume that [01:09:45] assume that my back I didn't assume that this is a equal partition I should [01:09:46] this is a equal partition I should assume that [01:09:47] assume that so this is also possible to assume [01:09:50] so this is also possible to assume also assume [01:09:52] also assume as is an over two as bar is n over two [01:09:59] um if you're not equally related I think [01:10:01] um if you're not equally related I think you have to do a little bit [01:10:03] you have to do a little bit other things to to deal with it not [01:10:05] other things to to deal with it not super important but [01:10:08] super important but um yeah so if if the ISS bar are not [01:10:11] um yeah so if if the ISS bar are not exact the same I think all one vector is [01:10:12] exact the same I think all one vector is not the [01:10:14] not the I can do that anymore so you have to [01:10:16] I can do that anymore so you have to really break you have to kind of like [01:10:18] really break you have to kind of like massage this Matrix to you a little bit [01:10:19] massage this Matrix to you a little bit to make it still true but that it will [01:10:22] to make it still true but that it will talk we'll get to that in a moment [01:10:24] talk we'll get to that in a moment um in the next option I guess so [01:10:27] um in the next option I guess so um but so far okay so let's assume is [01:10:30] um but so far okay so let's assume is bar balanced [01:10:31] bar balanced and now how do we see the second eigen [01:10:33] and now how do we see the second eigen Vector is this Vector U that we want to [01:10:35] Vector is this Vector U that we want to look like we want we're looking for [01:10:38] look like we want we're looking for right so so what the my way to think [01:10:40] right so so what the my way to think about is that you you subtract G bar [01:10:43] about is that you you subtract G bar uh is subject from G bar this background [01:10:46] uh is subject from G bar this background Matrix one one transpose times Q so [01:10:49] Matrix one one transpose times Q so basically the structure Q from this [01:10:51] basically the structure Q from this every entry of this Matrix one one [01:10:54] every entry of this Matrix one one transpose times Q is really just a [01:10:55] transpose times Q is really just a matrix with all [01:10:57] matrix with all entries being Q [01:10:59] entries being Q and then what's left is this Matrix [01:11:01] and then what's left is this Matrix let's call it [01:11:02] let's call it let's say R is equal to P minus Q you [01:11:05] let's say R is equal to P minus Q you get r [01:11:07] get r something like this [01:11:23] right so this is us this is us as far as [01:11:25] right so this is us this is us as far as bar [01:11:27] bar and here we have zero [01:11:32] okay so now you can see that this Matrix [01:11:36] okay so now you can see that this Matrix becomes nice because it's a block [01:11:38] becomes nice because it's a block diagonal matrix [01:11:40] diagonal matrix so for this Matrix if you want to verify [01:11:43] so for this Matrix if you want to verify maybe let's call this Matrix G Prime [01:11:46] maybe let's call this Matrix G Prime so we can verify G Prime Times U [01:11:50] so we can verify G Prime Times U is equals to U and how the verified is [01:11:53] is equals to U and how the verified is equal to multiply of U how to verify [01:11:55] equal to multiply of U how to verify this this is just because you can do [01:11:57] this this is just because you can do this [01:11:59] this for the two blocks separately right so [01:12:02] for the two blocks separately right so so this is really just a [01:12:06] so this is really just a something like r r [01:12:09] something like r r times you know well one [01:12:12] times you know well one minus one minus one [01:12:16] all right so when you do these two [01:12:18] all right so when you do these two things separately and [01:12:20] things separately and and basically you get [01:12:23] and basically you get uh R times n over two for the first [01:12:26] uh R times n over two for the first of two coordinates and you get minus r [01:12:28] of two coordinates and you get minus r times over two [01:12:31] for the second set of Corners so this is [01:12:33] for the second set of Corners so this is R times over two times U itself [01:12:38] so and also [01:12:40] so and also you [01:12:43] you is orthogonal to all one vector [01:12:46] is orthogonal to all one vector just because half of them are positive [01:12:48] just because half of them are positive half of the narrative so you take the [01:12:50] half of the narrative so you take the inner product it becomes zero [01:12:52] inner product it becomes zero so so that's why if you even look at G [01:12:55] so so that's why if you even look at G times G Bar times EU this is equals to G [01:12:57] times G Bar times EU this is equals to G Prime Times U [01:12:59] Prime Times U because the [01:13:00] because the the background you subtract of [01:13:02] the background you subtract of is orthogonal 2 to U so that's why this [01:13:06] is orthogonal 2 to U so that's why this is equals to R times over 2 times U [01:13:08] is equals to R times over 2 times U which is [01:13:09] which is P minus Q over 2 times n times U [01:13:13] P minus Q over 2 times n times U foreign [01:13:26] so this is the [01:13:28] so this is the so so I think the main point is that [01:13:30] so so I think the main point is that after you subtract all of this [01:13:32] after you subtract all of this background thing [01:13:33] background thing then this G Prime [01:13:35] then this G Prime is block diagonal [01:13:40] and this means that [01:13:42] and this means that the eigenvectors [01:13:46] um [01:13:47] um Alliance [01:13:50] with the with the blocks [01:13:53] with the with the blocks I think this is the kind of the [01:13:54] I think this is the kind of the fundamental things that we are looking [01:13:56] fundamental things that we are looking for [01:13:57] for and maybe maybe to just to generalize [01:14:01] and maybe maybe to just to generalize this to to make it [01:14:03] this to to make it looks a little bit kind of like um [01:14:06] looks a little bit kind of like um more [01:14:07] more convincing so suppose you have a matrix [01:14:09] convincing so suppose you have a matrix a which looks like this [01:14:11] a which looks like this suppose you have some [01:14:13] suppose you have some answers one one here in this block and [01:14:16] answers one one here in this block and you have [01:14:17] you have a lot of once [01:14:20] a lot of once and you have a lot of ones here [01:14:23] and you have a lot of ones here suppose you have three blocks no it's [01:14:24] suppose you have three blocks no it's not two blocks [01:14:26] not two blocks then because you have block diagnosis [01:14:28] then because you have block diagnosis you know that [01:14:29] you know that for every block you can you can do your [01:14:32] for every block you can you can do your own thing [01:14:33] own thing right so then [01:14:35] right so then if you look at [01:14:36] if you look at uh if you look at the eigenvectors you [01:14:40] uh if you look at the eigenvectors you can see that [01:14:46] you can see that this [01:14:51] foreign [01:14:53] so I got each of this [01:14:55] so I got each of this three vectors [01:15:08] are eigenvectors [01:15:11] are eigenvectors because you can do each of the blocks in [01:15:13] because you can do each of the blocks in a separate way so [01:15:15] a separate way so right so I guess there is a [01:15:21] so so basically you can you can say that [01:15:26] so so if so you can say that I'm going [01:15:28] so so if so you can say that I'm going to choose the Allman Vector for the [01:15:29] to choose the Allman Vector for the first block and then zero and all the [01:15:31] first block and then zero and all the other places that's still angry vector [01:15:33] other places that's still angry vector okay [01:15:34] okay so and then [01:15:36] so and then when you have this then you you have [01:15:38] when you have this then you you have that [01:15:40] that um if you have these three eigenvectors [01:15:42] um if you have these three eigenvectors then the rows [01:15:45] then the rows if you look at every row here [01:15:46] if you look at every row here [Music] [01:15:49] [Music] right so this is zero this is zero this [01:15:51] right so this is zero this is zero this is one right so this row gives the [01:15:56] is one right so this row gives the foreign [01:15:57] foreign cluster ID [01:16:01] of the vertex [01:16:04] of the vertex so each row gives you the cluster ID of [01:16:07] so each row gives you the cluster ID of the vertex [01:16:09] the vertex so so I think this is the fundamental [01:16:10] so so I think this is the fundamental intuition about why eigenvectors [01:16:13] intuition about why eigenvectors are useful for capturing the [01:16:17] are useful for capturing the captioning the kind of factor [01:16:19] captioning the kind of factor the classroom structure in the graph [01:16:22] the classroom structure in the graph it's just because in the extreme case [01:16:24] it's just because in the extreme case when you have extreme clustering where [01:16:26] when you have extreme clustering where so every [01:16:27] so every every block every stop side of the like [01:16:30] every block every stop side of the like in this these three blocks or three [01:16:32] in this these three blocks or three subsets they have like really just a [01:16:34] subsets they have like really just a strong inner connections and no any [01:16:37] strong inner connections and no any other cross groups Connections in that [01:16:39] other cross groups Connections in that case the eigenvectors is just strong [01:16:41] case the eigenvectors is just strong strongly below kind of aligns with [01:16:44] strongly below kind of aligns with uh [01:16:46] uh strongly aligns with the [01:16:48] strongly aligns with the uh with the block structure [01:16:51] uh with the block structure so and here I think like and then but [01:16:55] so and here I think like and then but you see but but it was complicated which [01:16:57] you see but but it was complicated which makes things a little bit more [01:16:57] makes things a little bit more complicated that you have some [01:16:58] complicated that you have some background right you have some more [01:17:01] background right you have some more things here and here right so some [01:17:03] things here and here right so some random answers you know small entries [01:17:04] random answers you know small entries other places then it will change the it [01:17:07] other places then it will change the it will Elevate the entire Matrix a little [01:17:09] will Elevate the entire Matrix a little bit right but they wouldn't change the [01:17:11] bit right but they wouldn't change the eigen space fundamentally [01:17:14] eigen space fundamentally that's pretty much the intuition [01:17:18] and then [01:17:19] and then questions [01:17:31] foreign [01:17:47] exactly why this is working right [01:17:49] exactly why this is working right because so the question is that what if [01:17:51] because so the question is that what if you permute this right [01:17:53] you permute this right so if you promote it [01:17:56] so if you promote it it's kind of like you're promoting the [01:17:58] it's kind of like you're promoting the your eigenous vector will permute [01:18:00] your eigenous vector will permute accordingly [01:18:01] accordingly so suppose you have a [01:18:04] so suppose you have a I'm not sure whether that makes sense [01:18:07] I'm not sure whether that makes sense so so for example suppose you have [01:18:11] so so for example suppose you have it [01:18:12] it you you said you decide this part and [01:18:14] you you said you decide this part and this part will be the first block [01:18:17] this part will be the first block right and and and then this part [01:18:21] right and and and then this part and this part will be the second block I [01:18:23] and this part will be the second block I think your eigenvectors will permute [01:18:25] think your eigenvectors will permute like the coordinates of the argument [01:18:26] like the coordinates of the argument will commit accordingly and that's why [01:18:28] will commit accordingly and that's why it's aligned with the that's why the [01:18:30] it's aligned with the that's why the alignment is kind of maintained and you [01:18:32] alignment is kind of maintained and you can discover the the higher structure [01:18:40] okay so [01:18:43] okay so um [01:18:51] okay sounds good so I guess maybe maybe [01:18:53] okay sounds good so I guess maybe maybe another thing I'm not sure whether this [01:18:55] another thing I'm not sure whether this is a confusion for you like it could be [01:18:57] is a confusion for you like it could be a confusion it could not like so here [01:18:59] a confusion it could not like so here the eigenvectors [01:19:01] the eigenvectors there's no negative values in this [01:19:03] there's no negative values in this construction right [01:19:05] construction right but but but it's kind of like the reason [01:19:07] but but but it's kind of like the reason is that the reason why I didn't have [01:19:09] is that the reason why I didn't have negative value is just because [01:19:12] negative value is just because um it makes simpler because so for [01:19:14] um it makes simpler because so for example let's say even this Matrix this [01:19:16] example let's say even this Matrix this vector [01:19:19] this is also eigenvector because it's [01:19:21] this is also eigenvector because it's the sum of two eigenvectors and and all [01:19:23] the sum of two eigenvectors and and all of these arguments have the same [01:19:24] of these arguments have the same eigenvalue so any combination of diagram [01:19:27] eigenvalue so any combination of diagram I can still I go back that's that's how [01:19:29] I can still I go back that's that's how you get the negative values so [01:19:33] you get the negative values so um [01:19:33] um um and and there's something special [01:19:35] um and and there's something special about all one vectors okay because so [01:19:38] about all one vectors okay because so here there's like in this example in [01:19:40] here there's like in this example in this a example there is nothing special [01:19:42] this a example there is nothing special about all one mattress because there's [01:19:43] about all one mattress because there's no this background noise but when you so [01:19:46] no this background noise but when you so what happens is that when you add [01:19:49] what happens is that when you add um a kind of a background noise to it [01:19:51] um a kind of a background noise to it then the all one vectors becomes stands [01:19:53] then the all one vectors becomes stands locked [01:19:54] locked so so here you have three arguments that [01:19:56] so so here you have three arguments that are equally same but when you and all [01:19:59] are equally same but when you and all one vectors is in the Subspace of these [01:20:01] one vectors is in the Subspace of these three elements right so like all my [01:20:03] three elements right so like all my vectors is indeed a linear combination [01:20:05] vectors is indeed a linear combination of these three things [01:20:07] of these three things um [01:20:07] um and when you add the background noise [01:20:09] and when you add the background noise the old one vector Direction stands out [01:20:10] the old one vector Direction stands out and and then you are left with two other [01:20:12] and and then you are left with two other directions which are still the same and [01:20:15] directions which are still the same and those two other directions will tell you [01:20:16] those two other directions will tell you the the block structure [01:20:22] so so maybe maybe another way to think [01:20:24] so so maybe maybe another way to think about this is if you have two blocks [01:20:25] about this is if you have two blocks suppose you have two blocks [01:20:30] so if you don't have any background [01:20:31] so if you don't have any background noise then the eigenvectors will be one [01:20:33] noise then the eigenvectors will be one on one zero zero zero [01:20:40] this is this will be a two I can access [01:20:41] this is this will be a two I can access but then when you're either back on and [01:20:43] but then when you're either back on and you can you can represent this eigmatics [01:20:45] you can you can represent this eigmatics in two different ways right this is the [01:20:46] in two different ways right this is the eigenag versus eigen system you could [01:20:48] eigenag versus eigen system you could also write it like this [01:20:53] just because you have different ways to [01:20:55] just because you have different ways to represent the two-dimensional Subspace [01:20:56] represent the two-dimensional Subspace of econometrics but when you add the [01:20:59] of econometrics but when you add the background noise then this one will [01:21:00] background noise then this one will stand out [01:21:02] stand out so that's why you you can only write use [01:21:04] so that's why you you can only write use this system to to see it but not here [01:21:06] this system to to see it but not here because [01:21:08] because I'm not sure whether it makes sense [01:21:12] I'm not sure whether it makes sense no [01:21:13] no so so basically [01:21:15] so so basically without adding a background noise you [01:21:16] without adding a background noise you have two you have this direction [01:21:19] have two you have this direction which is this one or one zero zero zero [01:21:21] which is this one or one zero zero zero and there's this direction [01:21:24] and there's this direction zero zero zero well online [01:21:26] zero zero zero well online and and also you can you can have this [01:21:28] and and also you can you can have this direction which is the all one thing [01:21:31] direction which is the all one thing and this direction which is the one one [01:21:33] and this direction which is the one one minus one minus 1 minus one [01:21:36] minus one minus 1 minus one so they're both so so so you have this [01:21:38] so they're both so so so you have this uh [01:21:40] uh two different sets of current systems [01:21:42] two different sets of current systems and [01:21:43] and when you add background noise you you [01:21:45] when you add background noise you you you're gonna Elevate or you're gonna [01:21:47] you're gonna Elevate or you're gonna increase the the strength in this [01:21:49] increase the the strength in this direction [01:21:50] direction but the Subspace doesn't really change [01:21:51] but the Subspace doesn't really change like like this direction becomes the [01:21:54] like like this direction becomes the topic vector and this becomes the second [01:21:56] topic vector and this becomes the second item [01:21:58] item but fundamentally nothing really changed [01:22:00] but fundamentally nothing really changed that much [01:22:02] that much I hope this only clarifies Yourself by [01:22:04] I hope this only clarifies Yourself by computers [01:22:05] computers um okay so I guess I'm running out of [01:22:08] um okay so I guess I'm running out of time [01:22:11] let's see [01:22:24] so I guess I'll take [01:22:27] so I guess I'll take two two minutes to [01:22:30] two two minutes to and I've gave a quick overview wrap up [01:22:32] and I've gave a quick overview wrap up and gave a quick overview what what we [01:22:34] and gave a quick overview what what we do next so [01:22:38] so basically [01:22:41] so basically you can also actually if you really want [01:22:42] you can also actually if you really want you can verify [01:22:45] G is actually equals to P plus Q over [01:22:49] G is actually equals to P plus Q over two times one [01:22:50] two times one transpose plus P minus Q over two times [01:22:53] transpose plus P minus Q over two times one [01:22:55] one uh uu transpose [01:22:57] uh uu transpose this is the the eigen decomposition of [01:22:59] this is the the eigen decomposition of this Matrix [01:23:01] this Matrix um [01:23:01] um so and [01:23:04] so and okay t-bar so next [01:23:08] okay t-bar so next what happens is we only have access so [01:23:11] what happens is we only have access so in reality [01:23:13] we only have [01:23:16] we only have G [01:23:17] G so what do we do what we do is we just [01:23:20] so what do we do what we do is we just say the intuition is just that [01:23:23] say the intuition is just that G is approximately equals to expectation [01:23:25] G is approximately equals to expectation of G in certain aspects it's not true [01:23:28] of G in certain aspects it's not true that every entry of D is close to either [01:23:30] that every entry of D is close to either actually of expected G right because if [01:23:32] actually of expected G right because if you take one entry G is binary and the [01:23:35] you take one entry G is binary and the expectation of G is P or Q there's no [01:23:37] expectation of G is P or Q there's no way they are close [01:23:38] way they are close but this is in terms of the [01:23:42] in terms of the spectrum [01:23:46] so so essentially you want to show that [01:23:48] so so essentially you want to show that essentially even though we need a little [01:23:51] essentially even though we need a little trick to make this work nicely [01:23:54] trick to make this work nicely essentially you want to show the [01:23:55] essentially you want to show the operator Norm between the difference [01:23:56] operator Norm between the difference between these two is small [01:23:59] between these two is small then [01:24:00] then if either you use G to do the [01:24:02] if either you use G to do the decomposition so this means that [01:24:05] decomposition so this means that decompo [01:24:07] decompo posing G is the same [01:24:10] posing G is the same is similar [01:24:14] through decomposing [01:24:16] through decomposing expectations [01:24:19] that's pretty much and and now you can [01:24:21] that's pretty much and and now you can see that the concentration in the [01:24:23] see that the concentration in the quality that we discussed in the early [01:24:24] quality that we discussed in the early lectures uh in this course becomes [01:24:27] lectures uh in this course becomes useful [01:24:28] useful um so concretely what you do is the [01:24:30] um so concretely what you do is the following [01:24:33] so you take you write G [01:24:40] your G as G minus expectation G [01:24:45] your G as G minus expectation G plus expectation G extracting [01:24:47] plus expectation G extracting expectation G is just G bar [01:24:49] expectation G is just G bar and so this is G minus [01:24:51] and so this is G minus expectation G Plus [01:24:53] expectation G Plus p plus Q over 2 [01:24:56] p plus Q over 2 while my transpose plus P minus Q over [01:24:59] while my transpose plus P minus Q over two [01:25:00] two or one transpose [01:25:03] or one transpose uh sorry new transpose [01:25:07] right so you want to say that this part [01:25:09] right so you want to say that this part doesn't matter too much right it doesn't [01:25:10] doesn't matter too much right it doesn't really change your eigen spectrum [01:25:13] really change your eigen spectrum is the to make it cleaner what you can [01:25:15] is the to make it cleaner what you can do is that [01:25:17] do is that uh you you subtract this part [01:25:20] uh you you subtract this part because U is something you want to [01:25:21] because U is something you want to discover top Elementor is something you [01:25:23] discover top Elementor is something you already know so so we probably shouldn't [01:25:25] already know so so we probably shouldn't ask in the top angle actor you should [01:25:27] ask in the top angle actor you should just directly look for the second [01:25:28] just directly look for the second eigenvector what you do is you sub you [01:25:30] eigenvector what you do is you sub you move this to the left hand side [01:25:32] move this to the left hand side so you can get this p [01:25:35] so you can get this p you look at this Matrix [01:25:37] you look at this Matrix this is something you know because G is [01:25:39] this is something you know because G is something you know [01:25:40] something you know and then this Matrix is equals to [01:25:44] and then this Matrix is equals to this plus P minus Q over 2 [01:25:48] this plus P minus Q over 2 . so you can review this as a [01:25:49] . so you can review this as a perturbation [01:25:52] right and this is something you you are [01:25:53] right and this is something you you are really looking for so basically you [01:25:55] really looking for so basically you start from the left hand side you take [01:25:56] start from the left hand side you take an eigen decomposition [01:25:58] an eigen decomposition to do identity conversation [01:26:01] to do identity conversation of G minus P plus Q over two [01:26:04] of G minus P plus Q over two by one transpose and you hope that the [01:26:06] by one transpose and you hope that the top eigenvector [01:26:08] top eigenvector of this Matrix is close to [01:26:13] and the hope is the top eigenvector [01:26:16] and the hope is the top eigenvector is close to you [01:26:18] is close to you that's basically our goal and and how do [01:26:22] that's basically our goal and and how do you make sure that this is the two the [01:26:25] you make sure that this is the two the only thing so it suffice is to show [01:26:31] this G minus E G [01:26:34] this G minus E G in terms of the spectral Norm [01:26:36] in terms of the spectral Norm is much much smaller than [01:26:39] is much much smaller than the the the noise is much smaller than [01:26:42] the the the noise is much smaller than that signal [01:26:45] in terms of the optional and this so [01:26:49] in terms of the optional and this so um so in some sense you need some [01:26:51] um so in some sense you need some robustness of eigen decrementation I [01:26:53] robustness of eigen decrementation I guess I didn't really discuss any of the [01:26:55] guess I didn't really discuss any of the existing theorems but but essentially if [01:26:57] existing theorems but but essentially if you have this you can prove that [01:26:58] you have this you can prove that eigenvectors of the sum of these two [01:27:00] eigenvectors of the sum of these two matrices is very similar to the [01:27:02] matrices is very similar to the eigenvector of one of these Matrix [01:27:05] eigenvector of one of these Matrix um and this is called Davis here I guess [01:27:08] um and this is called Davis here I guess now I wouldn't have time to talk about [01:27:10] now I wouldn't have time to talk about all of this but but this intuitive [01:27:12] all of this but but this intuitive results makes sense right so if you know [01:27:13] results makes sense right so if you know it is small enough in terms of the [01:27:15] it is small enough in terms of the spectrum [01:27:16] spectrum and the option Norm then you get the [01:27:18] and the option Norm then you get the signal [01:27:19] signal um and so and how do you get this I [01:27:22] um and so and how do you get this I think we're gonna how do you so this is [01:27:24] think we're gonna how do you so this is true I think I'm gonna uh this I'm gonna [01:27:27] true I think I'm gonna uh this I'm gonna discuss that [01:27:28] discuss that in the beginning of the next lecture [01:27:29] in the beginning of the next lecture it's essentially you just have to prove [01:27:31] it's essentially you just have to prove some concentration according to using [01:27:33] some concentration according to using some of the tools we had [01:27:36] some of the tools we had um in lecture three or four of this [01:27:38] um in lecture three or four of this course [01:27:40] okay have any questions [01:28:00] so if you have more clusters the noise [01:28:03] so if you have more clusters the noise will hurt the entire Spectrum right so [01:28:07] will hurt the entire Spectrum right so um and it becomes a little more [01:28:08] um and it becomes a little more complicated so first of all if you have [01:28:09] complicated so first of all if you have no noise then you can still prove that [01:28:11] no noise then you can still prove that the eigen vectors are not for you to [01:28:13] the eigen vectors are not for you to recover the blocks [01:28:15] recover the blocks about the this robustness theorem will [01:28:18] about the this robustness theorem will be a little bit tricky because now you [01:28:19] be a little bit tricky because now you have more eigenvectors and your noise [01:28:21] have more eigenvectors and your noise has an influence to each of them and you [01:28:23] has an influence to each of them and you have to control [01:28:24] have to control you have to again control Some Noise to [01:28:27] you have to again control Some Noise to Signal ratio in using some more a little [01:28:29] Signal ratio in using some more a little more advanced tactics [01:28:31] more advanced tactics but but essentially the [01:28:33] but but essentially the yeah I think it's just really the [01:28:35] yeah I think it's just really the mathematical part it's a little bit more [01:28:36] mathematical part it's a little bit more complicated but fundamentally still [01:28:38] complicated but fundamentally still doing the same thing [01:28:51] [Music] [01:29:01] um [01:29:03] foreign [01:29:05] foreign [Music] [01:29:14] [Music] [01:29:25] yeah so I think that's that's a that's a [01:29:27] yeah so I think that's that's a that's a great question I guess just to rephrase [01:29:29] great question I guess just to rephrase your question you are you are saying [01:29:31] your question you are you are saying that we do we really need to say that g [01:29:33] that we do we really need to say that g minus E G is small in all directions [01:29:37] minus E G is small in all directions do you just have to say that g minus E G [01:29:39] do you just have to say that g minus E G is only not messing up with the [01:29:41] is only not messing up with the direction that you I think you do have [01:29:44] direction that you I think you do have to say [01:29:45] to say to some extent G minus E G is smaller in [01:29:47] to some extent G minus E G is smaller in all directions [01:29:48] all directions um because if G minus CG is very very [01:29:51] um because if G minus CG is very very big in One Direction I say even that [01:29:53] big in One Direction I say even that direction is completely orthogonal to [01:29:55] direction is completely orthogonal to you let's say [01:29:56] you let's say then that direction will be a new topic [01:29:59] then that direction will be a new topic vector [01:30:00] vector right so because you're talking about [01:30:02] right so because you're talking about the max right you are not right [01:30:04] the max right you are not right um but I think the like how do you [01:30:05] um but I think the like how do you exactly measure this you know there's [01:30:07] exactly measure this you know there's there's still some room to negotiate but [01:30:08] there's still some room to negotiate but but you do have to [01:30:10] but you do have to in some sense say some something about [01:30:13] in some sense say some something about all directions to open noise [01:30:19] okay okay thanks I can see you uh on [01:30:22] okay okay thanks I can see you uh on Monday or Wednesday ================================================================================ LECTURE 019 ================================================================================ Stanford CS229M - Lecture 20: Spectral clustering Source: https://www.youtube.com/watch?v=UYBRLG64oSQ --- Transcript [00:00:05] okay I guess uh let's get started this [00:00:08] okay I guess uh let's get started this is the last lecture [00:00:09] is the last lecture of this course [00:00:11] of this course I guess um um we're gonna continue with [00:00:14] I guess um um we're gonna continue with the [00:00:14] the um the special approach for clustering [00:00:17] um the special approach for clustering so [00:00:19] so um I pre-write some of them [00:00:21] um I pre-write some of them um reviews of the last lectures I guess [00:00:25] um reviews of the last lectures I guess um [00:00:26] um so last lecture I think we did the [00:00:28] so last lecture I think we did the sarcastic blog model and one of the main [00:00:31] sarcastic blog model and one of the main findings is that [00:00:34] findings is that um if you do I cantic conversation [00:00:37] um if you do I cantic conversation um so our goal was to do again [00:00:39] um so our goal was to do again implementation on the graph G [00:00:42] implementation on the graph G um from the sarcastic product model and [00:00:44] um from the sarcastic product model and we have shown that if you do eigenetic [00:00:46] we have shown that if you do eigenetic position on the average craft Gene the [00:00:48] position on the average craft Gene the expectation of Gene then just gives the [00:00:50] expectation of Gene then just gives the hidden Community SNS bar where the [00:00:54] hidden Community SNS bar where the um I think last time we showed that the [00:00:56] um I think last time we showed that the second act eigenvectors is something [00:00:58] second act eigenvectors is something called U which will look like one one [00:01:01] called U which will look like one one one and minus one minus one [00:01:04] one and minus one minus one um and this is s and this is S4 so [00:01:07] um and this is s and this is S4 so basically if you just take the second [00:01:08] basically if you just take the second eigenvector of the expecting graph G [00:01:10] eigenvector of the expecting graph G then you get uh the uh the hidden [00:01:14] then you get uh the uh the hidden community [00:01:15] community so and and we have argued that it's a [00:01:18] so and and we have argued that it's a question to show that the graph G and [00:01:21] question to show that the graph G and the expectation graph expectation G are [00:01:24] the expectation graph expectation G are close enough in the operator norm and [00:01:26] close enough in the operator norm and this is because if you consider uh this [00:01:31] this is because if you consider uh this equation right so you subtract the first [00:01:33] equation right so you subtract the first eigenvalue from G and then what you get [00:01:36] eigenvalue from G and then what you get is that g minus the first eigenetic [00:01:39] is that g minus the first eigenetic component is equal to this perturbation [00:01:42] component is equal to this perturbation Matrix plus the contribution of the [00:01:44] Matrix plus the contribution of the second econometric and if you take the [00:01:47] second econometric and if you take the eigen the conversation of this Matrix [00:01:49] eigen the conversation of this Matrix which is something you can compute [00:01:52] which is something you can compute um easily then or you're getting a plot [00:01:54] um easily then or you're getting a plot argument Vector of the left hand side of [00:01:56] argument Vector of the left hand side of this equation then on [00:01:59] this equation then on and then [00:02:01] and then um you you are expected to find [00:02:03] um you you are expected to find something close to you and as long as G [00:02:06] something close to you and as long as G minus expectation G is something small [00:02:08] minus expectation G is something small but how small it is I didn't really want [00:02:10] but how small it is I didn't really want me to do this but essentially you need [00:02:12] me to do this but essentially you need this perturbation to be much smaller [00:02:15] this perturbation to be much smaller than the signal right so you need a [00:02:17] than the signal right so you need a perturbation in operating off much [00:02:19] perturbation in operating off much smaller than the request signal you [00:02:21] smaller than the request signal you operate or not and you can complete [00:02:23] operate or not and you can complete operator for the reference signal very [00:02:26] operator for the reference signal very easily which is sometimes P minus Q over [00:02:28] easily which is sometimes P minus Q over 2 times n so basically we're trying to [00:02:31] 2 times n so basically we're trying to show that the concentration right this [00:02:34] show that the concentration right this is a constitution inequality because you [00:02:36] is a constitution inequality because you are trying to prove that g concentrates [00:02:38] are trying to prove that g concentrates around expectation of gene in the [00:02:41] around expectation of gene in the spectral nonsense [00:02:43] spectral nonsense um I'd like to show you the proof this [00:02:44] um I'd like to show you the proof this is a lot of technical proof but the [00:02:45] is a lot of technical proof but the proof is not very long and also it kind [00:02:48] proof is not very long and also it kind of relates back to what we discussed in [00:02:50] of relates back to what we discussed in lecture three or four where I guess [00:02:53] lecture three or four where I guess probably remember that I said that this [00:02:55] probably remember that I said that this constitutional court is probably one of [00:02:56] constitutional court is probably one of the most important thing for this course [00:02:58] the most important thing for this course because this is this [00:03:00] because this is this kind of one of the if you pick one [00:03:02] kind of one of the if you pick one technical Tools in statistical motion [00:03:04] technical Tools in statistical motion learning I think it's probably [00:03:05] learning I think it's probably concentration inequality in my own [00:03:07] concentration inequality in my own opinion so [00:03:09] opinion so um it's probably useful to just review [00:03:11] um it's probably useful to just review why uh the concentration equality can [00:03:13] why uh the concentration equality can help us to do something like this [00:03:15] help us to do something like this so uh yeah I'll give a proof for this so [00:03:18] so uh yeah I'll give a proof for this so the proof look like [00:03:20] the proof look like so we're going to prove that [00:03:30] so our Lemma is that [00:03:34] with high probability [00:03:37] with high probability G minus X factorization G in operator [00:03:40] G minus X factorization G in operator Norm [00:03:41] Norm is less than [00:03:42] is less than square root and log n up to [00:03:45] square root and log n up to constant Factor [00:03:48] constant Factor so [00:03:49] so um [00:03:50] um and at the first side if this is not [00:03:52] and at the first side if this is not exactly the type of concentration of [00:03:54] exactly the type of concentration of quality we have talked about before [00:03:56] quality we have talked about before because before we're talking about uh [00:03:59] because before we're talking about uh scalars right we're saying that if you [00:04:00] scalars right we're saying that if you have expectation of the if some random [00:04:04] have expectation of the if some random variables have some empirical samples [00:04:05] variables have some empirical samples and the empirical average consensus [00:04:08] and the empirical average consensus around the uh the population on average [00:04:13] around the uh the population on average uh so here it's a little bit different [00:04:15] uh so here it's a little bit different because G is the Matrix and the [00:04:17] because G is the Matrix and the expectation of G is also Matrix we are [00:04:20] expectation of G is also Matrix we are doing some kind of like religious [00:04:21] doing some kind of like religious concentration to some extent and your [00:04:24] concentration to some extent and your measures of the the similarity is not [00:04:27] measures of the the similarity is not just the absolute value in the [00:04:29] just the absolute value in the difference of types value of the [00:04:30] difference of types value of the difference but it's about something like [00:04:32] difference but it's about something like the optical Norm of the of the matrices [00:04:35] the optical Norm of the of the matrices right of the differences of the matrices [00:04:37] right of the differences of the matrices so [00:04:39] so um the [00:04:41] um the um however you can actually turn this [00:04:43] um however you can actually turn this into something that we have we are [00:04:45] into something that we have we are familiar with very easily so what you do [00:04:47] familiar with very easily so what you do is the following so you say that uh this [00:04:51] is the following so you say that uh this is still uniform convergence as you see [00:04:54] is still uniform convergence as you see customer ID [00:04:57] customer ID and why this is the case this is because [00:04:59] and why this is the case this is because you can easily interpret Alternatives as [00:05:02] you can easily interpret Alternatives as follows [00:05:03] follows so G minus expectation G [00:05:06] so G minus expectation G operator Norm [00:05:07] operator Norm this is equals to [00:05:09] this is equals to the max [00:05:12] the max over some [00:05:14] over some V let me write it down [00:05:16] V let me write it down explain so minus [00:05:25] this [00:05:25] this is just because the Opera normal for [00:05:29] is just because the Opera normal for symmetric Matrix I think the definition [00:05:30] symmetric Matrix I think the definition is you know if you have a symmetric [00:05:33] is you know if you have a symmetric Matrix a [00:05:36] and and then the operator Norm I guess [00:05:40] and and then the operator Norm I guess there's absolute value here [00:05:42] there's absolute value here it's the optional Norm of Matrix a is [00:05:44] it's the optional Norm of Matrix a is the exactly equals to [00:05:47] the exactly equals to the maximum quadratic form that you can [00:05:50] the maximum quadratic form that you can achieve by hitting it [00:05:52] achieve by hitting it with a non-one method [00:05:55] with a non-one method so [00:05:57] so um and once you do this you see that [00:05:58] um and once you do this you see that this becomes a scalar you know because [00:06:00] this becomes a scalar you know because this quantity is a scale [00:06:02] this quantity is a scale and and you can decompose this into max [00:06:07] and and you can decompose this into max probably into an arm square and then you [00:06:09] probably into an arm square and then you get [00:06:10] get V transpose GB minus V transpose [00:06:13] V transpose GB minus V transpose expectation GB [00:06:16] expectation GB and this is a sum [00:06:18] and this is a sum so what is this maybe let me write down [00:06:20] so what is this maybe let me write down more explicitly [00:06:23] more explicitly this Max [00:06:28] this is sum of v i v j g i j [00:06:32] this is sum of v i v j g i j i j [00:06:35] i j to both are in the area and [00:06:41] minus [00:06:43] minus the expectation of this random version [00:06:51] and now this becomes some [00:06:54] and now this becomes some of Independence [00:06:58] random variables and this becomes the [00:07:01] random variables and this becomes the expectation of of this sum of [00:07:04] expectation of of this sum of independent human variables [00:07:07] independent human variables so now you can use Constitution if you [00:07:08] so now you can use Constitution if you don't have the max you can use [00:07:09] don't have the max you can use concentration points but this is what [00:07:11] concentration points but this is what exactly half in the quality is for and [00:07:15] exactly half in the quality is for and and how do you deal with the max right [00:07:17] and how do you deal with the max right so so then the max this will be the part [00:07:21] so so then the max this will be the part about uniform convergence right recall [00:07:24] about uniform convergence right recall that the whole point of this like [00:07:26] that the whole point of this like constant like uniform convergence is [00:07:28] constant like uniform convergence is that if you fix the time with her [00:07:30] that if you fix the time with her suppose you think of me as the parameter [00:07:32] suppose you think of me as the parameter so the point of viewing from converting [00:07:34] so the point of viewing from converting is that you can fix the parameter you [00:07:35] is that you can fix the parameter you can use whole thing in college to prove [00:07:37] can use whole thing in college to prove the concentration to prove that [00:07:39] the concentration to prove that empirical is not far away from [00:07:41] empirical is not far away from population [00:07:42] population and the challenge of uniform convergence [00:07:45] and the challenge of uniform convergence is about how do you take the Max and [00:07:47] is about how do you take the Max and here you still have a Max [00:07:48] here you still have a Max so [00:07:50] so so I guess you know there are multiple [00:07:52] so I guess you know there are multiple ways to deal with this concentration of [00:07:54] ways to deal with this concentration of course the easiest way is probably just [00:07:55] course the easiest way is probably just invoke some existing serum they are [00:07:58] invoke some existing serum they are discount Series in the literature as [00:08:00] discount Series in the literature as well [00:08:00] well um but if you want to do it yourself I [00:08:02] um but if you want to do it yourself I guess there are two ways so one way is [00:08:04] guess there are two ways so one way is that you can use um [00:08:06] that you can use um use the the rather marker complexity [00:08:08] use the the rather marker complexity machinery [00:08:13] to provide more complexity machinery [00:08:18] to provide more complexity machinery I guess I'm it's probably it's a while [00:08:20] I guess I'm it's probably it's a while back like we have we discussed this like [00:08:22] back like we have we discussed this like probably 10 weeks five weeks ago [00:08:24] probably 10 weeks five weeks ago um and and I think one of the techniques [00:08:26] um and and I think one of the techniques is that you do symmetrization [00:08:29] is that you do symmetrization right so so far this is not a [00:08:32] right so so far this is not a symmetrical form and you introduce some [00:08:33] symmetrical form and you introduce some random random marker variable and you [00:08:35] random random marker variable and you symmetrize it and then you can proceed [00:08:37] symmetrize it and then you can proceed with all the run you can essentially [00:08:39] with all the run you can essentially view this as a random Market complexity [00:08:41] view this as a random Market complexity of some function costs so you can go [00:08:43] of some function costs so you can go with that I think that's actually pretty [00:08:45] with that I think that's actually pretty clean and nicely I'm going to leave this [00:08:48] clean and nicely I'm going to leave this um you know if you're interested in you [00:08:49] um you know if you're interested in you can do it yourself I believe it's not [00:08:51] can do it yourself I believe it's not very difficult [00:08:52] very difficult um whatever so here is that I'm going to [00:08:55] um whatever so here is that I'm going to show us even a lot of Brute Force [00:08:57] show us even a lot of Brute Force methods which use actually the that's [00:09:00] methods which use actually the that's the first Technique we introduced in our [00:09:02] the first Technique we introduced in our in your class the good Force [00:09:05] in your class the good Force this conversation recall that before we [00:09:08] this conversation recall that before we talk about weather Market complexity we [00:09:10] talk about weather Market complexity we said that in many cases actually you can [00:09:12] said that in many cases actually you can just deal with the continuous function [00:09:14] just deal with the continuous function class the uniform convergence for [00:09:15] class the uniform convergence for continuous function costs with a very [00:09:19] continuous function costs with a very simple [00:09:20] simple um [00:09:21] um discretization so [00:09:24] discretization so um and and what we do is that so what we [00:09:27] um and and what we do is that so what we do here is going to be just that you [00:09:30] do here is going to be just that you um [00:09:31] um perfectly [00:09:35] um with [00:09:37] um with two normally equal to one we can use [00:09:40] two normally equal to one we can use whole thing inequality [00:09:41] whole thing inequality is hoping [00:09:49] so what you get is that [00:09:51] so what you get is that with probability [00:09:54] with probability at most [00:09:57] um exponential minus [00:09:59] um exponential minus Epsom square root two I guess I'm not [00:10:02] Epsom square root two I guess I'm not expecting to check it you know on the [00:10:04] expecting to check it you know on the flight but you can just basically plug [00:10:05] flight but you can just basically plug in the off opt inequality [00:10:08] in the off opt inequality um without much without any modification [00:10:10] um without much without any modification you can vibj gij [00:10:14] you can vibj gij is close to its expectation [00:10:22] um [00:10:23] um okay so the probability that it devices [00:10:25] okay so the probability that it devices from expectation is at most exponential [00:10:27] from expectation is at most exponential minus Epsilon Square over two and then [00:10:30] minus Epsilon Square over two and then you can [00:10:31] you can Epsilon to be something like being all [00:10:34] Epsilon to be something like being all of square root and log n so that your [00:10:37] of square root and log n so that your failure probability then this means that [00:10:39] failure probability then this means that exponential minus Epsilon Square Over N [00:10:41] exponential minus Epsilon Square Over N this is something like exponential minus [00:10:44] this is something like exponential minus being all and login [00:10:47] being all and login this is a [00:10:49] this is a pretty small failure probability [00:10:52] pretty small failure probability and then you take a display position [00:10:56] um [00:10:57] um this colonization [00:11:01] of the unique ball [00:11:05] of um with [00:11:11] granularity [00:11:15] something like [00:11:17] something like um some of our polio [00:11:20] um some of our polio right this is what we did you know it's [00:11:22] right this is what we did you know it's a long time ago I know but I think this [00:11:25] a long time ago I know but I think this is what weather is on lecture story I [00:11:27] is what weather is on lecture story I think you take a very very very very [00:11:29] think you take a very very very very icon [00:11:31] icon precise or like a you use a very small [00:11:34] precise or like a you use a very small granular routine but it doesn't really [00:11:36] granular routine but it doesn't really matter because at the end of the day the [00:11:38] matter because at the end of the day the dependency on the granularity is only [00:11:39] dependency on the granularity is only logarithmic so your size to the size of [00:11:43] logarithmic so your size to the size of this [00:11:44] this of this cover [00:11:46] of this cover is on exponential and logic [00:11:51] is on exponential and logic and then you can take a unima [00:11:58] over this this screen size descriptions [00:12:05] so and then um because your granularity [00:12:08] so and then um because your granularity is very small right it's only inverse [00:12:10] is very small right it's only inverse poly so you only lose inverse falling [00:12:12] poly so you only lose inverse falling and inverse power is smaller than any of [00:12:14] and inverse power is smaller than any of this point is so then basically [00:12:16] this point is so then basically eventually you got that you take a unit [00:12:18] eventually you got that you take a unit Bond regard with high probability [00:12:20] Bond regard with high probability uh you have this [00:12:24] uh you have this foreign [00:12:31] Epsilon which is chosen to be scored and [00:12:34] Epsilon which is chosen to be scored and open [00:12:37] so I guess I'm skipping a lot of details [00:12:40] so I guess I'm skipping a lot of details because I think today we don't have a [00:12:41] because I think today we don't have a lot of time to complete [00:12:44] lot of time to complete um all the materials so so I'm making a [00:12:47] um all the materials so so I'm making a brief but I think you can have got the [00:12:49] brief but I think you can have got the rough points and if you work out [00:12:51] rough points and if you work out it wouldn't take too much time to work [00:12:53] it wouldn't take too much time to work for details [00:12:55] for details um and I I kind of like this matter too [00:12:57] um and I I kind of like this matter too you know if I I were to say my [00:13:00] you know if I I were to say my preferences between these matters [00:13:01] preferences between these matters sometimes I like the matter too because [00:13:04] sometimes I like the matter too because you can do this very quickly yourself [00:13:05] you can do this very quickly yourself and you know exactly what dependencies [00:13:07] and you know exactly what dependencies you know where the pennies come from and [00:13:09] you know where the pennies come from and if you do the right amount complex they [00:13:11] if you do the right amount complex they will be much cleaner you will get better [00:13:12] will be much cleaner you will get better constants you'll get cleaner proofs but [00:13:15] constants you'll get cleaner proofs but sometimes it's a little bit less [00:13:16] sometimes it's a little bit less transparent because you have to go [00:13:17] transparent because you have to go through this whole Machinery [00:13:20] through this whole Machinery um [00:13:21] um and and why this is useful right this is [00:13:24] and and why this is useful right this is useful because now we've got this Lemma [00:13:27] useful because now we've got this Lemma right the Lemma is that the the G and E [00:13:31] right the Lemma is that the the G and E extraction G is only different different [00:13:33] extraction G is only different different on other Scrolls and and you can compare [00:13:36] on other Scrolls and and you can compare that with the signal so now [00:13:38] that with the signal so now so compare [00:13:40] so compare the knowledge level which is all [00:13:42] the knowledge level which is all underscore log and versus the signal [00:13:45] underscore log and versus the signal level which is [00:13:47] level which is p over Q times n [00:13:51] and P minus Q over two times n so then [00:13:54] and P minus Q over two times n so then this means that if P minus Q is much [00:13:57] this means that if P minus Q is much bigger than 1 over square root than then [00:14:01] bigger than 1 over square root than then can recover [00:14:02] can recover the signal [00:14:05] the signal the vector U [00:14:07] the vector U approximately [00:14:11] so so we can see that you know you only [00:14:13] so so we can see that you know you only need pmq to have some separation but not [00:14:15] need pmq to have some separation but not a lot of separate and the separation [00:14:17] a lot of separate and the separation depends on the size of the graph which [00:14:19] depends on the size of the graph which also makes some sense because the more [00:14:21] also makes some sense because the more no importances you see the clearer where [00:14:24] no importances you see the clearer where the structure is in some sense you have [00:14:25] the structure is in some sense you have more kind of [00:14:26] more kind of like suppose you see a very it's [00:14:28] like suppose you see a very it's supposed to see two users right [00:14:30] supposed to see two users right everything is kind of two randoms for [00:14:31] everything is kind of two randoms for you to tell who is which one is from [00:14:33] you to tell who is which one is from which Community but if you see a million [00:14:35] which Community but if you see a million users you can use a lot of different [00:14:37] users you can use a lot of different users to cross-validate in some sense [00:14:39] users to cross-validate in some sense yourself and to impact the two [00:14:41] yourself and to impact the two communities [00:14:43] communities all right so I guess this concludes the [00:14:47] all right so I guess this concludes the the sarcastic followers model Parks [00:14:50] the sarcastic followers model Parks um I guess there are some other small [00:14:52] um I guess there are some other small remarks which are not super important [00:14:54] remarks which are not super important um so so you can also actually [00:14:57] um so so you can also actually um [00:14:58] um can improve the can [00:15:00] can improve the can recover the exact community [00:15:10] um by by some post processing [00:15:14] so so here what I showed is that you [00:15:16] so so here what I showed is that you only come recover them you the vector U [00:15:19] only come recover them you the vector U approximately but actually you can post [00:15:21] approximately but actually you can post processing to get the exact Community [00:15:22] processing to get the exact Community under certain conditions I think under [00:15:24] under certain conditions I think under the conditions like [00:15:26] the conditions like um I'm getting here you can do it and [00:15:28] um I'm getting here you can do it and actually because this is a very precise [00:15:31] actually because this is a very precise mathematical structure here so there are [00:15:34] mathematical structure here so there are a lot of works that's in in the [00:15:36] a lot of works that's in in the literature uh on this and you can [00:15:38] literature uh on this and you can actually get exact even the exact [00:15:40] actually get exact even the exact constant here so here I'm writing P [00:15:41] constant here so here I'm writing P minus Q is larger than lower square [00:15:43] minus Q is larger than lower square rooted so it's definitely very loose you [00:15:46] rooted so it's definitely very loose you can get uh the precise dependencies [00:15:49] can get uh the precise dependencies um that you need to recover and then you [00:15:51] um that you need to recover and then you can have the precise threshold below [00:15:53] can have the precise threshold below that threshold you can't recover [00:15:54] that threshold you can't recover anything above that threshold you can [00:15:56] anything above that threshold you can recover something and above another [00:15:58] recover something and above another threshold you can recover exactly so so [00:16:00] threshold you can recover exactly so so all of these are uh in the literature if [00:16:02] all of these are uh in the literature if you're interested [00:16:03] you're interested and you can extend this to multiple [00:16:05] and you can extend this to multiple blocks and so forth [00:16:07] blocks and so forth okay so this company with the sarcastic [00:16:10] okay so this company with the sarcastic World model and now I'm going to move on [00:16:12] World model and now I'm going to move on to [00:16:14] to um uh another kind of very in my opinion [00:16:17] um uh another kind of very in my opinion pretty important literature which is [00:16:19] pretty important literature which is about you know clustering the worst case [00:16:21] about you know clustering the worst case graph [00:16:23] and still the thing is that if you do [00:16:26] and still the thing is that if you do identity conversation you are gonna [00:16:27] identity conversation you are gonna recover some approximate [00:16:30] recover some approximate um uh structures in the graph [00:16:34] but the analysis will be different [00:16:36] but the analysis will be different because here we don't have the sarcastic [00:16:38] because here we don't have the sarcastic scene from the graph [00:16:40] scene from the graph so [00:16:42] so um and and because you have a worst case [00:16:44] um and and because you have a worst case graph you have also somehow Define what [00:16:46] graph you have also somehow Define what what do you mean by the hidden Community [00:16:47] what do you mean by the hidden Community right so because before in a stochastic [00:16:50] right so because before in a stochastic graph you start with Community you [00:16:51] graph you start with Community you generally graph and now you understand [00:16:53] generally graph and now you understand the graph the graph is just some areas [00:16:55] the graph the graph is just some areas you have to say what you're trying to [00:16:57] you have to say what you're trying to recover [00:16:58] recover so so let's start with that what what [00:17:00] so so let's start with that what what what's what's our goal so this requires [00:17:03] what's what's our goal so this requires us some definitions so let's give a [00:17:05] us some definitions so let's give a little graph [00:17:07] little graph Gene because the vertices is called E [00:17:10] Gene because the vertices is called E and I just is called [00:17:12] and I just is called um e a sort of radius is called V and [00:17:14] um e a sort of radius is called V and address is called B [00:17:16] address is called B um so let's define this so-called [00:17:18] um so let's define this so-called conductance this is actually a pretty [00:17:21] conductance this is actually a pretty um important notion which shows that in [00:17:24] um important notion which shows that in many different areas of math so of [00:17:26] many different areas of math so of course inside different forms so here is [00:17:28] course inside different forms so here is a particle I just about graph analysis [00:17:31] a particle I just about graph analysis in other cases you can you can [00:17:33] in other cases you can you can conductors in high dimensional space as [00:17:36] conductors in high dimensional space as well which are essentially the same [00:17:37] well which are essentially the same definition but look it could look a [00:17:39] definition but look it could look a little bit different [00:17:40] little bit different so uh the conductance for graph so [00:17:44] so uh the conductance for graph so you're supposed to have a cost [00:17:45] you're supposed to have a cost that's called as an Xbox right you cut [00:17:48] that's called as an Xbox right you cut the graph into two parts I send that [00:17:50] the graph into two parts I send that form and the conductance of s is defined [00:17:53] form and the conductance of s is defined to be the following so you have [00:17:55] to be the following so you have the number of edges between us and S Bar [00:17:59] the number of edges between us and S Bar over the volume of us [00:18:02] over the volume of us let's define both of this more [00:18:05] let's define both of this more clearly so e is S Bar this is the total [00:18:09] clearly so e is S Bar this is the total number of edges [00:18:15] from X to S Bar but this is the [00:18:19] from X to S Bar but this is the undirected graph there is no maybe I [00:18:22] undirected graph there is no maybe I should call it between asset as well [00:18:27] excuse me it's precise [00:18:29] excuse me it's precise so and which you know mathematically [00:18:32] so and which you know mathematically this is really a sum of I [00:18:36] this is really a sum of I over I strain is bar g i j if gij if I [00:18:41] over I strain is bar g i j if gij if I use g i j i so this is C Matrix [00:18:45] use g i j i so this is C Matrix um I'm over using the notation a little [00:18:47] um I'm over using the notation a little bit G both [00:18:49] bit G both develops the graph and also this is [00:18:51] develops the graph and also this is mainly photograph [00:18:53] mainly photograph and the volume of s [00:18:55] and the volume of s this is the total number of ideas [00:19:00] uh total [00:19:05] total number [00:19:07] total number ages connecting [00:19:12] to us so which means that [00:19:15] to us so which means that you look at how many edges satisfies [00:19:18] you look at how many edges satisfies that one unemployment easiness [00:19:21] that one unemployment easiness so R is needs to be us and J can be [00:19:24] so R is needs to be us and J can be anything [00:19:25] anything and you have gig [00:19:27] and you have gig so if you draw a graph something like [00:19:29] so if you draw a graph something like this [00:19:30] this suppose your jaw graph [00:19:33] because [00:19:34] because this is this [00:19:36] this is this and you you define this card suppose [00:19:39] and you you define this card suppose This Is Us then what is ES4 so ESS bar [00:19:43] This Is Us then what is ES4 so ESS bar will be Counting [00:19:46] will be Counting these two right edges [00:19:48] these two right edges because this is from s to S4 [00:19:51] and and the volume of x will be counting [00:19:54] and and the volume of x will be counting all the edges connected to us so which [00:19:56] all the edges connected to us so which means basically all that I just drawn [00:19:58] means basically all that I just drawn here [00:19:59] here all the green outages are common [00:20:02] all the green outages are common so and you can see that by the [00:20:05] so and you can see that by the definition it's true that the volume of [00:20:07] definition it's true that the volume of s is always [00:20:12] okay so what this definition is for this [00:20:14] okay so what this definition is for this is trying to characterize how [00:20:17] is trying to characterize how I guess as con the word conductance [00:20:20] I guess as con the word conductance indicates it's currently characterized [00:20:23] indicates it's currently characterized how how good the cut is in some sense [00:20:26] how how good the cut is in some sense like how separate as an S Bar the [00:20:30] like how separate as an S Bar the smaller it is the more separate the SNS [00:20:33] smaller it is the more separate the SNS part is but you do have to normalize [00:20:36] part is but you do have to normalize um by the modeling right so so in some [00:20:37] um by the modeling right so so in some sense in the number of areas you draw [00:20:39] sense in the number of areas you draw the between is and S Bar is already [00:20:40] the between is and S Bar is already capturing [00:20:42] capturing how separated analysis bar or but you [00:20:45] how separated analysis bar or but you want to normalize with the volume to [00:20:47] want to normalize with the volume to make it more meaningful I guess that's [00:20:48] make it more meaningful I guess that's what I will argue in the next so I guess [00:20:51] what I will argue in the next so I guess before that let me just get some [00:20:54] before that let me just get some basic informations the volume of us is [00:20:58] basic informations the volume of us is bigger than [00:21:00] bigger than the [00:21:00] the uh the number of edges between is and S [00:21:03] uh the number of edges between is and S Bar that's trivial so this means that [00:21:05] Bar that's trivial so this means that conductance is always less than one and [00:21:07] conductance is always less than one and you are trying to make a conduction as [00:21:09] you are trying to make a conduction as small as possible [00:21:10] small as possible another thing is that the volume of [00:21:12] another thing is that the volume of earth plus the volume of S Bar [00:21:15] earth plus the volume of S Bar is equals the volume of V [00:21:20] is equals the volume of V this is the total number of edges right [00:21:23] this is the total number of edges right so [00:21:24] so um and and that means that [00:21:27] um and and that means that um so this means that [00:21:31] um if [00:21:34] um [00:21:35] um the volume of s [00:21:38] the volume of s is less than the volume of V over 2 then [00:21:43] is less than the volume of V over 2 then the volume of s is also less than the [00:21:46] the volume of s is also less than the volume of [00:21:47] volume of as far [00:21:49] as far and this means that the conductance of s [00:21:52] and this means that the conductance of s is bigger than the conductance supply [00:21:55] is bigger than the conductance supply so so there's something so so you should [00:21:57] so so there's something so so you should have a definition that somehow doesn't [00:21:59] have a definition that somehow doesn't mean planned on [00:22:00] mean planned on how do you name is bar SNS bar is [00:22:02] how do you name is bar SNS bar is symmetric but here the conductance of S [00:22:05] symmetric but here the conductance of S and S Bar are different [00:22:07] and S Bar are different right so so that's why to kind of like [00:22:09] right so so that's why to kind of like remove this symmetry confusion between [00:22:12] remove this symmetry confusion between the Symmetry you just insist that you're [00:22:14] the Symmetry you just insist that you're always talking about [00:22:15] always talking about so when we assist [00:22:20] that we always talk only talk about [00:22:27] as such that [00:22:31] the conductance of us uh sorry the [00:22:33] the conductance of us uh sorry the volume of class is less than [00:22:38] is less than the volume of B over 2. so [00:22:41] is less than the volume of B over 2. so you're only taking a smaller part of us [00:22:43] you're only taking a smaller part of us and use that to define the conductance [00:22:44] and use that to define the conductance of the cut [00:22:46] of the cut whatever we like why would you define [00:22:49] whatever we like why would you define inductance so that we're going inspired [00:22:52] inductance so that we're going inspired on your feet why don't yes so if you're [00:22:55] on your feet why don't yes so if you're not normalize by the value of V For [00:22:57] not normalize by the value of V For example the problem is that it means [00:22:59] example the problem is that it means they really normalize because V is a [00:23:00] they really normalize because V is a constant you have to normalize against [00:23:03] constant you have to normalize against something I guess I'm going to tell you [00:23:05] something I guess I'm going to tell you why you have to normalize but if you [00:23:06] why you have to normalize but if you want to normalize you have to normalize [00:23:08] want to normalize you have to normalize with something that changes as has [00:23:09] with something that changes as has changed right so uh here I'm only trying [00:23:13] changed right so uh here I'm only trying to deal with the Symmetry for so far [00:23:14] to deal with the Symmetry for so far it's like a uh you'll only kind of [00:23:17] it's like a uh you'll only kind of conductance on the smaller set [00:23:19] conductance on the smaller set this is not really that much because you [00:23:21] this is not really that much because you don't want to cheat by saying that I [00:23:23] don't want to cheat by saying that I have a very very sad large title that I [00:23:25] have a very very sad large title that I only have like one one point in Ice Bar [00:23:28] only have like one one point in Ice Bar and I it sounds like my contact is very [00:23:30] and I it sounds like my contact is very small but actually it's so you should [00:23:32] small but actually it's so you should mention the the other the other side [00:23:35] mention the the other the other side okay so but now let's go um now let's uh [00:23:38] okay so but now let's go um now let's uh okay so maybe [00:23:40] okay so maybe maybe before proceeding answer the [00:23:42] maybe before proceeding answer the question why we have to normalize so we [00:23:44] question why we have to normalize so we have to define the V of G this is the [00:23:47] have to define the V of G this is the conductance of the this is the fastest [00:23:49] conductance of the this is the fastest card [00:23:50] card this is the spark [00:23:53] the so-called sparse is cuts [00:23:57] uh of G [00:23:59] uh of G as process card value is defined to be [00:24:02] as process card value is defined to be the minimum possible conductance but [00:24:06] the minimum possible conductance but again you require that as is less than [00:24:09] again you require that as is less than is the smaller side of the [00:24:12] is the smaller side of the the [00:24:13] the um the two cards so you minimize over [00:24:16] um the two cards so you minimize over the conductance of us [00:24:19] the conductance of us um and you minimize the conductance [00:24:20] um and you minimize the conductance First with the concern that the volume [00:24:22] First with the concern that the volume of s is less than the volume of wheel [00:24:25] of s is less than the volume of wheel so you know we basically you just want [00:24:28] so you know we basically you just want to find a card that is has smallest [00:24:30] to find a card that is has smallest conductors [00:24:31] conductors and now let's talk about normalization [00:24:33] and now let's talk about normalization so why we have to normalize [00:24:37] so why we have to normalize questions [00:24:41] I think the the reason is pretty much [00:24:43] I think the the reason is pretty much just because you know if you don't [00:24:45] just because you know if you don't normalize [00:24:47] normalize uh if you don't normalize [00:24:53] then uh if you just minimize right if [00:24:56] then uh if you just minimize right if you just look at [00:24:57] you just look at e [00:24:59] e as as far as some would minimized it's [00:25:02] as as far as some would minimized it's typically minimized when ice is small [00:25:06] typically minimized when ice is small thank you [00:25:15] so suppose you draw a graph for example [00:25:17] so suppose you draw a graph for example I guess [00:25:19] I guess um [00:25:21] so if you only if you don't I mean if [00:25:23] so if you only if you don't I mean if you don't normalize basically you prefer [00:25:25] you don't normalize basically you prefer to pick a stack as that is itself is [00:25:28] to pick a stack as that is itself is very small so that it doesn't connect to [00:25:29] very small so that it doesn't connect to the other parts [00:25:30] the other parts so so for example let's see [00:25:34] suppose you have a graph like this [00:25:40] okay I guess I'm okay what I'm doing [00:25:42] okay I guess I'm okay what I'm doing here is I have a [00:25:44] here is I have a suppose you have a completely connected [00:25:46] suppose you have a completely connected subgroup [00:25:48] subgroup right so you have over two nodes and [00:25:50] right so you have over two nodes and over two nodes and and within each of [00:25:53] over two nodes and and within each of the sub graph we have complete [00:25:54] the sub graph we have complete connection with each other and then you [00:25:57] connection with each other and then you have some connections [00:25:59] have some connections some very small number of connections [00:26:01] some very small number of connections between them maybe every node is [00:26:03] between them maybe every node is connected two of them [00:26:04] connected two of them something like this [00:26:06] something like this okay so it sounds pretty clear that you [00:26:09] okay so it sounds pretty clear that you should do really the best cut you should [00:26:11] should do really the best cut you should get is is this is this bumper [00:26:14] get is is this is this bumper because within the cluster you have full [00:26:16] because within the cluster you have full connection and you're across the two [00:26:17] connection and you're across the two clusters you are for some number of [00:26:20] clusters you are for some number of let's say two energies per per note so [00:26:23] let's say two energies per per note so if it sounds pretty clear we should do [00:26:25] if it sounds pretty clear we should do this but if you look at [00:26:27] this but if you look at if you use the metric e as [00:26:29] if you use the metric e as as far then you'll see that some other [00:26:32] as far then you'll see that some other Cuts will have smaller number of edges [00:26:34] Cuts will have smaller number of edges across the thing because you can just [00:26:36] across the thing because you can just take [00:26:37] take this to be S1 [00:26:40] this to be S1 because the ice minus contains one node [00:26:42] because the ice minus contains one node so then e of S1 [00:26:45] so then e of S1 ES1 S1 bar is basically how many edges [00:26:51] ES1 S1 bar is basically how many edges comes from S1 to S1 bar basically the [00:26:53] comes from S1 to S1 bar basically the number of edges connected to S1 this is [00:26:54] number of edges connected to S1 this is n over two [00:26:57] but the the good cut let's say let's say [00:27:00] but the the good cut let's say let's say the good part is as two [00:27:02] the good part is as two so e as2 as 2 bar is definitely [00:27:05] so e as2 as 2 bar is definitely something bigger than L over two because [00:27:06] something bigger than L over two because you have n over two probably times [00:27:08] you have n over two probably times the the number of [00:27:11] the the number of green energy the number of blue edges I [00:27:13] green energy the number of blue edges I know something like two here but I'm [00:27:15] know something like two here but I'm joining [00:27:16] joining basically two are two edges per note [00:27:21] basically two are two edges per note that's right so so so so basically even [00:27:26] that's right so so so so basically even it sounds like you should get S2 but if [00:27:29] it sounds like you should get S2 but if you use the [00:27:30] you use the normalized version then you would guide [00:27:32] normalized version then you would guide to X1 [00:27:34] to X1 so however if you normalize then it's a [00:27:36] so however if you normalize then it's a different game right so if you normalize [00:27:38] different game right so if you normalize if you look at the conductance of S1 [00:27:41] if you look at the conductance of S1 then this is e of S1 S1 bar over the [00:27:46] then this is e of S1 S1 bar over the volume of S1 [00:27:48] volume of S1 this is over 2 times over L over two I [00:27:51] this is over 2 times over L over two I think the volume one plus flies [00:27:54] think the volume one plus flies uh [00:27:57] L over two so this is one so if you look [00:28:00] L over two so this is one so if you look at [00:28:02] at field [00:28:04] field as [00:28:05] as two [00:28:06] two then this is [00:28:14] this is uh [00:28:17] this is uh ten over two times [00:28:20] ten over two times two something like this and then you [00:28:22] two something like this and then you have the total number of I just [00:28:24] have the total number of I just connected us to that's actually a big [00:28:25] connected us to that's actually a big number there's probably something like [00:28:27] number there's probably something like over two times over two minus one this [00:28:30] over two times over two minus one this is the number of ideas we didn't ask two [00:28:32] is the number of ideas we didn't ask two and there are some ideas between us to [00:28:34] and there are some ideas between us to an S2 bar something like this and this [00:28:36] an S2 bar something like this and this would be something like roughly [00:28:39] would be something like roughly I think two over n [00:28:41] I think two over n so so the conduct can surprise two is [00:28:44] so so the conduct can surprise two is much smaller than conductance plus one [00:28:46] much smaller than conductance plus one economized [00:28:51] questions so far [00:28:54] foreign [00:29:07] is to find [00:29:10] approximate [00:29:14] sparse is cut [00:29:16] sparse is cut meaning [00:29:19] meaning that you want [00:29:21] that you want as high to satisfies that the field has [00:29:25] as high to satisfies that the field has had is close to the sparse as possible [00:29:29] had is close to the sparse as possible cut V of G [00:29:33] foreign [00:29:36] conversation [00:29:48] so [00:29:55] how do I do this [00:29:58] how do I do this um [00:30:02] okay I need a [00:30:05] okay I need a okay there's just for us to even say to [00:30:08] okay there's just for us to even say to what we mean exactly like a conversation [00:30:10] what we mean exactly like a conversation and the working of like a result [00:30:12] and the working of like a result so first of all let's [00:30:15] so first of all let's d i [00:30:17] d i uh to be the volume [00:30:19] uh to be the volume of the [00:30:21] of the node I right you take a single node you [00:30:24] node I right you take a single node you take the volume this is d i and this is [00:30:26] take the volume this is d i and this is really just the degree [00:30:30] of uh node I [00:30:34] of uh node I right so the volume of the anode is [00:30:37] right so the volume of the anode is really the degree of the node unless can [00:30:40] really the degree of the node unless can be [00:30:41] be to be the diagonal matrix that contains [00:30:45] to be the diagonal matrix that contains bis engine [00:30:50] and let's define this [00:30:54] and let's define this so-called normalized adjacent symmetrics [00:31:05] it's called a bar which is the C minus a [00:31:10] it's called a bar which is the C minus a half [00:31:11] half G times D major half [00:31:14] G times D major half so where G is an adjacency Matrix recall [00:31:17] so where G is an adjacency Matrix recall this is our notation [00:31:19] this is our notation which is a lot of beautiful Plantation [00:31:21] which is a lot of beautiful Plantation so what is this mean what does this [00:31:23] so what is this mean what does this really mean this really just means that [00:31:25] really mean this really just means that I guess you have one over Square v i d [00:31:28] I guess you have one over Square v i d one up to one over square root B and [00:31:30] one up to one over square root B and times G times more square root d one up [00:31:34] times G times more square root d one up to one square root even [00:31:36] to one square root even and a diagonal matrix Multiplied on the [00:31:38] and a diagonal matrix Multiplied on the left means that you [00:31:40] left means that you your scale [00:31:42] your scale all of the roles and the other Matrix on [00:31:45] all of the roles and the other Matrix on the right hand side as a right hand side [00:31:47] the right hand side as a right hand side multiplication means you scale all the [00:31:49] multiplication means you scale all the columns so basically you scale The [00:31:50] columns so basically you scale The Columns and row simultaneously with [00:31:53] Columns and row simultaneously with these numbers but you can do the exact [00:31:56] these numbers but you can do the exact equation whatever it means is that [00:31:58] equation whatever it means is that the aij the ijs entry of the normalizer [00:32:00] the aij the ijs entry of the normalizer this is the Matrix is really just a The [00:32:03] this is the Matrix is really just a The adjacency Matrix over square root d i [00:32:05] adjacency Matrix over square root d i Times Square DJ [00:32:10] so this sounds a little bit complicated [00:32:12] so this sounds a little bit complicated but in most of the cases you know I'm [00:32:14] but in most of the cases you know I'm almost dating this mostly for formality [00:32:17] almost dating this mostly for formality because you know [00:32:18] because you know um [00:32:19] um um it sometimes the key thing it can be [00:32:22] um it sometimes the key thing it can be seen by assuming the graph is regular so [00:32:25] seen by assuming the graph is regular so so in most cases [00:32:29] suffices to [00:32:32] suffices to uh think of [00:32:43] GUI as a regular graph [00:32:47] a regular graph means that all the [00:32:50] a regular graph means that all the degrees [00:32:54] are the same [00:32:56] are the same so let's say suppose I think G is a cut [00:33:00] so let's say suppose I think G is a cut our regular graph [00:33:04] meaning D is is equals to Kappa for [00:33:07] meaning D is is equals to Kappa for every I then [00:33:10] every I then is adjacency Matrix is really just a one [00:33:13] is adjacency Matrix is really just a one over Kappa not much of this system [00:33:15] over Kappa not much of this system Matrix is one over copper times gig [00:33:18] Matrix is one over copper times gig so in some sense you really didn't do [00:33:20] so in some sense you really didn't do much except that just changing the [00:33:22] much except that just changing the scaling of this that but this game is is [00:33:25] scaling of this that but this game is is kind of like important in the in the [00:33:27] kind of like important in the in the formula sense because it's going to make [00:33:29] formula sense because it's going to make the performer very clean but it's not [00:33:31] the performer very clean but it's not fundamentalist [00:33:33] fundamentalist um super important [00:33:35] um super important um [00:33:35] um so so this is pretty much the if you [00:33:38] so so this is pretty much the if you like you if you don't want to think [00:33:40] like you if you don't want to think about the the DI DJs you pretty much can [00:33:43] about the the DI DJs you pretty much can think of this simple case where you have [00:33:45] think of this simple case where you have a regular graph [00:33:46] a regular graph okay so and once you define the [00:33:49] okay so and once you define the normalized or just since the Matrix you [00:33:50] normalized or just since the Matrix you can also Define the soap plus laplacian [00:33:54] can also Define the soap plus laplacian Matrix [00:33:56] Matrix which is I minus the normalized [00:33:59] which is I minus the normalized adjacency Matrix [00:34:06] I think you'll probably see that one of [00:34:09] I think you'll probably see that one of the reasons why we have to normalize is [00:34:10] the reasons why we have to normalize is that if you don't normalize it doesn't [00:34:12] that if you don't normalize it doesn't make sense to subtract take the [00:34:14] make sense to subtract take the differences between it and and identity [00:34:17] differences between it and and identity I don't think it's something that [00:34:18] I don't think it's something that doesn't have a scale so you have to [00:34:19] doesn't have a scale so you have to normalize this so that you can kind of [00:34:21] normalize this so that you can kind of take the diff with analytic [00:34:24] take the diff with analytic and and this is laplacian matrix it's [00:34:26] and and this is laplacian matrix it's really not doing that much [00:34:29] really not doing that much it's not that different from normal so [00:34:32] it's not that different from normal so this is Matrix anyway because they are [00:34:33] this is Matrix anyway because they are pretty much like everything corresponds [00:34:35] pretty much like everything corresponds to each other right so like the ideal [00:34:37] to each other right so like the ideal vector [00:34:40] of IO is the same as [00:34:43] of IO is the same as the eigenvector [00:34:46] of a bar and the spectrums are just [00:34:49] of a bar and the spectrums are just flipped with each other so let's say [00:34:51] flipped with each other so let's say suppose suppose [00:34:54] suppose suppose L has [00:34:58] L has hi guys [00:35:00] hi guys value along the one up to London and [00:35:03] value along the one up to London and that's it force that I think in this [00:35:05] that's it force that I think in this literature you always want to [00:35:10] you always want to kind of like um [00:35:14] you always want to kind of like um um order them [00:35:15] um order them and then with [00:35:18] and then with say eigenvalue eigenvector [00:35:22] say eigenvalue eigenvector say U1 up to U1 [00:35:24] say U1 up to U1 then this means that this is equivalent [00:35:27] then this means that this is equivalent to a bar [00:35:29] to a bar as eigenvalue [00:35:32] as eigenvalue one minus Lambda one [00:35:34] one minus Lambda one up to one minus on the end now I'm [00:35:37] up to one minus on the end now I'm starting in [00:35:38] starting in this decreasing order and and with [00:35:41] this decreasing order and and with eigenvector [00:35:46] uh the same Mega vectors [00:35:49] uh the same Mega vectors right so [00:35:51] right so um but as you'll see like at the rate so [00:35:54] um but as you'll see like at the rate so far so you don't even have to think [00:35:56] far so you don't even have to think about the laplacian right the LaPlace [00:35:58] about the laplacian right the LaPlace will come into play at some level later [00:36:01] will come into play at some level later faces but so far you can just think of [00:36:04] faces but so far you can just think of black nationalism it's a flip version of [00:36:06] black nationalism it's a flip version of normalizer this is nothing really [00:36:09] normalizer this is nothing really so [00:36:11] so okay so these are some a little bit [00:36:13] okay so these are some a little bit abstract preparations [00:36:15] abstract preparations um and now let's see what we can do with [00:36:17] um and now let's see what we can do with this so the [00:36:19] this so the this is the of the [00:36:22] this is the of the in my opinion pretty [00:36:26] important zero is continuous inequality [00:36:30] important zero is continuous inequality actually use the dates back to 1969 by [00:36:34] actually use the dates back to 1969 by Jeff Tigger so it says the following [00:36:37] Jeff Tigger so it says the following uh it says that Lambda 2 this is the [00:36:41] uh it says that Lambda 2 this is the second eigenvalue [00:36:43] second eigenvalue over 2 is less than the conductance of G [00:36:46] over 2 is less than the conductance of G Which is less than square root 2 Lambda [00:36:49] Which is less than square root 2 Lambda 2. [00:36:50] 2. so it connects so why this is a very [00:36:52] so it connects so why this is a very important thing it connects the [00:36:54] important thing it connects the conductance the cuts the spices cut to [00:36:57] conductance the cuts the spices cut to something linear algebra to the eigenous [00:37:00] something linear algebra to the eigenous vectors right so so the smartest cut is [00:37:03] vectors right so so the smartest cut is a very combinatorial stuff where you [00:37:05] a very combinatorial stuff where you have to if you really want to find this [00:37:06] have to if you really want to find this versus card you probably want to [00:37:07] versus card you probably want to enumerate all the possible parts and to [00:37:09] enumerate all the possible parts and to find spices and this is the definition [00:37:10] find spices and this is the definition requires it's a combinatorial thing and [00:37:14] requires it's a combinatorial thing and but this inequality is saying that [00:37:17] but this inequality is saying that somehow the smallest the smartest cut [00:37:19] somehow the smallest the smartest cut value has a lot to do with the [00:37:22] value has a lot to do with the eigenvalues of the lambduction of the [00:37:25] eigenvalues of the lambduction of the agencies Matrix [00:37:27] agencies Matrix um and in particular it has it's very [00:37:29] um and in particular it has it's very close to the second eigenvalue of the [00:37:31] close to the second eigenvalue of the LaPlace images [00:37:33] LaPlace images and moreover you can also find the [00:37:38] you can find [00:37:42] the approximately cut as hat [00:37:45] the approximately cut as hat such that [00:37:49] this cut as as the conductance is less [00:37:53] this cut as as the conductance is less than square root 2 Lambda 2 Which is [00:37:55] than square root 2 Lambda 2 Which is less than two Times Square Root V of G [00:37:59] less than two Times Square Root V of G um efficiently computational inflation [00:38:01] um efficiently computational inflation three [00:38:07] and not only computation efficiently but [00:38:09] and not only computation efficiently but also actually pretty efficiently what we [00:38:11] also actually pretty efficiently what we can do with the following [00:38:13] can do with the following um by [00:38:14] um by rounding the eigenvectors I guess [00:38:17] rounding the eigenvectors I guess rounding really means the following [00:38:19] rounding really means the following this is the wrong thing in the [00:38:22] this is the wrong thing in the the approximation algorithm size I guess [00:38:24] the approximation algorithm size I guess you know if you don't know where the [00:38:25] you know if you don't know where the current comes from it doesn't matter so [00:38:27] current comes from it doesn't matter so here's the procedure to find such as [00:38:29] here's the procedure to find such as setbacks hats so suppose you take U2 [00:38:33] setbacks hats so suppose you take U2 suppose [00:38:35] suppose to U2 is equals to [00:38:38] to U2 is equals to the coordinates are available [00:38:42] is the second argument it's a second [00:38:46] is the second argument it's a second lecture [00:38:49] um and [00:38:52] um and so you can [00:38:54] so you can take [00:38:58] a threshold which is [00:39:02] a threshold which is um [00:39:04] um and consider [00:39:07] the set I expect I to be [00:39:10] the set I expect I to be all the coordinates [00:39:13] all the coordinates that satisfies the ability is less than [00:39:16] that satisfies the ability is less than tall [00:39:17] tall so take a threshold but you don't have [00:39:19] so take a threshold but you don't have to consider all the possible threshold [00:39:20] to consider all the possible threshold it's not [00:39:22] it's not necessary because [00:39:24] necessary because um and you see so you take a threshold [00:39:27] um and you see so you take a threshold and the threshold is chosen from one of [00:39:30] and the threshold is chosen from one of these colonies and you say you look at [00:39:32] these colonies and you say you look at all the coordinates that are smaller [00:39:35] all the coordinates that are smaller than this threshold [00:39:36] than this threshold and that's your set as headlight [00:39:39] and that's your set as headlight and then while this so you have [00:39:41] and then while this so you have basically any of these stats right as [00:39:42] basically any of these stats right as one head as two heads as three heads and [00:39:44] one head as two heads as three heads and so forth so one of this [00:39:48] so forth so one of this has high eyes [00:39:52] satisfied [00:39:56] feel as a height is less than 2 Times [00:39:59] feel as a height is less than 2 Times Square Root V [00:40:01] Square Root V so while this side will be a good cut so [00:40:04] so while this side will be a good cut so so I guess I'm sticking this in a slack [00:40:06] so I guess I'm sticking this in a slack in a formal way so that it seems to be a [00:40:08] in a formal way so that it seems to be a little confusing so what you really are [00:40:09] little confusing so what you really are doing is the following so you sort so I [00:40:13] doing is the following so you sort so I guess in in plain language or in more [00:40:15] guess in in plain language or in more click [00:40:16] click in formal language you serve first the [00:40:19] in formal language you serve first the sort of the coordinates [00:40:21] sort of the coordinates suppose you sort of opponents first and [00:40:23] suppose you sort of opponents first and you get beta 1 less than beta 2 so [00:40:26] you get beta 1 less than beta 2 so and then it's saying that if you take [00:40:29] and then it's saying that if you take this this will be as kind of R right [00:40:32] this this will be as kind of R right this will be the first I I coordinate [00:40:34] this will be the first I I coordinate and that would and while this part will [00:40:36] and that would and while this part will be uh a good cut [00:40:40] be uh a good cut so you can try one card which is like [00:40:43] so you can try one card which is like this you can try another card which is [00:40:44] this you can try another card which is better one better two and you can [00:40:46] better one better two and you can another try another card which is better [00:40:47] another try another card which is better one better two after beta I and one of [00:40:50] one better two after beta I and one of these cards will be a good cut for the [00:40:52] these cards will be a good cut for the graph with small conductors [00:40:55] graph with small conductors uh and of course you have to restart the [00:40:57] uh and of course you have to restart the you have to remark the coordinates back [00:41:00] you have to remark the coordinates back to the original coordinate system [00:41:02] to the original coordinate system because you have soldering the chords [00:41:05] because you have soldering the chords um [00:41:06] um questions [00:41:10] foreign [00:41:31] but here you don't know where the exact [00:41:34] but here you don't know where the exact special should be you should try all the [00:41:37] special should be you should try all the thresholds [00:41:38] thresholds um [00:41:39] um um beta one up to Beta and all of them [00:41:43] um beta one up to Beta and all of them Okay cool so this is a pretty you know [00:41:45] Okay cool so this is a pretty you know magical theorem in my opinion [00:41:48] magical theorem in my opinion um [00:41:50] um um [00:41:51] um I'm not going to prove it I think they [00:41:52] I'm not going to prove it I think they are you know if you are interesting I [00:41:54] are you know if you are interesting I think there are a lot of black notes [00:41:55] think there are a lot of black notes blacker notes that um can improve this [00:41:58] blacker notes that um can improve this um [00:41:59] um um I guess what I'm gonna do is that I'm [00:42:01] um I guess what I'm gonna do is that I'm going to [00:42:02] going to sign exactly [00:42:14] yeah and and if you are able to you know [00:42:17] yeah and and if you are able to you know and you can enumerate all of them I see [00:42:20] and you can enumerate all of them I see which one satisfied [00:42:23] which one satisfied foreign [00:42:25] [Music] [00:42:37] so I'm gonna skip the proof I'm gonna [00:42:40] so I'm gonna skip the proof I'm gonna link [00:42:43] link I see some questions here [00:42:50] so the question online here is that the [00:42:54] so the question online here is that the hack SJ found this way isn't the best [00:42:56] hack SJ found this way isn't the best possible cut right [00:42:58] possible cut right um yes so you [00:43:01] um yes so you you are not guaranteed to find the best [00:43:03] you are not guaranteed to find the best possible cards you are only guaranteed [00:43:04] possible cards you are only guaranteed to find a card such that the card value [00:43:09] to find a card such that the card value field sshi satisfies that it's less than [00:43:12] field sshi satisfies that it's less than two Times Square Root field [00:43:14] two Times Square Root field if you've got a few of you here suppose [00:43:16] if you've got a few of you here suppose suppose you magically change this to V [00:43:18] suppose you magically change this to V of G [00:43:20] of G then that means you can best talk [00:43:22] then that means you can best talk because P of T is the value of the best [00:43:23] because P of T is the value of the best part of course we believe there are much [00:43:26] part of course we believe there are much more bus stops as well but you you [00:43:28] more bus stops as well but you you definitely find one of the best cuts uh [00:43:30] definitely find one of the best cuts uh but however we don't have that strong [00:43:32] but however we don't have that strong material we only show that it's Square [00:43:35] material we only show that it's Square Two Times Square Root be okay because [00:43:38] Two Times Square Root be okay because um uh so you lose something right square [00:43:40] um uh so you lose something right square field G is bigger than P of G by the way [00:43:43] field G is bigger than P of G by the way because C of T is less than one so so [00:43:47] because C of T is less than one so so you lose some um some factor in terms of [00:43:50] you lose some um some factor in terms of like uh the best possible uh Vector [00:43:53] like uh the best possible uh Vector conductors [00:43:56] I hope that answers the question [00:44:00] how does you have to lose a little bit [00:44:02] how does you have to lose a little bit into some extent because to some extent [00:44:04] into some extent because to some extent because [00:44:06] because I guess this is sometimes postmortem [00:44:08] I guess this is sometimes postmortem like like um but if you think about it [00:44:12] like like um but if you think about it like in retrospect you know one of this [00:44:15] like in retrospect you know one of this quality is very combinatorial [00:44:18] quality is very combinatorial the the spices class and the other point [00:44:19] the the spices class and the other point is very linear algebraic [00:44:21] is very linear algebraic sounds unlikely that they can be exactly [00:44:23] sounds unlikely that they can be exactly the same right so [00:44:26] the same right so um [00:44:26] um um so it's already unfortunately that [00:44:28] um so it's already unfortunately that they are somewhat related in my opinion [00:44:32] they are somewhat related in my opinion um and I also kind of like the mostly [00:44:34] um and I also kind of like the mostly discuss some of the intuitions or kind [00:44:36] discuss some of the intuitions or kind of the kind of like more basic [00:44:39] of the kind of like more basic um [00:44:40] um qualities or um like some of the [00:44:42] qualities or um like some of the intuitions why this can be possible too [00:44:44] intuitions why this can be possible too but I wanted to build the whole group [00:44:46] but I wanted to build the whole group the statement of the parent so that we [00:44:50] the statement of the parent so that we can find people start such that less [00:44:52] can find people start such that less than square root two level two yeah [00:44:56] than square root two level two yeah which then is less than like is that [00:44:58] which then is less than like is that what we're actually finding that we're [00:45:00] what we're actually finding that we're just saying that o by this you know [00:45:02] just saying that o by this you know currently probably one or whatever [00:45:04] currently probably one or whatever um then that's less than two square root [00:45:06] um then that's less than two square root 50 and in that case like [00:45:12] right [00:45:15] you just care about the comparison of R [00:45:20] you just care about the comparison of R but the cut of G as it could like it's [00:45:23] but the cut of G as it could like it's not significant because [00:45:24] not significant because like the square roots you lend it to [00:45:27] like the square roots you lend it to business significant other than that it [00:45:28] business significant other than that it was assume is that one [00:45:30] was assume is that one uh sure okay so uh yes so first of all [00:45:34] uh sure okay so uh yes so first of all yes you are right so how do you get this [00:45:35] yes you are right so how do you get this inequality this is just by using this [00:45:38] inequality this is just by using this part right so that's true and second yes [00:45:41] part right so that's true and second yes probably the first other video I care [00:45:43] probably the first other video I care about is comparing with Cog and this is [00:45:45] about is comparing with Cog and this is just the start making media things yeah [00:45:48] just the start making media things yeah um [00:45:50] um however [00:45:52] um however um [00:45:55] yeah that's the that's the first [00:45:57] yeah that's the that's the first Olympics but I think if you look at the [00:45:59] Olympics but I think if you look at the proof you do have to like somehow the [00:46:01] proof you do have to like somehow the arguments have to show up somewhere so [00:46:03] arguments have to show up somewhere so so [00:46:06] so like do not produce [00:46:12] um is actually pretty good [00:46:15] um is actually pretty good um because for two Lambda 2 it actually [00:46:19] um because for two Lambda 2 it actually relatively small but then you lose more [00:46:21] relatively small but then you lose more in that same kind of quality I think [00:46:22] in that same kind of quality I think it's possible but we don't really know [00:46:24] it's possible but we don't really know like it's it's kind of like very hard to [00:46:26] like it's it's kind of like very hard to I I think I think they are a hard [00:46:28] I I think I think they are a hard instances in both cases like in like [00:46:31] instances in both cases like in like this thing can be both close to feet on [00:46:34] this thing can be both close to feet on October two or it could be very close to [00:46:35] October two or it could be very close to this this side [00:46:39] yeah [00:46:40] yeah um cool so [00:46:43] um cool so I guess I'll focus on some intuitions [00:46:45] I guess I'll focus on some intuitions and why like the first thing I'm going [00:46:48] and why like the first thing I'm going to discuss is that [00:46:50] to discuss is that um I think this is again about the [00:46:52] um I think this is again about the skating and and and [00:46:55] skating and and and um to some extent so so first of all the [00:46:57] um to some extent so so first of all the smallest architecture why you take the [00:47:00] smallest architecture why you take the second actor I think that's always [00:47:01] second actor I think that's always something that seems to be magical to me [00:47:04] something that seems to be magical to me uh at the first side I learned [00:47:07] uh at the first side I learned um after I spent some time with your [00:47:09] um after I spent some time with your wife what personalities I didn't realize [00:47:11] wife what personalities I didn't realize the the the the topic like I said kind [00:47:15] the the the the topic like I said kind of like say last time the topic is kind [00:47:17] of like say last time the topic is kind of like a background so either the [00:47:20] of like a background so either the smallest eigenvector [00:47:22] smallest eigenvector of L or the top eigenvector [00:47:29] of a bar this is kind of like not that [00:47:32] of a bar this is kind of like not that interesting [00:47:35] and what why it's not interesting is [00:47:37] and what why it's not interesting is pretty much only trying to get the [00:47:40] pretty much only trying to get the only capturing the in some sense [00:47:44] only capturing the in some sense I call it background it's kind of like a [00:47:46] I call it background it's kind of like a background density [00:47:50] so what I really mean by this is that of [00:47:52] so what I really mean by this is that of the graph [00:47:54] the graph what I really mean by this is that let's [00:47:56] what I really mean by this is that let's say suppose but she is Kappa regular [00:48:02] say suppose but she is Kappa regular then we know that I think reactions I [00:48:05] then we know that I think reactions I have said this in the previous lecture [00:48:06] have said this in the previous lecture so [00:48:08] so um all my Vector is top eigenvector [00:48:13] um all my Vector is top eigenvector of [00:48:15] of a [00:48:17] a of the of G [00:48:21] of the of G of the adjacency Matrix and thus [00:48:24] of the adjacency Matrix and thus also [00:48:26] also top eigenvector [00:48:30] of [00:48:32] of a bar which is just one Kappa times G [00:48:37] right so [00:48:39] right so so when G is regular then the top angle [00:48:42] so when G is regular then the top angle Vector is really just about it's just a [00:48:45] Vector is really just about it's just a online vector [00:48:47] online vector and when in more General case [00:48:50] and when in more General case it's really just involves the scaling [00:48:52] it's really just involves the scaling based on density [00:48:53] based on density so what more General G what happens is [00:48:56] so what more General G what happens is that the top eigenvector is really just [00:48:58] that the top eigenvector is really just this one [00:49:00] this one it's very D1 up to square B1 [00:49:04] it's very D1 up to square B1 the scaling doesn't matter here because [00:49:06] the scaling doesn't matter here because the unique the eigenvector signal I mean [00:49:09] the unique the eigenvector signal I mean multiplication of this is also I [00:49:10] multiplication of this is also I remember so I I didn't care about the [00:49:12] remember so I I didn't care about the scaling so this is the top eigenvector [00:49:17] of a bar so small is which means the [00:49:21] of a bar so small is which means the smallest eigenvector [00:49:23] smallest eigenvector of [00:49:25] of of [00:49:26] of of Laplace [00:49:30] right why this is the case you can [00:49:32] right why this is the case you can verify this very easily so a bar times [00:49:37] verify this very easily so a bar times U1 [00:49:39] U1 this is the machine location and you [00:49:41] this is the machine location and you look at the JS coordinate ice coordinate [00:49:43] look at the JS coordinate ice coordinate this is equals to the sum of J over J [00:49:48] this is equals to the sum of J over J sum of a i j Bar times u j [00:49:53] sum of a i j Bar times u j and aij bar is a scaled version of the [00:49:56] and aij bar is a scaled version of the graph so gij [00:49:58] graph so gij over square of d i square root DJ [00:50:01] over square of d i square root DJ and u j is square root DJ [00:50:05] and u j is square root DJ and this is the sum over J so so you [00:50:09] and this is the sum over J so so you first [00:50:10] first has all these two and you get one over [00:50:13] has all these two and you get one over Square D i in front you get this [00:50:18] and recall that this is actually process [00:50:20] and recall that this is actually process definition of the degree this is the [00:50:23] definition of the degree this is the um total number of edges connect to the [00:50:24] um total number of edges connect to the graph so get one respond b i times e i [00:50:27] graph so get one respond b i times e i which is square root EI [00:50:29] which is square root EI so that verifies that U1 is an eigen [00:50:31] so that verifies that U1 is an eigen left [00:50:33] left this means a bar U1 is equal to U1 [00:50:39] so basically as before the top item [00:50:41] so basically as before the top item actor is not doing much it's really just [00:50:43] actor is not doing much it's really just capturing the density the degrees of the [00:50:46] capturing the density the degrees of the of the graph and the second eigenvector [00:50:49] of the graph and the second eigenvector starts to talk about the the [00:50:51] starts to talk about the the interconnections it has more about the [00:50:53] interconnections it has more about the the relationship between ideas and and [00:50:57] the relationship between ideas and and kind of hidden communities [00:51:00] kind of hidden communities and now let's look at [00:51:02] and now let's look at um some [00:51:04] um some instructions about why this [00:51:07] instructions about why this um [00:51:09] um why [00:51:13] the laplacian like why the the why like [00:51:17] the laplacian like why the the why like somehow these eigenvectors is related to [00:51:19] somehow these eigenvectors is related to the cut so here is a another way to [00:51:22] the cut so here is a another way to think about it [00:51:24] think about it so if you look at the [00:51:26] so if you look at the quadratic form of the laplacian [00:51:29] quadratic form of the laplacian so what is this this is the mean [00:51:31] so what is this this is the mean transpose I times V minus V transpose a [00:51:34] transpose I times V minus V transpose a bar times B [00:51:36] bar times B that's just brutally put firstly writers [00:51:38] that's just brutally put firstly writers this is sum of the I Square I from 1 to [00:51:43] this is sum of the I Square I from 1 to n minus [00:51:45] n minus the sum of [00:51:48] the sum of i j [00:51:51] um I guess let's do it e i v j a bar i j [00:51:57] um I guess let's do it e i v j a bar i j and this is sum of e i Square [00:52:01] and this is sum of e i Square minus some b i b j g i j [00:52:06] minus some b i b j g i j square root d i square root DJ [00:52:09] square root d i square root DJ and gij is one when I use one [00:52:12] and gij is one when I use one so what we got is that sum of bi squared [00:52:15] so what we got is that sum of bi squared minus [00:52:16] minus i j is an edge [00:52:20] I guess g but I J and J I are both ID so [00:52:23] I guess g but I J and J I are both ID so that's why you get a 2 here [00:52:25] that's why you get a 2 here so [00:52:27] so um VI over Square D I [00:52:30] um VI over Square D I as v j over square root DJ [00:52:33] as v j over square root DJ and now you can break this first thing [00:52:38] um I and then you take [00:52:40] um I and then you take j v i [00:52:43] j v i uh [00:52:46] I guess maybe that's right this way so [00:52:48] I guess maybe that's right this way so I'm claiming that this will is equals to [00:52:51] I'm claiming that this will is equals to sum of VI over Square bi minus v j over [00:52:55] sum of VI over Square bi minus v j over Square DJ [00:52:57] Square DJ squared and I J is in e [00:53:01] squared and I J is in e and and why this is true this is true [00:53:03] and and why this is true this is true you know you can expand this inequality [00:53:06] you know you can expand this inequality this this equation into terms and you [00:53:09] this this equation into terms and you can see the cross term matches this one [00:53:11] can see the cross term matches this one the only thing is to see the other terms [00:53:14] the only thing is to see the other terms match the VI's right so we can verify [00:53:17] match the VI's right so we can verify that by looking some i j [00:53:20] that by looking some i j in e v i Square over [00:53:23] in e v i Square over d i [00:53:25] d i uh [00:53:26] uh as this is [00:53:32] sum over I sum of J [00:53:36] sum over I sum of J such that [00:53:37] such that I J and E [00:53:42] number okay I guess I think this is [00:53:45] number okay I guess I think this is probably obvious but [00:53:47] probably obvious but I'm making a little bit too complicate [00:53:49] I'm making a little bit too complicate so we are square over d i comes this way [00:53:53] so we are square over d i comes this way if you sum over I first and sum over j7 [00:53:55] if you sum over I first and sum over j7 or if we're sum over 2 and then some [00:53:57] or if we're sum over 2 and then some over I so how many edges are connected [00:54:01] over I so how many edges are connected to I that's basically d i so you guys [00:54:04] to I that's basically d i so you guys this v i Square [00:54:06] this v i Square over d i times d i [00:54:09] over d i times d i so [00:54:11] so let's buy some more so here score [00:54:15] let's buy some more so here score Okay so [00:54:22] okay sounds good [00:54:25] okay sounds good so the point here is that I guess I'm [00:54:27] so the point here is that I guess I'm somehow missing a constant somewhere I'm [00:54:29] somehow missing a constant somewhere I'm not sure what [00:54:32] I did yeah it's a constant somewhere [00:54:36] I did yeah it's a constant somewhere um I think the the constant might be off [00:54:39] um I think the the constant might be off by two somewhere but [00:54:41] by two somewhere but um but again it is right [00:54:45] um [00:54:47] um okay so and if this is regular rough if [00:54:55] if G is regular bro say copper regular [00:55:00] if G is regular bro say copper regular then you can ignore the eyes you can [00:55:02] then you can ignore the eyes you can just uh say V transpose LV is one over [00:55:05] just uh say V transpose LV is one over Kappa times [00:55:07] Kappa times I J and E [00:55:10] I J and E v i minus B J Square [00:55:12] v i minus B J Square okay so but why I care about is why [00:55:16] okay so but why I care about is why there's so much work to and I've got [00:55:17] there's so much work to and I've got this equation that the equation is very [00:55:19] this equation that the equation is very important because this is how it links [00:55:22] important because this is how it links to how this algebraic quantities links [00:55:25] to how this algebraic quantities links to the conductance right so this is [00:55:26] to the conductance right so this is still a large algebra quality it's [00:55:28] still a large algebra quality it's something like linear algebraic right [00:55:30] something like linear algebraic right it's a quadratic form [00:55:32] it's a quadratic form however if you hear suppose now you you [00:55:36] however if you hear suppose now you you restricted restrict me to be binary so [00:55:39] restricted restrict me to be binary so suppose [00:55:43] me is it take me to the binary [00:55:47] me is it take me to the binary uh is a binary vector [00:55:51] and you take us to be the support of the [00:55:54] and you take us to be the support of the right so [00:55:56] right so the the indices where the answer is the [00:55:58] the the indices where the answer is the V is one [00:56:00] V is one then you can see that from this [00:56:03] then you can see that from this foreign transpose LV is more over Kappa [00:56:06] foreign transpose LV is more over Kappa times [00:56:11] e [00:56:14] e uh V IMS VJ Square [00:56:17] uh V IMS VJ Square and when this is one this is one when I [00:56:20] and when this is one this is one when I and J [00:56:21] and J are in both are in different uh so this [00:56:25] are in both are in different uh so this is only one [00:56:27] is only one if [00:56:28] if I and S and J is in X bar or I is in S4 [00:56:33] I and S and J is in X bar or I is in S4 J is in s [00:56:35] J is in s right so so basically you know the sum [00:56:39] right so so basically you know the sum is the number of edges between us and S [00:56:41] is the number of edges between us and S Bar because only when the inj is the [00:56:45] Bar because only when the inj is the eyes across the groups [00:56:47] eyes across the groups this this vmx VJ square is because one [00:56:50] this this vmx VJ square is because one otherwise is equal to zero [00:56:52] otherwise is equal to zero so this is why it's one over Kappa times [00:56:55] so this is why it's one over Kappa times the number of edges across [00:56:58] the number of edges across so the quadratic form connects to [00:57:01] so the quadratic form connects to the [00:57:03] the the number of projects across the two [00:57:05] the number of projects across the two groups will V is not binary if V is not [00:57:08] groups will V is not binary if V is not binary or Force it's not true but if it [00:57:10] binary or Force it's not true but if it is binary it's true so and now [00:57:13] is binary it's true so and now uh all the other words you can write V [00:57:16] uh all the other words you can write V transpose LV is [00:57:18] transpose LV is for Kappa [00:57:20] for Kappa the support of me [00:57:22] the support of me is important [00:57:27] so and now suppose if [00:57:30] so and now suppose if the support of me [00:57:32] the support of me the size is less than over two so you [00:57:35] the size is less than over two so you have the volume is less than so this [00:57:37] have the volume is less than so this means that [00:57:40] the volume of this s is less than the [00:57:43] the volume of this s is less than the volume [00:57:45] volume of B over two [00:57:47] of B over two because this is a regular graph the [00:57:48] because this is a regular graph the volume is really just the size of the [00:57:50] volume is really just the size of the set and then in this case v transpose LV [00:57:54] set and then in this case v transpose LV over [00:57:56] over this this [00:57:58] this this um ratio V transpose LV over the norm of [00:58:00] um ratio V transpose LV over the norm of V Square [00:58:02] V Square this becomes one over a couple times the [00:58:05] this becomes one over a couple times the number of varieties between X and S Bar [00:58:07] number of varieties between X and S Bar and what is the volume of what is the [00:58:10] and what is the volume of what is the mean of course this we know before is [00:58:12] mean of course this we know before is really the size [00:58:14] really the size this is just equal to size of s the size [00:58:17] this is just equal to size of s the size of ice is really just the volume of s [00:58:19] of ice is really just the volume of s over copper [00:58:23] right the volume is the number of edges [00:58:25] right the volume is the number of edges connected to us and it's regular graph [00:58:26] connected to us and it's regular graph that's where the volume is just couple [00:58:28] that's where the volume is just couple times the side effects so then you [00:58:30] times the side effects so then you cancel the copper you get you as far [00:58:33] cancel the copper you get you as far over the volume of us [00:58:36] over the volume of us so this is the conductance of us [00:58:39] so this is the conductance of us so basically conducting device [00:58:41] so basically conducting device can be written as this this form [00:58:44] can be written as this this form and this form is some kind of linear [00:58:46] and this form is some kind of linear algebraic form and I think this there's [00:58:48] algebraic form and I think this there's so this is we transpose LV over Norm of [00:58:51] so this is we transpose LV over Norm of V this is called [00:58:54] V this is called really [00:58:57] quotient [00:59:02] and and I think this is the name of this [00:59:05] and and I think this is the name of this this is named really quotient and the [00:59:08] this is named really quotient and the point here is that it's really important [00:59:10] point here is that it's really important connects to the conductance [00:59:14] but of course it's not [00:59:16] but of course it's not exact because it requires when V is [00:59:20] exact because it requires when V is better [00:59:22] right so if you do other vectors you are [00:59:25] right so if you do other vectors you are trying to do either letters [00:59:28] trying to do either letters thank you means you are minimizing [00:59:32] thank you means you are minimizing really quotient [00:59:40] uh without any without any constraints [00:59:43] uh without any without any constraints right [00:59:45] questions [00:59:48] follow me [00:59:49] follow me right [00:59:50] right but uh the minimum conduct is the sparse [00:59:53] but uh the minimum conduct is the sparse is cut [00:59:55] is cut basically means minimize [00:59:58] basically means minimize really quotient [01:00:03] with the binary constraint [01:00:12] and in some sense the secrets apology [01:00:14] and in some sense the secrets apology the core is really that [01:00:17] the core is really that even without a constraint with the [01:00:19] even without a constraint with the constraint without constraint you don't [01:00:20] constraint without constraint you don't really [01:00:21] really think about that much [01:00:23] think about that much so actually the the proof works out like [01:00:25] so actually the the proof works out like something like you first try to [01:00:28] something like you first try to um [01:00:30] um the fun eigenvectors and then you [01:00:31] the fun eigenvectors and then you somehow you get some real number of [01:00:35] somehow you get some real number of means that the the [01:00:37] means that the the eigenvectors have real numbers right in [01:00:39] eigenvectors have real numbers right in B and then you're wrong you run it into [01:00:43] B and then you're wrong you run it into binary vectors and then you say by [01:00:45] binary vectors and then you say by rounding it you don't lose too much of [01:00:48] rounding it you don't lose too much of the radio quotient and that's how the [01:00:50] the radio quotient and that's how the proof uh roughly speaking works [01:00:53] proof uh roughly speaking works so [01:00:55] so um [01:00:57] um intuition [01:01:00] and all of this you know can be extended [01:01:02] and all of this you know can be extended to where the graph like the information [01:01:03] to where the graph like the information is the same for one year graph [01:01:05] is the same for one year graph um or for none by none like for [01:01:08] um or for none by none like for graphs that are not similar not not [01:01:11] graphs that are not similar not not regular and also for graphs that are [01:01:12] regular and also for graphs that are weighted right so here the graphs are [01:01:14] weighted right so here the graphs are just fine we're like zero one there's no [01:01:15] just fine we're like zero one there's no one in other words you can also do it [01:01:17] one in other words you can also do it forward [01:01:29] okay so great so I think I hope that I'm [01:01:32] okay so great so I think I hope that I'm convinced you that the eigenvectors are [01:01:35] convinced you that the eigenvectors are very related to the graph clustering [01:01:37] very related to the graph clustering by these two examples stochastic block [01:01:39] by these two examples stochastic block model and and this worst case situation [01:01:42] model and and this worst case situation and this kind of like algorithm has been [01:01:44] and this kind of like algorithm has been used so if you do this on a special [01:01:46] used so if you do this on a special classroom this is a [01:01:48] classroom this is a um [01:01:50] um like um [01:01:52] like um and you can actually use this like a [01:01:55] and you can actually use this like a um okay how do I say this so the [01:01:57] um okay how do I say this so the materials I presented this mostly come [01:01:59] materials I presented this mostly come from the theoretical conversations [01:02:00] from the theoretical conversations Community [01:02:02] Community um and I'm there uh it doesn't have much [01:02:05] um and I'm there uh it doesn't have much to do with machine learning right so [01:02:07] to do with machine learning right so what the people care about is that you [01:02:09] what the people care about is that you just want to partition a graph into two [01:02:10] just want to partition a graph into two clusters so you don't have to know [01:02:12] clusters so you don't have to know machine learning to define a problem to [01:02:14] machine learning to define a problem to study this and [01:02:16] study this and um I think [01:02:18] um I think um like there are there's a so-called [01:02:20] um like there are there's a so-called special classroom approach [01:02:24] special classroom [01:02:26] special classroom this was brewed to our machine Learning [01:02:28] this was brewed to our machine Learning Community I think around 200 I think by [01:02:30] Community I think around 200 I think by I guess I said this paper [01:02:32] I guess I said this paper um by She and Malik [01:02:36] and in Jordan [01:02:41] was [01:02:43] was one [01:02:46] one thousand [01:02:48] thousand and the way that you do it is that you [01:02:50] and the way that you do it is that you you define graph from the machine [01:02:53] you define graph from the machine learning right now and then you apply [01:02:54] learning right now and then you apply this algorithm [01:02:56] this algorithm so so basically this [01:02:58] so so basically this um brings us to question how to choose [01:03:00] um brings us to question how to choose this graph [01:03:02] this graph so how to choose [01:03:05] so how to choose our design graph [01:03:07] our design graph because the graph you know in TCI the [01:03:10] because the graph you know in TCI the graph is given to you maybe some some [01:03:12] graph is given to you maybe some some graph then somebody they view but in [01:03:14] graph then somebody they view but in machine learning you have to somehow [01:03:16] machine learning you have to somehow choose your ground [01:03:17] choose your ground right so in the uh in the Andrew and [01:03:21] right so in the uh in the Andrew and newspaper [01:03:23] newspaper um the definition of the graph is [01:03:24] um the definition of the graph is something like this so you first say [01:03:27] something like this so you first say you you give me some raw data [01:03:31] the X1 after excellent so these are [01:03:34] the X1 after excellent so these are untraining later points [01:03:36] untraining later points and then you define the graph G [01:03:38] and then you define the graph G uh to be [01:03:40] uh to be something like g i j this is a weighted [01:03:44] something like g i j this is a weighted graph you know I would I didn't really [01:03:46] graph you know I would I didn't really discuss the way they graph but there's a [01:03:48] discuss the way they graph but there's a natural extension to where the graph and [01:03:50] natural extension to where the graph and then the way around the waves between [01:03:51] then the way around the waves between rnj is something like exponential minus [01:03:54] rnj is something like exponential minus x i minus x j to Norm over two Sigma [01:03:58] x i minus x j to Norm over two Sigma squared I guess this is probably [01:04:00] squared I guess this is probably something that is very familiar to you [01:04:03] something that is very familiar to you this is just the rbr kernel the the [01:04:05] this is just the rbr kernel the the gaussian currency [01:04:07] gaussian currency so Define this uh with some training [01:04:09] so Define this uh with some training parameters or you can have some other [01:04:11] parameters or you can have some other virus right so [01:04:12] virus right so I guess you can find a graph based on [01:04:14] I guess you can find a graph based on some distances between the fewer [01:04:16] some distances between the fewer examples [01:04:18] uh and then what you do is you say uh [01:04:21] uh and then what you do is you say uh you do this you get a special class you [01:04:23] you do this you get a special class you run you get the eigen letters [01:04:30] so so so first I will Define a group of [01:04:33] so so so first I will Define a group of three and get eigenvectors of the [01:04:35] three and get eigenvectors of the LaPlace [01:04:36] LaPlace a normalized adjacency matrix [01:04:40] a normalized adjacency matrix and and here you know it's not only two [01:04:42] and and here you know it's not only two clusters you can do multiple clusters [01:04:44] clusters you can do multiple clusters and when you do multiple clusters what [01:04:45] and when you do multiple clusters what you do is you say you get eigenvectors [01:04:49] you do is you say you get eigenvectors um [01:04:50] um say U1 U2 up to UK suppose you want to [01:04:54] say U1 U2 up to UK suppose you want to have K cluster [01:04:55] have K cluster and this is the Matrix of dimension [01:05:00] and this is the Matrix of dimension R of and by K right so so each column is [01:05:04] R of and by K right so so each column is a [01:05:06] a is the eigen vector and you have three [01:05:09] is the eigen vector and you have three of these eigen vectors and now what you [01:05:12] of these eigen vectors and now what you do is you say you take the rows as the [01:05:14] do is you say you take the rows as the embeddings [01:05:18] or in the modern world you call it [01:05:20] or in the modern world you call it representation because you know I [01:05:21] representation because you know I probably some of you heard of [01:05:22] probably some of you heard of representation learning so [01:05:27] and for the for the the IFA example [01:05:34] so basically you can like so for every [01:05:36] so basically you can like so for every example x i now it becomes represented [01:05:39] example x i now it becomes represented as [01:05:40] as maybe let's call it 0 v i [01:05:43] maybe let's call it 0 v i v i which is the dimensional head [01:05:46] v i which is the dimensional head and K is K it corresponds to how many [01:05:49] and K is K it corresponds to how many eigenvectors you take [01:05:51] eigenvectors you take and then you can develop the machine [01:05:53] and then you can develop the machine organizations maybe tell you something [01:05:55] organizations maybe tell you something like two or three and then and in the [01:05:58] like two or three and then and in the original paper of Andrew's paper I think [01:06:00] original paper of Andrew's paper I think you do some kind of other another [01:06:01] you do some kind of other another training classroom some other classrooms [01:06:05] training classroom some other classrooms okay means [01:06:08] okay means I'm not I guess probably heard of [01:06:09] I'm not I guess probably heard of cleanings okay means on the [01:06:12] cleanings okay means on the representations [01:06:18] to you [01:06:21] um I'm going to Cluster them again so [01:06:24] um I'm going to Cluster them again so this is the uh the so-called spectral [01:06:26] this is the uh the so-called spectral classic algorithm and they were actually [01:06:28] classic algorithm and they were actually later I think around 2014 2013 there [01:06:31] later I think around 2014 2013 there were a few papers that analyze this [01:06:34] were a few papers that analyze this um and and so that you can actually get [01:06:37] um and and so that you can actually get reasonable [01:06:38] reasonable um repetitions and clusters by using [01:06:41] um repetitions and clusters by using this approach [01:06:47] any questions [01:06:58] so so what are the what's the issue with [01:07:02] so so what are the what's the issue with this the issue with this [01:07:05] this the issue with this is that the graph G could be not very [01:07:08] is that the graph G could be not very meaningful [01:07:09] meaningful so in high dimension [01:07:14] so all the data points are very far away [01:07:17] so all the data points are very far away from each other [01:07:24] are all the training data points that [01:07:27] are all the training data points that should be very precise [01:07:38] far away from each other [01:07:45] so so this uh under your green distance [01:07:49] so so this uh under your green distance becomes you know like Ukraine distance [01:07:51] becomes you know like Ukraine distance between these three little points [01:07:52] between these three little points becomes pretty much minimalist [01:07:54] becomes pretty much minimalist because [01:07:56] because um [01:07:57] um um because if you take it you clean this [01:07:58] um because if you take it you clean this install for cat and dog versus the [01:08:01] install for cat and dog versus the difference between dog and dog you [01:08:03] difference between dog and dog you probably wouldn't see much differences [01:08:04] probably wouldn't see much differences because two dogs could still have very [01:08:06] because two dogs could still have very big euclidian distance [01:08:09] big euclidian distance um two random dogs [01:08:10] um two random dogs so and I think this is the the in some [01:08:13] so and I think this is the the in some sense the problem uh with with the graph [01:08:17] sense the problem uh with with the graph itself is not meaningful so if you find [01:08:19] itself is not meaningful so if you find the sparse's class for the graph if the [01:08:21] the sparse's class for the graph if the graph itself is not very useful even [01:08:23] graph itself is not very useful even finding a sports card is not that [01:08:24] finding a sports card is not that important it's not that useful uh uh for [01:08:28] important it's not that useful uh uh for you so that's why the serial analysis [01:08:30] you so that's why the serial analysis for this special classic algorithm you [01:08:33] for this special classic algorithm you know it doesn't really kind of deliver [01:08:34] know it doesn't really kind of deliver that much because it didn't really [01:08:36] that much because it didn't really consider how the graph was generated [01:08:38] consider how the graph was generated right all of this Theory says that if [01:08:41] right all of this Theory says that if you're given a good graph you can find [01:08:43] you're given a good graph you can find as much as cut for this one using this [01:08:45] as much as cut for this one using this approach but it doesn't really say [01:08:47] approach but it doesn't really say anything about how the graph is [01:08:48] anything about how the graph is generated [01:08:50] generated so I think for the last the last 15 [01:08:52] so I think for the last the last 15 minutes I'm going to discuss [01:08:54] minutes I'm going to discuss um briefly discuss one of the my uh this [01:08:57] um briefly discuss one of the my uh this is one of the working line uh in my [01:08:59] is one of the working line uh in my group uh recently so where we try to [01:09:02] group uh recently so where we try to reuse this um classical idea but use it [01:09:05] reuse this um classical idea but use it for [01:09:06] for um um in a different way so so this is a [01:09:11] um um in a different way so so this is a um [01:09:12] um so this is in one of his paper by Halton [01:09:15] so this is in one of his paper by Halton at all in my group so what we are trying [01:09:18] at all in my group so what we are trying to do is that we say you consider [01:09:21] to do is that we say you consider infinite graph [01:09:23] infinite graph thank you [01:09:30] so g v w so this is the vertices [01:09:35] so g v w so this is the vertices this is the weights on edge [01:09:38] this is the weights on edge and where we take me to be [01:09:42] and where we take me to be um [01:09:43] um all the possible inputs [01:09:45] all the possible inputs so the this is the all possible [01:09:50] um data data points [01:09:54] um data data points so this graphically depends on the [01:09:55] so this graphically depends on the population space right so so X is the [01:09:59] population space right so so X is the basically space of all possible say [01:10:01] basically space of all possible say images and your graph is defined on each [01:10:04] images and your graph is defined on each each image corresponds to vertex so [01:10:06] each image corresponds to vertex so before the graph has size little and [01:10:08] before the graph has size little and sorry this is e [01:10:11] sorry this is e so before the graph I started later on [01:10:13] so before the graph I started later on right it's a little unbelievable Matrix [01:10:16] right it's a little unbelievable Matrix and now the graph has a much bigger size [01:10:19] and now the graph has a much bigger size the size is is the same as the [01:10:22] the size is is the same as the cardinality of all possible data points [01:10:24] cardinality of all possible data points which could be Infinity [01:10:26] which could be Infinity so [01:10:28] so um so it's possibility maybe let's say [01:10:32] um so it's possibility maybe let's say for example let's say let's say you have [01:10:35] for example let's say let's say you have find a number of possible images [01:10:38] find a number of possible images but in [01:10:40] but in could be exponential [01:10:44] so possibilities let's say we have [01:10:46] so possibilities let's say we have exponential size graph [01:10:48] exponential size graph um [01:10:48] um so then [01:10:51] so then um [01:10:51] um what you can so on this graph what you [01:10:54] what you can so on this graph what you what you do is you Define w x x Prime [01:10:56] what you do is you Define w x x Prime the [01:10:58] the the weight between two nodes two [01:11:00] the weight between two nodes two vertices [01:11:01] vertices um let's say the fund needs to be large [01:11:05] um let's say the fund needs to be large only [01:11:07] only when [01:11:08] when x and x Prime are closed [01:11:11] x and x Prime are closed are closing [01:11:13] are closing how to do this [01:11:16] how to do this so I'm still using how to distance I'm [01:11:18] so I'm still using how to distance I'm still probably using you know I didn't [01:11:20] still probably using you know I didn't I'm not that's not exactly what's the [01:11:21] I'm not that's not exactly what's the definition here because I think that [01:11:23] definition here because I think that requires too much Dragon which I can [01:11:24] requires too much Dragon which I can operate in 10 minutes but still we are [01:11:27] operate in 10 minutes but still we are using pretty much you can think of this [01:11:28] using pretty much you can think of this as almost the same as the the previous [01:11:31] as almost the same as the the previous definition of the graph when X and X2 [01:11:33] definition of the graph when X and X2 Prime are closed but I guess the point [01:11:36] Prime are closed but I guess the point is that this is very close [01:11:39] is that this is very close so before you have to choose the sigma [01:11:43] so before you have to choose the sigma very very subtly because all the points [01:11:46] very very subtly because all the points are very far away from each other right [01:11:48] are very far away from each other right so but now you say that I don't have all [01:11:50] so but now you say that I don't have all those points that far away from each [01:11:51] those points that far away from each other I just care about those two points [01:11:53] other I just care about those two points that are very close to each other right [01:11:54] that are very close to each other right so suppose you have two dogs random dogs [01:11:56] so suppose you have two dogs random dogs you say they are not blocks if you only [01:11:58] you say they are not blocks if you only have one lock and then you have a [01:12:00] have one lock and then you have a permutation perturbation of that same [01:12:02] permutation perturbation of that same note you say there are two there are [01:12:05] note you say there are two there are dogs that are connected each other [01:12:07] dogs that are connected each other so then this graph becomes more [01:12:09] so then this graph becomes more renewable because you only connects very [01:12:11] renewable because you only connects very nearby cats and dogs [01:12:14] nearby cats and dogs um uh or convert nearby images and then [01:12:17] um uh or convert nearby images and then the so the graph becomes more meaningful [01:12:20] the so the graph becomes more meaningful so the the pros is that [01:12:23] so the the pros is that the graph is more meaningful [01:12:30] I guess the cause is that it becomes [01:12:32] I guess the cause is that it becomes infinimensional [01:12:36] and you don't have this graph because [01:12:38] and you don't have this graph because you don't know all the possible data [01:12:40] you don't know all the possible data points uh you only have some sample data [01:12:42] points uh you only have some sample data points [01:12:43] points um [01:12:44] um um so or infinite or exponential [01:12:47] um so or infinite or exponential expansion [01:12:50] dimension [01:12:52] dimension and you don't have access to this graph [01:12:57] foreign [01:13:04] so the way we fix the cons is the [01:13:06] so the way we fix the cons is the following so and also maybe another way [01:13:08] following so and also maybe another way cause is that the um even the eigen [01:13:11] cause is that the um even the eigen vector itself right the eigenvector [01:13:15] uh is also High dimensional right [01:13:18] uh is also High dimensional right right it's infinite dimension [01:13:21] right it's infinite dimension because the eigenvector the mass of the [01:13:23] because the eigenvector the mass of the eigenvector is the same as the dimension [01:13:27] eigenvector is the same as the dimension of the of the graph [01:13:30] of the of the graph so however on here what we are doing is [01:13:33] so however on here what we are doing is that we [01:13:34] that we um use the [01:13:36] um use the the new ideas the different ideas into [01:13:38] the new ideas the different ideas into kind of actually the real research is [01:13:40] kind of actually the real research is the reverse direction we somehow try to [01:13:42] the reverse direction we somehow try to explain the different ideas but here in [01:13:45] explain the different ideas but here in this context you know you can think of [01:13:46] this context you know you can think of this as use the parametric numerator [01:13:49] this as use the parametric numerator kind of ideas to try to deal with this [01:13:51] kind of ideas to try to deal with this class so what you'll do is you say [01:13:54] class so what you'll do is you say suppose you have argument from you this [01:13:56] suppose you have argument from you this is the ID vector [01:13:58] is the ID vector so this is the iPhone [01:14:01] you and and here the eigenvector is a [01:14:04] you and and here the eigenvector is a high dimensional Vector so you can say [01:14:06] high dimensional Vector so you can say this in ux word is index size all the [01:14:09] this in ux word is index size all the possible data points uh in the vector X [01:14:12] possible data points uh in the vector X right this is the off Dimension [01:14:14] right this is the off Dimension something like maybe R2 to the capital N [01:14:17] something like maybe R2 to the capital N or or to the infinity depending on how [01:14:19] or or to the infinity depending on how many vertices uh in your set so and you [01:14:23] many vertices uh in your set so and you don't even have you know space to save [01:14:25] don't even have you know space to save all of this even it's a single act like [01:14:27] all of this even it's a single act like you don't have any space to save it but [01:14:29] you don't have any space to save it but what you do is you say you represent [01:14:32] what you do is you say you represent uh this U [01:14:36] uh this U U sub X by [01:14:39] U sub X by a [01:14:40] a new network applied on the [01:14:45] new network applied on the um on a raw data point x [01:14:48] um on a raw data point x so where [01:14:50] so where X Theta is [01:14:53] X Theta is a parameters model [01:14:59] so if you do this then at least you can [01:15:02] so if you do this then at least you can describe diagram marker by Theta now you [01:15:04] describe diagram marker by Theta now you don't have to specify all the capital N [01:15:07] don't have to specify all the capital N numbers [01:15:09] numbers um [01:15:10] um on to specify the eigenvector you only [01:15:13] on to specify the eigenvector you only have to specify the Theta to to describe [01:15:16] have to specify the Theta to to describe this argument of course you know there's [01:15:18] this argument of course you know there's a you know if you believe that acceler [01:15:20] a you know if you believe that acceler is powerful enough then you can express [01:15:21] is powerful enough then you can express any letters but of course you know I've [01:15:24] any letters but of course you know I've said the problem would have been enough [01:15:25] said the problem would have been enough so you have to make some assumption that [01:15:27] so you have to make some assumption that you might Works can represent these kind [01:15:29] you might Works can represent these kind of like conductors but suppose under [01:15:31] of like conductors but suppose under that assumption then you can at least [01:15:33] that assumption then you can at least represent the eigenvectors uh by by [01:15:36] represent the eigenvectors uh by by Theta and now basically the question [01:15:39] Theta and now basically the question becomes like so the question changes to [01:15:43] becomes like so the question changes to um you want to find Theta [01:15:45] um you want to find Theta such that [01:15:48] such that this Vector F Theta X right is very high [01:15:51] this Vector F Theta X right is very high dimensional vector [01:15:52] dimensional vector is an eigenvector [01:15:58] of um [01:16:00] of um of the graph G [01:16:03] of the graph G so at least you're trying to find a low [01:16:05] so at least you're trying to find a low dimensional part you're trying to find a [01:16:07] dimensional part you're trying to find a parameter Theta right you are not trying [01:16:08] parameter Theta right you are not trying to find the a high dimensional Vector [01:16:10] to find the a high dimensional Vector anymore [01:16:11] anymore and it turns out that if you uh do this [01:16:15] and it turns out that if you uh do this foreign [01:16:25] [Music] [01:16:31] basically trying to do the [01:16:34] basically trying to do the this gives a [01:16:37] H you can you can use that algorithm to [01:16:39] H you can you can use that algorithm to try to achieve this I guess I'm trying [01:16:40] try to achieve this I guess I'm trying to let me see whether I have time to [01:16:43] to let me see whether I have time to I guess what we can do is the following [01:16:45] I guess what we can do is the following so what you do is you say I'm going to [01:16:47] so what you do is you say I'm going to how do I find the eigenvector of L uh [01:16:52] how do I find the eigenvector of L uh like this so suppose I have the access [01:16:54] like this so suppose I have the access to the to the whole graph which I don't [01:16:57] to the to the whole graph which I don't but suppose I have it what I can do is [01:16:59] but suppose I have it what I can do is that I can minimize [01:17:01] that I can minimize the following thing so I can say I'm [01:17:03] the following thing so I can say I'm going to minimize uh [01:17:05] going to minimize uh let me write it down I'll [01:17:12] so [01:17:14] so maybe let's do this this was interesting [01:17:18] maybe let's do this this was interesting any bar [01:17:20] any bar I have is an invite images so first of [01:17:23] I have is an invite images so first of all I claim that this gives [01:17:26] all I claim that this gives the top item actor [01:17:30] of a bar [01:17:32] of a bar this is because [01:17:34] this is because um this is some [01:17:36] um this is some something that's uh I probably wouldn't [01:17:39] something that's uh I probably wouldn't have time to explain that much but if [01:17:41] have time to explain that much but if you want to fit a long run Matrix [01:17:45] this is exactly the top K architecture [01:17:50] this is exactly the top K architecture so if you want to play the no ramp [01:17:52] so if you want to play the no ramp matrix 12 to the Matrix a bar [01:17:55] matrix 12 to the Matrix a bar the best fetch would be to use the [01:17:58] the best fetch would be to use the eigenvectors of a bar [01:18:00] eigenvectors of a bar so so this is um [01:18:04] so so this is um um this is local theorem to show this so [01:18:08] um this is local theorem to show this so I've basically I've there's gonna be [01:18:10] I've basically I've there's gonna be some version of uh the minimize of this [01:18:12] some version of uh the minimize of this will be some version of the isometric so [01:18:14] will be some version of the isometric so I think F will be the minimizer of this [01:18:16] I think F will be the minimizer of this will be some scaling of diagonometers [01:18:19] will be some scaling of diagonometers and then if you use this objective then [01:18:22] and then if you use this objective then you can replace the capital F which is [01:18:24] you can replace the capital F which is non-parametric it's a very bit Matrix [01:18:26] non-parametric it's a very bit Matrix side so you can say that you have to act [01:18:28] side so you can say that you have to act now for now is you supposed to be done [01:18:32] now for now is you supposed to be done something like this right so and then [01:18:34] something like this right so and then you write it as this you write it as [01:18:36] you write it as this you write it as apps data X [01:18:38] apps data X uh [01:18:41] uh transpose X1 up to of theta extend [01:18:47] transpose X1 up to of theta extend principles so you replace the role by [01:18:50] principles so you replace the role by the parametrics version so you say that [01:18:52] the parametrics version so you say that every row now is a network of the raw [01:18:55] every row now is a network of the raw data [01:18:56] data and then what you can get is that this [01:18:59] and then what you can get is that this will be [01:19:00] will be if you write this as [01:19:02] if you write this as in this version this is like first of [01:19:05] in this version this is like first of all you write the Forbidden storm as a [01:19:07] all you write the Forbidden storm as a sum over i j in n a bar i j minus [01:19:12] sum over i j in n a bar i j minus uh so [01:19:14] uh so I have FF transpose i j the ijs entry is [01:19:18] I have FF transpose i j the ijs entry is a the eye flow in the product with the J [01:19:21] a the eye flow in the product with the J flow [01:19:22] flow so that's why this is equals to [01:19:24] so that's why this is equals to F Theta x i [01:19:27] F Theta x i times F Theta x j [01:19:31] times F Theta x j x squared [01:19:35] and I can change this to instead of [01:19:37] and I can change this to instead of minimizing up now you are minimize Theta [01:19:41] and I guess I don't have time to go [01:19:45] and I guess I don't have time to go through all the details but you can this [01:19:47] through all the details but you can this is basically this is a no objection [01:19:50] is basically this is a no objection function that you can optimize of course [01:19:54] function that you can optimize of course the problem is that you still have this [01:19:55] the problem is that you still have this sum this big sound you can replace this [01:19:58] sum this big sound you can replace this by the empirical version [01:20:02] so you can get minimize over Theta you [01:20:05] so you can get minimize over Theta you can take [01:20:06] can take some random samples [01:20:09] some random samples so you can take [01:20:13] so you can take um [01:20:14] um I guess I'm not sure whether [01:20:16] I guess I'm not sure whether I'm going to Simply write this [01:20:18] I'm going to Simply write this so so you can stop okay maybe I'll just [01:20:20] so so you can stop okay maybe I'll just say you can stop some photos [01:20:24] uh [01:20:28] estimates [01:20:34] using empirical examples [01:20:43] and and actually it turns out that if [01:20:45] and and actually it turns out that if you simplify this formula this will be [01:20:47] you simplify this formula this will be something similar to The contrastive [01:20:50] something similar to The contrastive Learning algorithm [01:20:52] Learning algorithm um that is used in practice I guess this [01:20:55] um that is used in practice I guess this part I don't really have time to to show [01:20:57] part I don't really have time to to show I guess I'll refer you to the paper [01:21:00] I guess I'll refer you to the paper um [01:21:02] I think probably I should just uh [01:21:05] I think probably I should just uh stop here are there any questions first [01:21:10] about like this I know this part is a [01:21:13] about like this I know this part is a little bit [01:21:14] little bit um feel free to ask any questions [01:21:22] um the paper the I think probably it's [01:21:25] um the paper the I think probably it's good to write yeah yeah I think like the [01:21:29] good to write yeah yeah I think like the the loss is not exactly the contrast [01:21:31] the loss is not exactly the contrast deployment loss it was in practice so so [01:21:33] deployment loss it was in practice so so we're gonna have something we call [01:21:34] we're gonna have something we call Spectral confessionals so which so [01:21:37] Spectral confessionals so which so basically actually this step you know if [01:21:39] basically actually this step you know if you have all the setup then I this stuff [01:21:41] you have all the setup then I this stuff is pretty cheap [01:21:44] is pretty cheap um [01:21:45] um um So eventually you simplify this a [01:21:47] um So eventually you simplify this a little bit your what you want to get is [01:21:49] little bit your what you want to get is that you guys one term which is minus a [01:21:51] that you guys one term which is minus a half I would say to x i [01:21:55] half I would say to x i would say the XJ [01:21:58] would say the XJ uh and this is something like the [01:22:01] uh and this is something like the the term that tries to make two [01:22:04] the term that tries to make two examples closer to each other and [01:22:06] examples closer to each other and there's another term that turns to [01:22:07] there's another term that turns to contrast them but there's another so [01:22:11] contrast them but there's another so um anyway so I guess I probably just [01:22:13] um anyway so I guess I probably just refer to the paper of our paper I think [01:22:15] refer to the paper of our paper I think the title of the paper is just [01:22:17] the title of the paper is just some improvable [01:22:26] self-supervised learning [01:22:30] while conscious spectral construction [01:22:33] while conscious spectral construction laws sexual contrastive loss [01:22:37] laws sexual contrastive loss something like this [01:22:40] um I think a problem is you can search [01:22:41] um I think a problem is you can search fantastic [01:22:46] the second just for this one new version [01:22:50] the second just for this one new version looking up like uh [01:22:53] looking up like uh and [01:23:00] uh say that corresponds to like the [01:23:03] uh say that corresponds to like the first data points like what exactly is [01:23:07] first data points like what exactly is that correspondent simply [01:23:09] that correspondent simply how come if the first and the second [01:23:12] how come if the first and the second world are similar that the first two [01:23:15] world are similar that the first two vehicles should be similar [01:23:21] I I think I got a question [01:23:24] I I think I got a question um so [01:23:27] um so I think there is something that I kind [01:23:29] I think there is something that I kind of like I understand why there's a lot [01:23:31] of like I understand why there's a lot of confusion because I escaped something [01:23:33] of confusion because I escaped something like about the the K how do you deal [01:23:35] like about the the K how do you deal with K clusters but I think this could [01:23:37] with K clusters but I think this could be seen when you have only [01:23:39] be seen when you have only like if you take a little place between [01:23:41] like if you take a little place between plate passes and two passers you just [01:23:43] plate passes and two passers you just say this is too faster and then if you [01:23:45] say this is too faster and then if you look at the [01:23:47] look at the let's see so where did we discuss this I [01:23:50] let's see so where did we discuss this I think we discussed this [01:23:51] think we discussed this is somewhat implicitous in several times [01:23:54] is somewhat implicitous in several times so I guess for example suppose you go [01:23:56] so I guess for example suppose you go back to here [01:23:57] back to here and recall that the second argument [01:23:59] and recall that the second argument looks like beta 1 up to Beta n [01:24:01] looks like beta 1 up to Beta n right so and we discussed that you take [01:24:03] right so and we discussed that you take a threshold and then you can separate it [01:24:06] a threshold and then you can separate it to right it's threshold so so so in this [01:24:09] to right it's threshold so so so in this case basically suppose you have two [01:24:11] case basically suppose you have two clusters and [01:24:14] clusters and in this case basically the beta I is our [01:24:16] in this case basically the beta I is our practitioner of the if the vertex [01:24:19] practitioner of the if the vertex so [01:24:22] so that's that's the real right that [01:24:24] so that's that's the real right that beta Y is the first zero right beta 2 is [01:24:26] beta Y is the first zero right beta 2 is the second row right so beta R is the [01:24:27] the second row right so beta R is the representation of the ith vertex and Y [01:24:29] representation of the ith vertex and Y beta I is the the raw data I think this [01:24:33] beta I is the the raw data I think this is because [01:24:35] is because at least if your threshold by the eye [01:24:37] at least if your threshold by the eye you get the works right so basically in [01:24:39] you get the works right so basically in some sense when I [01:24:41] some sense when I in some sense I mean you may be like [01:24:43] in some sense I mean you may be like using music for it so you suppose you in [01:24:45] using music for it so you suppose you in the stock password of model regardless [01:24:47] the stock password of model regardless and then I guess you probably can't [01:24:49] and then I guess you probably can't worry that [01:24:52] these numbers are better repetitions [01:24:54] these numbers are better repetitions than the original data because now you [01:24:56] than the original data because now you map all the [01:24:58] map all the vertices in the same group to one right [01:25:01] vertices in the same group to one right so just uh you lost all the other [01:25:03] so just uh you lost all the other information you just to get the [01:25:05] information you just to get the reference station just exactly tells you [01:25:06] reference station just exactly tells you about the group membership [01:25:08] about the group membership and you don't know anything about [01:25:09] and you don't know anything about elsewhere so the group membership is the [01:25:11] elsewhere so the group membership is the only thing you care about so [01:25:13] only thing you care about so so that's why these numbers are more [01:25:17] so that's why these numbers are more like better web stations than uh this is [01:25:20] like better web stations than uh this is kind of similar to the like features [01:25:22] kind of similar to the like features approximation [01:25:24] approximation um and [01:25:26] um and [Music] [01:25:27] [Music] um you know that's just a better [01:25:29] um you know that's just a better representation because we've taken the [01:25:32] representation because we've taken the most important parts of representation [01:25:34] most important parts of representation exactly and I know and what's the most [01:25:36] exactly and I know and what's the most important one the most important one [01:25:37] important one the most important one here in this case the most important one [01:25:40] here in this case the most important one is the classroom structure right so [01:25:41] is the classroom structure right so which group you belong to right so like [01:25:44] which group you belong to right so like so that's why the only thing you have [01:25:46] so that's why the only thing you have like a I suppose if you think that's the [01:25:49] like a I suppose if you think that's the most important information that your [01:25:50] most important information that your organization should just be that you [01:25:52] organization should just be that you know any other information you just say [01:25:53] know any other information you just say the group ID is my requestation and [01:25:55] the group ID is my requestation and that's the that's the best conversation [01:25:57] that's the that's the best conversation in this case we said there's two [01:26:00] in this case we said there's two clusters [01:26:05] but then you know we [01:26:08] but then you know we wanna you care about like you know [01:26:12] wanna you care about like you know the true clustered kind of [01:26:14] the true clustered kind of representation will be also declare [01:26:16] representation will be also declare paramoto three cluster representation [01:26:18] paramoto three cluster representation and like how close things are based on [01:26:20] and like how close things are based on and it's about taking multiple lighting [01:26:22] and it's about taking multiple lighting vectors we can kind of uh get a bigger [01:26:25] vectors we can kind of uh get a bigger picture than just like [01:26:27] picture than just like in this cluster or not exactly exactly [01:26:30] in this cluster or not exactly exactly so if you have more argument person you [01:26:32] so if you have more argument person you can get three costing information or [01:26:35] can get three costing information or even more information and also this can [01:26:37] even more information and also this can be some of this information can be [01:26:39] be some of this information can be recombined you can even bring your [01:26:41] recombined you can even bring your information right so [01:26:42] information right so um because eventually you probably use [01:26:44] um because eventually you probably use this workstation [01:26:45] this workstation by linear cost user linear height on top [01:26:49] by linear cost user linear height on top of it so so suppose you have two type of [01:26:51] of it so so suppose you have two type of information in your system then you [01:26:52] information in your system then you actually combine them to get more [01:26:54] actually combine them to get more information yes but you are right so [01:26:56] information yes but you are right so basically you get more argument because [01:26:57] basically you get more argument because you get more winter information from [01:27:00] you get more winter information from them [01:27:05] yeah so essentially you are it's kind of [01:27:07] yeah so essentially you are it's kind of like major experimentation you distill [01:27:09] like major experimentation you distill the information [01:27:10] the information in the graph to [01:27:13] in the graph to um smaller amount of information I'm [01:27:14] um smaller amount of information I'm looking at the question we are trying to [01:27:16] looking at the question we are trying to answer is what information you keep in [01:27:17] answer is what information you keep in the arguments right so it's not [01:27:19] the arguments right so it's not centralizing that diagonal numbers [01:27:22] centralizing that diagonal numbers has like a [01:27:24] has like a more specific information about the [01:27:26] more specific information about the graph the question is what can mini [01:27:28] graph the question is what can mini machine games and kind of the rough [01:27:30] machine games and kind of the rough intuition that it just keeps the the [01:27:33] intuition that it just keeps the the clustering structure in a graph but not [01:27:36] clustering structure in a graph but not other things right like I use the Low [01:27:38] other things right like I use the Low Low Like the smaller second batteries [01:27:40] Low Like the smaller second batteries are trying to keep the customers [01:27:47] okay great I think this is this will be [01:27:49] okay great I think this is this will be the uh the end of the quarter I guess uh [01:27:52] the uh the end of the quarter I guess uh I hope you like the course I guess we [01:27:54] I hope you like the course I guess we discussed [01:27:56] discussed um quite a bunch of topics and this [01:27:57] um quite a bunch of topics and this actually this college I think we covered [01:27:59] actually this college I think we covered the most compared to all the previous [01:28:00] the most compared to all the previous quarters because partly because we have [01:28:02] quarters because partly because we have more than we have 10 minutes every every [01:28:05] more than we have 10 minutes every every every class and every lecture and also [01:28:07] every class and every lecture and also we have two more while two more lectures [01:28:10] we have two more while two more lectures because we have fewer holidays in this [01:28:12] because we have fewer holidays in this corner [01:28:14] corner um yeah I guess uh I hope you like it [01:28:16] um yeah I guess uh I hope you like it thanks thanks so much for coming ================================================================================ LECTURE 020 ================================================================================ Stanford CS229M - Lecture 12: Non-convex optimization, Non-convex opt for PCA, matrix complexion Source: https://www.youtube.com/watch?v=EVyJkXOd5Xo --- Transcript [00:00:04] Okay cool so I guess um uh let's talk [00:00:07] Okay cool so I guess um uh let's talk about um [00:00:09] about um uh the materials today so I guess last [00:00:12] uh the materials today so I guess last time we have talked about some of them [00:00:16] time we have talked about some of them kind of the bigger questions the the [00:00:18] kind of the bigger questions the the conceptual uh kind of bigger questions [00:00:20] conceptual uh kind of bigger questions in different in theory and today we are [00:00:23] in different in theory and today we are going to start talking about the [00:00:24] going to start talking about the optimization perspective uh in deep [00:00:27] optimization perspective uh in deep learning uh for one maybe like for two [00:00:29] learning uh for one maybe like for two lectures so uh and here [00:00:33] lectures so uh and here um I guess I'm going to explain what Aug [00:00:34] um I guess I'm going to explain what Aug Plantation landscape means it really [00:00:36] Plantation landscape means it really means the the surface uh the the surface [00:00:39] means the the surface uh the the surface of the loss function so um but I guess [00:00:42] of the loss function so um but I guess you you'll see [00:00:44] you you'll see um so we are going to introduce some [00:00:45] um so we are going to introduce some very basic things about optimization but [00:00:48] very basic things about optimization but the main focus is actually not about how [00:00:50] the main focus is actually not about how to update how do design algorithm the [00:00:53] to update how do design algorithm the more focus is to analyze what uh the [00:00:56] more focus is to analyze what uh the functions you're optimizing look like so [00:00:58] functions you're optimizing look like so that you can use some tribute or some [00:01:01] that you can use some tribute or some some standard optimization algorithm for [00:01:04] some standard optimization algorithm for it so um so so if you you don't need any [00:01:08] it so um so so if you you don't need any background about optimization you [00:01:10] background about optimization you probably need to know what winning [00:01:11] probably need to know what winning design is I'm going to Define with [00:01:12] design is I'm going to Define with instant again but you probably need to [00:01:14] instant again but you probably need to know what algorithm is uh optimization [00:01:17] know what algorithm is uh optimization algorithm is but there's no any kind of [00:01:19] algorithm is but there's no any kind of concrete requirement about the details [00:01:23] concrete requirement about the details like what momentum look like you know [00:01:25] like what momentum look like you know what [00:01:26] what um stochastic is exactly is you don't [00:01:28] um stochastic is exactly is you don't really necessarily have to know them [00:01:31] really necessarily have to know them Okay cool so I guess um [00:01:34] Okay cool so I guess um um the the bigger the the question we [00:01:37] um the the bigger the the question we are trying to address you know just to [00:01:38] are trying to address you know just to quickly review [00:01:40] quickly review um [00:01:41] um to connect to the kind of the the last [00:01:43] to connect to the kind of the the last lecture the bigger question we are [00:01:45] lecture the bigger question we are trying to address here is that why my [00:01:48] trying to address here is that why my optimization algorithm so so many [00:01:51] optimization algorithm so so many optimization algorithms are designed for [00:01:54] optimization algorithms are designed for um [00:01:55] um for convex functions [00:01:58] for convex functions but why they can still work [00:02:04] for non-convex functions right so why [00:02:06] for non-convex functions right so why they can still not can why they can [00:02:08] they can still not can why they can still work [00:02:13] and actually pretty well in practice for [00:02:16] and actually pretty well in practice for non-convex functions in differently note [00:02:19] non-convex functions in differently note that it's not like these algorithms like [00:02:21] that it's not like these algorithms like green descent or stochastic with Nissan [00:02:22] green descent or stochastic with Nissan can work for all functions like uh that [00:02:26] can work for all functions like uh that you you may optimize right so definitely [00:02:27] you you may optimize right so definitely there are many functions that they [00:02:29] there are many functions that they cannot optimize there and they are these [00:02:31] cannot optimize there and they are these examples in many areas of research but [00:02:34] examples in many areas of research but in machine learning typically you know [00:02:36] in machine learning typically you know people assume that you can you know [00:02:38] people assume that you can you know people assume and also somewhat observe [00:02:40] people assume and also somewhat observe that you can optimize your function [00:02:42] that you can optimize your function pretty well uh uh even though the [00:02:44] pretty well uh uh even though the function is not convex of course even in [00:02:46] function is not convex of course even in machine learning they are a typical case [00:02:48] machine learning they are a typical case or all players whatever you call it uh [00:02:50] or all players whatever you call it uh especially if your parametrization of [00:02:52] especially if your parametrization of your model is very complex of kind of [00:02:54] your model is very complex of kind of somewhat weird so you could face some [00:02:57] somewhat weird so you could face some difficulties for example one simple [00:02:59] difficulties for example one simple example is that if your [00:03:01] example is that if your uh on you have a very deep Network like [00:03:05] uh on you have a very deep Network like feed forward standard deep networks then [00:03:09] feed forward standard deep networks then it's actually pretty hard to optimize [00:03:11] it's actually pretty hard to optimize because sometimes you have managed [00:03:13] because sometimes you have managed ingredients sometimes you have exploding [00:03:14] ingredients sometimes you have exploding so and so forth so however some of these [00:03:17] so and so forth so however some of these are solved by changing the architecture [00:03:19] are solved by changing the architecture which changes the landscape uh the [00:03:22] which changes the landscape uh the optimization landscape [00:03:24] optimization landscape um anyway so so the the bottom line is [00:03:26] um anyway so so the the bottom line is that in most of the cases people observe [00:03:28] that in most of the cases people observe that non-convex functions in machine [00:03:31] that non-convex functions in machine learning can be optimized pretty well in [00:03:32] learning can be optimized pretty well in green by Green Design or stochastic [00:03:34] green by Green Design or stochastic Renaissance or the environments so and [00:03:36] Renaissance or the environments so and we are trying to understand why we can [00:03:38] we are trying to understand why we can optimize uh reasonably well so so that's [00:03:42] optimize uh reasonably well so so that's the that's the question and uh maybe [00:03:45] the that's the question and uh maybe just before talking about more details [00:03:47] just before talking about more details uh let's first quickly review uh the the [00:03:50] uh let's first quickly review uh the the kind of like uh what creating set is [00:03:53] kind of like uh what creating set is right just in case you know this is very [00:03:55] right just in case you know this is very quick I'm just going to Define some [00:03:56] quick I'm just going to Define some notations here so suppose G Theta is the [00:03:59] notations here so suppose G Theta is the loss function [00:04:04] I guess I'm using G here just because um [00:04:06] I guess I'm using G here just because um you know I want to use a generic [00:04:07] you know I want to use a generic collector instead of just to uh instead [00:04:10] collector instead of just to uh instead of use l right so L probably should be a [00:04:12] of use l right so L probably should be a better lighter but here I'm using [00:04:14] better lighter but here I'm using something that is more generic so and [00:04:16] something that is more generic so and the algorithm is just something like six [00:04:18] the algorithm is just something like six zero is some initialization [00:04:23] and and you have something like Theta t [00:04:26] and and you have something like Theta t plus one is equal to 30 minus ETA times [00:04:29] plus one is equal to 30 minus ETA times the gradient of G of the T right this is [00:04:31] the gradient of G of the T right this is a good distance [00:04:33] a good distance and [00:04:35] and um [00:04:36] um and you can have stochastic versions of [00:04:38] and you can have stochastic versions of it many of you probably know them so [00:04:40] it many of you probably know them so under the I'm going to list a few facts [00:04:42] under the I'm going to list a few facts you know just to kind of motivate [00:04:45] you know just to kind of motivate um [00:04:45] um the discussions here right so so when [00:04:47] the discussions here right so so when we're looking at non-convex functions [00:04:49] we're looking at non-convex functions maybe let me join non-convex functions [00:04:50] maybe let me join non-convex functions something like this right so [00:04:53] something like this right so um and the the first fact [00:04:56] um and the the first fact is a fact always just um you know some [00:04:58] is a fact always just um you know some observation so uh maybe let's call it [00:05:01] observation so uh maybe let's call it observation [00:05:07] first observation is that so GD [00:05:10] first observation is that so GD cannot fund [00:05:13] cannot fund always cannot always find local Global [00:05:16] always cannot always find local Global minimum [00:05:19] right this is uh for continuous [00:05:22] right this is uh for continuous functions right this is kind of obvious [00:05:23] functions right this is kind of obvious because you know depending on where you [00:05:25] because you know depending on where you initialize depending on like how the [00:05:27] initialize depending on like how the function look like for example in this [00:05:28] function look like for example in this case suppose you initialize here and [00:05:30] case suppose you initialize here and Green Design will go no right word and [00:05:33] Green Design will go no right word and maybe it overshoots a little bit and go [00:05:35] maybe it overshoots a little bit and go back so and so forth but then at the end [00:05:37] back so and so forth but then at the end of the day it converts to this local [00:05:38] of the day it converts to this local minimum right if you're plugging already [00:05:40] minimum right if you're plugging already somewhere small right so it was stuck at [00:05:43] somewhere small right so it was stuck at this local minimum I stayed stay there [00:05:45] this local minimum I stayed stay there uh and even you have stochasticity if [00:05:47] uh and even you have stochasticity if your sarcastic is not big enough you are [00:05:49] your sarcastic is not big enough you are not going to go to another uh local [00:05:51] not going to go to another uh local minimum like or go to the other Global [00:05:53] minimum like or go to the other Global minimum right so [00:05:55] minimum right so so clearly you cannot hope that [00:05:57] so clearly you cannot hope that greetings tend to work you know [00:06:00] greetings tend to work you know like a in the worst case when you fall [00:06:02] like a in the worst case when you fall off possible non-convex functions right [00:06:04] off possible non-convex functions right so uh observation two [00:06:11] um [00:06:12] um actually this is a this is a theorem no [00:06:15] actually this is a this is a theorem no um [00:06:16] um um so funding [00:06:18] um so funding on global minimum [00:06:22] of General functions General non-convex [00:06:25] of General functions General non-convex functions is empty heart [00:06:32] is empty heart [00:06:35] um I I assume some of you have [00:06:37] um I I assume some of you have statistical backgrounds which who are [00:06:39] statistical backgrounds which who are not exactly familiar with MP heart [00:06:42] not exactly familiar with MP heart um you know it doesn't really matter [00:06:43] um you know it doesn't really matter this this is just really saying that [00:06:45] this this is just really saying that it's computationally intractable uh to [00:06:49] it's computationally intractable uh to uh to find uh Global minimum for uh and [00:06:52] uh to find uh Global minimum for uh and but but just uh to clarify what does [00:06:55] but but just uh to clarify what does that mean that means that there exists a [00:06:57] that mean that means that there exists a function that you cannot solve you know [00:06:59] function that you cannot solve you know it doesn't really mean that [00:07:01] it doesn't really mean that so so basically this is only saying that [00:07:04] so so basically this is only saying that you cannot in polynomial time solve all [00:07:08] you cannot in polynomial time solve all possible functions always good in this [00:07:10] possible functions always good in this set or with any algorithm but it doesn't [00:07:14] set or with any algorithm but it doesn't really mean that there's no subside of [00:07:15] really mean that there's no subside of uh functions that you can easily solve [00:07:17] uh functions that you can easily solve right so for example convex subset of [00:07:20] right so for example convex subset of functions can be solved in polynomial [00:07:22] functions can be solved in polynomial type [00:07:23] type okay so and the observation three in [00:07:26] okay so and the observation three in obviously this is opposite observation [00:07:28] obviously this is opposite observation so green in design can solve [00:07:32] so green in design can solve uh [00:07:33] uh convex functions as I said [00:07:38] and and I guess the observation for [00:07:44] is that you know objectives in deep [00:07:46] is that you know objectives in deep learning is non-convex [00:07:52] foreign [00:07:53] foreign this is you know probably not entirely [00:07:56] this is you know probably not entirely trivial you know it's almost trivial but [00:07:59] trivial you know it's almost trivial but not entirely contribute so you have to [00:08:01] not entirely contribute so you have to you know like you know it's probably [00:08:03] you know like you know it's probably kind of true to see that you cannot [00:08:05] kind of true to see that you cannot prove that they are convex but you [00:08:07] prove that they are convex but you probably need a little bit kind of [00:08:08] probably need a little bit kind of calculation or some kind of [00:08:10] calculation or some kind of constructions to see they are not convex [00:08:12] constructions to see they are not convex and generally they are not convex [00:08:15] and generally they are not convex um just because there's so many [00:08:16] um just because there's so many nonlinearities and and and most of the [00:08:19] nonlinearities and and and most of the convex functions we know are somewhat [00:08:21] convex functions we know are somewhat kind of like simple functions that's a [00:08:23] kind of like simple functions that's a linear function composed with the convex [00:08:25] linear function composed with the convex loss for example and as soon as you go [00:08:28] loss for example and as soon as you go beyond two layers uh you you have uh [00:08:30] beyond two layers uh you you have uh it's not convex [00:08:32] it's not convex okay so [00:08:35] okay so um and observation five I think this is [00:08:36] um and observation five I think this is what I mentioned it you know so like a [00:08:38] what I mentioned it you know so like a it like winning the sun does work you [00:08:40] it like winning the sun does work you know Green Design or stochastic we need [00:08:42] know Green Design or stochastic we need to send does uh uh works [00:08:46] to send does uh uh works so it finds [00:08:49] so it finds let me be precise about this funds [00:08:51] let me be precise about this funds approximate [00:08:54] or even sometimes you can claim it's [00:08:57] or even sometimes you can claim it's almost exactly a global minimum [00:09:05] of um uh of uh of like loss functions [00:09:12] foreign [00:09:14] foreign of course this is not like a 100 [00:09:16] of course this is not like a 100 rigorous statement because it depends on [00:09:18] rigorous statement because it depends on which loss function you're talking about [00:09:19] which loss function you're talking about what data set you have uh what you have [00:09:22] what data set you have uh what you have and what architectures you have so forth [00:09:24] and what architectures you have so forth but I'm just saying like for most of the [00:09:26] but I'm just saying like for most of the cases in deep learning uh uh start [00:09:29] cases in deep learning uh uh start casting with Center Green Nissan seems [00:09:30] casting with Center Green Nissan seems to work pretty well and why you know [00:09:31] to work pretty well and why you know they are standing at the global minimum [00:09:34] they are standing at the global minimum um so you know that because you know the [00:09:36] um so you know that because you know the loss function is not negative right so [00:09:38] loss function is not negative right so suppose you run image night or or you [00:09:41] suppose you run image night or or you know you do some kind of like operation [00:09:42] know you do some kind of like operation experiments and you know the on the [00:09:45] experiments and you know the on the global minimum has uh at least the Dos [00:09:47] global minimum has uh at least the Dos function is always so negative and and [00:09:49] function is always so negative and and you can see how small the loss uh HDD or [00:09:52] you can see how small the loss uh HDD or GD can get you and often a loss function [00:09:55] GD can get you and often a loss function is pretty small something like y minus [00:09:57] is pretty small something like y minus two or something like that or maybe [00:09:59] two or something like that or maybe sometimes depending on what I use [00:10:00] sometimes depending on what I use regularization sometimes it could be one [00:10:01] regularization sometimes it could be one e minus four one e minus five dependent [00:10:04] e minus four one e minus five dependent situation so you kind of believe that uh [00:10:06] situation so you kind of believe that uh you get at least an approximate Global [00:10:09] you get at least an approximate Global minimum [00:10:12] um cool so so what's what's going on [00:10:14] um cool so so what's what's going on here right so it sounds like this you [00:10:15] here right so it sounds like this you know there are some positive results you [00:10:17] know there are some positive results you know empirical observations there are [00:10:18] know empirical observations there are some negative uh uh results about the MP [00:10:21] some negative uh uh results about the MP hardness or the the intractability of [00:10:23] hardness or the the intractability of the uh of 20 non-convex functions so so [00:10:27] the uh of 20 non-convex functions so so the way that they Recon that can [00:10:29] the way that they Recon that can reconcile this is really just that you [00:10:31] reconcile this is really just that you know the lower bound you know the the [00:10:33] know the lower bound you know the the impossibility results is about worst [00:10:35] impossibility results is about worst case functions and actually we are not [00:10:36] case functions and actually we are not optimizing worst case functions right so [00:10:38] optimizing worst case functions right so so I guess the the kind of in my mind [00:10:40] so I guess the the kind of in my mind the kind of view is that you have a [00:10:42] the kind of view is that you have a family of all functions right so in this [00:10:45] family of all functions right so in this family here there are functions are [00:10:46] family here there are functions are super hard to super hard to to solve [00:10:48] super hard to super hard to to solve right so like very hard functions [00:10:52] right so like very hard functions foreign [00:10:56] functions that's called convex functions [00:10:58] functions that's called convex functions right these are not not we call it [00:11:00] right these are not not we call it convex these are just convex functions [00:11:02] convex these are just convex functions and these are easy to solve right so we [00:11:04] and these are easy to solve right so we need second solve can solve them but [00:11:07] need second solve can solve them but actually [00:11:08] actually um we didn't identify all the functions [00:11:10] um we didn't identify all the functions that we can solve right there are [00:11:11] that we can solve right there are actually more functions than convex [00:11:12] actually more functions than convex functions that within the sense of some [00:11:15] functions that within the sense of some other algorithms can solve and that's a [00:11:17] other algorithms can solve and that's a slightly larger family in between and [00:11:20] slightly larger family in between and today we are going to talk about you [00:11:21] today we are going to talk about you know this kind of functions right so of [00:11:23] know this kind of functions right so of course we cannot identify all the [00:11:25] course we cannot identify all the functions that are denied enough for us [00:11:27] functions that are denied enough for us to solve but uh um uh but we are going [00:11:31] to solve but uh um uh but we are going to identify a subset of the bigger than [00:11:33] to identify a subset of the bigger than convex uh the combat subset right so [00:11:36] convex uh the combat subset right so these are non-convex [00:11:39] these are non-convex um but some would have have but has a [00:11:42] um but some would have have but has a have denied properties some big [00:11:44] have denied properties some big properties and the kind of the task is [00:11:47] properties and the kind of the task is to figure out what properties that make [00:11:49] to figure out what properties that make them somewhat nice [00:11:52] them somewhat nice um and easy to optimize [00:11:54] um and easy to optimize all right so um so so here's our plan [00:11:57] all right so um so so here's our plan for this lecture maybe the the first [00:11:59] for this lecture maybe the the first part of the next lecture so [00:12:02] part of the next lecture so um [00:12:04] um the first step is that we're gonna [00:12:05] the first step is that we're gonna identify a large set of functions [00:12:10] a larger set of functions [00:12:19] that [00:12:21] that SGD rgd can [00:12:24] SGD rgd can myself up to Global optimality and [00:12:29] myself up to Global optimality and two [00:12:30] two um we're gonna prove that [00:12:33] um we're gonna prove that uh for some special some [00:12:37] uh for some special some uh some [00:12:40] uh some uh some of the loss function [00:12:46] in machine learning problems [00:12:51] um [00:12:52] um um can you know belongs to this set [00:12:55] um can you know belongs to this set belongs to [00:12:59] the set we just identified this larger [00:13:01] the set we just identified this larger set of functions we just identified [00:13:04] set of functions we just identified um and most of them [00:13:07] um and most of them um [00:13:07] um so most of the effort will be spent on [00:13:09] so most of the effort will be spent on the second bullet uh the first bullet [00:13:12] the second bullet uh the first bullet you know um you do need to show why [00:13:14] you know um you do need to show why actually you can solve this set of [00:13:15] actually you can solve this set of functions but um I guess I'll tell you [00:13:18] functions but um I guess I'll tell you what people can show in the first [00:13:20] what people can show in the first product along the line first product but [00:13:22] product along the line first product but I wouldn't tell you anything about all [00:13:25] I wouldn't tell you anything about all the details [00:13:26] the details um the results there actually is in some [00:13:27] um the results there actually is in some sense kind of like a [00:13:30] sense kind of like a um intuitive so and but they do require [00:13:32] um intuitive so and but they do require a lot of backgrounds to uh to talk about [00:13:35] a lot of backgrounds to uh to talk about details right so like you need to know [00:13:37] details right so like you need to know like a lot of things about how to [00:13:38] like a lot of things about how to analyze this iterative optimization [00:13:40] analyze this iterative optimization algorithms so that's why we don't focus [00:13:43] algorithms so that's why we don't focus on that we mostly focus on the second [00:13:44] on that we mostly focus on the second part which is more about the statistical [00:13:47] part which is more about the statistical properties of of those functions used in [00:13:50] properties of of those functions used in machine learning [00:13:52] machine learning Okay cool so [00:13:54] Okay cool so um so the so the very basic idea is the [00:13:58] um so the so the very basic idea is the following so the so it's very very [00:14:01] following so the so it's very very simple so so basically you say that you [00:14:03] simple so so basically you say that you know uh [00:14:05] know uh so you know that so you know like green [00:14:08] so you know that so you know like green designs can find local minimum [00:14:15] designs can find local minimum this is somewhat sort of easy to believe [00:14:17] this is somewhat sort of easy to believe but actually there is some caveat about [00:14:19] but actually there is some caveat about it maybe I'll I'll start here just uh [00:14:22] it maybe I'll I'll start here just uh just to remind you there is a caveat but [00:14:24] just to remind you there is a caveat but you kind of believe that roughly [00:14:25] you kind of believe that roughly speaking green Nissan can find a local [00:14:27] speaking green Nissan can find a local menu right and suppose if you know in [00:14:30] menu right and suppose if you know in addition that all local minimum [00:14:33] addition that all local minimum our local minimum [00:14:35] our local minimum of f are also global [00:14:41] then these two means that GD can find [00:14:44] then these two means that GD can find global mean [00:14:50] right so so basically you know the first [00:14:53] right so so basically you know the first statement is about the first [00:14:56] statement is about the first um so so basically so basically the set [00:14:58] um so so basically so basically the set of functions that we are going to [00:15:00] of functions that we are going to identify uh to be solvable by GD and HDD [00:15:03] identify uh to be solvable by GD and HDD is just a set of functions with with the [00:15:06] is just a set of functions with with the property that our local minimum are also [00:15:08] property that our local minimum are also Global or minimum and then we need to [00:15:11] Global or minimum and then we need to characterize you know uh when you show [00:15:13] characterize you know uh when you show that you know the functions we are [00:15:15] that you know the functions we are actually using in machine learning uh [00:15:17] actually using in machine learning uh have this property right of course not [00:15:20] have this property right of course not all problems have this property you know [00:15:22] all problems have this property you know uh we're gonna show somebody actually [00:15:24] uh we're gonna show somebody actually simple cases where you can prove this [00:15:27] simple cases where you can prove this Okay so [00:15:30] Okay so um but I guess as I mentioned there's [00:15:31] um but I guess as I mentioned there's some caveat about whether you can even [00:15:33] some caveat about whether you can even converge to a local menu so this is [00:15:35] converge to a local menu so this is actually somewhat uh [00:15:37] actually somewhat uh um uh nuanced so and I want to be kind [00:15:39] um uh nuanced so and I want to be kind of like as clear as possible about that [00:15:41] of like as clear as possible about that so I'm going to formalize this you know [00:15:44] so I'm going to formalize this you know converging to a local minimum uh and and [00:15:47] converging to a local minimum uh and and I'm not going to prove any of the [00:15:49] I'm not going to prove any of the theorems here right so the next part [00:15:51] theorems here right so the next part about convergence to local minimum [00:15:53] about convergence to local minimum foreign [00:15:58] so I guess let me start with some [00:16:00] so I guess let me start with some definitions to formalize it so let F be [00:16:03] definitions to formalize it so let F be uh twice differentiable [00:16:05] uh twice differentiable it's for Simplicity we are [00:16:08] it's for Simplicity we are like sometimes you can extend this to [00:16:10] like sometimes you can extend this to you know maybe [00:16:12] you know maybe um like only differentiable but not [00:16:14] um like only differentiable but not twice differentiable but just for [00:16:15] twice differentiable but just for Simplicity let's say f is twice [00:16:17] Simplicity let's say f is twice differential and what's the definition [00:16:19] differential and what's the definition of local minimum I guess you have for [00:16:22] of local minimum I guess you have for um [00:16:23] um I send this in Calculus class right so X [00:16:26] I send this in Calculus class right so X is a local mean of the function f [00:16:29] is a local mean of the function f if uh there exists open neighborhood [00:16:38] uh let's call it n around [00:16:44] X [00:16:45] X such that [00:16:48] such that uh [00:16:50] uh in this neighborhood and [00:16:52] in this neighborhood and X [00:16:54] X um of the function value [00:17:00] I'm using a lot of text here just to [00:17:03] I'm using a lot of text here just to make it easier to understand you know [00:17:05] make it easier to understand you know you know alternatively I can also use [00:17:07] you know alternatively I can also use just math but so let me use the text so [00:17:10] just math but so let me use the text so in the neighborhood and the function [00:17:12] in the neighborhood and the function values all the function values in [00:17:14] values all the function values in neighborhood and [00:17:15] neighborhood and are at least [00:17:20] uh FX [00:17:23] uh FX right so basically f x is literally one [00:17:26] right so basically f x is literally one of the minimum in this neighborhood [00:17:29] of the minimum in this neighborhood so that's the definition of flow coming [00:17:32] so that's the definition of flow coming um and so I guess you know from the [00:17:35] um and so I guess you know from the calculator class you probably know that [00:17:36] calculator class you probably know that you know if x is the local [00:17:38] you know if x is the local uh min [00:17:41] uh min it means that the green of f x [00:17:44] it means that the green of f x is zero the green square of the hessing [00:17:48] is zero the green square of the hessing of f is PSD [00:17:50] of f is PSD okay so so these are necessary [00:17:52] okay so so these are necessary conditions for being a local minimum but [00:17:54] conditions for being a local minimum but not vice versa right so it's not like if [00:17:57] not vice versa right so it's not like if the green is zero and uh [00:18:00] the green is zero and uh and it has since PSD then uh you uh URL [00:18:04] and it has since PSD then uh you uh URL commit right so why you know I guess a [00:18:07] commit right so why you know I guess a simple example here I cannot you know [00:18:10] simple example here I cannot you know it's easy to came up is that you just [00:18:12] it's easy to came up is that you just say maybe actually I'm not taking the [00:18:16] say maybe actually I'm not taking the simplest one just for some reason [00:18:17] simplest one just for some reason because I'm going to uh use this again [00:18:20] because I'm going to uh use this again so let's say suppose you have x y i f of [00:18:23] so let's say suppose you have x y i f of X1 X2 which is something like in the X1 [00:18:27] X1 X2 which is something like in the X1 square plus X2 cubed [00:18:30] square plus X2 cubed okay so I mean in this case you know X1 [00:18:35] okay so I mean in this case you know X1 X2 is 0 the origin [00:18:37] X2 is 0 the origin satisfies [00:18:42] the gradient is zero and satisfies the [00:18:44] the gradient is zero and satisfies the hessing is PSD actually that has in zero [00:18:48] hessing is PSD actually that has in zero right so that's why it's PSD [00:18:51] right so that's why it's PSD right uh just because if you take the [00:18:53] right uh just because if you take the second order wait the hash is not zero [00:18:56] second order wait the hash is not zero sorry the hashing is PSD right because [00:18:58] sorry the hashing is PSD right because the hessing is a matrix [00:19:01] the hessing is a matrix uh um I guess you know uh in One [00:19:05] uh um I guess you know uh in One Direction it's two and the other you [00:19:07] Direction it's two and the other you know all the other thing is zero is PSD [00:19:12] all right so [00:19:14] all right so um [00:19:15] um but actually it's not a local minimum as [00:19:16] but actually it's not a local minimum as you can see right because you know if [00:19:18] you can see right because you know if you change X2 you can make you know the [00:19:21] you change X2 you can make you know the function value smaller in the [00:19:22] function value smaller in the neighborhood right right because X2 is [00:19:25] neighborhood right right because X2 is the cubic function right so you can [00:19:26] the cubic function right so you can always make it or make it smaller than [00:19:28] always make it or make it smaller than zero uh in the neighborhood of zero and [00:19:30] zero uh in the neighborhood of zero and in the neighborhood of zero [00:19:33] in the neighborhood of zero okay so [00:19:35] okay so um [00:19:38] so basically from this example you can [00:19:41] so basically from this example you can kind of see that you know what happens [00:19:43] kind of see that you know what happens is the following so so why this is a a [00:19:46] is the following so so why this is a a problem right so well fundamentally what [00:19:48] problem right so well fundamentally what what what's the problem here the problem [00:19:50] what what's the problem here the problem we see here is that when the grain of f [00:19:52] we see here is that when the grain of f x is zero [00:19:54] x is zero and also that the hessing [00:19:57] and also that the hessing is not strictly positive semi definite [00:20:00] is not strictly positive semi definite it's just a positive definite so suppose [00:20:03] it's just a positive definite so suppose hashing vanishes [00:20:06] hashing vanishes in some Direction [00:20:12] right so if session manages in some [00:20:14] right so if session manages in some direction and so that means in that [00:20:16] direction and so that means in that direction you don't have the first other [00:20:17] direction you don't have the first other gradient you don't have the second order [00:20:18] gradient you don't have the second order gradient in that direction you're pretty [00:20:20] gradient in that direction you're pretty flat right so that makes it tricky [00:20:22] flat right so that makes it tricky because then the higher order gradients [00:20:24] because then the higher order gradients start to measure [00:20:26] start to measure derivatives [00:20:30] starts to matter right because if your [00:20:31] starts to matter right because if your second order derivatives is actually [00:20:33] second order derivatives is actually non-zero it's curved then the third [00:20:35] non-zero it's curved then the third order derivative is always killed by a [00:20:37] order derivative is always killed by a secondary derivative as long as your [00:20:39] secondary derivative as long as your neighborhood is small enough however if [00:20:41] neighborhood is small enough however if your neighborhood you know but if your [00:20:42] your neighborhood you know but if your second order is literally zero in some [00:20:44] second order is literally zero in some direction then a third hour derivative [00:20:46] direction then a third hour derivative starts to measure [00:20:48] starts to measure so that's why you know local minimum is [00:20:51] so that's why you know local minimum is not only always of a property of the [00:20:54] not only always of a property of the first and second or derivative [00:20:56] first and second or derivative so and when you look at this it becomes [00:21:00] so and when you look at this it becomes about the federal derivative of fourth [00:21:01] about the federal derivative of fourth hour derivative things becomes much more [00:21:03] hour derivative things becomes much more complicated and actually you know uh if [00:21:06] complicated and actually you know uh if you look at the Hard instance you know [00:21:07] you look at the Hard instance you know uh of the of the MP harness or the [00:21:10] uh of the of the MP harness or the intractability results about the [00:21:11] intractability results about the optimization so [00:21:14] optimization so um [00:21:15] um um like like a once you are so basically [00:21:17] um like like a once you are so basically all the hard cases happens when you have [00:21:19] all the hard cases happens when you have you have to deal with the health or [00:21:21] you have to deal with the health or derivatives like false all derivatives [00:21:22] derivatives like false all derivatives and then basically in the I'm not sure I [00:21:26] and then basically in the I'm not sure I I know this is probably not making that [00:21:28] I know this is probably not making that much sense for all of you if you don't [00:21:30] much sense for all of you if you don't familiar with you're not familiar with [00:21:31] familiar with you're not familiar with the how you prove MP harness but [00:21:33] the how you prove MP harness but basically you can embed hard instances [00:21:35] basically you can embed hard instances of the like a like some kind of like a [00:21:39] of the like a like some kind of like a set instance to the false other [00:21:41] set instance to the false other derivatives so that you know solving [00:21:44] derivatives so that you know solving knowing whether the first order is [00:21:46] knowing whether the first order is positive is is a positive operator is [00:21:50] positive is is a positive operator is equivalent to knowing whether you you [00:21:52] equivalent to knowing whether you you solve the sex problem uh anyway so this [00:21:55] solve the sex problem uh anyway so this is only for those people who know a [00:21:56] is only for those people who know a little bit about the computational uh [00:21:59] little bit about the computational uh intractability results anyway so so but [00:22:02] intractability results anyway so so but but the intuition is that hard higher [00:22:04] but the intuition is that hard higher order derivatives is is just hard to [00:22:06] order derivatives is is just hard to deal with [00:22:07] deal with especially higher than 4000 and and [00:22:09] especially higher than 4000 and and there is a there's a theorem which is [00:22:12] there is a there's a theorem which is the following so the theorem is that [00:22:14] the following so the theorem is that verifying [00:22:18] uh verifying if [00:22:21] uh verifying if X is a local minimum [00:22:25] X is a local minimum without any assumption [00:22:27] without any assumption uh of the locomin of f is actually NP [00:22:31] uh of the locomin of f is actually NP hard [00:22:33] hard so finding so so that finding a local [00:22:36] so finding so so that finding a local government is also hard [00:22:44] so I've told you that final Global [00:22:46] so I've told you that final Global Minimates in the heart but actually [00:22:47] Minimates in the heart but actually finding a local minimum is also MP hard [00:22:50] finding a local minimum is also MP hard and and this is you know you know this [00:22:54] and and this is you know you know this is the caveat I was referring to because [00:22:55] is the caveat I was referring to because in most of the cases if you talk to [00:22:57] in most of the cases if you talk to someone that you know like if you talk [00:22:58] someone that you know like if you talk to me like [00:23:00] to me like um about researching on your meeting you [00:23:02] um about researching on your meeting you know we'll think about you know finding [00:23:04] know we'll think about you know finding a local minimum is easy right in general [00:23:06] a local minimum is easy right in general like in that's the right intuition but [00:23:08] like in that's the right intuition but it's not exactly true right so you have [00:23:10] it's not exactly true right so you have to consider this kind of pathological [00:23:12] to consider this kind of pathological cases which makes things harder right so [00:23:16] cases which makes things harder right so um so so how how to proceed right so if [00:23:18] um so so how how to proceed right so if finding a local minimum is harder than [00:23:20] finding a local minimum is harder than what this plan doesn't work right so so [00:23:21] what this plan doesn't work right so so the the way to go beyond it is that [00:23:24] the the way to go beyond it is that there is a way to also remove some of [00:23:27] there is a way to also remove some of the pathological cases uh uh as well so [00:23:30] the pathological cases uh uh as well so that you can uh you can uh find a local [00:23:34] that you can uh you can uh find a local minimum in polynomial type and then we [00:23:37] minimum in polynomial type and then we can exclude our plan right so this is uh [00:23:39] can exclude our plan right so this is uh What uh what will happen so [00:23:42] What uh what will happen so [Music] [00:23:42] [Music] um [00:23:44] um so [00:23:48] so here is a a condition uh constantial [00:23:52] so here is a a condition uh constantial condition and if you satisfy the [00:23:54] condition and if you satisfy the condition then you remove those [00:23:55] condition then you remove those pathological uh cases which requires [00:23:58] pathological uh cases which requires higher order derivatives [00:23:59] higher order derivatives so [00:24:01] so sorry [00:24:13] thank you [00:24:17] exciting condition so and let me [00:24:23] exciting condition so and let me so in some sense I guess you know I'm [00:24:25] so in some sense I guess you know I'm not sure whether this makes sense before [00:24:26] not sure whether this makes sense before I Define it but generally you are [00:24:28] I Define it but generally you are basically saying that you want to rule a [00:24:30] basically saying that you want to rule a lot you want to say that you assume your [00:24:31] lot you want to say that you assume your function doesn't have this kind of like [00:24:34] function doesn't have this kind of like somewhat subtle possible candidate of [00:24:36] somewhat subtle possible candidate of local mineral right so so every Point [00:24:39] local mineral right so so every Point why there's a local meal or not can be [00:24:41] why there's a local meal or not can be tell can be told from examine only the [00:24:45] tell can be told from examine only the first algorithm and the second order [00:24:46] first algorithm and the second order gradient second order derivatives so [00:24:49] gradient second order derivatives so there's no subtle cases you know in your [00:24:51] there's no subtle cases you know in your function right so how do we formalize [00:24:54] function right so how do we formalize this [00:24:55] this uh this is strict [00:24:58] uh this is strict schedule [00:25:01] um [00:25:02] um I guess uh the the paper to side is the [00:25:05] I guess uh the the paper to side is the title by the way I think I I wrote a a [00:25:08] title by the way I think I I wrote a a book chapter about this kind of [00:25:09] book chapter about this kind of optimization thing [00:25:11] optimization thing um [00:25:12] um um for a book so I can send that to the [00:25:14] um for a book so I can send that to the the people the person who take the scrap [00:25:18] the people the person who take the scrap note and and that probably help you to [00:25:20] note and and that probably help you to uh to have some references [00:25:23] uh to have some references um but the the materials are not exactly [00:25:24] um but the the materials are not exactly the same as the book so so you you still [00:25:26] the same as the book so so you you still have to do uh 100 subscribe you know [00:25:29] have to do uh 100 subscribe you know um kind of from scratch in some sense [00:25:31] um kind of from scratch in some sense okay uh okay cool so the definition of [00:25:34] okay uh okay cool so the definition of strict cycle you know this is I'm citing [00:25:36] strict cycle you know this is I'm citing this paper just because [00:25:38] this paper just because um it's not like uh uh every paper use [00:25:41] um it's not like uh uh every paper use exactly the same definition the very [00:25:42] exactly the same definition the very very original paper that introduced this [00:25:45] very original paper that introduced this term under this notion it's called Uh uh [00:25:48] term under this notion it's called Uh uh is uh by wrong and uh ital so uh in 15 [00:25:53] is uh by wrong and uh ital so uh in 15 so so that paper has a slightly [00:25:55] so so that paper has a slightly different definition [00:25:57] different definition um and and but I think this one the the [00:25:59] um and and but I think this one the the the definition really I tell is a little [00:26:01] the definition really I tell is a little more kind of like easier to use for for [00:26:04] more kind of like easier to use for for Downs for the future of research so so I [00:26:07] Downs for the future of research so so I think people are somewhat can convert [00:26:09] think people are somewhat can convert into this so we say um [00:26:12] into this so we say um so here's the definition so we say f is [00:26:15] so here's the definition so we say f is Alpha Beta gamma [00:26:17] Alpha Beta gamma to excel [00:26:23] um if [00:26:24] um if for every [00:26:28] X in Rd [00:26:32] X in Rd um [00:26:33] um every accident Rd [00:26:37] satisfies [00:26:40] one of the following [00:26:46] so the first one is that you know for [00:26:48] so the first one is that you know for some of the X it just satisfies that f x [00:26:51] some of the X it just satisfies that f x the two Norm of X is larger than Alpha [00:26:54] the two Norm of X is larger than Alpha right so so you see that you know of [00:26:56] right so so you see that you know of course some of the x is this right so [00:26:58] course some of the x is this right so your green is Big so these points are [00:27:00] your green is Big so these points are not locomotive right so these are these [00:27:02] not locomotive right so these are these are not stationary points they are not [00:27:04] are not stationary points they are not local menu [00:27:05] local menu okay by the way that stationary point I [00:27:08] okay by the way that stationary point I mean first other stationary Point [00:27:09] mean first other stationary Point meaning those points with screen is zero [00:27:11] meaning those points with screen is zero right so [00:27:12] right so um so if you satisfy number one here you [00:27:14] um so if you satisfy number one here you cannot be a local minimum it cannot be a [00:27:16] cannot be a local minimum it cannot be a stationary point and the alpha beta [00:27:18] stationary point and the alpha beta gamma is not positive numbers [00:27:20] gamma is not positive numbers um and the second thing is that the [00:27:21] um and the second thing is that the Lambda mean [00:27:23] Lambda mean uh of the housing [00:27:25] uh of the housing at X is less than minus beta [00:27:29] at X is less than minus beta so if you satisfy this you cannot be a [00:27:31] so if you satisfy this you cannot be a local minimum right because your hash is [00:27:33] local minimum right because your hash is not positive semi-definite right so you [00:27:36] not positive semi-definite right so you cannot be a lot coming [00:27:37] cannot be a lot coming and in some sense you can think of [00:27:39] and in some sense you can think of alphabet and Gamma to be something super [00:27:41] alphabet and Gamma to be something super small or even close to be zero we just [00:27:43] small or even close to be zero we just require them to be strictly bigger than [00:27:45] require them to be strictly bigger than zero just for technical purposes [00:27:47] zero just for technical purposes um so so basically and the third [00:27:50] um so so basically and the third condition is that x uh the third [00:27:52] condition is that x uh the third possibility is that X is gamma close [00:27:56] possibility is that X is gamma close to a to a local community [00:28:06] uh that's called X star and equilibrium [00:28:09] uh that's called X star and equilibrium distance [00:28:13] the distance in my check here probably [00:28:15] the distance in my check here probably not entirely important because we are we [00:28:18] not entirely important because we are we are not going to be very quantitative [00:28:19] are not going to be very quantitative about this so everything is polynomial [00:28:22] about this so everything is polynomial so it's not that important so so [00:28:24] so it's not that important so so basically this is saying that you know [00:28:26] basically this is saying that you know um [00:28:27] um if number one rules are some kind of [00:28:29] if number one rules are some kind of like local minimum number two we would [00:28:30] like local minimum number two we would have some other kind of local mineral [00:28:31] have some other kind of local mineral but note that number one number two [00:28:33] but note that number one number two doesn't rule on all possible local [00:28:34] doesn't rule on all possible local minerals because they are even local [00:28:36] minerals because they are even local minimum which has good in zero which has [00:28:39] minimum which has good in zero which has zero gradient and positive subdefinite [00:28:41] zero gradient and positive subdefinite hasn't so they're also my right so one [00:28:44] hasn't so they're also my right so one two doesn't tell you [00:28:46] two doesn't tell you exactly where the pawn is local minimum [00:28:48] exactly where the pawn is local minimum per se if you don't have this assumption [00:28:50] per se if you don't have this assumption because you know you can have a point [00:28:52] because you know you can have a point that does not satisfy one or two right [00:28:57] that does not satisfy one or two right for example if you have a point with [00:28:59] for example if you have a point with green is zero enhancing is PSD right it [00:29:02] green is zero enhancing is PSD right it doesn't satisfy one two but it could [00:29:05] doesn't satisfy one two but it could still be not a local minimum right [00:29:08] still be not a local minimum right that's the pathology paste and this [00:29:10] that's the pathology paste and this assumption this definition is basically [00:29:12] assumption this definition is basically saying that [00:29:13] saying that if you don't satisfy one two then [00:29:16] if you don't satisfy one two then there's no this pathological case you [00:29:18] there's no this pathological case you have to be close to a real local minimum [00:29:20] have to be close to a real local minimum right so uh that's what the student set [00:29:23] right so uh that's what the student set of condition you say [00:29:27] um maybe let me take pause for a moment [00:29:29] um maybe let me take pause for a moment see whether there's any questions [00:29:33] see whether there's any questions um actually I'm looking at the query [00:29:34] um actually I'm looking at the query border I know it's oh it's empty but if [00:29:38] border I know it's oh it's empty but if you have any questions feel free to ask [00:29:40] you have any questions feel free to ask um [00:29:43] it's often better or positive yes that's [00:29:46] it's often better or positive yes that's right so uh this definition only makes [00:29:48] right so uh this definition only makes sense when alpha beta and gamma positive [00:29:51] sense when alpha beta and gamma positive um I think [00:29:53] um I think um [00:29:54] um right yes exactly [00:29:56] right yes exactly so R5 beta [00:30:08] cool [00:30:13] any other questions [00:30:19] uh the third string set of condition [00:30:21] uh the third string set of condition sounds hard to check yes that's that's a [00:30:24] sounds hard to check yes that's that's a that's a good point so you cannot check [00:30:25] that's a good point so you cannot check there's no way you can check whether uh [00:30:28] there's no way you can check whether uh empirically [00:30:30] empirically uh your function is [00:30:33] uh your function is um [00:30:34] um satisfies this condition [00:30:36] satisfies this condition so [00:30:37] so [Music] [00:30:37] [Music] um [00:30:38] um I think it's even [00:30:41] I think it's even I'm not 100 sure about this but I think [00:30:43] I'm not 100 sure about this but I think you can prove that if you're just giving [00:30:45] you can prove that if you're just giving an arbitrary function right [00:30:46] an arbitrary function right differentiable functions [00:30:48] differentiable functions um you shouldn't be able to check [00:30:50] um you shouldn't be able to check whether it testified suicide or right it [00:30:52] whether it testified suicide or right it should be as hard as [00:30:54] should be as hard as finding a local minimum in some sense [00:30:56] finding a local minimum in some sense right so this condition is not supposed [00:30:58] right so this condition is not supposed for uh it's not something that you are [00:31:01] for uh it's not something that you are supposed to check numerically [00:31:04] supposed to check numerically um this is something that you are [00:31:06] um this is something that you are supposed to [00:31:07] supposed to um prove theoretically in some sense uh [00:31:11] um prove theoretically in some sense uh if you can you know of course I know in [00:31:13] if you can you know of course I know in many cases you cannot [00:31:14] many cases you cannot um like not you kind of like nobody can [00:31:17] um like not you kind of like nobody can can do [00:31:18] can do um but but but I think uh the condition [00:31:21] um but but but I think uh the condition itself is not supposed for people to to [00:31:24] itself is not supposed for people to to numerical check [00:31:26] numerical check that's a good question [00:31:33] okay cool [00:31:34] okay cool um so and by the way always feel free to [00:31:37] um so and by the way always feel free to ask questions just uh even as I'm [00:31:39] ask questions just uh even as I'm speaking um so [00:31:41] speaking um so um [00:31:42] um um Okay cool so uh let's see uh what's [00:31:46] um Okay cool so uh let's see uh what's okay we have the condition right so now [00:31:54] what you can do with this right so [00:31:57] what you can do with this right so um [00:32:00] so here's the theorem the theorem is [00:32:02] so here's the theorem the theorem is somewhat kind of like a informal [00:32:06] somewhat kind of like a informal um just because uh I'm not you know it's [00:32:09] um just because uh I'm not you know it's it's pretty formal in the sense that you [00:32:11] it's pretty formal in the sense that you know all the bounds are correct it's [00:32:13] know all the bounds are correct it's just that I didn't I wouldn't specify [00:32:15] just that I didn't I wouldn't specify exact some of the details so so suppose [00:32:19] exact some of the details so so suppose are f is alpha beta and gamma [00:32:25] are f is alpha beta and gamma um [00:32:30] strict saddle [00:32:38] then [00:32:42] on mining optimizers [00:32:46] for example GD ICD if you use the [00:32:49] for example GD ICD if you use the learning rate correctly and and there [00:32:52] learning rate correctly and and there are many other articles like cubic [00:32:53] are many other articles like cubic regularization I guess many algorithms [00:32:57] regularization I guess many algorithms can do this so [00:32:59] can do this so um so and so for us you know can [00:33:02] um so and so for us you know can converge [00:33:06] to a local mean [00:33:10] in with [00:33:13] in with Epsilon error [00:33:17] in [00:33:19] in euclidean distance [00:33:26] in time [00:33:30] in time on poly [00:33:36] poly D this dimension of alpha 1 over [00:33:40] poly D this dimension of alpha 1 over beta 1 over gamma and then one website [00:33:43] beta 1 over gamma and then one website all right so so the the this theorem is [00:33:46] all right so so the the this theorem is very uh core screened [00:33:49] very uh core screened um of course you know different [00:33:49] um of course you know different optimizers have different convergence uh [00:33:52] optimizers have different convergence uh rate [00:33:53] rate um but we are not you know at least for [00:33:55] um but we are not you know at least for the purpose of this course and this [00:33:56] the purpose of this course and this lecture we are not interested in which [00:33:58] lecture we are not interested in which one is faster we are mostly just as [00:34:00] one is faster we are mostly just as interesting whether it's polynomial time [00:34:02] interesting whether it's polynomial time versus exponential time and the point is [00:34:04] versus exponential time and the point is that if you have the suicidal condition [00:34:06] that if you have the suicidal condition then you can converge to a local minimum [00:34:08] then you can converge to a local minimum you don't have those pathological or [00:34:10] you don't have those pathological or minimum pathological cases then you [00:34:13] minimum pathological cases then you converge to a local minimum uh you can [00:34:15] converge to a local minimum uh you can convert to a local minimum in this uh uh [00:34:18] convert to a local minimum in this uh uh euclidean this in in this polynomial [00:34:20] euclidean this in in this polynomial type [00:34:23] type um all right okay so by the way I think [00:34:25] um all right okay so by the way I think just to explain the name of strict [00:34:26] just to explain the name of strict saddle I think and this is just because [00:34:29] saddle I think and this is just because the pathological case is a central point [00:34:30] the pathological case is a central point right so when you have uh this kind of [00:34:33] right so when you have uh this kind of like [00:34:35] like um cases where the gradient is zero and [00:34:38] um cases where the gradient is zero and has its PSD but not strictly positive [00:34:40] has its PSD but not strictly positive some definite so you have some direction [00:34:42] some definite so you have some direction where you can potentially have a [00:34:44] where you can potentially have a negative curvature a negative like you [00:34:46] negative curvature a negative like you can potentially have a flat curvature [00:34:49] can potentially have a flat curvature um but potentially you can have um [00:34:52] um but potentially you can have um uh like bad you know third dollar [00:34:55] uh like bad you know third dollar derivatives and these are kind of like [00:34:56] derivatives and these are kind of like kind of Center points so so this is in [00:34:58] kind of Center points so so this is in some sense let's explain the name for [00:35:00] some sense let's explain the name for this right so so the in other words the [00:35:03] this right so so the in other words the the condition is saying that if you are [00:35:05] the condition is saying that if you are uh [00:35:07] uh um if you are cytopointing you can tell [00:35:09] um if you are cytopointing you can tell that it's a set of points from the [00:35:11] that it's a set of points from the negative curvature [00:35:14] negative curvature what is the third Optimizer inside the [00:35:16] what is the third Optimizer inside the the paralysis that's a good question oh [00:35:17] the paralysis that's a good question oh I this is called cubic acquisition I [00:35:19] I this is called cubic acquisition I guess there are so many others I [00:35:22] guess there are so many others I um um like uh cubic acquisition is one [00:35:24] um um like uh cubic acquisition is one of the early [00:35:27] of the early um I already work uh in 20 2006 by [00:35:30] um I already work uh in 20 2006 by nestorov [00:35:31] nestorov um so but there are many other [00:35:33] um so but there are many other optimizers like I I published a paper on [00:35:35] optimizers like I I published a paper on this like many other papers like many [00:35:37] this like many other papers like many other people publish you know papers on [00:35:39] other people publish you know papers on this um um I think I can add more [00:35:42] this um um I think I can add more references in the in the in the scrap [00:35:44] references in the in the in the scrap notes in the final scrap notes to uh to [00:35:46] notes in the final scrap notes to uh to cite some of the recent works [00:35:50] cite some of the recent works so [00:35:51] so right [00:35:54] Okay cool so um [00:35:59] uh all right so and now [00:36:03] uh all right so and now we can convert the local menu with this [00:36:05] we can convert the local menu with this condition and now suppose you make [00:36:06] condition and now suppose you make additional assumption to say that [00:36:09] additional assumption to say that um [00:36:10] um all the local Minima Global then we are [00:36:13] all the local Minima Global then we are good right so so basically the next [00:36:16] good right so so basically the next theorem is trying to say that [00:36:23] a global [00:36:25] a global and you have the strict cycle condition [00:36:34] then this means that optimizers can [00:36:37] then this means that optimizers can converge [00:36:43] to Global mean [00:36:47] right so so here is a theorem let's [00:36:49] right so so here is a theorem let's formalize this [00:36:51] formalize this um [00:36:53] um I guess I'm I'm writing a slightly [00:36:56] I guess I'm I'm writing a slightly different way not because [00:36:59] different way not because um [00:37:00] um just uh you know in some sense I I [00:37:02] just uh you know in some sense I I unpack it a little bit just I thought [00:37:05] unpack it a little bit just I thought that this is a either provides a [00:37:08] that this is a either provides a slightly different way of thinking about [00:37:10] slightly different way of thinking about this or it's just a more explicit so [00:37:12] this or it's just a more explicit so basically [00:37:13] basically um you say that you assume the strict [00:37:15] um you say that you assume the strict side of condition but uh let's uh let's [00:37:18] side of condition but uh let's uh let's rephrase the local Global plastic cell [00:37:21] rephrase the local Global plastic cell organization like this so you say that [00:37:23] organization like this so you say that there exists are apps on zero [00:37:27] um and Tau 0 [00:37:29] um and Tau 0 and C [00:37:31] and C such that [00:37:35] if [00:37:37] if X in Rd [00:37:39] X in Rd satisfies [00:37:44] the gradient is small [00:37:46] the gradient is small and Epsilon [00:37:50] and [00:37:54] the hessing is larger than minus Tau [00:37:57] the hessing is larger than minus Tau zero so basically this is like a you [00:37:59] zero so basically this is like a you know what does this two condition means [00:38:01] know what does this two condition means this is saying that you are somewhat [00:38:03] this is saying that you are somewhat approximate local minimum right so you [00:38:05] approximate local minimum right so you are you have already passed the the [00:38:08] are you have already passed the the sanity check for local minimum right but [00:38:10] sanity check for local minimum right but you you of course you haven't passed the [00:38:12] you you of course you haven't passed the uh you you cannot rule out the [00:38:14] uh you you cannot rule out the pathological cases but at least you pass [00:38:16] pathological cases but at least you pass the first first order condition the [00:38:18] the first first order condition the first other gradient is small you [00:38:20] first other gradient is small you somewhat pass the second all condition [00:38:21] somewhat pass the second all condition approximately because the hyphen is [00:38:24] approximately because the hyphen is somewhat big it almost can flatter than [00:38:26] somewhat big it almost can flatter than zero and then this is saying that [00:38:28] zero and then this is saying that suppose you pass these two conditions [00:38:30] suppose you pass these two conditions then X is Epsilon to the power C close [00:38:36] then X is Epsilon to the power C close to a global minimum the power C is just [00:38:40] to a global minimum the power C is just like to relax the condition so that you [00:38:42] like to relax the condition so that you can you can have square root up some for [00:38:44] can you can have square root up some for example close or something like that uh [00:38:47] example close or something like that uh so then actually the code close to a [00:38:50] so then actually the code close to a global minimum of the function act right [00:38:52] global minimum of the function act right so so this condition is just a a slight [00:38:55] so so this condition is just a a slight different way to say that you have all [00:38:56] different way to say that you have all local minimum Global and strict side of [00:38:58] local minimum Global and strict side of the guide [00:38:59] the guide all right so and then under this [00:39:02] all right so and then under this condition then uh [00:39:05] condition then uh optimizers again the same set of [00:39:07] optimizers again the same set of optimizers which can converge local [00:39:09] optimizers which can converge local minimum and many optimizers can [00:39:13] minimum and many optimizers can converts to [00:39:15] converts to converged [00:39:23] to um [00:39:25] to um a global mean [00:39:30] of uh f [00:39:33] of uh f up to say Delta error [00:39:37] up to say Delta error and you couldn't distance [00:39:44] uh in time [00:39:47] uh in time poly [00:39:50] well over Delta uh one over Tau 0 and D [00:39:57] all right so [00:39:59] all right so um [00:40:00] um it's not exactly the same as we did for [00:40:03] it's not exactly the same as we did for the strict cycle but you know if you [00:40:05] the strict cycle but you know if you think about it it's almost it's [00:40:06] think about it it's almost it's basically the same statement [00:40:09] basically the same statement okay anyway so [00:40:11] okay anyway so cool [00:40:13] cool so we are we are basically done with the [00:40:15] so we are we are basically done with the the first part right so like about the [00:40:16] the first part right so like about the identifying the subset of functions that [00:40:19] identifying the subset of functions that are uh that are easy to optimize but [00:40:21] are uh that are easy to optimize but these are all local minimum a global [00:40:22] these are all local minimum a global minimum functions so uh and and next we [00:40:26] minimum functions so uh and and next we are going to show some examples where [00:40:29] are going to show some examples where this kind of properties can be proved uh [00:40:32] this kind of properties can be proved uh rigorously for machine learning uh [00:40:35] rigorously for machine learning uh situations but this you know examples [00:40:37] situations but this you know examples are pretty [00:40:39] are pretty um pretty simple uh they are now deep [00:40:41] um pretty simple uh they are now deep learning all right so these are you know [00:40:43] learning all right so these are you know still roughly the best people can do you [00:40:45] still roughly the best people can do you know uh uh in some sense so [00:40:49] know uh uh in some sense so um so so this is just to give some [00:40:51] um so so this is just to give some examples for which on this kind of um [00:40:54] examples for which on this kind of um um properties can hold [00:40:56] um properties can hold okay so next [00:40:58] okay so next we have two examples [00:41:03] the first one is the PCA [00:41:06] the first one is the PCA our Matrix factorization [00:41:12] and fundamentally this is more or less [00:41:14] and fundamentally this is more or less the same as the [00:41:16] the same as the the linearized image work even though if [00:41:18] the linearized image work even though if you do new linearized network there is a [00:41:20] you do new linearized network there is a little bit more things to do beyond that [00:41:23] little bit more things to do beyond that and the second uh example I'm going to [00:41:26] and the second uh example I'm going to give is which is completion [00:41:32] this is a important machine questions [00:41:34] this is a important machine questions you know by itself as well right so [00:41:36] you know by itself as well right so before deep learning this was you know [00:41:38] before deep learning this was you know one of the most important topic uh maybe [00:41:40] one of the most important topic uh maybe machine learning like uh especially [00:41:42] machine learning like uh especially non-linear if you think about non-linear [00:41:44] non-linear if you think about non-linear cases [00:41:45] cases um and now still I think it's used in [00:41:47] um and now still I think it's used in recommendation system uh so [00:41:50] recommendation system uh so so we're going to talk about that [00:41:53] so we're going to talk about that Okay cool so any questions so far [00:42:00] um I guess um let's uh [00:42:02] um I guess um let's uh let's talk about PCA okay [00:42:06] let's talk about PCA okay first [00:42:10] so I guess you know or maybe more [00:42:12] so I guess you know or maybe more precisely Matrix factorization so we are [00:42:15] precisely Matrix factorization so we are assuming that uh we are given a matrix I [00:42:20] assuming that uh we are given a matrix I mean Dimension d by D and you want to [00:42:22] mean Dimension d by D and you want to find the best uh let's let's talk about [00:42:25] find the best uh let's let's talk about the rank one case right I run quantities [00:42:27] the rank one case right I run quantities so we want to do the rank one [00:42:31] best rank one [00:42:33] best rank one uh Rock one [00:42:35] uh Rock one so best [00:42:37] so best rank one approximation [00:42:42] of The Matrix [00:42:46] I think probably from other courses you [00:42:48] I think probably from other courses you have know that the best background [00:42:50] have know that the best background approximation is basically the [00:42:53] approximation is basically the um uh the kind of the eigen [00:42:55] um uh the kind of the eigen decomposition or the singularity [00:42:58] decomposition or the singularity conversation of the Matrix here let's [00:42:59] conversation of the Matrix here let's also assume that for Simplicity let's [00:43:01] also assume that for Simplicity let's also assume this Matrix m is symmetric [00:43:07] as also as you can assume is PSD [00:43:10] as also as you can assume is PSD um just for Simplicity so in this case [00:43:12] um just for Simplicity so in this case The Backs was best direct approximation [00:43:15] The Backs was best direct approximation is basically the eigenvector times the [00:43:17] is basically the eigenvector times the eigenvector transpose right up to some [00:43:19] eigenvector transpose right up to some scaling [00:43:20] scaling so [00:43:21] so um so this is not a hard problem right [00:43:23] um so this is not a hard problem right you can just write any eigenvector [00:43:25] you can just write any eigenvector solver to find the eigenvector and scale [00:43:27] solver to find the eigenvector and scale it properly then you get the best the [00:43:29] it properly then you get the best the best run Cloud approximation so [00:43:32] best run Cloud approximation so um but just for the purpose of this [00:43:35] um but just for the purpose of this class we are interested in this [00:43:36] class we are interested in this non-convex objective [00:43:40] foreign [00:43:44] function right which is like literally [00:43:47] function right which is like literally you know like interpreting this in the [00:43:50] you know like interpreting this in the in the most uh straightforward way so [00:43:52] in the most uh straightforward way so you say I'm just literally finding this [00:43:54] you say I'm just literally finding this back by the requirement approximation I [00:43:56] back by the requirement approximation I know the best right file approximate [00:43:57] know the best right file approximate should be symmetric so I'm just trying [00:43:59] should be symmetric so I'm just trying to find M let's find out Vector X such [00:44:03] to find M let's find out Vector X such that [00:44:05] that M minus X transpose infer brainstorm is [00:44:08] M minus X transpose infer brainstorm is the smallest right so you are [00:44:09] the smallest right so you are approximating n and the Matrix M with [00:44:12] approximating n and the Matrix M with the The Matrix X transpose right so x x [00:44:15] the The Matrix X transpose right so x x transpose is the real approximation and [00:44:18] transpose is the real approximation and you match the error by forbidden store [00:44:19] you match the error by forbidden store okay so and this becomes a non-convex [00:44:23] okay so and this becomes a non-convex objective function because you know you [00:44:25] objective function because you know you have a quadratic term here and then you [00:44:27] have a quadratic term here and then you take the square of that it becomes like [00:44:29] take the square of that it becomes like uh degree four polynomial and it's it's [00:44:31] uh degree four polynomial and it's it's non-convex [00:44:34] non-convex um [00:44:35] um um [00:44:35] um right so [00:44:38] right so um so and our goal is to show that [00:44:43] even though it's not convex uh uh like [00:44:47] even though it's not convex uh uh like uh all local minimum [00:44:52] of this G [00:44:56] are global [00:45:06] under the assumptions that we have [00:45:08] under the assumptions that we have mentioned right so like a rank one PSD [00:45:10] mentioned right so like a rank one PSD so and so forth [00:45:12] so and so forth okay so [00:45:15] okay so um [00:45:17] I guess you know I think I forgot to [00:45:21] I guess you know I think I forgot to um pass a figure here so if you look at [00:45:22] um pass a figure here so if you look at the rank the one dimensional case this [00:45:25] the rank the one dimensional case this function looks like this so one [00:45:27] function looks like this so one dimensional so D is one [00:45:30] dimensional so D is one so D is one then you just have a scalar [00:45:34] so D is one then you just have a scalar and minus a x square square this is a [00:45:37] and minus a x square square this is a function G of X and you plot this [00:45:39] function G of X and you plot this function this function looks like this [00:45:43] function this function looks like this and there are two local minimum and they [00:45:45] and there are two local minimum and they are both Global minimum because there's [00:45:46] are both Global minimum because there's some symmetry here and if you have like [00:45:48] some symmetry here and if you have like a higher Dimension noise becomes a [00:45:49] a higher Dimension noise becomes a little bit more complicated because you [00:45:51] little bit more complicated because you have actually not only necessarily you [00:45:53] have actually not only necessarily you know in some cases not only necessarily [00:45:55] know in some cases not only necessarily one like um not necessarily only like [00:45:58] one like um not necessarily only like two local minimum or I guess if it's [00:46:02] two local minimum or I guess if it's rank one there are only two on local [00:46:03] rank one there are only two on local minimum and but it looks more [00:46:05] minimum and but it looks more complicated [00:46:08] complicated um but generally you have some kind of [00:46:09] um but generally you have some kind of rotational kind of symmetry here to make [00:46:11] rotational kind of symmetry here to make this happens [00:46:12] this happens okay so [00:46:14] okay so um let's talk about the proof right how [00:46:16] um let's talk about the proof right how do we prove this so [00:46:19] do we prove this so um as you can imagine the proof is [00:46:20] um as you can imagine the proof is pretty simple the the the the plan is [00:46:22] pretty simple the the the the plan is very simple you first find out [00:46:25] very simple you first find out all [00:46:26] all stationary point the first order [00:46:28] stationary point the first order stationary point [00:46:31] uh and then you find out our local [00:46:34] uh and then you find out our local minimum [00:46:36] minimum and you prove that they are all global [00:46:38] and you prove that they are all global news so so [00:46:39] news so so basically it's just the more or less [00:46:41] basically it's just the more or less like we solve all the equations of the [00:46:44] like we solve all the equations of the [Music] [00:46:45] [Music] um [00:46:46] um um solve all of these equations and uh [00:46:48] um solve all of these equations and uh and and and see what are the possible [00:46:51] and and and see what are the possible locomonal uh you can have right so [00:46:54] locomonal uh you can have right so so let's first use the the station upon [00:46:56] so let's first use the the station upon the gradient condition right so gradient [00:46:58] the gradient condition right so gradient of X is zero and what is the green of G [00:47:01] of X is zero and what is the green of G of X right so green of G of x [00:47:03] of X right so green of G of x I'm not going to give a detailed [00:47:05] I'm not going to give a detailed conclusion here but [00:47:07] conclusion here but um believe me this is equals to [00:47:10] um believe me this is equals to minus this times x [00:47:13] minus this times x um I think this is actually a question [00:47:15] um I think this is actually a question in homework zero [00:47:17] in homework zero um maybe question two or question three [00:47:18] um maybe question two or question three in homework zero on [00:47:20] in homework zero on um about how to complete a gradient so [00:47:24] um about how to complete a gradient so um and you have the gradient and then [00:47:27] um and you have the gradient and then you say unless um [00:47:29] you say unless um let's write out what what does what is [00:47:31] let's write out what what does what is zero means right so this means M times x [00:47:34] zero means right so this means M times x is equals to [00:47:35] is equals to uh two normal effects squared times x [00:47:39] uh two normal effects squared times x right because these three things [00:47:40] right because these three things together the two the last two things [00:47:42] together the two the last two things becomes the two Norm of x square and [00:47:45] becomes the two Norm of x square and that's a scalar you can you can switch [00:47:47] that's a scalar you can you can switch the sign you get two Norm of x squared [00:47:49] the sign you get two Norm of x squared times uh you can switch the order uh you [00:47:51] times uh you can switch the order uh you get this okay cool so [00:47:54] get this okay cool so um and this is a scalar [00:47:58] uh and this is a matrix Vector [00:48:00] uh and this is a matrix Vector multiplication so basically this is [00:48:01] multiplication so basically this is saying that X is a eigenvector right so [00:48:04] saying that X is a eigenvector right so X is eigenvector [00:48:10] and an x square [00:48:13] and an x square corresponds to [00:48:14] corresponds to [Music] [00:48:18] eigenvalue [00:48:21] eigenvalue so basically you have to you know maybe [00:48:23] so basically you have to you know maybe you want what you think about this is [00:48:24] you want what you think about this is that you first find out the unit [00:48:25] that you first find out the unit eigenvector right so that I conducted [00:48:27] eigenvector right so that I conducted doesn't have a scale right so you first [00:48:29] doesn't have a scale right so you first find out the unit Vector eigenvector [00:48:30] find out the unit Vector eigenvector let's say [00:48:32] let's say the unit eigenvector [00:48:36] rv1 up to VD and then [00:48:40] rv1 up to VD and then um if you have unique eigenvectors then [00:48:44] um if you have unique eigenvectors then and [00:48:46] and uh so suppose let's say [00:48:50] uh so suppose let's say let me just be specified so this part [00:48:54] let me just be specified so this part just needs for some intuition so suppose [00:48:57] just needs for some intuition so suppose eigenvalues are distinct even though we [00:48:59] eigenvalues are distinct even though we don't have to assume this [00:49:05] then you have unique unit economactor [00:49:08] then you have unique unit economactor Bebop to v v d and you have Lambda 1 up [00:49:12] Bebop to v v d and you have Lambda 1 up to Lambda B [00:49:13] to Lambda B and these are the eigenvalues [00:49:16] and these are the eigenvalues and then basically all the stationary [00:49:18] and then basically all the stationary points are of the form that X is equals [00:49:21] points are of the form that X is equals to plus minus [00:49:24] square root Lambda I times the [00:49:27] square root Lambda I times the eigenvectors right because if you [00:49:28] eigenvectors right because if you mention the norm of x squared then you [00:49:30] mention the norm of x squared then you get along the I and that is the [00:49:33] get along the I and that is the corresponding eigenvalue [00:49:35] corresponding eigenvalue right so these are all the I the [00:49:37] right so these are all the I the stationary points the first other [00:49:39] stationary points the first other stationary points of this problem [00:49:41] stationary points of this problem and now let's look at which of these is [00:49:43] and now let's look at which of these is global minimum uh is a local minimum and [00:49:46] global minimum uh is a local minimum and then we say okay all the local minimum [00:49:47] then we say okay all the local minimum Global right so ideally we just want to [00:49:49] Global right so ideally we just want to say that the only the vi the V1 the V1 [00:49:52] say that the only the vi the V1 the V1 thing is uh is the is the local minimum [00:49:56] thing is uh is the is the local minimum right because uh that's why it's also a [00:49:58] right because uh that's why it's also a global minimum right so Lambda one V1 [00:49:59] global minimum right so Lambda one V1 square root Lambda one V1 is the uh [00:50:02] square root Lambda one V1 is the uh Global minimum [00:50:04] Global minimum okay so [00:50:05] okay so um how do we do this [00:50:07] um how do we do this so and also we don't necessarily want to [00:50:09] so and also we don't necessarily want to assume all the item values are distinct [00:50:10] assume all the item values are distinct so so there's a small things to be done [00:50:13] so so there's a small things to be done regarding that as well so [00:50:16] regarding that as well so um so let's compute [00:50:20] classic right so we need to use the [00:50:23] classic right so we need to use the hazard so [00:50:25] hazard so um I think you know this is an actual [00:50:27] um I think you know this is an actual typical question uh I got you know when [00:50:29] typical question uh I got you know when people start to think about this kind of [00:50:31] people start to think about this kind of like optimization problems because [00:50:33] like optimization problems because um and the typical question is that how [00:50:36] um and the typical question is that how do you find the hazard how to write down [00:50:37] do you find the hazard how to write down the hessing sometimes right that hasn't [00:50:38] the hessing sometimes right that hasn't sometimes could be very hard to to be [00:50:41] sometimes could be very hard to to be written down so here is actually not [00:50:42] written down so here is actually not that hard because that hasn't instant [00:50:44] that hard because that hasn't instant Dimension d by D because you have D [00:50:47] Dimension d by D because you have D parameters sometimes your parameters has [00:50:50] parameters sometimes your parameters has like it's a matrix and then your lesson [00:50:52] like it's a matrix and then your lesson becomes a fourth order tensor and it's [00:50:54] becomes a fourth order tensor and it's kind of very complex to to to be even [00:50:57] kind of very complex to to to be even just written down to just write down the [00:50:59] just written down to just write down the hashing so here is a is a kind of a very [00:51:02] hashing so here is a is a kind of a very useful trick [00:51:05] useful trick um and which actually also has some [00:51:06] um and which actually also has some fundamental reasons that this is useful [00:51:09] fundamental reasons that this is useful so [00:51:10] so um so the useful thing is that if you [00:51:12] um so the useful thing is that if you look at the quadratic form [00:51:14] look at the quadratic form regarding to the hessing [00:51:16] regarding to the hessing and you look at V transpose you know has [00:51:19] and you look at V transpose you know has inv or V in the product which has [00:51:21] inv or V in the product which has intensity this is the quadratic form [00:51:23] intensity this is the quadratic form related to the Hessian this is much [00:51:25] related to the Hessian this is much easier [00:51:32] so why this is much easier to compute [00:51:34] so why this is much easier to compute because the methodology is the following [00:51:37] because the methodology is the following so this is [00:51:38] so this is uh in some sense in the homework [00:51:40] uh in some sense in the homework solution of that homework that asks you [00:51:43] solution of that homework that asks you to find a gradient of this [00:51:46] to find a gradient of this um of this function right so I guess we [00:51:48] um of this function right so I guess we have the homework homework zero has this [00:51:50] have the homework homework zero has this question to ask the gradient of this [00:51:52] question to ask the gradient of this function [00:51:54] function um on the same methodology also applies [00:51:56] um on the same methodology also applies here so when you talk about the housing [00:51:59] here so when you talk about the housing so uh so the methodology you know I'm [00:52:02] so uh so the methodology you know I'm not going to go through all the details [00:52:04] not going to go through all the details but the the roughly speaking what you do [00:52:05] but the the roughly speaking what you do is the following you say I'm going to [00:52:07] is the following you say I'm going to consider GX plus Epsilon [00:52:10] consider GX plus Epsilon right so uh and [00:52:13] right so uh and um I I'm going to tell expanders [00:52:15] um I I'm going to tell expanders so I'm just gonna iteratively tell [00:52:17] so I'm just gonna iteratively tell everybody whatever this G is where gness [00:52:20] everybody whatever this G is where gness needs to have analytical form maybe it's [00:52:22] needs to have analytical form maybe it's got composition of several functions you [00:52:24] got composition of several functions you just iteratively [00:52:27] just iteratively expand it [00:52:29] expand it uh Taylor explain it [00:52:34] uh so that into a into something like G [00:52:39] uh so that into a into something like G of X plus some Epsilon times some vector [00:52:42] of X plus some Epsilon times some vector plus some Epsilon times some Matrix [00:52:46] plus some Epsilon times some Matrix so okay I guess maybe I shouldn't even [00:52:48] so okay I guess maybe I shouldn't even write this so uh G of X plus some linear [00:52:51] write this so uh G of X plus some linear terminal [00:52:52] terminal linear term [00:52:54] linear term abso [00:52:55] abso plus some quadratic term [00:53:06] um and so and so forth right plus [00:53:08] um and so and so forth right plus highlights from and then if you have [00:53:10] highlights from and then if you have this [00:53:11] this then this [00:53:13] then this basically corresponds if you replace [00:53:14] basically corresponds if you replace Epsilon to V [00:53:18] Epsilon to V replace Epsilon by V then you get V then [00:53:21] replace Epsilon by V then you get V then you get V Dot G Square g x v [00:53:27] so uh I guess I'm not sure whether this [00:53:29] so uh I guess I'm not sure whether this is two apps to act you know when I say [00:53:31] is two apps to act you know when I say this but you know if you are uh you [00:53:33] this but you know if you are uh you didn't get exactly what I mean so just [00:53:35] didn't get exactly what I mean so just look at the homework Zero Solutions uh I [00:53:38] look at the homework Zero Solutions uh I it's basically doing this so um so this [00:53:41] it's basically doing this so um so this is a very simple way to compute the [00:53:42] is a very simple way to compute the hessing and the quadratic form over the [00:53:44] hessing and the quadratic form over the hessing uh without you know writing the [00:53:47] hessing uh without you know writing the housing as as a complex in a matrix or [00:53:50] housing as as a complex in a matrix or tensor so if you apply this kind of [00:53:52] tensor so if you apply this kind of techniques you you can get the housing [00:53:54] techniques you you can get the housing like this so the quadratic form of the [00:53:56] like this so the quadratic form of the hyacin [00:53:59] is equals to something like this [00:54:12] so I guess you can see that um here I'm [00:54:14] so I guess you can see that um here I'm not writing it I I still don't [00:54:16] not writing it I I still don't necessarily have exactly a [00:54:18] necessarily have exactly a representation or a characterization of [00:54:20] representation or a characterization of what the hessing is right I'm only [00:54:22] what the hessing is right I'm only writing the quadratic form as an [00:54:24] writing the quadratic form as an analytical formula of X and n and V so [00:54:27] analytical formula of X and n and V so and so forth right so [00:54:29] and so forth right so um you can in this case you can still [00:54:32] um you can in this case you can still from this quadratic form you can figure [00:54:33] from this quadratic form you can figure out what the corresponding Matrix is [00:54:35] out what the corresponding Matrix is right you can write it as images [00:54:36] right you can write it as images modification that's okay [00:54:39] modification that's okay um but for many other cases actually [00:54:40] um but for many other cases actually it's very hard to write out that Matrix [00:54:43] it's very hard to write out that Matrix of the hessing right so so the quadratic [00:54:45] of the hessing right so so the quadratic form is just some analytical formula and [00:54:49] form is just some analytical formula and and as you will see actually the only [00:54:50] and as you will see actually the only thing that matters is the quadratic form [00:54:52] thing that matters is the quadratic form because anyway even you are giving a for [00:54:54] because anyway even you are giving a for example a Hassan which is kind of [00:54:56] example a Hassan which is kind of complex you know there's not much things [00:54:58] complex you know there's not much things you can do with that right so it pretty [00:55:00] you can do with that right so it pretty much you're still doing looking at the [00:55:02] much you're still doing looking at the different specific quadratic form [00:55:05] different specific quadratic form um anyway okay so so we have the we have [00:55:08] um anyway okay so so we have the we have this uh quadratic form and we know that [00:55:10] this uh quadratic form and we know that the hessing is learning zero is in [00:55:13] the hessing is learning zero is in equivalent to that for every V the [00:55:15] equivalent to that for every V the quadratic form of the hessing is [00:55:19] quadratic form of the hessing is learning zero okay so so here's what I [00:55:22] learning zero okay so so here's what I mean why you only care about the [00:55:23] mean why you only care about the quadratic form this is just because you [00:55:25] quadratic form this is just because you only care about you basically if you [00:55:27] only care about you basically if you plug in different V's you get the same [00:55:29] plug in different V's you get the same thing [00:55:30] thing and and by which we will plug in shall [00:55:33] and and by which we will plug in shall we plug in all of these or shall we just [00:55:34] we plug in all of these or shall we just use some specific bees it turns out in [00:55:37] use some specific bees it turns out in many cases you only care about a few [00:55:39] many cases you only care about a few special bees because some of the bees [00:55:41] special bees because some of the bees are more much more informative than the [00:55:42] are more much more informative than the others and and which so so you want to [00:55:45] others and and which so so you want to choose some informative fees to to to [00:55:47] choose some informative fees to to to evaluate this formula so that you get [00:55:49] evaluate this formula so that you get some info important information about [00:55:51] some info important information about what x can be right because at the end [00:55:53] what x can be right because at the end of the you care about what x is because [00:55:55] of the you care about what x is because you're you're using this to pin down [00:55:57] you're you're using this to pin down what are the local minimum [00:55:58] what are the local minimum so [00:56:00] so um so what are the informative views so [00:56:02] um so what are the informative views so it turns out that you know the v's that [00:56:04] it turns out that you know the v's that are informative here is the the top item [00:56:07] are informative here is the the top item vector [00:56:08] vector um how do you know this you know it [00:56:10] um how do you know this you know it requires some intuition it requires some [00:56:11] requires some intuition it requires some trials and hours so and so forth [00:56:14] trials and hours so and so forth um but but I guess you know it also [00:56:16] um but but I guess you know it also probably makes sense because the top [00:56:17] probably makes sense because the top economic direction is the global minimum [00:56:19] economic direction is the global minimum right so you somehow [00:56:21] right so you somehow you you try whether you can move in the [00:56:23] you you try whether you can move in the direction of the global window to see [00:56:25] direction of the global window to see whether your function value can increase [00:56:27] whether your function value can increase in a second out of sense uh and and to [00:56:31] in a second out of sense uh and and to some extent is intuitive to sometimes [00:56:32] some extent is intuitive to sometimes extend it's just the trials and errors [00:56:34] extend it's just the trials and errors so but anyway so V is equal to V1 is a [00:56:37] so but anyway so V is equal to V1 is a good choice because if you plug it in [00:56:39] good choice because if you plug it in you get V1 times the housing of x times [00:56:43] you get V1 times the housing of x times V1 and you plug into this formula you [00:56:46] V1 and you plug into this formula you get [00:56:47] get 2 times x V1 squared minus V1 transpose [00:56:51] 2 times x V1 squared minus V1 transpose mv1 [00:56:52] mv1 plus X2 Norm squared and you say this is [00:56:56] plus X2 Norm squared and you say this is larger than zero and you can probably [00:56:57] larger than zero and you can probably see why this is informative because this [00:56:59] see why this is informative because this term is negative right so it makes the [00:57:01] term is negative right so it makes the is the it's the hardest test in some [00:57:03] is the it's the hardest test in some sense because the negative term is [00:57:05] sense because the negative term is maximized so uh and now let's look at [00:57:08] maximized so uh and now let's look at what we can get from this equation right [00:57:10] what we can get from this equation right so [00:57:11] so uh so realize realize that we we don't [00:57:14] uh so realize realize that we we don't care about the hessing for every point [00:57:16] care about the hessing for every point right so we only care about the has info [00:57:18] right so we only care about the has info stationary point in the first other [00:57:19] stationary point in the first other stationary point because only those [00:57:21] stationary point because only those points can be possible in a local [00:57:23] points can be possible in a local minimum so we only look at X again is an [00:57:26] minimum so we only look at X again is an eigenvector right because we are only [00:57:28] eigenvector right because we are only filtering local minimals from the [00:57:31] filtering local minimals from the stationary points [00:57:33] stationary points so because X is the eigenvector we have [00:57:36] so because X is the eigenvector we have two cases so the first case is that X [00:57:41] two cases so the first case is that X has eigenvalue [00:57:44] has eigenvalue Lambda one [00:57:46] Lambda one so it actually has the top eigenvalue [00:57:48] so it actually has the top eigenvalue then [00:57:51] then X is just a global minimizer [00:57:53] X is just a global minimizer and we are done because we know the top [00:57:55] and we are done because we know the top again uh the top you know the global [00:57:58] again uh the top you know the global minimum is basically the using the top [00:58:00] minimum is basically the using the top eigenvalue to fit right that's the by [00:58:02] eigenvalue to fit right that's the by the by the the standard results in PCA [00:58:05] the by the the standard results in PCA you know that the best of one Cloud [00:58:06] you know that the best of one Cloud approximation is the top one uh Topline [00:58:09] approximation is the top one uh Topline I can write eigenvector with with the [00:58:11] I can write eigenvector with with the right scaling [00:58:13] right scaling um so the second case is that X has [00:58:15] um so the second case is that X has eigenvalue [00:58:17] eigenvalue let's say Lambda which is strictly less [00:58:19] let's say Lambda which is strictly less than [00:58:21] than number one so it could be the second [00:58:23] number one so it could be the second eigenvalue it could be the third [00:58:24] eigenvalue it could be the third eigenvalue so and so forth right so [00:58:27] eigenvalue so and so forth right so um and then because X is the eigenvector [00:58:29] um and then because X is the eigenvector and also the eigenvalue of X is [00:58:32] and also the eigenvalue of X is orthogonal to the eigenvalue so the [00:58:34] orthogonal to the eigenvalue so the eigenvalue of axis is different from the [00:58:36] eigenvalue of axis is different from the eigenvalue of V1 so you know that X is [00:58:40] eigenvalue of V1 so you know that X is orthogonal to V1 because different [00:58:42] orthogonal to V1 because different vectors different eigenvectors with [00:58:44] vectors different eigenvectors with different eigenvalues will be orthogonal [00:58:46] different eigenvalues will be orthogonal there's no guarantee that two [00:58:47] there's no guarantee that two eigenvectors are always orthogonal [00:58:50] eigenvectors are always orthogonal um if because they could be they could [00:58:51] um if because they could be they could have the same eigenvalue and they are [00:58:53] have the same eigenvalue and they are just in the same Subspace but if they [00:58:55] just in the same Subspace but if they have different orthog they have [00:58:57] have different orthog they have different eigenvalues then they have to [00:58:58] different eigenvalues then they have to be orthogonal so that's why X1 is uh [00:59:01] be orthogonal so that's why X1 is uh orthogonal to V1 and then if you [00:59:03] orthogonal to V1 and then if you evaluate two it's equation two so two [00:59:06] evaluate two it's equation two so two means what so 2 means that the first [00:59:08] means what so 2 means that the first term goes [00:59:10] term goes um those are ways you get just X2 Norm [00:59:12] um those are ways you get just X2 Norm square is bigger than V1 transpose mv1 [00:59:16] square is bigger than V1 transpose mv1 and I recall that [00:59:18] and I recall that the V1 transpose mv1 this is just Lambda [00:59:20] the V1 transpose mv1 this is just Lambda 1 and X2 Square recall that that's [00:59:23] 1 and X2 Square recall that that's equals to Lambda [00:59:24] equals to Lambda I recall that that's by the first other [00:59:26] I recall that that's by the first other condition you have that here uh the X2 [00:59:30] condition you have that here uh the X2 Norm square is the scale is the scalar [00:59:33] Norm square is the scale is the scalar that corresponds to the eigenvalue of [00:59:35] that corresponds to the eigenvalue of that vector so basically we have Lambda [00:59:38] that vector so basically we have Lambda is bigger than Lambda one and we have a [00:59:40] is bigger than Lambda one and we have a contradiction because [00:59:41] contradiction because [Music] [00:59:45] because this is contradictory with the [00:59:48] because this is contradictory with the assumption that Lambda is less than [00:59:49] assumption that Lambda is less than Lambda 1. [00:59:51] Lambda 1. so [00:59:52] so let's write that [00:59:57] okay any questions about this [01:00:01] okay any questions about this so guys maybe just a very quick summary [01:00:03] so guys maybe just a very quick summary so [01:00:11] so basically this is saying that if you [01:00:13] so basically this is saying that if you are [01:00:15] are if x is stationary [01:00:19] my extensional point always means first [01:00:21] my extensional point always means first order station point so I'm not going to [01:00:23] order station point so I'm not going to clarify that you know future so if x is [01:00:26] clarify that you know future so if x is digital point and X [01:00:28] digital point and X is not the local mean and it's X is not [01:00:31] is not the local mean and it's X is not Global mean [01:00:40] then [01:00:42] then moving [01:00:45] moving in V1 Direction [01:00:49] so moving with Direction doesn't really [01:00:51] so moving with Direction doesn't really we wouldn't change our function very [01:00:52] we wouldn't change our function very much right because you have a stationary [01:00:54] much right because you have a stationary point that means your point is flat so [01:00:56] point that means your point is flat so changing in V1 Direction wouldn't change [01:00:57] changing in V1 Direction wouldn't change it by a lot it would only leads to some [01:01:00] it by a lot it would only leads to some would lead to [01:01:03] would lead to a second order Improvement [01:01:09] Improvement [01:01:11] Improvement that and that's why it's not available [01:01:13] that and that's why it's not available right because if you are local minimum [01:01:16] right because if you are local minimum moving a v by Direction shouldn't give [01:01:18] moving a v by Direction shouldn't give you any setting of The Improvement [01:01:19] you any setting of The Improvement either right so [01:01:22] either right so um [01:01:23] um um so so that's basically the gist of [01:01:25] um so so that's basically the gist of the analysis [01:01:28] all right so [01:01:31] all right so cool so I think we have to talk about [01:01:33] cool so I think we have to talk about okay so now let's talk about Mitch's [01:01:35] okay so now let's talk about Mitch's completion which is kind of like an [01:01:37] completion which is kind of like an upgraded version of [01:01:40] upgraded version of um [01:01:42] um of PC and and as I said this is actually [01:01:44] of PC and and as I said this is actually a pretty important question uh uh in [01:01:46] a pretty important question uh uh in machine learning so let me Define a [01:01:49] machine learning so let me Define a question first and then I can briefly [01:01:50] question first and then I can briefly talk about why people care about it so [01:01:53] talk about why people care about it so uh let's also talk about the one common [01:01:55] uh let's also talk about the one common version just for uh simplicity [01:01:58] version just for uh simplicity the question is that so let's say [01:02:00] the question is that so let's say suppose and also we are we are assuming [01:02:02] suppose and also we are we are assuming that we have [01:02:04] that we have um [01:02:04] um I guess we assume the the ground truth [01:02:06] I guess we assume the the ground truth Matrix m [01:02:08] Matrix m is also a Recology Matrix [01:02:17] and and symmetric [01:02:21] and PSD just for Simplicity [01:02:24] and PSD just for Simplicity and so in other words you can assume am [01:02:27] and so in other words you can assume am is equal to something like ZZ transpose [01:02:29] is equal to something like ZZ transpose right and Z is the kind of the ground [01:02:31] right and Z is the kind of the ground truth and these interactions you know [01:02:33] truth and these interactions you know Dimension d [01:02:35] Dimension d and and the setup is the following so we [01:02:38] and and the setup is the following so we are given [01:02:41] uh random [01:02:44] uh random entries [01:02:47] of I so you pick some random enders in [01:02:51] of I so you pick some random enders in the indices uh of M and you review the [01:02:55] the indices uh of M and you review the corresponding entries that's the only [01:02:56] corresponding entries that's the only thing you know from about app [01:02:58] thing you know from about app and then the goal [01:03:01] and then the goal is to recover [01:03:04] the rest of launches [01:03:10] right so [01:03:12] right so um so more formulas so you say that [01:03:16] um so more formulas so you say that there's Omega which is a subset of you [01:03:18] there's Omega which is a subset of you know the entries [01:03:20] know the entries subject of the indices of D times d uh [01:03:24] subject of the indices of D times d uh and you say this is actually random [01:03:26] and you say this is actually random subset in the sense that every entry [01:03:31] is chosen [01:03:35] um it's including Omega uniformly [01:03:37] um it's including Omega uniformly randomly it's included [01:03:42] in Omega uniformly [01:03:46] randomly uh independently [01:03:55] with probability p [01:03:57] with probability p okay so each entry is included with some [01:03:59] okay so each entry is included with some probability p and you reveal so what we [01:04:03] probability p and you reveal so what we see we observe [01:04:07] the so-called P Omega of M P Omega of M [01:04:10] the so-called P Omega of M P Omega of M is basically okay so let me Define p [01:04:12] is basically okay so let me Define p Omega of M so P Omega of a [01:04:16] Omega of M so P Omega of a is [01:04:17] is The Matrix [01:04:23] obtained by zeroing [01:04:27] out every entry [01:04:32] outside [01:04:35] Omega [01:04:36] Omega so so you get this Matrix a and [01:04:38] so so you get this Matrix a and everything that is not in Omega you make [01:04:40] everything that is not in Omega you make them you make those answers zero and you [01:04:42] them you make those answers zero and you are given this you know sparse Matrix P [01:04:44] are given this you know sparse Matrix P Omega of a so we'll observe P Omega of N [01:04:47] Omega of a so we'll observe P Omega of N and and our goal is to recover at [01:04:54] and why people care about this question [01:04:56] and why people care about this question you know a lot you know uh in the past [01:04:58] you know a lot you know uh in the past one reason is that it it has this uh um [01:05:02] one reason is that it it has this uh um uh relationship with recommendation a [01:05:05] uh relationship with recommendation a recommendation system so [01:05:07] recommendation system so um so here I'm assuming it's a magic [01:05:09] um so here I'm assuming it's a magic Matrix and so on so forth but if you [01:05:11] Matrix and so on so forth but if you relax those a little bit uh which [01:05:13] relax those a little bit uh which doesn't necessarily change the essence [01:05:15] doesn't necessarily change the essence of the problem so so suppose you think [01:05:17] of the problem so so suppose you think of will have a matrix [01:05:19] of will have a matrix and in one side you are the The Columns [01:05:23] and in one side you are the The Columns are indexed by the users so let's say [01:05:25] are indexed by the users so let's say suppose this is a matrix uh that Amazon [01:05:29] suppose this is a matrix uh that Amazon maintains and the other side is the item [01:05:32] maintains and the other side is the item and and each entry is the rating of the [01:05:35] and and each entry is the rating of the user to the item [01:05:37] user to the item so so and and and every user probably [01:05:41] so so and and and every user probably have an opinion about every item right I [01:05:44] have an opinion about every item right I know they like it or not so and so forth [01:05:46] know they like it or not so and so forth but it's not like every user buys every [01:05:48] but it's not like every user buys every item right for sure so every user only [01:05:50] item right for sure so every user only buys a very small subside of the item [01:05:52] buys a very small subside of the item and that's why so you only see some [01:05:56] and that's why so you only see some entries in this Matrix right Amazon want [01:05:58] entries in this Matrix right Amazon want to see some of the entries and Amazon [01:06:00] to see some of the entries and Amazon wants to understand [01:06:02] wants to understand uh each what the each user's preferences [01:06:05] uh each what the each user's preferences you want to know that each user likes [01:06:07] you want to know that each user likes which item right so so the Amazon has an [01:06:10] which item right so so the Amazon has an incentive to uh just uh fill in the [01:06:13] incentive to uh just uh fill in the entire table right I'm just only using [01:06:16] entire table right I'm just only using Amazon example but you know but this is [01:06:18] Amazon example but you know but this is the same thing probably applies to many [01:06:19] the same thing probably applies to many other situations where you have to [01:06:21] other situations where you have to recommend items to users so [01:06:25] recommend items to users so um so so that's why you have to recover [01:06:27] um so so that's why you have to recover all the rest of the answers to serve the [01:06:29] all the rest of the answers to serve the users better in the future and that's [01:06:31] users better in the future and that's that's why this problem was important [01:06:34] that's why this problem was important um you know it's still kind of important [01:06:35] um you know it's still kind of important these days but I guess there are many [01:06:37] these days but I guess there are many already existing methods to solve this [01:06:40] already existing methods to solve this and the most uh uh the most used method [01:06:44] and the most uh uh the most used method to solve this is basically non-convex [01:06:46] to solve this is basically non-convex optimization to find this best uh this [01:06:49] optimization to find this best uh this ground Choice Matrix app are using the [01:06:51] ground Choice Matrix app are using the fact that you have a low rank structure [01:06:54] fact that you have a low rank structure because you know how come you can [01:06:55] because you know how come you can recover the rest of the entries if [01:06:57] recover the rest of the entries if there's no one else structuring m in [01:06:59] there's no one else structuring m in this Matrix M there's no way you can [01:07:01] this Matrix M there's no way you can recover the other entries because they [01:07:02] recover the other entries because they can be arbitrary so so that's why you [01:07:04] can be arbitrary so so that's why you have to assume that the Matrix M has [01:07:06] have to assume that the Matrix M has some lower abstraction or some other [01:07:08] some lower abstraction or some other structures so maybe just to give you a [01:07:11] structures so maybe just to give you a quick [01:07:12] quick um [01:07:13] um kind of access about you know how does [01:07:16] kind of access about you know how does this the structure matters here right so [01:07:18] this the structure matters here right so so we have uh if you count the number of [01:07:21] so we have uh if you count the number of parameters we have D parameters [01:07:25] right to to describe [01:07:29] Matrix [01:07:33] of Dimension d by D right because you [01:07:36] of Dimension d by D right because you can just write it as x x transpose and [01:07:40] can just write it as x x transpose and um [01:07:42] um and and how many uh so how many answers [01:07:45] and and how many uh so how many answers You observe should be probably bigger [01:07:47] You observe should be probably bigger than the degree of Freedom here right so [01:07:49] than the degree of Freedom here right so so the number of observations [01:07:55] foreign [01:08:00] because each entry is observed with [01:08:02] because each entry is observed with probability P so you have P times d [01:08:05] probability P so you have P times d square entries and this should be bigger [01:08:08] square entries and this should be bigger than on D right if this is not bigger [01:08:11] than on D right if this is not bigger than D is it's unlikely you can work [01:08:12] than D is it's unlikely you can work right so basically that is saying that P [01:08:14] right so basically that is saying that P is bigger than roughly one over d [01:08:17] is bigger than roughly one over d um and and this is actually the regime [01:08:19] um and and this is actually the regime we are going to work with right so we're [01:08:21] we are going to work with right so we're going to work with the regime that P is [01:08:22] going to work with the regime that P is no [01:08:23] no much bigger than our debt for example [01:08:26] much bigger than our debt for example log Factor something like that so that's [01:08:28] log Factor something like that so that's the uh that's the setting we're gonna be [01:08:31] the uh that's the setting we're gonna be in and [01:08:33] in and speaking of the [01:08:35] speaking of the objective functions this is actually a [01:08:37] objective functions this is actually a pretty [01:08:39] pretty um commonly used uh method in practice [01:08:42] um commonly used uh method in practice so [01:08:43] so [Music] [01:08:45] [Music] so you just say I'm going to minimize [01:08:47] so you just say I'm going to minimize this function that's called FX [01:08:49] this function that's called FX which is defined to be that [01:08:52] which is defined to be that um you basically you have a time [01:08:54] um you basically you have a time decision called access transports this [01:08:56] decision called access transports this is your target this is the [01:08:57] is your target this is the parametricization for Target Matrix and [01:08:59] parametricization for Target Matrix and you want to say this Matrix actually [01:09:01] you want to say this Matrix actually face all my observations right so so you [01:09:04] face all my observations right so so you are taking a sum over all possible [01:09:06] are taking a sum over all possible observed entries because this is the [01:09:07] observed entries because this is the only cases you know what the entries are [01:09:10] only cases you know what the entries are you know those nij and you minus this [01:09:13] you know those nij and you minus this with x i times x j and you take the [01:09:16] with x i times x j and you take the square so this is your prediction this [01:09:18] square so this is your prediction this is your observation and you take the [01:09:20] is your observation and you take the square and take the sum over all the [01:09:22] square and take the sum over all the observation our observed entries [01:09:26] observation our observed entries and just for future notational [01:09:30] and just for future notational uh easeness so this is actually you can [01:09:32] uh easeness so this is actually you can write this as P Omega of I minus x x [01:09:35] write this as P Omega of I minus x x transpose [01:09:37] transpose right because this is the Matrix you [01:09:39] right because this is the Matrix you look at the error Matrix right and then [01:09:41] look at the error Matrix right and then you zero out all of those that you don't [01:09:43] you zero out all of those that you don't know right because I all said Omega you [01:09:46] know right because I all said Omega you have those information you zero all of [01:09:47] have those information you zero all of those and you take the sum of squares of [01:09:50] those and you take the sum of squares of the rest of the entries right so that's [01:09:53] the rest of the entries right so that's the another way to write this this [01:09:55] the another way to write this this function [01:09:56] function right so and just a side note I think [01:09:59] right so and just a side note I think actually there are many other methods [01:10:00] actually there are many other methods that can solve Matrix completion so [01:10:02] that can solve Matrix completion so there are convex transition methods and [01:10:04] there are convex transition methods and and so and so forth right so uh however [01:10:07] and so and so forth right so uh however those those methods actually often have [01:10:09] those those methods actually often have stronger guarantees for example they [01:10:10] stronger guarantees for example they have tighter sample complexity bonds so [01:10:12] have tighter sample complexity bonds so and so forth but in practice just [01:10:15] and so forth but in practice just because the convex position takes too [01:10:16] because the convex position takes too long time people actually are using uh [01:10:19] long time people actually are using uh objective functions or methods like this [01:10:21] objective functions or methods like this and then they just use green descent to [01:10:24] and then they just use green descent to optimize these kind of functions and [01:10:25] optimize these kind of functions and that's why it's kind of also practically [01:10:27] that's why it's kind of also practically relevant to analyze these kind of [01:10:30] relevant to analyze these kind of objective functions because they are [01:10:31] objective functions because they are they are indeed used in practice [01:10:33] they are indeed used in practice foreign [01:10:35] foreign so our [01:10:37] so our um [01:10:37] um our main goal is to prove that this [01:10:39] our main goal is to prove that this object function has no value global [01:10:43] object function has no value global um there's one uh assumption that I have [01:10:45] um there's one uh assumption that I have to specify but it's not going to be used [01:10:47] to specify but it's not going to be used much uh in the uh in the in this lecture [01:10:51] much uh in the uh in the in this lecture in the proof because we are going to [01:10:52] in the proof because we are going to sweep some of these things under the [01:10:54] sweep some of these things under the rock but I do I do have to mention this [01:10:56] rock but I do I do have to mention this assumption [01:10:57] assumption um it may not sound very unintuitive I [01:10:59] um it may not sound very unintuitive I wouldn't spend too much time on it but [01:11:00] wouldn't spend too much time on it but just let me mention it so this is called [01:11:02] just let me mention it so this is called incoccurrence assumption and this this [01:11:04] incoccurrence assumption and this this assumption is necessary uh people know [01:11:07] assumption is necessary uh people know it so we assume that so I guess we [01:11:11] it so we assume that so I guess we assume for example [01:11:13] assume for example um first of all we assume that [01:11:15] um first of all we assume that the ground choose has Norm one this is [01:11:17] the ground choose has Norm one this is without generality [01:11:19] without generality um just uh which is for convenience to [01:11:21] um just uh which is for convenience to fix the scale and then after effects is [01:11:23] fix the scale and then after effects is still you assume that the ground shoes [01:11:25] still you assume that the ground shoes on Vector Z so we call that m is equal [01:11:28] on Vector Z so we call that m is equal to CZ transpose so Z is the ground shoes [01:11:30] to CZ transpose so Z is the ground shoes so the the infinite Norm is less than mu [01:11:33] so the the infinite Norm is less than mu over square root [01:11:34] over square root and mu is considered to be a constant [01:11:44] or a log logarithmic [01:11:50] indeed so [01:11:52] indeed so um so what he's saying is that this [01:11:54] um so what he's saying is that this Vector Z the norm is one and also the [01:11:56] Vector Z the norm is one and also the the entries are spread out right you [01:11:59] the entries are spread out right you cannot just have all the mass uh [01:12:01] cannot just have all the mass uh concentrated on on one entry right so [01:12:03] concentrated on on one entry right so the reason why you don't want that is [01:12:05] the reason why you don't want that is because for example a counter example is [01:12:08] because for example a counter example is that [01:12:11] if Z is just E1 and your m is just the [01:12:14] if Z is just E1 and your m is just the E1 E1 transpose which is just the top [01:12:17] E1 E1 transpose which is just the top left corner of the of the Matrix we just [01:12:20] left corner of the of the Matrix we just have a very very sparse Matrix and top [01:12:22] have a very very sparse Matrix and top top left corner is what is the uh is one [01:12:25] top left corner is what is the uh is one and then there's no way you can cover [01:12:27] and then there's no way you can cover this Matrix unless you observe the top [01:12:29] this Matrix unless you observe the top right top left corner so so basically [01:12:32] right top left corner so so basically all that's off you have to see enough [01:12:34] all that's off you have to see enough entries right so [01:12:37] entries right so um so so this uh this inconvenience [01:12:40] um so so this uh this inconvenience conditions in some sense trying to rule [01:12:41] conditions in some sense trying to rule a lot of this kind of like pathological [01:12:43] a lot of this kind of like pathological cases but I'm not going to talk too much [01:12:45] cases but I'm not going to talk too much about it [01:12:46] about it um it's just to follow for for the [01:12:48] um it's just to follow for for the formality Okay cool so [01:12:51] formality Okay cool so now this is just for the for the [01:12:52] now this is just for the for the rigorous of the proof so so here is the [01:12:55] rigorous of the proof so so here is the theorem and I guess I'm going to stop [01:12:57] theorem and I guess I'm going to stop after I State a theorem and then prove [01:12:59] after I State a theorem and then prove it next time so the theorem is that [01:13:03] it next time so the theorem is that suppose [01:13:06] p is something like poly [01:13:09] p is something like poly mu and log d [01:13:13] mu and log d over D Epsilon recall that we are in a [01:13:16] over D Epsilon recall that we are in a regime that P is roughly one over D and [01:13:18] regime that P is roughly one over D and this is the same regime where Epsilon is [01:13:20] this is the same regime where Epsilon is is something like larger than zero kind [01:13:22] is something like larger than zero kind of like a constant and this is a poly [01:13:24] of like a constant and this is a poly factor in mu and also poly log in on D [01:13:28] factor in mu and also poly log in on D okay so suppose p is on this order [01:13:32] okay so suppose p is on this order and then we assume the the incoherence [01:13:37] and then we assume the the incoherence okay [01:13:43] um and then [01:13:46] our local [01:13:51] of f [01:13:53] of f uh are [01:13:55] uh are we are so actually you can prove that [01:13:58] we are so actually you can prove that they are all exactly a global minimum [01:14:00] they are all exactly a global minimum but for the moment which only proved [01:14:01] but for the moment which only proved that they are closed [01:14:03] that they are closed of square root Epsilon close to [01:14:07] of square root Epsilon close to uh either Z or minus C and z and minus Z [01:14:13] uh either Z or minus C and z and minus Z are clearly Global minimum because the [01:14:14] are clearly Global minimum because the error will be exactly uh zero [01:14:18] error will be exactly uh zero so [01:14:20] so um [01:14:21] um all right so that's the statement [01:14:28] and also just to mention you can also [01:14:30] and also just to mention you can also have a structure conditions [01:14:36] so you can also [01:14:39] have strict cellular conditions [01:14:43] you can also prove that I it's just that [01:14:46] you can also prove that I it's just that I didn't [01:14:48] I didn't um I didn't include it just for the sake [01:14:50] um I didn't include it just for the sake of Simplicity and then you do have to [01:14:52] of Simplicity and then you do have to prove that to have the rigorous result [01:14:54] prove that to have the rigorous result and if you don't approve it you just [01:14:55] and if you don't approve it you just prove that all of them are Global [01:14:56] prove that all of them are Global sometimes you may get somewhat [01:14:59] sometimes you may get somewhat misleading results right so [01:15:01] misleading results right so um I think there's a paper that shows [01:15:03] um I think there's a paper that shows that actually [01:15:04] that actually um in some kind of like somewhat weird [01:15:07] um in some kind of like somewhat weird cases you can show a very strong looking [01:15:10] cases you can show a very strong looking results like the results like looks very [01:15:13] results like the results like looks very strong [01:15:14] strong um in a sense that oh look I'm Global [01:15:16] um in a sense that oh look I'm Global but actually the reason why they are so [01:15:18] but actually the reason why they are so strong is because somehow in that [01:15:20] strong is because somehow in that setting you ignore those two excited [01:15:22] setting you ignore those two excited conditions which is problematic [01:15:26] conditions which is problematic all right so I guess uh [01:15:29] all right so I guess uh the proof is obviously too long to cover [01:15:31] the proof is obviously too long to cover in one minute so I guess so I'll leave [01:15:35] in one minute so I guess so I'll leave it to the next lecture [01:15:37] it to the next lecture um I can take off some questions if [01:15:40] um I can take off some questions if anybody has any questions otherwise I [01:15:42] anybody has any questions otherwise I think we are good today [01:15:47] okay there's a question sounds great so [01:15:49] okay there's a question sounds great so are there any new artwork models where [01:15:51] are there any new artwork models where these are known to be hold [01:15:54] these are known to be hold um [01:15:57] the answer is no especially if you look [01:16:01] the answer is no especially if you look for a global [01:16:02] for a global uh um property like you know globally [01:16:05] uh um property like you know globally Global I don't think we have any proofs [01:16:09] Global I don't think we have any proofs for any uh [01:16:11] for any uh real new network models I guess there is [01:16:14] real new network models I guess there is a proof for linearized Network models um [01:16:17] a proof for linearized Network models um like in all the activations are linear [01:16:21] like in all the activations are linear um and actually in that case if you have [01:16:22] um and actually in that case if you have more than two layers you don't have the [01:16:24] more than two layers you don't have the strict set of conditions you have a lot [01:16:25] strict set of conditions you have a lot of bicycle points [01:16:27] of bicycle points so so basically the short answer is that [01:16:29] so so basically the short answer is that I don't think they are any real cases [01:16:32] I don't think they are any real cases like like satisfactory cases where we [01:16:34] like like satisfactory cases where we know how to prove this [01:16:36] know how to prove this um [01:16:37] um I think they are for two layer networks [01:16:40] I think they are for two layer networks if you assume some conditions on the [01:16:43] if you assume some conditions on the input for example if you assume that uh [01:16:46] input for example if you assume that uh the input are really separable then [01:16:49] the input are really separable then there is a proof for this [01:16:51] there is a proof for this um yeah and there are there are a bunch [01:16:54] um yeah and there are there are a bunch of other cases where you can have some [01:16:57] of other cases where you can have some partial results [01:16:59] partial results um next week uh in the next lecture [01:17:03] um next week uh in the next lecture maybe the second half of lecture I'm [01:17:05] maybe the second half of lecture I'm also going to give another result which [01:17:07] also going to give another result which is [01:17:08] is um somewhat more General like it applies [01:17:10] um somewhat more General like it applies to many different architectures but uh [01:17:13] to many different architectures but uh it has other kind of constraints you [01:17:14] it has other kind of constraints you know it offers why it doesn't really [01:17:16] know it offers why it doesn't really show this kind of exactly these kind of [01:17:18] show this kind of exactly these kind of landscape properties [01:17:20] landscape properties um it shows that this kind of properties [01:17:22] um it shows that this kind of properties holds for for region for special region [01:17:25] holds for for region for special region uh in the parameter space [01:17:27] uh in the parameter space um and that's so called ntk approach I'm [01:17:30] um and that's so called ntk approach I'm going to specify more details but there [01:17:33] going to specify more details but there are also other kind of like problems [01:17:34] are also other kind of like problems with those kind of approach as well as [01:17:36] with those kind of approach as well as you know I'm going to talk more about [01:17:38] you know I'm going to talk more about this next lecture ================================================================================ LECTURE INDEX.md ================================================================================ CS229M – Machine Learning Theory Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rP8nAmISxFINlGKSK4rbLKh Total Videos: 20 Transcripts Downloaded: 20 Failed/No Captions: 0 --- Lectures 1. Stanford CS229M - Lecture 1: Overview, supervised learning, empirical risk minimization - Video: [https://www.youtube.com/watch?v=I-tmjGFaaBg](https://www.youtube.com/watch?v=I-tmjGFaaBg) - Transcript: [001_I-tmjGFaaBg.md](001_I-tmjGFaaBg.md) 2. Stanford CS229M - Lecture 2: Asymptotic analysis, uniform convergence, Hoeffding inequality - Video: [https://www.youtube.com/watch?v=Fx3xldCEfsM](https://www.youtube.com/watch?v=Fx3xldCEfsM) - Transcript: [002_Fx3xldCEfsM.md](002_Fx3xldCEfsM.md) 3. Stanford CS229M - Lecture 3: Finite hypothesis class, discretizing infinite hypothesis space - Video: [https://www.youtube.com/watch?v=io-YFfXbIXk](https://www.youtube.com/watch?v=io-YFfXbIXk) - Transcript: [003_io-YFfXbIXk.md](003_io-YFfXbIXk.md) 4. Stanford CS229M - Lecture 4: Advanced concentration inequalities - Video: [https://www.youtube.com/watch?v=fKM6fcOkXuk](https://www.youtube.com/watch?v=fKM6fcOkXuk) - Transcript: [004_fKM6fcOkXuk.md](004_fKM6fcOkXuk.md) 5. Stanford CS229M - Lecture 5: Rademacher complexity, empirical Rademacher complexity - Video: [https://www.youtube.com/watch?v=tkJd2B98hII](https://www.youtube.com/watch?v=tkJd2B98hII) - Transcript: [005_tkJd2B98hII.md](005_tkJd2B98hII.md) 6. Stanford CS229M - Lecture 6: Margin theory and Rademacher complexity for linear models - Video: [https://www.youtube.com/watch?v=echF7IWE05c](https://www.youtube.com/watch?v=echF7IWE05c) - Transcript: [006_echF7IWE05c.md](006_echF7IWE05c.md) 7. Stanford CS229M - Lecture 7: Challenges in DL theory, generalization bounds for neural nets - Video: [https://www.youtube.com/watch?v=kVkMRDZ5fcU](https://www.youtube.com/watch?v=kVkMRDZ5fcU) - Transcript: [007_kVkMRDZ5fcU.md](007_kVkMRDZ5fcU.md) 8. Stanford CS229M - Lecture 8: Refined generalization bounds for neural nets, Kernel methods - Video: [https://www.youtube.com/watch?v=gwKfeDRCvSg](https://www.youtube.com/watch?v=gwKfeDRCvSg) - Transcript: [008_gwKfeDRCvSg.md](008_gwKfeDRCvSg.md) 9. Stanford CS229M - Lecture 9: Covering number approach, Dudley Theorem - Video: [https://www.youtube.com/watch?v=wDfardbL50I](https://www.youtube.com/watch?v=wDfardbL50I) - Transcript: [009_wDfardbL50I.md](009_wDfardbL50I.md) 10. Stanford CS229M - Lecture 10: Generalization bounds for deep nets - Video: [https://www.youtube.com/watch?v=P5-VVI1qLxA](https://www.youtube.com/watch?v=P5-VVI1qLxA) - Transcript: [010_P5-VVI1qLxA.md](010_P5-VVI1qLxA.md) 11. Stanford CS229M - Lecture 11: All-layer margin - Video: [https://www.youtube.com/watch?v=GeXBfyrKfM4](https://www.youtube.com/watch?v=GeXBfyrKfM4) - Transcript: [011_GeXBfyrKfM4.md](011_GeXBfyrKfM4.md) 12. Stanford CS229M - Lecture 13: Neural Tangent Kernel - Video: [https://www.youtube.com/watch?v=btphvvnad0A](https://www.youtube.com/watch?v=btphvvnad0A) - Transcript: [012_btphvvnad0A.md](012_btphvvnad0A.md) 13. Stanford CS229M - Lecture 14: Neural Tangent Kernel, Implicit regularization of gradient descent - Video: [https://www.youtube.com/watch?v=xpT1ymwCk9w](https://www.youtube.com/watch?v=xpT1ymwCk9w) - Transcript: [013_xpT1ymwCk9w.md](013_xpT1ymwCk9w.md) 14. Stanford CS229M - Lecture 15: Implicit regularization effect of initialization - Video: [https://www.youtube.com/watch?v=l-CR_TLihdg](https://www.youtube.com/watch?v=l-CR_TLihdg) - Transcript: [014_l-CR_TLihdg.md](014_l-CR_TLihdg.md) 15. Stanford CS229M - Lecture 16: Implicit regularization in classification problems - Video: [https://www.youtube.com/watch?v=mham4hHpo7A](https://www.youtube.com/watch?v=mham4hHpo7A) - Transcript: [015_mham4hHpo7A.md](015_mham4hHpo7A.md) 16. Stanford CS229M - Lecture 17: Implicit regularization effect of the noise - Video: [https://www.youtube.com/watch?v=60GqpISCtCU](https://www.youtube.com/watch?v=60GqpISCtCU) - Transcript: [016_60GqpISCtCU.md](016_60GqpISCtCU.md) 17. Stanford CS229M - Lecture 18: Unsupervised learning, mixture of Gaussians, moment methods - Video: [https://www.youtube.com/watch?v=4xDEsLUkdG4](https://www.youtube.com/watch?v=4xDEsLUkdG4) - Transcript: [017_4xDEsLUkdG4.md](017_4xDEsLUkdG4.md) 18. Stanford CS229M - Lecture 19: Mixture of Gaussians, spectral clustering - Video: [https://www.youtube.com/watch?v=E6rZeGIKdRY](https://www.youtube.com/watch?v=E6rZeGIKdRY) - Transcript: [018_E6rZeGIKdRY.md](018_E6rZeGIKdRY.md) 19. Stanford CS229M - Lecture 20: Spectral clustering - Video: [https://www.youtube.com/watch?v=UYBRLG64oSQ](https://www.youtube.com/watch?v=UYBRLG64oSQ) - Transcript: [019_UYBRLG64oSQ.md](019_UYBRLG64oSQ.md) 20. Stanford CS229M - Lecture 12: Non-convex optimization, Non-convex opt for PCA, matrix complexion - Video: [https://www.youtube.com/watch?v=EVyJkXOd5Xo](https://www.youtube.com/watch?v=EVyJkXOd5Xo) - Transcript: [020_EVyJkXOd5Xo.md](020_EVyJkXOd5Xo.md)