================================================================================ LECTURE 001 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors Source: https://www.youtube.com/watch?v=DzpHeXVSC5I --- Transcript [00:00:05] so the thing that seems kind of amazing [00:00:07] so the thing that seems kind of amazing um to me and us is the fact that well [00:00:11] um to me and us is the fact that well actually this course was taught um just [00:00:14] actually this course was taught um just last quarter and here we are with the [00:00:17] last quarter and here we are with the enormous number of people again taking [00:00:19] enormous number of people again taking this class um I guess that says [00:00:22] this class um I guess that says something maybe approximately what it [00:00:24] something maybe approximately what it says is chat [00:00:26] says is chat GPT um um [00:00:31] GPT um um um but anyway it's great to have you all [00:00:33] um but anyway it's great to have you all lots of exciting content to have and um [00:00:36] lots of exciting content to have and um hope you'll all enjoy it so let me get [00:00:39] hope you'll all enjoy it so let me get started and um start telling you a bit [00:00:42] started and um start telling you a bit about the course um before diving [00:00:44] about the course um before diving straight into um today's content um for [00:00:47] straight into um today's content um for people still coming in you know there [00:00:50] people still coming in you know there are oodles of seats still right on [00:00:52] are oodles of seats still right on either side especially down near the [00:00:54] either side especially down near the front there are tons of seats so um do [00:00:58] front there are tons of seats so um do feel um empowered to go out and seek [00:01:02] feel um empowered to go out and seek those seats if other if people on the [00:01:04] those seats if other if people on the corridors are really nice they could [00:01:06] corridors are really nice they could even move towards the edges to make it [00:01:08] even move towards the edges to make it easier for people but um One Way or [00:01:10] easier for people but um One Way or Another feel free to find a seat okay so [00:01:14] Another feel free to find a seat okay so this is the plan for what I want to get [00:01:15] this is the plan for what I want to get through today so first of all I'm going [00:01:18] through today so first of all I'm going to tell you about the course for a few [00:01:20] to tell you about the course for a few minutes um then um have a few remarks [00:01:23] minutes um then um have a few remarks about human language and word meaning um [00:01:27] about human language and word meaning um then the main technical thing we want to [00:01:29] then the main technical thing we want to get into today today is start learning [00:01:31] get into today today is start learning about the word Tove algorithm so the [00:01:34] about the word Tove algorithm so the word Tove algorithm is slightly over a [00:01:36] word Tove algorithm is slightly over a decade old now it was introduced in um [00:01:41] decade old now it was introduced in um 2013 but it was a wildly successful [00:01:45] 2013 but it was a wildly successful simple way of learning Vector [00:01:47] simple way of learning Vector representations of words so I want to [00:01:49] representations of words so I want to show you um that as a sort of a first [00:01:52] show you um that as a sort of a first easy baby system for the kind of neural [00:01:55] easy baby system for the kind of neural representations that we're going to talk [00:01:57] representations that we're going to talk about in class um with then going to get [00:02:00] about in class um with then going to get more concrete with that looking at its [00:02:02] more concrete with that looking at its objective function gradients and [00:02:04] objective function gradients and optimization and then hopefully if all [00:02:07] optimization and then hopefully if all goes I stick to schedule spend a few [00:02:09] goes I stick to schedule spend a few minutes just um playing around an i [00:02:12] minutes just um playing around an i python notebook um huh I'm going to have [00:02:15] python notebook um huh I'm going to have to change computers for that um than um [00:02:18] to change computers for that um than um sort of um seeing some of the things you [00:02:20] sort of um seeing some of the things you can do with this okay so this is the um [00:02:24] can do with this okay so this is the um course Logistics and brief I'm [00:02:26] course Logistics and brief I'm Christopher Manning hi again everyone um [00:02:29] Christopher Manning hi again everyone um the head ta is who unfortunately has a [00:02:33] the head ta is who unfortunately has a bit of a health problem so he's not [00:02:34] bit of a health problem so he's not actually here today um we've got a [00:02:36] actually here today um we've got a course manager for the course who [00:02:39] course manager for the course who is who is up the back there um and then [00:02:43] is who is up the back there um and then we've got a whole um lot of teas if [00:02:46] we've got a whole um lot of teas if you're a TA who's here you could stand [00:02:48] you're a TA who's here you could stand up and wave or something like that so [00:02:50] up and wave or something like that so people can see a few of the Tas and see [00:02:54] people can see a few of the Tas and see some friendly faces um okay we've got [00:02:57] some friendly faces um okay we've got some Tas um and some other ones and so [00:03:00] some Tas um and some other ones and so you can look at them on the website um [00:03:02] you can look at them on the website um if you're here you know what time the [00:03:04] if you're here you know what time the class is um there's an email list but [00:03:08] class is um there's an email list but preferably don't use it and use the Ed [00:03:11] preferably don't use it and use the Ed site that you can find on the course [00:03:13] site that you can find on the course website so the main place to go and look [00:03:16] website so the main place to go and look for information is the course website [00:03:18] for information is the course website which we've got up here and that then [00:03:20] which we've got up here and that then links in to Ed which is what we're going [00:03:23] links in to Ed which is what we're going to use as the main discussion board [00:03:25] to use as the main discussion board please use that rather than sending [00:03:28] please use that rather than sending emails um to um the first assignment for [00:03:32] emails um to um the first assignment for this class it's a sort of an easy one [00:03:34] this class it's a sort of an easy one it's the warm-up assignment but we want [00:03:36] it's the warm-up assignment but we want to get people busy and doing stuff um [00:03:39] to get people busy and doing stuff um straight away so the first assignment is [00:03:41] straight away so the first assignment is already live on the web page and it's [00:03:44] already live on the web page and it's due next Tuesday before class so um [00:03:48] due next Tuesday before class so um you're slightly less than seven days [00:03:50] you're slightly less than seven days left to do it um so do get started on [00:03:53] left to do it um so do get started on that um and to help with that um we're [00:03:56] that um and to help with that um we're going to be immediately starting office [00:03:58] going to be immediately starting office hours um tomorrow and they're also [00:04:00] hours um tomorrow and they're also described on the website um we also do a [00:04:03] described on the website um we also do a few tutorials on Friday um the first of [00:04:07] few tutorials on Friday um the first of these tutorials is a tutorial on Python [00:04:09] these tutorials is a tutorial on Python and numpy um many people don't need that [00:04:12] and numpy um many people don't need that because they've done other classes and [00:04:14] because they've done other classes and done this um but for some people we try [00:04:17] done this um but for some people we try and make this class accessible to [00:04:19] and make this class accessible to everybody so if you'd like to um brush [00:04:21] everybody so if you'd like to um brush up a bit on python or how to use numpy [00:04:24] up a bit on python or how to use numpy it's a great thing to go along to and [00:04:26] it's a great thing to go along to and who's right over there is going to be [00:04:28] who's right over there is going to be teaching it on Friday [00:04:30] teaching it on Friday today okay what do we hope to teach you [00:04:34] today okay what do we hope to teach you know at the end of the quarter when you [00:04:36] know at the end of the quarter when you get the evow you'll be asked to rate [00:04:39] get the evow you'll be asked to rate whether this class met its learning [00:04:42] whether this class met its learning goals these are my learning goals um [00:04:47] goals these are my learning goals um what are they so the first one um is to [00:04:50] what are they so the first one um is to teach you about the foundations and [00:04:53] teach you about the foundations and current methods for using deep learning [00:04:56] current methods for using deep learning applied to natural language processing [00:04:58] applied to natural language processing so this class [00:05:00] so this class tries to sort of build up from the [00:05:02] tries to sort of build up from the bottom up so we start off doing simple [00:05:04] bottom up so we start off doing simple things like word vectors and feed [00:05:06] things like word vectors and feed forward neural networks recurrent [00:05:07] forward neural networks recurrent networks and attention we then fairly [00:05:10] networks and attention we then fairly quickly move into the kind of key [00:05:12] quickly move into the kind of key methods they Ed for NLP um in [00:05:16] methods they Ed for NLP um in 2024 I wrote down here Transformers and [00:05:19] 2024 I wrote down here Transformers and coder decoder models I probably should [00:05:21] coder decoder models I probably should have written large language models [00:05:22] have written large language models somewhere in this list as well um but [00:05:25] somewhere in this list as well um but then pre-training and post-training of [00:05:27] then pre-training and post-training of large language models adaptation model [00:05:30] large language models adaptation model interpretability agents Etc but that's [00:05:33] interpretability agents Etc but that's not the only thing that we want to do so [00:05:35] not the only thing that we want to do so there are a couple of other things that [00:05:36] there are a couple of other things that we crucially want to achieve um the [00:05:39] we crucially want to achieve um the second is to give you [00:05:42] second is to give you some understanding of human languages [00:05:45] some understanding of human languages and the difficulties in understanding [00:05:47] and the difficulties in understanding and producing them on computers now [00:05:50] and producing them on computers now there are few of you in this class who [00:05:51] there are few of you in this class who are Linguistics Majors or perhaps the [00:05:54] are Linguistics Majors or perhaps the symbolic systems Majors yay to the [00:05:56] symbolic systems Majors yay to the symbolic systems Majors but for quite a [00:05:59] symbolic systems Majors but for quite a few of the rest of you um you'll never [00:06:01] few of the rest of you um you'll never see any uh Linguistics in the sense of [00:06:05] see any uh Linguistics in the sense of understanding how language Works apart [00:06:07] understanding how language Works apart from this class so we do want to try and [00:06:10] from this class so we do want to try and convey a little bit of a sense of what [00:06:12] convey a little bit of a sense of what some of the issues are in language [00:06:14] some of the issues are in language structure and why it's proven to be [00:06:18] structure and why it's proven to be quite difficult um to get computers to [00:06:21] quite difficult um to get computers to understand human languages even though [00:06:23] understand human languages even though humans seem very good at learning to [00:06:26] humans seem very good at learning to understand each other and then the final [00:06:28] understand each other and then the final thing that we want to make it on to um [00:06:31] thing that we want to make it on to um is actually concretely Building Systems [00:06:34] is actually concretely Building Systems so that this isn't just a theory class [00:06:37] so that this isn't just a theory class that we actually want you to leave this [00:06:39] that we actually want you to leave this class thinking oh yeah in my first job [00:06:43] class thinking oh yeah in my first job wherever you go whether it's a startup [00:06:45] wherever you go whether it's a startup or a big Tech or um some nonprofit oh [00:06:48] or a big Tech or um some nonprofit oh there's something they want to do that [00:06:51] there's something they want to do that they'd like that would be useful if we [00:06:52] they'd like that would be useful if we had a text classification system or we [00:06:55] had a text classification system or we did information extraction to get some [00:06:57] did information extraction to get some kind of facts out of documents I know to [00:06:59] kind of facts out of documents I know to build that I can build that system [00:07:02] build that I can build that system because I did [00:07:05] cs224n okay um here's how you get graded [00:07:08] cs224n okay um here's how you get graded um so we have four assignments mainly [00:07:12] um so we have four assignments mainly one and a half weeks long apart from the [00:07:14] one and a half weeks long apart from the first one they make up almost half the [00:07:16] first one they make up almost half the grade the other half of the grade is [00:07:19] grade the other half of the grade is made up of a final project which there [00:07:22] made up of a final project which there are two variants of a custom or default [00:07:25] are two variants of a custom or default final project which we'll get on to in a [00:07:28] final project which we'll get on to in a minute um and then there's a few percent [00:07:30] minute um and then there's a few percent that go for [00:07:32] that go for participation um six late [00:07:35] participation um six late days um collaboration policy um like all [00:07:39] days um collaboration policy um like all other CS classes we've had issues with [00:07:43] other CS classes we've had issues with people not doing their own work we [00:07:45] people not doing their own work we really do want you to learn things in [00:07:47] really do want you to learn things in this class and the way you do that is by [00:07:50] this class and the way you do that is by doing your own work um so make sure you [00:07:53] doing your own work um so make sure you understand that um and so for the [00:07:56] understand that um and so for the assignments everyone is expected to do [00:07:58] assignments everyone is expected to do their own assign assignments um you can [00:08:01] their own assign assignments um you can talk to your friends but you're expected [00:08:02] talk to your friends but you're expected to do your own assignment for the final [00:08:05] to do your own assignment for the final project you can do that as a group um [00:08:08] project you can do that as a group um then we have the issue of um AI tools [00:08:12] then we have the issue of um AI tools now of course in this class we love [00:08:14] now of course in this class we love large language models but nevertheless [00:08:17] large language models but nevertheless we don't want you to do your assignments [00:08:19] we don't want you to do your assignments by saying hey chat GPT could you answer [00:08:22] by saying hey chat GPT could you answer question three for me um that is not the [00:08:25] question three for me um that is not the way to learn things um if you want to [00:08:27] way to learn things um if you want to make use of um AI as a tool to assist [00:08:30] make use of um AI as a tool to assist you such as for coding assistance go for [00:08:33] you such as for coding assistance go for it um but we're wanting you to be [00:08:36] it um but we're wanting you to be working out how to answer assignment [00:08:38] working out how to answer assignment questions by [00:08:40] questions by yourself okay so this is what the [00:08:42] yourself okay so this is what the assignments look like so assignment one [00:08:45] assignments look like so assignment one is meant to be an easy onramp and it's [00:08:47] is meant to be an easy onramp and it's done as a Jupiter notebook assignment [00:08:51] done as a Jupiter notebook assignment two um then has people uh you know what [00:08:57] two um then has people uh you know what can I say here we are at this fine [00:09:00] can I say here we are at this fine liberal artarts and Engineering [00:09:03] liberal artarts and Engineering institution we're not at a coding boot [00:09:06] institution we're not at a coding boot camp so we hope that people have some [00:09:08] camp so we hope that people have some deep understanding of how things work so [00:09:11] deep understanding of how things work so in assignment two we actually want you [00:09:14] in assignment two we actually want you to do some math and understand how [00:09:18] to do some math and understand how things work in neural networks um so for [00:09:21] things work in neural networks um so for some people assignment to is the [00:09:23] some people assignment to is the scariest assignment in the whole class [00:09:26] scariest assignment in the whole class um but then it's also the place where we [00:09:28] um but then it's also the place where we introduce py talk which is software [00:09:30] introduce py talk which is software package we use for building newal [00:09:32] package we use for building newal networks and we build a dependency paa [00:09:35] networks and we build a dependency paa which we'll get to later as something [00:09:37] which we'll get to later as something more linguistic um then for assignment [00:09:40] more linguistic um then for assignment three and four we move on to larger [00:09:43] three and four we move on to larger projects using pytorch with gpus and [00:09:45] projects using pytorch with gpus and we'll be making use of Google cloud and [00:09:49] we'll be making use of Google cloud and um for those two assignments um we look [00:09:52] um for those two assignments um we look at doing machine translation and getting [00:09:55] at doing machine translation and getting information um out with Transformers and [00:09:58] information um out with Transformers and then these are the two final project [00:10:00] then these are the two final project options so essentially you know we have [00:10:03] options so essentially you know we have a default final project where we give [00:10:05] a default final project where we give you a lot of scaffolding and an outline [00:10:07] you a lot of scaffolding and an outline of what to do but um it's still an [00:10:10] of what to do but um it's still an open-ended project there are lots of [00:10:12] open-ended project there are lots of different things you can try to make [00:10:14] different things you can try to make this system work better and we encourage [00:10:17] this system work better and we encourage you to explore um but nevertheless [00:10:20] you to explore um but nevertheless you're given a leg up from quite a lot [00:10:22] you're given a leg up from quite a lot of um scaffolding we'll talk about this [00:10:25] of um scaffolding we'll talk about this more but you can either do that option [00:10:27] more but you can either do that option or you can just come up with totally [00:10:29] or you can just come up with totally your own project and do [00:10:32] your own project and do that okay that's the course any [00:10:35] that okay that's the course any questions on the [00:10:37] questions on the course yes for the final project how are [00:10:40] course yes for the final project how are mentors assigned um so if you if you can [00:10:45] mentors assigned um so if you if you can find your own Mentor your interest in [00:10:48] find your own Mentor your interest in something and there's someone that's [00:10:50] something and there's someone that's happy to Mentor you that person can be [00:10:52] happy to Mentor you that person can be your Mentor otherwise one of the course [00:10:55] your Mentor otherwise one of the course Tas will be your mentor and how that [00:10:58] Tas will be your mentor and how that person is assigned is uh one of the T [00:11:02] person is assigned is uh one of the T who is in charge of final projects [00:11:04] who is in charge of final projects assigns people and they do the best they [00:11:07] assigns people and they do the best they can in terms of finding people with some [00:11:09] can in terms of finding people with some expertise and having to divide all the [00:11:11] expertise and having to divide all the students across the mentors roughly [00:11:14] students across the mentors roughly equally any other [00:11:19] questions okay I'll power ahead um human [00:11:23] questions okay I'll power ahead um human language and word [00:11:25] language and word meaning um so let me just sort of um say [00:11:29] meaning um so let me just sort of um say a little bit about the big picture here [00:11:32] a little bit about the big picture here um so we're in the area of artificial [00:11:35] um so we're in the area of artificial intelligence and we've got this idea [00:11:38] intelligence and we've got this idea that humans are intelligent and then [00:11:41] that humans are intelligent and then there's the question of you know how [00:11:44] there's the question of you know how does language um fit into that and you [00:11:47] does language um fit into that and you know this is something that there is [00:11:49] know this is something that there is some argument about and if you want to [00:11:51] some argument about and if you want to you can run off in onto social media and [00:11:54] you can run off in onto social media and read some of the arguments about these [00:11:56] read some of the arguments about these things and contribute to it um if you [00:11:58] things and contribute to it um if you wish too um but here is my perhaps bias [00:12:02] wish too um but here is my perhaps bias take as a linguist um well you can [00:12:06] take as a linguist um well you can compare human beings um to some of our [00:12:10] compare human beings um to some of our nearest neighbors like chimpanzees [00:12:13] nearest neighbors like chimpanzees bonobos and things like that and you [00:12:16] bonobos and things like that and you know well one big distinguishing thing [00:12:19] know well one big distinguishing thing is we have language and they don't um [00:12:24] is we have language and they don't um but you know in most other respects um [00:12:27] but you know in most other respects um chimps are very similar human beings [00:12:30] chimps are very similar human beings right you know in they they can use [00:12:33] right you know in they they can use tools they can plan how to solve things [00:12:37] tools they can plan how to solve things um they've got really good memory um [00:12:39] um they've got really good memory um chimps have better short-term memory [00:12:41] chimps have better short-term memory than human beings do right so that in [00:12:44] than human beings do right so that in most respects it's hard to show an [00:12:47] most respects it's hard to show an intelligence difference between chimps [00:12:49] intelligence difference between chimps and people except for the fact that we [00:12:52] and people except for the fact that we have language but us having language has [00:12:55] have language but us having language has been this enormous differentiator right [00:12:59] been this enormous differentiator right that if you look around um what happened [00:13:02] that if you look around um what happened on the planet you know that there are [00:13:04] on the planet you know that there are creatures that are stronger than us [00:13:06] creatures that are stronger than us faster than us more venomous than us [00:13:09] faster than us more venomous than us have every possible Advantage um but [00:13:11] have every possible Advantage um but human beings took over the whole place [00:13:14] human beings took over the whole place and how did that happen we had language [00:13:17] and how did that happen we had language um so we could communicate and that [00:13:20] um so we could communicate and that communication allowed us to have human [00:13:24] communication allowed us to have human ascendency but I'd like to me so one big [00:13:27] ascendency but I'd like to me so one big role of language is the fact fact that [00:13:29] role of language is the fact fact that it allows communication but I'd like to [00:13:32] it allows communication but I'd like to suggest it's actually not the only role [00:13:34] suggest it's actually not the only role of language that language has also [00:13:38] of language that language has also allowed humans I would argue to achieve [00:13:41] allowed humans I would argue to achieve a higher level of thought so there are [00:13:44] a higher level of thought so there are various kinds of thoughts that you can [00:13:46] various kinds of thoughts that you can have without any language involved you [00:13:48] have without any language involved you know you can think about a scene you can [00:13:51] know you can think about a scene you can move some bits of furniture around in [00:13:52] move some bits of furniture around in your mind and there's no language and [00:13:55] your mind and there's no language and obviously emotional responses of feeling [00:13:58] obviously emotional responses of feeling scared or excited [00:13:59] scared or excited they happen and there's no language [00:14:01] they happen and there's no language involved but you know I think most of [00:14:04] involved but you know I think most of the time when we're doing higher level [00:14:07] the time when we're doing higher level cognition um if you're thinking to [00:14:09] cognition um if you're thinking to yourself oh gee my friend seemed upset [00:14:12] yourself oh gee my friend seemed upset about what I said last night my should [00:14:15] about what I said last night my should probably work out how to fix that or [00:14:17] probably work out how to fix that or maybe I could BL BL BL BL I think we [00:14:20] maybe I could BL BL BL BL I think we think in language and plan out things [00:14:22] think in language and plan out things and so that it's given us a scaffolding [00:14:25] and so that it's given us a scaffolding to do much more detailed thought and [00:14:27] to do much more detailed thought and planning [00:14:29] planning most recently of all of course human [00:14:32] most recently of all of course human beings invented ways to write um and [00:14:36] beings invented ways to write um and that led so writing is really really [00:14:39] that led so writing is really really recent I mean no one really knows how [00:14:42] recent I mean no one really knows how old human languages are you know most [00:14:44] old human languages are you know most people think a few 100,000 years not [00:14:46] people think a few 100,000 years not very long by evolutionary time scales [00:14:49] very long by evolutionary time scales but writing we do know writing is really [00:14:52] but writing we do know writing is really really recent so writing is about 5,000 [00:14:55] really recent so writing is about 5,000 years old um and so but you know writing [00:15:00] years old um and so but you know writing proved to be this um again this amazing [00:15:04] proved to be this um again this amazing cognitive tool that just gave Humanity [00:15:06] cognitive tool that just gave Humanity an enormous leg up because Suddenly It's [00:15:09] an enormous leg up because Suddenly It's Not only that you could share [00:15:11] Not only that you could share information and learn from the people [00:15:13] information and learn from the people that were standing within 50 feet of you [00:15:15] that were standing within 50 feet of you you could then um share knowledge across [00:15:19] you could then um share knowledge across time and space so really um having [00:15:22] time and space so really um having writing was enough to take us from the [00:15:25] writing was enough to take us from the Bronze Age very simple um metal working [00:15:28] Bronze Age very simple um metal working to the kind of um you know mobile phones [00:15:31] to the kind of um you know mobile phones and all the other technology that we [00:15:34] and all the other technology that we walk around with today in just a very [00:15:36] walk around with today in just a very short amount of time so language is [00:15:39] short amount of time so language is pretty [00:15:41] pretty cool um but it's you know one shouldn't [00:15:44] cool um but it's you know one shouldn't only fixate on um the sort of knowledge [00:15:49] only fixate on um the sort of knowledge side of language and how that's made um [00:15:52] side of language and how that's made um human beings great I mean there's this [00:15:54] human beings great I mean there's this other side of language where language is [00:15:57] other side of language where language is this very flexible system which is used [00:16:01] this very flexible system which is used as a social tool by human beings um so [00:16:05] as a social tool by human beings um so that we can speak with a lot of [00:16:09] that we can speak with a lot of imprecision and nuance and emotion in [00:16:12] imprecision and nuance and emotion in language and we can get people to [00:16:15] language and we can get people to understand we can set up sort of new [00:16:17] understand we can set up sort of new ways of thinking about things by using [00:16:20] ways of thinking about things by using words for them and languages aren't [00:16:22] words for them and languages aren't static languages change as human beings [00:16:25] static languages change as human beings use them that languages aren't something [00:16:28] use them that languages aren't something that would delivered down on Tablets by [00:16:31] that would delivered down on Tablets by God languages are things that humans [00:16:33] God languages are things that humans constructed and humans changed them with [00:16:36] constructed and humans changed them with each success of generation and indeed [00:16:39] each success of generation and indeed most of the innovation in language [00:16:41] most of the innovation in language happens among young people you know [00:16:44] happens among young people you know people that are either a few years [00:16:45] people that are either a few years younger than you are most of you are now [00:16:48] younger than you are most of you are now um in their um earlier teens going into [00:16:51] um in their um earlier teens going into the 20s right that's a big period of [00:16:53] the 20s right that's a big period of linguistic Innovation where people think [00:16:55] linguistic Innovation where people think up cool new phrases and ways of saying [00:16:58] up cool new phrases and ways of saying things and some of those get embedded [00:17:00] things and some of those get embedded and extended and that then becomes the [00:17:02] and extended and that then becomes the future of language um so um herb Clark [00:17:06] future of language um so um herb Clark used to be a um psychologist um at [00:17:10] used to be a um psychologist um at Stanford he's now retired but he had [00:17:12] Stanford he's now retired but he had this rather um nice quote the common [00:17:15] this rather um nice quote the common misconception is that language use has [00:17:17] misconception is that language use has primarily to do with words and what they [00:17:19] primarily to do with words and what they mean it doesn't it has primarily to do [00:17:22] mean it doesn't it has primarily to do with people and what they [00:17:24] with people and what they mean okay so that's language and two [00:17:27] mean okay so that's language and two slides for you um so now we'll skip [00:17:30] slides for you um so now we'll skip ahead to deep learning so in the last [00:17:33] ahead to deep learning so in the last decade or so we've been able to make um [00:17:37] decade or so we've been able to make um fantastic progress in doing more with [00:17:39] fantastic progress in doing more with computers understanding um human [00:17:42] computers understanding um human languages um in using deep learning [00:17:45] languages um in using deep learning we'll say a bit more about the history [00:17:47] we'll say a bit more about the history later on but you know work on trying to [00:17:50] later on but you know work on trying to do things with human language started in [00:17:51] do things with human language started in the 1950s so it had been sort of going [00:17:54] the 1950s so it had been sort of going for 60 years or so and you know there [00:17:58] for 60 years or so and you know there was some stuff it's not that nobody [00:18:00] was some stuff it's not that nobody could do anything but you know the [00:18:03] could do anything but you know the ability to understand and produce [00:18:06] ability to understand and produce language had always been kind of [00:18:08] language had always been kind of questionable where it's really in the [00:18:10] questionable where it's really in the last decade with new networks that just [00:18:13] last decade with new networks that just enormous strides of progress have been [00:18:15] enormous strides of progress have been made um and that's led into the world [00:18:17] made um and that's led into the world that we have today so one of the first [00:18:20] that we have today so one of the first big breakthroughs came in the area of [00:18:23] big breakthroughs came in the area of using um neural NLP systems for machine [00:18:27] using um neural NLP systems for machine translation and so this started about [00:18:30] translation and so this started about 2014 and was already deployed live on [00:18:34] 2014 and was already deployed live on services like Google by 2016 it was so [00:18:38] services like Google by 2016 it was so good that it saw really really rapid um [00:18:41] good that it saw really really rapid um commercial deployment and I mean overall [00:18:44] commercial deployment and I mean overall this kind of facility um with machine [00:18:48] this kind of facility um with machine translation just means that you're [00:18:51] translation just means that you're growing up in such a different world um [00:18:54] growing up in such a different world um to people a few Generations back right [00:18:58] to people a few Generations back right people a few Generations back um that [00:19:01] people a few Generations back um that unless you actually knew different [00:19:03] unless you actually knew different languages of different people you sort [00:19:06] languages of different people you sort of had no chance to communicate with [00:19:08] of had no chance to communicate with them where um now we're very close to [00:19:11] them where um now we're very close to having something like the Babel Fish [00:19:13] having something like the Babel Fish from Hitchhiker's Guide to the Galaxy um [00:19:16] from Hitchhiker's Guide to the Galaxy um for understanding all languages it's [00:19:18] for understanding all languages it's just it's not a Babel Fish it's a cell [00:19:22] just it's not a Babel Fish it's a cell phone but you know you can have it out [00:19:23] phone but you know you can have it out between two people and have it do [00:19:26] between two people and have it do simultaneous translation and you know [00:19:28] simultaneous translation and you know it's not perfect people keep on doing [00:19:30] it's not perfect people keep on doing research on this but um you know by and [00:19:33] research on this but um you know by and large it means you can pick anything up [00:19:35] large it means you can pick anything up from different areas of the world um as [00:19:38] from different areas of the world um as you can see this example is from a [00:19:40] you can see this example is from a couple of years ago since it's still [00:19:41] couple of years ago since it's still from the um Co pandemic era but you know [00:19:44] from the um Co pandemic era but you know I can um see this um Swahili from Kenya [00:19:49] I can um see this um Swahili from Kenya and say oh gee I wonder what that means [00:19:51] and say oh gee I wonder what that means stick it into Google translate and um I [00:19:55] stick it into Google translate and um I can learn that Malawi um lost two [00:19:58] can learn that Malawi um lost two ministers due to um Co infections and [00:20:02] ministers due to um Co infections and they died right so you know we're just [00:20:04] they died right so you know we're just in this different era of being able to [00:20:06] in this different era of being able to understand stuff and then there are lots [00:20:09] understand stuff and then there are lots of other things that we can do with [00:20:10] of other things that we can do with modern NLP so until a few years ago um [00:20:15] modern NLP so until a few years ago um we had web search engines and you put in [00:20:18] we had web search engines and you put in some text you could write it as a [00:20:20] some text you could write it as a sentence if you wanted to but it didn't [00:20:22] sentence if you wanted to but it didn't really matter whether you wrote a [00:20:23] really matter whether you wrote a sentence or not because what you got was [00:20:25] sentence or not because what you got was some keywords that were then matched [00:20:28] some keywords that were then matched against index and you were shown some [00:20:30] against index and you were shown some pages that might have the answers to [00:20:32] pages that might have the answers to your questions but these days um you can [00:20:35] your questions but these days um you can put an actual question into a modern [00:20:38] put an actual question into a modern search engine like when did Kendrick [00:20:40] search engine like when did Kendrick Lamar's first album come out it can go [00:20:43] Lamar's first album come out it can go and find documents that have relevant [00:20:45] and find documents that have relevant information it can read those documents [00:20:48] information it can read those documents and it can give you an answer so that it [00:20:51] and it can give you an answer so that it actually can become an answer engine [00:20:54] actually can become an answer engine rather than just something that finds [00:20:55] rather than just something that finds documents that might be relevant to what [00:20:57] documents that might be relevant to what you're interested in and the way that [00:20:59] you're interested in and the way that that's done is with big neural networks [00:21:02] that's done is with big neural networks so that you might commonly have um for [00:21:05] so that you might commonly have um for your query you've got a retrieval neural [00:21:08] your query you've got a retrieval neural network which can find passages that are [00:21:11] network which can find passages that are similar to The query they might then be [00:21:13] similar to The query they might then be reranked by a second neural network and [00:21:16] reranked by a second neural network and then there'll be a third reading neural [00:21:18] then there'll be a third reading neural network that'll read those passages um [00:21:21] network that'll read those passages um and synthesize information from them [00:21:24] and synthesize information from them which it then returns as the [00:21:26] which it then returns as the answer okay that gets to about [00:21:30] answer okay that gets to about 2018 um but then things got more [00:21:32] 2018 um but then things got more advanced again so it was really around [00:21:36] advanced again so it was really around 2019 that people started to see the [00:21:39] 2019 that people started to see the power of large language models and so [00:21:42] power of large language models and so back in [00:21:43] back in 2019 those of us in NLP were really [00:21:46] 2019 those of us in NLP were really excited about [00:21:48] excited about gpt2 um it didn't make much of an impact [00:21:50] gpt2 um it didn't make much of an impact on the Nightly News but it was really [00:21:52] on the Nightly News but it was really exciting an NLP land um because gpt2 [00:21:57] exciting an NLP land um because gpt2 already for the first time time meant [00:21:59] already for the first time time meant here was a large language model that [00:22:02] here was a large language model that could just generate fluent text that [00:22:05] could just generate fluent text that really until then um NLP systems had [00:22:08] really until then um NLP systems had done sort of a decent job at [00:22:10] done sort of a decent job at understanding certain facts out of text [00:22:13] understanding certain facts out of text but we've just never been able to [00:22:14] but we've just never been able to generate fluent text that was at all [00:22:17] generate fluent text that was at all good um where here what you could do [00:22:19] good um where here what you could do with gpt2 is you could um write [00:22:22] with gpt2 is you could um write something like the start of a story a [00:22:24] something like the start of a story a train Carriage containing controlled [00:22:26] train Carriage containing controlled nuclear materials was stolen in cincin [00:22:28] nuclear materials was stolen in cincin today its whereabouts are unknown and [00:22:32] today its whereabouts are unknown and then GPT 2 would just write a [00:22:35] then GPT 2 would just write a continuation the incident occurred on [00:22:37] continuation the incident occurred on the downtown train line which runs from [00:22:39] the downtown train line which runs from Covington ashin stations in an email to [00:22:42] Covington ashin stations in an email to Ohio news outlets the US Department of [00:22:44] Ohio news outlets the US Department of energy set is working with the Federal [00:22:46] energy set is working with the Federal Railroad Administration to find the [00:22:48] Railroad Administration to find the Thief dot dot dot and so the way this is [00:22:50] Thief dot dot dot and so the way this is working is this conditioning on all the [00:22:53] working is this conditioning on all the past material and as I show at the very [00:22:56] past material and as I show at the very bottom line down here it's generating [00:22:59] bottom line down here it's generating one word at a time as to what word it [00:23:02] one word at a time as to what word it thinks would be likely to come next [00:23:04] thinks would be likely to come next after that um and so from that simple [00:23:07] after that um and so from that simple method of sort of generating words out [00:23:09] method of sort of generating words out of U one after another it's able to [00:23:12] of U one after another it's able to produce excellent text and the thing to [00:23:15] produce excellent text and the thing to notice is I mean this text is not only [00:23:18] notice is I mean this text is not only kind of you know formally correct you [00:23:21] kind of you know formally correct you know not the spellings correct and the [00:23:23] know not the spellings correct and the sentences are real sentences not [00:23:26] sentences are real sentences not disconnected garbage but you know it [00:23:28] disconnected garbage but you know it actually understands a lot right so the [00:23:31] actually understands a lot right so the prompt that was written said there were [00:23:34] prompt that was written said there were stolen nuclear materials in Cincinnati [00:23:37] stolen nuclear materials in Cincinnati but you know gpt2 knows a lot of stuff [00:23:40] but you know gpt2 knows a lot of stuff it knows that Cincinnati is in Ohio it [00:23:44] it knows that Cincinnati is in Ohio it knows that in the United States it's the [00:23:46] knows that in the United States it's the department of energy that regulates [00:23:48] department of energy that regulates nuclear materials um it knows if [00:23:51] nuclear materials um it knows if something is stolen it's a theft and [00:23:54] something is stolen it's a theft and that that would um make sense that um [00:23:58] that that would um make sense that um people are getting involved with that um [00:24:01] people are getting involved with that um it talks about you know there's train [00:24:03] it talks about you know there's train Carriage So it's talking about the train [00:24:05] Carriage So it's talking about the train line where it goes it really knows a lot [00:24:07] line where it goes it really knows a lot and can write you know coherent [00:24:10] and can write you know coherent discourse um like a real story so that's [00:24:13] discourse um like a real story so that's kind of amazing um but you know things [00:24:17] kind of amazing um but you know things moved on from there and so now we're in [00:24:20] moved on from there and so now we're in the world of chat GPT and [00:24:22] the world of chat GPT and gp4 and one of the things that we will [00:24:25] gp4 and one of the things that we will talk about later is this was a huge huge [00:24:28] talk about later is this was a huge huge user success because now you could ask [00:24:32] user success because now you could ask questions or give it commands and it [00:24:35] questions or give it commands and it would do what you wanted and that was [00:24:38] would do what you wanted and that was further amazing so here I'm saying hey [00:24:41] further amazing so here I'm saying hey please draft a polite email to my boss [00:24:43] please draft a polite email to my boss Jeremy that I would not be able to come [00:24:45] Jeremy that I would not be able to come into the office for the next two days [00:24:47] into the office for the next two days because my 9-year-old song that's a [00:24:50] because my 9-year-old song that's a misspelling for son but the system works [00:24:52] misspelling for son but the system works fine Des spite it um Peter is angry with [00:24:55] fine Des spite it um Peter is angry with me that I'm not giving him much time [00:24:58] me that I'm not giving him much time and it writes a nice email um it [00:25:01] and it writes a nice email um it corrects the spelling mistake because it [00:25:04] corrects the spelling mistake because it knows people make spelling mistakes it [00:25:05] knows people make spelling mistakes it doesn't talk about songs and everything [00:25:08] doesn't talk about songs and everything works out beautifully um you can get it [00:25:11] works out beautifully um you can get it to do other things so you can um ask it [00:25:15] to do other things so you can um ask it what is unusual about this image um so [00:25:18] what is unusual about this image um so in thinking about meaning one of the [00:25:20] in thinking about meaning one of the things that's interesting with these [00:25:22] things that's interesting with these recent models um is that they're [00:25:24] recent models um is that they're multimodal and can operate across modes [00:25:28] multimodal and can operate across modes and so um a favorite term that we coined [00:25:31] and so um a favorite term that we coined at Stanford is the term Foundation [00:25:33] at Stanford is the term Foundation models which we use as a generalization [00:25:36] models which we use as a generalization of large language models to have the [00:25:39] of large language models to have the same kind of technology used across [00:25:41] same kind of technology used across different modalities images sound um [00:25:46] different modalities images sound um various kinds of bioinformatic things [00:25:48] various kinds of bioinformatic things DNA RNA things like that seismic waves [00:25:52] DNA RNA things like that seismic waves any kind of signal building these same [00:25:55] any kind of signal building these same kind of large [00:25:56] kind of large models another place that you can see [00:25:59] models another place that you can see that um is going from text to images um [00:26:05] that um is going from text to images um so if I asked for a picture of a train [00:26:07] so if I asked for a picture of a train going over the Golden Gate Bridge um [00:26:10] going over the Golden Gate Bridge um this is now um darly 2 um it gives me a [00:26:15] this is now um darly 2 um it gives me a picture of a train going over the Golden [00:26:17] picture of a train going over the Golden Gate Bridge um this is a perfect time to [00:26:20] Gate Bridge um this is a perfect time to welcome anyone who's watching this um on [00:26:23] welcome anyone who's watching this um on Stanford online um if you're on Stanford [00:26:25] Stanford online um if you're on Stanford online and are not in the Bay Area the [00:26:28] online and are not in the Bay Area the important thing to know is no trains go [00:26:31] important thing to know is no trains go over the Golden Gate Bridge um but you [00:26:34] over the Golden Gate Bridge um but you might not be completely happy with this [00:26:36] might not be completely happy with this picture um because you know it shows the [00:26:39] picture um because you know it shows the Golden Gate Bridge and a train going [00:26:41] Golden Gate Bridge and a train going over it but it doesn't show the bay so [00:26:43] over it but it doesn't show the bay so maybe I'd like to um get with the bay in [00:26:46] maybe I'd like to um get with the bay in the background and if I ask for that [00:26:48] the background and if I ask for that well look now I've got um a train going [00:26:51] well look now I've got um a train going over the Golden Gate Bridge with the bay [00:26:52] over the Golden Gate Bridge with the bay in the background but you still might [00:26:55] in the background but you still might not be this this might not be exactly [00:26:58] not be this this might not be exactly what you want like maybe you'd prefer [00:27:00] what you want like maybe you'd prefer something that's a pencil drawing so I [00:27:03] something that's a pencil drawing so I can say a train going over the Golden [00:27:05] can say a train going over the Golden Gate Bridge detailed pencil drawing and [00:27:07] Gate Bridge detailed pencil drawing and I can get a pencil drawing um or maybe [00:27:10] I can get a pencil drawing um or maybe it's unrealistic that the Golden Gate [00:27:12] it's unrealistic that the Golden Gate Bridge only has trains going over it now [00:27:15] Bridge only has trains going over it now um so maybe it be good to have some cars [00:27:17] um so maybe it be good to have some cars as well um so I could ask for a train [00:27:20] as well um so I could ask for a train and cars and we can get a train and cars [00:27:23] and cars and we can get a train and cars going over it um now I actually made [00:27:26] going over it um now I actually made these ones all by myself so should be [00:27:28] these ones all by myself so should be impressed with my generative AI artwork [00:27:31] impressed with my generative AI artwork um but these examples are actually a bit [00:27:33] um but these examples are actually a bit old now because they're done with DAR 2 [00:27:35] old now because they're done with DAR 2 and if you keep up with these things [00:27:37] and if you keep up with these things that's a few years ago there now Dary 3 [00:27:39] that's a few years ago there now Dary 3 and so on so we can now get much fancier [00:27:41] and so on so we can now get much fancier things again right an illustration from [00:27:44] things again right an illustration from a graphic novel a bustling City street [00:27:46] a graphic novel a bustling City street under the shine of a full moon the [00:27:48] under the shine of a full moon the sidewalks bustling with pedestrians [00:27:50] sidewalks bustling with pedestrians enjoying the night life at the corner [00:27:52] enjoying the night life at the corner stall a young woman with fiery red hair [00:27:54] stall a young woman with fiery red hair dressed in a signature velvet cloak is [00:27:57] dressed in a signature velvet cloak is haggling with the grumpy old vendor the [00:27:59] haggling with the grumpy old vendor the grumpy vendor a tall sophisticated man [00:28:01] grumpy vendor a tall sophisticated man is wearing a sharp suit Sports a [00:28:03] is wearing a sharp suit Sports a noteworthy mustache is animatedly [00:28:05] noteworthy mustache is animatedly conversing on his steampunk telephone [00:28:08] conversing on his steampunk telephone and pretty much um we're getting all of [00:28:12] and pretty much um we're getting all of that okay um so let's now get on to [00:28:16] that okay um so let's now get on to starting to think more about meaning so [00:28:20] starting to think more about meaning so perhaps um what can we do for meaning [00:28:25] perhaps um what can we do for meaning right so if you think of words and there [00:28:28] right so if you think of words and there meaning um if you look up a dictionary [00:28:31] meaning um if you look up a dictionary and say what does meaning mean um [00:28:33] and say what does meaning mean um meaning is defined as the idea that is [00:28:35] meaning is defined as the idea that is represented by a word or phrase the idea [00:28:38] represented by a word or phrase the idea that a person wants to express by using [00:28:40] that a person wants to express by using words the idea that is expressed um and [00:28:45] words the idea that is expressed um and in in linguistics you know if you go and [00:28:48] in in linguistics you know if you go and do a semantics class or something the [00:28:50] do a semantics class or something the commonest way of thinking of of meaning [00:28:53] commonest way of thinking of of meaning is somewhat like what's presented up [00:28:55] is somewhat like what's presented up above there that meaning is thought of [00:28:58] above there that meaning is thought of as a pairing between what sometimes [00:29:01] as a pairing between what sometimes called signifier and signified but it's [00:29:03] called signifier and signified but it's perhaps easy to think of as a symbol a [00:29:06] perhaps easy to think of as a symbol a word and then an idea or thing and so [00:29:10] word and then an idea or thing and so this notion is referred to as [00:29:12] this notion is referred to as denotational semantics so the idea or [00:29:15] denotational semantics so the idea or thing is the denotation of the symbol [00:29:18] thing is the denotation of the symbol and so this same idea of denotational [00:29:20] and so this same idea of denotational semantics has also been used for [00:29:21] semantics has also been used for programming languages because in [00:29:23] programming languages because in programming languages you have symbols [00:29:26] programming languages you have symbols like while and if [00:29:28] like while and if variables and they have a meaning and [00:29:31] variables and they have a meaning and that could be their [00:29:32] that could be their denotation um so we sort of would say [00:29:35] denotation um so we sort of would say that the meaning of tree is all the [00:29:38] that the meaning of tree is all the trees you can find out around the world [00:29:41] trees you can find out around the world um that's sort of a okay notion of [00:29:44] um that's sort of a okay notion of meaning um it's a popular one it's never [00:29:47] meaning um it's a popular one it's never been very [00:29:48] been very obvious or at least traditionally it [00:29:50] obvious or at least traditionally it wasn't very obvious as of what we could [00:29:52] wasn't very obvious as of what we could do with that to get it into computers so [00:29:55] do with that to get it into computers so if you looked in the pre-neural world [00:29:58] if you looked in the pre-neural world when people tried to look at meanings [00:30:01] when people tried to look at meanings inside computers they sort of had to do [00:30:04] inside computers they sort of had to do something much more primitive of looking [00:30:07] something much more primitive of looking at words and their relationship so a [00:30:09] at words and their relationship so a very common traditional solution was to [00:30:12] very common traditional solution was to make use of word net and word net was [00:30:15] make use of word net and word net was kind of a sort of fancy thesaurus that [00:30:17] kind of a sort of fancy thesaurus that showed word relations so i' tell you [00:30:20] showed word relations so i' tell you about synonyms and is a kind of things [00:30:23] about synonyms and is a kind of things um so a panda is a kind of carnivore [00:30:26] um so a panda is a kind of carnivore which is a placental which is a m [00:30:28] which is a placental which is a m and things like that good has various [00:30:31] and things like that good has various meanings it's a trade good or the sense [00:30:33] meanings it's a trade good or the sense of goodness and you could explore with [00:30:35] of goodness and you could explore with that but systems like wordnet um were [00:30:39] that but systems like wordnet um were never very good for computational [00:30:42] never very good for computational meaning um they missed a lot of nuance [00:30:45] meaning um they missed a lot of nuance wordnet would tell you that proficient [00:30:47] wordnet would tell you that proficient is a synonym for good but if you think [00:30:50] is a synonym for good but if you think about all the things that you would say [00:30:52] about all the things that you would say were good you know that was a good shot [00:30:54] were good you know that was a good shot would you say that was a proficient shot [00:30:57] would you say that was a proficient shot sounds kind of weird to me um you know [00:30:59] sounds kind of weird to me um you know there's a lot of color and Nuance on how [00:31:01] there's a lot of color and Nuance on how words are used um word net is very [00:31:05] words are used um word net is very incomplete it's missing anything that's [00:31:07] incomplete it's missing anything that's kind of cooler more modern slang this [00:31:10] kind of cooler more modern slang this maybe isn't very modern slang now but [00:31:12] maybe isn't very modern slang now but you won't find more modern slang either [00:31:14] you won't find more modern slang either in it it's sort of very human-made Etc [00:31:17] in it it's sort of very human-made Etc it's got a lot of issues so um this led [00:31:20] it's got a lot of issues so um this led into the idea of can we represent [00:31:22] into the idea of can we represent meaning differently and this leads us [00:31:24] meaning differently and this leads us into word [00:31:26] into word vectors um so when we have words Wicked [00:31:31] vectors um so when we have words Wicked badass Nifty wizard what do they turn [00:31:35] badass Nifty wizard what do they turn into when we have [00:31:38] into when we have computers um well [00:31:40] computers um well effectively um you know words are these [00:31:43] effectively um you know words are these discrete symbols um that they're just [00:31:46] discrete symbols um that they're just kind of some kind of atom or symbol and [00:31:49] kind of some kind of atom or symbol and if we then turn those into something [00:31:52] if we then turn those into something that's closer to math um how symbols are [00:31:57] that's closer to math um how symbols are normally rep presented is you have a [00:32:00] normally rep presented is you have a vocabulary and your word is some item in [00:32:04] vocabulary and your word is some item in that vocabulary so Motel is the that [00:32:06] that vocabulary so Motel is the that word in the vocabulary and hotel is this [00:32:09] word in the vocabulary and hotel is this word in the vocabulary and commonly this [00:32:12] word in the vocabulary and commonly this is what computational systems do you [00:32:14] is what computational systems do you take all your strings and you index them [00:32:17] take all your strings and you index them to numbers and that's the sort of [00:32:19] to numbers and that's the sort of position in a vector that they belong in [00:32:22] position in a vector that they belong in um and well we have huge numbers of [00:32:25] um and well we have huge numbers of words so we might have a huge vocabulary [00:32:28] words so we might have a huge vocabulary so we'll have very big and long vectors [00:32:30] so we'll have very big and long vectors and so these get referred to as one hot [00:32:33] and so these get referred to as one hot vectors um for representing the meaning [00:32:36] vectors um for representing the meaning of words um but [00:32:39] of words um but representing words by one hot vectors [00:32:42] representing words by one hot vectors turns out to not be a very good way of [00:32:45] turns out to not be a very good way of computing with them it was used for [00:32:48] computing with them it was used for decades um but it turns out to be kind [00:32:50] decades um but it turns out to be kind of problematic and part of why it's [00:32:52] of problematic and part of why it's problematic is it doesn't have any [00:32:56] problematic is it doesn't have any natural inherent sense [00:32:58] natural inherent sense of the meanings of words you just have [00:33:00] of the meanings of words you just have different words you have hotel and motel [00:33:03] different words you have hotel and motel and house and chair and so if you think [00:33:06] and house and chair and so if you think about in terms of these Vector [00:33:09] about in terms of these Vector representations that if you have motel [00:33:11] representations that if you have motel and hotel there's no indication that [00:33:14] and hotel there's no indication that they're kind of similar they're just two [00:33:16] they're kind of similar they're just two different symbols which have ones in [00:33:19] different symbols which have ones in different positions in the vector or [00:33:21] different positions in the vector or formally in math terms um if you think [00:33:24] formally in math terms um if you think about taking the dotproduct of these two [00:33:27] about taking the dotproduct of these two vectors zero um the two vectors are [00:33:30] vectors zero um the two vectors are orthogonal they have nothing to do with [00:33:33] orthogonal they have nothing to do with each other now there are things that you [00:33:35] each other now there are things that you can do with that you can start saying oh [00:33:37] can do with that you can start saying oh let me start building up some other [00:33:39] let me start building up some other resource of word similarity and I'll [00:33:42] resource of word similarity and I'll consult that resource of word similarity [00:33:45] consult that resource of word similarity and it'll tell me that motels and hotels [00:33:47] and it'll tell me that motels and hotels are similar to each other and people did [00:33:50] are similar to each other and people did things like that right in web search it [00:33:52] things like that right in web search it was referred to as query expansion [00:33:54] was referred to as query expansion techniques um but still the point is [00:33:57] techniques um but still the point is that there's no natural notion of [00:34:00] that there's no natural notion of similarity in one hot [00:34:02] similarity in one hot vectors um and so the the idea was that [00:34:07] vectors um and so the the idea was that maybe we could do better than that that [00:34:10] maybe we could do better than that that we could learn to include similarity in [00:34:13] we could learn to include similarity in the vectors themselves and so that leads [00:34:16] the vectors themselves and so that leads into the idea of word vectors um but it [00:34:19] into the idea of word vectors um but it also leads into a different way of [00:34:21] also leads into a different way of thinking about semantics um I just [00:34:24] thinking about semantics um I just realized I forgot to say one thing back [00:34:26] realized I forgot to say one thing back two slides um these kind of [00:34:30] two slides um these kind of representations are referred to as [00:34:32] representations are referred to as localist representations meaning that [00:34:34] localist representations meaning that there's one point in which um something [00:34:38] there's one point in which um something is represented so that um you've got [00:34:42] is represented so that um you've got here is the representation of motel and [00:34:45] here is the representation of motel and here is the representation of Hotel it's [00:34:48] here is the representation of Hotel it's in one place in the vector that each [00:34:50] in one place in the vector that each word is represented and they'll be [00:34:52] word is represented and they'll be different to what we do next um so [00:34:55] different to what we do next um so there's an alternative idea of semantic [00:34:58] there's an alternative idea of semantic um which goes back quite a long way [00:35:01] um which goes back quite a long way people commonly quote this quote of Jr F [00:35:05] people commonly quote this quote of Jr F who was a British linguist who said in [00:35:07] who was a British linguist who said in 1957 you shall know a word by the [00:35:09] 1957 you shall know a word by the company it keeps but also goes back to [00:35:12] company it keeps but also goes back to philosophical work by binstein and [00:35:14] philosophical work by binstein and others that what you should do is [00:35:17] others that what you should do is represent a word's meaning by the [00:35:20] represent a word's meaning by the context in which it appears um so the [00:35:23] context in which it appears um so the words that appear around the word give [00:35:26] words that appear around the word give information [00:35:28] information um about its meaning and so that's the [00:35:30] um about its meaning and so that's the idea of what's called distributional [00:35:32] idea of what's called distributional semantics in contrast to denotational [00:35:35] semantics in contrast to denotational semantics so if I want to know about the [00:35:38] semantics so if I want to know about the word banking I say give me some [00:35:41] word banking I say give me some sentences that use the word banking here [00:35:43] sentences that use the word banking here are some sentences using the word [00:35:44] are some sentences using the word banking government debt problems turning [00:35:47] banking government debt problems turning into banking crises as happened in 2009 [00:35:51] into banking crises as happened in 2009 etc etc and knowing about that context [00:35:55] etc etc and knowing about that context words that occur around banking [00:35:58] words that occur around banking those will become the meaning of banking [00:36:01] those will become the meaning of banking and so we're going to use those [00:36:04] and so we're going to use those statistics um about words and what other [00:36:08] statistics um about words and what other words appear around them in order to [00:36:12] words appear around them in order to learn a new kind of representation of a [00:36:16] learn a new kind of representation of a word so our new representation of words [00:36:20] word so our new representation of words is we're going to represent them now as [00:36:23] is we're going to represent them now as a dense a sort of a shorter dense Vector [00:36:27] a dense a sort of a shorter dense Vector that giv the meaning of the words now my [00:36:30] that giv the meaning of the words now my vectors are very short here um these are [00:36:32] vectors are very short here um these are only eight dimensional if I counted [00:36:34] only eight dimensional if I counted right so I could fit them on my slide [00:36:36] right so I could fit them on my slide they're not that short in practice they [00:36:38] they're not that short in practice they might be 200 2,000 but reasonably short [00:36:43] might be 200 2,000 but reasonably short they're not going to be like the half a [00:36:44] they're not going to be like the half a million of the half a million different [00:36:46] million of the half a million different words in our vocabulary and the idea is [00:36:50] words in our vocabulary and the idea is if words have stuff to do with each [00:36:52] if words have stuff to do with each other um they'll have sort of similar [00:36:55] other um they'll have sort of similar vectors which corresponds to their dot [00:36:58] vectors which corresponds to their dot product being large so for Banking and [00:37:01] product being large so for Banking and monetary in my example here both of them [00:37:03] monetary in my example here both of them are positive in the First Dimension [00:37:05] are positive in the First Dimension positive in the second dimension [00:37:07] positive in the second dimension negative on the third on the fourth [00:37:10] negative on the third on the fourth they've got opposite signs so if we want [00:37:11] they've got opposite signs so if we want to work out the dot product we're taking [00:37:14] to work out the dot product we're taking the product of the corresponding terms [00:37:17] the product of the corresponding terms and it'll get bigger to the extent that [00:37:19] and it'll get bigger to the extent that both of the corresponding ones have the [00:37:21] both of the corresponding ones have the same sides and bigger if they have large [00:37:24] same sides and bigger if they have large um [00:37:25] um magnitude Okay so these are what we call [00:37:29] magnitude Okay so these are what we call word vectors which are also known as [00:37:31] word vectors which are also known as embeddings or newal word representations [00:37:35] embeddings or newal word representations or phrases like that and so the first [00:37:37] or phrases like that and so the first thing we want to do is learn good word [00:37:41] thing we want to do is learn good word vectors for different words and our word [00:37:44] vectors for different words and our word vectors will be good word vectors if um [00:37:50] vectors will be good word vectors if um they give us a good sense of the [00:37:52] they give us a good sense of the meanings of words they know which words [00:37:55] meanings of words they know which words are similar to other words in meaning [00:37:58] are similar to other words in meaning um we refer to them as [00:38:00] um we refer to them as embeddings um because we can think of [00:38:03] embeddings um because we can think of this as a vector in a high dimensional [00:38:05] this as a vector in a high dimensional space and so that we're embedding each [00:38:08] space and so that we're embedding each word as a position in that [00:38:10] word as a position in that high-dimensional space and the [00:38:13] high-dimensional space and the dimensionality of the space um will be [00:38:15] dimensionality of the space um will be the length of the vector so it might be [00:38:17] the length of the vector so it might be something like a 300 dimensional space [00:38:20] something like a 300 dimensional space um now that kind of gets problematic [00:38:24] um now that kind of gets problematic because human beings can't look at 300 [00:38:27] because human beings can't look at 300 dimensional spaces and aren't very good [00:38:29] dimensional spaces and aren't very good at understanding or visualizing what [00:38:32] at understanding or visualizing what goes on in them so the only thing that I [00:38:35] goes on in them so the only thing that I can show you is um two-dimensional [00:38:38] can show you is um two-dimensional spaces but um a thing that is good to [00:38:44] spaces but um a thing that is good to have somewhat in your head is that [00:38:47] have somewhat in your head is that really high-dimensional spaces behave [00:38:50] really high-dimensional spaces behave extremely differently to two-dimensional [00:38:53] extremely differently to two-dimensional spaces in high dimensional spaces things [00:38:58] spaces in high dimensional spaces things can in a two-dimensional space you're [00:39:01] can in a two-dimensional space you're only near to something else if you got [00:39:02] only near to something else if you got similar X and Y coordinates in a high [00:39:05] similar X and Y coordinates in a high dimensional space things can be very [00:39:07] dimensional space things can be very near to all sorts of things on different [00:39:10] near to all sorts of things on different dimensions in the space and so we can [00:39:12] dimensions in the space and so we can capture different senses of words and [00:39:15] capture different senses of words and ways that words are similar to each [00:39:17] ways that words are similar to each other um but here's the kind of picture [00:39:20] other um but here's the kind of picture we end up with so we're what we're going [00:39:21] we end up with so we're what we're going to do is learn a way to um represent all [00:39:26] to do is learn a way to um represent all words as vectors based on the other [00:39:30] words as vectors based on the other words that they within context and we [00:39:33] words that they within context and we can embed them into this Vector space [00:39:35] can embed them into this Vector space and of course you can't read anything [00:39:37] and of course you can't read anything there but you know we can zoom into this [00:39:39] there but you know we can zoom into this space further and if we zoom into this [00:39:42] space further and if we zoom into this space and just show a bit of it well [00:39:44] space and just show a bit of it well here's a part of the space um where it's [00:39:46] here's a part of the space um where it's showing um country words and some other [00:39:49] showing um country words and some other location words so we've got so of um [00:39:52] location words so we've got so of um countries up the top there we've got [00:39:54] countries up the top there we've got some nationality terms British [00:39:56] some nationality terms British Australian American European um further [00:39:59] Australian American European um further down or we can go to another piece of [00:40:02] down or we can go to another piece of the space and here's a bit of the space [00:40:04] the space and here's a bit of the space um where we have verbs and not only if [00:40:06] um where we have verbs and not only if we got verbs but you know there's [00:40:08] we got verbs but you know there's actually quite quite a lot of fine [00:40:10] actually quite quite a lot of fine structure here of what's similar that [00:40:12] structure here of what's similar that represents things about verbs so you've [00:40:15] represents things about verbs so you've got um sort of verbs of you know [00:40:19] got um sort of verbs of you know communication statements saying thinking [00:40:22] communication statements saying thinking expecting grouping together come and go [00:40:24] expecting grouping together come and go group together down the bottom you've [00:40:27] group together down the bottom you've got forms of the verb have then you've [00:40:29] got forms of the verb have then you've got forms of the verb to be above them [00:40:32] got forms of the verb to be above them you've got become and remain which are [00:40:34] you've got become and remain which are actually sort of similar to the verb to [00:40:36] actually sort of similar to the verb to be um because they take these sort of [00:40:40] be um because they take these sort of complements of state so just that you [00:40:42] complements of state so just that you can same as you can say I am angry you [00:40:46] can same as you can say I am angry you can say he remained angry or he became [00:40:49] can say he remained angry or he became angry right so those verbs are more so [00:40:53] angry right so those verbs are more so than most verbs sort of similar to the [00:40:54] than most verbs sort of similar to the verb to be so we get these kind of [00:40:57] verb to be so we get these kind of interesting semantic spaces where things [00:41:00] interesting semantic spaces where things that have similar meaning are close by [00:41:02] that have similar meaning are close by to each other and so the question is how [00:41:06] to each other and so the question is how do we get to those things and how we get [00:41:09] do we get to those things and how we get to those things is then um for you know [00:41:13] to those things is then um for you know that there are various ways of doing it [00:41:15] that there are various ways of doing it but the one I want to get through um [00:41:17] but the one I want to get through um today is showing you about um word DEC [00:41:21] today is showing you about um word DEC okay I'll pause for 30 seconds for bread [00:41:23] okay I'll pause for 30 seconds for bread breath anyone have a question or [00:41:25] breath anyone have a question or anything they want to know yes [00:41:32] so but it doesn't to solve the problem [00:41:36] so but it doesn't to solve the problem where the similar meanings um might [00:41:39] where the similar meanings um might depend on context right so let's take [00:41:42] depend on context right so let's take your example about profession [00:41:44] your example about profession ver so those two words have their own [00:41:48] ver so those two words have their own batteries and we understand similarity [00:41:52] batteries and we understand similarity some spectors but it's contact right [00:41:56] some spectors but it's contact right because if you have different contact [00:41:58] because if you have different contact those two similar and this Al does [00:42:03] those two similar and this Al does not yes correct um so that's a good [00:42:06] not yes correct um so that's a good thought you can keep it for a few weeks [00:42:08] thought you can keep it for a few weeks to some extent yeah so for the first [00:42:10] to some extent yeah so for the first thing we're going to do we're just going [00:42:12] thing we're going to do we're just going to learn one word Vector for a string so [00:42:16] to learn one word Vector for a string so we're going to have a word let's say [00:42:19] we're going to have a word let's say it's star and we're going to learn one [00:42:21] it's star and we're going to learn one word Vector for it so that absolutely [00:42:24] word Vector for it so that absolutely doesn't capture the meaning of a word in [00:42:27] doesn't capture the meaning of a word in context so it won't be saying whether [00:42:30] context so it won't be saying whether it's meaning a Hollywood star or an [00:42:33] it's meaning a Hollywood star or an astronomical star or something like that [00:42:36] astronomical star or something like that and so later on we're going to get on to [00:42:38] and so later on we're going to get on to contextual meaning representation so [00:42:40] contextual meaning representation so wait for that but the thing I would like [00:42:43] wait for that but the thing I would like to going along with what I said about [00:42:45] to going along with what I said about high dimensional spaces being weird the [00:42:48] high dimensional spaces being weird the the cool thing that we will already find [00:42:51] the cool thing that we will already find is our representation for Star will be [00:42:55] is our representation for Star will be very close to um the representations for [00:42:59] very close to um the representations for astronomical words like nebula um and [00:43:03] astronomical words like nebula um and what other every other astronomical [00:43:05] what other every other astronomical words you know and [00:43:07] words you know and simultaneously um it'll be um very close [00:43:11] simultaneously um it'll be um very close to words that mean something like a [00:43:13] to words that mean something like a Hollywood star [00:43:17] Hollywood star um help me out no any words that mean [00:43:19] um help me out no any words that mean something similar um celebrity that's a [00:43:22] something similar um celebrity that's a good one okay yeah how are youing the [00:43:27] good one okay yeah how are youing the Bing is to a lower dimensional space [00:43:29] Bing is to a lower dimensional space visualized um so that pictures I was [00:43:33] visualized um so that pictures I was showing you used a particular method [00:43:35] showing you used a particular method called tne um which is a nonliner [00:43:39] called tne um which is a nonliner dimensionality reduction that tends to [00:43:41] dimensionality reduction that tends to work better for high dimensional new [00:43:43] work better for high dimensional new representations um then PCA which you [00:43:46] representations um then PCA which you might know um but I'm not going to go [00:43:48] might know um but I'm not going to go into that now yes um how do you know Di [00:43:57] but not [00:43:59] but not too I mean that's something that people [00:44:02] too I mean that's something that people have worked on it depends on how much it [00:44:05] have worked on it depends on how much it depends on how much data you've got to [00:44:07] depends on how much data you've got to make your representations over you know [00:44:10] make your representations over you know so normally you know it's worked out [00:44:13] so normally you know it's worked out either empirically for what works best [00:44:16] either empirically for what works best or practically based on how big vectors [00:44:19] or practically based on how big vectors you want to work on I mean to give you [00:44:21] you want to work on I mean to give you some idea you know things start to work [00:44:25] some idea you know things start to work well when you get to 100 dimensional [00:44:27] well when you get to 100 dimensional space for a long time people used 300 [00:44:30] space for a long time people used 300 Dimensions because that seemed to work [00:44:32] Dimensions because that seemed to work pretty well but as we people have [00:44:34] pretty well but as we people have started building Huger and Huger models [00:44:37] started building Huger and Huger models with way way more data it's now become [00:44:39] with way way more data it's now become increasingly common to use numbers like [00:44:42] increasingly common to use numbers like 1,000 or even 2,000 dimensional vectors [00:44:46] 1,000 or even 2,000 dimensional vectors yeah [00:44:48] yeah okay so um you mentioned that there sort [00:44:51] okay so um you mentioned that there sort of hidden structur the small areas as [00:44:53] of hidden structur the small areas as well as large areas of the embeding and [00:44:56] well as large areas of the embeding and as in pieces different like different [00:44:59] as in pieces different like different structures will come up but generally we [00:45:01] structures will come up but generally we seem to use distance as the single [00:45:03] seem to use distance as the single metric for closeness which doesn't seem [00:45:05] metric for closeness which doesn't seem to me that we get it like distance [00:45:07] to me that we get it like distance between like this and that in space [00:45:09] between like this and that in space three will be the same right so how [00:45:11] three will be the same right so how would that we don't only use distance we [00:45:14] would that we don't only use distance we also use directions in the spaces having [00:45:17] also use directions in the spaces having semantic meanings and I'll show a [00:45:18] semantic meanings and I'll show a example that soon [00:45:21] example that soon yeah the entri they seem to be between1 [00:45:26] yeah the entri they seem to be between1 and one is there a reason for that or do [00:45:28] and one is there a reason for that or do we have bounds that we them um so good [00:45:33] we have bounds that we them um so good question I mean you know they don't have [00:45:36] question I mean you know they don't have to be um and the way we're going to [00:45:40] to be um and the way we're going to learn them they're not bounded but you [00:45:43] learn them they're not bounded but you know you can bound things sometimes [00:45:45] know you can bound things sometimes people length normalize so that um the [00:45:48] people length normalize so that um the vectors are of length one but at any [00:45:51] vectors are of length one but at any rate normally in this work we use some [00:45:53] rate normally in this work we use some method called you know regularization [00:45:56] method called you know regularization that tries to kind of keep coefficients [00:45:58] that tries to kind of keep coefficients small so they're generally not getting [00:46:01] small so they're generally not getting huge yeah a specific word for example [00:46:05] huge yeah a specific word for example like the bank we use as before in the [00:46:07] like the bank we use as before in the previous slides um so is there like a [00:46:11] previous slides um so is there like a like for the word representation is [00:46:13] like for the word representation is there like a single embedding for each [00:46:16] there like a single embedding for each word or do we have multiple embeddings [00:46:18] word or do we have multiple embeddings for each word but what we're doing at [00:46:20] for each word but what we're doing at the moment each word each string of [00:46:24] the moment each word each string of letters has a single embed in and what [00:46:28] letters has a single embed in and what you can think of that embedding as as [00:46:32] you can think of that embedding as as kind of as an average over all its [00:46:35] kind of as an average over all its senses um so for example like Bank it [00:46:38] senses um so for example like Bank it can mean like the financial institution [00:46:41] can mean like the financial institution or it can also mean like the river B and [00:46:44] or it can also mean like the river B and then what I said before about star [00:46:46] then what I said before about star applies the interesting thing is you'll [00:46:50] applies the interesting thing is you'll find that we're able to come up with a [00:46:52] find that we're able to come up with a representation where our learned [00:46:53] representation where our learned representation because it's kind of an [00:46:55] representation because it's kind of an average of those will end up similar to [00:46:58] average of those will end up similar to words that are semantically evoked by [00:47:01] words that are semantically evoked by both senses um I think I probably about [00:47:04] both senses um I think I probably about about go on at this point um okay word [00:47:08] about go on at this point um okay word to ve okay so word Tove was this method [00:47:12] to ve okay so word Tove was this method of learning word vectors um that was [00:47:15] of learning word vectors um that was thought up by tamas mikolov and [00:47:17] thought up by tamas mikolov and colleagues at Google um in [00:47:20] colleagues at Google um in 2013 you know it wasn't the first method [00:47:23] 2013 you know it wasn't the first method there are other people that did methods [00:47:25] there are other people that did methods of learning word vectors that go back to [00:47:28] of learning word vectors that go back to about um the turn of the Millennium um [00:47:31] about um the turn of the Millennium um it wasn't the last there ones that come [00:47:33] it wasn't the last there ones that come after it as well but it was a [00:47:35] after it as well but it was a particularly simple one um and this [00:47:38] particularly simple one um and this particularly you know fast running one [00:47:41] particularly you know fast running one and so it really caught people's um [00:47:43] and so it really caught people's um attention so the idea of it is um that [00:47:47] attention so the idea of it is um that we start off with a large amount of text [00:47:52] we start off with a large amount of text so that can just be thought of as a long [00:47:54] so that can just be thought of as a long list of words and in l p we refer to [00:47:57] list of words and in l p we refer to that as a corpus Corpus um is just Latin [00:48:01] that as a corpus Corpus um is just Latin for body um so you know it's exactly the [00:48:04] for body um so you know it's exactly the same as if you have a dead person on the [00:48:05] same as if you have a dead person on the floor right that's a corpus no um yeah [00:48:10] floor right that's a corpus no um yeah so it's just a body but we mean a body [00:48:11] so it's just a body but we mean a body of text um not a live person oh sorry a [00:48:16] of text um not a live person oh sorry a dead person um yeah um if you want to [00:48:19] dead person um yeah um if you want to know more about Latin um since some [00:48:21] know more about Latin um since some there isn't very good classical [00:48:23] there isn't very good classical education these days um Corpus despite [00:48:27] education these days um Corpus despite the US ending is a fourth declen neuter [00:48:31] the US ending is a fourth declen neuter noun and that means um the plural of [00:48:35] noun and that means um the plural of Corpus is not corpy the plural of Corpus [00:48:40] Corpus is not corpy the plural of Corpus is corpora um so I'm sure sometime later [00:48:45] is corpora um so I'm sure sometime later in this class I will read a projector [00:48:48] in this class I will read a projector assignment that refers to corpy and I [00:48:51] assignment that refers to corpy and I will know that that person was not [00:48:52] will know that that person was not paying attention um in the first lecture [00:48:55] paying attention um in the first lecture um or else they should have said corpora [00:48:59] um or else they should have said corpora um c r p o r a is the correct um form [00:49:02] um c r p o r a is the correct um form for that okay I should move on okay so [00:49:05] for that okay I should move on okay so we have our text then we know um that [00:49:10] we have our text then we know um that we're going to represent each word so [00:49:13] we're going to represent each word so this is each word type so you know star [00:49:16] this is each word type so you know star or Bank Etc so for wherever it occurs by [00:49:20] or Bank Etc so for wherever it occurs by a single vector and so what we're going [00:49:23] a single vector and so what we're going to do in this algorithm is we're going [00:49:25] to do in this algorithm is we're going to go through each position in the text [00:49:28] to go through each position in the text and so at each position in the text [00:49:30] and so at each position in the text which is a list of words we're going to [00:49:32] which is a list of words we're going to have a center word and words outside it [00:49:36] have a center word and words outside it um and then what we're going to do is [00:49:38] um and then what we're going to do is use the similarity of the word vectors [00:49:42] use the similarity of the word vectors for C and the outside words to calculate [00:49:45] for C and the outside words to calculate the probability that they should have [00:49:47] the probability that they should have occurred or not um and then we just keep [00:49:50] occurred or not um and then we just keep fiddling and we learn ve word vectors [00:49:53] fiddling and we learn ve word vectors now you know at First Sight I'll show [00:49:56] now you know at First Sight I'll show this more concretely maybe I'll just [00:49:57] this more concretely maybe I'll just show up more concretely first so here's [00:49:59] show up more concretely first so here's the idea we're going to have um a vector [00:50:03] the idea we're going to have um a vector for each word type so a word type means [00:50:06] for each word type so a word type means you know the word problems wherever it [00:50:08] you know the word problems wherever it occurs which is differentiated from a [00:50:11] occurs which is differentiated from a word token which is this instance of the [00:50:14] word token which is this instance of the word problems so we're going to have a [00:50:16] word problems so we're going to have a vector for each word type and so I'm [00:50:20] vector for each word type and so I'm going to want to know look in this text [00:50:24] going to want to know look in this text the word turning occurred before [00:50:26] the word turning occurred before occurred before the word into How likely [00:50:29] occurred before the word into How likely should that have been to happen and what [00:50:32] should that have been to happen and what I'm going to do is calculate a [00:50:34] I'm going to do is calculate a probability of the word turning [00:50:37] probability of the word turning occurring close to the word into and I'm [00:50:40] occurring close to the word into and I'm going to do that for each word in a [00:50:43] going to do that for each word in a narrow context in example here I'm [00:50:45] narrow context in example here I'm saying I'm using two words to the left [00:50:48] saying I'm using two words to the left and two words to the right and what I [00:50:50] and two words to the right and what I want to do is make those probability [00:50:53] want to do is make those probability estimates as good as possible so in [00:50:55] estimates as good as possible so in particular I want the probability of [00:50:57] particular I want the probability of cooccurrence to be high for words that [00:51:00] cooccurrence to be high for words that actually do occur within the nearby [00:51:02] actually do occur within the nearby context of each other and so then the [00:51:05] context of each other and so then the question is how am I going to oh and [00:51:09] question is how am I going to oh and once I've done it for that word I'm [00:51:10] once I've done it for that word I'm going to go along and do exactly the [00:51:13] going to go along and do exactly the same thing for the next [00:51:15] same thing for the next word and so I'll can continue through [00:51:18] word and so I'll can continue through the text in that way and so what we want [00:51:22] the text in that way and so what we want to do is come up with vector [00:51:26] to do is come up with vector representations of words that will let [00:51:29] representations of words that will let us predict these probabilities quote [00:51:32] us predict these probabilities quote unquote well now you know there's a huge [00:51:35] unquote well now you know there's a huge limit to how well we can do it because [00:51:37] limit to how well we can do it because you know we've got a simple model [00:51:39] you know we've got a simple model obviously when you see the word banking [00:51:42] obviously when you see the word banking I can't tell you that the word into is [00:51:44] I can't tell you that the word into is going to occur before banking but you [00:51:47] going to occur before banking but you know I want to do it as well as possible [00:51:50] know I want to do it as well as possible so what I want my model to say is after [00:51:53] so what I want my model to say is after the word banking crisis is pretty likely [00:51:58] the word banking crisis is pretty likely um but the word skillet is not very [00:52:03] um but the word skillet is not very likely and if I can do that I'm doing a [00:52:06] likely and if I can do that I'm doing a good job and so we turn that into a [00:52:09] good job and so we turn that into a piece of math um here's how we do it [00:52:12] piece of math um here's how we do it turn it into a piece of math so we're [00:52:15] turn it into a piece of math so we're going to go through our Corpus every [00:52:17] going to go through our Corpus every position in the Corpus um and we're [00:52:20] position in the Corpus um and we're going to have a fixed window size M [00:52:23] going to have a fixed window size M which was two in my example and then [00:52:26] which was two in my example and then what we you're going to want to do is [00:52:27] what we you're going to want to do is have the probability of words in the [00:52:30] have the probability of words in the context um being as high as possible so [00:52:34] context um being as high as possible so we want to maximize this likelihood [00:52:37] we want to maximize this likelihood where we're going through every position [00:52:38] where we're going through every position in the text and then we're going through [00:52:41] in the text and then we're going through every word in the context and sort of [00:52:44] every word in the context and sort of wanting to make this [00:52:47] big okay so conceptually that's what [00:52:51] big okay so conceptually that's what we're doing but in practice um we never [00:52:54] we're doing but in practice um we never quite do that um we um use two little [00:52:59] quite do that um we um use two little tricks here um the first one is you know [00:53:02] tricks here um the first one is you know for completely arbitrary reasons it [00:53:05] for completely arbitrary reasons it really makes no difference um everyone [00:53:08] really makes no difference um everyone got into minimizing things rather than [00:53:11] got into minimizing things rather than maximizing things and so the algorithms [00:53:14] maximizing things and so the algorithms that we use get referred to as gradient [00:53:16] that we use get referred to as gradient descent as you'll see in a moment so the [00:53:18] descent as you'll see in a moment so the first thing we do um is put a minor sign [00:53:22] first thing we do um is put a minor sign in front so that we can minimize it [00:53:24] in front so that we can minimize it rather than maximize it that part's [00:53:25] rather than maximize it that part's pretty trivial [00:53:27] pretty trivial but the second part is here we have this [00:53:29] but the second part is here we have this enormous product and working with [00:53:32] enormous product and working with enormous products is more difficult for [00:53:34] enormous products is more difficult for the math so the second thing that we do [00:53:37] the math so the second thing that we do is introduce a logarithm and so once we [00:53:40] is introduce a logarithm and so once we take the log of the likelihood um that [00:53:44] take the log of the likelihood um that then um when we take logs of products [00:53:48] then um when we take logs of products they turn into sums and so now we can [00:53:51] they turn into sums and so now we can sum over each word position the text sum [00:53:56] sum over each word position the text sum over for each word in the context window [00:53:58] over for each word in the context window and then sum these log probabilities and [00:54:01] and then sum these log probabilities and then we still got the minus sign in [00:54:03] then we still got the minus sign in front so we want to minimize the sum of [00:54:06] front so we want to minimize the sum of log probabilities so what we're doing is [00:54:10] log probabilities so what we're doing is um then wanting to look at the negative [00:54:13] um then wanting to look at the negative log [00:54:15] log likelihood um and then the final thing [00:54:17] likelihood um and then the final thing that we do is you know to since this [00:54:20] that we do is you know to since this will get um bigger depending on the [00:54:23] will get um bigger depending on the number of words in the Corpus um we [00:54:25] number of words in the Corpus um we divide through by the number of words in [00:54:27] divide through by the number of words in the Corpus and so our objective function [00:54:30] the Corpus and so our objective function is the average negative log likelihood [00:54:33] is the average negative log likelihood um so by minimizing this objective [00:54:36] um so by minimizing this objective function we're maximizing the [00:54:39] function we're maximizing the probability of words in the [00:54:41] probability of words in the context okay um we're almost there [00:54:46] context okay um we're almost there that's what we want to do um we've got a [00:54:49] that's what we want to do um we've got a couple more tricks that we want to get [00:54:52] couple more tricks that we want to get through the next one is well I've said [00:54:54] through the next one is well I've said we want to maximize the this probability [00:54:58] we want to maximize the this probability how do we maximize this probability what [00:55:00] how do we maximize this probability what is this probability we haven't defined [00:55:03] is this probability we haven't defined how we're going to calculate this [00:55:05] how we're going to calculate this probability and this is where the word [00:55:08] probability and this is where the word vectors come in so we're going to Define [00:55:11] vectors come in so we're going to Define this [00:55:12] this probability in terms of the word Vector [00:55:15] probability in terms of the word Vector so we're going to say each word type is [00:55:18] so we're going to say each word type is represented by a vector of real numbers [00:55:21] represented by a vector of real numbers these are 100 real numbers and we're [00:55:23] these are 100 real numbers and we're going to have a formula that works out [00:55:26] going to have a formula that works out the probability simply in terms of the [00:55:30] the probability simply in terms of the vectors of for each um word that there [00:55:33] vectors of for each um word that there are no other parameters in this model so [00:55:36] are no other parameters in this model so over here I've shown this Theta which [00:55:38] over here I've shown this Theta which are the parameters of our model and all [00:55:41] are the parameters of our model and all and only the parameters of our model are [00:55:45] and only the parameters of our model are these word vectors for each word in the [00:55:48] these word vectors for each word in the vocabulary that's a lot of parameters [00:55:50] vocabulary that's a lot of parameters because we have a lot of words and and [00:55:52] because we have a lot of words and and we've got fairly big word vectors but [00:55:54] we've got fairly big word vectors but they are the only parameters [00:55:57] they are the only parameters okay and how we do that is um by using [00:56:01] okay and how we do that is um by using this little um trick here we're going to [00:56:03] this little um trick here we're going to say the probability of an outside word [00:56:06] say the probability of an outside word given a center word is going to be [00:56:09] given a center word is going to be defined in terms of the dotproduct of [00:56:12] defined in terms of the dotproduct of the two word vectors so if things have a [00:56:15] the two word vectors so if things have a high dot product they'll be similar and [00:56:19] high dot product they'll be similar and therefore they'll have a high [00:56:20] therefore they'll have a high probability of cooccurrence where I mean [00:56:23] probability of cooccurrence where I mean similar in a kind of a weird sense right [00:56:26] similar in a kind of a weird sense right it is the case that we're going to want [00:56:27] it is the case that we're going to want to say hotel and motel are similar but [00:56:30] to say hotel and motel are similar but you know it's also the case that we're [00:56:32] you know it's also the case that we're going to want to have the word the able [00:56:36] going to want to have the word the able to appear easily before the word student [00:56:39] to appear easily before the word student so in some weird sense the also has to [00:56:42] so in some weird sense the also has to be similar to student um that has to be [00:56:44] be similar to student um that has to be similar to basically any noun right um [00:56:48] similar to basically any noun right um okay so we're going to work witht [00:56:50] okay so we're going to work witht products and then we do this funky [00:56:52] products and then we do this funky little bit of math here and that will [00:56:54] little bit of math here and that will give us our probability [00:56:56] give us our probability ities okay so let's just go through the [00:56:59] ities okay so let's just go through the funky bit of math um so here's our [00:57:03] funky bit of math um so here's our formula for the [00:57:04] formula for the probabilities so what we're doing here [00:57:07] probabilities so what we're doing here is we're starting off with this dot [00:57:10] is we're starting off with this dot product right so the dot product is you [00:57:13] product right so the dot product is you take the two vectors you multiply each [00:57:16] take the two vectors you multiply each component together and you sum them so [00:57:18] component together and you sum them so if they both the same sign that [00:57:22] if they both the same sign that increases your dot product and if [00:57:24] increases your dot product and if they're both big and increases it a lot [00:57:26] they're both big and increases it a lot okay so that gives us a similarity [00:57:30] okay so that gives us a similarity between two vectors and that's unbounded [00:57:33] between two vectors and that's unbounded that's that's just a real number it can [00:57:35] that's that's just a real number it can be either negative or positive okay but [00:57:37] be either negative or positive okay but what we'd like to get out is a [00:57:39] what we'd like to get out is a probability so for our next tricks we [00:57:42] probability so for our next tricks we first of all exponentiate because if we [00:57:45] first of all exponentiate because if we take um e to X for any X we now have to [00:57:50] take um e to X for any X we now have to get something positive out right that's [00:57:52] get something positive out right that's what exponentiation does okay and then [00:57:55] what exponentiation does okay and then well since it's meant to be a [00:57:57] well since it's meant to be a probability we'd like it to be between 0 [00:58:00] probability we'd like it to be between 0 and 1 and so we turn it into numbers [00:58:03] and 1 and so we turn it into numbers between 0 and one in the dumbest way [00:58:06] between 0 and one in the dumbest way possible which is we just normalize so [00:58:09] possible which is we just normalize so that we work out the quantity in the [00:58:11] that we work out the quantity in the numerator for every possible context [00:58:14] numerator for every possible context word um and so we get the total of all [00:58:18] word um and so we get the total of all of those numbers and divide through by [00:58:20] of those numbers and divide through by it and then we're getting a probability [00:58:22] it and then we're getting a probability distribution of How likely different [00:58:24] distribution of How likely different words are in this context [00:58:27] words are in this context text okay um yeah so that this little [00:58:31] text okay um yeah so that this little trick that we're doing here is referred [00:58:34] trick that we're doing here is referred to as the softmax function so for the [00:58:36] to as the softmax function so for the softmax function you can take um [00:58:40] softmax function you can take um unbounded real numbers put them through [00:58:43] unbounded real numbers put them through this little softmax trick that we just [00:58:45] this little softmax trick that we just went through the steps of and what [00:58:47] went through the steps of and what you'll get out is a probability [00:58:50] you'll get out is a probability distribution so I'm now getting in this [00:58:53] distribution so I'm now getting in this example uh probability distribution over [00:58:57] example uh probability distribution over context words my probability estimates [00:59:00] context words my probability estimates over all the context words in my [00:59:02] over all the context words in my vocabulary will sum up to one by [00:59:05] vocabulary will sum up to one by definition by the way that I've con [00:59:07] definition by the way that I've con constructed this um so it's called the [00:59:10] constructed this um so it's called the softmax function because it amplifies [00:59:14] softmax function because it amplifies the probabilities of the largest things [00:59:16] the probabilities of the largest things that's because of the um the X function [00:59:21] that's because of the um the X function but it's soft because it still assigns [00:59:24] but it's soft because it still assigns some probability to smaller items but [00:59:27] some probability to smaller items but you know it's sort of a funny name [00:59:30] you know it's sort of a funny name because you know when you think about um [00:59:33] because you know when you think about um max I mean Max normally picks out just [00:59:36] max I mean Max normally picks out just one thing whereas the soft Max is [00:59:39] one thing whereas the soft Max is turning a bunch of real numbers into a [00:59:42] turning a bunch of real numbers into a probability [00:59:43] probability distribution um so this soft Max is used [00:59:47] distribution um so this soft Max is used everywhere um in deep learning any time [00:59:50] everywhere um in deep learning any time that we're wanting to turn things that [00:59:51] that we're wanting to turn things that are just vectors in RN into [00:59:54] are just vectors in RN into probabilities we shove them through a [00:59:56] probabilities we shove them through a soft Max [00:59:58] soft Max function okay [01:00:02] function okay um so [01:00:06] um so so in some sense this part I think still [01:00:10] so in some sense this part I think still seems very abstract and I mean the [01:00:14] seems very abstract and I mean the reason it seems very abstract is um [01:00:19] reason it seems very abstract is um because I've sort of said we have [01:00:22] because I've sort of said we have vectors for each word and using these [01:00:25] vectors for each word and using these vectors [01:00:26] vectors we can then calculate [01:00:28] we can then calculate probabilities um but where do the [01:00:31] probabilities um but where do the vectors come from and the answer to [01:00:34] vectors come from and the answer to where the vectors are going to come from [01:00:37] where the vectors are going to come from is we're going to turn this into an [01:00:39] is we're going to turn this into an optimization problem we have a large [01:00:42] optimization problem we have a large amount of text and so therefore we can [01:00:46] amount of text and so therefore we can hope to find word vectors that make the [01:00:50] hope to find word vectors that make the contexts of the words in our observed [01:00:53] contexts of the words in our observed text as big as possible so literally [01:00:56] text as big as possible so literally what we're going to do is we're going to [01:00:59] what we're going to do is we're going to start off with random vectors for every [01:01:01] start off with random vectors for every word and then we want to fiddle those [01:01:04] word and then we want to fiddle those vectors so that the calculated [01:01:08] vectors so that the calculated probabilities of words in a context go [01:01:11] probabilities of words in a context go up and we're going to keep fiddling [01:01:13] up and we're going to keep fiddling until they stop going up anymore and [01:01:15] until they stop going up anymore and we're getting the highest probability [01:01:17] we're getting the highest probability estimates um that we can and the way [01:01:20] estimates um that we can and the way that we do that fiddling um is we use [01:01:23] that we do that fiddling um is we use calculus um so you know what we're going [01:01:26] calculus um so you know what we're going to do is kind of conceptually exactly [01:01:30] to do is kind of conceptually exactly what you do if you're in something like [01:01:31] what you do if you're in something like a two-dimensional space like the picture [01:01:33] a two-dimensional space like the picture on the right right that if you want to [01:01:36] on the right right that if you want to find the minimum in this two-dimensional [01:01:38] find the minimum in this two-dimensional space and you start off at the top left [01:01:41] space and you start off at the top left what you can do is say let me work out [01:01:44] what you can do is say let me work out the derivatives of the function um at [01:01:47] the derivatives of the function um at the top left and they sort of Point sort [01:01:50] the top left and they sort of Point sort of down and a bit to the right and so [01:01:52] of down and a bit to the right and so you can walk down and a bit to the right [01:01:54] you can walk down and a bit to the right and you can say oh G given where I am [01:01:57] and you can say oh G given where I am now um let me work out the derivatives [01:02:00] now um let me work out the derivatives what direction do they point and they're [01:02:02] what direction do they point and they're still pointing down but a bit more to [01:02:03] still pointing down but a bit more to the right so you can walk a bit further [01:02:05] the right so you can walk a bit further that way and you can keep on walking and [01:02:08] that way and you can keep on walking and eventually you'll make it to the minimum [01:02:11] eventually you'll make it to the minimum of the space um in our case um we've got [01:02:15] of the space um in our case um we've got a lot more than two Dimensions so our [01:02:18] a lot more than two Dimensions so our parameters for our model are the [01:02:21] parameters for our model are the concatenation of all the word vectors [01:02:25] concatenation of all the word vectors but it's even slightly worse than I've [01:02:26] but it's even slightly worse than I've explained um up until now because [01:02:30] explained um up until now because actually for each word we assume two [01:02:34] actually for each word we assume two vectors we assume one vector when [01:02:36] vectors we assume one vector when they're the center word and one vector [01:02:38] they're the center word and one vector when they're the outside word doing that [01:02:41] when they're the outside word doing that just makes the math a bit simpler which [01:02:43] just makes the math a bit simpler which I can explain later so if we say had a [01:02:46] I can explain later so if we say had a 100 dimensional vectors we'll have 100 [01:02:49] 100 dimensional vectors we'll have 100 parameters for ad v as an outside word [01:02:53] parameters for ad v as an outside word 100 parameters for um ah as an outside [01:02:56] 100 parameters for um ah as an outside word all the way through to 100 [01:02:58] word all the way through to 100 parameters for zebra as an outside word [01:03:01] parameters for zebra as an outside word then we'd have 100 parameters for arvar [01:03:03] then we'd have 100 parameters for arvar as a um a center word continuing down so [01:03:08] as a um a center word continuing down so you know if we had a vocabulary of [01:03:10] you know if we had a vocabulary of 400,000 words and 100 um dimensional [01:03:14] 400,000 words and 100 um dimensional word vectors that means we'd have [01:03:16] word vectors that means we'd have 400,000 * 2 is [01:03:19] 400,000 * 2 is 800,000 time 100 we'd have 80 million [01:03:22] 800,000 time 100 we'd have 80 million parameters so that's a lot of parameters [01:03:25] parameters so that's a lot of parameters in our space to try and Fiddle to [01:03:28] in our space to try and Fiddle to optimize things but luckily we have big [01:03:30] optimize things but luckily we have big computers um and that's the kind of [01:03:33] computers um and that's the kind of thing that we do so we simply um say G [01:03:38] thing that we do so we simply um say G this is our optimization problem we're [01:03:41] this is our optimization problem we're going to compute the gradients of all of [01:03:46] going to compute the gradients of all of these parameters and that will um give [01:03:49] these parameters and that will um give us the answer of what we have um and you [01:03:55] us the answer of what we have um and you know [01:03:57] know this feels like magic I mean it doesn't [01:04:00] this feels like magic I mean it doesn't really seem like you know we could just [01:04:03] really seem like you know we could just start with nothing we could start with [01:04:05] start with nothing we could start with random word vectors and a pile of text [01:04:08] random word vectors and a pile of text and say uh do some math and we will get [01:04:12] and say uh do some math and we will get something useful out um but the miracle [01:04:15] something useful out um but the miracle of what happens in these deep learning [01:04:17] of what happens in these deep learning spaces is we do get something useful out [01:04:20] spaces is we do get something useful out we can just um minimize um all of the [01:04:23] we can just um minimize um all of the parameters and [01:04:26] parameters and um then we'll get something useful out [01:04:29] um then we'll get something useful out um so what I wanted to uh I guess I'm [01:04:32] um so what I wanted to uh I guess I'm not going to quite get to the end of [01:04:34] not going to quite get to the end of what I hope to today um but what we I [01:04:36] what I hope to today um but what we I wanted to do is sort of um get through [01:04:41] wanted to do is sort of um get through um some of um what we do here but you [01:04:45] um some of um what we do here but you know I wanted to take a few minutes to [01:04:47] know I wanted to take a few minutes to sort of go through concretely how we do [01:04:50] sort of go through concretely how we do um the math of [01:04:52] um the math of minimization now lots of different [01:04:55] minimization now lots of different people will um take um [01:04:59] people will um take um cs224n and some of you know way more [01:05:02] cs224n and some of you know way more math than I do and so if this next 10 [01:05:06] math than I do and so if this next 10 minutes might be extremely boring and if [01:05:09] minutes might be extremely boring and if that's the case you can either catch up [01:05:11] that's the case you can either catch up on Discord or Instagram or something or [01:05:14] on Discord or Instagram or something or else you can leave but it turns out [01:05:16] else you can leave but it turns out there are other people that do [01:05:18] there are other people that do cs224n that can't quite remember um when [01:05:21] cs224n that can't quite remember um when they lasted a math course and we'd like [01:05:25] they lasted a math course and we'd like um everybody to be able to learn [01:05:27] um everybody to be able to learn something about this um so I do actually [01:05:30] something about this um so I do actually like in the first two weeks to kind of [01:05:32] like in the first two weeks to kind of go through it a bit concretely so let's [01:05:35] go through it a bit concretely so let's um try to do this so this was our [01:05:37] um try to do this so this was our likelihood and then we'd already covered [01:05:40] likelihood and then we'd already covered the fact that what we were going to do [01:05:42] the fact that what we were going to do is have an objective function in terms [01:05:44] is have an objective function in terms of our parameters that was the negative [01:05:48] of our parameters that was the negative the average negative log likelihood [01:05:51] the average negative log likelihood across all the [01:05:52] across all the words [01:05:54] words um if I remember the notation for this [01:05:58] um if I remember the notation for this the [01:05:59] the sum um in this [01:06:03] sum um in this oops um I'll probably have a hard time [01:06:05] oops um I'll probably have a hard time writing this um the sum of position M [01:06:11] writing this um the sum of position M I've got a more neatly written out [01:06:13] I've got a more neatly written out version of it that appears on the [01:06:14] version of it that appears on the version of the slides it's on the [01:06:17] version of the slides it's on the webiz um and then we're going to be [01:06:19] webiz um and then we're going to be taking this [01:06:21] taking this log of the probability of the word [01:06:26] log of the probability of the word at [01:06:27] at position um t [01:06:32] plus sorry position J um t + [01:06:38] plus sorry position J um t + [Music] [01:06:39] [Music] J [01:06:41] J okay trying to write this on my iPad is [01:06:44] okay trying to write this on my iPad is not working super well I'll confess [01:06:47] not working super well I'll confess we'll see how I get on um [01:06:50] we'll see how I get on um WT okay [01:06:53] WT okay um okay and so then we had the form of [01:06:57] um okay and so then we had the form of what we um wanted to use for the [01:07:01] what we um wanted to use for the probability and the probability of an [01:07:04] probability and the probability of an outside word given a context word is was [01:07:08] outside word given a context word is was then this soft maxed equation where [01:07:10] then this soft maxed equation where we're taking the x of the outside [01:07:16] we're taking the x of the outside vector and the center [01:07:20] vector and the center Vector over the normalization term where [01:07:24] Vector over the normalization term where we sum over the [01:07:38] vocabulary okay um so [01:07:42] vocabulary okay um so to to work out um how to change our [01:07:46] to to work out um how to change our parameters so our parameters are all of [01:07:49] parameters so our parameters are all of these word vectors that we summarize um [01:07:53] these word vectors that we summarize um inside Theta [01:07:55] inside Theta what we're then going to want to do is [01:07:58] what we're then going to want to do is work out the partial [01:08:00] work out the partial derivative of this objective function [01:08:04] derivative of this objective function with respect to all the parameters Theta [01:08:08] with respect to all the parameters Theta but you know in particular um I'm going [01:08:11] but you know in particular um I'm going to cons just start doing here the [01:08:14] to cons just start doing here the partial derivatives with respect to um [01:08:17] partial derivatives with respect to um the [01:08:18] the center [01:08:20] center word and we can work through the outside [01:08:23] word and we can work through the outside words separately well this partial [01:08:26] words separately well this partial derivative is a big a big sum and it's a [01:08:29] derivative is a big a big sum and it's a big sum of terms like this and so when I [01:08:34] big sum of terms like this and so when I have a partial derivative of a big sum [01:08:36] have a partial derivative of a big sum of terms I can work out the partial [01:08:39] of terms I can work out the partial derivatives of each term independently [01:08:42] derivatives of each term independently and then sum them so what I want to be [01:08:44] and then sum them so what I want to be doing is working out um the partial [01:08:49] doing is working out um the partial derivative [01:08:51] derivative of the log of this probability which [01:08:54] of the log of this probability which equals the the log of [01:08:57] equals the the log of that with respect to the center Vector [01:09:02] that with respect to the center Vector um and so at this point I have a log of [01:09:07] um and so at this point I have a log of two things being divided and so that [01:09:10] two things being divided and so that means I can separate that out of the log [01:09:13] means I can separate that out of the log of the numerator minus the log of the [01:09:17] of the numerator minus the log of the denominator and so what I'll be doing is [01:09:20] denominator and so what I'll be doing is working out the partial derivative with [01:09:23] working out the partial derivative with respect to the center vector [01:09:25] respect to the center vector of the log the [01:09:28] of the log the numerator um log X of [01:09:33] numerator um log X of utvc [01:09:34] utvc minus um [01:09:37] minus um the partial [01:09:39] the partial derivative um with respect to the [01:09:42] derivative um with respect to the denominator which is then the log of the [01:09:48] denominator which is then the log of the sum of w = 1 to V of x [01:09:58] who okay I'm having real trouble here [01:10:02] who okay I'm having real trouble here writing I look at the slides where I [01:10:04] writing I look at the slides where I wrote it neatly at home okay um so I [01:10:08] wrote it neatly at home okay um so I want to work with these two terms now at [01:10:13] want to work with these two terms now at this point um part of [01:10:17] it at this point part of it is easy [01:10:21] it at this point part of it is easy because here I just have a log of an [01:10:23] because here I just have a log of an exponential and so so those two [01:10:26] exponential and so so those two functions just cancel out and go away [01:10:29] functions just cancel out and go away and so then I want to get the partial [01:10:32] and so then I want to get the partial derivative of U outside transpose V [01:10:37] derivative of U outside transpose V Center um with respect to V Center and [01:10:41] Center um with respect to V Center and so um what you get for the answer to [01:10:44] so um what you get for the answer to that is that that just comes out um as [01:10:49] that is that that just comes out um as U and um maybe you remember that but if [01:10:54] U and um maybe you remember that but if you don't remember that the thing to [01:10:56] you don't remember that the thing to think about is okay this is a whole [01:11:00] think about is okay this is a whole Vector right and so we've got a vector [01:11:03] Vector right and so we've got a vector here and a vector here so what this is [01:11:05] here and a vector here so what this is going to be looking like is like sort of [01:11:08] going to be looking like is like sort of U1 V1 plus U2 [01:11:13] U1 V1 plus U2 V2 um plus u3 V3 Etc long and so what [01:11:19] V2 um plus u3 V3 Etc long and so what we're going to want to do is work out [01:11:21] we're going to want to do is work out the partial derivative with respect to [01:11:24] the partial derivative with respect to each element [01:11:26] each element um VI right and so if you just think of [01:11:29] um VI right and so if you just think of a sort of a single element derivative [01:11:32] a sort of a single element derivative with respect to um [01:11:34] with respect to um V1 well it's going to be just U1 because [01:11:39] V1 well it's going to be just U1 because every other term would go to zero um and [01:11:41] every other term would go to zero um and then if you worked it out with respect [01:11:43] then if you worked it out with respect to V2 then it would be just U2 and every [01:11:47] to V2 then it would be just U2 and every other term goes to zero and so since you [01:11:50] other term goes to zero and so since you keep on doing that along the whole [01:11:52] keep on doing that along the whole Vector that what you're going to get out [01:11:54] Vector that what you're going to get out is the vector [01:11:55] is the vector um U1 U2 [01:11:58] um U1 U2 u3 um down the vocab um for the whole [01:12:02] u3 um down the vocab um for the whole list of vocab [01:12:03] list of vocab items okay so that part is easy um but [01:12:08] items okay so that part is easy um but then we also want to um work out the [01:12:13] then we also want to um work out the partial derivatives of that one and at [01:12:16] partial derivatives of that one and at that point I maybe have to um go to [01:12:20] that point I maybe have to um go to another slide so we then want to have [01:12:25] another slide so we then want to have um the partial [01:12:29] um the partial derivative [01:12:30] derivative um with respect to [01:12:33] um with respect to VC of [01:12:35] VC of the [01:12:37] the log of the sum equals W 1 to V of the x [01:12:44] log of the sum equals W 1 to V of the x u w transvers VC right so at this point [01:12:50] u w transvers VC right so at this point things aren't quite so easy and we have [01:12:53] things aren't quite so easy and we have to remember a little bit more calcul [01:12:55] to remember a little bit more calcul so in particular what we have to [01:12:57] so in particular what we have to remember is the chain rule so here we [01:13:01] remember is the chain rule so here we have this inside function so that we've [01:13:04] have this inside function so that we've got a function um we've got a function G [01:13:10] got a function um we've got a function G of VC which you know we might say the [01:13:13] of VC which you know we might say the output of that is Zed and then we put [01:13:17] output of that is Zed and then we put outside that an extra function f and so [01:13:22] outside that an extra function f and so when we have something like that what we [01:13:25] when we have something like that what we get is the derivative of f with respect [01:13:28] get is the derivative of f with respect to VC we can take the derivative of f [01:13:33] to VC we can take the derivative of f with respect to Z times the derivative [01:13:36] with respect to Z times the derivative of Z with respect to um VC right that's [01:13:40] of Z with respect to um VC right that's the chain rule so we're going to then [01:13:44] the chain rule so we're going to then apply that here so first of all we're [01:13:48] apply that here so first of all we're going to take the derivative of log um [01:13:52] going to take the derivative of log um and so the derivative of log is one X [01:13:55] and so the derivative of log is one X you have to remember that or look it up [01:13:57] you have to remember that or look it up or get mathematic to do it for you or [01:14:00] or get mathematic to do it for you or something like that [01:14:02] something like that um and so we're going to have one over [01:14:07] um and so we're going to have one over the inside Z part the sum of w equal 1 [01:14:11] the inside Z part the sum of w equal 1 to V of the X [01:14:14] to V of the X uwt [01:14:16] uwt VC um and then that's going to be [01:14:20] VC um and then that's going to be multiplied by the derivative of the [01:14:24] multiplied by the derivative of the inside part part [01:14:26] inside part part um so then we're going to have the [01:14:30] um so then we're going to have the derivative with respect to VC of the sum [01:14:34] derivative with respect to VC of the sum of w = [01:14:37] of w = 1 to [01:14:40] 1 to V of the X [01:14:47] of okay um [01:14:51] of okay um so um so that's um made us a little bit [01:14:55] so um so that's um made us a little bit of progress um but we've still got [01:14:58] of progress um but we've still got something to do here um and so well what [01:15:01] something to do here um and so well what we're going to do here is we're going to [01:15:03] we're going to do here is we're going to notice Oh wait we're again in the space [01:15:07] notice Oh wait we're again in the space to run the chain rule again so now we've [01:15:10] to run the chain rule again so now we've got this function well so first of all [01:15:13] got this function well so first of all we can move the sum to the outside right [01:15:16] we can move the sum to the outside right because we've got a sum of terms W = one [01:15:19] because we've got a sum of terms W = one to V and so we want to work out the [01:15:21] to V and so we want to work out the derivatives of the inside piece um with [01:15:25] derivatives of the inside piece um with respect to it sorry I'm doing this kind [01:15:26] respect to it sorry I'm doing this kind of informally of just doing this piece [01:15:28] of informally of just doing this piece now um okay so this again gives us an F [01:15:33] now um okay so this again gives us an F over a function um [01:15:37] over a function um G and so we're going to again want to [01:15:40] G and so we're going to again want to split the pieces up and so use the chain [01:15:43] split the pieces up and so use the chain rule one more time so we're going to [01:15:45] rule one more time so we're going to have the sum of w = 1 to V and now we [01:15:49] have the sum of w = 1 to V and now we have to know what the derivative of x is [01:15:51] have to know what the derivative of x is and the derivative of x is X so that [01:15:54] and the derivative of x is X so that will be X of [01:15:58] will be X of uux [01:16:00] uux tv0 and then we're taking the derivative [01:16:04] tv0 and then we're taking the derivative of the inside part with respect to VC of [01:16:09] of the inside part with respect to VC of ux um T [01:16:11] ux um T VC well luckily this was the bit that we [01:16:15] VC well luckily this was the bit that we already knew how to do because we worked [01:16:17] already knew how to do because we worked it out before and so this is going to be [01:16:19] it out before and so this is going to be the sum of w equal 1 to V of this X [01:16:27] times [01:16:30] times ux okay so then at this point um we want [01:16:35] ux okay so then at this point um we want to combine these two forms together so [01:16:39] to combine these two forms together so that we want to combine this part that [01:16:42] that we want to combine this part that we worked out and this piece here that [01:16:45] we worked out and this piece here that we've worked out and if we canine [01:16:49] we've worked out and if we canine combine them together with what we [01:16:52] combine them together with what we worked out on the first slide um for [01:16:55] worked out on the first slide um for numerator um since this is we have [01:17:00] numerator um since this is we have the U which was the derivative of the [01:17:05] the U which was the derivative of the numerator and then for the um derivative [01:17:09] numerator and then for the um derivative of the denominator we're going to [01:17:13] of the denominator we're going to have um on top this part and then on the [01:17:17] have um on top this part and then on the bottom we're going to have that part and [01:17:19] bottom we're going to have that part and so we can rewrite that as the sum from [01:17:23] so we can rewrite that as the sum from wal 1 to V of the [01:17:29] wal 1 to V of the X of [01:17:31] X of uxt [01:17:33] uxt v0 * [01:17:35] v0 * ux over um the [01:17:38] ux over um the [Music] [01:17:40] [Music] sum sorry x = 1 to V sum over W = 1 to V [01:17:48] sum sorry x = 1 to V sum over W = 1 to V of the X this part here of UW [01:17:56] okay so we can rearrange things in that [01:18:00] okay so we can rearrange things in that form and then lo and behold we find that [01:18:04] form and then lo and behold we find that we've recreated here this form of the [01:18:07] we've recreated here this form of the softn equation so we end up with [01:18:11] softn equation so we end up with u minus the sum x = 1 to V of the [01:18:18] u minus the sum x = 1 to V of the probability of x given c times um U of x [01:18:25] probability of x given c times um U of x so what this is saying is we're wanting [01:18:29] so what this is saying is we're wanting to have this quantity which takes the [01:18:32] to have this quantity which takes the actual observed U vector and it's [01:18:35] actual observed U vector and it's comparing it to the weighted prediction [01:18:40] comparing it to the weighted prediction so we're taking the weight of some of [01:18:42] so we're taking the weight of some of the our current ux vectors um based on [01:18:46] the our current ux vectors um based on How likely we we were they were to occur [01:18:50] How likely we we were they were to occur and so this is a form that you see quite [01:18:53] and so this is a form that you see quite a bit in these kind of der [01:18:55] a bit in these kind of der you get exer observed minus expected the [01:18:59] you get exer observed minus expected the weighted average and so what you'd like [01:19:01] weighted average and so what you'd like to have is your expectation the weighted [01:19:05] to have is your expectation the weighted average be the same as um the what was [01:19:09] average be the same as um the what was observed because then you'll get a [01:19:11] observed because then you'll get a derivative of zero which means that [01:19:13] derivative of zero which means that you've hit a [01:19:15] you've hit a maximum um and so that gives us you know [01:19:19] maximum um and so that gives us you know the form of the derivative of the um [01:19:25] the form of the derivative of the um that we're having with respect to the [01:19:27] that we're having with respect to the center Vector parameters to finish it [01:19:30] center Vector parameters to finish it off you'd have to then work it out also [01:19:32] off you'd have to then work it out also for the um outside Vector parameters but [01:19:34] for the um outside Vector parameters but hey it's officially the end of class [01:19:37] hey it's officially the end of class time so I'd better wrap up quickly now [01:19:39] time so I'd better wrap up quickly now but you know so the deal is we're going [01:19:42] but you know so the deal is we're going to work out all of these derivatives um [01:19:45] to work out all of these derivatives um for each parameter and then these [01:19:48] for each parameter and then these derivatives will give a direction to [01:19:50] derivatives will give a direction to change numbers which will let us find [01:19:53] change numbers which will let us find good word vectors [01:19:55] good word vectors automatically um I do want you to [01:19:58] automatically um I do want you to understand how this works but [01:20:00] understand how this works but fortunately you'll find out very quickly [01:20:02] fortunately you'll find out very quickly that computers will do this for you and [01:20:04] that computers will do this for you and on a regular basis you don't actually [01:20:06] on a regular basis you don't actually have to do it yourself um more about [01:20:08] have to do it yourself um more about that on Thursday okay see you everyone ================================================================================ LECTURE 002 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 2 - Word Vectors and Language Models Source: https://www.youtube.com/watch?v=nBor4jfWetQ --- Transcript [00:00:05] okay I should try and get [00:00:08] okay I should try and get started okay so what we're going to do [00:00:11] started okay so what we're going to do today is we're going to try and um do [00:00:14] today is we're going to try and um do everything else that you need to know [00:00:17] everything else that you need to know about word vectors and start to learn a [00:00:20] about word vectors and start to learn a teeny bit about new Nets and then we'll [00:00:22] teeny bit about new Nets and then we'll kind of get much further into sort of [00:00:24] kind of get much further into sort of doing more with the math of new or Nets [00:00:27] doing more with the math of new or Nets next week so this is the general plan um [00:00:30] next week so this is the general plan um so I'm going to sort of uh sort of [00:00:33] so I'm going to sort of uh sort of finish up from where I was last time [00:00:35] finish up from where I was last time with optimization Basics then look a [00:00:38] with optimization Basics then look a little bit more about word to V and word [00:00:40] little bit more about word to V and word vectors and then some of the variants of [00:00:43] vectors and then some of the variants of word to v um and then I'm going to [00:00:45] word to v um and then I'm going to briefly consider alternatives sort of [00:00:48] briefly consider alternatives sort of like what can you get from just counting [00:00:50] like what can you get from just counting words in different ways um then we're [00:00:52] words in different ways um then we're going to go on and talk a little bit [00:00:54] going to go on and talk a little bit about the evaluation of word vectors um [00:00:58] about the evaluation of word vectors um the topic of word sensor that already [00:01:00] the topic of word sensor that already came up a couple of times um last time [00:01:03] came up a couple of times um last time when people were asking questions and [00:01:06] when people were asking questions and then towards the end um start to [00:01:08] then towards the end um start to introduce the idea of classification [00:01:11] introduce the idea of classification doing new classification and um what new [00:01:15] doing new classification and um what new networks are about which is something [00:01:17] networks are about which is something that they'll then expand on more in the [00:01:19] that they'll then expand on more in the second [00:01:20] second week um before I get into that just [00:01:23] week um before I get into that just notes on course um organization um so [00:01:26] notes on course um organization um so remember the first assignment is already [00:01:29] remember the first assignment is already out and it's it's due before class next [00:01:33] out and it's it's due before class next Tuesday um so then um our python review [00:01:37] Tuesday um so then um our python review session is going to be taught this [00:01:39] session is going to be taught this Friday um 3:30 to 4:20 it's not going to [00:01:42] Friday um 3:30 to 4:20 it's not going to be taught here it's going to be taught [00:01:44] be taught here it's going to be taught in gates bo1 the gates basement um and [00:01:47] in gates bo1 the gates basement um and encourage everyone again to um come to [00:01:50] encourage everyone again to um come to office hours and help sessions they've [00:01:52] office hours and help sessions they've already started they're listed on the [00:01:54] already started they're listed on the website um we're having these sort of um [00:01:58] website um we're having these sort of um office our help sessions in classrooms [00:02:01] office our help sessions in classrooms with multiple Tas so um just turn up if [00:02:05] with multiple Tas so um just turn up if you're on campus and you can be helped [00:02:07] you're on campus and you can be helped and if you are on campus we'd like you [00:02:09] and if you are on campus we'd like you to just turn up that we do also have a [00:02:11] to just turn up that we do also have a zoom option for Stanford online [00:02:16] students um finally I have office hours [00:02:19] students um finally I have office hours which I have not yet open but I will [00:02:21] which I have not yet open but I will open sometime tonight um they're going [00:02:24] open sometime tonight um they're going to be on Monday afternoons now obviously [00:02:27] to be on Monday afternoons now obviously given the number of people not everyone [00:02:29] given the number of people not everyone can make it into my office hours and I'm [00:02:31] can make it into my office hours and I'm going to do these by appointment so [00:02:33] going to do these by appointment so they're by 15minute appointments on [00:02:36] they're by 15minute appointments on kendley um but you know I'm very happy [00:02:38] kendley um but you know I'm very happy to talk to some people um and you know [00:02:43] to talk to some people um and you know uh I put this little note at the end [00:02:44] uh I put this little note at the end saying don't hog the slots um some [00:02:47] saying don't hog the slots um some people think it'd be a really good idea [00:02:49] people think it'd be a really good idea if they really work out how to sign up [00:02:51] if they really work out how to sign up every week for an office hour session [00:02:54] every week for an office hour session than me with me and that's sort of a [00:02:56] than me with me and that's sort of a little bit antisocial um so um [00:03:00] little bit antisocial um so um think about that um okay so at the end [00:03:03] think about that um okay so at the end of last time I did a sort of bad job of [00:03:07] of last time I did a sort of bad job of trying to write on write on slides of [00:03:09] trying to write on write on slides of working out the derivatives of word to [00:03:13] working out the derivatives of word to be um and hopefully you could read it [00:03:15] be um and hopefully you could read it much more clearly in the the version [00:03:17] much more clearly in the the version that appears on the website where I was [00:03:19] that appears on the website where I was doing it at home more carefully um so [00:03:23] doing it at home more carefully um so that was saying that we had this lock [00:03:25] that was saying that we had this lock function and our job was to work out its [00:03:28] function and our job was to work out its derivatives which would tell us which [00:03:31] derivatives which would tell us which direction to go to walk downhill um and [00:03:35] direction to go to walk downhill um and so I didn't really quite finish the loop [00:03:37] so I didn't really quite finish the loop here so you know we have some cost [00:03:40] here so you know we have some cost function that we want to minimize and [00:03:43] function that we want to minimize and then we work out the gradients of that [00:03:45] then we work out the gradients of that the gradient of that function to work [00:03:47] the gradient of that function to work out which direction is downhill and then [00:03:50] out which direction is downhill and then the simplest algorithm is then that we [00:03:54] the simplest algorithm is then that we that we work out the direction downhill [00:03:58] that we work out the direction downhill we walk a little bit in that direction [00:03:59] we walk a little bit in that direction ction and then we repeat we work out the [00:04:03] ction and then we repeat we work out the gradient at this point we walk downhill [00:04:05] gradient at this point we walk downhill a little bit and we keep on going and [00:04:08] a little bit and we keep on going and we'll get to the minimum and with a sort [00:04:11] we'll get to the minimum and with a sort of a one-dimensional function like this [00:04:13] of a one-dimensional function like this it's sort of very simple we're just [00:04:14] it's sort of very simple we're just walking down hill but when we have a [00:04:17] walking down hill but when we have a function of many many dimensions when we [00:04:20] function of many many dimensions when we calculate the gradient at different [00:04:21] calculate the gradient at different points we might be starting to um walk [00:04:24] points we might be starting to um walk in different directions and so that's [00:04:27] in different directions and so that's why we need to do calculus and have [00:04:28] why we need to do calculus and have gradients and so this gives us the basic [00:04:31] gradients and so this gives us the basic algorithm of gradient [00:04:34] algorithm of gradient descent um and so under the gradient [00:04:37] descent um and so under the gradient descent algorithm um what we're doing is [00:04:41] descent algorithm um what we're doing is that we've got um our loss function J [00:04:46] that we've got um our loss function J we're working out its gradient um and [00:04:49] we're working out its gradient um and then we're taking a little bit of a [00:04:52] then we're taking a little bit of a little multipli of the gradient so that [00:04:54] little multipli of the gradient so that Alpha is our step size or learning rate [00:04:57] Alpha is our step size or learning rate um and that's normally Alpha is a very [00:04:59] um and that's normally Alpha is a very small number number something like 10- 3 [00:05:01] small number number something like 10- 3 or 10-4 or maybe even 10- 5 so we're [00:05:05] or 10-4 or maybe even 10- 5 so we're taking a really little bit of the [00:05:07] taking a really little bit of the gradient and then we're subtracting it [00:05:10] gradient and then we're subtracting it from our parameters to get new [00:05:13] from our parameters to get new parameters and as we do that we will [00:05:16] parameters and as we do that we will walk downhill and the reason why we want [00:05:19] walk downhill and the reason why we want to have a small learning rate is we [00:05:21] to have a small learning rate is we don't want to walk too fast so if from [00:05:23] don't want to walk too fast so if from here we worked out the gradient and said [00:05:25] here we worked out the gradient and said it's in this direction and we just kept [00:05:28] it's in this direction and we just kept on walking we sort of might end up way [00:05:31] on walking we sort of might end up way over here or if we had a really big step [00:05:33] over here or if we had a really big step size we might even end up at a worse [00:05:35] size we might even end up at a worse Point than we started with so we want to [00:05:37] Point than we started with so we want to take little steps to walk downhill and [00:05:41] take little steps to walk downhill and so that's the very basic gradient [00:05:42] so that's the very basic gradient descent algorithm now the very basic [00:05:46] descent algorithm now the very basic gradient descent algorithm we never use [00:05:50] gradient descent algorithm we never use what we actually use is the next thing [00:05:53] what we actually use is the next thing up which is called stochastic gradient [00:05:55] up which is called stochastic gradient descent so the problem is for the basic [00:05:59] descent so the problem is for the basic gradient Cent algorithm we've worked out [00:06:03] gradient Cent algorithm we've worked out um for an entire set of data um what the [00:06:09] um for an entire set of data um what the objective function is and what the um [00:06:12] objective function is and what the um slope at the point of evaluation is and [00:06:17] slope at the point of evaluation is and in general um we've got a lot of data in [00:06:21] in general um we've got a lot of data in which we're Computing models so simply [00:06:24] which we're Computing models so simply trying to calculate our objective [00:06:26] trying to calculate our objective function over all of our data for our [00:06:29] function over all of our data for our model the training data for the model [00:06:32] model the training data for the model would take us a very very long time um [00:06:35] would take us a very very long time um and so that's very um very expensive to [00:06:38] and so that's very um very expensive to compute and so we'd wait a very long [00:06:40] compute and so we'd wait a very long time before we make even a single step [00:06:43] time before we make even a single step of gradient update um so for neural Nets [00:06:46] of gradient update um so for neural Nets what you're always doing is using this [00:06:48] what you're always doing is using this variant that's called stochastic [00:06:49] variant that's called stochastic gradient descent and so for stochastic [00:06:52] gradient descent and so for stochastic gradient descent what that means is we [00:06:54] gradient descent what that means is we pick a very small subset of our data [00:06:57] pick a very small subset of our data like maybe we pick 16 or 2 Data items [00:07:01] like maybe we pick 16 or 2 Data items and we pretend that's all of our data [00:07:04] and we pretend that's all of our data and we evaluate the function J based on [00:07:07] and we evaluate the function J based on that small subset and work out the [00:07:09] that small subset and work out the gradient based on that small subset so [00:07:12] gradient based on that small subset so it's a noisy inaccurate estimate of the [00:07:14] it's a noisy inaccurate estimate of the gradient and we use that um to be the [00:07:18] gradient and we use that um to be the direction in which we walk um so that's [00:07:21] direction in which we walk um so that's normally referred to also as having mini [00:07:24] normally referred to also as having mini batches or mini batch gradient [00:07:27] batches or mini batch gradient descent um and in theory working out the [00:07:32] descent um and in theory working out the gradient based on this small subset um [00:07:36] gradient based on this small subset um is an approximation but one of the [00:07:38] is an approximation but one of the interesting things in the way things [00:07:41] interesting things in the way things have emerged in new network land is it [00:07:43] have emerged in new network land is it turns out that new networks actually [00:07:45] turns out that new networks actually often work better when you throw some [00:07:48] often work better when you throw some noise into the system that having this [00:07:50] noise into the system that having this noise in the system gives you jiggle and [00:07:53] noise in the system gives you jiggle and moves things around and so actually [00:07:57] moves things around and so actually stochastic gradient descent not only is [00:07:59] stochastic gradient descent not only is way way faster but actually works better [00:08:02] way way faster but actually works better as a system for optimization of neural [00:08:06] as a system for optimization of neural networks okay so if you remember from [00:08:09] networks okay so if you remember from last time um for um word de the idea was [00:08:13] last time um for um word de the idea was we started by just saying each word has [00:08:16] we started by just saying each word has a random um Vector representing it so we [00:08:20] a random um Vector representing it so we will literally sort of just get random [00:08:23] will literally sort of just get random small numbers and fill up the vectors [00:08:25] small numbers and fill up the vectors with those random small numbers um [00:08:27] with those random small numbers um there's an important point there which [00:08:29] there's an important point there which is you do have to initialize your [00:08:32] is you do have to initialize your vectors with random small numbers if you [00:08:34] vectors with random small numbers if you just leave the all the vectors as zero [00:08:37] just leave the all the vectors as zero then nothing works um and that's because [00:08:41] then nothing works um and that's because if everything starts off the same you [00:08:44] if everything starts off the same you get these sort of false symmetries which [00:08:46] get these sort of false symmetries which means that you can't learn so you always [00:08:49] means that you can't learn so you always do want to be initializing your vectors [00:08:52] do want to be initializing your vectors with random numbers and then we're going [00:08:54] with random numbers and then we're going to go through each position in the [00:08:56] to go through each position in the Corpus using our estimates we're going [00:08:59] Corpus using our estimates we're going to try try and predict the probability [00:09:01] to try try and predict the probability of words in the context as we talked [00:09:03] of words in the context as we talked about last time then we so that gives us [00:09:06] about last time then we so that gives us an objective function from which we can [00:09:09] an objective function from which we can then look at our errors look at our [00:09:12] then look at our errors look at our gradient um and update the vectors so [00:09:15] gradient um and update the vectors so that they learn to predict surrounding [00:09:17] that they learn to predict surrounding words better and so the incredible thing [00:09:20] words better and so the incredible thing is that we can do no more than that and [00:09:24] is that we can do no more than that and we end up learning word vectors which [00:09:27] we end up learning word vectors which actually capture quite a lot of the [00:09:30] actually capture quite a lot of the semantics the meaning and relationships [00:09:32] semantics the meaning and relationships between different words so you know when [00:09:35] between different words so you know when this was first [00:09:38] this was first um discovered um for these algorithms I [00:09:41] um discovered um for these algorithms I mean it really feels like magic that you [00:09:43] mean it really feels like magic that you can just sort of do this math simple [00:09:47] can just sort of do this math simple math over a lot of text um and actually [00:09:51] math over a lot of text um and actually learn about the meanings of words that [00:09:53] learn about the meanings of words that it's sort of just sort of surprising [00:09:55] it's sort of just sort of surprising that something so simple could work but [00:09:58] that something so simple could work but as time has gone on this same recipe has [00:10:01] as time has gone on this same recipe has then been for all kinds of learning [00:10:03] then been for all kinds of learning about [00:10:04] about um um the behavior of language from new [00:10:07] um um the behavior of language from new networks um so let's just um go through [00:10:12] networks um so let's just um go through a sense of how that is but before we do [00:10:14] a sense of how that is but before we do that so [00:10:16] that so um let me just mention so for our word [00:10:19] um let me just mention so for our word toac algorithms the only parameters of [00:10:22] toac algorithms the only parameters of the model are these word vectors they're [00:10:25] the model are these word vectors they're the outside word vectors and the center [00:10:27] the outside word vectors and the center word vectors which we actually treat as [00:10:30] word vectors which we actually treat as disjoint as I mentioned last time and [00:10:33] disjoint as I mentioned last time and when we do the computations we're [00:10:35] when we do the computations we're considering the dot product between the [00:10:38] considering the dot product between the various um possible outside words with [00:10:43] various um possible outside words with our Center word and we using those to [00:10:46] our Center word and we using those to get a probability distribution over How [00:10:48] get a probability distribution over How likely the model thinks that different [00:10:51] likely the model thinks that different outside words were and then we're [00:10:53] outside words were and then we're comparing that to the actual outside [00:10:56] comparing that to the actual outside word in the context and that gives us [00:10:58] word in the context and that gives us our source of error so as such this is [00:11:01] our source of error so as such this is what's referred to in NLP as a bag of [00:11:04] what's referred to in NLP as a bag of words model that it doesn't actually [00:11:06] words model that it doesn't actually know about the structure of sentences [00:11:08] know about the structure of sentences and or even what's to the left and [00:11:10] and or even what's to the left and what's to the right it's predicting [00:11:12] what's to the right it's predicting exactly the same probabilities at each [00:11:15] exactly the same probabilities at each position to the left or right um but [00:11:17] position to the left or right um but it's wanting to know about what kind of [00:11:19] it's wanting to know about what kind of words appear in the context of the [00:11:21] words appear in the context of the center word um so I just wanted to uh [00:11:26] center word um so I just wanted to uh stop this for a minute and um [00:11:30] stop this for a minute and um let's see not that one [00:11:38] um so give you some kind of a sense that [00:11:42] um so give you some kind of a sense that this really um does work um so this is a [00:11:45] this really um does work um so this is a little Jupiter notebook um that I've got [00:11:48] little Jupiter notebook um that I've got um for this [00:11:51] um for this um okay and so this is using and here [00:11:54] um okay and so this is using and here I'm using a package um gen Sim which we [00:11:57] I'm using a package um gen Sim which we don't continue to use after that really [00:12:02] don't continue to use after that really um but it's sort of one package that let [00:12:04] um but it's sort of one package that let you load and play with word vectors and [00:12:08] you load and play with word vectors and the word vectors I'm going to use here [00:12:10] the word vectors I'm going to use here are are glove word vectors and actually [00:12:13] are are glove word vectors and actually I'm going to um glove was a model we [00:12:16] I'm going to um glove was a model we built at Stanford and I'm going to [00:12:18] built at Stanford and I'm going to actually talk about it a little bit [00:12:19] actually talk about it a little bit later um so strictly speaking um these [00:12:23] later um so strictly speaking um these aren't exactly word to ve word vectors [00:12:25] aren't exactly word to ve word vectors but they behave in exactly the same way [00:12:28] but they behave in exactly the same way um and so okay so now it's loaded up by [00:12:32] um and so okay so now it's loaded up by word vectors because the word vectors [00:12:34] word vectors because the word vectors are a big data file and so as we've [00:12:37] are a big data file and so as we've discussed um for a word right that the [00:12:41] discussed um for a word right that the representation of any word here is bread [00:12:44] representation of any word here is bread is just um a vector of real numbers [00:12:48] is just um a vector of real numbers right so um I'm using a 100 dimensional [00:12:51] right so um I'm using a 100 dimensional word vectors to keep things quicker um [00:12:54] word vectors to keep things quicker um for my class demo so this is the word [00:12:57] for my class demo so this is the word bread um and then I can say well what's [00:13:00] bread um and then I can say well what's the representation for [00:13:02] the representation for croissant [00:13:04] croissant um and um this is croissant and we can [00:13:09] um and um this is croissant and we can sort of get a visual sense of oh they're [00:13:11] sort of get a visual sense of oh they're at least a little bit similar right so [00:13:13] at least a little bit similar right so the first components are both negative [00:13:16] the first components are both negative the second components are both positive [00:13:18] the second components are both positive the third components are both negative [00:13:21] the third components are both negative and large the fourth components are both [00:13:23] and large the fourth components are both positive right they they seem like [00:13:26] positive right they they seem like they're kind of similar vectors so that [00:13:28] they're kind of similar vectors so that seems kind of hopeful because that means [00:13:31] seems kind of hopeful because that means that it knows that bread and croissant [00:13:33] that it knows that bread and croissant are a bit um similar to each other um [00:13:37] are a bit um similar to each other um the this package has a nice simple [00:13:39] the this package has a nice simple function where rather than doing that by [00:13:41] function where rather than doing that by hand you can just ask it about all the [00:13:44] hand you can just ask it about all the word vectors and say which ones are most [00:13:47] word vectors and say which ones are most similar so I can ask it um what um words [00:13:51] similar so I can ask it um what um words in its vocabulary most similar to USA [00:13:54] in its vocabulary most similar to USA and in this model everything's been [00:13:56] and in this model everything's been lowercased I should mention um and so if [00:13:59] lowercased I should mention um and so if I do that it has Canada America U.S.A [00:14:03] I do that it has Canada America U.S.A then United States Australia well those [00:14:06] then United States Australia well those seem a fairly reasonable list of most [00:14:08] seem a fairly reasonable list of most similar words though you might think [00:14:10] similar words though you might think it's a little strange that Canada wins [00:14:12] it's a little strange that Canada wins out over the USA with dots over it [00:14:16] out over the USA with dots over it um um similarly I can ask if what's most [00:14:19] um um similarly I can ask if what's most similar to banana and I get coconut [00:14:21] similar to banana and I get coconut mango bananas potato pineapple fruit Etc [00:14:25] mango bananas potato pineapple fruit Etc again pretty sensible you know a little [00:14:27] again pretty sensible you know a little bit of a bias to more tropical fruits or [00:14:30] bit of a bias to more tropical fruits or I can go to cron and ask what's most [00:14:33] I can go to cron and ask what's most similar to cran the most similar things [00:14:35] similar to cran the most similar things to cron isn't bread but it's things like [00:14:37] to cron isn't bread but it's things like Brios bagette fatcha um which sort of [00:14:40] Brios bagette fatcha um which sort of basically makes sense though here's [00:14:42] basically makes sense though here's pudding here um and I've got wait I'd [00:14:46] pudding here um and I've got wait I'd already done oh sorry yeah I remember [00:14:49] already done oh sorry yeah I remember what this is right so with this most [00:14:52] what this is right so with this most similar you've got a positive word [00:14:54] similar you've got a positive word vector and you're saying what other [00:14:56] vector and you're saying what other words are most similar in position to [00:14:58] words are most similar in position to that that um there's something else you [00:15:01] that that um there's something else you can do which you can say is this is let [00:15:04] can do which you can say is this is let me take the negative of that word vector [00:15:06] me take the negative of that word vector and say what's most similar to the [00:15:09] and say what's most similar to the negative of it and you could possibly [00:15:12] negative of it and you could possibly think a that might be useful to find [00:15:14] think a that might be useful to find antonyms or something like that I mean [00:15:16] antonyms or something like that I mean the truth is it isn't if you ask for the [00:15:20] the truth is it isn't if you ask for the things that are most similar to the [00:15:21] things that are most similar to the negative of the banana Vector um and in [00:15:24] negative of the banana Vector um and in most other vectors it's the same you get [00:15:27] most other vectors it's the same you get off out these weirdo things things that [00:15:29] off out these weirdo things things that you're not really sure if they're words [00:15:30] you're not really sure if they're words at all or maybe they are in some other [00:15:33] at all or maybe they are in some other language or some of them are names right [00:15:35] language or some of them are names right like shichi is a Japanese name um but [00:15:38] like shichi is a Japanese name um but not very useful stuff don't that um they [00:15:41] not very useful stuff don't that um they not don't really feel like a negative of [00:15:43] not don't really feel like a negative of banana but it turns out that from there [00:15:47] banana but it turns out that from there we get this um powerful ability of um [00:15:51] we get this um powerful ability of um that was observed for word to ve which [00:15:54] that was observed for word to ve which is that we could isolate semantic [00:15:57] is that we could isolate semantic components um and then put them together [00:16:00] components um and then put them together in interesting ways so um looking at [00:16:04] in interesting ways so um looking at this picture what we could do is start [00:16:06] this picture what we could do is start with a positive Vector for King from the [00:16:10] with a positive Vector for King from the origin up the king then we could use the [00:16:13] origin up the king then we could use the negation to say subtract out the vector [00:16:16] negation to say subtract out the vector for man and then we could have another [00:16:19] for man and then we could have another positive Vector of add on the um Vector [00:16:22] positive Vector of add on the um Vector for woman and then we can ask the model [00:16:25] for woman and then we can ask the model is if you're over here in the space um [00:16:28] is if you're over here in the space um what is the nearest word to you over [00:16:31] what is the nearest word to you over there and so um that's what this um next [00:16:35] there and so um that's what this um next thing does right it's sort of says [00:16:38] thing does right it's sort of says positive Vector for King negative for [00:16:40] positive Vector for King negative for man also positive for Queen where does [00:16:43] man also positive for Queen where does that get you to and that gets you to [00:16:46] that get you to and that gets you to Queen yay um and so this was the most [00:16:50] Queen yay um and so this was the most celebrated property that was discovered [00:16:53] celebrated property that was discovered with these word vectors that they [00:16:55] with these word vectors that they weren't only good for meaning similarity [00:16:58] weren't only good for meaning similarity but that they were good um for doing [00:17:02] but that they were good um for doing these kind of meaning components and [00:17:05] these kind of meaning components and these got referred to as analogies [00:17:07] these got referred to as analogies because that's you can think of them as [00:17:09] because that's you can think of them as a is to B as C is to what so it's sort [00:17:12] a is to B as C is to what so it's sort of like um woman is to King no sorry um [00:17:16] of like um woman is to King no sorry um man is to King or King is to man as um [00:17:21] man is to King or King is to man as um I'm saying this the wrong way around man [00:17:23] I'm saying this the wrong way around man is to King as woman is to what um in the [00:17:26] is to King as woman is to what um in the analogies and so here I've defined a [00:17:29] analogies and so here I've defined a little function that is now saying this [00:17:32] little function that is now saying this little function just automates that and [00:17:34] little function just automates that and we'll compute [00:17:35] we'll compute analogies um and so now I can ask it in [00:17:39] analogies um and so now I can ask it in just this analogy format um man is to [00:17:42] just this analogy format um man is to King as woman is to queen and um that [00:17:46] King as woman is to queen and um that one was sort of the canonical example [00:17:49] one was sort of the canonical example but you know you can actually sort of [00:17:51] but you know you can actually sort of has have fun with this and I mean you [00:17:55] has have fun with this and I mean you know uh this is pretty oldfashioned [00:17:58] know uh this is pretty oldfashioned stuff you know know I I feel like I'm [00:18:01] stuff you know know I I feel like I'm maybe like now at this point an old guy [00:18:04] maybe like now at this point an old guy talking about how much fun we used to [00:18:06] talking about how much fun we used to have sitting around the radio listening [00:18:08] have sitting around the radio listening to radio plays um because you know [00:18:11] to radio plays um because you know basically no one uses this stuff anymore [00:18:13] basically no one uses this stuff anymore and there are much much better and [00:18:15] and there are much much better and fancier things like chat GPT but you [00:18:18] fancier things like chat GPT but you know back in the day when I was younger [00:18:21] know back in the day when I was younger um you know it was really stunning [00:18:24] um you know it was really stunning already just how this very simple model [00:18:27] already just how this very simple model built on very simple data um could just [00:18:32] built on very simple data um could just have quite good semantic understanding [00:18:35] have quite good semantic understanding and do quite good analogies so you can [00:18:37] and do quite good analogies so you can actually you know play with this quite a [00:18:39] actually you know play with this quite a bit and have a bit of fun so you can do [00:18:42] bit and have a bit of fun so you can do something like analogy Australia comma [00:18:48] something like analogy Australia comma be [00:18:50] be France okay what people think the answer [00:18:53] France okay what people think the answer will [00:18:54] will be close the answer gives us Champagne [00:18:58] be close the answer gives us Champagne but that seems a pretty good answer um I [00:19:01] but that seems a pretty good answer um I could then put in Russia what what [00:19:04] could then put in Russia what what people think vodka yeah it'll see you [00:19:07] people think vodka yeah it'll see you can get back vodka you know this is [00:19:10] can get back vodka you know this is actually works kind of [00:19:12] actually works kind of interestingly um you know I could do a [00:19:15] interestingly um you know I could do a different one I can do the test [00:19:17] different one I can do the test something different I can do something [00:19:19] something different I can do something like pencil is [00:19:22] like pencil is to [00:19:24] to sketching as [00:19:27] sketching as camera is [00:19:31] to photographing yeah that works quite [00:19:34] to photographing yeah that works quite well um so um we built this model in [00:19:39] well um so um we built this model in 2014 so it's a little bit out of date in [00:19:43] 2014 so it's a little bit out of date in politics so you know well we we can't do [00:19:46] politics so you know well we we can't do the last decade of politics which is [00:19:48] the last decade of politics which is maybe unfortunate but you know we could [00:19:50] maybe unfortunate but you know we could try out older politics questions so we [00:19:53] try out older politics questions so we could try um [00:19:57] could try um Obama is to [00:20:02] Clinton [00:20:04] Clinton as [00:20:05] as Reagan is to if you remember your Us [00:20:08] Reagan is to if you remember your Us world history class any guesses what [00:20:11] world history class any guesses what it's going to [00:20:13] it's going to say there's a bush one any other [00:20:17] say there's a bush one any other ideas some people have different [00:20:19] ideas some people have different opinions of Bill Clinton [00:20:22] opinions of Bill Clinton any uh um what it answers is Nixon which [00:20:26] any uh um what it answers is Nixon which I think is actually kind of fair [00:20:29] I think is actually kind of fair um [00:20:31] um but um you can also get it to do some [00:20:36] but um you can also get it to do some just sort of language syntactic facts so [00:20:39] just sort of language syntactic facts so you can do something like tallest to [00:20:43] you can do something like tallest to tallest as long [00:20:47] tallest as long oops as long is to this one's [00:20:53] oops as long is to this one's easy um yeah so you know with this simp [00:20:58] easy um yeah so you know with this simp simple method of learning with this [00:21:00] simple method of learning with this simple bag of words model it's enough to [00:21:03] simple bag of words model it's enough to learn a lot about the semantics of words [00:21:08] learn a lot about the semantics of words and you know stuff that's beyond [00:21:10] and you know stuff that's beyond conventional semantics right you know [00:21:12] conventional semantics right you know like our examples with Australia De Beer [00:21:15] like our examples with Australia De Beer as Russia as to vodka I mean that sort [00:21:17] as Russia as to vodka I mean that sort of cultural World Knowledge which goes a [00:21:20] of cultural World Knowledge which goes a little bit beyond um what people [00:21:22] little bit beyond um what people normally think of as sort of word [00:21:23] normally think of as sort of word meaning semantics but it's also in there [00:21:26] meaning semantics but it's also in there yes if you perhaps subract the distance [00:21:29] yes if you perhaps subract the distance from let's say like man and King does [00:21:31] from let's say like man and King does that also capture a concept of [00:21:33] that also capture a concept of relationship between two words like [00:21:34] relationship between two words like would that give you back like ruler like [00:21:36] would that give you back like ruler like something like that we taking the like [00:21:39] something like that we taking the like the difference between two vectors does [00:21:40] the difference between two vectors does capture [00:21:41] capture some [00:21:43] some right the distance between man so so Man [00:21:49] right the distance between man so so Man compared to King should be a ruler [00:21:51] compared to King should be a ruler concept but isn't that what I'm using CU [00:21:53] concept but isn't that what I'm using CU then I'm taking that I'm taking the [00:21:57] then I'm taking that I'm taking the distance between man and [00:22:00] distance between man and King is what I'm adding on to woman to [00:22:03] King is what I'm adding on to woman to get the queen [00:22:05] get the queen right right yeah so I can depending on [00:22:09] right right yeah so I can depending on if you think of these words depending on [00:22:11] if you think of these words depending on which thing you think of as the analogy [00:22:14] which thing you think of as the analogy you can think of it you've both got um a [00:22:19] you can think of it you've both got um a a vector a difference Vector between [00:22:21] a vector a difference Vector between words that gives you a gender analogy [00:22:25] words that gives you a gender analogy and one that gives you a ruler analogy [00:22:27] and one that gives you a ruler analogy yeah absolutely [00:22:30] yeah absolutely any other [00:22:32] any other questions yeah um in the watch ve umm we [00:22:37] questions yeah um in the watch ve umm we get two vectors like for each one a u [00:22:40] get two vectors like for each one a u and a v but here you only have one [00:22:42] and a v but here you only have one vector so how do you go from two to one [00:22:46] vector so how do you go from two to one yeah good question I mean the commonest [00:22:49] yeah good question I mean the commonest way in practice was you just average the [00:22:51] way in practice was you just average the two of them and and and really you sort [00:22:54] two of them and and and really you sort of find out that they um end up very [00:22:57] of find out that they um end up very close you [00:22:59] close you know um because if you think of it since [00:23:02] know um because if you think of it since you're going along every position of the [00:23:04] you're going along every position of the text you're both going to be the case [00:23:06] text you're both going to be the case where if the text is sort of you know [00:23:09] where if the text is sort of you know the octopus has legs you know you're [00:23:11] the octopus has legs you know you're going to have octopus in the center with [00:23:13] going to have octopus in the center with legs in the context and a couple of time [00:23:15] legs in the context and a couple of time steps later it's going to be legs in the [00:23:16] steps later it's going to be legs in the center with octavus in the context so [00:23:19] center with octavus in the context so you know although they vary a bit for [00:23:21] you know although they vary a bit for all the regions of the neuron Nets very [00:23:23] all the regions of the neuron Nets very basically they end up very similar and [00:23:25] basically they end up very similar and people normally just average them yeah [00:23:28] people normally just average them yeah can you [00:23:29] can you this process so use this uh the answer [00:23:32] this process so use this uh the answer of one to then be placed into the [00:23:34] of one to then be placed into the analogy function of the another and see [00:23:36] analogy function of the another and see how far away you can go before it SS [00:23:38] how far away you can go before it SS break [00:23:40] break down I think you can um so you what wait [00:23:44] down I think you can um so you what wait you're wanting to how far away how [00:23:46] you're wanting to how far away how distant of a relation between two words [00:23:48] distant of a relation between two words can you do before it starts providing [00:23:51] can you do before it starts providing incorrect relationship [00:23:55] between but are you wanting to sort of [00:23:57] between but are you wanting to sort of make two steps from somewhere or yeah [00:24:03] make two steps from somewhere or yeah St many St go away [00:24:09] froming so it doesn't always work I mean [00:24:12] froming so it doesn't always work I mean there are so examples that are fail um [00:24:15] there are so examples that are fail um I'm sort of shy to try that now because [00:24:17] I'm sort of shy to try that now because I don't have a predefined function that [00:24:19] I don't have a predefined function that did it and that might take me um too [00:24:21] did it and that might take me um too long but you could play with it at home [00:24:24] long but you could play with it at home and see how it works for you [00:24:29] curious is a clarification why is it [00:24:31] curious is a clarification why is it that we use two separate sets of vectors [00:24:33] that we use two separate sets of vectors for word uh is it just to get more [00:24:35] for word uh is it just to get more parameters or is there um I'll get back [00:24:38] parameters or is there um I'll get back to that maybe I should go on at this [00:24:41] to that maybe I should go on at this point um let me move on and kind of just [00:24:44] point um let me move on and kind of just get through some more details of the [00:24:47] get through some more details of the word Tove algorithm um so um just a [00:24:52] word Tove algorithm um so um just a technical point on this class so you [00:24:54] technical point on this class so you don't make any big mistakes and waste [00:24:56] don't make any big mistakes and waste your weekend I mean for most instances [00:24:59] your weekend I mean for most instances of 224n we've actually had people [00:25:03] of 224n we've actually had people Implement from scratch word to VC as [00:25:05] Implement from scratch word to VC as assignment to um but you know for this [00:25:09] assignment to um but you know for this quarter doing it in Spring quarter as [00:25:12] quarter doing it in Spring quarter as you probably know Spring quarter is [00:25:13] you probably know Spring quarter is actually a little shorter than the other [00:25:15] actually a little shorter than the other two quarters um we decided to skip [00:25:18] two quarters um we decided to skip having people Implement word to VC um so [00:25:21] having people Implement word to VC um so don't look at the old assignment to that [00:25:23] don't look at the old assignment to that says Implement word to V or else you'll [00:25:25] says Implement word to V or else you'll be misspending your time wait for the [00:25:27] be misspending your time wait for the newer assignment to to come out um but [00:25:31] newer assignment to to come out um but you know despite that let me just sort [00:25:32] you know despite that let me just sort of um say a little bit more about some [00:25:36] of um say a little bit more about some of the details right yeah so why two [00:25:38] of the details right yeah so why two vectors um so the two vectors is it just [00:25:43] vectors um so the two vectors is it just makes the math a little bit easy so if [00:25:46] makes the math a little bit easy so if you think about the math right if you [00:25:49] you think about the math right if you have the same vectors for the center [00:25:52] have the same vectors for the center word and for the outside words well for [00:25:55] word and for the outside words well for whatever the the center word is let's [00:25:58] whatever the the center word is let's say it's octopus that when you're going [00:26:00] say it's octopus that when you're going through the trying out every possible [00:26:04] through the trying out every possible context word for the normalization at [00:26:07] context word for the normalization at some point you'll hit octopus again and [00:26:10] some point you'll hit octopus again and so at that point you'll have a quadratic [00:26:12] so at that point you'll have a quadratic term right you'll have the x squared of [00:26:14] term right you'll have the x squared of the octopus vector and that kind of [00:26:17] the octopus vector and that kind of messes up I mean you're clever people [00:26:20] messes up I mean you're clever people you could work out the math of it but it [00:26:22] you could work out the math of it but it makes the math more of a mess right CU [00:26:25] makes the math more of a mess right CU every other term it's something [00:26:26] every other term it's something different and it's just like a X and [00:26:29] different and it's just like a X and then at one position you've got an x s [00:26:31] then at one position you've got an x s um so it just makes the math Messier and [00:26:34] um so it just makes the math Messier and so they kept it really simple by just [00:26:37] so they kept it really simple by just having them be disjoint vectors but it [00:26:40] having them be disjoint vectors but it it doesn't make it better I mean it [00:26:42] it doesn't make it better I mean it actually turns out it works a fraction [00:26:44] actually turns out it works a fraction better if you do it right um but in [00:26:48] better if you do it right um but in practice people have usually just [00:26:50] practice people have usually just estimated them separately and then [00:26:53] estimated them separately and then average them at the end um there if if [00:26:55] average them at the end um there if if you actually look at the paper here's m [00:26:58] you actually look at the paper here's m of it out you can find it 2013 paper um [00:27:02] of it out you can find it 2013 paper um there's actually sort of a family of [00:27:04] there's actually sort of a family of methods that they describe um so they [00:27:07] methods that they describe um so they described two methods one of which um [00:27:11] described two methods one of which um was that you have an inside word that's [00:27:13] was that you have an inside word that's predicting the words around it and then [00:27:16] predicting the words around it and then the other one tried to predict the [00:27:18] the other one tried to predict the center word from all the words in the [00:27:20] center word from all the words in the context which was called continuous bag [00:27:22] context which was called continuous bag of words in their paper um the one that [00:27:24] of words in their paper um the one that I've described is Skip gram which is [00:27:27] I've described is Skip gram which is simpler and works just great um but then [00:27:31] simpler and works just great um but then the other part of it is um for working [00:27:35] the other part of it is um for working out um what loss function to be used for [00:27:38] out um what loss function to be used for training and what I've um presented so [00:27:42] training and what I've um presented so far um is naive soft Max where we just [00:27:46] far um is naive soft Max where we just consider every possible choice of a [00:27:49] consider every possible choice of a context word and just run all the math [00:27:52] context word and just run all the math um you know that's totally doable and [00:27:55] um you know that's totally doable and with our modern super fast computers [00:27:57] with our modern super fast computers it's not even that unreasonable to do we [00:27:59] it's not even that unreasonable to do we do things like this all the time but at [00:28:01] do things like this all the time but at least at the time that they um wrote [00:28:04] least at the time that they um wrote their paper this seemed kind of [00:28:06] their paper this seemed kind of expensive and they considered other [00:28:08] expensive and they considered other Alternatives um like a hierarchical [00:28:10] Alternatives um like a hierarchical softmax which I'm not going to explain [00:28:12] softmax which I'm not going to explain right now but I do just want to explain [00:28:15] right now but I do just want to explain negative [00:28:17] negative sampling okay so this is just to sort of [00:28:19] sampling okay so this is just to sort of see a bit of a different way of doing [00:28:21] see a bit of a different way of doing things so for what we did last time we [00:28:25] things so for what we did last time we had this sort of straightforward soft [00:28:28] had this sort of straightforward soft Max equation and so in the denominator [00:28:31] Max equation and so in the denominator you're summing over every word in the [00:28:34] you're summing over every word in the vocabulary um and so if you might have [00:28:36] vocabulary um and so if you might have 400,000 words in your vocabulary a lot [00:28:39] 400,000 words in your vocabulary a lot of words in human languages you know [00:28:41] of words in human languages you know that's kind of a a big sum especially [00:28:44] that's kind of a a big sum especially when for each element of the sum you're [00:28:46] when for each element of the sum you're taking a DOT product between 100 [00:28:49] taking a DOT product between 100 dimensional or 300 dimensional vectors [00:28:51] dimensional or 300 dimensional vectors and then exponentiating it right a lot [00:28:54] and then exponentiating it right a lot of math going on somewhere in there um [00:28:57] of math going on somewhere in there um so [00:28:59] so maybe we could short circuit that and so [00:29:02] maybe we could short circuit that and so the idea of the negative sampling was to [00:29:04] the idea of the negative sampling was to say well rather than evaluating it for [00:29:07] say well rather than evaluating it for every single possible word maybe we [00:29:11] every single possible word maybe we could just sort of train some simple [00:29:14] could just sort of train some simple logistic regressions where they're going [00:29:17] logistic regressions where they're going to say you should like the true word [00:29:20] to say you should like the true word that's in the context and if we randomly [00:29:23] that's in the context and if we randomly pick a few other words you shouldn't [00:29:25] pick a few other words you shouldn't like them very much um and that's skip [00:29:28] like them very much um and that's skip gram negative sampling so that's what [00:29:30] gram negative sampling so that's what this looks like as an equation um so [00:29:33] this looks like as an equation um so we've got our Center word and our actual [00:29:37] we've got our Center word and our actual context word and we're um saying well [00:29:41] context word and we're um saying well let's work out the term for the actual [00:29:45] let's work out the term for the actual Center word we'd like this to be um high [00:29:50] Center word we'd like this to be um high probability um so since we're minimizing [00:29:53] probability um so since we're minimizing we're going to negate that and have it [00:29:55] we're going to negate that and have it go down and then we're going to sample [00:29:57] go down and then we're going to sample some other words and we'd like this to [00:29:59] some other words and we'd like this to be the opposite um but the the other [00:30:02] be the opposite um but the the other thing that we've changed here is now [00:30:04] thing that we've changed here is now we're not using the soft Max anymore [00:30:07] we're not using the soft Max anymore we're using this Sigma which stands for [00:30:10] we're using this Sigma which stands for the logistic function which is often [00:30:13] the logistic function which is often called the sigmoid um sigmoid just means [00:30:16] called the sigmoid um sigmoid just means s-shaped and but you know you could [00:30:18] s-shaped and but you know you could actually have an Infinity of s-shaped [00:30:20] actually have an Infinity of s-shaped functions and the one that we actually [00:30:22] functions and the one that we actually use is the logistic function so the [00:30:25] use is the logistic function so the logistic function has this form and and [00:30:28] logistic function has this form and and Maps um from any real number to a [00:30:32] Maps um from any real number to a probability between zero and one um so [00:30:36] probability between zero and one um so what we're wanting to say at that point [00:30:38] what we're wanting to say at that point is for the real outside world we're [00:30:43] is for the real outside world we're hoping that this dot product is large so [00:30:46] hoping that this dot product is large so its probability is near one um and so [00:30:49] its probability is near one um and so that will then sort of help with the [00:30:52] that will then sort of help with the minimization and for the other words [00:30:54] minimization and for the other words we'd like their probability to be small [00:30:58] we'd like their probability to be small um so we what like them to appear sort [00:31:01] um so we what like them to appear sort of over here um and that's what this is [00:31:05] of over here um and that's what this is calculating but as written it's sort of [00:31:07] calculating but as written it's sort of sticking the minus sign on the inside [00:31:10] sticking the minus sign on the inside there which works because of the this is [00:31:12] there which works because of the this is symmetric right so you're wanting to be [00:31:15] symmetric right so you're wanting to be over here which means that if you negate [00:31:18] over here which means that if you negate it you'll be on this side which will be [00:31:26] large okay um [00:31:28] large okay um and so then the final bit of this which [00:31:30] and so then the final bit of this which is the asterisk is so we're going to [00:31:33] is the asterisk is so we're going to pick a few words you know it might only [00:31:35] pick a few words you know it might only be five or 10 that are our negative [00:31:38] be five or 10 that are our negative samples um but for picking those words [00:31:42] samples um but for picking those words what works well is not just to sort of [00:31:46] what works well is not just to sort of pick sort of randomly uniformly from all [00:31:49] pick sort of randomly uniformly from all the 400,000 words in our vocab what you [00:31:53] the 400,000 words in our vocab what you basically want to do is sort of be [00:31:55] basically want to do is sort of be paying attention to how common the words [00:31:57] paying attention to how common the words are something like that is a really [00:31:59] are something like that is a really common word and so we refer to that as [00:32:02] common word and so we refer to that as the unigram distribution that means [00:32:04] the unigram distribution that means you're s just taking individual words [00:32:06] you're s just taking individual words independently how commonly they are so [00:32:09] independently how commonly they are so about 10% of the time you'd be choosing [00:32:11] about 10% of the time you'd be choosing the um but so that's sort of roughly [00:32:15] the um but so that's sort of roughly what you want to do for sampling but [00:32:17] what you want to do for sampling but people have found that you can actually [00:32:19] people have found that you can actually do even a bit better than that so the [00:32:21] do even a bit better than that so the standard thing that they presented for [00:32:24] standard thing that they presented for word DEC is you're taking the unigram [00:32:27] word DEC is you're taking the unigram probability of the word and raising it [00:32:29] probability of the word and raising it to the power of [00:32:31] to the power of 34 uh what does that end up [00:32:35] doing question for the audience if I [00:32:39] doing question for the audience if I take probabilities and raise them to the [00:32:41] take probabilities and raise them to the three [00:32:43] three quarters some less frequent words just [00:32:46] quarters some less frequent words just become sing more correct yeah so the [00:32:51] become sing more correct yeah so the raising to the [00:32:52] raising to the 34s means that you're sort of somewhat [00:32:56] 34s means that you're sort of somewhat upping the probability [00:32:59] upping the probability of the less frequent words so you're [00:33:02] of the less frequent words so you're sort of in between you know between [00:33:05] sort of in between you know between having every word uniform and exactly [00:33:07] having every word uniform and exactly using their relative frequencies in the [00:33:10] using their relative frequencies in the text you're sort of moving a little bit [00:33:12] text you're sort of moving a little bit in the direction of uniform and so you [00:33:15] in the direction of uniform and so you get better results by going somewhat in [00:33:17] get better results by going somewhat in the distance of sampling more uniformly [00:33:20] the distance of sampling more uniformly but you don't want to go all the way [00:33:22] but you don't want to go all the way there which should correspond to I guess [00:33:25] there which should correspond to I guess putting um a zero in there rather than [00:33:28] putting um a zero in there rather than three [00:33:30] three quarters okay um [00:33:34] quarters okay um yeah okay uh let's see I had a side here [00:33:38] yeah okay uh let's see I had a side here but Time Rushes along so let's not [00:33:41] but Time Rushes along so let's not bother with this side it's not that [00:33:42] bother with this side it's not that important um okay so that's um the word [00:33:47] important um okay so that's um the word to VC algorithm that we've seen all of [00:33:50] to VC algorithm that we've seen all of um in its different forms um a [00:33:55] um in its different forms um a reasonable uh wonder that you have at [00:33:58] reasonable uh wonder that you have at this point is you [00:34:01] this point is you know this seems a kind of a weird way of [00:34:05] know this seems a kind of a weird way of doing what we're wanting to do right the [00:34:07] doing what we're wanting to do right the idea is look we have this text we have [00:34:11] idea is look we have this text we have words and we have words in the context [00:34:14] words and we have words in the context of words it sort of seems like an [00:34:17] of words it sort of seems like an obvious thing to do would be to say [00:34:19] obvious thing to do would be to say let's just count some statistics we have [00:34:22] let's just count some statistics we have words and there are other words that [00:34:23] words and there are other words that occur in their context so let's just see [00:34:26] occur in their context so let's just see how often the word word swim occurs next [00:34:29] how often the word word swim occurs next to octopus and how often the word fish [00:34:32] to octopus and how often the word fish occurs next to octopus let's get some [00:34:34] occurs next to octopus let's get some counts um and see how often words occur [00:34:38] counts um and see how often words occur in the context of other words and maybe [00:34:40] in the context of other words and maybe we could use that um to calculate some [00:34:44] we could use that um to calculate some form of word vectors um and so that's [00:34:48] form of word vectors um and so that's something that people have already also [00:34:50] something that people have already also considered so if we use the same kind of [00:34:53] considered so if we use the same kind of idea of a context window we could just [00:34:55] idea of a context window we could just make a matrix of how often words occur [00:34:58] make a matrix of how often words occur in the context of other words and so you [00:35:01] in the context of other words and so you know here's a baby example my Corpus is [00:35:04] know here's a baby example my Corpus is I like deep learning I like NLP I enjoy [00:35:07] I like deep learning I like NLP I enjoy flying um and my context window I'm [00:35:10] flying um and my context window I'm using is just one word to the left and [00:35:12] using is just one word to the left and the right and then I can make this kind [00:35:15] the right and then I can make this kind of um co-occurrence count Matrix um [00:35:19] of um co-occurrence count Matrix um where I'm putting in the counts of [00:35:21] where I'm putting in the counts of different words in every context and you [00:35:24] different words in every context and you know because my Corpus is so small um [00:35:27] know because my Corpus is so small um every thing in the Matrix is a zero or [00:35:30] every thing in the Matrix is a zero or one except for right here where I've got [00:35:31] one except for right here where I've got the twos because I have I like twice [00:35:34] the twos because I have I like twice right but in principle I've got a matrix [00:35:37] right but in principle I've got a matrix of counts for all the different counts [00:35:39] of counts for all the different counts here um so maybe you know this gives [00:35:43] here um so maybe you know this gives this gives me a word Vector right you [00:35:45] this gives me a word Vector right you know here's a word Vector for deep um is [00:35:49] know here's a word Vector for deep um is this long Vector here and you know I [00:35:51] this long Vector here and you know I could just say that is my word vector [00:35:53] could just say that is my word vector and indeed sometimes people have done [00:35:56] and indeed sometimes people have done that but they're kind of of ungainly [00:35:58] that but they're kind of of ungainly word vectors because if we have 400,000 [00:36:02] word vectors because if we have 400,000 words in our vocabulary the size of this [00:36:05] words in our vocabulary the size of this Matrix is 400,000 by [00:36:08] Matrix is 400,000 by 400,000 which is a lot worse than our [00:36:10] 400,000 which is a lot worse than our word ve word vectors because if we're [00:36:13] word ve word vectors because if we're making them only 100 dimensional um [00:36:15] making them only 100 dimensional um we've only got 400,000 by 100 which is [00:36:19] we've only got 400,000 by 100 which is still a big number but it's a lot [00:36:20] still a big number but it's a lot smaller than 400,000 time 400,000 so [00:36:24] smaller than 400,000 time 400,000 so that's inconvenient so when people have [00:36:26] that's inconvenient so when people have started with um these kind of [00:36:29] started with um these kind of cooccurrence Matrix um the general thing [00:36:32] cooccurrence Matrix um the general thing that people have done is to say well [00:36:35] that people have done is to say well somehow we want to reduce the [00:36:37] somehow we want to reduce the dimensionality of that Matrix so that we [00:36:40] dimensionality of that Matrix so that we have a smaller Matrix to deal with um [00:36:44] have a smaller Matrix to deal with um and so then how can we reduce the [00:36:47] and so then how can we reduce the dimensionality of the Matrix and at this [00:36:50] dimensionality of the Matrix and at this point if you remember your linear [00:36:52] point if you remember your linear algebra and stuff like that you should [00:36:54] algebra and stuff like that you should be thinking of things like PCA and [00:36:57] be thinking of things like PCA and particular if you want it to work for [00:36:59] particular if you want it to work for any Matrix of any shape there's the um [00:37:02] any Matrix of any shape there's the um singular value decomposition so there [00:37:05] singular value decomposition so there the classic singular value decomposition [00:37:07] the classic singular value decomposition for any Matrix you can rewrite it as a [00:37:11] for any Matrix you can rewrite it as a product of three matrices a u and a v [00:37:15] product of three matrices a u and a v which are both [00:37:16] which are both orthonormal um which means that um you [00:37:19] orthonormal um which means that um you get these um [00:37:22] get these um independent [00:37:23] independent um vectors um they orthog to each other [00:37:29] um vectors um they orthog to each other um and then in the middle we have the [00:37:32] um and then in the middle we have the singular vectors which are ordered in [00:37:35] singular vectors which are ordered in size is the most important singular [00:37:37] size is the most important singular vector and these are sort of waiting [00:37:39] vector and these are sort of waiting terms on the different number of uh the [00:37:43] terms on the different number of uh the different dimensions and so this is sort [00:37:45] different dimensions and so this is sort of the full SVD [00:37:48] of the full SVD decomposition um but you know part of it [00:37:51] decomposition um but you know part of it is a relevant because if I've got this [00:37:53] is a relevant because if I've got this picture you know nothing is happening in [00:37:55] picture you know nothing is happening in the part that's sort of shown in yellow [00:37:57] the part that's sort of shown in yellow there um but if you want you know at the [00:38:01] there um but if you want you know at the moment you know this is just an a a good [00:38:05] moment you know this is just an a a good a full decomposition but if we wanting [00:38:08] a full decomposition but if we wanting to have sort of smaller low dimensional [00:38:10] to have sort of smaller low dimensional vectors well the next trick we pull is [00:38:13] vectors well the next trick we pull is we say well we know where the smallest [00:38:15] we say well we know where the smallest singular vectors are so we could just [00:38:18] singular vectors are so we could just set them to zero and if we did that then [00:38:21] set them to zero and if we did that then more of this goes away and we end up [00:38:25] more of this goes away and we end up with two-dimensional representation [00:38:28] with two-dimensional representation of our words and so that gives us [00:38:31] of our words and so that gives us another way of um forming [00:38:34] another way of um forming low-dimensional word [00:38:36] low-dimensional word representations um and this had actually [00:38:39] representations um and this had actually been explored before modern new word [00:38:42] been explored before modern new word vectors and using algorithms such as [00:38:44] vectors and using algorithms such as latent semantic analysis um and it is [00:38:49] latent semantic analysis um and it is sort of half worked but it never worked [00:38:53] sort of half worked but it never worked very well but you know some people [00:38:56] very well but you know some people especially in psychology had kept on [00:38:58] especially in psychology had kept on working on it um and among other people [00:39:02] working on it um and among other people in the early 2000s there was this grad [00:39:04] in the early 2000s there was this grad student um Doug roie who kept on working [00:39:07] student um Doug roie who kept on working on it and he came up with an algorithm [00:39:11] on it and he came up with an algorithm um that he called Kohl's and you know he [00:39:16] um that he called Kohl's and you know he he had known as other people before him [00:39:18] he had known as other people before him had known that just sort of doing an SVD [00:39:21] had known that just sort of doing an SVD on Raw counts didn't seem to give you [00:39:23] on Raw counts didn't seem to give you word vectors that worked very well um [00:39:26] word vectors that worked very well um but he had some ideas to do better than [00:39:28] but he had some ideas to do better than that um so one thing that helps a lot is [00:39:31] that um so one thing that helps a lot is if you log the [00:39:32] if you log the frequencies so you can put log [00:39:34] frequencies so you can put log frequencies in the cells um but then um [00:39:39] frequencies in the cells um but then um he sort of used some other ideas some of [00:39:41] he sort of used some other ideas some of which were also picked up in word to V [00:39:44] which were also picked up in word to V one of which is ramping the windows so [00:39:46] one of which is ramping the windows so that you count closer words more than [00:39:48] that you count closer words more than further away words um he used piercon [00:39:52] further away words um he used piercon correlations instead of counts Etc but [00:39:55] correlations instead of counts Etc but he ended up coming up with a low dial [00:39:58] he ended up coming up with a low dial version of Word vectors that are sort of [00:40:00] version of Word vectors that are sort of ultimately still based on an SVD um and [00:40:04] ultimately still based on an SVD um and he got out these word vectors and you [00:40:07] he got out these word vectors and you know interestingly sort of no one really [00:40:10] know interestingly sort of no one really noticed at the time but Doug roie in his [00:40:13] noticed at the time but Doug roie in his dissertation effectively discovered this [00:40:17] dissertation effectively discovered this same property of having linear semantic [00:40:19] same property of having linear semantic components so look here we go here's one [00:40:22] components so look here we go here's one of so this is actually you know picture [00:40:24] of so this is actually you know picture from his dissertation and look here you [00:40:27] from his dissertation and look here you know we've got this meaning component [00:40:29] know we've got this meaning component which is doer of an event and he's [00:40:31] which is doer of an event and he's essentially shown with the way he's [00:40:33] essentially shown with the way he's processed his word vectors that the do [00:40:36] processed his word vectors that the do of an event is a linear meaning [00:40:39] of an event is a linear meaning component that you can use to move [00:40:41] component that you can use to move between a verb and the doer of the verb [00:40:44] between a verb and the doer of the verb kind of cool but he didn't become famous [00:40:47] kind of cool but he didn't become famous because no one was paying attention to [00:40:49] because no one was paying attention to what he had come up with so that so once [00:40:52] what he had come up with so that so once word DEC um became popular that was [00:40:57] word DEC um became popular that was something that I was kind of interested [00:41:00] something that I was kind of interested in and so um working together with a [00:41:03] in and so um working together with a postto Jeffrey Pennington um we thought [00:41:07] postto Jeffrey Pennington um we thought that you know you there was interest in [00:41:11] that you know you there was interest in this sort of space of having doing [00:41:13] this sort of space of having doing things with matrices of counts and how [00:41:16] things with matrices of counts and how do you then get them to work well as [00:41:19] do you then get them to work well as word vectors in the same way that word [00:41:21] word vectors in the same way that word de V worked well as word vectors and so [00:41:24] de V worked well as word vectors and so that's what led um into the glove [00:41:27] that's what led um into the glove algorithm that was what I was actually [00:41:29] algorithm that was what I was actually showing you and so um what we wanted was [00:41:34] showing you and so um what we wanted was to say look we want a model in which um [00:41:39] to say look we want a model in which um linear components sort of adding or [00:41:41] linear components sort of adding or subtracting a vector in a vector space [00:41:44] subtracting a vector in a vector space correspond to a meaning difference how [00:41:47] correspond to a meaning difference how can we do that um and Jeffrey um did [00:41:52] can we do that um and Jeffrey um did good thinking and uh math and um thought [00:41:56] good thinking and uh math and um thought about that for a bit and his solution [00:42:00] about that for a bit and his solution was to say well if we think of um that [00:42:04] was to say well if we think of um that ratios of cooccurrence probabilities can [00:42:07] ratios of cooccurrence probabilities can encode meaning components so if we can [00:42:09] encode meaning components so if we can make a ratio of cooccurrence [00:42:11] make a ratio of cooccurrence probabilities into something linear in [00:42:14] probabilities into something linear in the vector space will get the kind of [00:42:16] the vector space will get the kind of result that word DEC or Doug Ro got um [00:42:20] result that word DEC or Doug Ro got um so what does that mean well so if you [00:42:24] so what does that mean well so if you start thinking of words occurring in the [00:42:26] start thinking of words occurring in the context of I ice you might think that [00:42:28] context of I ice you might think that sort of solid and water are likely to K [00:42:31] sort of solid and water are likely to K near ice and gas or a random word like [00:42:35] near ice and gas or a random word like random aren't likely to occur near ice [00:42:39] random aren't likely to occur near ice but and similar for steam you'd expect [00:42:43] but and similar for steam you'd expect um that you know gas and water are [00:42:46] um that you know gas and water are likely to occur near steam but probably [00:42:48] likely to occur near steam but probably not solid or random and well if you're [00:42:51] not solid or random and well if you're just looking at one of these you don't [00:42:53] just looking at one of these you don't really get meaning components because [00:42:56] really get meaning components because you get something that's large here or [00:42:58] you get something that's large here or large here but if you then look at the [00:43:01] large here but if you then look at the ratio of two of these cooccurrence [00:43:05] ratio of two of these cooccurrence probabilities then what you get out is [00:43:08] probabilities then what you get out is that for solid it's going to be large [00:43:10] that for solid it's going to be large and small is going to be um and for gas [00:43:14] and small is going to be um and for gas it's going to be small and so you're [00:43:16] it's going to be small and so you're getting a direction in the space which [00:43:19] getting a direction in the space which will correspond to the um solid liquid [00:43:22] will correspond to the um solid liquid gas dimension of physics um whereas for [00:43:26] gas dimension of physics um whereas for the other words it will be about one um [00:43:29] the other words it will be about one um this is just uh uh the wave your hands [00:43:32] this is just uh uh the wave your hands this was the conception of the idea but [00:43:34] this was the conception of the idea but if you actually do the counts this [00:43:36] if you actually do the counts this actually works out so using real data [00:43:39] actually works out so using real data this is what you get for cooccurrence [00:43:42] this is what you get for cooccurrence and indeed you kind of get these sort of [00:43:44] and indeed you kind of get these sort of factors of 10 um in both of these [00:43:47] factors of 10 um in both of these directions of these two and the numbers [00:43:49] directions of these two and the numbers are over there are approximately one um [00:43:53] are over there are approximately one um so um Jeffrey's idea was well we going [00:43:57] so um Jeffrey's idea was well we going to start with a cooccurrence count [00:44:00] to start with a cooccurrence count Matrix and we want to make this turn [00:44:05] Matrix and we want to make this turn into a linear component and well how do [00:44:08] into a linear component and well how do you do that well first of all it sort of [00:44:10] you do that well first of all it sort of makes sense immediately that you should [00:44:12] makes sense immediately that you should be putting a login right because once [00:44:14] be putting a login right because once you put a log in um this ratio will be [00:44:17] you put a log in um this ratio will be being turned into something that's [00:44:19] being turned into something that's subtracted um and so simply all you have [00:44:23] subtracted um and so simply all you have to do is have a log by linear model [00:44:25] to do is have a log by linear model where the um dotproduct of two word [00:44:29] where the um dotproduct of two word vectors models this conditional [00:44:32] vectors models this conditional probability and then the difference [00:44:34] probability and then the difference between two vectors will be [00:44:36] between two vectors will be corresponding to this log of the ratio [00:44:39] corresponding to this log of the ratio of their cooccurrence [00:44:41] of their cooccurrence probabilities um so that was basically [00:44:44] probabilities um so that was basically the glove model um so you're wanting to [00:44:48] the glove model um so you're wanting to you know model this dot product um such [00:44:53] you know model this dot product um such that it's being close to the log of the [00:44:56] that it's being close to the log of the cooccur Cur probability but you sort of [00:44:59] cooccur Cur probability but you sort of do a little bit of extra work to have [00:45:01] do a little bit of extra work to have some biased terms and and some frequency [00:45:04] some biased terms and and some frequency thresholds which aren't very important [00:45:07] thresholds which aren't very important um so I'm going to skip past them but I [00:45:09] um so I'm going to skip past them but I think that basic intuition as to what's [00:45:13] think that basic intuition as to what's the important thing to get linear [00:45:14] the important thing to get linear meaning components is a good one to know [00:45:19] about okay is everyone good [00:45:23] about okay is everyone good today [00:45:25] today cool yes all right noticed the the [00:45:28] cool yes all right noticed the the original X Matrix you show was like 3x 5 [00:45:31] original X Matrix you show was like 3x 5 or something shouldn't It Be Square so [00:45:34] or something shouldn't It Be Square so yeah I mean if you're doing sorry yeah I [00:45:37] yeah I mean if you're doing sorry yeah I maybe should have just shown you a [00:45:38] maybe should have just shown you a square one if you're just doing [00:45:40] square one if you're just doing vocabulary to vocabulary yes it should [00:45:42] vocabulary to vocabulary yes it should be square um but there was a bit in the [00:45:45] be square um but there was a bit in the slides that I didn't mention that there [00:45:46] slides that I didn't mention that there was another way you could do it where [00:45:48] was another way you could do it where you did it words versus documents and [00:45:50] you did it words versus documents and then it would be non Square um but yeah [00:45:52] then it would be non Square um but yeah you're right so let we can just just um [00:45:57] you're right so let we can just just um consider the square case Okay [00:46:00] consider the square case Okay um so you know hey I showed you that [00:46:04] um so you know hey I showed you that demo of the glove vectors and they work [00:46:06] demo of the glove vectors and they work great didn't [00:46:07] great didn't they um so you know these are good [00:46:09] they um so you know these are good vectors um but in general in NLP we'd [00:46:13] vectors um but in general in NLP we'd like to have things that we can evaluate [00:46:16] like to have things that we can evaluate and know whether things are really good [00:46:19] and know whether things are really good um and [00:46:21] um and so everywhere through the course we're [00:46:23] so everywhere through the course we're going to want to evaluate things and [00:46:25] going to want to evaluate things and work out how good they are and what's [00:46:27] work out how good they are and what's better and what's worse um and so one of [00:46:31] better and what's worse um and so one of the fundamental Notions of evaluation [00:46:33] the fundamental Notions of evaluation that will come up again and again is [00:46:35] that will come up again and again is intrinsic and extrinsic evaluations so [00:46:38] intrinsic and extrinsic evaluations so an intrinsic evaluation is where you are [00:46:41] an intrinsic evaluation is where you are doing a very specific internal subtask [00:46:46] doing a very specific internal subtask and you just try and score whether it's [00:46:48] and you just try and score whether it's good or bad so normally intrinsic [00:46:50] good or bad so normally intrinsic evaluations are fast to compute help you [00:46:54] evaluations are fast to compute help you understand the component you're building [00:46:56] understand the component you're building but they are sort of distant from your [00:46:59] but they are sort of distant from your Downstream task and improving the [00:47:01] Downstream task and improving the numbers internally may or may not help [00:47:04] numbers internally may or may not help you and that's um the contrast with an [00:47:07] you and that's um the contrast with an extrinsic evaluation um which is that [00:47:11] extrinsic evaluation um which is that you've got some real task you want to do [00:47:13] you've got some real task you want to do question answering or document [00:47:15] question answering or document summarization or machine [00:47:17] summarization or machine translation um and you want to know [00:47:20] translation um and you want to know whether some clever bit of um internal [00:47:23] whether some clever bit of um internal modeling will help you on that task so [00:47:26] modeling will help you on that task so then you have to sort of run an entire [00:47:28] then you have to sort of run an entire system and work out Downstream [00:47:31] system and work out Downstream accuracies um and find out whether it [00:47:34] accuracies um and find out whether it actually helps you at the end of the day [00:47:36] actually helps you at the end of the day um but that often means it's kind of you [00:47:38] um but that often means it's kind of you know indirect so harder to see exactly [00:47:41] know indirect so harder to see exactly what's happening in your task um so for [00:47:44] what's happening in your task um so for something like word vectors you know if [00:47:46] something like word vectors you know if we just sort of measure are they [00:47:48] we just sort of measure are they modeling word similarity well that's an [00:47:52] modeling word similarity well that's an intrinsic [00:47:53] intrinsic evaluation but you know we'd probably [00:47:56] evaluation but you know we'd probably like to know whether they model word [00:47:59] like to know whether they model word similarity well for some Downstream task [00:48:02] similarity well for some Downstream task which might be um doing web search right [00:48:05] which might be um doing web search right we'd like um when you say you know cell [00:48:09] we'd like um when you say you know cell phone or mobile phone that it comes out [00:48:11] phone or mobile phone that it comes out at about the same um so that would then [00:48:13] at about the same um so that would then be web search might be our extrinsic [00:48:17] be web search might be our extrinsic evaluation okay so for word factors um [00:48:22] evaluation okay so for word factors um two two intrinsic evaluations the ones [00:48:25] two two intrinsic evaluations the ones we've already seen [00:48:27] we've already seen so there's the word Vector analogies um [00:48:30] so there's the word Vector analogies um you know I cheated when I showed you the [00:48:33] you know I cheated when I showed you the glove demo I only showed you ones that [00:48:36] glove demo I only showed you ones that work but if you play for it yourself you [00:48:38] work but if you play for it yourself you can find some that don't work um so what [00:48:41] can find some that don't work um so what we can do is sort of have a set of um [00:48:45] we can do is sort of have a set of um word analogies and find out which ones [00:48:48] word analogies and find out which ones work now you know in general glove does [00:48:51] work now you know in general glove does work you know here's a set of word [00:48:53] work you know here's a set of word vectors um showing you um the sort of [00:48:57] vectors um showing you um the sort of um female distinction it's kind of good [00:48:59] um female distinction it's kind of good and linear but in general for different [00:49:02] and linear but in general for different ones it's going to work and it's not [00:49:04] ones it's going to work and it's not work and you're going to be able to [00:49:05] work and you're going to be able to score what percentage of the time it [00:49:08] score what percentage of the time it works um or we can do um word similarity [00:49:12] works um or we can do um word similarity How We Do word similarity is we actually [00:49:15] How We Do word similarity is we actually use human judgments of similarity so um [00:49:19] use human judgments of similarity so um psychologists ask undergrads and they [00:49:23] psychologists ask undergrads and they say here is the word plane and car how [00:49:26] say here is the word plane and car how similar are they on a scale of 1 to 10 [00:49:29] similar are they on a scale of 1 to 10 or 0 to 10 maybe actually I think it's 0 [00:49:31] or 0 to 10 maybe actually I think it's 0 to 10 here on a scale of 0 to 10 and the [00:49:35] to 10 here on a scale of 0 to 10 and the person says seven and then they ask [00:49:38] person says seven and then they ask another person um and uh they average [00:49:43] another person um and uh they average what the undergrads say and they come [00:49:45] what the undergrads say and they come out with these numbers so you know Tiger [00:49:47] out with these numbers so you know Tiger Tiger gets 10 book and paper got a an [00:49:50] Tiger gets 10 book and paper got a an average of [00:49:52] average of 7.46 um plane and car got [00:49:55] 7.46 um plane and car got 5.77 um stock and phone got 1.62 and [00:49:59] 5.77 um stock and phone got 1.62 and stock and jaguar got [00:50:01] stock and jaguar got 0.92 um noisy process but you roughly [00:50:04] 0.92 um noisy process but you roughly get to see how similar people think [00:50:06] get to see how similar people think words are and so then we ask our models [00:50:09] words are and so then we ask our models to also score how similar they think [00:50:12] to also score how similar they think words are and then we get models of how [00:50:16] words are and then we get models of how well the scores are correlated um [00:50:20] well the scores are correlated um between um human judgments and our [00:50:23] between um human judgments and our models judgments um and so here are sort [00:50:26] models judgments um and so here are sort of a big table of numbers that we um [00:50:29] of a big table of numbers that we um don't need to go through all of but you [00:50:31] don't need to go through all of but you know it sort of shows that a plain SVD [00:50:34] know it sort of shows that a plain SVD Works terribly simply doing SVD over log [00:50:38] Works terribly simply doing SVD over log counts already starts to work reasonably [00:50:41] counts already starts to work reasonably um and then you know here's um the two [00:50:43] um and then you know here's um the two word DEC algorithms sibo and Skip gram [00:50:47] word DEC algorithms sibo and Skip gram and here are numbers from our glove [00:50:49] and here are numbers from our glove vectors and so you get these kind of [00:50:51] vectors and so you get these kind of scores that you can then score different [00:50:53] scores that you can then score different models as to how good they are [00:50:57] models as to how good they are and well then you can ALS oh sorry yeah [00:51:00] and well then you can ALS oh sorry yeah um that's the only thing I have there [00:51:02] um that's the only thing I have there but you know what can you do for [00:51:05] but you know what can you do for Downstream evaluation well then you want [00:51:07] Downstream evaluation well then you want to um pick some Downstream task and so a [00:51:11] to um pick some Downstream task and so a simple Downstream task that's been used [00:51:13] simple Downstream task that's been used a lot um in NLP is what's called named [00:51:17] a lot um in NLP is what's called named entity recognition and so that's [00:51:20] entity recognition and so that's recognizing names of things and what [00:51:22] recognizing names of things and what type they are um so if the sentence is [00:51:25] type they are um so if the sentence is Chris Manning lives in poo Alto you want [00:51:28] Chris Manning lives in poo Alto you want to say Chris and Manning that's the name [00:51:30] to say Chris and Manning that's the name of a person and poo and outto that's the [00:51:34] of a person and poo and outto that's the name of a place um so that can be the [00:51:37] name of a place um so that can be the task and well that's the kind of task [00:51:40] task and well that's the kind of task which you might think word vectors would [00:51:42] which you might think word vectors would help you with um and it's um indeed the [00:51:45] help you with um and it's um indeed the case right so he what's label was [00:51:47] case right so he what's label was discreet was a baseline symbolic um [00:51:51] discreet was a baseline symbolic um probabilistic um name entity recognition [00:51:55] probabilistic um name entity recognition task and by putting word vectors into it [00:51:58] task and by putting word vectors into it you can make them the numbers go up so [00:52:01] you can make them the numbers go up so these numbers for glav are higher than [00:52:03] these numbers for glav are higher than the ones on the first line and so I'm [00:52:06] the ones on the first line and so I'm getting substantial improvements from [00:52:09] getting substantial improvements from adding word vectors to my system yay [00:52:15] um okay [00:52:18] um okay um I'll pile ahead into the next thing [00:52:21] um I'll pile ahead into the next thing this next one I think is interesting we [00:52:23] this next one I think is interesting we should spend a minute on and it came up [00:52:25] should spend a minute on and it came up in your questions last time words have [00:52:29] in your questions last time words have lots of meanings most words um have a [00:52:34] lots of meanings most words um have a whole bunch of meanings words that don't [00:52:37] whole bunch of meanings words that don't have a lot of different meanings are [00:52:39] have a lot of different meanings are only some very specialized scientific [00:52:41] only some very specialized scientific words okay so my example of of word with [00:52:45] words okay so my example of of word with multiple meanings is probably not the [00:52:47] multiple meanings is probably not the first one you think of all the time um [00:52:49] first one you think of all the time um the most famous example of a word with a [00:52:51] the most famous example of a word with a lot of meanings is bank which already [00:52:52] lot of meanings is bank which already came up last time and I use star which [00:52:55] came up last time and I use star which is another one here's a word that you [00:52:57] is another one here's a word that you probably don't use that often um but it [00:52:59] probably don't use that often um but it you know it still has lots of meaning so [00:53:01] you know it still has lots of meaning so the word Pike what are some things that [00:53:03] the word Pike what are some things that the word Pike can [00:53:05] the word Pike can mean fish a fish yes it's a kind of fish [00:53:08] mean fish a fish yes it's a kind of fish okay we've got one what else can a pike [00:53:11] okay we've got one what else can a pike be yeah a spear a spear yeah for the [00:53:14] be yeah a spear a spear yeah for the Dungeons and Dragons crowd yeah there's [00:53:16] Dungeons and Dragons crowd yeah there's a long arm right yep that's another one [00:53:19] a long arm right yep that's another one yeah a road right yes so Pike is used as [00:53:23] yeah a road right yes so Pike is used as a shorthand well a shorthand for a ter [00:53:26] a shorthand well a shorthand for a ter Turn Pike why it's called a Turnpike [00:53:28] Turn Pike why it's called a Turnpike where yeah originally you had you know [00:53:30] where yeah originally you had you know this the spey looking thing um at the [00:53:33] this the spey looking thing um at the start of it as sort of count people okay [00:53:35] start of it as sort of count people okay we've got three other thing meanings for [00:53:37] we've got three other thing meanings for pike yeah is it also a crap like a [00:53:42] pike yeah is it also a crap like a [Music] [00:53:43] [Music] fraternity I'll believe you I can't say [00:53:45] fraternity I'll believe you I can't say I know that one [00:53:48] I know that one um are [00:53:51] um are Pikes sharp as like a needle something [00:53:55] Pikes sharp as like a needle something Sharp [00:53:57] Sharp maybe I mean I think it's really the [00:53:59] maybe I mean I think it's really the sort of Pike as the [00:54:02] weapon other scratch your heads um one [00:54:06] weapon other scratch your heads um one that I think a lot of you will have seen [00:54:09] that I think a lot of you will have seen um in diving and swimming you can do a [00:54:13] um in diving and swimming you can do a pike Olympics if you see olympic diving [00:54:17] pike Olympics if you see olympic diving there are Pikes anyone seen [00:54:20] there are Pikes anyone seen those um trust me that's a pike um okay [00:54:25] those um trust me that's a pike um okay um and we've sort of been doing um the [00:54:28] um and we've sort of been doing um the noun uses but you know you can also use [00:54:32] noun uses but you know you can also use Pike as a verb right you know like once [00:54:35] Pike as a verb right you know like once you've got your medieval weapon you can [00:54:37] you've got your medieval weapon you can Pike somebody um and that's a usage of [00:54:41] Pike somebody um and that's a usage of Pike um and you can do other ones right [00:54:44] Pike um and you can do other ones right so uh here we go here's [00:54:47] so uh here we go here's um um ones I got from a dictionary we [00:54:51] um um ones I got from a dictionary we got most of those there are sort of [00:54:53] got most of those there are sort of weirder usages right like coming down [00:54:55] weirder usages right like coming down the pike that's kind of a metaphorical [00:54:57] the pike that's kind of a metaphorical use that comes um from the the road [00:55:01] use that comes um from the the road sense but it sort of ends up meaning the [00:55:03] sense but it sort of ends up meaning the future um yeah um in Australia we also [00:55:07] future um yeah um in Australia we also use Pike to mean um sort of chicken out [00:55:10] use Pike to mean um sort of chicken out of doing something um but I don't think [00:55:13] of doing something um but I don't think that usage is really used in the US [00:55:15] that usage is really used in the US anyway words have lots of meanings so [00:55:17] anyway words have lots of meanings so how can you deal with that well one way [00:55:20] how can you deal with that well one way you could deal with it is to say okay [00:55:23] you could deal with it is to say okay words have several meanings [00:55:27] words have several meanings and so we're just going to say words [00:55:29] and so we're just going to say words have several meanings and then we're [00:55:31] have several meanings and then we're going to take instances of words in text [00:55:34] going to take instances of words in text we're going to Cluster them based on [00:55:37] we're going to Cluster them based on their similarity of occurrence to decide [00:55:39] their similarity of occurrence to decide which sense of the word to regard each [00:55:43] which sense of the word to regard each token as and then we're going to learn [00:55:46] token as and then we're going to learn word vectors for those token clusters [00:55:50] word vectors for those token clusters which are our sensors and you can do [00:55:52] which are our sensors and you can do that um we did it in 2012 [00:55:56] that um we did it in 2012 before word de came out um so you see [00:55:59] before word de came out um so you see here we have bank one and somewhere over [00:56:03] here we have bank one and somewhere over here we have Bank two and here we have [00:56:06] here we have Bank two and here we have Jaguar one Jaguar 2 Jaguar 3 Jaguar 4 [00:56:12] Jaguar one Jaguar 2 Jaguar 3 Jaguar 4 and you know this really works out great [00:56:15] and you know this really works out great right so Jaguar one um picks out the [00:56:19] right so Jaguar one um picks out the sense of the um kind of car right and [00:56:22] sense of the um kind of car right and it's close to luxury and convertible [00:56:24] it's close to luxury and convertible Jaguar 2 [00:56:26] Jaguar 2 comes right close to software in [00:56:28] comes right close to software in Microsoft and this one's a bit of a [00:56:31] Microsoft and this one's a bit of a historical one but you know it's when [00:56:34] historical one but you know it's when you were when most of you were five or [00:56:37] you were when most of you were five or whatever you might remember um Apple [00:56:39] whatever you might remember um Apple used to use large cats for versions of [00:56:44] used to use large cats for versions of um Mac OS right so sort of Mac OS 10.3 [00:56:49] um Mac OS right so sort of Mac OS 10.3 or something like that a long time ago [00:56:51] or something like that a long time ago was called Jaguar right so it's software [00:56:54] was called Jaguar right so it's software close to Microsoft um Jaguar 3 um okay [00:56:59] close to Microsoft um Jaguar 3 um okay string keyboard solo musical um drum [00:57:03] string keyboard solo musical um drum base that's because there's a Jaguar [00:57:05] base that's because there's a Jaguar keyboard um and then finally um the sort [00:57:10] keyboard um and then finally um the sort of what we think of is the basic sense [00:57:12] of what we think of is the basic sense but turns out turns up rather Less in [00:57:15] but turns out turns up rather Less in text corpor normally um Jaguar next to [00:57:18] text corpor normally um Jaguar next to Hunter is the animal right so it's done [00:57:20] Hunter is the animal right so it's done a good job at learning the different [00:57:22] a good job at learning the different sensors but you know that's not what's [00:57:26] sensors but you know that's not what's actually usually done these days and [00:57:28] actually usually done these days and instead you know what's usually done is [00:57:31] instead you know what's usually done is you do only have one vector for Jaguar [00:57:35] you do only have one vector for Jaguar and when you do that um or Pike here the [00:57:39] and when you do that um or Pike here the one vector you learn is a weighted a [00:57:44] one vector you learn is a weighted a weighted average of the vectors that you [00:57:47] weighted average of the vectors that you would have learned for the [00:57:49] would have learned for the senses um it's often referred to as a [00:57:53] senses um it's often referred to as a superposition because somehow um neet [00:57:56] superposition because somehow um neet math people like to use physics terms um [00:58:00] math people like to use physics terms um and [00:58:00] and so they call it a superposition but it's [00:58:03] so they call it a superposition but it's a weighted average so you're taking the [00:58:06] a weighted average so you're taking the relative frequency of the different [00:58:08] relative frequency of the different senses and multiplying the vectors you [00:58:10] senses and multiplying the vectors you would have learned if you'd had sense [00:58:12] would have learned if you'd had sense vectors and that's what you get as the [00:58:15] vectors and that's what you get as the representation as a whole [00:58:18] representation as a whole um and um you know I can make a sort of [00:58:22] um and um you know I can make a sort of a linguistic argument as to why you [00:58:24] a linguistic argument as to why you might want to do that um which is you [00:58:27] might want to do that um which is you know although this model of words have [00:58:31] know although this model of words have senses is you know very longstanding and [00:58:35] senses is you know very longstanding and common I mean it's essentially the way [00:58:37] common I mean it's essentially the way dictionaries are built right you look up [00:58:39] dictionaries are built right you look up a word in the dictionary and it says [00:58:41] a word in the dictionary and it says sense one sense 2 sense three and you [00:58:44] sense one sense 2 sense three and you get them for things like bank or Jaguars [00:58:46] get them for things like bank or Jaguars we're talking about I mean it's sort of [00:58:50] we're talking about I mean it's sort of really a broken model right that [00:58:53] really a broken model right that like word meanings have a lot of nuance [00:58:57] like word meanings have a lot of nuance they're used in a lot of different [00:58:59] they're used in a lot of different contexts they are extreme examples like [00:59:02] contexts they are extreme examples like Bank wherever it was where we have [00:59:05] Bank wherever it was where we have Finance bank and Bank of a river bank [00:59:07] Finance bank and Bank of a river bank over here where it seems like the senses [00:59:10] over here where it seems like the senses are this far apart but you know most [00:59:13] are this far apart but you know most words have sort of different meanings [00:59:15] words have sort of different meanings but they're not actually that far apart [00:59:17] but they're not actually that far apart and trying to cut them into senses seems [00:59:20] and trying to cut them into senses seems actually very artificial and you know [00:59:24] actually very artificial and you know there you know if you look up different [00:59:26] there you know if you look up different dictionaries and you say how many senses [00:59:28] dictionaries and you say how many senses does this word have pretty much everyone [00:59:31] does this word have pretty much everyone will give you a different answer so the [00:59:33] will give you a different answer so the kind of situation you have is a word [00:59:36] kind of situation you have is a word like like field well a field can be used [00:59:40] like like field well a field can be used for a place where you grow a crop um it [00:59:43] for a place where you grow a crop um it can be used for sort of natural things [00:59:46] can be used for sort of natural things like a a rock field or an ice field it [00:59:50] like a a rock field or an ice field it can be used for a sporting field um [00:59:54] can be used for a sporting field um there's the mathematical sense of field [00:59:56] there's the mathematical sense of field now all of these things sort of have [00:59:58] now all of these things sort of have something to do with each other I mean [01:00:00] something to do with each other I mean the math one's further away but the [01:00:01] the math one's further away but the physical ones are sort sort of flat [01:00:04] physical ones are sort sort of flat spaces um but you know the the the sense [01:00:08] spaces um but you know the the the sense of it being a sporting field is clearly [01:00:11] of it being a sporting field is clearly kind of different from the sense of it [01:00:12] kind of different from the sense of it being an ice field is the ice field and [01:00:16] being an ice field is the ice field and The Rock field different or am I just [01:00:19] The Rock field different or am I just modifying they are they different senses [01:00:22] modifying they are they different senses right so really you sort of have a kind [01:00:24] right so really you sort of have a kind of a a [01:00:26] of a a uh what a math person say is sort of [01:00:29] uh what a math person say is sort of like some probability density [01:00:31] like some probability density distribution over things that can be [01:00:33] distribution over things that can be meant by the meaning of a word so it [01:00:35] meant by the meaning of a word so it sort of maybe makes sense to more use [01:00:37] sort of maybe makes sense to more use this model where you're just actually [01:00:39] this model where you're just actually saying we have a vector that's an [01:00:42] saying we have a vector that's an average of all the contexts and we'll [01:00:44] average of all the contexts and we'll see more of that when we get to [01:00:45] see more of that when we get to contextual word vectors um later on um [01:00:49] contextual word vectors um later on um but one more surprising result on this [01:00:53] but one more surprising result on this is um since you you have um the vector [01:00:59] is um since you you have um the vector for pike overall being the sum of these [01:01:03] for pike overall being the sum of these different um sense vectors you know [01:01:08] different um sense vectors you know standard math would tell you that if you [01:01:10] standard math would tell you that if you just have the single Vector there's no [01:01:13] just have the single Vector there's no way that you can recover the individual [01:01:15] way that you can recover the individual sense [01:01:16] sense vectors but um higher maath tells you um [01:01:21] vectors but um higher maath tells you um that actually these um Vector spaces are [01:01:26] that actually these um Vector spaces are so high dimensional and sparse that you [01:01:29] so high dimensional and sparse that you can use ideas from sparse coding Theory [01:01:33] can use ideas from sparse coding Theory to reconstruct the sense vectors out of [01:01:36] to reconstruct the sense vectors out of the whole vector and if you actually [01:01:39] the whole vector and if you actually want to understand this um some of the [01:01:42] want to understand this um some of the people in statistics David Dono I think [01:01:44] people in statistics David Dono I think is one of them um teach courses on [01:01:46] is one of them um teach courses on sparse coding Theory um but um I'm not [01:01:50] sparse coding Theory um but um I'm not going to try and teach that but you know [01:01:52] going to try and teach that but you know here's an example um from this paper the [01:01:55] here's an example um from this paper the San [01:01:56] San Al where one of the ATS is tar who's now [01:01:59] Al where one of the ATS is tar who's now faculty in computer science here where [01:02:03] faculty in computer science here where they uh starting off with the word [01:02:05] they uh starting off with the word vector and using sparse coding um to [01:02:09] vector and using sparse coding um to divide out sense vectors from one um [01:02:12] divide out sense vectors from one um word vector and they work pretty well [01:02:14] word vector and they work pretty well right so here's one sense of tie which [01:02:17] right so here's one sense of tie which is piece of clothing another sense of [01:02:20] is piece of clothing another sense of tie which is ties in the game um this [01:02:24] tie which is ties in the game um this one is sort of similar to that one and [01:02:26] one is sort of similar to that one and I'll admit but this sense of tie here is [01:02:28] I'll admit but this sense of tie here is then a tie as sort of you put on your [01:02:31] then a tie as sort of you put on your electrical cables um then you have the [01:02:34] electrical cables um then you have the musical sense of tie right at least four [01:02:36] musical sense of tie right at least four out of five they've done a pretty good [01:02:39] out of five they've done a pretty good job of getting senses out of this single [01:02:42] job of getting senses out of this single um word vector by sparse coding so [01:02:44] um word vector by sparse coding so sparse coding must be cool if you want [01:02:46] sparse coding must be cool if you want to go off and learn more about [01:02:49] to go off and learn more about it [01:02:53] okay okay so that's everything I was [01:02:56] okay okay so that's everything I was going to say about words vectors and [01:02:58] going to say about words vectors and word [01:02:59] word senses is everyone good to there any [01:03:05] questions I'll Rush ahead for the last [01:03:08] questions I'll Rush ahead for the last two pieces okay so I just wanted to [01:03:11] two pieces okay so I just wanted to start to introdu in the last 15 minutes [01:03:14] start to introdu in the last 15 minutes um the ideas of how we can build neural [01:03:20] um the ideas of how we can build neural classifiers um and how we start to build [01:03:23] classifiers um and how we start to build in general neural network works I mean [01:03:26] in general neural network works I mean in a sense we've already built a very [01:03:30] in a sense we've already built a very simple neural classifier our word DEC [01:03:34] simple neural classifier our word DEC model is predicting what words are [01:03:37] model is predicting what words are likely to occur in the context of [01:03:39] likely to occur in the context of another word um and you can think of [01:03:41] another word um and you can think of that as a classifier but let's look at a [01:03:43] that as a classifier but let's look at a simple classifier like our named entity [01:03:45] simple classifier like our named entity recognizers I mentioned before so for [01:03:48] recognizers I mentioned before so for the named entity recognizer we want to [01:03:50] the named entity recognizer we want to label words with their class so we want [01:03:53] label words with their class so we want to say these two words are a person but [01:03:56] to say these two words are a person but the same words Paris and Hilton are then [01:04:00] the same words Paris and Hilton are then locations in this second sentence so [01:04:03] locations in this second sentence so words can be ambiguous as to what their [01:04:06] words can be ambiguous as to what their class is um and the other state is that [01:04:11] class is um and the other state is that they're not a named entity at all [01:04:13] they're not a named entity at all they're just a word um that is some [01:04:15] they're just a word um that is some other word and this is something that's [01:04:17] other word and this is something that's used in lots of places um as a a bit of [01:04:22] used in lots of places um as a a bit of understanding so if you've seen any of [01:04:24] understanding so if you've seen any of those web pages um where they've sort of [01:04:26] those web pages um where they've sort of you know tagged company names with a [01:04:28] you know tagged company names with a stock ticker or um there links on a [01:04:32] stock ticker or um there links on a Wikipedia page to a Wikipedia page or [01:04:35] Wikipedia page to a Wikipedia page or something like that right you've got [01:04:37] something like that right you've got named entities where commonly after [01:04:40] named entities where commonly after finding the named entities um you're [01:04:42] finding the named entities um you're doing this second stage of entity [01:04:44] doing this second stage of entity linking where you're then linking the [01:04:46] linking where you're then linking the named entity to some canonical form of [01:04:49] named entity to some canonical form of it like a wikkipedia page but we're not [01:04:51] it like a wikkipedia page but we're not going to talk about the second part of [01:04:53] going to talk about the second part of it um for the rest of the day [01:04:56] it um for the rest of the day um and [01:04:58] um and so um we could say that building with [01:05:01] so um we could say that building with our word vectors we've got this simple [01:05:04] our word vectors we've got this simple task where what we're going to do is [01:05:07] task where what we're going to do is we're going to look at a word in context [01:05:09] we're going to look at a word in context because sometimes Paris is a name of a [01:05:12] because sometimes Paris is a name of a person sometimes it's a location and so [01:05:15] person sometimes it's a location and so we're going to want to look at this word [01:05:17] we're going to want to look at this word in its context and say aha this is a [01:05:20] in its context and say aha this is a name of a location in this instance and [01:05:24] name of a location in this instance and so the way that we going to do it um is [01:05:27] so the way that we going to do it um is we're going to form a a window [01:05:30] we're going to form a a window classifier so we're going to take a word [01:05:33] classifier so we're going to take a word with a couple of words of context on [01:05:35] with a couple of words of context on each side and for the words in our [01:05:38] each side and for the words in our context window we're going to use our [01:05:40] context window we're going to use our word vectors because we want to show [01:05:42] word vectors because we want to show they're useful for something and then we [01:05:43] they're useful for something and then we want to feed this into something that is [01:05:46] want to feed this into something that is a classifier and our classifier um it's [01:05:51] a classifier and our classifier um it's actually going to be a really simple loc [01:05:53] actually going to be a really simple loc classifier we're only here going to do [01:05:56] classifier we're only here going to do location or not a location so this one [01:05:59] location or not a location so this one here we're W to say for this um window [01:06:02] here we're W to say for this um window here yes it's a location um and whereas [01:06:07] here yes it's a location um and whereas if it had [01:06:08] if it had been I what I love Paris Hilton greatly [01:06:13] been I what I love Paris Hilton greatly um then we'd be saying no because Paris [01:06:16] um then we'd be saying no because Paris the word in the middle of the context [01:06:19] the word in the middle of the context then isn't um a location so that's sort [01:06:22] then isn't um a location so that's sort of the idea of a classification or [01:06:24] of the idea of a classification or classifier we're making assigning some [01:06:28] classifier we're making assigning some set of classes um to things right so in [01:06:32] set of classes um to things right so in general for [01:06:33] general for classifiers um we do supervised learning [01:06:37] classifiers um we do supervised learning which means we have some labeled [01:06:39] which means we have some labeled examples our training data set so we [01:06:42] examples our training data set so we have input items XI and for each one [01:06:45] have input items XI and for each one we've got a class Yi so I had for my ex [01:06:50] we've got a class Yi so I had for my ex example training examples ones like I [01:06:53] example training examples ones like I love Paris Hilton greatly that [01:06:56] love Paris Hilton greatly that was negative not a location and I visit [01:07:01] was negative not a location and I visit Paris Every Spring that's positive that [01:07:04] Paris Every Spring that's positive that is a location where I'm actually [01:07:05] is a location where I'm actually classifying the middle word okay so [01:07:08] classifying the middle word okay so inputs labels and in general we've got [01:07:11] inputs labels and in general we've got um labels are a set of classes so my set [01:07:14] um labels are a set of classes so my set here is simply location not a location [01:07:17] here is simply location not a location but I could get fancier and I could say [01:07:21] but I could get fancier and I could say I've got five classes I've got location [01:07:23] I've got five classes I've got location person name [01:07:26] person name uh whatever other ones there are um [01:07:29] uh whatever other ones there are um company name drug name right I could be [01:07:32] company name drug name right I could be assigning a bunch of or other not a name [01:07:36] assigning a bunch of or other not a name a bunch of different classes but I'm [01:07:37] a bunch of different classes but I'm going to be doing it with only two [01:07:40] going to be doing it with only two because I'm using this example on next [01:07:42] because I'm using this example on next Tuesday's lecture as well and I'm [01:07:44] Tuesday's lecture as well and I'm wanting to keep it simple um so that's [01:07:47] wanting to keep it simple um so that's what we're going to do and so what we're [01:07:51] what we're going to do and so what we're going to be using in our class um is [01:07:54] going to be using in our class um is neural classifiers and so I just wanted [01:07:57] neural classifiers and so I just wanted to sort of um sort of just sort of go [01:08:01] to sort of um sort of just sort of go through quickly just the sort of food [01:08:03] through quickly just the sort of food for thought as we go into it so for a [01:08:06] for thought as we go into it so for a typical um stats machine learning [01:08:10] typical um stats machine learning classifier you can build classifiers [01:08:13] classifier you can build classifiers like logistic regression or softmax [01:08:15] like logistic regression or softmax classifiers or other ones like support [01:08:18] classifiers or other ones like support Vector machines or naive Bays or [01:08:21] Vector machines or naive Bays or whatever else you might have seen the [01:08:24] whatever else you might have seen the vast majority of these classifiers are [01:08:27] vast majority of these classifiers are linear classifiers meaning that they [01:08:29] linear classifiers meaning that they have a linear decision boundary and when [01:08:32] have a linear decision boundary and when we're learning these classifiers we're [01:08:34] we're learning these classifiers we're learning parameters here W but our the [01:08:38] learning parameters here W but our the our inputs are fixed that our inputs are [01:08:41] our inputs are fixed that our inputs are represented by symbols like um or [01:08:44] represented by symbols like um or quantities so we have fixed inputs we [01:08:47] quantities so we have fixed inputs we learn parameters as weights into that [01:08:50] learn parameters as weights into that are used to multiply the inputs and then [01:08:53] are used to multiply the inputs and then we use a linear decision boundary [01:08:56] we use a linear decision boundary so when we have our neural classifier [01:08:59] so when we have our neural classifier we're kind of getting some more power um [01:09:02] we're kind of getting some more power um so first of all we're not only learning [01:09:05] so first of all we're not only learning weights W for our classifier we're also [01:09:08] weights W for our classifier we're also learning distributed representations for [01:09:10] learning distributed representations for our words so our words can sort of re [01:09:15] our words so our words can sort of re represent our word vectors re represent [01:09:18] represent our word vectors re represent the actual words of symbols and can move [01:09:20] the actual words of symbols and can move them around in the space so that in [01:09:23] them around in the space so that in terms of the original space we've got a [01:09:27] terms of the original space we've got a nonlinear classifier that can represent [01:09:29] nonlinear classifier that can represent much more complex functions um but we [01:09:33] much more complex functions um but we will then sort of use the word vectors [01:09:36] will then sort of use the word vectors to re-represent um those words to do a [01:09:39] to re-represent um those words to do a final classification so at the end of [01:09:42] final classification so at the end of our deep Network which we're about to [01:09:45] our deep Network which we're about to build we will have a linear classifier [01:09:48] build we will have a linear classifier in terms of our re-represented vectors [01:09:51] in terms of our re-represented vectors but not in terms of our original space [01:09:53] but not in terms of our original space let me try and be concrete about that [01:09:56] let me try and be concrete about that okay so here's what I'm going to use and [01:09:58] okay so here's what I'm going to use and we'll use again next Tuesday um as my um [01:10:02] we'll use again next Tuesday um as my um little neural network um and so I start [01:10:07] little neural network um and so I start with some words museums and Paris are [01:10:10] with some words museums and Paris are amazing I first of [01:10:12] amazing I first of all come up with the word embedding of [01:10:15] all come up with the word embedding of those using my word vectors so now I'm [01:10:18] those using my word vectors so now I'm got this sort of high dimensional Vector [01:10:21] got this sort of high dimensional Vector which is just a concatenation of five [01:10:23] which is just a concatenation of five word vectors so you know if I have 100 [01:10:25] word vectors so you know if I have 100 dimensional word vectors this is 500 [01:10:27] dimensional word vectors this is 500 dimensional and then I'm going to put it [01:10:29] dimensional and then I'm going to put it through a neural network layer which is [01:10:32] through a neural network layer which is simply multiplying that vector by a [01:10:35] simply multiplying that vector by a matrix and adding on a bias um vector [01:10:40] matrix and adding on a bias um vector and then I'm going to put it through [01:10:41] and then I'm going to put it through some [01:10:42] some nonlinearity which might be for example [01:10:44] nonlinearity which might be for example the logistic function that we've already [01:10:46] the logistic function that we've already seen so that will give me a new [01:10:49] seen so that will give me a new representation and in particular um if [01:10:54] representation and in particular um if the w is say 8 by 500 I'll be reducing [01:10:58] the w is say 8 by 500 I'll be reducing it to a much or what yeah 8 by 500 I'll [01:11:02] it to a much or what yeah 8 by 500 I'll be reducing it to a much smaller Vector [01:11:06] be reducing it to a much smaller Vector right so then I can do after that I can [01:11:09] right so then I can do after that I can multiply my hidden representation the [01:11:12] multiply my hidden representation the middle of my neural network by another [01:11:15] middle of my neural network by another vector and that will give me a score and [01:11:17] vector and that will give me a score and I'm going to put the score into the [01:11:19] I'm going to put the score into the logistic um function that we saw earlier [01:11:21] logistic um function that we saw earlier to say what's the probability this is [01:11:23] to say what's the probability this is the location so at this point um my [01:11:28] the location so at this point um my classifier is going to be a linear [01:11:31] classifier is going to be a linear classifier in terms of this internal [01:11:34] classifier in terms of this internal representation that's used right at the [01:11:36] representation that's used right at the end but it's going to be a nonlinear [01:11:38] end but it's going to be a nonlinear classifier in terms of my word [01:11:43] vectors [01:11:45] vectors um [01:11:46] um okay [01:11:48] okay uh great um here's one other thing this [01:11:53] uh great um here's one other thing this is just sort of uh uh a note for learner [01:11:57] is just sort of uh uh a note for learner ahead um since you want to know this [01:11:59] ahead um since you want to know this when we start doing the next assignments [01:12:02] when we start doing the next assignments I mean up until now I've presented [01:12:04] I mean up until now I've presented everything as you know doing log [01:12:07] everything as you know doing log likelihood and negative log likelihood [01:12:10] likelihood and negative log likelihood for building our models um very soon now [01:12:14] for building our models um very soon now assignment two we're going to be [01:12:16] assignment two we're going to be starting to do things um with pytorch [01:12:19] starting to do things um with pytorch and when you um start working out your [01:12:23] and when you um start working out your um losses with payto watch um what [01:12:26] um losses with payto watch um what you're going to be wanting to use is [01:12:28] you're going to be wanting to use is cross entropy loss and so just let me [01:12:31] cross entropy loss and so just let me quickly say what cross entropy loss is [01:12:34] quickly say what cross entropy loss is so cross entropy is from information [01:12:37] so cross entropy is from information Theory so if you have a true probability [01:12:40] Theory so if you have a true probability distribution p and you're Computing a [01:12:43] distribution p and you're Computing a probability distribution Q your cross [01:12:46] probability distribution Q your cross entropy loss is like this um so it's the [01:12:51] entropy loss is like this um so it's the log of your um model probability [01:12:56] log of your um model probability um the expectation of that under your um [01:12:59] um the expectation of that under your um true probability distribution but [01:13:02] true probability distribution but there's sort of a special case whereas [01:13:04] there's sort of a special case whereas if you have ground truth or gold or [01:13:07] if you have ground truth or gold or Target data where things are labeled one [01:13:12] Target data where things are labeled one zero so like for examples of um I love [01:13:16] zero so like for examples of um I love Paris um when warm right I'm just [01:13:21] Paris um when warm right I'm just labeling it one for location probability [01:13:24] labeling it one for location probability one it's a location probability zero is [01:13:27] one it's a location probability zero is not a location so if you're just [01:13:30] not a location so if you're just labeling the right class as probability [01:13:32] labeling the right class as probability one then in this summation every other [01:13:36] one then in this summation every other term goes to zero and the only thing [01:13:40] term goes to zero and the only thing you're left with is what probability is [01:13:43] you're left with is what probability is my model what log probability is my [01:13:46] my model what log probability is my model giving to the right class and so [01:13:50] model giving to the right class and so that then is um your log likelihood [01:13:54] that then is um your log likelihood which we can use for um the negative log [01:13:58] which we can use for um the negative log likelihood um little bit of a [01:14:01] likelihood um little bit of a complication here just remember that you [01:14:03] complication here just remember that you want to use cross entry loss in py tort [01:14:06] want to use cross entry loss in py tort when building the model okay um before [01:14:10] when building the model okay um before we end today um here is um my obligatory [01:14:13] we end today um here is um my obligatory one picture of human neurons don't miss [01:14:16] one picture of human neurons don't miss it because I'm not going to show any [01:14:18] it because I'm not going to show any more of these um okay these are human [01:14:21] more of these um okay these are human neurons right human neurons um were [01:14:25] neurons right human neurons um were the um inspiration for neural networks [01:14:29] the um inspiration for neural networks right so human neurons have a single [01:14:32] right so human neurons have a single output which comes down this axon um and [01:14:37] output which comes down this axon um and then um when you have these [01:14:42] then um when you have these outputs um they then feed into other [01:14:46] outputs um they then feed into other neurons I guess I don't really have an [01:14:48] neurons I guess I don't really have an example here but in general one output [01:14:52] example here but in general one output can feed into multiple different neurons [01:14:54] can feed into multiple different neurons you can see the different things hanging [01:14:55] you can see the different things hanging into it so you know you have the output [01:14:58] into it so you know you have the output connecting to the input and sort of [01:15:00] connecting to the input and sort of where you make this connection right [01:15:02] where you make this connection right that's the synapses that people talk [01:15:05] that's the synapses that people talk about and so one neuron will normally [01:15:08] about and so one neuron will normally have many many inputs where it picks [01:15:11] have many many inputs where it picks things up from other neurons and they [01:15:13] things up from other neurons and they all go into the nucleus of the cell and [01:15:17] all go into the nucleus of the cell and the nucleus combines together all those [01:15:20] the nucleus combines together all those inputs and kind of what happens is if [01:15:23] inputs and kind of what happens is if there's enough positive activation from [01:15:26] there's enough positive activation from all of these um inputs it then sends [01:15:31] all of these um inputs it then sends signals down its output now strictly how [01:15:35] signals down its output now strictly how um neurons work is that they send spikes [01:15:39] um neurons work is that they send spikes so the level of activation a neuron is [01:15:41] so the level of activation a neuron is its rate of spiking um but that [01:15:44] its rate of spiking um but that immediately got turned in artificial [01:15:46] immediately got turned in artificial neural networks into just a real value [01:15:50] neural networks into just a real value as to what is its level of activation um [01:15:53] as to what is its level of activation um and so it does this so this was kind of [01:15:56] and so it does this so this was kind of the genuine inspiration of all of our [01:15:59] the genuine inspiration of all of our neural networks right so a binary [01:16:01] neural networks right so a binary logistic regression is kind of a bit [01:16:05] logistic regression is kind of a bit similar to a neuron right it has [01:16:07] similar to a neuron right it has multiple inputs you're working out your [01:16:11] multiple inputs you're working out your total level of excitation where in [01:16:14] total level of excitation where in particular you can have inputs that are [01:16:17] particular you can have inputs that are both exciting positive inputs and inputs [01:16:21] both exciting positive inputs and inputs that are um negative um which are then [01:16:25] that are um negative um which are then inhibitory inputs you combine them all [01:16:27] inhibitory inputs you combine them all together and you get an output that's [01:16:29] together and you get an output that's your level of exit [01:16:32] your level of exit excitation and you're then sort of [01:16:34] excitation and you're then sort of converting that through some [01:16:36] converting that through some nonlinearity and so this was proposed as [01:16:39] nonlinearity and so this was proposed as a very simple model of human neurons now [01:16:43] a very simple model of human neurons now human neurons are way more complex than [01:16:45] human neurons are way more complex than this um and some people like [01:16:47] this um and some people like neuroscientists think we maybe should be [01:16:49] neuroscientists think we maybe should be doing a better model of actual human [01:16:53] doing a better model of actual human neurons but um [01:16:55] neurons but um in terms of what's being done in the [01:16:57] in terms of what's being done in the current neural networks eat the World [01:16:59] current neural networks eat the World Revolution everyone's forgotten about [01:17:02] Revolution everyone's forgotten about that and is just sticking with this very [01:17:04] that and is just sticking with this very very simple model um which conveniently [01:17:08] very simple model um which conveniently turns into linear algebra in a very [01:17:11] turns into linear algebra in a very simple way um so this gives us sort of [01:17:15] simple way um so this gives us sort of like a single neuron but then precise [01:17:19] like a single neuron but then precise right so this is which this s single [01:17:22] right so this is which this s single neuron if you use the logistic function [01:17:24] neuron if you use the logistic function is identical to logistic regression [01:17:27] is identical to logistic regression which you've probably seen in some stats [01:17:29] which you've probably seen in some stats class or something but the difference is [01:17:32] class or something but the difference is that for newal networks we don't just [01:17:34] that for newal networks we don't just have one logistic regression we have a [01:17:37] have one logistic regression we have a bunch of logistic regressions at once [01:17:42] bunch of logistic regressions at once and well that' be tricky if we had to [01:17:44] and well that' be tricky if we had to Define what each of these logistic [01:17:46] Define what each of these logistic regressions was calculating but what we [01:17:50] regressions was calculating but what we don't what we do is we just feed them [01:17:53] don't what we do is we just feed them into another logistic regression and so [01:17:57] into another logistic regression and so we have some eventual output that we [01:18:01] we have some eventual output that we want to be something like we want it to [01:18:03] want to be something like we want it to say you know this is or isn't a location [01:18:07] say you know this is or isn't a location but then what will happen is by our [01:18:10] but then what will happen is by our machine learning these intermediate [01:18:12] machine learning these intermediate logistic regressions will figure out all [01:18:15] logistic regressions will figure out all by themselves something useful to do [01:18:18] by themselves something useful to do that's the magic um right so that you [01:18:22] that's the magic um right so that you get this sort of self-learning prop [01:18:25] get this sort of self-learning prop where the model has a lot of parameters [01:18:28] where the model has a lot of parameters and internally um will work out useful [01:18:31] and internally um will work out useful things to do so in general we can get [01:18:33] things to do so in general we can get more Magic by having more layers in the [01:18:35] more Magic by having more layers in the neural network and that we will um build [01:18:39] neural network and that we will um build up functions so effectively these [01:18:41] up functions so effectively these intermediate layers let us learn a model [01:18:45] intermediate layers let us learn a model that re represents the input data in [01:18:48] that re represents the input data in ways that will make it easier to [01:18:51] ways that will make it easier to classify or easier to interpret and do [01:18:54] classify or easier to interpret and do things with Downstream in our new [01:18:58] things with Downstream in our new network um and it's time so I should [01:19:01] network um and it's time so I should stop there thank you ================================================================================ LECTURE 003 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 3 - Backpropagation, Neural Network Source: https://www.youtube.com/watch?v=HnliVHU2g9U --- Transcript [00:00:05] okay hi everyone I'll be started okay so [00:00:09] okay hi everyone I'll be started okay so it's Tuesday of week two so hopefully [00:00:12] it's Tuesday of week two so hopefully that means everyone has done the [00:00:15] that means everyone has done the assignment one everyone done assignment [00:00:17] assignment one everyone done assignment one um you know uh I'm if I'm saying [00:00:21] one um you know uh I'm if I'm saying this I'm probably saying it to the wrong [00:00:23] this I'm probably saying it to the wrong people but it it seems like every year [00:00:27] people but it it seems like every year some people blow some of their um late [00:00:31] some people blow some of their um late days on assignment one and it's really [00:00:33] days on assignment one and it's really just the wrong place to use them um so [00:00:37] just the wrong place to use them um so yeah hopefully you've all done [00:00:38] yeah hopefully you've all done assignment one um and did not that this [00:00:41] assignment one um and did not that this is meant to be the easy onramp um and [00:00:44] is meant to be the easy onramp um and then we go straight on from that so that [00:00:47] then we go straight on from that so that out today we have assignment two so [00:00:50] out today we have assignment two so assignment two um has two purposes um [00:00:55] assignment two um has two purposes um purpose one is to make you do some math [00:00:59] purpose one is to make you do some math um to um some understanding of what [00:01:02] um to um some understanding of what newal networks really compute and how [00:01:05] newal networks really compute and how they compute it and that's what I'm [00:01:06] they compute it and that's what I'm going to talk about today is also going [00:01:08] going to talk about today is also going through that math um but then [00:01:11] through that math um but then simultaneously maybe it does three [00:01:13] simultaneously maybe it does three things in assignment two um so we're [00:01:16] things in assignment two um so we're going to be learning something about [00:01:17] going to be learning something about dependency paing which will be actually [00:01:20] dependency paing which will be actually something about um language structure [00:01:22] something about um language structure and Linguistics but then thirdly um for [00:01:26] and Linguistics but then thirdly um for assignment two we're going to start [00:01:27] assignment two we're going to start using pytorch so pytorch is is one of [00:01:30] using pytorch so pytorch is is one of the leading software Frameworks for deep [00:01:32] the leading software Frameworks for deep learning and the one that we're going to [00:01:35] learning and the one that we're going to um use for this class um so for p i mean [00:01:40] um use for this class um so for p i mean the assignment 3 pytorch is [00:01:45] the assignment 3 pytorch is exceedingly um scaffolded so it's sort [00:01:48] exceedingly um scaffolded so it's sort of you know here's this thing and you [00:01:50] of you know here's this thing and you have to write these two lines use these [00:01:52] have to write these two lines use these two functions um but nevertheless um for [00:01:56] two functions um but nevertheless um for help people get up to speed and get [00:01:59] help people get up to speed and get started using pytorch on Friday at 3:30 [00:02:03] started using pytorch on Friday at 3:30 in Gates bo1 or it will again be [00:02:05] in Gates bo1 or it will again be recorded um we have a tutorial on [00:02:08] recorded um we have a tutorial on pytorch and so that's a great way to get [00:02:11] pytorch and so that's a great way to get more of a sense of py torch and how it [00:02:13] more of a sense of py torch and how it works before doing assignment [00:02:16] works before doing assignment two um yeah uh the other things yeah so [00:02:21] two um yeah uh the other things yeah so for nearly all the lectures we've got um [00:02:24] for nearly all the lectures we've got um further reading of places that you can [00:02:26] further reading of places that you can look um of all the classes in the attire [00:02:30] look um of all the classes in the attire um this um for many people might be a [00:02:35] um this um for many people might be a really good one to look at the suggested [00:02:37] really good one to look at the suggested readings we have several readings which [00:02:39] readings we have several readings which are sort of shorter tutorials and [00:02:42] are sort of shorter tutorials and reviews of the kind of um Matrix [00:02:45] reviews of the kind of um Matrix calculus um and linear algebra that we [00:02:48] calculus um and linear algebra that we need for this class um so really [00:02:51] need for this class um so really encourage you um to look at those um if [00:02:54] encourage you um to look at those um if you decide that one is your favorite you [00:02:56] you decide that one is your favorite you can tell us on Ed which one you think is [00:02:58] can tell us on Ed which one you think is the best one to choose between between [00:03:00] the best one to choose between between them I kind of like the one that's first [00:03:02] them I kind of like the one that's first on the list but maybe you'll feel [00:03:03] on the list but maybe you'll feel differently um yeah um conversely um [00:03:08] differently um yeah um conversely um yeah so today will be sort of all math [00:03:11] yeah so today will be sort of all math and then Thursday will be kind of all [00:03:15] and then Thursday will be kind of all language and Linguistics some people [00:03:17] language and Linguistics some people find the language and Linguistics hard [00:03:18] find the language and Linguistics hard as well um so I guess different kinds of [00:03:21] as well um so I guess different kinds of people um okay so getting straight into [00:03:25] people um okay so getting straight into it um so where we started last time um [00:03:30] it um so where we started last time um I'd sort of shown these baby neural [00:03:32] I'd sort of shown these baby neural networks and sort of said well we can [00:03:34] networks and sort of said well we can think of each of those orange things as [00:03:37] think of each of those orange things as basically like a little logistic [00:03:39] basically like a little logistic regression unit and the crucial [00:03:42] regression unit and the crucial difference from then the kind of um [00:03:45] difference from then the kind of um statistics machine learning you see in a [00:03:47] statistics machine learning you see in a stats class 109 or wherever is that in [00:03:51] stats class 109 or wherever is that in those you have one logistic regression [00:03:54] those you have one logistic regression and you're defining the input features [00:03:56] and you're defining the input features to it and you've got some decision [00:03:59] to it and you've got some decision variable that you want to un have at the [00:04:02] variable that you want to un have at the output here you're sort of building [00:04:04] output here you're sort of building these Cascades of little logistic [00:04:07] these Cascades of little logistic regressions and so the idea is right at [00:04:09] regressions and so the idea is right at the end we're going to Define what we [00:04:11] the end we're going to Define what we want we're going to capture that by our [00:04:14] want we're going to capture that by our objective function or loss function but [00:04:16] objective function or loss function but for the stuff in the middle that that [00:04:19] for the stuff in the middle that that stuff in the middle is going to be a [00:04:21] stuff in the middle is going to be a chance for the neural network to learn [00:04:24] chance for the neural network to learn by itself what would be useful inputs to [00:04:29] by itself what would be useful inputs to further Downstream neurons what kind of [00:04:34] further Downstream neurons what kind of functions should I come up with in terms [00:04:36] functions should I come up with in terms of my inputs that will help me provide [00:04:40] of my inputs that will help me provide um useful outputs to help the final [00:04:43] um useful outputs to help the final computation down the track um and you [00:04:47] computation down the track um and you know if you haven't sort of seen and um [00:04:49] know if you haven't sort of seen and um thought about this much before I mean I [00:04:52] thought about this much before I mean I think you know it's worth uh sitting [00:04:55] think you know it's worth uh sitting with that idea for a moment CU this is [00:04:57] with that idea for a moment CU this is really a super powerful idea which is [00:05:00] really a super powerful idea which is what's made neural networks more [00:05:03] what's made neural networks more powerful in most circumstances than [00:05:06] powerful in most circumstances than other forms of machine learning the fact [00:05:08] other forms of machine learning the fact that you have this [00:05:11] that you have this self-organization of intermediate levels [00:05:13] self-organization of intermediate levels of representation that you use to [00:05:15] of representation that you use to compute things that will be useful [00:05:18] compute things that will be useful Downstream for what you eventually want [00:05:20] Downstream for what you eventually want to [00:05:21] to do um the other reason I was bring back [00:05:23] do um the other reason I was bring back up this picture is I've sort of wanted [00:05:25] up this picture is I've sort of wanted to go straight from here um to matric [00:05:30] to go straight from here um to matric um so while you could sort of wire [00:05:34] um so while you could sort of wire together neurons however you wanted to [00:05:37] together neurons however you wanted to and arguably if you look at human brains [00:05:40] and arguably if you look at human brains they look more like neurons wired [00:05:42] they look more like neurons wired together however you wanted to but for [00:05:44] together however you wanted to but for what's done with neural networks [00:05:46] what's done with neural networks basically there's always this kind of [00:05:48] basically there's always this kind of regular structure of layers so once we [00:05:51] regular structure of layers so once we have this regular structure of layers we [00:05:54] have this regular structure of layers we are taking um the output of one of our [00:06:00] are taking um the output of one of our neurons at one layer and we're feeding [00:06:03] neurons at one layer and we're feeding them together with weights um to produce [00:06:07] them together with weights um to produce um the inputs to the next layer so we're [00:06:10] um the inputs to the next layer so we're taking the X1 X2 X3 outputs we're [00:06:13] taking the X1 X2 X3 outputs we're multiplying them all by weights we're [00:06:15] multiplying them all by weights we're adding a bias [00:06:18] adding a bias term um and then we're going to put it [00:06:20] term um and then we're going to put it through a nonlinearity and that will [00:06:22] through a nonlinearity and that will give us the value at the next layer so [00:06:24] give us the value at the next layer so if we then kind of collapse that to a [00:06:27] if we then kind of collapse that to a vector and this to a vector [00:06:30] vector and this to a vector that then collapses into a computation [00:06:32] that then collapses into a computation that first of all we're doing a matrix [00:06:36] that first of all we're doing a matrix multiplication we're calculating WX of [00:06:40] multiplication we're calculating WX of the inputs and then we're adding on the [00:06:42] the inputs and then we're adding on the biases as a a vector of biases which [00:06:45] biases as a a vector of biases which gives us this intermediate value Z and [00:06:48] gives us this intermediate value Z and then we have this nonlinearity or [00:06:52] then we have this nonlinearity or activation function which is applied um [00:06:55] activation function which is applied um to that which gives us the values in the [00:06:58] to that which gives us the values in the next layer of the new Network and the [00:07:01] next layer of the new Network and the activation function is applied to a [00:07:03] activation function is applied to a vector and produces a vector but it's [00:07:05] vector and produces a vector but it's operating on each of the individual [00:07:08] operating on each of the individual components of that Vector one at a time [00:07:10] components of that Vector one at a time so we've got some Scala function that [00:07:13] so we've got some Scala function that we're just applying to each element of [00:07:15] we're just applying to each element of the [00:07:15] the vector um and so that's the kind of [00:07:19] vector um and so that's the kind of picture um we saw when I did this [00:07:23] picture um we saw when I did this example and I'm going to continue to use [00:07:25] example and I'm going to continue to use this example in today's class remember [00:07:27] this example in today's class remember we were going to decide whether the word [00:07:29] we were going to decide whether the word in the middle of the input window was a [00:07:32] in the middle of the input window was a location or not and so we were doing the [00:07:35] location or not and so we were doing the matrix multiplication putting it through [00:07:38] matrix multiplication putting it through the nonlinearity we're then just doing a [00:07:40] the nonlinearity we're then just doing a DOT product here and then we're going [00:07:43] DOT product here and then we're going that got stuck into um uh sigmoid um to [00:07:48] that got stuck into um uh sigmoid um to predict yes or no um and the final thing [00:07:52] predict yes or no um and the final thing I wanted to say a little bit about is [00:07:55] I wanted to say a little bit about is these FS the [00:07:56] these FS the nonlinearity or the activation function [00:07:59] nonlinearity or the activation function and where did they come in well the [00:08:02] and where did they come in well the starting point of where they came in in [00:08:04] starting point of where they came in in the history of newal networks is when [00:08:07] the history of newal networks is when people came up with this idea that well [00:08:10] people came up with this idea that well you could represent um the operation of [00:08:13] you could represent um the operation of a basic neuron by doing a matrix [00:08:17] a basic neuron by doing a matrix multiplication of the inputs and then [00:08:21] multiplication of the inputs and then having a bias term or here represents a [00:08:24] having a bias term or here represents a threshold term to see whether the [00:08:27] threshold term to see whether the neurons should fire or not that was [00:08:30] neurons should fire or not that was actually in the very first [00:08:31] actually in the very first implementation which B dates back to the [00:08:34] implementation which B dates back to the 1940s as done as a threshold right so [00:08:38] 1940s as done as a threshold right so that if um the activation was greater [00:08:42] that if um the activation was greater than Theta you output one otherwise you [00:08:45] than Theta you output one otherwise you output zero and well if you have a [00:08:48] output zero and well if you have a threshold um the the two lines are flat [00:08:52] threshold um the the two lines are flat right so there is no slope there is no [00:08:55] right so there is no slope there is no gradient um so that's actually makes [00:08:58] gradient um so that's actually makes learning much harder so the whole secret [00:09:02] learning much harder so the whole secret of what we build with newal networks and [00:09:05] of what we build with newal networks and in an alternative name that's um popular [00:09:08] in an alternative name that's um popular in some circles these days is [00:09:11] in some circles these days is gradient-based learning and the entire [00:09:14] gradient-based learning and the entire idea of gradient based learning is if we [00:09:17] idea of gradient based learning is if we actually have some slopes um then it's [00:09:19] actually have some slopes um then it's like going skiing during spring break [00:09:22] like going skiing during spring break you can work out where it's steeper and [00:09:24] you can work out where it's steeper and you can head down where it's steeper um [00:09:27] you can head down where it's steeper um and that will allow us to op imiz our [00:09:30] and that will allow us to op imiz our function to learn much more quickly and [00:09:34] function to learn much more quickly and so that's one reason that we don't just [00:09:36] so that's one reason that we don't just want to have threshold units we want to [00:09:39] want to have threshold units we want to have things with slopes so we um have [00:09:42] have things with slopes so we um have gradient so in subsequent work people [00:09:45] gradient so in subsequent work people started using activation functions with [00:09:49] started using activation functions with slopes um and so the first popular one [00:09:53] slopes um and so the first popular one um was this sigmoidal logistic that [00:09:55] um was this sigmoidal logistic that we've seen for mapping to probabilities [00:09:58] we've seen for mapping to probabilities but you know it's sort of imperfect it [00:10:01] but you know it's sort of imperfect it seemed because you know the output was [00:10:03] seemed because you know the output was always non- negative so that sort of [00:10:06] always non- negative so that sort of tends to push things towards bigger [00:10:08] tends to push things towards bigger numbers so um there was quite a bit of [00:10:12] numbers so um there was quite a bit of use then of this tan H function um and [00:10:17] use then of this tan H function um and you'll actually see tan H when we do [00:10:19] you'll actually see tan H when we do assignment three we'll be using tan H's [00:10:21] assignment three we'll be using tan H's in our current new networks um and so um [00:10:27] in our current new networks um and so um I've written there um the formula [00:10:29] I've written there um the formula usually give for Tan H in terms of [00:10:32] usually give for Tan H in terms of exponentials um yeah if your math is [00:10:35] exponentials um yeah if your math is Rusty it's not obvious that tan H and [00:10:37] Rusty it's not obvious that tan H and logistic have much to do with each other [00:10:39] logistic have much to do with each other um but if you want to treat this as a [00:10:42] um but if you want to treat this as a math problem um that a 10 H is literally [00:10:46] math problem um that a 10 H is literally just a rescaled logistic you're [00:10:48] just a rescaled logistic you're stretching it by two and moving it down [00:10:50] stretching it by two and moving it down by one it's the same function [00:10:54] by one it's the same function um okay um but you know so that's nice [00:10:58] um okay um but you know so that's nice but if you're calculating 10 hes you [00:11:00] but if you're calculating 10 hes you have to do all of these exponentials and [00:11:03] have to do all of these exponentials and you know exponentials are kind of slow [00:11:05] you know exponentials are kind of slow on your computer and things like that [00:11:07] on your computer and things like that you might wonder whether you couldn't [00:11:09] you might wonder whether you couldn't get away with something much cheaper and [00:11:12] get away with something much cheaper and so people thought about that and thought [00:11:14] so people thought about that and thought oh maybe we could just use a so-called [00:11:16] oh maybe we could just use a so-called hard tan H um where it has a slope of [00:11:20] hard tan H um where it has a slope of one in the middle and is then just flat [00:11:22] one in the middle and is then just flat outside that area and you know that [00:11:25] outside that area and you know that seemed to work in many cases and so that [00:11:28] seemed to work in many cases and so that then led to the popularity of the [00:11:31] then led to the popularity of the rectified linear unit um so the [00:11:34] rectified linear unit um so the rectified linear unit is simply zero on [00:11:37] rectified linear unit is simply zero on the negative region and then is yal x in [00:11:40] the negative region and then is yal x in the positive region now this seems kind [00:11:44] the positive region now this seems kind of wonky and goes against what I was [00:11:47] of wonky and goes against what I was saying about gradient based learning [00:11:48] saying about gradient based learning because once you're in the negative [00:11:50] because once you're in the negative region there's no gradient um you're [00:11:52] region there's no gradient um you're just dead um but in the positive region [00:11:56] just dead um but in the positive region there is gradient and the gradient is [00:11:59] there is gradient and the gradient is particularly simple right the slope is [00:12:01] particularly simple right the slope is always one um and so you know this still [00:12:06] always one um and so you know this still feels slightly perverse to me but you [00:12:09] feels slightly perverse to me but you know this really became the norm of what [00:12:12] know this really became the norm of what people use for a number of years because [00:12:15] people use for a number of years because people found that although for an [00:12:18] people found that although for an individual neuron it was dead half the [00:12:20] individual neuron it was dead half the time anytime it went negative that [00:12:22] time anytime it went negative that overall for your newal network some [00:12:24] overall for your newal network some things would be alive so it kind of gave [00:12:27] things would be alive so it kind of gave sort of a form of specializ ation and [00:12:30] sort of a form of specializ ation and the fact that the slope was always one [00:12:32] the fact that the slope was always one meant that you got um really easy [00:12:36] meant that you got um really easy productive backward flow of gradients in [00:12:38] productive backward flow of gradients in a way we'll talk about later and so um [00:12:42] a way we'll talk about later and so um learning with realu turned to be out to [00:12:45] learning with realu turned to be out to be very effective and people started [00:12:48] be very effective and people started using the realu nonlinearity everywhere [00:12:51] using the realu nonlinearity everywhere and it sort of became the default in the [00:12:53] and it sort of became the default in the norm and you'll see us using it in the [00:12:55] norm and you'll see us using it in the assignments in particular we use it in [00:12:57] assignments in particular we use it in assignment to and so get to see that it [00:13:00] assignment to and so get to see that it works but nevertheless at some point [00:13:03] works but nevertheless at some point people sort of had second thoughts and [00:13:06] people sort of had second thoughts and decided you know having a dead over half [00:13:08] decided you know having a dead over half of its range maybe isn't such a good [00:13:11] of its range maybe isn't such a good idea after all even though it seemed to [00:13:12] idea after all even though it seemed to work great for a few years and so a lot [00:13:15] work great for a few years and so a lot of what's happened since then is then to [00:13:18] of what's happened since then is then to come up with other functions which are [00:13:20] come up with other functions which are in some sense Ru like but not actually [00:13:24] in some sense Ru like but not actually dead um so [00:13:27] dead um so um um [00:13:29] um um okay I don't really yeah uh here we go [00:13:33] okay I don't really yeah uh here we go enough so one one version of that is the [00:13:36] enough so one one version of that is the socalled Leaky value so for the Leaky [00:13:39] socalled Leaky value so for the Leaky value you make the negative half a [00:13:42] value you make the negative half a straight line as well with a very minor [00:13:44] straight line as well with a very minor slope but still it's got a little bit of [00:13:46] slope but still it's got a little bit of slope um there is then a variant of that [00:13:48] slope um there is then a variant of that called the parametric value where you [00:13:51] called the parametric value where you have one extra parameter which is [00:13:53] have one extra parameter which is actually what the slope of the ne the [00:13:55] actually what the slope of the ne the negative part is and people showed some [00:13:58] negative part is and people showed some positive result with that um more [00:14:00] positive result with that um more recently again and this is what you [00:14:03] recently again and this is what you often see in recent Transformer models [00:14:06] often see in recent Transformer models um is you see um nonlinearities like [00:14:10] um is you see um nonlinearities like Swiss swis and Jello so both of these [00:14:14] Swiss swis and Jello so both of these are sort of fancy functions but kind of [00:14:17] are sort of fancy functions but kind of what they both look like is basically [00:14:20] what they both look like is basically this is yal X to all intense and [00:14:23] this is yal X to all intense and purposes not quite but approximately and [00:14:25] purposes not quite but approximately and then you got sort of some funky bit of [00:14:27] then you got sort of some funky bit of curve down here which again gives you a [00:14:29] curve down here which again gives you a bit of slope um it's sort of the curve [00:14:31] bit of slope um it's sort of the curve is going the opposite way that's sort of [00:14:33] is going the opposite way that's sort of a bit funny but they seem to work well [00:14:35] a bit funny but they seem to work well commonly used um in recent Transformer [00:14:39] commonly used um in recent Transformer models um so you know that's a bit of a [00:14:42] models um so you know that's a bit of a dump of all the nonlinearities people [00:14:45] dump of all the nonlinearities people use I mean the details of that aren't [00:14:48] use I mean the details of that aren't super important right now but the [00:14:51] super important right now but the important thing to um have in your head [00:14:54] important thing to um have in your head is why do we need nonlinearities and the [00:14:58] is why do we need nonlinearities and the way to think about about that is that [00:15:01] way to think about about that is that what we're doing with neural networks is [00:15:03] what we're doing with neural networks is function approximation there's some very [00:15:06] function approximation there's some very complex function that we want to learn [00:15:08] complex function that we want to learn you know like maybe we want to go from a [00:15:10] you know like maybe we want to go from a piece of text to its meaning or we want [00:15:12] piece of text to its meaning or we want to be interpreting visual scenes or [00:15:15] to be interpreting visual scenes or something like that um and so we want to [00:15:18] something like that um and so we want to build really good function [00:15:20] build really good function approximators and well if you're just [00:15:23] approximators and well if you're just doing Matrix multiplies a matrix [00:15:25] doing Matrix multiplies a matrix multiply of a vector is a linear [00:15:27] multiply of a vector is a linear transform so that doesn't let you [00:15:30] transform so that doesn't let you multiply complex functions I guess [00:15:32] multiply complex functions I guess strictly if you put a bias on the end [00:15:34] strictly if you put a bias on the end it's then an aine transform but let's [00:15:36] it's then an aine transform but let's keep it simple linear transforms right [00:15:39] keep it simple linear transforms right so if you're doing multiple if you're [00:15:41] so if you're doing multiple if you're doing multiple Matrix multiplies you're [00:15:44] doing multiple Matrix multiplies you're doing multiple linear transforms but [00:15:46] doing multiple linear transforms but they compose so you could have just [00:15:48] they compose so you could have just multiplied these two matrices together [00:15:50] multiplied these two matrices together and you'd have a single linear transform [00:15:52] and you'd have a single linear transform so you get no power in terms of [00:15:56] so you get no power in terms of representation by having multi-layer [00:15:59] representation by having multi-layer networks that are just Matrix multiplies [00:16:02] networks that are just Matrix multiplies you know as in a little aside in terms [00:16:05] you know as in a little aside in terms of representational power having [00:16:08] of representational power having multi-layer Matrix multiplies gives you [00:16:11] multi-layer Matrix multiplies gives you no power but if you think about in terms [00:16:13] no power but if you think about in terms of learning actually it does give you [00:16:15] of learning actually it does give you some power so in the theoretical [00:16:18] some power so in the theoretical Community looking at newal networks [00:16:19] Community looking at newal networks there actually quite a few papers that [00:16:21] there actually quite a few papers that look at linear newal networks meaning [00:16:23] look at linear newal networks meaning that they're just sequences of the [00:16:25] that they're just sequences of the multiplies with no nonlinearities [00:16:28] multiplies with no nonlinearities because they have interesting learning [00:16:30] because they have interesting learning properties even though they give you no [00:16:32] properties even though they give you no representational power um okay but we'd [00:16:35] representational power um okay but we'd like to be able to learn functions like [00:16:37] like to be able to learn functions like this not only functions like this and to [00:16:40] this not only functions like this and to be able to learn functions like this we [00:16:42] be able to learn functions like this we need more than linear transforms and we [00:16:45] need more than linear transforms and we achieve those by having something that [00:16:48] achieve those by having something that makes us be calculating a nonlinear [00:16:51] makes us be calculating a nonlinear function and it's these activation [00:16:54] function and it's these activation functions that give us nonlinear [00:16:56] functions that give us nonlinear functions okay cool [00:17:00] functions okay cool um okay so then getting on to today so [00:17:04] um okay so then getting on to today so the whole thing we want to do now is [00:17:06] the whole thing we want to do now is gradient based learning right this is [00:17:08] gradient based learning right this is our stochastic gradient thecenter [00:17:10] our stochastic gradient thecenter equation where here you know that upside [00:17:14] equation where here you know that upside down triangle symbol right that's our [00:17:16] down triangle symbol right that's our gradient we're wanting to work out the [00:17:18] gradient we're wanting to work out the slope of our objective function and so [00:17:23] slope of our objective function and so this is how we're going to learn by [00:17:25] this is how we're going to learn by calculating gradients so what we want to [00:17:27] calculating gradients so what we want to know is how do we calculate the [00:17:29] know is how do we calculate the gradients for an arbitrary function and [00:17:32] gradients for an arbitrary function and so what I want to do today is first of [00:17:35] so what I want to do today is first of all do this by hand um for math um and [00:17:40] all do this by hand um for math um and then discuss you know how do we do it [00:17:43] then discuss you know how do we do it computationally um which is effectively [00:17:46] computationally um which is effectively um the famous thing that's taken as [00:17:48] um the famous thing that's taken as powering um underpowering all of neuron [00:17:50] powering um underpowering all of neuron Nets which is the back propagation [00:17:52] Nets which is the back propagation algorithm but the back propagation [00:17:54] algorithm but the back propagation algorithm is just automating the math [00:17:57] algorithm is just automating the math okay and so for the math it's Matrix [00:18:01] okay and so for the math it's Matrix calculus and at this point then there's [00:18:03] calculus and at this point then there's a huge Spectrum um between people who [00:18:06] a huge Spectrum um between people who know much more math than me and people [00:18:08] know much more math than me and people who barely ever learned this um but you [00:18:11] who barely ever learned this um but you know uh I hope to explain the [00:18:14] know uh I hope to explain the essentials um or remind people of them [00:18:18] essentials um or remind people of them um enough that you're at least at a [00:18:20] um enough that you're at least at a starting point um for reading some other [00:18:23] starting point um for reading some other stuff and doing homework too so um let's [00:18:26] stuff and doing homework too so um let's get into that and so I'm going to spend [00:18:28] get into that and so I'm going to spend about half the time on those two halves [00:18:31] about half the time on those two halves um and you know the hope is that after [00:18:33] um and you know the hope is that after this you'll feel like oh I actually [00:18:35] this you'll feel like oh I actually understand how new networks work under [00:18:37] understand how new networks work under the hood fingers crossed [00:18:40] the hood fingers crossed okay so here we go so if you're a [00:18:43] okay so here we go so if you're a Stanford student um you maybe did math [00:18:47] Stanford student um you maybe did math 51 um or else you could have done math [00:18:50] 51 um or else you could have done math 51 which teaches linear algebra [00:18:53] 51 which teaches linear algebra multivariate calculus and modern [00:18:55] multivariate calculus and modern applications um math 51 covers every [00:18:59] applications um math 51 covers every everything I'm going to talk about and [00:19:00] everything I'm going to talk about and way more stuff so if you actually know [00:19:02] way more stuff so if you actually know that and remember it you can um look at [00:19:05] that and remember it you can um look at Instagram for the next 35 minutes um but [00:19:08] Instagram for the next 35 minutes um but I think the problem is um that you know [00:19:12] I think the problem is um that you know quite apart from the fact a lot of [00:19:13] quite apart from the fact a lot of people do it as Frost um you know this [00:19:16] people do it as Frost um you know this is a lot to get through in 10 weeks and [00:19:19] is a lot to get through in 10 weeks and I think that a lot of the people who do [00:19:21] I think that a lot of the people who do this class sort of by two years later [00:19:23] this class sort of by two years later don't really have much ability to use [00:19:25] don't really have much ability to use any of it um but you know if you [00:19:27] any of it um but you know if you actually looked at this book really [00:19:29] actually looked at this book really Harden for a very really long time you [00:19:31] Harden for a very really long time you would have discovered that actually [00:19:34] would have discovered that actually right towards the end of the book in [00:19:36] right towards the end of the book in appendix G um there's actually an [00:19:38] appendix G um there's actually an appendix on newal networks and the [00:19:40] appendix on newal networks and the multivariable chain roll which is [00:19:43] multivariable chain roll which is precisely what we're going to be using [00:19:45] precisely what we're going to be using um for um doing our neural networks um [00:19:49] um for um doing our neural networks um but there are only two problems one [00:19:51] but there are only two problems one problem is that this is Page 697 of the [00:19:54] problem is that this is Page 697 of the book and I'm not sure anyone ever gets [00:19:56] book and I'm not sure anyone ever gets that far and the problem is even if you [00:20:00] that far and the problem is even if you do get that far you know I don't know I [00:20:03] do get that far you know I don't know I find these pages that they're really [00:20:05] find these pages that they're really dense texty Pages it's not even easy to [00:20:08] dense texty Pages it's not even easy to understand them if you have gone there [00:20:10] understand them if you have gone there so here's my attempt on that um so the [00:20:13] so here's my attempt on that um so the Mantra to have in your head is G if I [00:20:17] Mantra to have in your head is G if I can remember basic single variable [00:20:20] can remember basic single variable calculus you know that I've got 3x squ [00:20:23] calculus you know that I've got 3x squ and the derivative of that is 6X that's [00:20:26] and the derivative of that is 6X that's all you sort of need to know right the [00:20:28] all you sort of need to know right the ENT is multivariable calculus is just [00:20:32] ENT is multivariable calculus is just like single variable calculus except [00:20:35] like single variable calculus except you're using [00:20:36] you're using matrices okay so that's our article of [00:20:38] matrices okay so that's our article of faith and we're going to do that and so [00:20:41] faith and we're going to do that and so what we're wanting to do is to do Matrix [00:20:44] what we're wanting to do is to do Matrix calculus or the generalization of that [00:20:46] calculus or the generalization of that tensor calculus sort of using um vectors [00:20:49] tensor calculus sort of using um vectors matrices and higher order tensors [00:20:52] matrices and higher order tensors because if we can do things and what's [00:20:54] because if we can do things and what's referred to as vectorized gradients in [00:20:56] referred to as vectorized gradients in the neural network world that that will [00:20:59] the neural network world that that will be sort of the fast efficient way to do [00:21:02] be sort of the fast efficient way to do our operations you know so if you want [00:21:05] our operations you know so if you want to think it all through you can do it [00:21:07] to think it all through you can do it single variable at a time and check that [00:21:10] single variable at a time and check that you're doing the right thing and I sort [00:21:12] you're doing the right thing and I sort of tried to indicate that in the first [00:21:13] of tried to indicate that in the first lecture but if we want to have our [00:21:15] lecture but if we want to have our networks go room we want to be doing [00:21:18] networks go room we want to be doing Matrix [00:21:20] Matrix calculus okay so let's work up to doing [00:21:23] calculus okay so let's work up to doing that okay so this is the part that I I [00:21:27] that okay so this is the part that I I trust everyone can remember right so we [00:21:30] trust everyone can remember right so we have f ofx = x cubed and we can do [00:21:34] have f ofx = x cubed and we can do single variable um derivative and the [00:21:40] single variable um derivative and the derivative is [00:21:42] derivative is 3x² everyone remember that one okay [00:21:46] 3x² everyone remember that one okay that's something we can all start from [00:21:48] that's something we can all start from and remember this derivative is saying [00:21:50] and remember this derivative is saying the slope of things right so the slope [00:21:53] the slope of things right so the slope of things lets us work out where is [00:21:56] of things lets us work out where is something steep so we'll be able to go [00:21:58] something steep so we'll be able to go skiing right that's our goal right and [00:22:02] skiing right that's our goal right and so you can think of the slope of things [00:22:05] so you can think of the slope of things as how much the output will change if we [00:22:08] as how much the output will change if we change the input a bit right that's our [00:22:10] change the input a bit right that's our measure of um steepness right so um that [00:22:14] measure of um steepness right so um that so since the derivative is 3x^2 if we're [00:22:17] so since the derivative is 3x^2 if we're at x = 1 that means the slope is about 3 [00:22:21] at x = 1 that means the slope is about 3 * 1 S 3 so if I work out the value of [00:22:26] * 1 S 3 so if I work out the value of the function for 1.01 it's gone up by [00:22:29] the function for 1.01 it's gone up by about three times point I move the X by [00:22:32] about three times point I move the X by 0.01 and the output moved by 0.03 where [00:22:36] 0.01 and the output moved by 0.03 where if I go to x = 4 the derivative is 3 * [00:22:40] if I go to x = 4 the derivative is 3 * 4^2 is 48 and so if I work out the value [00:22:43] 4^2 is 48 and so if I work out the value of the function at 4.01 I get [00:22:46] of the function at 4.01 I get approximately [00:22:47] approximately 64.4 versus 64 right that small [00:22:51] 64.4 versus 64 right that small difference from 4 to 4.01 has been [00:22:54] difference from 4 to 4.01 has been magnified 48 times in the output okay [00:22:58] magnified 48 times in the output okay okay so now we just uh remember the [00:23:02] okay so now we just uh remember the Mantra it's going to be exactly the same [00:23:04] Mantra it's going to be exactly the same single value calculus um but with more [00:23:08] single value calculus um but with more stuff so if we have a function with n [00:23:11] stuff so if we have a function with n inputs we're then going to work out its [00:23:14] inputs we're then going to work out its gradient um which is its partial [00:23:17] gradient um which is its partial derivative with respect to each input so [00:23:20] derivative with respect to each input so its gradient will now be a vector of the [00:23:24] its gradient will now be a vector of the same size as the number of inputs um and [00:23:27] same size as the number of inputs um and there's this funky symb [00:23:29] there's this funky symb um which people pronounce various ways I [00:23:32] um which people pronounce various ways I mean you know this kind of originated [00:23:34] mean you know this kind of originated some kind of someone's weird way of [00:23:36] some kind of someone's weird way of drawing a calligraphic d right so it is [00:23:38] drawing a calligraphic d right so it is really a D um so I think I'll mainly [00:23:41] really a D um so I think I'll mainly just call it D but sometimes people call [00:23:43] just call it D but sometimes people call it partial or funy D or some some other [00:23:46] it partial or funy D or some some other name right so you have DF dx1 DF dx2 for [00:23:50] name right so you have DF dx1 DF dx2 for each of the variables okay so if we go [00:23:53] each of the variables okay so if we go beyond that um and then have a function [00:23:57] beyond that um and then have a function with um n inputs and M outputs what we [00:24:03] with um n inputs and M outputs what we then get for um the gradient is what's [00:24:07] then get for um the gradient is what's referred to as the Jacobian now actually [00:24:11] referred to as the Jacobian now actually um the dude this is named after was a [00:24:13] um the dude this is named after was a German Jew so it should really be yakobi [00:24:18] German Jew so it should really be yakobi um but no one says that in this country [00:24:20] um but no one says that in this country um Jacobian um okay so the Jacobian is [00:24:26] um Jacobian um okay so the Jacobian is then a matrix of partial [00:24:29] then a matrix of partial derivatives um where you're working out [00:24:31] derivatives um where you're working out for each output and each input the [00:24:35] for each output and each input the partial derivative between the component [00:24:37] partial derivative between the component of the input and the output so this [00:24:41] of the input and the output so this looks like the kind of thing that we're [00:24:43] looks like the kind of thing that we're going to have when we have a neural [00:24:45] going to have when we have a neural network layer because we're going to [00:24:47] network layer because we're going to have um n inputs and M outputs for the [00:24:50] have um n inputs and M outputs for the two layers of our neural networks so [00:24:52] two layers of our neural networks so we'll be using these kind of [00:24:56] we'll be using these kind of jacobians okay um so then you know the [00:24:59] jacobians okay um so then you know the whole idea of newal networks is we've [00:25:01] whole idea of newal networks is we've got these [00:25:04] got these multi-level [00:25:05] multi-level computations and they're going to [00:25:07] computations and they're going to correspond to composition of functions [00:25:10] correspond to composition of functions so we need to know how to compose things [00:25:14] so we need to know how to compose things both for calculating functions and for [00:25:17] both for calculating functions and for calculating their gradients so if we [00:25:20] calculating their gradients so if we have a one variable function and we want [00:25:23] have a one variable function and we want to um work out um its derivative in [00:25:27] to um work out um its derivative in terms of a composition of two functions [00:25:31] terms of a composition of two functions what we're doing is multiplying um our [00:25:35] what we're doing is multiplying um our computations okay so um if you compose [00:25:39] computations okay so um if you compose together um Z of Y um that's the [00:25:42] together um Z of Y um that's the function that we did at the beginning [00:25:44] function that we did at the beginning that gives you oh was it no it's not [00:25:47] that gives you oh was it no it's not sorry it's different okay Z of Y gives [00:25:50] sorry it's different okay Z of Y gives you [00:25:51] you 3x2 right and so we know that the [00:25:53] 3x2 right and so we know that the derivative of that is um [00:25:56] derivative of that is um 6X okay if we do it in terms of the [00:26:00] 6X okay if we do it in terms of the pieces we can work out DZ [00:26:04] pieces we can work out DZ Dy um which is just going to be three [00:26:08] Dy um which is just going to be three and um Dy DX is 2X and we can work out [00:26:13] and um Dy DX is 2X and we can work out the total derivative by multiplying [00:26:15] the total derivative by multiplying these two pieces and we get 6X the same [00:26:18] these two pieces and we get 6X the same answer right um so um Matrix calculus is [00:26:23] answer right um so um Matrix calculus is exactly like single variable calculus [00:26:26] exactly like single variable calculus except we're using um tensors of [00:26:29] except we're using um tensors of different um so the word tensor is used [00:26:32] different um so the word tensor is used to mean as you go up that Spectrum in in [00:26:35] to mean as you go up that Spectrum in in its size so from sort of Scala to Vector [00:26:37] its size so from sort of Scala to Vector to Matrix to then you know what in [00:26:41] to Matrix to then you know what in computer science is normally still uh is [00:26:43] computer science is normally still uh is multidimensional arrays um that spectrum [00:26:47] multidimensional arrays um that spectrum is then tensors of different um [00:26:50] is then tensors of different um Dimensions okay so um when we have [00:26:54] Dimensions okay so um when we have multiple variable functions we're going [00:26:56] multiple variable functions we're going to multiply jacobians so here we have a [00:26:59] to multiply jacobians so here we have a function WX + B and then we um compose [00:27:04] function WX + B and then we um compose um the nonlinearity F to get H and so [00:27:09] um the nonlinearity F to get H and so we're going to be able to compute that [00:27:11] we're going to be able to compute that in the same way as a product of these [00:27:14] in the same way as a product of these partial derivatives which are [00:27:18] partial derivatives which are jacobians okay so let's start looking at [00:27:21] jacobians okay so let's start looking at a few examples of what we get so let's [00:27:24] a few examples of what we get so let's count with start with an element wise [00:27:26] count with start with an element wise activation function so when we have a a [00:27:31] activation function so when we have a a vector that's being calculated as the [00:27:34] vector that's being calculated as the activation function of a previously [00:27:36] activation function of a previously computed quantity well we're Computing [00:27:39] computed quantity well we're Computing that component wise as I explained [00:27:41] that component wise as I explained before so hi equals F of Z and where the [00:27:46] before so hi equals F of Z and where the sort of f is our activation function [00:27:48] sort of f is our activation function that actually applies to a scalar but [00:27:51] that actually applies to a scalar but you know overall this layer is a [00:27:53] you know overall this layer is a function with n outputs and N inputs and [00:27:56] function with n outputs and N inputs and so it's going to have an N by n Jacobian [00:27:59] so it's going to have an N by n Jacobian and well what that's going to so this is [00:28:01] and well what that's going to so this is our definition of the Jacobian um but in [00:28:06] our definition of the Jacobian um but in this case this is sort of a special case [00:28:09] this case this is sort of a special case because if I equals J um then we're [00:28:14] because if I equals J um then we're going to have um the output um [00:28:19] going to have um the output um J the HJ depending on Z and otherwise [00:28:24] J the HJ depending on Z and otherwise it's going to be zero CU for the off [00:28:26] it's going to be zero CU for the off diagonal entries it doesn't matter how [00:28:29] diagonal entries it doesn't matter how you change the value it's not changing [00:28:31] you change the value it's not changing the output because the output only [00:28:33] the output because the output only depends on the corresponding index and [00:28:36] depends on the corresponding index and so what we're going to get for this [00:28:37] so what we're going to get for this Jacobian of activation functions is a [00:28:40] Jacobian of activation functions is a matrix where everything is zero apart [00:28:43] matrix where everything is zero apart from the diagonal terms um that [00:28:46] from the diagonal terms um that correspond to where we're calculating [00:28:49] correspond to where we're calculating the activation function and for those [00:28:52] the activation function and for those ones we're going to have to work out how [00:28:54] ones we're going to have to work out how to compute the derivative of our [00:28:57] to compute the derivative of our activation function that was on [00:29:00] activation function that was on assignment one one of the questions on [00:29:02] assignment one one of the questions on assignment one I do believe or or was it [00:29:05] assignment one I do believe or or was it on assignment two no no it's assignment [00:29:07] on assignment two no no it's assignment two one of the questions on assignment [00:29:09] two one of the questions on assignment two I got that wrong one of the ones on [00:29:10] two I got that wrong one of the ones on the new assignment is say hey um can you [00:29:13] the new assignment is say hey um can you work out um the derivative of a logistic [00:29:16] work out um the derivative of a logistic function well then we'd be a to plug [00:29:18] function well then we'd be a to plug that straight into um fpre so I'm not [00:29:21] that straight into um fpre so I'm not going to give that answer away today um [00:29:24] going to give that answer away today um okay so um other things that we want to [00:29:27] okay so um other things that we want to do um with uh Jacobian is well we have [00:29:32] do um with uh Jacobian is well we have this um layer of our neural network [00:29:35] this um layer of our neural network where we're um calculating WX + B and we [00:29:39] where we're um calculating WX + B and we can want to work out the partial [00:29:40] can want to work out the partial derivative of that with respect to X um [00:29:43] derivative of that with respect to X um you know this is the kind of place where [00:29:46] you know this is the kind of place where it actually works to remember the Mantra [00:29:49] it actually works to remember the Mantra and say Matrix calculus is just like [00:29:52] and say Matrix calculus is just like single value variable calculus but with [00:29:55] single value variable calculus but with matrices so if you just don't use your [00:29:57] matrices so if you just don't use your bra too hard and think oh it's just like [00:30:00] bra too hard and think oh it's just like single variable calculus so what should [00:30:01] single variable calculus so what should the answer be it's obviously going to be [00:30:03] the answer be it's obviously going to be W right and indeed it is um similarly if [00:30:07] W right and indeed it is um similarly if we want to do the same thing for wxb and [00:30:10] we want to do the same thing for wxb and work out the partial derivative with [00:30:12] work out the partial derivative with respect to B um well that would be one [00:30:15] respect to B um well that would be one in terms of single variable calculus and [00:30:18] in terms of single variable calculus and so in Matrix calculus that becomes an [00:30:21] so in Matrix calculus that becomes an identity Matrix okay slightly different [00:30:23] identity Matrix okay slightly different the same idea um but that's reflecting [00:30:26] the same idea um but that's reflecting the fact that b is actually Vector so we [00:30:28] the fact that b is actually Vector so we need we need it to be coming out um as [00:30:31] need we need it to be coming out um as an identity Matrix um okay so um higher [00:30:37] an identity Matrix um okay so um higher up in my example picture I did this sort [00:30:40] up in my example picture I did this sort of vector um do product of [00:30:44] of vector um do product of UT um and well what happens if we work [00:30:49] UT um and well what happens if we work out the um the [00:30:53] out the um the um to C in of that what we end up with [00:30:57] um to C in of that what we end up with strict ly is we come out with [00:31:00] strict ly is we come out with HT um and you know this is sort of like [00:31:04] HT um and you know this is sort of like when you're working out um well we did [00:31:07] when you're working out um well we did this in the first class right when we [00:31:09] this in the first class right when we did a a DOT product calculation that you [00:31:12] did a a DOT product calculation that you kind of get for each individual element [00:31:15] kind of get for each individual element you get the opposite term and so you get [00:31:18] you get the opposite term and so you get um the other Vector coming out um these [00:31:21] um the other Vector coming out um these are sort of good ones to comput it at [00:31:23] are sort of good ones to comput it at home for practice to make sure um you [00:31:26] home for practice to make sure um you really do know the answers and why they [00:31:28] really do know the answers and why they work out the way they [00:31:30] work out the way they do okay um so let's go back to our [00:31:34] do okay um so let's go back to our little neural net um this was most of [00:31:37] little neural net um this was most of our neural net up above our neural net [00:31:40] our neural net up above our neural net um there was the [00:31:42] um there was the nonlinearity now I'm going to uh leave [00:31:45] nonlinearity now I'm going to uh leave that out of this time oh see I got it [00:31:47] that out of this time oh see I got it wrong it's on assignment two um but you [00:31:50] wrong it's on assignment two um but you know normally you'd be C calculating the [00:31:53] know normally you'd be C calculating the partials of the output the loss function [00:31:56] partials of the output the loss function with respect to the inputs but since the [00:31:59] with respect to the inputs but since the loss function is On Assignment two I'm [00:32:01] loss function is On Assignment two I'm going to leave that out and I'm just [00:32:03] going to leave that out and I'm just going to calculate derivatives with [00:32:05] going to calculate derivatives with respect to this score that feeds into [00:32:08] respect to this score that feeds into the loss function so we first we got um [00:32:11] the loss function so we first we got um the newal network layer the nonlinearity [00:32:15] the newal network layer the nonlinearity and then we're doing this dot product to [00:32:17] and then we're doing this dot product to work out a score for each position which [00:32:19] work out a score for each position which feeds into the logistic function so if [00:32:22] feeds into the logistic function so if you want to work out [00:32:24] you want to work out ddb um so that's with respect to the [00:32:27] ddb um so that's with respect to the bias for first so the way we do it is [00:32:30] bias for first so the way we do it is you know we break up our equations into [00:32:33] you know we break up our equations into our individual pieces that are composed [00:32:36] our individual pieces that are composed together and so that means we break this [00:32:38] together and so that means we break this up so we first calculate the zal WX + B [00:32:42] up so we first calculate the zal WX + B then we apply the activation function to [00:32:45] then we apply the activation function to the different [00:32:46] the different components okay then after that [00:32:50] components okay then after that um what we remember to do is okay to [00:32:54] um what we remember to do is okay to work out um our um partial derivatives [00:33:00] work out um our um partial derivatives of B of s with respect to B that what [00:33:04] of B of s with respect to B that what we're going to be doing is doing the [00:33:07] we're going to be doing is doing the product of um the partial derivatives of [00:33:10] product of um the partial derivatives of the component pieces so we're applying [00:33:13] the component pieces so we're applying um The Matrix calculus version of the [00:33:15] um The Matrix calculus version of the chain rule so dsdb equals D sdh * DHD Z [00:33:22] chain rule so dsdb equals D sdh DHD Z [00:33:23] * dzb um and which corresponds to these [00:33:26] dzb um and which corresponds to these three layers that are composed together [00:33:30] three layers that are composed together and so at that point um we remember our [00:33:32] and so at that point um we remember our useful jacobians from the previous slide [00:33:36] useful jacobians from the previous slide and we can just apply them so the top [00:33:39] and we can just apply them so the top one um DSD is um the you [00:33:45] one um DSD is um the you transpose or else or maybe it's you [00:33:48] transpose or else or maybe it's you let's come back to that there's a fine [00:33:51] let's come back to that there's a fine point on that that I will explain more [00:33:53] point on that that I will explain more about later um [00:33:55] about later um um okay um then for the DH DZ that was [00:34:00] um okay um then for the DH DZ that was the activation function where we got the [00:34:02] the activation function where we got the diagonal of um the derivative of f of Z [00:34:07] diagonal of um the derivative of f of Z and then for dzdb that's where we got [00:34:10] and then for dzdb that's where we got the identity function okay um so we can [00:34:15] the identity function okay um so we can simplify that down and so what that's [00:34:18] simplify that down and so what that's going to end up as is the UT transpose [00:34:23] going to end up as is the UT transpose that funny symbol there times the um the [00:34:27] that funny symbol there times the um the vector [00:34:28] vector element y's um derivative of f um this [00:34:32] element y's um derivative of f um this symbol which doesn't normally turn up um [00:34:36] symbol which doesn't normally turn up um in your regular math course but turns up [00:34:38] in your regular math course but turns up all the time in New networks is referred [00:34:41] all the time in New networks is referred to as the Hadad product and the Hadad [00:34:44] to as the Hadad product and the Hadad product is meaning element wise [00:34:46] product is meaning element wise multiplication so it's not like a cross [00:34:49] multiplication so it's not like a cross product where you put two vectors [00:34:50] product where you put two vectors together and you get out one number of [00:34:52] together and you get out one number of Scala you put two vectors together you [00:34:55] Scala you put two vectors together you element wise multiply them all and [00:34:57] element wise multiply them all and you're left with another Vector of the [00:35:00] you're left with another Vector of the same [00:35:01] same type okay so that so now this gave us [00:35:05] type okay so that so now this gave us are working out of the partials of dstb [00:35:08] are working out of the partials of dstb and for a neural network um we want to [00:35:11] and for a neural network um we want to work out all the other partials as well [00:35:13] work out all the other partials as well so overall here in the picture right we [00:35:18] so overall here in the picture right we had the X the W the B um and the U and [00:35:26] had the X the W the B um and the U and we'd like to work out partials with [00:35:28] we'd like to work out partials with respect to all of those variables so we [00:35:31] respect to all of those variables so we can change their values and learn so [00:35:34] can change their values and learn so that our model predicts better um so um [00:35:39] that our model predicts better um so um so suppose we now want to um calculate [00:35:43] so suppose we now want to um calculate ddw so again we can split it up with the [00:35:46] ddw so again we can split it up with the same chain Rule and say [00:35:48] same chain Rule and say ddw equals the product of these three [00:35:51] ddw equals the product of these three things and the important thing to notice [00:35:54] things and the important thing to notice is that two of those three things were [00:35:56] is that two of those three things were exactly the same ones that we calculated [00:35:58] exactly the same ones that we calculated before the only bit that's different is [00:36:01] before the only bit that's different is that at the end we're now doing DZ DW [00:36:05] that at the end we're now doing DZ DW rather than [00:36:06] rather than dzdb and so the first central idea that [00:36:10] dzdb and so the first central idea that we'll come back to when we do [00:36:11] we'll come back to when we do computation graphs is oh we really want [00:36:14] computation graphs is oh we really want to avoid doing repeated work so we want [00:36:18] to avoid doing repeated work so we want to realize that those two parts of [00:36:20] to realize that those two parts of things are the same and since we're just [00:36:22] things are the same and since we're just sort of multiplying these um partial [00:36:25] sort of multiplying these um partial derivatives together right we can just [00:36:27] derivatives together right we can just compute what that part is and reuse it [00:36:30] compute what that part is and reuse it and so if we want to um wait yeah okay [00:36:36] and so if we want to um wait yeah okay so if we're wanting to calculate [00:36:39] so if we're wanting to calculate ddw the part that's the same this part [00:36:43] ddw the part that's the same this part here we can refer to as Delta so Delta [00:36:47] here we can refer to as Delta so Delta is sort of the Upstream gradient or the [00:36:49] is sort of the Upstream gradient or the error signal the part that you've got [00:36:51] error signal the part that you've got from sort of starting at the beginning [00:36:53] from sort of starting at the beginning DS DH DH DZ this sort of shared Upstream [00:36:57] DS DH DH DZ this sort of shared Upstream part we can calculate that once and then [00:37:01] part we can calculate that once and then we can use it um to calculate both of [00:37:04] we can use it um to calculate both of these two things and for dsdb because [00:37:09] these two things and for dsdb because the dzdb just comes out as the identity [00:37:11] the dzdb just comes out as the identity Matrix um the answer is just Delta but [00:37:16] Matrix um the answer is just Delta but for D [00:37:17] for D sdw we need to work out the DZ DW before [00:37:22] sdw we need to work out the DZ DW before we're [00:37:24] we're finished okay so what um does what do we [00:37:29] finished okay so what um does what do we get for that last piece so one question [00:37:32] get for that last piece so one question you might start off with and is normally [00:37:35] you might start off with and is normally a good thing to think about when you're [00:37:37] a good thing to think about when you're doing assignment problems on this and [00:37:39] doing assignment problems on this and other things is the first thing to think [00:37:41] other things is the first thing to think about is you know what do things look [00:37:44] about is you know what do things look like like am I should the answer be a [00:37:47] like like am I should the answer be a vector a matrix uh what size should it [00:37:50] vector a matrix uh what size should it be and things like that so for [00:37:53] be and things like that so for ddw um W is an N bym Matrix [00:37:58] ddw um W is an N bym Matrix um and S is a scalar so therefore since [00:38:03] um and S is a scalar so therefore since we have one output and M * m inputs the [00:38:08] we have one output and M * m inputs the answer according to math should be that [00:38:12] answer according to math should be that we've got a one by n * m Jacobian I a [00:38:17] we've got a one by n * m Jacobian I a big long row Vector um but here's where [00:38:22] big long row Vector um but here's where um things get a teeny bit tricky um and [00:38:26] um things get a teeny bit tricky um and there's sort of we end up with this [00:38:29] there's sort of we end up with this weird mess of um math and Engineering [00:38:34] weird mess of um math and Engineering convenience because you know immediately [00:38:36] convenience because you know immediately what we're wanting to do is we're [00:38:39] what we're wanting to do is we're wanting to take our old parameters which [00:38:42] wanting to take our old parameters which will be in stored in the form of [00:38:44] will be in stored in the form of matrices vectors and so on that we're [00:38:47] matrices vectors and so on that we're using as coefficients and we're going to [00:38:49] using as coefficients and we're going to want to subtract from them um our you [00:38:53] want to subtract from them um our you know a fraction of our calculated [00:38:55] know a fraction of our calculated gradient so what we'd like to do is have [00:38:59] gradient so what we'd like to do is have our um calculated gradients in the same [00:39:03] our um calculated gradients in the same shapes as our parameters because then we [00:39:06] shapes as our parameters because then we can just do subtraction whereas if [00:39:08] can just do subtraction whereas if they've turned into a God Almighty row [00:39:10] they've turned into a God Almighty row Vector um that's not quite so convenient [00:39:13] Vector um that's not quite so convenient um so it turns out that what we end up [00:39:17] um so it turns out that what we end up doing um is using something that gets [00:39:20] doing um is using something that gets referred to as the shape convention that [00:39:23] referred to as the shape convention that um we [00:39:25] um we uh uh [00:39:27] uh uh reshape our um jacobians so they fit [00:39:33] reshape our um jacobians so they fit into things that are of the same shape [00:39:36] into things that are of the same shape as the parameters that we are using so [00:39:40] as the parameters that we are using so we're going to represent um DSD W as an [00:39:43] we're going to represent um DSD W as an N bym Matrix laid out as follows and [00:39:48] N bym Matrix laid out as follows and that's a place that one people can get [00:39:51] that's a place that one people can get confused okay so that's what we want to [00:39:53] confused okay so that's what we want to calculate that kind of Matrix but um and [00:39:57] calculate that kind of Matrix but um and so that Matrix is going to be Delta * DZ [00:40:00] so that Matrix is going to be Delta * DZ DW so Delta is going to be part of the [00:40:03] DW so Delta is going to be part of the answer and then we want to know um what [00:40:06] answer and then we want to know um what DZ DW is um and the answer is going to [00:40:10] DZ DW is um and the answer is going to be it's going to come out like this so D [00:40:12] be it's going to come out like this so D sdw is going to be um delta T * XT so [00:40:17] sdw is going to be um delta T * XT so it's going to be the product of the [00:40:19] it's going to be the product of the Upstream gradient which was the same [00:40:21] Upstream gradient which was the same thing we calculated before for the other [00:40:23] thing we calculated before for the other two quantities and then a local inputs [00:40:27] two quantities and then a local inputs symbol which um is input signal which is [00:40:32] symbol which um is input signal which is here coming out to [00:40:34] here coming out to XT okay um and you know so we're taking [00:40:38] XT okay um and you know so we're taking the transposes of those two vectors [00:40:41] the transposes of those two vectors which it means that we end up [00:40:43] which it means that we end up calculating an outer product of those [00:40:45] calculating an outer product of those two vectors um which gives us our [00:40:49] two vectors um which gives us our gradient um and so why is that the right [00:40:53] gradient um and so why is that the right answer well you know it kind of looks [00:40:55] answer well you know it kind of looks convenient CU that's giving us something [00:40:57] convenient CU that's giving us something of the right shape um for what I was [00:40:59] of the right shape um for what I was arguing we want to find out and we have [00:41:01] arguing we want to find out and we have the right number of terms um now I'm [00:41:05] the right number of terms um now I'm going to rush through this so I [00:41:07] going to rush through this so I encourage you to read um the uh lecture [00:41:10] encourage you to read um the uh lecture notes and do this more carefully but um [00:41:13] notes and do this more carefully but um let me at least a little bit explain why [00:41:15] let me at least a little bit explain why it makes sense right so um if you think [00:41:19] it makes sense right so um if you think of one weight um in so all of these [00:41:23] of one weight um in so all of these connections are our Matrix right The [00:41:25] connections are our Matrix right The Matrix is being represented by all these [00:41:26] Matrix is being represented by all these lines and in your network so if you [00:41:29] lines and in your network so if you think of one number in The Matrix so [00:41:31] think of one number in The Matrix so here is [00:41:32] here is w23 so it's connecting from input 3 or [00:41:36] w23 so it's connecting from input 3 or it's multiplying input three to give [00:41:39] it's multiplying input three to give part of the answer of [00:41:41] part of the answer of H2 right so it's this line here um so [00:41:45] H2 right so it's this line here um so for this line here um this weight is [00:41:50] for this line here um this weight is being used only in the calculation of [00:41:52] being used only in the calculation of H2 and the only thing it's dependent on [00:41:55] H2 and the only thing it's dependent on is X3 um so that if you're then wanting [00:41:59] is X3 um so that if you're then wanting to work out the partial [00:42:03] to work out the partial of um h [00:42:06] of um h two um or Z2 sorry yeah um sorry yeah [00:42:13] two um or Z2 sorry yeah um sorry yeah sorry Z2 the partial of Z2 with respect [00:42:17] sorry Z2 the partial of Z2 with respect to X3 it's sort of depending on these [00:42:21] to X3 it's sort of depending on these two pieces only and that's what you're [00:42:23] two pieces only and that's what you're achieving um by working out um the sort [00:42:28] achieving um by working out um the sort of outer product like [00:42:31] of outer product like that okay um yeah so let me just come [00:42:35] that okay um yeah so let me just come back um one more time to um this the [00:42:39] back um one more time to um this the sort of question of the shape of [00:42:42] sort of question of the shape of derivatives [00:42:44] derivatives um you know so I already sort of fudged [00:42:47] um you know so I already sort of fudged it um when I was sort of um talking [00:42:51] it um when I was sort of um talking about oh should I put the the transpose [00:42:54] about oh should I put the the transpose there or should I nod and get a row [00:42:56] there or should I nod and get a row vector is a column Vector um so there's [00:43:00] vector is a column Vector um so there's sort of this disagreement between [00:43:04] sort of this disagreement between whether you kind of have the Jacobian [00:43:07] whether you kind of have the Jacobian form which is what actually makes the [00:43:09] form which is what actually makes the chain roll work right in terms of doing [00:43:13] chain roll work right in terms of doing multiplication um versus the shape [00:43:16] multiplication um versus the shape convention which is how we store [00:43:18] convention which is how we store everything for our computations and Mak [00:43:21] everything for our computations and Mak doing stochastic gradient descent where [00:43:24] doing stochastic gradient descent where you're um subtracting um whatever kind [00:43:27] you're um subtracting um whatever kind of tensor you have easy um so [00:43:32] of tensor you have easy um so um you know this can be a source of [00:43:34] um you know this can be a source of confusion um since we're doing a [00:43:37] confusion um since we're doing a computer science course for the answers [00:43:39] computer science course for the answers in the assignment we expect you to [00:43:41] in the assignment we expect you to follow the shape convention so you know [00:43:44] follow the shape convention so you know if you're working out the derivatives [00:43:46] if you're working out the derivatives with respect to some Matrix it should be [00:43:49] with respect to some Matrix it should be shaped like a matrix with the same [00:43:51] shaped like a matrix with the same parameters um but you know you may well [00:43:55] parameters um but you know you may well want to think about Jacobian forms and [00:43:57] want to think about Jacobian forms and Computing your answers I mean there are [00:43:59] Computing your answers I mean there are sort of two ways to go about do doing [00:44:01] sort of two ways to go about do doing this one way of doing it is to sort of [00:44:04] this one way of doing it is to sort of work out all the math using Jacobian Al [00:44:07] work out all the math using Jacobian Al math 51 and at the end just to reshape [00:44:10] math 51 and at the end just to reshape it so it fits into the same shape as the [00:44:14] it so it fits into the same shape as the parameters according to our shape [00:44:16] parameters according to our shape convention I mean the other way is to [00:44:19] convention I mean the other way is to sort of do each stage following the [00:44:21] sort of do each stage following the shape convention but then you sort of [00:44:23] shape convention but then you sort of have to be game to sort of reshape [00:44:26] have to be game to sort of reshape things as needed by sort of doing um [00:44:29] things as needed by sort of doing um transposing to have things work out at [00:44:31] transposing to have things work out at the different [00:44:33] the different stages okay that was my attempt to [00:44:36] stages okay that was my attempt to quickly review the [00:44:39] math most people are still here um I [00:44:43] math most people are still here um I will now go on to the second half and go [00:44:47] will now go on to the second half and go on to the um How We Do the computation [00:44:50] on to the um How We Do the computation right so you know so most of um yeah so [00:44:55] right so you know so most of um yeah so the famous thing that powers new [00:44:57] the famous thing that powers new networks is the back propagation [00:44:59] networks is the back propagation algorithm so the back propagation [00:45:02] algorithm so the back propagation algorithm is really only two things you [00:45:06] algorithm is really only two things you know its invention made people famous [00:45:09] know its invention made people famous because it gave an effective learning [00:45:11] because it gave an effective learning algorithm but you know at a fundamental [00:45:13] algorithm but you know at a fundamental level the back propagation algorithm is [00:45:16] level the back propagation algorithm is only two things thing one is you use the [00:45:20] only two things thing one is you use the chain rule you do calculus of complex [00:45:23] chain rule you do calculus of complex functions and Thing Two is [00:45:27] functions and Thing Two is um you store intermediate results so you [00:45:30] um you store intermediate results so you never recompute the same stuff again [00:45:33] never recompute the same stuff again that's all there is to the um back [00:45:35] that's all there is to the um back propagation [00:45:37] propagation algorithm and so let's just go through [00:45:40] algorithm and so let's just go through that so if we're [00:45:42] that so if we're computationally wanting to deal um with [00:45:45] computationally wanting to deal um with you know functions and doing back [00:45:48] you know functions and doing back propagation we can think of them as [00:45:51] propagation we can think of them as being represented as a graph and in some [00:45:54] being represented as a graph and in some way or another um this kind of graph is [00:45:58] way or another um this kind of graph is being used inside your newal network [00:46:01] being used inside your newal network framework so here is here is a re- [00:46:03] framework so here is here is a re- representation of my little neural [00:46:05] representation of my little neural network for finding whether the word at [00:46:07] network for finding whether the word at the center is a location so I'm taking [00:46:11] the center is a location so I'm taking the X Vector input I'm multiplying it by [00:46:14] the X Vector input I'm multiplying it by W I'm adding B to it I'm putting it [00:46:17] W I'm adding B to it I'm putting it through the [00:46:18] through the nonlinearity and then I'm um doing the [00:46:20] nonlinearity and then I'm um doing the dot product with my Vector U right so [00:46:23] dot product with my Vector U right so that was my computation and so the The [00:46:27] that was my computation and so the The Source nodes are the inputs in this [00:46:30] Source nodes are the inputs in this graph the interior nodes then the [00:46:33] graph the interior nodes then the operations I do um and so then the edges [00:46:38] operations I do um and so then the edges that connect those together then pass [00:46:41] that connect those together then pass along the result of each operation so I [00:46:44] along the result of each operation so I passed along WX to the addition function [00:46:48] passed along WX to the addition function with B then I that gives me Z that I [00:46:52] with B then I that gives me Z that I pass through the nonlinearity which [00:46:53] pass through the nonlinearity which gives me H which I then um dot product [00:46:56] gives me H which I then um dot product with the U to get S okay so I do [00:47:00] with the U to get S okay so I do precisely this computation and this is [00:47:03] precisely this computation and this is referred to as Ford propagation or the [00:47:06] referred to as Ford propagation or the forward pass of a neural network so um [00:47:09] forward pass of a neural network so um the forward pass just calculates [00:47:12] the forward pass just calculates functions okay but then once we've done [00:47:16] functions okay but then once we've done that what we want to [00:47:18] that what we want to do is then work out gradients so we can [00:47:22] do is then work out gradients so we can do gradient based learning and so that [00:47:25] do gradient based learning and so that part is then referred to as back [00:47:29] part is then referred to as back propagation or the backward pass and [00:47:32] propagation or the backward pass and then we run things backward so for [00:47:35] then we run things backward so for running things backward we're going to [00:47:37] running things backward we're going to use the same graph and we're going to [00:47:39] use the same graph and we're going to backwards pass along at gradients and so [00:47:42] backwards pass along at gradients and so we start at the right hand side and we [00:47:45] we start at the right hand side and we have dsds so dsds is just one because [00:47:49] have dsds so dsds is just one because you know um if you change S you've [00:47:52] you know um if you change S you've changed s and then what we want to do is [00:47:56] changed s and then what we want to do is sort of then work further back so we can [00:47:59] sort of then work further back so we can work out DSD ddz dsdb ddw dsdx as we [00:48:05] work out DSD ddz dsdb ddw dsdx as we work back um and so this is um the what [00:48:11] work back um and so this is um the what we want to work out with gradients um [00:48:14] we want to work out with gradients um and so how are we going to do that well [00:48:17] and so how are we going to do that well if we look at a single node so for [00:48:19] if we look at a single node so for example our um our nonlinearity node but [00:48:23] example our um our nonlinearity node but any node where H equal F ofx what we can [00:48:27] any node where H equal F ofx what we can have is an upstream gradient um DS DH [00:48:32] have is an upstream gradient um DS DH and what we want to do is calculate the [00:48:35] and what we want to do is calculate the downstream gradient of the next variable [00:48:38] downstream gradient of the next variable down the D [00:48:40] down the D sdz and the way that we're going to do [00:48:42] sdz and the way that we're going to do that is we're going to say well let's [00:48:46] that is we're going to say well let's look at F what is F's gradient and [00:48:50] look at F what is F's gradient and that's going to be our local gradient [00:48:53] that's going to be our local gradient and then this is immediately what gives [00:48:55] and then this is immediately what gives us the chain rule [00:48:57] us the chain rule that D sdz is going to be the product of [00:49:01] that D sdz is going to be the product of our Upstream gradient D sdh times the [00:49:05] our Upstream gradient D sdh times the DHD Z the local um gradient that we [00:49:08] DHD Z the local um gradient that we calculate at that node so Downstream [00:49:11] calculate at that node so Downstream gradient equals um Upstream gradient [00:49:15] gradient equals um Upstream gradient times local [00:49:18] times local gradient [00:49:21] gradient um oh yeah that's what that's what it [00:49:23] um oh yeah that's what that's what it says when I press um that again okay so [00:49:27] says when I press um that again okay so this is the sort of the um single the [00:49:31] this is the sort of the um single the single input single output case though [00:49:34] single input single output case though those inputs might be vectors or [00:49:36] those inputs might be vectors or matrices or something like that um we [00:49:39] matrices or something like that um we then have sort of more complex graph [00:49:42] then have sort of more complex graph cases um [00:49:44] cases um so I think I should have retitled this [00:49:46] so I think I should have retitled this SL oh yeah so still so sorry so the next [00:49:50] SL oh yeah so still so sorry so the next case is for our node it might have [00:49:53] case is for our node it might have multiple inputs so this is where we're [00:49:55] multiple inputs so this is where we're calculating [00:49:57] calculating WX so in that case we still have an up [00:50:01] WX so in that case we still have an up we have a single Upstream gradient and [00:50:04] we have a single Upstream gradient and then what we're going to do is we want [00:50:07] then what we're going to do is we want to calculate the downstream gradient [00:50:10] to calculate the downstream gradient with respect to each input and the way [00:50:13] with respect to each input and the way we're going to do that is we're going to [00:50:15] we're going to do that is we're going to work out the local gradient with respect [00:50:18] work out the local gradient with respect to each input and then we're going to do [00:50:20] to each input and then we're going to do the same kind of multiplication of [00:50:23] the same kind of multiplication of Upstream gradient times local gradient [00:50:27] Upstream gradient times local gradient with respect to each input again um [00:50:30] with respect to each input again um chain [00:50:32] chain rule okay um so here's a little example [00:50:36] rule okay um so here's a little example of this so I'm this isn't really uh the [00:50:40] of this so I'm this isn't really uh the kind of thing you normally see in a [00:50:41] kind of thing you normally see in a neural network but it's an easy example [00:50:44] neural network but it's an easy example so F of XYZ is going to be x + y * the [00:50:48] so F of XYZ is going to be x + y * the max of y z and we've got current values [00:50:53] max of y z and we've got current values of X Y and Z of 1 2 and z respectively [00:50:57] of X Y and Z of 1 2 and z respectively so here's our little computation graph [00:51:00] so here's our little computation graph um and so for forward propagation you [00:51:03] um and so for forward propagation you know we're going to do this addition [00:51:05] know we're going to do this addition we're going to do this Max function and [00:51:07] we're going to do this Max function and then we're going to multiply the two and [00:51:10] then we're going to multiply the two and that gives us the value of f um so we [00:51:13] that gives us the value of f um so we can run that with the current values of [00:51:15] can run that with the current values of X Y and Z and this is what we get so the [00:51:19] X Y and Z and this is what we get so the max of two and 0 is two addition is [00:51:22] max of two and 0 is two addition is three the answer is six okay so then [00:51:26] three the answer is six okay so then after having done that we run the [00:51:29] after having done that we run the backward propagation and yeah so this [00:51:32] backward propagation and yeah so this procedure you know is not actually [00:51:34] procedure you know is not actually special to new networks right you can [00:51:36] special to new networks right you can use it for any piece of math if you want [00:51:38] use it for any piece of math if you want to just run your math um on pytorch [00:51:41] to just run your math um on pytorch rather than um working it out in your [00:51:43] rather than um working it out in your head or with Mathematica um okay so now [00:51:46] head or with Mathematica um okay so now we work out um backwards so we want to [00:51:50] we work out um backwards so we want to know the local gradient so d a DZ is [00:51:55] know the local gradient so d a DZ is going to be one [00:51:58] going to be one sorry I said that wrong d a DX is going [00:52:00] sorry I said that wrong d a DX is going to be 1 so a = x + y d a d y = 1 um for [00:52:05] to be 1 so a = x + y d a d y = 1 um for the max function that's going to depend [00:52:08] the max function that's going to depend on which of the two is larger because [00:52:10] on which of the two is larger because it's going to have a slope of one for [00:52:14] it's going to have a slope of one for the one that's the biggest and zero for [00:52:16] the one that's the biggest and zero for the one that's the smallest um and then [00:52:19] the one that's the smallest um and then for the product that's like what we saw [00:52:22] for the product that's like what we saw with vectors that dfda is going to be B [00:52:26] with vectors that dfda is going to be B and D FDB is going to be a um so those [00:52:29] and D FDB is going to be a um so those are all our local gradients and so then [00:52:32] are all our local gradients and so then we can use those to calculate out the [00:52:35] we can use those to calculate out the derivatives so dfdf is one um we then [00:52:39] derivatives so dfdf is one um we then multiply that by the two um local [00:52:44] multiply that by the two um local gradients um that are [00:52:46] gradients um that are calculated [00:52:49] calculated um for A and B so that gives us um two [00:52:54] um for A and B so that gives us um two and three where you're swapping over the [00:52:57] and three where you're swapping over the numbers then for the max um that we're [00:53:02] numbers then for the max um that we're having the one that is biggest um we're [00:53:06] having the one that is biggest um we're taking the Upstream times one so it gets [00:53:10] taking the Upstream times one so it gets three the other one gets zero and then [00:53:13] three the other one gets zero and then for the plus we're just sending the [00:53:15] for the plus we're just sending the gradient down in both directions and so [00:53:18] gradient down in both directions and so both of them come out as two um and so [00:53:23] both of them come out as two um and so that gives us DF DX so the final [00:53:26] that gives us DF DX so the final function value is two D [00:53:29] function value is two D fdy we're taking the three and adding [00:53:32] fdy we're taking the three and adding the two I'll mention that again in a [00:53:34] the two I'll mention that again in a minute which gives us five and then [00:53:38] minute which gives us five and then DFD Z is zero um and we should be able [00:53:42] DFD Z is zero um and we should be able to again be able to quickly check that [00:53:44] to again be able to quickly check that we've got this right right so um if we [00:53:48] we've got this right right so um if we consider you know the slope around um Z [00:53:54] consider you know the slope around um Z as you change z a little so Z is Z if we [00:53:57] as you change z a little so Z is Z if we make Z [00:53:59] make Z 0.1 that makes absolutely no difference [00:54:02] 0.1 that makes absolutely no difference to what the computed function value is [00:54:05] to what the computed function value is um so the gradient there is zero that's [00:54:08] um so the gradient there is zero that's correct um so if I change up the Top If [00:54:12] correct um so if I change up the Top If I change x a little bit right if I [00:54:15] I change x a little bit right if I change X to [00:54:17] change X to 1.1 then I'll be calculating um 1.1 + 2 [00:54:22] 1.1 then I'll be calculating um 1.1 + 2 is [00:54:24] is 3.1 um and then I'll be taking the max [00:54:28] 3.1 um and then I'll be taking the max which is two and I'll be calculating [00:54:33] which is two and I'll be calculating 5.1 um and so wait no I did that [00:54:39] 5.1 um and so wait no I did that wrong oh times two [00:54:42] wrong oh times two wait I didn't do the multiplication [00:54:45] wait I didn't do the multiplication right um sorry yeah so we get the 3.1 [00:54:50] right um sorry yeah so we get the 3.1 that's multiplied by two that gives us [00:54:53] that's multiplied by two that gives us 6.2 so a change of 0.1 in the X has [00:54:57] 6.2 so a change of 0.1 in the X has moved things up by 2 so that corresponds [00:55:00] moved things up by 2 so that corresponds to the gradient being two and so then [00:55:03] to the gradient being two and so then the final case is well what if we change [00:55:06] the final case is well what if we change y to um so y started off as two and made [00:55:12] y to um so y started off as two and made it 2.1 then we're going to get 2.1 [00:55:17] it 2.1 then we're going to get 2.1 multiplied by 1 is [00:55:21] 2.1 61 6.5 and right and then we've got [00:55:26] 2.1 61 6.5 and right and then we've got the 2.1 [00:55:28] the 2.1 here the oh sorry I keep doing this [00:55:31] here the oh sorry I keep doing this wrong 2.1 + 1 = 3.1 and then we've got [00:55:35] wrong 2.1 + 1 = 3.1 and then we've got 2.1 as the max so we've got 2.1 * [00:55:40] 2.1 as the max so we've got 2.1 * 3.1 and that comes out to be [00:55:45] 3.1 and that comes out to be 6.51 so it's approximately gone up by so [00:55:48] 6.51 so it's approximately gone up by so our 0.1 difference has gone up to [00:55:51] our 0.1 difference has gone up to approximately 0.5 this is just an [00:55:53] approximately 0.5 this is just an estimate um and so that correspond to [00:55:56] estimate um and so that correspond to the gradient being five right we get [00:55:59] the gradient being five right we get this five times multiplication of our [00:56:02] this five times multiplication of our changes okay um and so that that fact [00:56:06] changes okay um and so that that fact that illustrates the fact that the right [00:56:08] that illustrates the fact that the right thing to do is when you have outward [00:56:11] thing to do is when you have outward branches in your um computation graph [00:56:15] branches in your um computation graph and you're running the um back [00:56:18] and you're running the um back propagation that what you do is you sum [00:56:22] propagation that what you do is you sum the gradients right um so that for this [00:56:26] the gradients right um so that for this case we had y being um the Y is sort of [00:56:31] case we had y being um the Y is sort of going into these two different things in [00:56:34] going into these two different things in our previous chart so once we've worked [00:56:36] our previous chart so once we've worked out the Upstream gradients we sum them [00:56:39] out the Upstream gradients we sum them to get the total gradient and so that's [00:56:41] to get the total gradient and so that's what we did back here we had two outward [00:56:44] what we did back here we had two outward things and we sort of took these [00:56:46] things and we sort of took these calculated Upstream gradients of two and [00:56:48] calculated Upstream gradients of two and three and we just summ them to get five [00:56:52] three and we just summ them to get five and that gave the right answer [00:56:59] okay um and so you can think about that [00:57:03] okay um and so you can think about that um for the sort of just generally how [00:57:07] um for the sort of just generally how the sort of things to think about as [00:57:09] the sort of things to think about as sort of gradients move around in these [00:57:12] sort of gradients move around in these pictures so that when we have a plus [00:57:16] pictures so that when we have a plus operation that um plush just sort of [00:57:20] operation that um plush just sort of distributes gradient so the same [00:57:22] distributes gradient so the same gradient that's the Upstream gradient [00:57:24] gradient that's the Upstream gradient goes to each input [00:57:26] goes to each input um when you have a Max it's kind of like [00:57:29] um when you have a Max it's kind of like a router of gradient so the max is going [00:57:32] a router of gradient so the max is going to send the gradient to one of the [00:57:35] to send the gradient to one of the inputs and send nothing at all to the [00:57:38] inputs and send nothing at all to the other inputs um and when you have a [00:57:41] other inputs um and when you have a multiplication it's a little bit funky [00:57:43] multiplication it's a little bit funky because you're sort of um doing this [00:57:46] because you're sort of um doing this sort of switching of the forward [00:57:48] sort of switching of the forward coefficient so you're taking the [00:57:50] coefficient so you're taking the Upstream gradient multiplied by the [00:57:53] Upstream gradient multiplied by the opposite um for coefficient gives you [00:57:57] opposite um for coefficient gives you your um Downstream [00:58:00] your um Downstream gradient okay um so we kind of have this [00:58:04] gradient okay um so we kind of have this systematic way of being able to sort of [00:58:07] systematic way of being able to sort of forward pass calculate the values of [00:58:09] forward pass calculate the values of functions then run this backward to work [00:58:13] functions then run this backward to work out um the um gradients heading down the [00:58:17] out um the um gradients heading down the network and so the main other thing of [00:58:21] network and so the main other thing of the back propagation algorithm is just [00:58:24] the back propagation algorithm is just that we want to do this [00:58:26] that we want to do this efficiently so the wrong way to do it [00:58:29] efficiently so the wrong way to do it would be to say well gee I want to [00:58:31] would be to say well gee I want to calculate dstb [00:58:33] calculate dstb ddw dsdx ddu so let me start doing those [00:58:38] ddw dsdx ddu so let me start doing those one at a time and when I've done them [00:58:40] one at a time and when I've done them all I will stop because that means if [00:58:43] all I will stop because that means if you first calculated dstb you do all of [00:58:47] you first calculated dstb you do all of the part that's in blue um but then if [00:58:50] the part that's in blue um but then if you went on to DS [00:58:53] you went on to DS DW um you'd be calculating all the part [00:58:56] DW um you'd be calculating all the part in red and well just as we saw in the [00:58:59] in red and well just as we saw in the math part when we were doing it as math [00:59:02] math part when we were doing it as math um these parts are exactly the same [00:59:05] um these parts are exactly the same you're doing exactly the same [00:59:07] you're doing exactly the same computations so you only want to do [00:59:10] computations so you only want to do those that part once and work out this [00:59:13] those that part once and work out this Upstream gradient or error signal that [00:59:16] Upstream gradient or error signal that is being then sort of calculated and is [00:59:19] is being then sort of calculated and is then being shared so the picture that we [00:59:22] then being shared so the picture that we want to have is you're doing together [00:59:25] want to have is you're doing together the Shar part and then you're only sort [00:59:28] the Shar part and then you're only sort of doing separately the little bits um [00:59:31] of doing separately the little bits um that you need to [00:59:33] that you need to do okay um boy I seem to have been [00:59:36] do okay um boy I seem to have been rushing through today and I'm going to [00:59:38] rushing through today and I'm going to actually end early unless anyone is [00:59:40] actually end early unless anyone is going to slow me down but I did have uh [00:59:43] going to slow me down but I did have uh just a few more slides um to go through [00:59:46] just a few more slides um to go through um yeah so the sort of [00:59:49] um yeah so the sort of generalization of this as an algorithm [00:59:52] generalization of this as an algorithm is you know in the general case you know [00:59:55] is you know in the general case you know so we normally have these sort of neural [00:59:58] so we normally have these sort of neural network layers and matrices which you [01:00:01] network layers and matrices which you can represent as vectors and [01:00:03] can represent as vectors and matrices um and you know it's sort of [01:00:07] matrices um and you know it's sort of nice and clean and it looks like um [01:00:10] nice and clean and it looks like um doing that in Calculus class I mean [01:00:13] doing that in Calculus class I mean strictly speaking that isn't necessary [01:00:16] strictly speaking that isn't necessary so the algorithm for forward propagation [01:00:18] so the algorithm for forward propagation and backward propagation that I've [01:00:21] and backward propagation that I've outlined that you can have it work in a [01:00:24] outlined that you can have it work in a completely arbitrary comput ation graph [01:00:27] completely arbitrary comput ation graph providing it's a dag that doesn't have [01:00:29] providing it's a dag that doesn't have Cycles in it um so the general algorithm [01:00:32] Cycles in it um so the general algorithm is well you've got a whole bunch of [01:00:35] is well you've got a whole bunch of variables that depend on other variables [01:00:38] variables that depend on other variables there's some way in which we can sort [01:00:41] there's some way in which we can sort them so that each variable only depends [01:00:44] them so that each variable only depends on variables to the left of it so that's [01:00:47] on variables to the left of it so that's referred to as a topological sort of the [01:00:50] referred to as a topological sort of the outputs and so that means there's a way [01:00:52] outputs and so that means there's a way we can do a forward pass where we're [01:00:56] we can do a forward pass where we're calculating um variables in terms of [01:00:59] calculating um variables in terms of ones that have already been calculated [01:01:02] ones that have already been calculated but you know if we want to have some [01:01:03] but you know if we want to have some extra wonky AR so it's not like nice [01:01:06] extra wonky AR so it's not like nice Matrix multiplies or anything we're [01:01:08] Matrix multiplies or anything we're totally allowed to um do that or we can [01:01:11] totally allowed to um do that or we can have things not fully connected right so [01:01:13] have things not fully connected right so there's no connections across here right [01:01:16] there's no connections across here right we can have an arbitrary computation [01:01:18] we can have an arbitrary computation graph um and so that gives us our [01:01:20] graph um and so that gives us our forward propagation and then once we've [01:01:24] forward propagation and then once we've done the forward propagation we can [01:01:26] done the forward propagation we can initialize the output gradient as as one [01:01:31] initialize the output gradient as as one and then we're going to visit the nodes [01:01:34] and then we're going to visit the nodes in reverse order and at for each node [01:01:38] in reverse order and at for each node we're going to compute a gradient by [01:01:41] we're going to compute a gradient by using the Upstream gradient and the [01:01:43] using the Upstream gradient and the local gradient to compute the downstream [01:01:46] local gradient to compute the downstream gradient and so then we can head back [01:01:48] gradient and so then we can head back down the computation graph and work out [01:01:51] down the computation graph and work out all of the downstream gradients and so [01:01:54] all of the downstream gradients and so the crucial thing to notice is that if [01:01:57] the crucial thing to notice is that if you do it correctly um that working out [01:02:03] you do it correctly um that working out um the the gradients has the same bigo [01:02:09] um the the gradients has the same bigo complexity as working out the forward [01:02:12] complexity as working out the forward calculation right so that if you're [01:02:14] calculation right so that if you're doing more you know in if if in terms of [01:02:17] doing more you know in if if in terms of Big O terms right you might have [01:02:19] Big O terms right you might have different functions depending on what [01:02:20] different functions depending on what the derivatives are but in bigo terms if [01:02:23] the derivatives are but in bigo terms if you're doing more work in the backward [01:02:26] you're doing more work in the backward path than you're doing in the forward [01:02:27] path than you're doing in the forward paths that means that you're somehow [01:02:30] paths that means that you're somehow failing to do this um efficient [01:02:33] failing to do this um efficient computation and that you're recomputing [01:02:35] computation and that you're recomputing some of your [01:02:37] some of your work okay um so because we have such a [01:02:41] work okay um so because we have such a good algorithm here you should be able [01:02:45] good algorithm here you should be able to just work out the backward path [01:02:48] to just work out the backward path automatically and that that gets [01:02:50] automatically and that that gets referred to as automatic differentiation [01:02:53] referred to as automatic differentiation so if you had the symbolic form of what [01:02:59] so if you had the symbolic form of what you're calculating with your forward [01:03:01] you're calculating with your forward pass um you should just be able to say [01:03:04] pass um you should just be able to say yo computer um can you work out the [01:03:07] yo computer um can you work out the backward pass for me and you know kind [01:03:10] backward pass for me and you know kind of mathematical like it could look at [01:03:13] of mathematical like it could look at the symbolic form of all of your [01:03:15] the symbolic form of all of your functions um work out their derivatives [01:03:19] functions um work out their derivatives and do the entire thing for you um so [01:03:23] and do the entire thing for you um so early on there was a [01:03:26] early on there was a pioneering um deep learning framework [01:03:29] pioneering um deep learning framework theano principally from the um [01:03:31] theano principally from the um university of Montreal which attempted [01:03:35] university of Montreal which attempted to do precisely that that you had the [01:03:37] to do precisely that that you had the entire forward path computation started [01:03:40] entire forward path computation started in symbolic form and it just did the [01:03:43] in symbolic form and it just did the entire thing for you and worked out the [01:03:47] entire thing for you and worked out the backward pass [01:03:49] backward pass automatically um but you know somehow [01:03:52] automatically um but you know somehow that sort of proved to be um too [01:03:56] that sort of proved to be um too heavyweight or um hard to deal with [01:03:59] heavyweight or um hard to deal with different things or people just like to [01:04:01] different things or people just like to write their own python or whatever it is [01:04:04] write their own python or whatever it is um so that idea did not fully succeed [01:04:08] um so that idea did not fully succeed and so what in practice all of the [01:04:11] and so what in practice all of the current main Frameworks have fallen back [01:04:14] current main Frameworks have fallen back on is something that's actually less [01:04:18] on is something that's actually less automated than that so it's sort of like [01:04:20] automated than that so it's sort of like we've gone backwards in time but the [01:04:22] we've gone backwards in time but the software's g a lot better really it's a [01:04:24] software's g a lot better really it's a lot staer and faster um so all of the [01:04:28] lot staer and faster um so all of the modern deep learning Frameworks sort of [01:04:31] modern deep learning Frameworks sort of say look I will manage the computation [01:04:34] say look I will manage the computation graph for you and I can run the forward [01:04:37] graph for you and I can run the forward propagation path and the backward [01:04:38] propagation path and the backward propagation path but you're going to [01:04:41] propagation path but you're going to have to work out the local um [01:04:43] have to work out the local um derivatives yourself um so if you're if [01:04:47] derivatives yourself um so if you're if you're putting in a layer or putting in [01:04:51] you're putting in a layer or putting in um you know a function like an [01:04:53] um you know a function like an activation function in the in a neural [01:04:56] activation function in the in a neural network your class your python class [01:05:00] network your class your python class that represents that you're going to [01:05:02] that represents that you're going to have to tell me what the forward um [01:05:05] have to tell me what the forward um computation is and what the local [01:05:08] computation is and what the local gradient is and I'm just going to call [01:05:10] gradient is and I'm just going to call your local gradient and assume it's [01:05:13] your local gradient and assume it's correct um so there's a bit more that [01:05:16] correct um so there's a bit more that has to be done manually so so the part [01:05:19] has to be done manually so so the part that's automated then um is that you [01:05:23] that's automated then um is that you know when you know precisely this code [01:05:27] know when you know precisely this code obviously but roughly you know inside [01:05:30] obviously but roughly you know inside the Deep learning software um it's [01:05:33] the Deep learning software um it's Computing with a computation graph and [01:05:36] Computing with a computation graph and it's got a forward and a backward and [01:05:37] it's got a forward and a backward and it's doing what I presented on the [01:05:39] it's doing what I presented on the pictures before so for the forward um [01:05:43] pictures before so for the forward um pass it's topologically sorting all the [01:05:47] pass it's topologically sorting all the nodes of the graph and then it's going [01:05:49] nodes of the graph and then it's going through them and for each node in the [01:05:52] through them and for each node in the graph it's calling its forward function [01:05:55] graph it's calling its forward function which will be able to compute its local [01:05:58] which will be able to compute its local value in terms of its inputs which have [01:06:01] value in terms of its inputs which have already been calculated because it's [01:06:03] already been calculated because it's topologically sorted and then it's um [01:06:06] topologically sorted and then it's um running the backward pass and in the [01:06:08] running the backward pass and in the backward pass you're reversing your [01:06:10] backward pass you're reversing your topological sort and then you're working [01:06:13] topological sort and then you're working out um the gradient which is going to be [01:06:17] out um the gradient which is going to be the multiplication of the Upstream error [01:06:19] the multiplication of the Upstream error signal times your local gradient and so [01:06:22] signal times your local gradient and so what a human being has to implement um [01:06:27] what a human being has to implement um is that for anything whether it's a [01:06:29] is that for anything whether it's a single gate here's a multiply gate or a [01:06:32] single gate here's a multiply gate or a newal network layer you have to [01:06:34] newal network layer you have to implement a forward pass and a backward [01:06:37] implement a forward pass and a backward pass so here for my baby example since [01:06:41] pass so here for my baby example since we're just doing multiplication my [01:06:43] we're just doing multiplication my forward pass is that I just multiply the [01:06:46] forward pass is that I just multiply the two numbers and um return it so I'm [01:06:49] two numbers and um return it so I'm specifying that for the local node and [01:06:52] specifying that for the local node and then the other part is that I have to [01:06:55] then the other part is that I have to work out those gradients and well we [01:06:59] work out those gradients and well we sort of know how to do that because [01:07:00] sort of know how to do that because that's the examples that we've been [01:07:02] that's the examples that we've been doing here um but notice that there's [01:07:05] doing here um but notice that there's sort of a trick right for what I've got [01:07:07] sort of a trick right for what I've got Now you kind of can't write down what [01:07:11] Now you kind of can't write down what the gradients are cuz you know what [01:07:16] the gradients are cuz you know what these cuz you know backward is just [01:07:18] these cuz you know backward is just taking as an input the Upstream gradient [01:07:21] taking as an input the Upstream gradient and you can't work out what the [01:07:24] and you can't work out what the downstream gradient are going to be [01:07:26] downstream gradient are going to be unless you know what function values [01:07:29] unless you know what function values you're calculating it at um so the [01:07:32] you're calculating it at um so the standard trick that all which is how [01:07:35] standard trick that all which is how everyone writes this code is you're [01:07:37] everyone writes this code is you're relying on the fact that the forward is [01:07:39] relying on the fact that the forward is being calculated before the backward and [01:07:42] being calculated before the backward and so your forward method um shoves into [01:07:46] so your forward method um shoves into some local variables of the class what [01:07:49] some local variables of the class what the values of the inputs are and then [01:07:52] the values of the inputs are and then you have them available um so when you [01:07:54] you have them available um so when you get to the back backward pass you can do [01:07:57] get to the back backward pass you can do what we did before um that um the DX is [01:08:02] what we did before um that um the DX is going to be the Upstream error signal [01:08:05] going to be the Upstream error signal times the opposite input and um and [01:08:08] times the opposite input and um and similarly for Dy and that's going to [01:08:11] similarly for Dy and that's going to give us the [01:08:12] give us the answer okay um just two last things then [01:08:17] answer okay um just two last things then to mention yeah so doing this um [01:08:22] to mention yeah so doing this um your you need to write um you need to [01:08:25] your you need to write um you need to get the math right for what's the [01:08:28] get the math right for what's the derivative of your function so you get [01:08:31] derivative of your function so you get the right backward calculation so the [01:08:34] the right backward calculation so the standard way to check that you've got [01:08:37] standard way to check that you've got the right backward calculation is to do [01:08:40] the right backward calculation is to do manual gradient checking um with numeric [01:08:44] manual gradient checking um with numeric gradients so the way you do that um is [01:08:49] gradients so the way you do that um is you sort of like for the couple of [01:08:51] you sort of like for the couple of examples I did when I said oh let's [01:08:53] examples I did when I said oh let's check it by for [01:08:55] check it by for going from 1 to 1.1 what should the [01:08:58] going from 1 to 1.1 what should the slope be approximately we're going to do [01:09:00] slope be approximately we're going to do that in an automated way and so we're [01:09:03] that in an automated way and so we're going to say at the value X let's [01:09:06] going to say at the value X let's estimate what the gradient should be and [01:09:08] estimate what the gradient should be and the way to do that is to pick a small H [01:09:12] the way to do that is to pick a small H there isn't a magical number because it [01:09:14] there isn't a magical number because it depends on the function but typically [01:09:16] depends on the function but typically you know for neural networks around 10us [01:09:18] you know for neural networks around 10us 4 is good um a small H and work out the [01:09:23] 4 is good um a small H and work out the function value I forward part [01:09:26] function value I forward part at x + H and x - H divided by the run [01:09:31] at x + H and x - H divided by the run which is 2 H and that should give you an [01:09:34] which is 2 H and that should give you an estimate of the slope what the backward [01:09:37] estimate of the slope what the backward pass is calculating and you want those [01:09:40] pass is calculating and you want those two numbers to be approximately equal [01:09:43] two numbers to be approximately equal you know within some 10us 2 of each [01:09:46] you know within some 10us 2 of each other and then probably you're [01:09:47] other and then probably you're calculating the gradient right and if [01:09:50] calculating the gradient right and if they aren't equal um that um you [01:09:56] they aren't equal um that um you probably have made a mistake um yeah so [01:09:59] probably have made a mistake um yeah so um note that this formul for the version [01:10:02] um note that this formul for the version I did for my examples I just compared to [01:10:06] I did for my examples I just compared to x with x + H right I did a one-sided [01:10:10] x with x + H right I did a one-sided estimate which is normally what you get [01:10:12] estimate which is normally what you get taught in a math class if you're doing [01:10:15] taught in a math class if you're doing this to check your gradients numerically [01:10:18] this to check your gradients numerically you're far far better off doing this [01:10:21] you're far far better off doing this two-sided estimate because it's much [01:10:23] two-sided estimate because it's much more accurate and stable when you're [01:10:24] more accurate and stable when you're doing it equally around both sides of [01:10:27] doing it equally around both sides of your H um yeah so this looks easy to do [01:10:32] your H um yeah so this looks easy to do um if if this was just so good why [01:10:36] um if if this was just so good why doesn't everyone do this all the time [01:10:37] doesn't everyone do this all the time and forget about calculus um you know [01:10:40] and forget about calculus um you know the reason you don't want to do this is [01:10:43] the reason you don't want to do this is that doing this is incredibly slow right [01:10:46] that doing this is incredibly slow right because you have to repeat this [01:10:48] because you have to repeat this computation for every parameter of your [01:10:51] computation for every parameter of your model that you're not getting the kind [01:10:53] model that you're not getting the kind of speed UPS your getting from the um [01:10:57] of speed UPS your getting from the um back propagation algorithm but you know [01:10:59] back propagation algorithm but you know it's useful for checking your [01:11:00] it's useful for checking your implementation is correct you know in [01:11:02] implementation is correct you know in the old days before Frameworks like py [01:11:05] the old days before Frameworks like py torch um you know we used to write [01:11:08] torch um you know we used to write everything by hand and people often got [01:11:10] everything by hand and people often got things wrong um but nowadays you know [01:11:13] things wrong um but nowadays you know it's less needed but it's good to check [01:11:14] it's less needed but it's good to check that if you've implemented your own new [01:11:16] that if you've implemented your own new layer that it's doing the right [01:11:19] layer that it's doing the right thing okay um yeah so that's everything [01:11:23] thing okay um yeah so that's everything that we need to know about new Nets [01:11:25] that we need to know about new Nets propagation is the chain rule applied [01:11:28] propagation is the chain rule applied efficiently forward pass is just [01:11:30] efficiently forward pass is just function application backward pass is [01:11:33] function application backward pass is chain rule applied inefficiently um so [01:11:38] chain rule applied inefficiently um so you know uh we're going to inflict pain [01:11:41] you know uh we're going to inflict pain on our students by making them do some [01:11:44] on our students by making them do some math and calculate some of these things [01:11:47] math and calculate some of these things and um do the homework and I know [01:11:50] and um do the homework and I know that'll be harder for some of you than [01:11:52] that'll be harder for some of you than others um you know that in some sense [01:11:57] others um you know that in some sense you don't actually need to know how to [01:11:58] you don't actually need to know how to do this the beauty of these modern deep [01:12:00] do this the beauty of these modern deep learning Frameworks is they'll do it all [01:12:02] learning Frameworks is they'll do it all for you they predefine common layer [01:12:05] for you they predefine common layer types and you can just plug them [01:12:06] types and you can just plug them together like pieces of Lego and they'll [01:12:08] together like pieces of Lego and they'll be computed right and this is precisely [01:12:11] be computed right and this is precisely the reason that high school students [01:12:13] the reason that high school students across the country and the world can now [01:12:15] across the country and the world can now do deep learning projects for their [01:12:18] do deep learning projects for their science fairs because you don't actually [01:12:20] science fairs because you don't actually have to understand any of this math um [01:12:22] have to understand any of this math um you can just use what's given to you um [01:12:25] you can just use what's given to you um but you know um we kind of uh want to [01:12:28] but you know um we kind of uh want to hope that you actually do understand [01:12:31] hope that you actually do understand something about what's going on under [01:12:33] something about what's going on under the hood and how new networks work so [01:12:36] the hood and how new networks work so therefore we make you suffer a little [01:12:37] therefore we make you suffer a little bit and of course you know if you sort [01:12:41] bit and of course you know if you sort of wanting to look at and understand [01:12:44] of wanting to look at and understand more complex things you need to have [01:12:46] more complex things you need to have some sense of what's going on so later [01:12:48] some sense of what's going on so later on when we get on to a current new [01:12:50] on when we get on to a current new networks we'll talk a bit about things [01:12:53] networks we'll talk a bit about things like exploding and Vanishing great [01:12:55] like exploding and Vanishing great and if you want to have some [01:12:57] and if you want to have some understanding about well why things [01:12:58] understanding about well why things aren't working and things are going [01:13:00] aren't working and things are going wrong um then you sort of want to know [01:13:03] wrong um then you sort of want to know what it's actually calculating rather [01:13:04] what it's actually calculating rather than just thinking it's all a black box [01:13:07] than just thinking it's all a black box magic and so that's why we hope to have [01:13:09] magic and so that's why we hope to have uh taught something about that okay I [01:13:13] uh taught something about that okay I think I'm done if the audience is [01:13:15] think I'm done if the audience is sufficiently stunned um and we can stop [01:13:18] sufficiently stunned um and we can stop for today okay thank you ================================================================================ LECTURE 004 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 4 - Dependency Parsing Source: https://www.youtube.com/watch?v=KVKvde-_MYc --- Transcript [00:00:05] okay hi [00:00:06] okay hi everyone okay so for today we're going [00:00:09] everyone okay so for today we're going to um you know I guess do a 180 from [00:00:13] to um you know I guess do a 180 from where we were on Tuesday and so today um [00:00:18] where we were on Tuesday and so today um I'm going to talk about um syntactic [00:00:22] I'm going to talk about um syntactic structure linguistic structure of human [00:00:24] structure linguistic structure of human language sentences dependency passing [00:00:28] language sentences dependency passing and how well dependency and then how you [00:00:30] and how well dependency and then how you go about building dependency paes so um [00:00:34] go about building dependency paes so um we're um solidly inter linguistic Zone [00:00:38] we're um solidly inter linguistic Zone today um how many people in the audience [00:00:41] today um how many people in the audience have done a Linguistics class yay okay [00:00:45] have done a Linguistics class yay okay there's some people have done [00:00:46] there's some people have done Linguistics classes okay great um and [00:00:50] Linguistics classes okay great um and for the rest of you well you know this [00:00:51] for the rest of you well you know this is your chance to see a little bit of [00:00:53] is your chance to see a little bit of human language structure and if you like [00:00:55] human language structure and if you like it you can enroll in a Linguistics class [00:00:57] it you can enroll in a Linguistics class later on um yeah so [00:01:02] oops so um you know so assignment two we [00:01:06] oops so um you know so assignment two we handed out on Tuesday so in the second [00:01:09] handed out on Tuesday so in the second half of assignment two um what your job [00:01:12] half of assignment two um what your job to do is to build a neural dependency [00:01:14] to do is to build a neural dependency paard using p torch um as you'll s we'll [00:01:18] paard using p torch um as you'll s we'll sort of come to later on really the bit [00:01:21] sort of come to later on really the bit that you have to build is just the [00:01:22] that you have to build is just the machine learning bit of making decisions [00:01:25] machine learning bit of making decisions and really we give you most of the rest [00:01:27] and really we give you most of the rest of the new dependency paa but this is [00:01:30] of the new dependency paa but this is also then a chance to remind you that [00:01:32] also then a chance to remind you that assignment two in that second half uses [00:01:35] assignment two in that second half uses py torch one of the leading deep [00:01:37] py torch one of the leading deep learning Frameworks um so if you're not [00:01:40] learning Frameworks um so if you're not familiar with that it'd be a really good [00:01:41] familiar with that it'd be a really good idea to also go along to um the Friday [00:01:45] idea to also go along to um the Friday pie torch tutorial um though we have [00:01:48] pie torch tutorial um though we have tried to make assignment too so it's a [00:01:50] tried to make assignment too so it's a fairly good place for learning pie torch [00:01:53] fairly good place for learning pie torch as you go along um we'll say more soon [00:01:56] as you go along um we'll say more soon about um final projects but you're [00:01:58] about um final projects but you're certainly already in encourage to come [00:02:00] certainly already in encourage to come and meet with Tas or me about final [00:02:03] and meet with Tas or me about final projects and we're putting up [00:02:04] projects and we're putting up information about the T so you can know [00:02:06] information about the T so you can know more about them on um the office hours [00:02:10] more about them on um the office hours page okay so let's get straight into it [00:02:13] page okay so let's get straight into it and start looking at linguistic [00:02:15] and start looking at linguistic structure so um in in um thinking about [00:02:21] structure so um in in um thinking about linguistic structure of human languages [00:02:24] linguistic structure of human languages there are two primary ways um that [00:02:28] there are two primary ways um that people have thought about it so one way [00:02:31] people have thought about it so one way is um using the idea that linguists [00:02:34] is um using the idea that linguists normally call phrase structure which is [00:02:37] normally call phrase structure which is then represented in terms of what um [00:02:40] then represented in terms of what um computer scientists normally know as [00:02:43] computer scientists normally know as context free grammars so I'm going to [00:02:45] context free grammars so I'm going to spend a couple of minutes um going over [00:02:48] spend a couple of minutes um going over that view of things but you know [00:02:50] that view of things but you know actually it's not the main one that I'm [00:02:52] actually it's not the main one that I'm going to talk about in this class I'm [00:02:54] going to talk about in this class I'm going to spend most of this class [00:02:56] going to spend most of this class talking about an alternative way of [00:02:58] talking about an alternative way of thinking about things called dependency [00:03:00] thinking about things called dependency grammars um there are actually some [00:03:02] grammars um there are actually some correspondences you can make between the [00:03:04] correspondences you can make between the two ways of thinking about things but [00:03:06] two ways of thinking about things but I'm not going to go through into those [00:03:08] I'm not going to go through into those um here today so for the constituency [00:03:11] um here today so for the constituency grammar or phrase structure um version [00:03:14] grammar or phrase structure um version of things the way that you go about [00:03:17] of things the way that you go about thinking about the structure of human [00:03:19] thinking about the structure of human languages is well there are words [00:03:21] languages is well there are words languages have lots of words hundreds of [00:03:24] languages have lots of words hundreds of thousands of words but it seems like a [00:03:27] thousands of words but it seems like a lot of the words nearly all the words in [00:03:30] lot of the words nearly all the words in fact fall into a few basic classes that [00:03:34] fact fall into a few basic classes that represent their nature and how they [00:03:37] represent their nature and how they behave in sentences so for words like [00:03:41] behave in sentences so for words like the examples here we have nouns so cat [00:03:44] the examples here we have nouns so cat is a noun door is a noun but you know [00:03:48] is a noun door is a noun but you know something like Linguistics is also a [00:03:50] something like Linguistics is also a noun so we have nouns um and then we [00:03:54] noun so we have nouns um and then we have other kinds of words so something [00:03:56] have other kinds of words so something like cuddly is an adjective a word that [00:03:58] like cuddly is an adjective a word that can modify nouns um and then we have [00:04:02] can modify nouns um and then we have theth for the the cuddly cat um the is [00:04:07] theth for the the cuddly cat um the is sort of a slightly more complex one as [00:04:10] sort of a slightly more complex one as to how to name um normally in modern [00:04:13] to how to name um normally in modern Linguistics refer to words like that as [00:04:15] Linguistics refer to words like that as a determiner you might also see the name [00:04:18] a determiner you might also see the name article and when sometimes when people [00:04:21] article and when sometimes when people try to shoehorn um human language into [00:04:25] try to shoehorn um human language into eight part of speech categories that [00:04:27] eight part of speech categories that they say it's an adjective that doesn't [00:04:28] they say it's an adjective that doesn't really behave like regular adjectives [00:04:31] really behave like regular adjectives and then we have words like by or [00:04:32] and then we have words like by or through or on and to and ones like that [00:04:35] through or on and to and ones like that and so they're then prepositions right [00:04:37] and so they're then prepositions right so we have these classes and with lots [00:04:40] so we have these classes and with lots of words fitting into each class and so [00:04:42] of words fitting into each class and so they're referred to conventionally as [00:04:44] they're referred to conventionally as parts of speech but then once we've got [00:04:47] parts of speech but then once we've got words we start putting them into bigger [00:04:50] words we start putting them into bigger units so the cuddly cat is some kind of [00:04:53] units so the cuddly cat is some kind of unit and so it seems like this is a um [00:04:57] unit and so it seems like this is a um explication of a noun cat and so this [00:05:01] explication of a noun cat and so this gets referred to as a noun phrase um and [00:05:04] gets referred to as a noun phrase um and then by the door well this is a phrase [00:05:08] then by the door well this is a phrase um but actually it has inside it this [00:05:10] um but actually it has inside it this the door and that's a noun phrase but [00:05:14] the door and that's a noun phrase but this bigger unit here of by the door is [00:05:17] this bigger unit here of by the door is then a prepositional [00:05:19] then a prepositional phrase and we can continue to build [00:05:22] phrase and we can continue to build bigger units so inside you know this we [00:05:27] bigger units so inside you know this we have this phrase that we've already [00:05:28] have this phrase that we've already looked at with the noun phrase and a [00:05:30] looked at with the noun phrase and a prepositional phrase but then we can [00:05:32] prepositional phrase but then we can have another noun phrase the cuddly cat [00:05:36] have another noun phrase the cuddly cat and we can put them together and build a [00:05:39] and we can put them together and build a bigger noun phrase the cuddly Cat by the [00:05:43] bigger noun phrase the cuddly Cat by the door and so to represent this you can [00:05:46] door and so to represent this you can start to write um a phrase structure [00:05:49] start to write um a phrase structure grammar or a context free grammar that [00:05:52] grammar or a context free grammar that represents what are the possibilities [00:05:54] represents what are the possibilities for building up sentences here in [00:05:56] for building up sentences here in English those similar kinds of phrase [00:05:58] English those similar kinds of phrase structure grammars can be written for [00:06:00] structure grammars can be written for other languages so this is sort of [00:06:03] other languages so this is sort of starting to give you possible structures [00:06:05] starting to give you possible structures for a noun phrase so you can have a noun [00:06:09] for a noun phrase so you can have a noun phrase just goes to a determiner [00:06:12] phrase just goes to a determiner followed by a noun but then as well as [00:06:15] followed by a noun but then as well as the cat and a dog you can have the large [00:06:19] the cat and a dog you can have the large cats so you might say that okay rather [00:06:21] cats so you might say that okay rather than that I might want to have as a [00:06:23] than that I might want to have as a better rule that a noun phrase goes to a [00:06:27] better rule that a noun phrase goes to a determiner and optional adjective [00:06:30] determiner and optional adjective and then a noun if you think about it [00:06:33] and then a noun if you think about it you can sort of have multiple adjectives [00:06:36] you can sort of have multiple adjectives so you can have the the large um green [00:06:41] so you can have the the large um green cat or something like that um so you can [00:06:45] cat or something like that um so you can really get multiple adjectives that are [00:06:46] really get multiple adjectives that are heing and that sort of star the clean [00:06:49] heing and that sort of star the clean star says you can have lots of them um [00:06:52] star says you can have lots of them um the large cuddly green cat um but then [00:06:56] the large cuddly green cat um but then you can stick things after the now [00:06:59] you can stick things after the now phrase so you can put these [00:07:01] phrase so you can put these prepositional phrases like in a crate so [00:07:04] prepositional phrases like in a crate so we might also want to say that a noun [00:07:05] we might also want to say that a noun phrase can be Rewritten as a noun phrase [00:07:09] phrase can be Rewritten as a noun phrase followed by a prepositional phrase where [00:07:12] followed by a prepositional phrase where a prepositional phrase um can be [00:07:15] a prepositional phrase um can be represented by preposition followed by a [00:07:17] represented by preposition followed by a noun phrase and somewhere we're also [00:07:19] noun phrase and somewhere we're also going to want to represent our parts of [00:07:22] going to want to represent our parts of speech membership so a [00:07:25] speech membership so a determiner um can go to words like a or [00:07:30] determiner um can go to words like a or and an adjective can go to words like [00:07:35] and an adjective can go to words like large cuddly or many other words that [00:07:39] large cuddly or many other words that I'm not going to write [00:07:40] I'm not going to write down and a preposition can go to words [00:07:45] down and a preposition can go to words like in on [00:07:48] like in on under Etc after that okay so now I've [00:07:52] under Etc after that okay so now I've got a a little grammar here and this [00:07:55] got a a little grammar here and this little grammar here could sort of make [00:07:58] little grammar here could sort of make everything [00:08:00] everything I've gotten these sentences well [00:08:02] I've gotten these sentences well actually can do this one too it can do [00:08:03] actually can do this one too it can do the large barking one where there are [00:08:05] the large barking one where there are multiple ones um but then if I start [00:08:08] multiple ones um but then if I start going Beyond these noun phrases and say [00:08:13] going Beyond these noun phrases and say think of sentences like talk to the cat [00:08:16] think of sentences like talk to the cat or talk to the large cuddly dog by the [00:08:21] or talk to the large cuddly dog by the door well now I've got here a verb talk [00:08:25] door well now I've got here a verb talk and I've again got a [00:08:27] and I've again got a preposition so I might um then have more [00:08:30] preposition so I might um then have more rules that says I can have a verb phrase [00:08:33] rules that says I can have a verb phrase and the verb phrase can go to a verb and [00:08:37] and the verb phrase can go to a verb and then a prepositional phrase and then [00:08:40] then a prepositional phrase and then that could explain these two sentences [00:08:42] that could explain these two sentences as well and in this kind of way I can [00:08:45] as well and in this kind of way I can start to build up a grammar of the [00:08:48] start to build up a grammar of the structure of English sentences as a [00:08:51] structure of English sentences as a context free grammar make sense yeah [00:08:55] context free grammar make sense yeah okay um and so that's [00:09:00] okay um and so that's um what is being quite commonly done in [00:09:04] um what is being quite commonly done in linguistics and elsewhere [00:09:07] linguistics and elsewhere um okay [00:09:10] um okay uh yeah uh so let me just do that once [00:09:14] uh yeah uh so let me just do that once more but behind in this one so one thing [00:09:18] more but behind in this one so one thing I can do here is say oh I have I'm going [00:09:23] I can do here is say oh I have I'm going to look at this with its phrase [00:09:26] to look at this with its phrase structure and if I write it upside down [00:09:28] structure and if I write it upside down to give myself some space for later you [00:09:31] to give myself some space for later you know I could start [00:09:33] know I could start um making a phrase structure that is of [00:09:38] um making a phrase structure that is of this [00:09:38] this sentence um and I'll start to run out of [00:09:41] sentence um and I'll start to run out of space but I can sort of start to make [00:09:46] space but I can sort of start to make this phrase structure of the sentence so [00:09:49] this phrase structure of the sentence so that's um phrase structure um but [00:09:52] that's um phrase structure um but there's another form of [00:09:54] there's another form of representation that has been fairly [00:09:56] representation that has been fairly widely used in linguistics and we're [00:09:59] widely used in linguistics and we're going and has been commonly used in NLP [00:10:01] going and has been commonly used in NLP and we're going to use for the paes we [00:10:03] and we're going to use for the paes we built and that's dependency structure so [00:10:06] built and that's dependency structure so dependency structure represents things [00:10:08] dependency structure represents things in a slightly different way it thinks [00:10:11] in a slightly different way it thinks about words that are the main word or [00:10:14] about words that are the main word or head or something and then which words [00:10:17] head or something and then which words they take as modifiers or arguments so [00:10:20] they take as modifiers or arguments so for look in the large crate in the [00:10:22] for look in the large crate in the kitchen by the door well this is um [00:10:26] kitchen by the door well this is um describing a looking command so that the [00:10:29] describing a looking command so that the head of the whole thing is looking um [00:10:32] head of the whole thing is looking um and then looking is taking one or more [00:10:36] and then looking is taking one or more arguments or modifiers and well what the [00:10:40] arguments or modifiers and well what the looking is saying here is well what you [00:10:43] looking is saying here is well what you want to do is look in the large crate so [00:10:48] want to do is look in the large crate so we are looking in something and then [00:10:51] we are looking in something and then when what we're looking in is a crate [00:10:54] when what we're looking in is a crate and then the crate has some modifiers [00:10:57] and then the crate has some modifiers it's a large crate it's the large crate [00:11:01] it's a large crate it's the large crate and then the crate um is also um placed [00:11:06] and then the crate um is also um placed somewhere it's placed in the kitchen so [00:11:11] somewhere it's placed in the kitchen so that in the kitchen is also modifying [00:11:14] that in the kitchen is also modifying crate um and then we've got over here [00:11:17] crate um and then we've got over here the by the door um [00:11:20] the by the door um well the by the door um is also [00:11:24] well the by the door um is also modifying crate so we've also got a link [00:11:27] modifying crate so we've also got a link down over to here and that gives us our [00:11:30] down over to here and that gives us our piece of structure here which having [00:11:32] piece of structure here which having filled that in makes me realize I [00:11:34] filled that in makes me realize I actually got it wrong in when I was [00:11:36] actually got it wrong in when I was doing the constituency representation [00:11:40] doing the constituency representation whoopsie so I should get my CU in the [00:11:43] whoopsie so I should get my CU in the constituency representation I made the [00:11:46] constituency representation I made the kitchen by the door into a phrase that [00:11:48] kitchen by the door into a phrase that was actually wrong whoops bad me um so [00:11:51] was actually wrong whoops bad me um so what I should have actually had um was [00:11:56] what I should have actually had um was we had another prepositional phrase that [00:11:59] we had another prepositional phrase that went to a noun phrase of in the kitchen [00:12:03] went to a noun phrase of in the kitchen and then both of [00:12:07] those were coming off a bigger noun [00:12:11] those were coming off a bigger noun phrase like that whoopsie okay um I get [00:12:15] phrase like that whoopsie okay um I get it right most of the time okay but so [00:12:17] it right most of the time okay but so this idea of um dependency structure is [00:12:20] this idea of um dependency structure is that we're sort of finding what is the [00:12:23] that we're sort of finding what is the head word and then we're saying which [00:12:25] head word and then we're saying which things modify the headword and either of [00:12:28] things modify the headword and either of these represent ations we can be using [00:12:31] these represent ations we can be using um to sort of work out what the [00:12:34] um to sort of work out what the structure of sentences is in terms of [00:12:37] structure of sentences is in terms of which words go together and which words [00:12:39] which words go together and which words modify other words and so the basic idea [00:12:43] modify other words and so the basic idea is so when humans communicate we [00:12:46] is so when humans communicate we communicate in a linear stream right so [00:12:49] communicate in a linear stream right so that if it's um conventional writing [00:12:52] that if it's um conventional writing systems it's a linear stream of words [00:12:55] systems it's a linear stream of words that you're reading if it's spoken [00:12:58] that you're reading if it's spoken language like you're understanding me [00:13:00] language like you're understanding me speaking right now it's not a linear [00:13:03] speaking right now it's not a linear stream of words it's a linear Sound [00:13:05] stream of words it's a linear Sound Stream and you know like when people [00:13:09] Stream and you know like when people speak there aren't any you know there [00:13:10] speak there aren't any you know there isn't white space between words when [00:13:13] isn't white space between words when people speak you know occasionally [00:13:14] people speak you know occasionally people pause at the end of a clause or [00:13:17] people pause at the end of a clause or sentence or something but in general I'm [00:13:19] sentence or something but in general I'm just sort of speaking continuous words [00:13:21] just sort of speaking continuous words that run one into each other so that [00:13:23] that run one into each other so that there's a a a linear sequence of sounds [00:13:26] there's a a a linear sequence of sounds coming out my mouth um and you have to [00:13:28] coming out my mouth um and you have to do all bit like that but the but if [00:13:31] do all bit like that but the but if you're then thinking oh gee [00:13:34] you're then thinking oh gee um I can actually understand Chris [00:13:38] um I can actually understand Chris talking then somehow you're taking that [00:13:41] talking then somehow you're taking that linear stream and you're turning it into [00:13:44] linear stream and you're turning it into a meaning where certain words are [00:13:47] a meaning where certain words are modifying other words and you have these [00:13:50] modifying other words and you have these bigger units um like constituents that [00:13:52] bigger units um like constituents that are understanding the meaning of the [00:13:54] are understanding the meaning of the sentence and so human listeners need to [00:13:58] sentence and so human listeners need to work out what modifies what to be able [00:14:01] work out what modifies what to be able to understand sentences [00:14:04] to understand sentences correctly and so similarly our models [00:14:08] correctly and so similarly our models need to be able to understand sentence [00:14:10] need to be able to understand sentence structure in order to be able to [00:14:12] structure in order to be able to interpret language correctly and so for [00:14:15] interpret language correctly and so for what we're going to be doing for [00:14:16] what we're going to be doing for building dependency paes is we're going [00:14:19] building dependency paes is we're going to be [00:14:20] to be explicitly building a neural network [00:14:23] explicitly building a neural network model that says let's find the structure [00:14:26] model that says let's find the structure of these sentences um in way we actually [00:14:29] of these sentences um in way we actually move away from that later on because [00:14:32] move away from that later on because when we move into um Transformer [00:14:34] when we move into um Transformer language models they just take in the um [00:14:38] language models they just take in the um sequence of words but actually inside [00:14:41] sequence of words but actually inside the parameters of the newal network they [00:14:43] the parameters of the newal network they recognizing and building the same kind [00:14:45] recognizing and building the same kind of structural units and we'll talk about [00:14:47] of structural units and we'll talk about that later in the class um to give you [00:14:50] that later in the class um to give you more of a sense of how um these you know [00:14:53] more of a sense of how um these you know understanding what modifies what is [00:14:56] understanding what modifies what is important for interpretation um here are [00:14:58] important for interpretation um here are a few funny examples from newspaper [00:15:01] a few funny examples from newspaper headlines and they're funny examples [00:15:03] headlines and they're funny examples because you get um there sentences don't [00:15:07] because you get um there sentences don't just have one way of interpreting them [00:15:10] just have one way of interpreting them um when you have a sequence of words [00:15:13] um when you have a sequence of words commonly in human languages sequence of [00:15:16] commonly in human languages sequence of words are ambiguous and it's relying on [00:15:19] words are ambiguous and it's relying on human interpretation of what makes sense [00:15:22] human interpretation of what makes sense and what goes together to work out how [00:15:24] and what goes together to work out how to read them so here's a first example [00:15:29] to read them so here's a first example um scientists count whales from space um [00:15:33] um scientists count whales from space um now that's um ambiguous and you can give [00:15:36] now that's um ambiguous and you can give this two possible readings so how can [00:15:39] this two possible readings so how can you give this um headline two possible [00:15:43] you give this um headline two possible readings yeah one is that they scien in [00:15:47] readings yeah one is that they scien in space counting whales and the other one [00:15:49] space counting whales and the other one is that they're whales from in space [00:15:51] is that they're whales from in space yeah so one possibility is so we've got [00:15:54] yeah so one possibility is so we've got this prepositional phrase this is a [00:15:56] this prepositional phrase this is a prepositional phrase here one [00:15:59] prepositional phrase here one possibility is that this prepositional [00:16:02] possibility is that this prepositional phrase is [00:16:04] phrase is modifying or is the OB yeah it's [00:16:07] modifying or is the OB yeah it's modifying [00:16:09] modifying whales um so they're whales from space [00:16:13] whales um so they're whales from space and the other possibility is that it's [00:16:16] and the other possibility is that it's the counting that's happening from space [00:16:18] the counting that's happening from space so the scientists are counting it from [00:16:20] so the scientists are counting it from space okay so that corresponds to my two [00:16:23] space okay so that corresponds to my two pictures here um so in one picture it's [00:16:27] pictures here um so in one picture it's the counting that is happening from [00:16:31] the counting that is happening from space um which is actually the right [00:16:34] space um which is actually the right interpretation of what the article is [00:16:35] interpretation of what the article is about um but in the other interpretation [00:16:38] about um but in the other interpretation we have space whales um and the and the [00:16:41] we have space whales um and the and the scientists are counting the space whales [00:16:43] scientists are counting the space whales that are arriving or something like that [00:16:46] that are arriving or something like that and so then we have um the from space [00:16:50] and so then we have um the from space that are modifying the [00:16:51] that are modifying the whales okay um so what we have here [00:16:55] whales okay um so what we have here right is a prepositional phrase which [00:16:58] right is a prepositional phrase which comes after a noun phrase it's just a [00:17:00] comes after a noun phrase it's just a onew noun phrase here whales that's fine [00:17:03] onew noun phrase here whales that's fine and then before that is a verb and so [00:17:06] and then before that is a verb and so the thing so one place in English where [00:17:09] the thing so one place in English where you get a lot of ambiguities is from [00:17:12] you get a lot of ambiguities is from these prepositional phrases because [00:17:14] these prepositional phrases because whenever you get prepositional phrases [00:17:16] whenever you get prepositional phrases and prepositional phrases are really [00:17:18] and prepositional phrases are really common in English if you think about it [00:17:20] common in English if you think about it whenever you get them like this it's [00:17:23] whenever you get them like this it's always ambiguous as [00:17:26] always ambiguous as to oops it's always amb uous as to what [00:17:30] to oops it's always amb uous as to what earlier thing in the sentence they [00:17:32] earlier thing in the sentence they dependent of um and so um you know you [00:17:39] dependent of um and so um you know you can sort of put in another prepositional [00:17:41] can sort of put in another prepositional phrase in the morning or something like [00:17:44] phrase in the morning or something like that and so then the ambiguities just [00:17:48] that and so then the ambiguities just multiply and so the important thing to [00:17:51] multiply and so the important thing to notice here about human language is [00:17:56] notice here about human language is human language is in synta I terms [00:17:59] human language is in synta I terms globally ambiguous right so in [00:18:02] globally ambiguous right so in programming languages you have local [00:18:06] programming languages you have local ambiguities interpretation how many [00:18:08] ambiguities interpretation how many people have done a compilers class I [00:18:10] people have done a compilers class I think very few these days anyone done a [00:18:12] think very few these days anyone done a compilers class okay it looks like less [00:18:15] compilers class okay it looks like less people have done a compilers class than [00:18:16] people have done a compilers class than a Linguistics class that's [00:18:19] a Linguistics class that's interesting okay well I won't make too [00:18:22] interesting okay well I won't make too many analogies to compilers classes in [00:18:24] many analogies to compilers classes in you know when I was young you know that [00:18:27] you know when I was young you know that was still the old days kind CS [00:18:29] was still the old days kind CS curriculum where writing interpreters [00:18:31] curriculum where writing interpreters and compilers were seen as the main sta [00:18:34] and compilers were seen as the main sta of Computer Science Education but no [00:18:36] of Computer Science Education but no more I guess um yeah so um in [00:18:40] more I guess um yeah so um in programming languages you can have a [00:18:42] programming languages you can have a local ambiguity but ambiguities are [00:18:45] local ambiguity but ambiguities are always resolved right so we have simple [00:18:47] always resolved right so we have simple rules and programming languages that [00:18:50] rules and programming languages that else is um construed with the nearest if [00:18:54] else is um construed with the nearest if um you know it's a bit different in [00:18:56] um you know it's a bit different in Python because it's indentation but you [00:18:58] Python because it's indentation but you know There are rules that so there's [00:19:00] know There are rules that so there's never Global ambiguity in a programming [00:19:02] never Global ambiguity in a programming language um and but human languages just [00:19:07] language um and but human languages just aren't like that right that there's [00:19:09] aren't like that right that there's nothing that resolves which of these two [00:19:12] nothing that resolves which of these two readings is correct if I made it a [00:19:14] readings is correct if I made it a bigger sentence that'd still be [00:19:16] bigger sentence that'd still be ambiguous you're just sort of meant to [00:19:18] ambiguous you're just sort of meant to read it and use context in your [00:19:20] read it and use context in your intelligence to decide um what's going [00:19:23] intelligence to decide um what's going on and so to take a a bigger but real [00:19:27] on and so to take a a bigger but real example um this is the kind of boring [00:19:31] example um this is the kind of boring sentence that you can read in the Wall [00:19:33] sentence that you can read in the Wall Street Journal most mornings um the [00:19:36] Street Journal most mornings um the board approved its acquisition by Royal [00:19:38] board approved its acquisition by Royal Trustco limited of Toronto for $27 a [00:19:41] Trustco limited of Toronto for $27 a share at its monthly meeting um so um [00:19:45] share at its monthly meeting um so um what you can see in this sentence is [00:19:47] what you can see in this sentence is we've got a verb and then we've got a [00:19:50] we've got a verb and then we've got a noun phrase and then what are what after [00:19:53] noun phrase and then what are what after that we have four prepositional phrases [00:19:56] that we have four prepositional phrases in a row okay so what do these [00:19:58] in a row okay so what do these prepositional phrases modify so what [00:20:01] prepositional phrases modify so what does by Royal Trustco limited [00:20:06] modify the acquisition right so it's the [00:20:09] modify the acquisition right so it's the acquisition by Royal Trust Co then of [00:20:13] acquisition by Royal Trust Co then of Toronto [00:20:15] Toronto modifies so it's Royal Trustco limited [00:20:18] modifies so it's Royal Trustco limited of [00:20:19] of Toronto um so yeah later on [00:20:22] Toronto um so yeah later on prepositional phrases can also modify [00:20:25] prepositional phrases can also modify earlier prepositional phrases or at [00:20:27] earlier prepositional phrases or at least the noun phrase inside them World [00:20:29] least the noun phrase inside them World Trustco limited okay for $27 a [00:20:34] Trustco limited okay for $27 a share is back to modifying the [00:20:38] share is back to modifying the acquisition okay at its monthly [00:20:41] acquisition okay at its monthly meeting is is yeah it's the approval so [00:20:45] meeting is is yeah it's the approval so it's gone way back up to there right um [00:20:49] it's gone way back up to there right um but you know um yeah so you know so if [00:20:53] but you know um yeah so you know so if you start having sentences um with a [00:20:56] you start having sentences um with a whole bunch of prepositional phrases [00:20:58] whole bunch of prepositional phrases like this you can start getting more and [00:21:01] like this you can start getting more and more ambiguities of attachment I mean [00:21:04] more ambiguities of attachment I mean you don't get the [00:21:06] you don't get the full you don't get the sort of full free [00:21:09] full you don't get the sort of full free choice factorial number of attachment [00:21:12] choice factorial number of attachment points because there is a restriction [00:21:15] points because there is a restriction that these dependencies don't cross um [00:21:19] that these dependencies don't cross um so once you've gone back further you [00:21:21] so once you've gone back further you have to stay equally far back or go even [00:21:23] have to stay equally far back or go even back further back again but nevertheless [00:21:26] back further back again but nevertheless so the number of readings you get is [00:21:28] so the number of readings you get is Catalan series which is a series you see [00:21:31] Catalan series which is a series you see in a whole bunch of other places if [00:21:32] in a whole bunch of other places if you've done any graph Theory or anything [00:21:34] you've done any graph Theory or anything like that you know if you're doing [00:21:36] like that you know if you're doing triangulations you get um cat alarms [00:21:39] triangulations you get um cat alarms because you get the same property um [00:21:41] because you get the same property um that things don't cross so it's an [00:21:43] that things don't cross so it's an exponentially growing um sequence of [00:21:46] exponentially growing um sequence of possible readings and so it quickly gets [00:21:49] possible readings and so it quickly gets very big so I think when you you've got [00:21:53] very big so I think when you you've got four prepositional phrases you get 13 [00:21:56] four prepositional phrases you get 13 readings and if you have five you 27 and [00:22:00] readings and if you have five you 27 and you know grows up from there so you get [00:22:02] you know grows up from there so you get a lot of ambiguities but the crucial [00:22:04] a lot of ambiguities but the crucial thing to notice is you know human beings [00:22:07] thing to notice is you know human beings read sentences like this every morning [00:22:10] read sentences like this every morning or at least people who work in banking [00:22:12] or at least people who work in banking do while you know having their Corn [00:22:14] do while you know having their Corn Flakes and you know they their brain [00:22:17] Flakes and you know they their brain doesn't explode trying to think about [00:22:19] doesn't explode trying to think about the 13 different readings and which one [00:22:21] the 13 different readings and which one is correct right we just sort of do this [00:22:24] is correct right we just sort of do this as we go along and it's sort of obvious [00:22:27] as we go along and it's sort of obvious um okay let's just do a couple more [00:22:29] um okay let's just do a couple more examples of where we get ambiguities in [00:22:33] examples of where we get ambiguities in um in human language so a different one [00:22:36] um in human language so a different one you get is coordination scope ambiguity [00:22:40] you get is coordination scope ambiguity so shuttle veteran and longtime Nas [00:22:42] so shuttle veteran and longtime Nas executive Fred Gregory appointed the [00:22:44] executive Fred Gregory appointed the board how is this sentence [00:22:47] board how is this sentence ambiguous it mean two people or one [00:22:50] ambiguous it mean two people or one person yeah so there can either be one [00:22:54] person yeah so there can either be one person Fred Gregory and they're both a [00:22:57] person Fred Gregory and they're both a shuttle veteran and a NASA [00:23:00] shuttle veteran and a NASA executive or it can be that there are [00:23:03] executive or it can be that there are two people there's a shuttle veteran and [00:23:07] two people there's a shuttle veteran and um there's a longtime Nessa executive [00:23:10] um there's a longtime Nessa executive Fred [00:23:11] Fred Gregory okay yeah so and we'd be kind of [00:23:14] Gregory okay yeah so and we'd be kind of capturing those by having extra grammar [00:23:17] capturing those by having extra grammar rules where a noun phrase can go to a [00:23:19] rules where a noun phrase can go to a noun phrase a conjunction and a noun [00:23:22] noun phrase a conjunction and a noun phrase um but then another another thing [00:23:26] phrase um but then another another thing that you get in English is um apposition [00:23:29] that you get in English is um apposition so you can have a noun phrase that's a [00:23:32] so you can have a noun phrase that's a descriptive noun phrase of another noun [00:23:34] descriptive noun phrase of another noun phrase like a name um you know the [00:23:37] phrase like a name um you know the author Fred Gregory or something like [00:23:39] author Fred Gregory or something like that um um saying the word English again [00:23:44] that um um saying the word English again I I I meant to comment um so you know [00:23:48] I I I meant to comment um so you know I'm I'm only going to give English [00:23:49] I'm I'm only going to give English examples here um in different languages [00:23:53] examples here um in different languages you don't get all the same ambiguities [00:23:56] you don't get all the same ambiguities um so if you're familiar with same [00:24:00] um so if you're familiar with same Chinese um you might have thought about [00:24:02] Chinese um you might have thought about the prepositional phrase example of wait [00:24:05] the prepositional phrase example of wait a minute we don't have that one because [00:24:08] a minute we don't have that one because the prepositional phrase modifying the [00:24:10] the prepositional phrase modifying the verb would appear before the verb and [00:24:12] verb would appear before the verb and the object noun would be afterward so it [00:24:14] the object noun would be afterward so it would be completely unambiguous and [00:24:16] would be completely unambiguous and that's true um but you know that doesn't [00:24:19] that's true um but you know that doesn't mean that Chinese is unambiguous Chinese [00:24:22] mean that Chinese is unambiguous Chinese has lots of very bad [00:24:25] has lots of very bad ambiguities and um yeah it's just that [00:24:29] ambiguities and um yeah it's just that you know different languages have [00:24:30] you know different languages have different syntactic structures okay um [00:24:33] different syntactic structures okay um here's so sometimes um in English [00:24:38] here's so sometimes um in English especially when you're sort of in a more [00:24:39] especially when you're sort of in a more written form rather than having an [00:24:41] written form rather than having an explicit con coordination word you can [00:24:45] explicit con coordination word you can just sort of use ju to position with a [00:24:47] just sort of use ju to position with a comma um to have the idea of [00:24:51] comma um to have the idea of coordination so here's a um fun example [00:24:55] coordination so here's a um fun example um from the first Trump Administration [00:24:58] um from the first Trump Administration of how we can have a coordination scope [00:25:01] of how we can have a coordination scope ambiguity um doctor no heart cognitive [00:25:05] ambiguity um doctor no heart cognitive issues um right so again this is the [00:25:08] issues um right so again this is the same kind of coordination scope [00:25:11] same kind of coordination scope ambiguity that it can either be kind of [00:25:13] ambiguity that it can either be kind of no hard and cognitive issues being [00:25:16] no hard and cognitive issues being conjoined together like that or else it [00:25:19] conjoined together like that or else it could be that it's no heart or cognitive [00:25:23] could be that it's no heart or cognitive issues being conjoined um together like [00:25:26] issues being conjoined um together like that you make the choice [00:25:29] that you make the choice um okay uh let's [00:25:33] um okay uh let's see oh this this is this is my risque [00:25:36] see oh this this is this is my risque one for a different kind of ambiguity um [00:25:39] one for a different kind of ambiguity um trigger warning um students get [00:25:42] trigger warning um students get firsthand job [00:25:45] firsthand job experience so this one is also um an [00:25:49] experience so this one is also um an ambiguity um you know as to whether [00:25:51] ambiguity um you know as to whether you're having the um the [00:25:55] you're having the um the firsthand and then both the job and the [00:25:59] firsthand and then both the job and the firsthand a modifying experience or [00:26:03] firsthand a modifying experience or there's this other reading if you have a [00:26:04] there's this other reading if you have a smutty mind that might come to you [00:26:08] smutty mind that might come to you um okay one more fun one okay mutilated [00:26:12] um okay one more fun one okay mutilated body washes up on Rio Beach to be used [00:26:15] body washes up on Rio Beach to be used for Olympics beach [00:26:17] for Olympics beach volleyball okay so what are what are the [00:26:20] volleyball okay so what are what are the two possible readings of this sentence [00:26:23] two possible readings of this sentence you know these are real examples from [00:26:24] you know these are real examples from quality newspapers [00:26:27] quality newspapers um okay what are the two readings of [00:26:29] um okay what are the two readings of this sentence [00:26:35] yeah so we've got so so the here we have [00:26:39] yeah so we've got so so the here we have one of these [00:26:41] one of these infinitival um so infinitival verb [00:26:45] infinitival um so infinitival verb phrase to be used for Olympic beach [00:26:48] phrase to be used for Olympic beach volleyball and for [00:26:50] volleyball and for these as well you know they kind of have [00:26:53] these as well you know they kind of have the same effect as prepositional phrases [00:26:57] the same effect as prepositional phrases um that they can can modify um different [00:27:00] um that they can can modify um different things um so it can either be the Rio [00:27:03] things um so it can either be the Rio Beach that's going to be used for the [00:27:06] Beach that's going to be used for the Olympic beach volleyball or it's going [00:27:08] Olympic beach volleyball or it's going to be the mutilated body that gets used [00:27:10] to be the mutilated body that gets used for the um beach [00:27:13] for the um beach volleyball okay um yeah so the so these [00:27:16] volleyball okay um yeah so the so these are the kind of ways in which we sort of [00:27:18] are the kind of ways in which we sort of want to use um the structure of the [00:27:21] want to use um the structure of the sentence to understand what they're [00:27:23] sentence to understand what they're meaning we also use it in lots of sort [00:27:26] meaning we also use it in lots of sort of just sort of M [00:27:28] of just sort of M practical ways um when we're building [00:27:31] practical ways um when we're building various kinds of natural language [00:27:33] various kinds of natural language processing systems so you know a kind of [00:27:36] processing systems so you know a kind of thing that people often in Practical [00:27:39] thing that people often in Practical systems do is that they want to get out [00:27:41] systems do is that they want to get out facts of various kinds so for people who [00:27:44] facts of various kinds so for people who um do stuff with bioinformatics that [00:27:47] um do stuff with bioinformatics that they commonly want to get out things [00:27:49] they commonly want to get out things like protein protein interaction facts [00:27:51] like protein protein interaction facts and so commonly you can get those kind [00:27:54] and so commonly you can get those kind of facts out by looking for patterns so [00:27:58] of facts out by looking for patterns so you know have a verb of interacts that's [00:28:00] you know have a verb of interacts that's going to be indicating um an interaction [00:28:03] going to be indicating um an interaction pattern and well it's going to be taking [00:28:05] pattern and well it's going to be taking arguments so it's going to be taking a [00:28:08] arguments so it's going to be taking a subject and interacts with the [00:28:10] subject and interacts with the prepositional argument and so that will [00:28:13] prepositional argument and so that will be um an interaction that kisy whatever [00:28:16] be um an interaction that kisy whatever that is interacts with sass a but in [00:28:19] that is interacts with sass a but in this case the Sass a is coordinated with [00:28:21] this case the Sass a is coordinated with the kai a and the kai B so it's also [00:28:24] the kai a and the kai B so it's also going to end up interacting with those [00:28:27] going to end up interacting with those two other things as well and so you can [00:28:29] two other things as well and so you can use the sort of sentence structure [00:28:31] use the sort of sentence structure patterns of a dependency PA to be [00:28:33] patterns of a dependency PA to be getting out the kind of um facts and [00:28:36] getting out the kind of um facts and events that you're interested in for [00:28:38] events that you're interested in for something like an um event understanding [00:28:40] something like an um event understanding system and people you know do these kind [00:28:43] system and people you know do these kind of anal analyses over biomedical texts [00:28:47] of anal analyses over biomedical texts to build up the kind of structured [00:28:49] to build up the kind of structured databases of known protein protein [00:28:51] databases of known protein protein interactions and things of that [00:28:54] interactions and things of that sort okay so linguistic structure is [00:28:58] sort okay so linguistic structure is useful um and it's syntactically very [00:29:03] useful um and it's syntactically very ambiguous and so you should think of [00:29:06] ambiguous and so you should think of humans as active interpreters that are [00:29:09] humans as active interpreters that are using their contextual knowledge both of [00:29:12] using their contextual knowledge both of earlier stuff in the text knowledge of [00:29:14] earlier stuff in the text knowledge of the world around them how the world [00:29:16] the world around them how the world Works to work out the right um structure [00:29:19] Works to work out the right um structure yeah so now I want to go on um and show [00:29:22] yeah so now I want to go on um and show you a bit more about sort of dependency [00:29:25] you a bit more about sort of dependency grammars which is what we're going to be [00:29:27] grammars which is what we're going to be using so for dependency syntax that it [00:29:31] using so for dependency syntax that it postulates that you can capture the [00:29:34] postulates that you can capture the structure of a sentence by having these [00:29:38] structure of a sentence by having these sort of asymmetric um dependent [00:29:41] sort of asymmetric um dependent relations which we might just call [00:29:43] relations which we might just call arrows which are going from heads to [00:29:46] arrows which are going from heads to dependence so here the sentence is um [00:29:49] dependence so here the sentence is um bills on ports and immigration were [00:29:51] bills on ports and immigration were submitted by Senator Brownback [00:29:54] submitted by Senator Brownback Republican of Kansas and we sort of [00:29:56] Republican of Kansas and we sort of picking out heads um and then we got um [00:30:00] picking out heads um and then we got um things that depend on them that modify [00:30:03] things that depend on them that modify them um yeah so if you're on the um [00:30:08] them um yeah so if you're on the um video audience and you are educated in [00:30:11] video audience and you are educated in the United States and you're over the [00:30:13] the United States and you're over the age of 50 um or if you happen to go to [00:30:17] age of 50 um or if you happen to go to one of those kind of private schools [00:30:19] one of those kind of private schools where they also teach Latin um you might [00:30:23] where they also teach Latin um you might have seen sentence diagramming um so [00:30:26] have seen sentence diagramming um so read Kellogg um sentence diagramming was [00:30:30] read Kellogg um sentence diagramming was something that was actually very [00:30:31] something that was actually very widespread um in American Education um [00:30:35] widespread um in American Education um which really was a it was a some it was [00:30:38] which really was a it was a some it was really dependency grammar was a sort of [00:30:40] really dependency grammar was a sort of a somewhat quirky form of dependency [00:30:42] a somewhat quirky form of dependency grammar where you had to write lines at [00:30:44] grammar where you had to write lines at different angles and stuff like that but [00:30:47] different angles and stuff like that but basically you're writing sort of heads [00:30:49] basically you're writing sort of heads and their dependence underneath them [00:30:51] and their dependence underneath them with different funny shaped lines um it [00:30:54] with different funny shaped lines um it also was dependency [00:30:56] also was dependency grammar okay um so this is the start of [00:30:59] grammar okay um so this is the start of a dependency grammar but just like the [00:31:02] a dependency grammar but just like the the funny angled lines of sentence [00:31:04] the funny angled lines of sentence diagramming normally people want to add [00:31:07] diagramming normally people want to add some more information than that um and [00:31:10] some more information than that um and so most commonly um that the arrows are [00:31:14] so most commonly um that the arrows are then typed by giving the name of some [00:31:17] then typed by giving the name of some grammatical relation so something can be [00:31:19] grammatical relation so something can be the noun subject or an oblique or an [00:31:25] the noun subject or an oblique or an appositional modifier or a case mark or [00:31:28] appositional modifier or a case mark or things like that um and um I I'm just [00:31:33] things like that um and um I I'm just trying to give you the idea of [00:31:35] trying to give you the idea of dependency grammars I'm not expecting [00:31:37] dependency grammars I'm not expecting you to master all of these names and [00:31:40] you to master all of these names and ways of doing things um and you know [00:31:44] ways of doing things um and you know there are different systems of deciding [00:31:46] there are different systems of deciding what's heads and dependence and not all [00:31:48] what's heads and dependence and not all the details are important what you [00:31:51] the details are important what you should get into your head is just sort [00:31:53] should get into your head is just sort of the basic idea of what one of these [00:31:55] of the basic idea of what one of these does and some sense of oh it should be [00:31:58] does and some sense of oh it should be at the phrase level it should be [00:32:00] at the phrase level it should be representing what's modifying what so we [00:32:03] representing what's modifying what so we do actually ask some questions um on the [00:32:07] do actually ask some questions um on the assignment and so for the cases like the [00:32:10] assignment and so for the cases like the prepositional phrase what is it [00:32:12] prepositional phrase what is it modifying you should be able to give the [00:32:14] modifying you should be able to give the right answer to [00:32:16] right answer to that okay [00:32:19] that okay um yeah um okay so uh this is just a [00:32:24] um yeah um okay so uh this is just a little bit more um vocabulary so yeah we [00:32:27] little bit more um vocabulary so yeah we have these or dependencies and so I'm [00:32:30] have these or dependencies and so I'm going to say that they connect between a [00:32:32] going to say that they connect between a head and a dependent but sometimes [00:32:34] head and a dependent but sometimes people use other words like governor and [00:32:36] people use other words like governor and modifier and things like that um and so [00:32:40] modifier and things like that um and so dependencies are generally taken and [00:32:43] dependencies are generally taken and will be taking them as forming a tree so [00:32:46] will be taking them as forming a tree so you've got something that's connected as [00:32:49] you've got something that's connected as cyclic and has a single root to it so [00:32:52] cyclic and has a single root to it so our single root is the top of the [00:32:54] our single root is the top of the sentence here [00:32:56] sentence here um so [00:32:58] um so um dependency so although what you see [00:33:02] um dependency so although what you see most often these days either in a [00:33:04] most often these days either in a Linguistics class or when you get taught [00:33:07] Linguistics class or when you get taught CS [00:33:08] CS 103 at Stanford or computer science what [00:33:12] 103 at Stanford or computer science what you see there is normally context free [00:33:15] you see there is normally context free grammars or phase structure grammars I [00:33:17] grammars or phase structure grammars I mean really you know it is dependency [00:33:20] mean really you know it is dependency grammars that have the really long [00:33:23] grammars that have the really long history so really the predominant way of [00:33:26] history so really the predominant way of representing the structure of human [00:33:28] representing the structure of human languages throughout human history is [00:33:31] languages throughout human history is dependency grammar um so who linguist um [00:33:34] dependency grammar um so who linguist um Herald as the first dependency [00:33:37] Herald as the first dependency grammarian or really the first person [00:33:39] grammarian or really the first person who tried to write the grammar of a [00:33:42] who tried to write the grammar of a human language period was panani so [00:33:45] human language period was panani so panani was working with Sanskrit um [00:33:48] panani was working with Sanskrit um panani lived so long ago that actually [00:33:50] panani lived so long ago that actually people don't really know when he lived I [00:33:53] people don't really know when he lived I mean he lived somewhere between about [00:33:55] mean he lived somewhere between about the 4th and 8th Century before the [00:33:57] the 4th and 8th Century before the common ER [00:33:58] common ER but really no one knows when um but you [00:34:01] but really no one knows when um but you know he lived um sort of up in part of [00:34:04] know he lived um sort of up in part of actually what's now Afghanistan um and [00:34:07] actually what's now Afghanistan um and um for Motivated for largely religious [00:34:11] um for Motivated for largely religious reasons um he said about developing a [00:34:14] reasons um he said about developing a grammar of Sanskrit and the way he [00:34:16] grammar of Sanskrit and the way he represented the syntax of Sanskrit was [00:34:19] represented the syntax of Sanskrit was using a dependency grammar um so there [00:34:21] using a dependency grammar um so there was a lot of work on grammar in Arabic [00:34:24] was a lot of work on grammar in Arabic in the first Millennium they used [00:34:26] in the first Millennium they used dependency grammars um in contrast um [00:34:30] dependency grammars um in contrast um the idea of sort of having context free [00:34:33] the idea of sort of having context free grammars that's really really recent so [00:34:35] grammars that's really really recent so the first work on um phrase structure [00:34:38] the first work on um phrase structure grammars dates to the 40s and then was [00:34:41] grammars dates to the 40s and then was sort of um canonicalized by the work of [00:34:44] sort of um canonicalized by the work of Chomsky in the [00:34:46] Chomsky in the 1950s yeah so um a fact for the computer [00:34:51] 1950s yeah so um a fact for the computer science part of people in the audience [00:34:53] science part of people in the audience so computer dear computer scientists if [00:34:56] so computer dear computer scientists if you know about Chomsky computer [00:34:58] you know about Chomsky computer scientists normally know two things [00:34:59] scientists normally know two things about Chomsky one is they hate on the [00:35:02] about Chomsky one is they hate on the Chomsky hierarchy that they were forced [00:35:04] Chomsky hierarchy that they were forced to learn um in CS 103 or equivalent [00:35:07] to learn um in CS 103 or equivalent classes and the second one is he's a [00:35:10] classes and the second one is he's a very left politician um but um if I only [00:35:13] very left politician um but um if I only deal with the first one of the two now [00:35:16] deal with the first one of the two now um the Chomsky hierarchy was not [00:35:19] um the Chomsky hierarchy was not invented either to torture Elementary [00:35:22] invented either to torture Elementary computer scientists um or um to explain [00:35:26] computer scientists um or um to explain fundamental facts about formal language [00:35:28] fundamental facts about formal language Theory the Chomsky hierarchy was [00:35:30] Theory the Chomsky hierarchy was actually invented in thinking about [00:35:33] actually invented in thinking about human languages because at that time and [00:35:37] human languages because at that time and in stuff that's come more often it was [00:35:40] in stuff that's come more often it was commonly the case that um people were [00:35:44] commonly the case that um people were modeling human languages with um Regular [00:35:48] modeling human languages with um Regular finite so finite State grammar [00:35:51] finite so finite State grammar equivalent mechanisms and Chomsky wanted [00:35:54] equivalent mechanisms and Chomsky wanted to argue that that was a completely [00:35:56] to argue that that was a completely inadequate um formalism to represent um [00:36:00] inadequate um formalism to represent um the complexity of human language and so [00:36:03] the complexity of human language and so it was in the context of arguments about [00:36:05] it was in the context of arguments about human language was why he developed um [00:36:07] human language was why he developed um the Chomsky [00:36:09] the Chomsky hierarchy okay um yeah so anyway uh [00:36:12] hierarchy okay um yeah so anyway uh that's enough of the history of that um [00:36:15] that's enough of the history of that um here's my uh picture of part of panan [00:36:18] here's my uh picture of part of panan grammar but actually or a version of it [00:36:21] grammar but actually or a version of it actually um this is really really [00:36:24] actually um this is really really misleading and because one of the [00:36:26] misleading and because one of the astounding facts about P's grammar and [00:36:29] astounding facts about P's grammar and part of why no one knows what century he [00:36:31] part of why no one knows what century he lived in was Pan's grammar was composed [00:36:35] lived in was Pan's grammar was composed orally um so this sort of kind of blows [00:36:38] orally um so this sort of kind of blows my mind you know it's it seems you know [00:36:42] my mind you know it's it seems you know um some of um the famous things in the [00:36:45] um some of um the famous things in the west like Homer's works right the [00:36:48] west like Homer's works right the Odyssey and The Iliad right they were [00:36:50] Odyssey and The Iliad right they were originally oral works that were passed [00:36:53] originally oral works that were passed down um in oral form you know you can [00:36:57] down um in oral form you know you can that seems hard to do but you can kind [00:36:59] that seems hard to do but you can kind of believe if you did plays in high [00:37:01] of believe if you did plays in high school or something that someone could [00:37:04] school or something that someone could um memorize the Odyssey perhaps um but [00:37:07] um memorize the Odyssey perhaps um but the idea that people could memorize a [00:37:10] the idea that people could memorize a grammar of a [00:37:12] grammar of a language passing it down for hundreds of [00:37:15] language passing it down for hundreds of years um kind of blows my mind um but [00:37:18] years um kind of blows my mind um but that's exactly what happened um yeah [00:37:21] that's exactly what happened um yeah with pines grammar um so you know really [00:37:25] with pines grammar um so you know really although this is sort of an old [00:37:26] although this is sort of an old birchbark manuscript you know that [00:37:28] birchbark manuscript you know that really it probably dates from about a [00:37:31] really it probably dates from about a millennium after panan um composed um [00:37:34] millennium after panan um composed um his grammar okay getting back to the um [00:37:37] his grammar okay getting back to the um modern days um yeah so um for things to [00:37:41] modern days um yeah so um for things to know yeah so I mean we don't want you to [00:37:45] know yeah so I mean we don't want you to fixate on the sort of details of [00:37:47] fixate on the sort of details of dependency grammar structure providing [00:37:49] dependency grammar structure providing you have the rough idea but just one [00:37:51] you have the rough idea but just one thing um that you can possibly be [00:37:53] thing um that you can possibly be confused about is you know there people [00:37:57] confused about is you know there people do things in different ways one way in [00:38:00] do things in different ways one way in which they don't agree is even which way [00:38:02] which they don't agree is even which way to draw the arrows so some people draw [00:38:06] to draw the arrows so some people draw arrows um from the head pointing at the [00:38:09] arrows um from the head pointing at the dependence and there are other people [00:38:11] dependence and there are other people who draw the arrows starting at the [00:38:12] who draw the arrows starting at the dependent and pointing back at the heads [00:38:15] dependent and pointing back at the heads um so um for modern dependency grammar [00:38:19] um so um for modern dependency grammar um largely follows the work of um Lucien [00:38:22] um largely follows the work of um Lucien tener a French um linguist um he did the [00:38:28] tener a French um linguist um he did the um arrows pointing from the head to the [00:38:30] um arrows pointing from the head to the dependent and so that's what I'm doing [00:38:32] dependent and so that's what I'm doing today but um you'll see both um we sort [00:38:36] today but um you'll see both um we sort of said that you know normally you [00:38:38] of said that you know normally you assume that you have a tree with a [00:38:40] assume that you have a tree with a single root it's kind of common and it [00:38:43] single root it's kind of common and it works out more easily for the paing if [00:38:46] works out more easily for the paing if you sort of add to a sentence a sort of [00:38:48] you sort of add to a sentence a sort of a fake root node so you know that that's [00:38:51] a fake root node so you know that that's going to be the starting point and it's [00:38:53] going to be the starting point and it's going to take one dependent which is the [00:38:56] going to take one dependent which is the word that's the head of the sentence [00:38:58] word that's the head of the sentence and then you're going to work down from [00:39:00] and then you're going to work down from there okay um so um before getting more [00:39:05] there okay um so um before getting more into doing dependency paing I just [00:39:07] into doing dependency paing I just wanted to sort of take a little detour [00:39:11] wanted to sort of take a little detour um to tell you about the you know the [00:39:14] um to tell you about the you know the importance um that happened with sort of [00:39:17] importance um that happened with sort of the rise of annotated data um in natural [00:39:21] the rise of annotated data um in natural language processing so and you know this [00:39:25] language processing so and you know this is sort of an interesting flip slop [00:39:27] is sort of an interesting flip slop that's occurred but we're going to sort [00:39:29] that's occurred but we're going to sort of today go in One Direction and later [00:39:31] of today go in One Direction and later class we'll go in the other direction um [00:39:34] class we'll go in the other direction um so in early natural language processing [00:39:37] so in early natural language processing um people started to see oh human [00:39:41] um people started to see oh human languages have structure so what we [00:39:45] languages have structure so what we should do is start writing rules for the [00:39:48] should do is start writing rules for the structure of human languages and you [00:39:50] structure of human languages and you know I start writing a few context free [00:39:53] know I start writing a few context free grammar rules for the structure of [00:39:54] grammar rules for the structure of English on that early slide and you [00:39:56] English on that early slide and you could also write dependency grammar [00:39:58] could also write dependency grammar structure rules so um people tried to do [00:40:01] structure rules so um people tried to do natural language processing by having [00:40:04] natural language processing by having rules grammar rules dictionaries of [00:40:07] rules grammar rules dictionaries of parts of speech and things like that and [00:40:09] parts of speech and things like that and that gave you paes um that in retrospect [00:40:15] that gave you paes um that in retrospect worked out pretty badly and it worked [00:40:19] worked out pretty badly and it worked out pretty badly for a number of reasons [00:40:21] out pretty badly for a number of reasons one reason is that although there are [00:40:25] one reason is that although there are these sort of very canonical clear [00:40:27] these sort of very canonical clear structures in human languages there's a [00:40:29] structures in human languages there's a very long tale of messy stuff where all [00:40:32] very long tale of messy stuff where all kinds of weird um usages start to emerge [00:40:37] kinds of weird um usages start to emerge in human languages which sort of means [00:40:40] in human languages which sort of means you just got this it's just really hard [00:40:42] you just got this it's just really hard to get coverage for handwritten language [00:40:45] to get coverage for handwritten language um and that's because people you humans [00:40:48] um and that's because people you humans use language creatively right so you [00:40:51] use language creatively right so you know you can start thinking of some of [00:40:53] know you can start thinking of some of the things um that you've probably come [00:40:57] the things um that you've probably come I'm probably not very good at Young [00:40:59] I'm probably not very good at Young persons slang usages of grammar these [00:41:02] persons slang usages of grammar these days but you know um the kind of ones [00:41:05] days but you know um the kind of ones that you might be still familiar with [00:41:07] that you might be still familiar with right Star Wars you have Yoda talk where [00:41:09] right Star Wars you have Yoda talk where you rearrange the sentences but people [00:41:11] you rearrange the sentences but people still understand them right so that's [00:41:13] still understand them right so that's changing the word order and earlier on [00:41:16] changing the word order and earlier on than that there was sort of a uh little [00:41:19] than that there was sort of a uh little a bit of a f to putting not at the end [00:41:21] a bit of a f to putting not at the end of the sentences of that's a really [00:41:24] of the sentences of that's a really great idea not um and well you know [00:41:28] great idea not um and well you know people um learn to understand that but [00:41:31] people um learn to understand that but it's different to regular grammar right [00:41:33] it's different to regular grammar right so it it's really hard to write a full [00:41:36] so it it's really hard to write a full grammar but the bigger reason actually [00:41:38] grammar but the bigger reason actually is the problem of ambiguity I talked [00:41:40] is the problem of ambiguity I talked about right that if you just write a [00:41:43] about right that if you just write a grammar well you know my sentence with [00:41:46] grammar well you know my sentence with the prepositional phrases had 13 [00:41:49] the prepositional phrases had 13 different paes and you didn't have much [00:41:51] different paes and you didn't have much reason to choose between them but you [00:41:53] reason to choose between them but you know if you had information about how [00:41:56] know if you had information about how often words modify other words then you [00:41:59] often words modify other words then you could get some statistics and start to [00:42:02] could get some statistics and start to predict in which order which things [00:42:05] predict in which order which things modify other things and so people wanted [00:42:08] modify other things and so people wanted to start to be able to do that [00:42:09] to start to be able to do that prediction that underlies probalistic or [00:42:12] prediction that underlies probalistic or machine learning models and so to be [00:42:15] machine learning models and so to be able to do that that led you know sort [00:42:18] able to do that that led you know sort of earliest antecedence in the 6s but [00:42:21] of earliest antecedence in the 6s but really starting in the late 80s and into [00:42:24] really starting in the late 80s and into the '90s that people decide you the way [00:42:27] the '90s that people decide you the way make progress in um natural language [00:42:30] make progress in um natural language processing natural language [00:42:32] processing natural language understanding is to build annotated data [00:42:35] understanding is to build annotated data resources and so all through the '90s [00:42:38] resources and so all through the '90s and the 2000s decades the name of the [00:42:42] and the 2000s decades the name of the game for a lot of natural language [00:42:43] game for a lot of natural language processing was people building annotated [00:42:46] processing was people building annotated data resources and then building machine [00:42:49] data resources and then building machine learning systems on top using those [00:42:51] learning systems on top using those resources now that's kind of gone into [00:42:53] resources now that's kind of gone into reverse and gone away again with large [00:42:55] reverse and gone away again with large language models which will to another [00:42:57] language models which will to another week or so but here's an example so this [00:43:01] week or so but here's an example so this is the universal dependencies tree Banks [00:43:04] is the universal dependencies tree Banks which I'm actually been heavily involved [00:43:06] which I'm actually been heavily involved with and it's a cool resource for all [00:43:08] with and it's a cool resource for all kinds of purposes because it's actually [00:43:11] kinds of purposes because it's actually a wide [00:43:12] a wide crosslinguistic um database where [00:43:14] crosslinguistic um database where there's over a 100 different languages [00:43:17] there's over a 100 different languages um with sentences passed with a uniform [00:43:20] um with sentences passed with a uniform um dependency formalism so it's actually [00:43:23] um dependency formalism so it's actually really good for things like cross [00:43:24] really good for things like cross linguistic work and psychol linguistic [00:43:26] linguistic work and psychol linguistic work work but you know what these are is [00:43:30] work work but you know what these are is taking sentences I think mamar was a [00:43:33] taking sentences I think mamar was a famous goat trainer or [00:43:34] famous goat trainer or something um and putting a dependency [00:43:38] something um and putting a dependency structure on it it's sort of all written [00:43:40] structure on it it's sort of all written there sort of very squished down and [00:43:42] there sort of very squished down and human beings are producing these [00:43:44] human beings are producing these dependency structures and then this is [00:43:46] dependency structures and then this is giving us data that we can learn things [00:43:49] giving us data that we can learn things from dependency like dependency paes [00:43:51] from dependency like dependency paes from and indeed um for what you do on [00:43:54] from and indeed um for what you do on homework 2 this is precisely what you'll [00:43:56] homework 2 this is precisely what you'll be using is data of this sort um to [00:44:00] be using is data of this sort um to build a dependency paa and it's going to [00:44:03] build a dependency paa and it's going to learn that you know that you have goat [00:44:05] learn that you know that you have goat trainers um and you have famous trainers [00:44:09] trainers um and you have famous trainers and so it'll build up sort of statistics [00:44:11] and so it'll build up sort of statistics and information to predict what kinds of [00:44:13] and information to predict what kinds of things are [00:44:15] things are likely um yeah so starting off building [00:44:20] likely um yeah so starting off building a tree Bank like that [00:44:22] a tree Bank like that um feels kind of like oh this is going [00:44:25] um feels kind of like oh this is going to be slow hard work work and it is [00:44:28] to be slow hard work work and it is actually slow hard work um but it proved [00:44:31] actually slow hard work um but it proved to be a very effective strategy because [00:44:34] to be a very effective strategy because it gave wonderful reusable resources [00:44:38] it gave wonderful reusable resources that once people had done it once all [00:44:41] that once people had done it once all sorts of people could use it to build [00:44:43] sorts of people could use it to build paes part of speech taggers um to do [00:44:47] paes part of speech taggers um to do psycholinguistic models and all kinds of [00:44:49] psycholinguistic models and all kinds of things you'd get the sort of [00:44:51] things you'd get the sort of distributional frequency information [00:44:53] distributional frequency information that's good for machine learning it also [00:44:55] that's good for machine learning it also provided one other thing that's crucial [00:44:58] provided one other thing that's crucial is it gave a method to evaluate systems [00:45:02] is it gave a method to evaluate systems to say how good they are at um producing [00:45:06] to say how good they are at um producing paes I mean this may seem kind of [00:45:10] paes I mean this may seem kind of comical to you in the modern era of [00:45:12] comical to you in the modern era of machine learning but the fact of the [00:45:14] machine learning but the fact of the matter is when people did natural [00:45:16] matter is when people did natural language processing in the 50s 60s 7s [00:45:21] language processing in the 50s 60s 7s nobody had evaluation methods the way [00:45:25] nobody had evaluation methods the way you showed people you had good paa is [00:45:28] you showed people you had good paa is you ran the program you said type in a [00:45:31] you ran the program you said type in a sentence look at what it look look it's [00:45:33] sentence look at what it look look it's worked it's a really good paa um you [00:45:36] worked it's a really good paa um you know there was no systematic valuation [00:45:39] know there was no systematic valuation of NLP systems whatsoever um so actually [00:45:43] of NLP systems whatsoever um so actually saying look here's uh thousand hand [00:45:46] saying look here's uh thousand hand passas sentences let's evaluate how well [00:45:49] passas sentences let's evaluate how well your paa does on those you know that was [00:45:52] your paa does on those you know that was actually a revolutionary new development [00:45:55] actually a revolutionary new development um that happened in the um [00:45:57] um that happened in the um well end of the 80s but especially in [00:45:59] well end of the 80s but especially in the [00:46:00] the 90s okay um so now we going have it now [00:46:06] 90s okay um so now we going have it now that we have all of those knowledge um [00:46:08] that we have all of those knowledge um we're going to um want to start building [00:46:12] we're going to um want to start building dependency paes and so I'm going to um [00:46:16] dependency paes and so I'm going to um show a particular way of dependency [00:46:18] show a particular way of dependency passing which is the one you're going to [00:46:20] passing which is the one you're going to use in the assignment um but you know [00:46:22] use in the assignment um but you know just first off it's sort of worth just [00:46:24] just first off it's sort of worth just thinking for a moment you know what kind [00:46:27] thinking for a moment you know what kind of information should a dependency paer [00:46:30] of information should a dependency paer have to make decisions so these are kind [00:46:32] have to make decisions so these are kind of the four factors the sort of the [00:46:35] of the four factors the sort of the obvious things that are useful for [00:46:37] obvious things that are useful for dependency passing I mean the first one [00:46:40] dependency passing I mean the first one is sort of thinking of the two words at [00:46:43] is sort of thinking of the two words at the ends of the arrow as to whether they [00:46:45] the ends of the arrow as to whether they are plausible right so that for um [00:46:49] are plausible right so that for um discussion of the outstanding issues was [00:46:52] discussion of the outstanding issues was completed um so to have um discussion of [00:46:57] completed um so to have um discussion of issues right that's that's a plausible [00:47:01] issues right that's that's a plausible dependency um to have um you know what's [00:47:05] dependency um to have um you know what's the silly one to have something like um [00:47:08] the silly one to have something like um the being a dependent of completed that [00:47:11] the being a dependent of completed that makes no sense at all so you know what [00:47:14] makes no sense at all so you know what words there are involved um the second [00:47:18] words there are involved um the second one is dependency distance so you can [00:47:21] one is dependency distance so you can have longdistance dependencies that go a [00:47:24] have longdistance dependencies that go a long way but most dependenc sees are [00:47:28] long way but most dependenc sees are short distance you know a lot of words [00:47:30] short distance you know a lot of words are depending on their neighboring words [00:47:32] are depending on their neighboring words at a very short distance so that's a [00:47:35] at a very short distance so that's a good preference to have um as well as [00:47:38] good preference to have um as well as just the distance it's somewhat [00:47:40] just the distance it's somewhat informative knowing what's in between so [00:47:44] informative knowing what's in between so it's rare for dependencies to span verbs [00:47:47] it's rare for dependencies to span verbs or [00:47:48] or punctuation um and then there's a final [00:47:52] punctuation um and then there's a final one which is to think of the veency of [00:47:55] one which is to think of the veency of heads and that's how many [00:47:57] heads and that's how many arguments um they take so that if you [00:48:01] arguments um they take so that if you have sort of something like a verb broke [00:48:04] have sort of something like a verb broke um well it probably has something um to [00:48:09] um well it probably has something um to the left because it probably has who did [00:48:12] the left because it probably has who did the breaking and it probably has [00:48:15] the breaking and it probably has something to the right um cuz there [00:48:18] something to the right um cuz there might be the cup or something like that [00:48:21] might be the cup or something like that um but you know it doesn't have to be [00:48:23] um but you know it doesn't have to be that cuz it could be the cup broke um so [00:48:27] that cuz it could be the cup broke um so you can have something to the left but [00:48:30] you can have something to the left but nothing to the right but you sort of [00:48:32] nothing to the right but you sort of have to have something to the left and [00:48:34] have to have something to the left and conversly you can't have any number of [00:48:37] conversly you can't have any number of things you can't sort of just say he [00:48:39] things you can't sort of just say he broke the cup the sorcer um the dish um [00:48:43] broke the cup the sorcer um the dish um right so it doesn't take just lots of [00:48:45] right so it doesn't take just lots of arguments to the left so we've got a [00:48:46] arguments to the left so we've got a notion of veency like [00:48:49] notion of veency like that um yeah there's one other tricky [00:48:53] that um yeah there's one other tricky little notion on dependency paring which [00:48:56] little notion on dependency paring which is [00:48:57] is normally um normally dependencies kind [00:49:01] normally um normally dependencies kind of Nest like this um and nesting [00:49:04] of Nest like this um and nesting dependencies corresponds to a tree [00:49:07] dependencies corresponds to a tree structure as you'd have in a context [00:49:09] structure as you'd have in a context free grammar [00:49:15] [Music] [00:49:17] [Music] yeah because in a sense when I read the [00:49:19] yeah because in a sense when I read the sentence [00:49:21] sentence which I thought that the most important [00:49:24] which I thought that the most important discussion [00:49:27] discussion [Music] [00:49:31] so so I will fair enough I will assert [00:49:36] so so I will fair enough I will assert that this [00:49:38] that this is uh [00:49:40] is uh sentence and discussion is the subject [00:49:45] sentence and discussion is the subject of the [00:49:46] of the verb completed and you know normally for [00:49:51] verb completed and you know normally for a sentence we say the main thing in the [00:49:54] a sentence we say the main thing in the sentence is its verb and so yeah so [00:49:57] sentence is its verb and so yeah so that's why the root is heading to [00:49:59] that's why the root is heading to completed and the subject of the verb is [00:50:01] completed and the subject of the verb is also an important thing but the [00:50:04] also an important thing but the arguments of the verb like the subject [00:50:06] arguments of the verb like the subject of the verb the object of the verb if [00:50:08] of the verb the object of the verb if there is one prepositional phrase [00:50:09] there is one prepositional phrase modifiers they're all taken as [00:50:11] modifiers they're all taken as dependence of the [00:50:14] dependence of the verb [00:50:17] verb yeah up on that is [00:50:21] yeah up on that is it is it not the that you start [00:50:24] it is it not the that you start with so um if you have a sentence with a [00:50:29] with so um if you have a sentence with a verb like this um it's always that is [00:50:34] verb like this um it's always that is always the answer I mean some of the [00:50:37] always the answer I mean some of the details here depend on languages but [00:50:40] details here depend on languages but there are languages in which you don't [00:50:41] there are languages in which you don't have to have a verb in a sentence um and [00:50:44] have to have a verb in a sentence um and you can get things like [00:50:47] you can get things like um um just I mean you can do it in sort [00:50:51] um um just I mean you can do it in sort of very restricted ways in in in English [00:50:54] of very restricted ways in in in English right so if you just sort of say easy as [00:50:57] right so if you just sort of say easy as pie um there's no verb and so then [00:51:00] pie um there's no verb and so then you're saying easy um the adjective [00:51:03] you're saying easy um the adjective which is sort of the predicate adjective [00:51:05] which is sort of the predicate adjective is then the head of the [00:51:08] is then the head of the sentence [00:51:10] sentence sorry like a question like what is the [00:51:13] sorry like a question like what is the story is the is like we would still look [00:51:16] story is the is like we would still look at that as the that is complicated some [00:51:20] at that as the that is complicated some people would say it is um and some [00:51:23] people would say it is um and some people would say it isn't um and [00:51:26] people would say it isn't um and particular in Universal dependencies we [00:51:29] particular in Universal dependencies we don't actually say that is is the head [00:51:30] don't actually say that is is the head of the sentence but I don't want to get [00:51:32] of the sentence but I don't want to get too far into this if you want you could [00:51:34] too far into this if you want you could sort of look more at how things are done [00:51:37] sort of look more at how things are done but you know I want to fully admit that [00:51:39] but you know I want to fully admit that you know dependency grammar isn't sort [00:51:42] you know dependency grammar isn't sort of one uniquely defined Theory people [00:51:44] of one uniquely defined Theory people have had different ideas of which things [00:51:46] have had different ideas of which things to take as the head in various [00:51:49] to take as the head in various circumstances and they argue about it [00:51:51] circumstances and they argue about it linguists argue about what the right [00:51:53] linguists argue about what the right structure is to put over all sorts of [00:51:55] structure is to put over all sorts of sentences but that the fact that people [00:51:58] sentences but that the fact that people do things different ways doesn't mean [00:52:00] do things different ways doesn't mean that everybody doesn't agree that there [00:52:02] that everybody doesn't agree that there are units there are phrases and [00:52:05] are units there are phrases and modifiers and ambiguities and so on [00:52:08] modifiers and ambiguities and so on between them okay um yeah so normally we [00:52:12] between them okay um yeah so normally we get this sort of nesting that [00:52:13] get this sort of nesting that corresponds to what you can build with [00:52:15] corresponds to what you can build with context free grammar structure but [00:52:18] context free grammar structure but sometimes in human languages you get [00:52:21] sometimes in human languages you get dependencies that don't Nest so you get [00:52:25] dependencies that don't Nest so you get sentences like I'll give a talk tomorrow [00:52:27] sentences like I'll give a talk tomorrow on new networks where actually the on [00:52:31] on new networks where actually the on new networks is modifying the talk where [00:52:35] new networks is modifying the talk where the [00:52:36] the yesterday um is an argument of sorry the [00:52:40] yesterday um is an argument of sorry the tomorrow is an argument of give and so [00:52:42] tomorrow is an argument of give and so you get these Crossing dependencies [00:52:45] you get these Crossing dependencies which are referred to as non-projective [00:52:48] which are referred to as non-projective dependencies um you also get them when [00:52:51] dependencies um you also get them when you form questions so who did Bill buy [00:52:55] you form questions so who did Bill buy the coffee from yesterday that the who [00:52:58] the coffee from yesterday that the who is the object of the preposition from [00:53:02] is the object of the preposition from but it's been moved out the front and so [00:53:04] but it's been moved out the front and so that again um gives us [00:53:08] that again um gives us non-projective um if you think about [00:53:12] non-projective um if you think about it um yeah you can still say um that you [00:53:16] it um yeah you can still say um that you have a dependency tree but it's got the [00:53:20] have a dependency tree but it's got the words in different orders and so one of [00:53:22] words in different orders and so one of the things that you have to cope with [00:53:24] the things that you have to cope with for full dependency PA [00:53:27] for full dependency PA is dealing with this non projectivity [00:53:29] is dealing with this non projectivity but I mean actually we're not going to [00:53:30] but I mean actually we're not going to deal with it in our paes we're only [00:53:32] deal with it in our paes we're only going to do um projective dependency [00:53:35] going to do um projective dependency paing okay so there are various ways [00:53:39] paing okay so there are various ways that people do dependency paring people [00:53:42] that people do dependency paring people have done it by dynamic programming [00:53:45] have done it by dynamic programming people who've done it using graph [00:53:47] people who've done it using graph algorithms if I have enough time at the [00:53:49] algorithms if I have enough time at the end I might mention that again people [00:53:52] end I might mention that again people have done it with constraint [00:53:53] have done it with constraint satisfaction methods if you saw those in [00:53:55] satisfaction methods if you saw those in cs220 21 um but the most common way in [00:54:00] cs220 21 um but the most common way in practice that's emerged um has been this [00:54:03] practice that's emerged um has been this transition based paing um which is kind [00:54:08] transition based paing um which is kind of sort of interesting as well and gives [00:54:11] of sort of interesting as well and gives the sort of a very simple machine [00:54:13] the sort of a very simple machine learning mechanism um so it makes it [00:54:16] learning mechanism um so it makes it good for um assignment two and so that's [00:54:19] good for um assignment two and so that's what we're going to explore [00:54:22] what we're going to explore here okay so um what we do [00:54:27] here okay so um what we do um in Greedy decision based paing in [00:54:30] um in Greedy decision based paing in transition based paing is you know this [00:54:33] transition based paing is you know this is where it's unfortunate that only two [00:54:35] is where it's unfortunate that only two people in the class have done a [00:54:36] people in the class have done a compiler's class um so um so a simple [00:54:41] compiler's class um so um so a simple form of passing that's also used in [00:54:43] form of passing that's also used in compilers classes something called shift [00:54:46] compilers classes something called shift reduced paing where you start sort of [00:54:49] reduced paing where you start sort of bottom up and you start putting little [00:54:51] bottom up and you start putting little units together and build bigger [00:54:53] units together and build bigger constituents but um if most people have [00:54:56] constituents but um if most people have hav't seen it that's not going to be um [00:54:59] hav't seen it that's not going to be um very much um help so I'm going to give [00:55:01] very much um help so I'm going to give you a concrete example so the things to [00:55:04] you a concrete example so the things to know is we' have two data structure well [00:55:07] know is we' have two data structure well we have more than two I guess um for [00:55:09] we have more than two I guess um for dealing with the sentence we have two [00:55:11] dealing with the sentence we have two data structures we have a buffer which [00:55:14] data structures we have a buffer which has the words of our input sentence and [00:55:17] has the words of our input sentence and then we start building pieces of um [00:55:19] then we start building pieces of um sentence structure which we put on a [00:55:22] sentence structure which we put on a stack and a little trick to know is for [00:55:25] stack and a little trick to know is for the buffer the the top is written to the [00:55:27] the buffer the the top is written to the left and for the stack the top is [00:55:29] left and for the stack the top is written to the right um and so we take [00:55:32] written to the right um and so we take actions which are like shift and reduce [00:55:35] actions which are like shift and reduce actions and when we take ARC building [00:55:39] actions and when we take ARC building actions we build up a set of dependency [00:55:41] actions we build up a set of dependency arcs which are going to be the [00:55:43] arcs which are going to be the dependency structure of our sentence and [00:55:46] dependency structure of our sentence and that's all incredibly abstract um and so [00:55:49] that's all incredibly abstract um and so I'm going to show an example um which [00:55:52] I'm going to show an example um which hopefully will a bit give the idea um [00:55:57] hopefully will a bit give the idea um so here's an example so I want to do [00:55:59] so here's an example so I want to do this very um simple example um of [00:56:04] this very um simple example um of passing up the sentence I ate fish um so [00:56:10] passing up the sentence I ate fish um so I so the way I do this is I have my [00:56:13] I so the way I do this is I have my stack and so I start by putting the root [00:56:17] stack and so I start by putting the root symbol on my stack and then I have in my [00:56:20] symbol on my stack and then I have in my buffer all the words of the sentence and [00:56:23] buffer all the words of the sentence and so that's the sort of start condition [00:56:25] so that's the sort of start condition I've written and very small print there [00:56:28] I've written and very small print there then for each step of processing I have [00:56:31] then for each step of processing I have a choice of three operations I can [00:56:34] a choice of three operations I can either shift which moves the top word on [00:56:38] either shift which moves the top word on the buffer onto the stack or I can do [00:56:42] the buffer onto the stack or I can do left Arc or right Arc and these are my [00:56:45] left Arc or right Arc and these are my two reduce operations that build a [00:56:47] two reduce operations that build a little bit of syntactic structure by [00:56:50] little bit of syntactic structure by saying that one word is a dependent of [00:56:52] saying that one word is a dependent of another word in either a left or a right [00:56:55] another word in either a left or a right direction so here a sequence of [00:56:57] direction so here a sequence of operations I can take and um so starting [00:57:02] operations I can take and um so starting off the first thing I can do is shift so [00:57:05] off the first thing I can do is shift so then I've moved I onto the stack um I [00:57:09] then I've moved I onto the stack um I can decide that I want to shift again [00:57:12] can decide that I want to shift again and so then I'd take eight and also move [00:57:16] and so then I'd take eight and also move it onto the stack and so I've now got [00:57:18] it onto the stack and so I've now got three things on my [00:57:21] three things on my stack so at this point you know I can do [00:57:26] stack so at this point you know I can do other things I mean in particular a left [00:57:29] other things I mean in particular a left Arc is going to say well I can take the [00:57:32] Arc is going to say well I can take the top two things on the stack and make the [00:57:37] top two things on the stack and make the uh the thing on the top The Head and the [00:57:40] uh the thing on the top The Head and the thing one down on the stack a dependent [00:57:43] thing one down on the stack a dependent of it so if I do a left Arc operation [00:57:47] of it so if I do a left Arc operation I'm effectively saying that the I is a [00:57:49] I'm effectively saying that the I is a dependent of eight and then I pop both [00:57:52] dependent of eight and then I pop both of then I pop the dependent off the [00:57:55] of then I pop the dependent off the stack back but I add on that I've built [00:57:59] stack back but I add on that I've built um a dependency that I made I a [00:58:01] um a dependency that I made I a dependent of eight I could then do [00:58:04] dependent of eight I could then do another shift operation so I shift fish [00:58:08] another shift operation so I shift fish um from the buffer onto the stack and [00:58:11] um from the buffer onto the stack and then I can do a right Arc which says um [00:58:15] then I can do a right Arc which says um okay I'm going to have fish as a [00:58:17] okay I'm going to have fish as a dependent of eight so then fish [00:58:19] dependent of eight so then fish disappears from the stack and I add in [00:58:22] disappears from the stack and I add in this new dependency saying fishes [00:58:26] this new dependency saying fishes dependent of eight um I then do right [00:58:29] dependent of eight um I then do right Arc again um which is then saying that [00:58:34] Arc again um which is then saying that um eight is a dependent of root so I'm [00:58:37] um eight is a dependent of root so I'm left with just root on my stack and I've [00:58:39] left with just root on my stack and I've built a new dependent saying eight is a [00:58:41] built a new dependent saying eight is a dependent of root and at this point I've [00:58:44] dependent of root and at this point I've gone to the finishing condition my [00:58:46] gone to the finishing condition my finishing condition is that my buffer is [00:58:48] finishing condition is that my buffer is empty and my um stack contains just the [00:58:52] empty and my um stack contains just the word root um and so this gives me a [00:58:56] word root um and so this gives me a little step set of operations referred [00:59:00] little step set of operations referred to as the transitions of [00:59:02] to as the transitions of transition-based passing and by making a [00:59:04] transition-based passing and by making a sequence of these different transitions [00:59:07] sequence of these different transitions I can build sentence structure and I've [00:59:10] I can build sentence structure and I've got choices of when to shift and when to [00:59:14] got choices of when to shift and when to reduce and whether to reduce left or [00:59:17] reduce and whether to reduce left or reduce right the arc left Arc right and [00:59:20] reduce right the arc left Arc right and so by making different ones of those [00:59:22] so by making different ones of those choices I could make any structure for [00:59:25] choices I could make any structure for the sentence that I wanted to so you [00:59:28] the sentence that I wanted to so you know if I somehow thought that this [00:59:31] know if I somehow thought that this sentence should have a different [00:59:33] sentence should have a different structure and that I should be the head [00:59:36] structure and that I should be the head and eights are dependent of that and [00:59:38] and eights are dependent of that and fishes are dependent of that well I [00:59:41] fishes are dependent of that well I could achieve this by making some [00:59:43] could achieve this by making some different choices as to I'd now be [00:59:47] different choices as to I'd now be saying I was doing a right Arc operation [00:59:50] saying I was doing a right Arc operation so that eight would become a dependent [00:59:52] so that eight would become a dependent of I rather than the other way around so [00:59:54] of I rather than the other way around so the choices of which oper operations I [00:59:56] the choices of which oper operations I take determine the syntactic structure [00:59:59] take determine the syntactic structure the set of dependencies that I have [01:00:02] the set of dependencies that I have built um which are my set of [01:00:04] built um which are my set of dependencies down here now the center of [01:00:07] dependencies down here now the center of dependencies I built were exactly the [01:00:09] dependencies I built were exactly the right ones because at each step I took [01:00:12] right ones because at each step I took the right [01:00:13] the right operation um and so the essential um [01:00:18] operation um and so the essential um idea of um transition based paing and [01:00:23] idea of um transition based paing and where it came to the four was [01:00:26] where it came to the four was um there was a particular guy who's I've [01:00:30] um there was a particular guy who's I've got a photo of him in somewhere in a bit [01:00:32] got a photo of him in somewhere in a bit I thought um so yoram Nea um is a [01:00:35] I thought um so yoram Nea um is a Swedish NLP person um and in the early [01:00:40] Swedish NLP person um and in the early 2000s um he came up with the idea of um [01:00:45] 2000s um he came up with the idea of um rather than doing the kind of dynamic [01:00:47] rather than doing the kind of dynamic programming and chart paing and things [01:00:50] programming and chart paing and things that people commonly used to do with [01:00:53] that people commonly used to do with PES these days we have machine Lear [01:00:56] PES these days we have machine Lear learning so maybe we could build a fast [01:00:59] learning so maybe we could build a fast efficient paa and the way we're going to [01:01:02] efficient paa and the way we're going to build it is with this making a sequence [01:01:06] build it is with this making a sequence of Transitions and it'll be the job of [01:01:09] of Transitions and it'll be the job of the machine learning to predict what is [01:01:12] the machine learning to predict what is the right transition at each point in [01:01:14] the right transition at each point in time so if you do that you know at each [01:01:19] time so if you do that you know at each point you're dealing with one thing and [01:01:23] point you're dealing with one thing and so the number of operations you're doing [01:01:25] so the number of operations you're doing to pass a sentence is linear so this [01:01:28] to pass a sentence is linear so this gives a linear time passing algorithm [01:01:32] gives a linear time passing algorithm whereas if um you've seen context free [01:01:35] whereas if um you've seen context free grammars and stuff like that in CS 103 [01:01:39] grammars and stuff like that in CS 103 and you want to do anything where you're [01:01:41] and you want to do anything where you're fully considering the paes and [01:01:43] fully considering the paes and structures of context free grammars [01:01:45] structures of context free grammars you've then got a cubic time algorithm [01:01:48] you've then got a cubic time algorithm which is much less Pleasant to be [01:01:50] which is much less Pleasant to be dealing with um so for the simplest form [01:01:54] dealing with um so for the simplest form of transition-based paing you do no [01:01:56] of transition-based paing you do no search whatsoever at each step you're [01:01:59] search whatsoever at each step you're just predicting the next transition and [01:02:02] just predicting the next transition and so you're doing this sort of sequence of [01:02:04] so you're doing this sort of sequence of transition predictions as machine [01:02:07] transition predictions as machine learning operations and that sequence [01:02:09] learning operations and that sequence gives you the path structure of the [01:02:11] gives you the path structure of the sentence and the essential result that [01:02:14] sentence and the essential result that Neo is able to show is that machine [01:02:17] Neo is able to show is that machine learning um is good enough that you can [01:02:20] learning um is good enough that you can do this and get a very accurate paraa [01:02:24] do this and get a very accurate paraa despite the fact that it no search [01:02:26] despite the fact that it no search whatsoever is just doing predictions in [01:02:29] whatsoever is just doing predictions in this [01:02:31] this way okay um so how did so when he did in [01:02:38] way okay um so how did so when he did in 2005 that was before neural networks [01:02:40] 2005 that was before neural networks came to the four and so the way he was [01:02:43] came to the four and so the way he was doing it was by using a sort of an older [01:02:47] doing it was by using a sort of an older style um symbolic feature based um [01:02:50] style um symbolic feature based um machine Learning System so he had a big [01:02:53] machine Learning System so he had a big classifier which might have been a list [01:02:55] classifier which might have been a list istic regression classifier or something [01:02:58] istic regression classifier or something else like a support Vector machine and [01:03:00] else like a support Vector machine and so to power that he was using indicator [01:03:04] so to power that he was using indicator features so the kind of features you'd [01:03:06] features so the kind of features you'd use is that the word on the top of the [01:03:09] use is that the word on the top of the stack is the word good and it's part of [01:03:11] stack is the word good and it's part of speech as an adjective or um the word um [01:03:17] speech as an adjective or um the word um on the top of the stack is good but the [01:03:21] on the top of the stack is good but the word that's sort of second on the stack [01:03:24] word that's sort of second on the stack is the verb had [01:03:26] is the verb had right you'd get these sort of [01:03:27] right you'd get these sort of combinations of matching functions and [01:03:30] combinations of matching functions and they would be used as features in a [01:03:32] they would be used as features in a machine Learning System to predict the [01:03:34] machine Learning System to predict the pars um but the problem is that once you [01:03:37] pars um but the problem is that once you started building these features that [01:03:41] started building these features that were conjunctions of multiple terms you [01:03:43] were conjunctions of multiple terms you ended up with millions and millions of [01:03:45] ended up with millions and millions of features right because you putting [01:03:47] features right because you putting particular words and features um and [01:03:50] particular words and features um and then you're combining choices of [01:03:51] then you're combining choices of multiple words so they're just millions [01:03:53] multiple words so they're just millions and millions of features so had to deal [01:03:56] and millions of features so had to deal with millions and millions of features [01:03:58] with millions and millions of features and furthermore individual features were [01:04:01] and furthermore individual features were exceedingly sparse that you barely ever [01:04:03] exceedingly sparse that you barely ever saw them right that you'd have a feature [01:04:06] saw them right that you'd have a feature that only turned up you know 10 times in [01:04:08] that only turned up you know 10 times in a million sentences because they are [01:04:10] a million sentences because they are matching these very precise systems so [01:04:13] matching these very precise systems so on the one hand um by making these [01:04:16] on the one hand um by making these feature conjunctions paing got more [01:04:19] feature conjunctions paing got more accurate and indeed people produced [01:04:21] accurate and indeed people produced pretty accurate paes in those days but [01:04:24] pretty accurate paes in those days but they had sort of these unappealing [01:04:26] they had sort of these unappealing characteristics of this [01:04:28] characteristics of this sort um yeah so uh before going on [01:04:33] sort um yeah so uh before going on further I should just explain how we [01:04:35] further I should just explain how we evaluate dependency paes um so um to [01:04:41] evaluate dependency paes um so um to evaluate dependency paes we're basically [01:04:44] evaluate dependency paes we're basically assessing are you getting the [01:04:47] assessing are you getting the dependency arcs arrows your proposing [01:04:50] dependency arcs arrows your proposing right um so here is someone's dependency [01:04:53] right um so here is someone's dependency PA she saw the video [01:04:57] PA she saw the video lecture [01:04:59] lecture um well actually sorry that's the gold [01:05:01] um well actually sorry that's the gold pa okay that's a correct pa okay she saw [01:05:05] pa okay that's a correct pa okay she saw the video lecture that's a correct PA so [01:05:08] the video lecture that's a correct PA so you can write out what are um the the [01:05:11] you can write out what are um the the different [01:05:12] different dependencies right so one's head is two [01:05:16] dependencies right so one's head is two two's head is zero word three's head is [01:05:19] two's head is zero word three's head is five words Four's head is five word [01:05:22] five words Four's head is five word five's head is two so these pairs of [01:05:25] five's head is two so these pairs of numbers represent our dependencies then [01:05:29] numbers represent our dependencies then if someone proposes a PA of the sentence [01:05:32] if someone proposes a PA of the sentence you can literally say okay which of [01:05:35] you can literally say okay which of these did they get right so they didn't [01:05:36] these did they get right so they didn't get this one right um they got the rest [01:05:40] get this one right um they got the rest of them right so they accuracy is 80% [01:05:43] of them right so they accuracy is 80% and so sometimes people just assess the [01:05:46] and so sometimes people just assess the arcs unlabeled and so that's referred to [01:05:49] arcs unlabeled and so that's referred to as unlabeled dependency accuracy but [01:05:53] as unlabeled dependency accuracy but sometimes people also want to label them [01:05:55] sometimes people also want to label them with with um subject determiner object [01:05:59] with with um subject determiner object Etc and say also are you getting the [01:06:01] Etc and say also are you getting the labels right so in this case only two of [01:06:04] labels right so in this case only two of the five labels are right so the labeled [01:06:07] the five labels are right so the labeled accuracy of the dependency passes is [01:06:12] 40% okay um [01:06:17] 40% okay um so so that was sort of what people did [01:06:21] so so that was sort of what people did um until the mid [01:06:24] um until the mid 2010s um and and I sort of already [01:06:26] 2010s um and and I sort of already started saying this the problems with [01:06:29] started saying this the problems with indicator features were they were [01:06:33] indicator features were they were sparse you didn't see them often they [01:06:36] sparse you didn't see them often they were incomplete because there's some [01:06:38] were incomplete because there's some words and combinations you've seen and [01:06:40] words and combinations you've seen and some you just didn't see in the training [01:06:41] some you just didn't see in the training data so you're missing features um but [01:06:45] data so you're missing features um but the final problem is actually just [01:06:47] the final problem is actually just Computing all those um symbolic features [01:06:50] Computing all those um symbolic features was just expensive it turns out that if [01:06:53] was just expensive it turns out that if you did runtime analysis most of time in [01:06:56] you did runtime analysis most of time in the paing wasn't in doing the machine [01:06:58] the paing wasn't in doing the machine learning decisions it was just simply in [01:07:01] learning decisions it was just simply in Computing the features that you put into [01:07:03] Computing the features that you put into this dependency paa so as new net [01:07:06] this dependency paa so as new net started to um show that they were [01:07:08] started to um show that they were successful for things that suggested [01:07:11] successful for things that suggested that maybe you could build um a better [01:07:14] that maybe you could build um a better dependency paa by using a newal net um [01:07:18] dependency paa by using a newal net um transition-based dependency paa which [01:07:20] transition-based dependency paa which would benefit from the kind of dens and [01:07:23] would benefit from the kind of dens and compact um feature Vector [01:07:26] compact um feature Vector representations that we've already [01:07:28] representations that we've already started to see um and so that's um what [01:07:32] started to see um and so that's um what started to be explored and in [01:07:36] particular who was then a PhD student of [01:07:39] particular who was then a PhD student of mine um and was head TA of 224n twice [01:07:43] mine um and was head TA of 224n twice actually in the earlier days um so um [01:07:47] actually in the earlier days um so um she built um uh neural transition-based [01:07:50] she built um uh neural transition-based dependency paa um and showed the success [01:07:54] dependency paa um and showed the success of this method [01:07:56] of this method um so this was um nea's transition-based [01:07:59] um so this was um nea's transition-based dependency paa um people had also [01:08:03] dependency paa um people had also explored other methods of dependency [01:08:05] explored other methods of dependency passing so these were two graph-based [01:08:07] passing so these were two graph-based dependency paes and essentially um for [01:08:11] dependency paes and essentially um for the kind of um symbolic feature machine [01:08:15] the kind of um symbolic feature machine learning methods um NE paa was really [01:08:19] learning methods um NE paa was really fast because I was using this linear [01:08:21] fast because I was using this linear trans um transition based paing idea [01:08:25] trans um transition based paing idea that depend the graph-based dependency [01:08:28] that depend the graph-based dependency paes were way way slower right you know [01:08:31] paes were way way slower right you know they're about what 50 times slower um [01:08:33] they're about what 50 times slower um but they were slightly more accurate you [01:08:35] but they were slightly more accurate you can see here that you're getting a bit [01:08:37] can see here that you're getting a bit better numbers so essentially what was [01:08:40] better numbers so essentially what was able to show was um you could build [01:08:44] able to show was um you could build something that was basically as accurate [01:08:46] something that was basically as accurate as the best known um graph-based [01:08:48] as the best known um graph-based dependency pases but it was fast like [01:08:52] dependency pases but it was fast like other um transition based fases indeed [01:08:56] other um transition based fases indeed despite the fact that it's you might [01:08:58] despite the fact that it's you might think that oh now I've got real numbers [01:09:00] think that oh now I've got real numbers and matrices and stuff surely that [01:09:02] and matrices and stuff surely that should be slowing me down the reality [01:09:05] should be slowing me down the reality was that the symbolic models spent so [01:09:08] was that the symbolic models spent so much time in feature um computation that [01:09:12] much time in feature um computation that actually you could make it faster at the [01:09:14] actually you could make it faster at the same time by using a newal [01:09:16] same time by using a newal network okay so how did that work um [01:09:20] network okay so how did that work um well so we've already seen word [01:09:22] well so we've already seen word embedding so it's going to exploit word [01:09:25] embedding so it's going to exploit word embed so it can use word representations [01:09:29] embed so it can use word representations and that has the advantage that even if [01:09:31] and that has the advantage that even if you haven't seen particular words and [01:09:32] you haven't seen particular words and particular configurations you've seen [01:09:35] particular configurations you've seen similar words and so it can exploit [01:09:38] similar words and so it can exploit what's likely in terms of word [01:09:40] what's likely in terms of word similarity but it went a bit further [01:09:42] similarity but it went a bit further than that because why only have [01:09:44] than that because why only have distributed representations of words we [01:09:47] distributed representations of words we also have parts of speech and although I [01:09:49] also have parts of speech and although I sort of said just noun verb adjective [01:09:53] sort of said just noun verb adjective most actual systems in NLP of parts of [01:09:57] most actual systems in NLP of parts of speech are much more fine grained so [01:09:59] speech are much more fine grained so they have different parts of speech for [01:10:01] they have different parts of speech for plural nouns versus singular nouns so [01:10:04] plural nouns versus singular nouns so they're sort of different symbols but [01:10:06] they're sort of different symbols but they're very similar to each other so we [01:10:08] they're very similar to each other so we might give them distributed [01:10:10] might give them distributed representation so they're also close to [01:10:12] representation so they're also close to each other and the same for the types of [01:10:16] each other and the same for the types of our labels for dependencies some of them [01:10:18] our labels for dependencies some of them are pretty closely related as well so [01:10:20] are pretty closely related as well so all of these were being given [01:10:22] all of these were being given distributed [01:10:24] distributed representations and so then to represent [01:10:27] representations and so then to represent the state of the dependency paa for [01:10:30] the state of the dependency paa for predicting transitions what you were [01:10:32] predicting transitions what you were doing is you had the same kind of stack [01:10:35] doing is you had the same kind of stack and buffer and you are taking the key [01:10:39] and buffer and you are taking the key elements of the stack and the buffer [01:10:41] elements of the stack and the buffer which are essentially the first thing on [01:10:43] which are essentially the first thing on the buffer the word that you would be [01:10:45] the buffer the word that you would be shifting if you're going to do a shift [01:10:47] shifting if you're going to do a shift and the two things at the top of the [01:10:49] and the two things at the top of the stack so these are the things that if [01:10:51] stack so these are the things that if you're either doing a left Arc or a [01:10:53] you're either doing a left Arc or a right Arc they're things that you're [01:10:55] right Arc they're things that you're considering combining so for those [01:10:58] considering combining so for those you're going to be taking the [01:11:00] you're going to be taking the distributed representations of the word [01:11:03] distributed representations of the word and their parts of speech and also with [01:11:06] and their parts of speech and also with a bit more complexity um for [01:11:08] a bit more complexity um for dependencies you've already constructed [01:11:10] dependencies you've already constructed if maybe something on the stack is [01:11:12] if maybe something on the stack is already involved in a dependency each of [01:11:15] already involved in a dependency each of those we're going to take their [01:11:16] those we're going to take their distributed representations and we're [01:11:18] distributed representations and we're going to just concatenate them together [01:11:21] going to just concatenate them together to produce a big um Vector in the same [01:11:24] to produce a big um Vector in the same way were concatenating together the five [01:11:27] way were concatenating together the five words in the last class for predicting [01:11:30] words in the last class for predicting whether something was the location or [01:11:32] whether something was the location or not and then we're going to feed that um [01:11:36] not and then we're going to feed that um into our neural network so um we our [01:11:41] into our neural network so um we our input layer is our concatenate [01:11:43] input layer is our concatenate distributed [01:11:44] distributed representations we're going to put that [01:11:46] representations we're going to put that through a hidden layer which is like we [01:11:48] through a hidden layer which is like we were talking about last time WX plus b [01:11:52] were talking about last time WX plus b then put through a relue [01:11:54] then put through a relue nonlinearity and then we're going to put [01:11:57] nonlinearity and then we're going to put above that the same kind of um another [01:12:01] above that the same kind of um another multiply by a matri so we've got a [01:12:03] multiply by a matri so we've got a second layer of neural network uh plus [01:12:06] second layer of neural network uh plus B2 and we're going to take the output of [01:12:09] B2 and we're going to take the output of that and then we're going to put that [01:12:10] that and then we're going to put that through a soft Max that gives a [01:12:12] through a soft Max that gives a probability distribution over whether to [01:12:16] probability distribution over whether to um shift or do a left Arc or a right Arc [01:12:20] um shift or do a left Arc or a right Arc operation and so the other way that this [01:12:24] operation and so the other way that this crucially gave us more power is that [01:12:27] crucially gave us more power is that other people's dependency paes were [01:12:30] other people's dependency paes were still using linear classifiers things [01:12:33] still using linear classifiers things like support Vector machines or logistic [01:12:36] like support Vector machines or logistic regressions where we had a deep neural [01:12:38] regressions where we had a deep neural network that gave us a nonlinear [01:12:41] network that gave us a nonlinear classifier but and so that's why we [01:12:43] classifier but and so that's why we could be more accurate than other um [01:12:46] could be more accurate than other um previous um transition based paes and so [01:12:49] previous um transition based paes and so this um essentially um showed that you [01:12:53] this um essentially um showed that you could build this very accurate neural [01:12:56] could build this very accurate neural dependency paa and that it [01:12:59] dependency paa and that it outperformed um symbolic um probalistic [01:13:03] outperformed um symbolic um probalistic representations and basically was as [01:13:05] representations and basically was as good as any other dependency paa that [01:13:08] good as any other dependency paa that was known so um back a decade to go um [01:13:12] was known so um back a decade to go um this was a big hit um people got very [01:13:14] this was a big hit um people got very excited about it um the people at Google [01:13:17] excited about it um the people at Google got very excited about it because this [01:13:19] got very excited about it because this gave a um a scalable way remember it's [01:13:22] gave a um a scalable way remember it's linear time in which you could [01:13:24] linear time in which you could efficiently go of and pass um the entire [01:13:27] efficiently go of and pass um the entire web um so they did um some further work [01:13:31] web um so they did um some further work on taking that model and proving it so [01:13:34] on taking that model and proving it so that they um made an a deeper neural [01:13:37] that they um made an a deeper neural network version with bigger vectors and [01:13:41] network version with bigger vectors and better tuned hyperparameters and they [01:13:43] better tuned hyperparameters and they added on to a beam search I've just [01:13:46] added on to a beam search I've just presented the greedy version where you [01:13:48] presented the greedy version where you always CH just immediately make the best [01:13:50] always CH just immediately make the best choice but you can improve these paas by [01:13:53] choice but you can improve these paas by doing some amount of search that does [01:13:56] doing some amount of search that does help um and so um they pushed that up [01:14:00] help um and so um they pushed that up and so rather than our kind of 92 uas [01:14:04] and so rather than our kind of 92 uas here they got it to you know 94 [01:14:08] here they got it to you know 94 94.6 and I mean [01:14:11] 94.6 and I mean um you're uh you're probably um all too [01:14:17] um you're uh you're probably um all too young to remember this but um you know [01:14:20] young to remember this but um you know really at the time of [01:14:22] really at the time of 2016 you know Google did their kind of [01:14:25] 2016 you know Google did their kind of typical big PR PR Splash for dependency [01:14:29] typical big PR PR Splash for dependency parza which kind of blew my mind since I [01:14:31] parza which kind of blew my mind since I didn't ever think that anyone was really [01:14:34] didn't ever think that anyone was really going to be writing articles in W and [01:14:36] going to be writing articles in W and Venture beat and those kind of tech um [01:14:39] Venture beat and those kind of tech um blogs but you know Google had it all [01:14:42] blogs but you know Google had it all over the place of the world's most [01:14:44] over the place of the world's most accurate paa and they gave it a silly [01:14:47] accurate paa and they gave it a silly name pzy mpas face which really worked [01:14:50] name pzy mpas face which really worked well um for getting lots of media pickup [01:14:53] well um for getting lots of media pickup um and so that was then um a very [01:14:56] um and so that was then um a very successful [01:14:58] successful paa um I've still got a couple of [01:15:01] paa um I've still got a couple of minutes left so let me just um do the [01:15:04] minutes left so let me just um do the last um three slides to show you sort of [01:15:08] last um three slides to show you sort of another way of doing things which is [01:15:10] another way of doing things which is also actually also a powerful paing [01:15:12] also actually also a powerful paing method that is commonly used so um that [01:15:16] method that is commonly used so um that was transition-based paing and that's [01:15:18] was transition-based paing and that's what you'll use an assignment to another [01:15:21] what you'll use an assignment to another way of doing things with dependencies [01:15:23] way of doing things with dependencies and paing can be done neural is what's [01:15:26] and paing can be done neural is what's referred to as graph-based dependency [01:15:28] referred to as graph-based dependency paes and in graph-based dependency paes [01:15:32] paes and in graph-based dependency paes what you do is um for each word you sort [01:15:38] what you do is um for each word you sort of ask for each word what am I a [01:15:41] of ask for each word what am I a dependent of right so if the sentence is [01:15:45] dependent of right so if the sentence is the big cat sat each word for example [01:15:48] the big cat sat each word for example big has to be a dependent of one of the [01:15:51] big has to be a dependent of one of the other four words in this sentence [01:15:53] other four words in this sentence including this possibility of root so we [01:15:56] including this possibility of root so we ask am I dependent of that am I [01:15:58] ask am I dependent of that am I dependent of root am I dependent of cat [01:16:01] dependent of root am I dependent of cat am I dependent of sat and we want to [01:16:03] am I dependent of sat and we want to score each of those possibilities and so [01:16:06] score each of those possibilities and so hopefully we decide the most likely one [01:16:09] hopefully we decide the most likely one is the Bigg as a dependent of cat and [01:16:12] is the Bigg as a dependent of cat and then we're going to do the same for [01:16:14] then we're going to do the same for every other word so you know sat could [01:16:16] every other word so you know sat could be a dependent of any of these words and [01:16:19] be a dependent of any of these words and so we could start asking okay which of [01:16:22] so we could start asking okay which of these words is it most likely a [01:16:25] these words is it most likely a dependent [01:16:27] dependent of uh beat to sat cat to sat um sorry [01:16:31] of uh beat to sat cat to sat um sorry that's unreadable now but hopefully we [01:16:34] that's unreadable now but hopefully we decide um that sat most likely as the [01:16:37] decide um that sat most likely as the verb is a dependent of root so we sort [01:16:41] verb is a dependent of root so we sort of scoring the N squared possible you [01:16:44] of scoring the N squared possible you know dependencies of the sentence and [01:16:47] know dependencies of the sentence and each one is given a score and then once [01:16:50] each one is given a score and then once we've done that our job is let me go to [01:16:54] we've done that our job is let me go to this one cleaner okay we've decided the [01:16:56] this one cleaner okay we've decided the good one there and so we we're going to [01:16:58] good one there and so we we're going to do this using some of the same features [01:17:01] do this using some of the same features we talked about looking at the words at [01:17:03] we talked about looking at the words at each end looking at what occurs between [01:17:05] each end looking at what occurs between them looking at what occurs around them [01:17:08] them looking at what occurs around them um thinking about um things um and then [01:17:12] um thinking about um things um and then once we've done that the only other [01:17:14] once we've done that the only other thing that's a constraint is well we [01:17:16] thing that's a constraint is well we want the dependencies to form a tree um [01:17:20] want the dependencies to form a tree um so that we need to do um something like [01:17:23] so that we need to do um something like a minimum spanning tree algorithm to [01:17:26] a minimum spanning tree algorithm to sort of find the minimum cost tree [01:17:28] sort of find the minimum cost tree because we don't want to find a solution [01:17:31] because we don't want to find a solution where there are Cycles or the parts of [01:17:34] where there are Cycles or the parts of the sentence end up disconnected with [01:17:35] the sentence end up disconnected with each other um and so that's graph-based [01:17:38] each other um and so that's graph-based dependency paes and so just as in the [01:17:42] dependency paes and so just as in the older symbolic paing days where the [01:17:44] older symbolic paing days where the graph based dependency paes were more [01:17:47] graph based dependency paes were more accurate than the transition based paes [01:17:50] accurate than the transition based paes um that we then started doing some work [01:17:53] um that we then started doing some work on neural graph based depend dependency [01:17:55] on neural graph based depend dependency paring and so here's our neurog graph [01:17:57] paring and so here's our neurog graph based dependency paring um which was [01:18:00] based dependency paring um which was then a bit over a percent more accurate [01:18:04] then a bit over a percent more accurate than pzy mpaz face the world's best um [01:18:07] than pzy mpaz face the world's best um dependency paer um so um so that got us [01:18:11] dependency paer um so um so that got us to 2017 I mean obviously this is still a [01:18:14] to 2017 I mean obviously this is still a few years ago but to get further into [01:18:17] few years ago but to get further into the latest um paring Stories We then [01:18:20] the latest um paring Stories We then need to sort of get into the ER of large [01:18:22] need to sort of get into the ER of large language models which I'm not doing [01:18:24] language models which I'm not doing today um but it's this neural graph [01:18:26] today um but it's this neural graph based dependency paa um that's in um [01:18:30] based dependency paa um that's in um stanza our open- source um paing [01:18:33] stanza our open- source um paing software that's available and that you [01:18:35] software that's available and that you can see it's using this algorithm as the [01:18:37] can see it's using this algorithm as the more accurate one okay so now you [01:18:40] more accurate one okay so now you hopefully know everything about [01:18:41] hopefully know everything about syntactic structure constituency [01:18:44] syntactic structure constituency independency paing and are fully [01:18:46] independency paing and are fully qualified to do assignment to so good [01:18:48] qualified to do assignment to so good luck with that thanks ================================================================================ LECTURE 005 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 5 - Recurrent Neural Networks Source: https://www.youtube.com/watch?v=fyc0Jzr74y4 --- Transcript [00:00:06] okay um let me get started for today so [00:00:09] okay um let me get started for today so for today um first of all I'm going to [00:00:13] for today um first of all I'm going to spend a few minutes talking about a [00:00:16] spend a few minutes talking about a couple more new net Concepts including [00:00:19] couple more new net Concepts including actually a couple of the concepts that [00:00:21] actually a couple of the concepts that turn up in assignment two um um then the [00:00:26] turn up in assignment two um um then the bulk of today is then going to be moving [00:00:29] bulk of today is then going to be moving on to introducing what are language [00:00:32] on to introducing what are language models um and then after introducing [00:00:36] models um and then after introducing language models we're going to um [00:00:39] language models we're going to um introduce a new kind of newal network [00:00:42] introduce a new kind of newal network which is one way to build language [00:00:43] which is one way to build language models which is recurrent newal networks [00:00:47] models which is recurrent newal networks um they're important thing to know about [00:00:49] um they're important thing to know about and we use them in assignment three um [00:00:52] and we use them in assignment three um but they're certainly not the only way [00:00:54] but they're certainly not the only way to build language models in fact [00:00:56] to build language models in fact probably a lot of you already know that [00:00:58] probably a lot of you already know that there's this other kind of new Network [00:01:00] there's this other kind of new Network called Transformers and we'll get on to [00:01:02] called Transformers and we'll get on to those after we've done recurrent new [00:01:05] those after we've done recurrent new Nets um Talk a bit about problems with [00:01:08] Nets um Talk a bit about problems with um Rec current new networks and well if [00:01:10] um Rec current new networks and well if I have time I'll get onto the um recap [00:01:13] I have time I'll get onto the um recap um before getting into the content of [00:01:15] um before getting into the content of the class I thought I could just spend a [00:01:17] the class I thought I could just spend a minute on giving you the stats of who is [00:01:19] minute on giving you the stats of who is in um [00:01:21] in um cs224n um who's in [00:01:24] cs224n um who's in cs224n kind of looks like the pie charts [00:01:27] cs224n kind of looks like the pie charts they show in CS 106a these days um [00:01:30] they show in CS 106a these days um except more grad students I guess um so [00:01:33] except more grad students I guess um so the four big groups are the computer [00:01:36] the four big groups are the computer science undergrads the computer science [00:01:38] science undergrads the computer science grads um the Undeclared [00:01:41] grads um the Undeclared undergraduates and the um ndo grads so [00:01:44] undergraduates and the um ndo grads so this is a large portion of the scpd [00:01:46] this is a large portion of the scpd students though um some of them are [00:01:48] students though um some of them are Under Computer Science grads so that [00:01:51] Under Computer Science grads so that makes up about 60% of the audience um [00:01:55] makes up about 60% of the audience um and if you're not in one of those four [00:01:56] and if you're not in one of those four big groups um you're in the other 40% [00:02:00] big groups um you're in the other 40% and everybody is somewhere so there are [00:02:01] and everybody is somewhere so there are lots of other interesting groups down [00:02:04] lots of other interesting groups down here so you know the the bright orange [00:02:06] here so you know the the bright orange down here that's where the math and [00:02:09] down here that's where the math and physics phds are um up here I mean [00:02:13] physics phds are um up here I mean interestingly we now have more [00:02:15] interestingly we now have more statistics grad students and there are [00:02:17] statistics grad students and there are ssis undergrads it didn't used to be [00:02:20] ssis undergrads it didn't used to be that way around in NLP classes um and [00:02:23] that way around in NLP classes um and you know one of my favorite groups um [00:02:26] you know one of my favorite groups um the little um magenta group down here um [00:02:30] the little um magenta group down here um these are the humanity undergrads yay [00:02:33] these are the humanity undergrads yay Humanity's undergrads um in terms of [00:02:37] Humanity's undergrads um in terms of years it breaks down like this um first [00:02:40] years it breaks down like this um first year grad students are the biggest group [00:02:42] year grad students are the biggest group tons of Juniors and seniors and a couple [00:02:46] tons of Juniors and seniors and a couple of Brave FR are any Brave FR here today [00:02:50] of Brave FR are any Brave FR here today [Laughter] [00:02:52] [Laughter] yeah okay um welcome yeah [00:02:58] yeah okay um welcome yeah so um modern newal networks especially [00:03:02] so um modern newal networks especially language models are [00:03:05] language models are enormous um this chart's sort of out of [00:03:07] enormous um this chart's sort of out of date because it only goes up to [00:03:10] date because it only goes up to 2022 but it's sort of actually hard to [00:03:13] 2022 but it's sort of actually hard to make an accurate chart for 2024 because [00:03:16] make an accurate chart for 2024 because in the last couple of years um the [00:03:18] in the last couple of years um the biggest language model makers have in [00:03:20] biggest language model makers have in general stopped saying how large their [00:03:22] general stopped saying how large their language models are in terms of [00:03:23] language models are in terms of parameters but at any rate they're [00:03:26] parameters but at any rate they're clearly um huge models which um [00:03:30] clearly um huge models which um have over a 100 billion parameters um [00:03:34] have over a 100 billion parameters um and so large and then deep in terms of [00:03:38] and so large and then deep in terms of very many layers new Nets are a [00:03:41] very many layers new Nets are a Cornerstone of modern NLP systems we're [00:03:44] Cornerstone of modern NLP systems we're going to be um pretty quickly working [00:03:47] going to be um pretty quickly working our way up to look at those kind of deep [00:03:50] our way up to look at those kind of deep models but I just sort of for starting [00:03:53] models but I just sort of for starting off with something simpler you know I [00:03:55] off with something simpler you know I did just want to kind of um key you in [00:03:59] did just want to kind of um key you in for a few minutes into a little bit of [00:04:01] for a few minutes into a little bit of History right um so um the last time [00:04:06] History right um so um the last time neural Nets were popular was in the 80s [00:04:09] neural Nets were popular was in the 80s and 90s and that was when people worked [00:04:12] and 90s and that was when people worked out the back propagation algorithm Jeff [00:04:14] out the back propagation algorithm Jeff Hinton and colleagues um made famous the [00:04:16] Hinton and colleagues um made famous the back propagation algorithm that we've [00:04:18] back propagation algorithm that we've looked at and that allowed the training [00:04:21] looked at and that allowed the training of neural Nets with hidden [00:04:24] of neural Nets with hidden layers um and so but in those days [00:04:28] layers um and so but in those days pretty much all the new du Nets with [00:04:30] pretty much all the new du Nets with hidden layers that were trained were [00:04:32] hidden layers that were trained were trained with one hidden layer you had [00:04:34] trained with one hidden layer you had the input the hidden layer and the [00:04:36] the input the hidden layer and the output and that's all that there was and [00:04:40] output and that's all that there was and the reason for that was for a very very [00:04:44] the reason for that was for a very very long time people couldn't really get [00:04:47] long time people couldn't really get things to work um with more hidden [00:04:50] things to work um with more hidden layers so that only started to change in [00:04:53] layers so that only started to change in the Resurgence of what often got called [00:04:56] the Resurgence of what often got called Deep learning but anyway back to neural [00:04:58] Deep learning but anyway back to neural Nets um started around 2006 and this was [00:05:04] Nets um started around 2006 and this was um one of the influential papers at the [00:05:06] um one of the influential papers at the time um greedy layerwise training of [00:05:08] time um greedy layerwise training of deep neural networks by Yoshua Benjo and [00:05:11] deep neural networks by Yoshua Benjo and colleagues and so right at the beginning [00:05:13] colleagues and so right at the beginning of that paper they observed this the the [00:05:17] of that paper they observed this the the problem however until recently it was [00:05:19] problem however until recently it was believed too difficult to train deep [00:05:21] believed too difficult to train deep multi-layer newal networks empirically [00:05:24] multi-layer newal networks empirically deep networks were generally found to be [00:05:27] deep networks were generally found to be not better and often worse than new [00:05:29] not better and often worse than new networks with one or two hidden layers [00:05:32] networks with one or two hidden layers um Jerry Torro um was actually a faculty [00:05:35] um Jerry Torro um was actually a faculty member who worked very early on [00:05:37] member who worked very early on autonomous driving with new networks um [00:05:40] autonomous driving with new networks um as this is a negative result there's not [00:05:42] as this is a negative result there's not been much report in the machine learning [00:05:44] been much report in the machine learning literature um so that really you know [00:05:49] literature um so that really you know although people had newal networks and [00:05:51] although people had newal networks and back propagation and current newal [00:05:53] back propagation and current newal networks we're going to talk about today [00:05:55] networks we're going to talk about today that for a very long period of time you [00:06:00] that for a very long period of time you know 15 years or so things seemed [00:06:03] know 15 years or so things seemed completely stuck in that you couldn't [00:06:06] completely stuck in that you couldn't although in theory it seemed like deep [00:06:08] although in theory it seemed like deep neural network should be promising in [00:06:10] neural network should be promising in practice um they didn't work and so it [00:06:15] practice um they didn't work and so it really then took some new developments [00:06:17] really then took some new developments that happened in the late 2000s decade [00:06:20] that happened in the late 2000s decade and then more profoundly in the 2010s [00:06:23] and then more profoundly in the 2010s decade to actually figure out how we [00:06:26] decade to actually figure out how we could have deep neural networks that [00:06:29] could have deep neural networks that actually worked working far better than [00:06:31] actually worked working far better than the shallow neural networks and leading [00:06:33] the shallow neural networks and leading into the networks that um we have today [00:06:37] into the networks that um we have today and you know we're going to be starting [00:06:39] and you know we're going to be starting to talk about some of those things in [00:06:42] to talk about some of those things in this class and come um coming up with [00:06:46] this class and come um coming up with classes and I mean I I think you know [00:06:50] classes and I mean I I think you know the tendency when you see the things [00:06:52] the tendency when you see the things that got new networks to work much [00:06:55] that got new networks to work much better like the the the natural action [00:07:00] better like the the the natural action is to sort of shrug and be underwhelmed [00:07:04] is to sort of shrug and be underwhelmed and think oh is this all there is to it [00:07:06] and think oh is this all there is to it this doesn't exactly seem like difficult [00:07:09] this doesn't exactly seem like difficult science um and in some sense that's true [00:07:13] science um and in some sense that's true they're fairly little introductions of [00:07:16] they're fairly little introductions of new ideas and tweaks of things but [00:07:20] new ideas and tweaks of things but nevertheless a handful of little ideas [00:07:23] nevertheless a handful of little ideas and tweaks of things turn things around [00:07:27] and tweaks of things turn things around from a field that was sort of stuck for [00:07:29] from a field that was sort of stuck for 15 years going nowhere and which nearly [00:07:32] 15 years going nowhere and which nearly everyone had abandoned because of that [00:07:35] everyone had abandoned because of that to suddenly turning around and there [00:07:38] to suddenly turning around and there being the ability to train these deeper [00:07:41] being the ability to train these deeper neural networks which then behaved [00:07:43] neural networks which then behaved amazingly better as machine learning [00:07:45] amazingly better as machine learning systems than other things that had [00:07:48] systems than other things that had preceded them and dominated for the for [00:07:51] preceded them and dominated for the for the intervening time so that took a lot [00:07:54] the intervening time so that took a lot of time so what are these things um one [00:07:58] of time so what are these things um one of them which you can greet with a bit [00:08:00] of them which you can greet with a bit of a yawn in some sense is doing better [00:08:04] of a yawn in some sense is doing better regularization of neural Nets so [00:08:08] regularization of neural Nets so regularization is the idea that Beyond [00:08:11] regularization is the idea that Beyond just having a loss that we want to [00:08:14] just having a loss that we want to minimize in terms of describing the data [00:08:18] minimize in terms of describing the data um we want to in some other ways [00:08:21] um we want to in some other ways manipulate what parameters we learn so [00:08:24] manipulate what parameters we learn so that our models work better and so [00:08:28] that our models work better and so normally we have some more complex loss [00:08:32] normally we have some more complex loss function that does some regularization [00:08:35] function that does some regularization the most common way of doing this is [00:08:37] the most common way of doing this is what's called L2 loss where you add on [00:08:40] what's called L2 loss where you add on this um parameter squared term at the [00:08:43] this um parameter squared term at the end and this regularization says you [00:08:48] end and this regularization says you know it would be kind of good to find a [00:08:52] know it would be kind of good to find a model with small parameter weights so [00:08:54] model with small parameter weights so you should be finding the smallest [00:08:56] you should be finding the smallest parameter weights um that will explain [00:08:58] parameter weights um that will explain your data well and there's a lot you can [00:09:01] your data well and there's a lot you can say about um regularization these kind [00:09:05] say about um regularization these kind of losses they get talked about a lot [00:09:08] of losses they get talked about a lot more in other classes um like cs229 [00:09:12] more in other classes um like cs229 machine learning and so um I'm not going [00:09:15] machine learning and so um I'm not going to say very much about it this is in the [00:09:17] to say very much about it this is in the machine learning theory class um but I [00:09:20] machine learning theory class um but I do just want to sort of um put in one [00:09:23] do just want to sort of um put in one note that's sort of very relevant um to [00:09:29] note that's sort of very relevant um to um what's happened in recent new [00:09:31] um what's happened in recent new networks work um so the classic view of [00:09:34] networks work um so the classic view of regularization was we needed this kind [00:09:37] regularization was we needed this kind of regularization to prevent our [00:09:39] of regularization to prevent our networks from [00:09:41] networks from overfitting meaning that they would do a [00:09:44] overfitting meaning that they would do a very good job at modeling the training [00:09:47] very good job at modeling the training data but then they would generalize [00:09:51] data but then they would generalize badly to new data that was shown and so [00:09:54] badly to new data that was shown and so the picture that you got shown was this [00:09:56] the picture that you got shown was this that as you train on some training data [00:10:00] that as you train on some training data your error necessarily goes [00:10:04] your error necessarily goes down however after some point you start [00:10:08] down however after some point you start learning specific properties of things [00:10:11] learning specific properties of things that happen to turn up in those training [00:10:14] that happen to turn up in those training examples and that you're learning things [00:10:16] examples and that you're learning things that are only good for the training [00:10:18] that are only good for the training examples and so they won't generalize [00:10:21] examples and so they won't generalize well to different pieces of data you see [00:10:23] well to different pieces of data you see at test time so if you have a separate [00:10:26] at test time so if you have a separate validation set or a final test test set [00:10:30] validation set or a final test test set you would and you traced out the error [00:10:33] you would and you traced out the error or loss on that um validation or test [00:10:37] or loss on that um validation or test set that after some point it would start [00:10:39] set that after some point it would start to go up again this is a quirk in my bad [00:10:43] to go up again this is a quirk in my bad PowerPoint it's just meant to go up um [00:10:46] PowerPoint it's just meant to go up um and the fact that it goes up um is then [00:10:49] and the fact that it goes up um is then you have overfit your training data and [00:10:52] you have overfit your training data and have making the parameters numerically [00:10:54] have making the parameters numerically small is meant to lessen the extent to [00:10:57] small is meant to lessen the extent to which you overfit on your training data [00:10:59] which you overfit on your training data um this is not a picture um that modern [00:11:04] um this is not a picture um that modern newal Network people believe at all [00:11:08] newal Network people believe at all instead the picture has changed like [00:11:10] instead the picture has changed like this um we don't believe that [00:11:14] this um we don't believe that overfitting exists anymore but what we [00:11:17] overfitting exists anymore but what we are concerned about is models that will [00:11:22] are concerned about is models that will generalize well to different data um so [00:11:26] generalize well to different data um so that when we train you know So In [00:11:29] that when we train you know So In classical statistics the idea that you [00:11:32] classical statistics the idea that you could train billions of parameters like [00:11:36] could train billions of parameters like large neuron Nets now have um would be [00:11:39] large neuron Nets now have um would be seen as ridiculous because you could not [00:11:41] seen as ridiculous because you could not possibly estimate those parameters well [00:11:44] possibly estimate those parameters well um and so you just have all of this [00:11:47] um and so you just have all of this noisy mess um but what's actually been [00:11:50] noisy mess um but what's actually been found is that yeah it's true you can't [00:11:53] found is that yeah it's true you can't estimate the numbers well but what you [00:11:55] estimate the numbers well but what you get is a kind of an interesting [00:11:57] get is a kind of an interesting averaging function from all these Myriad [00:12:00] averaging function from all these Myriad numbers and if you do it right what [00:12:03] numbers and if you do it right what happens is as you go on [00:12:07] happens is as you go on training that for a while it might look [00:12:10] training that for a while it might look like you're starting to overfit but if [00:12:12] like you're starting to overfit but if you keep on training in a huge Network [00:12:15] you keep on training in a huge Network um not only will your training loss [00:12:18] um not only will your training loss continue to go down very infinitesimally [00:12:21] continue to go down very infinitesimally but your validation loss will go down as [00:12:24] but your validation loss will go down as well and so that huge on huge networks [00:12:28] well and so that huge on huge networks these days is um we train our models so [00:12:33] these days is um we train our models so that they overfit to the training data [00:12:36] that they overfit to the training data almost completely right so that if you [00:12:38] almost completely right so that if you train a huge Network now on a training [00:12:41] train a huge Network now on a training set you can essentially train them to [00:12:44] set you can essentially train them to get zero loss you know maybe it's [00:12:48] get zero loss you know maybe it's 0.007 loss or something but you can [00:12:51] 0.007 loss or something but you can train them to get zero loss because [00:12:52] train them to get zero loss because you've got such Rich models you can [00:12:54] you've got such Rich models you can perfectly fit memorize the entire [00:12:57] perfectly fit memorize the entire training set now classically that would [00:13:00] training set now classically that would have been seen as a disaster because [00:13:01] have been seen as a disaster because you've overfit the training data with [00:13:04] you've overfit the training data with modern large neural networks it's not [00:13:05] modern large neural networks it's not seen as a disaster because providing [00:13:08] seen as a disaster because providing you've done regularization well that [00:13:11] you've done regularization well that your model will also generalize well to [00:13:14] your model will also generalize well to different [00:13:15] different data however the flip side of that is [00:13:19] data however the flip side of that is normally this kind of L2 regularization [00:13:22] normally this kind of L2 regularization or similar ones like L1 regularization [00:13:25] or similar ones like L1 regularization aren't strong enough regularization to [00:13:27] aren't strong enough regularization to achieve that effect and so neural [00:13:29] achieve that effect and so neural network people have turned to other [00:13:32] network people have turned to other methods of regularization of which [00:13:34] methods of regularization of which everyone's favorite is Dropout so this [00:13:37] everyone's favorite is Dropout so this is one of the things that's on the [00:13:39] is one of the things that's on the assignment and um at this point I should [00:13:43] assignment and um at this point I should uh um apologize or something because the [00:13:48] uh um apologize or something because the way the way Dropout is um done the way [00:13:53] way the way Dropout is um done the way Dropout is presented here is sort of the [00:13:55] Dropout is presented here is sort of the original formulation the way Dropout is [00:13:57] original formulation the way Dropout is presented on the assignment is the way [00:13:59] presented on the assignment is the way now normally done in deep learning [00:14:01] now normally done in deep learning packages so um the there are a couple of [00:14:04] packages so um the there are a couple of details that vary a bit and let me just [00:14:07] details that vary a bit and let me just present the main idea here and um not [00:14:10] present the main idea here and um not worry too much about the details of the [00:14:12] worry too much about the details of the math so the idea of Dropout is at [00:14:15] math so the idea of Dropout is at training time um every time you are [00:14:20] training time um every time you are doing a piece of training with an [00:14:22] doing a piece of training with an example what you're going to do is [00:14:25] example what you're going to do is inside the middle layers of the neural [00:14:27] inside the middle layers of the neural network you're just going to throw away [00:14:30] network you're just going to throw away some of the inputs and so technically [00:14:32] some of the inputs and so technically the way you do this is you have a random [00:14:36] the way you do this is you have a random mask that you sample each time of zeros [00:14:38] mask that you sample each time of zeros and ones you do a hadam mod product of [00:14:41] and ones you do a hadam mod product of that with the data so some of the data [00:14:44] that with the data so some of the data items go to zero and you have different [00:14:48] items go to zero and you have different Mass each time so for the next um thing [00:14:52] Mass each time so for the next um thing you know I've now masked out um [00:14:54] you know I've now masked out um something different this time and so [00:14:57] something different this time and so you're just sort of Random ly throwing [00:14:59] you're just sort of Random ly throwing away the inputs and the effect of this [00:15:03] away the inputs and the effect of this is that you're training the model that [00:15:06] is that you're training the model that it has to be robust and work well and [00:15:10] it has to be robust and work well and make as much use of every input as it [00:15:13] make as much use of every input as it can it can't decide that can be [00:15:15] can it can't decide that can be extremely reliant on you know component [00:15:18] extremely reliant on you know component 17 of the vector because sometimes it's [00:15:21] 17 of the vector because sometimes it's just going to randomly disappear so if [00:15:23] just going to randomly disappear so if there are other features that you could [00:15:25] there are other features that you could use instead that would let you work out [00:15:27] use instead that would let you work out what to do next you should also know how [00:15:30] what to do next you should also know how to make use of those features so at [00:15:33] to make use of those features so at training time you randomly delete things [00:15:36] training time you randomly delete things at test time sort of for efficiency but [00:15:38] at test time sort of for efficiency but also quality of the answer um you don't [00:15:41] also quality of the answer um you don't delete anything you keep all of your [00:15:43] delete anything you keep all of your weights but you just rescale things to [00:15:46] weights but you just rescale things to make up for the fact that you used to be [00:15:48] make up for the fact that you used to be dropping [00:15:49] dropping things okay um so what there are several [00:15:52] things okay um so what there are several ways that you can think of explaining [00:15:55] ways that you can think of explaining this one motivation that's often given [00:15:58] this one motivation that's often given is is that this prevents feature co- [00:16:01] is is that this prevents feature co- adaptation so rather than a model being [00:16:05] adaptation so rather than a model being able to learn complex functions of [00:16:07] able to learn complex functions of feature 7 8 and 11 can help me predict [00:16:11] feature 7 8 and 11 can help me predict this it knows that some of the features [00:16:13] this it knows that some of the features might be missing so it has to sort of [00:16:16] might be missing so it has to sort of make use of things in a more flexible [00:16:18] make use of things in a more flexible way another way of thinking of it is [00:16:21] way another way of thinking of it is that there's been a lot of work on model [00:16:23] that there's been a lot of work on model ensembles where you can sort of mix [00:16:26] ensembles where you can sort of mix together um different models and improve [00:16:28] together um different models and improve your your results if you're training [00:16:30] your your results if you're training with Dropout it's kind of like you're [00:16:33] with Dropout it's kind of like you're training with a huge model Ensemble [00:16:35] training with a huge model Ensemble because you're training with the [00:16:37] because you're training with the Ensemble of the power set the [00:16:39] Ensemble of the power set the exponential number of every possible [00:16:42] exponential number of every possible Dropout of features all at once and that [00:16:45] Dropout of features all at once and that gives you a a very good model um so [00:16:48] gives you a a very good model um so there are different ways of thinking [00:16:50] there are different ways of thinking about it I mean if you've seen na bays [00:16:53] about it I mean if you've seen na bays and logistic regression models before [00:16:56] and logistic regression models before you know I kind of think a nice way to [00:16:58] you know I kind of think a nice way to think of it is that it gives a sort of a [00:17:00] think of it is that it gives a sort of a middle ground between the two because [00:17:02] middle ground between the two because for naive based models you're waiting [00:17:04] for naive based models you're waiting each feature independently just based on [00:17:07] each feature independently just based on the data statistics doesn't matter what [00:17:09] the data statistics doesn't matter what other features are there in a logistic [00:17:11] other features are there in a logistic regression weights are set in the [00:17:13] regression weights are set in the context of all the other features and [00:17:17] context of all the other features and with Dropout you're somewhere in between [00:17:19] with Dropout you're somewhere in between you're seeing the weights in the context [00:17:20] you're seeing the weights in the context of some of the other features but [00:17:22] of some of the other features but different ones will disappear at [00:17:23] different ones will disappear at different times um but you know [00:17:26] different times um but you know following work that was done um at [00:17:29] following work that was done um at Stanford by Stefan varer and others that [00:17:32] Stanford by Stefan varer and others that generally these PE days people regard [00:17:35] generally these PE days people regard Dropout as a form of feature dependent [00:17:38] Dropout as a form of feature dependent regularization and he shows some [00:17:40] regularization and he shows some theoretical results as to why to think [00:17:42] theoretical results as to why to think of it that [00:17:43] of it that way okay I think we've implicitly seen [00:17:46] way okay I think we've implicitly seen this one um but vectorization is the [00:17:51] this one um but vectorization is the idea no for Loops always use vectors [00:17:55] idea no for Loops always use vectors matrices and tensors right the entire [00:17:59] matrices and tensors right the entire success and speed of deep learning works [00:18:03] success and speed of deep learning works from the fact that we can do things with [00:18:05] from the fact that we can do things with vectors matrices and tensors so you know [00:18:08] vectors matrices and tensors so you know if you're writing for Loops in any [00:18:11] if you're writing for Loops in any language but especially in Python things [00:18:13] language but especially in Python things run really slowly if you can do things [00:18:17] run really slowly if you can do things with vectors and matrices even on CPU [00:18:20] with vectors and matrices even on CPU things run at least an order of [00:18:22] things run at least an order of magnitude faster and well what everyone [00:18:25] magnitude faster and well what everyone really wants to move to doing in deep [00:18:27] really wants to move to doing in deep learning is running things on gpus or [00:18:31] learning is running things on gpus or sometimes now newal processing units and [00:18:33] sometimes now newal processing units and then you're getting you know two three [00:18:35] then you're getting you know two three orders of magnitude of speed up so um do [00:18:38] orders of magnitude of speed up so um do always think about I should be doing [00:18:40] always think about I should be doing things with vectors and matrices if I'm [00:18:44] things with vectors and matrices if I'm writing a for Loop for anything that [00:18:46] writing a for Loop for anything that isn't some very superficial bit of input [00:18:48] isn't some very superficial bit of input processing I've almost certainly made a [00:18:50] processing I've almost certainly made a mistake and I should be working out um [00:18:53] mistake and I should be working out um how to do things um with vectors and [00:18:56] how to do things um with vectors and metries and you know that's kind of [00:18:58] metries and you know that's kind of thing like Dropout you don't want to [00:19:00] thing like Dropout you don't want to write a for Loop um that goes through [00:19:02] write a for Loop um that goes through all the positions and sets some of them [00:19:04] all the positions and sets some of them to zero you want to be sort of using a [00:19:07] to zero you want to be sort of using a vector operation with your [00:19:09] vector operation with your mask um two more I think um parameter [00:19:14] mask um two more I think um parameter initialization I mean this one might not [00:19:16] initialization I mean this one might not be obvious but when we start training [00:19:21] be obvious but when we start training our neural networks in almost all cases [00:19:26] our neural networks in almost all cases it's vital that we [00:19:29] it's vital that we initialize the parameters of our [00:19:32] initialize the parameters of our matrices to some random numbers and the [00:19:36] matrices to some random numbers and the reason for this is if we just start with [00:19:39] reason for this is if we just start with the um if we just start with our [00:19:42] the um if we just start with our matrices all zero or some other constant [00:19:46] matrices all zero or some other constant normally the case is that we have um [00:19:51] normally the case is that we have um symmetry so it's sort of like in this [00:19:54] symmetry so it's sort of like in this picture when you're starting on this [00:19:56] picture when you're starting on this Saddle Point um that you know it's [00:19:58] Saddle Point um that you know it's symmetric to the left and the right and [00:20:01] symmetric to the left and the right and um or whatever forward and backwards and [00:20:03] um or whatever forward and backwards and left and right and so is you sort of [00:20:06] left and right and so is you sort of don't know where to go and you might be [00:20:09] don't know where to go and you might be sort of stuck and stay in the one place [00:20:11] sort of stuck and stay in the one place I mean normally a way to think about is [00:20:15] I mean normally a way to think about is the operations that you're doing to all [00:20:17] the operations that you're doing to all the elements in The Matrix are sort of [00:20:19] the elements in The Matrix are sort of the same so rather than having you know [00:20:22] the same so rather than having you know a whole Vector of features if all of [00:20:25] a whole Vector of features if all of them have the same value initially often [00:20:27] them have the same value initially often it's sort of like you only have one [00:20:28] it's sort of like you only have one feature and you've just got a lot of [00:20:30] feature and you've just got a lot of copies of it so to initialize learning [00:20:34] copies of it so to initialize learning and have things work well we almost [00:20:37] and have things work well we almost always want to set all the weights to [00:20:39] always want to set all the weights to very small random numbers um and so at [00:20:45] very small random numbers um and so at that point you know CL when I say very [00:20:49] that point you know CL when I say very small we sort of want to make them an A [00:20:51] small we sort of want to make them an A Range so that they don't disappear to [00:20:53] Range so that they don't disappear to zero if we make them a bit smaller and [00:20:57] zero if we make them a bit smaller and um they don't sort of start blowing up [00:20:59] um they don't sort of start blowing up into huge numbers when we multipli them [00:21:01] into huge numbers when we multipli them by things and doing this initialization [00:21:04] by things and doing this initialization at the right scale was used to be seen [00:21:07] at the right scale was used to be seen as something pretty important and there [00:21:09] as something pretty important and there were particular methods that had a basis [00:21:11] were particular methods that had a basis of sort of thinking of what happens once [00:21:13] of sort of thinking of what happens once you do Matrix multiplies that people had [00:21:16] you do Matrix multiplies that people had worked out and often used one of these [00:21:18] worked out and often used one of these was this harier [00:21:20] was this harier initialization which was sort of working [00:21:22] initialization which was sort of working out um what variance of your uniform [00:21:25] out um what variance of your uniform distribution to be um variant of your [00:21:29] distribution to be um variant of your distribution to be using based on the [00:21:31] distribution to be using based on the sort of number of inputs and outputs of [00:21:34] sort of number of inputs and outputs of a layer and things like that the [00:21:36] a layer and things like that the specifics of that um you know I think we [00:21:39] specifics of that um you know I think we still use to initialize things in [00:21:41] still use to initialize things in assignment two but we'll see later that [00:21:44] assignment two but we'll see later that they go away because people have come up [00:21:46] they go away because people have come up with clever methods um in particular [00:21:49] with clever methods um in particular doing layer normalization which sort of [00:21:51] doing layer normalization which sort of obviates the need to be so careful on [00:21:53] obviates the need to be so careful on the initialization but you still need to [00:21:55] the initialization but you still need to initialize things to something [00:21:59] initialize things to something okay then the fast the final one which [00:22:02] okay then the fast the final one which I'll is also something that appears in [00:22:05] I'll is also something that appears in the second assignment that I just want [00:22:06] the second assignment that I just want to say a word about um was optimizers so [00:22:10] to say a word about um was optimizers so we talked about in class um stochastic [00:22:14] we talked about in class um stochastic gradient descent and did the basic um [00:22:16] gradient descent and did the basic um equations for stochastic gradient [00:22:18] equations for stochastic gradient descent and you know to a first [00:22:21] descent and you know to a first approximation there's nothing wrong with [00:22:23] approximation there's nothing wrong with stochastic gradient descent and if you [00:22:26] stochastic gradient descent and if you fiddle around enough you can usually get [00:22:28] fiddle around enough you can usually get stochastic gradients send actually to [00:22:29] stochastic gradients send actually to work well for almost any problem but um [00:22:33] work well for almost any problem but um getting it to work well is very [00:22:36] getting it to work well is very dependent on getting the scales of [00:22:38] dependent on getting the scales of things right of sort of having the right [00:22:40] things right of sort of having the right step size and often you have to have a [00:22:42] step size and often you have to have a learning rate schedule with decreasing [00:22:44] learning rate schedule with decreasing step sizes and various other [00:22:46] step sizes and various other complications so people have come up [00:22:49] complications so people have come up with um more sophisticated optimizers [00:22:53] with um more sophisticated optimizers for newal networks and for complex Nets [00:22:57] for newal networks and for complex Nets sometimes these seem kind of necessary [00:22:59] sometimes these seem kind of necessary to get them to learn well and at any [00:23:02] to get them to learn well and at any rate they give you sort of lots of [00:23:04] rate they give you sort of lots of margins of safety since they're much [00:23:06] margins of safety since they're much less dependent on you setting different [00:23:09] less dependent on you setting different hyperparameters right and the idea of [00:23:12] hyperparameters right and the idea of all of the well all the methods I [00:23:14] all of the well all the methods I mentioned and the most commonly used [00:23:16] mentioned and the most commonly used methods is that for each parameter [00:23:20] methods is that for each parameter they're accumulating a measure of what [00:23:24] they're accumulating a measure of what the gradient has been in the past and [00:23:26] the gradient has been in the past and they've got some idea of the scale of [00:23:28] they've got some idea of the scale of the gradient the slope for a particular [00:23:31] the gradient the slope for a particular parameter and then they're using that to [00:23:34] parameter and then they're using that to decide how much you move the learning [00:23:36] decide how much you move the learning rate at each time step so the simplest [00:23:39] rate at each time step so the simplest method that was come up with this one [00:23:41] method that was come up with this one called adrad if you know John duci and E [00:23:44] called adrad if you know John duci and E he was one of the co-inventors of this [00:23:47] he was one of the co-inventors of this um you know it's simple and nice enough [00:23:49] um you know it's simple and nice enough but it tends to stall early then people [00:23:52] but it tends to stall early then people came up with different methods Adam's [00:23:54] came up with different methods Adam's the one that's on assignment two it's a [00:23:56] the one that's on assignment two it's a really good safe place to start art um [00:23:59] really good safe place to start art um but in a way um sort of our word vectors [00:24:03] but in a way um sort of our word vectors have a special property because of their [00:24:05] have a special property because of their spareness that you know you're very [00:24:07] spareness that you know you're very sparely updating them because particular [00:24:10] sparely updating them because particular words only turn up occasionally so [00:24:12] words only turn up occasionally so people have actually come up with [00:24:14] people have actually come up with particular um optimizers that sort of [00:24:17] particular um optimizers that sort of have special properties for things like [00:24:19] have special properties for things like word vectors and so these ones with the [00:24:22] word vectors and so these ones with the w at the end can sometimes be good to [00:24:24] w at the end can sometimes be good to try and then you know again there's a [00:24:26] try and then you know again there's a whole family of extra ideas that people [00:24:30] whole family of extra ideas that people have used to improve optimizers and if [00:24:32] have used to improve optimizers and if you want to learn about that you can go [00:24:33] you want to learn about that you can go off and do an optimization class like [00:24:36] off and do an optimization class like Conex optimization but there ideas like [00:24:39] Conex optimization but there ideas like momentum and nesterov acceleration and [00:24:41] momentum and nesterov acceleration and things like that and all of those things [00:24:43] things like that and all of those things people also variously try to use um but [00:24:46] people also variously try to use um but Adam is a good name to remember um if [00:24:48] Adam is a good name to remember um if you remember nothing [00:24:50] you remember nothing else okay that took longer than I hoped [00:24:53] else okay that took longer than I hoped but I'll get on now to language models [00:24:56] but I'll get on now to language models okay language models so you know um in [00:25:01] okay language models so you know um in some sense language model is just two [00:25:04] some sense language model is just two English words but when in NLP we say [00:25:07] English words but when in NLP we say language models we mean it as a [00:25:09] language models we mean it as a technical term that has a particular [00:25:12] technical term that has a particular meaning um so the idea of a language [00:25:16] meaning um so the idea of a language model is something that can [00:25:19] model is something that can predict well what word is going to come [00:25:22] predict well what word is going to come next or more precisely it's going to put [00:25:25] next or more precisely it's going to put a probability distribution over what [00:25:27] a probability distribution over what words come next so the students open [00:25:30] words come next so the students open there what words are likely to come [00:25:35] next [00:25:37] next bags laptops laptops notbook notebooks [00:25:41] bags laptops laptops notbook notebooks notebooks yeah um I have some of those [00:25:44] notebooks yeah um I have some of those at least okay um yeah I mean so right so [00:25:50] at least okay um yeah I mean so right so so these are kind of likely words and if [00:25:53] so these are kind of likely words and if on top of those we put a probability on [00:25:57] on top of those we put a probability on each one then we have a language model [00:26:00] each one then we have a language model so formally we've got a context of [00:26:04] so formally we've got a context of proceeding items we're putting a [00:26:06] proceeding items we're putting a probability distribution over the next [00:26:09] probability distribution over the next item which means that the sum of the [00:26:12] item which means that the sum of the estimates of this for the um items in [00:26:14] estimates of this for the um items in the vocabulary will sum to one and if [00:26:18] the vocabulary will sum to one and if we've defined a p like this that [00:26:20] we've defined a p like this that predicts probabilities of next words [00:26:23] predicts probabilities of next words that is called a language [00:26:26] that is called a language model as it says here um an alternative [00:26:29] model as it says here um an alternative way um that you can think of a language [00:26:32] way um that you can think of a language model is that a language model is a [00:26:35] model is that a language model is a system that assigns a probability to a [00:26:38] system that assigns a probability to a piece of text and um so we can say that [00:26:43] piece of text and um so we can say that a language model can can take any piece [00:26:45] a language model can can take any piece of text and give it a probability and [00:26:48] of text and give it a probability and the reason we can do that is we can use [00:26:50] the reason we can do that is we can use the chain rule so I want to know the [00:26:53] the chain rule so I want to know the probability of any stretch of text I say [00:26:57] probability of any stretch of text I say given my previous definition of language [00:26:59] given my previous definition of language model easy I can do that probability of [00:27:02] model easy I can do that probability of X1 with a null preceding context times [00:27:06] X1 with a null preceding context times the probability of X2 given X1 Etc along [00:27:10] the probability of X2 given X1 Etc along I can do this chain rule decomposition [00:27:13] I can do this chain rule decomposition and then the terms of that decomposition [00:27:16] and then the terms of that decomposition are precisely what the language model as [00:27:19] are precisely what the language model as I defined it previously provides okay so [00:27:22] I defined it previously provides okay so language models are this essential [00:27:25] language models are this essential technology for NLP just just about [00:27:29] technology for NLP just just about everything from the simplest places [00:27:31] everything from the simplest places forward um where people do things with [00:27:34] forward um where people do things with human language and computers people use [00:27:37] human language and computers people use language models in particular you know [00:27:40] language models in particular you know they weren't something that got invented [00:27:42] they weren't something that got invented in 2022 with chat GPT language models [00:27:46] in 2022 with chat GPT language models have been Central to NLP at least since [00:27:50] have been Central to NLP at least since the 80s the idea of them goes back to at [00:27:52] the 80s the idea of them goes back to at least the 50s um so anytime you're [00:27:56] least the 50s um so anytime you're typing on your phone and it's making [00:27:59] typing on your phone and it's making suggestions of next words regardless of [00:28:01] suggestions of next words regardless of whether you like those suggestions or [00:28:03] whether you like those suggestions or not um those suggestions are being [00:28:06] not um those suggestions are being generated by a language Model A Very uh [00:28:09] generated by a language Model A Very uh traditionally a compact not very good [00:28:12] traditionally a compact not very good language model so it can run sort of [00:28:15] language model so it can run sort of quickly and very little memory in your [00:28:17] quickly and very little memory in your keyboard application um if you go on [00:28:20] keyboard application um if you go on Google and you start typing some stuff [00:28:23] Google and you start typing some stuff and it's telling you stuff that could [00:28:25] and it's telling you stuff that could come after it um to complete your query [00:28:28] come after it um to complete your query well again that's being generated by a [00:28:31] well again that's being generated by a language model so how can you build a [00:28:34] language model so how can you build a language model so before getting into [00:28:36] language model so before getting into new language models I've got just a few [00:28:39] new language models I've got just a few slides to tell you about the old days of [00:28:42] slides to tell you about the old days of language modeling so this is sort of how [00:28:45] language modeling so this is sort of how language models were built um from [00:28:49] language models were built um from 1975 until you know effectively around [00:28:53] 1975 until you know effectively around about [00:28:55] about 2012 um so [00:28:58] 2012 um so we want to put probabilities on these [00:29:01] we want to put probabilities on these sequences um and the way we're going to [00:29:05] sequences um and the way we're going to do it is we're going to build what's [00:29:07] do it is we're going to build what's called an engram language model um and [00:29:11] called an engram language model um and so this is meaning we're going to look [00:29:12] so this is meaning we're going to look at Short word [00:29:14] at Short word subsequences and use them to predict so [00:29:17] subsequences and use them to predict so N is a variable describing how short are [00:29:21] N is a variable describing how short are the word sequences that we're going to [00:29:23] the word sequences that we're going to use to predict um so if we just look at [00:29:26] use to predict um so if we just look at the probabilities of individual words we [00:29:29] the probabilities of individual words we have a unigram language model if we look [00:29:32] have a unigram language model if we look at probabilities of pairs of words Byram [00:29:35] at probabilities of pairs of words Byram language model um uh probabilities of [00:29:38] language model um uh probabilities of three words trigram language models [00:29:40] three words trigram language models probabilities of more than three words [00:29:43] probabilities of more than three words they get called four gr language models [00:29:45] they get called four gr language models five gr language models six gr language [00:29:48] five gr language models six gr language models um so for people with a Classics [00:29:50] models um so for people with a Classics education this is horrific of course um [00:29:54] education this is horrific of course um in particular not even these ones are [00:29:56] in particular not even these ones are correct um because Graham is a Greek [00:30:00] correct um because Graham is a Greek root so it should really have Greek [00:30:02] root so it should really have Greek numbers in front here um so you should [00:30:04] numbers in front here um so you should have Monograms and diagrams um and you [00:30:08] have Monograms and diagrams um and you know actually so the first person who [00:30:09] know actually so the first person who introduced the idea of engram models was [00:30:11] introduced the idea of engram models was actually Claude Shannon when he was [00:30:13] actually Claude Shannon when he was working out information Theory the same [00:30:16] working out information Theory the same guy that did cross entropy and all of [00:30:18] guy that did cross entropy and all of that and if you look at his 1951 paper [00:30:21] that and if you look at his 1951 paper he uses diagrams um but the idea died [00:30:24] he uses diagrams um but the idea died about there and everyone else this is [00:30:27] about there and everyone else this is what people say in practice um It's Kind [00:30:30] what people say in practice um It's Kind kind of cute I like it a nice you know [00:30:33] kind of cute I like it a nice you know practical notation um so to build these [00:30:36] practical notation um so to build these models the idea is look we're just going [00:30:39] models the idea is look we're just going to count how often different engrs [00:30:43] to count how often different engrs appear in text and use those to build [00:30:46] appear in text and use those to build our probability estimates um and in [00:30:50] our probability estimates um and in particular our trick is that we make a [00:30:53] particular our trick is that we make a mark of assumption so that if we're [00:30:56] mark of assumption so that if we're predicting the next word based on a long [00:30:59] predicting the next word based on a long context we say ah tell you what we're [00:31:02] context we say ah tell you what we're not going to use all of it we're only [00:31:04] not going to use all of it we're only going to use the most recent n minus one [00:31:08] going to use the most recent n minus one words so we have this big context and we [00:31:11] words so we have this big context and we throw most of it away and so if we're [00:31:14] throw most of it away and so if we're predicting word XT + one based on simply [00:31:18] predicting word XT + one based on simply the preceding n minus one words well [00:31:21] the preceding n minus one words well then we can make the [00:31:22] then we can make the prediction using [00:31:25] prediction using NRS um for [00:31:28] NRS um for let's whatever it is if we use n is [00:31:32] let's whatever it is if we use n is three would' have a triam here and [00:31:36] three would' have a triam here and normalized by a Bagram down here and [00:31:39] normalized by a Bagram down here and that that would give us relative um [00:31:42] that that would give us relative um frequencies of the different [00:31:45] frequencies of the different terms um so we can do that simply by [00:31:49] terms um so we can do that simply by counting how often NRS occur in a large [00:31:55] counting how often NRS occur in a large amount of text and simply divid through [00:31:58] amount of text and simply divid through by the counts and that gives us a [00:32:00] by the counts and that gives us a relative frequency estimate of the [00:32:03] relative frequency estimate of the probability of different [00:32:05] probability of different continuations does that make sense yeah [00:32:08] continuations does that make sense yeah that's a way to do it okay um so suppose [00:32:12] that's a way to do it okay um so suppose we're um learning a a forr language [00:32:15] we're um learning a a forr language model right and we've got a piece of [00:32:17] model right and we've got a piece of text as the Proctor started the clock [00:32:20] text as the Proctor started the clock the students open there so well to [00:32:23] the students open there so well to estimate things we are going to throw [00:32:27] estimate things we are going to throw away all but the preceding three words [00:32:30] away all but the preceding three words so we're going to estimate based on [00:32:32] so we're going to estimate based on students open there and so we're going [00:32:34] students open there and so we're going to work out the probabilities by looking [00:32:37] to work out the probabilities by looking for counts of students open their W and [00:32:41] for counts of students open their W and counts of students open there um so we [00:32:45] counts of students open there um so we might have in a corpus that students [00:32:47] might have in a corpus that students open there occurred a thousand times [00:32:49] open there occurred a thousand times students open their books occurred 400 [00:32:52] students open their books occurred 400 times and so we'd say the probability [00:32:54] times and so we'd say the probability estimate is simply 0.4 for B [00:32:58] estimate is simply 0.4 for B if exams occurred 100 times the [00:33:01] if exams occurred 100 times the probability estimate is [00:33:03] probability estimate is 0.1 for [00:33:05] 0.1 for exams [00:33:08] exams um and well you can sort of see that [00:33:11] um and well you can sort of see that this is bad it's not terrible because if [00:33:14] this is bad it's not terrible because if you are going to try and predict the [00:33:15] you are going to try and predict the next word in a simple way looking at the [00:33:18] next word in a simple way looking at the immediately prior words is are the most [00:33:21] immediately prior words is are the most helpful words to look at but it's it's [00:33:23] helpful words to look at but it's it's clearly sort of primitive because you [00:33:26] clearly sort of primitive because you know if you known the prior text was as [00:33:29] know if you known the prior text was as the Proctor started the clock that makes [00:33:31] the Proctor started the clock that makes it sound likely that the words should [00:33:32] it sound likely that the words should have been exams where since you're [00:33:35] have been exams where since you're estimating just based on students open [00:33:37] estimating just based on students open theirs well you'd be more likely to [00:33:40] theirs well you'd be more likely to choose books because it's more common so [00:33:43] choose books because it's more common so it's a kind of a crude estimate but it's [00:33:46] it's a kind of a crude estimate but it's a decent enough place to start um it's a [00:33:50] a decent enough place to start um it's a crude estimate that could be problematic [00:33:52] crude estimate that could be problematic in other ways I mean why why else might [00:33:56] in other ways I mean why why else might we kind of get in [00:33:58] we kind of get in Troubles by using this our probability [00:34:02] estimate [00:34:05] estimate yeah so there are a lot of engrams yeah [00:34:08] yeah so there are a lot of engrams yeah so there are a lot of words and [00:34:10] so there are a lot of words and therefore there are a lot of lot of [00:34:12] therefore there are a lot of lot of engrams yeah so that's a problem we'll [00:34:14] engrams yeah so that's a problem we'll come to it later uh anything else maybe [00:34:17] come to it later uh anything else maybe up the back um like the word w might not [00:34:19] up the back um like the word w might not even show up in the training data so you [00:34:21] even show up in the training data so you might just have a count zero for that [00:34:23] might just have a count zero for that yeah yeah so um so if we're counting [00:34:28] yeah yeah so um so if we're counting over any reasonable size Corpus there [00:34:31] over any reasonable size Corpus there are lots of words that we just are not [00:34:35] are lots of words that we just are not going to have seen right that they never [00:34:38] going to have seen right that they never happen to occur in the text that we [00:34:40] happen to occur in the text that we counted over you know so if you start [00:34:44] counted over you know so if you start thinking students open there you know [00:34:46] thinking students open there you know there are lots of things that you could [00:34:48] there are lots of things that you could put there you know students open their [00:34:50] put there you know students open their accounts or if the students are doing [00:34:53] accounts or if the students are doing dissections in a biology class maybe [00:34:56] dissections in a biology class maybe students open their frogs I don't know [00:34:58] students open their frogs I don't know um you know that there are lots of words [00:35:01] um you know that there are lots of words that in some context you know would [00:35:03] that in some context you know would actually be possible and lots of them [00:35:06] actually be possible and lots of them that we won't have seen and so it give [00:35:08] that we won't have seen and so it give them a probability estimate of zero and [00:35:11] them a probability estimate of zero and that tends to be an especially bad thing [00:35:13] that tends to be an especially bad thing to do with probabilities because once we [00:35:15] to do with probabilities because once we have a probability estimate of zero any [00:35:17] have a probability estimate of zero any computations that we do that involve [00:35:19] computations that we do that involve that will instantly go to zero so we [00:35:22] that will instantly go to zero so we have to deal with some of these problems [00:35:24] have to deal with some of these problems so for that sparity problem right yeah [00:35:28] so for that sparity problem right yeah that we could have the word never [00:35:31] that we could have the word never occurred in the numerator and so simply [00:35:35] occurred in the numerator and so simply done we get a probability estimate of [00:35:38] done we get a probability estimate of zero the way that was dealt with was [00:35:41] zero the way that was dealt with was that people just hacked the counts a [00:35:43] that people just hacked the counts a little to make it non zero so there are [00:35:45] little to make it non zero so there are lots of ways that are explored but the [00:35:47] lots of ways that are explored but the easiest way is you just sort of added a [00:35:50] easiest way is you just sort of added a little Delta like you know 0.25 to [00:35:54] little Delta like you know 0.25 to counts so things that you never saw got [00:35:56] counts so things that you never saw got a count of 0 .25 in total and things you [00:36:00] a count of 0 .25 in total and things you saw once got to count of 1.25 and then [00:36:03] saw once got to count of 1.25 and then there are no zeros anymore everything is [00:36:05] there are no zeros anymore everything is possible um you could think then there's [00:36:08] possible um you could think then there's a second problem that wait you might [00:36:11] a second problem that wait you might never have seen stupid students open [00:36:13] never have seen stupid students open there before and so that means your [00:36:16] there before and so that means your denominator is just undefined and you [00:36:20] denominator is just undefined and you don't have any counts in the numerator [00:36:22] don't have any counts in the numerator either so you sort of need to do [00:36:24] either so you sort of need to do something different there and the [00:36:26] something different there and the standard trick was used then was that [00:36:29] standard trick was used then was that you um did back off so if you couldn't [00:36:32] you um did back off so if you couldn't estimate words coming after students [00:36:35] estimate words coming after students open there you just worked out the [00:36:37] open there you just worked out the estimates for come words coming after [00:36:40] estimates for come words coming after open there and if you couldn't estimate [00:36:43] open there and if you couldn't estimate that you just use the estimate of words [00:36:45] that you just use the estimate of words coming after there so you used less and [00:36:48] coming after there so you used less and less context until you could get an [00:36:50] less context until you could get an estimate that you could [00:36:52] estimate that you could use um but you know something to note is [00:36:55] use um but you know something to note is that we''ve got these conflicting [00:36:57] that we''ve got these conflicting pressures now so that on the one hand [00:37:00] pressures now so that on the one hand you know if you want to come up with a [00:37:02] you know if you want to come up with a better estimate that you would like to [00:37:05] better estimate that you would like to use more context I to have a larger engr [00:37:09] use more context I to have a larger engr but on the other hand as you make use [00:37:14] but on the other hand as you make use more more conditioning words well the [00:37:19] more more conditioning words well the storage size problem someone mentioned [00:37:22] storage size problem someone mentioned gets worse and worse because the number [00:37:24] gets worse and worse because the number of NRS that you have to know about is [00:37:26] of NRS that you have to know about is going up exponen eventally with the size [00:37:28] going up exponen eventally with the size of the context but also your spareness [00:37:32] of the context but also your spareness problems are getting way way worse and [00:37:34] problems are getting way way worse and you're almost necessarily going to be [00:37:36] you're almost necessarily going to be ending up seeing zeros and so because of [00:37:39] ending up seeing zeros and so because of that you know in practice where things [00:37:42] that you know in practice where things tended to um sort of max out was five [00:37:46] tended to um sort of max out was five and occasionally people use six gr and [00:37:49] and occasionally people use six gr and seven G but most of the time you know [00:37:52] seven G but most of the time you know between the sort of spareness and the [00:37:54] between the sort of spareness and the cost of storage 5 G was the large thing [00:37:57] cost of storage 5 G was the large thing people dealt with um so [00:38:00] people dealt with um so um a famous resource from back in the [00:38:04] um a famous resource from back in the 2000s decade that Google released um was [00:38:07] 2000s decade that Google released um was Google engrs which was built on a a [00:38:10] Google engrs which was built on a a trillion word web Corpus and had counts [00:38:13] trillion word web Corpus and had counts of n g and it gave counts of n g up to [00:38:17] of n g and it gave counts of n g up to nals 5 and that is where they [00:38:20] nals 5 and that is where they stopped okay well we've sort of said the [00:38:23] stopped okay well we've sort of said the storage problem the storage problem is [00:38:25] storage problem the storage problem is well to do this you need to store the [00:38:27] well to do this you need to store the these counts the number of counts is [00:38:30] these counts the number of counts is going up exponentially in the amount of [00:38:32] going up exponentially in the amount of Contex size um okay um but you know [00:38:37] Contex size um okay um but you know what's good about engram language models [00:38:39] what's good about engram language models they're really easy to build you can um [00:38:43] they're really easy to build you can um build one yourself in a few minutes when [00:38:45] build one yourself in a few minutes when you've got want to have a bit of fun on [00:38:47] you've got want to have a bit of fun on the weekend um you know all you have to [00:38:50] the weekend um you know all you have to do is start sort of storing um these [00:38:53] do is start sort of storing um these counts for engrams and you can use them [00:38:56] counts for engrams and you can use them to predict things so you know for if at [00:38:58] to predict things so you know for if at least if you do it over a small Corpus [00:39:01] least if you do it over a small Corpus like a couple of million words of text [00:39:04] like a couple of million words of text um you know you can build an engram [00:39:06] um you know you can build an engram language model in seconds on your laptop [00:39:08] language model in seconds on your laptop or you have to buil write the software [00:39:11] or you have to buil write the software okay a few minutes to write the software [00:39:13] okay a few minutes to write the software but building the model takes seconds [00:39:15] but building the model takes seconds because you know there's no training in [00:39:17] because you know there's no training in your network all you do is count how [00:39:18] your network all you do is count how often um engrams occur and so once [00:39:23] often um engrams occur and so once you've done that you can then run an [00:39:25] you've done that you can then run an engram language model to generate text [00:39:28] engram language model to generate text you know we could do text generation [00:39:30] you know we could do text generation before chat GPT right so if I have a [00:39:33] before chat GPT right so if I have a trigram language model I can start off [00:39:36] trigram language model I can start off with some words today the and I could [00:39:39] with some words today the and I could look at my stored engrams and get a [00:39:43] look at my stored engrams and get a probability distribution over next words [00:39:46] probability distribution over next words and here they are you know note the [00:39:49] and here they are you know note the strong patterning of these um these um [00:39:55] strong patterning of these um these um probabilities cuz remember they're all [00:39:57] probabilities cuz remember they're all der from counts right that are being [00:39:59] der from counts right that are being normalized so really these are words [00:40:02] normalized so really these are words that occurred once these are words that [00:40:04] that occurred once these are words that occurred twice these are words that [00:40:06] occurred twice these are words that occurred four times in this context [00:40:08] occurred four times in this context right so they're sort of in some sense [00:40:10] right so they're sort of in some sense crude when you look at them more [00:40:12] crude when you look at them more carefully but so what we could do is [00:40:14] carefully but so what we could do is then at this point you know we roll a a [00:40:18] then at this point you know we roll a a die and get a random number from 0er to [00:40:21] die and get a random number from 0er to one and we can use that sample from this [00:40:24] one and we can use that sample from this distribution [00:40:26] distribution um sorry yeah [00:40:29] um sorry yeah um so we sample from this distribution [00:40:33] um so we sample from this distribution and so that if we sort of um generate so [00:40:37] and so that if we sort of um generate so of as our random number something [00:40:40] of as our random number something like [00:40:42] like 35 if we go down from the top we'd say [00:40:46] 35 if we go down from the top we'd say okay we've sampled the word price today [00:40:49] okay we've sampled the word price today the price and then we repeat over we [00:40:51] the price and then we repeat over we condition on that we probability [00:40:53] condition on that we probability distribution of the next word um we [00:40:56] distribution of the next word um we generate a random numbered and use it to [00:40:58] generate a random numbered and use it to sample from the distribution um we say [00:41:01] sample from the distribution um we say generate 2 and so we choose of um we now [00:41:05] generate 2 and so we choose of um we now condition on that we get a probability [00:41:08] condition on that we get a probability distribution we generate a random number [00:41:11] distribution we generate a random number which is 0.5 or something and so we get [00:41:14] which is 0.5 or something and so we get gold coming out and we can say today the [00:41:17] gold coming out and we can say today the price of gold and we can keep on doing [00:41:19] price of gold and we can keep on doing this and generate some text and so [00:41:22] this and generate some text and so here's some text generated um from 2 [00:41:26] here's some text generated um from 2 million words training data using a [00:41:29] million words training data using a trigram language model today the price [00:41:32] trigram language model today the price of gold per ton while production of shoe [00:41:35] of gold per ton while production of shoe lasts and shoe industry the bank [00:41:38] lasts and shoe industry the bank intervened just after it considered and [00:41:39] intervened just after it considered and rejected an IMF demand to rebuild [00:41:42] rejected an IMF demand to rebuild depleted European stocks September 3rd [00:41:45] depleted European stocks September 3rd in primary 76 cents a share um [00:41:50] in primary 76 cents a share um now okay that text isn't great um but [00:41:54] now okay that text isn't great um but you know I actually want people to you [00:41:56] you know I actually want people to you know be in a positive of mood today um [00:41:59] know be in a positive of mood today um and you know actually it's not so bad [00:42:03] and you know actually it's not so bad right it's sort of surprisingly [00:42:06] right it's sort of surprisingly grammatical I mean in particular like I [00:42:09] grammatical I mean in particular like I lowercased everything so this is the IMF [00:42:11] lowercased everything so this is the IMF that should be capitalized of the [00:42:13] that should be capitalized of the international monetary fund right you [00:42:16] international monetary fund right you know there are big pieces of this that [00:42:18] know there are big pieces of this that even make sense right the bank [00:42:20] even make sense right the bank intervened just after it considered and [00:42:22] intervened just after it considered and rejected an if IMF demand you know [00:42:25] rejected an if IMF demand you know that's pretty much making sense as a [00:42:27] that's pretty much making sense as a piece of text um right so [00:42:31] piece of text um right so it's mostly grammatical it looks like [00:42:34] it's mostly grammatical it looks like you know English text I mean it it makes [00:42:37] you know English text I mean it it makes no sense right it's sort of really [00:42:39] no sense right it's sort of really incoherent so there there's work to do [00:42:42] incoherent so there there's work to do but you know what was already you could [00:42:45] but you know what was already you could see that even these simple engram models [00:42:48] see that even these simple engram models you could from a very low level you [00:42:52] you could from a very low level you could kind of approach what text and [00:42:55] could kind of approach what text and human language worked like in from below [00:42:59] human language worked like in from below and you know I could easily make this [00:43:01] and you know I could easily make this better even with the engram language [00:43:02] better even with the engram language model because you know rather than two [00:43:04] model because you know rather than two million words of text if I trained on 10 [00:43:06] million words of text if I trained on 10 million words of text would be better if [00:43:08] million words of text would be better if I then rather than a trigram model could [00:43:10] I then rather than a trigram model could go to a forr model get better and You' [00:43:13] go to a forr model get better and You' sort of start getting better and better [00:43:16] sort of start getting better and better um approximations of text um and so this [00:43:20] um approximations of text um and so this is essentially what people um did until [00:43:25] is essentially what people um did until about [00:43:25] about 2012 and and you know really uh the same [00:43:30] 2012 and and you know really uh the same story that people um tell today that [00:43:33] story that people um tell today that scale will solve everything is exactly [00:43:36] scale will solve everything is exactly the same story that people used to tell [00:43:39] the same story that people used to tell in the early [00:43:40] in the early 2010s with these engram language models [00:43:43] 2010s with these engram language models if you weren't getting a good enough [00:43:44] if you weren't getting a good enough results with your 10 million words of [00:43:47] results with your 10 million words of text and a trigram language model the [00:43:49] text and a trigram language model the answer was that if you had a 100 million [00:43:52] answer was that if you had a 100 million words of text and a for gram language [00:43:54] words of text and a for gram language model you'd do better and then if you [00:43:56] model you'd do better and then if you had a trillion words of text in a five [00:43:59] had a trillion words of text in a five gr language model You' do better and gee [00:44:01] gr language model You' do better and gee wouldn't it be good if we could collect [00:44:02] wouldn't it be good if we could collect 10 trillion words of text so we could [00:44:04] 10 trillion words of text so we could train an even better engram language [00:44:07] train an even better engram language model same strategy um but it turns out [00:44:10] model same strategy um but it turns out that sometimes you can do better with [00:44:13] that sometimes you can do better with better models as well as simply scale um [00:44:16] better models as well as simply scale um and so things got reinvented and started [00:44:20] and so things got reinvented and started again with building neural language [00:44:23] again with building neural language models um so how can we build a neural [00:44:26] models um so how can we build a neural language model um so you know we've got [00:44:29] language model um so you know we've got the same task of having a sequence of [00:44:33] the same task of having a sequence of words and we want to put a probability [00:44:35] words and we want to put a probability estimate over what word comes next um [00:44:39] estimate over what word comes next um and so the simplest way you could do [00:44:41] and so the simplest way you could do that which you'll hopefully all have [00:44:43] that which you'll hopefully all have thought of because it connects what we [00:44:46] thought of because it connects what we did in earlier classes look we already [00:44:49] did in earlier classes look we already had this idea that we could have [00:44:52] had this idea that we could have represented context by the concatenation [00:44:55] represented context by the concatenation of some word vectors and we could put [00:44:58] of some word vectors and we could put that into a neural network and we could [00:45:02] that into a neural network and we could use that to predict something and in the [00:45:05] use that to predict something and in the example I did in the last couple of [00:45:07] example I did in the last couple of classes what we used it to predict was [00:45:10] classes what we used it to predict was is the center word a location or not a [00:45:13] is the center word a location or not a location just a binary choice but that's [00:45:16] location just a binary choice but that's not the only thing we could predict we [00:45:18] not the only thing we could predict we could have predicted lots of things with [00:45:20] could have predicted lots of things with this new network we could have predicted [00:45:22] this new network we could have predicted whether the piece of text was positive [00:45:24] whether the piece of text was positive or negative we could have predicted [00:45:26] or negative we could have predicted whether it was written in English or [00:45:28] whether it was written in English or Japanese you know we can predict lots of [00:45:30] Japanese you know we can predict lots of things so one thing we could choose to [00:45:33] things so one thing we could choose to predict is we could choose to predict [00:45:35] predict is we could choose to predict what word is going to come next after [00:45:37] what word is going to come next after this window of text we'd have a model [00:45:39] this window of text we'd have a model just like this one apart from up the top [00:45:42] just like this one apart from up the top instead of doing this binary [00:45:44] instead of doing this binary classification we' do a many many way [00:45:47] classification we' do a many many way classification over what is the next [00:45:50] classification over what is the next word that is going to appear in the [00:45:52] word that is going to appear in the piece of text and that would then give [00:45:54] piece of text and that would then give us a neural language model in particular [00:45:58] us a neural language model in particular it give us a fixed window neural [00:46:01] it give us a fixed window neural language model so that we'd do the same [00:46:05] language model so that we'd do the same Markoff assumption trick of throwing [00:46:07] Markoff assumption trick of throwing away the further back context and so for [00:46:10] away the further back context and so for the fixed window um we'll you know um [00:46:15] the fixed window um we'll you know um use word embeddings which you can [00:46:18] use word embeddings which you can concatenate we'll put it through a [00:46:20] concatenate we'll put it through a hidden layer and then we'll take the [00:46:22] hidden layer and then we'll take the output of that hidden layer um multiply [00:46:26] output of that hidden layer um multiply it by by another layer say and then put [00:46:29] it by by another layer say and then put that through a soft Max and get an [00:46:31] that through a soft Max and get an output distribution and so this gives us [00:46:35] output distribution and so this gives us a sort of a fixed window neural language [00:46:38] a sort of a fixed window neural language model and you know apart from the fact [00:46:40] model and you know apart from the fact that we're now doing a classification [00:46:43] that we're now doing a classification over many many many classes this is [00:46:46] over many many many classes this is exactly like what we did um last week so [00:46:49] exactly like what we did um last week so it should look kind of familiar it's [00:46:52] it should look kind of familiar it's also kind of like what you're doing for [00:46:54] also kind of like what you're doing for assignment two um and so this is [00:46:57] assignment two um and so this is essentially the first kind of new [00:47:00] essentially the first kind of new language model that was [00:47:03] language model that was proposed um so in particular um yosua [00:47:06] proposed um so in particular um yosua Benjo um really sort of right at the [00:47:09] Benjo um really sort of right at the beginning of the 21st century suggested [00:47:12] beginning of the 21st century suggested that you could do this that rather than [00:47:14] that you could do this that rather than using an engram language model you could [00:47:16] using an engram language model you could use a fixed window neural language model [00:47:20] use a fixed window neural language model and you know even at that point um he [00:47:23] and you know even at that point um he and colleagues were able to get some [00:47:26] and colleagues were able to get some positive results from this model but you [00:47:29] positive results from this model but you know at the time it wasn't widely [00:47:31] know at the time it wasn't widely noticed it didn't really take off that [00:47:34] noticed it didn't really take off that much and you know it was sort of for a [00:47:36] much and you know it was sort of for a combination of reasons when it was only [00:47:38] combination of reasons when it was only a fixed window it was sort of not that [00:47:41] a fixed window it was sort of not that different to engrs in some sense and [00:47:44] different to engrs in some sense and although the new network could give [00:47:46] although the new network could give better [00:47:47] better generalization it could be argued rather [00:47:49] generalization it could be argued rather than using counts I mean in practice you [00:47:53] than using counts I mean in practice you know new Nets were still hard to run [00:47:56] know new Nets were still hard to run without gpus and people felt and I think [00:48:00] without gpus and people felt and I think in general this was the case that you [00:48:02] in general this was the case that you could get more oomph by doing the scale [00:48:06] could get more oomph by doing the scale story and um collecting your engram [00:48:09] story and um collecting your engram counts on hundreds of billions of words [00:48:12] counts on hundreds of billions of words of text um rather than trying to make a [00:48:14] of text um rather than trying to make a new network out of it and so it didn't [00:48:17] new network out of it and so it didn't really sort of especially take off at [00:48:19] really sort of especially take off at that time but you know in principle it [00:48:21] that time but you know in principle it seemed a nice thing it you know got rid [00:48:24] seemed a nice thing it you know got rid of the spasy problem um it got rid of [00:48:27] of the spasy problem um it got rid of the storage costs you no longer have to [00:48:29] the storage costs you no longer have to store all observed engrs you just have [00:48:32] store all observed engrs you just have to store the parameters of your newal [00:48:34] to store the parameters of your newal network but it didn't solve all the [00:48:37] network but it didn't solve all the problems that we'd like to solve so in [00:48:40] problems that we'd like to solve so in particular we still have this problem of [00:48:42] particular we still have this problem of the Markoff assumption that we're just [00:48:44] the Markoff assumption that we're just using a small fixed context beforehand [00:48:48] using a small fixed context beforehand to predict [00:48:49] to predict from um and there are some disadvantages [00:48:53] from um and there are some disadvantages to enlarging that window and you know [00:48:56] to enlarging that window and you know there's no fixed window that's ever big [00:48:58] there's no fixed window that's ever big enough um there's another there's [00:49:01] enough um there's another there's another thing that if you look [00:49:03] another thing that if you look technically at this model that might [00:49:06] technically at this model that might sort of make you suspicious of it which [00:49:09] sort of make you suspicious of it which is you know when we have words in [00:49:13] is you know when we have words in different positions that those words and [00:49:17] different positions that those words and different positions will be treated by [00:49:20] different positions will be treated by completely different subp parts of this [00:49:22] completely different subp parts of this Matrix W so you might think that you [00:49:26] Matrix W so you might think that you know know okay for predicting that books [00:49:30] know know okay for predicting that books comes next you know the fact that this [00:49:32] comes next you know the fact that this is a student um is important but it [00:49:36] is a student um is important but it doesn't matter so much exactly where the [00:49:39] doesn't matter so much exactly where the word student occurs right you know the [00:49:42] word student occurs right you know the context could have been the students [00:49:44] context could have been the students slowly open there um and it's still the [00:49:48] slowly open there um and it's still the same students we've just got a bit [00:49:50] same students we've just got a bit different linguistic structure where [00:49:52] different linguistic structure where this W Matrix would be using completely [00:49:55] this W Matrix would be using completely separate parameters to be learning stuff [00:49:57] separate parameters to be learning stuff about student here versus student in [00:50:00] about student here versus student in this position so that seems kind of [00:50:02] this position so that seems kind of inefficient and wrong um and so that [00:50:06] inefficient and wrong um and so that suggested that we kind of need a [00:50:08] suggested that we kind of need a different kind of neural architecture [00:50:11] different kind of neural architecture that can process any length of input and [00:50:15] that can process any length of input and can use the same parameters to say hey I [00:50:19] can use the same parameters to say hey I saw the word student that's evidence [00:50:21] saw the word student that's evidence that things like books exams homework [00:50:24] that things like books exams homework will be turning up regardless of where [00:50:26] will be turning up regardless of where it occurs and so that then led to [00:50:30] it occurs and so that then led to exploration of this different neural [00:50:32] exploration of this different neural network architecture um called recurrent [00:50:35] network architecture um called recurrent neural networks which is what I'll go on [00:50:37] neural networks which is what I'll go on to next but before I do is everyone [00:50:40] to next but before I do is everyone basically okay with what a language [00:50:42] basically okay with what a language model is yeah no [00:50:47] questions okay um we're current newal [00:50:51] questions okay um we're current newal networks [00:50:54] um so [00:50:57] um so recurrent newal networks is a different [00:51:00] recurrent newal networks is a different family of newal networks so effectively [00:51:03] family of newal networks so effectively in this class we see several neural [00:51:06] in this class we see several neural network [00:51:07] network architectures um so in some sense the [00:51:11] architectures um so in some sense the first architecture we saw was word to V [00:51:14] first architecture we saw was word to V it's a sort of a very simple um encoder [00:51:18] it's a sort of a very simple um encoder decoder architecture um the second [00:51:21] decoder architecture um the second family we saw was feed forward Network [00:51:26] family we saw was feed forward Network or fully connected layer classic neural [00:51:29] or fully connected layer classic neural networks and the third family we're [00:51:31] networks and the third family we're going to see is recurrent neural [00:51:33] going to see is recurrent neural networks which have different kinds and [00:51:35] networks which have different kinds and then we'll go on and go on to [00:51:37] then we'll go on and go on to Transformer models okay so the idea of a [00:51:41] Transformer models okay so the idea of a recurrent newal network is that you've [00:51:44] recurrent newal network is that you've got one set of Weights that are going to [00:51:48] got one set of Weights that are going to be applied through successive moments in [00:51:52] be applied through successive moments in time I successive positions in the text [00:51:56] time I successive positions in the text and as you do that you're going to [00:51:58] and as you do that you're going to update the parameters as you go um we'll [00:52:01] update the parameters as you go um we'll go through this in quite a bit of detail [00:52:04] go through this in quite a bit of detail but you know here's the idea of it so [00:52:06] but you know here's the idea of it so we've got the students open there and we [00:52:09] we've got the students open there and we want to predict with that and the way [00:52:12] want to predict with that and the way that we're going to do [00:52:14] that we're going to do it okay I've still got four words in my [00:52:17] it okay I've still got four words in my example so I can put stuff down the left [00:52:19] example so I can put stuff down the left side of the slide but there could have [00:52:22] side of the slide but there could have been 24 words with recurrent new [00:52:24] been 24 words with recurrent new networks because they can deal with any [00:52:26] networks because they can deal with any length of context okay so as before our [00:52:30] length of context okay so as before our words start off as just words or one hot [00:52:34] words start off as just words or one hot vectors and we can look up their word [00:52:36] vectors and we can look up their word embeddings just like [00:52:38] embeddings just like before okay but now to compute [00:52:42] before okay but now to compute probabilities for the next word we're [00:52:45] probabilities for the next word we're going to do something different so our [00:52:47] going to do something different so our hidden layer is going to be recurrent [00:52:51] hidden layer is going to be recurrent and by recurrent it means we're going to [00:52:54] and by recurrent it means we're going to sort of change a hidden State at each [00:52:57] sort of change a hidden State at each time step as we proceed through the text [00:53:00] time step as we proceed through the text from left to right um so we're going to [00:53:03] from left to right um so we're going to start off with an h0 which is the [00:53:05] start off with an h0 which is the initial hidden state which can actually [00:53:08] initial hidden state which can actually just be all zeros um and then at each [00:53:12] just be all zeros um and then at each time step what we're going to do is [00:53:15] time step what we're going to do is we're going to multiply the previous [00:53:18] we're going to multiply the previous hidden state by a weight M [00:53:20] hidden state by a weight M Matrix we're going to take a word [00:53:23] Matrix we're going to take a word embedding and multiply it by a weight [00:53:25] embedding and multiply it by a weight Matrix [00:53:26] Matrix and then we're going to sum the results [00:53:28] and then we're going to sum the results of those two things and that's going to [00:53:30] of those two things and that's going to give us a new hidden state so that [00:53:33] give us a new hidden state so that hidden state will then sort of store a [00:53:37] hidden state will then sort of store a memory of everything that's been seen so [00:53:39] memory of everything that's been seen so far so we'll do that and then we'll [00:53:43] far so we'll do that and then we'll continue along so we'll multiply the [00:53:45] continue along so we'll multiply the next word vector by the same weight [00:53:48] next word vector by the same weight Matrix we we store the previous multiply [00:53:52] Matrix we we store the previous multiply the previous hidden state by the same [00:53:54] the previous hidden state by the same weight M Matrix wa each and we add them [00:53:58] weight M Matrix wa each and we add them together and get a new [00:54:00] together and get a new representation um I've only sort of said [00:54:04] representation um I've only sort of said this bit so I've left out a bit commonly [00:54:06] this bit so I've left out a bit commonly there are two other things you're doing [00:54:08] there are two other things you're doing you're adding on a biased term because [00:54:10] you're adding on a biased term because we usually separate out a bias term and [00:54:13] we usually separate out a bias term and you're putting things through a [00:54:14] you're putting things through a nonlinearity so I should make sure I [00:54:16] nonlinearity so I should make sure I mention that and for recurrent neural [00:54:19] mention that and for recurrent neural networks most commonly this nonlinearity [00:54:22] networks most commonly this nonlinearity has actually been the tan H function so [00:54:24] has actually been the tan H function so it's sort of balanced on the positive [00:54:26] it's sort of balanced on the positive negative side and so you keep on doing [00:54:29] negative side and so you keep on doing that through each step and so the idea [00:54:32] that through each step and so the idea is once we've got to here this H4 hidden [00:54:36] is once we've got to here this H4 hidden state is a hidden state that in some [00:54:38] state is a hidden state that in some sense has read the text up until now [00:54:41] sense has read the text up until now it's seen all of the students open there [00:54:44] it's seen all of the students open there and if the word students occurred in any [00:54:47] and if the word students occurred in any of these positions it will have been [00:54:49] of these positions it will have been multiplied by the same we Matrix and [00:54:53] multiplied by the same we Matrix and added into the hidden state so it's kind [00:54:55] added into the hidden state so it's kind of got a cleaner [00:54:56] of got a cleaner low parameter way of incorporating in [00:54:59] low parameter way of incorporating in the information that seen so now I want [00:55:02] the information that seen so now I want to predict the next word and to predict [00:55:05] to predict the next word and to predict the next word I'm then going to do based [00:55:08] the next word I'm then going to do based on the final hidden State the same thing [00:55:11] on the final hidden State the same thing I did kind of thing I did before so I'm [00:55:14] I did kind of thing I did before so I'm going to multiply that hidden state by [00:55:17] going to multiply that hidden state by matrix and add another bias and stick [00:55:19] matrix and add another bias and stick that through a soft Max and use that to [00:55:24] that through a soft Max and use that to um sample from that soft Max well the [00:55:26] um sample from that soft Max well the softmax will give me a language model of [00:55:28] softmax will give me a language model of probability over all next words and I [00:55:31] probability over all next words and I can sample from it to generate the next [00:55:36] word that make [00:55:38] word that make sense okay recurrent new um [00:55:42] sense okay recurrent new um networks [00:55:48] um okay um so for current newal networks [00:55:52] um okay um so for current newal networks we can now process any length of [00:55:55] we can now process any length of preceding content next and we'll just [00:55:57] preceding content next and we'll just put more and more stuff in our hidden [00:56:00] put more and more stuff in our hidden State um the so our computation is using [00:56:05] State um the so our computation is using information from many steps back um our [00:56:10] information from many steps back um our model Size Doesn't increase for having a [00:56:13] model Size Doesn't increase for having a long context right you know we have to [00:56:16] long context right you know we have to do more computation for a long context [00:56:19] do more computation for a long context but our representation of that long [00:56:21] but our representation of that long context just remains this fixed size [00:56:23] context just remains this fixed size hidden Vector h of whatever dimension it [00:56:26] hidden Vector h of whatever dimension it is so there's no exponential blowout [00:56:29] is so there's no exponential blowout anymore um there's the same weights [00:56:31] anymore um there's the same weights appli in every time step so there's a [00:56:33] appli in every time step so there's a symmetry and how inputs are processed um [00:56:36] symmetry and how inputs are processed um there are some [00:56:38] there are some catches um the biggest catch in practice [00:56:42] catches um the biggest catch in practice is that recurrent computation is slow so [00:56:45] is that recurrent computation is slow so for the feed forward layer we just had [00:56:48] for the feed forward layer we just had you know our input Vector we multiply it [00:56:51] you know our input Vector we multiply it by matrix we multiply it by matrix [00:56:53] by matrix we multiply it by matrix however many times and then at the end [00:56:55] however many times and then at the end we're done whereas here we've sort of [00:56:58] we're done whereas here we've sort of stuck with this sequentiality that you [00:57:00] stuck with this sequentiality that you have to be doing one hidden Vector at a [00:57:03] have to be doing one hidden Vector at a time in fact this is going against what [00:57:06] time in fact this is going against what I said at the beginning of class because [00:57:08] I said at the beginning of class because essentially here you're doing a for Loop [00:57:11] essentially here you're doing a for Loop um you're going through for time equals [00:57:13] um you're going through for time equals 1 to T and then you're generating and [00:57:16] 1 to T and then you're generating and term each hidden vector and that's one [00:57:18] term each hidden vector and that's one of the big problems with rnns that have [00:57:21] of the big problems with rnns that have led them to fall out of favor um there's [00:57:25] led them to fall out of favor um there's another [00:57:26] another problem that we'll look at more is that [00:57:30] problem that we'll look at more is that in theory this is perfect you're just [00:57:33] in theory this is perfect you're just incorporating all of the past context in [00:57:36] incorporating all of the past context in in your hidden Vector in practice it [00:57:39] in your hidden Vector in practice it tends not to work perfectly because you [00:57:42] tends not to work perfectly because you know although stuff you saw back here is [00:57:46] know although stuff you saw back here is in some sense still alive in the hidden [00:57:48] in some sense still alive in the hidden Vector as you come across here that your [00:57:52] Vector as you come across here that your memory of it gets more and more distant [00:57:55] memory of it gets more and more distant and it's the words that you saw recently [00:57:57] and it's the words that you saw recently that dominate The Hidden State now in [00:58:00] that dominate The Hidden State now in some sense that's right because the [00:58:01] some sense that's right because the recent stuff is the most important stuff [00:58:03] recent stuff is the most important stuff that's freshest in your mind you know [00:58:05] that's freshest in your mind you know it's the same with human beings um they [00:58:08] it's the same with human beings um they tend to forget stuff from further back [00:58:10] tend to forget stuff from further back as well um but rnns especially in the [00:58:13] as well um but rnns especially in the simple form that I've just explained [00:58:15] simple form that I've just explained forget stuff from further back um rather [00:58:19] forget stuff from further back um rather too quickly and we'll come back to that [00:58:22] too quickly and we'll come back to that again um into in Thursday's class [00:58:27] again um into in Thursday's class okay so for training an RNN language [00:58:31] okay so for training an RNN language model um the starting off point is we [00:58:33] model um the starting off point is we get a big Corpus of text again um and [00:58:37] get a big Corpus of text again um and then we're going to compute um for each [00:58:42] then we're going to compute um for each time step a prediction of the [00:58:45] time step a prediction of the probability of next words and then [00:58:48] probability of next words and then there's going to be an actual next word [00:58:51] there's going to be an actual next word and we're going to use you know that as [00:58:54] and we're going to use you know that as the basis of our loss [00:58:56] the basis of our loss um so our loss function is the cross [00:58:59] um so our loss function is the cross entropy between the predicted [00:59:01] entropy between the predicted probability and what the actual next [00:59:03] probability and what the actual next word that we saw is which again as in [00:59:06] word that we saw is which again as in the example I showed before is just the [00:59:09] the example I showed before is just the the negative log likelihood of the [00:59:11] the negative log likelihood of the actual next word ideally you'd like to [00:59:15] actual next word ideally you'd like to predict the actual next word with [00:59:17] predict the actual next word with probability one which means the negative [00:59:20] probability one which means the negative log of one would be zero and there'd be [00:59:23] log of one would be zero and there'd be no loss but in practice if you give it [00:59:26] no loss but in practice if you give it an estimate of 0 five there's only a [00:59:28] an estimate of 0 five there's only a little bit of loss and so on and so um [00:59:32] little bit of loss and so on and so um to get our overall objective function we [00:59:35] to get our overall objective function we work out the average loss the average [00:59:39] work out the average loss the average negative log likelihood of predicting [00:59:41] negative log likelihood of predicting each word in turn so showing that as [00:59:44] each word in turn so showing that as pictures if our Corpus is the students [00:59:47] pictures if our Corpus is the students open their exams we're first of all [00:59:50] open their exams we're first of all going to be trying to predict um you [00:59:53] going to be trying to predict um you know what [00:59:54] know what comes after the and we will predict some [01:00:00] comes after the and we will predict some word with um different probabilities and [01:00:03] word with um different probabilities and then we'll say oh the actual next word [01:00:05] then we'll say oh the actual next word is students okay you gave that a [01:00:07] is students okay you gave that a probability of 0.05 say because all you [01:00:11] probability of 0.05 say because all you know was the first word was the okay [01:00:13] know was the first word was the okay there's a loss for that um the negative [01:00:16] there's a loss for that um the negative log prob given to students we then go on [01:00:20] log prob given to students we then go on and generate the probability estimate [01:00:23] and generate the probability estimate over the next words and then we say well [01:00:26] over the next words and then we say well the actual word is opened what [01:00:28] the actual word is opened what probability estimate did you give to [01:00:30] probability estimate did you give to that we get a negative probability loss [01:00:33] that we get a negative probability loss keep on running this along and then we [01:00:36] keep on running this along and then we sum all of those losses and we average [01:00:40] sum all of those losses and we average them per word and that's our sort of [01:00:43] them per word and that's our sort of average per word loss and we want to [01:00:46] average per word loss and we want to make that as small as possible and so [01:00:50] make that as small as possible and so that's our training mechanism and it's [01:00:53] that's our training mechanism and it's important to to know no that you know [01:00:57] important to to know no that you know for generating this loss we're not just [01:01:00] for generating this loss we're not just doing free generation we're not just [01:01:02] doing free generation we're not just saying to the model go off and generate [01:01:04] saying to the model go off and generate a sentence um what we're actually doing [01:01:07] a sentence um what we're actually doing is at each step we're effectively saying [01:01:10] is at each step we're effectively saying okay the prefix is the students open [01:01:13] okay the prefix is the students open what probability distribution do you put [01:01:15] what probability distribution do you put on next words after that um generate it [01:01:19] on next words after that um generate it with our current new network and then [01:01:21] with our current new network and then say ask for the actual next word what [01:01:24] say ask for the actual next word what probability estimate did you give to [01:01:26] probability estimate did you give to there and that's our loss but then what [01:01:29] there and that's our loss but then what we do is stick there into our current [01:01:32] we do is stick there into our current new network the right answer so we [01:01:35] new network the right answer so we always go back to the right answer [01:01:38] always go back to the right answer generate probability distribution for [01:01:41] generate probability distribution for next words and then ask okay what [01:01:43] next words and then ask okay what probability did you give to the actual [01:01:45] probability did you give to the actual next word exams and then again we use [01:01:49] next word exams and then again we use the actual next word so we do one step [01:01:51] the actual next word so we do one step of generation then we pull it back to [01:01:54] of generation then we pull it back to what was actually gener ated what was [01:01:57] what was actually gener ated what was what was actually in the text and then [01:01:59] what was actually in the text and then we ask it for guesses over the next word [01:02:02] we ask it for guesses over the next word and repeat forever and so the fact that [01:02:04] and repeat forever and so the fact that we don't do free generation but we pull [01:02:07] we don't do free generation but we pull it back to the actual piece of text each [01:02:10] it back to the actual piece of text each time um makes things simple because we [01:02:13] time um makes things simple because we sort of know what an actual author used [01:02:17] sort of know what an actual author used for the next word um and that process is [01:02:21] for the next word um and that process is called teacher forcing and so the most [01:02:23] called teacher forcing and so the most common way to [01:02:26] common way to language models is using this kind of [01:02:28] language models is using this kind of teacher forcing method I mean it's not [01:02:30] teacher forcing method I mean it's not perfect in all respects cuz you know [01:02:33] perfect in all respects cuz you know we're not actually exploring different [01:02:35] we're not actually exploring different things the model might want to generate [01:02:37] things the model might want to generate on its own and seeing what comes after [01:02:39] on its own and seeing what comes after them we're only doing the tell me the [01:02:42] them we're only doing the tell me the next word from some human generated [01:02:44] next word from some human generated piece of [01:02:51] text okay um so that's how we get losses [01:02:56] text okay um so that's how we get losses um and then after that um we want to as [01:03:00] um and then after that um we want to as before use these losses to update the [01:03:03] before use these losses to update the parameters of a newal [01:03:06] parameters of a newal network okay um and how do we do that um [01:03:12] network okay um and how do we do that um well in principle you know we just have [01:03:14] well in principle you know we just have all of the texts that we've collected [01:03:17] all of the texts that we've collected which you could think of as just a [01:03:18] which you could think of as just a really long sequence of okay we've got a [01:03:21] really long sequence of okay we've got a billion words of text here it is right [01:03:24] billion words of text here it is right so in theory you could just run your um [01:03:28] so in theory you could just run your um recurrent newal network over your [01:03:29] recurrent newal network over your billion words of text updating the [01:03:32] billion words of text updating the context as you go um but that would make [01:03:36] context as you go um but that would make it very difficult to train a model [01:03:40] it very difficult to train a model because you'd be accumulating these [01:03:42] because you'd be accumulating these losses for a billion steps and you'd [01:03:45] losses for a billion steps and you'd have to store them um and then You' be [01:03:49] have to store them um and then You' be you'd have to store hidden States so you [01:03:51] you'd have to store hidden States so you could update parameters and it just [01:03:53] could update parameters and it just wouldn't work so what we actually do is [01:03:56] wouldn't work so what we actually do is we cut our training data into segments [01:04:00] we cut our training data into segments of a reasonable length and then we're [01:04:03] of a reasonable length and then we're going to sort of run our recurrent newal [01:04:06] going to sort of run our recurrent newal Network on those segments and then we're [01:04:09] Network on those segments and then we're going to compute a loss for each segment [01:04:12] going to compute a loss for each segment and then we're going to update the [01:04:15] and then we're going to update the parameters of the recurrent new network [01:04:18] parameters of the recurrent new network based on the losses that we found for [01:04:20] based on the losses that we found for that segment um I I describe it here as [01:04:24] that segment um I I describe it here as the segments being sentences or [01:04:26] the segments being sentences or documents which seems a linguistically [01:04:29] documents which seems a linguistically nice thing it turns out that um in [01:04:33] nice thing it turns out that um in recent practice when you're wanting to [01:04:35] recent practice when you're wanting to scale most efficiently on gpus people [01:04:39] scale most efficiently on gpus people don't bother with those linguistic [01:04:41] don't bother with those linguistic niceties they just say a segment is 100 [01:04:44] niceties they just say a segment is 100 words just cut every 100 Words and the [01:04:48] words just cut every 100 Words and the reason why that's really convenient is [01:04:50] reason why that's really convenient is you can then create a batch of segments [01:04:53] you can then create a batch of segments all of which are 100 words long and [01:04:55] all of which are 100 words long and stick those in a matrix and do um [01:04:59] stick those in a matrix and do um vectorized training more efficiently um [01:05:02] vectorized training more efficiently um and things go great for you okay but [01:05:06] and things go great for you okay but there's still a few more things that we [01:05:07] there's still a few more things that we need to know um to get things to work [01:05:09] need to know um to get things to work great for you I was try and get a bit [01:05:11] great for you I was try and get a bit bit more through this before um today [01:05:14] bit more through this before um today ends so we sort of need to know about [01:05:17] ends so we sort of need to know about how to work out the derivative of our [01:05:21] how to work out the derivative of our loss with respect to um the [01:05:26] loss with respect to um the parameters of our recurrent newal [01:05:29] parameters of our recurrent newal Network and the interesting case here is [01:05:33] Network and the interesting case here is you know these wh parameters are sort of [01:05:36] you know these wh parameters are sort of being used everywhere through the neural [01:05:39] being used everywhere through the neural network at each stage as are the we ones [01:05:43] network at each stage as are the we ones so they appear at many places in the [01:05:45] so they appear at many places in the network so how do we work out the [01:05:48] network so how do we work out the partial derivatives of the loss with [01:05:52] partial derivatives of the loss with respect to the repeated weight [01:05:54] respect to the repeated weight matrices and and the answer to that is [01:05:57] matrices and and the answer to that is oh it's really simple um you can just [01:06:01] oh it's really simple um you can just sort of pretend that those wh's in each [01:06:05] sort of pretend that those wh's in each position are different and work out the [01:06:09] position are different and work out the partials with respect to them at one [01:06:11] partials with respect to them at one position and then to get the partials [01:06:14] position and then to get the partials with respect to wh you just sum whatever [01:06:17] with respect to wh you just sum whatever you found in the different [01:06:20] you found in the different positions and so um that is sort of okay [01:06:26] positions and so um that is sort of okay the gradient with respect to repeated [01:06:28] the gradient with respect to repeated weight is the sum of the gradient with [01:06:30] weight is the sum of the gradient with respect to each time it appears and the [01:06:33] respect to each time it appears and the reason why that is it sort of follows [01:06:36] reason why that is it sort of follows what I talked about in lecture three um [01:06:40] what I talked about in lecture three um that we talk or you know you can also [01:06:44] that we talk or you know you can also think about it in terms of what you [01:06:46] think about it in terms of what you might remember um from you know [01:06:49] might remember um from you know multivariable chain roles but you know [01:06:51] multivariable chain roles but you know the way I introduced in lecture three is [01:06:54] the way I introduced in lecture three is that gradient at outward branches and so [01:06:58] that gradient at outward branches and so what you can think about it in a case [01:07:01] what you can think about it in a case like this is that you've got uh wh [01:07:05] like this is that you've got uh wh Matrix which is being copied by identity [01:07:08] Matrix which is being copied by identity to wh1 wh2 wh3 wh4 Etc at each time step [01:07:15] to wh1 wh2 wh3 wh4 Etc at each time step and so since those are identity copies [01:07:19] and so since those are identity copies they have um a partial derivative with [01:07:23] they have um a partial derivative with respect to each other of one [01:07:26] respect to each other of one and so then we apply the multivariable [01:07:29] and so then we apply the multivariable chain roll to these copies um and so [01:07:34] chain roll to these copies um and so we've then got an outward branching node [01:07:37] we've then got an outward branching node and you're just summing the gradients um [01:07:40] and you're just summing the gradients um to get the total gradient of each time [01:07:43] to get the total gradient of each time for The [01:07:48] Matrix okay [01:07:52] Matrix okay um yeah I mean there's one other trick [01:07:55] um yeah I mean there's one other trick that's perhaps worth knowing I mean if [01:07:58] that's perhaps worth knowing I mean if you've got sort of segments that are 100 [01:08:00] you've got sort of segments that are 100 long um a common speed up is to say oh [01:08:04] long um a common speed up is to say oh maybe we don't actually have to run back [01:08:07] maybe we don't actually have to run back propagation for 100 time steps maybe we [01:08:09] propagation for 100 time steps maybe we could just run it for 20 times steps and [01:08:12] could just run it for 20 times steps and stop which is referred to as truncated [01:08:14] stop which is referred to as truncated back propagation through time I mean in [01:08:17] back propagation through time I mean in practice that tends to be sufficient [01:08:20] practice that tends to be sufficient note in particular you're still on the [01:08:22] note in particular you're still on the forward path updating your hidden State [01:08:25] forward path updating your hidden State using your full context but in the back [01:08:28] using your full context but in the back propagation you're just sort of cutting [01:08:31] propagation you're just sort of cutting it short um to speed up [01:08:34] it short um to speed up training okay um so just as I did before [01:08:38] training okay um so just as I did before with an engram language model we can use [01:08:42] with an engram language model we can use uh RNN language model to generate text [01:08:46] uh RNN language model to generate text and it's pretty much the same idea [01:08:48] and it's pretty much the same idea except now we're sort of um rather than [01:08:51] except now we're sort of um rather than just using counts of engrs we're using [01:08:54] just using counts of engrs we're using the hidden state of our neural network [01:08:57] the hidden state of our neural network to give us the input to a a probability [01:09:01] to give us the input to a a probability distribution that we can then sample [01:09:03] distribution that we can then sample from so I can start with the initial [01:09:05] from so I can start with the initial hidden State um I can use the start of [01:09:09] hidden State um I can use the start of sentence symbol I mean the example I had [01:09:11] sentence symbol I mean the example I had before I started immediately with the um [01:09:15] before I started immediately with the um hoping that that was less confusing the [01:09:17] hoping that that was less confusing the first time but what you should have [01:09:19] first time but what you should have asked is wait a minute where did the the [01:09:21] asked is wait a minute where did the the come from um so normally what we [01:09:24] come from um so normally what we actually do is is use a special start of [01:09:28] actually do is is use a special start of sequence symbol like this angle [01:09:31] sequence symbol like this angle bracketed s and so we sort of feed it in [01:09:34] bracketed s and so we sort of feed it in as a pseudo word which has a word [01:09:37] as a pseudo word which has a word embedding and then we based on this will [01:09:40] embedding and then we based on this will be generating first words of the text um [01:09:43] be generating first words of the text um so we end up with some [01:09:46] so we end up with some representation from which we can sample [01:09:49] representation from which we can sample and get the first word so now we don't [01:09:52] and get the first word so now we don't have any actual text so what we're going [01:09:54] have any actual text so what we're going to do [01:09:55] to do is take that generated word that we [01:09:58] is take that generated word that we generated and copy it down as the next [01:10:01] generated and copy it down as the next input and then we're going to run the [01:10:04] input and then we're going to run the next stage of newal network um sample [01:10:08] next stage of newal network um sample from the probability distribution and [01:10:10] from the probability distribution and next word favorite copy it down as the [01:10:13] next word favorite copy it down as the next word of the input and keep on [01:10:16] next word of the input and keep on generating and so this is referred to as [01:10:18] generating and so this is referred to as a roll out that you're kind of [01:10:21] a roll out that you're kind of continuing to roll the dice and generate [01:10:23] continuing to roll the dice and generate forward and generate a piece of text and [01:10:26] forward and generate a piece of text and so um and normally you want to stop at [01:10:31] so um and normally you want to stop at some point and the way we can stop it [01:10:33] some point and the way we can stop it some point is we can have a second [01:10:35] some point is we can have a second special symbol um the angle bracket SLS [01:10:40] special symbol um the angle bracket SLS which um says end of um your sequence so [01:10:44] which um says end of um your sequence so we can generate an end of sequence [01:10:46] we can generate an end of sequence symbol and then we can um stop and so [01:10:50] symbol and then we can um stop and so using this we can sort of generate [01:10:52] using this we can sort of generate pieces of text and essentially you know [01:10:55] pieces of text and essentially you know this is exactly what's happening if you [01:10:57] this is exactly what's happening if you use something like chat GPT right that [01:11:00] use something like chat GPT right that the model is a more complicated model [01:11:02] the model is a more complicated model that we've haven't yet gotten to but [01:11:04] that we've haven't yet gotten to but it's generating the response to you by [01:11:07] it's generating the response to you by doing this kind of process of generating [01:11:09] doing this kind of process of generating a word at the time treating it as an [01:11:12] a word at the time treating it as an input and generating the next word and [01:11:15] input and generating the next word and generating this sort of roll out and [01:11:17] generating this sort of roll out and that's why and it's done [01:11:19] that's why and it's done probabilistically so if you do it [01:11:21] probabilistically so if you do it multiple times um you can get different [01:11:24] multiple times um you can get different answers we haven't yet gone to chat GPT [01:11:27] answers we haven't yet gone to chat GPT but we can have a little bit of fun um [01:11:29] but we can have a little bit of fun um so you can take this simple recurrent [01:11:32] so you can take this simple recurrent newal Network that we've just built here [01:11:34] newal Network that we've just built here and you can train it on any piece of [01:11:37] and you can train it on any piece of text and get it to generate stuff so for [01:11:41] text and get it to generate stuff so for example I can train it on Barack Obama's [01:11:44] example I can train it on Barack Obama's speeches so that's a small Corpus right [01:11:46] speeches so that's a small Corpus right you know he didn't talk that much right [01:11:48] you know he didn't talk that much right I've only got a few hundred thousand [01:11:50] I've only got a few hundred thousand words of text it's not a huge Corpus [01:11:53] words of text it's not a huge Corpus I'll just show this and then I can [01:11:54] I'll just show this and then I can answer the question um but you know I [01:11:56] answer the question um but you know I can generate from it and I get something [01:11:59] can generate from it and I get something like the United States will step up to [01:12:01] like the United States will step up to the cost of a new challenges of the [01:12:03] the cost of a new challenges of the American people that will share the fact [01:12:05] American people that will share the fact that we created the problem they were [01:12:08] that we created the problem they were attacked and so that they have to say [01:12:09] attacked and so that they have to say that all the task of the final days of [01:12:11] that all the task of the final days of war that I will not be able to get this [01:12:14] war that I will not be able to get this done um yeah well maybe that's slightly [01:12:17] done um yeah well maybe that's slightly better than my engram language model [01:12:19] better than my engram language model still not perfect you might say but [01:12:21] still not perfect you might say but somewhat better maybe did you have a [01:12:24] somewhat better maybe did you have a question uh yeah so since we're like [01:12:28] question uh yeah so since we're like training the mod like truncated set of [01:12:30] training the mod like truncated set of the Corpus that impose some kind of like [01:12:33] the Corpus that impose some kind of like limitation on like how much we can like [01:12:36] limitation on like how much we can like produce and like still have some cency [01:12:38] produce and like still have some cency like meaning like [01:12:42] like meaning like foring um so yeah so I suggested we're [01:12:45] foring um so yeah so I suggested we're going to chunk the S chunk the text into [01:12:48] going to chunk the S chunk the text into 100w units so you know that's the limit [01:12:52] 100w units so you know that's the limit of the amount of Prior context that [01:12:53] of the amount of Prior context that we're going to use so I mean that's a [01:12:56] we're going to use so I mean that's a fair amount 100 words that's typically [01:12:58] fair amount 100 words that's typically several sentences but to the extent that [01:13:01] several sentences but to the extent that you wanted to know even more about the [01:13:04] you wanted to know even more about the further back context you wouldn't be [01:13:06] further back context you wouldn't be able to and you know certainly that's [01:13:09] able to and you know certainly that's one of the ways in which modern large [01:13:11] one of the ways in which modern large language models are using far bigger [01:13:14] language models are using far bigger context than that they're now using [01:13:15] context than that they're now using thousands of words of Prior context yeah [01:13:19] thousands of words of Prior context yeah absolutely it's a limit on how much far [01:13:22] absolutely it's a limit on how much far back context so in some sense actually [01:13:25] back context so in some sense actually even though in theory a current newal [01:13:28] even though in theory a current newal Network can feed in an arbitrary length [01:13:30] Network can feed in an arbitrary length context as soon as I say oh practically [01:13:33] context as soon as I say oh practically we cut it into segments you know [01:13:35] we cut it into segments you know actually that means we are making a [01:13:37] actually that means we are making a Markoff assumption again and we're [01:13:39] Markoff assumption again and we're saying the further back context doesn't [01:13:42] saying the further back context doesn't matter yeah okay uh couple more examples [01:13:47] matter yeah okay uh couple more examples um so instead of Barack Obama I can feed [01:13:50] um so instead of Barack Obama I can feed in Harry Potter which is a somewhat [01:13:52] in Harry Potter which is a somewhat bigger Corpus of text actually and [01:13:55] bigger Corpus of text actually and generate from that and so I can get um [01:13:58] generate from that and so I can get um sorry Harry shouted panicking I'll leave [01:14:01] sorry Harry shouted panicking I'll leave those brooms in London are they no idea [01:14:03] those brooms in London are they no idea said nearly headless Nick casting low [01:14:05] said nearly headless Nick casting low close by Cedric carrying the last bit of [01:14:08] close by Cedric carrying the last bit of trial charms from Harry's shoulder and [01:14:10] trial charms from Harry's shoulder and to answer him the common room perched [01:14:13] to answer him the common room perched upon it forearms held a shining knob [01:14:16] upon it forearms held a shining knob from when the spider hadn't felt it [01:14:17] from when the spider hadn't felt it seamed he reached the teams [01:14:20] seamed he reached the teams too well there you are um you can do [01:14:23] too well there you are um you can do other things as well um [01:14:26] other things as well um so you can train it on recipes and [01:14:28] so you can train it on recipes and generate a recipe um this one's a recipe [01:14:32] generate a recipe um this one's a recipe I don't suggest you try and cook um but [01:14:36] I don't suggest you try and cook um but it looks sort of like a recipe if you [01:14:38] it looks sort of like a recipe if you don't look very hard um chocolate Ranch [01:14:42] don't look very hard um chocolate Ranch Barbecue categories game casseroles [01:14:45] Barbecue categories game casseroles cookies cookies yield six servings two P [01:14:50] cookies cookies yield six servings two P two tablespoons of Parmesan cheese [01:14:52] two tablespoons of Parmesan cheese chopped um one cup of Co coconut milk [01:14:55] chopped um one cup of Co coconut milk and three eggs beaten um Place each [01:14:58] and three eggs beaten um Place each pasture over layers of lumps shape [01:15:01] pasture over layers of lumps shape mixture into the moderate oven and [01:15:05] mixture into the moderate oven and simmer until firm serve hot and bodied [01:15:09] simmer until firm serve hot and bodied fresh mustard orange and cheese combine [01:15:13] fresh mustard orange and cheese combine the cheese and salt together the dough [01:15:16] the cheese and salt together the dough in a large Skillet and the ingredients [01:15:19] in a large Skillet and the ingredients and stir in the chocolate and pepper H [01:15:22] and stir in the chocolate and pepper H um yeah it's not exactly very consistent [01:15:25] um yeah it's not exactly very consistent recipe when it comes down to it it sort [01:15:27] recipe when it comes down to it it sort of has a langage of a recipe but it's [01:15:31] of has a langage of a recipe but it's Absolut maybe if I had scaled it more [01:15:33] Absolut maybe if I had scaled it more and had a bigger Corpus it would have [01:15:35] and had a bigger Corpus it would have done a bit better um but it's definitely [01:15:37] done a bit better um but it's definitely not using the ingredients there are um [01:15:41] not using the ingredients there are um let's see it's almost um time today so [01:15:44] let's see it's almost um time today so maybe about all I can do um is uh do H I [01:15:50] maybe about all I can do um is uh do H I can do one more fun example and then [01:15:52] can do one more fun example and then after that oh yeah I probably should [01:15:55] after that oh yeah I probably should that bit at the start next time um so as [01:15:57] that bit at the start next time um so as a variant of building RNN language [01:16:00] a variant of building RNN language models I mean so far we've been building [01:16:03] models I mean so far we've been building them over words so our you know token [01:16:09] them over words so our you know token time steps over which you build it as [01:16:12] time steps over which you build it as words I mean actually you can use the [01:16:14] words I mean actually you can use the idea of recurrent new networks over any [01:16:16] idea of recurrent new networks over any other size unit and people have used [01:16:19] other size unit and people have used them for other things so people have [01:16:21] them for other things so people have used them in bioinformatics for things [01:16:23] used them in bioinformatics for things like um DNA for sort of having Gene [01:16:27] like um DNA for sort of having Gene sequencing or protein sequencing or [01:16:29] sequencing or protein sequencing or anything like that but even staying with [01:16:32] anything like that but even staying with language um instead of building them [01:16:34] language um instead of building them over words you can build them over [01:16:37] over words you can build them over characters so that my I'm generating at [01:16:41] characters so that my I'm generating at a a letter at a time rather than a word [01:16:44] a a letter at a time rather than a word at a time and so that can sometimes be [01:16:47] at a time and so that can sometimes be useful because it allows us to sort of [01:16:50] useful because it allows us to sort of generate things um that sort of look [01:16:53] generate things um that sort of look like Words um and perhaps have the [01:16:55] like Words um and perhaps have the structure of English words um and so and [01:16:59] structure of English words um and so and so similarly there are other things that [01:17:01] so similarly there are other things that you can do so before I [01:17:04] you can do so before I initialized the hidden State I said oh [01:17:08] initialized the hidden State I said oh you just have an initial hidden State [01:17:10] you just have an initial hidden State you can make it zeros if you want well [01:17:13] you can make it zeros if you want well sometimes we're going to build a [01:17:15] sometimes we're going to build a contextual RNN where we can initialize [01:17:18] contextual RNN where we can initialize the hidden State um with something else [01:17:20] the hidden State um with something else so in particular I can initialize the [01:17:23] so in particular I can initialize the hidden state with the RGB values of a [01:17:27] hidden state with the RGB values of a color and so I can have initialized the [01:17:30] color and so I can have initialized the hidden state with the color and generate [01:17:33] hidden state with the color and generate character at a time the name of paint [01:17:36] character at a time the name of paint colors and I can train a model based on [01:17:40] colors and I can train a model based on um a paint company's catalog of names of [01:17:44] um a paint company's catalog of names of colors and their um RGB of their colors [01:17:48] colors and their um RGB of their colors and then I can give it different [01:17:50] and then I can give it different different paint colors and it'll come up [01:17:52] different paint colors and it'll come up with names for them and it actually does [01:17:55] with names for them and it actually does an excellent job this one worked really [01:17:57] an excellent job this one worked really well look at this um this one here is [01:18:00] well look at this um this one here is gasty pink Power gray Naval tan bco [01:18:07] gasty pink Power gray Naval tan bco white hble gray Home Star Brown now [01:18:11] white hble gray Home Star Brown now couldn't you just imagine finding all of [01:18:13] couldn't you just imagine finding all of these in a paint catalog I mean some of [01:18:15] these in a paint catalog I mean some of them are there's some really good ones [01:18:18] them are there's some really good ones over here in the bottom right this color [01:18:20] over here in the bottom right this color here is [01:18:22] here is dope and then um this Stoner blue purple [01:18:28] dope and then um this Stoner blue purple s stinky [01:18:30] s stinky bean and Turley now I think I've got a a [01:18:34] bean and Turley now I think I've got a a real business opportunity here in the [01:18:37] real business opportunity here in the Paint Company Market um for my recurrent [01:18:40] Paint Company Market um for my recurrent new network okay I'll stop there for [01:18:42] new network okay I'll stop there for today and do more of the science of new [01:18:44] today and do more of the science of new networks next time ================================================================================ LECTURE 006 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 6 - Sequence to Sequence Models Source: https://www.youtube.com/watch?v=Ba6Fn1-Jsfw --- Transcript [00:00:05] okay hi [00:00:07] okay hi everyone back from all [00:00:10] everyone back from all cs224n um okay [00:00:13] cs224n um okay so for today the plan is essentially a [00:00:17] so for today the plan is essentially a continuation of what we started on [00:00:19] continuation of what we started on Tuesday so I'm going to say um more [00:00:23] Tuesday so I'm going to say um more about language models and more about [00:00:27] about language models and more about rnns in particular introd using a more [00:00:30] rnns in particular introd using a more advanced form of recurrent new network [00:00:33] advanced form of recurrent new network which was for a while very dominant um [00:00:35] which was for a while very dominant um lstms will talk about those and then in [00:00:38] lstms will talk about those and then in the latter part as something to be done [00:00:41] the latter part as something to be done with um current neural networks we'll [00:00:44] with um current neural networks we'll start looking at neural machine [00:00:47] start looking at neural machine translation [00:00:48] translation um okay so on Tuesday what we did was we [00:00:52] um okay so on Tuesday what we did was we introduced language models A system that [00:00:55] introduced language models A system that predicts the next word and then I [00:00:57] predicts the next word and then I introduced recurrent new Networks that's [00:01:00] introduced recurrent new Networks that's was this new neural architecture that [00:01:03] was this new neural architecture that can take sequential input of any length [00:01:06] can take sequential input of any length and it applied the same weight to each [00:01:08] and it applied the same weight to each step and can optionally produce output [00:01:11] step and can optionally produce output on each step um so these are two [00:01:14] on each step um so these are two distinct Notions though they tend to go [00:01:17] distinct Notions though they tend to go together so uh recurrent new network can [00:01:20] together so uh recurrent new network can be used for other purposes on any kinds [00:01:24] be used for other purposes on any kinds of sequence and I'll mention a few of [00:01:26] of sequence and I'll mention a few of those later today um and language [00:01:29] those later today um and language modeling is a traditional component of [00:01:32] modeling is a traditional component of many NLP tasks anything to do with [00:01:34] many NLP tasks anything to do with generating text or estimating [00:01:37] generating text or estimating likelihoods of pieces of text and indeed [00:01:39] likelihoods of pieces of text and indeed in the modern instantiation of large [00:01:42] in the modern instantiation of large language models essentially everything [00:01:44] language models essentially everything we do in um NLP is being done by [00:01:47] we do in um NLP is being done by language models so a language model one [00:01:49] language models so a language model one way to do it is with the recurrent [00:01:51] way to do it is with the recurrent neural network it's certainly not the [00:01:54] neural network it's certainly not the only way we also talked last time about [00:01:56] only way we also talked last time about engram language models which were [00:01:58] engram language models which were language models and then starting next [00:02:01] language models and then starting next week we'll start to talk about [00:02:03] week we'll start to talk about Transformers which are now the most [00:02:05] Transformers which are now the most widespread way that's used for building [00:02:07] widespread way that's used for building language [00:02:08] language models um so just finish off a teeny bit [00:02:12] models um so just finish off a teeny bit that I didn't get to last time on [00:02:14] that I didn't get to last time on evaluating language models well one way [00:02:16] evaluating language models well one way to evaluate language models is what I [00:02:18] to evaluate language models is what I did in class last time generate some [00:02:21] did in class last time generate some text and say hey doesn't this text look [00:02:23] text and say hey doesn't this text look good um but you know often we want [00:02:26] good um but you know often we want something more rigorous than that and [00:02:29] something more rigorous than that and the standard way to evaluate language [00:02:32] the standard way to evaluate language models is to say well you know a [00:02:34] models is to say well you know a language model scores a piece of text [00:02:36] language model scores a piece of text and says How likely it is and our [00:02:40] and says How likely it is and our standard for text in the language is [00:02:44] standard for text in the language is stuff produced by human beings so if we [00:02:47] stuff produced by human beings so if we find a new piece of text which wasn't [00:02:51] find a new piece of text which wasn't text that the model was trained on right [00:02:53] text that the model was trained on right we want some fresh um fresh evaluation [00:02:56] we want some fresh um fresh evaluation data and we show it um to a language [00:02:59] data and we show it um to a language model model we can then ask the language [00:03:02] model model we can then ask the language model to predict the success of words of [00:03:05] model to predict the success of words of this text and the better it is at doing [00:03:07] this text and the better it is at doing that the better a language model it is [00:03:10] that the better a language model it is because it's more accurately able to [00:03:12] because it's more accurately able to predict um a human written piece of text [00:03:16] predict um a human written piece of text and so the standard way that that is [00:03:18] and so the standard way that that is measured is with this measure that's [00:03:20] measured is with this measure that's called perplexity and so for perplexity [00:03:24] called perplexity and so for perplexity we are taking um the probability of a [00:03:27] we are taking um the probability of a prediction from the language model we're [00:03:29] prediction from the language model we're in inverting it so instead of it being [00:03:32] in inverting it so instead of it being you know .002 or something we're turning [00:03:35] you know .002 or something we're turning into 500 or something like that um and [00:03:41] into 500 or something like that um and then we're taking those numbers we're [00:03:43] then we're taking those numbers we're taking the product of them at each [00:03:45] taking the product of them at each position in the text and then we're [00:03:47] position in the text and then we're finding the geometric average of them um [00:03:51] finding the geometric average of them um so that's the measure that's normally [00:03:53] so that's the measure that's normally used but if we in this class we've been [00:03:56] used but if we in this class we've been tending um to look um at um negative log [00:04:02] tending um to look um at um negative log likelihoods and um the idea of cross [00:04:05] likelihoods and um the idea of cross entropy um and so what perplexity is is [00:04:10] entropy um and so what perplexity is is it's just the exponential of the Cross [00:04:13] it's just the exponential of the Cross entropy um so if you've already familiar [00:04:16] entropy um so if you've already familiar with negative log like per word negative [00:04:19] with negative log like per word negative log likelihoods if you just exponentiate [00:04:21] log likelihoods if you just exponentiate that um you then get the perplexity now [00:04:25] that um you then get the perplexity now there's one other little trick as to [00:04:27] there's one other little trick as to what base you use for your logarithms [00:04:29] what base you use for your logarithms and exponentials I mean traditionally [00:04:33] and exponentials I mean traditionally thinking of sort of binary and stuff a [00:04:35] thinking of sort of binary and stuff a lot of the time people use Bas two for [00:04:38] lot of the time people use Bas two for measuring perplexity um that's kind of [00:04:42] measuring perplexity um that's kind of gone out now a lot of the time now [00:04:43] gone out now a lot of the time now people are using natural logs but if [00:04:45] people are using natural logs but if you're comparing perplexity numbers [00:04:48] you're comparing perplexity numbers they're going to be different depending [00:04:49] they're going to be different depending on um what base you're using for things [00:04:52] on um what base you're using for things so you needs to be aware of this um so [00:04:56] so you needs to be aware of this um so from a sort of a modern perspective it [00:05:00] from a sort of a modern perspective it kind of makes no sense um why um [00:05:04] kind of makes no sense um why um perplexity is used um the story of why [00:05:08] perplexity is used um the story of why perplexity was used was you know in the [00:05:11] perplexity was used was you know in the bad old days of symbolic artificial [00:05:14] bad old days of symbolic artificial intelligence when all of those famous [00:05:16] intelligence when all of those famous people like John McCarthy and Ed Fen bam [00:05:20] people like John McCarthy and Ed Fen bam were around doing logical based systems [00:05:23] were around doing logical based systems um some people then essentially at IBM [00:05:27] um some people then essentially at IBM including Fred gelan started exploring [00:05:30] including Fred gelan started exploring probabilistic methods for speech [00:05:32] probabilistic methods for speech recognition and other similar methods um [00:05:36] recognition and other similar methods um and um the story Fred gelene used to [00:05:39] and um the story Fred gelene used to tell was well at that time this was in [00:05:43] tell was well at that time this was in the late 70s or early 80s that none of [00:05:47] the late 70s or early 80s that none of the AI people um that he was trying to [00:05:50] the AI people um that he was trying to talk to understood how to do any real [00:05:52] talk to understood how to do any real math and didn't understand um any [00:05:55] math and didn't understand um any information Theory Notions of um of do [00:05:59] information Theory Notions of um of do doing things like cross entropy or cross [00:06:02] doing things like cross entropy or cross entropy rate so he had to come up with [00:06:04] entropy rate so he had to come up with something simpler they could understand [00:06:07] something simpler they could understand um and so what he came up with is by [00:06:09] um and so what he came up with is by sort of doing this [00:06:10] sort of doing this exponentiated um perplexity you can [00:06:13] exponentiated um perplexity you can think of per a perplexity number as [00:06:16] think of per a perplexity number as being equivalent to how many uniform [00:06:19] being equivalent to how many uniform choices you're choosing between so if [00:06:22] choices you're choosing between so if the perplexity of something is [00:06:25] the perplexity of something is 64 that's like having a 64 sided dice [00:06:28] 64 that's like having a 64 sided dice that you're rolling each time and that's [00:06:31] that you're rolling each time and that's your chance of your chance of getting a [00:06:33] your chance of your chance of getting a one on that is your chance of guessing [00:06:35] one on that is your chance of guessing the right word um so that was why [00:06:38] the right word um so that was why perplexity got introduced but it's kind [00:06:40] perplexity got introduced but it's kind of stuck and so when you see scores for [00:06:43] of stuck and so when you see scores for language models you generally still see [00:06:46] language models you generally still see perplexities so a lower perplexity is [00:06:49] perplexities so a lower perplexity is better um so here are the kind of [00:06:53] better um so here are the kind of numbers and where progress was made with [00:06:56] numbers and where progress was made with neural language models so before that um [00:06:59] neural language models so before that um people used engram language models and [00:07:02] people used engram language models and people used clever ways to smooth them [00:07:05] people used clever ways to smooth them using methods I vaguely alluded to last [00:07:07] using methods I vaguely alluded to last time of this ad case smoothing and doing [00:07:10] time of this ad case smoothing and doing back off actually people use clever [00:07:13] back off actually people use clever methods um around um the 2000th decade [00:07:17] methods um around um the 2000th decade the cleverest method known of smoothing [00:07:20] the cleverest method known of smoothing engram language models was this thing [00:07:22] engram language models was this thing called interpolated kessi smoothing um [00:07:25] called interpolated kessi smoothing um and so for a big language model using [00:07:29] and so for a big language model using that the perplexity was about [00:07:31] that the perplexity was about 67 um which in some sense means that you [00:07:34] 67 um which in some sense means that you weren't very good at predicting the next [00:07:36] weren't very good at predicting the next word um but you know that had actually [00:07:38] word um but you know that had actually been enormous progress you know when I [00:07:40] been enormous progress you know when I was a young person doing um NLP you know [00:07:43] was a young person doing um NLP you know perplexities were three figure numbers [00:07:46] perplexities were three figure numbers right you were commonly seeing you know [00:07:48] right you were commonly seeing you know perplexities of 150 or something like [00:07:50] perplexities of 150 or something like that so progress was made um so when [00:07:54] that so progress was made um so when rnn's were first introduced people [00:07:58] rnn's were first introduced people weren't really actually able to do [00:08:00] weren't really actually able to do better with a sort of a pure RNN but [00:08:04] better with a sort of a pure RNN but they could do better by combining an RNN [00:08:07] they could do better by combining an RNN with something else such as a symbolic [00:08:11] with something else such as a symbolic maximum entropy model which I'm not [00:08:13] maximum entropy model which I'm not going to explain but those are numbers [00:08:14] going to explain but those are numbers like that 51 but where progress um [00:08:17] like that 51 but where progress um really started to be made was when lstm [00:08:21] really started to be made was when lstm started to be used as an improved RNN [00:08:23] started to be used as an improved RNN which is what I'm going to come to next [00:08:25] which is what I'm going to come to next so here are some lstm models and now [00:08:29] so here are some lstm models and now you're getting numbers like 43 and 30 [00:08:32] you're getting numbers like 43 and 30 and so for 30 you've sort of halfed the [00:08:35] and so for 30 you've sort of halfed the perplexity which in um cross entropy [00:08:40] perplexity which in um cross entropy terms means you reduce the cross entropy [00:08:42] terms means you reduce the cross entropy by about one bit um and so you've made [00:08:45] by about one bit um and so you've made real progress um in your language [00:08:48] real progress um in your language modeling Now by modern standards um [00:08:51] modeling Now by modern standards um these numbers are still really high [00:08:54] these numbers are still really high right for the best language models that [00:08:55] right for the best language models that we have now you're getting perplexities [00:08:58] we have now you're getting perplexities in the single digits you're getting [00:09:00] in the single digits you're getting models that are very often able to guess [00:09:03] models that are very often able to guess exactly the right word though of course [00:09:05] exactly the right word though of course not always cuz no one can predict what [00:09:07] not always cuz no one can predict what words going to be said by someone next [00:09:09] words going to be said by someone next in a lot of [00:09:12] in a lot of circumstances okay um so um to motivate [00:09:16] circumstances okay um so um to motivate lstms then wanted to sort of say a bit [00:09:20] lstms then wanted to sort of say a bit about how there are problems with rnns [00:09:22] about how there are problems with rnns and why that motivated fixing things and [00:09:25] and why that motivated fixing things and these are the problems of Vanishing and [00:09:27] these are the problems of Vanishing and exploding gradients so what we wanted to [00:09:30] exploding gradients so what we wanted to do was say okay we've tried to predict a [00:09:34] do was say okay we've tried to predict a word at position four and often we're [00:09:38] word at position four and often we're going to get it we're not going to [00:09:40] going to get it we're not going to predict it with 100% probability so we [00:09:43] predict it with 100% probability so we we have a loss that's a negative log [00:09:45] we have a loss that's a negative log likelihood we give to that word and [00:09:47] likelihood we give to that word and we're going to want to back propagate [00:09:49] we're going to want to back propagate that loss um through the [00:09:52] that loss um through the sequence um and work out our gradients [00:09:55] sequence um and work out our gradients as we always do now just one note about [00:09:58] as we always do now just one note about something someone asked after class um [00:10:01] something someone asked after class um last time you know I sort of showed it [00:10:03] last time you know I sort of showed it back propagating the whole sequence but [00:10:05] back propagating the whole sequence but we're doing this at every time step [00:10:07] we're doing this at every time step right so we're going to back propagate a [00:10:09] right so we're going to back propagate a loss from time step two back propagate a [00:10:12] loss from time step two back propagate a loss from time step 3 4 5 6 7 we're [00:10:14] loss from time step 3 4 5 6 7 we're doing it for each one and in one of the [00:10:16] doing it for each one and in one of the slides last time we then discussed how [00:10:18] slides last time we then discussed how we're going to sum all of those losses [00:10:21] we're going to sum all of those losses or work out the average loss um but for [00:10:24] or work out the average loss um but for doing this one when we back propagate [00:10:27] doing this one when we back propagate this loss what happens [00:10:30] this loss what happens well what happens is we're going to do [00:10:31] well what happens is we're going to do the same kind of chain roll where we're [00:10:34] the same kind of chain roll where we're multiplying these partial derivatives at [00:10:37] multiplying these partial derivatives at every time step and well here we've only [00:10:40] every time step and well here we've only got a few of them but you know maybe [00:10:42] got a few of them but you know maybe we're going to have a sequence 30 long [00:10:45] we're going to have a sequence 30 long and so we're going to be multiplying [00:10:47] and so we're going to be multiplying each time the partial of HK time with [00:10:52] each time the partial of HK time with respect to the partial of HK minus1 and [00:10:55] respect to the partial of HK minus1 and so what kind of effect is that going to [00:10:58] so what kind of effect is that going to have [00:10:59] have um in particular you know we might ask [00:11:02] um in particular you know we might ask what happens if these are small or what [00:11:05] what happens if these are small or what happens if these are large well if [00:11:08] happens if these are large well if they're small the gradient will [00:11:11] they're small the gradient will gradually get smaller and smaller and [00:11:15] gradually get smaller and smaller and disappear as we back propagate it along [00:11:18] disappear as we back propagate it along the sequence [00:11:20] the sequence yeah so why we're taking partial J over [00:11:24] yeah so why we're taking partial J over partial should we take partial J [00:11:32] sure I mean we're doing that as well um [00:11:37] sure I mean we're doing that as well um but you know in general we have to walk [00:11:40] but you know in general we have to walk the partials along and then you know we [00:11:44] the partials along and then you know we then have a w at the next [00:11:47] then have a w at the next step I mean if we're thinking of the [00:11:49] step I mean if we're thinking of the sort of computation graph that we're [00:11:52] sort of computation graph that we're sort of doing the chain rule backwards [00:11:54] sort of doing the chain rule backwards along we're going to be going through a [00:11:57] along we're going to be going through a w at each step and then arriving at [00:11:59] w at each step and then arriving at another AG [00:12:05] right [00:12:07] right um yeah so I mean at this point you can [00:12:13] um yeah so I mean at this point you can you can do some math and thinking about [00:12:17] you can do some math and thinking about things um and there's a couple of papers [00:12:20] things um and there's a couple of papers that are mentioned at the bottom here [00:12:22] that are mentioned at the bottom here which I'm actually uh rushing ahead not [00:12:25] which I'm actually uh rushing ahead not going to do very carefully um but the [00:12:29] going to do very carefully um but the point is that if you're taking the [00:12:32] point is that if you're taking the partial of HT with respect to HT [00:12:36] partial of HT with respect to HT minus1 um and if you make a simplifying [00:12:39] minus1 um and if you make a simplifying assumption and say Suppose there isn't a [00:12:42] assumption and say Suppose there isn't a nonlinearity suppose Sigma is just the [00:12:44] nonlinearity suppose Sigma is just the identity then what the partial will be [00:12:48] identity then what the partial will be is the Matrix [00:12:50] is the Matrix wh and so if you keep on back [00:12:54] wh and so if you keep on back propagating along the recurrent neural [00:12:56] propagating along the recurrent neural network what you're going to be doing [00:12:59] network what you're going to be doing is what ending up with powers of the [00:13:02] is what ending up with powers of the Matrix wh and then there's the question [00:13:05] Matrix wh and then there's the question of what happens when you raise that [00:13:07] of what happens when you raise that Matrix to higher and higher Powers well [00:13:11] Matrix to higher and higher Powers well at that point you can represent the [00:13:13] at that point you can represent the Matrix in terms of its igen vectors and [00:13:16] Matrix in terms of its igen vectors and IG values and then there are two [00:13:19] IG values and then there are two possibilities either all the igen values [00:13:22] possibilities either all the igen values are less than one and that means that [00:13:25] are less than one and that means that that number will be getting smaller and [00:13:28] that number will be getting smaller and smaller smaller as you raise it to [00:13:30] smaller smaller as you raise it to higher power or it can have IG ve IG [00:13:34] higher power or it can have IG ve IG values that are larger than one and then [00:13:37] values that are larger than one and then things will get bigger and bigger as you [00:13:39] things will get bigger and bigger as you go further back so essentially as you [00:13:40] go further back so essentially as you back propagate the gradients backwards [00:13:43] back propagate the gradients backwards unless things are sort of just [00:13:46] unless things are sort of just precisely of um corresponding to a [00:13:49] precisely of um corresponding to a largest Aon Vector [00:13:52] largest Aon Vector of large ion value of approximately one [00:13:55] of large ion value of approximately one you're either going to get a Vanishing [00:13:57] you're either going to get a Vanishing or an explosion and both of those will [00:14:00] or an explosion and both of those will be kind of [00:14:02] be kind of bad so why is Vanishing gradient a [00:14:06] bad so why is Vanishing gradient a problem I mean in a sense you could [00:14:09] problem I mean in a sense you could think it's it's not a problem it's what [00:14:13] think it's it's not a problem it's what should be happening because all else [00:14:15] should be happening because all else being equal you know the closest words [00:14:18] being equal you know the closest words are the most relevant ones and so that's [00:14:20] are the most relevant ones and so that's where you should be updating your [00:14:23] where you should be updating your parameters the most and to some extent [00:14:25] parameters the most and to some extent that that's true but nevertheless this [00:14:28] that that's true but nevertheless this vanish gradient in this model happens [00:14:31] vanish gradient in this model happens much too severely so that if you're [00:14:35] much too severely so that if you're looking at the loss from a later [00:14:37] looking at the loss from a later position and comparing it to the loss um [00:14:40] position and comparing it to the loss um from an earlier position and then you're [00:14:43] from an earlier position and then you're seeing how things are updating it's sort [00:14:46] seeing how things are updating it's sort of primarily the update is being [00:14:48] of primarily the update is being determined by the very nearby loss and [00:14:52] determined by the very nearby loss and not by the far away loss um that the [00:14:55] not by the far away loss um that the gradient signal from far away is much [00:14:58] gradient signal from far away is much much smaller ER and well that's bad [00:15:02] much smaller ER and well that's bad because you know overall for language [00:15:05] because you know overall for language modeling there are lots of cases where [00:15:07] modeling there are lots of cases where we want to be able to transmit signals a [00:15:09] we want to be able to transmit signals a long distance so here's my piece of text [00:15:13] long distance so here's my piece of text um when she tried to print her ticket [00:15:15] um when she tried to print her ticket she found that the printer was out of [00:15:17] she found that the printer was out of toner she went to the stationary store [00:15:19] toner she went to the stationary store to buy more toner it was very overpriced [00:15:22] to buy more toner it was very overpriced after installing the toner into the [00:15:24] after installing the toner into the printer she finally printed [00:15:27] printer she finally printed her yeah so you know human being you [00:15:29] her yeah so you know human being you know it's obvious we can predict this [00:15:31] know it's obvious we can predict this with pretty much probability one you [00:15:33] with pretty much probability one you know so um really low perplexity for [00:15:37] know so um really low perplexity for making this decision um but you know if [00:15:41] making this decision um but you know if you're that's depends on getting back to [00:15:44] you're that's depends on getting back to the tickets which are sort of about 20 [00:15:46] the tickets which are sort of about 20 odd words back right if you're just [00:15:48] odd words back right if you're just seeing installing the toner into the [00:15:50] seeing installing the toner into the printer she finally printed her could be [00:15:53] printer she finally printed her could be anything it could be her paper her [00:15:55] anything it could be her paper her invitation her novel um lots of things [00:15:59] invitation her novel um lots of things could be you're certainly not going to [00:16:00] could be you're certainly not going to guess tickets [00:16:02] guess tickets so we sort of want to have these really [00:16:06] so we sort of want to have these really long distance dependencies but we're [00:16:09] long distance dependencies but we're only going to be able to learn these [00:16:10] only going to be able to learn these long distance dependencies if we're [00:16:13] long distance dependencies if we're actually getting sufficient signal [00:16:16] actually getting sufficient signal between that position and when the word [00:16:19] between that position and when the word tickets appears near the beginning that [00:16:21] tickets appears near the beginning that we can learn the fact that having that [00:16:24] we can learn the fact that having that tickets 20 words back is the good [00:16:26] tickets 20 words back is the good predictive thing for predicting tickets [00:16:30] predictive thing for predicting tickets here and what we find is um that you [00:16:35] here and what we find is um that you know when the gradient becomes very [00:16:37] know when the gradient becomes very small the RNN doesn't learn these kind [00:16:40] small the RNN doesn't learn these kind of longdistance dependencies and so it's [00:16:43] of longdistance dependencies and so it's unable to sort of make these predictions [00:16:46] unable to sort of make these predictions well at test time I mean this is a very [00:16:50] well at test time I mean this is a very sort of just [00:16:52] sort of just rough uh back of the envelope estimate [00:16:56] rough uh back of the envelope estimate but you know what people actually found [00:17:00] but you know what people actually found is that you know with this the kind of [00:17:02] is that you know with this the kind of simple r&n that we've introduced up [00:17:04] simple r&n that we've introduced up until now that the amount of effective [00:17:07] until now that the amount of effective conditioning you could get was about [00:17:10] conditioning you could get was about seven tokens back um that if things were [00:17:13] seven tokens back um that if things were further back than that it just never [00:17:15] further back than that it just never learned to condition on them and so you [00:17:18] learned to condition on them and so you know compared to when we were talking [00:17:20] know compared to when we were talking about NS and I said are usually the [00:17:23] about NS and I said are usually the maximum people did was 5 G occasionally [00:17:25] maximum people did was 5 G occasionally a bit bigger because of the fact that [00:17:28] a bit bigger because of the fact that there was this exponential blowout [00:17:30] there was this exponential blowout although in theory we've now got a much [00:17:32] although in theory we've now got a much better solution in practice because of [00:17:35] better solution in practice because of Vanishing gradients well we're only kind [00:17:37] Vanishing gradients well we're only kind of getting the equivalent of 8 G so we [00:17:40] of getting the equivalent of 8 G so we haven't made that much progress it feels [00:17:43] haven't made that much progress it feels like um so there's a reverse problem [00:17:47] like um so there's a reverse problem which can also happen of exploding [00:17:49] which can also happen of exploding gradients so if the gradient becomes [00:17:53] gradients so if the gradient becomes very large because the IG values of that [00:17:56] very large because the IG values of that Matrix are large well what we're doing [00:17:59] Matrix are large well what we're doing for the parameter update is you know [00:18:02] for the parameter update is you know we've got a learning rate but [00:18:03] we've got a learning rate but essentially if the gradient is very [00:18:06] essentially if the gradient is very large we're going to make a very very [00:18:08] large we're going to make a very very large parameter update and that can [00:18:11] large parameter update and that can cause very bad updates because you know [00:18:15] cause very bad updates because you know we're sort of assuming that we're taking [00:18:17] we're sort of assuming that we're taking a step in the direction of the gradient [00:18:20] a step in the direction of the gradient and well we might overshoot a little but [00:18:22] and well we might overshoot a little but we'll be in a roughly in the right Zone [00:18:25] we'll be in a roughly in the right Zone but you know if we had an enormously [00:18:27] but you know if we had an enormously exploded gradi and well we could kind of [00:18:30] exploded gradi and well we could kind of be sort of walking off anywhere and you [00:18:33] be sort of walking off anywhere and you know we think we're heading to the [00:18:34] know we think we're heading to the sieras and we end up in Iowa or [00:18:36] sieras and we end up in Iowa or something like that right you know that [00:18:38] something like that right you know that we could just go arbitrarily far and [00:18:42] we could just go arbitrarily far and where we're ending up it might not be [00:18:44] where we're ending up it might not be making any progress whatsoever um so [00:18:47] making any progress whatsoever um so exploding gradients are a problem um [00:18:51] exploding gradients are a problem um they can also cause Infinities and nans [00:18:53] they can also cause Infinities and nans and they're always a problem when you're [00:18:56] and they're always a problem when you're training models um now for [00:18:59] training models um now for for dealing with exploding gradients um [00:19:03] for dealing with exploding gradients um this is the accepted wisdom um this [00:19:06] this is the accepted wisdom um this unfortunately doesn't this isn't High [00:19:09] unfortunately doesn't this isn't High futing math really um what people use [00:19:12] futing math really um what people use for exploding gradients is a crude hack [00:19:15] for exploding gradients is a crude hack they clip gradients um but you know it [00:19:18] they clip gradients um but you know it works really well and you really want to [00:19:20] works really well and you really want to know about this um because clipping [00:19:23] know about this um because clipping gradients is often essential um to [00:19:25] gradients is often essential um to having newal networks not having [00:19:27] having newal networks not having problems so what we do for gradient [00:19:29] problems so what we do for gradient clipping is we work out the norm of the [00:19:33] clipping is we work out the norm of the gradient and if it seems too large and [00:19:37] gradient and if it seems too large and that varies but you know that's normally [00:19:39] that varies but you know that's normally 5 10 20 something like that for a norm [00:19:42] 5 10 20 something like that for a norm of a gradient is seen as the limit of [00:19:44] of a gradient is seen as the limit of what's okay if the norm of your gradient [00:19:47] what's okay if the norm of your gradient is too large you just scale it down in [00:19:49] is too large you just scale it down in every direction and you apply a smaller [00:19:52] every direction and you apply a smaller gradient [00:19:53] gradient update um it works um yeah so that [00:19:59] update um it works um yeah so that problem is solvable um but fixing the [00:20:02] problem is solvable um but fixing the vanishing gradient seemed a more [00:20:05] vanishing gradient seemed a more difficult problem right that this was [00:20:08] difficult problem right that this was the problem that our rnns effectively [00:20:11] the problem that our rnns effectively couldn't preserve information over many [00:20:13] couldn't preserve information over many time steps um and well what seemed to be [00:20:17] time steps um and well what seemed to be the problem there the problem seems to [00:20:20] the problem there the problem seems to be really that we've got sort of an [00:20:23] be really that we've got sort of an architecture that makes it very hard to [00:20:25] architecture that makes it very hard to preserve information so if we look at [00:20:28] preserve information so if we look at sort of the Hidden state from one time [00:20:31] sort of the Hidden state from one time step to the next time step um it's [00:20:35] step to the next time step um it's completely being Rewritten right so [00:20:37] completely being Rewritten right so we're taking the previous time steps [00:20:40] we're taking the previous time steps hidden um hidden Vector we're [00:20:43] hidden um hidden Vector we're multiplying it by a matrix which [00:20:46] multiplying it by a matrix which completely changes it in general adding [00:20:48] completely changes it in general adding in other stuff from the input so if we [00:20:51] in other stuff from the input so if we just like to say we'd like to say we'd [00:20:54] just like to say we'd like to say we'd like you to carry forward information [00:20:56] like you to carry forward information there's useful stuff in htus y can you [00:20:59] there's useful stuff in htus y can you just kind of keep it around for a while [00:21:01] just kind of keep it around for a while it's not actually very easy to do in [00:21:03] it's not actually very easy to do in this formulation cuz you know trying to [00:21:06] this formulation cuz you know trying to learn [00:21:07] learn uh W vectors that all mostly preserve [00:21:11] uh W vectors that all mostly preserve what was there before isn't at all and [00:21:14] what was there before isn't at all and obvious thing to do um so the question [00:21:17] obvious thing to do um so the question was could we design an [00:21:19] was could we design an RNN which had a sort of a memory where [00:21:22] RNN which had a sort of a memory where is easy to preserve information yes so [00:21:25] is easy to preserve information yes so in in one of the earlier slides you [00:21:27] in in one of the earlier slides you mentioned the [00:21:29] mentioned the exponentiation happened during their [00:21:31] exponentiation happened during their analysis they remove the [00:21:33] analysis they remove the larity [00:21:36] larity so sort [00:21:38] so sort ofation prevent Vanishing oring no it [00:21:43] ofation prevent Vanishing oring no it actually doesn't I mean you can make an [00:21:45] actually doesn't I mean you can make an argument that it should help because [00:21:48] argument that it should help because you've got effectively if you got [00:21:49] you've got effectively if you got something like tan H you've got a [00:21:51] something like tan H you've got a flattening function um so it should help [00:21:55] flattening function um so it should help somewhat but it doesn't solve it even if [00:21:57] somewhat but it doesn't solve it even if you're using a t nonlinearity well so I [00:22:01] you're using a t nonlinearity well so I guess it should sorry it should help [00:22:03] guess it should sorry it should help with exploding though actually that even [00:22:06] with exploding though actually that even that still happens but it definitely [00:22:08] that still happens but it definitely doesn't help with the vanishing if you [00:22:10] doesn't help with the vanishing if you have Sig think that is one value so [00:22:15] have Sig think that is one value so you're always pushing the value between [00:22:17] you're always pushing the value between one so it's not going up or going down [00:22:20] one so it's not going up or going down it's staying between Z and [00:22:21] it's staying between Z and one but so [00:22:26] yeah Z [00:22:34] well I guess would it go up and so you [00:22:37] well I guess would it go up and so you have a really small value that becomes [00:22:38] have a really small value that becomes oneus a really small value Sig * [00:22:42] oneus a really small value Sig * 1us oh Sig [00:22:45] 1us oh Sig * [00:22:47] * okay um yeah so can we have a [00:22:52] okay um yeah so can we have a different architecture so we have a [00:22:55] different architecture so we have a memory that you can add to um and so [00:22:59] memory that you can add to um and so that led into this new kind of newal [00:23:03] that led into this new kind of newal network the lstm so this is going back a [00:23:06] network the lstm so this is going back a few years but any rate um this was [00:23:09] few years but any rate um this was trying to improve Siri suggestions and [00:23:12] trying to improve Siri suggestions and the big breakthrough that they were [00:23:13] the big breakthrough that they were described was being described was oh [00:23:16] described was being described was oh we're now using an lstm in the keyboard [00:23:20] we're now using an lstm in the keyboard prediction and the whole advantage of [00:23:23] prediction and the whole advantage of that um was was going to be able to [00:23:25] that um was was going to be able to predict context further back so you [00:23:27] predict context further back so you could differentiate between the children [00:23:29] could differentiate between the children are playing in the park versus the [00:23:31] are playing in the park versus the Orioles are playing in the [00:23:34] Orioles are playing in the playoff okay um so so the sort of big [00:23:39] playoff okay um so so the sort of big thing that was seen as very successful [00:23:42] thing that was seen as very successful was these um lstms long short-term [00:23:45] was these um lstms long short-term memory um just to say a little bit of [00:23:49] memory um just to say a little bit of the history here right um just on the [00:23:52] the history here right um just on the how to PA this name right that I think [00:23:56] how to PA this name right that I think people often don't even understand it [00:23:58] people often don't even understand it right so what you wanting to do was [00:24:00] right so what you wanting to do was model shortterm memory right because so [00:24:04] model shortterm memory right because so for humans people this normally [00:24:06] for humans people this normally distinguish between the short-term [00:24:07] distinguish between the short-term memory of stuff that you heard recently [00:24:10] memory of stuff that you heard recently versus things that you permanently [00:24:12] versus things that you permanently stored away um and the suggestion was [00:24:16] stored away um and the suggestion was well in short-term memory humans can [00:24:19] well in short-term memory humans can remember stuff for quite a while right [00:24:22] remember stuff for quite a while right you know if you're having a conversation [00:24:24] you know if you're having a conversation you can still remember um the thing that [00:24:26] you can still remember um the thing that the person said a few turns ago the [00:24:28] the person said a few turns ago the conversation said bring back up of oh [00:24:31] conversation said bring back up of oh didn't you say they um took last weekend [00:24:34] didn't you say they um took last weekend off or something right and well the [00:24:36] off or something right and well the problem was that the simple rnns their [00:24:39] problem was that the simple rnns their short-term memory was only about seven [00:24:42] short-term memory was only about seven tokens and so we'd like to make it [00:24:44] tokens and so we'd like to make it better than that and so we wanted long [00:24:47] better than that and so we wanted long short-term memory and that's where this [00:24:50] short-term memory and that's where this um name came about and so this was a [00:24:53] um name came about and so this was a type of recurrent new network um that [00:24:55] type of recurrent new network um that was um um proposed by HW and schmidhuber [00:25:01] was um um proposed by HW and schmidhuber in 1997 as a solution to the problem I [00:25:04] in 1997 as a solution to the problem I mean there's actually a second relevant [00:25:06] mean there's actually a second relevant piece of work that came a few years [00:25:08] piece of work that came a few years later that you know that first paper is [00:25:11] later that you know that first paper is the one that everybody cites um but [00:25:13] the one that everybody cites um but there's then a second paper by gz and [00:25:16] there's then a second paper by gz and smid Huber in 2000 which actually [00:25:19] smid Huber in 2000 which actually introduces a crucial part of the lstm as [00:25:22] introduces a crucial part of the lstm as we've used in the 21st century that [00:25:25] we've used in the 21st century that wasn't in the original paper [00:25:28] wasn't in the original paper um and you know so it's sort of an [00:25:33] um and you know so it's sort of an interesting story of all of this so you [00:25:36] interesting story of all of this so you know that um joggen schmidhuber and his [00:25:40] know that um joggen schmidhuber and his students did a lot of really crucial [00:25:44] students did a lot of really crucial foundational work in newal networks and [00:25:47] foundational work in newal networks and the sort of these years and the lat [00:25:50] the sort of these years and the lat years of the '90s um when just about [00:25:53] years of the '90s um when just about everybody else had given up on newal [00:25:56] everybody else had given up on newal networks um so unlike these days where [00:26:00] networks um so unlike these days where you know doing um pioneering work in new [00:26:03] you know doing um pioneering work in new networks is a really good way to get [00:26:05] networks is a really good way to get yourself hugely compensated jobs at [00:26:08] yourself hugely compensated jobs at Google meta or open AI it really wasn't [00:26:12] Google meta or open AI it really wasn't actually in these days so you know if [00:26:15] actually in these days so you know if you um ask G what happened to these [00:26:18] you um ask G what happened to these students um of hot crater and um G um [00:26:24] students um of hot crater and um G um that both of them are still in Academia [00:26:26] that both of them are still in Academia but um G seem to give up on AI and new [00:26:30] but um G seem to give up on AI and new networks all together and does stuff in [00:26:33] networks all together and does stuff in the area of [00:26:35] the area of multimedia um and SE Haw riter um is [00:26:39] multimedia um and SE Haw riter um is still in machine learning but you know [00:26:42] still in machine learning but you know for quite a long time he sort of [00:26:44] for quite a long time he sort of basically gave up on doing more General [00:26:46] basically gave up on doing more General neural network stuff and went into [00:26:48] neural network stuff and went into bioinformatics so if you look at his [00:26:50] bioinformatics so if you look at his Publications from about 2 [00:26:52] Publications from about 2 2015 um they were all in bioinformatics [00:26:55] 2015 um they were all in bioinformatics and most of them weren't using neural [00:26:57] and most of them weren't using neural networks at all um though um kind of [00:27:00] networks at all um though um kind of nicely I mean he's actually gone back [00:27:02] nicely I mean he's actually gone back into new networks more recently and is [00:27:05] into new networks more recently and is publishing a new networks again um yeah [00:27:08] publishing a new networks again um yeah so um really not much attention was paid [00:27:11] so um really not much attention was paid to this work at the time and so it only [00:27:14] to this work at the time and so it only sort of really kind of gradually seeped [00:27:17] sort of really kind of gradually seeped out further um so um Schmid hu had a [00:27:21] out further um so um Schmid hu had a later student in the mid 2000s decade [00:27:24] later student in the mid 2000s decade Alex Graves um and Alex Graves um did [00:27:29] Alex Graves um and Alex Graves um did more stuff with lstms and for people [00:27:32] more stuff with lstms and for people who've seen um speech recognition where [00:27:35] who've seen um speech recognition where people commonly do CTC loss and decoding [00:27:38] people commonly do CTC loss and decoding Alex Graves invented that but um most [00:27:42] Alex Graves invented that but um most crucially um Alex Graves then um went to [00:27:46] crucially um Alex Graves then um went to um Toronto to be postto for Jeff Hinton [00:27:49] um Toronto to be postto for Jeff Hinton and that sort of brought more attention [00:27:52] and that sort of brought more attention to the fact that lstms were a good model [00:27:55] to the fact that lstms were a good model um and then Jeff Hinton went to Google [00:27:58] um and then Jeff Hinton went to Google in [00:27:59] in 2013 and That Was Then sort of the use [00:28:02] 2013 and That Was Then sort of the use of um lstms at Google um in the sort of [00:28:05] of um lstms at Google um in the sort of 2014 to 16 period was when they really [00:28:09] 2014 to 16 period was when they really sort of um hit the world and became for [00:28:12] sort of um hit the world and became for a while the completely dominant [00:28:14] a while the completely dominant framework people use for neural [00:28:17] framework people use for neural networks um in the world of uh I guess [00:28:21] networks um in the world of uh I guess startups this is what you call being too [00:28:23] startups this is what you call being too early for the first people um yeah um [00:28:29] early for the first people um yeah um okay long short-term memories back to [00:28:30] okay long short-term memories back to the science um so um let's see um [00:28:35] the science um so um let's see um there's a slide here that talks about um [00:28:39] there's a slide here that talks about um long short-term memories but maybe I'll [00:28:41] long short-term memories but maybe I'll just sort of skip straight ahead and [00:28:44] just sort of skip straight ahead and start to show the pictures so we've [00:28:46] start to show the pictures so we've still got a sequence of inputs XT and [00:28:50] still got a sequence of inputs XT and the difference now is um inside our [00:28:52] the difference now is um inside our newal Network we're going to have two [00:28:55] newal Network we're going to have two hidden things one that's still called [00:28:57] hidden things one that's still called The Hidden State and the other one [00:28:59] The Hidden State and the other one that's referred to as the cell State and [00:29:03] that's referred to as the cell State and so what we're going to do is we're going [00:29:07] so what we're going to do is we're going to modulate how these things get updated [00:29:10] to modulate how these things get updated by introducing the idea of gates and so [00:29:14] by introducing the idea of gates and so gates are calculated things vectors [00:29:18] gates are calculated things vectors whose values are probabilities between [00:29:21] whose values are probabilities between zero and one and they're things that [00:29:23] zero and one and they're things that we're going to use to sort of turn [00:29:25] we're going to use to sort of turn things on or shut them off in a [00:29:27] things on or shut them off in a probabilistic way so we're going to [00:29:29] probabilistic way so we're going to control the movement of information by [00:29:32] control the movement of information by having [00:29:33] having gating and so we're going to calculate [00:29:37] gating and so we're going to calculate three gating vectors so these vectors [00:29:39] three gating vectors so these vectors are the same length as our hidden States [00:29:43] are the same length as our hidden States um and so the way we calculate these [00:29:46] um and so the way we calculate these gating vectors is with an equation that [00:29:49] gating vectors is with an equation that looks basically exactly the same as what [00:29:53] looks basically exactly the same as what we were using for current newal networks [00:29:56] we were using for current newal networks um apart from the Sigma there is [00:29:58] um apart from the Sigma there is definitely going to be the logistic that [00:30:00] definitely going to be the logistic that goes between Zer and one so we get [00:30:02] goes between Zer and one so we get probabilities and the three gates we're [00:30:04] probabilities and the three gates we're going to calculate is a forget gate [00:30:07] going to calculate is a forget gate which is going to say how much do we [00:30:10] which is going to say how much do we remember of the previous times hidden [00:30:13] remember of the previous times hidden State I think the forget gate was [00:30:15] State I think the forget gate was actually wrongly named I think it makes [00:30:17] actually wrongly named I think it makes more sense to think of it as a remember [00:30:19] more sense to think of it as a remember gate because it's actually calculating [00:30:21] gate because it's actually calculating how much you're remembering um okay then [00:30:25] how much you're remembering um okay then we've got an input gate and the input [00:30:28] we've got an input gate and the input gate is going to say how much are you [00:30:30] gate is going to say how much are you going to pay attention to the next input [00:30:33] going to pay attention to the next input the next XI and put it into your hidden [00:30:37] the next XI and put it into your hidden State and then you have an output gate [00:30:40] State and then you have an output gate and the output gate is going to control [00:30:42] and the output gate is going to control how much of of what's in the cell which [00:30:45] how much of of what's in the cell which is your primary memory are you going to [00:30:48] is your primary memory are you going to transfer over to the hidden state of the [00:30:51] transfer over to the hidden state of the network okay so once we have [00:30:54] network okay so once we have um once we have those Gates [00:30:58] um once we have those Gates um what we're then going to do has have [00:31:01] um what we're then going to do has have these equations which are how we're [00:31:04] these equations which are how we're going to um sort of update things so the [00:31:07] going to um sort of update things so the first thing we're going to do is work [00:31:10] first thing we're going to do is work out uh work out a potential new cell [00:31:16] out uh work out a potential new cell content so the new cell content is going [00:31:19] content so the new cell content is going to be calculated exactly using the [00:31:23] to be calculated exactly using the exactly the same kind of equation we saw [00:31:26] exactly the same kind of equation we saw last time for a current networks we're [00:31:29] last time for a current networks we're going to have um these two matrices um [00:31:32] going to have um these two matrices um the cell W and the cell U and we're [00:31:35] the cell W and the cell U and we're going to multiply one by the last time's [00:31:38] going to multiply one by the last time's hidden State and the other by the new [00:31:41] hidden State and the other by the new input add on a bias and that's a [00:31:44] input add on a bias and that's a potential update to the cell but then [00:31:47] potential update to the cell but then how we're actually going to update the [00:31:49] how we're actually going to update the cell is by making use of our gates so [00:31:53] cell is by making use of our gates so we're going to say the new cells content [00:31:56] we're going to say the new cells content is going to be the old cells content um [00:32:01] is going to be the old cells content um hadamar producted with the forget gate [00:32:04] hadamar producted with the forget gate so that's how much to remember of the [00:32:06] so that's how much to remember of the previous cells content plus this [00:32:10] previous cells content plus this calculated update had aad producted with [00:32:14] calculated update had aad producted with the input gate how much to pay attention [00:32:17] the input gate how much to pay attention to this new potential update that we've [00:32:20] to this new potential update that we've um invented and then for calculating the [00:32:24] um invented and then for calculating the new hidden State that's going to be [00:32:28] new hidden State that's going to be um the had mod product between the [00:32:31] um the had mod product between the output gate and [00:32:34] output gate and um our CT having been put through a tan [00:32:38] um our CT having been put through a tan H um and you know one idea here is you [00:32:43] H um and you know one idea here is you know we're thinking about how much to [00:32:47] know we're thinking about how much to keep on remembering what we've had in [00:32:48] keep on remembering what we've had in the past but you know for thinking about [00:32:52] the past but you know for thinking about sort of only sending some information to [00:32:55] sort of only sending some information to the hidden State as sort of a way to [00:32:57] the hidden State as sort of a way to start thinking about that is you know [00:33:00] start thinking about that is you know the hidden state of a recurrent newal [00:33:02] the hidden state of a recurrent newal network is sort of doing multiple Duty [00:33:05] network is sort of doing multiple Duty right like on one part of it is we were [00:33:08] right like on one part of it is we were going to feed it into the output to [00:33:10] going to feed it into the output to predict the next token but in another [00:33:15] predict the next token but in another thing it's going to do is we just want [00:33:17] thing it's going to do is we just want it to store information about the past [00:33:19] it to store information about the past that might come in useful later and that [00:33:22] that might come in useful later and that we'd like to kind of have carried [00:33:23] we'd like to kind of have carried through the sequence and so really only [00:33:27] through the sequence and so really only some of what [00:33:28] some of what in the hidden State we want to be using [00:33:30] in the hidden State we want to be using to predict the current word some of it [00:33:32] to predict the current word some of it isn't relevant to predicting the current [00:33:34] isn't relevant to predicting the current word but would be good stuff to know for [00:33:36] word but would be good stuff to know for the future right so you know if the [00:33:38] the future right so you know if the previous words were set in um for [00:33:43] previous words were set in um for predicting the next word we basically [00:33:45] predicting the next word we basically just need to know we're in a satin [00:33:47] just need to know we're in a satin context where the or a will come next [00:33:50] context where the or a will come next but you know if earlier on the sentence [00:33:53] but you know if earlier on the sentence that been saying the King of Prussia we [00:33:56] that been saying the King of Prussia we somewhere in the hidden State we want to [00:33:57] somewhere in the hidden State we want to be keeping the information that there's [00:33:59] be keeping the information that there's a King of Prussia because that might be [00:34:01] a King of Prussia because that might be relevant for predicting future words and [00:34:04] relevant for predicting future words and so it makes sense that we only want to [00:34:06] so it makes sense that we only want to have some of what's in our memory being [00:34:08] have some of what's in our memory being used to predict the the next word in the [00:34:11] used to predict the the next word in the current context so the cell is our long [00:34:14] current context so the cell is our long short-term memory and then we're moving [00:34:16] short-term memory and then we're moving over to the hidden state things are [00:34:18] over to the hidden state things are going to be relevant for [00:34:20] going to be relevant for Generation um yeah I've sort of said [00:34:23] Generation um yeah I've sort of said that okay [00:34:25] that okay uh all these are vectors of the same [00:34:29] uh all these are vectors of the same length n yeah so all of all of these [00:34:33] length n yeah so all of all of these things both the gates and the new values [00:34:36] things both the gates and the new values for the cell and hidden state they're [00:34:38] for the cell and hidden state they're all vectors of length n and part of how [00:34:43] all vectors of length n and part of how things actually get convenient when [00:34:46] things actually get convenient when you're actually running of these is up [00:34:48] you're actually running of these is up until here all of these things have [00:34:51] until here all of these things have exactly the same shape so you can [00:34:53] exactly the same shape so you can actually put them all together into a [00:34:56] actually put them all together into a big Matrix and do the computations of [00:34:59] big Matrix and do the computations of all four of these in terms of um one big [00:35:02] all four of these in terms of um one big Matrix if you want question if you did [00:35:05] Matrix if you want question if you did not have the Dage activation in the [00:35:07] not have the Dage activation in the hidden State update then the output dat [00:35:10] hidden State update then the output dat should have been expressed by both the [00:35:13] should have been expressed by both the input dates [00:35:17] and you're if this if this bit wasn't [00:35:21] and you're if this if this bit wasn't here then then you would not need an [00:35:23] here then then you would not need an output dat well because ft and it would [00:35:27] output dat well because ft and it would have been able to express it account for [00:35:31] have been able to express it account for it in some sense my question is how much [00:35:33] it in some sense my question is how much does having right well no to the extent [00:35:37] does having right well no to the extent that you want to mask out part of what's [00:35:40] that you want to mask out part of what's in the [00:35:41] in the cell so it's not visible when you're [00:35:44] cell so it's not visible when you're generating the next token isn't it still [00:35:47] generating the next token isn't it still useful to have an output [00:35:49] useful to have an output gate you can essentially have XT [00:35:54] C you don't want HT equal to CT you want [00:35:58] C you don't want HT equal to CT you want H you want some of the contents of CT to [00:36:01] H you want some of the contents of CT to be masked out so you you're not seeing [00:36:03] be masked out so you you're not seeing it when generating the output masking [00:36:05] it when generating the output masking would have being accounted for by fdn ID [00:36:08] would have being accounted for by fdn ID is my no because you want to keep it in [00:36:10] is my no because you want to keep it in CT you want there's information you want [00:36:12] CT you want there's information you want to keep in CT for the future um but you [00:36:16] to keep in CT for the future um but you don't want visible when generating the [00:36:19] don't want visible when generating the current next [00:36:20] current next word [00:36:22] word yeah um in some sense a bit I have the [00:36:26] yeah um in some sense a bit I have the hardest part explaining is [00:36:28] hardest part explaining is why is it necessarily better to have a [00:36:31] why is it necessarily better to have a tan H here I mean you can sort of argue [00:36:35] tan H here I mean you can sort of argue that it's a way of this can just stay [00:36:38] that it's a way of this can just stay unbounded real numbers and then this is [00:36:41] unbounded real numbers and then this is getting it back in the shape of stays [00:36:44] getting it back in the shape of stays between zero and one which is good for [00:36:46] between zero and one which is good for the um hidden state but it's a little [00:36:49] the um hidden state but it's a little bit I guess they did it that way it seem [00:36:52] bit I guess they did it that way it seem to work well okay um here's another way [00:36:55] to work well okay um here's another way of looking at it which may or may not be [00:36:57] of looking at it which may or may not be more helpful um as a picture so you know [00:37:01] more helpful um as a picture so you know at each time step we've got you know as [00:37:05] at each time step we've got you know as before um an input a hidden State and [00:37:10] before um an input a hidden State and then we're going to calculate an output [00:37:12] then we're going to calculate an output from that hidden state but we've sort of [00:37:14] from that hidden state but we've sort of got this more complex computational unit [00:37:18] got this more complex computational unit and these pictures of this more complex [00:37:20] and these pictures of this more complex computational unit um were diagrams were [00:37:23] computational unit um were diagrams were made by Chris oler who's someone who now [00:37:26] made by Chris oler who's someone who now um [00:37:28] um right now works at anthropic [00:37:30] right now works at anthropic um and so if you blow up in that this is [00:37:34] um and so if you blow up in that this is sort of showing the computation so [00:37:36] sort of showing the computation so you're sort of feeding along [00:37:39] you're sort of feeding along recurrently um the [00:37:41] recurrently um the C cell as the primary recurrent unit but [00:37:46] C cell as the primary recurrent unit but you've also got carried along H because [00:37:50] you've also got carried along H because H is being used to calculate stuff for [00:37:52] H is being used to calculate stuff for the next time step and then a new H is [00:37:54] the next time step and then a new H is being generated um and so you're [00:37:57] being generated um and so you're Computing the forget gate you're [00:37:59] Computing the forget gate you're forgetting some of the cell content [00:38:02] forgetting some of the cell content you're Computing an input gate you're [00:38:04] you're Computing an input gate you're using that to compute um a potential new [00:38:07] using that to compute um a potential new cell content um you write some of that [00:38:11] cell content um you write some of that into the cell um depending on the input [00:38:14] into the cell um depending on the input gate um then you compute an output gate [00:38:18] gate um then you compute an output gate and then some of the cell will go into [00:38:23] and then some of the cell will go into um the computation of H depending on the [00:38:27] um the computation of H depending on the output gate and then just like for the [00:38:30] output gate and then just like for the previous recurrent new network for [00:38:33] previous recurrent new network for working out what the predicted next word [00:38:37] working out what the predicted next word is you're working out an output layer by [00:38:40] is you're working out an output layer by taking the H and doing another Matrix uh [00:38:44] taking the H and doing another Matrix uh plus B2 and then using a soft Max on [00:38:47] plus B2 and then using a soft Max on that to actually predict the next [00:38:50] that to actually predict the next word okay so you know this all seems [00:38:54] word okay so you know this all seems very [00:38:56] very complex um [00:38:58] complex um and you know back in do you have a [00:39:00] and you know back in do you have a question yeah I a so how are we deciding [00:39:05] question yeah I a so how are we deciding theold I I imagine just some sort of [00:39:08] theold I I imagine just some sort of threshold around the probability of like [00:39:10] threshold around the probability of like what we're remembering and what we're [00:39:13] what we're remembering and what we're forgetting um so you know so we we're [00:39:17] forgetting um so you know so we we're getting more than a threshold right [00:39:19] getting more than a threshold right because we're actually we're calculating [00:39:21] because we're actually we're calculating a whole Vector of forgetting and [00:39:24] a whole Vector of forgetting and remembering so therefore it can choose [00:39:27] remembering so therefore it can choose to say Okay Dimensions 1 to 177 keep all [00:39:30] to say Okay Dimensions 1 to 177 keep all of that and throw away Dimensions 18 to [00:39:33] of that and throw away Dimensions 18 to 22 or really probabilistically to [00:39:35] 22 or really probabilistically to different extents um and so it's sort of [00:39:39] different extents um and so it's sort of unspecified it's up to it what it learns [00:39:42] unspecified it's up to it what it learns but we we're hoping that it will learn [00:39:45] but we we're hoping that it will learn that certain kinds of information is [00:39:47] that certain kinds of information is useful to keep carrying forward for at [00:39:49] useful to keep carrying forward for at least a while um but then we can use [00:39:53] least a while um but then we can use both the contents of the Hidden State [00:39:55] both the contents of the Hidden State and the cell sorry of the next input to [00:39:59] and the cell sorry of the next input to decide to throw away certain information [00:40:01] decide to throw away certain information so we might think that there are certain [00:40:03] so we might think that there are certain cues for example you know if it sees the [00:40:06] cues for example you know if it sees the word next it might think okay change of [00:40:09] word next it might think okay change of topic now it be a good time to forget [00:40:11] topic now it be a good time to forget more stuff and reset but it's sort of [00:40:13] more stuff and reset but it's sort of learning which dimensions of this Vector [00:40:16] learning which dimensions of this Vector to hold around in an unconstrained way [00:40:19] to hold around in an unconstrained way whatever's useful to do a better job at [00:40:21] whatever's useful to do a better job at language [00:40:22] language modeling okay um yeah so this all looks [00:40:26] modeling okay um yeah so this all looks like a very complex and [00:40:29] like a very complex and contous um design and you know quite [00:40:33] contous um design and you know quite honestly um you know when teaching this [00:40:35] honestly um you know when teaching this around 201 you know 16 17 and this was [00:40:40] around 201 you know 16 17 and this was the best kind of newal network we had [00:40:42] the best kind of newal network we had for language modeling you know we [00:40:43] for language modeling you know we literally you know spent hours of class [00:40:46] literally you know spent hours of class time um going through lstms and variant [00:40:50] time um going through lstms and variant of lstms with different properties CU [00:40:53] of lstms with different properties CU you know there are different ways you [00:40:54] you know there are different ways you can do the gating you can have less [00:40:56] can do the gating you can have less Gates or more gates and do different [00:40:58] Gates or more gates and do different things um and it seemed the most [00:41:00] things um and it seemed the most important thing to know um in [00:41:04] important thing to know um in 2024 you know it's probably not the most [00:41:07] 2024 you know it's probably not the most important thing to know um but um on [00:41:11] important thing to know um but um on lstms are um a thing to be aware of we [00:41:14] lstms are um a thing to be aware of we are going to use them um for um the [00:41:18] are going to use them um for um the assignment three but you know you can [00:41:19] assignment three but you know you can just ask pytorch for an lstm and I'll [00:41:22] just ask pytorch for an lstm and I'll give you one that does all of this stuff [00:41:24] give you one that does all of this stuff but you know there is one thing that I [00:41:26] but you know there is one thing that I really want to sort of focus on as to [00:41:29] really want to sort of focus on as to you know why what is the good thing that [00:41:33] you know why what is the good thing that an lstm achieves and you know really the [00:41:36] an lstm achieves and you know really the secret for why you get this [00:41:38] secret for why you get this fundamentally different behavior in an [00:41:40] fundamentally different behavior in an lstm is you have that plus sign right [00:41:44] lstm is you have that plus sign right there right that for the simple [00:41:47] there right that for the simple recurrent neural network at each time [00:41:50] recurrent neural network at each time the next hidden state was a result of [00:41:54] the next hidden state was a result of multiplicative stuff and therefore it [00:41:56] multiplicative stuff and therefore it was very hard just to preserve [00:41:59] was very hard just to preserve information um whereas the essence of [00:42:03] information um whereas the essence of the lstm is to say well look you've got [00:42:06] the lstm is to say well look you've got this past memory of stuff you've already [00:42:09] this past memory of stuff you've already seen and what we want to do is add some [00:42:12] seen and what we want to do is add some new information to it which [00:42:14] new information to it which fundamentally seems like kind of right [00:42:17] fundamentally seems like kind of right for human memories um that they're sort [00:42:19] for human memories um that they're sort of basically additive um and when I said [00:42:22] of basically additive um and when I said actually it was the second G paper that [00:42:25] actually it was the second G paper that introduced a crucial part of the lstm [00:42:28] introduced a crucial part of the lstm the first version of the lstm didn't [00:42:31] the first version of the lstm didn't have the forget gate so it was a purely [00:42:33] have the forget gate so it was a purely additive mechanism that you were [00:42:35] additive mechanism that you were deciding what to add to your memory as [00:42:38] deciding what to add to your memory as you went along um but you know that [00:42:41] you went along um but you know that proved to be not quite perfect because [00:42:44] proved to be not quite perfect because if you keep on adding more and more [00:42:45] if you keep on adding more and more stuff over a long sequence that tends to [00:42:48] stuff over a long sequence that tends to be dysfunctional after a certain point [00:42:50] be dysfunctional after a certain point and so the big Improvement was then to [00:42:52] and so the big Improvement was then to add this forget gate so some of it went [00:42:54] add this forget gate so some of it went away but nevertheless having things [00:42:57] away but nevertheless having things basically additive fixes the problem of [00:43:01] basically additive fixes the problem of gradient flow you no longer have um [00:43:04] gradient flow you no longer have um Vanishing gradients and it makes it [00:43:07] Vanishing gradients and it makes it something that seems much more memory [00:43:09] something that seems much more memory like you're adding to the things that [00:43:11] like you're adding to the things that you [00:43:12] you know okay um so the lstm architecture [00:43:16] know okay um so the lstm architecture allows you to preserve information over [00:43:18] allows you to preserve information over many time sets in the cell right so if [00:43:22] many time sets in the cell right so if you set the forget gate to one um and [00:43:25] you set the forget gate to one um and the input gate to zero you're just [00:43:27] the input gate to zero you're just linearly passing along in the cell [00:43:30] linearly passing along in the cell indefinitely the same [00:43:34] information okay um it's not the only [00:43:37] information okay um it's not the only way that you can do longdistance [00:43:39] way that you can do longdistance information flow and we're going to look [00:43:42] information flow and we're going to look increasingly in other in future lectures [00:43:45] increasingly in other in future lectures at other ways you can do um longdistance [00:43:49] at other ways you can do um longdistance information flow um and just to sort of [00:43:52] information flow um and just to sort of give a bit of a peek about those now and [00:43:54] give a bit of a peek about those now and to think about other architectures but [00:43:57] to think about other architectures but there's a question no no question yes um [00:44:01] there's a question no no question yes um so so since you're mentioning that those [00:44:03] so so since you're mentioning that those plus like hand grent uh does it help [00:44:08] plus like hand grent uh does it help with exploding grent at all does it make [00:44:10] with exploding grent at all does it make it worse is there like no difference or [00:44:12] it worse is there like no difference or no it also helps with exploding [00:44:14] no it also helps with exploding gradients because the fact that you're [00:44:16] gradients because the fact that you're not doing this sequence of multiplies [00:44:18] not doing this sequence of multiplies all the time that you sort of have this [00:44:21] all the time that you sort of have this addition [00:44:25] operator um [00:44:27] operator um so one thing you could wonder is that is [00:44:31] so one thing you could wonder is that is Vanishing and exploding gradients just [00:44:34] Vanishing and exploding gradients just uh Rec current neural network problem [00:44:37] uh Rec current neural network problem and it's not I mean it it it occurs [00:44:40] and it's not I mean it it it occurs earlier and worse when you've got long [00:44:43] earlier and worse when you've got long sequences but if you start building a [00:44:45] sequences but if you start building a very deep neural network surely the same [00:44:48] very deep neural network surely the same thing is happening you know the [00:44:50] thing is happening you know the parameters aren't the same so it's not [00:44:53] parameters aren't the same so it's not quite just raising one Matrix to a power [00:44:56] quite just raising one Matrix to a power but surely depending on your matrices [00:44:58] but surely depending on your matrices you tend to have the same problem that [00:45:01] you tend to have the same problem that either your gradients are disappearing [00:45:03] either your gradients are disappearing or else they're exploding um and that's [00:45:06] or else they're exploding um and that's what people found and that was part of [00:45:08] what people found and that was part of the reason why in the early days people [00:45:10] the reason why in the early days people weren't very successful building deep [00:45:13] weren't very successful building deep neural networks was because they [00:45:15] neural networks was because they suffered from problems of this sort that [00:45:18] suffered from problems of this sort that if you had basically Vanishing gradients [00:45:20] if you had basically Vanishing gradients um in a deep new network you got very [00:45:23] um in a deep new network you got very little gradient signal in the lower [00:45:25] little gradient signal in the lower layers therefore parameters didn't [00:45:28] layers therefore parameters didn't really update Therefore your model [00:45:30] really update Therefore your model didn't learn anything in the lower [00:45:31] didn't learn anything in the lower layers therefore the network didn't work [00:45:34] layers therefore the network didn't work well and you know that was part of why [00:45:36] well and you know that was part of why things were stuck in the days around the [00:45:38] things were stuck in the days around the early 2000s of deep networks didn't work [00:45:42] early 2000s of deep networks didn't work um and so there are other ways you can [00:45:44] um and so there are other ways you can think about fixing that so one common [00:45:47] think about fixing that so one common way of fixing that is to add more direct [00:45:51] way of fixing that is to add more direct connection so you know the problem when [00:45:53] connection so you know the problem when we went through our um current step was [00:45:58] we went through our um current step was we were sort of had this in between [00:46:01] we were sort of had this in between stuff of doing a matrix multiply and bl [00:46:03] stuff of doing a matrix multiply and bl blah blah blah um and that kind of [00:46:07] blah blah blah um and that kind of caused indirectness and the possibility [00:46:09] caused indirectness and the possibility for things to either explode or vanish [00:46:13] for things to either explode or vanish um so this network is written sort of [00:46:15] um so this network is written sort of upside down when I stole the picture [00:46:18] upside down when I stole the picture from the paper so we'll just have to [00:46:20] from the paper so we'll just have to deal with that like um right so the in [00:46:23] deal with that like um right so the in we're going downwards from here to the [00:46:25] we're going downwards from here to the next layer so [00:46:27] next layer so you know rather than going through sort [00:46:29] you know rather than going through sort of weight layers and weight layers which [00:46:31] of weight layers and weight layers which will start to produce the same kind of [00:46:33] will start to produce the same kind of problems what you can do is sort of [00:46:36] problems what you can do is sort of apply the same trick in a vertical [00:46:40] apply the same trick in a vertical Network and say well look I can also [00:46:42] Network and say well look I can also just carry the input around with an [00:46:44] just carry the input around with an identity function and add it on here and [00:46:48] identity function and add it on here and so then I've got this sort of direct [00:46:50] so then I've got this sort of direct carrying of information and so that um [00:46:53] carrying of information and so that um led to the residual Network which was [00:46:56] led to the residual Network which was what completely transformed computer [00:46:58] what completely transformed computer vision models um and made them much more [00:47:02] vision models um and made them much more learnable um than pure networks that [00:47:05] learnable um than pure networks that lack these residual [00:47:08] lack these residual connections um if you sort of start [00:47:10] connections um if you sort of start heading down that path you can think [00:47:13] heading down that path you can think well why only provide these residual [00:47:15] well why only provide these residual Loops that take you one step maybe I [00:47:19] Loops that take you one step maybe I could directly connect each layer to all [00:47:22] could directly connect each layer to all the successive layers and so people [00:47:25] the successive layers and so people played with that idea and that led to [00:47:27] played with that idea and that led to the so-called dens net where you have [00:47:30] the so-called dens net where you have these kind of skip connections linking [00:47:32] these kind of skip connections linking to every other layer um a variant of the [00:47:39] to every other layer um a variant of the residual Network the res net um which [00:47:42] residual Network the res net um which was actually again introduced by [00:47:43] was actually again introduced by schmidhuber and students was to say well [00:47:46] schmidhuber and students was to say well rather than just directly adding in the [00:47:49] rather than just directly adding in the input [00:47:51] input um summed with the output of the new [00:47:54] um summed with the output of the new network layer maybe again we'd be better [00:47:56] network layer maybe again we'd be better better off having gating so that you are [00:47:59] better off having gating so that you are deciding by Gates how much of the input [00:48:02] deciding by Gates how much of the input to have skip around and so that led to a [00:48:05] to have skip around and so that led to a variant um the the highway net um where [00:48:09] variant um the the highway net um where you've got sort of GED residual networks [00:48:12] you've got sort of GED residual networks so various ideas of doing that um not [00:48:16] so various ideas of doing that um not going to say more about that right now I [00:48:18] going to say more about that right now I want to skip ahead and sort of do the [00:48:20] want to skip ahead and sort of do the rest of new or Nets and get on to um [00:48:24] rest of new or Nets and get on to um machine translation [00:48:28] okay um so once you have rnns where RNN [00:48:33] okay um so once you have rnns where RNN is including lstms normally in practice [00:48:36] is including lstms normally in practice lstms you can use them for anything else [00:48:38] lstms you can use them for anything else where you're doing sequences and so [00:48:40] where you're doing sequences and so there are lots of places they're used in [00:48:42] there are lots of places they're used in N so if you want to assign words parts [00:48:47] N so if you want to assign words parts of speech like nouns and verbs that [00:48:49] of speech like nouns and verbs that would be commonly done with a part of [00:48:51] would be commonly done with a part of speech tagging lstm if you want to be [00:48:55] speech tagging lstm if you want to be assigning named entity labels like [00:48:58] assigning named entity labels like location right I I did this toy version [00:49:01] location right I I did this toy version where we were signing a label to the [00:49:03] where we were signing a label to the middle of a window but you want to [00:49:05] middle of a window but you want to assign a label at each position you can [00:49:07] assign a label at each position you can use an lstm for named entity recognition [00:49:10] use an lstm for named entity recognition you can use an RNN as a encoder model [00:49:15] you can use an RNN as a encoder model for a whole sentence so if we want to do [00:49:18] for a whole sentence so if we want to do sentiment classifications see whether a [00:49:20] sentiment classifications see whether a piece of text is positive or negative um [00:49:23] piece of text is positive or negative um we can say run an lstm over it and then [00:49:27] we can say run an lstm over it and then use this as a representation of the [00:49:30] use this as a representation of the sentence to in work out whether it's [00:49:34] sentence to in work out whether it's positive or negative piece of text and [00:49:36] positive or negative piece of text and well the simplest way of doing that is [00:49:39] well the simplest way of doing that is to use the final hidden State because [00:49:42] to use the final hidden State because after all that final hidden state is the [00:49:44] after all that final hidden state is the hidden State you've gotten from having [00:49:45] hidden State you've gotten from having seen the entire sentence and use that [00:49:49] seen the entire sentence and use that and then have a sort of a classification [00:49:53] and then have a sort of a classification layer um logistic regression on top of [00:49:56] layer um logistic regression on top of that to give you positive or negative in [00:49:59] that to give you positive or negative in practice though people have found it's [00:50:01] practice though people have found it's often better to use every hidden State [00:50:04] often better to use every hidden State and take some kind of mean or [00:50:07] and take some kind of mean or elementwise Max and feed that in as the [00:50:10] elementwise Max and feed that in as the sentence [00:50:12] sentence encoding you can also use rnns for lots [00:50:15] encoding you can also use rnns for lots of other purposes where you're using it [00:50:18] of other purposes where you're using it to generate text based on other [00:50:21] to generate text based on other information so if you want to do speech [00:50:23] information so if you want to do speech recognition or summarization machine [00:50:27] recognition or summarization machine translation um that we'll come to later [00:50:30] translation um that we'll come to later you can have an an input source which [00:50:34] you can have an an input source which you'll use to condition your network and [00:50:38] you'll use to condition your network and then you'll generate the speech [00:50:41] then you'll generate the speech recognition or the um machine [00:50:43] recognition or the um machine translation as we'll see later and so we [00:50:46] translation as we'll see later and so we refer to those as conditional language [00:50:49] refer to those as conditional language models because rather than just [00:50:51] models because rather than just generating text starting from nothing [00:50:54] generating text starting from nothing from a start token we're generating it [00:50:57] from a start token we're generating it conditioned on some source of [00:50:59] conditioned on some source of information [00:51:01] information um one other idea on what normally [00:51:05] um one other idea on what normally happens when people use these um you [00:51:09] happens when people use these um you know I [00:51:11] know I suggested [00:51:13] suggested um that you know we could sort of do [00:51:16] um that you know we could sort of do this averaging at each position if you [00:51:18] this averaging at each position if you think of these about these hidden State [00:51:21] think of these about these hidden State representations these hidden State [00:51:24] representations these hidden State representations that representation [00:51:27] representations that representation isn't only about the word terribly it [00:51:29] isn't only about the word terribly it has some information about what came [00:51:32] has some information about what came before it the movie was terribly but it [00:51:35] before it the movie was terribly but it has no information about what comes [00:51:38] has no information about what comes after it and well you might think you'd [00:51:41] after it and well you might think you'd like to have a representation of [00:51:44] like to have a representation of terribly that knows what came before it [00:51:48] terribly that knows what came before it but also what came after it and so [00:51:52] but also what came after it and so people sort of came up with the next [00:51:54] people sort of came up with the next obvious idea to deal with that which was [00:51:57] obvious idea to deal with that which was to build a [00:51:58] to build a bidirectional lstm so you ran a Ford [00:52:02] bidirectional lstm so you ran a Ford lstm and then you start another lsdm [00:52:05] lstm and then you start another lsdm that's shown in that sort of greenish [00:52:07] that's shown in that sort of greenish teal and you ran it backwards and so [00:52:10] teal and you ran it backwards and so then you had a forwards and backwards [00:52:12] then you had a forwards and backwards Vector at each position and you just [00:52:15] Vector at each position and you just concatenated them both and then you had [00:52:17] concatenated them both and then you had a two-sided context for a representation [00:52:21] a two-sided context for a representation of word [00:52:23] of word meaning and so these networks um were [00:52:27] meaning and so these networks um were pretty widely used so um we were sort of [00:52:30] pretty widely used so um we were sort of running a forward RNN a backward RNN and [00:52:34] running a forward RNN a backward RNN and concatenating the states [00:52:36] concatenating the states together and those were then sort of [00:52:38] together and those were then sort of commonly sort of written like this to [00:52:41] commonly sort of written like this to suggest that in a compact way you're [00:52:44] suggest that in a compact way you're running a bidirectional RNN um that and [00:52:50] running a bidirectional RNN um that and you know these were very popular for [00:52:53] you know these were very popular for language analysis they're not they [00:52:55] language analysis they're not they weren't working able if you wanting to [00:52:57] weren't working able if you wanting to generate text um but you using them in a [00:53:02] generate text um but you using them in a lot of places as a representation but [00:53:04] lot of places as a representation but more recently Transformer models have [00:53:07] more recently Transformer models have normally taken over from that um one [00:53:10] normally taken over from that um one more idea which we'll see for machine [00:53:13] more idea which we'll see for machine translation is [00:53:16] translation is um you know rnns are sort of deep in the [00:53:19] um you know rnns are sort of deep in the sense that they unroll over many time [00:53:22] sense that they unroll over many time steps but up until now they've only been [00:53:24] steps but up until now they've only been shallow rnns in the sense that we just [00:53:26] shallow rnns in the sense that we just had one hidden state but you can also [00:53:29] had one hidden state but you can also make them Deep by having multiple layers [00:53:32] make them Deep by having multiple layers of hidden States what was also commonly [00:53:35] of hidden States what was also commonly called stacked rnns so you'd have [00:53:37] called stacked rnns so you'd have several layers of [00:53:40] several layers of rnns um built above each other and you [00:53:44] rnns um built above each other and you might wonder H does this really do [00:53:46] might wonder H does this really do anything is it are they just big vectors [00:53:48] anything is it are they just big vectors above the words but precisely because [00:53:52] above the words but precisely because you have sort of this extra newal [00:53:54] you have sort of this extra newal Network layer between here and here here [00:53:56] Network layer between here and here here you get exactly the same power Advantage [00:53:59] you get exactly the same power Advantage you get otherwise with newal networks [00:54:01] you get otherwise with newal networks that you can do successive layers of [00:54:03] that you can do successive layers of feature extraction and so you get more [00:54:06] feature extraction and so you get more power out of your neural network um to [00:54:10] power out of your neural network um to some extent um what [00:54:14] some extent um what people yeah I okay to some extent what [00:54:18] people yeah I okay to some extent what people um [00:54:21] people um found with new rnns in those days um is [00:54:26] found with new rnns in those days um is the having multiple layers definitely [00:54:29] the having multiple layers definitely helps but unlike what was happening in [00:54:31] helps but unlike what was happening in those days with other kinds of newal [00:54:33] those days with other kinds of newal networks for vision Etc people still use [00:54:36] networks for vision Etc people still use relatively shallow shallow RNN so you [00:54:40] relatively shallow shallow RNN so you know you always got a lot of gains by [00:54:42] know you always got a lot of gains by having two layers rather than one but [00:54:45] having two layers rather than one but you know commonly it started to be more [00:54:46] you know commonly it started to be more iffy whether you got extra value from [00:54:49] iffy whether you got extra value from three or four layers so commonly people [00:54:51] three or four layers so commonly people were running two or thre layer lstms and [00:54:55] were running two or thre layer lstms and that's what people were using [00:54:57] that's what people were using but that's completely changed around in [00:54:59] but that's completely changed around in the world of [00:55:00] the world of Transformers um where nowadays people [00:55:02] Transformers um where nowadays people are building very deep um Transformer [00:55:06] are building very deep um Transformer networks for doing language [00:55:09] understanding okay but I Should Skip [00:55:12] understanding okay but I Should Skip ahead and say a few words before time [00:55:15] ahead and say a few words before time runs out about machine [00:55:17] runs out about machine translation um so machine translation is [00:55:20] translation um so machine translation is one of the key natural language [00:55:22] one of the key natural language processing tasks where We're translating [00:55:25] processing tasks where We're translating um words from sentences in one language [00:55:28] um words from sentences in one language to sentences in another language so [00:55:31] to sentences in another language so we're starting off um with a sentence in [00:55:34] we're starting off um with a sentence in some language here French and what we [00:55:36] some language here French and what we want to do is output it in a different [00:55:39] want to do is output it in a different language here [00:55:41] language here English um [00:55:44] English um so that machine translation was actually [00:55:48] so that machine translation was actually where NLP started right so in the early [00:55:53] where NLP started right so in the early 50s there wasn't artificial intelligence [00:55:56] 50s there wasn't artificial intelligence yet there wasn't a field of NLP yet um [00:56:00] yet there wasn't a field of NLP yet um but people started to work on machine [00:56:04] but people started to work on machine translation um and the story of why [00:56:07] translation um and the story of why people started to work on machine [00:56:08] people started to work on machine translation was essentially you know [00:56:11] translation was essentially you know computers were first developed um during [00:56:14] computers were first developed um during the second world war and during the [00:56:16] the second world war and during the second world war computers were used for [00:56:18] second world war computers were used for two things one of them was calculating [00:56:21] two things one of them was calculating artillery table targets artillery tables [00:56:24] artillery table targets artillery tables to sort of work out what angle to put [00:56:26] to sort of work out what angle to put your gun on to get it to land in the [00:56:28] your gun on to get it to land in the right place I'm not very relevant to [00:56:30] right place I'm not very relevant to what we're doing but the other thing the [00:56:33] what we're doing but the other thing the other thing that computers were used for [00:56:35] other thing that computers were used for was code breaking um so um after the [00:56:40] was code breaking um so um after the second world war it moved very quickly [00:56:42] second world war it moved very quickly into the cold war and there were you [00:56:44] into the cold war and there were you know concerns on both sides um you know [00:56:48] know concerns on both sides um you know of keeping up with the science that was [00:56:51] of keeping up with the science that was being developed on both sides and people [00:56:54] being developed on both sides and people had the idea of gee maybe we could think [00:56:58] had the idea of gee maybe we could think of translation between languages as like [00:57:01] of translation between languages as like code breaking and that thought um [00:57:05] code breaking and that thought um occurred to um important relevant people [00:57:08] occurred to um important relevant people and science funding agencies and [00:57:11] and science funding agencies and actually lots and lots of funding was [00:57:14] actually lots and lots of funding was poured into this idea of can we use [00:57:16] poured into this idea of can we use computers to do machine translation [00:57:19] computers to do machine translation between languages and you know at the [00:57:22] between languages and you know at the time in the 50s you know after some [00:57:26] time in the 50s you know after some initial very impressive looking cook [00:57:28] initial very impressive looking cook demos it was sort of basically a [00:57:31] demos it was sort of basically a complete flop and the reason you know [00:57:33] complete flop and the reason you know there are lots of reasons why it's was a [00:57:35] there are lots of reasons why it's was a complete flop you know one was people [00:57:38] complete flop you know one was people knew almost nothing about the structure [00:57:39] knew almost nothing about the structure of human languages I mean in particular [00:57:42] of human languages I mean in particular when I was mentioning the other day the [00:57:44] when I was mentioning the other day the Chomsky hierarchy right and knowing [00:57:47] Chomsky hierarchy right and knowing about sort of context free languages [00:57:49] about sort of context free languages right the chompsky hierarchy even hadn't [00:57:51] right the chompsky hierarchy even hadn't been invented yet right sort of formal [00:57:54] been invented yet right sort of formal properties of languages hadn't being [00:57:56] properties of languages hadn't being explored but also you know the computers [00:58:00] explored but also you know the computers um that people had in the 1950s right [00:58:03] um that people had in the 1950s right the amount of P computing power or [00:58:07] the amount of P computing power or memory or anything like this that those [00:58:10] memory or anything like this that those computers had in those days was [00:58:11] computers had in those days was laughable right these days you know the [00:58:14] laughable right these days you know the little power brick for your laptop has [00:58:17] little power brick for your laptop has more computing power inside it than the [00:58:20] more computing power inside it than the big mainframe computers that they used [00:58:22] big mainframe computers that they used to be using in those days so um [00:58:26] to be using in those days so um basically um people were only able to [00:58:29] basically um people were only able to build very simple lexicons and [00:58:32] build very simple lexicons and rule-based substitution rules and [00:58:35] rule-based substitution rules and nothing like the complexity of human [00:58:37] nothing like the complexity of human languages which only gradually people [00:58:40] languages which only gradually people began to understand um but machine [00:58:43] began to understand um but machine translation started to become more alive [00:58:46] translation started to become more alive in the 1990s and 2000s decades once [00:58:50] in the 1990s and 2000s decades once people started to build empirical models [00:58:54] people started to build empirical models over lots of data and the approach then [00:58:57] over lots of data and the approach then was called statistical machine [00:58:59] was called statistical machine translation and so when Google translate [00:59:02] translation and so when Google translate was first introduced to the world um it [00:59:05] was first introduced to the world um it was sort of the big unveiling to the [00:59:07] was sort of the big unveiling to the world of statistical phrase-based [00:59:10] world of statistical phrase-based machine translation systems where what [00:59:12] machine translation systems where what you were doing was you were collecting a [00:59:15] you were doing was you were collecting a large amount of parallel [00:59:17] large amount of parallel data words that have been translated [00:59:20] data words that have been translated from one word to another and you know [00:59:23] from one word to another and you know not for all languages but for quite a [00:59:25] not for all languages but for quite a few languages quite a few sources of [00:59:28] few languages quite a few sources of parallel data so the European Union [00:59:30] parallel data so the European Union generates a huge amount of parallel data [00:59:32] generates a huge amount of parallel data among European languages there are [00:59:35] among European languages there are places like Hong Kong where you get [00:59:37] places like Hong Kong where you get English Chinese if a certain dialect of [00:59:40] English Chinese if a certain dialect of Chinese um parallel data um the UN [00:59:44] Chinese um parallel data um the UN generates a lot of parallel data so [00:59:46] generates a lot of parallel data so getting sources of parallel data and [00:59:49] getting sources of parallel data and trying to build models and so the way it [00:59:51] trying to build models and so the way it was done was based on that model we're [00:59:54] was done was based on that model we're going to try and learn uh probability [00:59:56] going to try and learn uh probability model for translation so this um the [01:00:01] model for translation so this um the probability of a translation given a [01:00:04] probability of a translation given a source sentence and the way it was done [01:00:06] source sentence and the way it was done at that time was breaking it down using [01:00:09] at that time was breaking it down using Bas rule into two um sub problems so the [01:00:14] Bas rule into two um sub problems so the probability of the translation given the [01:00:16] probability of the translation given the source is going to be the inverted [01:00:19] source is going to be the inverted probability of the source given the [01:00:22] probability of the source given the translation times the probability of the [01:00:24] translation times the probability of the translation and and you know you could [01:00:27] translation and and you know you could think that this makes it no simpler [01:00:30] think that this makes it no simpler because you've just you know reversed [01:00:33] because you've just you know reversed the order of X and Y but the reason why [01:00:36] the order of X and Y but the reason why it made it simpler and people were able [01:00:38] it made it simpler and people were able to make progress was the translation [01:00:41] to make progress was the translation model was treated as a very simple model [01:00:45] model was treated as a very simple model as to how words tended to get translated [01:00:48] as to how words tended to get translated to words in the other language and it [01:00:50] to words in the other language and it didn't need to know anything about you [01:00:53] didn't need to know anything about you know word order grammar structure of the [01:00:55] know word order grammar structure of the other Lang anguage and then all of that [01:00:57] other Lang anguage and then all of that was being handled by just this [01:01:00] was being handled by just this probability of why which was a pure [01:01:02] probability of why which was a pure language model as we've talked about [01:01:04] language model as we've talked about before so you could have a simple [01:01:06] before so you could have a simple translation model which just sort of [01:01:08] translation model which just sort of said you know if you see the word om in [01:01:12] said you know if you see the word om in French you might want to translate it as [01:01:15] French you might want to translate it as man or person or and put some [01:01:17] man or person or and put some probabilities on that and then most of [01:01:20] probabilities on that and then most of the cleverness was in the language model [01:01:22] the cleverness was in the language model which was telling you what would be a [01:01:24] which was telling you what would be a good sentence in the target [01:01:29] good sentence in the target language okay um and so that was [01:01:32] language okay um and so that was important because you know translations [01:01:35] important because you know translations get pretty complicated right so um you [01:01:39] get pretty complicated right so um you not only have to know how to translate [01:01:42] not only have to know how to translate words and those translations of words [01:01:44] words and those translations of words VAR in context um but you get a lot of [01:01:48] VAR in context um but you get a lot of reordering of words in [01:01:50] reordering of words in sentences [01:01:52] sentences um I'm not going to be able to spend a [01:01:54] um I'm not going to be able to spend a lot of time on this [01:01:56] lot of time on this but you know here here for a while um [01:01:59] but you know here here for a while um was my favorite example machine [01:02:03] was my favorite example machine translation sentence um so um this is [01:02:07] translation sentence um so um this is actually a translated sentence so it [01:02:10] actually a translated sentence so it comes the original comes from the book [01:02:12] comes the original comes from the book Guns Germs and Steel if you're familiar [01:02:15] Guns Germs and Steel if you're familiar with that um but it was that book by [01:02:17] with that um but it was that book by Jared Diamond um but this book was um [01:02:21] Jared Diamond um but this book was um translated into Chinese so here's a [01:02:24] translated into Chinese so here's a sentence um from [01:02:26] sentence um from um the book in Chinese and you know uh I [01:02:30] um the book in Chinese and you know uh I guess in the 2000th decade I was [01:02:32] guess in the 2000th decade I was involved in building statistical machine [01:02:34] involved in building statistical machine translation systems and I guess there [01:02:37] translation systems and I guess there was a Mt [01:02:39] was a Mt evaluation um that um we did where our [01:02:42] evaluation um that um we did where our system did terribly on this sentence and [01:02:44] system did terribly on this sentence and I tried it out on Google translate and [01:02:47] I tried it out on Google translate and it also did terribly in this sentence so [01:02:49] it also did terribly in this sentence so what the sentence should say is in 1519 [01:02:53] what the sentence should say is in 1519 600 Spaniards landed in Mexico to [01:02:56] 600 Spaniards landed in Mexico to conquer the Aztec empire with a [01:02:58] conquer the Aztec empire with a population of a few million they lost [01:03:01] population of a few million they lost 2/3 of their soldiers in the initial [01:03:03] 2/3 of their soldiers in the initial Clash um so here's what Google translate [01:03:06] Clash um so here's what Google translate said in [01:03:07] said in 2019 2009 1519 600 Spaniards landed in [01:03:13] 2019 2009 1519 600 Spaniards landed in Mexico millions of people to conquer the [01:03:16] Mexico millions of people to conquer the Aztec empire the first 2third of [01:03:19] Aztec empire the first 2third of soldiers against their loss um now it's [01:03:21] soldiers against their loss um now it's partly bad because the word choices in [01:03:24] partly bad because the word choices in the translations [01:03:26] the translations very good um but you know it's [01:03:28] very good um but you know it's especially bad because it's just um not [01:03:32] especially bad because it's just um not actually able to capture and use the [01:03:35] actually able to capture and use the modification relationships um of the [01:03:38] modification relationships um of the sentence so you know here's the part of [01:03:41] sentence so you know here's the part of the Chinese that's saying the Aztec [01:03:43] the Chinese that's saying the Aztec Empire and over there in Orange is the [01:03:48] Empire and over there in Orange is the few million people and in Chinese [01:03:51] few million people and in Chinese there's this explicit little character [01:03:53] there's this explicit little character here dur which is saying that stuff in [01:03:56] here dur which is saying that stuff in Orange modifies this stuff in green [01:03:59] Orange modifies this stuff in green which is what it's meant to be in the [01:04:01] which is what it's meant to be in the correct translation of Aztec empire with [01:04:04] correct translation of Aztec empire with a population of a few million um but [01:04:06] a population of a few million um but Google translate completely fails on [01:04:09] Google translate completely fails on that and it's Suddenly It's the millions [01:04:11] that and it's Suddenly It's the millions of people who are going to be conquering [01:04:13] of people who are going to be conquering um the Aztec empire um and you know [01:04:16] um the Aztec empire um and you know that's sort of in this way um the worst [01:04:19] that's sort of in this way um the worst thing that's happening here though you [01:04:21] thing that's happening here though you know the 151 19600 isn't exactly a very [01:04:24] know the 151 19600 isn't exactly a very good translation um and the first 2third [01:04:27] good translation um and the first 2third of soldiers against their loss isn't [01:04:29] of soldiers against their loss isn't very good either um but you know so for [01:04:32] very good either um but you know so for a while I used to sort of update this [01:04:35] a while I used to sort of update this and see what happened you know um in [01:04:40] and see what happened you know um in 2013 it almost seemed like progress had [01:04:43] 2013 it almost seemed like progress had been made but by [01:04:46] been made but by 2015 it had gone downhill back to how it [01:04:49] 2015 it had gone downhill back to how it was before so it just seemed like they [01:04:52] was before so it just seemed like they got lucky in [01:04:54] got lucky in 2013 rather than the systems were [01:04:56] 2013 rather than the systems were working any better um and indeed this [01:04:59] working any better um and indeed this sort of seemed to be the problem [01:05:01] sort of seemed to be the problem although some kind of progress had been [01:05:04] although some kind of progress had been made in machine translation um these [01:05:07] made in machine translation um these systems um just you know sort of never [01:05:12] systems um just you know sort of never really worked all that great um and so [01:05:16] really worked all that great um and so that led to this amazing breakthrough in [01:05:20] that led to this amazing breakthrough in 2014 um where we then moved to neural [01:05:23] 2014 um where we then moved to neural machine translation and new machine [01:05:26] machine translation and new machine translation was much better so what did [01:05:29] translation was much better so what did we do in neural machine translation so [01:05:33] we do in neural machine translation so we built a neural machine translation [01:05:35] we built a neural machine translation system as a single endtoend neural [01:05:38] system as a single endtoend neural network and that's been a powerful idea [01:05:41] network and that's been a powerful idea in newal network systems in general [01:05:43] in newal network systems in general including in NLP if we can just have a [01:05:47] including in NLP if we can just have a single big system and put a loss [01:05:50] single big system and put a loss function at the end of it and then we [01:05:52] function at the end of it and then we can back propagate errors right back [01:05:54] can back propagate errors right back down through the system um it means [01:05:56] down through the system um it means we're sort of aligning all of our [01:05:58] we're sort of aligning all of our learning for the final task we want to [01:06:01] learning for the final task we want to do and that's been very effective [01:06:03] do and that's been very effective whereas earlier models couldn't do that [01:06:05] whereas earlier models couldn't do that um so we built it with a sequence to [01:06:08] um so we built it with a sequence to sequence model so that sounds like our [01:06:12] sequence model so that sounds like our lstms but it's meaning that we're going [01:06:14] lstms but it's meaning that we're going to have two of them one of them to [01:06:16] to have two of them one of them to encode the sentence um The Source [01:06:20] encode the sentence um The Source sentence and one to produce the target [01:06:22] sentence and one to produce the target sentence so that's what we're building [01:06:25] sentence so that's what we're building so for the Source sentence we're taking [01:06:28] so for the Source sentence we're taking here it says RNN but let's just think [01:06:30] here it says RNN but let's just think lstm because that's what we're going to [01:06:31] lstm because that's what we're going to use in practice it's much better and so [01:06:34] use in practice it's much better and so we're going to chunk through it [01:06:37] we're going to chunk through it encoding um what we've read using an RNN [01:06:41] encoding um what we've read using an RNN so this RNN isn't going to Output [01:06:43] so this RNN isn't going to Output anything right we're just building up a [01:06:45] anything right we're just building up a hidden state that knows what's in the [01:06:47] hidden state that knows what's in the source sentence um so again encoding of [01:06:50] source sentence um so again encoding of the source sentence and we're going to [01:06:52] the source sentence and we're going to use that final hidden state to condition [01:06:56] use that final hidden state to condition the decoder RNN which is going to then [01:06:59] the decoder RNN which is going to then generate the translation so for the [01:07:02] generate the translation so for the decoder RNN it's also an lstm but it's [01:07:06] decoder RNN it's also an lstm but it's going to be an lstm with different [01:07:08] going to be an lstm with different parameters so we're going to be learning [01:07:09] parameters so we're going to be learning one set one lstm with Source en coding [01:07:13] one set one lstm with Source en coding parameters and then for the different [01:07:14] parameters and then for the different language we're learning a different lstm [01:07:17] language we're learning a different lstm that all know about the target language [01:07:20] that all know about the target language and so we give it start and say well [01:07:23] and so we give it start and say well feed in for feed in your past what [01:07:28] feed in for feed in your past what you've encoded from the encoder RNN as [01:07:31] you've encoded from the encoder RNN as your starting point um and then we're [01:07:33] your starting point um and then we're going to be I that'll be count as the [01:07:37] going to be I that'll be count as the previous hidden State you're feeding [01:07:38] previous hidden State you're feeding into your um lstm and then we're going [01:07:42] into your um lstm and then we're going to generate the first word of the [01:07:44] to generate the first word of the translation and we'll then copy that [01:07:47] translation and we'll then copy that translated word down using this as a [01:07:50] translated word down using this as a generative model as I did last time and [01:07:52] generative model as I did last time and we um start translating through he hit [01:07:56] we um start translating through he hit me with a [01:07:57] me with a pie okay um [01:08:00] pie okay um so does that sort of make sense the [01:08:03] so does that sort of make sense the model yeah okay um [01:08:09] so okay there's some note [01:08:12] so okay there's some note sorry um yeah sorry what I was going to [01:08:15] sorry um yeah sorry what I was going to say yeah so the the little pink note [01:08:17] say yeah so the the little pink note here so what I was showing you is the [01:08:19] here so what I was showing you is the picture of using it at sort of runtime [01:08:24] picture of using it at sort of runtime at run time we're going to encode the [01:08:27] at run time we're going to encode the source and then generate the words of [01:08:29] source and then generate the words of the translation at training time we're [01:08:32] the translation at training time we're going to have parallel text we're going [01:08:35] going to have parallel text we're going to have sentences in their translations [01:08:37] to have sentences in their translations we're going to run with the same [01:08:39] we're going to run with the same architecture but as before then for the [01:08:43] architecture but as before then for the decoder Network we're going to try and [01:08:46] decoder Network we're going to try and predict each word and then say what [01:08:49] predict each word and then say what probability did you assign the actual [01:08:52] probability did you assign the actual next word and that will give us a loss [01:08:54] next word and that will give us a loss and we'll be calculating the losses at [01:08:56] and we'll be calculating the losses at each position working out the average [01:08:59] each position working out the average loss working out the gradients back [01:09:01] loss working out the gradients back propagating them through the entire [01:09:02] propagating them through the entire network both the decoder RNN and the [01:09:06] network both the decoder RNN and the encoder RNN networks and updating all [01:09:09] encoder RNN networks and updating all the parameters of our model and that's [01:09:11] the parameters of our model and that's the sense in which it's being trained [01:09:13] the sense in which it's being trained end to [01:09:15] end to end okay um so sequence so this is sort [01:09:19] end okay um so sequence so this is sort of a general notion of an encoder [01:09:22] of a general notion of an encoder decoder model which is a very general [01:09:25] decoder model which is a very general thing that we use in all kinds of places [01:09:29] thing that we use in all kinds of places right that we have one network that [01:09:31] right that we have one network that encod something which produces a [01:09:34] encod something which produces a representation which will then feed into [01:09:36] representation which will then feed into another Network that we'll use to decode [01:09:39] another Network that we'll use to decode something and even when we go on to do [01:09:42] something and even when we go on to do other things like use Transformers [01:09:44] other things like use Transformers rather than lstms we're still commonly [01:09:47] rather than lstms we're still commonly going to use these kind of encoded ecoda [01:09:49] going to use these kind of encoded ecoda models because if we want to do not only [01:09:52] models because if we want to do not only machine translation but other tasks um [01:09:56] machine translation but other tasks um like um [01:09:57] like um summarization or um text to speech or [01:10:03] summarization or um text to speech or other things like that we're going to be [01:10:05] other things like that we're going to be in this space of using encoder decoder [01:10:08] in this space of using encoder decoder networks yeah what is the difference [01:10:10] networks yeah what is the difference between this encod de model and just [01:10:12] between this encod de model and just using a deeper neural network with more [01:10:16] layers [01:10:19] layers um well a lot is sequenced um right so [01:10:24] um well a lot is sequenced um right so it has never been very Su you're meaning [01:10:26] it has never been very Su you're meaning like why don't you just build on top of [01:10:28] like why don't you just build on top of the source right um people have tried [01:10:32] the source right um people have tried that occasionally it's never been very [01:10:35] that occasionally it's never been very successful and I think part of the [01:10:37] successful and I think part of the reason is all of what I was trying to [01:10:39] reason is all of what I was trying to show before about all all of the word [01:10:41] show before about all all of the word order changes around a lot between [01:10:44] order changes around a lot between languages and if you're sort of um just [01:10:46] languages and if you're sort of um just trying to build stuff on top of the [01:10:49] trying to build stuff on top of the source sentence it's very hard to cope [01:10:52] source sentence it's very hard to cope with that in particular it's not even [01:10:55] with that in particular it's not even the case that the length stays the same [01:10:57] the case that the length stays the same right one of the big ways um in which [01:11:00] right one of the big ways um in which languages vary is what little words that [01:11:03] languages vary is what little words that they have right so that in English [01:11:05] they have right so that in English you're putting in a lot of these [01:11:06] you're putting in a lot of these auxiliary verbs and articles whereas [01:11:09] auxiliary verbs and articles whereas it's in Chinese you don't have any of [01:11:11] it's in Chinese you don't have any of those and so you're neither needing to [01:11:14] those and so you're neither needing to depending on Direction add a lot of [01:11:16] depending on Direction add a lot of words or subtract a lot of words which [01:11:18] words or subtract a lot of words which is very hard to do if you're sort of [01:11:20] is very hard to do if you're sort of building on top of the source of [01:11:22] building on top of the source of it ah is it quick uh yeah so left side [01:11:27] it ah is it quick uh yeah so left side is that b directional or just like a [01:11:29] is that b directional or just like a like the encoder um yeah so you you [01:11:33] like the encoder um yeah so you you totally think and it could be that the [01:11:37] totally think and it could be that the encoder is bidirectional and that might [01:11:40] encoder is bidirectional and that might be better um for the for the famous [01:11:43] be better um for the for the famous original instantiation of this that was [01:11:45] original instantiation of this that was done at Google they actually didn't make [01:11:47] done at Google they actually didn't make it bir directional so it was simply [01:11:49] it bir directional so it was simply taking the final hidden state but that's [01:11:52] taking the final hidden state but that's absolutely an alternative that you could [01:11:54] absolutely an alternative that you could do okay okay [01:11:58] um yeah so I sort of said it was um okay [01:12:03] um yeah so I sort of said it was um okay usable for lots of things okay um yeah [01:12:06] usable for lots of things okay um yeah so this is our um conditional language [01:12:09] so this is our um conditional language model um this so we're now kind of [01:12:12] model um this so we're now kind of directly calculating the probability of [01:12:15] directly calculating the probability of Y given X right that the decoder model [01:12:18] Y given X right that the decoder model is generating um uh language expression [01:12:22] is generating um uh language expression as a language model directly conditioned [01:12:25] as a language model directly conditioned on X um and so we train it with a big [01:12:29] on X um and so we train it with a big parallel Corpus um and that's the only [01:12:32] parallel Corpus um and that's the only case I'm going to talk about today [01:12:34] case I'm going to talk about today recently there's been sort of some [01:12:36] recently there's been sort of some interesting work on unsupervised machine [01:12:38] interesting work on unsupervised machine translation meaning that you're got only [01:12:41] translation meaning that you're got only a little bit of information about how [01:12:43] a little bit of information about how the languages relate you don't really [01:12:45] the languages relate you don't really have a lot of parallel text but I'm not [01:12:47] have a lot of parallel text but I'm not going to cover that today um yeah so for [01:12:50] going to cover that today um yeah so for training it we have um paired sentences [01:12:55] training it we have um paired sentences um we work out our losses in the [01:12:59] um we work out our losses in the predictions at each position and then [01:13:01] predictions at each position and then we're working out our average loss and [01:13:04] we're working out our average loss and back propagating it through in a single [01:13:06] back propagating it through in a single system end to end as [01:13:08] system end to end as described [01:13:10] described um yeah so in practice um when people [01:13:15] um yeah so in practice um when people built Big Machine translation systems [01:13:17] built Big Machine translation systems this was one of the places where [01:13:19] this was one of the places where absolutely it gave value to have multi- [01:13:22] absolutely it gave value to have multi- layer stacked um lstms and so typically [01:13:27] layer stacked um lstms and so typically people are building a model there and [01:13:29] people are building a model there and you'll be building a model something [01:13:31] you'll be building a model something like this that's a multi-layer lstm [01:13:35] like this that's a multi-layer lstm that's being used to encode and [01:13:38] that's being used to encode and decode [01:13:40] decode um in my two minutes um remaining um I [01:13:45] um in my two minutes um remaining um I just want to sort of um quickly say so [01:13:48] just want to sort of um quickly say so um building these new machine [01:13:50] um building these new machine translation systems was really the first [01:13:54] translation systems was really the first big success [01:13:55] big success of natural language processing deep [01:13:58] of natural language processing deep learning now in this in this sense you [01:14:01] learning now in this in this sense you know it depends on how you define what [01:14:04] know it depends on how you define what parts of language if you look at the [01:14:06] parts of language if you look at the sort of the history of the Renaissance [01:14:09] sort of the history of the Renaissance of deep learning the first place where [01:14:12] of deep learning the first place where deep learning was highly successful was [01:14:14] deep learning was highly successful was in speech recognition systems um the [01:14:18] in speech recognition systems um the second place in which it was highly [01:14:20] second place in which it was highly successful was then object recognition [01:14:23] successful was then object recognition and vision and then the third place that [01:14:26] and vision and then the third place that was highly successful was then building [01:14:28] was highly successful was then building machine translation systems um so you [01:14:32] machine translation systems um so you know Google had a big um statistical [01:14:35] know Google had a big um statistical machine translation system and it was uh [01:14:39] machine translation system and it was uh it was only in [01:14:41] it was only in 2014 that people first built this sort [01:14:44] 2014 that people first built this sort of [01:14:45] of lstm um deep learning machine [01:14:47] lstm um deep learning machine translation system but it was just sort [01:14:50] translation system but it was just sort of obviously super good and it was so [01:14:53] of obviously super good and it was so super good that in only two 2 years it [01:14:56] super good that in only two 2 years it was then deployed as the live system um [01:14:59] was then deployed as the live system um that was being used at Google but it [01:15:01] that was being used at Google but it wasn't only used in Google um that new [01:15:05] wasn't only used in Google um that new machine translation was just so much [01:15:07] machine translation was just so much better than what had come before that by [01:15:10] better than what had come before that by a couple of years after that you know [01:15:13] a couple of years after that you know absolutely everybody um both us [01:15:16] absolutely everybody um both us companies and Chinese companies [01:15:18] companies and Chinese companies Microsoft Facebook 10cent bu um [01:15:22] Microsoft Facebook 10cent bu um everybody was using new machine [01:15:24] everybody was using new machine translation system because they are just [01:15:26] translation system because they are just much better systems and so this was an [01:15:29] much better systems and so this was an amazing success right CU um statistical [01:15:32] amazing success right CU um statistical machine translation systems like the [01:15:34] machine translation systems like the Google system that this is something [01:15:37] Google system that this is something that had been worked on for about a [01:15:39] that had been worked on for about a decade hundreds of people had worked on [01:15:41] decade hundreds of people had worked on it there were millions of lines of code [01:15:44] it there were millions of lines of code lots of hacks built in um for particular [01:15:47] lots of hacks built in um for particular languages and language pairs um but [01:15:50] languages and language pairs um but really a simple small um newal machine [01:15:55] really a simple small um newal machine translation system was able to um work [01:15:58] translation system was able to um work much better um there was an article [01:16:00] much better um there was an article published about it when it sort went [01:16:02] published about it when it sort went live in the New York Times that you can [01:16:03] live in the New York Times that you can find in that um link it's a um a little [01:16:07] find in that um link it's a um a little bit of a a praising piece um where you [01:16:10] bit of a a praising piece um where you could be a little bit critical um but [01:16:13] could be a little bit critical um but you know basically it's sort of talking [01:16:15] you know basically it's sort of talking about how just the difference in quality [01:16:18] about how just the difference in quality was so obvious that everyone immediately [01:16:21] was so obvious that everyone immediately noticed even before Google had announced [01:16:23] noticed even before Google had announced it of w some Machine translation's gone [01:16:26] it of w some Machine translation's gone so much [01:16:27] so much better okay um so that's basically today [01:16:31] better okay um so that's basically today so um for today you know we've leared [01:16:34] so um for today you know we've leared that lstms are powerful if you're doing [01:16:37] that lstms are powerful if you're doing something with a current new network you [01:16:38] something with a current new network you probably want to use an lstm um you [01:16:41] probably want to use an lstm um you should know about the idea of clipping [01:16:43] should know about the idea of clipping your gradients um bidirectional lstms [01:16:47] your gradients um bidirectional lstms are good when you've got an encoder but [01:16:50] are good when you've got an encoder but you can't use them to generate new text [01:16:53] you can't use them to generate new text um and encoder decoder rual machine [01:16:56] um and encoder decoder rual machine translation systems were great new [01:16:58] translation systems were great new technology um that advanced the field [01:17:01] technology um that advanced the field thank you ================================================================================ LECTURE 007 ================================================================================ Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 7 - Attention, Final Projects and LLM Intro Source: https://www.youtube.com/watch?v=J7ruSOIzhrE --- Transcript [00:00:05] um okay welcome everyone um to week four [00:00:09] um okay welcome everyone um to week four we're into now um so for today what I [00:00:12] we're into now um so for today what I want to do is first of all um well a [00:00:17] want to do is first of all um well a couple more bits on machine translation [00:00:20] couple more bits on machine translation especially just talking a little bit [00:00:21] especially just talking a little bit about evaluating machine translation [00:00:24] about evaluating machine translation then I want to spend a while on [00:00:26] then I want to spend a while on attention so attention is a very um fun [00:00:30] attention so attention is a very um fun mental concept and of newal networks [00:00:33] mental concept and of newal networks which was originally developed in the [00:00:35] which was originally developed in the context of machine translation there [00:00:38] context of machine translation there also then a very um Central concept when [00:00:41] also then a very um Central concept when we're talking about Transformers which [00:00:43] we're talking about Transformers which we then start talking about on [00:00:46] we then start talking about on Thursday okay so um getting straight [00:00:49] Thursday okay so um getting straight into it um so this is the picture that [00:00:53] into it um so this is the picture that we saw towards the end of last time that [00:00:56] we saw towards the end of last time that this is how we were baking uh machine [00:00:59] this is how we were baking uh machine translation system where we are using um [00:01:02] translation system where we are using um a multi-layer lstm where we're feeding [00:01:05] a multi-layer lstm where we're feeding in a source sent sentence and then we [00:01:08] in a source sent sentence and then we were flipping um to um then turning the [00:01:13] were flipping um to um then turning the model into a decoder with different [00:01:15] model into a decoder with different parameters which would generate one word [00:01:17] parameters which would generate one word at a time to generate the translated [00:01:21] at a time to generate the translated sentence um so um here I've got an [00:01:24] sentence um so um here I've got an German sentence and it's produced an [00:01:26] German sentence and it's produced an English translation that looks a pretty [00:01:29] English translation that looks a pretty good one and but you know we're going to [00:01:32] good one and but you know we're going to want to have a way of deciding well are [00:01:36] want to have a way of deciding well are we um producing good translations or not [00:01:39] we um producing good translations or not um and so we need some way to evaluate [00:01:42] um and so we need some way to evaluate machine translation now this is a [00:01:45] machine translation now this is a complex area because you know if you [00:01:48] complex area because you know if you start poking around in the literature um [00:01:51] start poking around in the literature um people have proposed literally hundreds [00:01:55] people have proposed literally hundreds of different measures that could be used [00:01:56] of different measures that could be used to evaluate machine translation systems [00:01:59] to evaluate machine translation systems I'm of writing a couple of papers on it [00:02:02] I'm of writing a couple of papers on it myself so I'm contributed to the problem [00:02:05] myself so I'm contributed to the problem um but you know by far the most commonly [00:02:09] um but you know by far the most commonly common measure that you see to this day [00:02:11] common measure that you see to this day was essentially the first measure [00:02:14] was essentially the first measure proposed to automatically evaluate [00:02:16] proposed to automatically evaluate machine translation which was the blue [00:02:18] machine translation which was the blue measure um which was um proposed to [00:02:22] measure um which was um proposed to understand for bilingual evaluation [00:02:24] understand for bilingual evaluation underst study though it went along with [00:02:27] underst study though it went along with the fact that was proposed by IBM [00:02:29] the fact that was proposed by IBM probably not coincidence um so until [00:02:32] probably not coincidence um so until this point um the only way that people [00:02:36] this point um the only way that people had really used for evaluating [00:02:38] had really used for evaluating translations was getting human beings to [00:02:40] translations was getting human beings to look at them and say how good of a [00:02:42] look at them and say how good of a translation this is and you know that's [00:02:45] translation this is and you know that's still a gold standard measure that is [00:02:48] still a gold standard measure that is widely used um for evaluating [00:02:51] widely used um for evaluating translations because you know many of [00:02:53] translations because you know many of the automatic measures have various [00:02:55] the automatic measures have various kinds of um biases and problems that [00:02:58] kinds of um biases and problems that make human evaluation [00:03:01] make human evaluation useful but on the other hand um a lot of [00:03:05] useful but on the other hand um a lot of the time we'd like to you know iterate [00:03:08] the time we'd like to you know iterate quickly on evaluations we'd like to use [00:03:11] quickly on evaluations we'd like to use evaluations and training loops and [00:03:13] evaluations and training loops and things like that and I the IBM people [00:03:16] things like that and I the IBM people with the blue paper suggests well maybe [00:03:18] with the blue paper suggests well maybe we can come up with a halfway decent [00:03:20] we can come up with a halfway decent automatic method of um doing [00:03:23] automatic method of um doing translations and the idea of what they [00:03:26] translations and the idea of what they proposed was this that we're going to [00:03:28] proposed was this that we're going to have one or more reference translations [00:03:31] have one or more reference translations for a piece of text so these are human [00:03:34] for a piece of text so these are human written translations and then we can [00:03:37] written translations and then we can score any automatic translation mainly [00:03:41] score any automatic translation mainly on how often they have overlapping one 2 [00:03:47] on how often they have overlapping one 2 three and four grams uh the number four [00:03:50] three and four grams uh the number four isn't special you could have only gone [00:03:52] isn't special you could have only gone up to three or five but four we seen as [00:03:54] up to three or five but four we seen as a reasonable length um overlapping um n [00:03:58] a reasonable length um overlapping um n gs with one of the reference [00:04:00] gs with one of the reference translations and the more overlap you [00:04:02] translations and the more overlap you have the better um and we we this [00:04:07] have the better um and we we this discussion of this evaluation in the [00:04:09] discussion of this evaluation in the assignment so you can think about it a [00:04:11] assignment so you can think about it a bit more and I won't go actually through [00:04:13] bit more and I won't go actually through all the formulas right now um but you [00:04:16] all the formulas right now um but you know that's most of it um and so here's [00:04:19] know that's most of it um and so here's um a picture of how that looks so the [00:04:22] um a picture of how that looks so the original idea was what we should do is [00:04:26] original idea was what we should do is you know have several reference [00:04:28] you know have several reference translations and then we'd get a machine [00:04:31] translations and then we'd get a machine translation and then we'd look at this [00:04:33] translation and then we'd look at this machine translation and try and find [00:04:36] machine translation and try and find pieces of it in the reference [00:04:39] pieces of it in the reference translation so we can certainly find the [00:04:41] translation so we can certainly find the unigram the um we can't find American at [00:04:45] unigram the um we can't find American at all but we can find International [00:04:48] all but we can find International Airport and its in the second reference [00:04:51] Airport and its in the second reference translation so we're going to get a 4 g [00:04:53] translation so we're going to get a 4 g match for that um we can find that again [00:04:56] match for that um we can find that again that's easy office all receives one call [00:05:00] that's easy office all receives one call Self the sand Arab I'm not of very good [00:05:02] Self the sand Arab I'm not of very good translation this right so that all [00:05:04] translation this right so that all misses but then you start to find other [00:05:06] misses but then you start to find other pieces um that do overlap and you use [00:05:10] pieces um that do overlap and you use those to work out a score um the [00:05:12] those to work out a score um the original idea was you should always have [00:05:14] original idea was you should always have multiple reference translations so that [00:05:17] multiple reference translations so that you can sample the space of possible [00:05:21] you can sample the space of possible translations and have reasonable [00:05:23] translations and have reasonable coverage in practice for what's been [00:05:25] coverage in practice for what's been done more recently it's not so uncommon [00:05:28] done more recently it's not so uncommon that people do this with one reference [00:05:30] that people do this with one reference translation and the argument then is [00:05:33] translation and the argument then is still on a kind of a probabilistic basis [00:05:36] still on a kind of a probabilistic basis the more often you have a good [00:05:37] the more often you have a good translation the more often you'll get [00:05:39] translation the more often you'll get matches and therefore your score will be [00:05:43] matches and therefore your score will be better yeah so um [00:05:47] better yeah so um why you know why did people come up with [00:05:51] why you know why did people come up with this and why is it still imperfect well [00:05:55] this and why is it still imperfect well the problem with translation is that [00:05:58] the problem with translation is that there isn't one right answer it's not [00:06:01] there isn't one right answer it's not like the kind of classification things [00:06:02] like the kind of classification things you see in machine learning where you [00:06:04] you see in machine learning where you show people a picture and the right [00:06:06] show people a picture and the right answer is to say this the class of this [00:06:10] answer is to say this the class of this object um is whatever labut or right dog [00:06:14] object um is whatever labut or right dog breeds or something right that for any [00:06:17] breeds or something right that for any sentence there are many different ways [00:06:19] sentence there are many different ways to translate it and you know translators [00:06:22] to translate it and you know translators can sit around and argue that oh this [00:06:24] can sit around and argue that oh this phrasing is a little bit nicer than this [00:06:25] phrasing is a little bit nicer than this phrasing blah blah blah but to a first [00:06:28] phrasing blah blah blah but to a first approximation you can translate sentence [00:06:30] approximation you can translate sentence in lots of ways um and those different [00:06:32] in lots of ways um and those different ways of translation can involve [00:06:34] ways of translation can involve different word orders so you can't [00:06:36] different word orders so you can't really sort of check the words off as [00:06:38] really sort of check the words off as you come down in the sentence and that's [00:06:41] you come down in the sentence and that's what motivated this idea of sort of [00:06:43] what motivated this idea of sort of matching engrams anywhere so you can get [00:06:46] matching engrams anywhere so you can get reasonable credit um for having the [00:06:49] reasonable credit um for having the right matches but you know nevertheless [00:06:52] right matches but you know nevertheless it's a pretty crude version of it right [00:06:57] it's a pretty crude version of it right um you know you can still get a poor [00:06:59] um you know you can still get a poor blue score for good translation just [00:07:01] blue score for good translation just because the words you chose didn't [00:07:03] because the words you chose didn't happen to match a reference translation [00:07:06] happen to match a reference translation and also you can get points for things [00:07:09] and also you can get points for things without really having a good translation [00:07:11] without really having a good translation at all right if you just have words that [00:07:13] at all right if you just have words that match even if they're having completely [00:07:16] match even if they're having completely the wrong role in the sentence you will [00:07:19] the wrong role in the sentence you will get some points but it's harder to get [00:07:21] get some points but it's harder to get engram matches unless you're um for [00:07:24] engram matches unless you're um for larger n unless you're using words the [00:07:27] larger n unless you're using words the right way um there's one other tricken [00:07:30] right way um there's one other tricken the blue measure that there's a penalty [00:07:32] the blue measure that there's a penalty for two short system translations CU [00:07:35] for two short system translations CU otherwise you could leave out everything [00:07:37] otherwise you could leave out everything difficult and only translate the easy [00:07:40] difficult and only translate the easy part of the sentence and then for the [00:07:42] part of the sentence and then for the bits you have translated um you could [00:07:44] bits you have translated um you could then be getting a high score for the [00:07:47] then be getting a high score for the Precision of those [00:07:49] Precision of those pieces okay so we'll use when you're [00:07:52] pieces okay so we'll use when you're developing your Mt systems for [00:07:53] developing your Mt systems for assignment three um we'll use them with [00:07:57] assignment three um we'll use them with blue um so so um now we have a [00:08:01] blue um so so um now we have a evaluation measure um we can um start [00:08:05] evaluation measure um we can um start looking at how well [00:08:08] looking at how well um how well do systems do on a blue [00:08:12] um how well do systems do on a blue score and blue scores are theoretically [00:08:14] score and blue scores are theoretically between zero and 100 but you're never [00:08:17] between zero and 100 but you're never going to get to 100 because of the [00:08:19] going to get to 100 because of the variations of how you can translate [00:08:20] variations of how you can translate things um and so typically if you can [00:08:24] things um and so typically if you can start to get to the 20s uh translations [00:08:28] start to get to the 20s uh translations uh you can sort of understand what the [00:08:30] uh you can sort of understand what the source document was about um once you [00:08:33] source document was about um once you get into the 30s and 40s the [00:08:35] get into the 30s and 40s the translations are getting much much [00:08:37] translations are getting much much better [00:08:39] better um yeah so um statistical phrase-based [00:08:42] um yeah so um statistical phrase-based translation was pioneered um by IBM in [00:08:47] translation was pioneered um by IBM in the late 90s actually and was sort of [00:08:49] the late 90s actually and was sort of redeveloped in the 2000's decade and was [00:08:52] redeveloped in the 2000's decade and was what Google launched as Google Translate [00:08:54] what Google launched as Google Translate in the 2000th decade and it continued to [00:08:57] in the 2000th decade and it continued to be worked on for sort of the following [00:09:00] be worked on for sort of the following decade but there was basically a strong [00:09:03] decade but there was basically a strong sense um that progress in Translation [00:09:07] sense um that progress in Translation had doing statistical phrase based [00:09:09] had doing statistical phrase based systems had basically stalled that it [00:09:12] systems had basically stalled that it got a little bit better each year as [00:09:15] got a little bit better each year as people could build traditional Ingram [00:09:17] people could build traditional Ingram language models with more data every [00:09:19] language models with more data every year and things like that but the [00:09:21] year and things like that but the numbers were barely going upwards um so [00:09:25] numbers were barely going upwards um so in the years from about 2005 [00:09:29] in the years from about 2005 to 15 or maybe 14 the dominant idea in [00:09:34] to 15 or maybe 14 the dominant idea in the machine translation Community was [00:09:37] the machine translation Community was the way we were going to get um better [00:09:39] the way we were going to get um better machine translation is doing [00:09:41] machine translation is doing syntax-based machine translation if we [00:09:44] syntax-based machine translation if we actually knew the structure of sentences [00:09:46] actually knew the structure of sentences and we'd pass them up then we'd know [00:09:49] and we'd pass them up then we'd know what the role of words was in sentences [00:09:51] what the role of words was in sentences and then we'd be able to translate much [00:09:54] and then we'd be able to translate much better and this was particularly invoked [00:09:57] better and this was particularly invoked by looking at languages where trans [00:09:59] by looking at languages where trans worked terribly so in those days [00:10:03] worked terribly so in those days translation worked sort of okay for [00:10:06] translation worked sort of okay for languages like French to English um or [00:10:10] languages like French to English um or Spanish to English which are kind of [00:10:11] Spanish to English which are kind of sort of similar European languages um [00:10:15] sort of similar European languages um but the results worked way worse for [00:10:18] but the results worked way worse for Chinese to English or German to English [00:10:21] Chinese to English or German to English and even though English is a Germanic [00:10:23] and even though English is a Germanic language German has a very different [00:10:25] language German has a very different word order um to English with commonly [00:10:28] word order um to English with commonly verbs at the end of a clause and [00:10:31] verbs at the end of a clause and different elements being fronted and so [00:10:33] different elements being fronted and so there was it wasn't so people tried to [00:10:36] there was it wasn't so people tried to work on um um grammar-based syntax based [00:10:41] work on um um grammar-based syntax based methods of statistical machine [00:10:42] methods of statistical machine translation and I was one of those who [00:10:44] translation and I was one of those who worked on those in the late 2000s decade [00:10:48] worked on those in the late 2000s decade but you know the truth is it sort of [00:10:50] but you know the truth is it sort of didn't really work right if the rate of [00:10:53] didn't really work right if the rate of progress in um syntax-based [00:10:56] progress in um syntax-based um machine translation was had slightly [00:11:00] um machine translation was had slightly more slope um than phrase based machine [00:11:03] more slope um than phrase based machine translation over these years the amount [00:11:05] translation over these years the amount of slope um wasn't very much so things [00:11:09] of slope um wasn't very much so things were completely then thrown on their [00:11:11] were completely then thrown on their head um when neural machine translation [00:11:14] head um when neural machine translation got invented because as I explained you [00:11:17] got invented because as I explained you know the first attempts were in [00:11:19] know the first attempts were in 2014 the first cases in which it was [00:11:22] 2014 the first cases in which it was evaluated and boff evaluations was [00:11:25] evaluated and boff evaluations was 2015 and so in 2015 it wasn't as good as [00:11:28] 2015 and so in 2015 it wasn't as good as the best other machine translation [00:11:31] the best other machine translation methods but by 2016 it was and it was [00:11:35] methods but by 2016 it was and it was just on this much much steeper slope of [00:11:37] just on this much much steeper slope of getting way way better and that this [00:11:39] getting way way better and that this graph only goes up to [00:11:41] graph only goes up to 2019 um but it's continued to go up and [00:11:44] 2019 um but it's continued to go up and so it's not that uncommon these days [00:11:47] so it's not that uncommon these days that you see blue numbers in the 50s and [00:11:49] that you see blue numbers in the 50s and 60s um for new or machine translation [00:11:53] 60s um for new or machine translation systems so that's a good news story um [00:11:57] systems so that's a good news story um so after this I want to go on and sort [00:12:00] so after this I want to go on and sort of introduced this um idea of attention [00:12:03] of introduced this um idea of attention um which is now a very fundamental [00:12:06] um which is now a very fundamental important idea in neural systems it's [00:12:09] important idea in neural systems it's also interesting because it's actually [00:12:12] also interesting because it's actually something novel that was invented kind [00:12:15] something novel that was invented kind of recently so for everything that we've [00:12:18] of recently so for everything that we've done in new networks up until now really [00:12:23] done in new networks up until now really it had all been invented before the turn [00:12:26] it had all been invented before the turn of the Millennium right so um basic feed [00:12:29] of the Millennium right so um basic feed Ford neural networks recurrent neural [00:12:32] Ford neural networks recurrent neural networks [00:12:33] networks lstms other things that we haven't yet [00:12:35] lstms other things that we haven't yet haven't talked about like convolutional [00:12:37] haven't talked about like convolutional newal networks they were all invented [00:12:40] newal networks they were all invented last Millennium it was really a waiting [00:12:44] last Millennium it was really a waiting game at that point until there was [00:12:46] game at that point until there was sufficient data and computational power [00:12:49] sufficient data and computational power for them really to show that how good [00:12:51] for them really to show that how good they were um but attention was something [00:12:54] they were um but attention was something that actually got invented in 2014 in [00:12:58] that actually got invented in 2014 in the origins of neural machine [00:13:00] the origins of neural machine translation and it proved to be a very [00:13:03] translation and it proved to be a very transformative idea for making neural [00:13:05] transformative idea for making neural networks more powerful um so the idea of [00:13:11] networks more powerful um so the idea of what motivated attention was looking at [00:13:13] what motivated attention was looking at exactly this kind of machine translation [00:13:15] exactly this kind of machine translation problem so we're running our lstm over [00:13:17] problem so we're running our lstm over the source sentence and then we were [00:13:20] the source sentence and then we were using this hidden State as the previous [00:13:23] using this hidden State as the previous hidden state that we're feeding into the [00:13:26] hidden state that we're feeding into the generator lstm for the T Target sentence [00:13:30] generator lstm for the T Target sentence and what that means is everything useful [00:13:34] and what that means is everything useful about this sentence has to be stuffed [00:13:37] about this sentence has to be stuffed into that one vector and well that's [00:13:39] into that one vector and well that's maybe not so hard if you've got a [00:13:41] maybe not so hard if you've got a four-word sentence but you know maybe [00:13:43] four-word sentence but you know maybe you've got a 40w sentence out here and [00:13:46] you've got a 40w sentence out here and it seems to be kind of implausible that [00:13:49] it seems to be kind of implausible that it' be a good idea to be trying to fit [00:13:52] it' be a good idea to be trying to fit everything about that sentence into this [00:13:54] everything about that sentence into this one hidden State and well obviously [00:13:56] one hidden State and well obviously there are crude solutions to this you [00:13:58] there are crude solutions to this you make the hidden state bigger and then [00:14:00] make the hidden state bigger and then you've got more representational space [00:14:02] you've got more representational space you use a multi-layer lstm you've got [00:14:05] you use a multi-layer lstm you've got more representational space but it still [00:14:09] more representational space but it still seems a very questionable thing to do [00:14:12] seems a very questionable thing to do and it's certainly not like what a human [00:14:14] and it's certainly not like what a human being does right like if a human being [00:14:16] being does right like if a human being is translating a sentence they read the [00:14:19] is translating a sentence they read the sentence and they've got some idea of [00:14:21] sentence and they've got some idea of its meaning but as they start to [00:14:23] its meaning but as they start to translate they look back at the earlier [00:14:25] translate they look back at the earlier parts of the sentence and make use of [00:14:27] parts of the sentence and make use of that in their translation and so that [00:14:31] that in their translation and so that doesn't seem like it's a very plausible [00:14:33] doesn't seem like it's a very plausible model so the idea should be that our new [00:14:37] model so the idea should be that our new network should be able to attend to [00:14:39] network should be able to attend to different things in the source um so [00:14:42] different things in the source um so that they can get information as needed [00:14:45] that they can get information as needed looking back in the sentence and so this [00:14:48] looking back in the sentence and so this is the idea of attention so on each step [00:14:51] is the idea of attention so on each step of the decoder we're going to insert [00:14:54] of the decoder we're going to insert direct connections to the encoder so we [00:14:57] direct connections to the encoder so we can look at particular words in the [00:14:59] can look at particular words in the sentence um so I've got a bunch of [00:15:02] sentence um so I've got a bunch of diagram sentences that go through what [00:15:04] diagram sentences that go through what we do and then after that I'll present [00:15:07] we do and then after that I'll present the equations that go along with this [00:15:10] the equations that go along with this okay so once we're starting to translate [00:15:14] okay so once we're starting to translate um we've got a hidden State at the start [00:15:17] um we've got a hidden State at the start of our generator and then we're going to [00:15:20] of our generator and then we're going to use this hidden State as our key to look [00:15:25] use this hidden State as our key to look back into the encoder to try and find [00:15:28] back into the encoder to try and find useful stuff [00:15:29] useful stuff so we're going to compare in a way I'll [00:15:34] so we're going to compare in a way I'll make precise later the hidden state with [00:15:37] make precise later the hidden state with the hidden State at every position um in [00:15:42] the hidden State at every position um in the source [00:15:43] the source sentence and based on our comparisons [00:15:48] sentence and based on our comparisons we're going to work out an attention [00:15:50] we're going to work out an attention score where where should we be looking [00:15:53] score where where should we be looking at in the source sentence while [00:15:55] at in the source sentence while generating the here the first word of [00:15:57] generating the here the first word of the translation [00:15:59] the translation and So based on these attention scores [00:16:02] and So based on these attention scores we'll stick them into a soft Max as we [00:16:05] we'll stick them into a soft Max as we commonly do and we'll then get a [00:16:08] commonly do and we'll then get a probability distribution or a waiting [00:16:10] probability distribution or a waiting over the different positions in the [00:16:12] over the different positions in the sentence and then we will use this [00:16:15] sentence and then we will use this waiting to compute a representation [00:16:19] waiting to compute a representation based on the [00:16:20] based on the encoder um which is then going to be a [00:16:25] encoder um which is then going to be a weighted average of the encoder St dates [00:16:29] weighted average of the encoder St dates so in this particular case it'll be [00:16:31] so in this particular case it'll be nearly entirely the representation above [00:16:34] nearly entirely the representation above the first word eel which means he in [00:16:37] the first word eel which means he in French um so then we'll take that [00:16:40] French um so then we'll take that attention [00:16:43] output um and we'll combine it um with [00:16:47] output um and we'll combine it um with the hidden state of our decoder and [00:16:51] the hidden state of our decoder and we'll use both of them together to [00:16:54] we'll use both of them together to generate an output Vector which we stick [00:16:57] generate an output Vector which we stick through our soft Max and generate a word [00:17:00] through our soft Max and generate a word as the first word of the translation y [00:17:03] as the first word of the translation y one and so then at that point we just [00:17:06] one and so then at that point we just repeat this over [00:17:09] repeat this over um so we then go on to generating the [00:17:12] um so we then go on to generating the second word um well you know we copy [00:17:15] second word um well you know we copy down the first word generator start to [00:17:17] down the first word generator start to generate the second word we work out [00:17:19] generate the second word we work out attention at every position um it gives [00:17:23] attention at every position um it gives us um an oh sorry there's a little note [00:17:28] us um an oh sorry there's a little note there which little fine point which [00:17:30] there which little fine point which maybe I won't deal with but it points [00:17:32] maybe I won't deal with but it points out sometimes you um also do things like [00:17:36] out sometimes you um also do things like stick the previous time steps attention [00:17:38] stick the previous time steps attention output into the next step as an extra [00:17:41] output into the next step as an extra input and we actually do that in it [00:17:44] input and we actually do that in it should say assignment three there that's [00:17:46] should say assignment three there that's buggy um so there are other ways to use [00:17:49] buggy um so there are other ways to use things but I'll sort of gloss over that [00:17:51] things but I'll sort of gloss over that um [00:17:52] um so um we generate another word and we [00:17:55] so um we generate another word and we sort of repeat over and at each time [00:17:58] sort of repeat over and at each time step we're looking at different words in [00:18:01] step we're looking at different words in the source and they will help us to [00:18:04] the source and they will help us to translate the [00:18:05] translate the sentence [00:18:17] yeah wait wait say again why is it bring [00:18:21] yeah wait wait say again why is it bring bring point the green okay the so the [00:18:26] bring point the green okay the so the the green Vector the hidden Vector of [00:18:28] the green Vector the hidden Vector of the coder is going to be used together [00:18:31] the coder is going to be used together with the hidden States the hidden [00:18:34] with the hidden States the hidden vectors of the encoder one at a time to [00:18:37] vectors of the encoder one at a time to calculate the attention scores so the [00:18:40] calculate the attention scores so the attention score at a position is going [00:18:43] attention score at a position is going to be a function of the Hidden state of [00:18:46] to be a function of the Hidden state of the encoder at that position and the [00:18:50] the encoder at that position and the current hidden state of the [00:18:54] decoder and I'll explain exactly how in [00:18:57] decoder and I'll explain exactly how in a moment thank you any other [00:19:03] questions okay um well so here it is in [00:19:08] questions okay um well so here it is in math okay so we have um encod of hidden [00:19:13] math okay so we have um encod of hidden states which we're going to call [00:19:15] states which we're going to call H um so we have decoder hidden states [00:19:20] H um so we have decoder hidden states which we're going to call S so there're [00:19:22] which we're going to call S so there're something different um and we're going [00:19:25] something different um and we're going to at each point being some particular [00:19:28] to at each point being some particular time step T so we'll be dealing with [00:19:31] time step T so we'll be dealing with st um so to calculate the attention [00:19:36] st um so to calculate the attention scores um for [00:19:39] scores um for generating uh then word for time step T [00:19:44] generating uh then word for time step T we're going to calculate an attention [00:19:47] we're going to calculate an attention score for each position in the [00:19:51] score for each position in the encoder okay um I'll discuss [00:19:54] encoder okay um I'll discuss alternatives for this in a moment but [00:19:56] alternatives for this in a moment but the very easiest way to calculate an [00:19:59] the very easiest way to calculate an attention score which is shown here is [00:20:02] attention score which is shown here is to take a DOT product between the hidden [00:20:06] to take a DOT product between the hidden state of the [00:20:08] state of the encoder and the current hidden state of [00:20:11] encoder and the current hidden state of the decoder and so that's what we're [00:20:13] the decoder and so that's what we're showing here so that will give us some [00:20:16] showing here so that will give us some dot product score which is just any [00:20:18] dot product score which is just any number at [00:20:19] number at all um then the next thing we do is we [00:20:24] all um then the next thing we do is we stick those ET scores into our softmax [00:20:27] stick those ET scores into our softmax distribution and that gives us our [00:20:30] distribution and that gives us our probability distribution as to how much [00:20:33] probability distribution as to how much weight to put on each position in the [00:20:36] weight to put on each position in the encoder and so then we are calculating [00:20:40] encoder and so then we are calculating the weighted average of the encoder [00:20:41] the weighted average of the encoder hidden States um which we're just doing [00:20:45] hidden States um which we're just doing with the obvious equation that we're [00:20:47] with the obvious equation that we're taking the weighted sum of the Hidden [00:20:50] taking the weighted sum of the Hidden states of the encoder based on the [00:20:52] states of the encoder based on the attention weights and then what we want [00:20:55] attention weights and then what we want to do is concatenate our our our tension [00:21:00] to do is concatenate our our our tension output and the hidden state of the [00:21:04] output and the hidden state of the decoder and we're going to which is [00:21:06] decoder and we're going to which is giving us then a double length vector [00:21:08] giving us then a double length vector and then we're going to feed that into [00:21:12] and then we're going to feed that into um producing the next word from the [00:21:14] um producing the next word from the decoder so typically that means we're [00:21:17] decoder so typically that means we're multiplying that vector by another [00:21:19] multiplying that vector by another Matrix and then putting it through a [00:21:22] Matrix and then putting it through a soft Max um to get a probability [00:21:26] soft Max um to get a probability distribution over um words to output and [00:21:30] distribution over um words to output and choosing the highest probability [00:21:34] word okay um that make sense I hope yeah [00:21:40] word okay um that make sense I hope yeah um okay so attention is great so [00:21:43] um okay so attention is great so inventing this idea was completely [00:21:47] inventing this idea was completely transformative um so the very first [00:21:50] transformative um so the very first modern neural machine translation system [00:21:52] modern neural machine translation system was done at Google in [00:21:55] was done at Google in 2014 and they used a pure but very large [00:22:01] 2014 and they used a pure but very large very deep [00:22:04] very deep um lstm so was an eight layer deep lstm [00:22:08] um lstm so was an eight layer deep lstm with a very large Hidden State for the [00:22:11] with a very large Hidden State for the time um and they were able to get good [00:22:14] time um and they were able to get good results um but very shortly thereafter [00:22:19] results um but very shortly thereafter um people at the University of Montreal [00:22:22] um people at the University of Montreal dimma bad now K huno and Yoshua Benjo um [00:22:26] dimma bad now K huno and Yoshua Benjo um did a second version of machine [00:22:28] did a second version of machine translation um using attention and with [00:22:33] translation um using attention and with a much more modest compute budget of the [00:22:36] a much more modest compute budget of the kind that you can afford in universities [00:22:38] kind that you can afford in universities um they were able to get better results [00:22:40] um they were able to get better results because attention was their secret um [00:22:43] because attention was their secret um thing so attention significantly [00:22:46] thing so attention significantly improved nmt performance and essentially [00:22:49] improved nmt performance and essentially every new machine [00:22:51] every new machine translation um System since has used [00:22:54] translation um System since has used attention like we've just seen um you [00:22:57] attention like we've just seen um you know it's more human like as I was [00:22:59] know it's more human like as I was indicating because it's sort of what a [00:23:00] indicating because it's sort of what a human would do you'd look back in the [00:23:02] human would do you'd look back in the sentence to see what you need to [00:23:03] sentence to see what you need to translate um and it solves this [00:23:06] translate um and it solves this bottleneck problem you now no longer [00:23:08] bottleneck problem you now no longer have to stuff all the information about [00:23:10] have to stuff all the information about the source sentence into one hidden [00:23:13] the source sentence into one hidden State you can have the whole of your [00:23:16] State you can have the whole of your representational space from your entire [00:23:18] representational space from your entire encoding and use it as you need it um it [00:23:22] encoding and use it as you need it um it also helps with the vanishing gradient [00:23:24] also helps with the vanishing gradient problem this is connected to what I was [00:23:25] problem this is connected to what I was saying NE last time when talking about [00:23:28] saying NE last time when talking about res ual connection so of a way out of [00:23:30] res ual connection so of a way out of the banishing gradient problem is to [00:23:33] the banishing gradient problem is to direct connect things and this provides [00:23:35] direct connect things and this provides shortcut connections to all of the [00:23:37] shortcut connections to all of the Hidden states of the [00:23:39] Hidden states of the encoder and another nice thing that [00:23:42] encoder and another nice thing that attention does is it gives you some [00:23:45] attention does is it gives you some interpretability so by looking at where [00:23:48] interpretability so by looking at where the model is um attending you can [00:23:50] the model is um attending you can basically see what it's translating at [00:23:53] basically see what it's translating at different time steps and so that can be [00:23:57] different time steps and so that can be um really useful [00:23:58] um really useful and so it's kind of like we can see what [00:24:02] and so it's kind of like we can see what We're translating where without [00:24:04] We're translating where without explicitly having trained A system that [00:24:07] explicitly having trained A system that does that so for my little toy sentence [00:24:10] does that so for my little toy sentence here if he hit me with a pie you know at [00:24:12] here if he hit me with a pie you know at the first position um it's you know it [00:24:16] the first position um it's you know it was looking at the first word eel he [00:24:18] was looking at the first word eel he which it translates um then there's in [00:24:21] which it translates um then there's in French this there's this sort of verb on [00:24:24] French this there's this sort of verb on T to sort of pie somebody I guess in [00:24:26] T to sort of pie somebody I guess in English as well you can use pie as a [00:24:28] English as well you can use pie as a verb right so so um the a is a is a sort [00:24:33] verb right so so um the a is a is a sort of perfect past um exiler so sort of [00:24:38] of perfect past um exiler so sort of like he has me pied is what the French [00:24:41] like he has me pied is what the French words are one at a time and so the hit [00:24:44] words are one at a time and so the hit is already looking at the pi then the me [00:24:47] is already looking at the pi then the me is attending to the m which means me and [00:24:50] is attending to the m which means me and then all the with with the piie is [00:24:52] then all the with with the piie is attending still to Anar which is [00:24:55] attending still to Anar which is basically the right kind of alignment [00:24:57] basically the right kind of alignment that you want for words of a sentence so [00:25:00] that you want for words of a sentence so that's pretty cool um [00:25:03] that's pretty cool um too okay um so I presented up until this [00:25:07] too okay um so I presented up until this point um just this [00:25:11] point um just this um said oh we could do a DOT product but [00:25:15] um said oh we could do a DOT product but you know in general there's more to it [00:25:17] you know in general there's more to it than that so what we have is we have [00:25:20] than that so what we have is we have some values H1 to hn and we have a query [00:25:24] some values H1 to hn and we have a query vector and we want to work out how to do [00:25:27] vector and we want to work out how to do a tension based on these things so [00:25:30] a tension based on these things so attention always involves Computing some [00:25:33] attention always involves Computing some attention scores and taking the soft Max [00:25:37] attention scores and taking the soft Max to get an attention distribution and [00:25:40] to get an attention distribution and then getting an attention output but the [00:25:43] then getting an attention output but the part where there's variation is how do [00:25:45] part where there's variation is how do you compute these attention scores and a [00:25:48] you compute these attention scores and a number of different ways have been done [00:25:50] number of different ways have been done for that and I just want to go through [00:25:53] for that and I just want to go through that a little bit um so the simplest way [00:25:58] that a little bit um so the simplest way that I just presented is this dot [00:26:00] that I just presented is this dot product attention we just take the [00:26:02] product attention we just take the hidden States and Dot product the whole [00:26:04] hidden States and Dot product the whole of them um [00:26:07] of them um and that sort of works um but it doesn't [00:26:11] and that sort of works um but it doesn't actually work great and I sort of [00:26:13] actually work great and I sort of discussed this a bit when talking about [00:26:16] discussed this a bit when talking about lstms last time right that you know the [00:26:20] lstms last time right that you know the the hidden state of an lstm is its [00:26:23] the hidden state of an lstm is its complete memory right so it has to [00:26:26] complete memory right so it has to variously store lots of things in that [00:26:29] variously store lots of things in that memory it's got to be storing [00:26:30] memory it's got to be storing information that'll help it output the [00:26:34] information that'll help it output the right word it has to be storing [00:26:37] right word it has to be storing information about the future about other [00:26:39] information about the future about other things that you'll want to say given the [00:26:41] things that you'll want to say given the kind of sentence context grammar and [00:26:44] kind of sentence context grammar and previous words you've said right you [00:26:47] previous words you've said right you sort of got all kinds of memory and so [00:26:49] sort of got all kinds of memory and so it sort of makes sense that some of it [00:26:52] it sort of makes sense that some of it would be useful for linking up for [00:26:56] would be useful for linking up for looking back and some of it would be [00:26:59] looking back and some of it would be less useful you sort of want to find the [00:27:01] less useful you sort of want to find the parts that are related to what you want [00:27:03] parts that are related to what you want to say immediately not all the parts [00:27:06] to say immediately not all the parts that um do all of the rest of the [00:27:09] that um do all of the rest of the future um so that suggested um maybe you [00:27:14] future um so that suggested um maybe you could do a more general form of um [00:27:18] could do a more general form of um attention and so tanglong and me in 201 [00:27:22] attention and so tanglong and me in 201 2015 suggested maybe we could introduce [00:27:27] 2015 suggested maybe we could introduce um what we called by linear attention [00:27:29] um what we called by linear attention which um I still think is a better name [00:27:32] which um I still think is a better name but the rest of the world came to call [00:27:34] but the rest of the world came to call multiplicative attention um where what [00:27:37] multiplicative attention um where what we're doing is between these two vectors [00:27:41] we're doing is between these two vectors um we're sticking a matrix um and so [00:27:44] um we're sticking a matrix um and so we're then learning the parameters of [00:27:46] we're then learning the parameters of this Matrix just like everything else in [00:27:48] this Matrix just like everything else in our neural network and so effectively [00:27:52] our neural network and so effectively this Matrix can [00:27:54] this Matrix can learn which parts of the um generator [00:28:00] learn which parts of the um generator hidden State you should be looking to [00:28:02] hidden State you should be looking to find where in the hidden states of the [00:28:06] find where in the hidden states of the encoder in particular it no longer [00:28:09] encoder in particular it no longer requires that things have to match up [00:28:11] requires that things have to match up Dimension by Dimension you know it could [00:28:13] Dimension by Dimension you know it could be the case that the encoder is storing [00:28:16] be the case that the encoder is storing information about word meaning here and [00:28:19] information about word meaning here and the and the decoder is storing [00:28:22] the and the decoder is storing information about word meaning here and [00:28:25] information about word meaning here and by learning appropriate parameters in [00:28:27] by learning appropriate parameters in this Matrix [00:28:28] this Matrix um we can sort of match those together [00:28:31] um we can sort of match those together and work out the right place to pay [00:28:33] and work out the right place to pay attention so that seemed [00:28:37] attention so that seemed um kind of a cool approach to us um yeah [00:28:42] um kind of a cool approach to us um yeah why don't go all in with Even build like [00:28:44] why don't go all in with Even build like a neural network that's GNA H input and [00:28:47] a neural network that's GNA H input and output um you can do that I was going to [00:28:50] output um you can do that I was going to get to that on the next slide um [00:28:53] get to that on the next slide um actually that's in a way sort of going [00:28:55] actually that's in a way sort of going backwards but I I will get to it on the [00:28:57] backwards but I I will get to it on the next slide but before I do that um I [00:29:01] next slide but before I do that um I will [00:29:03] will uh um show you um these other versions [00:29:07] uh um show you um these other versions so the the dis the one thing you might [00:29:11] so the the dis the one thing you might wonder about um doing it this way is you [00:29:15] wonder about um doing it this way is you know there's a lot of parameters that [00:29:18] know there's a lot of parameters that you have to learn in The Matrix W you [00:29:20] you have to learn in The Matrix W you know there aren't that many in my [00:29:21] know there aren't that many in my example because they're only 36 but [00:29:24] example because they're only 36 but that's because my hidden states are only [00:29:25] that's because my hidden states are only of length six right and if your hidden [00:29:28] of length six right and if your hidden states are of length a thousand say then [00:29:31] states are of length a thousand say then you've got a million parameters in that [00:29:34] you've got a million parameters in that um W Matrix and that seems like it might [00:29:37] um W Matrix and that seems like it might be kind of problematic and so the way to [00:29:43] be kind of problematic and so the way to get beyond that which was fairly quickly [00:29:46] get beyond that which was fairly quickly suggested thereafter is well maybe [00:29:48] suggested thereafter is well maybe rather than having that whole big Matrix [00:29:50] rather than having that whole big Matrix in the middle instead what we could do [00:29:53] in the middle instead what we could do is form it as a low rank Matrix and the [00:29:57] is form it as a low rank Matrix and the easy way to make a low rank Matrix is [00:29:59] easy way to make a low rank Matrix is you take two skinny matrices like this [00:30:02] you take two skinny matrices like this where this is the rank of these of the [00:30:04] where this is the rank of these of the pieces and multiply them together which [00:30:07] pieces and multiply them together which would give us the big Matrix that I [00:30:09] would give us the big Matrix that I showed on the last slide and so this [00:30:12] showed on the last slide and so this gives you a low parameter um version of [00:30:16] gives you a low parameter um version of um the um bilinear attention Matrix from [00:30:19] um the um bilinear attention Matrix from the last slide but at that point if you [00:30:23] the last slide but at that point if you just do a teeny bit of linear algebra [00:30:27] just do a teeny bit of linear algebra this computation [00:30:28] this computation is exactly the same as saying well what [00:30:31] is exactly the same as saying well what I'm going to do is I'm going to take [00:30:33] I'm going to do is I'm going to take each of these two vectors and project [00:30:37] each of these two vectors and project them to a lower dimensional space using [00:30:40] them to a lower dimensional space using this low rank transformation matrix and [00:30:43] this low rank transformation matrix and then I'm going to take the dot product [00:30:45] then I'm going to take the dot product in this low dimensional space [00:30:49] in this low dimensional space um and uh on Thursday when you get to [00:30:53] um and uh on Thursday when you get to Transformers um what you will see that [00:30:56] Transformers um what you will see that Transformers do is this that they're [00:31:00] Transformers do is this that they're taking um the big vector and they're [00:31:03] taking um the big vector and they're projecting it to a low dimensional space [00:31:06] projecting it to a low dimensional space and then taking dot product um attention [00:31:09] and then taking dot product um attention in that low dimensional [00:31:12] in that low dimensional space okay um back to the question um [00:31:16] space okay um back to the question um yeah you're totally right and you know [00:31:19] yeah you're totally right and you know at this point I'm going um sort of you [00:31:23] at this point I'm going um sort of you know in an ahistorical manner because [00:31:26] know in an ahistorical manner because yeah actually the first form of [00:31:29] yeah actually the first form of attention that was proposed in the B [00:31:31] attention that was proposed in the B hour Al paper um was hey um let's just [00:31:37] hour Al paper um was hey um let's just stick a little new net there to [00:31:39] stick a little new net there to calculate attention scores so um we take [00:31:43] calculate attention scores so um we take the the S and the H we multiply them [00:31:47] the the S and the H we multiply them both by a matrix add them put them [00:31:50] both by a matrix add them put them through a tan H multiply that by a [00:31:52] through a tan H multiply that by a vector um and we get a number you know [00:31:55] vector um and we get a number you know this looks just like the kind of comput [00:31:58] this looks just like the kind of comput ations we've use everywhere else in an [00:31:59] ations we've use everywhere else in an lstm so there's a little neural net [00:32:02] lstm so there's a little neural net that's calculating the attention scores [00:32:04] that's calculating the attention scores and then they go into a softmax as [00:32:07] and then they go into a softmax as useful usual um in most of the [00:32:10] useful usual um in most of the literature this is called additive [00:32:11] literature this is called additive attention which also seems to be a [00:32:13] attention which also seems to be a really weird name I mean I think kind of [00:32:15] really weird name I mean I think kind of saying you've got a little new net um [00:32:17] saying you've got a little new net um makes more sense um for that one [00:32:21] makes more sense um for that one um so but anyway so this is what they [00:32:24] um so but anyway so this is what they proposed and used um and and you know at [00:32:28] proposed and used um and and you know at this point it's a little bit um complex [00:32:33] this point it's a little bit um complex to be honest I mean you know [00:32:37] to be honest I mean you know um so like when we wrote our paper the [00:32:41] um so like when we wrote our paper the next year we had found that the um [00:32:44] next year we had found that the um bilinear attention worked better for us [00:32:47] bilinear attention worked better for us um but there was subsequent work [00:32:49] um but there was subsequent work especially this massive exploration of [00:32:51] especially this massive exploration of new machine translation architectures [00:32:54] new machine translation architectures that argued that actually um with the [00:32:58] that argued that actually um with the right kinds of [00:33:01] right kinds of uh good hyperparameter optimization [00:33:04] uh good hyperparameter optimization actually this is the best kind this is [00:33:06] actually this is the best kind this is better than the bilinear tension but you [00:33:10] better than the bilinear tension but you know this is a lot more complex and a [00:33:13] know this is a lot more complex and a lot slower um than doing what you're [00:33:15] lot slower um than doing what you're doing in the upper part of the chart so [00:33:18] doing in the upper part of the chart so regardless of whether it's better or not [00:33:20] regardless of whether it's better or not in practice what's completely one is [00:33:23] in practice what's completely one is doing this and this is what Transformers [00:33:25] doing this and this is what Transformers use and just about all other new Nets [00:33:28] use and just about all other new Nets that are used these [00:33:30] that are used these days okay um questions on attention will [00:33:33] days okay um questions on attention will be found in assignment three um yeah so [00:33:38] be found in assignment three um yeah so um I won't say much more about this now [00:33:41] um I won't say much more about this now and you know we'll see more of it just [00:33:43] and you know we'll see more of it just next lecture um but attention is a very [00:33:47] next lecture um but attention is a very general technique right it was a great [00:33:49] general technique right it was a great way to improve machine translation and [00:33:53] way to improve machine translation and that was how it was first invented but [00:33:55] that was how it was first invented but you know for all kinds of new [00:33:58] you know for all kinds of new architectures for all kinds of purposes [00:34:01] architectures for all kinds of purposes you can stick attention to into them and [00:34:03] you can stick attention to into them and the general finding was that always [00:34:06] the general finding was that always improved results so in general anywhere [00:34:09] improved results so in general anywhere where you have a a vector of values a [00:34:12] where you have a a vector of values a vector query and you can use attention [00:34:15] vector query and you can use attention to then sort of get a weighted average [00:34:17] to then sort of get a weighted average of the values which finds relevant [00:34:20] of the values which finds relevant information that you can use to improve [00:34:23] information that you can use to improve your [00:34:25] your performance um and so maybe maybe I [00:34:28] performance um and so maybe maybe I won't try and even give examples of that [00:34:30] won't try and even give examples of that now um but you'll sort of see another [00:34:33] now um but you'll sort of see another example of attention immediately when we [00:34:35] example of attention immediately when we do things on T on Thursday where um we [00:34:40] do things on T on Thursday where um we then sort of start doing self attention [00:34:42] then sort of start doing self attention inside transformers yes in the you also [00:34:46] inside transformers yes in the you also try [00:34:48] try nonlinearity um no we did [00:34:55] not ah [00:35:02] I mean it didn't seem especially [00:35:04] I mean it didn't seem especially necessary I don't know but no we did not [00:35:07] necessary I don't know but no we did not okay um well this is the end of um the [00:35:13] okay um well this is the end of um the part with attention are there any other [00:35:17] part with attention are there any other questions yes um for the RNN attention [00:35:21] questions yes um for the RNN attention stuff is there a need for positional [00:35:24] stuff is there a need for positional information or is that not required to [00:35:26] information or is that not required to solve the for positional [00:35:30] solve the for positional information um [00:35:34] information um so so there was none and it seemed like [00:35:38] so so there was none and it seemed like it wasn't very required I mean you [00:35:43] it wasn't very required I mean you could yeah I mean you could make some [00:35:46] could yeah I mean you could make some inform you could make some argument that [00:35:49] inform you could make some argument that maybe position information um might have [00:35:52] maybe position information um might have been useful but there's also a good [00:35:55] been useful but there's also a good argument that it wasn't necess Neary and [00:35:58] argument that it wasn't necess Neary and the sort of [00:36:00] the sort of recent everywhere usage of positional [00:36:03] recent everywhere usage of positional information only becomes necessary when [00:36:06] information only becomes necessary when you get to a Transformer and the reason [00:36:09] you get to a Transformer and the reason for that is you know going back to the [00:36:13] for that is you know going back to the pictures [00:36:15] pictures um for these encoder States they're [00:36:19] um for these encoder States they're being calculated with respect to the [00:36:23] being calculated with respect to the previous encoded estate right because [00:36:25] previous encoded estate right because it's a current newal Network and [00:36:27] it's a current newal Network and therefore the representation here knows [00:36:31] therefore the representation here knows something about the past so it kind of [00:36:33] something about the past so it kind of knows what position it's in basically [00:36:35] knows what position it's in basically and so that you know that's giving a lot [00:36:38] and so that you know that's giving a lot of that information or another way to [00:36:40] of that information or another way to think about it is this final [00:36:42] think about it is this final representation will give a certain [00:36:44] representation will give a certain overall sense of the semantics of the [00:36:46] overall sense of the semantics of the sentence and so to the extent that [00:36:49] sentence and so to the extent that you're looking backwards the more sort [00:36:51] you're looking backwards the more sort of associative matching of similar [00:36:54] of associative matching of similar semantic content that's needed seems [00:36:56] semantic content that's needed seems sufficient you don't really need [00:36:58] sufficient you don't really need additional positional [00:37:02] information okay I will go on okay um so [00:37:08] information okay I will go on okay um so that's the um that's the new Network's [00:37:11] that's the um that's the new Network's content for today um and so for the [00:37:15] content for today um and so for the remaining 39 minutes um I want to talk [00:37:18] remaining 39 minutes um I want to talk final projects but also a bit about um [00:37:21] final projects but also a bit about um data experiments and things like that [00:37:24] data experiments and things like that okay so this is a reminder on the class [00:37:26] okay so this is a reminder on the class so um we've got the four assignments [00:37:29] so um we've got the four assignments which are [00:37:30] which are 48% and then the big other part of what [00:37:33] 48% and then the big other part of what you need to do um is the final project [00:37:37] you need to do um is the final project which is [00:37:38] which is 49% almost completing things out except [00:37:41] 49% almost completing things out except for the [00:37:42] for the participation um and um let me just give [00:37:46] participation um and um let me just give one note back to um collaboration the [00:37:49] one note back to um collaboration the honor code I mean for final projects [00:37:53] honor code I mean for final projects it's quite use usual that people use all [00:37:56] it's quite use usual that people use all sorts of stuff that were written by [00:37:58] sorts of stuff that were written by other people that's completely fine we [00:38:00] other people that's completely fine we don't expect you to implement everything [00:38:02] don't expect you to implement everything from scratch um but you must document [00:38:06] from scratch um but you must document what you're using you know give [00:38:07] what you're using you know give references or URLs if you're using other [00:38:10] references or URLs if you're using other people's code rather than writing your [00:38:12] people's code rather than writing your own we do not want to know what code you [00:38:14] own we do not want to know what code you wrote yourself and what things you [00:38:16] wrote yourself and what things you downloaded from pie um and in particular [00:38:20] downloaded from pie um and in particular in thinking about final projects um the [00:38:24] in thinking about final projects um the question of interest for us is what [00:38:26] question of interest for us is what value add did you provide right so you [00:38:29] value add did you provide right so you haven't done something great if you've [00:38:31] haven't done something great if you've downloaded a really good neural network [00:38:34] downloaded a really good neural network and run it on some data and it produces [00:38:36] and run it on some data and it produces really good results that's not much [00:38:38] really good results that's not much value ad um so if you want to have value [00:38:41] value ad um so if you want to have value ad in that context you at least want to [00:38:44] ad in that context you at least want to be doing something interesting about [00:38:47] be doing something interesting about understanding um why it works so well [00:38:50] understanding um why it works so well what kind of examples it doesn't work [00:38:52] what kind of examples it doesn't work well on doing some thorough experimental [00:38:55] well on doing some thorough experimental analysis [00:39:00] um yeah a couple of other points there [00:39:05] um yeah a couple of other points there um okay so for the final project um for [00:39:09] um okay so for the final project um for this class um there's a binary um choice [00:39:13] this class um there's a binary um choice you can either do our default final [00:39:15] you can either do our default final project which I'll talk about more a bit [00:39:18] project which I'll talk about more a bit later or you can come up with your own [00:39:20] later or you can come up with your own final project and I'll talk about that a [00:39:23] final project and I'll talk about that a bit too um so um we allow team size of 1 [00:39:28] bit too um so um we allow team size of 1 to three um the complicated thing that [00:39:31] to three um the complicated thing that comes up [00:39:34] comes up um oh actually sorry I should say the [00:39:36] um oh actually sorry I should say the other point first um yeah so if you do [00:39:40] other point first um yeah so if you do um we we generally encourage people to [00:39:43] um we we generally encourage people to form teams it means that you can do [00:39:45] form teams it means that you can do something more interesting it's more [00:39:47] something more interesting it's more motivational you can make friends [00:39:49] motivational you can make friends whatever um so teams are good um on [00:39:53] whatever um so teams are good um on expectations for teams um our expect for [00:39:57] expectations for teams um our expect for teams is that a bigger team should be [00:40:00] teams is that a bigger team should be able to do proportionately more work now [00:40:03] able to do proportionately more work now and so when we're grading things we [00:40:06] and so when we're grading things we expect to see more work from larger [00:40:09] expect to see more work from larger teams um now how this works out is kind [00:40:14] teams um now how this works out is kind of I will admit a little bit complicated [00:40:17] of I will admit a little bit complicated because you know there's sort of a [00:40:19] because you know there's sort of a quality issue that's separate from the [00:40:22] quality issue that's separate from the amount of work um so you know the [00:40:26] amount of work um so you know the reality is that it's just always the [00:40:29] reality is that it's just always the case that several of the very best [00:40:32] case that several of the very best projects are one person efforts because [00:40:35] projects are one person efforts because they're just somebody who has a good [00:40:37] they're just somebody who has a good idea and knows what they want to do and [00:40:39] idea and knows what they want to do and does it by themselves and it is great um [00:40:43] does it by themselves and it is great um but you know they're also great [00:40:44] but you know they're also great multi-person projects as well but um the [00:40:48] multi-person projects as well but um the point I'm meaning is well if you know it [00:40:51] point I'm meaning is well if you know it kind of doesn't work if you're a [00:40:52] kind of doesn't work if you're a oneperson project and you try and [00:40:55] oneperson project and you try and attempt a huge amount of stuff and you [00:40:57] attempt a huge amount of stuff and you can only get onethird of the way through [00:40:59] can only get onethird of the way through it um that's not a good recipe for doing [00:41:02] it um that's not a good recipe for doing well in the final project um for any [00:41:05] well in the final project um for any project um you really need to sort of be [00:41:07] project um you really need to sort of be completing something and showing [00:41:10] completing something and showing something but you know nevertheless if [00:41:13] something but you know nevertheless if you're one person and you can show [00:41:15] you're one person and you can show something kind of interesting even if [00:41:18] something kind of interesting even if our reaction is oh well this would have [00:41:22] our reaction is oh well this would have been much better if they' shown it was [00:41:24] been much better if they' shown it was better than this other kind of model or [00:41:26] better than this other kind of model or it would have been really nice if they' [00:41:28] it would have been really nice if they' run ablations to work things out well if [00:41:31] run ablations to work things out well if you're one person will give you a buy [00:41:33] you're one person will give you a buy and say oh but it was only one person um [00:41:36] and say oh but it was only one person um whereas if you're a three-person team [00:41:38] whereas if you're a three-person team and it seems like you obviously should [00:41:41] and it seems like you obviously should have compared it to some other models [00:41:43] have compared it to some other models and you obviously could have run it on [00:41:44] and you obviously could have run it on some other data sets um then we'll feel [00:41:47] some other data sets um then we'll feel like well as a three-person team um they [00:41:50] like well as a three-person team um they obviously should have done that and [00:41:51] obviously should have done that and therefore we should give them um a less [00:41:54] therefore we should give them um a less good score and that's how that is a [00:41:57] good score and that's how that is a worked out um the complication comes um [00:42:01] worked out um the complication comes um with other things people are doing at [00:42:03] with other things people are doing at the same time we allow people to do [00:42:06] the same time we allow people to do final projects um that are shared with [00:42:09] final projects um that are shared with multiple classes um but you know [00:42:12] multiple classes um but you know expectation is again that you'll do more [00:42:15] expectation is again that you'll do more work so if there are two of you who are [00:42:16] work so if there are two of you who are using one project for both this class [00:42:19] using one project for both this class and cs231 and say then it's sort of like [00:42:23] and cs231 and say then it's sort of like it's a four-person project and you [00:42:25] it's a four-person project and you should be doing a lot of work for it um [00:42:29] should be doing a lot of work for it um there are other cases sometimes people [00:42:31] there are other cases sometimes people have ra ships or their PHD rotation [00:42:34] have ra ships or their PHD rotation students or other things if you're doing [00:42:37] students or other things if you're doing it for other things we'd like you to [00:42:39] it for other things we'd like you to tell us and we expect you to be um doing [00:42:42] tell us and we expect you to be um doing um more work for it [00:42:44] um more work for it um okay um I'm very happy to talk to [00:42:48] um okay um I'm very happy to talk to people about final projects and have [00:42:50] people about final projects and have been talking to people about final [00:42:51] been talking to people about final projects but unfortunately there's only [00:42:53] projects but unfortunately there's only one of me so I definitely can't talk to [00:42:56] one of me so I definitely can't talk to 500 people about final projects so I do [00:42:58] 500 people about final projects so I do also encourage you to um talk to all of [00:43:01] also encourage you to um talk to all of the Tas about final projects so on the [00:43:04] the Tas about final projects so on the um office hours page under all of the [00:43:07] um office hours page under all of the Tas there's some information about [00:43:10] Tas there's some information about things that they know about um so if you [00:43:12] things that they know about um so if you know what your projects is about you [00:43:14] know what your projects is about you could at least try and find one of the [00:43:16] could at least try and find one of the most useful Tas or just find a TA with a [00:43:18] most useful Tas or just find a TA with a friendly face um whatever mechanism you [00:43:21] friendly face um whatever mechanism you use talk to Tas about um final [00:43:25] use talk to Tas about um final projects yeah so default final project [00:43:29] projects yeah so default final project so what it's going to be um is so Bert [00:43:33] so what it's going to be um is so Bert was a famous early um Transformer and [00:43:37] was a famous early um Transformer and we're going to be sort of building and [00:43:39] we're going to be sort of building and experimenting with a minimal bir [00:43:42] experimenting with a minimal bir implementation so if you do this um [00:43:46] implementation so if you do this um there's part of a a part of an [00:43:48] there's part of a a part of an implementation of bird and you're meant [00:43:51] implementation of bird and you're meant to finish it off and you're meant to [00:43:53] to finish it off and you're meant to fine-tune it and get some data results [00:43:56] fine-tune it and get some data results um for doing sentiment analysis and then [00:44:00] um for doing sentiment analysis and then basically um we want the even the [00:44:03] basically um we want the even the default final project to be an [00:44:04] default final project to be an open-ended project where people can do [00:44:07] open-ended project where people can do different things and so then there's [00:44:09] different things and so then there's lots of other ideas or you can come up [00:44:11] lots of other ideas or you can come up with your own of ways you could extend [00:44:14] with your own of ways you could extend um this system and make it better which [00:44:17] um this system and make it better which might be with paraphrasing contrast of [00:44:20] might be with paraphrasing contrast of learning low rank adaptation something [00:44:23] learning low rank adaptation something and you can do something and that is [00:44:25] and you can do something and that is your final project [00:44:27] your final project um so why choose the final project so if [00:44:32] um so why choose the final project so if you haven't had much experience with [00:44:35] you haven't had much experience with research um you don't have any real idea [00:44:38] research um you don't have any real idea of what you want to do for a final [00:44:41] of what you want to do for a final project um or you'd like something with [00:44:44] project um or you'd like something with clear guidance and a goal and a [00:44:46] clear guidance and a goal and a leaderboard because we provide a [00:44:48] leaderboard because we provide a leaderboard for people doing the default [00:44:50] leaderboard for people doing the default final project of how good your [00:44:52] final project of how good your performance is on the tasks we provide [00:44:56] performance is on the tasks we provide then you can do the final project and I [00:44:58] then you can do the final project and I mean honestly I think for many people [00:45:01] mean honestly I think for many people the best option um is to do the final [00:45:03] the best option um is to do the final project um for sort of past performance [00:45:06] project um for sort of past performance typically about half the students do the [00:45:08] typically about half the students do the final project including some Pro people [00:45:11] final project including some Pro people who um start off thinking I'll do a [00:45:14] who um start off thinking I'll do a custom final project then after a couple [00:45:16] custom final project then after a couple of weeks they decide H this makes no [00:45:19] of weeks they decide H this makes no sense what I was suggesting it's not [00:45:21] sense what I was suggesting it's not working at all I'm just going to abandon [00:45:23] working at all I'm just going to abandon and flip to the default final project um [00:45:27] and flip to the default final project um okay but we also allow custom final [00:45:29] okay but we also allow custom final projects and there are good reasons to [00:45:31] projects and there are good reasons to do custom final projects so if you have [00:45:34] do custom final projects so if you have some um topic or research idea that [00:45:37] some um topic or research idea that you're excited about maybe you're [00:45:39] you're excited about maybe you're already even working on it or you want [00:45:42] already even working on it or you want to try something different on your own [00:45:45] to try something different on your own um or You' just like to have more of the [00:45:48] um or You' just like to have more of the experience of trying to come up with a [00:45:50] experience of trying to come up with a research goal finding the necessary data [00:45:52] research goal finding the necessary data and tools and starting from scratch [00:45:55] and tools and starting from scratch which is actually very educational if [00:45:58] which is actually very educational if considerably harder um well then the [00:46:00] considerably harder um well then the custom final project um is fine for you [00:46:05] custom final project um is fine for you um restriction on topics I think we'd [00:46:08] um restriction on topics I think we'd already sort of signal this on AED um we [00:46:11] already sort of signal this on AED um we insist for um cs224n final projects that [00:46:16] insist for um cs224n final projects that they have to substantively involve both [00:46:19] they have to substantively involve both human language and new networks you know [00:46:22] human language and new networks you know because this is the NLP class um so we'd [00:46:25] because this is the NLP class um so we'd like people to know and learn something [00:46:28] like people to know and learn something about human language um I'm totally [00:46:32] about human language um I'm totally aware of the fact that you can use these [00:46:34] aware of the fact that you can use these same models for bioinformatic [00:46:36] same models for bioinformatic sequences um or music or um uh radar [00:46:41] sequences um or music or um uh radar whatever but we'd like you to do [00:46:44] whatever but we'd like you to do something with human language for this [00:46:46] something with human language for this class um that doesn't mean it has to be [00:46:48] class um that doesn't mean it has to be only about human language so people have [00:46:51] only about human language so people have done things like visual language models [00:46:55] done things like visual language models um or music and language um so it can [00:46:59] um or music and language um so it can have a combination of modalities um but [00:47:02] have a combination of modalities um but it has to you know substantively not [00:47:05] it has to you know substantively not completely trivially involve human [00:47:07] completely trivially involve human language if You' got any questions about [00:47:09] language if You' got any questions about that ask and it also has to [00:47:11] that ask and it also has to substantively involve neural networks [00:47:13] substantively involve neural networks though again it doesn't have to be holy [00:47:15] though again it doesn't have to be holy about neural networks if you've got some [00:47:17] about neural networks if you've got some ideas thinking oh I think I could show [00:47:20] ideas thinking oh I think I could show using kernel machines that they work [00:47:22] using kernel machines that they work just as well as having multiannual [00:47:24] just as well as having multiannual networks or something like that that's [00:47:26] networks or something like that that's of course fine to do as [00:47:29] of course fine to do as well um [00:47:31] well um gamesmanship um yeah the default final [00:47:34] gamesmanship um yeah the default final project is more guided but it's not [00:47:37] project is more guided but it's not meant to be a complete Slackers ride [00:47:40] meant to be a complete Slackers ride we're um hoping that people do the same [00:47:42] we're um hoping that people do the same amount of work for either kind of [00:47:44] amount of work for either kind of project but on the other hand it does [00:47:46] project but on the other hand it does kind of give you sort of a clearer focus [00:47:49] kind of give you sort of a clearer focus and course of things to do but it is [00:47:52] and course of things to do but it is still an open-ended project um so you [00:47:57] still an open-ended project um so you know for both um default final projects [00:48:00] know for both um default final projects and custom final projects there are [00:48:02] and custom final projects there are great projects and there are not so [00:48:05] great projects and there are not so great projects um you know if anything [00:48:08] great projects um you know if anything there's a bit more variance in the [00:48:10] there's a bit more variance in the custom final project so you know the [00:48:12] custom final project so you know the path of success is not to do some try [00:48:16] path of success is not to do some try and do something for a custom final [00:48:18] and do something for a custom final project um that just looks really weak [00:48:21] project um that just looks really weak compared to people's default final [00:48:24] compared to people's default final projects um okay um you can get good [00:48:28] projects um okay um you can get good grades either way um we get give best [00:48:31] grades either way um we get give best project Awards um to both kinds of [00:48:34] project Awards um to both kinds of projects so yeah it's really not that [00:48:36] projects so yeah it's really not that there's some secret one you have to pick [00:48:38] there's some secret one you have to pick um Computing um yeah so [00:48:43] um Computing um yeah so um to be honest with the confessions [00:48:46] um to be honest with the confessions right at the beginning we're actually in [00:48:48] right at the beginning we're actually in a less good position um for computing [00:48:52] a less good position um for computing than we've been in recent years and it's [00:48:55] than we've been in recent years and it's all open AI fault no that [00:48:58] all open AI fault no that part um but you know up until and [00:49:01] part um but you know up until and including last year um we were actually [00:49:06] including last year um we were actually had invariably managed to get um very [00:49:10] had invariably managed to get um very generous um cloud computing giveaways [00:49:13] generous um cloud computing giveaways from one or other cloud computing [00:49:14] from one or other cloud computing provider which really um provided a lot [00:49:19] provider which really um provided a lot of um Computing support but you know [00:49:22] of um Computing support but you know there's the great GPU shortage on at the [00:49:25] there's the great GPU shortage on at the moment due to the Great success of large [00:49:27] moment due to the Great success of large language models and it turns out that um [00:49:30] language models and it turns out that um Cloud compute providers just aren't [00:49:32] Cloud compute providers just aren't being as generous as they used to be and [00:49:35] being as generous as they used to be and gee I guess the AWS rep was pointing out [00:49:38] gee I guess the AWS rep was pointing out um that my course was their single [00:49:41] um that my course was their single largest Grant of free GPU last year so [00:49:45] largest Grant of free GPU last year so it's getting harder to do um so um so um [00:49:49] it's getting harder to do um so um so um really people will have to um patch [00:49:52] really people will have to um patch things together more in many cases and [00:49:55] things together more in many cases and so we'll be relying on the Ingenuity of [00:49:58] so we'll be relying on the Ingenuity of students to be able to find free and [00:50:01] students to be able to find free and cheap stuff um so um Google is giving [00:50:05] cheap stuff um so um Google is giving $50 credit per person on gcp which can [00:50:08] $50 credit per person on gcp which can be used for assignments 34 and the final [00:50:11] be used for assignments 34 and the final project um on all the clouds if you [00:50:14] project um on all the clouds if you haven't used a cloud with an account [00:50:17] haven't used a cloud with an account before you can usually get some free [00:50:20] before you can usually get some free starter credits um which can be a useful [00:50:23] starter credits um which can be a useful thing um there are um the sort of jup [00:50:26] thing um there are um the sort of jup notebooks in the cloud so um the most [00:50:29] notebooks in the cloud so um the most used one is Google collab um which [00:50:32] used one is Google collab um which allows limited GPU use um it often tends [00:50:36] allows limited GPU use um it often tends to get tighter later in the quarter um [00:50:40] to get tighter later in the quarter um so you might find it a good investment [00:50:42] so you might find it a good investment to not have a couple of lattes and pay [00:50:45] to not have a couple of lattes and pay 10 bucks a month to get collab Pro which [00:50:48] 10 bucks a month to get collab Pro which gives you much better access to G um [00:50:51] gives you much better access to G um gpus um but there are alternatives to [00:50:53] gpus um but there are alternatives to that which you might also want to look [00:50:55] that which you might also want to look at so AWS provid provides a Jupiter [00:50:57] at so AWS provid provides a Jupiter notebook environment s maker studio lab [00:51:01] notebook environment s maker studio lab um and you know also owned by Google [00:51:04] um and you know also owned by Google kegle separately provides kegle [00:51:06] kegle separately provides kegle notebooks um which actually commonly [00:51:09] notebooks um which actually commonly give you better GPU access than Google [00:51:12] give you better GPU access than Google collab provides even though um you know [00:51:16] collab provides even though um you know they're otherwise not as [00:51:18] they're otherwise not as nice kegle notebooks are sort of just um [00:51:22] nice kegle notebooks are sort of just um Bare Bones Jupiter notebooks whereas [00:51:24] Bare Bones Jupiter notebooks whereas collab has some fancier um UI stuff [00:51:26] collab has some fancier um UI stuff grafted on it um so other possibilities [00:51:30] grafted on it um so other possibilities um modal is a low price GPU provider and [00:51:34] um modal is a low price GPU provider and allows and a a certain amount of free [00:51:36] allows and a a certain amount of free GPU usage a month so that could be handy [00:51:39] GPU usage a month so that could be handy um there are other lower cost GPU [00:51:41] um there are other lower cost GPU providers like vast AI which could be of [00:51:45] providers like vast AI which could be of relevance um and then the other thing [00:51:47] relevance um and then the other thing that I'll say more about in a minute is [00:51:49] that I'll say more about in a minute is you know the way things have changed [00:51:51] you know the way things have changed with large language models there are [00:51:54] with large language models there are lots of projects that you might want to [00:51:55] lots of projects that you might want to do where you're not actually building [00:51:58] do where you're not actually building models at all yourself but you're [00:52:01] models at all yourself but you're wanting to you know do experiments on [00:52:04] wanting to you know do experiments on large language models or you're wanting [00:52:06] large language models or you're wanting to do in context learning with large [00:52:08] to do in context learning with large language models um or other things of [00:52:12] language models um or other things of that sort um and then what you want is [00:52:15] that sort um and then what you want is to have access to large language models [00:52:18] to have access to large language models and in particular you probably want to [00:52:20] and in particular you probably want to use have API access so you can automate [00:52:23] use have API access so you can automate things so another thing that we have [00:52:25] things so another thing that we have been able to get um is through the [00:52:28] been able to get um is through the generosity of together AI that together [00:52:30] generosity of together AI that together AI is providing $50 of API access to um [00:52:35] AI is providing $50 of API access to um large language models which can actually [00:52:38] large language models which can actually be a lot um how much of a lot it is [00:52:41] be a lot um how much of a lot it is depends on how bigger model you're using [00:52:43] depends on how bigger model you're using so something you should um think about [00:52:45] so something you should um think about is how big a model do you really need to [00:52:47] is how big a model do you really need to use to um show something because if you [00:52:50] use to um show something because if you can run a 7 billion parameter language [00:52:52] can run a 7 billion parameter language model um on together you know you can [00:52:55] model um on together you know you can put a huge huge number of tokens through [00:52:57] put a huge huge number of tokens through it for 50 bucks whereas if you want to [00:53:00] it for 50 bucks whereas if you want to run a much bigger model um then you know [00:53:03] run a much bigger model um then you know the number of tokens start you can get [00:53:05] the number of tokens start you can get through it goes down by orders of [00:53:07] through it goes down by orders of magnitude um so that's good um and I [00:53:11] magnitude um so that's good um and I mentioned some other ones so we've [00:53:13] mentioned some other ones so we've already put a whole bunch of documents [00:53:16] already put a whole bunch of documents um up on Ed that talk about these [00:53:18] um up on Ed that talk about these different um GPU options so do look at [00:53:23] different um GPU options so do look at those okay um jumping ahead so the first [00:53:27] those okay um jumping ahead so the first thing you have to do as a project [00:53:29] thing you have to do as a project proposal um so it's one per team so I [00:53:32] proposal um so it's one per team so I guess the first step is to work out who [00:53:34] guess the first step is to work out who your team is um and so for the project [00:53:38] your team is um and so for the project proposal part of it is actually giving [00:53:41] proposal part of it is actually giving us the details of your project but [00:53:44] us the details of your project but there's another major part of it which [00:53:47] there's another major part of it which is writing a review of a key research [00:53:50] is writing a review of a key research paper for your topic so what um for the [00:53:53] paper for your topic so what um for the default final project we provide some [00:53:55] default final project we provide some suggestions so you can find something [00:53:57] suggestions so you can find something else if you've got another idea for how [00:53:59] else if you've got another idea for how to extend the project um for your custom [00:54:02] to extend the project um for your custom project you're finding your own but what [00:54:04] project you're finding your own but what we want you to do is get some practice [00:54:06] we want you to do is get some practice at looking at a research paper [00:54:09] at looking at a research paper understanding what it's doing [00:54:11] understanding what it's doing understanding what's convincing what it [00:54:14] understanding what's convincing what it didn't consider what it failed to do and [00:54:16] didn't consider what it failed to do and so we want you to write a two-page [00:54:19] so we want you to write a two-page summary of a research paper and the goal [00:54:22] summary of a research paper and the goal is for you to be thinking critically [00:54:25] is for you to be thinking critically about this re search paper you know of [00:54:28] about this re search paper you know of what did it do that was exciting versus [00:54:31] what did it do that was exciting versus what did it claim was exciting but was [00:54:33] what did it claim was exciting but was really obvious or perhaps even wrong [00:54:37] really obvious or perhaps even wrong Etc okay um and then right so then a so [00:54:42] Etc okay um and then right so then a so um after that you know we want you to [00:54:46] um after that you know we want you to say what you're planning to do that may [00:54:48] say what you're planning to do that may be very straightforward for a default [00:54:50] be very straightforward for a default final project but it's really important [00:54:54] final project but it's really important um for a custom final project um and in [00:54:58] um for a custom final project um and in particular um you know tell us about you [00:55:02] particular um you know tell us about you know the literature you're going to use [00:55:05] know the literature you're going to use if any and the kind of models you're [00:55:06] if any and the kind of models you're going to explore but you know it turns [00:55:09] going to explore but you know it turns out that when we're unhappy with custom [00:55:12] out that when we're unhappy with custom final projects the two commonest [00:55:14] final projects the two commonest complaints about um what you tell us [00:55:17] complaints about um what you tell us about custom final projects is you don't [00:55:20] about custom final projects is you don't make clear what data you're going to use [00:55:23] make clear what data you're going to use because we're sort of worried already if [00:55:25] because we're sort of worried already if you haven't worked out by the project [00:55:27] you haven't worked out by the project proposal deadline what data you can use [00:55:30] proposal deadline what data you can use for your final project and if you don't [00:55:33] for your final project and if you don't tell us how you're going to evaluate [00:55:35] tell us how you're going to evaluate your system we want to know how you're [00:55:37] your system we want to know how you're going to measure whether you're getting [00:55:39] going to measure whether you're getting any success um as a new thing this year [00:55:44] any success um as a new thing this year um we'd like you to include an ethical [00:55:47] um we'd like you to include an ethical considerations paragraph outlining [00:55:49] considerations paragraph outlining potential ethical challenges of your [00:55:51] potential ethical challenges of your work if it were deployed in the real [00:55:54] work if it were deployed in the real world and how that might be mitigated [00:55:57] world and how that might be mitigated um this is something that now a lot of [00:55:59] um this is something that now a lot of conferences are requiring and a lot of [00:56:02] conferences are requiring and a lot of Grants are requiring um so w't give you [00:56:04] Grants are requiring um so w't give you a little bit of practice on that by [00:56:06] a little bit of practice on that by writing a paragraph of that um how much [00:56:09] writing a paragraph of that um how much there is to talk about various somewhat [00:56:12] there is to talk about various somewhat on what you're trying to do and whether [00:56:14] on what you're trying to do and whether it has a lot of ethical problems or [00:56:17] it has a lot of ethical problems or whether it's a fairly straightforward [00:56:19] whether it's a fairly straightforward question answering system but in all [00:56:21] question answering system but in all cases you might think about what are the [00:56:24] cases you might think about what are the possible ethical consider ations of this [00:56:27] possible ethical consider ations of this piece of work okay the whole thing is [00:56:30] piece of work okay the whole thing is maximum four [00:56:31] maximum four pages um okay so for the um research [00:56:37] pages um okay so for the um research paper summary yeah do think critically [00:56:41] paper summary yeah do think critically right I mean the [00:56:43] right I mean the worst the the worst um summaries are [00:56:48] worst the the worst um summaries are essentially people that just paraphrase [00:56:51] essentially people that just paraphrase what's in the abstract and introduction [00:56:53] what's in the abstract and introduction of the paper and we want you to think a [00:56:56] of the paper and we want you to think a bit harder about this you know what were [00:56:59] bit harder about this you know what were the novel contributions of the paper um [00:57:03] the novel contributions of the paper um is it something that you could use for [00:57:05] is it something that you could use for different kinds of problems and [00:57:07] different kinds of problems and different ways or was it really [00:57:09] different ways or was it really exploiting a trick of one data set um [00:57:12] exploiting a trick of one data set um are there things that it seemed like [00:57:15] are there things that it seemed like they missed or could have done [00:57:16] they missed or could have done differently or you weren't convinced [00:57:18] differently or you weren't convinced were done properly um is it similar or [00:57:22] were done properly um is it similar or distinctive to other papers that are [00:57:24] distinctive to other papers that are dealing with the same topic [00:57:26] dealing with the same topic does it suggest perhaps something that [00:57:28] does it suggest perhaps something that you could try that extends beyond the [00:57:30] you could try that extends beyond the paper [00:57:32] paper um okay and for grading these final [00:57:36] um okay and for grading these final project proposals most of the points are [00:57:39] project proposals most of the points are on that paper review and so do pay [00:57:42] on that paper review and so do pay attention to it um there are some points [00:57:46] attention to it um there are some points on the project plan but you know really [00:57:50] on the project plan but you know really we're wanting to mainly give you [00:57:52] we're wanting to mainly give you formative feedback on the project plan [00:57:54] formative feedback on the project plan and comments as to how we think it's [00:57:56] and comments as to how we think it's realistic or unrealistic um but [00:57:59] realistic or unrealistic um but nevertheless um we're expecting you to [00:58:02] nevertheless um we're expecting you to sort of have an idea have thought [00:58:05] sort of have an idea have thought through how you can investigate it [00:58:07] through how you can investigate it thought through how you can evaluate it [00:58:09] thought through how you can evaluate it data sets baselines things like that oh [00:58:13] data sets baselines things like that oh yeah I should emphasize this do have an [00:58:15] yeah I should emphasize this do have an appropriate Baseline so for anything [00:58:18] appropriate Baseline so for anything that you're doing you should have [00:58:21] that you're doing you should have something you can compare it against so [00:58:24] something you can compare it against so sometimes it's a previous system that [00:58:26] sometimes it's a previous system that did exactly the same thing but if you're [00:58:28] did exactly the same thing but if you're doing something more novel and [00:58:30] doing something more novel and interesting you should be thinking of [00:58:33] interesting you should be thinking of some seat of the pants obvious way to do [00:58:36] some seat of the pants obvious way to do things and proving that you can do it [00:58:38] things and proving that you can do it better and what that is depends a lot on [00:58:40] better and what that is depends a lot on what your project is but you know if [00:58:42] what your project is but you know if you're building some complex new net um [00:58:46] you're building some complex new net um that's going to be used to work out [00:58:49] that's going to be used to work out textural similarity between two pieces [00:58:51] textural similarity between two pieces of text well a simple way of working out [00:58:54] of text well a simple way of working out textual similarity between two pie [00:58:56] textual similarity between two pie pieces of text is to look up the word [00:58:59] pieces of text is to look up the word vectors for every word in the text and [00:59:01] vectors for every word in the text and average them together and work out um [00:59:04] average them together and work out um the dot product between those average [00:59:06] the dot product between those average vectors and unless your complex newal [00:59:09] vectors and unless your complex newal network is significantly better than [00:59:11] network is significantly better than that it doesn't seem like it's a very [00:59:13] that it doesn't seem like it's a very good system so you should always attempt [00:59:15] good system so you should always attempt to have some baselines um after um the [00:59:21] to have some baselines um after um the project proposal we also have a project [00:59:23] project proposal we also have a project Milestone Stuck in the Middle to make [00:59:25] Milestone Stuck in the Middle to make sure everybody um has making some [00:59:28] sure everybody um has making some progress this is just to help make sure [00:59:31] progress this is just to help make sure people um do get through things and keep [00:59:33] people um do get through things and keep working on it so we'll have good final [00:59:36] working on it so we'll have good final projects um for [00:59:38] projects um for most final projects I'll say more about [00:59:41] most final projects I'll say more about this in a minute we The crucial thing we [00:59:44] this in a minute we The crucial thing we expect for the Milestone is that you [00:59:47] expect for the Milestone is that you know you've kind of got set up and you [00:59:49] know you've kind of got set up and you can run something it might just be your [00:59:52] can run something it might just be your Baseline of looking up the word vectors [00:59:54] Baseline of looking up the word vectors but means you've kind of got the data [00:59:56] but means you've kind of got the data and the framework and something that you [00:59:58] and the framework and something that you can run and produce a number from [01:00:01] can run and produce a number from it um and then there's the final project [01:00:05] it um and then there's the final project um we have people submit their code for [01:00:08] um we have people submit their code for the final projects but final projects [01:00:11] the final projects but final projects are um [01:00:13] are um evaluated almost entirely unless there's [01:00:17] evaluated almost entirely unless there's some major worries or concerns based on [01:00:20] some major worries or concerns based on your project report so make sure you put [01:00:23] your project report so make sure you put time into the project report which is [01:00:26] time into the project report which is essentially a research paper like a [01:00:28] essentially a research paper like a conference paper and they can be up to [01:00:31] conference paper and they can be up to eight pages and it varies on what you're [01:00:34] eight pages and it varies on what you're doing but you know this is the kind of [01:00:37] doing but you know this is the kind of picture typically of what um Pap will [01:00:39] picture typically of what um Pap will look like will'll have an abstract and [01:00:41] look like will'll have an abstract and introduction it'll talk about other [01:00:44] introduction it'll talk about other related work it'll present the model [01:00:46] related work it'll present the model you're using the data you're using and [01:00:49] you're using the data you're using and your experiments and their results and [01:00:51] your experiments and their results and have some insightful comments in its [01:00:54] have some insightful comments in its analysis and conclusion at the end [01:00:57] analysis and conclusion at the end okay um finding research topics um for [01:01:02] okay um finding research topics um for custom projects um all kinds of things [01:01:05] custom projects um all kinds of things you can do you know basic philosophy of [01:01:08] you can do you know basic philosophy of science you're normally either starting [01:01:10] science you're normally either starting off with here's some problem I want to [01:01:12] off with here's some problem I want to make some progress on or here's this [01:01:15] make some progress on or here's this cool idea for a theoretical technique um [01:01:18] cool idea for a theoretical technique um or change in something and I want to [01:01:20] or change in something and I want to show it's better than other ways of [01:01:22] show it's better than other ways of doing it and you're working from that um [01:01:26] doing it and you're working from that um we allow different kinds of projects um [01:01:30] we allow different kinds of projects um you know one common type of project is [01:01:32] you know one common type of project is you've got some task of interest and [01:01:34] you've got some task of interest and you're going to try and solve it or make [01:01:37] you're going to try and solve it or make progress on it somehow that you want to [01:01:40] progress on it somehow that you want to you know get information out of state [01:01:43] you know get information out of state department documents um and you're going [01:01:46] department documents um and you're going to see how well you can do it with [01:01:48] to see how well you can do it with neural NP a second kind is you got some [01:01:52] neural NP a second kind is you got some ideas of doing something different with [01:01:53] ideas of doing something different with newal networks and then you going to see [01:01:56] newal networks and then you going to see how well it works um or maybe given [01:02:00] how well it works um or maybe given there are large language models these [01:02:01] there are large language models these days you're going to see how using large [01:02:04] days you're going to see how using large language models you can do something [01:02:06] language models you can do something interesting by in context learning or [01:02:09] interesting by in context learning or building a larger language model program [01:02:12] building a larger language model program so NE you know nearly all um 224n [01:02:17] so NE you know nearly all um 224n projects are in those first three types [01:02:20] projects are in those first three types where at the end of the day you've got [01:02:24] where at the end of the day you've got some kind of system and you've got some [01:02:26] some kind of system and you've got some kind of data and you're going to [01:02:28] kind of data and you're going to evaluate it um but that's not a 100% [01:02:32] evaluate it um but that's not a 100% requirement there are different kinds of [01:02:34] requirement there are different kinds of projects you can do and a few people do [01:02:37] projects you can do and a few people do so you can do an analysis [01:02:39] so you can do an analysis interpretability project um you could be [01:02:42] interpretability project um you could be interested in something like how could [01:02:45] interested in something like how could these Transformer models possibly [01:02:49] these Transformer models possibly understand what I say to them and give [01:02:52] understand what I say to them and give the right answers to my statements let [01:02:55] the right answers to my statements let me try and look inside the neural [01:02:58] me try and look inside the neural networks and see what they're Computing [01:03:00] networks and see what they're Computing how recently there's been a lot of work [01:03:03] how recently there's been a lot of work on this topic often under um titles like [01:03:06] on this topic often under um titles like mechanistic [01:03:07] mechanistic interpretability um circuit training and [01:03:09] interpretability um circuit training and things like that so you can do some kind [01:03:12] things like that so you can do some kind of analysis or interpretability project [01:03:14] of analysis or interpretability project or you could even just do it um looking [01:03:18] or you could even just do it um looking at the behavior of models of some task [01:03:20] at the behavior of models of some task so you could take some linguistic task [01:03:24] so you could take some linguistic task like [01:03:26] like metaphor interpretation and see which [01:03:28] metaphor interpretation and see which neural networks can interpret them [01:03:30] neural networks can interpret them correctly and which can't or which kinds [01:03:32] correctly and which can't or which kinds of ones they can interpret correctly or [01:03:34] of ones they can interpret correctly or not and do things like that um another [01:03:38] not and do things like that um another kind is a theoretical um project [01:03:42] kind is a theoretical um project occasionally people have done things um [01:03:45] occasionally people have done things um looking at the behavior [01:03:50] looking at the behavior of well what's a good example um [01:03:53] of well what's a good example um somewhere that's in the math so an [01:03:57] somewhere that's in the math so an example that was actually done a few [01:03:59] example that was actually done a few years ago and turned into a conference [01:04:01] years ago and turned into a conference paper was looking at in the estimation [01:04:05] paper was looking at in the estimation of word vectors the stability of um the [01:04:10] of word vectors the stability of um the word vectors that were computed by [01:04:13] word vectors that were computed by different algorithms word Toc versus [01:04:16] different algorithms word Toc versus glove um and um deriving results with [01:04:21] glove um and um deriving results with proofs about the stability um of the um [01:04:25] proofs about the stability um of the um um vectors that were calculated um so [01:04:29] um vectors that were calculated um so that's loud but we don't see many of [01:04:30] that's loud but we don't see many of those um here very quickly um it's sort [01:04:35] those um here very quickly um it's sort of just sort of random things so a lot [01:04:37] of just sort of random things so a lot of past projects you can find on the [01:04:40] of past projects you can find on the 224n web page you can just find [01:04:42] 224n web page you can just find different past year reports um and you [01:04:46] different past year reports um and you can look at them to get ideas as you [01:04:49] can look at them to get ideas as you wish um so deep poetry um was a gated [01:04:53] wish um so deep poetry um was a gated lstm where the idea was as well as sort [01:04:56] lstm where the idea was as well as sort a language model that generated success [01:04:59] a language model that generated success of words they had extra stuff in it to [01:05:01] of words they had extra stuff in it to make it um rhyme in a um poetry like [01:05:06] make it um rhyme in a um poetry like pattern that was kind of fun um you can [01:05:08] pattern that was kind of fun um you can do a [01:05:10] do a reimplementation of a paper um that has [01:05:13] reimplementation of a paper um that has been done previously this is actually [01:05:15] been done previously this is actually kind of an old one but I remember it [01:05:17] kind of an old one but I remember it well so back in the days before [01:05:20] well so back in the days before Transformers um deep minded these kind [01:05:22] Transformers um deep minded these kind of interesting papers on newal chur [01:05:25] of interesting papers on newal chur machines and differentiable neural [01:05:28] machines and differentiable neural computers um and um but they didn't [01:05:31] computers um and um but they didn't release implementations of them um and [01:05:34] release implementations of them um and so Carol set about writing her own [01:05:37] so Carol set about writing her own implementation of a differentiable [01:05:39] implementation of a differentiable neural computer which in a way was a a [01:05:42] neural computer which in a way was a a little bit crazy um and a few days [01:05:45] little bit crazy um and a few days before the deadline she still hadn't [01:05:47] before the deadline she still hadn't gone at working so it could have been a [01:05:49] gone at working so it could have been a complete disaster um but she did get it [01:05:51] complete disaster um but she did get it working before the deadline and got it [01:05:54] working before the deadline and got it to run producing some interesting [01:05:56] to run producing some interesting results um so that was kind of cool so [01:05:59] results um so that was kind of cool so it if it's something interesting it [01:06:00] it if it's something interesting it doesn't have to be original it can be [01:06:02] doesn't have to be original it can be sort of reimplementing something [01:06:05] sort of reimplementing something interesting um okay [01:06:08] interesting um okay um sometimes um papers do get published [01:06:11] um sometimes um papers do get published later as interesting ones this was a [01:06:14] later as interesting ones this was a paper that was sort of again from the [01:06:16] paper that was sort of again from the early days and was sort of fairly simple [01:06:19] early days and was sort of fairly simple but you know it was a novel thing that [01:06:21] but you know it was a novel thing that gave progress so the way we've sort of [01:06:23] gave progress so the way we've sort of presented these rnm you have sort of [01:06:26] presented these rnm you have sort of word vectors at the bottom and then you [01:06:29] word vectors at the bottom and then you kind of compute the soft Max at the top [01:06:31] kind of compute the soft Max at the top but if you think about the sort of [01:06:34] but if you think about the sort of multiplying by the output Matrix and [01:06:37] multiplying by the output Matrix and then putting that into the softmax that [01:06:39] then putting that into the softmax that output Matrix is also like a set of word [01:06:42] output Matrix is also like a set of word vectors because you have a column for [01:06:44] vectors because you have a column for each word and then you put it to you get [01:06:47] each word and then you put it to you get a score for each output word and then [01:06:49] a score for each output word and then you're putting a soft Max over that um [01:06:52] you're putting a soft Max over that um and so their idea was well maybe you [01:06:54] and so their idea was well maybe you could sort of share those two um sets of [01:06:59] could sort of share those two um sets of vectors and you'd be able to get [01:07:00] vectors and you'd be able to get improvements from that and you could [01:07:03] improvements from that and you could um okay maybe I won't talk about that [01:07:06] um okay maybe I won't talk about that one um sometimes people have worked on [01:07:09] one um sometimes people have worked on quantized models that's sort of more of [01:07:11] quantized models that's sort of more of a um a sort of a general new network [01:07:14] a um a sort of a general new network technique but providing you show you can [01:07:16] technique but providing you show you can do useful things with it like have good [01:07:18] do useful things with it like have good language modeling results even with [01:07:20] language modeling results even with quantize vectors we'll count that as [01:07:22] quantize vectors we'll count that as using language um so in recent times um [01:07:28] using language um so in recent times um these two last tour from [01:07:31] these two last tour from 2024 um you know a lot of the time [01:07:34] 2024 um you know a lot of the time people are doing projects with [01:07:36] people are doing projects with pre-trained large language models which [01:07:38] pre-trained large language models which we will be talking about in the next [01:07:40] we will be talking about in the next three models three lectures and then [01:07:43] three models three lectures and then doing things with them and so you can do [01:07:45] doing things with them and so you can do lightweight parameter efficient [01:07:47] lightweight parameter efficient fine-tuning methods you can do in [01:07:49] fine-tuning methods you can do in context learning methods and things like [01:07:52] context learning methods and things like this and I suspect that probably quite a [01:07:55] this and I suspect that probably quite a few of you will do projects um of um [01:07:58] few of you will do projects um of um this kind so you know here's an example [01:08:01] this kind so you know here's an example so lots of work has been done on [01:08:05] so lots of work has been done on producing um code language models [01:08:09] producing um code language models um and so um these people um decided to [01:08:15] um and so um these people um decided to improve um the generation of Fortran [01:08:18] improve um the generation of Fortran maybe they are physicist I don't [01:08:22] maybe they are physicist I don't know and um so um they were able to show [01:08:28] know and um so um they were able to show that they could um use parameter [01:08:30] that they could um use parameter efficient fine-tuning to improve code [01:08:32] efficient fine-tuning to improve code llama for producing for now where was [01:08:36] llama for producing for now where was the natural language um code has natural [01:08:39] the natural language um code has natural language comments in it and the comments [01:08:41] language comments in it and the comments can be useful for explaining what you [01:08:44] can be useful for explaining what you want the code to do and so it was [01:08:47] want the code to do and so it was effectively doing um translation from [01:08:51] effectively doing um translation from human language um explanation of what [01:08:55] human language um explanation of what the code code was meant to do into [01:08:57] the code code was meant to do into pieces of code [01:08:59] pieces of code um here was another one um which um was [01:09:05] um here was another one um which um was um doing AI fashion driven cataloging um [01:09:09] um doing AI fashion driven cataloging um transforming images into textual [01:09:11] transforming images into textual descriptions which again was starting [01:09:13] descriptions which again was starting off with an existing visual language [01:09:15] off with an existing visual language model and looking at how to fine-tune [01:09:18] model and looking at how to fine-tune it okay um other places to look for [01:09:21] it okay um other places to look for stuff um so you know you can get kind of [01:09:25] stuff um so you know you can get kind of lots of ideas of areas and things people [01:09:27] lots of ideas of areas and things people do by looking at past papers they you're [01:09:30] do by looking at past papers they you're also welcome to have your own original [01:09:32] also welcome to have your own original ideas thinking about anything you know [01:09:34] ideas thinking about anything you know or work on in the world so for NLP [01:09:37] or work on in the world so for NLP papers there's a site called the ACL [01:09:39] papers there's a site called the ACL Anthology that's good for them for there [01:09:42] Anthology that's good for them for there are lots of papers on language that also [01:09:45] are lots of papers on language that also appear in machine learning conferences [01:09:47] appear in machine learning conferences so you can look at the neurs or I clear [01:09:50] so you can look at the neurs or I clear proceedings you can look at past 2 24n [01:09:53] proceedings you can look at past 2 24n projects and then um the archive [01:09:56] projects and then um the archive pre-print servers got tons of papers on [01:09:59] pre-print servers got tons of papers on everything including NLP and you can [01:10:01] everything including NLP and you can look there but I do actually think it's [01:10:04] look there but I do actually think it's you know some of the funnest best [01:10:06] you know some of the funnest best projects are actually people that find [01:10:07] projects are actually people that find their own problem which is an [01:10:10] their own problem which is an interesting problem in their world you [01:10:12] interesting problem in their world you know if there's anything about a cool [01:10:14] know if there's anything about a cool website that has text on it and you [01:10:17] website that has text on it and you think you could kind of get information [01:10:18] think you could kind of get information out of automatically by using a language [01:10:21] out of automatically by using a language model or something there's probably [01:10:22] model or something there's probably something interesting and different you [01:10:24] something interesting and different you can do there um another place to look is [01:10:27] can do there um another place to look is that there are various leaderboards for [01:10:29] that there are various leaderboards for the state-ofthe-art on different [01:10:31] the state-ofthe-art on different problems and you can start looking [01:10:33] problems and you can start looking through leaderboards for stuff and see [01:10:36] through leaderboards for stuff and see what you find there um but you know on [01:10:39] what you find there um but you know on the other hand the disadvantage of [01:10:41] the other hand the disadvantage of looking at things like leaderboards and [01:10:43] looking at things like leaderboards and past conferences is you sort of uh tend [01:10:46] past conferences is you sort of uh tend to be trying to do a bit better on a [01:10:49] to be trying to do a bit better on a problem someone else has done and that's [01:10:51] problem someone else has done and that's part of um why you know really often in [01:10:54] part of um why you know really often in research it's a clever thing to think of [01:10:57] research it's a clever thing to think of something different perhaps not too far [01:10:59] something different perhaps not too far from things that other people have done [01:11:01] from things that other people have done but somehow different so you'll be able [01:11:04] but somehow different so you'll be able to do something a bit more original and [01:11:06] to do something a bit more original and different for what you're doing [01:11:09] different for what you're doing um yeah I mean I do just want to go [01:11:13] um yeah I mean I do just want to go through this a bit quickly um [01:11:17] through this a bit quickly um that you know um for sort of decade that [01:11:22] that you know um for sort of decade that I've been doing natural language [01:11:24] I've been doing natural language processing with deep learning there's [01:11:26] processing with deep learning there's sort of been a sea change in what's [01:11:30] sort of been a sea change in what's possible um so in the early days of the [01:11:32] possible um so in the early days of the deep learning um Revival you know most [01:11:37] deep learning um Revival you know most of the work in people's papers were [01:11:39] of the work in people's papers were trying to find better deep learning [01:11:42] trying to find better deep learning architectures so that would be here is [01:11:44] architectures so that would be here is some question answering system I've got [01:11:47] some question answering system I've got an idea of how I could add attention in [01:11:49] an idea of how I could add attention in some new place or I could add a new um [01:11:53] some new place or I could add a new um layer into the new network and the the [01:11:55] layer into the new network and the the numbers will go up um and um there were [01:11:58] numbers will go up um and um there were lots of papers like that and it was a [01:12:00] lots of papers like that and it was a lot of fun and that's what a lot of good [01:12:04] lot of fun and that's what a lot of good CS2 224n projects did too and people [01:12:08] CS2 224n projects did too and people were often able to build systems from [01:12:10] were often able to build systems from scratch that were close to the [01:12:12] scratch that were close to the state-of-the-art um but you know in the [01:12:15] state-of-the-art um but you know in the last five years your chances of doing [01:12:18] last five years your chances of doing this have been become pretty slim [01:12:22] this have been become pretty slim frankly um you know you can if you [01:12:24] frankly um you know you can if you really got a good idea and it's [01:12:26] really got a good idea and it's something different than original by all [01:12:28] something different than original by all means but it's kind of hard so most work [01:12:33] means but it's kind of hard so most work these days even for people who are [01:12:35] these days even for people who are professional [01:12:36] professional researchers that you know they're making [01:12:39] researchers that you know they're making use of existing large pre-trained models [01:12:44] use of existing large pre-trained models in some way and then once you're doing [01:12:47] in some way and then once you're doing that that actually sort of fixes a lot [01:12:49] that that actually sort of fixes a lot of your architectural choices because [01:12:52] of your architectural choices because your large pre- chain neural network has [01:12:54] your large pre- chain neural network has a certain AR architecture and you kind [01:12:56] a certain AR architecture and you kind of have to live with it you know you [01:12:58] of have to live with it you know you might be able to do interesting things [01:13:00] might be able to do interesting things by adapting it with something like low [01:13:02] by adapting it with something like low rank adaptation around the side or [01:13:04] rank adaptation around the side or something but nevertheless there's sort [01:13:06] something but nevertheless there's sort of constraints on what you want to do so [01:13:09] of constraints on what you want to do so you know for just about any practical [01:13:12] you know for just about any practical project like you've got some data set [01:13:14] project like you've got some data set and you want to understand it and get [01:13:17] and you want to understand it and get facts out of it or something like that [01:13:19] facts out of it or something like that essentially the only sensible choice is [01:13:22] essentially the only sensible choice is to say I am going to use hugging face um [01:13:25] to say I am going to use hugging face um Transformers um which we have a tutorial [01:13:28] Transformers um which we have a tutorial on coming up ahead and I will load some [01:13:31] on coming up ahead and I will load some pre-trained model and I will be running [01:13:34] pre-trained model and I will be running it over the text and then I'll be [01:13:35] it over the text and then I'll be working out some other stuff I can do on [01:13:38] working out some other stuff I can do on a top and around that so you know [01:13:40] a top and around that so you know building your own architecture is really [01:13:42] building your own architecture is really only a sensible choice if you can do [01:13:45] only a sensible choice if you can do something in the small which is more a [01:13:48] something in the small which is more a sort of exploring architectures project [01:13:51] sort of exploring architectures project if you've kind of got an idea of hey [01:13:54] if you've kind of got an idea of hey I've got an idea for different [01:13:55] I've got an idea for different nonlinearity that I think will work [01:13:57] nonlinearity that I think will work better than using a re let me [01:13:59] better than using a re let me investigate kind of thing because then [01:14:01] investigate kind of thing because then you can do small [01:14:03] you can do small experiments um yeah maybe I won't read [01:14:07] experiments um yeah maybe I won't read out all of this list but um there are [01:14:10] out all of this list but um there are lists of sort of some of the ideas of [01:14:12] lists of sort of some of the ideas of what's more interesting um now um but [01:14:16] what's more interesting um now um but you know do be cognizant of the world [01:14:20] you know do be cognizant of the world we're in in terms of scale I mean one of [01:14:24] we're in in terms of scale I mean one of the problems we now have is that people [01:14:28] the problems we now have is that people have um seen the latest paper that was [01:14:31] have um seen the latest paper that was being pushed by Deep Mind or whoever um [01:14:34] being pushed by Deep Mind or whoever um doing some cool graph structured um [01:14:37] doing some cool graph structured um reasoning search to do things and they [01:14:39] reasoning search to do things and they turn up and say I want to um do this for [01:14:42] turn up and say I want to um do this for my project but a lot of the time if you [01:14:45] my project but a lot of the time if you um read further into the paper you'll [01:14:48] um read further into the paper you'll find that they were doing it on 32 A1 [01:14:51] find that they were doing it on 32 A1 100s for a month and that's not the [01:14:54] 100s for a month and that's not the scale of computer [01:14:55] scale of computer that you're going to have available to [01:14:58] that you're going to have available to you in almost any all circumstances [01:15:00] you in almost any all circumstances maybe they're one or two industry [01:15:02] maybe they're one or two industry students for the industry students that [01:15:04] students for the industry students that you can do that if so go for it but for [01:15:07] you can do that if so go for it but for the vast majority of people not likely [01:15:10] the vast majority of people not likely um so you do have to um do something um [01:15:14] um so you do have to um do something um that um is practical but you know that [01:15:16] that um is practical but you know that practicality is true for vast majority [01:15:19] practicality is true for vast majority of the people in the world and if you [01:15:21] of the people in the world and if you look around in blogs and so on you find [01:15:25] look around in blogs and so on you find lots of people um doing stuff in [01:15:27] lots of people um doing stuff in lightweight ways um and describing how [01:15:30] lightweight ways um and describing how to do that and that's why methods like [01:15:32] to do that and that's why methods like parameter efficient fine tuning are [01:15:34] parameter efficient fine tuning are really popular because you can do them [01:15:36] really popular because you can do them in lightweight ways um the question [01:15:39] in lightweight ways um the question related to that and I'll end on this is [01:15:43] related to that and I'll end on this is you know I just want to sort of [01:15:47] you know I just want to sort of um sort of mention again you know if you [01:15:50] um sort of mention again you know if you want to you're welcome to use gp4 or [01:15:53] want to you're welcome to use gp4 or Gemini Pro or Claude Opus or any of [01:15:56] Gemini Pro or Claude Opus or any of these models in your project um but you [01:16:00] these models in your project um but you know it has to be then API usage you [01:16:04] know it has to be then API usage you can't possibly train your own big models [01:16:08] can't possibly train your own big models I mean even for the models that are [01:16:10] I mean even for the models that are available open source and like those you [01:16:13] available open source and like those you know for big models you can't even load [01:16:16] know for big models you can't even load them into the kind of gpus you have so [01:16:19] them into the kind of gpus you have so you know probably you can load a llama [01:16:22] you know probably you can load a llama 7B model but you can't just load into [01:16:24] 7B model but you can't just load into your G CPU llama 70b model um [01:16:28] your G CPU llama 70b model um so you have to be um realistic on that [01:16:32] so you have to be um realistic on that size but you know there's actually now [01:16:35] size but you know there's actually now lots of interesting things you can do [01:16:37] lots of interesting things you can do with API access doing things like in [01:16:39] with API access doing things like in context learning and prompting and [01:16:41] context learning and prompting and exploring that or building larger [01:16:44] exploring that or building larger language model programs around these [01:16:47] language model programs around these language model components and you're [01:16:49] language model components and you're certainly encouraged to do that um lots [01:16:52] certainly encouraged to do that um lots of other things you can do such as [01:16:54] of other things you can do such as analysis projects which look at are [01:16:56] analysis projects which look at are these models sexist and racist still or [01:17:00] these models sexist and racist still or do they have good understanding of [01:17:02] do they have good understanding of analogies or can they interpret love [01:17:06] analogies or can they interpret love letters or whatever is your topic of [01:17:08] letters or whatever is your topic of Interest lots of things you can do and [01:17:11] Interest lots of things you can do and that's totally allowed um but again you [01:17:14] that's totally allowed um but again you know remember that we'll be trying to [01:17:17] know remember that we'll be trying to evaluate us on what interesting stuff [01:17:20] evaluate us on what interesting stuff you did um so your project shouldn't be [01:17:23] you did um so your project shouldn't be I ran this stuff through gp4 and It [01:17:26] I ran this stuff through gp4 and It produced great summaries of the [01:17:27] produced great summaries of the documents I am done the question is um [01:17:31] documents I am done the question is um what did you do in addition to that to [01:17:33] what did you do in addition to that to have an interesting research project [01:17:36] have an interesting research project okay I'll stop there thanks a lot ================================================================================ LECTURE 008 ================================================================================ Stanford CS224N NLP with Deep Learning | 2023 | Lecture 8 - Self-Attention and Transformers Source: https://www.youtube.com/watch?v=LWMzyfvuehA --- Transcript [00:00:04] hi everyone uh welcome to cs224n we're [00:00:09] hi everyone uh welcome to cs224n we're about two minutes in so let's get [00:00:11] about two minutes in so let's get started [00:00:12] started um so today uh we've got what I think is [00:00:15] um so today uh we've got what I think is quite an exciting lecture topic we're [00:00:17] quite an exciting lecture topic we're going to talk about self-attention and [00:00:20] going to talk about self-attention and Transformers so these are some ideas [00:00:22] Transformers so these are some ideas that are sort of the foundation of most [00:00:26] that are sort of the foundation of most of the modern advances in natural [00:00:28] of the modern advances in natural language processing and actually uh sort [00:00:31] language processing and actually uh sort of AI systems in a broad range of fields [00:00:34] of AI systems in a broad range of fields so it's a very very fun topic [00:00:37] so it's a very very fun topic um before we get into that [00:00:41] um before we get into that um [00:00:41] um [Music] [00:00:45] okay before we get into that we're going [00:00:47] okay before we get into that we're going to have a couple of reminders so there [00:00:49] to have a couple of reminders so there are brand new lecture notes [00:00:52] are brand new lecture notes uh uh nice thank you yeah [00:00:57] uh uh nice thank you yeah um I'm very excited about them [00:00:59] um I'm very excited about them um they go into they they pretty much [00:01:01] um they go into they they pretty much follow along with uh what I'll be [00:01:04] follow along with uh what I'll be talking about today but go into [00:01:05] talking about today but go into considerably more detail uh assignment [00:01:08] considerably more detail uh assignment four is due a week from today [00:01:11] four is due a week from today um yeah so the issues with Azure [00:01:13] um yeah so the issues with Azure continue [00:01:15] continue um thankfully thankfully [00:01:18] um thankfully thankfully um [00:01:19] um our uh uh Tas especially has tested that [00:01:23] our uh uh Tas especially has tested that this works on collab and the amount of [00:01:25] this works on collab and the amount of training is such that you know uh you [00:01:28] training is such that you know uh you know a collab session will allow you to [00:01:29] know a collab session will allow you to train uh your machine translation system [00:01:32] train uh your machine translation system so if you don't have a GPU use collab [00:01:34] so if you don't have a GPU use collab we're continuing to work on getting [00:01:36] we're continuing to work on getting access to more gpus for uh assignment [00:01:40] access to more gpus for uh assignment five in the final project uh we'll [00:01:42] five in the final project uh we'll continue to update you as we're able to [00:01:44] continue to update you as we're able to um but our you know are the usual [00:01:46] um but our you know are the usual systems this year uh are no longer [00:01:49] systems this year uh are no longer holding because companies are changing [00:01:51] holding because companies are changing their minds about things okay [00:01:53] their minds about things okay um so our final project proposal uh you [00:01:55] um so our final project proposal uh you have a proposal of what you want to work [00:01:58] have a proposal of what you want to work on for uh your final project we will [00:02:02] on for uh your final project we will give you feedback on whether we think [00:02:03] give you feedback on whether we think it's a feasible idea or how to change it [00:02:05] it's a feasible idea or how to change it so this is very important because we [00:02:07] so this is very important because we want you to work on something that we [00:02:08] want you to work on something that we think has a good chance of success for [00:02:09] think has a good chance of success for the rest of the quarter that's going to [00:02:11] the rest of the quarter that's going to be out tonight we'll have an ad [00:02:12] be out tonight we'll have an ad announcement when it is out [00:02:15] announcement when it is out um and we want to get you feedback on [00:02:16] um and we want to get you feedback on that pretty quickly uh because you know [00:02:19] that pretty quickly uh because you know you'll be working on this after [00:02:21] you'll be working on this after assignment five is done really the major [00:02:22] assignment five is done really the major core component of the course uh after [00:02:26] core component of the course uh after that is the um is the final project [00:02:29] that is the um is the final project okay any questions [00:02:32] cool okay [00:02:34] cool okay um [00:02:35] um okay so so let's let's kind of take a [00:02:38] okay so so let's let's kind of take a look back into what we've done so far in [00:02:41] look back into what we've done so far in this course and sort of see uh what you [00:02:45] this course and sort of see uh what you know what we were doing in natural [00:02:46] know what we were doing in natural language processing what was our [00:02:47] language processing what was our strategy if you had a natural language [00:02:48] strategy if you had a natural language processing problem and you wanted to say [00:02:50] processing problem and you wanted to say take like your best effort attempt at it [00:02:53] take like your best effort attempt at it without doing anything too fancy you [00:02:54] without doing anything too fancy you would have said okay I'm going to have [00:02:56] would have said okay I'm going to have you know a bi-directional lstm uh [00:02:59] you know a bi-directional lstm uh instead of a simple RNN right I'm going [00:03:01] instead of a simple RNN right I'm going to use an lstm uh to encode my sentences [00:03:04] to use an lstm uh to encode my sentences I get bi-directional context and um if I [00:03:07] I get bi-directional context and um if I have an output that I'm trying to [00:03:08] have an output that I'm trying to generate right I'll have like a [00:03:09] generate right I'll have like a unidirectional lstm you know that I was [00:03:13] unidirectional lstm you know that I was going to generate one by one so you have [00:03:14] going to generate one by one so you have a translation or a parse or whatever and [00:03:17] a translation or a parse or whatever and so maybe I've encoded in a [00:03:18] so maybe I've encoded in a bi-directional LCM The Source sentence [00:03:20] bi-directional LCM The Source sentence and I'm sort of you know one by one [00:03:22] and I'm sort of you know one by one decoding out the the target with my [00:03:24] decoding out the the target with my unidirectional LCM and then uh also [00:03:27] unidirectional LCM and then uh also right I was going to use something like [00:03:29] right I was going to use something like attention to give flexible access to [00:03:32] attention to give flexible access to memory uh if I you know felt like I [00:03:35] memory uh if I you know felt like I needed to do this sort of look back and [00:03:36] needed to do this sort of look back and see where I want to translate from okay [00:03:38] see where I want to translate from okay and this was just working uh [00:03:40] and this was just working uh exceptionally well and we we motivated [00:03:42] exceptionally well and we we motivated so you know attention through wanting to [00:03:45] so you know attention through wanting to do machine translation and you have this [00:03:46] do machine translation and you have this this bottleneck where you don't want to [00:03:48] this bottleneck where you don't want to have to encode the whole sentence Source [00:03:50] have to encode the whole sentence Source sentence in a single vector [00:03:52] sentence in a single vector okay and in this lecture we have the [00:03:54] okay and in this lecture we have the same goal so we're going to be looking [00:03:55] same goal so we're going to be looking at a lot of the same problems that we [00:03:57] at a lot of the same problems that we did previously but we're going to use [00:03:59] did previously but we're going to use different building blocks we're going to [00:04:01] different building blocks we're going to say [00:04:02] say um you know uh if if 2014 to 2017-ish I [00:04:06] um you know uh if if 2014 to 2017-ish I was using recurrence uh through lots of [00:04:08] was using recurrence uh through lots of trial and error years later uh it was we [00:04:11] trial and error years later uh it was we had these like brand new building blocks [00:04:12] had these like brand new building blocks that we could plug in sort of you know [00:04:14] that we could plug in sort of you know uh direct replacement for lstms and [00:04:17] uh direct replacement for lstms and they're going to allow for just a huge [00:04:20] they're going to allow for just a huge range of much more successful [00:04:22] range of much more successful applications and um and so what what are [00:04:25] applications and um and so what what are the what what are the issues with the [00:04:28] the what what are the issues with the recurrent neural networks we used to use [00:04:29] recurrent neural networks we used to use and what are the new systems that we're [00:04:31] and what are the new systems that we're going to use sort of from this point [00:04:32] going to use sort of from this point moving forward [00:04:34] moving forward okay so um so one of the issues with [00:04:36] okay so um so one of the issues with with a recurrent neural network uh is [00:04:39] with a recurrent neural network uh is what we're going to call linear [00:04:40] what we're going to call linear interaction distance so as we know uh [00:04:43] interaction distance so as we know uh you know [00:04:44] you know rnns are unrolled left to right or right [00:04:46] rnns are unrolled left to right or right to left depending on the language and [00:04:48] to left depending on the language and the direction okay but it encodes the [00:04:50] the direction okay but it encodes the sort of notion of linear locality which [00:04:52] sort of notion of linear locality which is useful because if two words occur [00:04:54] is useful because if two words occur right next to each other sometimes [00:04:56] right next to each other sometimes they're actually quite related so tasty [00:04:58] they're actually quite related so tasty Pizza they're nearby and in the [00:05:00] Pizza they're nearby and in the recurrent neural network right you sort [00:05:02] recurrent neural network right you sort of encode you know tasty and then you [00:05:04] of encode you know tasty and then you sort of walk one step and you encode [00:05:06] sort of walk one step and you encode Pizza [00:05:08] Pizza um so nearby words do often affect each [00:05:11] um so nearby words do often affect each other's meanings [00:05:12] other's meanings um but you know you have this this [00:05:14] um but you know you have this this problem where very long distance [00:05:15] problem where very long distance dependencies can take a very long time [00:05:18] dependencies can take a very long time to interact so if I have the sentence [00:05:20] to interact so if I have the sentence the chef [00:05:21] the chef so those are those are nearby those [00:05:22] so those are those are nearby those interact with each other [00:05:24] interact with each other and then uh who and then a bunch of [00:05:28] and then uh who and then a bunch of stuff like the chef who went to the [00:05:29] stuff like the chef who went to the stores and picked up the ingredients and [00:05:32] stores and picked up the ingredients and you know loves garlic [00:05:35] you know loves garlic um and then was right like I actually [00:05:37] um and then was right like I actually have an RNN step right this sort of [00:05:40] have an RNN step right this sort of application of the recurrent weight [00:05:42] application of the recurrent weight Matrix and some element-wise [00:05:44] Matrix and some element-wise non-linearities once twice three times [00:05:46] non-linearities once twice three times right sort of as many times as there is [00:05:48] right sort of as many times as there is potentially the the length of the [00:05:50] potentially the the length of the sequence between chef and was right and [00:05:54] sequence between chef and was right and it's the chef who was so this is a long [00:05:55] it's the chef who was so this is a long distance dependency should feel kind of [00:05:58] distance dependency should feel kind of you know related to the stuff that we [00:05:59] you know related to the stuff that we did in dependency syntax but you know [00:06:01] did in dependency syntax but you know it's quite difficult [00:06:03] it's quite difficult uh to learn potentially that these words [00:06:07] uh to learn potentially that these words should be related so if you have sort of [00:06:10] should be related so if you have sort of a lot of steps uh between [00:06:13] a lot of steps uh between uh between words [00:06:16] uh between words um [00:06:18] you know it can be difficult to learn [00:06:20] you know it can be difficult to learn the dependencies between them you know [00:06:22] the dependencies between them you know we talked about all these gradient [00:06:23] we talked about all these gradient problems lstms do a lot better at [00:06:25] problems lstms do a lot better at modeling the gradients uh across long [00:06:28] modeling the gradients uh across long distances than simple recurrent neural [00:06:31] distances than simple recurrent neural networks but it's not perfect [00:06:33] networks but it's not perfect um [00:06:33] um and we already know sort of that this [00:06:35] and we already know sort of that this linear linear order isn't sort of the [00:06:37] linear linear order isn't sort of the right way to think about about sentences [00:06:40] right way to think about about sentences so if I wanted to learn that it's the [00:06:43] so if I wanted to learn that it's the chef who [00:06:45] chef who uh was then you know I might have a hard [00:06:49] uh was then you know I might have a hard time doing it because the gradients have [00:06:51] time doing it because the gradients have to propagate from west to Chef and you [00:06:54] to propagate from west to Chef and you know uh really I'd like more direct [00:06:55] know uh really I'd like more direct connection between words that might be [00:06:57] connection between words that might be related in the sentence or in a document [00:07:00] related in the sentence or in a document even right if these are going to get [00:07:01] even right if these are going to get much longer [00:07:03] um so so this is this linear interaction [00:07:05] um so so this is this linear interaction distance problem we would like words [00:07:07] distance problem we would like words that might be related to be able to [00:07:09] that might be related to be able to interact with each other in the neural [00:07:10] interact with each other in the neural networks computation sort of graph uh [00:07:13] networks computation sort of graph uh more easily than uh sort of being [00:07:16] more easily than uh sort of being linearly far away [00:07:19] linearly far away um yeah so that we can learn these long [00:07:21] um yeah so that we can learn these long distance dependencies better [00:07:22] distance dependencies better and there's a related problem too that [00:07:24] and there's a related problem too that again comes back to the recurrent neural [00:07:26] again comes back to the recurrent neural networks dependence on the index on the [00:07:29] networks dependence on the index on the index into the sequence often call it a [00:07:31] index into the sequence often call it a dependence on time so in a recurrent [00:07:34] dependence on time so in a recurrent neural network the forward and backward [00:07:36] neural network the forward and backward passes have o of sequence length many so [00:07:39] passes have o of sequence length many so that means just roughly sequence in this [00:07:41] that means just roughly sequence in this case just sequence length many [00:07:42] case just sequence length many unparallelizable operations so you know [00:07:45] unparallelizable operations so you know we know gpus are great they can do [00:07:48] we know gpus are great they can do a lot of operations at once as long as [00:07:51] a lot of operations at once as long as there's no dependency between the [00:07:53] there's no dependency between the operations in terms of time that you [00:07:54] operations in terms of time that you have to compute one and then compute the [00:07:56] have to compute one and then compute the other right but in a recurrent neural [00:07:59] other right but in a recurrent neural network you can't actually compute the [00:08:01] network you can't actually compute the RNN hidden state for time step 5 before [00:08:04] RNN hidden state for time step 5 before you compute the RNN hidden state for [00:08:05] you compute the RNN hidden state for time step four or time step three right [00:08:08] time step four or time step three right and so you get this graph that looks [00:08:10] and so you get this graph that looks very similar where if I want to compute [00:08:12] very similar where if I want to compute this hidden state so I've got some word [00:08:13] this hidden state so I've got some word I can I have zero operations I need to [00:08:16] I can I have zero operations I need to do before I can compute this state I [00:08:18] do before I can compute this state I have one operation I can do before I can [00:08:20] have one operation I can do before I can compute this state [00:08:22] compute this state and as my sequence length grows right [00:08:24] and as my sequence length grows right I've got okay here I've got three [00:08:26] I've got okay here I've got three operations I need to do before I can [00:08:28] operations I need to do before I can compute the state with the number three [00:08:30] compute the state with the number three because I need to compute this and this [00:08:32] because I need to compute this and this and that so there's sort of three [00:08:34] and that so there's sort of three unparallelizable operations that I'm [00:08:37] unparallelizable operations that I'm sort of glomming you know all the Matrix [00:08:39] sort of glomming you know all the Matrix multiplies and stuff into a single one [00:08:40] multiplies and stuff into a single one so so one two three and of course this [00:08:43] so so one two three and of course this grows with the sequence length as well [00:08:45] grows with the sequence length as well so uh down over here so as the sequence [00:08:47] so uh down over here so as the sequence length grows I can't parallelize you [00:08:50] length grows I can't parallelize you know I can't just have a big GPU just [00:08:52] know I can't just have a big GPU just you know with the with the Matrix [00:08:54] you know with the with the Matrix multiply to compute this state because I [00:08:57] multiply to compute this state because I need to compute all the previous States [00:08:58] need to compute all the previous States beforehand [00:09:00] beforehand foreign [00:09:04] sort of related problems both with the [00:09:07] sort of related problems both with the dependence on time yeah yeah so I have a [00:09:09] dependence on time yeah yeah so I have a question on the linear interaction [00:09:10] question on the linear interaction issues I thought that was the whole [00:09:12] issues I thought that was the whole point of the attention Network and then [00:09:14] point of the attention Network and then how maybe um you want during the [00:09:17] how maybe um you want during the training of the actual cells that depend [00:09:20] training of the actual cells that depend more on each other can't we do something [00:09:22] more on each other can't we do something like the attention and sort of work our [00:09:25] like the attention and sort of work our way [00:09:26] way so the question is uh with the linear [00:09:28] so the question is uh with the linear interaction distance wasn't this sort of [00:09:29] interaction distance wasn't this sort of the point of attention that it sort of [00:09:31] the point of attention that it sort of gets around that can't we use something [00:09:32] gets around that can't we use something with attention to sort of help or does [00:09:34] with attention to sort of help or does that just help so it won't solve the [00:09:36] that just help so it won't solve the paralyzability problem and in fact [00:09:38] paralyzability problem and in fact everything we do in the rest of this [00:09:39] everything we do in the rest of this lecture will be attention-based but [00:09:41] lecture will be attention-based but we'll get rid of the recurrence and just [00:09:42] we'll get rid of the recurrence and just do attention more or less so well yeah [00:09:45] do attention more or less so well yeah it's a great intuition [00:09:48] any other questions [00:09:51] Okay cool so [00:09:55] Okay cool so um [00:09:56] um so if not recurrence what about [00:09:57] so if not recurrence what about attentions even just a slide a slide [00:09:59] attentions even just a slide a slide back [00:10:00] back um and uh so you know just we're gonna [00:10:03] um and uh so you know just we're gonna get deep into attention today but just [00:10:04] get deep into attention today but just for the second right attention treats [00:10:07] for the second right attention treats each word's representation as a query to [00:10:09] each word's representation as a query to access and incorporate information from [00:10:11] access and incorporate information from a set of values so previously right we [00:10:14] a set of values so previously right we were in a decoder we were decoding out a [00:10:16] were in a decoder we were decoding out a translation of a sentence and we [00:10:17] translation of a sentence and we attended to the encoder so that we [00:10:19] attended to the encoder so that we didn't have to store the entire [00:10:20] didn't have to store the entire representation of the source sentence [00:10:22] representation of the source sentence into a single vector and here today [00:10:24] into a single vector and here today we'll talk think about attention within [00:10:26] we'll talk think about attention within a single sentence so I've got this sort [00:10:28] a single sentence so I've got this sort of sentence written out here with a you [00:10:30] of sentence written out here with a you know word one through word t in this [00:10:32] know word one through word t in this case and um right on these sort of [00:10:34] case and um right on these sort of integers in the boxes I'm writing out [00:10:36] integers in the boxes I'm writing out the number of unparallelizable [00:10:37] the number of unparallelizable operations that you need to do before [00:10:40] operations that you need to do before you can can compute these so for each [00:10:42] you can can compute these so for each word you can independently compute its [00:10:44] word you can independently compute its embedding without doing anything else [00:10:46] embedding without doing anything else previously right because the embedding [00:10:47] previously right because the embedding just depends on the word identity [00:10:50] just depends on the word identity and then with attention right if I [00:10:53] and then with attention right if I wanted to build an attention [00:10:54] wanted to build an attention representation of this word by looking [00:10:56] representation of this word by looking at all the other words in the sequence [00:10:57] at all the other words in the sequence that's sort of one big operation and I [00:11:00] that's sort of one big operation and I can do them in parallel for all the [00:11:01] can do them in parallel for all the words so the attention for this word I [00:11:04] words so the attention for this word I can do for the attention for this word I [00:11:06] can do for the attention for this word I don't need to sort of walk left to right [00:11:07] don't need to sort of walk left to right like I did for an RNN again we'll get [00:11:09] like I did for an RNN again we'll get much deeper into this but this you [00:11:12] much deeper into this but this you should uh have the intuition that it [00:11:14] should uh have the intuition that it solves the linear interaction problem [00:11:16] solves the linear interaction problem and the non-parelizability problem [00:11:18] and the non-parelizability problem because now no matter how far away words [00:11:20] because now no matter how far away words are from each other I am potentially [00:11:23] are from each other I am potentially interacting right I might just attend to [00:11:25] interacting right I might just attend to you even if you're very very far away uh [00:11:27] you even if you're very very far away uh sort of independent of how far away you [00:11:29] sort of independent of how far away you are and I also don't need to sort of [00:11:31] are and I also don't need to sort of walk along the sequence linearly long so [00:11:34] walk along the sequence linearly long so I'm treating the whole sequence at once [00:11:36] I'm treating the whole sequence at once all right so so you know the intuition [00:11:39] all right so so you know the intuition is that attention allows you to look [00:11:41] is that attention allows you to look very far away at once and it doesn't [00:11:43] very far away at once and it doesn't have this dependence on the sequence [00:11:44] have this dependence on the sequence index that keeps us from parallelizing [00:11:46] index that keeps us from parallelizing operations and so now the rest of the [00:11:48] operations and so now the rest of the lecture we'll talk uh in great depth [00:11:50] lecture we'll talk uh in great depth about attention uh so maybe let's just [00:11:53] about attention uh so maybe let's just uh move on okay [00:11:56] uh move on okay so let's think more deeply about [00:11:58] so let's think more deeply about attention [00:11:59] attention um you know one thing that you might [00:12:01] um you know one thing that you might think of with attention is that it's [00:12:03] think of with attention is that it's sort of Performing kind of a fuzzy [00:12:04] sort of Performing kind of a fuzzy lookup in a key value store so you have [00:12:07] lookup in a key value store so you have a bunch of keys a bunch of values and [00:12:09] a bunch of keys a bunch of values and it's going to help you sort of access [00:12:11] it's going to help you sort of access that so in an actual lookup table right [00:12:14] that so in an actual lookup table right just like a dictionary in Python for [00:12:16] just like a dictionary in Python for example right very simple you have a [00:12:19] example right very simple you have a table of keys that each key maps to a [00:12:21] table of keys that each key maps to a value and then you like give it a query [00:12:23] value and then you like give it a query and the query matches you know one of [00:12:25] and the query matches you know one of the keys and then you return the value [00:12:27] the keys and then you return the value right so I've got a bunch of keys here [00:12:31] right so I've got a bunch of keys here and my query matches the key so I return [00:12:33] and my query matches the key so I return the value simple [00:12:36] the value simple Fair easy okay good [00:12:39] Fair easy okay good um and in attention uh right so just [00:12:43] um and in attention uh right so just like we saw before the query matches all [00:12:45] like we saw before the query matches all keys softly there's no exact match uh [00:12:48] keys softly there's no exact match uh you sort of compute some sort of [00:12:50] you sort of compute some sort of similarity between the key and all of [00:12:52] similarity between the key and all of the sorry the query and all of the keys [00:12:54] the sorry the query and all of the keys and then you sort of weight the results [00:12:56] and then you sort of weight the results so you've got to query again you've got [00:12:58] so you've got to query again you've got a bunch of keys [00:12:59] a bunch of keys the query to different extents is [00:13:02] the query to different extents is similar to each of the keys [00:13:04] similar to each of the keys and you will sort of measure that [00:13:06] and you will sort of measure that similarity between zero and one through [00:13:08] similarity between zero and one through a soft Max and then you know you get the [00:13:11] a soft Max and then you know you get the values out you you average them via the [00:13:14] values out you you average them via the weights of the similarity between the [00:13:16] weights of the similarity between the key and the the query and the keys [00:13:18] key and the the query and the keys you do a weighted sum with those weights [00:13:20] you do a weighted sum with those weights and you get an output right so it really [00:13:22] and you get an output right so it really is quite a bit like a lookup table but [00:13:25] is quite a bit like a lookup table but in this sort of soft Vector space you [00:13:27] in this sort of soft Vector space you know [00:13:28] know um mushy sort of sense so I'm really [00:13:30] um mushy sort of sense so I'm really doing some kind of accessing into this [00:13:32] doing some kind of accessing into this information that's stored in the key [00:13:34] information that's stored in the key value store but I'm sort of softly [00:13:36] value store but I'm sort of softly looking at all of the results [00:13:40] okay any questions there [00:13:45] cool [00:13:46] cool um so so what might this look like right [00:13:48] um so so what might this look like right so if I was trying to represent this [00:13:50] so if I was trying to represent this sentence I went to Stanford's cs224n and [00:13:53] sentence I went to Stanford's cs224n and learned so I'm trying to build a [00:13:55] learned so I'm trying to build a representation of learned [00:13:58] um you know uh I have a key for each [00:14:01] um you know uh I have a key for each word so this is this self-attention [00:14:03] word so this is this self-attention thing that we'll we'll get into I have a [00:14:04] thing that we'll we'll get into I have a key for each word a value for each word [00:14:06] key for each word a value for each word I've got the query for learned and I've [00:14:08] I've got the query for learned and I've got these sort of these sort of teal-ish [00:14:10] got these sort of these sort of teal-ish bars up top which sort of might say how [00:14:13] bars up top which sort of might say how much you're going to try to access each [00:14:15] much you're going to try to access each of the word like so maybe 224n is not [00:14:17] of the word like so maybe 224n is not that important CS maybe that determines [00:14:19] that important CS maybe that determines what I learned you know Stanford uh [00:14:22] what I learned you know Stanford uh right and then learned maybe that's [00:14:24] right and then learned maybe that's important to representing itself right [00:14:25] important to representing itself right so you sort of look across at the whole [00:14:27] so you sort of look across at the whole sentence and build up this sort of soft [00:14:29] sentence and build up this sort of soft accessing of of information across the [00:14:31] accessing of of information across the sentence in order to represent learned [00:14:33] sentence in order to represent learned in context [00:14:35] in context okay so this is just a toy a toy diagram [00:14:38] okay so this is just a toy a toy diagram so let's get into the math so we're [00:14:41] so let's get into the math so we're going to look at a sequence of words [00:14:43] going to look at a sequence of words that's W1 to n a sequence of words in a [00:14:46] that's W1 to n a sequence of words in a vocabulary so this is like you know Zuko [00:14:48] vocabulary so this is like you know Zuko made his Uncle T that's a that's a good [00:14:49] made his Uncle T that's a that's a good sequence and for each word we're going [00:14:51] sequence and for each word we're going to embed it with this embedding Matrix [00:14:53] to embed it with this embedding Matrix just like we've been doing in this class [00:14:56] just like we've been doing in this class right so I have this embedding Matrix [00:14:57] right so I have this embedding Matrix that goes from the vocabulary size to [00:15:00] that goes from the vocabulary size to the dimensionality D [00:15:02] the dimensionality D so that's each word has a non-contextual [00:15:04] so that's each word has a non-contextual right only dependent on itself word [00:15:06] right only dependent on itself word embedding and now I'm going to transform [00:15:09] embedding and now I'm going to transform each word with one of three different [00:15:11] each word with one of three different weight matrices so this is often called [00:15:13] weight matrices so this is often called key query value self-attention so right [00:15:16] key query value self-attention so right so I have a matrix Q which is an RD to D [00:15:19] so I have a matrix Q which is an RD to D so this Maps x i to which is a vector of [00:15:22] so this Maps x i to which is a vector of dimensionality D to another Vector of [00:15:24] dimensionality D to another Vector of dimensionality D and uh so that's going [00:15:27] dimensionality D and uh so that's going to be a query Vector right so it takes [00:15:29] to be a query Vector right so it takes an x i and it sort of you know rotates [00:15:31] an x i and it sort of you know rotates it shuffles it around stretches it [00:15:32] it shuffles it around stretches it squishes it [00:15:33] squishes it makes it different and now it's a query [00:15:36] makes it different and now it's a query and now for a different learnable [00:15:37] and now for a different learnable parameter K that's another Matrix I'm [00:15:39] parameter K that's another Matrix I'm going to come up with my keys [00:15:41] going to come up with my keys and with a different learnable parameter [00:15:43] and with a different learnable parameter V I'm going to come up with my values [00:15:46] V I'm going to come up with my values right so I'm taking each of the [00:15:48] right so I'm taking each of the non-contextual word embeddings each of [00:15:49] non-contextual word embeddings each of these xi's and I'm transforming each of [00:15:52] these xi's and I'm transforming each of them to come up with my query for that [00:15:55] them to come up with my query for that word my key for that word and my value [00:15:57] word my key for that word and my value for that word [00:15:59] for that word okay so every word is doing each of [00:16:01] okay so every word is doing each of these roles [00:16:03] these roles next I'm going to compute all pairs of [00:16:06] next I'm going to compute all pairs of similarities between the keys and [00:16:07] similarities between the keys and queries right so in the toy example we [00:16:09] queries right so in the toy example we saw I was Computing sort of the [00:16:11] saw I was Computing sort of the similarity between a single query for [00:16:13] similarity between a single query for the word learned and all of the keys for [00:16:15] the word learned and all of the keys for the entire sentence [00:16:17] the entire sentence in this context I'm Computing all pairs [00:16:19] in this context I'm Computing all pairs of similarities between all keys and all [00:16:21] of similarities between all keys and all values because I want to represent sort [00:16:24] values because I want to represent sort of all of these sums so I've got [00:16:26] of all of these sums so I've got this sort of dot product I'm just going [00:16:28] this sort of dot product I'm just going to take the dot product between these [00:16:29] to take the dot product between these two vectors right so I've got Qi so this [00:16:32] two vectors right so I've got Qi so this is saying the query for word I dotted [00:16:34] is saying the query for word I dotted with the key for Word J and I get this [00:16:36] with the key for Word J and I get this score which is you know a real value uh [00:16:41] score which is you know a real value uh might be very large negative might be [00:16:42] might be very large negative might be zero might be very large and positive [00:16:44] zero might be very large and positive and so that's like how much should I [00:16:46] and so that's like how much should I look at J in this lookup table [00:16:49] look at J in this lookup table and then I do the softmax right so I [00:16:51] and then I do the softmax right so I softmax so I say that you know the [00:16:54] softmax so I say that you know the actual weight that I'm going to look at [00:16:55] actual weight that I'm going to look at J from I is softmax of this over all of [00:16:59] J from I is softmax of this over all of the possible indices right so it's like [00:17:01] the possible indices right so it's like the the Affinity between I and J [00:17:03] the the Affinity between I and J normalized by the infinity between I and [00:17:06] normalized by the infinity between I and all of the possible J Prime in the [00:17:08] all of the possible J Prime in the sequence [00:17:10] and then my output is just the weighted [00:17:13] and then my output is just the weighted sum of values so I've got this output [00:17:14] sum of values so I've got this output for word I so maybe I is like one for [00:17:17] for word I so maybe I is like one for Zuko and I'm representing it as the sum [00:17:20] Zuko and I'm representing it as the sum of these weights for all J so Zuko and [00:17:24] of these weights for all J so Zuko and made and his and uncle and T and the [00:17:26] made and his and uncle and T and the value Vector for that word uh J I'm [00:17:30] value Vector for that word uh J I'm looking from I to J [00:17:32] looking from I to J as much as Alpha i j [00:17:39] oh w i you can either think of it as a [00:17:43] oh w i you can either think of it as a symbol in vocab V so that's like you [00:17:45] symbol in vocab V so that's like you could think of it as a one hot Vector in [00:17:48] could think of it as a one hot Vector in um [00:17:49] um yeah in this case we are I guess [00:17:50] yeah in this case we are I guess thinking of this so a one hot Vector in [00:17:52] thinking of this so a one hot Vector in dimensionality size of vocab so in in [00:17:55] dimensionality size of vocab so in in The Matrix e you see that it's uh r d by [00:17:58] The Matrix e you see that it's uh r d by bars around V that's the size of the [00:18:00] bars around V that's the size of the vocabulary so when I do e multiplied by [00:18:03] vocabulary so when I do e multiplied by w i that's taking e which is d by V [00:18:07] w i that's taking e which is d by V multiplying it by W which is V and [00:18:10] multiplying it by W which is V and returning a vector that's dimensionality [00:18:12] returning a vector that's dimensionality D so [00:18:14] D so first line it like W1 and that's a [00:18:17] first line it like W1 and that's a matrix where um it has like maybe like a [00:18:21] matrix where um it has like maybe like a column for every word in that that [00:18:23] column for every word in that that sentence in each column is a length V [00:18:25] sentence in each column is a length V yeah usually I guess we think of it as [00:18:27] yeah usually I guess we think of it as having a I mean if I'm putting the the [00:18:29] having a I mean if I'm putting the the sequence length index first you might [00:18:32] sequence length index first you might think of having a row for each word but [00:18:33] think of having a row for each word but but similarly yeah it's it's n which is [00:18:36] but similarly yeah it's it's n which is the sequence length and then the second [00:18:37] the sequence length and then the second dimension would be V which is the [00:18:39] dimension would be V which is the vocabulary size and then that gets [00:18:41] vocabulary size and then that gets mapped to this thing which is sequence [00:18:43] mapped to this thing which is sequence length by D [00:18:46] um why do we learn two different [00:18:47] um why do we learn two different matrices q and K when like Q transpose [00:18:51] matrices q and K when like Q transpose Qi transpose KJ is really just one [00:18:54] Qi transpose KJ is really just one Matrix in the middle between that's a [00:18:56] Matrix in the middle between that's a great question it ends up being because [00:18:58] great question it ends up being because this will end up being a low rank [00:19:00] this will end up being a low rank approximation to that Matrix so it is [00:19:02] approximation to that Matrix so it is for computational efficiency reasons [00:19:05] for computational efficiency reasons although it also I think feels kind of [00:19:07] although it also I think feels kind of nice and uh in the presentation but yeah [00:19:10] nice and uh in the presentation but yeah what we'll end up doing is having a very [00:19:12] what we'll end up doing is having a very low rank approximation to qk transpose [00:19:14] low rank approximation to qk transpose and so it you actually do do it like [00:19:16] and so it you actually do do it like this it's a good question [00:19:21] so the curry [00:19:26] so I could you repeat that for me the [00:19:27] so I could you repeat that for me the CII so the query of the word [00:19:31] CII so the query of the word um dotted with the key by itself doesn't [00:19:33] um dotted with the key by itself doesn't look like Identity or do they look at [00:19:36] look like Identity or do they look at these things in particular that's a good [00:19:38] these things in particular that's a good question okay let me remember to repeat [00:19:39] question okay let me remember to repeat questions so does eii right for for J [00:19:42] questions so does eii right for for J equal to I so looking at itself look [00:19:45] equal to I so looking at itself look like anything in particular does it look [00:19:46] like anything in particular does it look like the identity is that the question [00:19:48] like the identity is that the question okay so um [00:19:50] okay so um so right it's unclear actually this [00:19:53] so right it's unclear actually this question of should you look at yourself [00:19:54] question of should you look at yourself for representing yourself well it's it's [00:19:57] for representing yourself well it's it's going to be encoded by the matrices q [00:19:59] going to be encoded by the matrices q and K right if I didn't have q and K in [00:20:02] and K right if I didn't have q and K in there right if those were the identity [00:20:04] there right if those were the identity matrices if Q is identity K's identity [00:20:06] matrices if Q is identity K's identity then this would be sort of Dot credit [00:20:08] then this would be sort of Dot credit with yourself which is going to be [00:20:10] with yourself which is going to be high on average like you're pointing in [00:20:12] high on average like you're pointing in the same direction as yourself but it [00:20:14] the same direction as yourself but it could be that you know qxi and kxi might [00:20:18] could be that you know qxi and kxi might be sort of arbitrarily different from [00:20:20] be sort of arbitrarily different from each other because Q could be the [00:20:23] each other because Q could be the identity and K could map you to the [00:20:26] identity and K could map you to the negative of yourself for example so that [00:20:28] negative of yourself for example so that you don't look at yourself so this is [00:20:29] you don't look at yourself so this is all learned in practice so you end up it [00:20:32] all learned in practice so you end up it can it can sort of decide by learning [00:20:35] can it can sort of decide by learning whether you should be looking at [00:20:37] whether you should be looking at yourself or not and that's some of the [00:20:39] yourself or not and that's some of the flexibility that parameterizing at SQ [00:20:41] flexibility that parameterizing at SQ and K gives you that wouldn't be there [00:20:43] and K gives you that wouldn't be there if I just used xi's everywhere in this [00:20:46] if I just used xi's everywhere in this in this equation [00:20:49] in this equation I'm going to try to move on I'm afraid [00:20:52] I'm going to try to move on I'm afraid because there's a lot to get on but uh [00:20:53] because there's a lot to get on but uh we'll keep talking about self-attention [00:20:55] we'll keep talking about self-attention and so as more questions come up I can [00:20:58] and so as more questions come up I can also potentially return back [00:21:01] also potentially return back um okay so so this is our basic building [00:21:04] um okay so so this is our basic building block but there are a bunch of barriers [00:21:06] block but there are a bunch of barriers to using it as a replacement for for our [00:21:09] to using it as a replacement for for our lstms and so what we're going to do for [00:21:11] lstms and so what we're going to do for this portion of the lecture is talk [00:21:13] this portion of the lecture is talk about the minimal components that we [00:21:15] about the minimal components that we need in order to use self-attention as [00:21:18] need in order to use self-attention as sort of this like very fundamental uh [00:21:20] sort of this like very fundamental uh building block so we can't use it as it [00:21:23] building block so we can't use it as it stands as I've presented it [00:21:25] stands as I've presented it um but because there are a couple of [00:21:26] um but because there are a couple of things that we need to sort of solve or [00:21:28] things that we need to sort of solve or fix one of them is that there's no [00:21:30] fix one of them is that there's no notion of sequence order in [00:21:32] notion of sequence order in self-attention so so you know [00:21:35] self-attention so so you know um [00:21:36] um what is what does this mean if I have a [00:21:38] what is what does this mean if I have a sentence uh like I'm going to move over [00:21:41] sentence uh like I'm going to move over here to the Whiteboard briefly and [00:21:42] here to the Whiteboard briefly and hopefully I'll I'll uh write quite large [00:21:46] hopefully I'll I'll uh write quite large um if I have a sentence like Zuko [00:21:49] um if I have a sentence like Zuko made [00:21:52] made his uncle [00:21:55] his uncle and [00:21:57] and uh let's say [00:21:58] uh let's say his uncle [00:22:01] his uncle made Zuko [00:22:05] if I were to embed each of these words [00:22:07] if I were to embed each of these words right using its embedding Matrix the [00:22:10] right using its embedding Matrix the embedding Matrix isn't dependent on [00:22:13] embedding Matrix isn't dependent on uh the index of the word so this is the [00:22:15] uh the index of the word so this is the word index one two three four versus now [00:22:19] word index one two three four versus now his is over here an uncle right and so [00:22:22] his is over here an uncle right and so when I compute the self-attention and [00:22:24] when I compute the self-attention and there's a lot more in this in the [00:22:25] there's a lot more in this in the lecture notes that goes through a full [00:22:26] lecture notes that goes through a full example [00:22:27] example um uh [00:22:29] um uh the actual self-attention operation will [00:22:32] the actual self-attention operation will give you exactly the same [00:22:33] give you exactly the same representations for this sequence Zuko [00:22:35] representations for this sequence Zuko made his uncle as for this sequence his [00:22:38] made his uncle as for this sequence his uncle made Zuko and that's bad because [00:22:40] uncle made Zuko and that's bad because they're sentences that mean different [00:22:42] they're sentences that mean different things [00:22:43] things um and so right it's sort of this this [00:22:45] um and so right it's sort of this this idea that self-attention is an operation [00:22:47] idea that self-attention is an operation on sets like you have a set of vectors [00:22:50] on sets like you have a set of vectors that you're going to perform [00:22:51] that you're going to perform self-attention on and nowhere does like [00:22:54] self-attention on and nowhere does like the exact position of the words come [00:22:56] the exact position of the words come into play directly [00:22:59] um so uh we're going to encode the [00:23:01] um so uh we're going to encode the position of words uh through the keys [00:23:03] position of words uh through the keys queries and values that we have [00:23:06] queries and values that we have um so you know consider now representing [00:23:08] um so you know consider now representing each sequence Index right our sequences [00:23:10] each sequence Index right our sequences are going from one to n as a vector so [00:23:13] are going from one to n as a vector so so don't worry so far about you know how [00:23:15] so don't worry so far about you know how it's being made but you can imagine [00:23:17] it's being made but you can imagine representing sort of the number one like [00:23:20] representing sort of the number one like the position one the position two the [00:23:22] the position one the position two the position three [00:23:23] position three as a vector in the dimensionality D just [00:23:25] as a vector in the dimensionality D just like we're representing our keys queries [00:23:28] like we're representing our keys queries and values and um so these are position [00:23:30] and values and um so these are position vectors uh you know you can if if you [00:23:33] vectors uh you know you can if if you were to [00:23:35] were to want to incorporate the information [00:23:37] want to incorporate the information represented by these positions into our [00:23:40] represented by these positions into our self-attention uh you could just add [00:23:43] self-attention uh you could just add these vectors these Pi vectors to the [00:23:46] these vectors these Pi vectors to the inputs right so if I have you know this [00:23:50] inputs right so if I have you know this this x i embedding of a word which is [00:23:53] this x i embedding of a word which is the word at position I but really just [00:23:55] the word at position I but really just represents oh the word Zuko is here now [00:23:58] represents oh the word Zuko is here now I can say that oh it's the word Zuko and [00:24:00] I can say that oh it's the word Zuko and it's at position five because you know [00:24:02] it's at position five because you know this Vector represents position five [00:24:08] okay so so how do we do this [00:24:11] okay so so how do we do this um and we might only have to do this [00:24:12] um and we might only have to do this once right so we can do it once uh at [00:24:14] once right so we can do it once uh at the very input to the to the network and [00:24:16] the very input to the to the network and then that sort of is sufficient we don't [00:24:18] then that sort of is sufficient we don't have to do it at every layer because it [00:24:20] have to do it at every layer because it sort of knows from the input [00:24:23] sort of knows from the input um so so one way in which people have [00:24:25] um so so one way in which people have done this is look at these sinusoidal [00:24:27] done this is look at these sinusoidal position representations so this looks a [00:24:30] position representations so this looks a little bit like this where you have so [00:24:32] little bit like this where you have so these are this is a vector Pi which is [00:24:34] these are this is a vector Pi which is in dimensionality D right and um each [00:24:37] in dimensionality D right and um each one of the dimensions you take the value [00:24:39] one of the dimensions you take the value I you modify it by some uh constant and [00:24:43] I you modify it by some uh constant and you [00:24:45] you you pass it to the sine or cosine [00:24:46] you pass it to the sine or cosine function and you get these sort of [00:24:48] function and you get these sort of values that vary according to the period [00:24:51] values that vary according to the period uh uh differing periods depending on the [00:24:53] uh uh differing periods depending on the dimensionalities D so I've got this sort [00:24:55] dimensionalities D so I've got this sort of a representation of a matrix where D [00:24:58] of a representation of a matrix where D is the vertical Dimension and then n is [00:25:00] is the vertical Dimension and then n is the horizontal and you can see that [00:25:02] the horizontal and you can see that they're sort of like oh you know [00:25:04] they're sort of like oh you know um as I walk along you see the period of [00:25:06] um as I walk along you see the period of the sine function going up and down and [00:25:08] the sine function going up and down and each of the dimensions D has a different [00:25:10] each of the dimensions D has a different period And so together you can represent [00:25:12] period And so together you can represent a bunch of different uh sort of position [00:25:14] a bunch of different uh sort of position indices and um you know [00:25:17] indices and um you know it gives so this intuition that oh maybe [00:25:19] it gives so this intuition that oh maybe period maybe sort of the absolute [00:25:20] period maybe sort of the absolute position of a word isn't as important [00:25:22] position of a word isn't as important you've got the sort of periodicity of [00:25:24] you've got the sort of periodicity of the Sines and cosines [00:25:26] the Sines and cosines um and maybe that allows you to [00:25:27] um and maybe that allows you to extrapolate to longer sequences uh but [00:25:29] extrapolate to longer sequences uh but in practice that doesn't work [00:25:31] in practice that doesn't work um but this is sort of like an early uh [00:25:33] um but this is sort of like an early uh notion that still sometimes used for how [00:25:36] notion that still sometimes used for how to represent position in Transformers [00:25:39] to represent position in Transformers and self-attention networks in general [00:25:42] and self-attention networks in general um so so that's one idea you might think [00:25:45] um so so that's one idea you might think it's a little bit complicated [00:25:48] it's a little bit complicated a little bit unintuitive here's [00:25:50] a little bit unintuitive here's something that feels a little bit more [00:25:52] something that feels a little bit more deep learning [00:25:54] deep learning so we're just going to say oh you know [00:25:57] so we're just going to say oh you know I've got a maximum sequence length of n [00:25:59] I've got a maximum sequence length of n and I'm just gonna learn a matrix that's [00:26:01] and I'm just gonna learn a matrix that's dimensionality d by n and that's going [00:26:04] dimensionality d by n and that's going to represent my positions I'm going to [00:26:05] to represent my positions I'm going to learn it as a parameter just like I [00:26:07] learn it as a parameter just like I learned every other parameter and what [00:26:09] learned every other parameter and what do they mean oh I have no idea but it [00:26:11] do they mean oh I have no idea but it you know represents position [00:26:13] you know represents position um so [00:26:14] um so um right and be so you just sort of add [00:26:16] um right and be so you just sort of add this Matrix uh to the xi's your input [00:26:19] this Matrix uh to the xi's your input embeddings [00:26:21] embeddings um and it learns to you know fit to data [00:26:24] um and it learns to you know fit to data so whatever representation of position [00:26:25] so whatever representation of position that's linear uh sort of you know index [00:26:29] that's linear uh sort of you know index based that you want you can learn and [00:26:31] based that you want you can learn and the cons are that well you definitely [00:26:33] the cons are that well you definitely now can't represent anything that's [00:26:35] now can't represent anything that's longer than n words long right no [00:26:38] longer than n words long right no sequence longer than n you can handle [00:26:40] sequence longer than n you can handle because [00:26:41] because um well you only learned a matrix of [00:26:43] um well you only learned a matrix of this many positions and so in practice [00:26:45] this many positions and so in practice you'll get you know a model error if you [00:26:48] you'll get you know a model error if you if you pass a self-attention model [00:26:49] if you pass a self-attention model something longer than length n it will [00:26:52] something longer than length n it will just sort of Crash and say I can't I [00:26:54] just sort of Crash and say I can't I can't do this [00:26:56] can't do this and so this is sort of what most systems [00:26:58] and so this is sort of what most systems nowadays use they're more flexible [00:27:00] nowadays use they're more flexible representations of position including a [00:27:02] representations of position including a couple in the lecture notes you might [00:27:05] couple in the lecture notes you might want to look at sort of like the [00:27:06] want to look at sort of like the relative linear position or words before [00:27:08] relative linear position or words before or after each other but not their [00:27:10] or after each other but not their absolute position there's also some sort [00:27:12] absolute position there's also some sort of representations that that hearken [00:27:14] of representations that that hearken back to our dependency syntax because [00:27:16] back to our dependency syntax because like oh maybe words that are close in [00:27:18] like oh maybe words that are close in the dependency parse tree should be the [00:27:20] the dependency parse tree should be the things that are sort of close in the uh [00:27:22] things that are sort of close in the uh in the self-attention operation [00:27:24] in the self-attention operation um okay questions [00:27:28] in practice do we typically just make n [00:27:31] in practice do we typically just make n large enough that we don't run into the [00:27:33] large enough that we don't run into the issue of course [00:27:35] issue of course having something that could be input [00:27:37] having something that could be input longer than him so the question is in [00:27:40] longer than him so the question is in practice do we just make n long enough [00:27:41] practice do we just make n long enough so that we don't run into the problem [00:27:43] so that we don't run into the problem where we're going to you know look at a [00:27:45] where we're going to you know look at a text longer than n no in practice it's [00:27:47] text longer than n no in practice it's actually quite a problem uh even today [00:27:50] actually quite a problem uh even today even in the largest biggest language [00:27:51] even in the largest biggest language models and uh you know uh you know can I [00:27:55] models and uh you know uh you know can I fit this prompt into chat GPT or [00:27:57] fit this prompt into chat GPT or whatever is the thing that you might see [00:27:59] whatever is the thing that you might see on Twitter I mean these continue to be [00:28:00] on Twitter I mean these continue to be issues and part of it is because the [00:28:03] issues and part of it is because the self-attention operation and we'll get [00:28:05] self-attention operation and we'll get into this later in the lecture it's it's [00:28:07] into this later in the lecture it's it's quadratic complexity in the sequence [00:28:09] quadratic complexity in the sequence length so you're going to cut you're [00:28:11] length so you're going to cut you're going to spend N squared sort of memory [00:28:13] going to spend N squared sort of memory budget in order to make sequence lengths [00:28:15] budget in order to make sequence lengths longer so in practice you know this [00:28:17] longer so in practice you know this might be on a large model say 4 000 or [00:28:21] might be on a large model say 4 000 or so n is four thousand so you can fit [00:28:22] so n is four thousand so you can fit four thousand words which feels like a [00:28:24] four thousand words which feels like a lot but it's not going to fit a novel [00:28:25] lot but it's not going to fit a novel it's not going to fit a Wikipedia page [00:28:28] it's not going to fit a Wikipedia page um you know and so and there are models [00:28:30] um you know and so and there are models that do longer uh sequences for sure [00:28:33] that do longer uh sequences for sure um and again we'll talk a bit about it [00:28:35] um and again we'll talk a bit about it but no this this actually is an issue [00:28:48] yeah so how do you know that the P that [00:28:51] yeah so how do you know that the P that you've learned this Matrix that you've [00:28:52] you've learned this Matrix that you've learned is representing position as [00:28:54] learned is representing position as opposed to anything else the reason is [00:28:56] opposed to anything else the reason is the only thing that correlates this [00:28:57] the only thing that correlates this position [00:28:58] position right so like when I see these vectors [00:29:00] right so like when I see these vectors I'm adding this P Matrix to my X Matrix [00:29:03] I'm adding this P Matrix to my X Matrix the word embeddings [00:29:05] the word embeddings I'm adding them together and the words [00:29:07] I'm adding them together and the words that show up at each index will vary [00:29:08] that show up at each index will vary depending on what word actually showed [00:29:11] depending on what word actually showed up there in the example but the P Matrix [00:29:13] up there in the example but the P Matrix never differs it's always exactly the [00:29:14] never differs it's always exactly the same at every index and so it's the only [00:29:17] same at every index and so it's the only thing in the data that it correlates [00:29:19] thing in the data that it correlates with so you're sort of learning it [00:29:20] with so you're sort of learning it implicitly like this Vector at index one [00:29:22] implicitly like this Vector at index one is always at index one for every example [00:29:24] is always at index one for every example for every gradient update and nothing [00:29:26] for every gradient update and nothing else uh co-occurs like that [00:29:31] else uh co-occurs like that yeah so what you end up learning I don't [00:29:34] yeah so what you end up learning I don't know unclear but it definitely allows [00:29:36] know unclear but it definitely allows you to know oh this word is with this [00:29:38] you to know oh this word is with this index said this yeah [00:29:41] okay yeah just quickly in space [00:29:47] okay yeah just quickly in space um [00:29:56] okay so the question is when this is [00:29:59] okay so the question is when this is quadratic in the sequence is that a [00:30:00] quadratic in the sequence is that a sequence of words yeah think of it as a [00:30:02] sequence of words yeah think of it as a sequence of words [00:30:03] sequence of words um sometimes there'll be pieces that are [00:30:05] um sometimes there'll be pieces that are smaller than words which we'll go into [00:30:06] smaller than words which we'll go into in next slide in the next lecture but [00:30:08] in next slide in the next lecture but yeah think of this as a sequence of [00:30:09] yeah think of this as a sequence of words but not necessarily just for a [00:30:11] words but not necessarily just for a sentence maybe for an entire paragraph [00:30:13] sentence maybe for an entire paragraph or an entire document or something like [00:30:16] or an entire document or something like that [00:30:19] where yeah the tension is based words to [00:30:21] where yeah the tension is based words to words [00:30:23] words okay cool I'm gonna move on [00:30:26] okay cool I'm gonna move on um okay so [00:30:28] um okay so um right so we have another problem uh [00:30:30] um right so we have another problem uh another is that you know based on the [00:30:32] another is that you know based on the presentation of self-attention that [00:30:34] presentation of self-attention that we've done you know there's really no [00:30:35] we've done you know there's really no non-linearities for uh sort of deep [00:30:38] non-linearities for uh sort of deep learning magic we're just sort of [00:30:39] learning magic we're just sort of computing weighted averages of stuff [00:30:42] computing weighted averages of stuff um [00:30:43] um so so you know if I apply self-attention [00:30:46] so so you know if I apply self-attention and then apply self-attention again and [00:30:48] and then apply self-attention again and then and again and again and again you [00:30:50] then and again and again and again you should get uh you should look at the [00:30:52] should get uh you should look at the next lecture notes if you're interested [00:30:53] next lecture notes if you're interested in this it's actually quite cool but [00:30:54] in this it's actually quite cool but what you end up doing is you're just [00:30:56] what you end up doing is you're just re-averageing value vectors together so [00:30:58] re-averageing value vectors together so you're like Computing averages of value [00:31:00] you're like Computing averages of value vectors and it ends up looking like one [00:31:02] vectors and it ends up looking like one big self-attention uh but there's an [00:31:04] big self-attention uh but there's an easy fix to this if you want sort of the [00:31:06] easy fix to this if you want sort of the traditional deep learning magic and you [00:31:08] traditional deep learning magic and you can just add a feed forward Network to [00:31:10] can just add a feed forward Network to post-process each output Vector so I've [00:31:12] post-process each output Vector so I've got a word here that's sort of the [00:31:14] got a word here that's sort of the output of self-attention and I'm going [00:31:15] output of self-attention and I'm going to pass it through you know in this case [00:31:17] to pass it through you know in this case I'm calling it a multi-layer perceptron [00:31:19] I'm calling it a multi-layer perceptron MLP so this is a vector in Rd that's [00:31:22] MLP so this is a vector in Rd that's going to be uh and it's taking in as [00:31:24] going to be uh and it's taking in as input a vector in Rd and you know you do [00:31:27] input a vector in Rd and you know you do the usual uh sort of multi-layer [00:31:29] the usual uh sort of multi-layer perceptron thing right where you have [00:31:30] perceptron thing right where you have the output and you multiply it by matrix [00:31:32] the output and you multiply it by matrix pass it to a non-linearity multiply it [00:31:35] pass it to a non-linearity multiply it by another Matrix okay and so what this [00:31:36] by another Matrix okay and so what this looks like in self-attention is that [00:31:38] looks like in self-attention is that I've got this sort of sentence the chef [00:31:40] I've got this sort of sentence the chef who the food and I've got my embedding [00:31:43] who the food and I've got my embedding for it I pass it through this whole big [00:31:45] for it I pass it through this whole big self-attention block right which looks [00:31:47] self-attention block right which looks at the whole sequence and sort of [00:31:48] at the whole sequence and sort of incorporates context and all that and [00:31:50] incorporates context and all that and then I pass each one individually [00:31:52] then I pass each one individually through a feed forward uh layer right so [00:31:55] through a feed forward uh layer right so so this embedding that's sort of the [00:31:57] so this embedding that's sort of the output of the self-attention for the [00:31:58] output of the self-attention for the word the is passed independently through [00:32:01] word the is passed independently through a multi-layer perceptron here and that [00:32:03] a multi-layer perceptron here and that sort of you can think of it as sort of [00:32:05] sort of you can think of it as sort of combining you know together uh or [00:32:09] combining you know together uh or processing the result of attention so [00:32:11] processing the result of attention so so there's a number of reasons why we do [00:32:13] so there's a number of reasons why we do this [00:32:14] this um one of them also is that you can [00:32:15] um one of them also is that you can actually stack a ton of computation into [00:32:17] actually stack a ton of computation into these feed forward uh networks very very [00:32:20] these feed forward uh networks very very efficiently very paralyzable very good [00:32:23] efficiently very paralyzable very good for gpus but but this is what's done in [00:32:25] for gpus but but this is what's done in practice so you do self-attention and [00:32:27] practice so you do self-attention and then you can you know pass it through [00:32:28] then you can you know pass it through this sort of position wise feed forward [00:32:31] this sort of position wise feed forward layer right every word is processed [00:32:32] layer right every word is processed independently by this feed forward [00:32:35] independently by this feed forward Network to process the result [00:32:39] Network to process the result okay so that's adding our sort of [00:32:41] okay so that's adding our sort of classical deep learning non-linearities [00:32:43] classical deep learning non-linearities for self-attention [00:32:45] for self-attention um and that's an easy fix for this sort [00:32:48] um and that's an easy fix for this sort of no non-linearities problem in [00:32:49] of no non-linearities problem in self-attention and then we have a last [00:32:51] self-attention and then we have a last issue before we have our final minimal [00:32:54] issue before we have our final minimal self-attention building block with which [00:32:56] self-attention building block with which we can replace rnns [00:32:59] we can replace rnns and that's that uh well you know when [00:33:02] and that's that uh well you know when I've been writing out all of these [00:33:03] I've been writing out all of these examples of self-attention you can sort [00:33:05] examples of self-attention you can sort of look at the entire sequence right and [00:33:08] of look at the entire sequence right and and uh in practice for some tasks such [00:33:10] and uh in practice for some tasks such as machine translation or language [00:33:12] as machine translation or language modeling whenever you want to define a [00:33:14] modeling whenever you want to define a probability distribution over a sequence [00:33:16] probability distribution over a sequence you can't cheat and look at the future [00:33:20] you can't cheat and look at the future right uh so you know at every time step [00:33:23] right uh so you know at every time step I could Define the set of keys and [00:33:25] I could Define the set of keys and queries and values to only include past [00:33:28] queries and values to only include past words but this is inefficient uh bear [00:33:31] words but this is inefficient uh bear with me it's inefficient because you [00:33:32] with me it's inefficient because you can't parallelize it so well so instead [00:33:35] can't parallelize it so well so instead we compute the entire n by n Matrix just [00:33:38] we compute the entire n by n Matrix just like I showed in the slide discussing [00:33:39] like I showed in the slide discussing self-attention and then I mask out words [00:33:42] self-attention and then I mask out words in the future so if this score e i j [00:33:44] in the future so if this score e i j right and I I computed eij for all n by [00:33:47] right and I I computed eij for all n by n pairs of words is equal to whatever it [00:33:51] n pairs of words is equal to whatever it was before [00:33:52] was before if the word that you're looking at at [00:33:56] if the word that you're looking at at index J is an index that is less than or [00:33:58] index J is an index that is less than or equal to where you are index I and it's [00:34:02] equal to where you are index I and it's equal to negative infinity-ish otherwise [00:34:04] equal to negative infinity-ish otherwise if it's in the future and when you [00:34:06] if it's in the future and when you softmax the eij negative Infinity gets [00:34:09] softmax the eij negative Infinity gets mapped to zero [00:34:10] mapped to zero so now my attention is weighted zero my [00:34:14] so now my attention is weighted zero my my weighted average is zero on the [00:34:15] my weighted average is zero on the future so I can't look at it [00:34:18] future so I can't look at it what does this look like so in order to [00:34:21] what does this look like so in order to encode these words the chef who and [00:34:24] encode these words the chef who and maybe the start start symbol there [00:34:27] maybe the start start symbol there I can look at these words right that's [00:34:30] I can look at these words right that's all pairs of words and then I just gray [00:34:32] all pairs of words and then I just gray out I I sort of negative Infinity out [00:34:34] out I I sort of negative Infinity out the words I can't look at so encoding [00:34:36] the words I can't look at so encoding the start symbol I can just look at the [00:34:38] the start symbol I can just look at the start symbol when encoding the I can [00:34:40] start symbol when encoding the I can look at the start symbol and the [00:34:42] look at the start symbol and the encoding Chef I can look at start the [00:34:45] encoding Chef I can look at start the chef but I you know can't look at who [00:34:48] chef but I you know can't look at who right and so it with this representation [00:34:50] right and so it with this representation of Chef that encode that is you know [00:34:53] of Chef that encode that is you know only looking at start the chef [00:34:56] only looking at start the chef I can define a probability distribution [00:34:57] I can define a probability distribution using this Vector that allows me to [00:35:00] using this Vector that allows me to predict who without having cheated by [00:35:02] predict who without having cheated by already looking ahead and seeing that [00:35:03] already looking ahead and seeing that well who is the next word [00:35:09] questions [00:35:11] so it says for using it in decoders [00:35:14] so it says for using it in decoders um do we do this for both the encoding [00:35:16] um do we do this for both the encoding layer and the e-coding layer or for the [00:35:18] layer and the e-coding layer or for the encoding layer are we allowing ourselves [00:35:20] encoding layer are we allowing ourselves to look forward the question is uh it [00:35:23] to look forward the question is uh it says here that we're using this in a [00:35:24] says here that we're using this in a decoder do we also use it in the encoder [00:35:26] decoder do we also use it in the encoder so that this is the distinction between [00:35:28] so that this is the distinction between sort of like a bi-directional lstm and a [00:35:31] sort of like a bi-directional lstm and a unidirectional lstm right so wherever [00:35:34] unidirectional lstm right so wherever you don't need this constraint [00:35:37] you don't need this constraint you probably don't use it so if you're [00:35:38] you probably don't use it so if you're using an encoder right on the source [00:35:40] using an encoder right on the source sentence of your machine translation [00:35:41] sentence of your machine translation problem you probably don't do this [00:35:43] problem you probably don't do this masking because it's probably good to [00:35:45] masking because it's probably good to let everything look at each other and [00:35:47] let everything look at each other and then whenever you do need to use it [00:35:48] then whenever you do need to use it because you have this Auto regressive [00:35:50] because you have this Auto regressive sort of probability of word one [00:35:52] sort of probability of word one probability of two given one you know [00:35:54] probability of two given one you know three given two in one then you would [00:35:55] three given two in one then you would use this so traditionally yes in [00:35:57] use this so traditionally yes in decoders you will use it in encoders you [00:35:59] decoders you will use it in encoders you will not [00:36:02] will not yes [00:36:04] yes um my question is a lot about [00:36:05] um my question is a lot about philosophical [00:36:07] philosophical how humans actually generate sentences [00:36:10] how humans actually generate sentences by having some notion of the probability [00:36:13] by having some notion of the probability of future words before they say [00:36:17] of future words before they say um [00:36:17] um the words that or before they choose the [00:36:20] the words that or before they choose the words that they are friendly [00:36:24] words that they are friendly speaking or writing regenerating good [00:36:27] speaking or writing regenerating good question so the question is isn't you [00:36:29] question so the question is isn't you know looking ahead a little bit and sort [00:36:30] know looking ahead a little bit and sort of predicting or getting an idea of the [00:36:33] of predicting or getting an idea of the words that you might say in the future [00:36:34] words that you might say in the future sort of how humans generate language [00:36:35] sort of how humans generate language instead of the sort of strict constraint [00:36:38] instead of the sort of strict constraint of not seeing it into the future is that [00:36:40] of not seeing it into the future is that is that what you're okay so so right [00:36:42] is that what you're okay so so right um you know trying to plan ahead to see [00:36:45] um you know trying to plan ahead to see what I should do is definitely an [00:36:47] what I should do is definitely an interesting idea [00:36:48] interesting idea um but when I am training the network [00:36:50] um but when I am training the network right I can't if I'm teaching it to try [00:36:54] right I can't if I'm teaching it to try to predict the next word and if I give [00:36:56] to predict the next word and if I give it the answer it's not going to learn [00:36:57] it the answer it's not going to learn anything useful [00:36:59] anything useful uh so in practice when I'm generating [00:37:01] uh so in practice when I'm generating text maybe it would be a good idea to [00:37:03] text maybe it would be a good idea to make some guesses far into the future or [00:37:05] make some guesses far into the future or have a high level plan or something but [00:37:08] have a high level plan or something but in training the network I can't encode [00:37:10] in training the network I can't encode that intuition about how humans build uh [00:37:13] that intuition about how humans build uh see like generate sequences of language [00:37:15] see like generate sequences of language by just giving it the answer of the [00:37:16] by just giving it the answer of the future directly at least because then [00:37:18] future directly at least because then it's just too easy like there's nothing [00:37:20] it's just too easy like there's nothing to learn [00:37:21] to learn um yeah but there might be interesting [00:37:23] um yeah but there might be interesting ideas about maybe giving the network [00:37:24] ideas about maybe giving the network like a hint as to what kind of thing [00:37:26] like a hint as to what kind of thing could come next for example but but [00:37:28] could come next for example but but that's out of scope for this yeah [00:37:31] that's out of scope for this yeah um yeah question up here so I understand [00:37:33] um yeah question up here so I understand like the like why we want to mask the [00:37:35] like the like why we want to mask the future for stuff like language models [00:37:36] future for stuff like language models but how does it apply to machine [00:37:38] but how does it apply to machine translation like why would we use it [00:37:40] translation like why would we use it there yeah so in machine translation uh [00:37:43] there yeah so in machine translation uh I'm gonna come over to this board and [00:37:46] I'm gonna come over to this board and hopefully get a better marker [00:37:49] hopefully get a better marker nice in machine translation you know I [00:37:52] nice in machine translation you know I have a sentence like uh [00:37:54] have a sentence like uh I like pizza and I want to be able to uh [00:38:01] I like pizza and I want to be able to uh you know translate it uh [00:38:04] you know translate it uh Jim [00:38:06] Jim uh Pizza nice [00:38:09] uh Pizza nice um right and so uh when I'm when I'm [00:38:12] um right and so uh when I'm when I'm looking at the I like pizza right I get [00:38:15] looking at the I like pizza right I get this as the input and so I want [00:38:16] this as the input and so I want self-attention [00:38:19] self-attention um uh without masking [00:38:22] um uh without masking because I want I to look at like and I [00:38:26] because I want I to look at like and I to look at pizza and like to look at [00:38:28] to look at pizza and like to look at pizza and I want it all and then when [00:38:29] pizza and I want it all and then when I'm generating this right if my tokens [00:38:31] I'm generating this right if my tokens are like J M La Pizza [00:38:35] are like J M La Pizza um I want to in encoding this word I [00:38:37] um I want to in encoding this word I want to be able to look only at myself [00:38:40] want to be able to look only at myself and we'll talk about encoder decoder [00:38:42] and we'll talk about encoder decoder architectures in this uh later in the [00:38:45] architectures in this uh later in the lecture [00:38:46] lecture um but I want to be able to look at [00:38:47] um but I want to be able to look at myself none of the future and all of [00:38:49] myself none of the future and all of this and so what I'm talking about right [00:38:51] this and so what I'm talking about right now in this masking case is masking out [00:38:53] now in this masking case is masking out you know [00:38:55] you know um with like negative Infinity [00:38:57] um with like negative Infinity all of these words so that sort of [00:39:00] all of these words so that sort of attention score from to everything else [00:39:02] attention score from to everything else should be uh net to be you know negative [00:39:05] should be uh net to be you know negative Infinity yeah does that answer your [00:39:07] Infinity yeah does that answer your question great [00:39:09] question great okay let's move ahead [00:39:11] okay let's move ahead um okay [00:39:12] um okay so so that was our last big uh sort of [00:39:15] so so that was our last big uh sort of building block uh issue with [00:39:17] building block uh issue with self-attention so this is what I would [00:39:18] self-attention so this is what I would call and this is my personal opinion a [00:39:20] call and this is my personal opinion a minimal you know self-attention building [00:39:22] minimal you know self-attention building block you have self-attention the basis [00:39:24] block you have self-attention the basis of the method so uh that's sort of here [00:39:27] of the method so uh that's sort of here in the red [00:39:29] in the red um and maybe we had you know the inputs [00:39:30] um and maybe we had you know the inputs to the sequence here and then you embed [00:39:32] to the sequence here and then you embed it with that embedding Matrix e and then [00:39:34] it with that embedding Matrix e and then you add position embeddings right then [00:39:37] you add position embeddings right then these three arrows represent using you [00:39:39] these three arrows represent using you know the uh the key the value and the [00:39:42] know the uh the key the value and the query that sort of stylized there this [00:39:44] query that sort of stylized there this is often how you see these diagrams [00:39:46] is often how you see these diagrams um right and so you pass it to [00:39:48] um right and so you pass it to self-attention [00:39:49] self-attention uh with the position representation [00:39:52] uh with the position representation right so that specifies the sequence [00:39:54] right so that specifies the sequence order because otherwise you'd have no [00:39:56] order because otherwise you'd have no idea what order the words showed up in [00:39:59] idea what order the words showed up in yeah the non-linearities in sort of the [00:40:01] yeah the non-linearities in sort of the teal feed forward Network there uh to [00:40:03] teal feed forward Network there uh to sort of provide that sort of squashing [00:40:05] sort of provide that sort of squashing and and sort of uh deep learning [00:40:07] and and sort of uh deep learning expressivity and then you have masking [00:40:10] expressivity and then you have masking in order to have parallelizable [00:40:12] in order to have parallelizable operations that don't look at the future [00:40:15] operations that don't look at the future okay so this is sort of our minimal uh [00:40:17] okay so this is sort of our minimal uh architecture and then up at the top [00:40:19] architecture and then up at the top above here right so you have this thing [00:40:21] above here right so you have this thing maybe you repeat this sort of [00:40:22] maybe you repeat this sort of self-attention and feed forward many [00:40:24] self-attention and feed forward many times so self-attention feed forward [00:40:25] times so self-attention feed forward self tension feed forward self tension [00:40:27] self tension feed forward self tension feet forward right that's what I'm [00:40:29] feet forward right that's what I'm calling this block [00:40:30] calling this block and then maybe at the end of it you you [00:40:32] and then maybe at the end of it you you know predict something I don't know we [00:40:34] know predict something I don't know we haven't really talked about that but you [00:40:35] haven't really talked about that but you know you have these representations and [00:40:37] know you have these representations and then you predict the next word or you [00:40:38] then you predict the next word or you predict the sentiment or you predict [00:40:40] predict the sentiment or you predict whatever so this is like a [00:40:41] whatever so this is like a self-attention architecture [00:40:44] self-attention architecture okay we're going to move on to the [00:40:46] okay we're going to move on to the Transformer next so if there are any [00:40:47] Transformer next so if there are any questions yeah [00:40:52] other way around uh we will use masking [00:40:54] other way around uh we will use masking for decoders where I want to decode out [00:40:58] for decoders where I want to decode out a sequence where I have an informational [00:41:01] a sequence where I have an informational constraint where to represent this word [00:41:03] constraint where to represent this word properly I cannot have the information [00:41:05] properly I cannot have the information of the future [00:41:07] of the future right yeah okay [00:41:13] okay [00:41:15] okay great uh so now let's talk about the [00:41:17] great uh so now let's talk about the Transformers so what I've what I've [00:41:19] Transformers so what I've what I've pitched to you is what I call a minimal [00:41:21] pitched to you is what I call a minimal self-attention architecture [00:41:24] self-attention architecture uh and um [00:41:26] uh and um you know I quite I quite like pitching [00:41:28] you know I quite I quite like pitching it that way but really no one uses the [00:41:30] it that way but really no one uses the architecture that was just up on the [00:41:31] architecture that was just up on the slide the previous uh slide it it [00:41:34] slide the previous uh slide it it doesn't work quite as well as it could [00:41:36] doesn't work quite as well as it could and there's a bunch of sort of important [00:41:38] and there's a bunch of sort of important details that we'll talk about now that [00:41:40] details that we'll talk about now that goes into the Transformer but what I [00:41:42] goes into the Transformer but what I would hope though to sort of [00:41:44] would hope though to sort of um have you take away from that is that [00:41:46] um have you take away from that is that the Transformer architecture as I'll [00:41:48] the Transformer architecture as I'll present it now is uh not necessarily the [00:41:51] present it now is uh not necessarily the end point of our search for better and [00:41:54] end point of our search for better and better ways of representing language [00:41:55] better ways of representing language even though it's now ubiquitous and has [00:41:58] even though it's now ubiquitous and has been for a couple of years so so think [00:42:00] been for a couple of years so so think about these sort of ideas of of the [00:42:02] about these sort of ideas of of the problems of using self-attention [00:42:04] problems of using self-attention um and maybe ways of fixing some of the [00:42:06] um and maybe ways of fixing some of the issues with Transformers [00:42:08] issues with Transformers okay so a Transformer uh decoder is how [00:42:12] okay so a Transformer uh decoder is how we'll build systems like language models [00:42:14] we'll build systems like language models right and so we've discussed this it's [00:42:15] right and so we've discussed this it's like our decoder uh with our [00:42:17] like our decoder uh with our self-attention only sort of minimal [00:42:19] self-attention only sort of minimal architecture it's got a couple of extra [00:42:21] architecture it's got a couple of extra components some of which I've grayed out [00:42:23] components some of which I've grayed out here that will go over one by one the [00:42:25] here that will go over one by one the first uh that's actually different is [00:42:29] first uh that's actually different is that we'll replace uh our self-attention [00:42:31] that we'll replace uh our self-attention with masking with masked multi-head [00:42:34] with masking with masked multi-head self-attention this ends up being [00:42:36] self-attention this ends up being crucial it's probably the most important [00:42:38] crucial it's probably the most important uh distinction between the Transformer [00:42:40] uh distinction between the Transformer and this sort of minimal architecture [00:42:42] and this sort of minimal architecture that I've presented [00:42:43] that I've presented so let's come back to our toy example of [00:42:46] so let's come back to our toy example of attention where we've been trying to [00:42:47] attention where we've been trying to represent the word learned in the [00:42:49] represent the word learned in the context of the sequence I went to [00:42:51] context of the sequence I went to Stanford cs224n and learned [00:42:54] Stanford cs224n and learned um and I was sort of giving these teal [00:42:56] um and I was sort of giving these teal bars to say oh maybe intuitively you [00:42:59] bars to say oh maybe intuitively you look at various things to build up your [00:43:01] look at various things to build up your representation of learned [00:43:04] representation of learned um but you know really there are varying [00:43:06] um but you know really there are varying ways in which I want to look back at the [00:43:08] ways in which I want to look back at the sequence to see varying sort of aspects [00:43:12] sequence to see varying sort of aspects of of information that I want to [00:43:14] of of information that I want to incorporate into my representation so [00:43:16] incorporate into my representation so maybe in this way I sort of want to look [00:43:19] maybe in this way I sort of want to look at Stanford cs224n [00:43:23] at Stanford cs224n because like oh it's like entities like [00:43:25] because like oh it's like entities like it it you learn different stuff at [00:43:27] it it you learn different stuff at Stanford cs224n than you do it other [00:43:29] Stanford cs224n than you do it other courses or other universities or [00:43:31] courses or other universities or whatever right and so maybe I want to [00:43:33] whatever right and so maybe I want to look here for this reason [00:43:34] look here for this reason and maybe you know there's in another [00:43:36] and maybe you know there's in another sense I actually want to look at the [00:43:38] sense I actually want to look at the word learned and I want to look at I you [00:43:40] word learned and I want to look at I you know I went and learned right is he sort [00:43:43] know I went and learned right is he sort of like maybe syntactically relevant [00:43:45] of like maybe syntactically relevant words right like it's very different [00:43:47] words right like it's very different reasons for which I might want to look [00:43:48] reasons for which I might want to look at different things in the sequence [00:43:50] at different things in the sequence and so trying to sort of average it all [00:43:52] and so trying to sort of average it all out with a single operation of [00:43:54] out with a single operation of self-attention ends up being maybe [00:43:56] self-attention ends up being maybe somewhat too difficult in a way that [00:43:58] somewhat too difficult in a way that will make precise in assignment five [00:44:00] will make precise in assignment five nice we'll do a little bit more math uh [00:44:03] nice we'll do a little bit more math uh um okay so uh any questions about this [00:44:07] um okay so uh any questions about this in this intuition [00:44:12] um [00:44:13] um yeah so uh it should be an application [00:44:16] yeah so uh it should be an application of attention just as I've presented it [00:44:18] of attention just as I've presented it uh right so one independent Define the [00:44:21] uh right so one independent Define the keys to find the queries to find the [00:44:22] keys to find the queries to find the values I'll Define it more precisely [00:44:24] values I'll Define it more precisely here but think of it as I do attention [00:44:26] here but think of it as I do attention once [00:44:27] once and then I do it again with different [00:44:29] and then I do it again with different like being able different parameters [00:44:31] like being able different parameters being able to look at different things [00:44:33] being able to look at different things Etc [00:44:38] we do not okay so the question is if we [00:44:40] we do not okay so the question is if we have two separate sets of Weights try to [00:44:41] have two separate sets of Weights try to learn say to do this and and to do that [00:44:43] learn say to do this and and to do that how do we ensure that they learn [00:44:45] how do we ensure that they learn different things uh we do not ensure [00:44:46] different things uh we do not ensure that they hope that they learn different [00:44:48] that they hope that they learn different things and in practice they do uh [00:44:51] things and in practice they do uh although not perfectly uh so it ends up [00:44:53] although not perfectly uh so it ends up being the case that you have some [00:44:54] being the case that you have some redundancy and you can sort of like cut [00:44:56] redundancy and you can sort of like cut out some of these but that's sort of out [00:44:58] out some of these but that's sort of out of scope for this but we sort of Hope [00:44:59] of scope for this but we sort of Hope just like we hope that different sort of [00:45:01] just like we hope that different sort of dimensions in our feed forward layers [00:45:03] dimensions in our feed forward layers will learn different things because of [00:45:04] will learn different things because of lack of symmetry and whatever that uh we [00:45:07] lack of symmetry and whatever that uh we hope that the heads will start to [00:45:08] hope that the heads will start to specialize and that will mean they'll [00:45:10] specialize and that will mean they'll specialize even more and yeah [00:45:14] okay [00:45:16] okay all right so in order to discuss [00:45:17] all right so in order to discuss multi-head self-attention well we really [00:45:20] multi-head self-attention well we really need to talk about the matrices how [00:45:22] need to talk about the matrices how we're going to implement this in gpus [00:45:24] we're going to implement this in gpus efficiently we're going to talk about [00:45:25] efficiently we're going to talk about the sequence stacked form of attention [00:45:29] the sequence stacked form of attention um so we've been talking about each word [00:45:30] um so we've been talking about each word sort of individually as a vector in [00:45:32] sort of individually as a vector in dimensionality D but you know really [00:45:34] dimensionality D but you know really we're going to be working on these as as [00:45:36] we're going to be working on these as as big matrices that are stacked so I take [00:45:39] big matrices that are stacked so I take you know all of my word embeddings X1 to [00:45:41] you know all of my word embeddings X1 to xn and I stack them together and now I [00:45:44] xn and I stack them together and now I have a big Matrix that is in [00:45:45] have a big Matrix that is in dimensionality r n by D [00:45:49] okay and uh now with my matrices k q and [00:45:54] okay and uh now with my matrices k q and V I can just multiply them sort of on [00:45:57] V I can just multiply them sort of on this side of X so X is RN by d k is our [00:46:01] this side of X so X is RN by d k is our d by D so n by D times d by D gives you [00:46:05] d by D so n by D times d by D gives you uh n by D again so I can just compute a [00:46:08] uh n by D again so I can just compute a big you know Matrix multiply on my whole [00:46:11] big you know Matrix multiply on my whole sequence to multiply each one of the [00:46:13] sequence to multiply each one of the words with my key query and value [00:46:14] words with my key query and value matrices very efficiently right so this [00:46:17] matrices very efficiently right so this is sort of this vectorization idea I [00:46:18] is sort of this vectorization idea I don't want to for Loop over the sequence [00:46:20] don't want to for Loop over the sequence I represent the sequence as a big Matrix [00:46:22] I represent the sequence as a big Matrix and I just do one big Matrix multiply [00:46:27] and I just do one big Matrix multiply then the output is defined as this sort [00:46:29] then the output is defined as this sort of inscrutable bit of math which I'm [00:46:31] of inscrutable bit of math which I'm going to go over visually [00:46:34] going to go over visually um so so first we're going to take the [00:46:37] um so so first we're going to take the key query dot products in one Matrix so [00:46:39] key query dot products in one Matrix so we've got um [00:46:41] we've got um we've got X Cube which is uh RN by D [00:46:46] we've got X Cube which is uh RN by D and I've got x k transpose which is our [00:46:49] and I've got x k transpose which is our d by n so n by d d by n this is [00:46:53] d by n so n by d d by n this is Computing all of the e i JS these scores [00:46:56] Computing all of the e i JS these scores for self-attention right so this is all [00:46:59] for self-attention right so this is all pairs of attention scores computed in [00:47:01] pairs of attention scores computed in one big Matrix multiply [00:47:05] okay so this is this big Matrix here [00:47:08] okay so this is this big Matrix here next I use the softmax right so I [00:47:12] next I use the softmax right so I softmax this over uh the second [00:47:15] softmax this over uh the second dimension the second n Dimension [00:47:17] dimension the second n Dimension um and I get my sort of normalized [00:47:20] um and I get my sort of normalized scores and then I multiply with XV so [00:47:22] scores and then I multiply with XV so this is an N by n Matrix multiplied by [00:47:25] this is an N by n Matrix multiplied by an N by D Matrix [00:47:27] an N by D Matrix and what do I get well this is just [00:47:29] and what do I get well this is just doing the weighted average right so this [00:47:31] doing the weighted average right so this is one big weighted average contribution [00:47:34] is one big weighted average contribution on the whole Matrix giving me my whole [00:47:36] on the whole Matrix giving me my whole self-attention output and r n by D right [00:47:39] self-attention output and r n by D right so I've just restated identically the [00:47:41] so I've just restated identically the self-attention operations but computed [00:47:44] self-attention operations but computed in terms of matrices so you could do [00:47:46] in terms of matrices so you could do this efficiently on a GPU [00:47:50] okay uh so multi-headed attention this [00:47:54] okay uh so multi-headed attention this is going to give us and it's going to be [00:47:56] is going to give us and it's going to be important to compute this in terms of [00:47:57] important to compute this in terms of the matrices which we'll see this is [00:48:00] the matrices which we'll see this is going to give us the ability to look in [00:48:01] going to give us the ability to look in multiple places at once for different [00:48:03] multiple places at once for different reasons so sort of you know first [00:48:06] reasons so sort of you know first self-attention looks where this dot [00:48:08] self-attention looks where this dot product here is high [00:48:10] product here is high right this x i the Q Matrix the key [00:48:13] right this x i the Q Matrix the key Matrix uh but [00:48:16] Matrix uh but um maybe we want to look in different [00:48:17] um maybe we want to look in different places for different reasons so we [00:48:19] places for different reasons so we actually Define multiple query key and [00:48:22] actually Define multiple query key and value matrices so I'm going to have a [00:48:26] value matrices so I'm going to have a bunch of heads I'm going to have 8 H [00:48:28] bunch of heads I'm going to have 8 H self-attention heads and for each head [00:48:30] self-attention heads and for each head I'm going to Define an independent query [00:48:33] I'm going to Define an independent query key and value Matrix and I'm going to [00:48:35] key and value Matrix and I'm going to say that it's its shape is going to map [00:48:37] say that it's its shape is going to map from the model dimensionality to the [00:48:39] from the model dimensionality to the model dimensionality over H so each one [00:48:42] model dimensionality over H so each one of these is doing projection down to a [00:48:43] of these is doing projection down to a lower dimensional space uh this is going [00:48:45] lower dimensional space uh this is going to be for computational efficiency and [00:48:48] to be for computational efficiency and um [00:48:49] um I'll just apply self-attention sort of [00:48:51] I'll just apply self-attention sort of independently for each output so this [00:48:53] independently for each output so this equation here is identical to the one we [00:48:56] equation here is identical to the one we saw for single-headed self-attention [00:48:58] saw for single-headed self-attention except we've got the sort of L indices [00:49:00] except we've got the sort of L indices everywhere [00:49:02] everywhere so I've got this lower dimensional thing [00:49:04] so I've got this lower dimensional thing I'm mapping to a lower dimensional space [00:49:06] I'm mapping to a lower dimensional space and then I do have my lower dimensional [00:49:08] and then I do have my lower dimensional value Vector there so my output is an r [00:49:10] value Vector there so my output is an r d by H but really you're doing exactly [00:49:13] d by H but really you're doing exactly the same kind of operation I'm just [00:49:15] the same kind of operation I'm just doing it h different times and then you [00:49:18] doing it h different times and then you combine the outputs so I've done sort of [00:49:20] combine the outputs so I've done sort of look in different places with the [00:49:22] look in different places with the different key query and value matrices [00:49:24] different key query and value matrices and then I to get each of their outputs [00:49:27] and then I to get each of their outputs and then I concatenate them together [00:49:30] and then I concatenate them together right so each one is dimensionality d by [00:49:32] right so each one is dimensionality d by H and I concatenate them together and [00:49:35] H and I concatenate them together and then sort of mix them together with the [00:49:37] then sort of mix them together with the final linear transformation [00:49:39] final linear transformation and so uh each head gets to look at [00:49:41] and so uh each head gets to look at different things and construct their [00:49:43] different things and construct their value vectors differently and then I [00:49:45] value vectors differently and then I sort of combine the result altogether at [00:49:47] sort of combine the result altogether at once [00:49:49] once okay let's go through this visually [00:49:50] okay let's go through this visually because it's at least helpful for me [00:49:53] because it's at least helpful for me um so uh right it's actually not more [00:49:57] um so uh right it's actually not more costly to do this really than it is to [00:49:59] costly to do this really than it is to compute a single head of self-attention [00:50:00] compute a single head of self-attention and we'll see through the pictures [00:50:04] so you know we were in single-headed [00:50:07] so you know we were in single-headed self-attention we computed xq and in [00:50:09] self-attention we computed xq and in multi-headed self-attention we'll also [00:50:11] multi-headed self-attention we'll also compute X cubed the same way [00:50:13] compute X cubed the same way so xq is r and by D [00:50:19] and then we can reshape it into our n [00:50:23] and then we can reshape it into our n that's sequence length times the number [00:50:25] that's sequence length times the number of heads times the model dimensionality [00:50:29] of heads times the model dimensionality over the number of heads so I've just [00:50:30] over the number of heads so I've just reshaped it to say now I've got you know [00:50:33] reshaped it to say now I've got you know a big three axis tensor the first axis [00:50:36] a big three axis tensor the first axis is the sequence length the second one is [00:50:38] is the sequence length the second one is the number of heads the third is this [00:50:40] the number of heads the third is this reduced model dimensionality [00:50:41] reduced model dimensionality and that costs nothing right and do the [00:50:44] and that costs nothing right and do the same thing for x and V and then I [00:50:46] same thing for x and V and then I transpose so that I've got the head axis [00:50:49] transpose so that I've got the head axis as the first axis and now I can compute [00:50:52] as the first axis and now I can compute all my other operations with the head [00:50:54] all my other operations with the head axis kind of like a batch [00:50:57] axis kind of like a batch so what does this look like in uh in [00:51:00] so what does this look like in uh in practice like instead of having one big [00:51:03] practice like instead of having one big xq Matrix that's Model dimensionality D [00:51:06] xq Matrix that's Model dimensionality D I've got like in this case three x Cube [00:51:10] I've got like in this case three x Cube matrices of Model dimensionality D by 3 [00:51:12] matrices of Model dimensionality D by 3 D by three D by three same thing with [00:51:14] D by three D by three same thing with the key Matrix here [00:51:16] the key Matrix here so everything looks almost identical [00:51:18] so everything looks almost identical it's just a reshaping of the tensors and [00:51:21] it's just a reshaping of the tensors and now right at the output of this I've got [00:51:23] now right at the output of this I've got three sets of attention scores right [00:51:26] three sets of attention scores right just by doing this reshape and the cost [00:51:29] just by doing this reshape and the cost is that well you know each of my [00:51:32] is that well you know each of my attention heads has only a d by H Vector [00:51:35] attention heads has only a d by H Vector to work with instead of a d dimensional [00:51:37] to work with instead of a d dimensional Vector to work with right so I get the [00:51:38] Vector to work with right so I get the output I get these three uh sets of [00:51:41] output I get these three uh sets of pairs of scores [00:51:43] pairs of scores I compute the softmax independently for [00:51:45] I compute the softmax independently for each of the three and then I have three [00:51:48] each of the three and then I have three uh value matrices there as well each of [00:51:51] uh value matrices there as well each of them lower dimensional and then finally [00:51:53] them lower dimensional and then finally right I get my three different output [00:51:55] right I get my three different output vectors and if I have a final linear [00:51:57] vectors and if I have a final linear transformation to sort of [00:51:59] transformation to sort of mush them together and I get an output [00:52:02] mush them together and I get an output and in summary what this allows you to [00:52:04] and in summary what this allows you to do is exactly what I gave in the toy [00:52:07] do is exactly what I gave in the toy example which was I can have each of [00:52:09] example which was I can have each of these heads look at different parts of a [00:52:11] these heads look at different parts of a sequence for different reasons [00:52:17] so this is at a given uh block right [00:52:20] so this is at a given uh block right like all of these attention heads are [00:52:22] like all of these attention heads are for a given Transformer block a next [00:52:24] for a given Transformer block a next block would also could also have three [00:52:25] block would also could also have three attention pins the question is uh are [00:52:28] attention pins the question is uh are all of these four a given block and [00:52:30] all of these four a given block and we'll talk about a block again but this [00:52:32] we'll talk about a block again but this block was this sort of pair of [00:52:34] block was this sort of pair of self-attention and feed forward Network [00:52:36] self-attention and feed forward Network so you do like self-attention feed [00:52:37] so you do like self-attention feed forward that's one block another block [00:52:39] forward that's one block another block is another self-attention another feed [00:52:40] is another self-attention another feed forward and the question is are the [00:52:42] forward and the question is are the parameters shared between the blocks or [00:52:44] parameters shared between the blocks or not generally they are not shared you'll [00:52:46] not generally they are not shared you'll have independent parameters at every [00:52:48] have independent parameters at every block although there are some exceptions [00:52:53] is it typically the case that you have [00:52:56] is it typically the case that you have the same number of like heads at each [00:52:58] the same number of like heads at each block or do you vary the number of heads [00:53:00] block or do you vary the number of heads across blocks you have this [00:53:02] across blocks you have this you definitely could vary it people [00:53:04] you definitely could vary it people haven't found reason to there so the [00:53:06] haven't found reason to there so the question is do you have different [00:53:06] question is do you have different numbers of heads across the different [00:53:08] numbers of heads across the different blocks uh or do you have the same number [00:53:11] blocks uh or do you have the same number of heads across all blocks you know the [00:53:13] of heads across all blocks you know the simplest thing is to just have it be the [00:53:15] simplest thing is to just have it be the same everywhere which is what people [00:53:16] same everywhere which is what people have done I haven't yet found a good [00:53:18] have done I haven't yet found a good reason to vary it but well [00:53:21] reason to vary it but well it could be interesting it's definitely [00:53:22] it could be interesting it's definitely the case that you know after training [00:53:24] the case that you know after training these networks you can actually just [00:53:26] these networks you can actually just totally zero out remove some of the [00:53:29] totally zero out remove some of the attention heads and I'd be curious to [00:53:32] attention heads and I'd be curious to know if uh you could remove more or less [00:53:35] know if uh you could remove more or less depending on the like layer index which [00:53:38] depending on the like layer index which might then say Oh we should just have [00:53:40] might then say Oh we should just have fewer but it's again it's not actually [00:53:41] fewer but it's again it's not actually more expensive to have a bunch so people [00:53:44] more expensive to have a bunch so people tend to instead set the number of heads [00:53:46] tend to instead set the number of heads to be roughly so that you have like a [00:53:49] to be roughly so that you have like a reasonable number of Dimensions per head [00:53:51] reasonable number of Dimensions per head given the total Model dimensionality D [00:53:54] given the total Model dimensionality D that you want so for example I might [00:53:57] that you want so for example I might want at least 64 Dimensions per head [00:53:59] want at least 64 Dimensions per head which if D is you know 128 that tells me [00:54:03] which if D is you know 128 that tells me how many heads I'm going to have roughly [00:54:05] how many heads I'm going to have roughly so people tend to scale the number of [00:54:07] so people tend to scale the number of heads up with the model dimensionality [00:54:13] excuse by slicing it in different [00:54:15] excuse by slicing it in different columns you're reducing the rank of the [00:54:17] columns you're reducing the rank of the final Matrix right yeah this but that [00:54:20] final Matrix right yeah this but that doesn't really have any effect on the [00:54:22] doesn't really have any effect on the results so the question is by having [00:54:24] results so the question is by having these sort of reduced xq and uh uh XK [00:54:29] these sort of reduced xq and uh uh XK matrices right this is a very low rank [00:54:31] matrices right this is a very low rank approximation this little sliver in this [00:54:34] approximation this little sliver in this little sliver defining this whole big [00:54:36] little sliver defining this whole big matrix it's very low rank is that not [00:54:38] matrix it's very low rank is that not bad [00:54:39] bad in practice no I mean again it's sort of [00:54:42] in practice no I mean again it's sort of the reason why we limit the number of [00:54:43] the reason why we limit the number of heads depending on the model [00:54:45] heads depending on the model dimensionality because you you know you [00:54:48] dimensionality because you you know you want intuitively at least some number of [00:54:50] want intuitively at least some number of Dimensions so you know 64 is sometimes [00:54:53] Dimensions so you know 64 is sometimes done 128 something like that [00:54:55] done 128 something like that um but you know if you're not giving [00:54:56] um but you know if you're not giving each head too much to do and it's got [00:54:58] each head too much to do and it's got sort of a simple job you've got a lot of [00:55:00] sort of a simple job you've got a lot of heads it ends up sort of being okay at [00:55:04] heads it ends up sort of being okay at the very all we really know is that [00:55:05] the very all we really know is that empirically it's way better to have more [00:55:07] empirically it's way better to have more heads than like one [00:55:12] uh yes [00:55:14] uh yes um I'm wondering have there been studies [00:55:16] um I'm wondering have there been studies to see if [00:55:18] to see if um information in one of the sets of the [00:55:21] um information in one of the sets of the attention scores like information that [00:55:24] attention scores like information that one of them learns is consistent and [00:55:27] one of them learns is consistent and like [00:55:28] like um related to each other or [00:55:31] um related to each other or so the question is have there been [00:55:33] so the question is have there been studies to see if there's sort of [00:55:35] studies to see if there's sort of consistent information encoded by the [00:55:37] consistent information encoded by the attention heads and you know yes [00:55:40] attention heads and you know yes actually there's been quite a lot of [00:55:42] actually there's been quite a lot of sort of study and interpretability and [00:55:43] sort of study and interpretability and Analysis of these models to try to [00:55:45] Analysis of these models to try to figure out what roles what sort of [00:55:47] figure out what roles what sort of mechanistic roles each of these heads [00:55:49] mechanistic roles each of these heads takes on and uh there's quite a bit of [00:55:51] takes on and uh there's quite a bit of exciting results there around some [00:55:53] exciting results there around some attention heads you know learning to [00:55:55] attention heads you know learning to pick out sort of the you know it's like [00:55:58] pick out sort of the you know it's like syntactic dependencies or maybe doing [00:56:00] syntactic dependencies or maybe doing like a sort of a global averaging of [00:56:02] like a sort of a global averaging of context [00:56:03] context um the question is quite nuanced though [00:56:05] um the question is quite nuanced though because in a deep Network it's unclear [00:56:07] because in a deep Network it's unclear and we should talk about this more [00:56:09] and we should talk about this more offline but it's unclear if you look at [00:56:10] offline but it's unclear if you look at a word 10 layers deep in a network what [00:56:13] a word 10 layers deep in a network what you're really looking at because it's [00:56:15] you're really looking at because it's already Incorporated context from [00:56:17] already Incorporated context from everyone else and it's a little bit [00:56:19] everyone else and it's a little bit unclear active area of research but I [00:56:21] unclear active area of research but I think I should move on uh now to uh keep [00:56:25] think I should move on uh now to uh keep discussing Transformers but yeah if you [00:56:27] discussing Transformers but yeah if you want to talk more about it I'm happy to [00:56:30] um okay so so uh another sort of uh hack [00:56:33] um okay so so uh another sort of uh hack that I'm going to toss in here I mean [00:56:35] that I'm going to toss in here I mean maybe they wouldn't call it hack but you [00:56:36] maybe they wouldn't call it hack but you know it's a nice little method to [00:56:38] know it's a nice little method to improve things it's called scaled dot [00:56:40] improve things it's called scaled dot product attention so one of the issues [00:56:43] product attention so one of the issues with this sort of key query value [00:56:45] with this sort of key query value self-attention is that when the model [00:56:46] self-attention is that when the model dimensionality becomes large the dot [00:56:49] dimensionality becomes large the dot products between vectors even random [00:56:51] products between vectors even random vectors tend to become uh large [00:56:54] vectors tend to become uh large and when that happens the inputs to the [00:56:57] and when that happens the inputs to the softmax function can be very large [00:56:59] softmax function can be very large making the gradients small so [00:57:01] making the gradients small so intuitively if you have two random [00:57:03] intuitively if you have two random vectors in Model dimensionality D and [00:57:05] vectors in Model dimensionality D and you just dot product them together as D [00:57:07] you just dot product them together as D grows their dot product grows an [00:57:09] grows their dot product grows an expectation to be very large and so you [00:57:12] expectation to be very large and so you know you sort of want to start out with [00:57:14] know you sort of want to start out with everyone's attention being very uniform [00:57:16] everyone's attention being very uniform very flat sort of look everywhere but if [00:57:19] very flat sort of look everywhere but if some dot products are very large then [00:57:21] some dot products are very large then you know learning will be inhibited and [00:57:23] you know learning will be inhibited and so what you end up doing is you just [00:57:25] so what you end up doing is you just sort of for each of your heads uh you [00:57:28] sort of for each of your heads uh you know you just sort of divide all the [00:57:29] know you just sort of divide all the scores by this constant that's [00:57:30] scores by this constant that's determined by the model dimensionality [00:57:32] determined by the model dimensionality so as the vectors grow very large their [00:57:36] so as the vectors grow very large their dot products don't at least at an [00:57:39] dot products don't at least at an initialization time so this is sort of [00:57:41] initialization time so this is sort of like a nice little [00:57:42] like a nice little um [00:57:43] um you know important but but maybe not uh [00:57:48] you know important but but maybe not uh like yeah it's it's important to know [00:57:51] like yeah it's it's important to know um and uh so that's called scale dot [00:57:54] um and uh so that's called scale dot product attention from here on out we'll [00:57:57] product attention from here on out we'll just assume that we do this you know [00:57:58] just assume that we do this you know it's quite easy to implement you just do [00:58:00] it's quite easy to implement you just do a little division in all of your uh [00:58:02] a little division in all of your uh computations [00:58:04] okay so so now in the Transformer [00:58:07] okay so so now in the Transformer decoder we've got a couple of other [00:58:08] decoder we've got a couple of other things that I have un uh faded out here [00:58:12] things that I have un uh faded out here um we have two big optimization tricks [00:58:14] um we have two big optimization tricks or optimization methods I should say [00:58:16] or optimization methods I should say really because these are quite important [00:58:17] really because these are quite important that end up being very important we've [00:58:20] that end up being very important we've got residual connections and layer [00:58:21] got residual connections and layer normalization and in Transformer [00:58:24] normalization and in Transformer diagrams that you see sort of around the [00:58:26] diagrams that you see sort of around the web they're often uh written together as [00:58:29] web they're often uh written together as this ad and Norm box and in practice in [00:58:33] this ad and Norm box and in practice in the Transformer decoder I'm going to you [00:58:35] the Transformer decoder I'm going to you know apply mask multi-head attention and [00:58:38] know apply mask multi-head attention and then do this sort of optimization add a [00:58:40] then do this sort of optimization add a norm then I'll do a feed forward [00:58:42] norm then I'll do a feed forward application and then add a norm so you [00:58:45] application and then add a norm so you know this is quite important so let's go [00:58:48] know this is quite important so let's go over these two individual uh components [00:58:51] over these two individual uh components the first is residual connections I mean [00:58:53] the first is residual connections I mean we've I think we've talked about [00:58:54] we've I think we've talked about residual connections before right well [00:58:56] residual connections before right well it's worth doing it again [00:58:57] it's worth doing it again um uh but you know it's really a good [00:58:59] um uh but you know it's really a good trick to help models train better [00:59:01] trick to help models train better um so just to recap right we're going to [00:59:04] um so just to recap right we're going to take instead of having this sort of you [00:59:06] take instead of having this sort of you have a layer uh layer I minus one and [00:59:09] have a layer uh layer I minus one and you pass it through a thing maybe it's [00:59:10] you pass it through a thing maybe it's self-attention maybe it's a feed forward [00:59:12] self-attention maybe it's a feed forward Network now you've got layer I [00:59:15] Network now you've got layer I I'm going to add the result of layer I [00:59:19] I'm going to add the result of layer I uh to this sort of to its input here so [00:59:23] uh to this sort of to its input here so now I'm saying I'm just going to compute [00:59:25] now I'm saying I'm just going to compute the layer and I'm going to add in the [00:59:26] the layer and I'm going to add in the input to the layer so that I only have [00:59:28] input to the layer so that I only have to learn the residual from the previous [00:59:31] to learn the residual from the previous layer right so I've got this sort of [00:59:33] layer right so I've got this sort of connection here it's often written as [00:59:34] connection here it's often written as this it's sort of like oh [00:59:36] this it's sort of like oh connection [00:59:37] connection okay right goes around and you should [00:59:40] okay right goes around and you should think that the gradient is just really [00:59:42] think that the gradient is just really great through the residual connection [00:59:43] great through the residual connection right like ah you know if I've got [00:59:45] right like ah you know if I've got Vanishing or exploding gradients [00:59:47] Vanishing or exploding gradients Vanishing gradients through this layer [00:59:49] Vanishing gradients through this layer well I can at least learn everything [00:59:50] well I can at least learn everything behind it because I've got this residual [00:59:52] behind it because I've got this residual connection where the where the gradient [00:59:54] connection where the where the gradient is one because it's the identity [00:59:58] is one because it's the identity um this is really nice and you know it [01:00:00] um this is really nice and you know it also maybe is like a buy at least at [01:00:02] also maybe is like a buy at least at initialization everything looks a little [01:00:04] initialization everything looks a little bit like the identity function now right [01:00:06] bit like the identity function now right because if the contribution of the layer [01:00:09] because if the contribution of the layer is somewhat small because all of your [01:00:10] is somewhat small because all of your weights are small and I have the [01:00:12] weights are small and I have the addition from the input maybe the whole [01:00:14] addition from the input maybe the whole thing looks a little bit like the [01:00:16] thing looks a little bit like the identity which might be a good sort of [01:00:18] identity which might be a good sort of place to start [01:00:20] place to start and you know there are really nice [01:00:21] and you know there are really nice visualizations I just love this [01:00:22] visualizations I just love this visualization uh right so this is your [01:00:25] visualization uh right so this is your like lost landscape right so you're [01:00:26] like lost landscape right so you're gradient descent and you're trying to [01:00:28] gradient descent and you're trying to Traverse the mountains of the Lost [01:00:30] Traverse the mountains of the Lost landscape this is like the parameter [01:00:31] landscape this is like the parameter space and down is better in your lost [01:00:34] space and down is better in your lost function and it's really hard so you get [01:00:35] function and it's really hard so you get stuck in some local Optima and you can't [01:00:38] stuck in some local Optima and you can't sort of find your way to to get out and [01:00:41] sort of find your way to to get out and then this with residual connections I [01:00:43] then this with residual connections I mean come on you just sort of walk down [01:00:46] mean come on you just sort of walk down I mean it's not actually I guess really [01:00:49] I mean it's not actually I guess really how it works all the time but I really [01:00:51] how it works all the time but I really love this it's great [01:00:55] okay [01:00:58] um so yeah we've seen residual [01:01:00] um so yeah we've seen residual connections we should move on to layer [01:01:01] connections we should move on to layer normalization [01:01:02] normalization um so layer Norm uh is another thing to [01:01:05] um so layer Norm uh is another thing to help your model train faster [01:01:08] help your model train faster um and you know there's [01:01:11] um and you know there's the intuitions around layer [01:01:12] the intuitions around layer normalization [01:01:14] normalization um and sort of the empiricism of it [01:01:16] um and sort of the empiricism of it working very well maybe aren't perfectly [01:01:18] working very well maybe aren't perfectly like uh let's say connected but [01:01:22] like uh let's say connected but you know you should imagine I suppose [01:01:25] you know you should imagine I suppose um that we want to uh say you know this [01:01:28] um that we want to uh say you know this variation within each layer things can [01:01:30] variation within each layer things can get very big things can get very small [01:01:32] get very big things can get very small uh that's not actually informative [01:01:34] uh that's not actually informative because of you know variations between [01:01:38] because of you know variations between um maybe the the gradients or you know [01:01:41] um maybe the the gradients or you know I've got sort of weird things going on [01:01:43] I've got sort of weird things going on in my layers that I can't totally [01:01:44] in my layers that I can't totally control I haven't been able to sort of [01:01:46] control I haven't been able to sort of make everything behave sort of nicely [01:01:48] make everything behave sort of nicely where everything stays roughly the same [01:01:49] where everything stays roughly the same Norm maybe some things explode maybe [01:01:51] Norm maybe some things explode maybe some things shrink [01:01:53] some things shrink um [01:01:54] um and I want to cut down on sort of [01:01:56] and I want to cut down on sort of uninformative variation [01:01:59] uninformative variation um in between layers so I'm going to let [01:02:01] um in between layers so I'm going to let X and r d be an individual word Vector [01:02:04] X and r d be an individual word Vector in the model so this is like at a single [01:02:06] in the model so this is like at a single a single index one vector [01:02:08] a single index one vector and what I'm going to try to do is just [01:02:10] and what I'm going to try to do is just normalize it [01:02:12] normalize it normalize it in the sense of it's got a [01:02:14] normalize it in the sense of it's got a bunch of variation and I'm going to cut [01:02:16] bunch of variation and I'm going to cut out on everything I'm going to normalize [01:02:18] out on everything I'm going to normalize it to unit mean and standard deviation [01:02:20] it to unit mean and standard deviation so I'm going to estimate the mean [01:02:23] so I'm going to estimate the mean um [01:02:24] um here across uh so for all of the uh [01:02:28] here across uh so for all of the uh dimensions in the vector so J equals one [01:02:31] dimensions in the vector so J equals one to the model dimensionality I'm going to [01:02:33] to the model dimensionality I'm going to sum up the value so I've got this one [01:02:34] sum up the value so I've got this one big word vector and I sum up all the [01:02:36] big word vector and I sum up all the values division by D here right that's [01:02:39] values division by D here right that's the mean I'm going to have my estimate [01:02:41] the mean I'm going to have my estimate of the standard deviation [01:02:43] of the standard deviation um again these should say estimates this [01:02:45] um again these should say estimates this is my simple estimate of the standard [01:02:46] is my simple estimate of the standard deviation or the values within this one [01:02:48] deviation or the values within this one vector [01:02:50] vector and I'm just going to [01:02:53] and I'm just going to um and then possibly I guess I can have [01:02:55] um and then possibly I guess I can have learned uh parameters to try to like [01:02:58] learned uh parameters to try to like scale back out in terms of uh [01:03:01] scale back out in terms of uh multiplicatively and additively here [01:03:04] multiplicatively and additively here that's optional we're going to compute [01:03:06] that's optional we're going to compute this this standardization right we're [01:03:08] this this standardization right we're going to take my Vector X subtract out [01:03:10] going to take my Vector X subtract out the mean divide by the standard [01:03:12] the mean divide by the standard deviation plus this Epsilon sort of [01:03:14] deviation plus this Epsilon sort of constant if there's not a lot of [01:03:15] constant if there's not a lot of variation I don't want things to explode [01:03:17] variation I don't want things to explode so I'm going to have this Epsilon there [01:03:19] so I'm going to have this Epsilon there that's uh close to zero so this part [01:03:22] that's uh close to zero so this part here x minus mu over square root Sigma [01:03:25] here x minus mu over square root Sigma plus Epsilon is saying take all the [01:03:27] plus Epsilon is saying take all the variation and sort of normalize it to [01:03:29] variation and sort of normalize it to unit mean and standard deviation [01:03:32] unit mean and standard deviation and then maybe I want to sort of scale [01:03:34] and then maybe I want to sort of scale it stretch it back out [01:03:36] it stretch it back out um and then maybe add an offset beta [01:03:40] um and then maybe add an offset beta that I've learned although in practice [01:03:41] that I've learned although in practice actually this part and discuss this in [01:03:43] actually this part and discuss this in the lecture notes uh and practice this [01:03:45] the lecture notes uh and practice this part maybe isn't actually that important [01:03:47] part maybe isn't actually that important um but so layer normalization yeah [01:03:49] um but so layer normalization yeah you're sort of you know you can think of [01:03:51] you're sort of you know you can think of this as when I get the output of layer [01:03:54] this as when I get the output of layer normalization it's going to be sort of [01:03:56] normalization it's going to be sort of look nice and look similar to the next [01:03:58] look nice and look similar to the next layer independent of what's gone on [01:04:00] layer independent of what's gone on because it's going to be unit mean and [01:04:01] because it's going to be unit mean and standard deviation so maybe that makes [01:04:03] standard deviation so maybe that makes for a better thing to learn off of for [01:04:05] for a better thing to learn off of for the next layer [01:04:09] okay any questions for uh residual or [01:04:12] okay any questions for uh residual or layer Norm yes [01:04:16] yeah it's a good question when I [01:04:19] yeah it's a good question when I subtract the scalar mu from the vector x [01:04:21] subtract the scalar mu from the vector x i broadcast mu to dimensionality D and [01:04:24] i broadcast mu to dimensionality D and remove mu [01:04:25] remove mu from Aldi yeah good point [01:04:29] from Aldi yeah good point thank you that was unclear [01:04:34] uh sure [01:04:36] uh sure is it divided should it be divided by D [01:04:40] is it divided should it be divided by D or from me sorry can you repeat that in [01:04:44] or from me sorry can you repeat that in the fourth bullet point when you're [01:04:45] the fourth bullet point when you're calculating the mean [01:04:47] calculating the mean um is it divided by D or is it or maybe [01:04:50] um is it divided by D or is it or maybe I'm just interested I think it is [01:04:51] I'm just interested I think it is divided by D yeah [01:04:55] these are so this is the average [01:04:56] these are so this is the average deviation from the mean of all of the [01:04:59] deviation from the mean of all of the yeah [01:05:00] yeah yes [01:05:04] [Music] [01:05:06] [Music] mobilized based on the statistics [01:05:11] so the question is if I have five words [01:05:13] so the question is if I have five words in the sequence do I normalize by sort [01:05:16] in the sequence do I normalize by sort of aggregating the statistics to [01:05:18] of aggregating the statistics to estimate mu and sigma across all the [01:05:20] estimate mu and sigma across all the five words share their statistics or do [01:05:22] five words share their statistics or do it independently for each word this is a [01:05:24] it independently for each word this is a great question which I think in all the [01:05:26] great question which I think in all the papers that discuss Transformers is [01:05:28] papers that discuss Transformers is under specified you do not share across [01:05:31] under specified you do not share across the five words which is somewhat [01:05:33] the five words which is somewhat confusing to me but so each of the five [01:05:36] confusing to me but so each of the five words is done completely independently [01:05:39] words is done completely independently um you could have shared across the five [01:05:40] um you could have shared across the five words and said that my estimate of the [01:05:42] words and said that my estimate of the statistics are just based on all five uh [01:05:46] statistics are just based on all five uh but you do not [01:05:49] I can't pretend I understand totally why [01:05:54] for example per batch [01:05:58] of the same position so so a similar [01:06:02] of the same position so so a similar question the question is [01:06:04] question the question is um if you have a batch of sequences [01:06:06] um if you have a batch of sequences right so like just like we're doing [01:06:08] right so like just like we're doing batch Based training do you for a single [01:06:11] batch Based training do you for a single word now we don't share across a [01:06:13] word now we don't share across a sequence index for sharing the [01:06:14] sequence index for sharing the statistics we do share across the batch [01:06:16] statistics we do share across the batch and the answer is no you also do not [01:06:18] and the answer is no you also do not share across the batch in fact layer [01:06:20] share across the batch in fact layer normalization was sort of invented as a [01:06:23] normalization was sort of invented as a replacement for batch normalization [01:06:25] replacement for batch normalization which did just that and the issue with [01:06:27] which did just that and the issue with batch normalization is that now your [01:06:28] batch normalization is that now your forward pass sort of depends in a way [01:06:31] forward pass sort of depends in a way that you don't like on examples that [01:06:32] that you don't like on examples that should be not related to your example [01:06:35] should be not related to your example and so yeah you don't share statistics [01:06:37] and so yeah you don't share statistics across the batch [01:06:40] okay cool [01:06:44] okay cool okay so so now we have our full [01:06:46] okay so so now we have our full Transformer decoder and we have our [01:06:50] Transformer decoder and we have our blocks so in this sort of slightly [01:06:52] blocks so in this sort of slightly grayed out thing here that says repeat [01:06:53] grayed out thing here that says repeat uh for a number of uh encoder or sorry [01:06:56] uh for a number of uh encoder or sorry decoder blocks [01:06:58] decoder blocks um each block consists of I pass it [01:07:01] um each block consists of I pass it through self-attention and then my ADD [01:07:03] through self-attention and then my ADD and Norm right so I've got this residual [01:07:05] and Norm right so I've got this residual connection here that goes around and [01:07:08] connection here that goes around and I've got the layer normalization there [01:07:10] I've got the layer normalization there and then a feed forward layer and then [01:07:12] and then a feed forward layer and then another ad and norm and so that sort of [01:07:16] another ad and norm and so that sort of set of four operations I apply you know [01:07:19] set of four operations I apply you know for some number of times number of [01:07:21] for some number of times number of blocks so that whole thing is called a [01:07:22] blocks so that whole thing is called a single block and uh that's it that's the [01:07:25] single block and uh that's it that's the Transformer uh decoder [01:07:28] Transformer uh decoder as it is [01:07:31] cool so that's a whole architecture [01:07:33] cool so that's a whole architecture right there we've solved things like [01:07:35] right there we've solved things like needing to represent position we've [01:07:37] needing to represent position we've solved things like [01:07:39] solved things like um not being able to look into the [01:07:40] um not being able to look into the future uh We've solved a lot of [01:07:43] future uh We've solved a lot of different optimization problems you've [01:07:44] different optimization problems you've got a question yes [01:07:47] yes [01:07:49] yes Mass to multi-head attention yeah [01:07:52] Mass to multi-head attention yeah with the dot product scaling with the [01:07:55] with the dot product scaling with the square root D over H as well yeah [01:08:03] so the question is uh how do these [01:08:05] so the question is uh how do these models handle variable length inputs [01:08:08] models handle variable length inputs um yeah so [01:08:12] if you have so so the input to the like [01:08:16] if you have so so the input to the like GPU forward pass is going to be a [01:08:19] GPU forward pass is going to be a constant length so you're going to maybe [01:08:21] constant length so you're going to maybe pad to a constant length and in order to [01:08:25] pad to a constant length and in order to not look at the future the stuff that's [01:08:28] not look at the future the stuff that's sort of happening in the future you can [01:08:30] sort of happening in the future you can mask out the pad tokens just like the [01:08:33] mask out the pad tokens just like the masking that we showed for not looking [01:08:35] masking that we showed for not looking at the future in general you can just [01:08:37] at the future in general you can just say set all of the attention weights to [01:08:39] say set all of the attention weights to to zero or the scores to negative [01:08:41] to zero or the scores to negative Infinity for all of the pad tokens [01:08:47] yeah exactly so you can you can uh set [01:08:50] yeah exactly so you can you can uh set everything to this maximum length now in [01:08:52] everything to this maximum length now in practice so the question was do you set [01:08:54] practice so the question was do you set this length that you have everything be [01:08:55] this length that you have everything be be that maximum length I mean you know [01:08:58] be that maximum length I mean you know yes often although you can save [01:08:59] yes often although you can save computation by setting it to something [01:09:02] computation by setting it to something smaller and uh everything the math all [01:09:05] smaller and uh everything the math all still works out you just have to code it [01:09:07] still works out you just have to code it properly so it can handle so you set [01:09:09] properly so it can handle so you set everything instead of the N you set it [01:09:10] everything instead of the N you set it all to five if everything is shorter [01:09:12] all to five if everything is shorter than like five and you save a lot of [01:09:14] than like five and you save a lot of computation all of the self-attention [01:09:16] computation all of the self-attention operations just work [01:09:18] operations just work so yeah [01:09:22] so yeah um [01:09:25] uh there's one hidden layer in the feed [01:09:26] uh there's one hidden layer in the feed forward yeah [01:09:28] forward yeah okay I should move on got a couple more [01:09:30] okay I should move on got a couple more things and not very much time okay [01:09:33] things and not very much time okay um but I'll be here after the class as [01:09:35] um but I'll be here after the class as well so in the encoder so the [01:09:37] well so in the encoder so the Transformer encoder is almost identical [01:09:39] Transformer encoder is almost identical but again we want bi-directional context [01:09:41] but again we want bi-directional context and so we just don't do the masking [01:09:44] and so we just don't do the masking right so I've got in my multi-head [01:09:45] right so I've got in my multi-head attention here I've got no masking and [01:09:48] attention here I've got no masking and so it's that easy to make the model [01:09:50] so it's that easy to make the model bi-directional okay [01:09:53] bi-directional okay um so that's easy so that's called the [01:09:54] um so that's easy so that's called the Transformer encoder it's almost [01:09:56] Transformer encoder it's almost identical but no masking and then [01:09:58] identical but no masking and then finally we've got the Transformer [01:09:59] finally we've got the Transformer encoder decoder which is actually how [01:10:02] encoder decoder which is actually how the Transformer was originally presented [01:10:04] the Transformer was originally presented in this paper attention is all you need [01:10:07] in this paper attention is all you need um and this is when we want to have sort [01:10:09] um and this is when we want to have sort of a bi-directional network here's the [01:10:11] of a bi-directional network here's the encoder it takes in say my source [01:10:12] encoder it takes in say my source sentence for machine translation it's [01:10:15] sentence for machine translation it's multi-headed attention is not masked and [01:10:18] multi-headed attention is not masked and I have a decoder to decode out my [01:10:21] I have a decoder to decode out my sentence now but you'll see that this is [01:10:23] sentence now but you'll see that this is slightly more complicated I have my [01:10:25] slightly more complicated I have my masked multi-head self-attention uh just [01:10:27] masked multi-head self-attention uh just like I had before in my decoder but now [01:10:30] like I had before in my decoder but now I have an extra operation which is [01:10:33] I have an extra operation which is called cross attention where I'm going [01:10:35] called cross attention where I'm going to use my decoder vectors as my queries [01:10:41] to use my decoder vectors as my queries then I'll take the output of the encoder [01:10:43] then I'll take the output of the encoder as my keys and values so now for every [01:10:46] as my keys and values so now for every word in the decoder I'm looking at all [01:10:49] word in the decoder I'm looking at all the possible words in the output of all [01:10:52] the possible words in the output of all of the blocks of the encoder yes [01:10:54] of the blocks of the encoder yes yeah [01:10:57] yeah longer because I know initially it was [01:10:59] longer because I know initially it was like the keys and the values how do we [01:11:01] like the keys and the values how do we get like a key in value separated from [01:11:03] get like a key in value separated from the output because then we collapse [01:11:05] the output because then we collapse those into the single output uh so we [01:11:09] those into the single output uh so we well how sorry how will we get the keys [01:11:11] well how sorry how will we get the keys and values out like how do we because [01:11:13] and values out like how do we because when we have the output didn't we [01:11:15] when we have the output didn't we collapse like the keys and values into [01:11:17] collapse like the keys and values into like a single output so the output we [01:11:20] like a single output so the output we capture those yeah the question is how [01:11:22] capture those yeah the question is how do you get the keys and values and [01:11:23] do you get the keys and values and queries out of this sort of single [01:11:25] queries out of this sort of single collapsed output now remember the output [01:11:26] collapsed output now remember the output for each word is just this weighted [01:11:28] for each word is just this weighted average of the value vectors for the for [01:11:31] average of the value vectors for the for the previous words right and then from [01:11:33] the previous words right and then from that output for the next layer we apply [01:11:36] that output for the next layer we apply a new key query and value transformation [01:11:38] a new key query and value transformation to each of them for the next layer of [01:11:40] to each of them for the next layer of self-attention [01:11:42] self-attention so it's not actually that you're [01:11:45] so it's not actually that you're here [01:11:47] here yeah you apply the key Matrix the query [01:11:50] yeah you apply the key Matrix the query Matrix to the output of whatever came [01:11:52] Matrix to the output of whatever came before it yeah [01:11:53] before it yeah um and so just in a little bit of math [01:11:55] um and so just in a little bit of math right we have [01:11:57] right we have um these vectors H1 through each n I'm [01:12:00] um these vectors H1 through each n I'm going to call them that are the output [01:12:01] going to call them that are the output of the encoder right and then I've got [01:12:04] of the encoder right and then I've got vectors that are the output of the [01:12:05] vectors that are the output of the decoder [01:12:07] decoder uh so I've got these Z's I'm calling the [01:12:09] uh so I've got these Z's I'm calling the output of the decoder and then I simply [01:12:11] output of the decoder and then I simply Define my keys and my values from the [01:12:16] Define my keys and my values from the encoder vectors these H's [01:12:19] encoder vectors these H's right so I take the H's I apply a key [01:12:20] right so I take the H's I apply a key Matrix and a value Matrix and then I [01:12:24] Matrix and a value Matrix and then I Define the queries from my decoder so my [01:12:26] Define the queries from my decoder so my queries here so this is why two of the [01:12:28] queries here so this is why two of the arrows come from the encoder and one of [01:12:30] arrows come from the encoder and one of the arrows comes from the decoder I've [01:12:32] the arrows comes from the decoder I've got my Z's here get my queries my keys [01:12:35] got my Z's here get my queries my keys and values from the encoder [01:12:39] okay [01:12:41] okay uh so that is it I've got a couple of [01:12:45] uh so that is it I've got a couple of minutes I want to discuss some of the [01:12:46] minutes I want to discuss some of the sort of results of Transformers and I'm [01:12:48] sort of results of Transformers and I'm happy to answer more questions about [01:12:50] happy to answer more questions about Transformers after class so um so you [01:12:53] Transformers after class so um so you know really the original results of [01:12:55] know really the original results of Transformers they had this big pitch for [01:12:57] Transformers they had this big pitch for like oh look you can do way more [01:13:00] like oh look you can do way more computation because of parallelization [01:13:01] computation because of parallelization they got great results in machine [01:13:03] they got great results in machine translation so you had [01:13:06] translation so you had um [01:13:10] you had Transformers sort of doing quite [01:13:12] you had Transformers sort of doing quite well although not like astoundingly [01:13:15] well although not like astoundingly better than existing machine translation [01:13:17] better than existing machine translation systems [01:13:19] systems um and they but they were significantly [01:13:21] um and they but they were significantly more efficient to train right because [01:13:22] more efficient to train right because you don't have this this parallelization [01:13:24] you don't have this this parallelization problem you could compute on much more [01:13:26] problem you could compute on much more data much faster and you could make use [01:13:28] data much faster and you could make use of faster gpus much more [01:13:31] of faster gpus much more um you know after that there were things [01:13:33] um you know after that there were things like document generation where you had [01:13:35] like document generation where you had the sort of old standard of sequence to [01:13:37] the sort of old standard of sequence to sequence model to the lstms and [01:13:39] sequence model to the lstms and eventually everything became sort of [01:13:41] eventually everything became sort of Transformers all the way down [01:13:44] Transformers all the way down um uh Transformers also enabled this [01:13:46] um uh Transformers also enabled this revolution into pre-training which we'll [01:13:49] revolution into pre-training which we'll go over uh in lecture next class [01:13:51] go over uh in lecture next class um and sort of the efficiency the [01:13:53] um and sort of the efficiency the parallelizability allows you to compute [01:13:55] parallelizability allows you to compute on tons and tons of data and so after a [01:13:59] on tons and tons of data and so after a certain point sort of on standard large [01:14:01] certain point sort of on standard large benchmarks everything became Transformer [01:14:04] benchmarks everything became Transformer based this ability to make use of lots [01:14:07] based this ability to make use of lots and lots of data lots and lots of [01:14:08] and lots of data lots and lots of compute just put Transformers Head and [01:14:10] compute just put Transformers Head and Shoulders above lstms in let's say [01:14:13] Shoulders above lstms in let's say almost every sort of modern advancement [01:14:16] almost every sort of modern advancement in uh in natural language processing [01:14:19] in uh in natural language processing um there are many sort of drawbacks and [01:14:22] um there are many sort of drawbacks and and variants to Transformers you know [01:14:24] and variants to Transformers you know the clearest one that people have tried [01:14:26] the clearest one that people have tried to work on quite a bit is this quadratic [01:14:28] to work on quite a bit is this quadratic compute problem so this all pairs of [01:14:30] compute problem so this all pairs of interactions right means that our sort [01:14:32] interactions right means that our sort of total computation for each block [01:14:34] of total computation for each block grows quadratically with the sequence [01:14:36] grows quadratically with the sequence length and in a student's question we [01:14:37] length and in a student's question we heard uh that you know well as as the [01:14:40] heard uh that you know well as as the sequence length becomes long if I want [01:14:42] sequence length becomes long if I want to process you know a whole Wikipedia [01:14:43] to process you know a whole Wikipedia article a whole a whole novel that [01:14:46] article a whole a whole novel that becomes quite unfeasible and actually [01:14:48] becomes quite unfeasible and actually you know that's a step backwards in some [01:14:50] you know that's a step backwards in some sense because for recurrent neural [01:14:51] sense because for recurrent neural networks it only grew linearly with the [01:14:54] networks it only grew linearly with the sequence length [01:14:55] sequence length um you know other things people have [01:14:56] um you know other things people have tried to work on are sort of better [01:14:58] tried to work on are sort of better position representations because the [01:15:00] position representations because the absolute index of a word is not really [01:15:02] absolute index of a word is not really you know the best way maybe to represent [01:15:05] you know the best way maybe to represent its position in a sequence [01:15:08] its position in a sequence um and just to give you an intuition of [01:15:09] um and just to give you an intuition of quadratic sequence length right remember [01:15:11] quadratic sequence length right remember that we had this big Matrix multiply [01:15:12] that we had this big Matrix multiply here that resulted in this Matrix of n [01:15:16] here that resulted in this Matrix of n by n and Computing this is like a you [01:15:19] by n and Computing this is like a you know a big a big cost it costs a lot of [01:15:21] know a big a big cost it costs a lot of memory [01:15:22] memory um and so there's been work uh oh yeah [01:15:24] um and so there's been work uh oh yeah and so you know if you think of the [01:15:25] and so you know if you think of the model dimensionality as like a thousand [01:15:27] model dimensionality as like a thousand although today it gets much larger then [01:15:29] although today it gets much larger then for a short sequence of n is roughly 30 [01:15:32] for a short sequence of n is roughly 30 maybe the you know if you're Computing N [01:15:35] maybe the you know if you're Computing N squared times d uh 30 isn't so bad but [01:15:38] squared times d uh 30 isn't so bad but if you had something like you know 50 [01:15:41] if you had something like you know 50 000 then N squared becomes huge and sort [01:15:44] 000 then N squared becomes huge and sort of totally infusible so people have [01:15:47] of totally infusible so people have tried to sort of map things down to a [01:15:49] tried to sort of map things down to a lower dimensional space to get rid of [01:15:51] lower dimensional space to get rid of the sort of quadratic computation [01:15:53] the sort of quadratic computation but in practice I mean as people have [01:15:57] but in practice I mean as people have gone to things like gpt3 chat GPT most [01:16:00] gone to things like gpt3 chat GPT most of the computation doesn't show up in [01:16:01] of the computation doesn't show up in the self-attention so people are [01:16:04] the self-attention so people are wondering sort of is it even necessary [01:16:06] wondering sort of is it even necessary to get rid of some attention operations [01:16:08] to get rid of some attention operations quadratic constraint it's an open form [01:16:11] quadratic constraint it's an open form of research whether this is sort of [01:16:13] of research whether this is sort of necessary and then finally there have [01:16:15] necessary and then finally there have been a ton of modifications for the [01:16:17] been a ton of modifications for the Transformer uh over the last you know [01:16:20] Transformer uh over the last you know five four ish years and um it turns out [01:16:23] five four ish years and um it turns out that the original Transformer plus maybe [01:16:25] that the original Transformer plus maybe a couple of of modifications is pretty [01:16:28] a couple of of modifications is pretty much the best thing there is still [01:16:31] much the best thing there is still um there have been a couple of things [01:16:32] um there have been a couple of things that end up being important changing out [01:16:34] that end up being important changing out the non-linearities [01:16:36] the non-linearities and the feed forward Network ends up [01:16:37] and the feed forward Network ends up being important but it's sort of uh it's [01:16:40] being important but it's sort of uh it's had lasting power so far and so it's [01:16:42] had lasting power so far and so it's it's but I think it's it's right for uh [01:16:44] it's but I think it's it's right for uh people to come through and think about [01:16:46] people to come through and think about how to sort of improve it in various [01:16:48] how to sort of improve it in various ways so um pre-training is on Tuesday uh [01:16:52] ways so um pre-training is on Tuesday uh good luck on assignment four and then [01:16:53] good luck on assignment four and then yeah we'll have the project proposal [01:16:55] yeah we'll have the project proposal documents out tonight uh for you to talk [01:16:57] documents out tonight uh for you to talk about ================================================================================ LECTURE 009 ================================================================================ Stanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretraining Source: https://www.youtube.com/watch?v=DGfCRXuNA2w --- Transcript [00:00:05] hello welcome to cs224n [00:00:09] hello welcome to cs224n today we'll be talking about [00:00:10] today we'll be talking about pre-training uh which is another [00:00:13] pre-training uh which is another exciting topic on the road to Modern [00:00:17] exciting topic on the road to Modern natural language processing [00:00:19] natural language processing um okay [00:00:21] um okay how is everyone doing thumbs up some [00:00:24] how is everyone doing thumbs up some side thumbs down [00:00:27] side thumbs down wow no response bias there all you know [00:00:30] wow no response bias there all you know all thumbs up oh sorry nice I like that [00:00:32] all thumbs up oh sorry nice I like that Honesty that's good well [00:00:34] Honesty that's good well um okay so we're now uh what is this [00:00:37] um okay so we're now uh what is this week five [00:00:39] week five yes it's week five and [00:00:41] yes it's week five and um we have a couple so this lecture [00:00:44] um we have a couple so this lecture um the Transformers lecture and then to [00:00:47] um the Transformers lecture and then to a lesser extent Thursday's lecture on [00:00:49] a lesser extent Thursday's lecture on natural language Generation [00:00:51] natural language Generation Um will be sort of the sum of lectures [00:00:53] Um will be sort of the sum of lectures for the assignments you have to do right [00:00:56] for the assignments you have to do right so assignment five is coming out on uh [00:00:59] so assignment five is coming out on uh Thursday [00:01:01] Thursday um and uh the the topics covered in this [00:01:04] um and uh the the topics covered in this lecture and the you know self-attention [00:01:06] lecture and the you know self-attention Transformers and again a little bit of [00:01:07] Transformers and again a little bit of natural language generation will be [00:01:09] natural language generation will be tested in assignment five and then the [00:01:10] tested in assignment five and then the rest of the course will go through some [00:01:13] rest of the course will go through some really fascinating topics and sort of [00:01:15] really fascinating topics and sort of modern uh natural language processing [00:01:17] modern uh natural language processing that should be useful for your final [00:01:19] that should be useful for your final projects and future jobs and interviews [00:01:21] projects and future jobs and interviews and intellectual curiosity and [00:01:25] and intellectual curiosity and um but uh you know I think that this [00:01:27] um but uh you know I think that this today's lecture is significantly less [00:01:29] today's lecture is significantly less um uh technical in detail than last [00:01:32] um uh technical in detail than last Thursdays on self-attention and [00:01:34] Thursdays on self-attention and Transformers but should give you an idea [00:01:37] Transformers but should give you an idea of this sort of uh world of pre-training [00:01:40] of this sort of uh world of pre-training and sort of how it helps Define [00:01:43] and sort of how it helps Define uh natural language processing today [00:01:46] uh natural language processing today um so a reminder about assignment five [00:01:48] um so a reminder about assignment five your project proposals also are due on [00:01:50] your project proposals also are due on Tuesday next Tuesday [00:01:53] Tuesday next Tuesday um please do get those in try to get [00:01:55] um please do get those in try to get them in on time so that we can give you [00:01:57] them in on time so that we can give you prompt feedback about your project [00:01:59] prompt feedback about your project proposals [00:02:01] proposals um and yeah so let's let's jump into it [00:02:06] okay so uh what we're going to start [00:02:08] okay so uh what we're going to start with today is [00:02:10] with today is um a bit of a technical detail on uh [00:02:15] um a bit of a technical detail on uh word structure and sort of how we model [00:02:17] word structure and sort of how we model the input sequence of words that we get [00:02:19] the input sequence of words that we get so um in when we were teaching word to [00:02:22] so um in when we were teaching word to VEC and uh sort of all the methods that [00:02:26] VEC and uh sort of all the methods that we've talked about so far we assumed a [00:02:28] we've talked about so far we assumed a finite vocabulary right so we had a [00:02:30] finite vocabulary right so we had a vocabulary V that you define via [00:02:32] vocabulary V that you define via whatever you've looked at some data [00:02:33] whatever you've looked at some data you've decided what the words are in in [00:02:35] you've decided what the words are in in that data and so you know [00:02:38] that data and so you know um you have some words uh like hat and [00:02:42] um you have some words uh like hat and learn and uh you know you have this [00:02:44] learn and uh you know you have this embedding it's in red because you've [00:02:45] embedding it's in red because you've learned it properly actually let's [00:02:47] learned it properly actually let's replace hat and learn with pizza and [00:02:49] replace hat and learn with pizza and tasty those are better [00:02:50] tasty those are better um and uh and so that's all well and [00:02:53] um and uh and so that's all well and good you see these words uh in your [00:02:55] good you see these words uh in your model and [00:02:57] model and you have an embedding that's been [00:02:59] you have an embedding that's been learned on your data uh to sort of know [00:03:02] learned on your data uh to sort of know what to do when you see those words but [00:03:04] what to do when you see those words but when you see some sort of variations [00:03:05] when you see some sort of variations maybe you see like tasty and maybe a [00:03:09] maybe you see like tasty and maybe a typo like learn [00:03:11] typo like learn um or or maybe novel items where it's [00:03:13] um or or maybe novel items where it's like a word that you know you as a human [00:03:15] like a word that you know you as a human can understand as sort of this [00:03:17] can understand as sort of this combination this is called derivational [00:03:19] combination this is called derivational morphology uh of like this word [00:03:22] morphology uh of like this word Transformer that you know and if I which [00:03:25] Transformer that you know and if I which means you know take this noun and give [00:03:27] means you know take this noun and give me back you know a verb that means to [00:03:30] me back you know a verb that means to make more like that noun to Transformer [00:03:32] make more like that noun to Transformer if I NLP might mean to you know make NLP [00:03:35] if I NLP might mean to you know make NLP more like using Transformers and such [00:03:38] more like using Transformers and such um and for each of these right this [00:03:40] um and for each of these right this maybe didn't show up in your in your [00:03:41] maybe didn't show up in your in your training Corpus and language is uh [00:03:44] training Corpus and language is uh always doing this right people are [00:03:46] always doing this right people are always coming up with new words and [00:03:47] always coming up with new words and there's new domains and there's the you [00:03:50] there's new domains and there's the you know young people are always making new [00:03:51] know young people are always making new words it's great and so it's a problem [00:03:54] words it's great and so it's a problem for your model though right because [00:03:55] for your model though right because you've defined this finite vocabulary [00:03:57] you've defined this finite vocabulary and there's sort of no mapping [00:03:59] and there's sort of no mapping in that vocabulary for each of these [00:04:01] in that vocabulary for each of these things even though their meanings should [00:04:04] things even though their meanings should be relatively well defined based on the [00:04:06] be relatively well defined based on the data you've seen so far it's just that [00:04:08] data you've seen so far it's just that the sort of string of characters that [00:04:10] the sort of string of characters that Define them [00:04:11] Define them aren't quite what you've seen [00:04:13] aren't quite what you've seen and so what do you do well maybe you map [00:04:15] and so what do you do well maybe you map them to this sort of universal unknown [00:04:18] them to this sort of universal unknown tokens this is Unk uh right so it's like [00:04:20] tokens this is Unk uh right so it's like oh I see something I don't know what [00:04:21] oh I see something I don't know what I've never seen it before I'm going to [00:04:23] I've never seen it before I'm going to say it's always represented by the same [00:04:25] say it's always represented by the same token ankh [00:04:26] token ankh um and so that's been done in the past [00:04:28] um and so that's been done in the past uh and that's sort of bad right because [00:04:30] uh and that's sort of bad right because it's totally in like losing tons of [00:04:32] it's totally in like losing tons of information [00:04:34] information um but you know you need to map it to [00:04:36] um but you know you need to map it to something [00:04:38] something uh and so that this is like a clear [00:04:41] uh and so that this is like a clear problem especially I mean in English [00:04:43] problem especially I mean in English it's a problem in many of the world's [00:04:45] it's a problem in many of the world's languages it's a substantially larger [00:04:47] languages it's a substantially larger problem right so [00:04:49] problem right so um you know English has relatively [00:04:52] um you know English has relatively simple word structure there's a couple [00:04:53] simple word structure there's a couple of conjugations for each verb like you [00:04:56] of conjugations for each verb like you know eat eats eaten ate [00:05:00] know eat eats eaten ate um but in a language uh with much more [00:05:02] um but in a language uh with much more complex morphology or word structure [00:05:06] complex morphology or word structure um you'll have a considerably more [00:05:08] um you'll have a considerably more complex uh sort of set of things that [00:05:11] complex uh sort of set of things that you could see in the world so here is a [00:05:13] you could see in the world so here is a a conjugation table for a Swahili verb [00:05:16] a conjugation table for a Swahili verb and it has over 300 conjugations [00:05:20] and it has over 300 conjugations and if I Define the vocabulary to be [00:05:22] and if I Define the vocabulary to be every unique string of characters uh [00:05:24] every unique string of characters uh maps to its own word then every one of [00:05:27] maps to its own word then every one of the 300 conjugations would get an [00:05:28] the 300 conjugations would get an independent Vector under my model which [00:05:32] independent Vector under my model which makes no sense because the 300 [00:05:34] makes no sense because the 300 conjugations obviously have a lot in [00:05:36] conjugations obviously have a lot in common [00:05:36] common and differ by sort of meaningful uh [00:05:39] and differ by sort of meaningful uh extents so you don't want to do this I'd [00:05:41] extents so you don't want to do this I'd have to have a huge vocabulary uh if I [00:05:44] have to have a huge vocabulary uh if I wanted all conjugations to show up and [00:05:46] wanted all conjugations to show up and that's that's a mistake for efficiency [00:05:48] that's that's a mistake for efficiency reasons and for learning reasons [00:05:51] reasons and for learning reasons any questions so far [00:05:54] cool [00:05:55] cool okay [00:05:56] okay um and so what we end up uh doing [00:06:01] um and so what we end up uh doing um is we'll look at subword sub word [00:06:05] um is we'll look at subword sub word structure sub word modeling so what [00:06:07] structure sub word modeling so what we're going to do is we're going to say [00:06:08] we're going to do is we're going to say I'm not going to even try to Define what [00:06:10] I'm not going to even try to Define what the set of all words is I'm going to [00:06:13] the set of all words is I'm going to Define my vocabulary to include parts of [00:06:17] Define my vocabulary to include parts of words [00:06:18] words there [00:06:21] there where am I oh [00:06:23] where am I oh uh right so um [00:06:26] uh right so um so I'm going to split words into [00:06:28] so I'm going to split words into sequences of known sub words and so [00:06:30] sequences of known sub words and so there's a simple sort of algorithm for [00:06:32] there's a simple sort of algorithm for this where you start with all characters [00:06:35] this where you start with all characters right so if I only had a vocabulary of [00:06:37] right so if I only had a vocabulary of all characters and maybe like an end of [00:06:39] all characters and maybe like an end of word symbol [00:06:41] word symbol um I ha for a finite data set then I [00:06:44] um I ha for a finite data set then I could no matter what word I saw in the [00:06:46] could no matter what word I saw in the future as long as I had seen all [00:06:47] future as long as I had seen all possible characters I could take the [00:06:49] possible characters I could take the word and say I don't know what this word [00:06:50] word and say I don't know what this word is I'm going to split it into like all [00:06:52] is I'm going to split it into like all of its individual characters so you [00:06:54] of its individual characters so you won't have this UNC problem you can sort [00:06:55] won't have this UNC problem you can sort of represent any word and then you're [00:06:58] of represent any word and then you're going to find common adjacent characters [00:07:00] going to find common adjacent characters and say okay A and B co-occur next to [00:07:02] and say okay A and B co-occur next to each other quite a bit so I'm going to [00:07:04] each other quite a bit so I'm going to add a new word to my vocabulary [00:07:06] add a new word to my vocabulary now it's all characters plus this new [00:07:09] now it's all characters plus this new word a b which is a sub word [00:07:13] word a b which is a sub word and likewise I'm going so now I'm going [00:07:15] and likewise I'm going so now I'm going to replace the character pair with the [00:07:16] to replace the character pair with the new sub word and repeat until you add a [00:07:19] new sub word and repeat until you add a lot a lot a lot of vocabulary items [00:07:21] lot a lot a lot of vocabulary items through this process of what things tend [00:07:22] through this process of what things tend to co-occur next to each other and so [00:07:24] to co-occur next to each other and so what you'll end up with is a vocabulary [00:07:28] what you'll end up with is a vocabulary a very commonly a co-occurring sort of [00:07:30] a very commonly a co-occurring sort of substrings by which you can build up [00:07:32] substrings by which you can build up words [00:07:33] words and this was originally developed for [00:07:35] and this was originally developed for machine translation but then has been [00:07:36] machine translation but then has been used considerably in pretty much all [00:07:39] used considerably in pretty much all modern language models so now we have a [00:07:42] modern language models so now we have a hat and learn hat and learn so in our [00:07:44] hat and learn hat and learn so in our sub word vocabulary hat and learn showed [00:07:47] sub word vocabulary hat and learn showed up enough that they're their own [00:07:48] up enough that they're their own individual words so that's sort of good [00:07:51] individual words so that's sort of good right so simple common words show up as [00:07:54] right so simple common words show up as a word in your vocabulary just like [00:07:57] a word in your vocabulary just like you'd like them to but now tasty maybe [00:07:59] you'd like them to but now tasty maybe gets split into TAA and then maybe you [00:08:02] gets split into TAA and then maybe you know in some cases this Hash Hash means [00:08:04] know in some cases this Hash Hash means like don't add a space next right so TAA [00:08:08] like don't add a space next right so TAA and then AAA and then s-t-y right so [00:08:12] and then AAA and then s-t-y right so I've actually taken one sort of thing [00:08:14] I've actually taken one sort of thing that seems like a word and in my [00:08:15] that seems like a word and in my vocabulary it's now split into three sub [00:08:18] vocabulary it's now split into three sub word tokens [00:08:19] word tokens so when I pass this to my Transformer or [00:08:23] so when I pass this to my Transformer or to my recurrent neural network right the [00:08:25] to my recurrent neural network right the recurrent neural network would take TAA [00:08:27] recurrent neural network would take TAA as a as just a single element do the RNN [00:08:30] as a as just a single element do the RNN update and then take AAA do the RNA and [00:08:33] update and then take AAA do the RNA and update and then sty so it could learn [00:08:36] update and then sty so it could learn to process constructions like this and [00:08:39] to process constructions like this and maybe I can even add more aaas in the [00:08:41] maybe I can even add more aaas in the middle right and have it do something [00:08:42] middle right and have it do something similar [00:08:43] similar instead of just seeing the entire word [00:08:45] instead of just seeing the entire word tasty [00:08:48] tasty and not knowing what it means [00:08:51] and not knowing what it means is that that's feedback yeah [00:08:55] is that that's feedback yeah uh [00:08:58] how loud is that feedback [00:09:01] how loud is that feedback we good [00:09:02] we good okay I think we're fixed great [00:09:05] okay I think we're fixed great um [00:09:06] um and so same with Transformer if I maybe [00:09:08] and so same with Transformer if I maybe Transformer as its own word and then if [00:09:10] Transformer as its own word and then if I and so you can see that you have sort [00:09:12] I and so you can see that you have sort of three learned embeddings instead of [00:09:15] of three learned embeddings instead of one sort of useless unkem betting this [00:09:18] one sort of useless unkem betting this is just wildly useful and is used pretty [00:09:20] is just wildly useful and is used pretty much everywhere variants of this [00:09:22] much everywhere variants of this algorithm are used pretty much [00:09:23] algorithm are used pretty much everywhere in uh like modern NLP [00:09:26] everywhere in uh like modern NLP questions yes [00:09:28] questions yes if we have three embeddings for tasty do [00:09:31] if we have three embeddings for tasty do we just add them together [00:09:33] we just add them together so the question is if we have three [00:09:34] so the question is if we have three embeddings for tasty do we just add them [00:09:36] embeddings for tasty do we just add them together uh if we want to represent so [00:09:40] together uh if we want to represent so when we're actually processing the [00:09:41] when we're actually processing the sequence I'd see something like I [00:09:44] sequence I'd see something like I learned about the TAA AAA sty so it'd [00:09:50] learned about the TAA AAA sty so it'd actually be totally separate tokens but [00:09:52] actually be totally separate tokens but if I wanted to then say what's my [00:09:54] if I wanted to then say what's my representation of this thing [00:09:56] representation of this thing uh depends on what you want to do [00:09:58] uh depends on what you want to do sometimes you average you average the [00:10:01] sometimes you average you average the contextual representations of the three [00:10:02] contextual representations of the three or look at the last one maybe it at that [00:10:06] or look at the last one maybe it at that point it's unclear what to do but [00:10:08] point it's unclear what to do but everything sort of works okay [00:10:11] click how do you what how do you know [00:10:14] click how do you what how do you know where to split yeah so um you know where [00:10:17] where to split yeah so um you know where to split based on the algorithm that I [00:10:19] to split based on the algorithm that I uh specified earlier for learning the [00:10:22] uh specified earlier for learning the vocabulary so you've learned this [00:10:23] vocabulary so you've learned this vocabulary by just combining commonly [00:10:26] vocabulary by just combining commonly co-occurring adjacent strings of letters [00:10:28] co-occurring adjacent strings of letters right so like a b co-occurred a lot so [00:10:31] right so like a b co-occurred a lot so now I've got a new word that's a b [00:10:33] now I've got a new word that's a b um and then when I'm actually walking [00:10:35] um and then when I'm actually walking through and tokenizing I try to split as [00:10:37] through and tokenizing I try to split as little as possible so I split words into [00:10:39] little as possible so I split words into the maximal uh sort of sub word that [00:10:42] the maximal uh sort of sub word that takes up the most characters they're [00:10:43] takes up the most characters they're algorithms for this uh yeah so like I'm [00:10:45] algorithms for this uh yeah so like I'm like okay if I want to split this up you [00:10:48] like okay if I want to split this up you know like there's many ways I could [00:10:50] know like there's many ways I could split it up and you try to find some [00:10:51] split it up and you try to find some approximate like what the best way to [00:10:54] approximate like what the best way to split it into the fewest words is yeah [00:11:00] do I ask the question is do people make [00:11:02] do I ask the question is do people make use punctuation in the character set how [00:11:04] use punctuation in the character set how do people do it yes absolutely so you [00:11:07] do people do it yes absolutely so you know sort of from this point on so uh [00:11:11] know sort of from this point on so uh just assume that what text is given to [00:11:14] just assume that what text is given to these models is as unprocessed as [00:11:16] these models is as unprocessed as possible you take it you try to make it [00:11:18] possible you take it you try to make it sort of clean looking text where you've [00:11:21] sort of clean looking text where you've removed you know HTML tags maybe if it's [00:11:23] removed you know HTML tags maybe if it's from the Internet or or whatever [00:11:26] from the Internet or or whatever um but then beyond that you process it [00:11:27] um but then beyond that you process it as little as possible so that it [00:11:29] as little as possible so that it reflects as well as possible what people [00:11:32] reflects as well as possible what people might actually be using this for [00:11:35] might actually be using this for um so maybe earlier in the course when [00:11:36] um so maybe earlier in the course when we were looking at word to VEC maybe we [00:11:39] we were looking at word to VEC maybe we had what might have thought about oh we [00:11:40] had what might have thought about oh we don't want word to vectors of [00:11:42] don't want word to vectors of punctuation or something like that [00:11:45] punctuation or something like that um now everything is just as close as [00:11:48] um now everything is just as close as possible to what the text you'd get with [00:11:50] possible to what the text you'd get with people trying to use your system would [00:11:51] people trying to use your system would be so yes uh in practice punctuation and [00:11:54] be so yes uh in practice punctuation and like dot dot dot might be its own word [00:11:56] like dot dot dot might be its own word you know and and maybe a sequence of [00:11:59] you know and and maybe a sequence of like hyphens because people make big [00:12:01] like hyphens because people make big bars across you know tables yeah [00:12:04] bars across you know tables yeah foreign [00:12:08] [Music] [00:12:11] could be multiple embeddings versus a [00:12:14] could be multiple embeddings versus a single embedding like this is the [00:12:18] single embedding like this is the like system tree those any differently [00:12:21] like system tree those any differently uh the question is does the system treat [00:12:23] uh the question is does the system treat any differently words that are like [00:12:24] any differently words that are like really themselves the whole word versus [00:12:26] really themselves the whole word versus words that are sort of pieces you know [00:12:28] words that are sort of pieces you know the system has no idea they're all just [00:12:31] the system has no idea they're all just indices into your embedding vocabulary [00:12:34] indices into your embedding vocabulary Matrix [00:12:36] Matrix um so they're all treated equally [00:12:41] about really long ones that are I guess [00:12:43] about really long ones that are I guess relatively common because if you're [00:12:45] relatively common because if you're building up from the same character all [00:12:47] building up from the same character all the way up what happens then yeah the [00:12:49] the way up what happens then yeah the question is what happens to very long [00:12:51] question is what happens to very long words uh if you're building up some sort [00:12:53] words uh if you're building up some sort of character Pairs and portions of [00:12:55] of character Pairs and portions of characters uh you know in practice the [00:12:58] characters uh you know in practice the statistics speak really well for [00:13:00] statistics speak really well for themselves so if a long word is very [00:13:02] themselves so if a long word is very common it will end up in the vocabulary [00:13:04] common it will end up in the vocabulary and uh if it's not very common it won't [00:13:08] and uh if it's not very common it won't um there are algorithms that aren't this [00:13:09] um there are algorithms that aren't this that do slightly better in various ways [00:13:12] that do slightly better in various ways um but the intuition that you sort of [00:13:15] um but the intuition that you sort of figure out what the common co-occurring [00:13:17] figure out what the common co-occurring substrings are sort of independent of [00:13:19] substrings are sort of independent of length almost is is the right intuition [00:13:21] length almost is is the right intuition to have and so yeah you can actually [00:13:23] to have and so yeah you can actually just look at the Learned vocabularies of [00:13:25] just look at the Learned vocabularies of a lot of these models and uh you see [00:13:28] a lot of these models and uh you see some long words [00:13:30] some long words uh just because they if they showed up a [00:13:32] uh just because they if they showed up a lot [00:13:36] I'm curious how does it weigh the uh [00:13:39] I'm curious how does it weigh the uh like the frequency so let's say there's [00:13:42] like the frequency so let's say there's like if ify or at the in your next slide [00:13:44] like if ify or at the in your next slide it was like goodbye [00:13:46] it was like goodbye um at the very last one so if could be [00:13:49] um at the very last one so if could be really common so how did the weight like [00:13:51] really common so how did the weight like the frequency of a sub word versus the [00:13:53] the frequency of a sub word versus the length of it like it tries to spread it [00:13:55] length of it like it tries to spread it up into the smallest number but what if [00:13:57] up into the smallest number but what if it split it up into three but one of [00:13:59] it split it up into three but one of them was super common [00:14:00] them was super common yeah so the question is uh you know if [00:14:03] yeah so the question is uh you know if Transformer is a sub word in my [00:14:05] Transformer is a sub word in my vocabulary and if it's a sub word and Y [00:14:08] vocabulary and if it's a sub word and Y is a sub word and if I as a three letter [00:14:11] is a sub word and if I as a three letter Tuple is also a sub word how does it [00:14:14] Tuple is also a sub word how does it choose to like take the you know if I [00:14:16] choose to like take the you know if I maybe it's not very common uh as opposed [00:14:20] maybe it's not very common uh as opposed to splitting it into more subwords [00:14:22] to splitting it into more subwords um it's just a choice we choose to try [00:14:24] um it's just a choice we choose to try to take the smallest number of sub words [00:14:26] to take the smallest number of sub words because that tends to be uh more of the [00:14:28] because that tends to be uh more of the bottleneck as opposed to the having a [00:14:31] bottleneck as opposed to the having a bunch of very common very short sub [00:14:32] bunch of very common very short sub words uh trans you know sequence length [00:14:35] words uh trans you know sequence length is a big problem in Transformers and [00:14:37] is a big problem in Transformers and this seems to be sort of what works [00:14:39] this seems to be sort of what works although trying to split things into [00:14:40] although trying to split things into multiple options of a sequence and [00:14:42] multiple options of a sequence and running the Transformer on all of them [00:14:44] running the Transformer on all of them is the thing that people have done to [00:14:46] is the thing that people have done to see which one will work better [00:14:47] see which one will work better but yeah having fewer bigger sub words [00:14:49] but yeah having fewer bigger sub words tends to be the best sort of idea I'm [00:14:51] tends to be the best sort of idea I'm going to start moving on though uh feel [00:14:53] going to start moving on though uh feel free to ask me more questions about this [00:14:55] free to ask me more questions about this afterward [00:14:56] afterward okay so um [00:14:58] okay so um so let's talk about pre-training from [00:15:00] so let's talk about pre-training from the context of the course so far uh so [00:15:03] the context of the course so far uh so we at the very beginning of the course [00:15:05] we at the very beginning of the course we gave you this quote which was you [00:15:07] we gave you this quote which was you know you shall know a word by the [00:15:08] know you shall know a word by the company it keeps this was the sort of [00:15:10] company it keeps this was the sort of thesis of the distributional hypothesis [00:15:13] thesis of the distributional hypothesis right that the meaning of the word is [00:15:15] right that the meaning of the word is defined by or at least reflected by what [00:15:18] defined by or at least reflected by what words it tends to co-occur around and we [00:15:20] words it tends to co-occur around and we implemented this via word to VEC [00:15:23] implemented this via word to VEC uh the same person who made that quote [00:15:26] uh the same person who made that quote had a separate quote actually earlier uh [00:15:29] had a separate quote actually earlier uh that continues this sort of notion of [00:15:32] that continues this sort of notion of meaning as defined by context which has [00:15:35] meaning as defined by context which has something along the lines of well you [00:15:37] something along the lines of well you know since the word shows up in context [00:15:39] know since the word shows up in context when we actually use it when we speak to [00:15:41] when we actually use it when we speak to each other the meaning of the word [00:15:43] each other the meaning of the word should be defined in the context that it [00:15:46] should be defined in the context that it actually shows up in and so uh you know [00:15:48] actually shows up in and so uh you know the complete meaning of a word is always [00:15:50] the complete meaning of a word is always contextual and no study of meaning apart [00:15:53] contextual and no study of meaning apart from a complete context can be taken [00:15:55] from a complete context can be taken seriously so right the big difference [00:15:57] seriously so right the big difference here is like at word tovec training time [00:16:00] here is like at word tovec training time if I have uh the word uh record [00:16:04] if I have uh the word uh record r-e-c-o-r-d [00:16:06] r-e-c-o-r-d when I'm training word to VEC I get one [00:16:08] when I'm training word to VEC I get one vector or two but you know one one [00:16:11] vector or two but you know one one vector meaning uh record the string [00:16:15] vector meaning uh record the string um [00:16:15] um and uh it has to learn by what context [00:16:19] and uh it has to learn by what context it shows up in that sometimes you know [00:16:21] it shows up in that sometimes you know it can mean I record I.E the verb or [00:16:25] it can mean I record I.E the verb or record I.E the noun right but I only [00:16:28] record I.E the noun right but I only have one vector to represent it and so [00:16:30] have one vector to represent it and so when I use the word embedding of record [00:16:32] when I use the word embedding of record uh it sort of has this mixture meaning [00:16:35] uh it sort of has this mixture meaning of both of its sort of Senses right it [00:16:39] of both of its sort of Senses right it doesn't get to specialize and say oh [00:16:41] doesn't get to specialize and say oh this part means record and this part [00:16:43] this part means record and this part means record [00:16:44] means record and so where to back is going to just [00:16:46] and so where to back is going to just sort of fail [00:16:48] sort of fail um and and so I can build better [00:16:49] um and and so I can build better representations of language through [00:16:51] representations of language through these contextual uh representations that [00:16:54] these contextual uh representations that are going to take things like recurrent [00:16:55] are going to take things like recurrent neural networks or Transformers that we [00:16:57] neural networks or Transformers that we used before to build up sort of [00:16:58] used before to build up sort of contextual meaning [00:17:02] uh so so what we had before were [00:17:05] uh so so what we had before were pre-trained word embeddings and then we [00:17:08] pre-trained word embeddings and then we had sort of a a big box on top of it [00:17:10] had sort of a a big box on top of it like a Transformer or an lstm that was [00:17:13] like a Transformer or an lstm that was not pre-trained right so so you learn [00:17:16] not pre-trained right so so you learn via context your word embeddings here [00:17:18] via context your word embeddings here and then uh you have a task like [00:17:21] and then uh you have a task like sentiment analysis or machine [00:17:22] sentiment analysis or machine translation or or parsing or whatever [00:17:25] translation or or parsing or whatever and you initialize all the parameters of [00:17:27] and you initialize all the parameters of this randomly and then you train uh to [00:17:30] this randomly and then you train uh to predict your label [00:17:32] predict your label and the the big difference uh in you [00:17:36] and the the big difference uh in you know today's work is that we're going to [00:17:37] know today's work is that we're going to try to pre-train all the parameters so I [00:17:39] try to pre-train all the parameters so I have my big Transformer and instead of [00:17:42] have my big Transformer and instead of just you know pre-training my word [00:17:44] just you know pre-training my word embeddings with word to VEC I'm going to [00:17:46] embeddings with word to VEC I'm going to train all of the parameters of the [00:17:48] train all of the parameters of the network [00:17:49] network uh trying to teach it you know much more [00:17:53] uh trying to teach it you know much more about language uh that I could use in my [00:17:55] about language uh that I could use in my in my Downstream tasks so now I'm sort [00:17:58] in my Downstream tasks so now I'm sort of the the labeled data that I have for [00:18:01] of the the labeled data that I have for say machine translation might need to be [00:18:04] say machine translation might need to be smaller I might not need as much of it [00:18:07] smaller I might not need as much of it because I've already trained much more [00:18:09] because I've already trained much more of the network than I otherwise would [00:18:11] of the network than I otherwise would have if I had just gotten sort of word [00:18:12] have if I had just gotten sort of word to back embeddings [00:18:15] to back embeddings okay so so here right I've pre-trained [00:18:18] okay so so here right I've pre-trained this entire sort of structure the word [00:18:20] this entire sort of structure the word embeddings the Transformer on top [00:18:23] embeddings the Transformer on top everything's been trained via methods [00:18:25] everything's been trained via methods that we'll talk about today and so what [00:18:28] that we'll talk about today and so what does this give you I mean it gives you [00:18:29] does this give you I mean it gives you very strong representations of language [00:18:31] very strong representations of language so the meaning of record and record will [00:18:35] so the meaning of record and record will be different in the sort of contextual [00:18:37] be different in the sort of contextual representations that know what where in [00:18:40] representations that know what where in the sequence it is and what words are [00:18:42] the sequence it is and what words are co-occurring with it in the specific [00:18:43] co-occurring with it in the specific input then word to back which only has [00:18:46] input then word to back which only has one representation for record [00:18:47] one representation for record independent of where it shows up it'll [00:18:50] independent of where it shows up it'll also be used as strong parameter [00:18:52] also be used as strong parameter initializations for NLP models so in all [00:18:55] initializations for NLP models so in all of your homework so far you've worked [00:18:57] of your homework so far you've worked with uh you know building out a natural [00:18:59] with uh you know building out a natural language processing system sort of from [00:19:01] language processing system sort of from scratch right like how do I initialize [00:19:02] scratch right like how do I initialize this weight Matrix and we always say oh [00:19:04] this weight Matrix and we always say oh you know small normally distributed [00:19:07] you know small normally distributed noise like little values you know uh [00:19:10] noise like little values you know uh close to zero and here we're going to [00:19:13] close to zero and here we're going to say well just like we were going to you [00:19:15] say well just like we were going to you know use the word to VEC embeddings and [00:19:17] know use the word to VEC embeddings and those sort of encoded structure I'm [00:19:19] those sort of encoded structure I'm going to start maybe my machine [00:19:20] going to start maybe my machine translation system from a parameter [00:19:22] translation system from a parameter initialization that's given to me via [00:19:25] initialization that's given to me via pre-training [00:19:27] pre-training and then also it's going to give us [00:19:29] and then also it's going to give us probability distributions over language [00:19:30] probability distributions over language that we can use to to generate and [00:19:32] that we can use to to generate and otherwise and we'll talk about this [00:19:35] otherwise and we'll talk about this okay so whole models are going to be [00:19:37] okay so whole models are going to be pre-trained so um all of pre-training is [00:19:40] pre-trained so um all of pre-training is effectively going to be centered around [00:19:43] effectively going to be centered around this idea of reconstructing the input so [00:19:45] this idea of reconstructing the input so you have an input it's a sequence of [00:19:47] you have an input it's a sequence of text that some human has generated and [00:19:50] text that some human has generated and the sort of hypothesis is that by [00:19:53] the sort of hypothesis is that by masking out part of it [00:19:55] masking out part of it and tasking a neural network with [00:19:57] and tasking a neural network with reconstructing the original input [00:20:00] reconstructing the original input that neural network has to learn a lot [00:20:02] that neural network has to learn a lot about language about the world in order [00:20:05] about language about the world in order to do a good job of reconstructing the [00:20:07] to do a good job of reconstructing the input right so this is now a supervised [00:20:09] input right so this is now a supervised learning problem just like you know [00:20:11] learning problem just like you know machine translation right I've taken [00:20:13] machine translation right I've taken this sentence that just existed Stanford [00:20:15] this sentence that just existed Stanford University is located in say Palo Alto [00:20:18] University is located in say Palo Alto California or Stanford California I [00:20:20] California or Stanford California I guess [00:20:21] guess um [00:20:23] um and I have by removing this you know [00:20:25] and I have by removing this you know part of the sentence uh made a label for [00:20:29] part of the sentence uh made a label for myself right the input is this sort of [00:20:30] myself right the input is this sort of broken uh mask sentence and the label is [00:20:34] broken uh mask sentence and the label is Stanford or Palo Alto [00:20:39] so if I give this example to a network [00:20:41] so if I give this example to a network and ask it to predict the center thing [00:20:43] and ask it to predict the center thing as it's doing its gradient step on this [00:20:45] as it's doing its gradient step on this input it's going to encode information [00:20:47] input it's going to encode information about the co-occurrence between this [00:20:49] about the co-occurrence between this context Stanford University is located [00:20:51] context Stanford University is located in and Palo Alto so by tasking it with [00:20:54] in and Palo Alto so by tasking it with this it might learn say where Stanford [00:20:57] this it might learn say where Stanford is [00:20:58] is what else might it learn well it can [00:20:59] what else might it learn well it can learn things about maybe syntax so I put [00:21:02] learn things about maybe syntax so I put blank Fork down on the table [00:21:05] blank Fork down on the table um here there's only a certain set of [00:21:07] um here there's only a certain set of words that could go here I put the fork [00:21:09] words that could go here I put the fork down on the table I put a fork down to [00:21:11] down on the table I put a fork down to the table these are syntactic [00:21:12] the table these are syntactic constraints [00:21:13] constraints right so the context shows me sort of [00:21:16] right so the context shows me sort of What kinds of words can appear in what [00:21:18] What kinds of words can appear in what kinds of contexts [00:21:21] kinds of contexts the woman walked across the street [00:21:23] the woman walked across the street checking for traffic over blank shoulder [00:21:26] checking for traffic over blank shoulder any ideas on what could go here [00:21:29] any ideas on what could go here right so um this sort of this sort of co [00:21:32] right so um this sort of this sort of co uh co-reference between this entity who [00:21:35] uh co-reference between this entity who is being discussed in the world this [00:21:36] is being discussed in the world this woman and her shoulder now when I [00:21:39] woman and her shoulder now when I discuss you know this is sort of [00:21:41] discuss you know this is sort of linguistic concept the word her here is [00:21:43] linguistic concept the word her here is a co-referrent to woman right it's [00:21:45] a co-referrent to woman right it's referring to the same entity in the [00:21:46] referring to the same entity in the discourse and so the network might be [00:21:48] discourse and so the network might be able to learn things about you know like [00:21:50] able to learn things about you know like kind of what entities are doing what [00:21:52] kind of what entities are doing what where [00:21:56] uh it can learn things about sort of [00:21:58] uh it can learn things about sort of semantics so if I have I went to the [00:22:00] semantics so if I have I went to the ocean to see the fish Turtle seals and [00:22:01] ocean to see the fish Turtle seals and blank then the word that's in the blank [00:22:04] blank then the word that's in the blank should be sort of a member of the class [00:22:06] should be sort of a member of the class that I'm thinking of as a person writing [00:22:08] that I'm thinking of as a person writing this sentence of stuff that I see when I [00:22:10] this sentence of stuff that I see when I go to the ocean and see these other [00:22:12] go to the ocean and see these other things as well right so in order to do [00:22:14] things as well right so in order to do this prediction task maybe I learned [00:22:16] this prediction task maybe I learned about you know the semantics of of uh [00:22:19] about you know the semantics of of uh aquatic creatures [00:22:22] aquatic creatures okay so what else could I learn I've got [00:22:25] okay so what else could I learn I've got overall the value I got from the two [00:22:26] overall the value I got from the two hours watching it was the sum total of [00:22:28] hours watching it was the sum total of the popcorn and drink the the movie was [00:22:30] the popcorn and drink the the movie was blank what kind of task could I be [00:22:33] blank what kind of task could I be learning from doing this sort of [00:22:34] learning from doing this sort of prediction problem [00:22:37] sentiment exactly so this is just a [00:22:39] sentiment exactly so this is just a naturalistic sort of text that I [00:22:42] naturalistic sort of text that I naturally wrote uh myself [00:22:45] naturally wrote uh myself um but by saying oh the movie was bad [00:22:48] um but by saying oh the movie was bad I'm learning about sort of the latent [00:22:51] I'm learning about sort of the latent sentiment of the person who wrote This [00:22:53] sentiment of the person who wrote This what they were feeling about the movie [00:22:55] what they were feeling about the movie at the time [00:22:56] at the time so maybe if I see a new review later on [00:22:59] so maybe if I see a new review later on I can just paste in the review say the [00:23:01] I can just paste in the review say the movie was blank [00:23:04] movie was blank and if the model generates bad or good [00:23:06] and if the model generates bad or good that could be implicitly solving the [00:23:09] that could be implicitly solving the task of sentiment analysis [00:23:13] so here's another one Ira went to the [00:23:15] so here's another one Ira went to the kitchen to make some tea standing next [00:23:17] kitchen to make some tea standing next to Ira Zuko pondered his Destiny Zuko [00:23:20] to Ira Zuko pondered his Destiny Zuko left the blank [00:23:22] left the blank okay so in this scenario we've got a [00:23:25] okay so in this scenario we've got a world implicitly that's been designed by [00:23:27] world implicitly that's been designed by the person who is creating this text [00:23:30] the person who is creating this text right I've got physical locations in the [00:23:32] right I've got physical locations in the discourse like the kitchen uh and I've [00:23:35] discourse like the kitchen uh and I've got Zuko uh we've got iros in the [00:23:38] got Zuko uh we've got iros in the kitchen Zuko's next to iro [00:23:40] kitchen Zuko's next to iro so Zuko must be in the kitchen [00:23:43] so Zuko must be in the kitchen so what could Zuko leave but the kitchen [00:23:46] so what could Zuko leave but the kitchen right and so in terms of you know latent [00:23:49] right and so in terms of you know latent Notions of embodiment and physical [00:23:50] Notions of embodiment and physical location the way that people talk about [00:23:53] location the way that people talk about people you know being next to something [00:23:54] people you know being next to something and then leaving something could tell [00:23:56] and then leaving something could tell you uh stuff about sort of yeah a little [00:24:00] you uh stuff about sort of yeah a little bit about how the world works even [00:24:04] so here's the secret sequence I was [00:24:06] so here's the secret sequence I was thinking about the sequence that goes [00:24:07] thinking about the sequence that goes one one two three five eight thirteen [00:24:09] one one two three five eight thirteen twenty one uh blank [00:24:12] twenty one uh blank and [00:24:13] and um you know this is a pretty tough one [00:24:16] um you know this is a pretty tough one right [00:24:17] right this is the Fibonacci sequence right [00:24:19] this is the Fibonacci sequence right create a model by looking at the a bunch [00:24:22] create a model by looking at the a bunch of numbers from the Fibonacci sequence [00:24:23] of numbers from the Fibonacci sequence learn to in general predict the next one [00:24:27] learn to in general predict the next one a question you should be thinking about [00:24:29] a question you should be thinking about throughout the lecture [00:24:31] throughout the lecture okay any questions on these sort of [00:24:33] okay any questions on these sort of examples of what you might learn from [00:24:35] examples of what you might learn from predicting the context [00:24:41] okay okay cool [00:24:43] okay okay cool um so uh [00:24:45] um so uh you know a very simple way to think [00:24:47] you know a very simple way to think about pre-training is pre-training is [00:24:48] about pre-training is pre-training is language modeling so we saw language [00:24:49] language modeling so we saw language modeling earlier in the course and now [00:24:51] modeling earlier in the course and now we're just going to say instead of using [00:24:54] we're just going to say instead of using my language model just to provide [00:24:55] my language model just to provide probabilities over the next word I am [00:24:58] probabilities over the next word I am going to train it on that task right I'm [00:24:59] going to train it on that task right I'm going to actually model the distribution [00:25:04] going to actually model the distribution P Theta of the word t given all the [00:25:07] P Theta of the word t given all the words previous [00:25:09] words previous and there's a ton of data for this right [00:25:11] and there's a ton of data for this right there's you know there's just an amazing [00:25:13] there's you know there's just an amazing amount of data for this in a lot of [00:25:15] amount of data for this in a lot of languages especially English there's [00:25:17] languages especially English there's very little data for this in actually [00:25:18] very little data for this in actually most of the world's languages which is a [00:25:20] most of the world's languages which is a separate problem [00:25:22] separate problem um but you can free train just through [00:25:23] um but you can free train just through language modeling right so I'm going to [00:25:25] language modeling right so I'm going to sort of do the teacher forcing thing so [00:25:27] sort of do the teacher forcing thing so I have IRL I predict goes I have goes I [00:25:29] I have IRL I predict goes I have goes I predict two and I'm going to train my [00:25:32] predict two and I'm going to train my sort of lstm or my Transformer to do [00:25:34] sort of lstm or my Transformer to do this task and then I'm just going to [00:25:36] this task and then I'm just going to keep all the weights [00:25:38] keep all the weights okay I'm going to save all the network [00:25:39] okay I'm going to save all the network parameters [00:25:44] um and then once I have these parameters [00:25:45] um and then once I have these parameters right instead of generating from my [00:25:47] right instead of generating from my language model I'm just going to use [00:25:48] language model I'm just going to use them as an initialization for my [00:25:51] them as an initialization for my parameters so I have this pre-training [00:25:53] parameters so I have this pre-training fine tuning Paradigm two steps most of [00:25:57] fine tuning Paradigm two steps most of you I think in your well maybe not this [00:25:59] you I think in your well maybe not this year let's say a large portion of you [00:26:01] year let's say a large portion of you this year in your final projects will be [00:26:02] this year in your final projects will be doing the pre-training fine-tuning sort [00:26:04] doing the pre-training fine-tuning sort of Paradigm where someone has done the [00:26:06] of Paradigm where someone has done the pre-training for you right so you have a [00:26:08] pre-training for you right so you have a ton of text you learn very general [00:26:10] ton of text you learn very general things about the distribution of words [00:26:12] things about the distribution of words and sort of the latent things that that [00:26:14] and sort of the latent things that that tells you about the world and about [00:26:16] tells you about the world and about language and then in step two you've got [00:26:19] language and then in step two you've got some tasks maybe sentiment analysis and [00:26:22] some tasks maybe sentiment analysis and you have maybe not very many labels you [00:26:24] you have maybe not very many labels you have a little bit of labeled data [00:26:26] have a little bit of labeled data and you adapt the pre-trained model to [00:26:29] and you adapt the pre-trained model to the task that you care about by further [00:26:31] the task that you care about by further doing gradient steps on this task so you [00:26:34] doing gradient steps on this task so you give it the movie was you predict happy [00:26:37] give it the movie was you predict happy or sad and then [00:26:39] or sad and then um [00:26:39] um you sort of Continue to update the [00:26:41] you sort of Continue to update the parameters based on the initialization [00:26:43] parameters based on the initialization from the pre-training [00:26:46] from the pre-training and this just works exceptionally well I [00:26:48] and this just works exceptionally well I mean unbelievably well compared to [00:26:50] mean unbelievably well compared to training from scratch [00:26:51] training from scratch intuitively because you've taken a lot [00:26:53] intuitively because you've taken a lot of the burden of learning about language [00:26:55] of the burden of learning about language learning about the world off of the data [00:26:58] learning about the world off of the data that you've labeled for sentiment [00:26:59] that you've labeled for sentiment analysis and you're sort of giving that [00:27:01] analysis and you're sort of giving that task of learning all this sort of very [00:27:03] task of learning all this sort of very general stuff to the much more General [00:27:05] general stuff to the much more General task of language modeling yes [00:27:08] task of language modeling yes but we didn't have much data in other [00:27:10] but we didn't have much data in other languages what do you mean by that was [00:27:12] languages what do you mean by that was it just text in that language yeah [00:27:14] it just text in that language yeah labeled in some way so the question is [00:27:17] labeled in some way so the question is uh you know you said we have a lot of [00:27:18] uh you know you said we have a lot of data in English but not in other [00:27:20] data in English but not in other languages [00:27:22] languages um what do you mean by data that we [00:27:24] um what do you mean by data that we don't have a lot of in other languages [00:27:25] don't have a lot of in other languages is it just text it's literally just [00:27:27] is it just text it's literally just text no annotations because you don't [00:27:30] text no annotations because you don't need annotations to do language model [00:27:32] need annotations to do language model pre-training right the existence of that [00:27:34] pre-training right the existence of that sequence of words that someone has [00:27:35] sequence of words that someone has written [00:27:37] written provides you with all these pairs of [00:27:39] provides you with all these pairs of input and output input iro output goes [00:27:42] input and output input iro output goes input iro goes output two those are all [00:27:45] input iro goes output two those are all labels sort of that you've constructed [00:27:47] labels sort of that you've constructed from the input just existing but you [00:27:49] from the input just existing but you know in most languages even on the [00:27:52] know in most languages even on the entire internet I mean there's about 7 [00:27:54] entire internet I mean there's about 7 000 ish languages on Earth and most of [00:27:56] 000 ish languages on Earth and most of them don't have the sort of you know [00:27:59] them don't have the sort of you know uh billions of words that you might want [00:28:01] uh billions of words that you might want to to train these systems on uh yeah [00:28:07] entire thing are you still only like one [00:28:09] entire thing are you still only like one vector representation per word [00:28:11] vector representation per word the question is if you're pre-training [00:28:13] the question is if you're pre-training the entire thing do you still learn one [00:28:14] the entire thing do you still learn one vector representation per word you learn [00:28:16] vector representation per word you learn one vector representation that is the [00:28:18] one vector representation that is the non-contextual input vector [00:28:21] non-contextual input vector right so you have your vocabulary Matrix [00:28:22] right so you have your vocabulary Matrix you've got your embedding Matrix that is [00:28:24] you've got your embedding Matrix that is vocabulary size by model dimensionality [00:28:28] vocabulary size by model dimensionality and so yeah iro has one vector goes has [00:28:31] and so yeah iro has one vector goes has one vector [00:28:32] one vector but then the Transformer that you're [00:28:34] but then the Transformer that you're learning on top of it takes in the [00:28:36] learning on top of it takes in the sequence so far and sort of gives a [00:28:38] sequence so far and sort of gives a vector to each of them that's dependent [00:28:40] vector to each of them that's dependent on the context in that case but still at [00:28:42] on the context in that case but still at the input you only have one embedding [00:28:44] the input you only have one embedding per word [00:28:48] yeah so what's sort of like metric would [00:28:49] yeah so what's sort of like metric would you use to like evaluate [00:28:51] you use to like evaluate it's supposed to be like General right [00:28:53] it's supposed to be like General right so but things like application specific [00:28:55] so but things like application specific metrics which one do you use yeah so the [00:28:57] metrics which one do you use yeah so the question is what metric do you use to [00:28:59] question is what metric do you use to evaluate pre-trained models since it's [00:29:00] evaluate pre-trained models since it's supposed to be so General [00:29:02] supposed to be so General um but there are lots of sort of very [00:29:03] um but there are lots of sort of very specific evaluations you could use [00:29:06] specific evaluations you could use um it will get into a lot of that in the [00:29:08] um it will get into a lot of that in the rest of the lecture uh while you're [00:29:10] rest of the lecture uh while you're training it you can use Simple metrics [00:29:12] training it you can use Simple metrics that sort of correlate with what you [00:29:13] that sort of correlate with what you want but aren't actually what you want [00:29:15] want but aren't actually what you want just like the probability quality right [00:29:18] just like the probability quality right so you can evaluate the perplexity of [00:29:20] so you can evaluate the perplexity of your language model just like you would [00:29:21] your language model just like you would have when you cared about language [00:29:22] have when you cared about language modeling and it turns out to be the case [00:29:25] modeling and it turns out to be the case that better perplexity correlates with [00:29:27] that better perplexity correlates with all the stuff that's much harder to [00:29:29] all the stuff that's much harder to evaluate like lots and lots of different [00:29:31] evaluate like lots and lots of different tasks but also the natural language [00:29:33] tasks but also the natural language processing Community has built very [00:29:35] processing Community has built very large sort of Benchmark Suites of [00:29:38] large sort of Benchmark Suites of varying tasks to try to get at sort of a [00:29:40] varying tasks to try to get at sort of a notion of generality although that's [00:29:42] notion of generality although that's very very difficult it's sort of [00:29:43] very very difficult it's sort of ill-defined even and so when you develop [00:29:46] ill-defined even and so when you develop new pre-training methods what you often [00:29:48] new pre-training methods what you often do is you try to pick a whole bunch of [00:29:50] do is you try to pick a whole bunch of evaluations and show that you do better [00:29:52] evaluations and show that you do better on all of them you know and and that's [00:29:54] on all of them you know and and that's your argument for generality [00:29:56] your argument for generality okay [00:29:58] okay um so so why should this sort of [00:30:01] um so so why should this sort of pre-training fine-tuning two-part [00:30:03] pre-training fine-tuning two-part Paradigm help uh you know this is still [00:30:07] Paradigm help uh you know this is still an open area of research but the [00:30:09] an open area of research but the intuitions are all you're going to take [00:30:11] intuitions are all you're going to take from this course so right so [00:30:13] from this course so right so pre-training provides some sort of [00:30:14] pre-training provides some sort of starting uh parameters L Theta so this [00:30:17] starting uh parameters L Theta so this is like all the parameters in your [00:30:19] is like all the parameters in your network right from trying to do this [00:30:21] network right from trying to do this minimum over all possible settings of [00:30:23] minimum over all possible settings of your parameters of the pre-training loss [00:30:26] your parameters of the pre-training loss and then the fine-tuning process takes [00:30:29] and then the fine-tuning process takes uh you know your data for fine tuning [00:30:31] uh you know your data for fine tuning you've got some labels and it tries to [00:30:33] you've got some labels and it tries to approximate the minimum through gradient [00:30:35] approximate the minimum through gradient Descent of the loss of the fine-tuning [00:30:37] Descent of the loss of the fine-tuning task of theta but you start at Theta hat [00:30:41] task of theta but you start at Theta hat right so you start gradient descent at [00:30:43] right so you start gradient descent at Theta hat which your pre-training [00:30:44] Theta hat which your pre-training process gave you [00:30:46] process gave you and then you know [00:30:48] and then you know if you could actually solve this Min and [00:30:50] if you could actually solve this Min and wanted to it sort of feels like the [00:30:53] wanted to it sort of feels like the starting point shouldn't matter [00:30:55] starting point shouldn't matter but it really really really does it [00:30:58] but it really really really does it really does [00:31:00] really does um uh so that's and we'll talk a bit [00:31:02] um uh so that's and we'll talk a bit more about this later but [00:31:05] more about this later but um the process of gradient descent you [00:31:07] um the process of gradient descent you know maybe it sticks relatively close to [00:31:09] know maybe it sticks relatively close to the Theta hat during fine tuning right [00:31:11] the Theta hat during fine tuning right so [00:31:13] so um you know you start at Theta hat and [00:31:14] um you know you start at Theta hat and then you sort of walk downhill with [00:31:16] then you sort of walk downhill with gradient descent until you hit sort of a [00:31:18] gradient descent until you hit sort of a valley and that Valley ends up being [00:31:20] valley and that Valley ends up being really good because it's close to the [00:31:22] really good because it's close to the pre-training parameters which were [00:31:24] pre-training parameters which were really good for a lot of things this is [00:31:26] really good for a lot of things this is a cool place where sort of practice and [00:31:28] a cool place where sort of practice and Theory are sort of like meeting where [00:31:30] Theory are sort of like meeting where like optimization people want to [00:31:32] like optimization people want to understand why this is so useful NLP [00:31:35] understand why this is so useful NLP people sort of just want to build better [00:31:36] people sort of just want to build better systems [00:31:38] systems um [00:31:39] um so uh [00:31:40] so uh yeah maybe the stuff around Theta hat [00:31:43] yeah maybe the stuff around Theta hat tends to generalize well if you want to [00:31:44] tends to generalize well if you want to work on this kind of thing you should [00:31:46] work on this kind of thing you should talk about it yeah [00:31:48] talk about it yeah the classic waiting to send sticks [00:31:50] the classic waiting to send sticks relatively close but what if we were to [00:31:52] relatively close but what if we were to use a different Optimizer how would that [00:31:54] use a different Optimizer how would that change their results the question is uh [00:31:57] change their results the question is uh if stochastic gradient descent sticks [00:31:59] if stochastic gradient descent sticks relatively close what if we use a [00:32:00] relatively close what if we use a different Optimizer I mean if we use [00:32:02] different Optimizer I mean if we use sort of any common variant of gradient [00:32:05] sort of any common variant of gradient descent like any first order method like [00:32:07] descent like any first order method like atom which we use in this course or add [00:32:09] atom which we use in this course or add a grad or they all have this very very [00:32:12] a grad or they all have this very very similar properties [00:32:14] similar properties um other types of optimization we just [00:32:16] um other types of optimization we just tend to not use so who knows ah yeah [00:32:22] yeah fine tuning works better than just [00:32:25] yeah fine tuning works better than just fine tuning but making article like [00:32:27] fine tuning but making article like adding more layers more data yeah uh the [00:32:30] adding more layers more data yeah uh the question is why does the pre-trained [00:32:32] question is why does the pre-trained fine-tuned Paradigm work better than [00:32:34] fine-tuned Paradigm work better than just making the model more powerful [00:32:35] just making the model more powerful adding more layers adding more data to [00:32:38] adding more layers adding more data to just the fine tuning [00:32:40] just the fine tuning um [00:32:40] um that's a you know the simple answer is [00:32:43] that's a you know the simple answer is that you have orders of magnitude more [00:32:46] that you have orders of magnitude more data that's unlabeled [00:32:48] data that's unlabeled that's just text [00:32:50] that's just text that you found [00:32:51] that you found then you do you know carefully labeled [00:32:53] then you do you know carefully labeled data and the tasks that you care about [00:32:55] data and the tasks that you care about right because that's expensive to get it [00:32:57] right because that's expensive to get it has to be examples of your movie reviews [00:32:59] has to be examples of your movie reviews or whatever that you've had someone [00:33:00] or whatever that you've had someone label carefully [00:33:03] label carefully um so you have you know something like [00:33:05] um so you have you know something like on the internet [00:33:07] on the internet uh at least five trillion maybe 10 [00:33:11] uh at least five trillion maybe 10 trillion words of this and you have [00:33:13] trillion words of this and you have maybe a million words of your label data [00:33:16] maybe a million words of your label data or whatever over here so it's just like [00:33:18] or whatever over here so it's just like the it's just the scale is way off [00:33:21] the it's just the scale is way off um but there's also an intuition that [00:33:22] um but there's also an intuition that like learning to do a very very simple [00:33:25] like learning to do a very very simple thing like sentiment analysis [00:33:28] thing like sentiment analysis um is not going to get you a very [00:33:31] um is not going to get you a very general [00:33:33] general generally able agent in a wide range of [00:33:36] generally able agent in a wide range of settings compared to language modeling [00:33:38] settings compared to language modeling so like it's hard to get how to put it [00:33:42] so like it's hard to get how to put it even if you have a lot of labeled data [00:33:43] even if you have a lot of labeled data of movie reviews of the kind that people [00:33:45] of movie reviews of the kind that people are writing today [00:33:48] are writing today maybe tomorrow they start writing [00:33:50] maybe tomorrow they start writing slightly different kinds of movie [00:33:51] slightly different kinds of movie reviews and your system doesn't perform [00:33:52] reviews and your system doesn't perform as well whereas if you pre-trained on a [00:33:55] as well whereas if you pre-trained on a really diverse set of text from a wide [00:33:57] really diverse set of text from a wide range of sources in people it might be [00:34:00] range of sources in people it might be more adaptable to seeing stuff that [00:34:03] more adaptable to seeing stuff that doesn't quite look like the training [00:34:04] doesn't quite look like the training data you showed it even if you showed it [00:34:06] data you showed it even if you showed it a ton of training data so one of the [00:34:08] a ton of training data so one of the sort of big takeaways of pre-training is [00:34:10] sort of big takeaways of pre-training is that you get this huge amount of sort of [00:34:12] that you get this huge amount of sort of variety of text uh on the internet you [00:34:15] variety of text uh on the internet you have to be very careful I mean you yeah [00:34:18] have to be very careful I mean you yeah you should be very careful about what [00:34:19] you should be very careful about what kind of text you're showing it and what [00:34:21] kind of text you're showing it and what kind of text you're not because the [00:34:22] kind of text you're not because the internet is full of you know [00:34:25] internet is full of you know um awful text as well [00:34:27] um awful text as well um but some of that generality just [00:34:29] um but some of that generality just comes from how hard this problem is and [00:34:31] comes from how hard this problem is and how much data you can show it [00:34:36] so much data how do you then train it so [00:34:40] so much data how do you then train it so that it considers the stuff that you're [00:34:42] that it considers the stuff that you're fine-tuning it with as like more [00:34:44] fine-tuning it with as like more important more sale into a passive Trend [00:34:46] important more sale into a passive Trend if you rather than just one in a billion [00:34:48] if you rather than just one in a billion articles [00:34:50] articles yeah it's a good question so the [00:34:52] yeah it's a good question so the question is uh given that the amount of [00:34:54] question is uh given that the amount of data on the pre-training side is orders [00:34:55] data on the pre-training side is orders of magnitude more than the amount of [00:34:57] of magnitude more than the amount of data on the fine tuning side how do you [00:34:58] data on the fine tuning side how do you sort of get across to the model that [00:35:00] sort of get across to the model that okay actually the fine tuning task is [00:35:02] okay actually the fine tuning task is like what I care about so like focus on [00:35:03] like what I care about so like focus on that [00:35:04] that um it's it's about the fact that I did [00:35:06] um it's it's about the fact that I did this first the pre-training first and [00:35:08] this first the pre-training first and then I do the fine tuning second right [00:35:11] then I do the fine tuning second right so I've done I've gotten my parameter [00:35:13] so I've done I've gotten my parameter initialization from this I've set it [00:35:15] initialization from this I've set it somewhere and then I fine tune I move to [00:35:18] somewhere and then I fine tune I move to where the parameters are doing well for [00:35:20] where the parameters are doing well for this task afterward and so well it might [00:35:23] this task afterward and so well it might just forget a lot about how to do this [00:35:25] just forget a lot about how to do this because now I'm just asking it to do [00:35:27] because now I'm just asking it to do this at this point [00:35:30] this at this point uh I should move on I think [00:35:32] uh I should move on I think um but we're going to keep talking about [00:35:34] um but we're going to keep talking about this in in much more detail with more [00:35:36] this in in much more detail with more concrete uh elements so [00:35:39] concrete uh elements so um [00:35:41] um okay so uh let's talk about model [00:35:43] okay so uh let's talk about model pre-training oh wait [00:35:46] pre-training oh wait that did not Advance the slides [00:35:53] nice okay let's talk about model [00:35:56] nice okay let's talk about model pre-training three ways uh in our [00:35:58] pre-training three ways uh in our Transformers lecture uh Tuesday we [00:36:01] Transformers lecture uh Tuesday we talked about encoders encoder decoders [00:36:04] talked about encoders encoder decoders and decoders and we'll we'll do decoders [00:36:06] and decoders and we'll we'll do decoders last because [00:36:08] last because um actually many of the largest models [00:36:10] um actually many of the largest models uh that are being used today Are all uh [00:36:13] uh that are being used today Are all uh decoders and so we'll have a bit more to [00:36:15] decoders and so we'll have a bit more to say about them [00:36:16] say about them um right so let's recall these three so [00:36:19] um right so let's recall these three so encoders get bi-directional context you [00:36:21] encoders get bi-directional context you have a single sequence and you're able [00:36:24] have a single sequence and you're able to see the whole thing kind of like an [00:36:25] to see the whole thing kind of like an encoder in machine translation [00:36:27] encoder in machine translation encoder decoders have one uh portion of [00:36:32] encoder decoders have one uh portion of the network that gets bi-directional [00:36:33] the network that gets bi-directional context so that's like the source [00:36:35] context so that's like the source sentence of my machine translation [00:36:37] sentence of my machine translation system and then they're sort of paired [00:36:39] system and then they're sort of paired with a decoder that gets unidirectional [00:36:41] with a decoder that gets unidirectional context so that I have this sort of uh [00:36:44] context so that I have this sort of uh informational masking where I can't see [00:36:45] informational masking where I can't see the future so that I can do things like [00:36:47] the future so that I can do things like language modeling I can generate the [00:36:49] language modeling I can generate the next token of my translation whatever so [00:36:51] next token of my translation whatever so you could think of it as you know I've [00:36:52] you could think of it as you know I've got my source sentence here and my [00:36:55] got my source sentence here and my partial translation here and I'm sort of [00:36:56] partial translation here and I'm sort of decoding out the translation [00:36:58] decoding out the translation and then decoders only are things like [00:37:01] and then decoders only are things like language models we've seen a lot of this [00:37:02] language models we've seen a lot of this so far and there's pre-training for all [00:37:05] so far and there's pre-training for all three sort of large classes of models [00:37:08] three sort of large classes of models and how you pre-train them and then how [00:37:10] and how you pre-train them and then how you use them depends on the properties [00:37:13] you use them depends on the properties and the productivities of the specific [00:37:15] and the productivities of the specific architecture so let's let's look at [00:37:16] architecture so let's let's look at encoders first [00:37:18] encoders first um so we've looked at language modeling [00:37:20] um so we've looked at language modeling quite a bit but we can't do language [00:37:22] quite a bit but we can't do language modeling with an encoder because they [00:37:24] modeling with an encoder because they get bi-directional context right so if [00:37:26] get bi-directional context right so if I'm down here uh at I and I want to [00:37:30] I'm down here uh at I and I want to present I want to predict the next word [00:37:32] present I want to predict the next word it's a trivial task at the at this level [00:37:36] it's a trivial task at the at this level here to predict the next word because in [00:37:38] here to predict the next word because in the middle I was able to look at the [00:37:40] the middle I was able to look at the next word [00:37:41] next word and so I should just know there's [00:37:43] and so I should just know there's nothing hard about learning to predict [00:37:44] nothing hard about learning to predict the next word here because I could just [00:37:46] the next word here because I could just look at it see what it is and then you [00:37:48] look at it see what it is and then you know copy it over so when I'm training [00:37:50] know copy it over so when I'm training an encoder in something uh for for [00:37:54] an encoder in something uh for for pre-training I have to be a little bit [00:37:55] pre-training I have to be a little bit more clever [00:37:57] more clever in practice what I do is something like [00:37:58] in practice what I do is something like this uh I take the input and I modify it [00:38:01] this uh I take the input and I modify it somewhat I mask out words sort of like I [00:38:03] somewhat I mask out words sort of like I did in the examples I gave at the [00:38:05] did in the examples I gave at the beginning of class so I blank to the [00:38:07] beginning of class so I blank to the blank [00:38:08] blank right and then I have the network [00:38:10] right and then I have the network predict uh with this whole you know I [00:38:13] predict uh with this whole you know I haven't built contextual representations [00:38:14] haven't built contextual representations so now this Vector representation of the [00:38:17] so now this Vector representation of the blank sees the entire context around it [00:38:21] blank sees the entire context around it here and then I predict the word went [00:38:25] here and then I predict the word went and then here the word store [00:38:29] any questions [00:38:34] okay and you can see how this is doing [00:38:36] okay and you can see how this is doing something quite a bit like language [00:38:37] something quite a bit like language modeling but with you know [00:38:39] modeling but with you know bi-directional context I've removed the [00:38:42] bi-directional context I've removed the Network's information about the words [00:38:44] Network's information about the words that go in the blanks and I'm training [00:38:46] that go in the blanks and I'm training it to reconstruct that so I only have [00:38:48] it to reconstruct that so I only have lost terms right I only ask it to [00:38:50] lost terms right I only ask it to actually do the prediction compute the [00:38:52] actually do the prediction compute the loss back propagate the gradients for [00:38:54] loss back propagate the gradients for the words that I've masked out [00:38:56] the words that I've masked out and you can think of this as you know [00:38:58] and you can think of this as you know instead of learning probability of X [00:39:00] instead of learning probability of X where X is like a sentence or a document [00:39:02] where X is like a sentence or a document this is learning the probability of X [00:39:04] this is learning the probability of X the real document given X tilde which is [00:39:08] the real document given X tilde which is this sort of corrupted document [00:39:11] this sort of corrupted document with some of the information missing [00:39:12] with some of the information missing missing [00:39:14] missing okay and so maybe we get the sequence of [00:39:16] okay and so maybe we get the sequence of vectors here one per word which is the [00:39:19] vectors here one per word which is the output of my encoder in blue and then [00:39:21] output of my encoder in blue and then I'd say that for the words that I want [00:39:23] I'd say that for the words that I want to predict y i i draw them this is the [00:39:26] to predict y i i draw them this is the Sim means the probability is uh [00:39:29] Sim means the probability is uh proportional to uh you know my embedding [00:39:32] proportional to uh you know my embedding Matrix times [00:39:34] Matrix times um my representation of it so it's a [00:39:37] um my representation of it so it's a linear transformation of that last thing [00:39:38] linear transformation of that last thing here so this a plus b is this red [00:39:40] here so this a plus b is this red portion here and then do the prediction [00:39:42] portion here and then do the prediction and I train the entire network to do [00:39:44] and I train the entire network to do this yes [00:39:46] this yes so [00:39:50] far do we just [00:39:51] far do we just do it as we are is there something you [00:39:53] do it as we are is there something you can do it the question is do we just [00:39:55] can do it the question is do we just choose words randomly to mask out or is [00:39:57] choose words randomly to mask out or is there a scheme mostly randomly we'll [00:39:59] there a scheme mostly randomly we'll talk about a slightly smarter scheme in [00:40:01] talk about a slightly smarter scheme in a couple of slides but yeah just mostly [00:40:02] a couple of slides but yeah just mostly randomly [00:40:05] randomly uh yeah [00:40:06] uh yeah what was that last part on the bottom [00:40:09] what was that last part on the bottom um exit the maps version of like if it's [00:40:12] um exit the maps version of like if it's the first [00:40:13] the first or the very last sentence uh yeah so so [00:40:17] or the very last sentence uh yeah so so I'm saying that I'm defining X tilde to [00:40:21] I'm saying that I'm defining X tilde to be this input part where I've got the [00:40:23] be this input part where I've got the masked version of the sentence with [00:40:25] masked version of the sentence with these sort of words missing and then I'm [00:40:27] these sort of words missing and then I'm defining a probability distribution [00:40:28] defining a probability distribution that's the probability of a sequence [00:40:31] that's the probability of a sequence conditioned on the input being the sort [00:40:34] conditioned on the input being the sort of corrupted sequence the masked [00:40:35] of corrupted sequence the masked sequence [00:40:39] okay [00:40:41] okay um so uh this brings us to a very very [00:40:44] um so uh this brings us to a very very popular and uh sort of NLP model that [00:40:47] popular and uh sort of NLP model that you need to know about it's called Bert [00:40:49] you need to know about it's called Bert and it was the first one to popularize [00:40:51] and it was the first one to popularize this masked language modeling objective [00:40:55] this masked language modeling objective um and they released the weights of this [00:40:57] um and they released the weights of this pre-trained Transformer that they [00:40:58] pre-trained Transformer that they pre-trained via something that looks a [00:41:00] pre-trained via something that looks a lot like Mass language modeling and so [00:41:02] lot like Mass language modeling and so these you can download you can use them [00:41:04] these you can download you can use them via a code that's released by the [00:41:06] via a code that's released by the company hugging face that we have you [00:41:09] company hugging face that we have you know continue to bring up many of you [00:41:10] know continue to bring up many of you will use a model like Birch in your [00:41:12] will use a model like Birch in your final project because it's such a useful [00:41:15] final project because it's such a useful builder of representations of language [00:41:16] builder of representations of language and context so let's talk a little bit [00:41:19] and context so let's talk a little bit about the details of mass language [00:41:20] about the details of mass language modeling in Bert [00:41:22] modeling in Bert first we'd take 15 of the sub word [00:41:26] first we'd take 15 of the sub word tokens so remember all of our inputs now [00:41:28] tokens so remember all of our inputs now are subword tokens I'm making I've made [00:41:31] are subword tokens I'm making I've made them all look like words but just like [00:41:33] them all look like words but just like we saw at the very beginning of class [00:41:34] we saw at the very beginning of class each of these tokens could just be some [00:41:37] each of these tokens could just be some portion some sub word and I'm going to [00:41:39] portion some sub word and I'm going to do a couple of things with it sometimes [00:41:41] do a couple of things with it sometimes I am going to just mask out the word [00:41:45] and then you know predict the true word [00:41:48] and then you know predict the true word sometimes I'm going to replace the word [00:41:51] sometimes I'm going to replace the word with some random sample of another word [00:41:53] with some random sample of another word from my distribution from my vocabulary [00:41:55] from my distribution from my vocabulary and predict the real word that was [00:41:57] and predict the real word that was supposed to go there and sometimes I'm [00:42:00] supposed to go there and sometimes I'm going to not change the word at all and [00:42:03] going to not change the word at all and still predict it the intuition of this [00:42:05] still predict it the intuition of this is the following [00:42:07] is the following um if I just had to build good [00:42:09] um if I just had to build good representations [00:42:10] representations of uh in the sort of middle of this [00:42:13] of uh in the sort of middle of this network for words that are masked out [00:42:15] network for words that are masked out then when I actually use the model at [00:42:18] then when I actually use the model at test time on some real you know review [00:42:21] test time on some real you know review to do sentiment analysis on well there [00:42:23] to do sentiment analysis on well there are never going to be any tokens like [00:42:24] are never going to be any tokens like this so maybe the model won't do a very [00:42:26] this so maybe the model won't do a very good job because it's like oh you know I [00:42:28] good job because it's like oh you know I have no job to do here because I only [00:42:30] have no job to do here because I only need to deal with the mask tokens [00:42:33] need to deal with the mask tokens by giving it sequences of words where [00:42:36] by giving it sequences of words where sometimes it's the real word that needs [00:42:37] sometimes it's the real word that needs to be predicted sometimes you have to [00:42:39] to be predicted sometimes you have to detect if the word is wrong the idea is [00:42:42] detect if the word is wrong the idea is that now when I give it a sentence uh [00:42:44] that now when I give it a sentence uh that doesn't have any masks it actually [00:42:47] that doesn't have any masks it actually sort of does a good job of representing [00:42:48] sort of does a good job of representing all the words in context because it has [00:42:50] all the words in context because it has this chance that it could be asked to [00:42:52] this chance that it could be asked to predict anything at any time [00:42:58] okay so uh the folks at uh at Google who [00:43:03] okay so uh the folks at uh at Google who were defining this had a separate [00:43:04] were defining this had a separate additional task that is sort of [00:43:08] additional task that is sort of interesting to think about [00:43:10] interesting to think about so this was their their Bert model from [00:43:12] so this was their their Bert model from their paper they had their position [00:43:14] their paper they had their position embeddings just like we saw from our [00:43:16] embeddings just like we saw from our Transformers lecture token embeddings [00:43:19] Transformers lecture token embeddings just like we saw from the Transformers [00:43:20] just like we saw from the Transformers lecture but then also they had this [00:43:22] lecture but then also they had this thing called a segment embedding where [00:43:24] thing called a segment embedding where they had two possible segments segment a [00:43:26] they had two possible segments segment a and segment B [00:43:28] and segment B and uh they had this additional task [00:43:31] and uh they had this additional task where they would get a big chunk of text [00:43:33] where they would get a big chunk of text for Segment a and a big chunk of text [00:43:35] for Segment a and a big chunk of text for Segment B and then they would ask [00:43:38] for Segment B and then they would ask the model is segment b a real [00:43:40] the model is segment b a real continuation of segment a well so the [00:43:43] continuation of segment a well so the text that actually came next or did I [00:43:46] text that actually came next or did I just pick this big segment randomly from [00:43:48] just pick this big segment randomly from somewhere else [00:43:49] somewhere else and the idea is that this should teach [00:43:51] and the idea is that this should teach the network something some notion of [00:43:53] the network something some notion of sort of long distance coherence right [00:43:55] sort of long distance coherence right about sort of the connection between a [00:43:57] about sort of the connection between a bunch of text over here and a bunch of [00:43:58] bunch of text over here and a bunch of text over there [00:44:00] text over there turns out it's not really necessary but [00:44:02] turns out it's not really necessary but it's an interesting idea [00:44:04] it's an interesting idea and sort of similar things have [00:44:06] and sort of similar things have continued to have some sort of influence [00:44:08] continued to have some sort of influence since then but again you should get this [00:44:11] since then but again you should get this intuition that we're trying to come up [00:44:13] intuition that we're trying to come up with hard problems for the network to [00:44:14] with hard problems for the network to solve such that by solving them it has [00:44:17] solve such that by solving them it has to learn a lot about language [00:44:19] to learn a lot about language and we're defining those problems by [00:44:21] and we're defining those problems by making simple Transformations or [00:44:23] making simple Transformations or removing information from text that just [00:44:26] removing information from text that just happen to occur [00:44:29] questions [00:44:32] yeah the plus signs do we concatenate [00:44:35] yeah the plus signs do we concatenate the vector so do we do an element-wise [00:44:37] the vector so do we do an element-wise Edition uh the question is for these [00:44:39] Edition uh the question is for these plus signs do we concatenate the vectors [00:44:41] plus signs do we concatenate the vectors or do element wise Edition we do element [00:44:43] or do element wise Edition we do element wise Edition uh you could have [00:44:46] wise Edition uh you could have concatenated them [00:44:48] concatenated them however the one of the big sort of [00:44:50] however the one of the big sort of conventions of all these networks is [00:44:51] conventions of all these networks is that you always have exactly the same [00:44:53] that you always have exactly the same number of Dimensions everywhere at every [00:44:55] number of Dimensions everywhere at every layer of the network it just makes [00:44:57] layer of the network it just makes everything very simple so just saying [00:44:59] everything very simple so just saying everything's the same Dimension and then [00:45:00] everything's the same Dimension and then uh doing addition just ends up being [00:45:03] uh doing addition just ends up being simpler [00:45:04] simpler yeah [00:45:07] the next sentence prediction not [00:45:09] the next sentence prediction not necessarily [00:45:10] necessarily yeah why was the next sentence [00:45:12] yeah why was the next sentence prediction not necessary I mean it one [00:45:14] prediction not necessary I mean it one thing that it does that's a negative is [00:45:16] thing that it does that's a negative is that now [00:45:18] that now um [00:45:20] um the sort of content the effective [00:45:22] the sort of content the effective context length for a lot of your [00:45:24] context length for a lot of your examples is halved [00:45:26] examples is halved so one of the things that's useful about [00:45:28] so one of the things that's useful about pre-training seemingly is that you get [00:45:29] pre-training seemingly is that you get to build representations of very long [00:45:31] to build representations of very long sequences of text so this is very short [00:45:34] sequences of text so this is very short but in practice segment a was going to [00:45:37] but in practice segment a was going to be something like 250 words and segment [00:45:40] be something like 250 words and segment B was going to be 250 words and in the [00:45:42] B was going to be 250 words and in the paper that sort of let us know that this [00:45:44] paper that sort of let us know that this wasn't necessary they always had a long [00:45:46] wasn't necessary they always had a long segment of 500 Words [00:45:48] segment of 500 Words and it seemed to be useful to always [00:45:50] and it seemed to be useful to always have this very long context because [00:45:53] have this very long context because longer contexts help give you more [00:45:55] longer contexts help give you more information about the role that each [00:45:57] information about the role that each word is playing in that specific context [00:45:59] word is playing in that specific context right if I see one word it's hard to [00:46:01] right if I see one word it's hard to know if it's just the record it's hard [00:46:03] know if it's just the record it's hard to know what it's supposed to mean but [00:46:05] to know what it's supposed to mean but if I see a thousand words around it it's [00:46:07] if I see a thousand words around it it's much clearer what its role is in that [00:46:09] much clearer what its role is in that context is so yeah it cuts the effective [00:46:11] context is so yeah it cuts the effective context size as one answer [00:46:14] context size as one answer um [00:46:17] what another thing is that this is [00:46:18] what another thing is that this is actually much more difficult this is a [00:46:19] actually much more difficult this is a much more recent paper uh that I don't [00:46:22] much more recent paper uh that I don't have in the slides but it's been shown [00:46:24] have in the slides but it's been shown since then that these models are really [00:46:25] since then that these models are really really bad at the next sentence [00:46:27] really bad at the next sentence prediction task so it could be that [00:46:30] prediction task so it could be that maybe it just like was too hard at the [00:46:32] maybe it just like was too hard at the time [00:46:34] time uh and so it just like wasn't useful [00:46:36] uh and so it just like wasn't useful because the model was failing to do it [00:46:37] because the model was failing to do it at all [00:46:39] at all so I give the link for that paper later [00:46:44] why we need to do a next sentence [00:46:46] why we need to do a next sentence prediction what about just masking and [00:46:48] prediction what about just masking and predicting I missed that jump so [00:46:51] predicting I missed that jump so yeah so the question is why do we need [00:46:53] yeah so the question is why do we need to do next sentence prediction why not [00:46:55] to do next sentence prediction why not just do the masking we saw before that's [00:46:57] just do the masking we saw before that's the thing you seem to not need to do [00:46:58] the thing you seem to not need to do next to this prediction but you know as [00:47:00] next to this prediction but you know as sort of like his history of the research [00:47:02] sort of like his history of the research it was thought that this was useful [00:47:05] it was thought that this was useful and the idea is that it required you to [00:47:08] and the idea is that it required you to develop this sort of pairwise like do [00:47:10] develop this sort of pairwise like do these two segments of text interact how [00:47:13] these two segments of text interact how do they interact are they related the [00:47:15] do they interact are they related the sort of longer distance notion and many [00:47:17] sort of longer distance notion and many NLP tasks are defined on pairs of things [00:47:20] NLP tasks are defined on pairs of things and they thought that might be useful [00:47:22] and they thought that might be useful and so they published it with this and [00:47:25] and so they published it with this and then someone else came through published [00:47:26] then someone else came through published a new model that didn't do that and it [00:47:28] a new model that didn't do that and it and it sort of did better [00:47:30] and it sort of did better so you know this is just yeah so yeah uh [00:47:34] so you know this is just yeah so yeah uh there are intuitions as to why it could [00:47:35] there are intuitions as to why it could work it just didn't [00:47:39] it was doing both it was doing both this [00:47:41] it was doing both it was doing both this next sentence so Bert was doing both [00:47:43] next sentence so Bert was doing both this next sentence prediction uh [00:47:45] this next sentence prediction uh evaluate uh training as well as this [00:47:47] evaluate uh training as well as this masking training uh all at the same time [00:47:52] and so you had to have a separate [00:47:53] and so you had to have a separate predictor head on top of bird a separate [00:47:56] predictor head on top of bird a separate predictor sort of classification thing [00:47:59] predictor sort of classification thing and you know so one detail there is that [00:48:02] and you know so one detail there is that there's this special word at the [00:48:03] there's this special word at the beginning of Bert in every sequence [00:48:05] beginning of Bert in every sequence that's CLS and you know you can define a [00:48:09] that's CLS and you know you can define a predictor on top of that sort of fake [00:48:11] predictor on top of that sort of fake word embedding that was going to say is [00:48:13] word embedding that was going to say is the next sentence real or fake or not [00:48:16] the next sentence real or fake or not yeah [00:48:17] yeah okay I'm gonna move on [00:48:20] okay I'm gonna move on um and so this gets that sort of the [00:48:22] um and so this gets that sort of the question that we had earlier about how [00:48:23] question that we had earlier about how do you evaluate these things [00:48:25] do you evaluate these things um there's a lot of different NLP tasks [00:48:27] um there's a lot of different NLP tasks out there gosh and uh you know when [00:48:30] out there gosh and uh you know when people were defining these papers they [00:48:32] people were defining these papers they would look at a ton of different [00:48:33] would look at a ton of different evaluations that had been sort of [00:48:35] evaluations that had been sort of compiled as a set of things that are [00:48:36] compiled as a set of things that are still hard for today's systems so are [00:48:39] still hard for today's systems so are you detecting paraphrases between [00:48:41] you detecting paraphrases between questions or two quora questions [00:48:43] questions or two quora questions actually the same question that turns [00:48:44] actually the same question that turns out to be hard [00:48:46] out to be hard um you know uh can you do sentiment [00:48:49] um you know uh can you do sentiment analysis on this hard data set can you [00:48:51] analysis on this hard data set can you tell if sentences are linguistically [00:48:53] tell if sentences are linguistically acceptable are they grammatical or not [00:48:56] acceptable are they grammatical or not um are two sequences similar [00:48:58] um are two sequences similar semantically do they mean sort of [00:48:59] semantically do they mean sort of vaguely the similar thing [00:49:02] vaguely the similar thing um and we'll talk a bit about natural [00:49:03] um and we'll talk a bit about natural language inference later but that's the [00:49:05] language inference later but that's the task of defining sort of if I say you [00:49:08] task of defining sort of if I say you know I saw the dog that does not [00:49:10] know I saw the dog that does not necessarily mean I saw the little dog [00:49:14] necessarily mean I saw the little dog but saying I saw the little dog does [00:49:16] but saying I saw the little dog does mean I saw the dog so that's sort of [00:49:18] mean I saw the dog so that's sort of this natural language inference task and [00:49:21] this natural language inference task and you know the Striking the difference [00:49:22] you know the Striking the difference between sort of pre-pre-training days [00:49:26] between sort of pre-pre-training days where you had this sort of this row here [00:49:29] where you had this sort of this row here before you had substantial amounts of [00:49:32] before you had substantial amounts of pre-training and Bert was just like the [00:49:35] pre-training and Bert was just like the field was taken aback in a way that's [00:49:37] field was taken aback in a way that's hard to describe [00:49:39] hard to describe you know very carefully crafted [00:49:41] you know very carefully crafted architectures for each individual task [00:49:43] architectures for each individual task where everyone was designing their own [00:49:45] where everyone was designing their own neural network and doing things that [00:49:46] neural network and doing things that they thought were sort of clever as to [00:49:48] they thought were sort of clever as to how to define all the connections and [00:49:50] how to define all the connections and the weights and whatever to do their [00:49:51] the weights and whatever to do their tasks independently so everyone was [00:49:53] tasks independently so everyone was doing a different thing for each one of [00:49:54] doing a different thing for each one of these tasks roughly [00:49:57] these tasks roughly all of that was blown out of the water [00:49:58] all of that was blown out of the water by just build a big Transformer and just [00:50:01] by just build a big Transformer and just teach it to predict the missing words a [00:50:03] teach it to predict the missing words a whole bunch and then fine tune it on [00:50:05] whole bunch and then fine tune it on each of these tasks [00:50:06] each of these tasks so this was this was just a sea change [00:50:09] so this was this was just a sea change in the field people were I mean amazed [00:50:12] in the field people were I mean amazed it's a little bit less flashy than chat [00:50:13] it's a little bit less flashy than chat GPT I'll admit but it's really part of [00:50:15] GPT I'll admit but it's really part of the story that gets us to it you know [00:50:18] the story that gets us to it you know um okay [00:50:20] um okay questions [00:50:24] so like uh to get stuff out of the like [00:50:27] so like uh to get stuff out of the like the during the encode their pre-training [00:50:30] the during the encode their pre-training stage encoder usually outputs like uh [00:50:34] stage encoder usually outputs like uh some sort of hidden values how do we [00:50:37] some sort of hidden values how do we correlate those the words that we are [00:50:39] correlate those the words that we are trying to test against [00:50:41] trying to test against so the question is you know the the [00:50:43] so the question is you know the the encoder output is a bunch of you know [00:50:45] encoder output is a bunch of you know hidden values [00:50:48] hidden values um how do we actually correlate those [00:50:49] um how do we actually correlate those values to stuff that we want to predict [00:50:52] values to stuff that we want to predict I'm going to go on to the next slide [00:50:53] I'm going to go on to the next slide here to bring up this this example here [00:50:55] here to bring up this this example here right so the encoder gives us for each [00:50:58] right so the encoder gives us for each input word token a vector of that token [00:51:02] input word token a vector of that token that represents the token in context and [00:51:04] that represents the token in context and the question is you know how do we get [00:51:06] the question is you know how do we get these representations and and turn them [00:51:08] these representations and and turn them into uh sort of answers for the tasks [00:51:11] into uh sort of answers for the tasks that we care about and [00:51:14] that we care about and um [00:51:14] um the answer comes back to [00:51:18] the answer comes back to do [00:51:18] do [Music] [00:51:21] [Music] something like this uh [00:51:30] something like this [00:51:31] something like this Maybe [00:51:37] wow sure [00:51:39] wow sure um so when we were doing a pre-training [00:51:40] um so when we were doing a pre-training right we had the Transformer that was [00:51:42] right we had the Transformer that was giving us our representations and we had [00:51:44] giving us our representations and we had this little last layer here this little [00:51:47] this little last layer here this little um sort of affine uh transformation that [00:51:50] um sort of affine uh transformation that moved us from the encoder's hidden State [00:51:51] moved us from the encoder's hidden State size to the vocabulary to do our [00:51:53] size to the vocabulary to do our prediction and we just removed this last [00:51:56] prediction and we just removed this last prediction layer here and let's say we [00:51:59] prediction layer here and let's say we want to do something that is uh [00:52:02] want to do something that is uh classifying the sentiment of the [00:52:04] classifying the sentiment of the sentence we just pick arbitrarily maybe [00:52:06] sentence we just pick arbitrarily maybe the last word in the sentence and we [00:52:08] the last word in the sentence and we stick a linear classifier on top and map [00:52:11] stick a linear classifier on top and map it to positive or negative and then fine [00:52:13] it to positive or negative and then fine tune the whole thing [00:52:15] tune the whole thing okay so so yeah the Bert model uh had uh [00:52:20] okay so so yeah the Bert model uh had uh two different models one was 110 million [00:52:22] two different models one was 110 million parameters one was 340 million keep that [00:52:25] parameters one was 340 million keep that sort of in the back of your head sort of [00:52:26] sort of in the back of your head sort of percolating as we talk about models with [00:52:28] percolating as we talk about models with with many many more parameters later on [00:52:30] with many many more parameters later on it was trained on [00:52:33] it was trained on um uh [00:52:36] 800 million words plus that is [00:52:39] 800 million words plus that is definitely wrong maybe two point maybe [00:52:41] definitely wrong maybe two point maybe 25 million words but on the order of [00:52:43] 25 million words but on the order of less than a billion words of text quite [00:52:46] less than a billion words of text quite a bit still [00:52:48] a bit still um and it was trained on what was [00:52:49] um and it was trained on what was considered at the time to be a whole lot [00:52:52] considered at the time to be a whole lot of compute just you know it was Google [00:52:54] of compute just you know it was Google doing this and they released it and we [00:52:56] doing this and they released it and we were like oh who has that kind of [00:52:57] were like oh who has that kind of compute but Google although nowadays [00:52:59] compute but Google although nowadays it's not considered to be very much [00:53:01] it's not considered to be very much um but fine-tuning is practical and [00:53:03] um but fine-tuning is practical and common on a single GPU so you could take [00:53:05] common on a single GPU so you could take the burp model that they'd spend a lot [00:53:07] the burp model that they'd spend a lot of time training and fine-tune it [00:53:09] of time training and fine-tune it yourself on your task uh on even sort of [00:53:11] yourself on your task uh on even sort of a very a very sort of small GPU [00:53:16] uh okay [00:53:20] so so one question is like well this [00:53:23] so so one question is like well this seems really great why don't we just use [00:53:25] seems really great why don't we just use this for uh everything [00:53:28] this for uh everything um [00:53:29] um uh-huh yeah uh and the answer is well [00:53:32] uh-huh yeah uh and the answer is well you know what is the sort of [00:53:34] you know what is the sort of pre-training objective what's the [00:53:35] pre-training objective what's the structure of the pre-trained model good [00:53:37] structure of the pre-trained model good for uh Bert is really good for sort of [00:53:40] for uh Bert is really good for sort of filling in the blanks but it's much less [00:53:43] filling in the blanks but it's much less naturally used for actually generating [00:53:45] naturally used for actually generating text right so I wouldn't want to use to [00:53:48] text right so I wouldn't want to use to generate a summary of something because [00:53:50] generate a summary of something because it's not really built for it it's not it [00:53:53] it's not really built for it it's not it doesn't have a natural notion of [00:53:55] doesn't have a natural notion of predicting the next word given all the [00:53:56] predicting the next word given all the words that came before it so maybe I [00:53:58] words that came before it so maybe I want to use Bert if I want a good [00:54:00] want to use Bert if I want a good representation of say a document to [00:54:02] representation of say a document to classify it give it one of a set of [00:54:04] classify it give it one of a set of topic labels or say it's toxic or [00:54:06] topic labels or say it's toxic or non-toxic or whatever but I wouldn't [00:54:08] non-toxic or whatever but I wouldn't want to use it to generate a whole [00:54:10] want to use it to generate a whole sequence [00:54:12] sequence uh okay some extensions of Bert so we [00:54:15] uh okay some extensions of Bert so we had a question earlier of whether you [00:54:17] had a question earlier of whether you just mask things out randomly one thing [00:54:19] just mask things out randomly one thing that seems to work better is uh you uh [00:54:22] that seems to work better is uh you uh mask out sort of whole contiguous spans [00:54:25] mask out sort of whole contiguous spans uh because uh sort of [00:54:28] uh because uh sort of the difficulty of this problem is much [00:54:31] the difficulty of this problem is much easier than it would otherwise be [00:54:33] easier than it would otherwise be because uh sort of this is part of [00:54:36] because uh sort of this is part of irresistibly and you can tell very [00:54:38] irresistibly and you can tell very easily based on the sort of sub words [00:54:39] easily based on the sort of sub words that came before it whereas uh if I have [00:54:43] that came before it whereas uh if I have a much longer sequence it's a trade-off [00:54:45] a much longer sequence it's a trade-off but you know this might be a harder [00:54:47] but you know this might be a harder problem and it ends up being better to [00:54:49] problem and it ends up being better to do this sort of span-based masking than [00:54:51] do this sort of span-based masking than random masking and that might be because [00:54:53] random masking and that might be because sub words make very simple prediction [00:54:55] sub words make very simple prediction problems when you mask out just one sub [00:54:58] problems when you mask out just one sub word of a word versus all the subwords [00:55:00] word of a word versus all the subwords of a word [00:55:01] of a word okay so those this ends up doing much [00:55:04] okay so those this ends up doing much better there's also a paper called the [00:55:06] better there's also a paper called the Roberta paper which showed that the next [00:55:08] Roberta paper which showed that the next sentence prediction wasn't neces wasn't [00:55:11] sentence prediction wasn't neces wasn't necessary they also showed that they [00:55:13] necessary they also showed that they really should have trained it on a lot [00:55:15] really should have trained it on a lot more text so Roberta is a drop-in [00:55:18] more text so Roberta is a drop-in replacement for Bert so if you're [00:55:20] replacement for Bert so if you're thinking of using just use your Brita [00:55:21] thinking of using just use your Brita it's better and it gave us this [00:55:23] it's better and it gave us this intuition that we really don't know a [00:55:25] intuition that we really don't know a whole lot about the best practices for [00:55:26] whole lot about the best practices for training these things you sort of train [00:55:27] training these things you sort of train it for as long as you're willing to and [00:55:30] it for as long as you're willing to and things do good stuff and whatever [00:55:33] things do good stuff and whatever um so this is very but it's very [00:55:35] um so this is very but it's very difficult to do sort of uh iteration on [00:55:38] difficult to do sort of uh iteration on these models because they're big it's [00:55:39] these models because they're big it's expensive to train them [00:55:41] expensive to train them uh another thing that you should know [00:55:43] uh another thing that you should know for your final projects in the world [00:55:45] for your final projects in the world ahead is this notion of fine-tuning all [00:55:48] ahead is this notion of fine-tuning all parameters of the network versus just a [00:55:49] parameters of the network versus just a couple of them uh so what we've talked [00:55:52] couple of them uh so what we've talked about so far is you pre-train all the [00:55:54] about so far is you pre-train all the parameters and then you fine-tune all of [00:55:55] parameters and then you fine-tune all of them as well so all the parameter values [00:55:57] them as well so all the parameter values change uh an alternative which you call [00:56:00] change uh an alternative which you call parameter efficient or lightweight [00:56:02] parameter efficient or lightweight fine-tuning uh you sort of choose little [00:56:05] fine-tuning uh you sort of choose little bits of parameters or you choose the [00:56:07] bits of parameters or you choose the very smart way of keeping most of the [00:56:08] very smart way of keeping most of the parameters fixed and only fine-tuning [00:56:10] parameters fixed and only fine-tuning others and the intuition is that you [00:56:13] others and the intuition is that you know these pre-trained parameters were [00:56:14] know these pre-trained parameters were really good [00:56:15] really good and you want to make the minimal change [00:56:18] and you want to make the minimal change from the pre-trained model to the model [00:56:20] from the pre-trained model to the model that does what you want so that you keep [00:56:22] that does what you want so that you keep some of the generality some of the [00:56:23] some of the generality some of the goodness of the pre-training [00:56:26] goodness of the pre-training so one way that this is done is called [00:56:28] so one way that this is done is called prefix tuning prompt tuning is very [00:56:30] prefix tuning prompt tuning is very similar where you actually freeze all [00:56:32] similar where you actually freeze all the parameters of the network so I've [00:56:33] the parameters of the network so I've pre-trained my network here [00:56:36] pre-trained my network here and I've never changed any of the [00:56:38] and I've never changed any of the parameter values instead I make a bunch [00:56:41] parameter values instead I make a bunch of fake sort of pseudo word vectors that [00:56:44] of fake sort of pseudo word vectors that I propend to the very beginning of the [00:56:46] I propend to the very beginning of the sequence and I train just them sort of [00:56:49] sequence and I train just them sort of unintuitive it's like these would have [00:56:51] unintuitive it's like these would have been like inputs to the network but I'm [00:56:53] been like inputs to the network but I'm specifying them as parameters and I'm [00:56:55] specifying them as parameters and I'm training everything to do my sentiment [00:56:57] training everything to do my sentiment analysis task just by changing the [00:57:00] analysis task just by changing the values of these sort of fake words [00:57:02] values of these sort of fake words and this is nice because you know I get [00:57:05] and this is nice because you know I get to keep all the good pre-trained [00:57:07] to keep all the good pre-trained parameters [00:57:08] parameters um and then just specify this sort of [00:57:10] um and then just specify this sort of diff that ends up [00:57:13] diff that ends up generalizing better this is a very open [00:57:15] generalizing better this is a very open field of research but this is also [00:57:18] field of research but this is also cheaper because I don't have to compute [00:57:20] cheaper because I don't have to compute the gradients or I don't have to store [00:57:22] the gradients or I don't have to store the gradients and all the optimizer [00:57:24] the gradients and all the optimizer state with respect to all these [00:57:26] state with respect to all these parameters I'm only training a very [00:57:28] parameters I'm only training a very small number of parameters uh yeah [00:57:33] small number of parameters uh yeah it's like make parameters and as if like [00:57:37] it's like make parameters and as if like here but it doesn't make any difference [00:57:39] here but it doesn't make any difference but he's at the end of the beginning in [00:57:41] but he's at the end of the beginning in a decoder you have to put them at the [00:57:43] a decoder you have to put them at the beginning because otherwise you don't [00:57:45] beginning because otherwise you don't see them before you process the whole [00:57:47] see them before you process the whole sequence [00:57:48] sequence uh yes [00:57:50] uh yes a few layers I only train the new layers [00:57:52] a few layers I only train the new layers but the question is can we just attach a [00:57:55] but the question is can we just attach a new layers of the sort of top of this [00:57:56] new layers of the sort of top of this and only train those absolutely this [00:57:59] and only train those absolutely this works a bit better [00:58:00] works a bit better another thing that works well sorry [00:58:02] another thing that works well sorry we're running out of time [00:58:04] we're running out of time um is taking each weight Matrix so I [00:58:07] um is taking each weight Matrix so I have a bunch of weight matrices in my [00:58:08] have a bunch of weight matrices in my Transformer and I freeze the weight [00:58:11] Transformer and I freeze the weight Matrix and learn a very low ranked [00:58:14] Matrix and learn a very low ranked little diff and I set the weight [00:58:16] little diff and I set the weight matrix's value to be sort of the [00:58:18] matrix's value to be sort of the original Value Plus my my sort of very [00:58:21] original Value Plus my my sort of very low rank diff uh from the original one [00:58:24] low rank diff uh from the original one and this ends up being a very similarly [00:58:27] and this ends up being a very similarly useful technique and the overall idea [00:58:30] useful technique and the overall idea here is that again I'm learning way [00:58:32] here is that again I'm learning way fewer parameters than I did Via [00:58:35] fewer parameters than I did Via pre-training and freezing most of the [00:58:36] pre-training and freezing most of the pre-training parameters [00:58:38] pre-training parameters okay encoder decoders so um for encoder [00:58:42] okay encoder decoders so um for encoder decoders we could do something like [00:58:44] decoders we could do something like language modeling right I've got my [00:58:45] language modeling right I've got my input sequence here encoder output [00:58:48] input sequence here encoder output sequence here and I could say this part [00:58:51] sequence here and I could say this part is my prefix for sort of having [00:58:53] is my prefix for sort of having bi-directional context and I could then [00:58:55] bi-directional context and I could then predict all the words that are sort of [00:58:58] predict all the words that are sort of in the latter half of the sequence just [00:59:01] in the latter half of the sequence just like a language model and that would [00:59:02] like a language model and that would work fine [00:59:04] work fine um and so this this is something that [00:59:06] um and so this this is something that you could do right you sort of take it [00:59:07] you could do right you sort of take it along text split it into two give half [00:59:10] along text split it into two give half of it to the encoder and then generate [00:59:12] of it to the encoder and then generate the second half with the decoder [00:59:15] uh but in practice what works much [00:59:18] uh but in practice what works much better is this notion of span corruption [00:59:20] better is this notion of span corruption span corruption is going to show up in [00:59:21] span corruption is going to show up in your assignment five and the idea here [00:59:24] your assignment five and the idea here is a lot like Bert but uh in a sort of [00:59:28] is a lot like Bert but uh in a sort of generative sense where I'm going to mask [00:59:31] generative sense where I'm going to mask out a bunch of words in the input thank [00:59:34] out a bunch of words in the input thank you mask token one me to your party mask [00:59:38] you mask token one me to your party mask token two week [00:59:40] token two week and then at the output I generate the [00:59:43] and then at the output I generate the mask token and then what was supposed to [00:59:46] mask token and then what was supposed to be there but the mass token replaced it [00:59:48] be there but the mass token replaced it right so thank you then predict for [00:59:50] right so thank you then predict for inviting at the output meet your party [00:59:53] inviting at the output meet your party last week and what this does is that it [00:59:57] last week and what this does is that it um allows you to have bi-directional [01:00:00] um allows you to have bi-directional context right I get to see the whole [01:00:02] context right I get to see the whole sequence except I can generate the parts [01:00:05] sequence except I can generate the parts that we're missing [01:00:07] that we're missing so this feels a little bit like you mask [01:00:09] so this feels a little bit like you mask out parts of the input but you actually [01:00:10] out parts of the input but you actually generate the output as a sequence like [01:00:13] generate the output as a sequence like you would in language modeling so this [01:00:15] you would in language modeling so this might be good for something like machine [01:00:16] might be good for something like machine translation where I have an input that I [01:00:18] translation where I have an input that I want bi-directional context in but then [01:00:20] want bi-directional context in but then I want to generate an output and I want [01:00:22] I want to generate an output and I want to pre-train the whole thing so this was [01:00:25] to pre-train the whole thing so this was shown to work better than language [01:00:26] shown to work better than language modeling at the scales that these uh [01:00:28] modeling at the scales that these uh Folks at Google were able to test back [01:00:30] Folks at Google were able to test back in 2018 this is still quite quite [01:00:32] in 2018 this is still quite quite popular [01:00:35] um yeah there's a lot of numbers it [01:00:38] um yeah there's a lot of numbers it works better than the other stuff [01:00:39] works better than the other stuff I'm not going to worry about it [01:00:42] I'm not going to worry about it um you know there's a fascinating [01:00:44] um you know there's a fascinating property of these models also so um so [01:00:47] property of these models also so um so T5 was the model that was originally uh [01:00:50] T5 was the model that was originally uh introduced with salience band masking [01:00:52] introduced with salience band masking and you can think of you know at [01:00:55] and you can think of you know at pre-training time you saw a bunch of [01:00:56] pre-training time you saw a bunch of things like Franklin D Roosevelt was [01:00:58] things like Franklin D Roosevelt was born in you know blank and you generated [01:01:01] born in you know blank and you generated out the blank and uh there's this task [01:01:04] out the blank and uh there's this task called [01:01:06] called um open domain question answering which [01:01:08] um open domain question answering which has a bunch of trivia questions like you [01:01:10] has a bunch of trivia questions like you know when was Franklin D Roosevelt born [01:01:12] know when was Franklin D Roosevelt born and then you're supposed to generate out [01:01:14] and then you're supposed to generate out the answer as a string just like just [01:01:16] the answer as a string just like just from your parameters right so you did a [01:01:18] from your parameters right so you did a bunch of pre-training you saw a bunch of [01:01:20] bunch of pre-training you saw a bunch of text and then you're supposed to [01:01:21] text and then you're supposed to generate these answers and what's [01:01:23] generate these answers and what's fascinating is that this sort of [01:01:26] fascinating is that this sort of salience band masking method [01:01:29] salience band masking method allowed you to pre-train and then fine [01:01:32] allowed you to pre-train and then fine tune on some examples of questions [01:01:35] tune on some examples of questions trivia questions and then when you [01:01:37] trivia questions and then when you tested on new trivia questions it would [01:01:40] tested on new trivia questions it would sort of the model would sort of [01:01:41] sort of the model would sort of implicitly extract from its pre-training [01:01:43] implicitly extract from its pre-training data somehow the answer to that new [01:01:46] data somehow the answer to that new question that it never saw explicitly at [01:01:48] question that it never saw explicitly at fine tuning time so it learned this sort [01:01:50] fine tuning time so it learned this sort of implicit retrieval sometimes [01:01:53] of implicit retrieval sometimes sometimes you know less than 50 of the [01:01:55] sometimes you know less than 50 of the time or whatever but you know much more [01:01:57] time or whatever but you know much more than random chance yeah [01:01:59] than random chance yeah um and that's just sort of fascinating [01:02:01] um and that's just sort of fascinating right so you've sort of learned to [01:02:02] right so you've sort of learned to access this sort of latent knowledge [01:02:05] access this sort of latent knowledge that you stored up by pre-training and [01:02:07] that you stored up by pre-training and so yeah you just pass it the text when [01:02:09] so yeah you just pass it the text when was Roosevelt born and it would pass out [01:02:12] was Roosevelt born and it would pass out an answer and one thing to know is that [01:02:14] an answer and one thing to know is that the answers always look very fluent they [01:02:16] the answers always look very fluent they always look very reasonable but they're [01:02:18] always look very reasonable but they're frequently wrong and that's still true [01:02:20] frequently wrong and that's still true of things like chat gbt [01:02:23] of things like chat gbt um [01:02:24] um yeah [01:02:25] yeah okay so that's that's like encoder [01:02:27] okay so that's that's like encoder decoder models [01:02:29] decoder models um next up we've got decoders and we'll [01:02:32] um next up we've got decoders and we'll spend a long time on decoders so this is [01:02:34] spend a long time on decoders so this is just our normal language model so I get [01:02:36] just our normal language model so I get a sequence of hidden States for my [01:02:37] a sequence of hidden States for my decoder the the models this words can [01:02:40] decoder the the models this words can only look at themselves not the future [01:02:42] only look at themselves not the future and then I predict you know the next [01:02:45] and then I predict you know the next word in the sentence and then here again [01:02:47] word in the sentence and then here again I can you know to do sentiment analysis [01:02:49] I can you know to do sentiment analysis maybe take the last state for the last [01:02:51] maybe take the last state for the last word and then predict happier sad based [01:02:54] word and then predict happier sad based on that last embedding back propagate [01:02:57] on that last embedding back propagate the gradients the whole network train [01:02:58] the gradients the whole network train the whole thing or do some kind of [01:03:00] the whole thing or do some kind of lightweight or or parameter efficient [01:03:02] lightweight or or parameter efficient fine tuning like we mentioned earlier so [01:03:05] fine tuning like we mentioned earlier so this is our our you know pre-training a [01:03:06] this is our our you know pre-training a decoder and um you know I can just [01:03:09] decoder and um you know I can just pre-train it on language modeling [01:03:12] pre-train it on language modeling um [01:03:13] um so again you might want to do this if [01:03:15] so again you might want to do this if you are wanting to generate [01:03:18] you are wanting to generate generate texts generate things uh this [01:03:22] generate texts generate things uh this is you sort of can use this like you use [01:03:23] is you sort of can use this like you use an encoder decoder but in practice as [01:03:26] an encoder decoder but in practice as we'll see a lot of the sort of biggest [01:03:28] we'll see a lot of the sort of biggest uh most powerful pre-trained models tend [01:03:31] uh most powerful pre-trained models tend to be decoder only it's not really clear [01:03:34] to be decoder only it's not really clear exactly why except they seem a little [01:03:37] exactly why except they seem a little bit simpler than encoder decoders [01:03:40] bit simpler than encoder decoders um and you get to share all the [01:03:42] um and you get to share all the parameters in one big Network for the [01:03:44] parameters in one big Network for the decoder whereas an encoder decoder you [01:03:46] decoder whereas an encoder decoder you have to split them sort of some into the [01:03:48] have to split them sort of some into the encoder some into the decoder so for the [01:03:51] encoder some into the decoder so for the rest of this lecture we'll talk only [01:03:53] rest of this lecture we'll talk only about decoders so even and modern things [01:03:56] about decoders so even and modern things uh the biggest networks do tend to be [01:03:59] uh the biggest networks do tend to be decoders [01:04:00] decoders so we're coming all the way back again [01:04:02] so we're coming all the way back again to 2018 and the GPT model from openai [01:04:06] to 2018 and the GPT model from openai was a big success [01:04:09] was a big success it had 117 parameter a million [01:04:11] it had 117 parameter a million parameters uh it had you know 768 [01:04:15] parameters uh it had you know 768 dimensional hidden States and uh it had [01:04:17] dimensional hidden States and uh it had this vocabulary that was 40 000 ish [01:04:22] this vocabulary that was 40 000 ish words that was defined via a method like [01:04:24] words that was defined via a method like what we showed at the beginning of class [01:04:26] what we showed at the beginning of class trained on Books Corpus and [01:04:29] trained on Books Corpus and um you know actually you know GPT never [01:04:31] um you know actually you know GPT never actually showed up in the original paper [01:04:32] actually showed up in the original paper uh [01:04:34] uh the sort of uh it's unclear what exactly [01:04:37] the sort of uh it's unclear what exactly it's supposed to refer to [01:04:39] it's supposed to refer to um but uh this model was a precursor to [01:04:43] um but uh this model was a precursor to all the things that you're hearing about [01:04:44] all the things that you're hearing about nowadays uh if you move forward [01:04:48] nowadays uh if you move forward uh oh yeah so if you [01:04:52] um [01:04:55] so if we wanted to do something like [01:04:57] so if we wanted to do something like natural language inference right which [01:05:00] natural language inference right which says you know take these pairs of [01:05:01] says you know take these pairs of sentences the man is in the doorway the [01:05:04] sentences the man is in the doorway the person is near the door and uh say that [01:05:07] person is near the door and uh say that these mean that one entails the other [01:05:08] these mean that one entails the other the sort of premise entails the [01:05:10] the sort of premise entails the hypothesis that I can believe the [01:05:12] hypothesis that I can believe the hypothesis if I believe the premise I [01:05:14] hypothesis if I believe the premise I just sort of concatenate them together [01:05:15] just sort of concatenate them together right so give it maybe a start token [01:05:19] right so give it maybe a start token pass in one sentence pass in some [01:05:21] pass in one sentence pass in some delimiter token pass in the other and [01:05:24] delimiter token pass in the other and then predict uh sort of yes no [01:05:27] then predict uh sort of yes no entailment not entailment fine tuning [01:05:29] entailment not entailment fine tuning gbt on this it worked really well [01:05:33] gbt on this it worked really well um and then you know Bert came after GPT [01:05:35] um and then you know Bert came after GPT Bert did a bit better it had [01:05:37] Bert did a bit better it had bi-directional context [01:05:39] bi-directional context um but you know it did it did uh sort of [01:05:42] um but you know it did it did uh sort of an excellent job [01:05:43] an excellent job and then came gpt2 where they focused [01:05:46] and then came gpt2 where they focused more on the generative abilities of the [01:05:49] more on the generative abilities of the network so [01:05:51] network so um right we looked at uh now a much [01:05:54] um right we looked at uh now a much larger Network we've gone from 117 [01:05:55] larger Network we've gone from 117 million to 1.5 billion and given some [01:05:58] million to 1.5 billion and given some sort of prompt it could generate at the [01:06:01] sort of prompt it could generate at the time a quite surprisingly coherent [01:06:03] time a quite surprisingly coherent continuation to The Prompt so it's [01:06:05] continuation to The Prompt so it's telling this sort of story about uh [01:06:08] telling this sort of story about uh about scientists and unicorns here [01:06:11] about scientists and unicorns here um and this size of model is still sort [01:06:14] um and this size of model is still sort of small enough that you can use on a [01:06:16] of small enough that you can use on a small GPU and fine-tune and whatever and [01:06:19] small GPU and fine-tune and whatever and its capabilities of generating long [01:06:21] its capabilities of generating long coherent texts was just sort of [01:06:24] coherent texts was just sort of um exceptional at the time [01:06:27] um exceptional at the time it was also trained on more data [01:06:29] it was also trained on more data although I don't [01:06:31] although I don't uh something like 9 billion words of [01:06:33] uh something like 9 billion words of text [01:06:35] text um and then so after gpt2 we come to [01:06:39] um and then so after gpt2 we come to gpd3 sort of walking through these [01:06:41] gpd3 sort of walking through these models and then we come with a different [01:06:43] models and then we come with a different way of interacting with the models so [01:06:46] way of interacting with the models so we've interacted with pre-trained models [01:06:47] we've interacted with pre-trained models in two ways so far we've sort of sampled [01:06:50] in two ways so far we've sort of sampled from the distribution that they Define [01:06:52] from the distribution that they Define uh We've generated text via like a [01:06:55] uh We've generated text via like a machine translation system or whatever [01:06:56] machine translation system or whatever or you fine-tuned them on a task that we [01:06:59] or you fine-tuned them on a task that we care about and then we take their [01:07:00] care about and then we take their predictions [01:07:03] um but gpt3 seems to have an interesting [01:07:08] um but gpt3 seems to have an interesting new ability uh it's much larger and it [01:07:12] new ability uh it's much larger and it can do some tasks without any sort of [01:07:15] can do some tasks without any sort of fine-tuning whatsoever [01:07:17] fine-tuning whatsoever uh gbd3 is much larger than gpd2 right [01:07:20] uh gbd3 is much larger than gpd2 right so we went from GPT 100-ish million [01:07:22] so we went from GPT 100-ish million parameters gbd2 1.5 billion cpt3 175 [01:07:26] parameters gbd2 1.5 billion cpt3 175 billion much larger uh trained on 300 [01:07:30] billion much larger uh trained on 300 billion words of text and this sort of [01:07:33] billion words of text and this sort of notion of in context learning that it [01:07:34] notion of in context learning that it could Define or figure out patterns in [01:07:37] could Define or figure out patterns in the training or in the example that it's [01:07:39] the training or in the example that it's currently seeing and continue the [01:07:41] currently seeing and continue the pattern [01:07:42] pattern uh is called in context learnings you [01:07:45] uh is called in context learnings you got you know the word thanks and I pass [01:07:47] got you know the word thanks and I pass in this little arrow and say okay thanks [01:07:48] in this little arrow and say okay thanks goes to you know mercy and then hello [01:07:50] goes to you know mercy and then hello goes to bonjour and then you know they [01:07:52] goes to bonjour and then you know they give it all of these examples and ask it [01:07:55] give it all of these examples and ask it um what you know otter should go to and [01:07:57] um what you know otter should go to and it's learned to sort of continue the [01:08:00] it's learned to sort of continue the pattern and say that this is the [01:08:02] pattern and say that this is the translation of otter so now remember [01:08:05] translation of otter so now remember this is a single sort of input that I've [01:08:07] this is a single sort of input that I've given to my to my model and I haven't [01:08:10] given to my to my model and I haven't said oh do translation or fine tune it [01:08:12] said oh do translation or fine tune it on translation or whatever I've just [01:08:14] on translation or whatever I've just passed in the input given it some [01:08:16] passed in the input given it some examples and then it is able to to some [01:08:18] examples and then it is able to to some extent uh do this seemingly complex task [01:08:22] extent uh do this seemingly complex task that's in context learning [01:08:25] that's in context learning uh and here are more examples you know [01:08:27] uh and here are more examples you know maybe you give it examples of addition [01:08:29] maybe you give it examples of addition and then it can do some uh some simple [01:08:32] and then it can do some uh some simple addition afterward uh you give it in [01:08:34] addition afterward uh you give it in this case this is sort of rewriting [01:08:36] this case this is sort of rewriting typos they can figure out how to rewrite [01:08:37] typos they can figure out how to rewrite typos in context learning for for [01:08:40] typos in context learning for for machine translation and this was the [01:08:42] machine translation and this was the start of this idea that there were these [01:08:44] start of this idea that there were these emergent properties that showed up in [01:08:46] emergent properties that showed up in much larger models and it wasn't clear [01:08:48] much larger models and it wasn't clear when looking at the smaller models that [01:08:51] when looking at the smaller models that you'd get this sort of new this [01:08:53] you'd get this sort of new this qualitatively new Behavior out of them [01:08:57] qualitatively new Behavior out of them right like it's not obvious from just [01:08:59] right like it's not obvious from just the language modeling signal right gpt3 [01:09:01] the language modeling signal right gpt3 is just trained on that decoder only [01:09:04] is just trained on that decoder only just next predict the next word that it [01:09:06] just next predict the next word that it would as a result of that training learn [01:09:10] would as a result of that training learn to perform seemingly quite complex [01:09:12] to perform seemingly quite complex things as a function of its context [01:09:15] things as a function of its context um yeah okay one or two questions about [01:09:19] um yeah okay one or two questions about that [01:09:26] um this should be quite surprising I [01:09:28] um this should be quite surprising I think right like so far we've said [01:09:30] think right like so far we've said talked about good representations [01:09:31] talked about good representations contextual representations meanings of [01:09:33] contextual representations meanings of words in context this is some very very [01:09:36] words in context this is some very very high level pattern matching right it's [01:09:37] high level pattern matching right it's coming up with patterns in just the [01:09:39] coming up with patterns in just the input data and that one sequence of text [01:09:42] input data and that one sequence of text that you've passed it so far and it's [01:09:43] that you've passed it so far and it's able to sort of identify how to complete [01:09:47] able to sort of identify how to complete the pattern uh and as you think what [01:09:49] the pattern uh and as you think what kinds of things can this solve what are [01:09:51] kinds of things can this solve what are its capabilities whether it's [01:09:52] its capabilities whether it's limitations this ends up being an open [01:09:55] limitations this ends up being an open area of research sort of what are the [01:09:57] area of research sort of what are the kinds of problems that you maybe saw in [01:09:59] kinds of problems that you maybe saw in the training data lab maybe gpt3 saw a [01:10:02] the training data lab maybe gpt3 saw a ton of pairs of words right they saw a [01:10:04] ton of pairs of words right they saw a bunch of you know dictionaries bilingual [01:10:06] bunch of you know dictionaries bilingual dictionaries in its training data so it [01:10:08] dictionaries in its training data so it learned to do something like this or is [01:10:10] learned to do something like this or is it doing something much more General [01:10:11] it doing something much more General where it's really learning the task in [01:10:13] where it's really learning the task in context you know the actual story we're [01:10:16] context you know the actual story we're not totally sure something in the middle [01:10:18] not totally sure something in the middle it seems like it has to be tied to your [01:10:21] it seems like it has to be tied to your train data in ways that we don't quite [01:10:23] train data in ways that we don't quite understand but there's also a [01:10:25] understand but there's also a non-trivial ability to learn new sort of [01:10:28] non-trivial ability to learn new sort of at least types of patterns just from the [01:10:30] at least types of patterns just from the context so this is a very interesting [01:10:32] context so this is a very interesting thing to work on [01:10:34] thing to work on now we've talked a lot about the size of [01:10:36] now we've talked a lot about the size of these models so far and as models have [01:10:38] these models so far and as models have gotten larger they've always gotten [01:10:40] gotten larger they've always gotten better we train them on more data [01:10:43] better we train them on more data um right so gpd3 was trained on 300 [01:10:45] um right so gpd3 was trained on 300 billion words of text and it was 175 [01:10:48] billion words of text and it was 175 billion parameters [01:10:50] billion parameters and you know at that scale it costs a [01:10:53] and you know at that scale it costs a lot of money [01:10:55] lot of money to build these things and it's very [01:10:56] to build these things and it's very unclear whether you're getting the best [01:10:58] unclear whether you're getting the best use out of your money like it's bigger [01:10:59] use out of your money like it's bigger really what you should have been doing [01:11:01] really what you should have been doing in terms of the number of parameters [01:11:03] in terms of the number of parameters um so you know the cost of training one [01:11:05] um so you know the cost of training one of these is roughly you take the number [01:11:07] of these is roughly you take the number of parameters you multiply it by the [01:11:09] of parameters you multiply it by the number of tokens that you're going to [01:11:10] number of tokens that you're going to train it on the number of words [01:11:12] train it on the number of words and uh some folks at deepmind I forgot [01:11:15] and uh some folks at deepmind I forgot the citation on this some folks at [01:11:17] the citation on this some folks at deepmind uh realized through some [01:11:19] deepmind uh realized through some experimentation that actually gpt3 was [01:11:22] experimentation that actually gpt3 was just comically oversized right so [01:11:25] just comically oversized right so chinchilla the model they trained is [01:11:27] chinchilla the model they trained is less than half the size and works better [01:11:30] less than half the size and works better but they just trained it on way more [01:11:32] but they just trained it on way more data [01:11:34] data um and this is sort of an interesting [01:11:36] um and this is sort of an interesting sort of trade-off about you know how do [01:11:38] sort of trade-off about you know how do you best spend your compute I mean you [01:11:39] you best spend your compute I mean you can't do this more than a handful of [01:11:40] can't do this more than a handful of times even if you're you know Google [01:11:43] times even if you're you know Google um so you know open open questions there [01:11:46] um so you know open open questions there as well [01:11:47] as well um another sort of way of interacting [01:11:50] um another sort of way of interacting with these networks that has come out [01:11:52] with these networks that has come out recently is called Chain of Thought [01:11:56] um so the prefix right we saw in the in [01:11:59] um so the prefix right we saw in the in context learning slide that the prefix [01:12:00] context learning slide that the prefix can help sort of specify what task [01:12:02] can help sort of specify what task you're trying to solve right now and it [01:12:04] you're trying to solve right now and it can do even more so here's standard sort [01:12:07] can do even more so here's standard sort of prompting we have a prefix of [01:12:09] of prompting we have a prefix of examples of questions and answers so you [01:12:11] examples of questions and answers so you have a question and then an example [01:12:13] have a question and then an example answer so that's your prompt that's [01:12:15] answer so that's your prompt that's specifying the task and then you have a [01:12:18] specifying the task and then you have a new question and you're having the model [01:12:19] new question and you're having the model generate an answer and it generates it [01:12:21] generate an answer and it generates it wrong [01:12:22] wrong and Chain of Thought prompting uh says [01:12:26] and Chain of Thought prompting uh says well how about in the example in the [01:12:28] well how about in the example in the demonstration we give we give the [01:12:29] demonstration we give we give the question and then we give this sort of [01:12:31] question and then we give this sort of decomposition [01:12:33] decomposition of steps towards how to get an answer [01:12:36] of steps towards how to get an answer right so I'm actually writing this out [01:12:37] right so I'm actually writing this out as part of the input I'm I'm giving [01:12:39] as part of the input I'm I'm giving annotations as a human to say oh you [01:12:42] annotations as a human to say oh you know to solve this sort of word problem [01:12:43] know to solve this sort of word problem here's how you could think it through [01:12:46] here's how you could think it through ish and then I give it a new question [01:12:49] ish and then I give it a new question and the model says oh I know what I'm [01:12:51] and the model says oh I know what I'm supposed to do I'm supposed to First [01:12:53] supposed to do I'm supposed to First generate a sequence of steps [01:12:55] generate a sequence of steps of intermediate steps and then next say [01:12:58] of intermediate steps and then next say the answer is and then say what the [01:13:00] the answer is and then say what the answer is and it turns out and this [01:13:03] answer is and it turns out and this should again be very surprising that the [01:13:06] should again be very surprising that the model can tend to generate plausible [01:13:09] model can tend to generate plausible sequences of steps and then much more [01:13:12] sequences of steps and then much more frequently generates the correct answer [01:13:13] frequently generates the correct answer after doing so relative to trying to [01:13:16] after doing so relative to trying to generate the answer by itself [01:13:17] generate the answer by itself so you can think of this as a scratch [01:13:19] so you can think of this as a scratch Pad you can think of this as uh [01:13:22] Pad you can think of this as uh increasing the amount of computation [01:13:24] increasing the amount of computation that you're putting into trying to solve [01:13:25] that you're putting into trying to solve the problem [01:13:27] the problem sort of writing out your thoughts right [01:13:28] sort of writing out your thoughts right as I generate each word of this [01:13:31] as I generate each word of this continuation here I'm able to condition [01:13:33] continuation here I'm able to condition on all the past words so far and so [01:13:36] on all the past words so far and so maybe it just uh [01:13:39] maybe it just uh yeah allows the network to sort of [01:13:40] yeah allows the network to sort of decompose the problem into smaller [01:13:42] decompose the problem into smaller simpler problems which is more able to [01:13:45] simpler problems which is more able to solve each [01:13:47] solve each um [01:13:47] um no one's really sure why this works [01:13:49] no one's really sure why this works exactly either at this point with [01:13:52] exactly either at this point with networks that are this large they're [01:13:54] networks that are this large they're emergent properties are both very [01:13:57] emergent properties are both very powerful and exceptionally hard to [01:13:58] powerful and exceptionally hard to understand and very hard you should [01:14:01] understand and very hard you should think uh to trust [01:14:03] think uh to trust because it's unclear what its [01:14:04] because it's unclear what its capabilities are and what its [01:14:06] capabilities are and what its limitations are where it will fail [01:14:09] limitations are where it will fail so what do we think pre-training is [01:14:10] so what do we think pre-training is teaching gosh a wide range of things [01:14:14] teaching gosh a wide range of things even beyond what I've written in this [01:14:16] even beyond what I've written in this slide which I mostly wrote two years ago [01:14:19] slide which I mostly wrote two years ago right so it can teach you trivia and [01:14:21] right so it can teach you trivia and syntax and co-reference and maybe some [01:14:23] syntax and co-reference and maybe some lexical semantics and sentiment and some [01:14:25] lexical semantics and sentiment and some reasoning like way more reasoning than [01:14:27] reasoning like way more reasoning than we would have thought even three years [01:14:28] we would have thought even three years ago [01:14:30] ago um and yet they also learn and [01:14:32] um and yet they also learn and exacerbate racism and sexism all manner [01:14:35] exacerbate racism and sexism all manner of biases [01:14:37] of biases um there's more on this later but it's [01:14:40] um there's more on this later but it's uh the generality of this is really I [01:14:43] uh the generality of this is really I think what's taken many people aback and [01:14:45] think what's taken many people aback and so increasingly these objects are not [01:14:48] so increasingly these objects are not just [01:14:49] just um studied for the sake of using them [01:14:50] um studied for the sake of using them but studied for the sake of [01:14:52] but studied for the sake of understanding anything about how they [01:14:54] understanding anything about how they work and how they fail [01:14:56] work and how they fail uh yeah any questions [01:15:05] has anyone tried like a benchmarking [01:15:07] has anyone tried like a benchmarking like GPT for like programming tasks like [01:15:11] like GPT for like programming tasks like how accurate these does etc yeah the [01:15:14] how accurate these does etc yeah the question is has anyone tried [01:15:15] question is has anyone tried benchmarking GPT for programming tasks [01:15:18] benchmarking GPT for programming tasks anyone seen how well it does [01:15:21] anyone seen how well it does um yes so there's definitely examples of [01:15:23] um yes so there's definitely examples of people using GPT uh three four simple [01:15:27] people using GPT uh three four simple programming things and then you know the [01:15:29] programming things and then you know the modern state-of-the-art competitive [01:15:31] modern state-of-the-art competitive programming Bots are all based on ideas [01:15:34] programming Bots are all based on ideas from language modeling uh and I think I [01:15:37] from language modeling uh and I think I think they're all also based on [01:15:39] think they're all also based on pre-trained language models themselves [01:15:40] pre-trained language models themselves like if you just take all of these ideas [01:15:42] like if you just take all of these ideas and apply it to like GitHub uh then you [01:15:46] and apply it to like GitHub uh then you get some very interesting emergent [01:15:48] get some very interesting emergent behaviors relating to code uh fallout [01:15:50] behaviors relating to code uh fallout and so yeah I think all of the best [01:15:53] and so yeah I think all of the best systems use this more or less so lots of [01:15:56] systems use this more or less so lots of benchmarking there for sure [01:15:59] benchmarking there for sure through the basis for what like GitHub [01:16:01] through the basis for what like GitHub co-pilot is going to do the question is [01:16:03] co-pilot is going to do the question is is this the basis is that what we just [01:16:04] is this the basis is that what we just mentioned the basis for the GitHub [01:16:06] mentioned the basis for the GitHub copilot system yes absolutely [01:16:10] copilot system yes absolutely we don't know exactly what it is in [01:16:12] we don't know exactly what it is in terms of details but it's all these [01:16:14] terms of details but it's all these ideas [01:16:15] ideas what if you have a situation where you [01:16:17] what if you have a situation where you have you know still a large amount of [01:16:19] have you know still a large amount of data for you know General data and then [01:16:21] data for you know General data and then you have also a large amount of data for [01:16:23] you have also a large amount of data for your fine tuning task at one point is it [01:16:25] your fine tuning task at one point is it better to train a new model for that [01:16:28] better to train a new model for that pioneering versus you know get data from [01:16:30] pioneering versus you know get data from both yeah the question is what if you [01:16:32] both yeah the question is what if you have a large amount of data for [01:16:33] have a large amount of data for pre-training and a large amount of data [01:16:34] pre-training and a large amount of data for fine tuning when is it better to do [01:16:37] for fine tuning when is it better to do sort of a separate training on just the [01:16:39] sort of a separate training on just the fine-tuning data [01:16:41] fine-tuning data um almost never if you [01:16:44] um almost never if you have a bunch of data for the task that [01:16:47] have a bunch of data for the task that you care about what's frequently done [01:16:49] you care about what's frequently done instead is three-part training where you [01:16:52] instead is three-part training where you pre-train on a very broad Corpus then [01:16:55] pre-train on a very broad Corpus then you sort of continue to pre-train using [01:16:57] you sort of continue to pre-train using something like language modeling on an [01:16:59] something like language modeling on an unlabeled version [01:17:01] unlabeled version of the label data that you have you just [01:17:03] of the label data that you have you just like strip the labels off and just treat [01:17:04] like strip the labels off and just treat it all as text and do language modeling [01:17:06] it all as text and do language modeling on that adapt the parameters a little [01:17:08] on that adapt the parameters a little bit and then do the final stage of [01:17:11] bit and then do the final stage of fine-tuning with the labels that you [01:17:12] fine-tuning with the labels that you want and that works even better this [01:17:14] want and that works even better this interesting paper called Don't Stop [01:17:16] interesting paper called Don't Stop pre-training [01:17:18] pre-training nice uh final question [01:17:21] nice uh final question that's a lot of questions on anyone new [01:17:24] that's a lot of questions on anyone new new someone knew the question [01:17:30] yeah um I was wondering do you know if [01:17:33] yeah um I was wondering do you know if there's like a lot of instances where a [01:17:35] there's like a lot of instances where a pre-trained model can do some tasks not [01:17:38] pre-trained model can do some tasks not soon before I even know [01:17:40] soon before I even know yeah so are there any instances of where [01:17:42] yeah so are there any instances of where a pre-trained model can do a task that [01:17:44] a pre-trained model can do a task that it hasn't seen before uh you know [01:17:46] it hasn't seen before uh you know without fine-tuning the question is what [01:17:47] without fine-tuning the question is what is hasn't seen before mean right like uh [01:17:51] is hasn't seen before mean right like uh these models especially gpt3 and similar [01:17:53] these models especially gpt3 and similar very large models you know during [01:17:55] very large models you know during pre-training did it ever see something [01:17:57] pre-training did it ever see something exactly like this sort of word problem [01:18:00] exactly like this sort of word problem arithmetic maybe maybe not it's actually [01:18:03] arithmetic maybe maybe not it's actually sort of unclear it's clearly able to [01:18:06] sort of unclear it's clearly able to recombine sort of bits and pieces of [01:18:08] recombine sort of bits and pieces of tasks that it saw implicitly during [01:18:10] tasks that it saw implicitly during pre-training we saw the same thing with [01:18:12] pre-training we saw the same thing with trivia right like language modeling [01:18:13] trivia right like language modeling looks a lot like trivia sometimes where [01:18:15] looks a lot like trivia sometimes where you just read the first paragraph of a [01:18:18] you just read the first paragraph of a Wikipedia page and it's kind of like [01:18:20] Wikipedia page and it's kind of like answering a bunch of little trivia [01:18:21] answering a bunch of little trivia questions about where someone was born [01:18:22] questions about where someone was born and when [01:18:24] and when um but like it's never seen something [01:18:25] um but like it's never seen something quite like this and it's actually still [01:18:27] quite like this and it's actually still kind of astounding how much is able to [01:18:29] kind of astounding how much is able to do things that don't seem like they [01:18:30] do things that don't seem like they should have shown up all that directly [01:18:32] should have shown up all that directly in the pre-training data quantifying [01:18:34] in the pre-training data quantifying that extent is an open research problem [01:18:37] that extent is an open research problem okay that's it let's call it foreign ================================================================================ LECTURE 010 ================================================================================ Stanford CS224N NLP with Deep Learning | 2023 | Lecture 11 - Natural Language Generation Source: https://www.youtube.com/watch?v=N9L32bFieEY --- Transcript [00:00:05] hello everyone [00:00:07] hello everyone um my name is Lisa I'm a third year PhD [00:00:09] um my name is Lisa I'm a third year PhD student in the NLP group I'm advised by [00:00:11] student in the NLP group I'm advised by Percy and Tatsu today I will give a [00:00:14] Percy and Tatsu today I will give a lecture on natural language generation [00:00:15] lecture on natural language generation and this is also the research area that [00:00:18] and this is also the research area that I work on so I'm super excited about it [00:00:20] I work on so I'm super excited about it I'm happy to answer any questions both [00:00:22] I'm happy to answer any questions both during the lecture and after class about [00:00:24] during the lecture and after class about natural language generation so nlg is a [00:00:27] natural language generation so nlg is a super exciting area and is also moving [00:00:30] super exciting area and is also moving really really fast so today we will [00:00:33] really really fast so today we will discuss all the excitement of nlg [00:00:36] discuss all the excitement of nlg but before we get into the really [00:00:37] but before we get into the really exciting part I have to make some [00:00:39] exciting part I have to make some announcements so first it is very very [00:00:42] announcements so first it is very very important for you to remember to sign up [00:00:44] important for you to remember to sign up for AWS by midnight today so this will [00:00:48] for AWS by midnight today so this will concern this is related to your homework [00:00:50] concern this is related to your homework 5 whether you have GPU access and then [00:00:53] 5 whether you have GPU access and then also related to our final project so [00:00:55] also related to our final project so please please remember to sign up for it [00:00:57] please please remember to sign up for it for AWS by tonight and second the [00:01:01] for AWS by tonight and second the project proposal is due on Tuesday next [00:01:04] project proposal is due on Tuesday next Tuesday and I think assignment 4 should [00:01:07] Tuesday and I think assignment 4 should just do it hopefully you had fun in this [00:01:10] just do it hopefully you had fun in this machine translation and stuff and also [00:01:13] machine translation and stuff and also assignment 5 is out today I think just [00:01:16] assignment 5 is out today I think just now and it is due on Friday uh like [00:01:20] now and it is due on Friday uh like basically Friday midnight and uh last we [00:01:24] basically Friday midnight and uh last we will hold a Transformer I will hold a [00:01:26] will hold a Transformer I will hold a hugging face Transformer Library [00:01:27] hugging face Transformer Library tutorial this Friday so if your final [00:01:31] tutorial this Friday so if your final project is related to implementing [00:01:33] project is related to implementing Transformers or playing with large [00:01:34] Transformers or playing with large language models you should definitely go [00:01:36] language models you should definitely go to this tutorial because it's going to [00:01:37] to this tutorial because it's going to be very very helpful [00:01:40] be very very helpful um also yeah just one more time please [00:01:42] um also yeah just one more time please remember to sign up for AWS because this [00:01:44] remember to sign up for AWS because this is the final hard deadline [00:01:47] is the final hard deadline okay cool now moving on to the main [00:01:50] okay cool now moving on to the main topic for today [00:01:51] topic for today um the very exciting natural language [00:01:53] um the very exciting natural language Generation stuff so today we will [00:01:55] Generation stuff so today we will discuss what is an LG review sound [00:01:57] discuss what is an LG review sound models discuss about how to decode from [00:02:00] models discuss about how to decode from language models and how to train [00:02:01] language models and how to train language models [00:02:03] language models um and we will also talk about [00:02:05] um and we will also talk about evaluations and finally we'll discuss [00:02:07] evaluations and finally we'll discuss ethical and risk considerations with the [00:02:09] ethical and risk considerations with the current analogy systems so this natural [00:02:12] current analogy systems so this natural language generation techniques are going [00:02:14] language generation techniques are going to be really exciting because this is [00:02:16] to be really exciting because this is kind of getting us closer to explain the [00:02:18] kind of getting us closer to explain the magic of chat GPT which is a super [00:02:21] magic of chat GPT which is a super popular model recently and practically [00:02:23] popular model recently and practically speaking they could also help you with [00:02:25] speaking they could also help you with your final project if you decide to work [00:02:27] your final project if you decide to work on something related to text generation [00:02:29] on something related to text generation so um let's get started to begin with [00:02:32] so um let's get started to begin with let's ask the question of what is [00:02:34] let's ask the question of what is natural language generation [00:02:36] natural language generation so natural language generation is [00:02:38] so natural language generation is actually a really broad category people [00:02:40] actually a really broad category people have divided an LP into natural language [00:02:43] have divided an LP into natural language understanding and natural language [00:02:45] understanding and natural language generation so the understanding part [00:02:47] generation so the understanding part mostly means that the task input is in [00:02:50] mostly means that the task input is in natural language such as semantic [00:02:52] natural language such as semantic parsing natural language inference and [00:02:55] parsing natural language inference and so on whereas natural language [00:02:57] so on whereas natural language generation means that the task output is [00:02:59] generation means that the task output is in natural language so nlg focuses on [00:03:03] in natural language so nlg focuses on systems that produce fluent coherent and [00:03:06] systems that produce fluent coherent and useful language outputs for human to use [00:03:09] useful language outputs for human to use historically there are many analogy [00:03:12] historically there are many analogy systems that use rule-based systems such [00:03:15] systems that use rule-based systems such as templates or infilling but nowadays [00:03:18] as templates or infilling but nowadays deep learning is powering almost every [00:03:20] deep learning is powering almost every text generation systems so this lecture [00:03:23] text generation systems so this lecture today will be mostly focused on deep [00:03:25] today will be mostly focused on deep learning steps [00:03:27] learning steps so um first what are some examples of [00:03:30] so um first what are some examples of natural language generation it's [00:03:32] natural language generation it's actually everywhere including our [00:03:33] actually everywhere including our homework machine translation is a form [00:03:36] homework machine translation is a form of nlg where the input is some address [00:03:38] of nlg where the input is some address in the source language and the output is [00:03:41] in the source language and the output is generated text in a targeted language [00:03:44] generated text in a targeted language digital assistant such as series or [00:03:46] digital assistant such as series or Alexa they are also an LG systems so it [00:03:50] Alexa they are also an LG systems so it takes in dialogue history and generates [00:03:52] takes in dialogue history and generates continuations of the conversation [00:03:55] continuations of the conversation um there is also summarization systems [00:03:57] um there is also summarization systems that takes in a long document such as a [00:04:00] that takes in a long document such as a research article and then the idea is [00:04:02] research article and then the idea is trying to summarize it into a few [00:04:04] trying to summarize it into a few sentences that are easy to read [00:04:07] sentences that are easy to read so beyond these classic tasks there are [00:04:09] so beyond these classic tasks there are some more interesting uses like creative [00:04:12] some more interesting uses like creative story writing where you can prompt a [00:04:14] story writing where you can prompt a language model with a story plot and [00:04:16] language model with a story plot and then it will give you some creative [00:04:18] then it will give you some creative stories that are aligned with the plot [00:04:20] stories that are aligned with the plot there is state of the text where you [00:04:22] there is state of the text where you give the language model some database or [00:04:24] give the language model some database or some tables and then the idea is that it [00:04:27] some tables and then the idea is that it will output some textual description of [00:04:29] will output some textual description of the table content [00:04:30] the table content and finally there is also like visual [00:04:32] and finally there is also like visual description based nlg systems like image [00:04:35] description based nlg systems like image captioning or like image based [00:04:38] captioning or like image based storytelling [00:04:40] storytelling so the really cool example [00:04:43] so the really cool example um is the popular track GPT models so [00:04:46] um is the popular track GPT models so chat GPT is also an analogy system it is [00:04:49] chat GPT is also an analogy system it is very general purpose so therefore you [00:04:51] very general purpose so therefore you can use it to do many many different [00:04:54] can use it to do many many different tasks with different prompts for example [00:04:56] tasks with different prompts for example we can use chat GPT to simulate a [00:04:59] we can use chat GPT to simulate a chatbot it can ask it can answer [00:05:01] chatbot it can ask it can answer questions about like creative gifts for [00:05:04] questions about like creative gifts for 10 years old [00:05:05] 10 years old it can be used to do poetry generation [00:05:08] it can be used to do poetry generation like for example we can ask you to [00:05:11] like for example we can ask you to generate a poem about sorting algorithms [00:05:12] generate a poem about sorting algorithms and it's actually well I wouldn't say [00:05:15] and it's actually well I wouldn't say it's very poetic but at least it has the [00:05:17] it's very poetic but at least it has the same format as a poem and the content is [00:05:19] same format as a poem and the content is actually correct [00:05:22] so um charging Beauty can also be used [00:05:25] so um charging Beauty can also be used in some really useful settings like a [00:05:28] in some really useful settings like a web search so here Bing is augmented [00:05:31] web search so here Bing is augmented with chat GPT and there are some [00:05:33] with chat GPT and there are some twitters that are saying that the magic [00:05:34] twitters that are saying that the magic of chat GPT is that it actually makes [00:05:36] of chat GPT is that it actually makes people be happy to use Bing [00:05:38] people be happy to use Bing um [00:05:42] so there are so many tasks that actually [00:05:44] so there are so many tasks that actually belong to the nlg category so how do we [00:05:47] belong to the nlg category so how do we categorize these tasks one common way is [00:05:49] categorize these tasks one common way is to think about the open-endedness of the [00:05:51] to think about the open-endedness of the task so here we draw a line for the [00:05:54] task so here we draw a line for the spectrum of open-endedness on the one [00:05:57] spectrum of open-endedness on the one end we have tasks like machine [00:05:58] end we have tasks like machine translation and summarization so we [00:06:01] translation and summarization so we consider them not very open-ended [00:06:03] consider them not very open-ended because for each Source sentence the [00:06:06] because for each Source sentence the output is almost determined by the input [00:06:08] output is almost determined by the input because basically we are trying to do [00:06:11] because basically we are trying to do machine translation the semantic should [00:06:13] machine translation the semantic should be exactly similar to the input sentence [00:06:15] be exactly similar to the input sentence so there are only a few ways that you [00:06:17] so there are only a few ways that you can refreeze the output like authorities [00:06:19] can refreeze the output like authorities have announced that today is a national [00:06:21] have announced that today is a national holiday you can rephrase it a little bit [00:06:23] holiday you can rephrase it a little bit to say today is a national holiday [00:06:25] to say today is a national holiday announced by the authorities but the [00:06:27] announced by the authorities but the actual Space is really small because you [00:06:29] actual Space is really small because you have to make sure the semantics doesn't [00:06:31] have to make sure the semantics doesn't change so we can say that the output [00:06:34] change so we can say that the output space here is not very diverse [00:06:37] um and moving to the middle of the [00:06:39] um and moving to the middle of the spectrum there is dialogue tasks such as [00:06:41] spectrum there is dialogue tasks such as task driven dialogue or Chit Chat [00:06:43] task driven dialogue or Chit Chat dialogue so we can see that for each [00:06:45] dialogue so we can see that for each dialog input there are multiple [00:06:47] dialog input there are multiple responses and the degree of Freedom has [00:06:50] responses and the degree of Freedom has increased here we can say like we can [00:06:52] increased here we can say like we can respond by saying good and you or we can [00:06:55] respond by saying good and you or we can say about thanks for asking barely [00:06:57] say about thanks for asking barely surviving on my homeworks so here we are [00:07:00] surviving on my homeworks so here we are observing that there are actually [00:07:01] observing that there are actually multiple ways to continue this [00:07:03] multiple ways to continue this conversation and then this is where we [00:07:05] conversation and then this is where we say the output space is getting more and [00:07:07] say the output space is getting more and more diverse [00:07:09] more diverse and on the other end of the spectrum [00:07:12] and on the other end of the spectrum there is a very open-ended generation [00:07:14] there is a very open-ended generation tasks like story generation so given the [00:07:17] tasks like story generation so given the input like write me a story about three [00:07:19] input like write me a story about three little pigs there are so many ways to [00:07:21] little pigs there are so many ways to continue the prompt right we can write [00:07:22] continue the prompt right we can write about them going to schools building [00:07:24] about them going to schools building houses like they always do [00:07:26] houses like they always do um so the valid output here is extremely [00:07:29] um so the valid output here is extremely large and we call this open-ended [00:07:31] large and we call this open-ended generation [00:07:33] generation so it's hard to really draw a boundary [00:07:35] so it's hard to really draw a boundary between open-ended and non-open-ended [00:07:38] between open-ended and non-open-ended tasks but we still try to give a rough [00:07:40] tasks but we still try to give a rough categorization so over the Ender [00:07:42] categorization so over the Ender generation refers to tasks whose output [00:07:44] generation refers to tasks whose output distribution has a high degree of [00:07:46] distribution has a high degree of Freedom or an non-open under generation [00:07:49] Freedom or an non-open under generation tasks refers to tasks where the input [00:07:52] tasks refers to tasks where the input will almost certainly determine the [00:07:55] will almost certainly determine the output generation examples of non-open [00:07:58] output generation examples of non-open ended Generations are machine [00:07:59] ended Generations are machine translation summarization and examples [00:08:01] translation summarization and examples of open-ended Generations are story [00:08:03] of open-ended Generations are story generation Chit Chat dialogue task [00:08:05] generation Chit Chat dialogue task oriented dialogue Etc [00:08:07] oriented dialogue Etc so how do we formalize this [00:08:10] so how do we formalize this categorization one way of formalizing is [00:08:12] categorization one way of formalizing is by Computing the entropy of the nlg [00:08:15] by Computing the entropy of the nlg system so high entropy means that we we [00:08:18] system so high entropy means that we we are to the right of the spectrum so it [00:08:21] are to the right of the spectrum so it is more open-ended and low entropy means [00:08:23] is more open-ended and low entropy means that we are to the left of the spectrum [00:08:25] that we are to the left of the spectrum and less open-ended [00:08:27] and less open-ended so there's two classes of nlg tasks [00:08:30] so there's two classes of nlg tasks actually require different decoding and [00:08:32] actually require different decoding and training approaches as we'll talk about [00:08:34] training approaches as we'll talk about later [00:08:36] later okay cool now let's recall some previous [00:08:39] okay cool now let's recall some previous lectures and review the nlg models and [00:08:41] lectures and review the nlg models and trainings that we have studied before [00:08:44] trainings that we have studied before so I think we discussed the basics of [00:08:46] so I think we discussed the basics of natural language generation so here is [00:08:49] natural language generation so here is how other aggressive language model [00:08:50] how other aggressive language model works at each time step our model would [00:08:53] works at each time step our model would take in a sequence of tokens as input [00:08:56] take in a sequence of tokens as input and here it is y less than T and the [00:08:59] and here it is y less than T and the output is basically the new token YT so [00:09:03] output is basically the new token YT so to decide on YT we first use the model [00:09:06] to decide on YT we first use the model to assign a score for each token in the [00:09:08] to assign a score for each token in the vocabulary denoted as s and then we [00:09:11] vocabulary denoted as s and then we apply softmax to get the next token [00:09:13] apply softmax to get the next token distribution p and we choose a token [00:09:16] distribution p and we choose a token according to this next token [00:09:17] according to this next token distribution [00:09:19] distribution and summary once we have predicted YT [00:09:21] and summary once we have predicted YT hat we then pass it back into the [00:09:22] hat we then pass it back into the language model as the input predict y [00:09:25] language model as the input predict y hat t plus 1 and then we do so [00:09:27] hat t plus 1 and then we do so recursively until we reach the end of [00:09:30] recursively until we reach the end of the sequence [00:09:31] the sequence so any questions so far [00:09:35] so any questions so far okay good [00:09:37] okay good um so for the two types of energy tasks [00:09:40] um so for the two types of energy tasks that we talked about like the open-ended [00:09:42] that we talked about like the open-ended non-open-ended tasks they tend to prefer [00:09:44] non-open-ended tasks they tend to prefer different model architectures so for now [00:09:47] different model architectures so for now open-ended tasks like machine [00:09:48] open-ended tasks like machine translation we typically use an encoder [00:09:51] translation we typically use an encoder decoder system where like the other [00:09:53] decoder system where like the other regressive decoder that we just talked [00:09:55] regressive decoder that we just talked about function as the decoder and then [00:09:57] about function as the decoder and then we have another bi-directional encoder [00:09:59] we have another bi-directional encoder for encoding the inputs so this is kind [00:10:01] for encoding the inputs so this is kind of what you implemented for assignment [00:10:03] of what you implemented for assignment four because the encoder is like the [00:10:06] four because the encoder is like the bi-directional lstm and the decoder is [00:10:09] bi-directional lstm and the decoder is another lstm that is auto regressive [00:10:12] another lstm that is auto regressive so for more open-ended tasks typically [00:10:15] so for more open-ended tasks typically other aggressive generation model is the [00:10:18] other aggressive generation model is the only Oppo is the only component [00:10:20] only Oppo is the only component um of course like this architectures are [00:10:22] um of course like this architectures are not really hard constraints because a [00:10:25] not really hard constraints because a auto-agressive decoder alone can also be [00:10:27] auto-agressive decoder alone can also be used to do machine translation and an [00:10:29] used to do machine translation and an encoder decoder model can also be used [00:10:31] encoder decoder model can also be used for story generation [00:10:33] for story generation so this is kind of the convention for [00:10:35] so this is kind of the convention for now but it's a reasonable convention [00:10:37] now but it's a reasonable convention because like using decoder only model [00:10:39] because like using decoder only model for Mt tends to hurt performance [00:10:42] for Mt tends to hurt performance compared to an encoder decoder model for [00:10:44] compared to an encoder decoder model for Mt and using an encoder decoder model [00:10:46] Mt and using an encoder decoder model for open-ended generation seems to like [00:10:49] for open-ended generation seems to like achieve similar performance to a decoder [00:10:51] achieve similar performance to a decoder only model and therefore if you have the [00:10:54] only model and therefore if you have the compute budget to train an encoder [00:10:55] compute budget to train an encoder decoder model you might just be better [00:10:57] decoder model you might just be better off by only trading a larger decoder [00:10:59] off by only trading a larger decoder model so it's kind of more of an [00:11:01] model so it's kind of more of an allocation of resources problem than [00:11:03] allocation of resources problem than whether this to architecture will type [00:11:05] whether this to architecture will type check with your task [00:11:08] so [00:11:09] so um okay so how do we train such a [00:11:12] um okay so how do we train such a language model in previous lectures we [00:11:14] language model in previous lectures we talked about that the language models [00:11:16] talked about that the language models are trained by maximum likelihood so [00:11:19] are trained by maximum likelihood so basically we were trying to maximize the [00:11:21] basically we were trying to maximize the probability of the next token uh YT [00:11:24] probability of the next token uh YT given the preceding words and this is [00:11:26] given the preceding words and this is our optimization objective so at each [00:11:30] our optimization objective so at each time step this can be regarded as a [00:11:32] time step this can be regarded as a classification task because we are [00:11:34] classification task because we are trying to distinguish the actual word uh [00:11:36] trying to distinguish the actual word uh YT star from all the remaining words in [00:11:39] YT star from all the remaining words in the vocabulary [00:11:40] the vocabulary and this is also called teacher forcing [00:11:42] and this is also called teacher forcing because at each time step uh we are [00:11:45] because at each time step uh we are using the gold standard wise uh y star [00:11:48] using the gold standard wise uh y star less than t as input to the model [00:11:51] less than t as input to the model whereas presumably at generation time [00:11:54] whereas presumably at generation time you wouldn't have any access to Y star [00:11:56] you wouldn't have any access to Y star so you would have to use the model's own [00:11:58] so you would have to use the model's own prediction to fit it back into the model [00:12:00] prediction to fit it back into the model to generate the next token and that is [00:12:02] to generate the next token and that is called student forcing which will talk [00:12:04] called student forcing which will talk in detail later [00:12:12] we never used that word before what does [00:12:15] we never used that word before what does it mean Ultra aggressive oh this means [00:12:17] it mean Ultra aggressive oh this means like uh so let's look at this animations [00:12:20] like uh so let's look at this animations again [00:12:22] again oops sorry oh it just looks like uh you [00:12:24] oops sorry oh it just looks like uh you are generating word from left to right [00:12:26] are generating word from left to right one by one so here suppose that you are [00:12:28] one by one so here suppose that you are given a y less than T and then other [00:12:31] given a y less than T and then other aggressive for your first general YT and [00:12:33] aggressive for your first general YT and then once you have YT you'll fit it back [00:12:34] then once you have YT you'll fit it back in general YT plus one and then feed it [00:12:38] in general YT plus one and then feed it back and generate another thing so this [00:12:39] back and generate another thing so this left to right nature because you are [00:12:41] left to right nature because you are using chain rule to like condition on [00:12:43] using chain rule to like condition on the the tokens that you just generated [00:12:45] the the tokens that you just generated this chain rule thing is called Auto [00:12:47] this chain rule thing is called Auto regressive and typically like I think [00:12:50] regressive and typically like I think conventionally we are doing left to [00:12:51] conventionally we are doing left to right other aggressive by generating [00:12:52] right other aggressive by generating from left to right but there are also [00:12:54] from left to right but there are also like other more interesting models that [00:12:56] like other more interesting models that can do backward or influence and other [00:12:58] can do backward or influence and other things this idea of generating one token [00:13:00] things this idea of generating one token at once is auto regressive [00:13:04] at once is auto regressive cool any other questions [00:13:09] yep [00:13:13] um so at inference time our decoding [00:13:16] um so at inference time our decoding algorithm will Define a function to [00:13:18] algorithm will Define a function to select a token from this distribution so [00:13:21] select a token from this distribution so we've discussed that we can use the [00:13:22] we've discussed that we can use the language model to compute this P which [00:13:24] language model to compute this P which is the next token distribution and then [00:13:26] is the next token distribution and then G here based on our notation is the [00:13:29] G here based on our notation is the decoded algorithm which helps us select [00:13:31] decoded algorithm which helps us select what token we are actually going to use [00:13:33] what token we are actually going to use for YT so the obvious decoding algorithm [00:13:36] for YT so the obvious decoding algorithm is to greatly choose the highest [00:13:37] is to greatly choose the highest probability token as YT hat for each [00:13:41] probability token as YT hat for each time step so well this basic algorithm [00:13:43] time step so well this basic algorithm sort of works because they work for your [00:13:45] sort of works because they work for your homework for to do better there are two [00:13:47] homework for to do better there are two main avenues that we can take we can [00:13:50] main avenues that we can take we can decide to improve decoding and we can [00:13:52] decide to improve decoding and we can also decide to improve the training [00:13:54] also decide to improve the training of course there are other things that we [00:13:56] of course there are other things that we can do we can improve training data and [00:13:58] can do we can improve training data and we can improve model architectures but [00:14:00] we can improve model architectures but for this lecture we will focus on [00:14:01] for this lecture we will focus on decoding and training [00:14:04] decoding and training so uh now let's talk about how decoding [00:14:08] so uh now let's talk about how decoding algorithms work for natural language [00:14:09] algorithms work for natural language generation models before that I'm happy [00:14:11] generation models before that I'm happy to take any questions about the previous [00:14:13] to take any questions about the previous slides [00:14:23] uh I think I'll go into this in detail [00:14:26] uh I think I'll go into this in detail later but sure so basically for teacher [00:14:28] later but sure so basically for teacher of forcing the idea is like you do [00:14:29] of forcing the idea is like you do teacher forcing where you'll train the [00:14:31] teacher forcing where you'll train the language model because you already [00:14:32] language model because you already observe like the gold text so you kind [00:14:35] observe like the gold text so you kind of use the gold text up until timestamp [00:14:36] of use the gold text up until timestamp t uh put put it into the model and then [00:14:39] t uh put put it into the model and then the model would try to predict why uh t [00:14:41] the model would try to predict why uh t plus one whereas student forcing means [00:14:44] plus one whereas student forcing means that you don't have access to this gold [00:14:45] that you don't have access to this gold reference data instead you are still but [00:14:48] reference data instead you are still but you are still trying to generate a [00:14:49] you are still trying to generate a sequence of data so you have to use uh [00:14:51] sequence of data so you have to use uh the text that you generated yourself [00:14:52] the text that you generated yourself using the model and then feed it back [00:14:54] using the model and then feed it back into the model as input to predict t [00:14:56] into the model as input to predict t plus one that's the primary difference [00:15:01] cool [00:15:02] cool um so what is decoding all about at each [00:15:04] um so what is decoding all about at each time step uh [00:15:06] time step uh our model computes a vector of score for [00:15:09] our model computes a vector of score for each token so it takes in preceding [00:15:11] each token so it takes in preceding context while less than T and produce a [00:15:13] context while less than T and produce a score s [00:15:14] score s and then we try to compute the [00:15:16] and then we try to compute the probability distribution P all of this [00:15:18] probability distribution P all of this scores by just applying softmax to [00:15:20] scores by just applying softmax to normalize them [00:15:22] normalize them and our decoding algorithm is defined as [00:15:25] and our decoding algorithm is defined as this function G which takes in the [00:15:28] this function G which takes in the probability distribution and try to map [00:15:30] probability distribution and try to map it to some word basically try to select [00:15:32] it to some word basically try to select a token from this probability [00:15:33] a token from this probability distribution [00:15:34] distribution so in the machine translation lecture uh [00:15:37] so in the machine translation lecture uh we talked about graded decoding which [00:15:40] we talked about graded decoding which selects the highest probability token of [00:15:42] selects the highest probability token of this P distribution [00:15:45] this P distribution and we also talk about beam search which [00:15:47] and we also talk about beam search which has the same objective as grade decoding [00:15:49] has the same objective as grade decoding which is that we are both trying to find [00:15:51] which is that we are both trying to find the most likely string defined based on [00:15:53] the most likely string defined based on the model but instead of doing so [00:15:55] the model but instead of doing so greedily for beam search we actually [00:15:57] greedily for beam search we actually explore a wider range of candidates so [00:16:00] explore a wider range of candidates so we have a wider exploration of [00:16:02] we have a wider exploration of candidates by keeping always like k k [00:16:05] candidates by keeping always like k k candidates in the beam [00:16:08] candidates in the beam so overall this maximum probability [00:16:10] so overall this maximum probability decoding is good for low entropy tasks [00:16:12] decoding is good for low entropy tasks like machine translation and [00:16:14] like machine translation and summarization but it actually encounters [00:16:16] summarization but it actually encounters more problems for open-ended generation [00:16:19] more problems for open-ended generation so the most likely string is actually [00:16:21] so the most likely string is actually very repetitive when we try to do [00:16:23] very repetitive when we try to do open-ended text generation [00:16:26] open-ended text generation as we can see in this example the [00:16:28] as we can see in this example the context is perfect in normal it's about [00:16:30] context is perfect in normal it's about I mean a unicorn trying to speak English [00:16:32] I mean a unicorn trying to speak English and for the continuation the first part [00:16:35] and for the continuation the first part of it is it looks great it's like valid [00:16:38] of it is it looks great it's like valid English it talks about science but [00:16:39] English it talks about science but suddenly it starts to repeat and it [00:16:41] suddenly it starts to repeat and it starts to repeat like I think uh [00:16:43] starts to repeat like I think uh a institution's name [00:16:46] a institution's name so why does this happen [00:16:48] so why does this happen um if we look at for example uh this [00:16:51] um if we look at for example uh this plot which shows uh the problem the [00:16:54] plot which shows uh the problem the language model's probability assigned to [00:16:56] language model's probability assigned to the sequence I don't know we can see [00:16:58] the sequence I don't know we can see like here's the pattern [00:17:00] like here's the pattern um it has regular probability but if we [00:17:02] um it has regular probability but if we keep repeating this phrase I don't know [00:17:04] keep repeating this phrase I don't know I don't know I don't know for 10 times [00:17:05] I don't know I don't know for 10 times then we can see that there's a decrease [00:17:08] then we can see that there's a decrease in Trend in their negative log [00:17:10] in Trend in their negative log likelihood so the y-axis is the negative [00:17:12] likelihood so the y-axis is the negative log probability we can see this [00:17:14] log probability we can see this decreasing Trend which means that the [00:17:17] decreasing Trend which means that the model actually has higher probability uh [00:17:19] model actually has higher probability uh as the repeat goes on which is quite [00:17:22] as the repeat goes on which is quite strange because it's suggesting that [00:17:24] strange because it's suggesting that there is a self-amplification effect so [00:17:26] there is a self-amplification effect so the more repeat we have the more [00:17:28] the more repeat we have the more confidence the model becomes about this [00:17:30] confidence the model becomes about this repeat [00:17:32] repeat and this keeps going on we can see that [00:17:34] and this keeps going on we can see that for I am tired I'm tired repeat 100 [00:17:36] for I am tired I'm tired repeat 100 times because it continuously decreasing [00:17:38] times because it continuously decreasing Trend until the model is almost 100 sure [00:17:41] Trend until the model is almost 100 sure that it's gonna keep repeating the same [00:17:43] that it's gonna keep repeating the same thing [00:17:45] thing and sadly [00:17:47] and sadly um this art this problem is not really [00:17:48] um this art this problem is not really solved by architecture here the Red [00:17:50] solved by architecture here the Red Cloud is a lstm model and the blue curve [00:17:54] Cloud is a lstm model and the blue curve is a Transformer model we can see that [00:17:56] is a Transformer model we can see that both model kind of suffers from the same [00:17:58] both model kind of suffers from the same problem and scale also doesn't solve [00:18:01] problem and scale also doesn't solve this problem so we kind of believe that [00:18:02] this problem so we kind of believe that like scale is the magical thing in NLP [00:18:04] like scale is the magical thing in NLP but even even models with 175 billion [00:18:08] but even even models with 175 billion parameters will still suffer from [00:18:10] parameters will still suffer from repetition if we try to find the most [00:18:12] repetition if we try to find the most likely string [00:18:16] so how do we reduce repetition [00:18:18] so how do we reduce repetition um one canonical approach is to do [00:18:20] um one canonical approach is to do unground blocking so the principle is [00:18:23] unground blocking so the principle is very simple basically you just don't [00:18:24] very simple basically you just don't want to see the same engram twice if we [00:18:27] want to see the same engram twice if we send n to be three then for any text [00:18:29] send n to be three then for any text that contains the phrase I am happy the [00:18:32] that contains the phrase I am happy the next time you see the prefix I am ungram [00:18:34] next time you see the prefix I am ungram blocking would automatically set the [00:18:36] blocking would automatically set the probability of happy to be zero so that [00:18:39] probability of happy to be zero so that you will never see this unground this [00:18:41] you will never see this unground this trigram again but clearly this this [00:18:44] trigram again but clearly this this underground blocking heuristic has some [00:18:46] underground blocking heuristic has some problems because sometimes it is quite [00:18:48] problems because sometimes it is quite common for you to want to see a person's [00:18:50] common for you to want to see a person's name appear twice or three times or even [00:18:52] name appear twice or three times or even more in the text but this unground [00:18:54] more in the text but this unground blocking will eliminate that possibility [00:18:57] blocking will eliminate that possibility so what are better options that possibly [00:18:59] so what are better options that possibly are more complicated for example we can [00:19:02] are more complicated for example we can use a different training objective [00:19:04] use a different training objective instead of training by mle we can train [00:19:07] instead of training by mle we can train by unlikelihood objective so in this [00:19:10] by unlikelihood objective so in this approach uh the model is actually [00:19:12] approach uh the model is actually penalized for generating already seen [00:19:15] penalized for generating already seen tokens so it's kind of like putting this [00:19:17] tokens so it's kind of like putting this unground blocking idea into training [00:19:19] unground blocking idea into training time [00:19:20] time um rather than a decoding Time Force [00:19:22] um rather than a decoding Time Force this constraint at trading time we just [00:19:23] this constraint at trading time we just decrease the probability of repetition [00:19:25] decrease the probability of repetition another another training objective is [00:19:28] another another training objective is coverage Wells which uses kind of the [00:19:31] coverage Wells which uses kind of the attention mechanism to prevent [00:19:32] attention mechanism to prevent repetition so basically if you try to [00:19:34] repetition so basically if you try to regularize and enforce your attention so [00:19:36] regularize and enforce your attention so that it's always attending to different [00:19:38] that it's always attending to different words for each token then uh it is [00:19:41] words for each token then uh it is highly likely that you are not going to [00:19:42] highly likely that you are not going to repeat because repetition tends to [00:19:45] repeat because repetition tends to happen when you have similar attention [00:19:46] happen when you have similar attention patterns [00:19:48] patterns another different angle is that instead [00:19:51] another different angle is that instead of searching for the most likely string [00:19:53] of searching for the most likely string we can use a different decoding [00:19:55] we can use a different decoding objective so maybe we can search for [00:19:56] objective so maybe we can search for Strings that maximizes uh the difference [00:19:59] Strings that maximizes uh the difference between log probabilities of two models [00:20:01] between log probabilities of two models say that we want to maximize log problem [00:20:03] say that we want to maximize log problem large model minus a lot of problem small [00:20:06] large model minus a lot of problem small model in this way because both models [00:20:08] model in this way because both models are repetitive so they kind of cancels [00:20:10] are repetitive so they kind of cancels out so like they would both assign High [00:20:11] out so like they would both assign High probabilities repetition and after [00:20:13] probabilities repetition and after applying this new objective the [00:20:16] applying this new objective the repetition stuff will actually be [00:20:17] repetition stuff will actually be penalized because it cancels out [00:20:20] penalized because it cancels out so here comes the broader question [00:20:23] so here comes the broader question um it's finally the most likely string [00:20:25] um it's finally the most likely string even a reasonable thing to do for [00:20:27] even a reasonable thing to do for open-ended text generation [00:20:28] open-ended text generation the answer is probably no because this [00:20:32] the answer is probably no because this doesn't really match human pattern so we [00:20:34] doesn't really match human pattern so we can see In This Cloud the orange curve [00:20:36] can see In This Cloud the orange curve is the human pattern and the blue curve [00:20:38] is the human pattern and the blue curve is the machine generated text using beam [00:20:40] is the machine generated text using beam search so you can see that will with [00:20:42] search so you can see that will with human talks there are actually lots of [00:20:44] human talks there are actually lots of uncertainty uh in as we can see by the [00:20:47] uncertainty uh in as we can see by the fluctuation of the probabilities like [00:20:49] fluctuation of the probabilities like for some words we can be very certain [00:20:51] for some words we can be very certain for some words we are a little bit [00:20:53] for some words we are a little bit unsure whereas here for the model [00:20:54] unsure whereas here for the model distribution is always very sure it's [00:20:56] distribution is always very sure it's always assigning probability one to the [00:20:58] always assigning probability one to the sequence so because we now are seeing a [00:21:02] sequence so because we now are seeing a answer obviously there's a mismatch [00:21:04] answer obviously there's a mismatch between the two distributions so it's [00:21:06] between the two distributions so it's kind of suggesting that maybe searching [00:21:08] kind of suggesting that maybe searching for the most likely string is not the [00:21:10] for the most likely string is not the right decoding objective at all [00:21:13] right decoding objective at all any questions so far before we move up [00:21:15] any questions so far before we move up yeah [00:21:16] yeah the online magazine for like some [00:21:19] the online magazine for like some detector of whether some characters [00:21:21] detector of whether some characters generated by Chinese [00:21:24] um not really because uh so this can [00:21:26] um not really because uh so this can only detect the really simple things [00:21:28] only detect the really simple things that humans are also able to detect like [00:21:30] that humans are also able to detect like repetition so uh in order to avoid like [00:21:33] repetition so uh in order to avoid like the previous problems that we've talked [00:21:34] the previous problems that we've talked about I'll talk about some other [00:21:36] about I'll talk about some other decoding families that generates more [00:21:39] decoding families that generates more robust attacks that actually look like [00:21:41] robust attacks that actually look like this [00:21:42] this um whose probability distribution looks [00:21:43] um whose probability distribution looks like the orange curve so I wouldn't say [00:21:46] like the orange curve so I wouldn't say this is like the to go answer for [00:21:48] this is like the to go answer for watermarking or detection [00:21:52] watermarking or detection oh yeah Okay cool so she asked about [00:21:55] oh yeah Okay cool so she asked about whether [00:21:57] whether um whether this mechanism of plotting [00:21:59] um whether this mechanism of plotting the probabilities of human text and [00:22:01] the probabilities of human text and machine generated text is one way of [00:22:03] machine generated text is one way of detecting whether some text is generated [00:22:06] detecting whether some text is generated by model or human and my answer is I [00:22:09] by model or human and my answer is I don't think so but this could be an [00:22:11] don't think so but this could be an interesting research Direction [00:22:14] interesting research Direction because I feel like they are more robust [00:22:16] because I feel like they are more robust decoding approaches that generate texts [00:22:18] decoding approaches that generate texts that are that actually fluctuates a lot [00:22:24] um so yeah let's talk about the decoding [00:22:26] um so yeah let's talk about the decoding algorithm that is able to generate text [00:22:28] algorithm that is able to generate text that fluctuates so given that searching [00:22:31] that fluctuates so given that searching for the most likely string is a bad idea [00:22:32] for the most likely string is a bad idea what else should we do and how do we [00:22:35] what else should we do and how do we simulate that human pattern and the [00:22:37] simulate that human pattern and the answer to this is to introduce [00:22:38] answer to this is to introduce Randomness and stochasticity to decoding [00:22:41] Randomness and stochasticity to decoding so um suppose that we are sampling a [00:22:44] so um suppose that we are sampling a token from this distribution of P [00:22:47] token from this distribution of P basically like we are trying to sample [00:22:49] basically like we are trying to sample YT hat from this distribution It Is [00:22:51] YT hat from this distribution It Is Random so that you can essentially [00:22:53] Random so that you can essentially sample any token distribution previously [00:22:55] sample any token distribution previously you are kind of restricted to selecting [00:22:57] you are kind of restricted to selecting rest for more grocery but now you can [00:22:59] rest for more grocery but now you can select bathroom instead [00:23:02] so however uh sampling introduces a new [00:23:05] so however uh sampling introduces a new set of problems since we never really [00:23:08] set of problems since we never really zero out any token probabilities vanilla [00:23:10] zero out any token probabilities vanilla vanilla sampling would make every token [00:23:13] vanilla sampling would make every token in the vocabulary a viable option and in [00:23:16] in the vocabulary a viable option and in some unlucky cases we might end up with [00:23:18] some unlucky cases we might end up with a bad word [00:23:19] a bad word so assuming that uh we already have a [00:23:22] so assuming that uh we already have a very well trade model like even if most [00:23:25] very well trade model like even if most of the probability mass of the [00:23:27] of the probability mass of the distribution is over the limited set of [00:23:29] distribution is over the limited set of good options the tail of the [00:23:31] good options the tail of the distribution will still be very long [00:23:32] distribution will still be very long because we have so many words in our [00:23:34] because we have so many words in our vocabulary [00:23:35] vocabulary and therefore if we add all those round [00:23:37] and therefore if we add all those round Tails it Aggregates they still have a [00:23:40] Tails it Aggregates they still have a considerable Mass so statistically [00:23:42] considerable Mass so statistically speaking this is called heavy tail [00:23:43] speaking this is called heavy tail distribution and language is exactly a [00:23:46] distribution and language is exactly a heavy tail distribution [00:23:47] heavy tail distribution so for example like uh many tokens are [00:23:51] so for example like uh many tokens are probably really wrong in this context [00:23:52] probably really wrong in this context and then given that we have a good [00:23:55] and then given that we have a good language model we assign them each very [00:23:57] language model we assign them each very little probability [00:23:58] little probability thus this doesn't really solve the [00:24:00] thus this doesn't really solve the problem because there are so many of [00:24:01] problem because there are so many of them so you aggregate them as a group [00:24:04] them so you aggregate them as a group will still have a high chance of being [00:24:05] will still have a high chance of being selected [00:24:07] selected and the solution here that we have for [00:24:10] and the solution here that we have for this problem of long tail is that we [00:24:12] this problem of long tail is that we should just cut off the tail we should [00:24:13] should just cut off the tail we should just zero out the probabilities that we [00:24:15] just zero out the probabilities that we don't want and one idea is called top [00:24:18] don't want and one idea is called top place that a top case sampling where the [00:24:21] place that a top case sampling where the idea is that we would only sample from [00:24:23] idea is that we would only sample from the top K tokens in the probability [00:24:25] the top K tokens in the probability distribution [00:24:28] any questions for now [00:24:32] okay yeah well the model we were looking [00:24:35] okay yeah well the model we were looking at a second ago had some really low [00:24:38] at a second ago had some really low probability samples as well on the graph [00:24:41] probability samples as well on the graph right I would copy something with that [00:24:44] right I would copy something with that uh you mean this one or even uh the [00:24:48] uh you mean this one or even uh the orange blue graph of the human versus uh [00:24:51] orange blue graph of the human versus uh oh yeah yeah so uh top cable basically [00:24:55] oh yeah yeah so uh top cable basically uh eliminate it will not it will make it [00:24:57] uh eliminate it will not it will make it impossible to generate the super low [00:24:59] impossible to generate the super low probability tokens so technically it's [00:25:01] probability tokens so technically it's not it's not exactly simulating this [00:25:04] not it's not exactly simulating this pattern because now you don't have the [00:25:05] pattern because now you don't have the super low probability tokens whereas [00:25:07] super low probability tokens whereas human can generate super low probability [00:25:09] human can generate super low probability television affluence way but yeah that's [00:25:11] television affluence way but yeah that's that could be [00:25:12] that could be um another like hint that people can use [00:25:15] um another like hint that people can use for detecting a machine generated text [00:25:18] for detecting a machine generated text yeah [00:25:19] yeah depends on the type and text you want to [00:25:22] depends on the type and text you want to generate for example poem or novels or [00:25:24] generate for example poem or novels or more creative writing but then you [00:25:27] more creative writing but then you decide to hyper correct yeah yeah for [00:25:29] decide to hyper correct yeah yeah for sure case I have a parameter that [00:25:31] sure case I have a parameter that depending on the type of task you will [00:25:32] depending on the type of task you will choose K differently uh maybe mostly for [00:25:35] choose K differently uh maybe mostly for close and the task K should be small and [00:25:38] close and the task K should be small and for open-ended case should be large yeah [00:25:39] for open-ended case should be large yeah cluster in the back [00:25:41] cluster in the back how come like I guess intuitively this [00:25:44] how come like I guess intuitively this builds up of one of the earlier [00:25:46] builds up of one of the earlier questions why don't we consider the case [00:25:48] questions why don't we consider the case like where we sample and then we just [00:25:50] like where we sample and then we just weight the probability of each word by [00:25:52] weight the probability of each word by it's like score or something rather than [00:25:54] it's like score or something rather than just looking at top trade how can we [00:25:56] just looking at top trade how can we don't do like a weighted sampling type [00:25:58] don't do like a weighted sampling type of situation so we still have that small [00:26:00] of situation so we still have that small but non-zero probability of selecting [00:26:03] but non-zero probability of selecting uh I think Top Care is also like rated [00:26:06] uh I think Top Care is also like rated so like top K just kind of zeros out all [00:26:09] so like top K just kind of zeros out all the Tails of the distribution but for [00:26:11] the Tails of the distribution but for the things that I didn't zero out uh [00:26:13] the things that I didn't zero out uh it's not like a uniform Choice among the [00:26:15] it's not like a uniform Choice among the K it's still trying to choose [00:26:17] K it's still trying to choose proportional to the scores that you [00:26:19] proportional to the scores that you computed is that just like a [00:26:21] computed is that just like a computational [00:26:24] like 17 000 words it could be like for [00:26:27] like 17 000 words it could be like for like 10 or something [00:26:29] like 10 or something um yeah sure that could be one gain of [00:26:31] um yeah sure that could be one gain of 12K decoding is that your self Max will [00:26:34] 12K decoding is that your self Max will take in fewer uh fewer candidates [00:26:37] take in fewer uh fewer candidates yeah but it's not the main reason I [00:26:40] yeah but it's not the main reason I think you should show yeah yeah I'll [00:26:41] think you should show yeah yeah I'll keep talking about the many reasons [00:26:48] um so we've discussed this part and then [00:26:50] um so we've discussed this part and then here uh this is the formal this is kind [00:26:52] here uh this is the formal this is kind of the formerly what is happening for [00:26:54] of the formerly what is happening for top case sampling uh now that we are [00:26:57] top case sampling uh now that we are only sampling from the top K tokens of [00:26:59] only sampling from the top K tokens of the probability distribution and as [00:27:01] the probability distribution and as we've said K is a hyper parameter so we [00:27:04] we've said K is a hyper parameter so we can set K to be large or small uh if we [00:27:07] can set K to be large or small uh if we increase K this means that we are making [00:27:09] increase K this means that we are making our output more diverse but at the risk [00:27:12] our output more diverse but at the risk of including some tokens that are bad if [00:27:14] of including some tokens that are bad if we decrease k then we are making more [00:27:16] we decrease k then we are making more conservative and safe options but [00:27:18] conservative and safe options but possibly the generation will be quite [00:27:20] possibly the generation will be quite generic and boring [00:27:24] um so uh is top K decoding good enough [00:27:26] um so uh is top K decoding good enough the answer is not really because we can [00:27:29] the answer is not really because we can still find some problems with top K [00:27:31] still find some problems with top K decoding for example in the context she [00:27:34] decoding for example in the context she said I never blank there are many words [00:27:36] said I never blank there are many words that are still valid options uh such as [00:27:39] that are still valid options uh such as won't 8 but those words got zeroed out [00:27:42] won't 8 but those words got zeroed out because they are not within the top K [00:27:44] because they are not within the top K candidates so this actually leads to bad [00:27:46] candidates so this actually leads to bad recall for your generation system [00:27:49] recall for your generation system and similarly another failure of top K [00:27:52] and similarly another failure of top K is that it can also cut off too quickly [00:27:54] is that it can also cut off too quickly so in this example code is not really a [00:27:58] so in this example code is not really a valid answer according to common sense [00:28:00] valid answer according to common sense because you probably don't want to eat a [00:28:01] because you probably don't want to eat a piece of code [00:28:02] piece of code um but the probability remains non-zero [00:28:05] um but the probability remains non-zero meaning that the model might still [00:28:07] meaning that the model might still sample code as an output despite this [00:28:10] sample code as an output despite this low probability but it might still [00:28:11] low probability but it might still happen and this means bad Precision for [00:28:14] happen and this means bad Precision for the generation model [00:28:17] the generation model so given this problems with top K [00:28:20] so given this problems with top K decoding how can we address them how can [00:28:23] decoding how can we address them how can we address this of this issue of like [00:28:26] we address this of this issue of like there is no single K that fits all [00:28:27] there is no single K that fits all circumstances [00:28:29] circumstances um this is basically because the [00:28:30] um this is basically because the probability distribution that we sample [00:28:32] probability distribution that we sample from our Dynamic so when the probability [00:28:34] from our Dynamic so when the probability distribution is relatively flat having a [00:28:37] distribution is relatively flat having a small cable remove many viable options [00:28:40] small cable remove many viable options so the having a limited cable removes [00:28:43] so the having a limited cable removes many viable options and we want K to be [00:28:45] many viable options and we want K to be larger for this case [00:28:46] larger for this case similarly when a distribution p is too [00:28:49] similarly when a distribution p is too picky then we want the like a high K [00:28:53] picky then we want the like a high K would allow for too many options uh to [00:28:56] would allow for too many options uh to be viable and instead we might want a [00:28:58] be viable and instead we might want a smaller K so that we are being safer [00:29:01] smaller K so that we are being safer um so the solution here is that maybe K [00:29:03] um so the solution here is that maybe K is just a bad Haver parameter and [00:29:05] is just a bad Haver parameter and instead of doing K we should doing we [00:29:07] instead of doing K we should doing we should think about probability we should [00:29:09] should think about probability we should think about how to sample from tokens in [00:29:11] think about how to sample from tokens in the top P probability percentiles of the [00:29:15] the top P probability percentiles of the cumulative probability mass of the CDF [00:29:18] cumulative probability mass of the CDF for example [00:29:20] for example so now [00:29:22] so now um the the advantage of doing top P [00:29:24] um the the advantage of doing top P sampling where we sample from the top P [00:29:26] sampling where we sample from the top P percentile of the cumulative probability [00:29:28] percentile of the cumulative probability mass is that this is actually equivalent [00:29:31] mass is that this is actually equivalent to we have now a adaptive k for each [00:29:33] to we have now a adaptive k for each different distribution and let me [00:29:36] different distribution and let me explain what what I mean by having like [00:29:38] explain what what I mean by having like an Adaptive k [00:29:39] an Adaptive k so in the first distribution this is [00:29:41] so in the first distribution this is like a regular power law of language [00:29:44] like a regular power law of language that's kind of typical and then uh doing [00:29:47] that's kind of typical and then uh doing top uh doing top case sampling means [00:29:49] top uh doing top case sampling means we're selecting the top K but doing the [00:29:50] we're selecting the top K but doing the top P sampling means that we are zooming [00:29:52] top P sampling means that we are zooming into maybe like like something that's [00:29:55] into maybe like like something that's similar to top K and in fact but if I [00:29:57] similar to top K and in fact but if I have a relatively flat distribution like [00:29:59] have a relatively flat distribution like the blue one we can see that's doing top [00:30:02] the blue one we can see that's doing top p means that we are including more [00:30:04] p means that we are including more candidates and then if we have a more [00:30:06] candidates and then if we have a more schools distribution like the green one [00:30:08] schools distribution like the green one doing top p means that we actually [00:30:10] doing top p means that we actually include fewer candidates so by actually [00:30:13] include fewer candidates so by actually selecting like the the top P percentile [00:30:16] selecting like the the top P percentile in the probability distribution we are [00:30:18] in the probability distribution we are we are actually having a more uh [00:30:21] we are actually having a more uh flexible okay and therefore have a [00:30:23] flexible okay and therefore have a better sense of what are the good [00:30:24] better sense of what are the good options uh in the model any questions [00:30:27] options uh in the model any questions about top P top K decoding [00:30:32] so everything's clear [00:30:34] so everything's clear yeah sounds good [00:30:36] yeah sounds good um so to go back to that question uh [00:30:39] um so to go back to that question uh doing top K is not necessarily saving [00:30:41] doing top K is not necessarily saving compute or like this whole idea is not [00:30:43] compute or like this whole idea is not really compute saving intended because [00:30:46] really compute saving intended because uh in the case of top p in order to [00:30:48] uh in the case of top p in order to select the top P percentile we still [00:30:51] select the top P percentile we still need to compute the soft Max over the [00:30:53] need to compute the soft Max over the entire vocabulary set in order for us to [00:30:55] entire vocabulary set in order for us to do top pay properly to compute the P [00:30:58] do top pay properly to compute the P properly so therefore it's not really [00:31:00] properly so therefore it's not really saving compute but it's improving [00:31:01] saving compute but it's improving performance [00:31:05] moving on [00:31:07] moving on um so there are much more to go with [00:31:08] um so there are much more to go with decoding algorithms with uh besides the [00:31:11] decoding algorithms with uh besides the topic and top P that we've discussed [00:31:13] topic and top P that we've discussed there are some more recent approaches [00:31:15] there are some more recent approaches like typical typical sampling where the [00:31:18] like typical typical sampling where the idea is that we want to relate the score [00:31:19] idea is that we want to relate the score based on the entropy of the distribution [00:31:22] based on the entropy of the distribution and try to generate tags that are closer [00:31:24] and try to generate tags that are closer to the negative whose probability is [00:31:26] to the negative whose probability is closer to the negative entropy of the [00:31:29] closer to the negative entropy of the data distribution this kind of means [00:31:31] data distribution this kind of means that if you have a closed-ended task or [00:31:34] that if you have a closed-ended task or non-open-ended task you want it has [00:31:36] non-open-ended task you want it has smaller entropy so you want a negative [00:31:39] smaller entropy so you want a negative log probability to be smaller so you [00:31:40] log probability to be smaller so you want probabilities to be larger so it [00:31:42] want probabilities to be larger so it kind of TAPS it tap checks very well and [00:31:45] kind of TAPS it tap checks very well and additionally there is also Epsilon [00:31:48] additionally there is also Epsilon sampling coming from John [00:31:50] sampling coming from John so this is an idea where we set the [00:31:52] so this is an idea where we set the threshold for to lower bound [00:31:54] threshold for to lower bound probabilities so basically if you have a [00:31:56] probabilities so basically if you have a word whose probability is less than .03 [00:31:59] word whose probability is less than .03 for example then that word will never [00:32:01] for example then that word will never appear [00:32:02] appear um in the output distribution now that [00:32:05] um in the output distribution now that that word will never be part of your [00:32:06] that word will never be part of your output because it has still will [00:32:07] output because it has still will probability [00:32:09] probability yeah [00:32:13] oh cool great question so the entropy [00:32:16] oh cool great question so the entropy distribution is defined as [00:32:18] distribution is defined as um [00:32:19] um like you can suppose that we have a [00:32:21] like you can suppose that we have a discrete distribution we can go over it [00:32:23] discrete distribution we can go over it like we'll just enumerate X and then [00:32:25] like we'll just enumerate X and then it's like negative log probability [00:32:28] it's like negative log probability of X so like if we write it from a from [00:32:31] of X so like if we write it from a from an expectation perspective it's [00:32:33] an expectation perspective it's basically expected of well probability [00:32:35] basically expected of well probability of x [00:32:36] of x okay I'll I have to do a little bit here [00:32:38] okay I'll I have to do a little bit here so so this is the entropy of a [00:32:41] so so this is the entropy of a distribution [00:32:44] and then so basically if you are [00:32:46] and then so basically if you are distribution is very very concentrated [00:32:48] distribution is very very concentrated to a few words then the entropy will be [00:32:51] to a few words then the entropy will be relatively small if your distribution is [00:32:53] relatively small if your distribution is very flat then your entropy will be very [00:32:56] very flat then your entropy will be very large [00:32:58] yeah [00:33:00] yeah the Epsilon sampling is set such that we [00:33:03] the Epsilon sampling is set such that we have no valid [00:33:05] have no valid oh yeah I mean I [00:33:08] oh yeah I mean I bump back off cases I think so in the [00:33:10] bump back off cases I think so in the case that there is no valid options [00:33:13] case that there is no valid options um You probably still want to select one [00:33:15] um You probably still want to select one or two things just as a edge case I [00:33:17] or two things just as a edge case I think [00:33:20] okay cool [00:33:22] okay cool um moving on so another hyper parameter [00:33:25] um moving on so another hyper parameter that we can tune to affect decoding is [00:33:28] that we can tune to affect decoding is the temperature parameter so recall that [00:33:31] the temperature parameter so recall that previously at each time step we asked [00:33:33] previously at each time step we asked the model to compute a score [00:33:35] the model to compute a score um and then we renormalize that score [00:33:37] um and then we renormalize that score using solve Max to get a probability [00:33:39] using solve Max to get a probability distribution so one thing that we can [00:33:41] distribution so one thing that we can adjust here is that we can insert this [00:33:43] adjust here is that we can insert this temperature parameter Tau to relate the [00:33:46] temperature parameter Tau to relate the score so basically we just divide all [00:33:47] score so basically we just divide all the SW by Tau and after dividing this we [00:33:52] the SW by Tau and after dividing this we apply solve Max and we get a new [00:33:54] apply solve Max and we get a new distribution [00:33:55] distribution and this temperature adjustment is not [00:33:58] and this temperature adjustment is not really going to affect the monotonicity [00:34:00] really going to affect the monotonicity of the distribution for example if word [00:34:02] of the distribution for example if word a has higher probability than word b [00:34:04] a has higher probability than word b previously then after the adjustment [00:34:07] previously then after the adjustment where a is still going to have a higher [00:34:09] where a is still going to have a higher probability than word b but still [00:34:11] probability than word b but still relative difference will change [00:34:15] relative difference will change so um for example if we raise the [00:34:18] so um for example if we raise the temperature Tau to be greater than one [00:34:19] temperature Tau to be greater than one then the distribution PT will become [00:34:22] then the distribution PT will become more uniform it will be flatter and this [00:34:25] more uniform it will be flatter and this kind of implies that there will be more [00:34:27] kind of implies that there will be more diverse output because our distribution [00:34:29] diverse output because our distribution is flatter [00:34:30] is flatter and it's more spread out across [00:34:33] and it's more spread out across different words in the vocabulary [00:34:35] different words in the vocabulary on the other hand if we lower the [00:34:37] on the other hand if we lower the temperature Tau less than one then PT [00:34:40] temperature Tau less than one then PT becomes very spiky and then this means [00:34:43] becomes very spiky and then this means that we are if we sample from the PT [00:34:45] that we are if we sample from the PT we'll get less diverse output [00:34:47] we'll get less diverse output um so because here the probability is [00:34:49] um so because here the probability is concentrated only on the top words [00:34:51] concentrated only on the top words so in the very extreme case if we set [00:34:53] so in the very extreme case if we set Tau to be very very close to zero then [00:34:55] Tau to be very very close to zero then the probability will kind of be a one [00:34:58] the probability will kind of be a one hot Vector where all the probability [00:35:00] hot Vector where all the probability mass will be centered on one word [00:35:02] mass will be centered on one word and then this kind of reduces back to [00:35:04] and then this kind of reduces back to Arc Max sampling or greedy decoding [00:35:07] Arc Max sampling or greedy decoding so temperature is a hyper parameter as [00:35:09] so temperature is a hyper parameter as well as as for K and P in top K on top P [00:35:12] well as as for K and P in top K on top P it is a hyper parameter for decoding it [00:35:15] it is a hyper parameter for decoding it can be tuned for beam search and [00:35:17] can be tuned for beam search and sampling algorithms so it's kind of [00:35:19] sampling algorithms so it's kind of orthogonal to the approaches that we [00:35:21] orthogonal to the approaches that we discussed before [00:35:23] discussed before any questions so far [00:35:28] okay cool uh temperature is so easy [00:35:33] so um well because something still [00:35:36] so um well because something still involves Randomness like even though we [00:35:38] involves Randomness like even though we do we try very hard in terms of [00:35:39] do we try very hard in terms of truncation truncating the tail something [00:35:42] truncation truncating the tail something still has Randomness so what if we're [00:35:44] still has Randomness so what if we're just unlucky and decode a bad sequence [00:35:46] just unlucky and decode a bad sequence from the model [00:35:48] from the model um one common solution is to do [00:35:49] um one common solution is to do re-ranking so basically we would decode [00:35:51] re-ranking so basically we would decode a bunch of sequences like for example we [00:35:53] a bunch of sequences like for example we can decode 10 candidates [00:35:55] can decode 10 candidates um but like 10 or 30 is up to you the [00:35:58] um but like 10 or 30 is up to you the only choice is that you want to balance [00:35:59] only choice is that you want to balance between your compute efficiency and [00:36:01] between your compute efficiency and performance so if you decode too many [00:36:04] performance so if you decode too many sequences then of course your [00:36:06] sequences then of course your performance is going to increase but [00:36:08] performance is going to increase but it's also very costly to to just [00:36:10] it's also very costly to to just generate a lot of things for one example [00:36:13] generate a lot of things for one example and then so once you have a bunch of uh [00:36:16] and then so once you have a bunch of uh sample sequences then we are trying to [00:36:18] sample sequences then we are trying to define a score to approximate the [00:36:21] define a score to approximate the quality of the sequence and re-rank [00:36:23] quality of the sequence and re-rank everything and re-rank all the [00:36:24] everything and re-rank all the candidates by this score so the simple [00:36:26] candidates by this score so the simple thing to do is we can use a perplexity [00:36:29] thing to do is we can use a perplexity as a metric as a score as a scoring [00:36:31] as a metric as a score as a scoring function but we need to be careful that [00:36:34] function but we need to be careful that because we have talked about this like [00:36:36] because we have talked about this like the extreme of perplexity like if we try [00:36:38] the extreme of perplexity like if we try to Arc Max log probability we will try [00:36:41] to Arc Max log probability we will try to aim for a super well perplexity the [00:36:43] to aim for a super well perplexity the attacks are actually very repetitive so [00:36:45] attacks are actually very repetitive so we shouldn't really aim for extremely [00:36:47] we shouldn't really aim for extremely low perplexity and perplexity to some [00:36:49] low perplexity and perplexity to some extent it's not a perfect re-scoring [00:36:52] extent it's not a perfect re-scoring function it's it's not a perfect scoring [00:36:54] function it's it's not a perfect scoring function because it's not really robust [00:36:56] function because it's not really robust to maximize [00:36:58] to maximize so alternatively the re-rankers can [00:37:01] so alternatively the re-rankers can actually use a wide variety of other [00:37:02] actually use a wide variety of other scoring functions that we can score text [00:37:05] scoring functions that we can score text based on their style their discourse [00:37:07] based on their style their discourse coherence uh their entailment factuality [00:37:10] coherence uh their entailment factuality properties consistency and so on [00:37:13] properties consistency and so on um [00:37:14] um and additionally we can compose multiple [00:37:17] and additionally we can compose multiple re-rankers together uh yeah questions [00:37:22] re-rankers together uh yeah questions 10 candidates or any number of [00:37:24] 10 candidates or any number of candidates yeah what's the strategy [00:37:26] candidates yeah what's the strategy usually use to generate these other [00:37:29] usually use to generate these other candidates like what you're listening to [00:37:31] candidates like what you're listening to use oh yeah so basically the idea is to [00:37:34] use oh yeah so basically the idea is to sample from the model right so when you [00:37:36] sample from the model right so when you sample from the model each time you [00:37:37] sample from the model each time you sample you're going to get a different [00:37:38] sample you're going to get a different output [00:37:40] output and then that's what I mean by different [00:37:41] and then that's what I mean by different candidates so if you sample 10 times you [00:37:43] candidates so if you sample 10 times you will get 10 you will very likely get 10 [00:37:45] will get 10 you will very likely get 10 different outputs [00:37:47] different outputs and then you are just given these 10 [00:37:49] and then you are just given these 10 different outputs that come from [00:37:50] different outputs that come from sampling you can just decide re-rank [00:37:53] sampling you can just decide re-rank them and select the candidate that has [00:37:55] them and select the candidate that has the highest score [00:37:59] oh because we are sampling here [00:38:01] oh because we are sampling here yeah yeah for example if you are doing [00:38:03] yeah yeah for example if you are doing like top three something then well [00:38:06] like top three something then well suppose that A and B are equally [00:38:08] suppose that A and B are equally probable then you might sample a your [00:38:10] probable then you might sample a your max sample B with the same probability [00:38:14] max sample B with the same probability okay cool and another cool thing that we [00:38:16] okay cool and another cool thing that we can do is re-ranking is that we can [00:38:18] can do is re-ranking is that we can compose multiple re-rankers together so [00:38:20] compose multiple re-rankers together so basically you can suppose you have a [00:38:22] basically you can suppose you have a scoring function for style and you have [00:38:24] scoring function for style and you have a scoring function for factual [00:38:25] a scoring function for factual consistency you can just add those two [00:38:27] consistency you can just add those two scoring functions together to get a new [00:38:29] scoring functions together to get a new scoring function and then uh re-rank [00:38:31] scoring function and then uh re-rank everything based on your new scoring [00:38:33] everything based on your new scoring function to get tags that are both good [00:38:35] function to get tags that are both good at style and good at factual consistency [00:38:42] do we just pick the decoding that has [00:38:45] do we just pick the decoding that has the high score or do we do some more [00:38:47] the high score or do we do some more sampling again based on the story uh the [00:38:50] sampling again based on the story uh the idea is you just take the decoding that [00:38:51] idea is you just take the decoding that has the highest score because you [00:38:52] has the highest score because you already have like say 10 candidates so [00:38:55] already have like say 10 candidates so out of this 10 you only need one and [00:38:56] out of this 10 you only need one and then you just choose one that has the [00:38:58] then you just choose one that has the highest score [00:38:59] highest score yeah [00:39:01] yeah cool any other questions [00:39:04] cool any other questions yeah sorry what what is perplexity oh [00:39:09] yeah sorry what what is perplexity oh yeah perplexity is like you can kind of [00:39:11] yeah perplexity is like you can kind of regard it as log probabilities uh it's [00:39:13] regard it as log probabilities uh it's it's proportion it's like e to the [00:39:16] it's proportion it's like e to the negative well probabilities [00:39:18] negative well probabilities kind of like uh if uh if a talker has [00:39:21] kind of like uh if uh if a talker has high perplexity then it means it has a [00:39:23] high perplexity then it means it has a low probability because you are more [00:39:25] low probability because you are more perplexed [00:39:29] okay so um taking a step back to [00:39:31] okay so um taking a step back to summarize this decoding section we have [00:39:34] summarize this decoding section we have discussed uh many decoding approaches [00:39:35] discussed uh many decoding approaches from selecting the most probable string [00:39:38] from selecting the most probable string to selecting uh to sampling and then to [00:39:41] to selecting uh to sampling and then to various truncation approaches that we [00:39:42] various truncation approaches that we can do to improve sampling like top P [00:39:44] can do to improve sampling like top P top K Epsilon typical decoding and [00:39:48] top K Epsilon typical decoding and finally we discuss how we can do in [00:39:50] finally we discuss how we can do in terms of re-ranking the results so uh [00:39:53] terms of re-ranking the results so uh decoding is still a really essential [00:39:55] decoding is still a really essential problem in energy and there are lots of [00:39:58] problem in energy and there are lots of Works to be done here still especially [00:40:00] Works to be done here still especially as like chai GPD is so powerful we [00:40:02] as like chai GPD is so powerful we should all go study decoding [00:40:04] should all go study decoding um so it would be interesting if you [00:40:06] um so it would be interesting if you want to do such final projects and also [00:40:09] want to do such final projects and also different decoding algorithms can allow [00:40:11] different decoding algorithms can allow us to inject different inductive biases [00:40:12] us to inject different inductive biases uh to the to the text that we are trying [00:40:16] uh to the to the text that we are trying to generate [00:40:17] to generate and some of the most impactful advances [00:40:20] and some of the most impactful advances in nlg in the last couple years actually [00:40:22] in nlg in the last couple years actually come from simple but effective decoding [00:40:24] come from simple but effective decoding algorithms for example the nuclear [00:40:26] algorithms for example the nuclear sampling is the nuclear sampling paper [00:40:28] sampling is the nuclear sampling paper is actually very very highly cited [00:40:31] is actually very very highly cited so moving on to talk about training [00:40:33] so moving on to talk about training analogy models [00:40:36] analogy models well we have seen this example before in [00:40:38] well we have seen this example before in the decoding slides and I'm just trying [00:40:40] the decoding slides and I'm just trying to show them again uh because even [00:40:42] to show them again uh because even though we can solve this repetition [00:40:44] though we can solve this repetition Problem by by instead of doing search [00:40:46] Problem by by instead of doing search doing sampling [00:40:48] doing sampling um the but it's still concerning from a [00:40:50] um the but it's still concerning from a language modeling perspective that's [00:40:52] language modeling perspective that's your model would put so much probability [00:40:54] your model would put so much probability on such repetitive and degenerate text [00:40:56] on such repetitive and degenerate text so we asked this question well is [00:40:59] so we asked this question well is repetition due to how language models [00:41:01] repetition due to how language models are trained [00:41:04] you have also seen this Cloud before [00:41:06] you have also seen this Cloud before which shows this decaying pattern or [00:41:08] which shows this decaying pattern or like the self amplification effect so we [00:41:11] like the self amplification effect so we can conclude from this observation that [00:41:13] can conclude from this observation that model trained via a mle objective wears [00:41:17] model trained via a mle objective wears a really bad like whereas really bad [00:41:18] a really bad like whereas really bad mode of the distribution by mode of the [00:41:20] mode of the distribution by mode of the distribution I mean the arguments of the [00:41:22] distribution I mean the arguments of the distribution so basically they would [00:41:23] distribution so basically they would assign high probability to terrible [00:41:25] assign high probability to terrible strings [00:41:26] strings and this is definitely problematic for a [00:41:28] and this is definitely problematic for a model perspective [00:41:30] model perspective so why is this the case shouldn't mle be [00:41:33] so why is this the case shouldn't mle be like a gold standard in machine [00:41:34] like a gold standard in machine translation uh in in machine learning in [00:41:36] translation uh in in machine learning in general not just machine translation [00:41:37] general not just machine translation should an ml be like a gold standard for [00:41:39] should an ml be like a gold standard for machine learning [00:41:40] machine learning um the answer here is not really [00:41:42] um the answer here is not really especially for text because mle has some [00:41:45] especially for text because mle has some problem for sequential data and we call [00:41:48] problem for sequential data and we call this problem exposure bias [00:41:50] this problem exposure bias um so training with teacher forcing [00:41:53] um so training with teacher forcing leads to exposure bias at generation [00:41:55] leads to exposure bias at generation time because during training our model's [00:41:57] time because during training our model's inputs are gold context tokens from real [00:42:00] inputs are gold context tokens from real human generated text as denoted as I had [00:42:02] human generated text as denoted as I had less than T here but during generation [00:42:06] less than T here but during generation time our model's input become previously [00:42:09] time our model's input become previously decoded tokens from the model well I had [00:42:11] decoded tokens from the model well I had to and suppose that our model has minor [00:42:14] to and suppose that our model has minor arrows than like what I had T why had [00:42:17] arrows than like what I had T why had less than T will be much worse in terms [00:42:19] less than T will be much worse in terms of quality than y star less than T and [00:42:22] of quality than y star less than T and this discrepancy is terrible because it [00:42:25] this discrepancy is terrible because it actually causes a discrepancy between [00:42:27] actually causes a discrepancy between trading and test time which actually [00:42:29] trading and test time which actually hurts model performance and we call this [00:42:32] hurts model performance and we call this problem exposure bias [00:42:35] um so people have proposed many [00:42:37] um so people have proposed many solutions to address this exposure bias [00:42:39] solutions to address this exposure bias problem [00:42:40] problem uh one thing to do is to do um scheduled [00:42:43] uh one thing to do is to do um scheduled sampling which means that uh with [00:42:45] sampling which means that uh with probability P we try to decode a token [00:42:48] probability P we try to decode a token uh and feed it back in as context to [00:42:51] uh and feed it back in as context to train the model and this probability one [00:42:53] train the model and this probability one minus P we use the gold tag we use the [00:42:56] minus P we use the gold tag we use the gold token as context [00:42:57] gold token as context so throughout trading we try to increase [00:43:00] so throughout trading we try to increase P to gradually warm it up and then [00:43:03] P to gradually warm it up and then prepare it for test time generation so [00:43:06] prepare it for test time generation so this leads to Improvement in practice [00:43:07] this leads to Improvement in practice because of using this T using this P [00:43:10] because of using this T using this P probabilities we're actually graduating [00:43:13] probabilities we're actually graduating uh like trying to narrow the discrepancy [00:43:16] uh like trying to narrow the discrepancy between training and test time [00:43:17] between training and test time but the objective is actually quite [00:43:19] but the objective is actually quite strange and training can be very [00:43:21] strange and training can be very unstable [00:43:23] unstable another idea is to do data set [00:43:25] another idea is to do data set aggregation and the method is called [00:43:27] aggregation and the method is called dagger [00:43:28] dagger essentially at various intervals during [00:43:31] essentially at various intervals during training we try to generate a sequence [00:43:33] training we try to generate a sequence of text from the current model and then [00:43:35] of text from the current model and then use this and then put the sequence of [00:43:36] use this and then put the sequence of text into the training data so we're [00:43:39] text into the training data so we're kind of continuously doing this uh [00:43:41] kind of continuously doing this uh training data augmentation scheme to [00:43:43] training data augmentation scheme to make sure that the trading distribution [00:43:46] make sure that the trading distribution and the generation distribution are [00:43:48] and the generation distribution are closer together [00:43:49] closer together so both approaches both scheduled [00:43:51] so both approaches both scheduled sampling and data set aggregation are [00:43:53] sampling and data set aggregation are ways to narrow the discrepancy between [00:43:55] ways to narrow the discrepancy between training and test yes question [00:44:00] just means human text [00:44:03] just means human text I mean it's like uh well so little [00:44:05] I mean it's like uh well so little language model you will see lots of [00:44:07] language model you will see lots of Corpus that are human written gold is [00:44:09] Corpus that are human written gold is just human [00:44:13] okay cool [00:44:15] okay cool um so another approach is to do [00:44:16] um so another approach is to do retrieval augmented generation so we [00:44:19] retrieval augmented generation so we first learned to retrieve a sequence [00:44:21] first learned to retrieve a sequence from some existing Corpus of prototypes [00:44:23] from some existing Corpus of prototypes and then we train a model to actually [00:44:25] and then we train a model to actually edit the retrieved text by doing [00:44:28] edit the retrieved text by doing insertion deletion or swapping [00:44:30] insertion deletion or swapping we can add or remove tokens from this [00:44:33] we can add or remove tokens from this prototype and then try to modify it into [00:44:36] prototype and then try to modify it into another into another sentence so this [00:44:40] another into another sentence so this doesn't really suffer from exposure bias [00:44:41] doesn't really suffer from exposure bias because we start from a high quality [00:44:44] because we start from a high quality prototype so that's at trading time and [00:44:46] prototype so that's at trading time and at test time like you don't really have [00:44:48] at test time like you don't really have the discrepancy anymore because you are [00:44:49] the discrepancy anymore because you are not generating from left to right [00:44:53] um another approach is to do [00:44:55] um another approach is to do reinforcement learning so here the idea [00:44:58] reinforcement learning so here the idea is to cast your generation problem as a [00:45:00] is to cast your generation problem as a Markov decision process so there is the [00:45:03] Markov decision process so there is the state as uh which is the model's [00:45:06] state as uh which is the model's representation for all the preceding [00:45:07] representation for all the preceding context there is action a uh which is [00:45:10] context there is action a uh which is basically like the next token that we [00:45:13] basically like the next token that we are trying to pick and there's policy [00:45:15] are trying to pick and there's policy which is the language model or also [00:45:16] which is the language model or also called the decoder and there is the [00:45:18] called the decoder and there is the reward R which is provided by some [00:45:21] reward R which is provided by some external score and the idea here uh well [00:45:24] external score and the idea here uh well like we won't go into details about [00:45:26] like we won't go into details about reinforcement learning and how it works [00:45:27] reinforcement learning and how it works but we will recommend the class CS two [00:45:30] but we will recommend the class CS two three like 234. [00:45:34] so um in the reinforcement learning [00:45:36] so um in the reinforcement learning context because reinforcement learning [00:45:37] context because reinforcement learning involves a reward function that's very [00:45:40] involves a reward function that's very important so how do we do reward [00:45:42] important so how do we do reward estimation for tax Generation Well [00:45:44] estimation for tax Generation Well really natural idea is to just use the [00:45:47] really natural idea is to just use the evaluation metrics so whatever because [00:45:49] evaluation metrics so whatever because you are trying to do well in terms of [00:45:50] you are trying to do well in terms of evaluation so why not just improve for [00:45:53] evaluation so why not just improve for evaluation metrics directly at training [00:45:55] evaluation metrics directly at training time for example in the case of machine [00:45:57] time for example in the case of machine translation we can use blue score as the [00:46:00] translation we can use blue score as the reward function in the case of [00:46:02] reward function in the case of summarization we can use root score as [00:46:04] summarization we can use root score as the reward function [00:46:06] the reward function but we really need to be careful about [00:46:08] but we really need to be careful about optimizing for tasks as opposed to [00:46:10] optimizing for tasks as opposed to gaining the reward because evaluation [00:46:12] gaining the reward because evaluation metrics are merely proxies for the [00:46:15] metrics are merely proxies for the generation quality so sometimes suppose [00:46:17] generation quality so sometimes suppose that you run RL and improve the blue [00:46:19] that you run RL and improve the blue score by a lot but will you will run [00:46:22] score by a lot but will you will run human evaluations humans might still [00:46:24] human evaluations humans might still think that well this this generated tax [00:46:27] think that well this this generated tax is no better than the previous one or [00:46:28] is no better than the previous one or even worse even though it gives you a [00:46:30] even worse even though it gives you a much better blue score so we want to [00:46:32] much better blue score so we want to like be careful about this case of not [00:46:34] like be careful about this case of not gaining the reward [00:46:37] gaining the reward so what behaviors can we tied to a [00:46:39] so what behaviors can we tied to a reward function this is about reward [00:46:40] reward function this is about reward design and reward estimation there are [00:46:43] design and reward estimation there are so many things that we can do we can do [00:46:45] so many things that we can do we can do cross modality consistency for image [00:46:47] cross modality consistency for image captioning we can do sentence similarity [00:46:49] captioning we can do sentence similarity to a sentence Simplicity to make sure [00:46:52] to a sentence Simplicity to make sure that we are generating simple English [00:46:54] that we are generating simple English that are understandable we can do [00:46:56] that are understandable we can do formality and politeness to make sure [00:46:58] formality and politeness to make sure that I don't know like your chatbot [00:47:00] that I don't know like your chatbot doesn't suddenly yell at you [00:47:02] doesn't suddenly yell at you um and the most important thing that's [00:47:04] um and the most important thing that's really really popular uh is recently is [00:47:07] really really popular uh is recently is human preference so we should just build [00:47:10] human preference so we should just build a remote a reward model that captures [00:47:12] a remote a reward model that captures human preference and this is actually [00:47:14] human preference and this is actually the technique behind the chat GPT model [00:47:17] the technique behind the chat GPT model so the idea here is that we would ask [00:47:19] so the idea here is that we would ask human to rank a bunch of generated text [00:47:21] human to rank a bunch of generated text based on their preference and then we [00:47:23] based on their preference and then we will use this preference data to learn a [00:47:26] will use this preference data to learn a reward function which will basically [00:47:28] reward function which will basically always assign high score to something [00:47:31] always assign high score to something that humans might prefer and assign a [00:47:33] that humans might prefer and assign a low score to something that humans [00:47:35] low score to something that humans wouldn't prefer [00:47:37] wouldn't prefer Yeah question [00:47:38] Yeah question more expensive [00:47:43] oh yeah sure I mean it is going to be [00:47:45] oh yeah sure I mean it is going to be very expensive but I feel like uh [00:47:47] very expensive but I feel like uh compared to all the cost of trading [00:47:48] compared to all the cost of trading models trading like 170 billion [00:47:51] models trading like 170 billion parameter models [00:47:52] parameter models um I feel like open Ai and Google are [00:47:54] um I feel like open Ai and Google are well they can't afford hiring lots of [00:47:56] well they can't afford hiring lots of humans to do human annotations and ask [00:47:58] humans to do human annotations and ask their preference yeah [00:48:04] yeah this is a great question so [00:48:07] yeah this is a great question so um I think it's kind of a mystery about [00:48:09] um I think it's kind of a mystery about how much data you exactly need to [00:48:11] how much data you exactly need to achieve the level of performance of chat [00:48:12] achieve the level of performance of chat GPT but roughly speaking I feel like I [00:48:16] GPT but roughly speaking I feel like I mean like whenever you try to fine-tune [00:48:18] mean like whenever you try to fine-tune a model on some Downstream tasks [00:48:19] a model on some Downstream tasks similarly here you are trying to find [00:48:20] similarly here you are trying to find through your model on on human [00:48:22] through your model on on human preference it do need quite a lot of [00:48:25] preference it do need quite a lot of data like maybe on a scale of 50k to [00:48:27] data like maybe on a scale of 50k to 100K that's roughly the scale that like [00:48:30] 100K that's roughly the scale that like anthropic actually released some data [00:48:32] anthropic actually released some data set about human preference that's [00:48:33] set about human preference that's roughly the skill that they released I [00:48:35] roughly the skill that they released I think [00:48:36] think um if I remember correctly Yeah question [00:48:39] um if I remember correctly Yeah question we talked about earlier about how many [00:48:41] we talked about earlier about how many of the state-of-the-art language models [00:48:43] of the state-of-the-art language models use Transformers as their architecture [00:48:45] use Transformers as their architecture how do you apply reinforcement learning [00:48:47] how do you apply reinforcement learning to this model [00:48:50] to this model uh to to what do you mean to Transformer [00:48:52] uh to to what do you mean to Transformer model yeah yeah I feel like um [00:48:56] model yeah yeah I feel like um reinforcement learning is kind of a [00:48:58] reinforcement learning is kind of a modeling tool I mean it's kind of an [00:49:00] modeling tool I mean it's kind of an objective that you are trying to [00:49:01] objective that you are trying to optimize instead of an mlu objective now [00:49:03] optimize instead of an mlu objective now you are optimizing for an RL objective [00:49:05] you are optimizing for an RL objective so uh it's not real it's kind of [00:49:08] so uh it's not real it's kind of orthogonal to the architectural choice [00:49:10] orthogonal to the architectural choice so uh Transformer is an architecture you [00:49:13] so uh Transformer is an architecture you just use Transformer to give you [00:49:14] just use Transformer to give you probability of the next token [00:49:16] probability of the next token distribution or to to try to estimate [00:49:19] distribution or to to try to estimate probability of a sequence and then once [00:49:21] probability of a sequence and then once you have the probability of a sequence [00:49:22] you have the probability of a sequence you use that probability of the sequence [00:49:24] you use that probability of the sequence pass it into the uh the RL objective [00:49:27] pass it into the uh the RL objective that you have and then suppose that you [00:49:30] that you have and then suppose that you are trying to do policy gradient or [00:49:31] are trying to do policy gradient or something then you need to estimate the [00:49:33] something then you need to estimate the probability of that sequence [00:49:35] probability of that sequence and then you just need to be able to [00:49:37] and then you just need to be able to back prop uh through Transformer which [00:49:39] back prop uh through Transformer which is doable [00:49:40] is doable yeah so I think like the question about [00:49:42] yeah so I think like the question about architecture and objectives are [00:49:43] architecture and objectives are orthogonal so even if you have an ostm [00:49:46] orthogonal so even if you have an ostm you can do it you have a Transformer you [00:49:47] you can do it you have a Transformer you can also do it yep [00:49:51] can also do it yep cool hope I answered that question [00:49:54] cool hope I answered that question yeah can you just like with the model T4 [00:49:57] yeah can you just like with the model T4 to for this country well for example we [00:49:59] to for this country well for example we can build another Transformer to like to [00:50:01] can build another Transformer to like to calculate yeah I think that's exactly [00:50:03] calculate yeah I think that's exactly what they did so they uh so for example [00:50:05] what they did so they uh so for example you would have gpt3 right [00:50:07] you would have gpt3 right um you use gpd3 as the generator that [00:50:10] um you use gpd3 as the generator that generate text and you kind of have [00:50:12] generate text and you kind of have another pre-trained model that it could [00:50:14] another pre-trained model that it could probably also be gpd3 but I'm guessing [00:50:16] probably also be gpd3 but I'm guessing here [00:50:17] here um that you fine-tune it to your human [00:50:19] um that you fine-tune it to your human preference and then once you have a [00:50:21] preference and then once you have a human preference model uh you use the [00:50:23] human preference model uh you use the human preference model to put it into RL [00:50:26] human preference model to put it into RL as the reward model and then use the [00:50:28] as the reward model and then use the original gpd3 as the policy model and [00:50:30] original gpd3 as the policy model and then you you apply our objectives and [00:50:33] then you you apply our objectives and then update them so that you will get a [00:50:36] then update them so that you will get a new model that's better at everything [00:50:40] okay cool [00:50:42] okay cool um yeah actually if you are very curious [00:50:43] um yeah actually if you are very curious about earlier chap I would encourage you [00:50:45] about earlier chap I would encourage you to come to the next lecture which is uh [00:50:47] to come to the next lecture which is uh and where Jesse will talk about rlhs [00:50:49] and where Jesse will talk about rlhs which is uh RL HF is shorthand for RL [00:50:52] which is uh RL HF is shorthand for RL using human pref uh using human feedback [00:50:57] foreign [00:51:00] foreign teacher enforcing is still the main [00:51:02] teacher enforcing is still the main algorithm for training tax generation [00:51:05] algorithm for training tax generation models and exposure bias causes problems [00:51:08] models and exposure bias causes problems in tax generation models for example it [00:51:11] in tax generation models for example it causes models to lose coherence cause [00:51:13] causes models to lose coherence cause this model to be repetitive and models [00:51:15] this model to be repetitive and models must learn to recover from their own bad [00:51:18] must learn to recover from their own bad samples by using techniques like [00:51:20] samples by using techniques like scheduled sampling or dagger and models [00:51:23] scheduled sampling or dagger and models shouldn't another approach to to reduce [00:51:26] shouldn't another approach to to reduce exposure bias is to start with good text [00:51:28] exposure bias is to start with good text like retrieval of class generation and [00:51:31] like retrieval of class generation and we also discussed how to do training [00:51:32] we also discussed how to do training with RL and this can actually [00:51:35] with RL and this can actually make model learn behaviors that are [00:51:37] make model learn behaviors that are preferred by human perform that are [00:51:40] preferred by human perform that are preferred by human or preferred by some [00:51:42] preferred by human or preferred by some metrics so uh to be very up-to-date in [00:51:46] metrics so uh to be very up-to-date in the best language model nowadays check [00:51:48] the best language model nowadays check GPT the trading is actually pipelined [00:51:50] GPT the trading is actually pipelined for example we would first pre-train a [00:51:52] for example we would first pre-train a large language models using internet [00:51:53] large language models using internet Corpus by self-supervision and this kind [00:51:56] Corpus by self-supervision and this kind of gets your chat GPT like the uh sorry [00:51:58] of gets your chat GPT like the uh sorry gpt3 which is the original version and [00:52:01] gpt3 which is the original version and then you would do some sorts of [00:52:02] then you would do some sorts of instruction tuning to fine-tune the [00:52:05] instruction tuning to fine-tune the language model to fine-tune the [00:52:06] language model to fine-tune the pre-trained language model so that it [00:52:07] pre-trained language model so that it learns roughly how to follow human [00:52:09] learns roughly how to follow human instructions and finally we will do rlhs [00:52:12] instructions and finally we will do rlhs to make sure that these models are well [00:52:14] to make sure that these models are well aligned with human preference so if we [00:52:18] aligned with human preference so if we start RL HF from scratch it's probably [00:52:20] start RL HF from scratch it's probably going to be very hard for the model to [00:52:22] going to be very hard for the model to converge because RL is hard to train for [00:52:25] converge because RL is hard to train for text Data Etc so RL doesn't really work [00:52:28] text Data Etc so RL doesn't really work from scratch but with all these smart [00:52:30] from scratch but with all these smart tricks about pre-training and [00:52:32] tricks about pre-training and instruction tuning suddenly now like [00:52:35] instruction tuning suddenly now like they're they're off to a good start [00:52:39] they're they're off to a good start cool any questions so far [00:52:43] cool any questions so far okay oh yeah [00:52:52] [Music] [00:52:54] [Music] uh you mean the difference between [00:52:57] uh you mean the difference between dagger and schedule sampling is how long [00:53:00] dagger and schedule sampling is how long the the sequence are yeah I think [00:53:02] the the sequence are yeah I think roughly that is that is it because like [00:53:04] roughly that is that is it because like for dagger you are kind of trying to you [00:53:06] for dagger you are kind of trying to you are trying to put in [00:53:08] are trying to put in um full generated sequence but I feel [00:53:10] um full generated sequence but I feel like there can be variations of dagger [00:53:12] like there can be variations of dagger dagger is just like a high level [00:53:13] dagger is just like a high level framework and idea there can be [00:53:15] framework and idea there can be variations variations of dagger that are [00:53:17] variations variations of dagger that are very similar to scheduled sampling I [00:53:19] very similar to scheduled sampling I think I feel like for schedule sampling [00:53:21] think I feel like for schedule sampling it's kind of a more smooth version of [00:53:24] it's kind of a more smooth version of dagger because dagger for dagger you [00:53:26] dagger because dagger for dagger you have to like uh for well basically for [00:53:29] have to like uh for well basically for this Epoch I am generating something and [00:53:31] this Epoch I am generating something and then I after this Epoch finishes I put [00:53:33] then I after this Epoch finishes I put this into the data together and then [00:53:35] this into the data together and then train for another Epoch whereas dagger [00:53:37] train for another Epoch whereas dagger seems to be more flexible in terms of [00:53:39] seems to be more flexible in terms of when you add data in yes look it's for [00:53:43] when you add data in yes look it's for daggers to the rest of the models coming [00:53:44] daggers to the rest of the models coming out but like how does it helpful model [00:53:48] um I think that's a that's a good [00:53:51] um I think that's a that's a good question I feel like if you regress the [00:53:53] question I feel like if you regress the model for example [00:53:54] model for example um if you regress the model on its own [00:53:56] um if you regress the model on its own output uh I think well I think there are [00:53:59] output uh I think well I think there are there should be smarter ways than to [00:54:01] there should be smarter ways than to exactly regress on your own output for [00:54:03] exactly regress on your own output for example you might still like consult [00:54:06] example you might still like consult some good reference data for example [00:54:07] some good reference data for example given that you ask the model to generate [00:54:09] given that you ask the model to generate for something and then you can instead [00:54:12] for something and then you can instead of using uh say you ask the model [00:54:14] of using uh say you ask the model generate for five tokens and then [00:54:16] generate for five tokens and then instead of using like the models [00:54:18] instead of using like the models generation to be the sixth token you'll [00:54:21] generation to be the sixth token you'll probably try to find some examples in [00:54:23] probably try to find some examples in the training data that would be a good [00:54:25] the training data that would be a good continuations and then you try to plug [00:54:27] continuations and then you try to plug that in by like connecting the [00:54:29] that in by like connecting the generation the model generation and some [00:54:31] generation the model generation and some gold text and then therefore you are [00:54:34] gold text and then therefore you are able to kind of correct the model uh [00:54:37] able to kind of correct the model uh even though it it probably went off path [00:54:39] even though it it probably went off path a little bit by generating its own stuff [00:54:40] a little bit by generating its own stuff so it's kind of like letting the model [00:54:42] so it's kind of like letting the model learn how to correct for itself [00:54:44] learn how to correct for itself but yes I think you are right if you [00:54:46] but yes I think you are right if you just ask the model to uh gen if you just [00:54:49] just ask the model to uh gen if you just put model generation in the data it [00:54:51] put model generation in the data it shouldn't really work [00:54:53] shouldn't really work yeah any other questions [00:54:57] cool [00:54:59] cool um moving on [00:55:05] yes um so now we'll talk about uh how we [00:55:08] yes um so now we'll talk about uh how we are going to evaluate Energy Systems so [00:55:10] are going to evaluate Energy Systems so there are three types of methods for [00:55:12] there are three types of methods for evaluation there is content overlap [00:55:14] evaluation there is content overlap metrics [00:55:15] metrics um there is model based metrics and [00:55:17] um there is model based metrics and there is human evaluations [00:55:20] there is human evaluations so first content overlap metrics [00:55:22] so first content overlap metrics computer score based on lexical [00:55:24] computer score based on lexical similarities between the generated text [00:55:26] similarities between the generated text and the gold reference text so the [00:55:28] and the gold reference text so the advantage of this approach is that it's [00:55:30] advantage of this approach is that it's very fast and efficient and widely used [00:55:32] very fast and efficient and widely used for example a blue score is very popular [00:55:35] for example a blue score is very popular in Mt and root score is very popular in [00:55:37] in Mt and root score is very popular in summarization [00:55:40] um [00:55:41] um so these models are very popular because [00:55:43] so these models are very popular because well these methods are very popular [00:55:44] well these methods are very popular because they are cheap and easy to run [00:55:47] because they are cheap and easy to run but they are not really the ideal [00:55:49] but they are not really the ideal metrics for example simply rely on [00:55:51] metrics for example simply rely on lexical overlap might miss some [00:55:54] lexical overlap might miss some refreshings that have the same semantic [00:55:56] refreshings that have the same semantic meaning or it might reward text with a [00:55:59] meaning or it might reward text with a large portion of lexical overlap but [00:56:01] large portion of lexical overlap but actually have the opposite meaning so [00:56:02] actually have the opposite meaning so you have lots of both false positive and [00:56:05] you have lots of both false positive and false negative problems [00:56:07] false negative problems uh so despite all these disadvantages [00:56:09] uh so despite all these disadvantages the metrics are still the to-go [00:56:11] the metrics are still the to-go evaluation standard in machine [00:56:12] evaluation standard in machine translation part of the reason is that [00:56:14] translation part of the reason is that Mt uh is actually super close ended it's [00:56:18] Mt uh is actually super close ended it's very non-open-ended and then therefore [00:56:20] very non-open-ended and then therefore this is probably still fine to use uh [00:56:24] this is probably still fine to use uh like blue score to measure machine [00:56:25] like blue score to measure machine translation [00:56:27] translation and they get progressively worse for [00:56:28] and they get progressively worse for tasks that are more open-ended for [00:56:31] tasks that are more open-ended for example they get words for summarization [00:56:32] example they get words for summarization as long as the output text because the [00:56:36] as long as the output text because the output text becomes much harder to [00:56:37] output text becomes much harder to measure [00:56:38] measure they are much worse for dialogue which [00:56:40] they are much worse for dialogue which is more open-ended and then they are [00:56:42] is more open-ended and then they are much much worse for story generation [00:56:43] much much worse for story generation which is also open-ended and then the [00:56:46] which is also open-ended and then the drawback here is that because like the [00:56:48] drawback here is that because like the underground metrics [00:56:49] underground metrics um [00:56:50] um this is because like suppose that you [00:56:53] this is because like suppose that you are generating a story that's relatively [00:56:54] are generating a story that's relatively long then if you are still looking at [00:56:56] long then if you are still looking at word overlap then you might actually get [00:56:58] word overlap then you might actually get very high ungram scores because of your [00:57:01] very high ungram scores because of your taxes very well not because it's [00:57:02] taxes very well not because it's accurate of high quality just because [00:57:04] accurate of high quality just because you are talking so much that you might [00:57:06] you are talking so much that you might have covered lots of points already [00:57:14] yes exactly that's kind of the the next [00:57:18] yes exactly that's kind of the the next thing that I will talk about uh as a [00:57:19] thing that I will talk about uh as a better metric for evaluation uh but for [00:57:23] better metric for evaluation uh but for now let's do like a case study of a [00:57:25] now let's do like a case study of a failure mode for uh Google score for [00:57:27] failure mode for uh Google score for example so suppose that Chris asks a [00:57:30] example so suppose that Chris asks a question are you enjoying the cs224a [00:57:32] question are you enjoying the cs224a lectures [00:57:33] lectures um the correct answer of course is hack [00:57:34] um the correct answer of course is hack yes [00:57:36] yes um so if we have this if if one if one [00:57:40] um so if we have this if if one if one of the answer is yes it will get a score [00:57:42] of the answer is yes it will get a score of 0.61 because it has some lexical [00:57:44] of 0.61 because it has some lexical overlap with the correct answer if you [00:57:47] overlap with the correct answer if you answer like you know it then it gets a [00:57:50] answer like you know it then it gets a relatively lower score because it [00:57:52] relatively lower score because it doesn't really have any lexical overlap [00:57:54] doesn't really have any lexical overlap except from the exclamation mark [00:57:56] except from the exclamation mark and if you answer Yep this is [00:57:58] and if you answer Yep this is semantically correct but it actually [00:58:02] semantically correct but it actually gets zero score because there is no [00:58:03] gets zero score because there is no lexical overlap between the gold answer [00:58:05] lexical overlap between the gold answer and the generation [00:58:07] and the generation if you answer hack no this should be [00:58:09] if you answer hack no this should be wrong [00:58:10] wrong um but because it has lots of Sims but [00:58:13] um but because it has lots of Sims but because it has lots of lexical overlap [00:58:14] because it has lots of lexical overlap with the correct answer [00:58:16] with the correct answer um it's actually getting some high [00:58:18] um it's actually getting some high scores [00:58:19] scores so these two cases are the major failure [00:58:21] so these two cases are the major failure modes of lexical based engram overlap [00:58:24] modes of lexical based engram overlap metrics you get false negative and false [00:58:27] metrics you get false negative and false positives [00:58:30] positives so um moving beyond this failure most of [00:58:34] so um moving beyond this failure most of lexical based metrics the next step is [00:58:37] lexical based metrics the next step is to check for semantic similarities and [00:58:39] to check for semantic similarities and model based metrics are better at [00:58:41] model based metrics are better at capturing the semantic similarities uh [00:58:43] capturing the semantic similarities uh so this is kind of similar to what you [00:58:45] so this is kind of similar to what you kind of raised up like a couple of [00:58:47] kind of raised up like a couple of minutes ago we can actually use learn [00:58:49] minutes ago we can actually use learn representation of words and sentences to [00:58:52] representation of words and sentences to compute to compute semantic similarities [00:58:54] compute to compute semantic similarities between generated and reference text [00:58:58] between generated and reference text um so now we are no longer bottom at a [00:59:00] um so now we are no longer bottom at a bottlenecked by ungram and instead we [00:59:02] bottlenecked by ungram and instead we are using embeddings and these [00:59:04] are using embeddings and these embeddings are going to be pre-trained [00:59:05] embeddings are going to be pre-trained but the methods can still live on [00:59:07] but the methods can still live on because we can just swap in different [00:59:09] because we can just swap in different pre-trained methods and use the fixed [00:59:10] pre-trained methods and use the fixed metrics [00:59:11] metrics so here are some good examples of the [00:59:14] so here are some good examples of the metrics that could be used uh one thing [00:59:16] metrics that could be used uh one thing is to do Vector similarity this is very [00:59:18] is to do Vector similarity this is very similar to homework one uh where if you [00:59:21] similar to homework one uh where if you are trying to compute similarity between [00:59:22] are trying to compute similarity between words except now we're trying to compute [00:59:25] words except now we're trying to compute similarity between sentences [00:59:27] similarity between sentences there are some ideas of how to go from [00:59:30] there are some ideas of how to go from word similarity to sentence similarities [00:59:31] word similarity to sentence similarities for example you can just average the [00:59:33] for example you can just average the embedding which is like a relatively [00:59:35] embedding which is like a relatively naive idea but it works uh sometimes [00:59:39] naive idea but it works uh sometimes another high-level idea is that we can [00:59:42] another high-level idea is that we can measure word movers distance [00:59:45] measure word movers distance um the idea here is that we can use [00:59:48] um the idea here is that we can use optimal transports to align the source [00:59:50] optimal transports to align the source and Target word embeddings suppose that [00:59:52] and Target word embeddings suppose that your Source word embedding is Obama [00:59:54] your Source word embedding is Obama speaks to the media in Illinois and the [00:59:58] speaks to the media in Illinois and the target is the the president Grace the [01:00:01] target is the the president Grace the press in Chicago from a human evaluation [01:00:03] press in Chicago from a human evaluation perspective these two are actually very [01:00:05] perspective these two are actually very similar but they are not exactly aligned [01:00:07] similar but they are not exactly aligned word by word so we need to figure out [01:00:09] word by word so we need to figure out how to optimally align word to word like [01:00:12] how to optimally align word to word like align Obama to president allowing [01:00:14] align Obama to president allowing Chicago to Illinois and then therefore [01:00:16] Chicago to Illinois and then therefore we can compute a score we can compute [01:00:18] we can compute a score we can compute the pairwise word embedding difference [01:00:21] the pairwise word embedding difference between this and then get a good score [01:00:23] between this and then get a good score for the model for the sentence [01:00:25] for the model for the sentence similarities [01:00:27] similarities and finally there is Bird score which is [01:00:29] and finally there is Bird score which is also a very popular metric for semantic [01:00:31] also a very popular metric for semantic similarity so it first computes pairwise [01:00:34] similarity so it first computes pairwise cosine distance using birth embeddings [01:00:36] cosine distance using birth embeddings and then it finds an optimal alignment [01:00:39] and then it finds an optimal alignment between the source and Target sentence [01:00:41] between the source and Target sentence and then they finally compute some score [01:00:43] and then they finally compute some score so I feel like uh these details are not [01:00:46] so I feel like uh these details are not really that important but the high level [01:00:48] really that important but the high level idea is super important is that we can [01:00:51] idea is super important is that we can now use uh like we can now use word [01:00:54] now use uh like we can now use word embeddings to compute sentence [01:00:55] embeddings to compute sentence similarities by doing some sort of smart [01:00:57] similarities by doing some sort of smart alignment and then transform from word [01:00:59] alignment and then transform from word similarity to sentence similarities [01:01:02] similarity to sentence similarities to move Beyond word embeddings we can [01:01:04] to move Beyond word embeddings we can also use sentence embeddings to compute [01:01:07] also use sentence embeddings to compute sentence similarities so typically this [01:01:09] sentence similarities so typically this doesn't have the very comprehensive [01:01:10] doesn't have the very comprehensive alignment by word problem [01:01:13] alignment by word problem um but it has similar problems about you [01:01:14] um but it has similar problems about you need to now align sentences or phrases [01:01:16] need to now align sentences or phrases in a sentence [01:01:18] in a sentence and similarly there is Port which is [01:01:20] and similarly there is Port which is slightly different it is a regression [01:01:22] slightly different it is a regression model based on birth [01:01:24] model based on birth um to so the model is trained as a [01:01:27] um to so the model is trained as a regression problem to return the score [01:01:28] regression problem to return the score that indicate how good the text is in [01:01:31] that indicate how good the text is in terms of grammaticality and the meaning [01:01:33] terms of grammaticality and the meaning of the reference and similarity with the [01:01:35] of the reference and similarity with the reference text so this is kind of a [01:01:37] reference text so this is kind of a trading evaluation as a regression [01:01:38] trading evaluation as a regression problem [01:01:40] problem any questions so far [01:01:48] okay cool you can move on [01:01:50] okay cool you can move on um so all the previous Mission [01:01:52] um so all the previous Mission approaches are evaluating semantic [01:01:54] approaches are evaluating semantic similarities so they can be applied to [01:01:56] similarities so they can be applied to non-open ended generation tasks but what [01:01:59] non-open ended generation tasks but what about open-ended settings so here [01:02:01] about open-ended settings so here enforcing semantic similarity seems [01:02:03] enforcing semantic similarity seems wrong because a story can be perfectly [01:02:05] wrong because a story can be perfectly fluent and perfectly high quality [01:02:07] fluent and perfectly high quality without having to reassemble any of the [01:02:09] without having to reassemble any of the reference stories [01:02:11] reference stories so one idea here is that um maybe we [01:02:14] so one idea here is that um maybe we want to evaluate open-ended text [01:02:15] want to evaluate open-ended text generation using this mouth score mob [01:02:19] generation using this mouth score mob score computes the information [01:02:20] score computes the information Divergence in a contest embedding space [01:02:22] Divergence in a contest embedding space between the generated text and the gold [01:02:25] between the generated text and the gold reference text [01:02:26] reference text so um here is roughly the detail of [01:02:28] so um here is roughly the detail of what's going on suppose that you have a [01:02:30] what's going on suppose that you have a batch of text from the gold reference [01:02:32] batch of text from the gold reference that are human written and you have a [01:02:34] that are human written and you have a batch of tags that's generated by your [01:02:36] batch of tags that's generated by your model [01:02:36] model um Step number one is that you want to [01:02:38] um Step number one is that you want to embed this text you want to put this [01:02:40] embed this text you want to put this text into some continuous representation [01:02:42] text into some continuous representation space [01:02:43] space which is kind of the figure to the left [01:02:45] which is kind of the figure to the left and but it's really hard to compute any [01:02:47] and but it's really hard to compute any distance metrics in this continuous [01:02:49] distance metrics in this continuous embedding space because [01:02:51] embedding space because um well different sentences might [01:02:52] um well different sentences might actually lie very far away from each [01:02:54] actually lie very far away from each other so the idea here is that we are [01:02:57] other so the idea here is that we are trying to do a k-means cluster to [01:02:59] trying to do a k-means cluster to discretize The Continuous space into [01:03:01] discretize The Continuous space into some discrete space now after the [01:03:03] some discrete space now after the discretization we can actually have a [01:03:06] discretization we can actually have a histogram for the for the gold human [01:03:09] histogram for the for the gold human written text and the histogram for the [01:03:11] written text and the histogram for the machine generated text and then we can [01:03:13] machine generated text and then we can now compute Precision recall using this [01:03:16] now compute Precision recall using this to discretize the distributions and then [01:03:18] to discretize the distributions and then we can compute Precision by like forward [01:03:21] we can compute Precision by like forward K on recall that backward KL yes [01:03:23] K on recall that backward KL yes question [01:03:24] question why do we want to discretize it and then [01:03:26] why do we want to discretize it and then we touch that why do we want to discard [01:03:28] we touch that why do we want to discard Hazard so imagine that you suppose uh [01:03:31] Hazard so imagine that you suppose uh maybe like it's equivalent to answer why [01:03:33] maybe like it's equivalent to answer why is it hard to work with the continuous [01:03:35] is it hard to work with the continuous space the idea is like if you're in that [01:03:37] space the idea is like if you're in that a word if you embed a sentence into the [01:03:39] a word if you embed a sentence into the continuous space say that it lies here [01:03:41] continuous space say that it lies here and you embed another sentence in a [01:03:43] and you embed another sentence in a confused with the lies here suppose that [01:03:45] confused with the lies here suppose that you only have a finite number of uh [01:03:47] you only have a finite number of uh sentences then they would basically be [01:03:48] sentences then they would basically be direct Delta distributions in your [01:03:50] direct Delta distributions in your manifold right so it's hard to like you [01:03:53] manifold right so it's hard to like you probably want a smoother distribution [01:03:55] probably want a smoother distribution but it's hard to Define what is a good [01:03:58] but it's hard to Define what is a good smooth distribution uh in the case of [01:04:00] smooth distribution uh in the case of text embedding because they're not super [01:04:01] text embedding because they're not super interpretable so therefore eventually [01:04:04] interpretable so therefore eventually you will have like a [01:04:05] you will have like a um if you embed everything in a [01:04:07] um if you embed everything in a continual space you will have like lots [01:04:08] continual space you will have like lots of direct Deltas that are just very high [01:04:11] of direct Deltas that are just very high and then not really connected to the to [01:04:14] and then not really connected to the to their today's neighbors so it's hard to [01:04:16] their today's neighbors so it's hard to uh so it's hard to quantify Chao [01:04:19] uh so it's hard to quantify Chao Divergence or a distance Matrix in that [01:04:20] Divergence or a distance Matrix in that space [01:04:21] space well for example you have to make some [01:04:23] well for example you have to make some assumptions for example you want to make [01:04:25] assumptions for example you want to make gaussian assumptions that I want to [01:04:26] gaussian assumptions that I want to smooth all the embeddings by convolving [01:04:29] smooth all the embeddings by convolving with the gaussian and then you can start [01:04:31] with the gaussian and then you can start getting some meaningful distance metrics [01:04:34] getting some meaningful distance metrics but it's just the embeddings uh although [01:04:37] but it's just the embeddings uh although you're not going to get meaning for [01:04:38] you're not going to get meaning for distance metrics and then it doesn't [01:04:40] distance metrics and then it doesn't really make sense to smooth things using [01:04:41] really make sense to smooth things using gaussian because who said uh word [01:04:43] gaussian because who said uh word representations are gaussian related [01:04:45] representations are gaussian related yeah [01:04:47] yeah classrooms [01:04:51] I think this requires some gaussian [01:04:53] I think this requires some gaussian smoothie yeah I think the plot is made [01:04:55] smoothie yeah I think the plot is made with some smoothie yeah I mean I didn't [01:04:57] with some smoothie yeah I mean I didn't make those clouds so I couldn't be [01:04:59] make those clouds so I couldn't be perfectly sure but I think the fact that [01:05:01] perfectly sure but I think the fact that it looks like this means that you [01:05:02] it looks like this means that you smoothed it a little bit [01:05:05] these are kind of sentence and weddings [01:05:07] these are kind of sentence and weddings or concatenated word embeddings because [01:05:09] or concatenated word embeddings because you are comparing sentences to sentences [01:05:11] you are comparing sentences to sentences not words to words [01:05:14] not words to words yeah so the advantage of mouth score is [01:05:16] yeah so the advantage of mouth score is that it is applicable to open-ended [01:05:18] that it is applicable to open-ended settings because you are now measuring [01:05:21] settings because you are now measuring precision and recall with regard to the [01:05:24] precision and recall with regard to the Target distribution [01:05:27] cool so it has a it has a better [01:05:29] cool so it has a it has a better probabilistic interpretation than all [01:05:31] probabilistic interpretation than all the previous similarity metrics [01:05:35] cool [01:05:36] cool any other questions yes [01:05:43] how's that different from just trying to [01:05:45] how's that different from just trying to maximize it the similarity between [01:05:49] oh yeah that's a good question [01:05:52] oh yeah that's a good question um well this is because in a case where [01:05:54] um well this is because in a case where it's really hard to get exactly the same [01:05:56] it's really hard to get exactly the same thing like well for example I would say [01:05:59] thing like well for example I would say that if maybe because I've never tried [01:06:01] that if maybe because I've never tried this myself but if you try to run off on [01:06:04] this myself but if you try to run off on a machine translation task you might get [01:06:06] a machine translation task you might get very high score [01:06:07] very high score um but for if you try to run full score [01:06:10] um but for if you try to run full score on the open-ended text generation you [01:06:11] on the open-ended text generation you will get super low score so it's just [01:06:13] will get super low score so it's just not really measurable because [01:06:15] not really measurable because everything's so different from each [01:06:16] everything's so different from each other uh so I feel like moth is kind of [01:06:18] other uh so I feel like moth is kind of a middle ground where you are trying to [01:06:21] a middle ground where you are trying to evaluate something that are actually [01:06:22] evaluate something that are actually very far away from each other but you [01:06:24] very far away from each other but you still want a meaningful representation [01:06:27] still want a meaningful representation yeah of course I mean if you are source [01:06:30] yeah of course I mean if you are source and Target are exactly the same or are [01:06:32] and Target are exactly the same or are just different app to some refreshings [01:06:34] just different app to some refreshings you will get the best small score but [01:06:36] you will get the best small score but maybe that's not really what you're [01:06:37] maybe that's not really what you're looking for because given the current [01:06:39] looking for because given the current situation you only have Generations that [01:06:42] situation you only have Generations that are very far away from the gold text how [01:06:44] are very far away from the gold text how do we evaluate this type of things [01:06:47] do we evaluate this type of things yes question in the back I'm still [01:06:49] yes question in the back I'm still trying to understand the most score is [01:06:51] trying to understand the most score is it possible to write a the map even in [01:06:54] it possible to write a the map even in just kind of pseudo uh simple form yeah [01:06:58] just kind of pseudo uh simple form yeah I think it's possible I mean maybe we [01:07:00] I think it's possible I mean maybe we come for this discussion after class and [01:07:02] come for this discussion after class and because I kind of want to finish my [01:07:04] because I kind of want to finish my slides yeah but happy to chat after [01:07:07] slides yeah but happy to chat after class there is a paper a lot if you [01:07:09] class there is a paper a lot if you search for mouth score [01:07:10] search for mouth score I think it's probably the best paper in [01:07:12] I think it's probably the best paper in some scml or Europe's conference as well [01:07:16] some scml or Europe's conference as well okay so moving on [01:07:18] okay so moving on um I've pointed out that there are so [01:07:20] um I've pointed out that there are so many evaluation methods so let's take a [01:07:22] many evaluation methods so let's take a step back and think about what's a good [01:07:24] step back and think about what's a good metric for evaluation methods so how do [01:07:26] metric for evaluation methods so how do we evaluate evaluations [01:07:28] we evaluate evaluations nowadays the gold standard is still to [01:07:30] nowadays the gold standard is still to check how well this metric is aligned [01:07:33] check how well this metric is aligned with human judgment so if a model match [01:07:35] with human judgment so if a model match human preference uh in other words if [01:07:38] human preference uh in other words if the metric is very correlated with if [01:07:41] the metric is very correlated with if the metric correlates very strongly with [01:07:42] the metric correlates very strongly with human judgment then we say that the [01:07:44] human judgment then we say that the metric is a good metric so in this part [01:07:46] metric is a good metric so in this part people have shown people have pulled out [01:07:48] people have shown people have pulled out a Google score and human score uh y and [01:07:52] a Google score and human score uh y and x axis respectively and then we because [01:07:54] x axis respectively and then we because we didn't see a correlation a strong [01:07:56] we didn't see a correlation a strong correlation this kind of suggests that [01:07:57] correlation this kind of suggests that blue score is not a very good metric [01:08:01] so uh actually the gold standard for [01:08:05] so uh actually the gold standard for human evaluation the gold standard for [01:08:07] human evaluation the gold standard for evaluating language models is always to [01:08:09] evaluating language models is always to do human evaluation [01:08:11] do human evaluation so automatic metrics fall short of [01:08:14] so automatic metrics fall short of matching human decisions and human [01:08:16] matching human decisions and human evaluation is kind of the most important [01:08:18] evaluation is kind of the most important criteria for evaluating text that are [01:08:21] criteria for evaluating text that are generated from a model and it's also the [01:08:23] generated from a model and it's also the gold standard in developing automatic [01:08:25] gold standard in developing automatic metrics because we want everything to [01:08:27] metrics because we want everything to match human evaluation [01:08:30] um so what do we mean by human [01:08:32] um so what do we mean by human evaluation how is it conducted typically [01:08:35] evaluation how is it conducted typically we will provide human annotators with [01:08:38] we will provide human annotators with some access that we care about like [01:08:41] some access that we care about like fluency coherence open for open-ended [01:08:43] fluency coherence open for open-ended tax generation suppose that we also care [01:08:45] tax generation suppose that we also care about factuality for summarization we [01:08:47] about factuality for summarization we care about the style of the writing and [01:08:50] care about the style of the writing and Common Sense for example if you're [01:08:51] Common Sense for example if you're trying to write a children's story uh [01:08:55] trying to write a children's story uh essentially like another thing to note [01:08:58] essentially like another thing to note is that please don't compare human [01:09:00] is that please don't compare human evaluations across different papers or [01:09:01] evaluations across different papers or different studies because human [01:09:03] different studies because human evaluations tends to not be well [01:09:05] evaluations tends to not be well collaborated and are not really [01:09:06] collaborated and are not really reproducible [01:09:07] reproducible even though we believe that human [01:09:09] even though we believe that human evaluations are the gold standard there [01:09:12] evaluations are the gold standard there are still many drawbacks for example [01:09:14] are still many drawbacks for example human evaluations are really slow and [01:09:16] human evaluations are really slow and expensive uh so but even beyond the slow [01:09:20] expensive uh so but even beyond the slow and expensiveness they are still not not [01:09:22] and expensiveness they are still not not perfect because first human evaluations [01:09:25] perfect because first human evaluations the results may be inconsistent and it [01:09:27] the results may be inconsistent and it may not be very reproducible so if you [01:09:29] may not be very reproducible so if you ask the same human whether you like ARB [01:09:31] ask the same human whether you like ARB they might say a the first time and B [01:09:33] they might say a the first time and B the second time so and then human [01:09:35] the second time so and then human evaluations are typically not really [01:09:37] evaluations are typically not really logical [01:09:39] logical um and it's really and sometimes like [01:09:40] um and it's really and sometimes like the human annotators might misinterpret [01:09:43] the human annotators might misinterpret your question suppose that you want them [01:09:45] your question suppose that you want them to measure coherence of the text [01:09:46] to measure coherence of the text different people have different criteria [01:09:48] different people have different criteria for coherence some people might think [01:09:50] for coherence some people might think coherence is equivalent to fluency and [01:09:52] coherence is equivalent to fluency and then they look for grammaticality arrows [01:09:54] then they look for grammaticality arrows some people might think coherence means [01:09:57] some people might think coherence means how well your continuation is aligned [01:09:59] how well your continuation is aligned with the prompt or the topic [01:10:02] with the prompt or the topic so there are all sorts of [01:10:03] so there are all sorts of misunderstandings that make that might [01:10:05] misunderstandings that make that might make human evaluation very hard [01:10:08] make human evaluation very hard and finally human evaluation only [01:10:10] and finally human evaluation only measures Precision not recall this means [01:10:13] measures Precision not recall this means that you can give a sentence to human [01:10:14] that you can give a sentence to human and ask the human uh how do you like the [01:10:17] and ask the human uh how do you like the sentence but you couldn't ask the human [01:10:18] sentence but you couldn't ask the human like whether this model is able to [01:10:21] like whether this model is able to generate all possible sentences that are [01:10:23] generate all possible sentences that are good [01:10:24] good so it's only a precision based metrics [01:10:26] so it's only a precision based metrics not a recall based metrics so here are [01:10:29] not a recall based metrics so here are two approaches that tries to like [01:10:31] two approaches that tries to like combine human evaluations with uh [01:10:34] combine human evaluations with uh modeling for example uh the first idea [01:10:37] modeling for example uh the first idea is basically trying to learn a metric [01:10:39] is basically trying to learn a metric from Human judgment [01:10:41] from Human judgment um basically by by trying to use human [01:10:43] um basically by by trying to use human human judgment data uh as trading data [01:10:46] human judgment data uh as trading data and then train a model to simulate human [01:10:48] and then train a model to simulate human judgment and the second approach is [01:10:50] judgment and the second approach is trying to ask human and the human and [01:10:53] trying to ask human and the human and model to collaborate so that the human [01:10:55] model to collaborate so that the human would be in charge of evaluating [01:10:56] would be in charge of evaluating Precision whereas the model would be in [01:10:58] Precision whereas the model would be in charge of evaluating recall [01:11:02] charge of evaluating recall um also like we have tried approaches in [01:11:04] um also like we have tried approaches in terms of evaluating models interactively [01:11:06] terms of evaluating models interactively so in this case we will no longer we not [01:11:09] so in this case we will no longer we not only care about the output quality we [01:11:11] only care about the output quality we also care about how the person feels [01:11:13] also care about how the person feels when they interact with the model when [01:11:16] when they interact with the model when they try to be a co-author with the [01:11:18] they try to be a co-author with the model and how the person feels about the [01:11:20] model and how the person feels about the writing process [01:11:22] writing process Etc so this is called trying to evaluate [01:11:24] Etc so this is called trying to evaluate the models more interactively [01:11:29] um so the takeaway here is that content [01:11:31] um so the takeaway here is that content overlap is a bad metric uh semantic [01:11:34] overlap is a bad metric uh semantic based like model based metrics become [01:11:36] based like model based metrics become better because it's more focused on [01:11:37] better because it's more focused on semantics but it's still not good enough [01:11:39] semantics but it's still not good enough human judgment is the gold standard but [01:11:42] human judgment is the gold standard but it's hard to do human judgment it's hard [01:11:44] it's hard to do human judgment it's hard to do human study well [01:11:46] to do human study well and in many cases this is a hint for [01:11:48] and in many cases this is a hint for final project the best charge of the [01:11:51] final project the best charge of the output quality is actually you so if you [01:11:53] output quality is actually you so if you want to do a final project in like [01:11:55] want to do a final project in like natural language generation you should [01:11:57] natural language generation you should look at the model output yourself and [01:11:59] look at the model output yourself and don't just rely on the numbers that are [01:12:01] don't just rely on the numbers that are in that are reported by Blue swirl or [01:12:03] in that are reported by Blue swirl or something [01:12:05] something cool [01:12:06] cool um so finally we will discuss ethical [01:12:08] um so finally we will discuss ethical considerations of natural language [01:12:10] considerations of natural language generation problems [01:12:12] generation problems so as language models gets better and [01:12:15] so as language models gets better and better ethical considerations becomes [01:12:17] better ethical considerations becomes much more pressing so we want to ensure [01:12:19] much more pressing so we want to ensure that the model are well aligned with [01:12:21] that the model are well aligned with human values for example we want to make [01:12:23] human values for example we want to make sure the models are not harmful they are [01:12:25] sure the models are not harmful they are not toxic and we want to make sure that [01:12:27] not toxic and we want to make sure that the models are unbiased and fair to all [01:12:30] the models are unbiased and fair to all demographics groups so for example here [01:12:33] demographics groups so for example here we also we don't want the model to [01:12:34] we also we don't want the model to generate any harmful content basically I [01:12:37] generate any harmful content basically I try to prompt cat GPT to say can you [01:12:39] try to prompt cat GPT to say can you write me some toxic content can GPT [01:12:42] write me some toxic content can GPT politely refuse me [01:12:43] politely refuse me um which I'm quite happy about but there [01:12:47] um which I'm quite happy about but there are there are other people who kind of [01:12:48] are there are other people who kind of like try to jailbreak chat GPT the idea [01:12:52] like try to jailbreak chat GPT the idea here is that creativity actually I think [01:12:54] here is that creativity actually I think internally they probably implement some [01:12:56] internally they probably implement some detection tools so that we will try to [01:12:59] detection tools so that we will try to prompt it adversarially it's going to [01:13:00] prompt it adversarially it's going to avoid doing adversarial things but here [01:13:03] avoid doing adversarial things but here there are many very complicated ways to [01:13:06] there are many very complicated ways to prompt chat GPT so that you can get over [01:13:09] prompt chat GPT so that you can get over the firewall and then therefore still [01:13:12] the firewall and then therefore still ask you ability to generate some I don't [01:13:15] ask you ability to generate some I don't know like bad English [01:13:20] but uh so another problem with uh this [01:13:24] but uh so another problem with uh this large language models is that they are [01:13:27] large language models is that they are not necessarily truthful so for example [01:13:28] not necessarily truthful so for example this very famous on news that uh [01:13:31] this very famous on news that uh Google's model actually generates [01:13:33] Google's model actually generates factual arrows [01:13:34] factual arrows um which is quite disappointing but I [01:13:37] um which is quite disappointing but I mean like [01:13:38] mean like but the way the model talks about it is [01:13:40] but the way the model talks about it is very convincing so like you wouldn't [01:13:42] very convincing so like you wouldn't really know that it's a factual error [01:13:44] really know that it's a factual error unless you go check that this is not the [01:13:46] unless you go check that this is not the picture of the this is not the first [01:13:48] picture of the this is not the first picture or something [01:13:50] picture or something so we want to avoid this type of [01:13:52] so we want to avoid this type of problems [01:13:53] problems um actually like the models have already [01:13:55] um actually like the models have already been trying very hard to refrain from [01:13:58] been trying very hard to refrain from like generating harmful content uh but [01:14:01] like generating harmful content uh but like for models that are more open [01:14:03] like for models that are more open sourced and are smaller the same problem [01:14:05] sourced and are smaller the same problem still appears and then typically like [01:14:08] still appears and then typically like when we do our final project or we work [01:14:10] when we do our final project or we work with models we are probably going to [01:14:11] with models we are probably going to deal with much smaller models and then [01:14:13] deal with much smaller models and then therefore we need to think about ways to [01:14:15] therefore we need to think about ways to deal with these problems better [01:14:17] deal with these problems better so text generation models are often [01:14:19] so text generation models are often constructed from pre-trained language [01:14:21] constructed from pre-trained language models and then pre-train language [01:14:22] models and then pre-train language models are trained on internet data [01:14:24] models are trained on internet data which contains lots of harmful stuff and [01:14:26] which contains lots of harmful stuff and biased [01:14:28] biased so when when we prom when the models are [01:14:31] so when when we prom when the models are prompted for this information they will [01:14:32] prompted for this information they will just repeat the negative stereotypes [01:14:34] just repeat the negative stereotypes that they learn from the internet [01:14:35] that they learn from the internet training data so one way uh to avoid [01:14:38] training data so one way uh to avoid this is to do extensive data cleaning so [01:14:41] this is to do extensive data cleaning so that the pre-training data does not [01:14:42] that the pre-training data does not contain any bias or stereotypical [01:14:44] contain any bias or stereotypical content however this is going to be very [01:14:46] content however this is going to be very labor intensive and almost impossible to [01:14:48] labor intensive and almost impossible to do because filtering a large amount of [01:14:51] do because filtering a large amount of internet data is just so costly that is [01:14:53] internet data is just so costly that is not really possible [01:14:56] um again [01:14:58] um again this existing language models like gpt2 [01:15:00] this existing language models like gpt2 medium there are some adversarial inputs [01:15:03] medium there are some adversarial inputs that almost always trigger toxic content [01:15:06] that almost always trigger toxic content and these models might be exploited in [01:15:09] and these models might be exploited in the real world in the real world by EU [01:15:11] the real world in the real world by EU intended people so for example there's a [01:15:14] intended people so for example there's a paper about Universal adversarial [01:15:16] paper about Universal adversarial triggers where the authors just find [01:15:19] triggers where the authors just find some Universal set of words that would [01:15:21] some Universal set of words that would trigger bad contents from the model that [01:15:23] trigger bad contents from the model that would trigger toxic content from the [01:15:24] would trigger toxic content from the model [01:15:28] and sometimes even if you don't try to [01:15:30] and sometimes even if you don't try to trigger the model the model might still [01:15:31] trigger the model the model might still start to generate toxic content by [01:15:33] start to generate toxic content by itself so in this case the pre-trained [01:15:37] itself so in this case the pre-trained language models are prompted with very [01:15:39] language models are prompted with very innocuous prompts but they still [01:15:41] innocuous prompts but they still degenerate into toxic content [01:15:43] degenerate into toxic content so [01:15:44] so um the takeaway here is that models [01:15:46] um the takeaway here is that models really shouldn't be deployed without [01:15:48] really shouldn't be deployed without proper safeguards to control for toxic [01:15:50] proper safeguards to control for toxic content or any harmful contents in [01:15:51] content or any harmful contents in general [01:15:52] general and models should not be deployed [01:15:54] and models should not be deployed without consider rate without careful [01:15:55] without consider rate without careful considerations of how users will [01:15:57] considerations of how users will interact with these models [01:16:02] um so in the Asic section one major [01:16:04] um so in the Asic section one major takeaway is that we are trying to [01:16:06] takeaway is that we are trying to Advocate that you need to think more [01:16:08] Advocate that you need to think more about your model about the model that [01:16:10] about your model about the model that you are building so before deploying or [01:16:13] you are building so before deploying or publishing any nlg models please check [01:16:15] publishing any nlg models please check if the models are output is is not [01:16:17] if the models are output is is not harmful and please check if the model is [01:16:20] harmful and please check if the model is more robust is robust to all the trigger [01:16:22] more robust is robust to all the trigger words and other adversarial prompts and [01:16:25] words and other adversarial prompts and of course there are more so well [01:16:27] of course there are more so well basically one can never do enough for to [01:16:30] basically one can never do enough for to improve the assets of tax generation [01:16:31] improve the assets of tax generation systems and okay cool I still have three [01:16:34] systems and okay cool I still have three minutes left so I can still do [01:16:36] minutes left so I can still do concluding thoughts um [01:16:38] concluding thoughts um the idea here well today we talk about [01:16:40] the idea here well today we talk about the exciting applications of natural [01:16:41] the exciting applications of natural language generation systems [01:16:44] language generation systems um so but well one might think that [01:16:47] um so but well one might think that while given that chaiji 50 is already so [01:16:49] while given that chaiji 50 is already so good are there any other things that we [01:16:51] good are there any other things that we can do research-wise if you try [01:16:53] can do research-wise if you try interacting with these models [01:16:55] interacting with these models um if you try to interact with these [01:16:56] um if you try to interact with these models actually you can see that there [01:16:58] models actually you can see that there are still lots of limitations in their [01:17:00] are still lots of limitations in their skills and performance for example check [01:17:02] skills and performance for example check GPT is able to like do a lot of things [01:17:05] GPT is able to like do a lot of things with manipulating text but it couldn't [01:17:08] with manipulating text but it couldn't really create like interesting contents [01:17:10] really create like interesting contents or I couldn't really think deeply about [01:17:12] or I couldn't really think deeply about stuff so it's still also so there are [01:17:14] stuff so it's still also so there are lots of headrooms and there are still [01:17:16] lots of headrooms and there are still many improvements ahead [01:17:18] many improvements ahead and evaluation remains a really huge [01:17:21] and evaluation remains a really huge challenge in natural language Generation [01:17:23] challenge in natural language Generation Um basically we need better ways to [01:17:25] Um basically we need better ways to automatically evaluate performance of [01:17:27] automatically evaluate performance of nlg models because human evaluations are [01:17:29] nlg models because human evaluations are expensive and not reproducible so it's [01:17:33] expensive and not reproducible so it's better to figure out ways to how to [01:17:35] better to figure out ways to how to compile all those human judgments into a [01:17:38] compile all those human judgments into a very reliable and trustworthy model [01:17:41] very reliable and trustworthy model and also with the advance of all these [01:17:44] and also with the advance of all these large-scale language models uh doing [01:17:46] large-scale language models uh doing like neural net doing like neural [01:17:47] like neural net doing like neural natural language generation has been [01:17:49] natural language generation has been reset and it's never been easier to jump [01:17:53] reset and it's never been easier to jump into this space because now there are [01:17:55] into this space because now there are all the tools that are already there for [01:17:57] all the tools that are already there for you to build upon and finally it is one [01:18:00] you to build upon and finally it is one of the most exciting and fun areas of [01:18:01] of the most exciting and fun areas of NLP to work on so yeah I'm happy to chat [01:18:04] NLP to work on so yeah I'm happy to chat more about nlg if you have any questions [01:18:06] more about nlg if you have any questions post after class and in class I guess [01:18:09] post after class and in class I guess into one minute [01:18:10] into one minute okay cool that's everything so do you [01:18:14] okay cool that's everything so do you have any questions if you don't we can [01:18:16] have any questions if you don't we can end the class ================================================================================ LECTURE 011 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 10 - Post-training by Archit Sharma Source: https://www.youtube.com/watch?v=35X6zlhoCy4 --- Transcript [00:00:05] good evening people um even how are you [00:00:08] good evening people um even how are you guys [00:00:10] guys doing all right my name is archa Sharma [00:00:13] doing all right my name is archa Sharma I'm a PhD student at Stanford and I'm [00:00:15] I'm a PhD student at Stanford and I'm very very excited to talk about post [00:00:17] very very excited to talk about post training generally speaking for large [00:00:19] training generally speaking for large language models and I hope you guys are [00:00:21] language models and I hope you guys are ready to learn some stuff because this [00:00:24] ready to learn some stuff because this has been one of the last few years in [00:00:25] has been one of the last few years in machine learning have been very very [00:00:26] machine learning have been very very exciting uh with the Advent of large [00:00:29] exciting uh with the Advent of large language model CH GPD and everything to [00:00:31] language model CH GPD and everything to that extent and hopefully after today's [00:00:33] that extent and hopefully after today's lecture you will be more comfortable [00:00:36] lecture you will be more comfortable understanding how we go from pre-train [00:00:38] understanding how we go from pre-train Models to models like CH GPD and we'll [00:00:40] Models to models like CH GPD and we'll take a whole journey through prompting [00:00:43] take a whole journey through prompting instruction fine tuning and DP and [00:00:45] instruction fine tuning and DP and rlf so let's get [00:00:51] started all [00:00:53] started all right so something that has been very [00:00:57] right so something that has been very fundamental to our entire field is this [00:01:01] fundamental to our entire field is this idea of scaling loss and models are [00:01:04] idea of scaling loss and models are increasingly becoming larger and larger [00:01:06] increasingly becoming larger and larger and they're expanding more and more [00:01:08] and they're expanding more and more compute so this is a graph of models [00:01:10] compute so this is a graph of models starting all the way back in 1950s to [00:01:13] starting all the way back in 1950s to somewhere around these are still this is [00:01:15] somewhere around these are still this is an outdated graph so like this shows up [00:01:17] an outdated graph so like this shows up to 10 to^ 24 flops or floating Point [00:01:19] to 10 to^ 24 flops or floating Point operations that go into pre-training [00:01:21] operations that go into pre-training these models but the number is well [00:01:23] these models but the number is well above 10 to^ 26 now but you can see the [00:01:26] above 10 to^ 26 now but you can see the graph and the way it's [00:01:28] graph and the way it's trending and more more and more compute [00:01:30] trending and more more and more compute requires more and more data because you [00:01:32] requires more and more data because you need to train on something meaningful [00:01:34] need to train on something meaningful and this is roughly the trend on the [00:01:35] and this is roughly the trend on the amount of language tokens that are going [00:01:37] amount of language tokens that are going into the language models in pre-training [00:01:40] into the language models in pre-training and again this plot is outdated does [00:01:43] and again this plot is outdated does anybody want to guess like we're in 2024 [00:01:45] anybody want to guess like we're in 2024 2022 we were at 1.4 trillion tokens or [00:01:49] 2022 we were at 1.4 trillion tokens or words roughly speaking in language model [00:01:51] words roughly speaking in language model pre-training do does anyone want to [00:01:53] pre-training do does anyone want to guess like where we are in 2024 [00:02:00] that's a pretty good guess yeah so we're [00:02:02] that's a pretty good guess yeah so we're close to 15 trillion tokens um recent [00:02:05] close to 15 trillion tokens um recent llama 3 models were roughly trained on [00:02:06] llama 3 models were roughly trained on 15 trillion tokens so yeah just just for [00:02:10] 15 trillion tokens so yeah just just for a second appreciate that these are a lot [00:02:12] a second appreciate that these are a lot of words uh this is not yeah I don't I [00:02:16] of words uh this is not yeah I don't I don't think anybody of us listens to [00:02:18] don't think anybody of us listens to like trillions of tokens in our lifetime [00:02:20] like trillions of tokens in our lifetime so this is where we are right now and I [00:02:24] so this is where we are right now and I hope you guys were here for the pre- [00:02:25] hope you guys were here for the pre- pre-training lectures cool um so what do [00:02:30] pre-training lectures cool um so what do we do so like I mean broadly speaking we [00:02:32] we do so like I mean broadly speaking we are really just learning to predict text [00:02:34] are really just learning to predict text tokens or language tokens but what do we [00:02:37] tokens or language tokens but what do we learn in the process of pre-training why [00:02:39] learn in the process of pre-training why is why are people spending so much money [00:02:42] is why are people spending so much money so much compute because these Compu and [00:02:44] so much compute because these Compu and tokens take dollars to do and we're [00:02:46] tokens take dollars to do and we're we're on the order spending hundreds of [00:02:48] we're on the order spending hundreds of millions of dollars on these runs so why [00:02:50] millions of dollars on these runs so why are we doing this and this is basically [00:02:53] are we doing this and this is basically a recall from whatever you have probably [00:02:55] a recall from whatever you have probably learned till now but we're learning [00:02:57] learned till now but we're learning things like oh we are learning knowledge [00:02:59] things like oh we are learning knowledge Stanford University is located in Santa [00:03:02] Stanford University is located in Santa clar California or wherever you want to [00:03:04] clar California or wherever you want to like say you're learning syntax you're [00:03:06] like say you're learning syntax you're learning semantics of the sentences [00:03:08] learning semantics of the sentences these are things that you would expect [00:03:10] these are things that you would expect to learn when you're training on [00:03:12] to learn when you're training on language data broadly you're probably [00:03:14] language data broadly you're probably learning a lot about different languages [00:03:16] learning a lot about different languages as well so like depending on your text [00:03:17] as well so like depending on your text Data distribution you're learning a lot [00:03:19] Data distribution you're learning a lot of things but the models we interact [00:03:22] of things but the models we interact with are very intelligent so where is [00:03:24] with are very intelligent so where is that coming from like I mean just simply [00:03:26] that coming from like I mean just simply learning about very factual things and [00:03:30] learning about very factual things and it's a very simple loss function we're [00:03:32] it's a very simple loss function we're optimizing and where is that [00:03:33] optimizing and where is that Intelligence coming [00:03:35] Intelligence coming from and this perhaps is the interesting [00:03:39] from and this perhaps is the interesting bit recently like people have like [00:03:43] bit recently like people have like started accumulating evidence for that [00:03:45] started accumulating evidence for that like when you optimize the next token [00:03:47] like when you optimize the next token prediction losses you're not just [00:03:49] prediction losses you're not just learning about syntax you're not just [00:03:50] learning about syntax you're not just learning knowledge but you're starting [00:03:52] learning knowledge but you're starting to like form models of Agents beliefs [00:03:55] to like form models of Agents beliefs and actions as well so how do we know [00:03:58] and actions as well so how do we know this again a lot of this is speculative [00:04:00] this again a lot of this is speculative evidence but it's mayy to like form an [00:04:02] evidence but it's mayy to like form an understanding that the losses we're [00:04:03] understanding that the losses we're optimizing are not just about the data [00:04:04] optimizing are not just about the data fitting the data but you start learning [00:04:06] fitting the data but you start learning something maybe more meaningful as [00:04:08] something maybe more meaningful as well um for example like I mean in this [00:04:11] well um for example like I mean in this specific case um we change the last [00:04:15] specific case um we change the last sentence and the prediction of the text [00:04:17] sentence and the prediction of the text or the next Tex that that is predicted [00:04:19] or the next Tex that that is predicted changes as well so here it starts with [00:04:22] changes as well so here it starts with Pat watch as a demonstration of a [00:04:23] Pat watch as a demonstration of a bowling ball and the leaf being dropped [00:04:25] bowling ball and the leaf being dropped at the same time pat who is a physicist [00:04:27] at the same time pat who is a physicist predicts that the bowling ball and the [00:04:29] predicts that the bowling ball and the leaf will land at the same rate we all [00:04:31] leaf will land at the same rate we all know Gravity the way it works but when [00:04:34] know Gravity the way it works but when you L change the last sentence to Pat [00:04:36] you L change the last sentence to Pat who has never seen this demonstration [00:04:38] who has never seen this demonstration before Pat predicts that bowling ball [00:04:40] before Pat predicts that bowling ball will fall to the ground first maybe [00:04:42] will fall to the ground first maybe somebody who's never seen this [00:04:43] somebody who's never seen this experiment before might intuitively [00:04:44] experiment before might intuitively believe that correct so like I mean the [00:04:47] believe that correct so like I mean the language model was able to predict this [00:04:48] language model was able to predict this so how do you predict this you have to [00:04:51] so how do you predict this you have to have some notion of understanding of how [00:04:54] have some notion of understanding of how humans work to even be able to predict [00:04:56] humans work to even be able to predict this and that's maybe like something [00:04:58] this and that's maybe like something that is not obvious with You're simply [00:05:00] that is not obvious with You're simply optimizing to predict the [00:05:03] optimizing to predict the text similarly like I mean these kind of [00:05:05] text similarly like I mean these kind of examples are we're going to run through [00:05:06] examples are we're going to run through some examples to like sort of [00:05:08] some examples to like sort of communicate that when you're [00:05:09] communicate that when you're pre-training these models you're [00:05:10] pre-training these models you're learning much more than just language [00:05:11] learning much more than just language tokens and so on you're also learning [00:05:13] tokens and so on you're also learning about math like you're able to [00:05:16] about math like you're able to understand what a graph of a circle [00:05:17] understand what a graph of a circle means and what the center is and where [00:05:19] means and what the center is and where how to like understand [00:05:22] how to like understand equations probably my favorite example [00:05:24] equations probably my favorite example something I use pretty much every day is [00:05:26] something I use pretty much every day is you're learning how to write code so I [00:05:29] you're learning how to write code so I don't know how many of you have [00:05:30] don't know how many of you have interacted with co-pilot before but if [00:05:33] interacted with co-pilot before but if you have like you probably know like if [00:05:34] you have like you probably know like if you write down a few commands write down [00:05:36] you write down a few commands write down a function template it will [00:05:38] a function template it will automatically complete code for you so [00:05:41] automatically complete code for you so again it's not perfect but it has to [00:05:43] again it's not perfect but it has to have some deeper understanding of what [00:05:45] have some deeper understanding of what your intent is for something like that [00:05:47] your intent is for something like that to [00:05:48] to emerge and similarly we have examples [00:05:50] emerge and similarly we have examples from medicine as well I don't know about [00:05:52] from medicine as well I don't know about you guys but like whenever I have some [00:05:54] you guys but like whenever I have some issue I probably go to chat gbd or [00:05:55] issue I probably go to chat gbd or Claude or something to that effect and [00:05:57] Claude or something to that effect and ask them a diagnosis for those things as [00:05:59] ask them a diagnosis for those things as well [00:06:00] well um I don't recommend that uh please [00:06:03] um I don't recommend that uh please don't take medical advice from me but [00:06:06] don't take medical advice from me but yeah so broadly like the way we're [00:06:09] yeah so broadly like the way we're seeing language models at this point is [00:06:11] seeing language models at this point is that like they're sort of emerging as [00:06:12] that like they're sort of emerging as this general purpose multitask [00:06:14] this general purpose multitask assistance and it's very strange right [00:06:17] assistance and it's very strange right like I mean we started off with text [00:06:18] like I mean we started off with text token prediction and we're reaching the [00:06:19] token prediction and we're reaching the stage where it can like sort of rely to [00:06:21] stage where it can like sort of rely to them on them to do many many different [00:06:23] them on them to do many many different things so how are we getting there and [00:06:25] things so how are we getting there and I'm sure you all are aware of like what [00:06:26] I'm sure you all are aware of like what these models are so yeah [00:06:30] these models are so yeah so today's lecture is largely going to [00:06:32] so today's lecture is largely going to be about how do we go from something [00:06:34] be about how do we go from something Stanford University is located this very [00:06:37] Stanford University is located this very simple pretraining task a very simple [00:06:38] simple pretraining task a very simple procedure well it's more complicated but [00:06:40] procedure well it's more complicated but in abstract terms it's not very [00:06:42] in abstract terms it's not very complicated to like something as [00:06:44] complicated to like something as powerful as CH [00:06:45] powerful as CH GPD cool [00:06:48] GPD cool so um I recommend you guys stopping me [00:06:50] so um I recommend you guys stopping me asking me a lot of questions because [00:06:52] asking me a lot of questions because this is a there's a lot of fun examples [00:06:54] this is a there's a lot of fun examples and a lot of fun techniques so like I I [00:06:56] and a lot of fun techniques so like I I want you guys to like learn everything [00:06:58] want you guys to like learn everything about here so the overall plan is we're [00:07:00] about here so the overall plan is we're going to talk about zero shot and few [00:07:02] going to talk about zero shot and few shot in context learning um next we're [00:07:05] shot in context learning um next we're going to follow up with instruction [00:07:06] going to follow up with instruction fine-tuning and then we're going to talk [00:07:08] fine-tuning and then we're going to talk about optimizing for preferences and [00:07:10] about optimizing for preferences and this is where roughly things are right [00:07:12] this is where roughly things are right now in the industry and when we're going [00:07:15] now in the industry and when we're going to talk about what's next what the [00:07:16] to talk about what's next what the limitations are and how do we move from [00:07:20] here cool so we're going to start off [00:07:23] here cool so we're going to start off with zero shot INF fusure in context [00:07:27] learning um broad we're going to take an [00:07:29] learning um broad we're going to take an example example of GPT or the generative [00:07:31] example example of GPT or the generative pre-train Transformer and this is a [00:07:33] pre-train Transformer and this is a whole series of models that started off [00:07:34] whole series of models that started off in roughly 2018 and like up to 2020 they [00:07:37] in roughly 2018 and like up to 2020 they were building GPD gpd2 gbd3 so we're [00:07:40] were building GPD gpd2 gbd3 so we're going to start off with this example and [00:07:42] going to start off with this example and yes so it's a decoder only model that is [00:07:45] yes so it's a decoder only model that is trained on roughly 4.6 GB of text and it [00:07:49] trained on roughly 4.6 GB of text and it has 12 layers of Transformers layers and [00:07:51] has 12 layers of Transformers layers and it's trained with the next token [00:07:52] it's trained with the next token prediction [00:07:53] prediction loss and the first model obviously was [00:07:57] loss and the first model obviously was not extremely good but it started [00:07:58] not extremely good but it started showing that like hey like this [00:08:00] showing that like hey like this technique for pre-training can be very [00:08:03] technique for pre-training can be very effective for general purpose tasks and [00:08:05] effective for general purpose tasks and we're going to see some [00:08:06] we're going to see some examples um for example like I mean here [00:08:09] examples um for example like I mean here it's able to do the task for entainment [00:08:12] it's able to do the task for entainment and okay [00:08:16] and okay um yeah and gbd1 itself was not very [00:08:21] um yeah and gbd1 itself was not very strong as a model so like but they took [00:08:23] strong as a model so like but they took the same recipe and like I mean tried to [00:08:25] the same recipe and like I mean tried to like increase the model size so they [00:08:27] like increase the model size so they went from 117 million parameters to [00:08:29] went from 117 million parameters to about 1.5 billion parameters and we're [00:08:32] about 1.5 billion parameters and we're now scaling up the data alongside as [00:08:34] now scaling up the data alongside as well so we went from 4 gabt of data to [00:08:36] well so we went from 4 gabt of data to approximately 40 gab of data and [00:08:38] approximately 40 gab of data and pre-training is a whole different like [00:08:40] pre-training is a whole different like melting part of techniques and there's a [00:08:42] melting part of techniques and there's a lot that goes into it but like roughly [00:08:43] lot that goes into it but like roughly for example here they filter data by the [00:08:46] for example here they filter data by the number of upwards on the redit [00:08:48] number of upwards on the redit data and yeah so this is roughly where [00:08:52] data and yeah so this is roughly where we are and I think one of the things [00:08:54] we are and I think one of the things that started emerging with gpd2 is zero [00:08:57] that started emerging with gpd2 is zero shot learning and what do we mean by [00:09:00] shot learning and what do we mean by zero shot learning [00:09:02] zero shot learning um conventionally in the field like when [00:09:05] um conventionally in the field like when we pre-train models there was the idea [00:09:06] we pre-train models there was the idea that you take a few examples you update [00:09:08] that you take a few examples you update the model um and then you are able to [00:09:11] the model um and then you are able to adapt to a specific task but as you [00:09:14] adapt to a specific task but as you pre-train on more and more data and more [00:09:15] pre-train on more and more data and more and more tasks you sort of start seeing [00:09:17] and more tasks you sort of start seeing this phenomena where they're able to do [00:09:19] this phenomena where they're able to do the task basically zero short they're [00:09:21] the task basically zero short they're shown no examples of how to do the task [00:09:23] shown no examples of how to do the task and you can start thinking of oh how you [00:09:25] and you can start thinking of oh how you can do it summarization you can follow [00:09:27] can do it summarization you can follow some instructions you can do maybe a [00:09:29] some instructions you can do maybe a little bit of math as well so this is [00:09:31] little bit of math as well so this is where the idea of zero shot learning [00:09:33] where the idea of zero shot learning started to [00:09:38] emerge yeah so how do we do zero shot [00:09:41] emerge yeah so how do we do zero shot learning or task specific learning from [00:09:42] learning or task specific learning from these pre-trained models really the idea [00:09:45] these pre-trained models really the idea is that we have to be creative here we [00:09:47] is that we have to be creative here we know that these are text prediction [00:09:48] know that these are text prediction models if you put in a text they will [00:09:50] models if you put in a text they will complete whatever follows so if we can [00:09:52] complete whatever follows so if we can sort of course these models into [00:09:54] sort of course these models into completing the task we care about maybe [00:09:56] completing the task we care about maybe it's question answering we can so start [00:09:59] it's question answering we can so start getting them to solve tasks here so for [00:10:02] getting them to solve tasks here so for example if you want to ask questions [00:10:04] example if you want to ask questions about Tom Brady you sort of set it up [00:10:07] about Tom Brady you sort of set it up you sort put information about Tom Brady [00:10:09] you sort put information about Tom Brady and then you put a question that you [00:10:10] and then you put a question that you wanted to answer and then it will [00:10:12] wanted to answer and then it will autocomplete in some sense so this is [00:10:14] autocomplete in some sense so this is one early perspective on these models [00:10:16] one early perspective on these models these are very Advanced autocomplete [00:10:19] these are very Advanced autocomplete models and [00:10:21] models and similarly if you want to figure out like [00:10:23] similarly if you want to figure out like which answer is true or which is not [00:10:25] which answer is true or which is not something that is very useful to measure [00:10:27] something that is very useful to measure is log probabilities so [00:10:29] is log probabilities so for example we want to figure out what [00:10:32] for example we want to figure out what is the word it refering to here in this [00:10:35] is the word it refering to here in this sentence the cat couldn't fit into the [00:10:36] sentence the cat couldn't fit into the Hat because it was too big um what we [00:10:39] Hat because it was too big um what we can do is we can take the sentence [00:10:41] can do is we can take the sentence replace it with either the cat or either [00:10:44] replace it with either the cat or either the hat and then you can measure the [00:10:46] the hat and then you can measure the probability of which Mo which one does [00:10:49] probability of which Mo which one does the model think is higher and you can [00:10:51] the model think is higher and you can sort of get the idea what the reference [00:10:53] sort of get the idea what the reference here is so none of this is like in the [00:10:56] here is so none of this is like in the training data it's simply learning to [00:10:58] training data it's simply learning to predict text but you can start seeing [00:11:00] predict text but you can start seeing how like we can leverage these models to [00:11:02] how like we can leverage these models to do other tasks as well be besides [00:11:06] do other tasks as well be besides prediction so this is just more evidence [00:11:09] prediction so this is just more evidence about like how gpd2 no task specific [00:11:12] about like how gpd2 no task specific fine-tuning no task specific training it [00:11:15] fine-tuning no task specific training it simply is learning to predict text and [00:11:17] simply is learning to predict text and it's establishes the state-of-the-art on [00:11:19] it's establishes the state-of-the-art on many many different tasks simply by [00:11:22] many many different tasks simply by scaling up the model parameters and [00:11:23] scaling up the model parameters and scaling up the amount of data it's [00:11:25] scaling up the amount of data it's stained on [00:11:29] so this is a fun example so if you want [00:11:31] so this is a fun example so if you want to do summarization for data or like you [00:11:35] to do summarization for data or like you have a news article that you want to [00:11:37] have a news article that you want to summarize so how do you get a zero shot [00:11:40] summarize so how do you get a zero shot model to do it this answer is you put [00:11:42] model to do it this answer is you put the document into the context and you [00:11:44] the document into the context and you simply put tldr in front of [00:11:47] simply put tldr in front of it now like I mean if most of the data [00:11:50] it now like I mean if most of the data on internet whenever you see tldr you'll [00:11:51] on internet whenever you see tldr you'll naturally summarize it so yeah you can [00:11:54] naturally summarize it so yeah you can get zero shot performance and [00:11:56] get zero shot performance and summarization here as well and again [00:11:57] summarization here as well and again this is not trained to do something [00:11:59] this is not trained to do something summarization in any specific way and [00:12:01] summarization in any specific way and it's still doing really well simply [00:12:03] it's still doing really well simply because of its pre-training [00:12:07] data so yeah um I think gp22 tldr is [00:12:11] data so yeah um I think gp22 tldr is somewhere there and like some of the [00:12:12] somewhere there and like some of the very Tas specific train models are um [00:12:15] very Tas specific train models are um like and I think you will see the trend [00:12:18] like and I think you will see the trend with again if you were Alec Radford or [00:12:20] with again if you were Alec Radford or somebody like I mean you see like these [00:12:22] somebody like I mean you see like these cool things emerging your next step [00:12:24] cool things emerging your next step would obviously be I'm going to scale [00:12:26] would obviously be I'm going to scale this up a little more I'm going to make [00:12:27] this up a little more I'm going to make an even bigger model I'm going to train [00:12:28] an even bigger model I'm going to train it even more data and we'll see how [00:12:31] it even more data and we'll see how things go right so that's how we got [00:12:34] things go right so that's how we got gbd3 uh we went from 1.5 billion [00:12:36] gbd3 uh we went from 1.5 billion parameters to 175 billion parameters we [00:12:38] parameters to 175 billion parameters we are well over like 40 GB of data to 600 [00:12:42] are well over like 40 GB of data to 600 gbt of data of course like now we're in [00:12:44] gbt of data of course like now we're in like terabytes of data and text is a [00:12:47] like terabytes of data and text is a very compressed representation so like [00:12:48] very compressed representation so like terabytes of data is a [00:12:50] terabytes of data is a lot um and you know we we talked about [00:12:53] lot um and you know we we talked about zero shot learning the cool thing that [00:12:56] zero shot learning the cool thing that emerged in gbd3 is like go ahead like [00:13:00] emerged in gbd3 is like go ahead like used before the passage right no you [00:13:03] used before the passage right no you typically put the passage uh if youve [00:13:04] typically put the passage uh if youve like interacted with Reddit or something [00:13:06] like interacted with Reddit or something like that typically somebody will write [00:13:08] like that typically somebody will write an entire post and then end with TLD drr [00:13:11] an entire post and then end with TLD drr here's a summary of the thing too long [00:13:13] here's a summary of the thing too long didn't read or if you have [00:13:15] didn't read or if you have used opposite comes first oh yeah there [00:13:20] used opposite comes first oh yeah there are situations where it also comes first [00:13:21] are situations where it also comes first but um one reason is that these are like [00:13:24] but um one reason is that these are like decoder only models so like they are [00:13:27] decoder only models so like they are often these are causal attention models [00:13:28] often these are causal attention models so the typically need to see the context [00:13:30] so the typically need to see the context before yeah understand I'm just curious [00:13:33] before yeah understand I'm just curious like from my experience the comes first [00:13:36] like from my experience the comes first then how is [00:13:38] then how is it [00:13:40] it the okay um there's probably a lot of [00:13:43] the okay um there's probably a lot of data where the tldr comes first but [00:13:44] data where the tldr comes first but there's probably a lot of data where [00:13:45] there's probably a lot of data where tldr comes after as [00:13:47] tldr comes after as well cool so we saw Zero shot learning [00:13:51] well cool so we saw Zero shot learning emerging in gpd2 few shot learning maybe [00:13:54] emerging in gpd2 few shot learning maybe seem slightly easier but like this is [00:13:56] seem slightly easier but like this is where things started getting really [00:13:57] where things started getting really funny is that like you're starting to [00:13:59] funny is that like you're starting to beat state-ofthe-art simply by just [00:14:00] beat state-ofthe-art simply by just putting examples in context so yeah what [00:14:04] putting examples in context so yeah what does f shot learning mean here what is [00:14:06] does f shot learning mean here what is what are we talking about um as I [00:14:09] what are we talking about um as I mentioned like the typical idea here is [00:14:13] mentioned like the typical idea here is is that like you want to solve [00:14:14] is that like you want to solve translation so you would put some [00:14:16] translation so you would put some examples of translation into [00:14:18] examples of translation into context and you know this is a [00:14:21] context and you know this is a correction task here or maybe you want [00:14:22] correction task here or maybe you want interested in Translation and no [00:14:25] interested in Translation and no gradient updates no learning in any [00:14:28] gradient updates no learning in any conventional sense whatso ever you put a [00:14:30] conventional sense whatso ever you put a few examples in and that's it like I [00:14:32] few examples in and that's it like I mean you know how to solve the task [00:14:34] mean you know how to solve the task isn't that like crazy like you're bu you [00:14:37] isn't that like crazy like you're bu you guys did the assignment on translation [00:14:39] guys did the assignment on translation right so but this is what the modern NLP [00:14:42] right so but this is what the modern NLP looks like [00:14:44] looks like so yeah um you put in some examples and [00:14:48] so yeah um you put in some examples and you have the entire system and this is [00:14:50] you have the entire system and this is where things got really interesting is [00:14:52] where things got really interesting is that all these task specific models that [00:14:54] that all these task specific models that were created to like be really really [00:14:56] were created to like be really really good at translation or really good at [00:14:57] good at translation or really good at summarization you can just put let's [00:15:00] summarization you can just put let's look at this graph so we start with a [00:15:02] look at this graph so we start with a zero shot performance of this in a [00:15:04] zero shot performance of this in a similar fashion that I described earlier [00:15:05] similar fashion that I described earlier and you start somewhere there you put [00:15:07] and you start somewhere there you put one example in of translation from [00:15:09] one example in of translation from English to French you get to somewhere [00:15:11] English to French you get to somewhere like already at a fine level few [00:15:13] like already at a fine level few examples in you're already starting to [00:15:15] examples in you're already starting to like be close to the state-ofthe-art [00:15:18] like be close to the state-ofthe-art models wait but in that gra the state of [00:15:20] models wait but in that gra the state of theart is really high isn't [00:15:22] theart is really high isn't it uh find your Inver of the bird Plus+ [00:15:26] it uh find your Inver of the bird Plus+ here I think is like the one I'm [00:15:27] here I think is like the one I'm referring to find your of the like which [00:15:29] referring to find your of the like which is trained exclusively on a lot of um [00:15:32] is trained exclusively on a lot of um translation data might be like slightly [00:15:33] translation data might be like slightly better yes um and I think that's the [00:15:37] better yes um and I think that's the relevant comparison here is the in [00:15:39] relevant comparison here is the in context learning starts to emerge at [00:15:42] context learning starts to emerge at scale so and this is I think like the [00:15:45] scale so and this is I think like the key point is that this some of this is [00:15:48] key point is that this some of this is contested just to be very upfront but [00:15:50] contested just to be very upfront but like there's this idea of emergence of [00:15:52] like there's this idea of emergence of this property as you train on more [00:15:54] this property as you train on more Computing and more scale um there's more [00:15:56] Computing and more scale um there's more recent research which suggest that if we [00:15:58] recent research which suggest that if we plot the access correctly it feels less [00:16:00] plot the access correctly it feels less emergent but the general idea is as you [00:16:02] emergent but the general idea is as you increase the number of parameters and [00:16:05] increase the number of parameters and increase the number of compute that is [00:16:06] increase the number of compute that is going into the models the ability to [00:16:08] going into the models the ability to just go from a few examples to really [00:16:10] just go from a few examples to really strong performance is very [00:16:14] compelling cool [00:16:17] compelling cool um and yeah I think as I explained [00:16:20] um and yeah I think as I explained earlier the general idea is that this is [00:16:22] earlier the general idea is that this is very different from the conventional [00:16:23] very different from the conventional idea of fine tuning that we typically go [00:16:25] idea of fine tuning that we typically go for instead of like iterating over [00:16:27] for instead of like iterating over examples putting it into context and [00:16:29] examples putting it into context and doing gradient updates we are actually [00:16:31] doing gradient updates we are actually just going for few short promting we're [00:16:33] just going for few short promting we're going to put in few examples and that's [00:16:34] going to put in few examples and that's going to give us the [00:16:42] system um yes I mean the exact details [00:16:45] system um yes I mean the exact details roughly can depend on the prom template [00:16:47] roughly can depend on the prom template that you use but typically you would [00:16:49] that you use but typically you would just put examples so like c order and [00:16:52] just put examples so like c order and put these examples and then whatever [00:16:54] put these examples and then whatever your task is you can just let the model [00:16:56] your task is you can just let the model complete from there because it can infer [00:16:58] complete from there because it can infer the task [00:16:59] the task um based on the examples you've [00:17:02] um based on the examples you've given any other [00:17:05] given any other questions [00:17:08] cool [00:17:10] cool so yeah like I mean we have gotten from [00:17:12] so yeah like I mean we have gotten from zero shot prompting and like we've seen [00:17:14] zero shot prompting and like we've seen seeing that F shot prompting is becoming [00:17:16] seeing that F shot prompting is becoming really competitive with good models but [00:17:18] really competitive with good models but there's still limitations to this like I [00:17:20] there's still limitations to this like I mean you cannot solve every task that [00:17:21] mean you cannot solve every task that you see here and particularly like [00:17:23] you see here and particularly like things that involve like richer [00:17:25] things that involve like richer multi-step reasoning is something that [00:17:26] multi-step reasoning is something that actually can be pretty challenging and [00:17:29] actually can be pretty challenging and just to be fair human struggle at these [00:17:30] just to be fair human struggle at these tasks as well so things like addition [00:17:33] tasks as well so things like addition and so on like these are these are [00:17:36] and so on like these are these are probably like still still hard to do [00:17:38] probably like still still hard to do like when you keep increasing the number [00:17:39] like when you keep increasing the number of digits but one thing that you have to [00:17:43] of digits but one thing that you have to start being creative with I alluded to [00:17:45] start being creative with I alluded to this earlier is that you can get these [00:17:46] this earlier is that you can get these models to do the task if you're creative [00:17:49] models to do the task if you're creative in how you prompt the model and this is [00:17:52] in how you prompt the model and this is what we're going to see [00:17:53] what we're going to see next um so this technique called Chain [00:17:57] next um so this technique called Chain of Thought prompting emerged here and [00:17:59] of Thought prompting emerged here and the idea that we have explored thus far [00:18:01] the idea that we have explored thus far is that we put in examples of the kind [00:18:03] is that we put in examples of the kind of tasks we want to do and we expect the [00:18:07] of tasks we want to do and we expect the model to zero shot learn what the task [00:18:09] model to zero shot learn what the task is and go from there um the idea is that [00:18:13] is and go from there um the idea is that like instead of just showing what the [00:18:15] like instead of just showing what the task is you show them examples where [00:18:17] task is you show them examples where they reason through the task so they're [00:18:19] they reason through the task so they're not just learning to do the task but [00:18:21] not just learning to do the task but also learning how the reasoning is [00:18:22] also learning how the reasoning is working so in this example initially we [00:18:24] working so in this example initially we started with like we have to solve a [00:18:25] started with like we have to solve a simple math problem and we are just [00:18:27] simple math problem and we are just shown exactly the answer answer directly [00:18:30] shown exactly the answer answer directly instead of doing that and if you do that [00:18:32] instead of doing that and if you do that directly you'll observe that the model [00:18:33] directly you'll observe that the model gets the answer wrong instead of that [00:18:36] gets the answer wrong instead of that what if you show model how to reason [00:18:37] what if you show model how to reason about the task show it a chain of [00:18:39] about the task show it a chain of thought and include that in the prompt [00:18:41] thought and include that in the prompt as [00:18:42] as well and then you ask at a new question [00:18:46] well and then you ask at a new question the idea is that now the model is not [00:18:47] the idea is that now the model is not just going to Output an answer it's [00:18:50] just going to Output an answer it's going to reason about the task and it's [00:18:52] going to reason about the task and it's going to do actually a lot better and [00:18:54] going to do actually a lot better and this has been shown to be very effective [00:18:57] this has been shown to be very effective um Chain of Thought is also as you can [00:19:00] um Chain of Thought is also as you can see like I mean it's again something [00:19:02] see like I mean it's again something that improves a lot with model scale [00:19:05] that improves a lot with model scale it's not just um yeah um but what you [00:19:09] it's not just um yeah um but what you can probably start seeing is like it's [00:19:11] can probably start seeing is like it's nearly better than supervised best [00:19:13] nearly better than supervised best models here so power models roughly were [00:19:17] models here so power models roughly were about 5 40 billion [00:19:18] about 5 40 billion parameters and simply with this Chain of [00:19:21] parameters and simply with this Chain of Thought kind of a skill you're already [00:19:22] Thought kind of a skill you're already like beating state of the [00:19:25] like beating state of the art cool um [00:19:29] art cool um so yeah so I showed you examples of [00:19:32] so yeah so I showed you examples of Chain of Thought reasoning to to this [00:19:34] Chain of Thought reasoning to to this point where you go through a reasoning [00:19:35] point where you go through a reasoning chain but you can be even slightly [00:19:37] chain but you can be even slightly smarter than that you might not even [00:19:39] smarter than that you might not even need to show them any examples you just [00:19:41] need to show them any examples you just need to trick them into thinking about [00:19:43] need to trick them into thinking about what to do [00:19:44] what to do next um so yeah this s this idea emerged [00:19:49] next um so yeah this s this idea emerged in this paper called where you let's [00:19:51] in this paper called where you let's think step by step where instead of even [00:19:54] think step by step where instead of even showing an example you just start your [00:19:56] showing an example you just start your answer with let's think step by step and [00:20:01] answer with let's think step by step and that's it like I mean the model will [00:20:03] that's it like I mean the model will start reasoning about the answer itself [00:20:05] start reasoning about the answer itself instead of just like autoc completing to [00:20:07] instead of just like autoc completing to an answer and you get something like [00:20:11] an answer and you get something like this [00:20:12] this so maybe you don't even need to show any [00:20:15] so maybe you don't even need to show any examples like you can probably induce [00:20:17] examples like you can probably induce the reasoning Behavior through zero shot [00:20:18] the reasoning Behavior through zero shot Behavior as well and again um what the [00:20:22] Behavior as well and again um what the final numbers look like is like compared [00:20:24] final numbers look like is like compared to zero shot performance that we got [00:20:26] to zero shot performance that we got from essentially autoc comp completing [00:20:29] from essentially autoc comp completing this zero shot Chain of Thought [00:20:31] this zero shot Chain of Thought substantially improves the performance [00:20:32] substantially improves the performance so you go from like 17.7 to 78.7 it's [00:20:35] so you go from like 17.7 to 78.7 it's still worse than still putting like [00:20:37] still worse than still putting like examples of reasoning and multi-shot few [00:20:40] examples of reasoning and multi-shot few shot Chain of Thought as well but you [00:20:42] shot Chain of Thought as well but you can see like how much it improves the [00:20:44] can see like how much it improves the performance simply by asking you to [00:20:45] performance simply by asking you to let's think by step by step and maybe [00:20:48] let's think by step by step and maybe this is like a lesson that interacting [00:20:50] this is like a lesson that interacting with these models is like when you [00:20:52] with these models is like when you interact with these models you might not [00:20:54] interact with these models you might not get the exact desired behavior from [00:20:57] get the exact desired behavior from these models up front but often like [00:20:59] these models up front but often like these models are capable of doing the [00:21:01] these models are capable of doing the behavior that you might want and often [00:21:05] behavior that you might want and often you have to think about how to induce [00:21:06] you have to think about how to induce that behavior such that and the right [00:21:09] that behavior such that and the right way to think perhaps is like what is the [00:21:10] way to think perhaps is like what is the pre-training data what is the data on [00:21:12] pre-training data what is the data on the internet it might have seen which [00:21:13] the internet it might have seen which induces a similar Behavior to the kind I [00:21:15] induces a similar Behavior to the kind I want and you probably want to like think [00:21:18] want and you probably want to like think about that and then induce these kinds [00:21:20] about that and then induce these kinds of behaviors from those [00:21:24] models and yeah like I mean um you know [00:21:28] models and yeah like I mean um you know we we hand designed some of these [00:21:30] we we hand designed some of these prompts you can also like get an llm to [00:21:32] prompts you can also like get an llm to design these prompts as well there's [00:21:34] design these prompts as well there's like recursive self-improving ideas here [00:21:37] like recursive self-improving ideas here um that happen and you can bump up the [00:21:38] um that happen and you can bump up the performance a little bit [00:21:42] more cool so what we have seen so far is [00:21:46] more cool so what we have seen so far is that as models get stronger and stronger [00:21:49] that as models get stronger and stronger you can start forcing them to do your [00:21:50] you can start forcing them to do your task zero shot or with few short [00:21:53] task zero shot or with few short examples and you can trick them into [00:21:55] examples and you can trick them into thinking what task you want them to [00:21:56] thinking what task you want them to solve [00:21:59] solve but the downside is that there's only so [00:22:01] but the downside is that there's only so much you can fit into context this might [00:22:04] much you can fit into context this might not be very true anymore models like [00:22:06] not be very true anymore models like becoming increasingly larger context but [00:22:09] becoming increasingly larger context but it's still somewhat unsatisfactory to [00:22:11] it's still somewhat unsatisfactory to think you have to trick the model into [00:22:13] think you have to trick the model into doing your task rather than like it just [00:22:15] doing your task rather than like it just doing the task you wanted to do and [00:22:18] doing the task you wanted to do and potentially like I mean going forward [00:22:20] potentially like I mean going forward like you probably still want to fine [00:22:22] like you probably still want to fine tune these models for more and more [00:22:23] tune these models for more and more complex tasks and that's where we're [00:22:26] complex tasks and that's where we're going to go forward in this [00:22:28] going to go forward in this um next section we're going to cover is [00:22:30] um next section we're going to cover is instruction fine tuning and the general [00:22:34] instruction fine tuning and the general idea we have right now is that as we [00:22:37] idea we have right now is that as we talked about pre-training is not about [00:22:39] talked about pre-training is not about assisting users it is about predicting [00:22:41] assisting users it is about predicting the next token now you can trick it into [00:22:44] the next token now you can trick it into assisting users and uh following the [00:22:47] assisting users and uh following the instruction you wanted to but in general [00:22:49] instruction you wanted to but in general that's not what it retrained it for and [00:22:51] that's not what it retrained it for and this is an example of where if you ask [00:22:53] this is an example of where if you ask GPD 3 pretty strong model to explain [00:22:55] GPD 3 pretty strong model to explain like moon landing to a six-year-old in a [00:22:57] like moon landing to a six-year-old in a few sentences and it will follow up with [00:22:59] few sentences and it will follow up with more questions about what a 60-year-old [00:23:01] more questions about what a 60-year-old might want this is not what you wanted [00:23:03] might want this is not what you wanted the model to do right so the general [00:23:07] the model to do right so the general term that people use these days is that [00:23:08] term that people use these days is that they're not aligned with user intent and [00:23:12] they're not aligned with user intent and the next sections that we're going to [00:23:13] the next sections that we're going to cover are going to talk about how to [00:23:14] cover are going to talk about how to align it with the user intent so that [00:23:16] align it with the user intent so that they don't have to trick the model into [00:23:17] they don't have to trick the model into whatever uh we wanted to [00:23:20] whatever uh we wanted to do and this is a kind of like desired [00:23:23] do and this is a kind of like desired completion we want at the end of [00:23:24] completion we want at the end of instruction tuning um and yeah [00:23:29] instruction tuning um and yeah how do we get from those pre-trained [00:23:32] how do we get from those pre-trained models to models which can respond to [00:23:33] models to models which can respond to user [00:23:34] user intent um again um I hope this was [00:23:38] intent um again um I hope this was covered somewhere in the class the [00:23:39] covered somewhere in the class the general idea of pre-training and [00:23:41] general idea of pre-training and fine-tuning um but what you have [00:23:43] fine-tuning um but what you have probably seen thus far is that you [00:23:45] probably seen thus far is that you pre-train on a lot of different language [00:23:47] pre-train on a lot of different language task uh on data but then you find on [00:23:50] task uh on data but then you find on your specific task so you're taking the [00:23:53] your specific task so you're taking the same decoder only models and you're fine [00:23:56] same decoder only models and you're fine tuning to some task with very little [00:23:58] tuning to some task with very little amount of data the thing that is going [00:24:00] amount of data the thing that is going to be different now is not that we're no [00:24:02] to be different now is not that we're no longer fine tuning on a little amount of [00:24:04] longer fine tuning on a little amount of data we're going to fine tune on many [00:24:05] data we're going to fine tune on many many different tasks and we're going to [00:24:07] many different tasks and we're going to just try to put them into a single [00:24:10] just try to put them into a single usable um ux for users and this is where [00:24:14] usable um ux for users and this is where fine tuning or instruction fine tuning [00:24:16] fine tuning or instruction fine tuning comes [00:24:19] in cool um so again the recipe is not [00:24:23] in cool um so again the recipe is not very very complicated here um we're [00:24:25] very very complicated here um we're going to collect a lot of examples of [00:24:27] going to collect a lot of examples of instruction and output Pairs and the [00:24:29] instruction and output Pairs and the instructions are going to rrange over [00:24:30] instructions are going to rrange over several task different forms um there's [00:24:33] several task different forms um there's going to be question answering they're [00:24:34] going to be question answering they're going to be summarization translation [00:24:36] going to be summarization translation code reasoning and so on and we're going [00:24:38] code reasoning and so on and we're going to collect a lot of examples uh related [00:24:41] to collect a lot of examples uh related to all those tasks and the idea is like [00:24:44] to all those tasks and the idea is like I mean we'll train on instruction and [00:24:45] I mean we'll train on instruction and output pairs exactly with them and then [00:24:48] output pairs exactly with them and then we're going to evaluate on some unseen [00:24:50] we're going to evaluate on some unseen tasks as well so this is a general [00:24:53] tasks as well so this is a general Paradigm of instruction fine tuning and [00:24:57] Paradigm of instruction fine tuning and again it's the same idea which we [00:24:59] again it's the same idea which we explored in pre-training is that data [00:25:01] explored in pre-training is that data plus scale is really important and these [00:25:04] plus scale is really important and these days like a mean you start off with like [00:25:06] days like a mean you start off with like one task you're now extending it over [00:25:08] one task you're now extending it over thousands of thousands and thousands of [00:25:10] thousands of thousands and thousands of tasks with like three million plus [00:25:11] tasks with like three million plus examples and this is generally like a [00:25:13] examples and this is generally like a broad range of tasks that you might see [00:25:14] broad range of tasks that you might see in instruction fine tuning data [00:25:16] in instruction fine tuning data sets and yeah you might even think of [00:25:19] sets and yeah you might even think of like why are we calling it fine tuning [00:25:20] like why are we calling it fine tuning anymore like it's almost starting to [00:25:22] anymore like it's almost starting to look like pre-training um but yeah we [00:25:25] look like pre-training um but yeah we can these are just terms um so so you [00:25:28] can these are just terms um so so you can decide whatever you are comfortable [00:25:30] can decide whatever you are comfortable with um so yeah we we get this like huge [00:25:35] with um so yeah we we get this like huge instruction data set we finder model the [00:25:37] instruction data set we finder model the next question is like how do we evaluate [00:25:39] next question is like how do we evaluate these data sets um now I think you guys [00:25:43] these data sets um now I think you guys will see another lecture on evaluation [00:25:45] will see another lecture on evaluation so I don't want to like dive too deep [00:25:46] so I don't want to like dive too deep into this but generally evaluation of [00:25:49] into this but generally evaluation of these language models is an extremely [00:25:51] these language models is an extremely tricky topic um there's a lot of biases [00:25:53] tricky topic um there's a lot of biases that you need to deal with and a lot of [00:25:55] that you need to deal with and a lot of this will be covered but some more [00:25:57] this will be covered but some more recent progess on this is like we are [00:25:59] recent progess on this is like we are starting to curate these like really [00:26:01] starting to curate these like really large benchmarks uh like mlu where the [00:26:05] large benchmarks uh like mlu where the models are tested on a broad range of [00:26:06] models are tested on a broad range of diverse knowledge and this is just one [00:26:09] diverse knowledge and this is just one example which is and these are the [00:26:11] example which is and these are the topics that you will see and just to [00:26:14] topics that you will see and just to give some intuition of what the examples [00:26:16] give some intuition of what the examples in these evaluation look like um under [00:26:18] in these evaluation look like um under astronomy you might be asked what is [00:26:20] astronomy you might be asked what is true for type 1 a supernova or you might [00:26:23] true for type 1 a supernova or you might be asked some questions about biology [00:26:25] be asked some questions about biology and there's a huge host of tasks for [00:26:27] and there's a huge host of tasks for this and you can typically like these [00:26:30] this and you can typically like these are multi- choice questions and you can [00:26:31] are multi- choice questions and you can ask the model to answer the question if [00:26:33] ask the model to answer the question if they're instruction fine tuned already [00:26:34] they're instruction fine tuned already hopefully they can like simply answer [00:26:36] hopefully they can like simply answer the question but you can also uh Chain [00:26:38] the question but you can also uh Chain of Thought prompt these questions or few [00:26:40] of Thought prompt these questions or few short promp these questions too and [00:26:43] short promp these questions too and recently there's been a huge amount of [00:26:45] recently there's been a huge amount of progress uh on this Benchmark what [00:26:48] progress uh on this Benchmark what people have observed is like more and [00:26:49] people have observed is like more and more pre-training on more and more data [00:26:51] more pre-training on more and more data and larger models is simply just like [00:26:52] and larger models is simply just like climbing up these um the number on this [00:26:55] climbing up these um the number on this so 90% is often seen as a benchmark Mark [00:26:58] so 90% is often seen as a benchmark Mark number that these model wanted to cross [00:27:00] number that these model wanted to cross because it's roughly like human level [00:27:02] because it's roughly like human level knowledge or understanding and recently [00:27:05] knowledge or understanding and recently the Gemini models purply cross this [00:27:09] the Gemini models purply cross this number so yeah go [00:27:12] number so yeah go ahead is isn't this like the entire sort [00:27:15] ahead is isn't this like the entire sort of thing all over again right like imag [00:27:18] of thing all over again right like imag at some point you're like okay maybe my [00:27:19] at some point you're like okay maybe my methods are too like too fine tuned [00:27:22] methods are too like too fine tuned implicitly on on the image bi and isn't [00:27:25] implicitly on on the image bi and isn't something like that happening here as [00:27:27] something like that happening here as well uh um yes I think this is a tricky [00:27:30] well uh um yes I think this is a tricky topic because a lot of the models often [00:27:33] topic because a lot of the models often there's this idea about whether your [00:27:35] there's this idea about whether your test sets are leaking into your training [00:27:37] test sets are leaking into your training data set and there's are huge concerns [00:27:39] data set and there's are huge concerns about that it's a perfectly valid [00:27:41] about that it's a perfectly valid question to ask how do we even evaluate [00:27:44] question to ask how do we even evaluate this is why evaluation is actually very [00:27:45] this is why evaluation is actually very tricky but one General thing to be [00:27:48] tricky but one General thing to be careful about is like at some point like [00:27:50] careful about is like at some point like it doesn't matter what your trained test [00:27:51] it doesn't matter what your trained test is if the models are generally useful if [00:27:54] is if the models are generally useful if their models are doing useful stuff like [00:27:56] their models are doing useful stuff like does it matter like how if your test if [00:27:59] does it matter like how if your test if you train on everything you care about [00:28:01] you train on everything you care about and it does well on it like does it [00:28:03] and it does well on it like does it matter so yeah um again we still need [00:28:08] matter so yeah um again we still need better ways to evaluate the models um [00:28:10] better ways to evaluate the models um and to understand what methods are doing [00:28:12] and to understand what methods are doing and how they're if they're improving the [00:28:14] and how they're if they're improving the model or not but at some point like that [00:28:16] model or not but at some point like that those boundaries start to like be less [00:28:22] important cool so massive progress on [00:28:24] important cool so massive progress on this Benchmark starting with gpd2 and [00:28:26] this Benchmark starting with gpd2 and like we're roughly at 90% which to the [00:28:29] like we're roughly at 90% which to the point where these benchmarks are [00:28:30] point where these benchmarks are starting to become unclear if like [00:28:32] starting to become unclear if like improvements on these are actually [00:28:33] improvements on these are actually meaningful or not um in fact like most [00:28:37] meaningful or not um in fact like most of the times when the models are wrong [00:28:39] of the times when the models are wrong like you might often find that the [00:28:41] like you might often find that the question itself was unclear or ambiguous [00:28:43] question itself was unclear or ambiguous so all evaluation benchmarks have a [00:28:46] so all evaluation benchmarks have a certain limited utility to [00:28:48] certain limited utility to them so yeah um going to go over like [00:28:52] them so yeah um going to go over like another evaluation example of how this [00:28:54] another evaluation example of how this recipe like changes things so T5 models [00:28:58] recipe like changes things so T5 models were instruction fine tuned on a huge [00:28:59] were instruction fine tuned on a huge number of tasks and another Trend to or [00:29:02] number of tasks and another Trend to or which I think will be the theme across [00:29:04] which I think will be the theme across this lecture is that as your models [00:29:06] this lecture is that as your models become larger as they're trained on more [00:29:07] become larger as they're trained on more data they become more and more [00:29:09] data they become more and more responsive to your task information as [00:29:11] responsive to your task information as well so what you will observe here is [00:29:13] well so what you will observe here is like as the number of parameters ex [00:29:15] like as the number of parameters ex increase we have like T5 small FL T5 [00:29:18] increase we have like T5 small FL T5 small and we go up to 11 billion [00:29:20] small and we go up to 11 billion parameters where where we have T5 XXL [00:29:23] parameters where where we have T5 XXL you'll see that the Improvement actually [00:29:25] you'll see that the Improvement actually improves like going from a pre into an [00:29:28] improves like going from a pre into an instruction model the instruction model [00:29:30] instruction model the instruction model is all the more better at following [00:29:32] is all the more better at following instructions so the difference is plus [00:29:35] instructions so the difference is plus 6.1 and goes to plus 26.6 as the models [00:29:37] 6.1 and goes to plus 26.6 as the models become larger so this is another very [00:29:40] become larger so this is another very encouraging Trend that you probably [00:29:42] encouraging Trend that you probably should train on a lot of data with a lot [00:29:44] should train on a lot of data with a lot of compute and you know pre-training [00:29:48] of compute and you know pre-training just keeps on [00:29:50] just keeps on giving [00:29:52] giving so yeah um you I hope you guys get a [00:29:56] so yeah um you I hope you guys get a chance to like play with a lot of these [00:29:57] chance to like play with a lot of these models I I think you already hopefully [00:29:59] models I I think you already hopefully are uh but yeah before instruction fine [00:30:02] are uh but yeah before instruction fine tuning something when you're asked a [00:30:04] tuning something when you're asked a question related to disambiguation QA um [00:30:07] question related to disambiguation QA um you get something like this and it [00:30:09] you get something like this and it doesn't actually follow the let's think [00:30:11] doesn't actually follow the let's think by step byep instruction very clearly [00:30:15] by step byep instruction very clearly but after instruction fine tuning it is [00:30:16] but after instruction fine tuning it is able to answer the question [00:30:19] able to answer the question here and yeah like more recently people [00:30:22] here and yeah like more recently people have been like researching into like [00:30:24] have been like researching into like what the instruction tuning data set [00:30:25] what the instruction tuning data set should look like there's a huge plora of [00:30:28] should look like there's a huge plora of instruction tuning data sets now [00:30:29] instruction tuning data sets now available like this is just a [00:30:30] available like this is just a representative diagram and there's a h [00:30:32] representative diagram and there's a h open source Community developing around [00:30:34] open source Community developing around these as [00:30:35] these as well um some high level lessons that we [00:30:38] well um some high level lessons that we have learned from this [00:30:41] have learned from this is one lesson that I think might be [00:30:44] is one lesson that I think might be interesting is that we can actually use [00:30:45] interesting is that we can actually use really large strong models to generate [00:30:48] really large strong models to generate some of the instruction tuning data to [00:30:49] some of the instruction tuning data to train some of our smaller models so take [00:30:52] train some of our smaller models so take your favorite model right now gbd4 maybe [00:30:54] your favorite model right now gbd4 maybe or maybe Claud or whatever and you can [00:30:56] or maybe Claud or whatever and you can get it to answer some question s and [00:30:58] get it to answer some question s and generate instruction outut pairs for [00:31:01] generate instruction outut pairs for training your open source or smaller [00:31:03] training your open source or smaller model and that actually is a very [00:31:04] model and that actually is a very successful recipe so instead of getting [00:31:06] successful recipe so instead of getting a human to collect all the instruction [00:31:08] a human to collect all the instruction output pairs or getting humans to [00:31:10] output pairs or getting humans to generate the answers you can get bigger [00:31:12] generate the answers you can get bigger models to generate the answers as well [00:31:14] models to generate the answers as well so that's number one thing that like has [00:31:16] so that's number one thing that like has recently emerged another thing that are [00:31:19] recently emerged another thing that are is being emerged or is like being [00:31:21] is being emerged or is like being discussed is how much data do we need I [00:31:23] discussed is how much data do we need I talked about millions of examples but [00:31:25] talked about millions of examples but like people have often found that if you [00:31:26] like people have often found that if you have really high quality example you can [00:31:28] have really high quality example you can get away with thousand examples as well [00:31:30] get away with thousand examples as well so this is the paperless as more for [00:31:32] so this is the paperless as more for alignment and this is still an active [00:31:34] alignment and this is still an active area of research on how like data [00:31:36] area of research on how like data scaling and instruction tuning affects [00:31:37] scaling and instruction tuning affects the final model [00:31:39] the final model performance and yeah crowdsourcing these [00:31:42] performance and yeah crowdsourcing these models can be effective as well so there [00:31:45] models can be effective as well so there are very cool benchmarks that are [00:31:46] are very cool benchmarks that are emerging like open Assistant um yeah a [00:31:49] emerging like open Assistant um yeah a lot of activity in the field and [00:31:52] lot of activity in the field and hopefully like a lot more progress as we [00:31:54] hopefully like a lot more progress as we go on yes um a question sort of in the [00:31:58] go on yes um a question sort of in the spirit of this LMA paper uh doesn't like [00:32:02] spirit of this LMA paper uh doesn't like code or like I don't know like math word [00:32:06] code or like I don't know like math word problems have this desired structure so [00:32:08] problems have this desired structure so like shouldn't we just like be training [00:32:10] like shouldn't we just like be training code models and doing like some English [00:32:13] code models and doing like some English stuff and then just be like okay this is [00:32:15] stuff and then just be like okay this is the best reasoning we can get at some [00:32:17] the best reasoning we can get at some point right cuz like C code has the [00:32:19] point right cuz like C code has the structure where where where like you're [00:32:22] structure where where where like you're going sort of step by step and you're [00:32:24] going sort of step by step and you're sort of thinking in in some way in like [00:32:27] sort of thinking in in some way in like a so breaking down a concept into a [00:32:29] a so breaking down a concept into a smaller so you can consider like cod [00:32:31] smaller so you can consider like cod have like very high value tokens so [00:32:33] have like very high value tokens so maybe like just doing so I think there's [00:32:37] maybe like just doing so I think there's again pre-training is a whole Dark Art [00:32:39] again pre-training is a whole Dark Art that I am not completely familiar with [00:32:41] that I am not completely familiar with but um code actually ends up being [00:32:44] but um code actually ends up being really useful in pre-training mixtures [00:32:46] really useful in pre-training mixtures and people do like up with code data [00:32:48] and people do like up with code data quite a lot um similarly like I mean but [00:32:53] quite a lot um similarly like I mean but it depends upon what the users are going [00:32:54] it depends upon what the users are going to use the models for right um some [00:32:56] to use the models for right um some people might use it for Cotes some [00:32:57] people might use it for Cotes some people people might do for reasoning but [00:32:59] people people might do for reasoning but that's not the only task we care about [00:33:01] that's not the only task we care about as you might see later on in the next [00:33:02] as you might see later on in the next step we'll discuss this as well is that [00:33:05] step we'll discuss this as well is that people often use these models for [00:33:06] people often use these models for Creative task they wanted to write a [00:33:08] Creative task they wanted to write a story uh they wanted to generate a movie [00:33:10] story uh they wanted to generate a movie script or so on and I don't know if like [00:33:13] script or so on and I don't know if like necessarily training on reasoning only [00:33:15] necessarily training on reasoning only tasks would help with that so go ahead [00:33:18] tasks would help with that so go ahead yeah would you explain like there there [00:33:20] yeah would you explain like there there exists like some data distribution which [00:33:22] exists like some data distribution which is like high value for Creative tasks [00:33:28] is like high value for Creative tasks yes like I mean it seems like um a lot [00:33:32] yes like I mean it seems like um a lot of PE people write about stories and [00:33:34] of PE people write about stories and everything on the internet all the time [00:33:36] everything on the internet all the time like which is not code and sometimes [00:33:39] like which is not code and sometimes like there's this idea of hallucinations [00:33:40] like there's this idea of hallucinations as well in this like field but you can [00:33:42] as well in this like field but you can often think like hey like creativity [00:33:44] often think like hey like creativity might be a byproduct of hallucinations [00:33:47] might be a byproduct of hallucinations as well so I don't know what exact data [00:33:50] as well so I don't know what exact data would like lead to like more creative [00:33:52] would like lead to like more creative models but generally like there's a lot [00:33:54] models but generally like there's a lot of data or a lot of stories that are [00:33:56] of data or a lot of stories that are written on the internet which allows the [00:33:57] written on the internet which allows the model to be [00:33:59] model to be creative yeah but I don't know if I have [00:34:01] creative yeah but I don't know if I have a specific answer to the [00:34:03] a specific answer to the question cool so we discussed [00:34:05] question cool so we discussed instruction fine tuning um very simple [00:34:08] instruction fine tuning um very simple and very straightforward there's like no [00:34:10] and very straightforward there's like no complicated algorithms here just collect [00:34:12] complicated algorithms here just collect a lot of data and then you can start [00:34:14] a lot of data and then you can start leveraging the performance at scale as [00:34:16] leveraging the performance at scale as well like as models become better these [00:34:18] well like as models become better these models also become more easily [00:34:21] models also become more easily specifiable and they become more [00:34:23] specifiable and they become more responsive to task as well we're going [00:34:25] responsive to task as well we're going to discuss some limitations and I think [00:34:27] to discuss some limitations and I think this is like really important to [00:34:28] this is like really important to understand why we are going to optimize [00:34:29] understand why we are going to optimize for human [00:34:32] for human preferences cool so we talked a bit [00:34:35] preferences cool so we talked a bit about this like instruction fine tuning [00:34:37] about this like instruction fine tuning is necessarily contingent on humans [00:34:40] is necessarily contingent on humans labeling the data now [00:34:44] labeling the data now humans it's expensive to collect this [00:34:46] humans it's expensive to collect this data especially as the questions become [00:34:48] data especially as the questions become more and more complex you want to answer [00:34:50] more and more complex you want to answer questions about what which may be at [00:34:52] questions about what which may be at physics PhD level or things to that [00:34:54] physics PhD level or things to that effect these things become increasingly [00:34:57] effect these things become increasingly expensive to collect [00:34:59] expensive to collect so yeah this is maybe like perhaps [00:35:01] so yeah this is maybe like perhaps obvious like collecting data [00:35:02] obvious like collecting data pre-training does not require any [00:35:04] pre-training does not require any specific data you scrape data of the web [00:35:06] specific data you scrape data of the web but for instruction fing you probably [00:35:08] but for instruction fing you probably need to recruit some people to write [00:35:09] need to recruit some people to write down answer to your instructions so this [00:35:11] down answer to your instructions so this can become very expensive very quickly [00:35:14] can become very expensive very quickly but there's more limitations to this as [00:35:16] but there's more limitations to this as well and we we just discussing this like [00:35:18] well and we we just discussing this like there are open-ended tasks related to [00:35:20] there are open-ended tasks related to creativity that don't really have like [00:35:22] creativity that don't really have like an exact correct answer to begin with so [00:35:25] an exact correct answer to begin with so how do you generate the right answer to [00:35:27] how do you generate the right answer to the kind of a [00:35:30] the kind of a question and yeah like language modeling [00:35:34] question and yeah like language modeling inherently like penalizes all token [00:35:36] inherently like penalizes all token level mistakes equally um this is what [00:35:38] level mistakes equally um this is what super fine supervised fine tuning does [00:35:40] super fine supervised fine tuning does as well but often like not all mistakes [00:35:42] as well but often like not all mistakes are the same so this is an example where [00:35:46] are the same so this is an example where you're trying to do this prediction task [00:35:47] you're trying to do this prediction task Avatar is a fantasy TV show and perhaps [00:35:50] Avatar is a fantasy TV show and perhaps you can see like I mean calling it an [00:35:53] you can see like I mean calling it an adventure TV show is perhaps okay but [00:35:56] adventure TV show is perhaps okay but calling it a musical May be like a much [00:35:59] calling it a musical May be like a much worse mistake but both these mistakes [00:36:01] worse mistake but both these mistakes are penalized [00:36:04] equally and I think one General aspect [00:36:07] equally and I think one General aspect which is like more becoming increasingly [00:36:09] which is like more becoming increasingly relevant is that the humans that you [00:36:11] relevant is that the humans that you might ask might not generate the right [00:36:13] might ask might not generate the right or the highest quality answer your [00:36:15] or the highest quality answer your models are becoming increasingly [00:36:17] models are becoming increasingly competitive and you want in some sense [00:36:19] competitive and you want in some sense you're going to be limited by how high [00:36:22] you're going to be limited by how high quality the answer um humans can [00:36:24] quality the answer um humans can generate but often I find that the [00:36:27] generate but often I find that the models are generating better and better [00:36:29] models are generating better and better answers so do we really want to keep [00:36:31] answers so do we really want to keep relying on humans to write down the [00:36:32] relying on humans to write down the answers or do we want to like somehow go [00:36:34] answers or do we want to like somehow go over [00:36:36] that so these are the three problems we [00:36:41] that so these are the three problems we have talked about um with instruction [00:36:43] have talked about um with instruction fine tuning and we made a lot of [00:36:46] fine tuning and we made a lot of progress with this but this is not how [00:36:47] progress with this but this is not how we got Char [00:36:49] we got Char GPT um and one high level problem here [00:36:53] GPT um and one high level problem here is that even though when even when we [00:36:55] is that even though when even when we are instruction fine tuning there is [00:36:57] are instruction fine tuning there is still a huge mismatch [00:36:59] still a huge mismatch between the end goal is to optimize for [00:37:02] between the end goal is to optimize for human preferences generate an output [00:37:04] human preferences generate an output that a human might like and we're still [00:37:08] that a human might like and we're still doing prediction kind of tasks where [00:37:09] doing prediction kind of tasks where we're predicting the next token but now [00:37:11] we're predicting the next token but now in a more curated data set so that's [00:37:13] in a more curated data set so that's still a bit of a mismatch going on here [00:37:15] still a bit of a mismatch going on here and it's not exactly what we want to [00:37:18] and it's not exactly what we want to do hopefully like I mean I'm going to [00:37:20] do hopefully like I mean I'm going to take a second here to pause because this [00:37:21] take a second here to pause because this is important to understand the next [00:37:23] is important to understand the next section and if there's any [00:37:25] section and if there's any questions feel free to ask so is this [00:37:28] questions feel free to ask so is this step uh still taken as a as a first step [00:37:33] step uh still taken as a as a first step or we discard this it's a good question [00:37:36] or we discard this it's a good question so um I think this is still one of the [00:37:39] so um I think this is still one of the more important steps that you take [00:37:41] more important steps that you take before taking the next step but people [00:37:43] before taking the next step but people are trying to like remove the step Al [00:37:45] are trying to like remove the step Al together and jump directly to the next [00:37:47] together and jump directly to the next step so there's work emerging on that [00:37:49] step so there's work emerging on that but yeah and this is still a very [00:37:51] but yeah and this is still a very important step before we do the next [00:37:53] important step before we do the next step [00:37:57] go ahead is PR two also present in [00:38:00] go ahead is PR two also present in pre-training uh and if so how how do you [00:38:03] pre-training uh and if so how how do you avoid that ver just by having a lot of [00:38:06] avoid that ver just by having a lot of data um yeah that's a great question uh [00:38:08] data um yeah that's a great question uh there's two diff there's one difference [00:38:10] there's two diff there's one difference one major difference on pre-training um [00:38:13] one major difference on pre-training um pre-training covers a lot more text so [00:38:16] pre-training covers a lot more text so um just for context like I mean as we [00:38:18] um just for context like I mean as we talked about it's pre-training is [00:38:20] talked about it's pre-training is roughly 15 trillion tokens whereas like [00:38:23] roughly 15 trillion tokens whereas like supervised instruction fine tuning might [00:38:24] supervised instruction fine tuning might be somewhere on the order of millions to [00:38:26] be somewhere on the order of millions to billions of tokens so it's like few [00:38:28] billions of tokens so it's like few orders of magnitude lower typically [00:38:30] orders of magnitude lower typically you'd only see one answer for a specific [00:38:32] you'd only see one answer for a specific instruction but during pre-training [00:38:34] instruction but during pre-training you'll see multiple text and multiple [00:38:36] you'll see multiple text and multiple completions for a same kind of a prompt [00:38:39] completions for a same kind of a prompt um now that's good because when you see [00:38:41] um now that's good because when you see multiple answers or completions during [00:38:42] multiple answers or completions during pre-training you sort of start to weigh [00:38:44] pre-training you sort of start to weigh different answers you start to put [00:38:46] different answers you start to put probability Mass on different kind of [00:38:48] probability Mass on different kind of answers or completions but instruction [00:38:50] answers or completions but instruction fineing might force you to put and wait [00:38:52] fineing might force you to put and wait on only one [00:38:53] on only one answer does it okay but generally yeah [00:38:57] answer does it okay but generally yeah like I mean this is a problem with both [00:38:59] like I mean this is a problem with both the stages you're [00:39:01] the stages you're right anything [00:39:04] right anything else [00:39:07] else cool so as this whole thing alludes to [00:39:11] cool so as this whole thing alludes to we're going to start to attempt to [00:39:14] we're going to start to attempt to satisfy human preferences directly we're [00:39:16] satisfy human preferences directly we're no longer going to like try to like get [00:39:18] no longer going to like try to like get humans to generate some data and try to [00:39:20] humans to generate some data and try to do some kind of a token level prediction [00:39:21] do some kind of a token level prediction loss we're going to try to optimize for [00:39:24] loss we're going to try to optimize for human preferences directly and that is [00:39:27] human preferences directly and that is uh the general field of rlf and that's [00:39:30] uh the general field of rlf and that's the final step in typically getting a [00:39:32] the final step in typically getting a model like CH [00:39:33] model like CH GPD so we talked about how collecting [00:39:36] GPD so we talked about how collecting demonstration is expensive and there's [00:39:37] demonstration is expensive and there's still a broad mismatch between the LM [00:39:39] still a broad mismatch between the LM objective and human preferences and now [00:39:41] objective and human preferences and now we're going to try and optimize for [00:39:42] we're going to try and optimize for human preferences [00:39:44] human preferences directly so what ises optimizing for [00:39:47] directly so what ises optimizing for human preferences even mean um to like [00:39:50] human preferences even mean um to like concretely establish that let's go [00:39:52] concretely establish that let's go through like a specific example in mind [00:39:54] through like a specific example in mind which is [00:39:55] which is summarization um we want to train a [00:39:58] summarization um we want to train a model to be better at [00:39:59] model to be better at summarization and we want to satisfy [00:40:01] summarization and we want to satisfy human preferences so let's imagine that [00:40:03] human preferences so let's imagine that a human is able to prescribe a reward [00:40:05] a human is able to prescribe a reward for a specific summary let's just [00:40:07] for a specific summary let's just pretend there is a reward function you [00:40:09] pretend there is a reward function you and I can assign say like reward this is [00:40:11] and I can assign say like reward this is plus one this is minus one or something [00:40:13] plus one this is minus one or something to that [00:40:17] effect okay um so in this specific case [00:40:22] effect okay um so in this specific case we have this input X which uh which is [00:40:25] we have this input X which uh which is about an earthquake in San Francisco so [00:40:27] about an earthquake in San Francisco so this is news article that we want to [00:40:28] this is news article that we want to summarize [00:40:30] summarize and let's pretend that we get these [00:40:33] and let's pretend that we get these rewards and we want to optimize this so [00:40:36] rewards and we want to optimize this so we we get one summary y1 which gives us [00:40:39] we we get one summary y1 which gives us an earthquake hit and so on and we [00:40:40] an earthquake hit and so on and we assign a reward of 8.0 and another [00:40:43] assign a reward of 8.0 and another summary which gives us a reward of [00:40:45] summary which gives us a reward of 1.2 generally speaking like the [00:40:47] 1.2 generally speaking like the objective that we want to set up is [00:40:49] objective that we want to set up is something of the following form where we [00:40:51] something of the following form where we want to take our language model P Theta [00:40:54] want to take our language model P Theta which generates a completion y uh given [00:40:57] which generates a completion y uh given an input X and we want to maximize the [00:41:00] an input X and we want to maximize the reward of rxy where X is the input and Y [00:41:04] reward of rxy where X is the input and Y is the output summary in this specific [00:41:07] is the output summary in this specific task and maybe like just to like really [00:41:11] task and maybe like just to like really concretely point out something here this [00:41:13] concretely point out something here this is different from everything that we [00:41:15] is different from everything that we have done in one very specific way um we [00:41:18] have done in one very specific way um we are sampling from the model itself in [00:41:21] are sampling from the model itself in the bottom term if you see like we're [00:41:23] the bottom term if you see like we're using Y from P Theta everything we've [00:41:25] using Y from P Theta everything we've seen so far the data is sampled from [00:41:27] seen so far the data is sampled from some other source either during [00:41:28] some other source either during pre-training either in supervised fine [00:41:30] pre-training either in supervised fine tuning and we're maximizing the log [00:41:32] tuning and we're maximizing the log likelihood of those tokens but now we're [00:41:35] likelihood of those tokens but now we're explicitly sampling from our model and [00:41:37] explicitly sampling from our model and optimizing potentially a [00:41:38] optimizing potentially a non-differentiable [00:41:41] non-differentiable objective [00:41:42] objective cool so broadly the rlf pipeline looks [00:41:46] cool so broadly the rlf pipeline looks something like this and first step is [00:41:48] something like this and first step is still instruction tuning something we [00:41:49] still instruction tuning something we have seen up until now where we take our [00:41:52] have seen up until now where we take our pre-trained model we instruction tune on [00:41:54] pre-trained model we instruction tune on a large collection of tasks and we get [00:41:57] a large collection of tasks and we get some something which starts responding [00:41:58] some something which starts responding to our desired intent or [00:42:01] to our desired intent or not but there are two more steps after [00:42:03] not but there are two more steps after this which are typically followed in [00:42:05] this which are typically followed in creating something like instruct gbt the [00:42:07] creating something like instruct gbt the first step is estimating some kind of a [00:42:09] first step is estimating some kind of a reward model something which tells us [00:42:11] reward model something which tells us given an instruction how much would a [00:42:13] given an instruction how much would a human like this answer or how much would [00:42:15] human like this answer or how much would a human hate this answer so we looked at [00:42:18] a human hate this answer so we looked at something like this earlier but I didn't [00:42:19] something like this earlier but I didn't talk about how do we even get something [00:42:21] talk about how do we even get something like that that's the second step and [00:42:23] like that that's the second step and then we take this reward model and we [00:42:25] then we take this reward model and we optimize it through the optimiz ation [00:42:27] optimize it through the optimiz ation that I suggested earlier so the [00:42:28] that I suggested earlier so the maximizing the expected reward under [00:42:31] maximizing the expected reward under your language [00:42:32] your language model and we're going to go over a lot [00:42:35] model and we're going to go over a lot over in the second and third [00:42:37] over in the second and third steps so the first question we want to [00:42:39] steps so the first question we want to answer is how do we even get like a [00:42:40] answer is how do we even get like a reward model what about what humans are [00:42:43] reward model what about what humans are going to like like this is a very IL [00:42:46] going to like like this is a very IL defined problem generally speaking [00:42:49] defined problem generally speaking so there's there's two problems here [00:42:52] so there's there's two problems here that we're going to address first is a [00:42:53] that we're going to address first is a human in the loop is expensive so let's [00:42:55] human in the loop is expensive so let's say like if I ask a model to like [00:42:57] say like if I ask a model to like generate an answer and then I get a [00:42:59] generate an answer and then I get a human to label with some kind of a score [00:43:02] human to label with some kind of a score I'm doing this over millions of [00:43:03] I'm doing this over millions of completions that is not very scalable I [00:43:07] completions that is not very scalable I I don't want to sit around and label [00:43:08] I don't want to sit around and label millions of examples [00:43:10] millions of examples so this is very easy like we're in a [00:43:13] so this is very easy like we're in a machine learning class so what are we [00:43:15] machine learning class so what are we going to do what we're going to do is [00:43:16] going to do what we're going to do is we're going to train something which [00:43:18] we're going to train something which predicts what a human would like or what [00:43:19] predicts what a human would like or what a human might not like and this is [00:43:22] a human might not like and this is roughly um this is essentially a machine [00:43:25] roughly um this is essentially a machine learning problem where we take these [00:43:26] learning problem where we take these Rewards scor scores and we try to train [00:43:28] Rewards scor scores and we try to train a reward model to predict given an input [00:43:31] a reward model to predict given an input and output what the reward scores would [00:43:33] and output what the reward scores would look [00:43:34] look like simple simple machine learning [00:43:37] like simple simple machine learning regression style problem uh you might [00:43:38] regression style problem uh you might have seen this [00:43:40] have seen this earlier [00:43:43] cool now there's a bigger problem here [00:43:46] cool now there's a bigger problem here and sorry go ahead one so do we use like [00:43:49] and sorry go ahead one so do we use like I don't know like just embedding [00:43:52] I don't know like just embedding withier we use a real language model to [00:43:55] withier we use a real language model to do that um that's a good question [00:43:58] do that um that's a good question generally like what we do is like we [00:44:00] generally like what we do is like we still typically need reward models where [00:44:02] still typically need reward models where they need to be able to understand the [00:44:04] they need to be able to understand the text really well so like bigger models [00:44:06] text really well so like bigger models and like they're typically initialized [00:44:07] and like they're typically initialized from the language model that you trained [00:44:09] from the language model that you trained pre-trained as well so you typically [00:44:11] pre-trained as well so you typically start with the pre-trained language [00:44:12] start with the pre-trained language model and do some kind of prediction [00:44:14] model and do some kind of prediction that we'll talk about and they'll give [00:44:16] that we'll talk about and they'll give you a [00:44:17] you a score how do you if you're doing that [00:44:20] score how do you if you're doing that how do you separate X and Y how does the [00:44:22] how do you separate X and Y how does the language model know which part it [00:44:24] language model know which part it doesn't need to it can put the X and Y [00:44:27] doesn't need to it can put the X and Y like it only sees X and Y as an input so [00:44:29] like it only sees X and Y as an input so it doesn't need to T typically see it [00:44:31] it doesn't need to T typically see it separated it's just going to predict a [00:44:33] separated it's just going to predict a score at the end okay yeah the X and Y [00:44:35] score at the end okay yeah the X and Y is more for notational convenience here [00:44:38] is more for notational convenience here because for us X and Y are different X [00:44:40] because for us X and Y are different X is a question user asked and Y is [00:44:42] is a question user asked and Y is something the model generated but you [00:44:44] something the model generated but you shove the whole thing into you shove the [00:44:46] shove the whole thing into you shove the whole thing into yes [00:44:48] whole thing into yes cool now this is the bigger problem here [00:44:51] cool now this is the bigger problem here and human judgments are very noisy we [00:44:54] and human judgments are very noisy we have talked about we want to assign a [00:44:55] have talked about we want to assign a score to a completion this is something [00:44:58] score to a completion this is something that's like extremely non-trivial to do [00:45:00] that's like extremely non-trivial to do so if I give you a summary like this [00:45:02] so if I give you a summary like this what score are you going to assign on a [00:45:04] what score are you going to assign on a scale of 10 if you ask me on different [00:45:07] scale of 10 if you ask me on different days I'll give a different answer first [00:45:08] days I'll give a different answer first of all but across humans itself this [00:45:12] of all but across humans itself this this number is not calibrated in any [00:45:14] this number is not calibrated in any meaningful way so you could assign [00:45:17] meaningful way so you could assign number of 4.1 6.6 and different humans [00:45:19] number of 4.1 6.6 and different humans would just simply assign different [00:45:20] would just simply assign different scores and there are ways to address [00:45:23] scores and there are ways to address this you can like calibrate humans you [00:45:24] this you can like calibrate humans you can give them a specific rubric you can [00:45:26] can give them a specific rubric you can like talk to them but it's a very [00:45:28] like talk to them but it's a very complicated process and like still like [00:45:30] complicated process and like still like there's a lot of room for judgment which [00:45:32] there's a lot of room for judgment which is not typically very nice for training [00:45:34] is not typically very nice for training a model like this if your labels can [00:45:37] a model like this if your labels can vary a lot it's just hard to [00:45:40] vary a lot it's just hard to predict so the way this is addressed is [00:45:44] predict so the way this is addressed is that instead of trying to predict the [00:45:45] that instead of trying to predict the reward label directly you actually want [00:45:47] reward label directly you actually want to set up a problem in a slightly [00:45:49] to set up a problem in a slightly different way what is something much [00:45:50] different way what is something much easier for humans to do is give them two [00:45:53] easier for humans to do is give them two answers or maybe many answers and tell [00:45:55] answers or maybe many answers and tell them ask them which one is better so [00:45:58] them ask them which one is better so this is where the idea of asking humans [00:46:01] this is where the idea of asking humans to rank answers comes in so if I give [00:46:04] to rank answers comes in so if I give you a whole news article and ask you [00:46:07] you a whole news article and ask you which summary is better you might be [00:46:08] which summary is better you might be able to give me a ranking that oh this [00:46:10] able to give me a ranking that oh this second summary is the worst but the [00:46:12] second summary is the worst but the first one is better and the third one is [00:46:13] first one is better and the third one is somewhere in the middle between those [00:46:14] somewhere in the middle between those two so you get like a ranking which [00:46:16] two so you get like a ranking which gives you um preference over [00:46:19] gives you um preference over summaries and hopefully like I mean you [00:46:22] summaries and hopefully like I mean you can see like the idea that is important [00:46:24] can see like the idea that is important here is that even when we have some kind [00:46:27] here is that even when we have some kind of a consistent utility function even [00:46:29] of a consistent utility function even when I have it's much easier to compare [00:46:32] when I have it's much easier to compare to something and know that which is [00:46:33] to something and know that which is better than this rather than ascribing [00:46:34] better than this rather than ascribing it an arbitrary number on a scale and [00:46:37] it an arbitrary number on a scale and that's why the signal from something [00:46:39] that's why the signal from something like this is a lot [00:46:41] like this is a lot better now how do we get like we talked [00:46:44] better now how do we get like we talked about we need like we get this kind of a [00:46:46] about we need like we get this kind of a preference data and now we need some [00:46:47] preference data and now we need some kind of a reward score out of this and [00:46:50] kind of a reward score out of this and we shove in like our input we shove in a [00:46:53] we shove in like our input we shove in a summary as well and we still need to get [00:46:54] summary as well and we still need to get a score out of this but it's not clearly [00:46:56] a score out of this but it's not clearly obvious to me like how do we take this [00:46:58] obvious to me like how do we take this data and convert it into that kind of [00:47:00] data and convert it into that kind of score [00:47:02] score um incomes are pretty good friends named [00:47:05] um incomes are pretty good friends named Bradley Terry um and essentially like [00:47:10] Bradley Terry um and essentially like there's a lot of study in like many in [00:47:12] there's a lot of study in like many in economics and like psychology which [00:47:14] economics and like psychology which basically tries to model how humans make [00:47:18] basically tries to model how humans make decisions in specific case like this [00:47:20] decisions in specific case like this Brad lary model essentially says that a [00:47:23] Brad lary model essentially says that a probability that a human chooses answer [00:47:25] probability that a human chooses answer y1 over y two is based on the difference [00:47:30] y1 over y two is based on the difference between the rewards that humans assign [00:47:33] between the rewards that humans assign internally and then you take a sigmoid [00:47:35] internally and then you take a sigmoid around it so if you have looked at [00:47:37] around it so if you have looked at binary classification before uh the [00:47:40] binary classification before uh the logic is simply the difference between [00:47:41] logic is simply the difference between the reward of some y1 minus Y2 or the [00:47:45] the reward of some y1 minus Y2 or the difference between the winning [00:47:47] difference between the winning completion and the losing [00:47:51] completion is everybody with me till [00:47:53] completion is everybody with me till this point [00:47:57] so the idea is that like if you have a [00:48:00] so the idea is that like if you have a data set where given y1 and Y2 where y1 [00:48:03] data set where given y1 and Y2 where y1 is a winning completion and we have a [00:48:05] is a winning completion and we have a winning completion YW and a losing [00:48:07] winning completion YW and a losing completion y l um the winning completion [00:48:10] completion y l um the winning completion should score higher than the losing [00:48:12] should score higher than the losing completion go ahead sorry what is J is [00:48:15] completion go ahead sorry what is J is that a log or like sorry what what like [00:48:19] that a log or like sorry what what like what is the type of J like this number [00:48:21] what is the type of J like this number here that we're getting as the [00:48:23] here that we're getting as the expectation is it a log prop or what is [00:48:25] expectation is it a log prop or what is it it's an log prop so it will be a [00:48:28] it it's an log prop so it will be a scaler at the end sigmoid is so you're [00:48:31] scaler at the end sigmoid is so you're taking the let's say you have a reward [00:48:33] taking the let's say you have a reward model which gives a [00:48:34] model which gives a score R1 to like YW and R2 to y l you [00:48:39] score R1 to like YW and R2 to y l you subtract that number you get another [00:48:40] subtract that number you get another number you put it into sigmoid and you [00:48:42] number you put it into sigmoid and you get a probability because sigmoid will [00:48:45] get a probability because sigmoid will convert a logit into probability and [00:48:47] convert a logit into probability and then you take a logarithm of that and [00:48:50] then you take a logarithm of that and you take the expectation of everything [00:48:51] you take the expectation of everything and you get this final number which [00:48:53] and you get this final number which tells you how good your reward model is [00:48:55] tells you how good your reward model is doing on the entire data set [00:48:58] doing on the entire data set so like a good model of humans should [00:48:59] so like a good model of humans should behave like this a good model of humans [00:49:02] behave like this a good model of humans would um score very low here so it would [00:49:05] would um score very low here so it would generally assign a higher reward to the [00:49:07] generally assign a higher reward to the winning completion and generally assign [00:49:09] winning completion and generally assign a lower reward to the losing [00:49:12] a lower reward to the losing completion [00:49:14] completion cool the math is just beginning so um [00:49:18] cool the math is just beginning so um hold on to your seats [00:49:21] hold on to your seats um cool so now let's see where we are we [00:49:24] um cool so now let's see where we are we have a pre-trained model p p d y given X [00:49:28] have a pre-trained model p p d y given X and we got this like fancy reward model [00:49:30] and we got this like fancy reward model which tells us that hey how we have a [00:49:32] which tells us that hey how we have a model of humans and it can tell us which [00:49:34] model of humans and it can tell us which instruction which answer they like and [00:49:36] instruction which answer they like and which in answer did not [00:49:38] which in answer did not like now to do rlf generally like I mean [00:49:42] like now to do rlf generally like I mean we have discussed what this will look [00:49:43] we have discussed what this will look like uh we will copy our pre-train model [00:49:46] like uh we will copy our pre-train model or instruction tune model and we'll [00:49:48] or instruction tune model and we'll optimize the parameters for those models [00:49:51] optimize the parameters for those models and I suggested that the param objective [00:49:53] and I suggested that the param objective that we want to optimize is the expected [00:49:56] that we want to optimize is the expected reward when we sample completions from P [00:49:59] reward when we sample completions from P Theta and we're going to optimize our [00:50:02] Theta and we're going to optimize our learned reward model instead of like the [00:50:03] learned reward model instead of like the true reward model which humans would [00:50:05] true reward model which humans would have typically assigned do you guys see [00:50:07] have typically assigned do you guys see any problem with [00:50:09] any problem with this [00:50:11] this um is there something that's wrong here [00:50:14] um is there something that's wrong here or like that might go wrong if we do [00:50:16] or like that might go wrong if we do something along these [00:50:22] lines go for itel [00:50:28] it might collapse yes okay um but [00:50:31] it might collapse yes okay um but generally at least from my intuition [00:50:32] generally at least from my intuition like if you're ever doing something and [00:50:34] like if you're ever doing something and you have you're optimizing some learned [00:50:36] you have you're optimizing some learned metric I'd be very careful because [00:50:39] metric I'd be very careful because typically our loss functions are very [00:50:40] typically our loss functions are very clearly defined but here my reward model [00:50:42] clearly defined but here my reward model is learned what when it's learned it [00:50:44] is learned what when it's learned it means it will have [00:50:46] means it will have errors yes so it's going to be trained [00:50:49] errors yes so it's going to be trained on some distribution it will generalize [00:50:51] on some distribution it will generalize as well but it will have errors and when [00:50:54] as well but it will have errors and when you're optimizing against a learn model [00:50:57] you're optimizing against a learn model it will tend to hack the reward model so [00:50:59] it will tend to hack the reward model so it might exploit the reward model might [00:51:02] it might exploit the reward model might erroneously assign a really high score [00:51:04] erroneously assign a really high score to a really bad completion if your [00:51:06] to a really bad completion if your policy learns or if your language model [00:51:08] policy learns or if your language model learns to do that it will completely [00:51:10] learns to do that it will completely Hack That and start generating those [00:51:11] Hack That and start generating those gibberish [00:51:13] gibberish completions [00:51:15] completions so just as a general machine learning [00:51:17] so just as a general machine learning tip as well if you're optimizing a learn [00:51:19] tip as well if you're optimizing a learn metric be careful about what you're [00:51:20] metric be careful about what you're optimizing and make sure that it's [00:51:22] optimizing and make sure that it's actually [00:51:23] actually reliable um and the way and this is [00:51:27] reliable um and the way and this is obviously not desirable like I mean if [00:51:28] obviously not desirable like I mean if you start optimizing this objective [00:51:30] you start optimizing this objective you're going to converse to gibberish [00:51:31] you're going to converse to gibberish language models very very quickly so [00:51:33] language models very very quickly so typically what people do is that you [00:51:35] typically what people do is that you want to add some kind of a penalty that [00:51:37] want to add some kind of a penalty that like avoids it drifting too far from its [00:51:40] like avoids it drifting too far from its initialization and why do we want to do [00:51:42] initialization and why do we want to do that like if it cannot Drift from too [00:51:44] that like if it cannot Drift from too far from its initialization we know the [00:51:46] far from its initialization we know the initialization of the model is a decent [00:51:47] initialization of the model is a decent language model and we know it is not [00:51:50] language model and we know it is not necessarily satisfying this reward model [00:51:52] necessarily satisfying this reward model too much and we also know that like the [00:51:54] too much and we also know that like the reward model is trained on a distrib [00:51:56] reward model is trained on a distrib ution of completions where the um [00:51:59] ution of completions where the um initial model is so typically we when we [00:52:02] initial model is so typically we when we talk about training this reward model we [00:52:04] talk about training this reward model we have trained on certain completions [00:52:05] have trained on certain completions which are sampled from this initial [00:52:07] which are sampled from this initial distribution so we know the reward model [00:52:09] distribution so we know the reward model will be somewhat reliable in that [00:52:10] will be somewhat reliable in that distribution so we're just going to [00:52:12] distribution so we're just going to Simply add a penalty which tells us that [00:52:15] Simply add a penalty which tells us that you should not drift too far away from [00:52:17] you should not drift too far away from the initial distribution and just to go [00:52:20] the initial distribution and just to go over this we want to maximize the [00:52:22] over this we want to maximize the objective where we have RMF which is our [00:52:25] objective where we have RMF which is our learned one model but we're going to add [00:52:29] learned one model but we're going to add this term beta log ratio and the ratio [00:52:31] this term beta log ratio and the ratio is our the model we're optimizing P [00:52:33] is our the model we're optimizing P Theta and our initial model PP PT and [00:52:37] Theta and our initial model PP PT and what this says is that if we assign a [00:52:39] what this says is that if we assign a much higher probability to certain [00:52:41] much higher probability to certain completion as compared to our pre-train [00:52:43] completion as compared to our pre-train model you're going to add an [00:52:44] model you're going to add an increasingly large penalty to [00:52:46] increasingly large penalty to it and simply you're paying a price for [00:52:49] it and simply you're paying a price for drifting too far from initial [00:52:50] drifting too far from initial distribution if you guys have taken like [00:52:53] distribution if you guys have taken like machine learning this the expectation of [00:52:55] machine learning this the expectation of this quantity can is exactly the cbak LI [00:52:58] this quantity can is exactly the cbak LI Li Divergence or k Divergence between P [00:53:01] Li Divergence or k Divergence between P Theta and PPT so you're penalizing [00:53:04] Theta and PPT so you're penalizing drifting between two distributions go [00:53:07] drifting between two distributions go forhead question shouldn't you also do [00:53:09] forhead question shouldn't you also do this like add a penalty in the previous [00:53:12] this like add a penalty in the previous version where you had to find huning or [00:53:14] version where you had to find huning or is this only relevant for the RL HF [00:53:18] is this only relevant for the RL HF that's a good question so um I think [00:53:21] that's a good question so um I think people do add some kinds of [00:53:22] people do add some kinds of regularization in fine tuning it's not [00:53:25] regularization in fine tuning it's not nearly not as critical when you're doing [00:53:27] nearly not as critical when you're doing this with RL like the incentive is to [00:53:29] this with RL like the incentive is to exploit this reward model as well as [00:53:32] exploit this reward model as well as much as possible and we'll see examples [00:53:34] much as possible and we'll see examples where like the Learned reward predicts [00:53:37] where like the Learned reward predicts like it's doing really well but the true [00:53:38] like it's doing really well but the true reward models are completely garbage so [00:53:41] reward models are completely garbage so it's much more important in this [00:53:47] optimization cool [00:53:49] optimization cool so now this assume does this course does [00:53:53] so now this assume does this course does not assume background on reinforcement [00:53:54] not assume background on reinforcement learning so we're not going to go into [00:53:56] learning so we're not going to go into reinforce learning but I just want to [00:53:57] reinforce learning but I just want to give a very high level intuition about [00:53:59] give a very high level intuition about how this works and reinforcement [00:54:01] how this works and reinforcement learning is not typically just used for [00:54:03] learning is not typically just used for language model it's been applied to [00:54:04] language model it's been applied to several uh domains of Interest game [00:54:07] several uh domains of Interest game playing agents re um robotics developing [00:54:11] playing agents re um robotics developing chip designs and so on and the interest [00:54:15] chip designs and so on and the interest between like RL and model LMS it's also [00:54:18] between like RL and model LMS it's also like dates back to roughly like 2016 as [00:54:20] like dates back to roughly like 2016 as well but like it's been really [00:54:22] well but like it's been really successful recently and especially with [00:54:24] successful recently and especially with the success of rlf [00:54:27] the success of rlf um the general idea is that we're going [00:54:28] um the general idea is that we're going to use our model that we're optimizing [00:54:30] to use our model that we're optimizing to generate several completions for an [00:54:32] to generate several completions for an instruction um we're going to compute [00:54:35] instruction um we're going to compute the reward under our learned reward [00:54:37] the reward under our learned reward model and then we're going to simply try [00:54:39] model and then we're going to simply try and like update the update our model to [00:54:42] and like update the update our model to increase the probability on the high [00:54:44] increase the probability on the high reward completions so when we sample a [00:54:46] reward completions so when we sample a model we'll see completions of varing [00:54:48] model we'll see completions of varing quality and we'll see some good [00:54:49] quality and we'll see some good completions good summaries for our task [00:54:51] completions good summaries for our task some bad summaries for our task and [00:54:53] some bad summaries for our task and we'll try to update our log [00:54:54] we'll try to update our log probabilities in a way such that uh the [00:54:57] probabilities in a way such that uh the reward for when you use a updated model [00:55:00] reward for when you use a updated model you're typically in the higher reward [00:55:04] region does the high level summary like [00:55:06] region does the high level summary like make [00:55:07] make sense [00:55:10] sense cool and rhf is incredibly successful I [00:55:13] cool and rhf is incredibly successful I think this is a very good example of um [00:55:15] think this is a very good example of um this is the same summarization example [00:55:18] this is the same summarization example and I think the key Point here is that [00:55:22] and I think the key Point here is that the performance improves by increasing [00:55:24] the performance improves by increasing the model size for sure we have seen [00:55:26] the model size for sure we have seen this in many different example what you [00:55:28] this in many different example what you can actually see is that even very small [00:55:30] can actually see is that even very small models can outperform human completions [00:55:33] models can outperform human completions if you train it with with rlf and this [00:55:36] if you train it with with rlf and this is exactly the result you see here the [00:55:38] is exactly the result you see here the reference summaries are human generated [00:55:39] reference summaries are human generated and when you evaluate when you ask [00:55:42] and when you evaluate when you ask humans which ones they prefer they often [00:55:44] humans which ones they prefer they often prefer the model generated summary over [00:55:46] prefer the model generated summary over the human generated summary and this is [00:55:47] the human generated summary and this is something you only observe with rlf even [00:55:50] something you only observe with rlf even at small scales and again the same [00:55:51] at small scales and again the same scaling phenomena still holds here [00:55:53] scaling phenomena still holds here bigger models do become more responsive [00:55:55] bigger models do become more responsive but are of itself is very impactful [00:56:00] here [00:56:02] here cool the problem with rlf is that it's [00:56:05] cool the problem with rlf is that it's just incredibly complex like um I gave [00:56:08] just incredibly complex like um I gave you a very high level summary that's [00:56:10] you a very high level summary that's like doesn't that there's whole courses [00:56:12] like doesn't that there's whole courses on this for a reason um so it just and [00:56:15] on this for a reason um so it just and this image is not for you to understand [00:56:17] this image is not for you to understand it's just completely to intimidate you [00:56:20] it's just completely to intimidate you um [00:56:21] um so um you want to fit a value function [00:56:24] so um you want to fit a value function to something there's you have to sample [00:56:25] to something there's you have to sample the model a lot it can be sensitive to a [00:56:28] the model a lot it can be sensitive to a lot of hyperparameters so there's a lot [00:56:30] lot of hyperparameters so there's a lot that goes on here and yeah um if you [00:56:35] that goes on here and yeah um if you start implementing an rlf pipeline it [00:56:37] start implementing an rlf pipeline it can be very hard and this is the reason [00:56:39] can be very hard and this is the reason why like a lot of rlf was restricted to [00:56:41] why like a lot of rlf was restricted to very very like high compute High [00:56:43] very very like high compute High resource places and it was not very [00:56:46] resource places and it was not very accessible so what we're going to talk [00:56:48] accessible so what we're going to talk about and cover in this course is [00:56:49] about and cover in this course is something called direct preference [00:56:50] something called direct preference optimization which is a much simpler [00:56:52] optimization which is a much simpler alternative to R LF and hopefully like [00:56:54] alternative to R LF and hopefully like that's much more accessible but but [00:56:56] that's much more accessible but but please bear with me there will be a lot [00:56:58] please bear with me there will be a lot of math on here but the end goal of the [00:57:00] of math on here but the end goal of the math is to make come up with a very [00:57:02] math is to make come up with a very simple algorithm so hopefully like it's [00:57:04] simple algorithm so hopefully like it's um and feel free to stop me and ask me [00:57:10] questions you need in terms of like gbt [00:57:13] questions you need in terms of like gbt 4 versus three like how much do the [00:57:15] 4 versus three like how much do the number of parameters in the base model [00:57:17] number of parameters in the base model help with like need to reduce the number [00:57:19] help with like need to reduce the number of parameters or like in order sorry R [00:57:22] of parameters or like in order sorry R the number of like examples from humans [00:57:25] the number of like examples from humans for RFS that work [00:57:26] for RFS that work well yeah that's a really good question [00:57:28] well yeah that's a really good question so generally speaking as the if you hold [00:57:31] so generally speaking as the if you hold the data set size constant and simply [00:57:33] the data set size constant and simply increase the mod size it will improve [00:57:35] increase the mod size it will improve quite a lot sure but the nice thing is [00:57:38] quite a lot sure but the nice thing is that you can reuse the data and you can [00:57:39] that you can reuse the data and you can keep adding data yeah uh as you keep [00:57:42] keep adding data yeah uh as you keep like scaling models up so typically like [00:57:44] like scaling models up so typically like nobody tries to like reduce the amount [00:57:46] nobody tries to like reduce the amount of data collection yeah right you just [00:57:47] of data collection yeah right you just keep increasing both the things [00:57:51] out cool so we talked about rlf and the [00:57:55] out cool so we talked about rlf and the current pipeline is some something like [00:57:58] current pipeline is some something like um we train a reward model on the [00:57:59] um we train a reward model on the comparison data that we have seen so far [00:58:01] comparison data that we have seen so far and we're going to optimize we're going [00:58:03] and we're going to optimize we're going to start with our pre-train our [00:58:04] to start with our pre-train our instruction tune model and convert it [00:58:05] instruction tune model and convert it into an rlf model using the [00:58:07] into an rlf model using the reinforcement learning [00:58:09] reinforcement learning techniques now the really the key idea [00:58:11] techniques now the really the key idea in direct preference optimization is [00:58:13] in direct preference optimization is what if we could just simply write a [00:58:15] what if we could just simply write a reward model in terms of our language [00:58:17] reward model in terms of our language model itself now to intuitively [00:58:20] model itself now to intuitively understand that like what is going on a [00:58:22] understand that like what is going on a language model is assigning [00:58:23] language model is assigning probabilities to whatever is the most [00:58:25] probabilities to whatever is the most plausible completion next but those [00:58:28] plausible completion next but those plausible completions might not be what [00:58:29] plausible completions might not be what we intended but you could restrict the [00:58:31] we intended but you could restrict the probability simply to the completions [00:58:34] probability simply to the completions that a human might like and then the log [00:58:36] that a human might like and then the log probabilities of your model might [00:58:38] probabilities of your model might represent something which the humans [00:58:39] represent something which the humans might like and not just some arbitrary [00:58:41] might like and not just some arbitrary completion on the internet so there is a [00:58:43] completion on the internet so there is a direct correspondence between the log [00:58:46] direct correspondence between the log probability that a language model [00:58:48] probability that a language model assigns and how much a human might like [00:58:50] assigns and how much a human might like the answer they can have like a direct [00:58:52] the answer they can have like a direct correspondence in them and this is not [00:58:55] correspondence in them and this is not some arbitrary intuition that I'm trying [00:58:56] some arbitrary intuition that I'm trying to like come up with we will derive this [00:59:00] to like come up with we will derive this mathematically so the general idea with [00:59:03] mathematically so the general idea with direct preference optimization is going [00:59:04] direct preference optimization is going to be we're going to write down reward [00:59:06] to be we're going to write down reward model in terms of our language model and [00:59:09] model in terms of our language model and now that we can write our reward model [00:59:10] now that we can write our reward model in terms of our language model we can [00:59:12] in terms of our language model we can simply solve directly fit our reward [00:59:15] simply solve directly fit our reward model to the preference data we have and [00:59:19] model to the preference data we have and we don't need to do the RL Step at all [00:59:21] we don't need to do the RL Step at all so we started off with some preference [00:59:22] so we started off with some preference data and we simply fit our reward model [00:59:24] data and we simply fit our reward model to it which directly optimiz as the [00:59:26] to it which directly optimiz as the language [00:59:28] language parameters and maybe at a high level why [00:59:31] parameters and maybe at a high level why is this like even possible like we did [00:59:33] is this like even possible like we did this like really cumbersome process with [00:59:34] this like really cumbersome process with fitting a reward model and optimizing it [00:59:37] fitting a reward model and optimizing it but in the whole process the only [00:59:39] but in the whole process the only external information that was being [00:59:41] external information that was being added to the system like was human [00:59:43] added to the system like was human labels labels on the preference data [00:59:45] labels labels on the preference data when we optimize a learned reward model [00:59:47] when we optimize a learned reward model there's no new information being added [00:59:49] there's no new information being added into the system so this is why something [00:59:52] into the system so this is why something like this is even possible for quite a [00:59:54] like this is even possible for quite a few years this was not obvious obvious [00:59:56] few years this was not obvious obvious but like as you will see like some of [00:59:58] but like as you will see like some of these results like start to make sense [01:00:00] these results like start to make sense so we're going to derive direct [01:00:02] so we're going to derive direct preference [01:00:03] preference optimization I'll I'll be after I'll be [01:00:05] optimization I'll I'll be after I'll be here after the class as well if you have [01:00:06] here after the class as well if you have questions but I'll hopefully like this [01:00:08] questions but I'll hopefully like this is clear [01:00:11] is clear so yes um we discussed that we wanted to [01:00:14] so yes um we discussed that we wanted to solve this expected reward problem where [01:00:17] solve this expected reward problem where we want to maximize the expected reward [01:00:19] we want to maximize the expected reward but we subtract this term which is the [01:00:21] but we subtract this term which is the beta log ratio which essentially [01:00:22] beta log ratio which essentially penalizes the distance between where our [01:00:25] penalizes the distance between where our current model is and where we started [01:00:27] current model is and where we started off so we don't want to drift too far [01:00:28] off so we don't want to drift too far away from our um from where we [01:00:33] away from our um from where we started now it turns out that this [01:00:36] started now it turns out that this specific problem instead of doing like [01:00:38] specific problem instead of doing like an iterative routine um there's actually [01:00:41] an iterative routine um there's actually a close form solution to this problem so [01:00:45] a close form solution to this problem so the close form solution looks something [01:00:46] the close form solution looks something like this um again if you have seen the [01:00:50] like this um again if you have seen the boltzman distribution or something to [01:00:52] boltzman distribution or something to that effect before this is very [01:00:54] that effect before this is very basically the same idea but the idea is [01:00:56] basically the same idea but the idea is this that we're going to take a [01:00:57] this that we're going to take a pre-train distribution PPT y given X and [01:01:00] pre-train distribution PPT y given X and we're going to rade the distribution by [01:01:02] we're going to rade the distribution by the expected [01:01:03] the expected reward so if if if a completion has a [01:01:07] reward so if if if a completion has a very high reward it's going to have a [01:01:09] very high reward it's going to have a higher probability mass and if it has a [01:01:11] higher probability mass and if it has a lower reward it's going to have a lower [01:01:12] lower reward it's going to have a lower probability mass and it's determined by [01:01:14] probability mass and it's determined by the expected reward and beta is a [01:01:16] the expected reward and beta is a hyperparameter which essentially governs [01:01:18] hyperparameter which essentially governs like what is the trade-off between the [01:01:20] like what is the trade-off between the reward model and the constraint and as [01:01:24] reward model and the constraint and as beta becomes lower and lower you're [01:01:26] beta becomes lower and lower you're going to start paying more and more [01:01:27] going to start paying more and more attention to the reward [01:01:29] attention to the reward model so the probabilities look [01:01:32] model so the probabilities look something like this and there's this [01:01:34] something like this and there's this like really annoying term this ZX and [01:01:37] like really annoying term this ZX and the reason why it exists is that the [01:01:39] the reason why it exists is that the numerator by itself is not normalized [01:01:42] numerator by itself is not normalized it's not a probability distribution so [01:01:44] it's not a probability distribution so to construct like an actual probability [01:01:46] to construct like an actual probability distribution you have to normalize it [01:01:48] distribution you have to normalize it and ZX is simply just this [01:01:50] and ZX is simply just this normalization so if we write ZX out is [01:01:53] normalization so if we write ZX out is the sum of all y okay yeah and that's [01:01:56] the sum of all y okay yeah and that's exactly like it's some overall wise for [01:01:58] exactly like it's some overall wise for a given instruction and that's exactly [01:02:00] a given instruction and that's exactly why this is very pesky is like it's [01:02:02] why this is very pesky is like it's intractable if I take an instruction and [01:02:04] intractable if I take an instruction and try to sum over every possible [01:02:06] try to sum over every possible completion and not just like [01:02:07] completion and not just like syntactically correct ones every single [01:02:09] syntactically correct ones every single possible we have 50,000 tokens maybe [01:02:12] possible we have 50,000 tokens maybe even more and the completions can go [01:02:14] even more and the completions can go arbitrary long so this space is [01:02:15] arbitrary long so this space is completely intractable this quantity is [01:02:17] completely intractable this quantity is not easy to approximate [01:02:21] even um so the main point here is that [01:02:25] even um so the main point here is that you if you're given in a reward model [01:02:26] you if you're given in a reward model you can actually there does exist at [01:02:28] you can actually there does exist at least a close form solution which tells [01:02:30] least a close form solution which tells us what the optimal policy will look [01:02:31] us what the optimal policy will look like or optimal language model will look [01:02:33] like or optimal language model will look like but if you do a little bit of [01:02:35] like but if you do a little bit of algebra just move some terms around take [01:02:37] algebra just move some terms around take a logarithm here or there I I promise [01:02:39] a logarithm here or there I I promise this is not very complicated you can [01:02:41] this is not very complicated you can actually Express the reward model in [01:02:43] actually Express the reward model in terms of the language model itself and I [01:02:46] terms of the language model itself and I think this term is reasonably intuitive [01:02:48] think this term is reasonably intuitive as well uh what it says is that um a [01:02:51] as well uh what it says is that um a completion y hat has a high reward if [01:02:54] completion y hat has a high reward if the model my optimal policy assigns a [01:02:57] the model my optimal policy assigns a higher probability to it relative to my [01:03:00] higher probability to it relative to my initialized model and this is scal by [01:03:03] initialized model and this is scal by Beta so the beta log ratio is what we're [01:03:05] Beta so the beta log ratio is what we're looking at [01:03:07] looking at here and the partition function let's [01:03:09] here and the partition function let's just ignore it for now but it's [01:03:10] just ignore it for now but it's intractable but the beta log ratio is [01:03:13] intractable but the beta log ratio is the key part [01:03:15] the key part here is everyone following [01:03:18] here is everyone following along awesome okay so right now I'm [01:03:23] along awesome okay so right now I'm talking about optimal policies but [01:03:26] talking about optimal policies but really like every policy is probably [01:03:28] really like every policy is probably optimal for some kind of a reward right [01:03:30] optimal for some kind of a reward right like this is mathematically true as well [01:03:32] like this is mathematically true as well so the important bit here is that you [01:03:34] so the important bit here is that you can actually Express you take a current [01:03:37] can actually Express you take a current policy take your initialized model and [01:03:39] policy take your initialized model and you can get some kind of a reward model [01:03:41] you can get some kind of a reward model out of it and this is the exact identity [01:03:44] out of it and this is the exact identity which leads to this so reward model can [01:03:46] which leads to this so reward model can be expressed in terms of your language [01:03:48] be expressed in terms of your language model baring the log partition term [01:03:52] model baring the log partition term which we'll see what happens to it go [01:03:55] which we'll see what happens to it go for sorry I don't know how you got like [01:03:57] for sorry I don't know how you got like why is it that we can swap because there [01:03:59] why is it that we can swap because there is a thing that we're trying to optimize [01:04:00] is a thing that we're trying to optimize and how do p star turn into P yeah um [01:04:04] and how do p star turn into P yeah um for now like we're not optimizing any [01:04:05] for now like we're not optimizing any reward model okay all I'm saying is that [01:04:08] reward model okay all I'm saying is that if I take my current language model it [01:04:10] if I take my current language model it is it probably represents some kind of a [01:04:12] is it probably represents some kind of a reward model [01:04:15] reward model implicitly because of this relationship [01:04:17] implicitly because of this relationship because this holds for every P star and [01:04:19] because this holds for every P star and every reward model what I'm saying is [01:04:22] every reward model what I'm saying is that like there if I plug in my current [01:04:24] that like there if I plug in my current language model it also represents some [01:04:26] language model it also represents some kind of a reward model I'm not saying [01:04:27] kind of a reward model I'm not saying it's optimal [01:04:29] it's optimal okay but I want say because at the [01:04:31] okay but I want say because at the beginning uh PRL is PPT yes and so we [01:04:36] beginning uh PRL is PPT yes and so we just get that the reward is basically [01:04:38] just get that the reward is basically zero and so what what do we do initially [01:04:41] zero and so what what do we do initially it's zero but like we can optimize the [01:04:42] it's zero but like we can optimize the parameters okay okay yeah um yeah but [01:04:45] parameters okay okay yeah um yeah but that's a good observation that it's [01:04:46] that's a good observation that it's basically zero in the beginning but how [01:04:48] basically zero in the beginning but how do we start optimizing [01:04:50] do we start optimizing it I'll get to okay okay any other [01:04:54] it I'll get to okay okay any other questions so the idea is that given the [01:04:56] questions so the idea is that given the language model you have model such that [01:05:02] language model you have model such that that makes the language model [01:05:04] that makes the language model op yes that's uh that's the next step [01:05:08] op yes that's uh that's the next step yes uh but the key idea is that like my [01:05:11] yes uh but the key idea is that like my log my language model the probabilities [01:05:13] log my language model the probabilities already implicitly Define a reward model [01:05:16] already implicitly Define a reward model I think that's really the main point [01:05:18] I think that's really the main point here and this mathematical relationship [01:05:20] here and this mathematical relationship is [01:05:21] is exact cool now like I mean I'm obviously [01:05:25] exact cool now like I mean I'm obviously ignoring like the elephant in the room [01:05:27] ignoring like the elephant in the room here which is the partition function um [01:05:30] here which is the partition function um it's not going to magically vanish away [01:05:31] it's not going to magically vanish away so like if this was just the beta log [01:05:33] so like if this was just the beta log ratio that would be really nice I can [01:05:35] ratio that would be really nice I can compute all these quantities I know how [01:05:37] compute all these quantities I know how to compute the log probability under my [01:05:38] to compute the log probability under my language model I know how to compute the [01:05:40] language model I know how to compute the log probability under my pre-train model [01:05:43] log probability under my pre-train model and I can compute the reward score and I [01:05:45] and I can compute the reward score and I can optimize this but I don't know what [01:05:47] can optimize this but I don't know what to do about my Lo log partition function [01:05:50] to do about my Lo log partition function this is where something fun happens so [01:05:54] this is where something fun happens so recall what the the reward modeling [01:05:56] recall what the the reward modeling objective was uh when we started off [01:05:59] objective was uh when we started off like we started off with the friends [01:06:00] like we started off with the friends Bradley Terry again and what we really [01:06:03] Bradley Terry again and what we really wanted to optimize was the reward [01:06:05] wanted to optimize was the reward difference between the winning [01:06:06] difference between the winning completion and the losing [01:06:08] completion and the losing completion um and really like I mean [01:06:11] completion um and really like I mean that's all we care about we don't care [01:06:12] that's all we care about we don't care about the exact reward itself what we [01:06:15] about the exact reward itself what we care about is maximizing the difference [01:06:16] care about is maximizing the difference between the the difference between [01:06:19] between the the difference between winning and losing completion and that's [01:06:21] winning and losing completion and that's actually really key here because if you [01:06:24] actually really key here because if you plug in the definition of of the RM [01:06:27] plug in the definition of of the RM Theta there what you'll observe is that [01:06:30] Theta there what you'll observe is that the partition function actually just [01:06:32] the partition function actually just cancels out now why does it cancel out [01:06:36] cancels out now why does it cancel out um the input is exactly the same the x [01:06:39] um the input is exactly the same the x is actually exactly the same in the [01:06:41] is actually exactly the same in the difference so the partition function ZX [01:06:43] difference so the partition function ZX will just cancel out like it's the same [01:06:45] will just cancel out like it's the same in both the terms so what you get is [01:06:47] in both the terms so what you get is that the reward difference between the [01:06:48] that the reward difference between the winning and losing completion is the [01:06:50] winning and losing completion is the differences between the beta log ratio [01:06:51] differences between the beta log ratio for the winning and losing [01:06:53] for the winning and losing completion you can plug in the terms you [01:06:56] completion you can plug in the terms you can work it out it's fairly simple so [01:06:59] can work it out it's fairly simple so the partition function which was our [01:07:00] the partition function which was our like um which was something we could not [01:07:03] like um which was something we could not address we could not compute actually [01:07:04] address we could not compute actually just simply vanished away I'm so sorry Z [01:07:07] just simply vanished away I'm so sorry Z doesn't appear in theary mod um but it [01:07:11] doesn't appear in theary mod um but it appears here in this equation so how [01:07:14] appears here in this equation so how does plug in [01:07:16] does plug in model um so we're going to take this [01:07:19] model um so we're going to take this equation uh the last line that you see [01:07:22] equation uh the last line that you see and we're going to plug in in place of [01:07:24] and we're going to plug in in place of RMF [01:07:26] RMF okay so um and in this the first loss [01:07:31] okay so um and in this the first loss equation oh I see got yeah so the first [01:07:33] equation oh I see got yeah so the first loss equation is the broadly ter loss [01:07:35] loss equation is the broadly ter loss model [01:07:37] model cool so this really is it like I mean [01:07:40] cool so this really is it like I mean the key observation is we could express [01:07:42] the key observation is we could express our reward model in terms of language [01:07:43] our reward model in terms of language model and our problems with the [01:07:45] model and our problems with the partition function actually go away [01:07:46] partition function actually go away because we were optimizing the Brad lary [01:07:48] because we were optimizing the Brad lary model and um what you get is something [01:07:51] model and um what you get is something like this is that um we're going to [01:07:55] like this is that um we're going to Express the loss function directly in [01:07:57] Express the loss function directly in terms of our language model parameters [01:07:59] terms of our language model parameters Theta and we're going to be able to [01:08:02] Theta and we're going to be able to directly optimize on our data um without [01:08:05] directly optimize on our data um without doing any RL steps or not and this is [01:08:07] doing any RL steps or not and this is simply a binary classification problem [01:08:10] simply a binary classification problem so we're really just trying to classify [01:08:12] so we're really just trying to classify whether an answer is good or bad and [01:08:14] whether an answer is good or bad and that's really what we're [01:08:16] that's really what we're doing before I go on like people want to [01:08:19] doing before I go on like people want to like absorb this in like I mean feel [01:08:22] like absorb this in like I mean feel they're okay with it [01:08:26] I don't get where they why good and why [01:08:28] I don't get where they why good and why win and why lose come from are they [01:08:30] win and why lose come from are they human and or they good question um it's [01:08:34] human and or they good question um it's the same data set we started with in rlf [01:08:36] the same data set we started with in rlf as well but the way the process works is [01:08:39] as well but the way the process works is that you take a set of instructions and [01:08:40] that you take a set of instructions and get the model to generate some answers [01:08:42] get the model to generate some answers and then you get humans to label which [01:08:44] and then you get humans to label which answer they prefer so they're model [01:08:46] answer they prefer so they're model generated uh typically they can be human [01:08:48] generated uh typically they can be human generated as well but they're typically [01:08:50] generated as well but they're typically model generated and then you get some [01:08:52] model generated and then you get some preference labels okay all you need is a [01:08:55] preference labels okay all you need is a label saying which is a better [01:08:58] label saying which is a better answer what do you lose here like you [01:09:02] answer what do you lose here like you must be losing some information because [01:09:03] must be losing some information because of the lack of information about like [01:09:08] of the lack of information about like other you're canceling out your your [01:09:11] other you're canceling out your your your because of the lack of uh any [01:09:13] your because of the lack of uh any information about the partition function [01:09:15] information about the partition function yeah you are bound to lose information [01:09:18] yeah you are bound to lose information about like other possible completions [01:09:20] about like other possible completions which you would have taken into account [01:09:22] which you would have taken into account in like standard rlf right [01:09:26] in like standard rlf right um that's a really good question I don't [01:09:27] um that's a really good question I don't think I'll be able to completely answer [01:09:29] think I'll be able to completely answer this question in time but like partition [01:09:32] this question in time but like partition function is almost kind of a free [01:09:33] function is almost kind of a free variable so I think the problem here is [01:09:35] variable so I think the problem here is that the reward model there think of [01:09:38] that the reward model there think of when you there's many reward models that [01:09:40] when you there's many reward models that satisfy this optimization so there's a [01:09:43] satisfy this optimization so there's a free variable here that you can actually [01:09:45] free variable here that you can actually completely remove and that's what this [01:09:47] completely remove and that's what this optimization benefits from so think of [01:09:49] optimization benefits from so think of it this way like if I assign something a [01:09:50] it this way like if I assign something a reward of plus one and assign something [01:09:52] reward of plus one and assign something a reward of minus one that's basically [01:09:54] a reward of minus one that's basically the same as saying as if it's a reward [01:09:55] the same as saying as if it's a reward of plus [01:09:57] of plus 99 and it will give you the same loss [01:10:01] 99 and it will give you the same loss right so um that scale doesn't that [01:10:05] right so um that scale doesn't that shift invariant in a ways is that like [01:10:08] shift invariant in a ways is that like isn't that somehow like not what you [01:10:11] isn't that somehow like not what you want though like like okay like if if [01:10:15] want though like like okay like if if you have if you're actually training a [01:10:16] you have if you're actually training a reward model right like 199 is like much [01:10:20] reward model right like 199 is like much you should pay much less attention to [01:10:22] you should pay much less attention to that as compared to [01:10:23] that as compared to like one right [01:10:26] like one right Zer or something what we're assuming is [01:10:27] Zer or something what we're assuming is our choice model here is like if a human [01:10:30] our choice model here is like if a human prefers something over the other like [01:10:32] prefers something over the other like the probability is governed only by the [01:10:34] the probability is governed only by the difference between the rewards so that's [01:10:37] difference between the rewards so that's an assumption that every rlf also makes [01:10:39] an assumption that every rlf also makes and like DPO also makes now is that [01:10:42] and like DPO also makes now is that assumption true not completely true but [01:10:45] assumption true not completely true but like it it holds to a fairly large [01:10:49] like it it holds to a fairly large degree but that's a good question [01:10:52] degree but that's a good question yeah cool um I'll move on and rest of [01:10:56] yeah cool um I'll move on and rest of time um and really like I mean the goal [01:10:58] time um and really like I mean the goal of this plot is to like we actually get [01:11:00] of this plot is to like we actually get fairly performant models when we [01:11:02] fairly performant models when we optimize things with DPO um we so in [01:11:06] optimize things with DPO um we so in this plot I think the main thing that [01:11:07] this plot I think the main thing that you should look at is po which is the [01:11:08] you should look at is po which is the typical rlf Pipeline and we are [01:11:10] typical rlf Pipeline and we are evaluating the models for summarization [01:11:12] evaluating the models for summarization and we're comparing to human summaries [01:11:15] and we're comparing to human summaries and what we find is that DP and BP sort [01:11:17] and what we find is that DP and BP sort of do similarly but you're really not [01:11:19] of do similarly but you're really not losing much by just doing the DPO [01:11:21] losing much by just doing the DPO procedure instead of R LF and that's [01:11:23] procedure instead of R LF and that's really compelling because DP is simply a [01:11:24] really compelling because DP is simply a classif app ation loss instead of like a [01:11:26] classif app ation loss instead of like a whole reinforcement learning [01:11:29] whole reinforcement learning procedure so I want to quickly summarize [01:11:32] procedure so I want to quickly summarize um what we have seen thus far is that we [01:11:35] um what we have seen thus far is that we want to optimize for human preferences [01:11:37] want to optimize for human preferences so um and the way we do this is like [01:11:40] so um and the way we do this is like instead of relying on uncalibrated [01:11:41] instead of relying on uncalibrated scores we're getting comparison data and [01:11:43] scores we're getting comparison data and feedback on that and we use this ranking [01:11:46] feedback on that and we use this ranking data to either do something like rlf [01:11:48] data to either do something like rlf where we first fit a reward model and [01:11:50] where we first fit a reward model and optimize using reinforcement learning um [01:11:54] optimize using reinforcement learning um or we do something that like direct [01:11:55] or we do something that like direct preference optimization we simply take [01:11:57] preference optimization we simply take the data set and do a classification [01:11:59] the data set and do a classification loss on that um and yeah like there's [01:12:02] loss on that um and yeah like there's trade offs in these algorithms like [01:12:04] trade offs in these algorithms like people when they have a lot of [01:12:06] people when they have a lot of computational budget they typically [01:12:07] computational budget they typically maybe go for rlf or some routine like [01:12:09] maybe go for rlf or some routine like that but if you're really looking to get [01:12:12] that but if you're really looking to get the bank for your buck like I mean you [01:12:13] the bank for your buck like I mean you might want to go for DPO and if and [01:12:15] might want to go for DPO and if and that's like probably going to work out [01:12:17] that's like probably going to work out of the box um it's a still an active [01:12:20] of the box um it's a still an active area of research people are still trying [01:12:21] area of research people are still trying to understand how to like best work with [01:12:23] to understand how to like best work with these algorithms so like I'm not making [01:12:25] these algorithms so like I'm not making any strong claims here but like both of [01:12:26] any strong claims here but like both of these algorithms are very effective DP [01:12:28] these algorithms are very effective DP is just much simpler to work [01:12:30] is just much simpler to work with [01:12:32] with cool um so yeah like I mean let's see [01:12:35] cool um so yeah like I mean let's see like we went through all this [01:12:36] like we went through all this instruction tuning rlf what do we get um [01:12:41] instruction tuning rlf what do we get um instruct GPD is the first model which [01:12:43] instruct GPD is the first model which sort of followed this pipeline it [01:12:45] sort of followed this pipeline it defined this pipeline so we got models [01:12:47] defined this pipeline so we got models which did 30,000 or so tasks remember [01:12:50] which did 30,000 or so tasks remember when we were doing like only one task [01:12:52] when we were doing like only one task and now we have scaled it up from th000 [01:12:53] and now we have scaled it up from th000 tasks to like 30,000 different task with [01:12:55] tasks to like 30,000 different task with many many different examples so that's [01:12:57] many many different examples so that's like where we are with instruct GPT and [01:13:00] like where we are with instruct GPT and it follows this pipeline that we just [01:13:02] it follows this pipeline that we just described in this case they're following [01:13:03] described in this case they're following a specific rlf pipeline where we [01:13:05] a specific rlf pipeline where we explicitly fit a reward model and then [01:13:07] explicitly fit a reward model and then do some kind of a reinforcement learning [01:13:09] do some kind of a reinforcement learning routine on top of it um and yeah like [01:13:14] routine on top of it um and yeah like the task collected from labelers looks [01:13:15] the task collected from labelers looks something like this um I leave it to [01:13:18] something like this um I leave it to your imagination or you can look at the [01:13:19] your imagination or you can look at the details but how we started off with this [01:13:22] details but how we started off with this model was something like completions we [01:13:24] model was something like completions we see from G GPD 3 uh which you know [01:13:27] see from G GPD 3 uh which you know explained the moon Ling to sixer and [01:13:29] explained the moon Ling to sixer and like it is not really following the [01:13:31] like it is not really following the instructions but instruct GPD will give [01:13:33] instructions but instruct GPD will give you something which is Meaningful so [01:13:35] you something which is Meaningful so it's inferring what a user wanted from [01:13:37] it's inferring what a user wanted from the specific instruction and it's [01:13:39] the specific instruction and it's converting to a realistic answer that a [01:13:40] converting to a realistic answer that a user might [01:13:43] like and yeah these are just more [01:13:46] like and yeah these are just more examples of what an instruct GPD like [01:13:48] examples of what an instruct GPD like model would do whereas your base model [01:13:50] model would do whereas your base model might not follow the instructions to [01:13:52] might not follow the instructions to your desired intentions [01:13:56] your desired intentions and yeah like we went from instruct GPD [01:13:58] and yeah like we went from instruct GPD to chart GPD and it was essentially this [01:14:01] to chart GPD and it was essentially this pipeline um the key difference here is [01:14:04] pipeline um the key difference here is that it is still doing the instruction [01:14:06] that it is still doing the instruction tuning but it is more optimized for [01:14:08] tuning but it is more optimized for dialogue more optimized for interacting [01:14:10] dialogue more optimized for interacting with users so the core algorithmic [01:14:13] with users so the core algorithmic techniques that we discussed today are [01:14:15] techniques that we discussed today are what give us CH GPD but you have to be [01:14:17] what give us CH GPD but you have to be really careful about the kind of data [01:14:19] really careful about the kind of data you're training on and that's really the [01:14:21] you're training on and that's really the whole game um but this is the foundation [01:14:24] whole game um but this is the foundation for CH GPD [01:14:26] for CH GPD and yeah it it follows the same pipeline [01:14:29] and yeah it it follows the same pipeline as well and you might look at you might [01:14:32] as well and you might look at you might interact with ch gbd I'm sure you all [01:14:34] interact with ch gbd I'm sure you all have interacted with it some form or not [01:14:35] have interacted with it some form or not but like this is an example of what a CH [01:14:37] but like this is an example of what a CH gbd interaction might look [01:14:40] like um you want to make a gen Z so like [01:14:44] like um you want to make a gen Z so like I mean you can you know the idea here is [01:14:46] I mean you can you know the idea here is that it's like very good at responding [01:14:47] that it's like very good at responding to instructions and intent this is not [01:14:49] to instructions and intent this is not something that we could like even fuse [01:14:51] something that we could like even fuse shot in very easily uh these are kind of [01:14:54] shot in very easily uh these are kind of instructions are hard to come examples [01:14:56] instructions are hard to come examples for but like this is probably not [01:14:58] for but like this is probably not something to trained on either but it's [01:14:59] something to trained on either but it's able to like infer the intent and [01:15:01] able to like infer the intent and generalize very very nicely and that's [01:15:03] generalize very very nicely and that's something I find personally very [01:15:07] something I find personally very remarkable cool and there's been a lot [01:15:10] remarkable cool and there's been a lot of progress on the open source front as [01:15:12] of progress on the open source front as well so like DPO is much simpler and [01:15:14] well so like DPO is much simpler and much more efficient and essentially all [01:15:16] much more efficient and essentially all the open source models these days are [01:15:18] the open source models these days are using DPO so this is a leaderboard that [01:15:21] using DPO so this is a leaderboard that is maintained by hugging hugging face a [01:15:23] is maintained by hugging hugging face a so like I mean N9 out of 10 more models [01:15:25] so like I mean N9 out of 10 more models here are trained with DPO so that's been [01:15:28] here are trained with DPO so that's been something that's been enabled the open [01:15:29] something that's been enabled the open source Community to instruction tune [01:15:31] source Community to instruction tune their model betters as well and same is [01:15:34] their model betters as well and same is being used in many production models now [01:15:36] being used in many production models now as well mistol is using DPO llama 3 used [01:15:38] as well mistol is using DPO llama 3 used DPO so these are very very strong models [01:15:41] DPO so these are very very strong models which are nearly gp4 level and they're [01:15:43] which are nearly gp4 level and they're also like starting to use um these [01:15:46] also like starting to use um these algorithms as well and something that's [01:15:48] algorithms as well and something that's very cool cool to see is like like we [01:15:51] very cool cool to see is like like we went through all this like optimization [01:15:52] went through all this like optimization and like I mean math and stuff but what [01:15:54] and like I mean math and stuff but what is really fundamentally changing in the [01:15:56] is really fundamentally changing in the behavior and I think this is a really [01:15:58] behavior and I think this is a really good example is that if you simply ask [01:16:01] good example is that if you simply ask an instruction for and ask for an sft [01:16:03] an instruction for and ask for an sft output from an instruction tune model [01:16:05] output from an instruction tune model you'll get something like this but when [01:16:07] you'll get something like this but when you RL of the model you actually get a [01:16:09] you RL of the model you actually get a lot more details in your answer and [01:16:11] lot more details in your answer and they'll probably organize the answers a [01:16:13] they'll probably organize the answers a little better and there's something that [01:16:14] little better and there's something that they maybe humans prefer that's why it's [01:16:17] they maybe humans prefer that's why it's an property that is emerging in these [01:16:19] an property that is emerging in these model but it's something that's a very [01:16:21] model but it's something that's a very clear difference between simply [01:16:24] clear difference between simply instruction tude models and some models [01:16:27] instruction tude models and some models which are [01:16:30] rft so yeah um we discuss like this [01:16:34] rft so yeah um we discuss like this whole rlf routine where we are directly [01:16:37] whole rlf routine where we are directly modeling the preferences and we are [01:16:38] modeling the preferences and we are generalizing Beyond label data um and we [01:16:41] generalizing Beyond label data um and we also discussed RL can be very tricky to [01:16:43] also discussed RL can be very tricky to um correctly Implement though DPO sort [01:16:45] um correctly Implement though DPO sort of implements this or like avoid some of [01:16:48] of implements this or like avoid some of these issue and we briefly also touched [01:16:50] these issue and we briefly also touched upon the idea of reward model and reward [01:16:52] upon the idea of reward model and reward hacking um and when you're optimizing [01:16:56] hacking um and when you're optimizing for learned reward models you will often [01:16:58] for learned reward models you will often see this example is that there's a way [01:17:01] see this example is that there's a way for it to just simply crash into um the [01:17:05] for it to just simply crash into um the object some keep repe repetitively [01:17:08] object some keep repe repetitively crashing the board to get more and more [01:17:09] crashing the board to get more and more points that wasn't the goal of this game [01:17:12] points that wasn't the goal of this game so um this is a very common example that [01:17:15] so um this is a very common example that is shown for reward hacking if you do [01:17:18] is shown for reward hacking if you do not specify Rewards well the models can [01:17:20] not specify Rewards well the models can like learn weird behaviors which are not [01:17:23] like learn weird behaviors which are not your desired intent and there's [01:17:24] your desired intent and there's something a lot of people worry about as [01:17:26] something a lot of people worry about as well um part of the reason is [01:17:28] well um part of the reason is reinforcement learning is a very strong [01:17:29] reinforcement learning is a very strong optimization algorithm it's at the heart [01:17:31] optimization algorithm it's at the heart of alpha go Alpha zero uh which like [01:17:34] of alpha go Alpha zero uh which like results in superhuman models so you have [01:17:36] results in superhuman models so you have to be careful about how you specify [01:17:38] to be careful about how you specify things and the other thing is like even [01:17:40] things and the other thing is like even optimizing for human preferences is [01:17:42] optimizing for human preferences is often not the right thing because humans [01:17:43] often not the right thing because humans are not do not always like things which [01:17:46] are not do not always like things which are in their best interest so something [01:17:48] are in their best interest so something that emerges is that they like [01:17:49] that emerges is that they like authoritative and helpful answers but [01:17:51] authoritative and helpful answers but they often like don't necessarily like [01:17:54] they often like don't necessarily like truthful answers [01:17:55] truthful answers so one property that happens is like is [01:17:58] so one property that happens is like is that they'll prefer authoritativeness [01:18:00] that they'll prefer authoritativeness more than correctness which is maybe [01:18:02] more than correctness which is maybe like not something nice please go ahead [01:18:04] like not something nice please go ahead on those lines I'm curious if maybe like [01:18:07] on those lines I'm curious if maybe like chbt being so like now widely used by [01:18:10] chbt being so like now widely used by the public will maybe change the like [01:18:12] the public will maybe change the like how people were like made the rewards [01:18:14] how people were like made the rewards because I at least feel like now when I [01:18:15] because I at least feel like now when I go to chat I TP something it gives me [01:18:17] go to chat I TP something it gives me five like detailed paragraphs of [01:18:19] five like detailed paragraphs of information sometimes I'm just annoyed [01:18:20] information sometimes I'm just annoyed by that that's not what I wanted but [01:18:22] by that that's not what I wanted but maybe in the original reward function in [01:18:24] maybe in the original reward function in the original people actually pref that [01:18:25] the original people actually pref that and nower it less yeah um that's a great [01:18:29] and nower it less yeah um that's a great point because like as these models like [01:18:31] point because like as these models like integrate more and more into our system [01:18:33] integrate more and more into our system they're going to collect more and more [01:18:34] they're going to collect more and more data and they will like pick up on [01:18:37] data and they will like pick up on things maybe undesirable things as well [01:18:40] things maybe undesirable things as well um as far as I understand chbd is really [01:18:42] um as far as I understand chbd is really cutting down on the verbosity which is [01:18:44] cutting down on the verbosity which is like a huge issue that all of these [01:18:45] like a huge issue that all of these models are trying to cut down on and [01:18:47] models are trying to cut down on and they are dealing with that um part of [01:18:50] they are dealing with that um part of the reason why that emerges is that when [01:18:51] the reason why that emerges is that when you collect preference data at scale [01:18:53] you collect preference data at scale people are not necessarily reading the [01:18:55] people are not necessarily reading the answers the turkers might just simply [01:18:57] answers the turkers might just simply choose the longer answer and that's a [01:18:59] choose the longer answer and that's a property that actually goes into these [01:19:00] property that actually goes into these models so but hopefully like these [01:19:03] models so but hopefully like these things will improve over time as they [01:19:04] things will improve over time as they get more feedb and yeah hallucinations [01:19:07] get more feedb and yeah hallucinations is not a problem that is going to go [01:19:08] is not a problem that is going to go away with RL and we talked a bit about [01:19:10] away with RL and we talked a bit about reward hacking as well um biases from [01:19:14] reward hacking as well um biases from things and so on but hopefully like I [01:19:16] things and so on but hopefully like I mean what I want to conclude out is like [01:19:18] mean what I want to conclude out is like we started with pre-trained [01:19:20] we started with pre-trained models we we had these things which [01:19:22] models we we had these things which could predict text and we got chargy GPD [01:19:25] could predict text and we got chargy GPD and hopefully like it's a little more [01:19:26] and hopefully like it's a little more clear how we go from something like that [01:19:28] clear how we go from something like that to chat [01:19:30] to chat GPD and that's I'll end [01:19:34] GPD and that's I'll end here thanks ================================================================================ LECTURE 012 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 11 - Benchmarking by Yann Dubois Source: https://www.youtube.com/watch?v=TO0CqzqiArM --- Transcript [00:00:05] great so I think let's get started [00:00:07] great so I think let's get started because we have a lot to cover today uh [00:00:09] because we have a lot to cover today uh so my name is Yan uh for those who don't [00:00:11] so my name is Yan uh for those who don't know me I'm a third-year PhD student [00:00:13] know me I'm a third-year PhD student advised by Tatsu and Percy and today [00:00:16] advised by Tatsu and Percy and today I'll be talking about benchmarking and [00:00:18] I'll be talking about benchmarking and evaluations uh so benchmarking and [00:00:20] evaluations uh so benchmarking and evaluations are honestly something that [00:00:22] evaluations are honestly something that I think not enough people look at in [00:00:25] I think not enough people look at in Academia uh but if you really want to [00:00:27] Academia uh but if you really want to put something in production and you [00:00:28] put something in production and you really care about let's say real world [00:00:31] really care about let's say real world machine learning uh evaluation is really [00:00:33] machine learning uh evaluation is really key so let's talk about [00:00:36] key so let's talk about that so overview of what we'll talk [00:00:39] that so overview of what we'll talk about first is uh different reasons for [00:00:42] about first is uh different reasons for measuring performance uh then I'll talk [00:00:44] measuring performance uh then I'll talk about text classification and and how [00:00:46] about text classification and and how you measure performance there then text [00:00:48] you measure performance there then text generations and how you measure [00:00:50] generations and how you measure performance there and finally like how [00:00:52] performance there and finally like how do you evaluate Uh current large [00:00:55] do you evaluate Uh current large language models um and some issues and [00:00:58] language models um and some issues and challenges with the ways that we [00:00:59] challenges with the ways that we actually to perform [00:01:02] evaluations okay so my mental model of [00:01:05] evaluations okay so my mental model of how you actually develop a machine [00:01:07] how you actually develop a machine learning model is that first you will be [00:01:10] learning model is that first you will be uh training your model uh so here [00:01:12] uh training your model uh so here measuring performance is really key uh [00:01:14] measuring performance is really key uh because you need to have a loss um that [00:01:17] because you need to have a loss um that you need to know when basically how to [00:01:19] you need to know when basically how to optimize um then once you are optimizing [00:01:22] optimize um then once you are optimizing your loss the second step is basically [00:01:24] your loss the second step is basically development uh so usually this is highi [00:01:26] development uh so usually this is highi tuning or for example if you have um [00:01:30] tuning or for example if you have um early stopping during your models like [00:01:32] early stopping during your models like if you see that your model is not [00:01:33] if you see that your model is not performing that well you might or that [00:01:36] performing that well you might or that there's like some overfitting happening [00:01:37] there's like some overfitting happening you might decide to to stop or you might [00:01:39] you might decide to to stop or you might decide to like change the learning rate [00:01:41] decide to like change the learning rate during the training of your model so [00:01:43] during the training of your model so development is kind of the Second Step [00:01:45] development is kind of the Second Step um and here you need to measure [00:01:46] um and here you need to measure performance because you need to know how [00:01:48] performance because you need to know how to do actually uh models uh sorry High [00:01:50] to do actually uh models uh sorry High Prim tuning um and and like changing [00:01:54] Prim tuning um and and like changing High parameters uh then the third step [00:01:56] High parameters uh then the third step is essentially model selection so if I [00:01:59] is essentially model selection so if I have a task that I really care about uh [00:02:01] have a task that I really care about uh which model performs best for my task [00:02:03] which model performs best for my task that might be a model that I have [00:02:05] that might be a model that I have trained it might be a model that another [00:02:06] trained it might be a model that another group has [00:02:07] group has trained uh and finally at least in the [00:02:09] trained uh and finally at least in the real world you would decide to deploy [00:02:11] real world you would decide to deploy your model and here per uh measuring [00:02:13] your model and here per uh measuring performance is really key because you [00:02:15] performance is really key because you need to know whether your model is good [00:02:16] need to know whether your model is good enough to put in production um in the [00:02:19] enough to put in production um in the parallel universe that we live in [00:02:21] parallel universe that we live in there's also the publishing uh so you [00:02:24] there's also the publishing uh so you basically need to um test like evaluate [00:02:28] basically need to um test like evaluate a model on standing bench marks and the [00:02:30] a model on standing bench marks and the reason why we do that is essentially for [00:02:32] reason why we do that is essentially for communicating to different groups the [00:02:34] communicating to different groups the quality of our bottle so at every step [00:02:37] quality of our bottle so at every step of this pipeline you really need to [00:02:39] of this pipeline you really need to measure performance and that's what [00:02:40] measure performance and that's what we'll talk about today but what is key [00:02:42] we'll talk about today but what is key to understand is that at different steps [00:02:44] to understand is that at different steps you need to measure performance in [00:02:45] you need to measure performance in different ways so there's really not a [00:02:47] different ways so there's really not a single way of of uh not a single ideal [00:02:50] single way of of uh not a single ideal way of measuring [00:02:51] way of measuring performance so for example on the left [00:02:54] performance so for example on the left when you train your model uh for [00:02:56] when you train your model uh for evaluating performance you really need [00:02:57] evaluating performance you really need to uh have a way of measuring [00:02:59] to uh have a way of measuring performance that is super fast super [00:03:01] performance that is super fast super cheap uh and differentiable because [00:03:03] cheap uh and differentiable because usually I mean with new networks you [00:03:05] usually I mean with new networks you basically back propagate through the [00:03:06] basically back propagate through the loss so it needs to be differentiable [00:03:08] loss so it needs to be differentiable and finally you really cannot have a way [00:03:11] and finally you really cannot have a way for your model to optimize some [00:03:13] for your model to optimize some shortcuts um to optimize the loss even [00:03:15] shortcuts um to optimize the loss even though it's not really what you wanted [00:03:16] though it's not really what you wanted to optimize uh and as you move more to [00:03:19] to optimize uh and as you move more to the right basically you you allowed or [00:03:21] the right basically you you allowed or like you will you will um measure [00:03:23] like you will you will um measure performance uh less often so it's fine [00:03:26] performance uh less often so it's fine if it's more expensive uh but you really [00:03:30] if it's more expensive uh but you really like the the risk that that um or you [00:03:33] like the the risk that that um or you really need your perform your measuring [00:03:34] really need your perform your measuring like your your evaluation metrics to be [00:03:36] like your your evaluation metrics to be higher quality uh because um the issues [00:03:39] higher quality uh because um the issues if you put a model in production are [00:03:41] if you put a model in production are higher um so during the development [00:03:43] higher um so during the development stage you need a way of measuring [00:03:45] stage you need a way of measuring performance that is fast cheap and also [00:03:47] performance that is fast cheap and also kind of avoiding shortcuts because when [00:03:49] kind of avoiding shortcuts because when you do high PR tuning you're essentially [00:03:51] you do high PR tuning you're essentially also optimizing over certain [00:03:54] also optimizing over certain objective uh model selection it can be a [00:03:56] objective uh model selection it can be a little bit less fast and less cheap but [00:03:58] little bit less fast and less cheap but still you will have to do it that many [00:03:59] still you will have to do it that many many times and most importantly when you [00:04:01] many times and most importantly when you deploy a model you really want the way [00:04:03] deploy a model you really want the way to evaluate performance to be [00:04:05] to evaluate performance to be trustworthy uh because once you put [00:04:07] trustworthy uh because once you put something in production there's kind of [00:04:09] something in production there's kind of no way to go back for what happened [00:04:11] no way to go back for what happened during that time where it was in [00:04:12] during that time where it was in production uh you also want things to be [00:04:14] production uh you also want things to be very task specific so if I care about a [00:04:16] very task specific so if I care about a certain task when I put when I put my [00:04:18] certain task when I put when I put my model in production H it really you [00:04:20] model in production H it really you really need to evaluate on that specific [00:04:22] really need to evaluate on that specific task I don't care about other tasks and [00:04:24] task I don't care about other tasks and finally you need your metrics to be [00:04:25] finally you need your metrics to be absolute uh so the reason why I'm [00:04:27] absolute uh so the reason why I'm highlighting that is that in the in the [00:04:29] highlighting that is that in the in the three other steps you really just care [00:04:31] three other steps you really just care about comparing between things um so [00:04:33] about comparing between things um so that is very different than if you if [00:04:35] that is very different than if you if you want to kind of have a a threshold [00:04:38] you want to kind of have a a threshold which says if I have less than 95% [00:04:40] which says if I have less than 95% accuracy I'm not putting my model in [00:04:42] accuracy I'm not putting my model in production okay and now let's talk about [00:04:44] production okay and now let's talk about publishing this is a little bit [00:04:45] publishing this is a little bit different than honesty evaluation in the [00:04:47] different than honesty evaluation in the real world uh but when you publ when you [00:04:49] real world uh but when you publ when you basically do academic benchmarking and [00:04:51] basically do academic benchmarking and when you evaluate uh your models and [00:04:53] when you evaluate uh your models and academic benchmarks you want the [00:04:55] academic benchmarks you want the Benchmark to be reproducible and [00:04:56] Benchmark to be reproducible and standardized and the reason why is [00:04:58] standardized and the reason why is basically because for the next five or [00:05:01] basically because for the next five or six or 10 years everyone will be [00:05:03] six or 10 years everyone will be evaluated on that one Benchmark and you [00:05:05] evaluated on that one Benchmark and you want papers in three years to be [00:05:06] want papers in three years to be comparable to yours uh so it's really [00:05:08] comparable to yours uh so it's really important that your evaluations are [00:05:10] important that your evaluations are reproducible honestly you don't really [00:05:12] reproducible honestly you don't really care about that in the real world um you [00:05:15] care about that in the real world um you also want things to be easy to work with [00:05:17] also want things to be easy to work with because researchers are uh usually a [00:05:20] because researchers are uh usually a little bit they don't want to do [00:05:22] little bit they don't want to do additional work that they need to uh and [00:05:25] additional work that they need to uh and also they usually don't have that much [00:05:26] also they usually don't have that much resource so it needs to be fast and and [00:05:28] resource so it needs to be fast and and and cheap um [00:05:30] and cheap um and finally one one thing which I really [00:05:31] and finally one one thing which I really want to highlight is that for the [00:05:34] want to highlight is that for the academic benchmarks that we usually have [00:05:36] academic benchmarks that we usually have it's fine if the metrics that we use are [00:05:38] it's fine if the metrics that we use are not perfect because really what matters [00:05:40] not perfect because really what matters is that over 10 years uh the direction [00:05:44] is that over 10 years uh the direction that your metrix is showing you to to go [00:05:46] that your metrix is showing you to to go into like basically the how the field is [00:05:48] into like basically the how the field is moving um really if if the metric is [00:05:52] moving um really if if the metric is saying that it's slightly better uh [00:05:53] saying that it's slightly better uh sorry that it's better over 10 years [00:05:55] sorry that it's better over 10 years that in reality the the the field has [00:05:57] that in reality the the the field has made some progress um so at a meta level [00:06:01] made some progress um so at a meta level it's fine if you use crude metrics in in [00:06:03] it's fine if you use crude metrics in in um in [00:06:05] um in Academia and also you kind of need to [00:06:09] Academia and also you kind of need to um to balance between difficulty and [00:06:12] um to balance between difficulty and simplicity and what I mean by that is [00:06:14] simplicity and what I mean by that is that if your benchmark is way too [00:06:16] that if your benchmark is way too complicated uh then basically all [00:06:19] complicated uh then basically all methods will have essentially random [00:06:21] methods will have essentially random performance so no one will use your [00:06:23] performance so no one will use your benchmark and if uh your benchmark is [00:06:26] benchmark and if uh your benchmark is too simple then the basine will be so [00:06:29] too simple then the basine will be so good that no one will use your benchmark [00:06:31] good that no one will use your benchmark because no one can beat the the the [00:06:32] because no one can beat the the the Baseline this is really something that [00:06:34] Baseline this is really something that is specific to Academia in the real [00:06:36] is specific to Academia in the real world you're not going to be able to [00:06:37] world you're not going to be able to change the Tas that you're performing uh [00:06:40] change the Tas that you're performing uh based on like how good your model is so [00:06:43] based on like how good your model is so that's why I I kind of just want to [00:06:44] that's why I I kind of just want to highlight that because usually people [00:06:46] highlight that because usually people talk about evaluations but there's [00:06:47] talk about evaluations but there's really different like different ways of [00:06:49] really different like different ways of evaluating and different reasons why we [00:06:51] evaluating and different reasons why we evaluate is that all makes sense also [00:06:54] evaluate is that all makes sense also feel free to ask [00:06:56] feel free to ask questions great Okay so benchmarks in [00:07:00] questions great Okay so benchmarks in Academia this is really the way we drive [00:07:02] Academia this is really the way we drive the field so this is The mlu Benchmark I [00:07:05] the field so this is The mlu Benchmark I think archit briefly mentioned it but [00:07:07] think archit briefly mentioned it but I'll talk about it later again um so [00:07:09] I'll talk about it later again um so this is the most standard Benchmark [00:07:12] this is the most standard Benchmark right now and you basically see that in [00:07:13] right now and you basically see that in the last 4ish years uh it has gone from [00:07:17] the last 4ish years uh it has gone from 25% accuracy which is essentially random [00:07:19] 25% accuracy which is essentially random because uh it's multiple choice um and [00:07:22] because uh it's multiple choice um and there are four choices to around 90 is% [00:07:26] there are four choices to around 90 is% accuracy um so yeah benchmarking is [00:07:30] accuracy um so yeah benchmarking is really what Drive the the the progress [00:07:31] really what Drive the the the progress in the field and again you see what I [00:07:33] in the field and again you see what I mean here what I meant here is that it's [00:07:36] mean here what I meant here is that it's not really the differences between small [00:07:37] not really the differences between small points that matter at least in Academia [00:07:40] points that matter at least in Academia you have to take a step back and you [00:07:41] you have to take a step back and you have to think what matters is is like [00:07:43] have to think what matters is is like how your models will perform over 10 [00:07:45] how your models will perform over 10 years and making sure that the model on [00:07:46] years and making sure that the model on the top right here is better than the [00:07:48] the top right here is better than the model on the top on the bottom left um [00:07:51] model on the top on the bottom left um even if the if even if the Benchmark is [00:07:53] even if the if even if the Benchmark is not perfect and I I think mlu is a [00:07:55] not perfect and I I think mlu is a pretty good one in that [00:07:58] pretty good one in that sense okay so there are two main types [00:08:01] sense okay so there are two main types at least classically of um tasks in in [00:08:04] at least classically of um tasks in in NLP um close ended tasks uh so I'll talk [00:08:09] NLP um close ended tasks uh so I'll talk about it later but essentially you can [00:08:11] about it later but essentially you can think about classification where you [00:08:12] think about classification where you know exactly uh the label the correct [00:08:15] know exactly uh the label the correct label for uh the task that you're [00:08:17] label for uh the task that you're performing so here this is the IMDB data [00:08:20] performing so here this is the IMDB data set where you're asked to say whether a [00:08:23] set where you're asked to say whether a sentence is has positive sentiment or [00:08:25] sentence is has positive sentiment or negative sentiment so the text is read [00:08:27] negative sentiment so the text is read the book forget the movie so this is [00:08:28] the book forget the movie so this is about uh sentiment classification of the [00:08:31] about uh sentiment classification of the movie so here it's basically negative um [00:08:34] movie so here it's basically negative um and then there's open-ended evaluation [00:08:36] and then there's open-ended evaluation so think about Chad GPT like how do you [00:08:38] so think about Chad GPT like how do you evaluate something like that where [00:08:39] evaluate something like that where really there's no correct answer uh and [00:08:42] really there's no correct answer uh and there are many or there are many [00:08:43] there are many or there are many possible correct answers and they all [00:08:46] possible correct answers and they all have like different qualities um so [00:08:49] have like different qualities um so we're going to distinguish between those [00:08:51] we're going to distinguish between those two so close ended [00:08:54] two so close ended evaluation so as I just said close ended [00:08:57] evaluation so as I just said close ended tasks uh there's a limited it's Define [00:08:59] tasks uh there's a limited it's Define one uh as the task where is's a limited [00:09:02] one uh as the task where is's a limited number of potential answers um think [00:09:05] number of potential answers um think like less than 10 um and often there's [00:09:07] like less than 10 um and often there's just one or maybe a few correct POS [00:09:10] just one or maybe a few correct POS possible answers um so this really is [00:09:15] possible answers um so this really is standard machine learning if you think [00:09:16] standard machine learning if you think about standard classification you can [00:09:18] about standard classification you can just do accuracy you can look at your [00:09:19] just do accuracy you can look at your Precision your recalls um there's [00:09:21] Precision your recalls um there's nothing special here about NLP uh that [00:09:24] nothing special here about NLP uh that is not to me to say that it's simple [00:09:27] is not to me to say that it's simple it's just that there's nothing special [00:09:28] it's just that there's nothing special about NP here [00:09:30] about NP here uh so some tasks some uh close-ended [00:09:33] uh so some tasks some uh close-ended tasks I already told you about uh [00:09:35] tasks I already told you about uh sentiment analysis so usually this is a [00:09:37] sentiment analysis so usually this is a binary binary classification task where [00:09:39] binary binary classification task where you just have to say whether the [00:09:40] you just have to say whether the sentiment is positive or whether it's [00:09:42] sentiment is positive or whether it's negative uh another task is entailment [00:09:45] negative uh another task is entailment um also for sentiment analysis the [00:09:47] um also for sentiment analysis the typical benchmarks I always put it next [00:09:49] typical benchmarks I always put it next to the task uh is IMDb and SST from [00:09:52] to the task uh is IMDb and SST from Stanford entailment is snli also from [00:09:55] Stanford entailment is snli also from Stanford where basically you have some [00:09:56] Stanford where basically you have some text so here a soccer game with multiple [00:09:59] text so here a soccer game with multiple males playing and then you have a [00:10:01] males playing and then you have a hypothesis some men are playing Sport [00:10:03] hypothesis some men are playing Sport and you have to say whether the [00:10:05] and you have to say whether the hypothesis is implied or entailed by the [00:10:08] hypothesis is implied or entailed by the text uh so here it is uh other tasks [00:10:12] text uh so here it is uh other tasks part of speech uh typical Benchmark um [00:10:15] part of speech uh typical Benchmark um pentry bank and name entity recognition [00:10:18] pentry bank and name entity recognition uh which is a caral uh [00:10:20] uh which is a caral uh Benchmark a few other tasks you don't [00:10:23] Benchmark a few other tasks you don't need to know all of them but just to [00:10:25] need to know all of them but just to give you a brief overview uh cor [00:10:27] give you a brief overview uh cor reference resolution um so this is [00:10:30] reference resolution um so this is actually a pretty challenging uh NLP [00:10:32] actually a pretty challenging uh NLP task where you have to say what pronoun [00:10:34] task where you have to say what pronoun which pronoun refers to what noun so you [00:10:37] which pronoun refers to what noun so you have here the sentence Mark told Pete [00:10:40] have here the sentence Mark told Pete many lies about himself uh himself which [00:10:42] many lies about himself uh himself which Pete included in his book uh he should [00:10:45] Pete included in his book uh he should have been more truthful and now you have [00:10:46] have been more truthful and now you have to say uh what does he refer to um [00:10:51] to say uh what does he refer to um whether he refers to Pete and then there [00:10:54] whether he refers to Pete and then there question answering uh where you [00:10:55] question answering uh where you basically have a long text and you ask a [00:10:57] basically have a long text and you ask a question and uh well the test ask a [00:11:00] question and uh well the test ask a question and you're supposed to provide [00:11:01] question and you're supposed to provide an answer based on the on the text that [00:11:03] an answer based on the on the text that you have before so those are some [00:11:05] you have before so those are some examples of of uh close ended tasks and [00:11:09] examples of of uh close ended tasks and again uh the key here is that the way we [00:11:11] again uh the key here is that the way we evaluate those is just standard machine [00:11:13] evaluate those is just standard machine learning you can look at accuracy [00:11:14] learning you can look at accuracy precision recall F1 score uh hopefully [00:11:18] precision recall F1 score uh hopefully you all know about these type of metrics [00:11:20] you all know about these type of metrics but if you don't you should look at um [00:11:22] but if you don't you should look at um Chris Peach's uh class uh I think it's [00:11:26] Chris Peach's uh class uh I think it's cs224u uh but his lecture is online and [00:11:29] cs224u uh but his lecture is online and it's actually really good um on [00:11:31] it's actually really good um on different [00:11:33] different metrics um so the ways that people [00:11:36] metrics um so the ways that people evaluate some of these benchmarks uh is [00:11:38] evaluate some of these benchmarks uh is usually by looking at many of them uh [00:11:41] usually by looking at many of them uh concurrently uh so the most common I [00:11:43] concurrently uh so the most common I would say like super or multitask [00:11:45] would say like super or multitask Benchmark is called superglue um so here [00:11:48] Benchmark is called superglue um so here you see on the on the columns here you [00:11:53] you see on the on the columns here you have all the different tasks uh in super [00:11:55] have all the different tasks uh in super glue so I think there are eight or nine [00:11:58] glue so I think there are eight or nine um and then you you really just look at [00:12:00] um and then you you really just look at the average uh performance in each of [00:12:02] the average uh performance in each of these benchmarks and you get a ranking [00:12:04] these benchmarks and you get a ranking on that and that is kind of a attempt to [00:12:07] on that and that is kind of a attempt to measure General language capabilities um [00:12:10] measure General language capabilities um this is what people used to do I would [00:12:12] this is what people used to do I would say until maybe two years ago um I will [00:12:16] say until maybe two years ago um I will tell you about what people do now uh [00:12:19] tell you about what people do now uh around the end of the lecture uh but [00:12:21] around the end of the lecture uh but yeah super glue is is definitely [00:12:23] yeah super glue is is definitely something you should at least be aware [00:12:25] something you should at least be aware of uh and the example of tasks uh that [00:12:28] of uh and the example of tasks uh that are in superglue uh one is a bull Q [00:12:31] are in superglue uh one is a bull Q which is simply you have some text you [00:12:33] which is simply you have some text you have some question and you have to say [00:12:35] have some question and you have to say whether the answer is yes or whether [00:12:37] whether the answer is yes or whether it's no so that's very easy to evaluate [00:12:39] it's no so that's very easy to evaluate you just look at accuracies or Precision [00:12:41] you just look at accuracies or Precision recall uh entailment we already talked [00:12:43] recall uh entailment we already talked about um and then the other ones like [00:12:46] about um and then the other ones like co- reference resolution which we also [00:12:47] co- reference resolution which we also talked about and meaning of words which [00:12:49] talked about and meaning of words which is something where you have two [00:12:51] is something where you have two sentences with uh the same words and you [00:12:54] sentences with uh the same words and you have to say whether they actually mean [00:12:56] have to say whether they actually mean the same thing in this sentence for [00:12:57] the same thing in this sentence for example if if you have Bank could mean [00:12:59] example if if you have Bank could mean Bank like water and Bank like money and [00:13:02] Bank like water and Bank like money and you have to say whether in these two [00:13:03] you have to say whether in these two sentences um they refer to the same [00:13:07] sentences um they refer to the same concept uh and there are some question [00:13:09] concept uh and there are some question answering tasks too um so this is about [00:13:12] answering tasks too um so this is about super clue are there any [00:13:16] questions no cool um so again although I [00:13:22] questions no cool um so again although I said many times that this is essentially [00:13:24] said many times that this is essentially just classical machine learning uh I [00:13:26] just classical machine learning uh I want to emphasize that it doesn't mean [00:13:28] want to emphasize that it doesn't mean that it's simple and and you really have [00:13:29] that it's simple and and you really have to think carefully about what you do [00:13:31] to think carefully about what you do when you use those type of close-ended [00:13:33] when you use those type of close-ended tasks in particular you're going to have [00:13:35] tasks in particular you're going to have to choose whether you look at accuracies [00:13:37] to choose whether you look at accuracies Precision recall F1 score Rock curves Au [00:13:40] Precision recall F1 score Rock curves Au curves if you don't know these names you [00:13:42] curves if you don't know these names you should really check out the es learn uh [00:13:44] should really check out the es learn uh documentation or the lecture from Chris [00:13:46] documentation or the lecture from Chris Peach that I that I I linked above um [00:13:49] Peach that I that I I linked above um both of which are really good but [00:13:52] both of which are really good but depending on which metric you will [00:13:53] depending on which metric you will choose uh you will have very you will [00:13:55] choose uh you will have very you will decide on very different type of [00:13:56] decide on very different type of algorithms and the usual example is is [00:13:59] algorithms and the usual example is is that if um let's say you look at uh spam [00:14:04] that if um let's say you look at uh spam you want to do classification of whether [00:14:06] you want to do classification of whether um a email is spammed or not most emails [00:14:09] um a email is spammed or not most emails are not spammed uh thankfully at least I [00:14:11] are not spammed uh thankfully at least I hope um so let's say that 90% of emails [00:14:15] hope um so let's say that 90% of emails were actually not spam and only 10% of [00:14:17] were actually not spam and only 10% of them are spam uh if you look at the [00:14:18] them are spam uh if you look at the accuracy then just a random classifier [00:14:21] accuracy then just a random classifier uh that predicts the most likely label [00:14:22] uh that predicts the most likely label will get 90% accuracy um and that seems [00:14:26] will get 90% accuracy um and that seems if you don't know really about your data [00:14:28] if you don't know really about your data set like 90% accur seems good but in [00:14:30] set like 90% accur seems good but in reality here it means that you're not [00:14:31] reality here it means that you're not classifying anything um so that's why [00:14:34] classifying anything um so that's why you want to look at Precision recall and [00:14:35] you want to look at Precision recall and Fon call um anyways I will not talk too [00:14:39] Fon call um anyways I will not talk too much about that because again this is [00:14:40] much about that because again this is not specific to NLP but it doesn't mean [00:14:42] not specific to NLP but it doesn't mean it's easy um another issue is that once [00:14:45] it's easy um another issue is that once you have multiple different tasks [00:14:47] you have multiple different tasks there's a question of how do you [00:14:48] there's a question of how do you aggregate these metrics so right before [00:14:51] aggregate these metrics so right before I told you oh you just take the average [00:14:53] I told you oh you just take the average between all of these things this [00:14:55] between all of these things this honestly is a really terrible thing of [00:14:57] honestly is a really terrible thing of do um to do but that's actually what [00:14:59] do um to do but that's actually what people do um but these columns actually [00:15:02] people do um but these columns actually mean very different things some of them [00:15:04] mean very different things some of them are accuracies others are F1 score [00:15:07] are accuracies others are F1 score others are correlations uh and you just [00:15:09] others are correlations uh and you just average everything I can't remember [00:15:11] average everything I can't remember which uh Benchmark but I remember a few [00:15:13] which uh Benchmark but I remember a few years ago there was one Benchmark where [00:15:15] years ago there was one Benchmark where actually one of the columns was uh you [00:15:17] actually one of the columns was uh you were better basically you had better [00:15:19] were better basically you had better performance if the if the value was [00:15:21] performance if the if the value was lower and you still took an average of [00:15:23] lower and you still took an average of these things until someone realized like [00:15:25] these things until someone realized like maybe we should take a put a minus there [00:15:28] maybe we should take a put a minus there so uh um so yeah be careful and like [00:15:31] so uh um so yeah be careful and like don't always think that what people do [00:15:33] don't always think that what people do in in in Academia or like in yeah that [00:15:36] in in in Academia or like in yeah that what people do is correct um you should [00:15:39] what people do is correct um you should think a little bit uh about that then [00:15:41] think a little bit uh about that then there are some other questions I want [00:15:42] there are some other questions I want you to think about uh where do those [00:15:45] you to think about uh where do those labels come from I said that is usually [00:15:47] labels come from I said that is usually a a real answer but how you actually get [00:15:50] a a real answer but how you actually get those La labels is is unclear so I will [00:15:53] those La labels is is unclear so I will tell you about like some issues uh in [00:15:55] tell you about like some issues uh in the next slide and uh and also related [00:15:58] the next slide and uh and also related to that the might be some superious [00:15:59] to that the might be some superious correlations and that's what we're going [00:16:00] correlations and that's what we're going to talk about right now so we already [00:16:03] to talk about right now so we already talked about um snli so uh entailment so [00:16:08] talked about um snli so uh entailment so here you have again your premise the E [00:16:11] here you have again your premise the E economy could still be better uh and the [00:16:14] economy could still be better uh and the hypothesis the economy has never been [00:16:16] hypothesis the economy has never been better and you have to say whether the [00:16:17] better and you have to say whether the hypothesis is implied by the premise uh [00:16:20] hypothesis is implied by the premise uh and what this paper from 2019 found that [00:16:24] and what this paper from 2019 found that actually all the different models were [00:16:26] actually all the different models were performing really well but if you looked [00:16:28] performing really well but if you looked if you just classified based on the [00:16:30] if you just classified based on the hypothesis you could also perform really [00:16:32] hypothesis you could also perform really well so even if you did not look at the [00:16:35] well so even if you did not look at the at the premise which seems like [00:16:37] at the premise which seems like something that you need to take into [00:16:38] something that you need to take into account because it's part of the task [00:16:39] account because it's part of the task you could perform well and the reason [00:16:41] you could perform well and the reason why is because uh they realized that [00:16:44] why is because uh they realized that when the humans actually wrote the the [00:16:47] when the humans actually wrote the the hypothesis they will ask oh write a [00:16:49] hypothesis they will ask oh write a hypothesis which is not entailed by the [00:16:51] hypothesis which is not entailed by the premise and how humans usually do that [00:16:54] premise and how humans usually do that is by adding a negation um so if you [00:16:57] is by adding a negation um so if you only look at the hypothesis and you see [00:16:59] only look at the hypothesis and you see this a negation it's very likely that [00:17:00] this a negation it's very likely that it's not entailed by uh by the premise [00:17:03] it's not entailed by uh by the premise um so again even though this is standard [00:17:06] um so again even though this is standard machine learning be really careful about [00:17:09] machine learning be really careful about what metric you use and what where do [00:17:12] what metric you use and what where do the labels come from uh and don't do [00:17:14] the labels come from uh and don't do everything like don't just use what [00:17:16] everything like don't just use what people do thinking that like if there [00:17:19] people do thinking that like if there was an issue people would have realized [00:17:22] was an issue people would have realized um so yeah so that is Pur correlations [00:17:26] um so yeah so that is Pur correlations uh any questions on close ended task [00:17:32] cool okay open-ended evaluations I'm [00:17:34] cool okay open-ended evaluations I'm going to mostly talk about that because [00:17:36] going to mostly talk about that because this is what is specific to uh [00:17:38] this is what is specific to uh NLP um so open-ended evaluation or [00:17:41] NLP um so open-ended evaluation or open-ended task is essentially the [00:17:43] open-ended task is essentially the opposite of the close ended task which [00:17:46] opposite of the close ended task which is to say that there are many possible [00:17:48] is to say that there are many possible correct answers and you cannot enumerate [00:17:50] correct answers and you cannot enumerate all of them um so you really can't use [00:17:53] all of them um so you really can't use standard machine learning metrics uh and [00:17:56] standard machine learning metrics uh and more oops um even more than the the fact [00:17:59] more oops um even more than the the fact that you cannot enumerate all the [00:18:00] that you cannot enumerate all the possible answers usually uh there are [00:18:03] possible answers usually uh there are different levels of correctness so if I [00:18:07] different levels of correctness so if I ask you to write a book or if I ask chpt [00:18:09] ask you to write a book or if I ask chpt to write a book like it might be a [00:18:10] to write a book like it might be a decent book but it might be a better [00:18:12] decent book but it might be a better book uh that it could have written or [00:18:14] book uh that it could have written or that another model could write so it's [00:18:16] that another model could write so it's not just right and wrong there's like [00:18:17] not just right and wrong there's like diff it's a Continuum or or yeah it's a [00:18:21] diff it's a Continuum or or yeah it's a Continuum uh so stand examples for [00:18:24] Continuum uh so stand examples for open-ended tasks the two most common [00:18:26] open-ended tasks the two most common ones are summarization um so [00:18:28] ones are summarization um so summarization you have a long uh piece [00:18:31] summarization you have a long uh piece of text and you just ask to summarize it [00:18:33] of text and you just ask to summarize it in less than X characters um standard [00:18:36] in less than X characters um standard Benchmark is the C uh CNN and daily mail [00:18:40] Benchmark is the C uh CNN and daily mail um Benchmark so the way they actually uh [00:18:43] um Benchmark so the way they actually uh um collected that data set is that they [00:18:45] um collected that data set is that they took a lot of CNN articles and you know [00:18:47] took a lot of CNN articles and you know at the top of CNN articles you have [00:18:49] at the top of CNN articles you have bullet points uh that kind of say like [00:18:51] bullet points uh that kind of say like what are the most important things in [00:18:52] what are the most important things in the article so they use this as the as [00:18:55] the article so they use this as the as essentially the gold summary um so this [00:18:58] essentially the gold summary um so this is the classic one for summarization for [00:19:00] is the classic one for summarization for translation you basically have sentences [00:19:02] translation you basically have sentences in two different languages and you have [00:19:04] in two different languages and you have to translate from one to the other uh so [00:19:06] to translate from one to the other uh so those are the classical ones uh the way [00:19:09] those are the classical ones uh the way that people currently do it is I would [00:19:11] that people currently do it is I would say the the most standard task right now [00:19:13] say the the most standard task right now is instruction following instruction [00:19:15] is instruction following instruction following is kind of the uh the mother [00:19:18] following is kind of the uh the mother of all tasks in the sense that um you [00:19:22] of all tasks in the sense that um you can view any previous task as just a [00:19:25] can view any previous task as just a chat bot or like some question that you [00:19:27] chat bot or like some question that you ask to basically chat GP you can think [00:19:29] ask to basically chat GP you can think classification I could just ask CH GPT [00:19:31] classification I could just ask CH GPT to do that you can think summarization I [00:19:33] to do that you can think summarization I could ask CH GPT to do that so [00:19:35] could ask CH GPT to do that so essentially you could just view a [00:19:37] essentially you could just view a chatbot as the most General type of task [00:19:40] chatbot as the most General type of task and you can per you can ask it to [00:19:41] and you can per you can ask it to perform any possible task and it should [00:19:44] perform any possible task and it should just provide the answer for that task uh [00:19:46] just provide the answer for that task uh so this is what we call instruction [00:19:47] so this is what we call instruction following uh so as you might think um [00:19:50] following uh so as you might think um evaluation is very hard in that domain [00:19:53] evaluation is very hard in that domain uh and that's what we'll talk about [00:19:54] uh and that's what we'll talk about later is how do you evaluate something [00:19:56] later is how do you evaluate something like CH GPT [00:19:59] like CH GPT okay so types of evaluation methods for [00:20:01] okay so types of evaluation methods for text generation or open-ended tasks um [00:20:04] text generation or open-ended tasks um the classical ones are content overlap [00:20:06] the classical ones are content overlap metrix which I'll talk about then so [00:20:08] metrix which I'll talk about then so that's really comparing just the the the [00:20:11] that's really comparing just the the the words between a reference answer a gold [00:20:13] words between a reference answer a gold answer that humans wrote and the actual [00:20:15] answer that humans wrote and the actual Generation generation that you got from [00:20:17] Generation generation that you got from your model uh then there's a there are [00:20:19] your model uh then there's a there are model based metrics where you basically [00:20:22] model based metrics where you basically uh turn evaluation into machine learning [00:20:26] uh turn evaluation into machine learning so you ask a you train a model to [00:20:28] so you ask a you train a model to basically become an evaluator um and [00:20:31] basically become an evaluator um and then there human evaluation which is uh [00:20:34] then there human evaluation which is uh usually seen as the gold standard for [00:20:36] usually seen as the gold standard for open-ended [00:20:39] open-ended tasks so content overlap metrics so as I [00:20:42] tasks so content overlap metrics so as I just said this is really just comparing [00:20:44] just said this is really just comparing word by word um or group of words [00:20:47] word by word um or group of words between the generated sequence and some [00:20:50] between the generated sequence and some reference um so here I have the [00:20:53] reference um so here I have the generated sequence being the woman went [00:20:55] generated sequence being the woman went to the hardware store and the gold [00:20:57] to the hardware store and the gold reference which is uh the reference [00:20:59] reference which is uh the reference written by humans I actually don't even [00:21:01] written by humans I actually don't even know what the task is but the reference [00:21:02] know what the task is but the reference here is they walk to the grocery store [00:21:05] here is they walk to the grocery store and then what you do is that you just [00:21:06] and then what you do is that you just compare uh the two different sentences [00:21:09] compare uh the two different sentences by looking at the lexical similarity [00:21:11] by looking at the lexical similarity between those two [00:21:13] between those two texts um and this is super fast and [00:21:16] texts um and this is super fast and efficient and the way you usually do [00:21:18] efficient and the way you usually do that is by using engram overlap metrix [00:21:20] that is by using engram overlap metrix so what I mean by this is that uh the [00:21:22] so what I mean by this is that uh the simplest possible thing is just to say [00:21:24] simplest possible thing is just to say whether for every word in the generated [00:21:26] whether for every word in the generated sequence whether it appears in the [00:21:28] sequence whether it appears in the reference sequence and if if it does [00:21:31] reference sequence and if if it does then you kind of increment um your [00:21:33] then you kind of increment um your performance uh so engrams is essentially [00:21:36] performance uh so engrams is essentially the same thing but instead of looking at [00:21:37] the same thing but instead of looking at a single word you basically look at Iams [00:21:40] a single word you basically look at Iams trigrams and and and kind of um uh [00:21:43] trigrams and and and kind of um uh multiple words uh next to one another so [00:21:47] multiple words uh next to one another so the usual overlap metrix the most common [00:21:49] the usual overlap metrix the most common ones are blue and Rouge uh blue means [00:21:52] ones are blue and Rouge uh blue means blue and Rouge means red um that's not [00:21:56] blue and Rouge means red um that's not what they stand for though and I I [00:21:57] what they stand for though and I I always forget what they stand for [00:21:59] always forget what they stand for uh but basically blur what it is is that [00:22:02] uh but basically blur what it is is that it's a engram overlap metric that um [00:22:06] it's a engram overlap metric that um tries to look at Precision while Rouge [00:22:09] tries to look at Precision while Rouge is what looks at it looks at the recall [00:22:12] is what looks at it looks at the recall uh so as I as I said before as alluded [00:22:14] uh so as I as I said before as alluded before what is important even if you [00:22:16] before what is important even if you turn everything into a kind of sentence [00:22:18] turn everything into a kind of sentence classification you have to think about [00:22:20] classification you have to think about whether you care about Precision or [00:22:22] whether you care about Precision or recall um so those metrics are not ideal [00:22:26] recall um so those metrics are not ideal but until I would say two years ago they [00:22:28] but until I would say two years ago they would have gold standard for translation [00:22:30] would have gold standard for translation and summarization uh for translation [00:22:33] and summarization uh for translation people use blue uh because you really [00:22:37] people use blue uh because you really want to yeah you really want you [00:22:39] want to yeah you really want you basically look at the let's say I'm [00:22:41] basically look at the let's say I'm translating from French to English uh I [00:22:43] translating from French to English uh I want to look at the uh generated [00:22:45] want to look at the uh generated sequence in English and the actual [00:22:47] sequence in English and the actual reference sequence in English and I want [00:22:49] reference sequence in English and I want to know whether every uh Bagram that I [00:22:52] to know whether every uh Bagram that I generated appears or like how many of [00:22:55] generated appears or like how many of the bams that I I generated appears in [00:22:57] the bams that I I generated appears in the uh in the reference sequence um the [00:23:02] the uh in the reference sequence um the there's one additional thing which is [00:23:03] there's one additional thing which is that they don't only look at Precision [00:23:05] that they don't only look at Precision because you could get a very high [00:23:06] because you could get a very high Precision by actually predicting [00:23:07] Precision by actually predicting something very small uh for example if [00:23:09] something very small uh for example if you always predicted the word the only [00:23:12] you always predicted the word the only generated the word the you would most [00:23:14] generated the word the you would most likely get very high Precision uh [00:23:16] likely get very high Precision uh because the usually appears in every [00:23:19] because the usually appears in every sentence or like let's say a full stop [00:23:23] sentence or like let's say a full stop um so there's also like some length [00:23:25] um so there's also like some length penalty uh and Rouge is kind of the [00:23:27] penalty uh and Rouge is kind of the opposite it just look said we call um so [00:23:30] opposite it just look said we call um so those are the the common the common [00:23:33] those are the the common the common content overlap metrics and I'll just to [00:23:36] content overlap metrics and I'll just to illustrate why those are not ideal um [00:23:40] illustrate why those are not ideal um well they have many issues but one of [00:23:42] well they have many issues but one of them is that they don't really take into [00:23:43] them is that they don't really take into account the the semantic relatedness [00:23:45] account the the semantic relatedness between words um so imagine that Chris [00:23:48] between words um so imagine that Chris asks you are you enjoying the cs224 N [00:23:52] asks you are you enjoying the cs224 N lectures uh of course the the gold [00:23:54] lectures uh of course the the gold answer is heck yes um so that's the [00:23:58] answer is heck yes um so that's the reference answer so now let's say that [00:24:00] reference answer so now let's say that the model just generates yes so here [00:24:02] the model just generates yes so here what you're going to have is if I look [00:24:03] what you're going to have is if I look at the blue the blue score uh I will [00:24:06] at the blue the blue score uh I will have [00:24:07] have 67% uh um essentially blue score because [00:24:10] 67% uh um essentially blue score because two of the words that I generated or two [00:24:12] two of the words that I generated or two of the unigrams I generated are in the [00:24:14] of the unigrams I generated are in the reference um in the gold reference if I [00:24:18] reference um in the gold reference if I generate you know it then I will only [00:24:20] generate you know it then I will only have a single um token in the generated [00:24:24] have a single um token in the generated sequence that appears in the reference [00:24:25] sequence that appears in the reference sequence which is the exclamation point [00:24:28] sequence which is the exclamation point uh I get a much lower um blue score and [00:24:32] uh I get a much lower um blue score and if I just say yep then that doesn't [00:24:34] if I just say yep then that doesn't appear at all in the generated uh sorry [00:24:37] appear at all in the generated uh sorry in the reference sequence so I get zero [00:24:39] in the reference sequence so I get zero uh blue score which is a false negative [00:24:41] uh blue score which is a false negative because really it literally means the [00:24:43] because really it literally means the same thing as uh heck yes um so [00:24:46] same thing as uh heck yes um so hopefully you see that like these [00:24:49] hopefully you see that like these metrics really have issues um also you [00:24:52] metrics really have issues um also you can have false positives for example if [00:24:54] can have false positives for example if you say heck no uh then most of the [00:24:57] you say heck no uh then most of the words are the same so you get 7% uh blue [00:24:59] words are the same so you get 7% uh blue score um but it really means something [00:25:02] score um but it really means something completely [00:25:03] completely different uh does that make sense any [00:25:08] different uh does that make sense any questions cool so very naturally now [00:25:12] questions cool so very naturally now that you know everything about what it [00:25:14] that you know everything about what it Bings what you might ask is oh why do we [00:25:17] Bings what you might ask is oh why do we look at words if what we could do is [00:25:20] look at words if what we could do is looking at like learn representations [00:25:22] looking at like learn representations which really kind of uh maintain the the [00:25:25] which really kind of uh maintain the the semantic similarity uh between words uh [00:25:28] semantic similarity uh between words uh so this is exactly what people have done [00:25:31] so this is exactly what people have done um around 2019 I think is that they took [00:25:36] um around 2019 I think is that they took uh some some or even before actually [00:25:38] uh some some or even before actually 2016 they took some word embeddings um [00:25:41] 2016 they took some word embeddings um they Associated every word in the uh in [00:25:46] they Associated every word in the uh in the reference sequence to a word [00:25:48] the reference sequence to a word embedding every word um in the generated [00:25:51] embedding every word um in the generated sequence to the corresponding word [00:25:53] sequence to the corresponding word embedding and they basically started [00:25:55] embedding and they basically started comparing the word embeddings so a very [00:25:57] comparing the word embeddings so a very simple way of comparing word embeddings [00:25:58] simple way of comparing word embeddings is just you take the average uh between [00:26:00] is just you take the average uh between the word embeddings in the reference [00:26:02] the word embeddings in the reference sequence and the average between the [00:26:03] sequence and the average between the word embeddings in the generated [00:26:04] word embeddings in the generated sequence and you maybe look at cosine [00:26:06] sequence and you maybe look at cosine similarity I mean there are mod ways of [00:26:08] similarity I mean there are mod ways of doing it but honestly at this point it's [00:26:09] doing it but honestly at this point it's not that important uh so you can think [00:26:12] not that important uh so you can think about [00:26:13] about averaging um another one uh as you [00:26:17] averaging um another one uh as you know at this point um what embeddings [00:26:21] know at this point um what embeddings don't really take into account the [00:26:23] don't really take into account the contextual um or like the context of the [00:26:26] contextual um or like the context of the of where the word basically appears uh [00:26:28] of where the word basically appears uh so better way of getting good [00:26:30] so better way of getting good representations for for word is by [00:26:33] representations for for word is by looking essentially at at BIR so what [00:26:35] looking essentially at at BIR so what you could do is you could take a bird [00:26:36] you could do is you could take a bird model you could pass the generated [00:26:39] model you could pass the generated sequence through it you get some [00:26:41] sequence through it you get some embeddings um and then you can take bird [00:26:44] embeddings um and then you can take bird again the same bird you pass the [00:26:46] again the same bird you pass the reference sequence to it you get some [00:26:48] reference sequence to it you get some other embeddings and then you do again [00:26:49] other embeddings and then you do again some comparison I mean this bird score [00:26:52] some comparison I mean this bird score uh pretty famous paper they do like some [00:26:53] uh pretty famous paper they do like some smart comparison but it it's not that [00:26:56] smart comparison but it it's not that important to understand what EX exactly [00:26:58] important to understand what EX exactly they do um what is important is that [00:27:00] they do um what is important is that they take like some smart averaging um [00:27:03] they take like some smart averaging um between those [00:27:05] between those words cool any [00:27:10] questions [00:27:12] questions okay um so that was the simplest uh the [00:27:16] okay um so that was the simplest uh the simplest type of of meth of learning [00:27:18] simplest type of of meth of learning methods which is uh word uh word [00:27:22] methods which is uh word uh word matching another slightly more [00:27:24] matching another slightly more complicated one is called blurt also [00:27:27] complicated one is called blurt also pretty famous which is a mix between [00:27:29] pretty famous which is a mix between blur and bird uh so the the way that [00:27:32] blur and bird uh so the the way that they did that is that basically they [00:27:34] they did that is that basically they they took a pre-trained bir um and then [00:27:37] they took a pre-trained bir um and then they do some continual pre-training by [00:27:40] they do some continual pre-training by trying to predict the the blur score and [00:27:43] trying to predict the the blur score and some other metrics and then they [00:27:44] some other metrics and then they fine-tune that's the important part is [00:27:46] fine-tune that's the important part is that they F tune the pre-train model uh [00:27:48] that they F tune the pre-train model uh to actually do the evaluation that they [00:27:50] to actually do the evaluation that they care about so let's say that I have a [00:27:52] care about so let's say that I have a lot of different uh sequences and I I I [00:27:55] lot of different uh sequences and I I I have some human annotations of how I [00:27:57] have some human annotations of how I should be evaluating it I could just [00:27:59] should be evaluating it I could just treat that as a normal machine learning [00:28:01] treat that as a normal machine learning task and I just fine-tune my bir uh to [00:28:04] task and I just fine-tune my bir uh to do the [00:28:05] do the evaluation so this is [00:28:08] evaluation so this is blurt any [00:28:10] blurt any questions yes curious if you pre-train [00:28:14] questions yes curious if you pre-train on uh blue wouldn't it cause the same [00:28:17] on uh blue wouldn't it cause the same problems as the if you if your [00:28:18] problems as the if you if your pre-training fast is blue then your [00:28:21] pre-training fast is blue then your would learn the ability to model [00:28:23] would learn the ability to model languages semantically in the first [00:28:25] languages semantically in the first place yeah uh that's a very good point [00:28:27] place yeah uh that's a very good point so actually I I also find it kind of [00:28:29] so actually I I also find it kind of surprising so they did two things first [00:28:31] surprising so they did two things first they do the real pre-training of of uh [00:28:33] they do the real pre-training of of uh of bird and then they do continual [00:28:35] of bird and then they do continual pre-training for predicting uh blue and [00:28:38] pre-training for predicting uh blue and the reason why is because usually they [00:28:39] the reason why is because usually they say we have a lot of uh sequences in our [00:28:41] say we have a lot of uh sequences in our data set that are unlabeled so we have [00:28:43] data set that are unlabeled so we have like some reference um sequences and [00:28:47] like some reference um sequences and some uh generated sequences but we don't [00:28:50] some uh generated sequences but we don't have like the human annotation of [00:28:51] have like the human annotation of whether this is good or bad so we will [00:28:54] whether this is good or bad so we will treat that as unsupervised learning [00:28:56] treat that as unsupervised learning objective uh so what you use for the [00:28:58] objective uh so what you use for the supervised learning objective well you [00:29:00] supervised learning objective well you have to use something and they basically [00:29:02] have to use something and they basically use blur and they also use actually um [00:29:05] use blur and they also use actually um uh Bird score so they use like many [00:29:07] uh Bird score so they use like many different tasks and they basically do [00:29:08] different tasks and they basically do multitask [00:29:10] multitask learning [00:29:13] learning cool okay so one important uh issue with [00:29:17] cool okay so one important uh issue with all these methods is that really they [00:29:19] all these methods is that really they are they can only be as good as the [00:29:21] are they can only be as good as the references are and in reality the [00:29:24] references are and in reality the references are usually not that good uh [00:29:26] references are usually not that good uh so this is in part this is a paper that [00:29:29] so this is in part this is a paper that looks at summarization um of news so [00:29:33] looks at summarization um of news so basically as I said before most of the [00:29:36] basically as I said before most of the news summarization benchmarks um they [00:29:38] news summarization benchmarks um they they usually take the the reference uh [00:29:41] they usually take the the reference uh summary as being the bullet points that [00:29:43] summary as being the bullet points that you find at the top of an article um and [00:29:46] you find at the top of an article um and this is usually not that good so here [00:29:48] this is usually not that good so here what you see on the um sorry on on the [00:29:52] what you see on the um sorry on on the left this is what if you look at the [00:29:54] left this is what if you look at the correlation between the x-axis being the [00:29:57] correlation between the x-axis being the the human uh rate of or like the human [00:30:02] the human uh rate of or like the human evaluated performance of every model and [00:30:03] evaluated performance of every model and on the y- axis you see the uh Rouge L [00:30:06] on the y- axis you see the uh Rouge L which is just a variant of rouge um and [00:30:09] which is just a variant of rouge um and you look at whether uh basically these [00:30:12] you look at whether uh basically these two are correlated and what you see is [00:30:14] two are correlated and what you see is that it essentially not correlated which [00:30:16] that it essentially not correlated which means that rou L on standard um [00:30:20] means that rou L on standard um references is really not correlated to [00:30:22] references is really not correlated to what humans would say is a good summary [00:30:25] what humans would say is a good summary um that is not to say that rou is a bad [00:30:28] um that is not to say that rou is a bad score that is to say that actually the [00:30:30] score that is to say that actually the references are bad because if you look [00:30:31] references are bad because if you look at the exact same thing uh but now you [00:30:34] at the exact same thing uh but now you ask experts to write very good summaries [00:30:37] ask experts to write very good summaries uh then you see that the correlation [00:30:38] uh then you see that the correlation actually increases by a decent amount [00:30:40] actually increases by a decent amount still not perfect uh Rouge is definitely [00:30:42] still not perfect uh Rouge is definitely not perfect but at least it's much [00:30:44] not perfect but at least it's much better um so this is to say that the [00:30:47] better um so this is to say that the metric itself is is not always perfect [00:30:49] metric itself is is not always perfect but not only this the the references are [00:30:51] but not only this the the references are usually actually uh not [00:30:54] usually actually uh not great cool so that begs a very natural [00:30:58] great cool so that begs a very natural question which is can we just dump uh [00:31:01] question which is can we just dump uh and Le yeah basically move away from [00:31:04] and Le yeah basically move away from reference based evaluation so as we just [00:31:06] reference based evaluation so as we just said uh reference based evaluations are [00:31:08] said uh reference based evaluations are the ones that compare human written [00:31:10] the ones that compare human written references to some model outputs uh [00:31:12] references to some model outputs uh using some different type of metrics um [00:31:17] using some different type of metrics um and those used to be the standard [00:31:20] and those used to be the standard metrics for evaluating or the standard [00:31:22] metrics for evaluating or the standard Benchmark for evaluating NLP tasks uh I [00:31:24] Benchmark for evaluating NLP tasks uh I would say up to like 2 or 3 years ago [00:31:27] would say up to like 2 or 3 years ago right now now I think uh papers still [00:31:30] right now now I think uh papers still have to always show the blue scores like [00:31:33] have to always show the blue scores like for example in Translation um because [00:31:36] for example in Translation um because review is W those but I'm I don't think [00:31:39] review is W those but I'm I don't think like anyone in in uh in the real world [00:31:41] like anyone in in uh in the real world actually uses them but I might be wrong [00:31:43] actually uses them but I might be wrong on that um so yeah so blue Rouge Bird [00:31:46] on that um so yeah so blue Rouge Bird score oh and I was mostly talking about [00:31:48] score oh and I was mostly talking about blue and Rouge Bird score is actually [00:31:50] blue and Rouge Bird score is actually still uh decently used and and actually [00:31:52] still uh decently used and and actually pretty good okay so reference free [00:31:55] pretty good okay so reference free evaluation so reference free evaluation [00:31:58] evaluation so reference free evaluation is uh basically you have a model and you [00:32:00] is uh basically you have a model and you ask it to give a score uh but there are [00:32:02] ask it to give a score uh but there are no human references um so the way that [00:32:05] no human references um so the way that this is that this used to be done is [00:32:07] this is that this used to be done is essentially by taking a model like like [00:32:10] essentially by taking a model like like B again but instead of comparing between [00:32:12] B again but instead of comparing between a reference a reference answer and the [00:32:14] a reference a reference answer and the generated answer you could just ask it [00:32:17] generated answer you could just ask it to to take the input and just predict [00:32:18] to to take the input and just predict the score uh that's like one simple way [00:32:20] the score uh that's like one simple way of doing it that used to really not work [00:32:22] of doing it that used to really not work well uh and I say used to because until [00:32:25] well uh and I say used to because until basically Chad GPT and GPT 4 uh now what [00:32:28] basically Chad GPT and GPT 4 uh now what people do and honestly that works super [00:32:30] people do and honestly that works super well is that you just ask gp4 to do the [00:32:34] well is that you just ask gp4 to do the uh the same task as you would ask a [00:32:35] uh the same task as you would ask a human so you give like a very long uh [00:32:37] human so you give like a very long uh text and then you uh you give the [00:32:40] text and then you uh you give the generated summary and you ask like how [00:32:41] generated summary and you ask like how good is it essentially and that works uh [00:32:44] good is it essentially and that works uh surprisingly well um so common [00:32:47] surprisingly well um so common benchmarks here alpaca eval and empty [00:32:49] benchmarks here alpaca eval and empty bench uh there are many others now [00:32:51] bench uh there are many others now honestly most people uh start using [00:32:53] honestly most people uh start using these type of of techniques um but we'll [00:32:55] these type of of techniques um but we'll be talking at least about app Eva [00:32:59] be talking at least about app Eva um good okay so let's talk a little bit [00:33:01] um good okay so let's talk a little bit about human evaluation before looping [00:33:03] about human evaluation before looping back to um to basically [00:33:06] back to um to basically gp4 um so as we saw the metrics until [00:33:09] gp4 um so as we saw the metrics until now they are they all have some [00:33:12] now they are they all have some shortcomings um and they definitely not [00:33:14] shortcomings um and they definitely not as good as if you ask directly human [00:33:16] as good as if you ask directly human evaluation because they are based on on [00:33:18] evaluation because they are based on on um on [00:33:20] um on references uh so human evaluation is [00:33:23] references uh so human evaluation is really the goal standard for open-ended [00:33:26] really the goal standard for open-ended uh open ended tasks um and not only is [00:33:30] uh open ended tasks um and not only is it really the the standard way of doing [00:33:34] it really the the standard way of doing evaluation or like the goal standard for [00:33:35] evaluation or like the goal standard for evaluation it's also the gold standard [00:33:37] evaluation it's also the gold standard for developing new automatic evaluations [00:33:40] for developing new automatic evaluations so every time you you develop a new [00:33:43] so every time you you develop a new automatic evaluations you will want to [00:33:44] automatic evaluations you will want to compare to uh what humans would have [00:33:48] compare to uh what humans would have basically uh predicted [00:33:51] basically uh predicted um [00:33:54] yeah okay so doing human evaluation i' [00:33:57] yeah okay so doing human evaluation i' first it might seem very simple you [00:33:59] first it might seem very simple you basically ask humans to evaluate the [00:34:01] basically ask humans to evaluate the quality of some generated text seems [00:34:03] quality of some generated text seems simple right uh but actually it's super [00:34:06] simple right uh but actually it's super complicated and it's a it's a real like [00:34:08] complicated and it's a it's a real like Challenge and it has many issues so [00:34:10] Challenge and it has many issues so first um oh sorry I I'll talk about that [00:34:13] first um oh sorry I I'll talk about that before maybe one additional thing is [00:34:15] before maybe one additional thing is that you should not only ask the human [00:34:17] that you should not only ask the human you usually ask it also to um ask them [00:34:20] you usually ask it also to um ask them to evaluate across different axes for [00:34:23] to evaluate across different axes for example the fluency of the text or the [00:34:24] example the fluency of the text or the coherence of the text or like common [00:34:26] coherence of the text or like common sense or like the the style [00:34:29] sense or like the the style grammaticality redundancy and like [00:34:31] grammaticality redundancy and like different axes that you might care about [00:34:33] different axes that you might care about um another thing to note is that you [00:34:36] um another thing to note is that you should absolutely never compare uh [00:34:38] should absolutely never compare uh different human evaluations so if [00:34:40] different human evaluations so if there's one paper that says oh uh humans [00:34:43] there's one paper that says oh uh humans have evaluated the fluency of our text [00:34:45] have evaluated the fluency of our text to be I don't know four out of five uh [00:34:47] to be I don't know four out of five uh and then another paper that says like [00:34:49] and then another paper that says like three out of five like they use [00:34:50] three out of five like they use different humans different ways of like [00:34:52] different humans different ways of like prom prompting the humans um so it's [00:34:55] prom prompting the humans um so it's it's absolutely not comparable [00:34:58] it's absolutely not comparable okay so let's go back to some of the [00:35:01] okay so let's go back to some of the issues so as I said human judgments are [00:35:03] issues so as I said human judgments are regarded as the gold standard uh but [00:35:05] regarded as the gold standard uh but definitely has issues uh first it's [00:35:08] definitely has issues uh first it's super slow um as you might expect humans [00:35:11] super slow um as you might expect humans are definitely not as fast as automatic [00:35:14] are definitely not as fast as automatic uh metrics uh second at least in [00:35:17] uh metrics uh second at least in Academia it's it's still pretty [00:35:19] Academia it's it's still pretty expensive to do um because I mean when [00:35:22] expensive to do um because I mean when you pay well your workers um it's pretty [00:35:25] you pay well your workers um it's pretty expensive to do well uh human evaluation [00:35:28] expensive to do well uh human evaluation um another part is interannotator [00:35:30] um another part is interannotator disagreement so if I take two random [00:35:33] disagreement so if I take two random people in this room and I ask them to [00:35:35] people in this room and I ask them to evaluate the quality of a generated text [00:35:37] evaluate the quality of a generated text I can assure you that you will really [00:35:39] I can assure you that you will really not agree uh so this is even if it [00:35:42] not agree uh so this is even if it especially if it's subjective it's [00:35:43] especially if it's subjective it's really bad but even if you talk for like [00:35:45] really bad but even if you talk for like one hour before about uh like how you [00:35:48] one hour before about uh like how you should be evaluating um uh Generations I [00:35:53] should be evaluating um uh Generations I can most likely guarantee you that you [00:35:55] can most likely guarantee you that you will you will still disagree on many of [00:35:58] will you will still disagree on many of the evaluations and to give you an [00:36:00] the evaluations and to give you an example uh when we were doing uh alpaca [00:36:03] example uh when we were doing uh alpaca form last year which is uh something [00:36:06] form last year which is uh something where we basically had to take um some [00:36:10] where we basically had to take um some inputs and then take two models um think [00:36:14] inputs and then take two models um think chat GPT alpaca and these type of models [00:36:16] chat GPT alpaca and these type of models and you just uh have the two models [00:36:19] and you just uh have the two models predict an answer and then you ask the [00:36:21] predict an answer and then you ask the humans to say which answer they prefer [00:36:24] humans to say which answer they prefer this is a very simple task um and this [00:36:27] this is a very simple task um and this is but I will talk about it later this [00:36:29] is but I will talk about it later this is what like a lot of people basically [00:36:30] is what like a lot of people basically use right now for evaluating um models [00:36:33] use right now for evaluating um models like J GPT um so natural question is [00:36:36] like J GPT um so natural question is whether humans are good at doing that [00:36:38] whether humans are good at doing that and what we saw is that so we were five [00:36:41] and what we saw is that so we were five researchers doing that and the five of [00:36:43] researchers doing that and the five of us we talked for like two or three hours [00:36:45] us we talked for like two or three hours we write wrote extremely detailed uh [00:36:48] we write wrote extremely detailed uh rubrics about how to do their [00:36:49] rubrics about how to do their evaluations and still we only agreed 67% [00:36:52] evaluations and still we only agreed 67% of the time so 50% is like uh random uh [00:36:56] of the time so 50% is like uh random uh and if we just label things [00:36:58] and if we just label things independently we only uh agree 67% of [00:37:01] independently we only uh agree 67% of the time and we really try to do our [00:37:02] the time and we really try to do our best like we were working on this thing [00:37:04] best like we were working on this thing so it's not as if we were trying to do [00:37:06] so it's not as if we were trying to do it like quickly so really people [00:37:08] it like quickly so really people disagree of course if you if you then [00:37:10] disagree of course if you if you then allow like discussions between the [00:37:12] allow like discussions between the annotators then agreement actually [00:37:15] annotators then agreement actually improves uh but then it becomes even [00:37:17] improves uh but then it becomes even more like slower and more expensive [00:37:20] more like slower and more expensive intra annotator disagreement this is [00:37:22] intra annotator disagreement this is something that is extremely annoying [00:37:24] something that is extremely annoying which is that if I ask a human if I ask [00:37:26] which is that if I ask a human if I ask myself right now to evaluate something [00:37:28] myself right now to evaluate something or like in three hours like after I have [00:37:30] or like in three hours like after I have dinner or after I went to to run I will [00:37:32] dinner or after I went to to run I will actually give different different [00:37:34] actually give different different annotations um [00:37:39] yes [00:37:42] to um you mean for [00:37:45] to um you mean for validating yeah so this is a very good [00:37:47] validating yeah so this is a very good question honestly there's no good answer [00:37:50] question honestly there's no good answer the usual way that people do it is that [00:37:52] the usual way that people do it is that you look at some statistical um like [00:37:56] you look at some statistical um like some stati iCal metrics basically where [00:37:58] some stati iCal metrics basically where you're like okay I want to compare [00:38:00] you're like okay I want to compare between these two models uh I'm going to [00:38:02] between these two models uh I'm going to look at I'm going to basically perform a [00:38:03] look at I'm going to basically perform a t test and I'm I want to know that like [00:38:05] t test and I'm I want to know that like my P value is less than a certain amount [00:38:07] my P value is less than a certain amount um what people usually do also when they [00:38:09] um what people usually do also when they have humans annotations um I [00:38:11] have humans annotations um I unfortunately didn't put a slide on that [00:38:13] unfortunately didn't put a slide on that but they have a metric for computing the [00:38:15] but they have a metric for computing the intra annotator basically agreement and [00:38:18] intra annotator basically agreement and they try to achieve a certain intra [00:38:19] they try to achieve a certain intra annotator agreement and if not they will [00:38:22] annotator agreement and if not they will essentially ask for more humans or for [00:38:24] essentially ask for more humans or for Rel [00:38:26] Rel labelings um [00:38:28] labelings um yeah it's not reproducible and this is [00:38:31] yeah it's not reproducible and this is like partly because of the two things [00:38:32] like partly because of the two things that we said before uh but also partly [00:38:35] that we said before uh but also partly because um yeah I mean mostly because of [00:38:38] because um yeah I mean mostly because of the two things before so this is a an [00:38:41] the two things before so this is a an interesting paper um I think it's I I [00:38:43] interesting paper um I think it's I I forgot which year I think it's from 2021 [00:38:45] forgot which year I think it's from 2021 but I'm not sure uh where basically they [00:38:48] but I'm not sure uh where basically they say and I read from the abstract here [00:38:50] say and I read from the abstract here just 5% of human evaluations are [00:38:52] just 5% of human evaluations are repeatable in the sense that there are [00:38:53] repeatable in the sense that there are no prohibitive barriers to repetition [00:38:55] no prohibitive barriers to repetition and sufficient information about EXP [00:38:57] and sufficient information about EXP experimental design uh is publicly [00:39:00] experimental design uh is publicly available for rerunning them so this is [00:39:02] available for rerunning them so this is a paper that analyzed I think [00:39:04] a paper that analyzed I think 128 um different papers that were [00:39:06] 128 um different papers that were published across like five years I think [00:39:09] published across like five years I think between 2015 and 2020 um and they found [00:39:12] between 2015 and 2020 um and they found that essentially only 5% of those papers [00:39:14] that essentially only 5% of those papers were reproducible um uh so honestly [00:39:17] were reproducible um uh so honestly working with humans is hard that's [00:39:19] working with humans is hard that's definitely something to [00:39:21] definitely something to remember uh another part is that humans [00:39:23] remember uh another part is that humans only uh basically evaluate precision and [00:39:27] only uh basically evaluate precision and not recall so what I mean by that is [00:39:29] not recall so what I mean by that is that if I if I if you show me uh what [00:39:31] that if I if I if you show me uh what the model generated I can only evaluate [00:39:33] the model generated I can only evaluate that generation I cannot gen evaluate [00:39:36] that generation I cannot gen evaluate all the other possible Generations I [00:39:38] all the other possible Generations I could have generated um because then you [00:39:41] could have generated um because then you really have to sample a lot of things [00:39:42] really have to sample a lot of things and that that will become way too slow [00:39:44] and that that will become way too slow and way too expensive and finally uh [00:39:48] and way too expensive and finally uh usually the incentives are not aligned [00:39:50] usually the incentives are not aligned so what you want is for the humans to [00:39:52] so what you want is for the humans to basically do the best possible [00:39:54] basically do the best possible evaluations uh but what crowd workers [00:39:57] evaluations uh but what crowd workers usually want is basically to maximize [00:39:59] usually want is basically to maximize the the amount of money that they get [00:40:00] the the amount of money that they get paid uh per hour so to give you again a [00:40:04] paid uh per hour so to give you again a concrete example when we were doing [00:40:05] concrete example when we were doing alpaka Farm um I I think we were paying [00:40:10] alpaka Farm um I I think we were paying relatively well in the sense that we [00:40:11] relatively well in the sense that we were paying a u 1.5 times the the [00:40:14] were paying a u 1.5 times the the minimum wage in California um and then [00:40:17] minimum wage in California um and then we divided basically we we looked at how [00:40:19] we divided basically we we looked at how much time we would spend to do the thing [00:40:21] much time we would spend to do the thing like in the basically to evaluate a [00:40:24] like in the basically to evaluate a single example the best we could uh um [00:40:27] single example the best we could uh um and then we divided by that time to [00:40:29] and then we divided by that time to basically know how much how much we [00:40:30] basically know how much how much we would pay for every example um and what [00:40:33] would pay for every example um and what we realized is that they ended up being [00:40:36] we realized is that they ended up being paid I think two or 2.5 times the [00:40:38] paid I think two or 2.5 times the minimum wage because they were just [00:40:40] minimum wage because they were just doing things like two three times faster [00:40:42] doing things like two three times faster than us uh and I don't I mean we could [00:40:44] than us uh and I don't I mean we could be slow but I think what was happening [00:40:46] be slow but I think what was happening is that they were just trying to [00:40:48] is that they were just trying to maximize the dollars that they were [00:40:49] maximize the dollars that they were getting per hour and as a result they [00:40:51] getting per hour and as a result they were uh finding like shortcuts for doing [00:40:53] were uh finding like shortcuts for doing their evaluations um and this is [00:40:56] their evaluations um and this is something that you really see in the of [00:40:57] something that you really see in the of papers uh for example in our case uh you [00:41:00] papers uh for example in our case uh you saw that humans really preferred longer [00:41:01] saw that humans really preferred longer answers and of course if you give me two [00:41:05] answers and of course if you give me two uh two very long gener like two sorry [00:41:07] uh two very long gener like two sorry generations and you ask me with minimal [00:41:09] generations and you ask me with minimal amount of work to say which one is [00:41:10] amount of work to say which one is better like if I see a longer one I'm [00:41:12] better like if I see a longer one I'm like H probably there are more details [00:41:14] like H probably there are more details proba it's better uh anyways it's not to [00:41:16] proba it's better uh anyways it's not to say that everyone is like is like that [00:41:18] say that everyone is like is like that but definitely it's it's the incentives [00:41:20] but definitely it's it's the incentives are on line so you have to be careful of [00:41:22] are on line so you have to be careful of this uh other challenges uh first you [00:41:26] this uh other challenges uh first you have to decide how to descri the task uh [00:41:28] have to decide how to descri the task uh you really have to give very detailed [00:41:30] you really have to give very detailed rubrics for how the humans have to [00:41:31] rubrics for how the humans have to evaluate the task um then there's a [00:41:34] evaluate the task um then there's a question of how do you show the task to [00:41:35] question of how do you show the task to the humans uh for example the order in [00:41:37] the humans uh for example the order in which you give examples actually really [00:41:39] which you give examples actually really important uh the in our case because we [00:41:41] important uh the in our case because we had two examples side by side they're [00:41:42] had two examples side by side they're actually which one is on the left and [00:41:44] actually which one is on the left and which one is on the right is actually [00:41:45] which one is on the right is actually also very important so all these things [00:41:47] also very important so all these things really matter um of course you can [00:41:49] really matter um of course you can randomize these things away but but it [00:41:51] randomize these things away but but it is like it adds challenges um what [00:41:54] is like it adds challenges um what metrics to use I mean this is not [00:41:55] metrics to use I mean this is not specific to humans uh selecting the [00:41:58] specific to humans uh selecting the annotators this is also very complicated [00:42:00] annotators this is also very complicated uh you might think okay I have some [00:42:02] uh you might think okay I have some money now I I can go on on Amazon M [00:42:04] money now I I can go on on Amazon M turkers and I can just ask them to uh [00:42:06] turkers and I can just ask them to uh evaluate uh or to do some annotations [00:42:09] evaluate uh or to do some annotations but in reality you want to have the good [00:42:11] but in reality you want to have the good annotators so how it usually works in [00:42:13] annotators so how it usually works in Amazon uh in in m is that basically you [00:42:16] Amazon uh in in m is that basically you you uh say oh here's a task I want like [00:42:19] you uh say oh here's a task I want like 30 different people to do these [00:42:21] 30 different people to do these annotations uh and then they start [00:42:22] annotations uh and then they start annotating and then you if they don't [00:42:26] annotating and then you if they don't achieve the level that you want you [00:42:27] achieve the level that you want you basically pay for what they they [00:42:28] basically pay for what they they annotated until then and you you work [00:42:30] annotated until then and you you work with someone else afterwards uh so then [00:42:32] with someone else afterwards uh so then there's a question of how do you decide [00:42:34] there's a question of how do you decide whether they achieved the performance [00:42:35] whether they achieved the performance that you want uh so you probably have to [00:42:38] that you want uh so you probably have to do like some gold labeling before and [00:42:39] do like some gold labeling before and then look at like some accuracies of how [00:42:42] then look at like some accuracies of how well and like some intra anator [00:42:43] well and like some intra anator agreement with you and with like the [00:42:45] agreement with you and with like the other researchers on your team uh so it [00:42:47] other researchers on your team uh so it is very [00:42:48] is very complicated and not only this you have [00:42:50] complicated and not only this you have to monitor that over time um so there [00:42:53] to monitor that over time um so there are many different ways you can monitor [00:42:54] are many different ways you can monitor that over time looking again at the [00:42:56] that over time looking again at the accuracy so maybe every let's say a [00:42:59] accuracy so maybe every let's say a typical thing is that every batch of [00:43:00] typical thing is that every batch of example that you label you give a few [00:43:03] example that you label you give a few you give a few examples that are [00:43:04] you give a few examples that are actually uh ones that you already know [00:43:06] actually uh ones that you already know what the Gold Label is and you see how [00:43:08] what the Gold Label is and you see how well they're performing on that another [00:43:11] well they're performing on that another way to look at is like the time that [00:43:12] way to look at is like the time that they take to annotate um [00:43:16] they take to annotate um yeah okay so that was about humans uh so [00:43:20] yeah okay so that was about humans uh so human evaluation is hard but it is a [00:43:22] human evaluation is hard but it is a gold [00:43:22] gold standard okay now let's talk about uh [00:43:25] standard okay now let's talk about uh reference free evaluation and chat bots [00:43:26] reference free evaluation and chat bots so so I already told you about uh about [00:43:28] so so I already told you about uh about it before very briefly how do you [00:43:30] it before very briefly how do you evaluate something like chat GPT um this [00:43:33] evaluate something like chat GPT um this is extremely complicated because [00:43:34] is extremely complicated because basically you could ask it any task you [00:43:36] basically you could ask it any task you want um and it can answer text that is [00:43:39] want um and it can answer text that is arbitrarily long and that just makes [00:43:42] arbitrarily long and that just makes evaluation extremely hard um so as I [00:43:46] evaluation extremely hard um so as I suggested before the usual way that it's [00:43:48] suggested before the usual way that it's done is that you take two models you put [00:43:49] done is that you take two models you put them side by side you as the same [00:43:51] them side by side you as the same question and you just ask either some [00:43:53] question and you just ask either some humans or some model as we will see [00:43:55] humans or some model as we will see afterwards uh which which one is better [00:43:58] afterwards uh which which one is better um so this is the most common Benchmark [00:44:02] um so this is the most common Benchmark right now I would say for human [00:44:03] right now I would say for human evaluation it's called chatbot Arena uh [00:44:06] evaluation it's called chatbot Arena uh where basically anyone can go online and [00:44:09] where basically anyone can go online and just play with for free with some of [00:44:11] just play with for free with some of like the best models out there and all [00:44:13] like the best models out there and all they ask you is to say whether you [00:44:15] they ask you is to say whether you prefer the one on the right or whether [00:44:16] prefer the one on the right or whether you prefer the one on the left [00:44:17] you prefer the one on the left essentially uh and then once they reach [00:44:20] essentially uh and then once they reach I think a crazy amount of data 200,000 [00:44:23] I think a crazy amount of data 200,000 human votes for example they basically [00:44:25] human votes for example they basically add it to a leaderboard uh and the way [00:44:27] add it to a leaderboard uh and the way they add it to leaderboard is that they [00:44:29] they add it to leaderboard is that they um essenti I don't know if you know how [00:44:31] um essenti I don't know if you know how chess works but they basically look at [00:44:33] chess works but they basically look at the ELO ratings um so it's they [00:44:36] the ELO ratings um so it's they basically put everything as if it was a [00:44:37] basically put everything as if it was a t tournament uh such that not every [00:44:39] t tournament uh such that not every model has to uh play against every other [00:44:42] model has to uh play against every other model uh and then they get Elo Elo [00:44:46] model uh and then they get Elo Elo scores okay so what's missing with this [00:44:48] scores okay so what's missing with this side by side human evil as I said this [00:44:51] side by side human evil as I said this is really the gold standard for [00:44:52] is really the gold standard for evaluation of CH chat LMS but there are [00:44:55] evaluation of CH chat LMS but there are still some challenges uh first like it's [00:44:58] still some challenges uh first like it's basically random people online uh that [00:45:00] basically random people online uh that ask random questions and they provide [00:45:03] ask random questions and they provide like their preferences um so that might [00:45:05] like their preferences um so that might may not be representative although [00:45:07] may not be representative although arguably when you have that many [00:45:09] arguably when you have that many examples like it becomes actually pretty [00:45:10] examples like it becomes actually pretty representative of what people would want [00:45:12] representative of what people would want Al um [00:45:14] Al um so it's probably better than whatever we [00:45:16] so it's probably better than whatever we have um but it is still not ideal and [00:45:20] have um but it is still not ideal and then really the big the big issue is [00:45:21] then really the big the big issue is cost uh this takes a huge Community [00:45:24] cost uh this takes a huge Community effort and a lot of people to work on [00:45:27] effort and a lot of people to work on that um also it takes a lot of time to [00:45:31] that um also it takes a lot of time to get uh new models on The Benchmark and [00:45:33] get uh new models on The Benchmark and only the notable models so think like [00:45:35] only the notable models so think like the openi models and the cloud and like [00:45:37] the openi models and the cloud and like the Google ones uh and the Facebook ones [00:45:39] the Google ones uh and the Facebook ones are going to be benchmarked uh you will [00:45:41] are going to be benchmarked uh you will never have for your random model 200,000 [00:45:43] never have for your random model 200,000 uh um people who are willing to annotate [00:45:46] uh um people who are willing to annotate it for free um so this is an issue and [00:45:49] it for free um so this is an issue and again like as as we talked about in the [00:45:51] again like as as we talked about in the first slide even for those big companies [00:45:53] first slide even for those big companies they can definitely not do that for like [00:45:54] they can definitely not do that for like development of their model uh this is [00:45:57] development of their model uh this is something that comes at the end for [00:45:58] something that comes at the end for maybe model [00:46:00] maybe model selection okay so how do we make it [00:46:03] selection okay so how do we make it faster um so one uh very natural uh [00:46:07] faster um so one uh very natural uh solution is basically to ask uh a large [00:46:10] solution is basically to ask uh a large language model to do the evaluation for [00:46:12] language model to do the evaluation for you so imagine that I want to compare [00:46:13] you so imagine that I want to compare chat GPT with Moll I basically add G as [00:46:16] chat GPT with Moll I basically add G as GPT 4 to evaluate uh which one is better [00:46:19] GPT 4 to evaluate uh which one is better um this is surprisingly good and I will [00:46:21] um this is surprisingly good and I will show you some results afterwards and [00:46:23] show you some results afterwards and some common versions are Paka eval and [00:46:25] some common versions are Paka eval and empty bench probably the two most common [00:46:28] empty bench probably the two most common ones um so when we started doing that [00:46:31] ones um so when we started doing that that's a problem I told you about uh we [00:46:33] that's a problem I told you about uh we started that around last year um and we [00:46:36] started that around last year um and we found that using gp4 essentially for [00:46:39] found that using gp4 essentially for evaluation is at least if you look at [00:46:42] evaluation is at least if you look at the prices now would be 100 times faster [00:46:44] the prices now would be 100 times faster and 100 times cheaper than if use human [00:46:47] and 100 times cheaper than if use human evaluations uh but and this is very [00:46:50] evaluations uh but and this is very surprising the agreement with humans [00:46:52] surprising the agreement with humans actually higher than humans agree with [00:46:56] actually higher than humans agree with themselves so what I mean by that is [00:46:57] themselves so what I mean by that is that if I ask so this is what we found [00:46:59] that if I ask so this is what we found if I ask four humans uh let's say I have [00:47:03] if I ask four humans uh let's say I have a pool of four humans and I take out one [00:47:06] a pool of four humans and I take out one human and I look at the agreement [00:47:08] human and I look at the agreement between that that human preferences and [00:47:10] between that that human preferences and the mode of the preferences of the three [00:47:12] the mode of the preferences of the three others and I I do that in a leave one [00:47:14] others and I I do that in a leave one out fashion and I look at this agreement [00:47:17] out fashion and I look at this agreement uh this will be lower than if I ask um [00:47:20] uh this will be lower than if I ask um for the model to predict essentially the [00:47:22] for the model to predict essentially the preference of the mode of the humans um [00:47:24] preference of the mode of the humans um so in some ways models are more highly [00:47:28] so in some ways models are more highly correlated with humans than humans [00:47:30] correlated with humans than humans themselves which is very surprising and [00:47:31] themselves which is very surprising and I will tell you about it in two seconds [00:47:33] I will tell you about it in two seconds a little bit more um when we did that we [00:47:35] a little bit more um when we did that we actually use that for uh collecting uh [00:47:38] actually use that for uh collecting uh human preferences for rhf so that's what [00:47:41] human preferences for rhf so that's what we call R AI as I think Arch told you [00:47:44] we call R AI as I think Arch told you about these things uh last [00:47:47] about these things uh last week um so going back to this ISS or [00:47:51] week um so going back to this ISS or this like surprising result that [00:47:53] this like surprising result that actually models are more highly [00:47:55] actually models are more highly correlated with humans and humans [00:47:56] correlated with humans and humans themselves the reason why this is is [00:47:58] themselves the reason why this is is because humans are actually uh have high [00:48:00] because humans are actually uh have high inter annotate disagreement and have [00:48:03] inter annotate disagreement and have high variance essentially um models they [00:48:06] high variance essentially um models they will always be very consistent or maybe [00:48:09] will always be very consistent or maybe not perfectly like there's still some [00:48:10] not perfectly like there's still some some circity but essentially they will [00:48:12] some circity but essentially they will they will always predict the same uh [00:48:14] they will always predict the same uh label so they have very little variance [00:48:16] label so they have very little variance so here what you see on this plot is on [00:48:18] so here what you see on this plot is on the Y sorry x-axis we estimated the the [00:48:21] the Y sorry x-axis we estimated the the variance and you see that the human has [00:48:23] variance and you see that the human has a variance of like around 31 or 33 um [00:48:26] a variance of like around 31 or 33 um well if you look at the red point this [00:48:28] well if you look at the red point this is basically if you just add gp4 to do [00:48:30] is basically if you just add gp4 to do evaluations so even though the bias is [00:48:32] evaluations so even though the bias is still pretty high um so bias by [00:48:35] still pretty high um so bias by definition for humans is zero uh for [00:48:37] definition for humans is zero uh for gbd4 it is like around 32% the viance is [00:48:40] gbd4 it is like around 32% the viance is much lower um than than humans so this [00:48:44] much lower um than than humans so this is why you can see that actually [00:48:46] is why you can see that actually sometimes agreement is higher but that's [00:48:48] sometimes agreement is higher but that's really because there are there's no [00:48:49] really because there are there's no varant or very little Varan in uh in [00:48:53] varant or very little Varan in uh in LMS um yeah does that make sense [00:48:59] yeah lens is higher than a human sorry [00:49:02] yeah lens is higher than a human sorry it means the internal cerence is higher [00:49:04] it means the internal cerence is higher than exactly so which which is actually [00:49:07] than exactly so which which is actually a good sign because that means it's [00:49:08] a good sign because that means it's that's makes it much easier for for [00:49:10] that's makes it much easier for for research the bad sign is that the bias [00:49:12] research the bad sign is that the bias is still [00:49:14] is still high yeah okay so things to be careful [00:49:17] high yeah okay so things to be careful with when you work I mean this is both [00:49:20] with when you work I mean this is both with humans and with uh llms uh there [00:49:23] with humans and with uh llms uh there will be some spus correlations so we [00:49:25] will be some spus correlations so we already talked about SP correlations but [00:49:26] already talked about SP correlations but you will see a lot of those um one very [00:49:30] you will see a lot of those um one very common example is length so if you just [00:49:32] common example is length so if you just as I told you before if you ask crowd [00:49:35] as I told you before if you ask crowd workers which examples they prefer they [00:49:37] workers which examples they prefer they are highly biased towards longer longer [00:49:39] are highly biased towards longer longer output so here the blue is humans it's [00:49:41] output so here the blue is humans it's around I think 70% preferences for [00:49:44] around I think 70% preferences for longer outputs and and models are around [00:49:46] longer outputs and and models are around the same uh same bias um and another [00:49:49] the same uh same bias um and another example is preference for lists so [00:49:51] example is preference for lists so usually if you if you see lists in an [00:49:53] usually if you if you see lists in an output uh models prefer these these [00:49:55] output uh models prefer these these examples and models uh model and humans [00:49:57] examples and models uh model and humans prefer these examples uh another another [00:50:00] prefer these examples uh another another bias or SP correlation is a position I [00:50:02] bias or SP correlation is a position I told you like how which one you put on [00:50:04] told you like how which one you put on the left which one do you put on the [00:50:05] the left which one do you put on the right when you compare to when you ask [00:50:07] right when you compare to when you ask humans to to label there's the same [00:50:09] humans to to label there's the same thing with models but this is usually [00:50:10] thing with models but this is usually pretty easy to control for you just [00:50:12] pretty easy to control for you just randomize both uh another issue is GPD [00:50:14] randomize both uh another issue is GPD for Self Bias so very naturally you [00:50:17] for Self Bias so very naturally you might like you might wonder if if I ask [00:50:19] might like you might wonder if if I ask GPD for to to evaluate itself like it [00:50:21] GPD for to to evaluate itself like it will probably bias it will prefer itself [00:50:24] will probably bias it will prefer itself than other models uh and this is true [00:50:27] than other models uh and this is true but less than what you might think I [00:50:29] but less than what you might think I will tell you about later okay so Al EV [00:50:34] will tell you about later okay so Al EV TR wait until what time do I [00:50:38] TR wait until what time do I have you have 30 minutes oh thanks great [00:50:42] have you have 30 minutes oh thanks great um okay alpaka eval so alpaka eval is [00:50:45] um okay alpaka eval so alpaka eval is The Benchmark that we developed when [00:50:47] The Benchmark that we developed when when we were working on alpaca um so as [00:50:50] when we were working on alpaca um so as I told you before we need um one thing [00:50:52] I told you before we need um one thing which is very important is what you use [00:50:54] which is very important is what you use for development um so basically for high [00:50:57] for development um so basically for high propri tuning so what what we did is [00:50:59] propri tuning so what what we did is that we basically did not trust many of [00:51:01] that we basically did not trust many of the benchmarks out there at this point [00:51:03] the benchmarks out there at this point uh for instruction following so we just [00:51:05] uh for instruction following so we just developed a very small Benchmark for [00:51:07] developed a very small Benchmark for ourselves and this is what we were doing [00:51:08] ourselves and this is what we were doing for I tuning uh and then it kind of [00:51:11] for I tuning uh and then it kind of became its own thing um so alpaka eval [00:51:14] became its own thing um so alpaka eval in a few numbers it has very high [00:51:16] in a few numbers it has very high correlation with chatboard Arena so the [00:51:18] correlation with chatboard Arena so the ranking if you look at the the [00:51:20] ranking if you look at the the correlation between the ranking in [00:51:21] correlation between the ranking in chadbad Arena and and in alpaka eval [00:51:24] chadbad Arena and and in alpaka eval it's 98% so very high and it takes [00:51:26] it's 98% so very high and it takes around 3 minutes and $10 to to evaluate [00:51:30] around 3 minutes and $10 to to evaluate um and the way it works I think I [00:51:32] um and the way it works I think I already mentioned it but basically you [00:51:33] already mentioned it but basically you take an instruction you generate an [00:51:35] take an instruction you generate an output from one one model and then from [00:51:38] output from one one model and then from another model that you're you're [00:51:39] another model that you're you're comparing it to um and you as gp4 to [00:51:42] comparing it to um and you as gp4 to basically give the probability that it [00:51:44] basically give the probability that it prefers uh the model that you're [00:51:46] prefers uh the model that you're evaluating versus the Baseline that [00:51:48] evaluating versus the Baseline that you're comparing to um and then you do [00:51:51] you're comparing to um and then you do some reweighting and the reason why you [00:51:53] some reweighting and the reason why you do some reweighting is because uh these [00:51:56] do some reweighting is because uh these models as I said are very uh biased [00:51:58] models as I said are very uh biased towards longer outputs so you want to [00:52:00] towards longer outputs so you want to rewe such that uh if it's a longer [00:52:03] rewe such that uh if it's a longer output you give it a slightly less High [00:52:06] output you give it a slightly less High preferences uh High preference and then [00:52:08] preferences uh High preference and then you average across your entire data set [00:52:10] you average across your entire data set and you get a red rate um so that's how [00:52:12] and you get a red rate um so that's how it [00:52:14] it works any [00:52:17] questions cool um so system level [00:52:21] questions cool um so system level correlation so here what you see on the [00:52:23] correlation so here what you see on the x-axis is basically alaka I mean a [00:52:25] x-axis is basically alaka I mean a slight transform of it but essentially a [00:52:28] slight transform of it but essentially a back a EV valid scores and on the y-axis [00:52:30] back a EV valid scores and on the y-axis is this chatbot Arena which is the gold [00:52:32] is this chatbot Arena which is the gold standard and you see that things are [00:52:34] standard and you see that things are relatively highly correlated and on the [00:52:36] relatively highly correlated and on the on the lower plot you see basically the [00:52:38] on the lower plot you see basically the correlation between different Benchmark [00:52:40] correlation between different Benchmark and chatbot Arena and you see like mty [00:52:42] and chatbot Arena and you see like mty bench and alpaka Eva which are the two [00:52:44] bench and alpaka Eva which are the two ones that use llms for evaluations are [00:52:46] ones that use llms for evaluations are relatively highly correlated with [00:52:48] relatively highly correlated with chadbad Arena and mlu which is the [00:52:51] chadbad Arena and mlu which is the automated one that doesn't use anlm is [00:52:53] automated one that doesn't use anlm is also very highly correlated [00:52:57] um so I told you very briefly about the [00:52:59] um so I told you very briefly about the fact that we had to do some reweighting [00:53:01] fact that we had to do some reweighting um so I'm not going to tell you how we [00:53:02] um so I'm not going to tell you how we do it but I want to tell you why we do [00:53:04] do it but I want to tell you why we do it um one of the issues that we [00:53:08] it um one of the issues that we realized uh a little bit too late is [00:53:11] realized uh a little bit too late is that if if you take something like GPT 4 [00:53:14] that if if you take something like GPT 4 um and you just ask it you prompt it to [00:53:16] um and you just ask it you prompt it to be much more detailed to basically [00:53:18] be much more detailed to basically provide much more detailed answers uh [00:53:21] provide much more detailed answers uh its win rate so its performance on your [00:53:23] its win rate so its performance on your on your benchmark goes from 50% to 64.3 [00:53:27] on your benchmark goes from 50% to 64.3 uh so that's this one [00:53:29] uh so that's this one 64.3 uh if you're ask it to be more [00:53:31] 64.3 uh if you're ask it to be more concise like it decreases to 22.9 and it [00:53:34] concise like it decreases to 22.9 and it really doesn't fit like our mental model [00:53:35] really doesn't fit like our mental model of what benchmarks should be doing um if [00:53:38] of what benchmarks should be doing um if I just change tweak a little bit The [00:53:40] I just change tweak a little bit The Prompt I don't want my model to change [00:53:42] Prompt I don't want my model to change completely its ranking um so that's why [00:53:46] completely its ranking um so that's why we have to do some reting and you see [00:53:47] we have to do some reting and you see that after the reting uh you basically [00:53:50] that after the reting uh you basically have that the uh uh performance after [00:53:54] have that the uh uh performance after you as the model to be more of a boss is [00:53:57] you as the model to be more of a boss is is very close to the performance uh [00:53:59] is very close to the performance uh without any any prompt uh [00:54:04] tuning [00:54:05] tuning cool so I told you slight or very [00:54:08] cool so I told you slight or very briefly before about self bias I do want [00:54:10] briefly before about self bias I do want to say that I'm pretty surprised about [00:54:12] to say that I'm pretty surprised about this result but actually Self Bias [00:54:15] this result but actually Self Bias exists but is not as high as you might [00:54:17] exists but is not as high as you might uh think so here you see on the xaxis [00:54:21] uh think so here you see on the xaxis the ranking um or like the the different [00:54:24] the ranking um or like the the different models that you're evaluating and on the [00:54:27] models that you're evaluating and on the sorry that's on the rows and on the [00:54:28] sorry that's on the rows and on the columns you see who is evaluating which [00:54:31] columns you see who is evaluating which model are you using for evaluation and [00:54:33] model are you using for evaluation and you actually see that regardless of the [00:54:35] you actually see that regardless of the model that you that you evaluate with uh [00:54:38] model that you that you evaluate with uh the ranking will be the same so even [00:54:40] the ranking will be the same so even though it's true that um if I look [00:54:43] though it's true that um if I look at um [00:54:45] at um mol uh evaluated by mol it gives itself [00:54:48] mol uh evaluated by mol it gives itself a much higher accuracy uh it is it still [00:54:51] a much higher accuracy uh it is it still prefers Claud and gp4 uh so it's not as [00:54:55] prefers Claud and gp4 uh so it's not as bad as what you may think it's still bad [00:54:59] though [00:55:01] though cool okay so that leads me to talking [00:55:04] cool okay so that leads me to talking about current evaluation of llms so i' [00:55:06] about current evaluation of llms so i' would say there are three main ways that [00:55:09] would say there are three main ways that people currently evaluate llms um the [00:55:11] people currently evaluate llms um the first one is perplexity um which is [00:55:13] first one is perplexity um which is essentially just looking at training [00:55:14] essentially just looking at training losses or validation losses um the [00:55:17] losses or validation losses um the second one is basically averaging [00:55:19] second one is basically averaging everything uh which is actually [00:55:22] everything uh which is actually surprisingly more common than what you [00:55:23] surprisingly more common than what you may think and the third one is uh is [00:55:26] may think and the third one is uh is this Arena like or where you basically [00:55:28] this Arena like or where you basically have comparisons between models and [00:55:30] have comparisons between models and either use humans or use uh um models to [00:55:32] either use humans or use uh um models to do the evaluation and usually how it [00:55:35] do the evaluation and usually how it works is that pre-trained model let's [00:55:36] works is that pre-trained model let's say the new like when L 4 comes out or [00:55:39] say the new like when L 4 comes out or like when GPT 5 comes out they B [00:55:41] like when GPT 5 comes out they B basically mostly show perplexity and [00:55:43] basically mostly show perplexity and like average over everything and the [00:55:45] like average over everything and the fine-tune models they usually tend to [00:55:47] fine-tune models they usually tend to show average over everything and um uh [00:55:50] show average over everything and um uh Arena like performance under Arena like [00:55:52] Arena like performance under Arena like models uh and the reason why is because [00:55:55] models uh and the reason why is because uh models that are pre that are [00:55:57] uh models that are pre that are fine-tuned usually the the log [00:55:59] fine-tuned usually the the log likelihood that they predict is is not [00:56:03] likelihood that they predict is is not um yeah it's not calibrated for your for [00:56:05] um yeah it's not calibrated for your for your data [00:56:06] your data set so what do I mean by everything um I [00:56:10] set so what do I mean by everything um I would say the two most common uh [00:56:12] would say the two most common uh benchmarks that basically look at [00:56:15] benchmarks that basically look at everything are Helm and hugging face [00:56:17] everything are Helm and hugging face openm uh leaderboard it's really just a [00:56:20] openm uh leaderboard it's really just a collection of a lot of different autom [00:56:22] collection of a lot of different autom automatically evaluated benchmarks um [00:56:24] automatically evaluated benchmarks um and you evaluate across all of them um [00:56:27] and you evaluate across all of them um so what are some of the common Benchmark [00:56:30] so what are some of the common Benchmark that we use uh one is um yeah measuring [00:56:34] that we use uh one is um yeah measuring like math performance so GSM 8K that's a [00:56:37] like math performance so GSM 8K that's a pretty common one that's basically grade [00:56:39] pretty common one that's basically grade school math mlu uh is multiple choice [00:56:42] school math mlu uh is multiple choice question answering on like Math Science [00:56:44] question answering on like Math Science History uh legal bench is is uh is on [00:56:46] History uh legal bench is is uh is on the legal aspect then you have Med QA so [00:56:48] the legal aspect then you have Med QA so this I I believe this is for Helm Med QA [00:56:51] this I I believe this is for Helm Med QA is uh um yeah medical licensing exams so [00:56:54] is uh um yeah medical licensing exams so you basically ask many many different [00:56:56] you basically ask many many different questions that you can uh automatically [00:56:58] questions that you can uh automatically evaluate um and you hope that by taking [00:57:01] evaluate um and you hope that by taking averages um it will say like how well [00:57:04] averages um it will say like how well your model performs so that's kind of [00:57:06] your model performs so that's kind of like the newer version of super glue I [00:57:08] like the newer version of super glue I would say um one data set that I want to [00:57:11] would say um one data set that I want to highlight which is probably or one [00:57:12] highlight which is probably or one Benchmark which is probably the most [00:57:14] Benchmark which is probably the most widely used and the one that people [00:57:15] widely used and the one that people believe the most is mlu so massively [00:57:18] believe the most is mlu so massively multitask language understanding um so [00:57:20] multitask language understanding um so this is I I think maybe Archard [00:57:22] this is I I think maybe Archard mentioned it last week but this is [00:57:25] mentioned it last week but this is basically um uh multiple choice uh [00:57:29] basically um uh multiple choice uh questions on 57 different tasks you so [00:57:32] questions on 57 different tasks you so you have tasks like formal logic [00:57:34] you have tasks like formal logic conceptual physics econometrics and and [00:57:36] conceptual physics econometrics and and uh these type of tasks so here's an [00:57:38] uh these type of tasks so here's an example um what is true for type 1 a [00:57:42] example um what is true for type 1 a supernova uh this type occurs in binary [00:57:45] supernova uh this type occurs in binary system this type occurs in young [00:57:46] system this type occurs in young galaxies and you basically have to say [00:57:48] galaxies and you basically have to say which answer so that seems very simp I [00:57:50] which answer so that seems very simp I mean the task is not simple but the way [00:57:52] mean the task is not simple but the way you evaluate seems simple uh and then [00:57:54] you evaluate seems simple uh and then like high school biology in a population [00:57:55] like high school biology in a population of Gira an environmental and then you [00:57:58] of Gira an environmental and then you this is an example of directional [00:58:00] this is an example of directional selection um so that seems simple but [00:58:03] selection um so that seems simple but actually it's it's also more complicated [00:58:05] actually it's it's also more complicated than what you might think [00:58:08] than what you might think um and I think I will tell [00:58:11] um and I think I will tell you okay I will tell you about it later [00:58:14] you okay I will tell you about it later um but that's that's mo one of the most [00:58:16] um but that's that's mo one of the most common probably the most common [00:58:18] common probably the most common Benchmark and what people actually look [00:58:19] Benchmark and what people actually look at uh for example when Mark Zuckerberg [00:58:22] at uh for example when Mark Zuckerberg like said that LMA 3 was out uh he [00:58:26] like said that LMA 3 was out uh he yeah he talked about MML U scores which [00:58:28] yeah he talked about MML U scores which I I find kind of crazy but yeah um other [00:58:33] I I find kind of crazy but yeah um other capabilities that people look at coding [00:58:35] capabilities that people look at coding uh coding is a very common one that [00:58:36] uh coding is a very common one that people evaluate on um for two different [00:58:40] people evaluate on um for two different reasons one because coding uh is usually [00:58:44] reasons one because coding uh is usually if you perform well on code you usually [00:58:46] if you perform well on code you usually actually these models perform well on [00:58:47] actually these models perform well on reasoning um which is actually pretty [00:58:49] reasoning um which is actually pretty cool um so that's like highly correlated [00:58:52] cool um so that's like highly correlated with things that people care about uh [00:58:54] with things that people care about uh two I mean a lot of us are Cod is so uh [00:58:56] two I mean a lot of us are Cod is so uh we like to have uh uh better models for [00:59:00] we like to have uh uh better models for helping us coding and three the other [00:59:01] helping us coding and three the other point is that it's actually pretty easy [00:59:03] point is that it's actually pretty easy to evaluate uh because you can write [00:59:04] to evaluate uh because you can write test cases so you basically as the model [00:59:07] test cases so you basically as the model to generate very long uh uh code or like [00:59:09] to generate very long uh uh code or like functions to to to do something and then [00:59:11] functions to to to do something and then you just run the test and you see [00:59:13] you just run the test and you see whether it it succeeds or not yes sorry [00:59:16] whether it it succeeds or not yes sorry going back to the prev evaluations some [00:59:18] going back to the prev evaluations some of them was short answ currently um how [00:59:22] of them was short answ currently um how would you validate like short QA type of [00:59:24] would you validate like short QA type of thing where it's like multiple choice [00:59:26] thing where it's like multiple choice makes sense but if it's like short [00:59:27] makes sense but if it's like short answer QA how would you say something is [00:59:30] answer QA how would you say something is correct as an automatic [00:59:32] correct as an automatic met uh think specifically to the [00:59:36] met uh think specifically to the top yeah I actually don't [00:59:42] know huh I actually don't know yeah I [00:59:45] know huh I actually don't know yeah I should check sorry um so I don't know [00:59:49] should check sorry um so I don't know specifically for this one but Hot Pot QA [00:59:51] specifically for this one but Hot Pot QA and beer QA are other QA uh data sets [00:59:55] and beer QA are other QA uh data sets and then they look at F1 for the true [00:59:57] and then they look at F1 for the true and false and then they also have an [00:59:59] and false and then they also have an exact match which is pretty punitive [01:00:01] exact match which is pretty punitive because like if you say President Reagan [01:00:03] because like if you say President Reagan and the answer is like President Ronald [01:00:05] and the answer is like President Ronald Reagan like will de you but anyway so [01:00:08] Reagan like will de you but anyway so they use like an exact M on [01:00:11] they use like an exact M on that yeah um cool thanks okay Sam you [01:00:17] that yeah um cool thanks okay Sam you coding another one that people start [01:00:19] coding another one that people start looking at are agents I think shakar is [01:00:21] looking at are agents I think shakar is going to give election on it uh so I'm [01:00:23] going to give election on it uh so I'm not going to talk too much about it but [01:00:24] not going to talk too much about it but like one cool thing that LMS can do [01:00:26] like one cool thing that LMS can do right now is basically call apis uh and [01:00:28] right now is basically call apis uh and then take actions in the real world [01:00:30] then take actions in the real world essentially or like take control of your [01:00:32] essentially or like take control of your computer um you should not give it [01:00:35] computer um you should not give it control to your computer uh so a natural [01:00:38] control to your computer uh so a natural question is like how do you evaluate [01:00:39] question is like how do you evaluate these type of things this is a real [01:00:41] these type of things this is a real challenge um because I mean the biggest [01:00:45] challenge um because I mean the biggest challenge is that if you for example if [01:00:47] challenge is that if you for example if I really wanted to to evaluate like how [01:00:48] I really wanted to to evaluate like how good it is at coding or like how good it [01:00:50] good it is at coding or like how good it is at doing things in my terminal I need [01:00:52] is at doing things in my terminal I need to give it access to my terminal and I [01:00:53] to give it access to my terminal and I really don't want to give my LM access [01:00:56] really don't want to give my LM access my terminal um so you really need to [01:00:59] my terminal um so you really need to sandbox environments for the specific [01:01:00] sandbox environments for the specific cases of terminal I mean it's pretty [01:01:02] cases of terminal I mean it's pretty easy to sandbox but once you want to do [01:01:04] easy to sandbox but once you want to do evaluation of like a model that that I [01:01:07] evaluation of like a model that that I don't know P people on slack or like [01:01:08] don't know P people on slack or like writes things in your emails then you [01:01:10] writes things in your emails then you have to write an entire uh uh sandbox [01:01:13] have to write an entire uh uh sandbox environment for all the applications [01:01:14] environment for all the applications that you want your llms to have access [01:01:16] that you want your llms to have access to uh so this is actually really [01:01:19] to uh so this is actually really complicated and something that people [01:01:21] complicated and something that people really have to deal with in in kind of [01:01:22] really have to deal with in in kind of the real [01:01:24] the real world uh at least we have to because [01:01:26] world uh at least we have to because right now it's still not in [01:01:28] right now it's still not in production okay the last part is uh or [01:01:31] production okay the last part is uh or the penultimate one perplexities uh so [01:01:34] the penultimate one perplexities uh so one thing which is very uh surprising um [01:01:37] one thing which is very uh surprising um at least the first time you see it is [01:01:39] at least the first time you see it is that really the performance that you [01:01:40] that really the performance that you have on um pre-training is extremely [01:01:43] have on um pre-training is extremely highly correlated uh with basically [01:01:45] highly correlated uh with basically performance on any Downstream task um at [01:01:47] performance on any Downstream task um at least for the current types of LMS so [01:01:49] least for the current types of LMS so what I mean by this is that if you just [01:01:51] what I mean by this is that if you just look at your training performance uh [01:01:53] look at your training performance uh just predicting the next word it's [01:01:55] just predicting the next word it's extremely highly correlated so this is [01:01:56] extremely highly correlated so this is the x- axxis which is essentially [01:01:58] the x- axxis which is essentially perplexities um and the y- AIS which is [01:02:00] perplexities um and the y- AIS which is just the average over like many [01:02:02] just the average over like many different tasks what you will see is [01:02:04] different tasks what you will see is that tasks that perform well on [01:02:07] that tasks that perform well on your uh on perplexities will actually [01:02:09] your uh on perplexities will actually have high uh High average scores um and [01:02:14] have high uh High average scores um and as a result a lot of people actually end [01:02:16] as a result a lot of people actually end up when they develop just looking at [01:02:18] up when they develop just looking at perplexities and they just trust it [01:02:20] perplexities and they just trust it enough that they don't need to uh do the [01:02:21] enough that they don't need to uh do the downstream evaluations I would not [01:02:23] downstream evaluations I would not recommend doing it but if you have to [01:02:26] recommend doing it but if you have to have something like quick and dirty it [01:02:27] have something like quick and dirty it usually works pretty well um one thing [01:02:29] usually works pretty well um one thing to be careful with though is that the [01:02:31] to be careful with though is that the perplexities are not going to be [01:02:32] perplexities are not going to be comparable across different data sets um [01:02:35] comparable across different data sets um so you really have to be careful with [01:02:36] so you really have to be careful with like what perplexities you're looking at [01:02:38] like what perplexities you're looking at and two it will depend on the tokenizer [01:02:40] and two it will depend on the tokenizer um so if you have like Lama 3 or and and [01:02:45] um so if you have like Lama 3 or and and you compare it to Gemini even um uh yeah [01:02:49] you compare it to Gemini even um uh yeah even on the same data set it's going to [01:02:50] even on the same data set it's going to give different different scores and it's [01:02:52] give different different scores and it's not comparable uh yes [01:02:56] not comparable uh yes uh the the easy answer I mean it's not [01:02:58] uh the the easy answer I mean it's not the only answer but the easy answer is [01:03:00] the only answer but the easy answer is that if the vocabulary changes the the [01:03:02] that if the vocabulary changes the the size of the vocabulary changes uh then [01:03:04] size of the vocabulary changes uh then clearly the type of um I mean everything [01:03:07] clearly the type of um I mean everything is not on the like the upper bound is [01:03:09] is not on the like the upper bound is different [01:03:11] different sequence uh a sequence like normal yeah [01:03:15] sequence uh a sequence like normal yeah but I'm talking but I'm not talking [01:03:16] but I'm talking but I'm not talking about that I'm talking about the fact [01:03:17] about that I'm talking about the fact that I mean just think about it if you [01:03:19] that I mean just think about it if you have a vocabulary size of one oh then I [01:03:22] have a vocabulary size of one oh then I have to always predict the same thing so [01:03:24] have to always predict the same thing so like you you're like the basically your [01:03:27] like you you're like the basically your entropy depends your entropy is uper [01:03:29] entropy depends your entropy is uper bounded by log of like the cardinality [01:03:31] bounded by log of like the cardinality of your vocabulary size so you're going [01:03:33] of your vocabulary size so you're going to depend on [01:03:34] to depend on that [01:03:37] that cool and the last one is Arena like as I [01:03:40] cool and the last one is Arena like as I already told you basically you compare [01:03:42] already told you basically you compare different models uh you make them fight [01:03:44] different models uh you make them fight essentially against each other and you [01:03:45] essentially against each other and you have U um ELO ratings at the end so [01:03:47] have U um ELO ratings at the end so that's really a more General way of [01:03:49] that's really a more General way of saying it is I really just let the users [01:03:51] saying it is I really just let the users decide uh and that works also pretty [01:03:54] decide uh and that works also pretty well okay issues and challenging with [01:03:57] well okay issues and challenging with issues and challenges with current [01:03:59] issues and challenges with current evaluations uh first consistency issues [01:04:02] evaluations uh first consistency issues uh if you look at question answering uh [01:04:05] uh if you look at question answering uh sorry um multiple choice questions um if [01:04:08] sorry um multiple choice questions um if you just change so you see on the top [01:04:10] you just change so you see on the top left and top right if you just change [01:04:12] left and top right if you just change ABCD to like random symbols uh the the [01:04:15] ABCD to like random symbols uh the the generations that you will give are [01:04:17] generations that you will give are actually going to be different and then [01:04:18] actually going to be different and then the rankings between different models [01:04:20] the rankings between different models will be different um so even things that [01:04:23] will be different um so even things that are very simple like multiple choice [01:04:25] are very simple like multiple choice like like selecting out of four choices [01:04:28] like like selecting out of four choices uh will be very dependent on exactly [01:04:31] uh will be very dependent on exactly like how you format uh these choices uh [01:04:34] like how you format uh these choices uh and one real example that's that's what [01:04:37] and one real example that's that's what I was alluding to before is mlu so mlu [01:04:40] I was alluding to before is mlu so mlu seems really simple to evaluate you just [01:04:42] seems really simple to evaluate you just ask it to say like what which one of the [01:04:45] ask it to say like what which one of the four uh the model prefers um but [01:04:49] four uh the model prefers um but actually for a very long time I think [01:04:51] actually for a very long time I think for nearly one year uh there were three [01:04:54] for nearly one year uh there were three main implementation of mlu and people [01:04:57] main implementation of mlu and people were comparing between those three [01:04:59] were comparing between those three having no idea that like those three [01:05:00] having no idea that like those three gave like different scores um and the [01:05:02] gave like different scores um and the reason like the two main differences [01:05:04] reason like the two main differences were one people use different prompts so [01:05:06] were one people use different prompts so that clearly will give different answers [01:05:08] that clearly will give different answers but two they were using different ways [01:05:11] but two they were using different ways of sampling to get the the actual to get [01:05:14] of sampling to get the the actual to get the actual like most likely prediction [01:05:16] the actual like most likely prediction so one of them for example was saying um [01:05:19] so one of them for example was saying um I have the four choices now to get my [01:05:21] I have the four choices now to get my most likely uh let's say that the [01:05:23] most likely uh let's say that the correct answer is D um I will just look [01:05:26] correct answer is D um I will just look look at the most likely answers out of [01:05:29] look at the most likely answers out of ABCD even though like Z zigot was [01:05:33] ABCD even though like Z zigot was another answer that has a higher [01:05:35] another answer that has a higher likelihood uh I will not look at it [01:05:37] likelihood uh I will not look at it because I will I will basically do [01:05:38] because I will I will basically do constraint decoding um and if I do [01:05:41] constraint decoding um and if I do constraint decoding here I will say that [01:05:43] constraint decoding here I will say that the correct answer is d uh but if I [01:05:46] the correct answer is d uh but if I actually just look at the most likely [01:05:48] actually just look at the most likely token I will not get the correct answer [01:05:50] token I will not get the correct answer so like those were two different [01:05:52] so like those were two different implementation and a third different [01:05:54] implementation and a third different implementation which uh seems really [01:05:56] implementation which uh seems really different is that instead of generating [01:05:58] different is that instead of generating the correct uh token which is basically [01:06:01] the correct uh token which is basically the letter ABCD you can look [01:06:03] the letter ABCD you can look at after this question what is the [01:06:07] at after this question what is the likelihood that the answer sorry that [01:06:09] likelihood that the answer sorry that the model would generate this so you [01:06:10] the model would generate this so you would look at the log likelihood um or [01:06:13] would look at the log likelihood um or like the perplexity essentially of [01:06:14] like the perplexity essentially of predicting this thing log likelihood of [01:06:16] predicting this thing log likelihood of predicting that and that gives very [01:06:18] predicting that and that gives very different answers uh so if you look at [01:06:20] different answers uh so if you look at the top right uh you see that Lama 65b [01:06:23] the top right uh you see that Lama 65b MML on Helm was 63 3.7 and the original [01:06:27] MML on Helm was 63 3.7 and the original mmu 63.6 but on harness uh which is the [01:06:31] mmu 63.6 but on harness uh which is the thing that actually hugging face uses is [01:06:34] thing that actually hugging face uses is 48.8 so that's like a huge difference [01:06:37] 48.8 so that's like a huge difference yeah what is Helm harness and original [01:06:39] yeah what is Helm harness and original it matches to these three things there [01:06:41] it matches to these three things there uh yeah one I can't remember which one [01:06:43] uh yeah one I can't remember which one does what but each of them does [01:06:45] does what but each of them does something different actually now it's [01:06:47] something different actually now it's not true anymore so the middle column [01:06:49] not true anymore so the middle column change what they're doing so they start [01:06:51] change what they're doing so they start matching the other two ones uh but at [01:06:53] matching the other two ones uh but at that time they weren't [01:06:56] that time they weren't uh I'm not sure which one my guess was [01:06:58] uh I'm not sure which one my guess was would be that they did the the last one [01:07:02] would be that they did the the last one uh but I'm not in [01:07:06] show okay [01:07:10] questions cool another issue [01:07:13] questions cool another issue contamination um so here you have harass [01:07:17] contamination um so here you have harass he if you don't follow him on Twitter [01:07:18] he if you don't follow him on Twitter you should um I and he basically said [01:07:21] you should um I and he basically said that he was looking at um like code [01:07:24] that he was looking at um like code benchmarks and he was saying that pre [01:07:27] benchmarks and he was saying that pre pre 2021 uh I can't remember which Mar [01:07:30] pre 2021 uh I can't remember which Mar or gbt 4 was getting 10 out of 10 on [01:07:33] or gbt 4 was getting 10 out of 10 on questions on code Force but after 2021 [01:07:36] questions on code Force but after 2021 or like more recent problems it was [01:07:37] or like more recent problems it was getting zero out of 10 which seems very [01:07:40] getting zero out of 10 which seems very very uh strange uh so that really [01:07:43] very uh strange uh so that really strongly points to the fact that it was [01:07:44] strongly points to the fact that it was contaminated and was probably the model [01:07:46] contaminated and was probably the model was probably pre-trained or like that [01:07:48] was probably pre-trained or like that data set or the code Force data set was [01:07:49] data set or the code Force data set was probably in the pre-training data set [01:07:51] probably in the pre-training data set and of course if you I mean essentially [01:07:53] and of course if you I mean essentially you do you do training on your test set [01:07:55] you do you do training on your test set then you're going to perform really well [01:07:57] then you're going to perform really well um and Suzanne said also to follow also [01:08:02] um and Suzanne said also to follow also said something similar for uh F 1.5 um [01:08:06] said something similar for uh F 1.5 um which is a model from [01:08:08] which is a model from Microsoft so what is challenging here is [01:08:10] Microsoft so what is challenging here is that with' closed models um I mean there [01:08:13] that with' closed models um I mean there are two things actually that are [01:08:13] are two things actually that are challenging one is that those are [01:08:15] challenging one is that those are pre-trained on so much data that even if [01:08:17] pre-trained on so much data that even if we had access to the data it would be [01:08:19] we had access to the data it would be hard to actually know what they like if [01:08:21] hard to actually know what they like if they were pre-trained on your test set [01:08:22] they were pre-trained on your test set but two those are all close Source [01:08:24] but two those are all close Source models so you really don't even have [01:08:26] models so you really don't even have access to a data set so you have no idea [01:08:28] access to a data set so you have no idea if they were pre-train on that [01:08:30] if they were pre-train on that data um overfitting issues that's also [01:08:33] data um overfitting issues that's also relatively related uh but could be [01:08:35] relatively related uh but could be slightly different uh so here you see [01:08:38] slightly different uh so here you see how much time it took for standard data [01:08:41] how much time it took for standard data sets to achieve uh quot like squary [01:08:45] sets to achieve uh quot like squary quotes uh human level performance um and [01:08:48] quotes uh human level performance um and what you see is that on the recent ones [01:08:49] what you see is that on the recent ones where you really have this pre-training [01:08:51] where you really have this pre-training in less than like six months uh you [01:08:53] in less than like six months uh you perform at human level performance um we [01:08:57] perform at human level performance um we don't really know if it's because of the [01:08:59] don't really know if it's because of the contamination or if it's simply that [01:09:01] contamination or if it's simply that like a lot of people are basically [01:09:03] like a lot of people are basically developing and trying to do Hy prar [01:09:05] developing and trying to do Hy prar tuning on these test sets uh we don't [01:09:07] tuning on these test sets uh we don't know why but it's clearly an issue with [01:09:09] know why but it's clearly an issue with over [01:09:10] over overfitting um so how do you alleviate [01:09:13] overfitting um so how do you alleviate that one you can have private test sets [01:09:15] that one you can have private test sets uh so there's a paper from I think two [01:09:17] uh so there's a paper from I think two weeks ago uh that presented GSM 1K which [01:09:22] weeks ago uh that presented GSM 1K which is the same thing as the GSM 8K that we [01:09:24] is the same thing as the GSM 8K that we saw before which is the the math uh data [01:09:26] saw before which is the the math uh data set but um tries to reamp basically [01:09:29] set but um tries to reamp basically regenerate or resample this data set uh [01:09:32] regenerate or resample this data set uh or recollect this data set and then they [01:09:34] or recollect this data set and then they look at how well different models um [01:09:37] look at how well different models um perform on both the GSM 1K and the GSM [01:09:39] perform on both the GSM 1K and the GSM 8K and what you see is that at least the [01:09:41] 8K and what you see is that at least the open source models they they perform [01:09:43] open source models they they perform much worse on the on a new data set than [01:09:45] much worse on the on a new data set than on the one that people are able to tune [01:09:48] on the one that people are able to tune on um this is not true though for like [01:09:51] on um this is not true though for like cloud and [01:09:52] cloud and gp4 another one is Dynam Ben or just [01:09:55] gp4 another one is Dynam Ben or just Dynamic test sets uh so ideally every X [01:09:59] Dynamic test sets uh so ideally every X number of days you would basically have [01:10:01] number of days you would basically have new instructions or new inputs to the [01:10:02] new instructions or new inputs to the models and and your data set would [01:10:04] models and and your data set would basically be dynamic that's essentially [01:10:06] basically be dynamic that's essentially also what chatbot Arena does so uh that [01:10:09] also what chatbot Arena does so uh that definitely [01:10:10] definitely helps um another way of alleviating [01:10:14] helps um another way of alleviating contaminators is that you may try to [01:10:16] contaminators is that you may try to estimate or like to look at whether the [01:10:19] estimate or like to look at whether the the models were actually trained on your [01:10:20] the models were actually trained on your on your test set so one very simple way [01:10:22] on your test set so one very simple way of doing it which actually works I think [01:10:25] of doing it which actually works I think relatively well um is just looking at [01:10:27] relatively well um is just looking at the probability of different answers and [01:10:30] the probability of different answers and you will see that if uh your model is [01:10:32] you will see that if uh your model is really sure about a certain answer then [01:10:34] really sure about a certain answer then probably was strein on that answer um [01:10:37] probably was strein on that answer um another one which which is also really [01:10:39] another one which which is also really cool is um looking at the order of your [01:10:43] cool is um looking at the order of your test set so if your model was uh if a [01:10:46] test set so if your model was uh if a model was trained or pre-trained on the [01:10:48] model was trained or pre-trained on the test set then most likely it thinks that [01:10:51] test set then most likely it thinks that example two comes after example one so [01:10:53] example two comes after example one so if you switch example one and example [01:10:54] if you switch example one and example two and you see drops in log likelihoods [01:10:58] two and you see drops in log likelihoods then most likely the model was actually [01:10:59] then most likely the model was actually pre-trained on that data set [01:11:03] pre-trained on that data set um [01:11:05] um cool any questions [01:11:09] here okay so another issue is that I [01:11:13] here okay so another issue is that I mean really there's a monoculture of NLP [01:11:15] mean really there's a monoculture of NLP Benchmark benchmarking what I mean by [01:11:17] Benchmark benchmarking what I mean by this is mostly the fact that we all just [01:11:19] this is mostly the fact that we all just look at English um and this is a paper [01:11:22] look at English um and this is a paper from [01:11:23] from 2021 2022 I think but they look at ACL [01:11:26] 2021 2022 I think but they look at ACL 2021 which is the probably the the most [01:11:29] 2021 which is the probably the the most common uh um conference in or like yeah [01:11:32] common uh um conference in or like yeah conference in in in NLP and they look at [01:11:35] conference in in in NLP and they look at the best papers so the oral papers and [01:11:37] the best papers so the oral papers and they saw that out of the 461 papers uh [01:11:40] they saw that out of the 461 papers uh 70% of them only look at English and 40% [01:11:43] 70% of them only look at English and 40% of them only look at accuracy so [01:11:44] of them only look at accuracy so essentially just performance um so there [01:11:47] essentially just performance um so there are very few papers that look at [01:11:48] are very few papers that look at multilinguality and like uh even like [01:11:51] multilinguality and like uh even like efficiency and interpretability or [01:11:53] efficiency and interpretability or fairness um and [01:11:55] fairness um and there's a similar paper from that [01:11:58] there's a similar paper from that analyzes another conference in 2008 and [01:12:01] analyzes another conference in 2008 and it was essentially the same finding so [01:12:02] it was essentially the same finding so unfortunately it doesn't seem to improve [01:12:04] unfortunately it doesn't seem to improve over [01:12:05] over time um the thing is there are actually [01:12:08] time um the thing is there are actually a lot of benchmarks for multilinguality [01:12:12] a lot of benchmarks for multilinguality uh I just highlight a few here uh Mega [01:12:14] uh I just highlight a few here uh Mega Global bench extreme uh those have at [01:12:18] Global bench extreme uh those have at least uh 30 40 languages and many many [01:12:22] least uh 30 40 languages and many many different tasks um so it's not that we [01:12:25] different tasks um so it's not that we don't have the benchmarks is that [01:12:26] don't have the benchmarks is that there's no incentives um unfortunately [01:12:30] there's no incentives um unfortunately in Academia to actually train or like [01:12:32] in Academia to actually train or like sorry to um to evaluate on those [01:12:34] sorry to um to evaluate on those benchmarks um so if you have the chance [01:12:38] benchmarks um so if you have the chance use those [01:12:39] use those benchmarks um another issue is that [01:12:42] benchmarks um another issue is that really we reduce everything to a single [01:12:44] really we reduce everything to a single metric so I already told you before the [01:12:46] metric so I already told you before the way we aggregate metrics this is usually [01:12:48] way we aggregate metrics this is usually kind of broken in in some of these super [01:12:51] kind of broken in in some of these super benchmarks but also we only look at [01:12:53] benchmarks but also we only look at performance and in in in the real world [01:12:55] performance and in in in the real world like we really care about computational [01:12:57] like we really care about computational efficiency too we also care about biases [01:12:59] efficiency too we also care about biases and we care about many other aspects and [01:13:01] and we care about many other aspects and like most of these benchmarks don't [01:13:02] like most of these benchmarks don't consider those um another part is that [01:13:05] consider those um another part is that we usually average across every example [01:13:08] we usually average across every example uh we just say that every example has [01:13:09] uh we just say that every example has the same value essentially the same [01:13:11] the same value essentially the same weight um so this is definitely unfair [01:13:13] weight um so this is definitely unfair for like minoritized groups but more [01:13:15] for like minoritized groups but more than this I think if for example if you [01:13:19] than this I think if for example if you think about like um agents where maybe [01:13:21] think about like um agents where maybe one example will be like how well it [01:13:23] one example will be like how well it performs on um [01:13:26] performs on um I don't know writing codee that will [01:13:27] I don't know writing codee that will actually be put in production versus [01:13:29] actually be put in production versus just like [01:13:31] just like uh like answering your daily question [01:13:34] uh like answering your daily question about like where I don't know where to [01:13:36] about like where I don't know where to buy the best burger um like the value [01:13:40] buy the best burger um like the value that you will get out of these examples [01:13:41] that you will get out of these examples are very different and we right now when [01:13:43] are very different and we right now when we evaluate stuff we don't actually [01:13:44] we evaluate stuff we don't actually consider that so that's I think a real a [01:13:46] consider that so that's I think a real a real issue um and also we basically we [01:13:49] real issue um and also we basically we don't take into account that different [01:13:50] don't take into account that different people have different [01:13:52] people have different preferences um so a few outs one [01:13:56] preferences um so a few outs one considering computational efficiency so [01:13:58] considering computational efficiency so ml puff has a great Benchmark uh where [01:14:00] ml puff has a great Benchmark uh where basically instead of trying to maximize [01:14:02] basically instead of trying to maximize the performance on a certain Benchmark [01:14:05] the performance on a certain Benchmark they say I want to achieve that [01:14:07] they say I want to achieve that performance in the least amount of time [01:14:10] performance in the least amount of time um so now you basically consider both [01:14:12] um so now you basically consider both like accuracies and like speed either [01:14:15] like accuracies and like speed either for training or for for inference uh for [01:14:18] for training or for for inference uh for biases uh dis remal is a good data set [01:14:21] biases uh dis remal is a good data set from um uh entropic where basically they [01:14:25] from um uh entropic where basically they they have some templates um and they so [01:14:28] they have some templates um and they so they try to ask questions like uh uh [01:14:30] they try to ask questions like uh uh knowing whether someone should should [01:14:32] knowing whether someone should should keep their insurance or not and they [01:14:34] keep their insurance or not and they have templates where they change um like [01:14:37] have templates where they change um like the race or like the gender of of the in [01:14:40] the race or like the gender of of the in the template of the person and they see [01:14:42] the template of the person and they see how the decisions made by the model [01:14:44] how the decisions made by the model would change um and I mean unfortunately [01:14:49] would change um and I mean unfortunately but unsurprisingly uh you will see that [01:14:51] but unsurprisingly uh you will see that some groups are much more discriminated [01:14:54] some groups are much more discriminated than others [01:14:58] uh other biases in our evaluation so I [01:15:00] uh other biases in our evaluation so I already told you slightly about the [01:15:02] already told you slightly about the multilingual issues but honestly this [01:15:04] multilingual issues but honestly this this issue about English is like much [01:15:06] this issue about English is like much more prevalent than you would think for [01:15:07] more prevalent than you would think for example blur and RK they really assume [01:15:11] example blur and RK they really assume that you basically have access to words [01:15:13] that you basically have access to words like you know how to tokenize and how to [01:15:15] like you know how to tokenize and how to get words so I used to work with uh Tai [01:15:17] get words so I used to work with uh Tai and Vietnamese with Vietnamese you have [01:15:19] and Vietnamese with Vietnamese you have spaces in between words and T you have [01:15:21] spaces in between words and T you have no spaces between words like you have no [01:15:24] no spaces between words like you have no idea how to run like blue Rouge like [01:15:26] idea how to run like blue Rouge like really these it's much more than just [01:15:28] really these it's much more than just the data like all all our algorithms are [01:15:30] the data like all all our algorithms are are really like focused on English or at [01:15:32] are really like focused on English or at least Western languages uh bias uh [01:15:35] least Western languages uh bias uh biased llm uh based evaluations so one [01:15:38] biased llm uh based evaluations so one thing is that I told you about is that [01:15:40] thing is that I told you about is that it's really cool because now you can use [01:15:41] it's really cool because now you can use essentially gp4 for doing uh uh uh [01:15:44] essentially gp4 for doing uh uh uh labeling um but that also means that [01:15:47] labeling um but that also means that given that gp4 is very consistent if it [01:15:50] given that gp4 is very consistent if it has some biases then most of essentially [01:15:53] has some biases then most of essentially the NLP community will have these um [01:15:56] the NLP community will have these um biases scaled up essentially um so one [01:16:00] biases scaled up essentially um so one Benchmark which tries to look at uh [01:16:03] Benchmark which tries to look at uh whose opinions llms reflect uh by [01:16:06] whose opinions llms reflect uh by default uh this is actually pretty cool [01:16:08] default uh this is actually pretty cool work that looks at [01:16:10] work that looks at the uh output distribution of llms on [01:16:13] the uh output distribution of llms on public opinion surveys so just trying to [01:16:16] public opinion surveys so just trying to understand um whether llms um uh reflect [01:16:21] understand um whether llms um uh reflect opinions from from which groups and they [01:16:23] opinions from from which groups and they find that at least after [01:16:25] find that at least after when you only do pre-training uh the [01:16:27] when you only do pre-training uh the models actually relatively well [01:16:31] models actually relatively well uh they are not too optimized to a [01:16:33] uh they are not too optimized to a single group uh but after so this is in [01:16:36] single group uh but after so this is in red but after fine-tuning uh you [01:16:38] red but after fine-tuning uh you basically see that the models really [01:16:40] basically see that the models really start being optimized for certain [01:16:41] start being optimized for certain preferences uh which is unsurprising [01:16:43] preferences uh which is unsurprising because that's how we actually train the [01:16:44] because that's how we actually train the model um and typically these models are [01:16:48] model um and typically these models are actually uh um mostly show uh uh ref es [01:16:55] actually uh um mostly show uh uh ref es from um actually the answer as if they [01:16:58] from um actually the answer as if they were from uh I mean white and uh [01:17:01] were from uh I mean white and uh Southeast Asian so I think the selfie [01:17:03] Southeast Asian so I think the selfie station is actually pretty interesting I [01:17:04] station is actually pretty interesting I think it's probably because a lot of [01:17:05] think it's probably because a lot of these models were uh the human data that [01:17:09] these models were uh the human data that was used for supervised fine tuning and [01:17:11] was used for supervised fine tuning and for HF was actually labeled by people in [01:17:15] for HF was actually labeled by people in Southeast Asia uh which would explain [01:17:17] Southeast Asia uh which would explain why these models have um these type of [01:17:20] why these models have um these type of views and usually also highly educated [01:17:26] views and usually also highly educated okay so this is the main challenge uh [01:17:29] okay so this is the main challenge uh the challenges of all challenges uh we [01:17:32] the challenges of all challenges uh we saw that there are many challenges in [01:17:33] saw that there are many challenges in evaluation in in at least in academic [01:17:35] evaluation in in at least in academic benchmarking but the biggest one is that [01:17:37] benchmarking but the biggest one is that really there's no incentives for us to [01:17:40] really there's no incentives for us to move to anything else um and this is uh [01:17:44] move to anything else um and this is uh actually pretty interesting paper that [01:17:46] actually pretty interesting paper that looks at uh machine translation between [01:17:49] looks at uh machine translation between all the papers of or many papers um from [01:17:53] all the papers of or many papers um from 2019 to 2020 in machine translation and [01:17:56] 2019 to 2020 in machine translation and they found that 82% of papers they only [01:17:59] they found that 82% of papers they only evaluated blue scores and as we said [01:18:01] evaluated blue scores and as we said like blue scores have many many issues [01:18:03] like blue scores have many many issues and if you see like we know that there [01:18:06] and if you see like we know that there are many better uh metrics um but still [01:18:09] are many better uh metrics um but still people are not incentivized to look at [01:18:11] people are not incentivized to look at anything else and actually reviewers [01:18:13] anything else and actually reviewers will usually ask you to show uh [01:18:16] will usually ask you to show uh performance on Blue scores so it's not [01:18:18] performance on Blue scores so it's not even that your incentivized not to look [01:18:19] even that your incentivized not to look at something else or also incentivized [01:18:20] at something else or also incentivized to continue and it kind of makes sense [01:18:22] to continue and it kind of makes sense because you want to be able to compare [01:18:23] because you want to be able to compare to methods from two three years ago go [01:18:25] to methods from two three years ago go but it also means that we we it's it's [01:18:27] but it also means that we we it's it's hard for the academic field um to change [01:18:30] hard for the academic field um to change to other benchmarks um but this is [01:18:32] to other benchmarks um but this is really specific to Academia like in [01:18:34] really specific to Academia like in reality if you know that your metric is [01:18:36] reality if you know that your metric is bad just [01:18:37] bad just switch um okay evaluation takeaways uh [01:18:42] switch um okay evaluation takeaways uh so first we I I mentioned that there [01:18:45] so first we I I mentioned that there were different uh types of evaluation [01:18:48] were different uh types of evaluation and different uh desired properties for [01:18:49] and different uh desired properties for different types of evaluation uh then I [01:18:51] different types of evaluation uh then I talked about close ended tasks um and [01:18:54] talked about close ended tasks um and how you evaluate those the fact that [01:18:56] how you evaluate those the fact that it's basically standard machine learning [01:18:58] it's basically standard machine learning um but that you have to think carefully [01:19:00] um but that you have to think carefully even though it's standard machine [01:19:01] even though it's standard machine learning of how you evaluate them uh [01:19:03] learning of how you evaluate them uh then there are open-ended tasks uh where [01:19:06] then there are open-ended tasks uh where you look at content overlap metrics [01:19:07] you look at content overlap metrics typically uh so things like blue uh uh [01:19:11] typically uh so things like blue uh uh and Rouge um and Bird score and then you [01:19:14] and Rouge um and Bird score and then you have chatbot evaluations uh which is [01:19:16] have chatbot evaluations uh which is extremely difficult um but uh people [01:19:20] extremely difficult um but uh people have start has have started doing with [01:19:22] have start has have started doing with uh using essentially LM based [01:19:25] uh using essentially LM based evaluations um and then we talked about [01:19:27] evaluations um and then we talked about challenges one of them being consistency [01:19:29] challenges one of them being consistency the other one contamination uh and the [01:19:32] the other one contamination uh and the third one biases um in reality honestly [01:19:36] third one biases um in reality honestly the best evaluation is just check your [01:19:38] the best evaluation is just check your outputs um so I think too many people [01:19:42] outputs um so I think too many people they just believe numbers in reality [01:19:45] they just believe numbers in reality like never just believe numbers like I I [01:19:48] like never just believe numbers like I I remember when we did initially alpaca [01:19:50] remember when we did initially alpaca like we we kind of believed our alpaka [01:19:52] like we we kind of believed our alpaka eval but uh but once we saw playing with [01:19:55] eval but uh but once we saw playing with it that's when we were like okay this [01:19:56] it that's when we were like okay this thing is actually I mean at that time [01:19:57] thing is actually I mean at that time good now it's it would be a pretty bad [01:19:59] good now it's it would be a pretty bad model but at that time we're like okay [01:20:00] model but at that time we're like okay this thing is actually pretty good um we [01:20:02] this thing is actually pretty good um we should do something about it even though [01:20:04] should do something about it even though on maybe standard academic benchmarks it [01:20:06] on maybe standard academic benchmarks it was pretty bad um so yeah don't rely on [01:20:09] was pretty bad um so yeah don't rely on numbers and I'm happy to what time is it [01:20:13] numbers and I'm happy to what time is it to take any other questions uh that you [01:20:17] to take any other questions uh that you may [01:20:18] may have yes question about so there's this [01:20:21] have yes question about so there's this whole issue of bias which we're trying [01:20:23] whole issue of bias which we're trying which we really trying to deal with but [01:20:25] which we really trying to deal with but we're sweeping under the rug here so if [01:20:28] we're sweeping under the rug here so if we have a problem in which we're dealing [01:20:30] we have a problem in which we're dealing with a very specialized [01:20:32] with a very specialized domain and yes we try and go and run and [01:20:37] domain and yes we try and go and run and run um reference re vales using like [01:20:39] run um reference re vales using like let's say gbg4 yeah uh sh like is it [01:20:45] let's say gbg4 yeah uh sh like is it considered bad practice to be checking a [01:20:49] considered bad practice to be checking a subset of these gp4 evals ranking them [01:20:52] subset of these gp4 evals ranking them as ourselves and then like and [01:20:56] as ourselves and then like and then using and then using [01:20:59] then using and then using ourself uh like inserting ourself and [01:21:03] ourself uh like inserting ourself and and our bias into this process by [01:21:06] and our bias into this process by actually looking at many many many data [01:21:09] actually looking at many many many data points uh so just to make sure I [01:21:11] points uh so just to make sure I understand your question you're saying [01:21:13] understand your question you're saying that if we try to look ourselves at the [01:21:16] that if we try to look ourselves at the answers we might be incorporating some [01:21:17] answers we might be incorporating some biases there yes but we should look at [01:21:20] biases there yes but we should look at the answers to make sure that gp4 isn't [01:21:23] the answers to make sure that gp4 isn't being biased when it looks at the [01:21:25] being biased when it looks at the answers there's this tension here and I [01:21:26] answers there's this tension here and I don't know what the cuz in a in a [01:21:28] don't know what the cuz in a in a controlled scientific experiment you [01:21:30] controlled scientific experiment you would blind yourself to looking at these [01:21:32] would blind yourself to looking at these answers how do you deal with this yeah [01:21:35] answers how do you deal with this yeah that's a good question I actually don't [01:21:36] that's a good question I actually don't quite know but one thing um I actually [01:21:40] quite know but one thing um I actually feel less concerned about biases of a [01:21:42] feel less concerned about biases of a single person uh my issue with the gp4 [01:21:45] single person uh my issue with the gp4 biases is that it's the same across [01:21:47] biases is that it's the same across every model so things really scale up [01:21:49] every model so things really scale up and kind of uh um it's really it becomes [01:21:52] and kind of uh um it's really it becomes a monoculture and I think that that I [01:21:54] a monoculture and I think that that I think that's less that's much worse than [01:21:57] think that's less that's much worse than if everyone incorporates a little bit of [01:21:59] if everyone incorporates a little bit of the biases that they have in their [01:22:00] the biases that they have in their Direction I'm not saying that that's the [01:22:02] Direction I'm not saying that that's the the best answer but I think it's [01:22:04] the best answer but I think it's slightly better than than just going [01:22:06] slightly better than than just going with whatever they have yeah how does [01:22:08] with whatever they have yeah how does one following up on that how does avoid [01:22:09] one following up on that how does avoid a situation if we're like one is trying [01:22:11] a situation if we're like one is trying to solve a problem with a model yeah uh [01:22:15] to solve a problem with a model yeah uh and uh one evaluates it with GPT chat [01:22:20] and uh one evaluates it with GPT chat gp4 yeah uh and then one starts to to [01:22:23] gp4 yeah uh and then one starts to to like look at it and say okay is this is [01:22:25] like look at it and say okay is this is this good and stuff and then one goes [01:22:27] this good and stuff and then one goes okay this is great and everyone else in [01:22:30] okay this is great and everyone else in the world and GT4 thinks it's a terrible [01:22:33] the world and GT4 thinks it's a terrible terrible model and it's just someone [01:22:35] terrible model and it's just someone being and it's just some academic being [01:22:37] being and it's just some academic being like uh like pressuring themsel into [01:22:39] like uh like pressuring themsel into publishing something that doesn't [01:22:40] publishing something that doesn't actually work how do you how does the [01:22:43] actually work how do you how does the field structurally avoid situations like [01:22:46] field structurally avoid situations like that um well I I think that's one reason [01:22:50] that um well I I think that's one reason why they want standardized benchmarks [01:22:51] why they want standardized benchmarks and why all every reviewer actually [01:22:53] and why all every reviewer actually wants standardized benchmarks because [01:22:54] wants standardized benchmarks because because at least even though everyone [01:22:56] because at least even though everyone knows that they're wrong they understand [01:22:58] knows that they're wrong they understand how they are wrong um so I think that's [01:23:00] how they are wrong um so I think that's like one perspective another thing which [01:23:02] like one perspective another thing which is not doesn't completely answer your [01:23:04] is not doesn't completely answer your question but but um I think could be a [01:23:08] question but but um I think could be a potential solution um is that how I view [01:23:11] potential solution um is that how I view gp4 is just something that is really [01:23:13] gp4 is just something that is really good at performing what I want it to [01:23:15] good at performing what I want it to perform uh right now the thing is I I [01:23:18] perform uh right now the thing is I I not very specific about what I wanted to [01:23:20] not very specific about what I wanted to perform um and as a result it will it [01:23:22] perform um and as a result it will it will basically come in with its own biy [01:23:24] will basically come in with its own biy that that come from its pre-training [01:23:26] that that come from its pre-training data or or fine tring data um a [01:23:28] data or or fine tring data um a potentially better way of doing it is [01:23:30] potentially better way of doing it is that I could write exactly what I want [01:23:32] that I could write exactly what I want so right now when we say when we do the [01:23:34] so right now when we say when we do the the prompting to gp4 I basically ask a [01:23:36] the prompting to gp4 I basically ask a question simple question like how good [01:23:38] question simple question like how good is the summary uh out of five uh but a [01:23:42] is the summary uh out of five uh but a much better way would probably be [01:23:43] much better way would probably be writing a very detailed rubric of [01:23:45] writing a very detailed rubric of everything that has to be uh in this [01:23:47] everything that has to be uh in this answer for it to be a good answer and if [01:23:49] answer for it to be a good answer and if you think about it this is exactly what [01:23:50] you think about it this is exactly what like professors do uh when they evaluate [01:23:53] like professors do uh when they evaluate for class like they [01:23:55] for class like they basically say Okay Yan is a TA but I [01:23:58] basically say Okay Yan is a TA but I cannot trust him per like blindly so [01:24:01] cannot trust him per like blindly so what I will do is that I will write a [01:24:03] what I will do is that I will write a very detailed rubric and I trust that he [01:24:05] very detailed rubric and I trust that he can apply that rubric and I think that's [01:24:07] can apply that rubric and I think that's also how we should be thinking about gp4 [01:24:09] also how we should be thinking about gp4 and this is not how we currently do [01:24:13] it any other questions ================================================================================ LECTURE 013 ================================================================================ Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 12 - Efficient Training, Shikhar Murty Source: https://www.youtube.com/watch?v=UVX7SYGCKkA --- Transcript [00:00:06] okay cool uh let's just get started uh [00:00:09] okay cool uh let's just get started uh welcome everyone to lecture [00:00:11] welcome everyone to lecture 12 um so you know so far we've learned a [00:00:15] 12 um so you know so far we've learned a lot about how uh like you know we [00:00:18] lot about how uh like you know we convert words into vectors how we [00:00:20] convert words into vectors how we convert sentences into vectors and you [00:00:23] convert sentences into vectors and you know uh basically uh take actions in the [00:00:27] know uh basically uh take actions in the real world using that so like classify [00:00:28] real world using that so like classify documents uh we learned about uh you [00:00:32] documents uh we learned about uh you know Transformers we learned about [00:00:33] know Transformers we learned about pre-training today is going to be a [00:00:35] pre-training today is going to be a little bit different uh I'm going to be [00:00:37] little bit different uh I'm going to be talking about how you can train large [00:00:39] talking about how you can train large models on gpus and a few Basics about [00:00:44] models on gpus and a few Basics about how you know these ml systems work and [00:00:47] how you know these ml systems work and it's it has nothing to do with natural [00:00:48] it's it has nothing to do with natural language at all but hopefully it's going [00:00:50] language at all but hopefully it's going to be useful for final [00:00:51] to be useful for final projects um so I'm going to spend some [00:00:54] projects um so I'm going to spend some time on mixed Precision training uh some [00:00:57] time on mixed Precision training uh some time on multi-gpu training with DDP and [00:00:59] time on multi-gpu training with DDP and Fs DP and hopefully by the end of the [00:01:01] Fs DP and hopefully by the end of the lecture these terms will make sense and [00:01:04] lecture these terms will make sense and sometime on parameter efficient [00:01:07] sometime on parameter efficient fine-tuning uh but before we get into [00:01:09] fine-tuning uh but before we get into the lecture uh just some [00:01:12] the lecture uh just some announcements uh proposal grades are [00:01:14] announcements uh proposal grades are going to be coming out shortly uh [00:01:17] going to be coming out shortly uh hopefully by the end of the day uh thank [00:01:19] hopefully by the end of the day uh thank you so much for all the hard work I know [00:01:21] you so much for all the hard work I know you know it's kind of getting a little [00:01:23] you know it's kind of getting a little bit crammed with you know a lot of [00:01:24] bit crammed with you know a lot of deadlines for assignment 4 and the [00:01:27] deadlines for assignment 4 and the project proposal so thank you so much [00:01:29] project proposal so thank you so much for for all your hard work uh the other [00:01:31] for for all your hard work uh the other thing is the project Milestone uh [00:01:34] thing is the project Milestone uh details should be out shortly if not [00:01:37] details should be out shortly if not already out on the website um so it's [00:01:40] already out on the website um so it's worth 5% of the overall grade uh it's [00:01:43] worth 5% of the overall grade uh it's due 12 days from now and it's a maximum [00:01:46] due 12 days from now and it's a maximum of two pages and really the way to think [00:01:49] of two pages and really the way to think about the Milestone is to use this as a [00:01:51] about the Milestone is to use this as a forcing function to get work done for [00:01:53] forcing function to get work done for your final project um and yeah uh with [00:01:57] your final project um and yeah uh with that out of the way let's just jump into [00:01:59] that out of the way let's just jump into into the material uh so I'm going to [00:02:02] into the material uh so I'm going to start by thinking about how parameters [00:02:05] start by thinking about how parameters and gradients and generally numbers are [00:02:07] and gradients and generally numbers are represented in computers uh and I [00:02:10] represented in computers uh and I promise it's going to be relevant to to [00:02:12] promise it's going to be relevant to to deep learning pretty soon so let's start [00:02:15] deep learning pretty soon so let's start with uh floating Point uh how many [00:02:17] with uh floating Point uh how many people here are familiar with this [00:02:19] people here are familiar with this cartoon depiction of fp32 can you just [00:02:22] cartoon depiction of fp32 can you just get okay so some of you so yeah let's [00:02:25] get okay so some of you so yeah let's let's kind of recap how floating points [00:02:27] let's kind of recap how floating points are represented in computers uh so so [00:02:30] are represented in computers uh so so firstly fp32 that's like 32 bytes so the [00:02:33] firstly fp32 that's like 32 bytes so the memory requirement is it's 32 bits so [00:02:35] memory requirement is it's 32 bits so the memory requirement is 4 bytes okay [00:02:38] the memory requirement is 4 bytes okay and so if you're thinking about neural [00:02:39] and so if you're thinking about neural networks then for every single neural [00:02:41] networks then for every single neural net parameters you need four bytes of [00:02:43] net parameters you need four bytes of GPU memory and the way to convert this [00:02:45] GPU memory and the way to convert this cartoon into a real number is something [00:02:48] cartoon into a real number is something like this uh so the first bit there is [00:02:50] like this uh so the first bit there is the sign and then the stuff in green [00:02:53] the sign and then the stuff in green represents the range and then the stuff [00:02:55] represents the range and then the stuff in blue represents Precision okay um and [00:02:59] in blue represents Precision okay um and and yeah and and so for fp32 you know [00:03:03] and yeah and and so for fp32 you know there's like uh you can represent like a [00:03:05] there's like uh you can represent like a pretty large range and it's fairly [00:03:07] pretty large range and it's fairly precise right um and so the larger the [00:03:10] precise right um and so the larger the stuff in green is the more numbers you [00:03:13] stuff in green is the more numbers you can represent which means more smaller [00:03:16] can represent which means more smaller numbers and also like larger numbers and [00:03:19] numbers and also like larger numbers and the more stuff in green uh the more [00:03:21] the more stuff in green uh the more stuff in blue we have the greater appr [00:03:23] stuff in blue we have the greater appr Precision in representing actual numbers [00:03:27] Precision in representing actual numbers okay um so another popular data type [00:03:30] okay um so another popular data type that takes half the memory of fp32 is [00:03:35] that takes half the memory of fp32 is fp16 and uh the way we reduce memory is [00:03:38] fp16 and uh the way we reduce memory is we're going to reduce uh the stuff in [00:03:40] we're going to reduce uh the stuff in green so there's going to be less range [00:03:42] green so there's going to be less range less dynamic range and also the stuff in [00:03:45] less dynamic range and also the stuff in blue which means it's going to be uh [00:03:47] blue which means it's going to be uh less Precision okay but the good uh [00:03:50] less Precision okay but the good uh thing is that we can save memory so you [00:03:52] thing is that we can save memory so you know we slash memory requirements in [00:03:55] know we slash memory requirements in half so uh let's think of a scenario [00:03:59] half so uh let's think of a scenario where you know you're trying to train a [00:04:00] where you know you're trying to train a big neural network and your model [00:04:03] big neural network and your model parameters and gradients are represented [00:04:05] parameters and gradients are represented in fp32 you start training and suddenly [00:04:08] in fp32 you start training and suddenly you get uh an out of memory Cuda error [00:04:12] you get uh an out of memory Cuda error okay and so just based on you know what [00:04:14] okay and so just based on you know what we've seen so far one possible solution [00:04:16] we've seen so far one possible solution is you cast everything into [00:04:18] is you cast everything into fp16 okay and if you do that you reduce [00:04:22] fp16 okay and if you do that you reduce memory usage by [00:04:24] memory usage by half um so let's kind of work through [00:04:26] half um so let's kind of work through what are some possible problems with [00:04:28] what are some possible problems with doing something like that uh so you know [00:04:31] doing something like that uh so you know like I said uh because there's less [00:04:33] like I said uh because there's less stuff in green there's going to be less [00:04:34] stuff in green there's going to be less range and so that means uh a lot of uh [00:04:38] range and so that means uh a lot of uh very small numbers will get uh converted [00:04:42] very small numbers will get uh converted to zero and a lot of really large [00:04:44] to zero and a lot of really large numbers will get converted into n okay [00:04:47] numbers will get converted into n okay and there's also less Precision because [00:04:49] and there's also less Precision because you have less bits in in in blue which [00:04:52] you have less bits in in in blue which means you're going to get rounding [00:04:54] means you're going to get rounding errors um for example uh 1.1 [00:04:59] errors um for example uh 1.1 gets converted to one uh in half [00:05:02] gets converted to one uh in half Precision uh and I have a little uh [00:05:04] Precision uh and I have a little uh screenshot of how you can test uh [00:05:07] screenshot of how you can test uh various properties of data types right [00:05:09] various properties of data types right so um basically the things to to to to [00:05:13] so um basically the things to to to to look at are the Epsilon the Epsilon is [00:05:15] look at are the Epsilon the Epsilon is like the smallest number such that if [00:05:18] like the smallest number such that if you add that to one you don't lose any [00:05:20] you add that to one you don't lose any Precision if you add a number that's [00:05:22] Precision if you add a number that's smaller than Epsilon to one that gets [00:05:24] smaller than Epsilon to one that gets just round rounded down to one and the [00:05:26] just round rounded down to one and the smallest normal is the smallest number [00:05:29] smallest normal is the smallest number that can be presented in uh uh fp16 [00:05:32] that can be presented in uh uh fp16 anything smaller than that it goes [00:05:33] anything smaller than that it goes straight to zero okay and uh for neural [00:05:38] straight to zero okay and uh for neural network training if a lot of small [00:05:39] network training if a lot of small numbers get rounded down to zero uh [00:05:42] numbers get rounded down to zero uh that's actually not good so here's a [00:05:43] that's actually not good so here's a diagram that I took from uh an Nvidia [00:05:46] diagram that I took from uh an Nvidia blog post that's just showing um just [00:05:50] blog post that's just showing um just sort of some gradients during the course [00:05:51] sort of some gradients during the course of training and more than half of these [00:05:53] of training and more than half of these gradients will literally just get set to [00:05:56] gradients will literally just get set to zero in in fp16 uh which is kind of a [00:06:00] zero in in fp16 uh which is kind of a problem um and uh that has to do with [00:06:03] problem um and uh that has to do with the range of fb16 and the second problem [00:06:05] the range of fb16 and the second problem is uh with Precision right so we have [00:06:08] is uh with Precision right so we have basically uh less precision and so our [00:06:11] basically uh less precision and so our updates are not going to be [00:06:13] updates are not going to be precise [00:06:15] precise okay so uh the solution here's one [00:06:19] okay so uh the solution here's one possible solution right so uh we're [00:06:21] possible solution right so uh we're going to use fp16 but we also going to [00:06:23] going to use fp16 but we also going to use fp32 okay so that's that's sort of [00:06:26] use fp32 okay so that's that's sort of the high level idea uh and what we're [00:06:28] the high level idea uh and what we're going to do is we're going to maintain [00:06:30] going to do is we're going to maintain uh a copy of the model in fp32 and let's [00:06:33] uh a copy of the model in fp32 and let's call those Master weights and then uh [00:06:36] call those Master weights and then uh you get a little bit of data you run a [00:06:38] you get a little bit of data you run a forward pass um and then when you run [00:06:40] forward pass um and then when you run your forward pass you run it by uh [00:06:42] your forward pass you run it by uh converting from fp32 into fp16 okay and [00:06:46] converting from fp32 into fp16 okay and then uh you get a gradient run a [00:06:48] then uh you get a gradient run a backward pass and then get your gradient [00:06:50] backward pass and then get your gradient in fp16 okay so everything so far has [00:06:52] in fp16 okay so everything so far has happened in fp16 then you take your [00:06:55] happened in fp16 then you take your gradients upcast them into [00:06:57] gradients upcast them into fp32 and then update your master weights [00:07:01] fp32 and then update your master weights and then once you update your master [00:07:02] and then once you update your master weights you copy them into the fp16 [00:07:04] weights you copy them into the fp16 version of the neural network okay so [00:07:07] version of the neural network okay so this seems like a reasonable scheme you [00:07:09] this seems like a reasonable scheme you know I'm using fp16 on my GPU but I have [00:07:12] know I'm using fp16 on my GPU but I have the full sort of uh 32-bit Precision [00:07:15] the full sort of uh 32-bit Precision also lying around somewhere so I can [00:07:17] also lying around somewhere so I can have more precise updates okay uh can [00:07:21] have more precise updates okay uh can someone tell me why this is still [00:07:25] someone tell me why this is still problematic any [00:07:28] problematic any guesses yeah uh one would be like really [00:07:31] guesses yeah uh one would be like really slow because you have to copy like the [00:07:34] slow because you have to copy like the 32 bit versions from like GPU into like [00:07:38] 32 bit versions from like GPU into like some yeah so so that's a good point so [00:07:41] some yeah so so that's a good point so you can often like overlap uh IO with [00:07:45] you can often like overlap uh IO with like forward and backward passes so [00:07:48] like forward and backward passes so practically this is not a problem but [00:07:50] practically this is not a problem but yeah that's a good point potentially if [00:07:51] yeah that's a good point potentially if your network is very very small this [00:07:53] your network is very very small this would be a [00:07:54] would be a problem yeah gradients are usually [00:07:57] problem yeah gradients are usually fairly small like individual gradients [00:07:58] fairly small like individual gradients are usually Fair small and when you copy [00:08:00] are usually Fair small and when you copy the fp6 computed Radiance onto fp32 you [00:08:03] the fp6 computed Radiance onto fp32 you may be sending your network somewhere [00:08:05] may be sending your network somewhere else where you don't want it [00:08:07] else where you don't want it to so yeah so that's that's pretty much [00:08:09] to so yeah so that's that's pretty much pretty much the right answer so you know [00:08:12] pretty much the right answer so you know let's kind of uh go back to this diagram [00:08:14] let's kind of uh go back to this diagram that we had uh so this shows gradients [00:08:16] that we had uh so this shows gradients in the backward pass and you know I said [00:08:19] in the backward pass and you know I said that we're going to compute all our [00:08:20] that we're going to compute all our gradients in fp16 what's going to happen [00:08:23] gradients in fp16 what's going to happen like most of them will just get [00:08:24] like most of them will just get converted to zero okay which is which is [00:08:26] converted to zero okay which is which is something that we really would like to [00:08:28] something that we really would like to avoid okay so here's a possible solution [00:08:31] avoid okay so here's a possible solution so what you can do is you can you get [00:08:34] so what you can do is you can you get your batch of data you run your forward [00:08:35] your batch of data you run your forward pass in fp16 you compute your gradient [00:08:39] pass in fp16 you compute your gradient uh but then when you have the uh sorry [00:08:43] uh but then when you have the uh sorry so we here um so you get a batch of data [00:08:47] so we here um so you get a batch of data you compute a forward pass in fp16 you [00:08:50] you compute a forward pass in fp16 you get your loss you scale the loss by some [00:08:52] get your loss you scale the loss by some large value okay let's say 100 let's say [00:08:55] large value okay let's say 100 let's say a th000 and then you compute gradients [00:08:57] a th000 and then you compute gradients and now you just like scale your [00:08:58] and now you just like scale your gradient by a large number and so [00:09:00] gradient by a large number and so everything that we had on the left hand [00:09:02] everything that we had on the left hand side of this red line just gets shifted [00:09:04] side of this red line just gets shifted to the right and hopefully there's less [00:09:06] to the right and hopefully there's less stuff that will uh get rounded down to [00:09:08] stuff that will uh get rounded down to zero [00:09:10] zero okay and then compute your gradient in [00:09:12] okay and then compute your gradient in fp16 copy it into fp32 and then divided [00:09:16] fp16 copy it into fp32 and then divided by the scaling factor and then you [00:09:18] by the scaling factor and then you update your master widths okay uh so [00:09:22] update your master widths okay uh so this will solve uh both the problems [00:09:24] this will solve uh both the problems that we talked about okay and so this is [00:09:27] that we talked about okay and so this is basically uh what we call Precision [00:09:29] basically uh what we call Precision training okay and it's relatively simple [00:09:33] training okay and it's relatively simple to implement this in pytorch uh all you [00:09:36] to implement this in pytorch uh all you have to do is uh you need to instantiate [00:09:38] have to do is uh you need to instantiate this grad scalar uh uh object and then [00:09:43] this grad scalar uh uh object and then um within the context of like this [00:09:47] um within the context of like this AutoCast you want to run your forward [00:09:49] AutoCast you want to run your forward and backward passes and then um scale [00:09:52] and backward passes and then um scale down your gradient uh and then uh update [00:09:55] down your gradient uh and then uh update your model parameters [00:09:57] your model parameters okay uh but then this seems a little [00:10:00] okay uh but then this seems a little complex you know we have to deal with [00:10:01] complex you know we have to deal with sort of scaling the loss uh and then [00:10:04] sort of scaling the loss uh and then scaling it back down what if you [00:10:06] scaling it back down what if you multiplied it by 10,000 and that leads [00:10:08] multiplied it by 10,000 and that leads to nans and so then you have to kind of [00:10:10] to nans and so then you have to kind of scale uh you have to update your scaler [00:10:13] scale uh you have to update your scaler you have to like in the next iteration [00:10:14] you have to like in the next iteration multiply by th000 you know and you have [00:10:16] multiply by th000 you know and you have to kind of adjust to to sort of network [00:10:19] to kind of adjust to to sort of network Dynamics okay so we'd like to not do [00:10:21] Dynamics okay so we'd like to not do gradient scaling um so can we do [00:10:24] gradient scaling um so can we do something better okay so uh the reason [00:10:27] something better okay so uh the reason we why we have to do the scaling is you [00:10:29] we why we have to do the scaling is you know just recall uh sort of the role of [00:10:33] know just recall uh sort of the role of uh of sort of the bits in green that [00:10:36] uh of sort of the bits in green that kind of tells you what is the dynamic [00:10:38] kind of tells you what is the dynamic range of the data type and we needed [00:10:41] range of the data type and we needed scaling because fp16 has a small much [00:10:44] scaling because fp16 has a small much smaller range compared to fp32 [00:10:46] smaller range compared to fp32 right um and so because of that fp16 [00:10:49] right um and so because of that fp16 cannot represent very small numbers okay [00:10:51] cannot represent very small numbers okay so how do we solve this any ideas [00:11:03] yeah so uh here's the problem right so [00:11:05] yeah so uh here's the problem right so in fp16 because you have fewer bits for [00:11:07] in fp16 because you have fewer bits for the exponent you can't represent very [00:11:09] the exponent you can't represent very small numbers so if you have something [00:11:11] small numbers so if you have something that's smaller than I don't know 6 e [00:11:14] that's smaller than I don't know 6 e minus 5 it gets uh down uh sort of uh uh [00:11:19] minus 5 it gets uh down uh sort of uh uh rounded down to zero and that's because [00:11:21] rounded down to zero and that's because of the dynamic range of fp16 so how do [00:11:23] of the dynamic range of fp16 so how do you solve that sacrifice Precision so [00:11:27] you solve that sacrifice Precision so have more [00:11:28] have more green absolutely yeah so that's that's [00:11:30] green absolutely yeah so that's that's the right answer uh so what we're going [00:11:33] the right answer uh so what we're going to do is we're going to sacrifice uh [00:11:35] to do is we're going to sacrifice uh Precision so uh that's the idea for B [00:11:39] Precision so uh that's the idea for B float 16 which stands for brain float 16 [00:11:43] float 16 which stands for brain float 16 um so you're going to have exactly the [00:11:45] um so you're going to have exactly the same number of bits for representing the [00:11:47] same number of bits for representing the range so that's going to be eight bits [00:11:49] range so that's going to be eight bits so has the same dynamic range as [00:11:51] so has the same dynamic range as fp32 but a lot less precision and it [00:11:55] fp32 but a lot less precision and it turns out that this this is okay for [00:11:57] turns out that this this is okay for neural network training okay [00:12:00] neural network training okay and now if you use B float 16 you don't [00:12:02] and now if you use B float 16 you don't need to use uh grad scalers anymore it's [00:12:05] need to use uh grad scalers anymore it's it's as simple as wrapping your model [00:12:07] it's as simple as wrapping your model forward pass and backward pass uh within [00:12:09] forward pass and backward pass uh within the right uh [00:12:11] the right uh context uh the one caveat about B flat [00:12:14] context uh the one caveat about B flat 16 is that it's not available on all [00:12:16] 16 is that it's not available on all gpus so you need to have the latest sort [00:12:18] gpus so you need to have the latest sort of ampere uh Nvidia architectures which [00:12:22] of ampere uh Nvidia architectures which the h100s the a100 the a6000 have uh but [00:12:26] the h100s the a100 the a6000 have uh but if you have like a older GPU then you [00:12:28] if you have like a older GPU then you might not be able to utilize B FL 16 [00:12:31] might not be able to utilize B FL 16 sorry can you why having less Precision [00:12:33] sorry can you why having less Precision but the same amount of eits um yeah so [00:12:36] but the same amount of eits um yeah so it's B 6 and oh never mind sorry [00:12:42] I'm uh so here are some um kind of [00:12:46] I'm uh so here are some um kind of results um so someone uh find you in [00:12:49] results um so someone uh find you in dist bir for sentiment classification on [00:12:51] dist bir for sentiment classification on a single a 100 um at the very top is [00:12:56] a single a 100 um at the very top is float 64 which is like you know really [00:12:59] float 64 which is like you know really really rich 64bit representation of [00:13:01] really rich 64bit representation of floating points it takes about 25 [00:13:05] floating points it takes about 25 minutes um and you get a pretty high [00:13:07] minutes um and you get a pretty high accuracy but it also takes a lot more [00:13:09] accuracy but it also takes a lot more memory okay and all the way down we're [00:13:12] memory okay and all the way down we're using mixed Precision training with B [00:13:14] using mixed Precision training with B float 16 and now we have reduced [00:13:16] float 16 and now we have reduced staining Time by roughly a third um more [00:13:20] staining Time by roughly a third um more or less have the same accuracy a little [00:13:22] or less have the same accuracy a little bit better actually because there's some [00:13:24] bit better actually because there's some regularizing effect from U the you know [00:13:27] regularizing effect from U the you know half bit uh the half Precision [00:13:30] half bit uh the half Precision representation and then a lot less [00:13:32] representation and then a lot less memory [00:13:34] memory okay uh and the reason we see speedups [00:13:37] okay uh and the reason we see speedups for training is because Matrix [00:13:38] for training is because Matrix multiplies tend to be faster when you [00:13:41] multiplies tend to be faster when you are multiplying in half [00:13:44] are multiplying in half Precision okay so uh before we move on [00:13:47] Precision okay so uh before we move on are there any questions about about [00:13:52] this Okay cool so uh let's let's keep [00:13:57] this Okay cool so uh let's let's keep going and uh let's sort of change the [00:14:00] going and uh let's sort of change the setting right so now now we have more [00:14:02] setting right so now now we have more than one GPU now we have multiple gpus [00:14:04] than one GPU now we have multiple gpus and we want to train a network or all of [00:14:07] and we want to train a network or all of the multiple gpus that we have Okay so [00:14:10] the multiple gpus that we have Okay so let's start with some Basics okay so uh [00:14:13] let's start with some Basics okay so uh here's a cartoon [00:14:15] here's a cartoon showing uh basically a model and an [00:14:19] showing uh basically a model and an Optimizer uh receiving some data from a [00:14:21] Optimizer uh receiving some data from a data set okay and let's kind of work [00:14:24] data set okay and let's kind of work through what's stored on GPU vram okay [00:14:27] through what's stored on GPU vram okay and this is going to be some what of a [00:14:29] and this is going to be some what of a lie and I will I I will point out what [00:14:32] lie and I will I I will point out what my lie is soon uh but you know just to [00:14:34] my lie is soon uh but you know just to keep things simple uh we have the neural [00:14:38] keep things simple uh we have the neural net parameters okay so let's say we're [00:14:39] net parameters okay so let's say we're doing mixed procure training and so it's [00:14:41] doing mixed procure training and so it's stored in [00:14:42] stored in fp16 and then we have an Optimizer okay [00:14:46] fp16 and then we have an Optimizer okay and uh you know when I first saw this uh [00:14:50] and uh you know when I first saw this uh you know few years back I was very [00:14:53] you know few years back I was very surprised to see that optimizers also [00:14:55] surprised to see that optimizers also need memory but uh if you're using [00:14:57] need memory but uh if you're using something like Adam uh then you need to [00:15:00] something like Adam uh then you need to store the adom momentum term and the [00:15:03] store the adom momentum term and the adom variance and every time you get a [00:15:04] adom variance and every time you get a gradient you have to update adom [00:15:06] gradient you have to update adom momentum and variance and that's what [00:15:07] momentum and variance and that's what you use uh for updating your parameters [00:15:11] you use uh for updating your parameters and because you're using mixed Precision [00:15:13] and because you're using mixed Precision training these have to be represented in [00:15:15] training these have to be represented in sort of um fp32 okay uh so so that's [00:15:20] sort of um fp32 okay uh so so that's that that's that's what the picture [00:15:21] that that's that's what the picture looks like if you have a single GPU now [00:15:24] looks like if you have a single GPU now let's say we have multiple gpus okay and [00:15:27] let's say we have multiple gpus okay and what we'd like to do is is uh first [00:15:29] what we'd like to do is is uh first divide our data set into let's say we [00:15:31] divide our data set into let's say we have four GPS right so we'll divide our [00:15:33] have four GPS right so we'll divide our data set into four parts and we'll [00:15:35] data set into four parts and we'll maintain a synchronized copy of the [00:15:38] maintain a synchronized copy of the model uh and every model receives its [00:15:41] model uh and every model receives its own slice of the data set [00:15:43] own slice of the data set okay um in the beginning we have a [00:15:46] okay um in the beginning we have a synchronized model and everyone has [00:15:49] synchronized model and everyone has their own copy we run a forward pass [00:15:51] their own copy we run a forward pass okay so this forward pass receives [00:15:55] okay so this forward pass receives different data points and so every model [00:15:58] different data points and so every model is going to have different activations [00:16:00] is going to have different activations and correspondingly every model is going [00:16:02] and correspondingly every model is going to have different gradients okay so you [00:16:05] to have different gradients okay so you run a backward pass every model has a [00:16:07] run a backward pass every model has a different gradient because there's [00:16:08] different gradient because there's different data points and then we're [00:16:10] different data points and then we're going to run a synchronization step and [00:16:13] going to run a synchronization step and what synchronization is going to do is [00:16:15] what synchronization is going to do is communicate gradients between between [00:16:17] communicate gradients between between different workers okay and so I'm going [00:16:20] different workers okay and so I'm going to introduce the first sort of uh MPI [00:16:23] to introduce the first sort of uh MPI primitive in this lecture and that [00:16:25] primitive in this lecture and that primitive is called the all reduce [00:16:27] primitive is called the all reduce operation what all ruse does is it takes [00:16:31] operation what all ruse does is it takes um four pieces of uh information on in [00:16:36] um four pieces of uh information on in this example on four different gpus it [00:16:38] this example on four different gpus it sort of merges everything together and [00:16:40] sort of merges everything together and then uh distributes it to all of the [00:16:43] then uh distributes it to all of the gpus okay and the communication overhead [00:16:47] gpus okay and the communication overhead of doing that is two bytes per parameter [00:16:49] of doing that is two bytes per parameter because remember we have uh fp16 [00:16:52] because remember we have uh fp16 gradients okay so two bytes per gradient [00:16:55] gradients okay so two bytes per gradient and then uh this needs to be [00:16:57] and then uh this needs to be communicated um and so the overhead is 2 [00:17:00] communicated um and so the overhead is 2 bytes per [00:17:01] bytes per parameter okay so that's the all reduce [00:17:04] parameter okay so that's the all reduce operation uh and then once gradients [00:17:06] operation uh and then once gradients have been communicated so they have to [00:17:08] have been communicated so they have to be communicated by you know sort of [00:17:10] be communicated by you know sort of gathering on one worker and just sort of [00:17:12] gathering on one worker and just sort of Distributing the cumulative gradient at [00:17:14] Distributing the cumulative gradient at that point uh every Optimizer has the [00:17:17] that point uh every Optimizer has the full gradient and then the optimizer can [00:17:20] full gradient and then the optimizer can you know update the model so that you [00:17:22] you know update the model so that you maintain [00:17:24] maintain synchronization okay so that's the basic [00:17:26] synchronization okay so that's the basic that's uh that's known as distributed [00:17:29] that's uh that's known as distributed data parallel okay um that's good uh but [00:17:33] data parallel okay um that's good uh but turns out that it has really poor memory [00:17:35] turns out that it has really poor memory scaling so let's kind of go through our [00:17:38] scaling so let's kind of go through our math for how many uh how much memory is [00:17:41] math for how many uh how much memory is needed right so uh we have the model [00:17:45] needed right so uh we have the model parameters that's fp16 because we're [00:17:47] parameters that's fp16 because we're doing mixed Precision training and then [00:17:49] doing mixed Precision training and then for the gradient we also have the [00:17:51] for the gradient we also have the gradient in fp16 right so two bytes uh [00:17:54] gradient in fp16 right so two bytes uh for the gradient and then we have the [00:17:56] for the gradient and then we have the stuff in green the stuff in green is [00:17:59] stuff in green the stuff in green is let's say we're doing adom so we need to [00:18:02] let's say we're doing adom so we need to well we need to store the master waste [00:18:03] well we need to store the master waste regardless of whether we're doing adom [00:18:04] regardless of whether we're doing adom or not and then we need to store the [00:18:06] or not and then we need to store the momentum and the variance okay so that's [00:18:08] momentum and the variance okay so that's 12 extra bytes uh per parameter okay and [00:18:13] 12 extra bytes uh per parameter okay and this needs to be stored on every single [00:18:14] this needs to be stored on every single GPU okay um and so the question is can [00:18:19] GPU okay um and so the question is can we do better than this okay and so now [00:18:22] we do better than this okay and so now uh things are going to get a little bit [00:18:23] uh things are going to get a little bit more tricky so if you have questions [00:18:26] more tricky so if you have questions just stop me um and you know we can we [00:18:29] just stop me um and you know we can we can go from there so um the way we're [00:18:33] can go from there so um the way we're going to improve uh our memory sort of [00:18:37] going to improve uh our memory sort of scaling is we are a set of techniques uh [00:18:40] scaling is we are a set of techniques uh that are together known as zero that [00:18:43] that are together known as zero that stands for zero redundancy Optimizer so [00:18:45] stands for zero redundancy Optimizer so this was uh you know a set of techniques [00:18:48] this was uh you know a set of techniques released by Microsoft as part of their [00:18:51] released by Microsoft as part of their deep speed project okay and the idea is [00:18:54] deep speed project okay and the idea is going to be that we are going to instead [00:18:56] going to be that we are going to instead instead of having every GPU contain all [00:18:59] instead of having every GPU contain all of this state so by the state I mean the [00:19:02] of this state so by the state I mean the stuff in blue the stuff in Orange and [00:19:03] stuff in blue the stuff in Orange and the stuff in green you're going to sort [00:19:05] the stuff in green you're going to sort of Shard it okay so there's there's [00:19:08] of Shard it okay so there's there's there's going to be uh shards so that [00:19:11] there's going to be uh shards so that not every GPU has all of the parameters [00:19:13] not every GPU has all of the parameters or all of the gradient but by [00:19:16] or all of the gradient but by communication they can sort of [00:19:17] communication they can sort of synchronize okay so that's pretty much [00:19:19] synchronize okay so that's pretty much what uh the the the uh sketch for this [00:19:21] what uh the the the uh sketch for this is going to look like okay so let's look [00:19:25] is going to look like okay so let's look at stage one so like zero has like [00:19:27] at stage one so like zero has like multiple stages so there's stage one 1 [00:19:28] multiple stages so there's stage one 1 two and three in stage one we're going [00:19:31] two and three in stage one we're going to Shard uh the stuff in green so stuff [00:19:33] to Shard uh the stuff in green so stuff in green was the optimizer State um and [00:19:37] in green was the optimizer State um and so the way we're going to Shard and [00:19:39] so the way we're going to Shard and still maintain synchronization is [00:19:41] still maintain synchronization is something like this okay so every GPU [00:19:43] something like this okay so every GPU has um you know uh the full set of [00:19:46] has um you know uh the full set of parameters in [00:19:47] parameters in fp16 and every GPU has its its gradient [00:19:51] fp16 and every GPU has its its gradient for its data okay but it only has a [00:19:54] for its data okay but it only has a sharded copy of the full Optimizer State [00:19:57] sharded copy of the full Optimizer State and the the the other requirement is [00:20:00] and the the the other requirement is that every GPU is responsible for [00:20:02] that every GPU is responsible for updating the parameters corresponding to [00:20:05] updating the parameters corresponding to its own Shard okay so if you go step by [00:20:08] its own Shard okay so if you go step by step this is what it looks like uh every [00:20:12] step this is what it looks like uh every every GPU has its own [00:20:14] every GPU has its own data uh every GPU gets a gradient on its [00:20:17] data uh every GPU gets a gradient on its subset of the data okay then we perform [00:20:20] subset of the data okay then we perform a reduced scatter so now this is the [00:20:22] a reduced scatter so now this is the second uh MPI operation of the lecture [00:20:24] second uh MPI operation of the lecture so we've we've done all reduce this is [00:20:27] so we've we've done all reduce this is the second one this is called reduce [00:20:28] the second one this is called reduce scatter [00:20:29] scatter uh what A reduced scatter does is every [00:20:32] uh what A reduced scatter does is every GPU [00:20:33] GPU has uh the full gradient on its data and [00:20:37] has uh the full gradient on its data and what you want to do is you want to take [00:20:39] what you want to do is you want to take the chunk corresponding to let's say GPU [00:20:42] the chunk corresponding to let's say GPU 1 so let's say your GPU 0 and you've [00:20:44] 1 so let's say your GPU 0 and you've computed the full gradient for all the [00:20:47] computed the full gradient for all the parameters and you want to [00:20:48] parameters and you want to communicate uh your uh the chunk for GPU [00:20:51] communicate uh your uh the chunk for GPU 1 to GPU 1 okay and same for GPU 2 and [00:20:55] 1 to GPU 1 okay and same for GPU 2 and three okay so what you're going to do is [00:20:58] three okay so what you're going to do is from the full gradient just communicate [00:20:59] from the full gradient just communicate the bits that a different worker wants [00:21:01] the bits that a different worker wants to that worker okay and every GPU has to [00:21:04] to that worker okay and every GPU has to do that so that's called a reduced [00:21:06] do that so that's called a reduced scatter okay and then once every worker [00:21:10] scatter okay and then once every worker gets uh the gradient corresponding to [00:21:13] gets uh the gradient corresponding to its Shard they're going to update its [00:21:16] its Shard they're going to update its parameters and then once they have [00:21:18] parameters and then once they have updated their Shard they're going to [00:21:20] updated their Shard they're going to sort of perform an all gather so what [00:21:23] sort of perform an all gather so what that means is let's say you have a [00:21:24] that means is let's say you have a neural network with just let's say eight [00:21:26] neural network with just let's say eight parameters uh two parameters on each GPU [00:21:29] parameters uh two parameters on each GPU at the end of this uh each GPU has [00:21:32] at the end of this uh each GPU has updated their subset of parameters and [00:21:34] updated their subset of parameters and then they're going to sort of do an all [00:21:36] then they're going to sort of do an all gather to just sort of maintain [00:21:37] gather to just sort of maintain synchronization so every GPU gets the [00:21:40] synchronization so every GPU gets the full set of parameters that are all [00:21:44] updated yeah is maintaining this and [00:21:47] updated yeah is maintaining this and you're not merging them together in that [00:21:49] you're not merging them together in that way um what is what makes this more [00:21:56] efficient um sorry could you repeat your [00:21:59] efficient um sorry could you repeat your question can you go over why this is [00:22:01] question can you go over why this is better than doing the prion right so uh [00:22:04] better than doing the prion right so uh what we're going to do is uh Shard the [00:22:07] what we're going to do is uh Shard the optimizer state right so let's say in a [00:22:10] optimizer state right so let's say in a running example we have a neural network [00:22:11] running example we have a neural network with eight parameters okay earlier uh we [00:22:14] with eight parameters okay earlier uh we needed the optimizer state for all of [00:22:16] needed the optimizer state for all of the eight parameters now every GPU has [00:22:19] the eight parameters now every GPU has to maintain Optimizer state for only two [00:22:21] to maintain Optimizer state for only two parameters okay so after the reduced [00:22:26] parameters okay so after the reduced scatters are done you have the full uh [00:22:29] scatters are done you have the full uh gradient corresponding to just two [00:22:31] gradient corresponding to just two parameters okay so the optimizer state [00:22:34] parameters okay so the optimizer state is just uh the gradient for two [00:22:37] is just uh the gradient for two parameters the model is going to update [00:22:40] parameters the model is going to update only two [00:22:42] only two parameters using the partial uh uh sort [00:22:46] parameters using the partial uh uh sort of Optimizer state but you have to have [00:22:48] of Optimizer state but you have to have the entire set of parameters to run so [00:22:50] the entire set of parameters to run so you'll eventually get the rest of the [00:22:51] you'll eventually get the rest of the parameters back so you have the entire [00:22:53] parameters back so you have the entire set of parameters you have all the stuff [00:22:55] set of parameters you have all the stuff in blue and you have the full grad for [00:22:58] in blue and you have the full grad for your subset but you don't have the full [00:23:01] your subset but you don't have the full Optimizer state so what you can do is [00:23:03] Optimizer state so what you can do is you can only update the parameters for [00:23:06] you can only update the parameters for the bits of Optimizer State you have [00:23:09] the bits of Optimizer State you have okay so in a running example that I just [00:23:12] okay so in a running example that I just made up uh you know gp0 updates two [00:23:15] made up uh you know gp0 updates two parameters gpu1 updates two parameters [00:23:18] parameters gpu1 updates two parameters and so on okay and then they communicate [00:23:21] and so on okay and then they communicate updated parameters to maintain [00:23:24] updated parameters to maintain synchronization okay more questions [00:23:27] synchronization okay more questions about this [00:23:32] okay so let's keep going so so so far we [00:23:35] okay so let's keep going so so so far we have looked at three MPI operations we [00:23:37] have looked at three MPI operations we looked at um all gather we looked at [00:23:41] looked at um all gather we looked at reduce scatter and we looked at all [00:23:43] reduce scatter and we looked at all reduce okay so turns out that um all [00:23:49] reduce okay so turns out that um all reduce is actually equivalent to running [00:23:52] reduce is actually equivalent to running a reduced scatter followed by an all [00:23:54] a reduced scatter followed by an all gather [00:23:55] gather operation and just recall that like for [00:23:58] operation and just recall that like for DD P all we had to do was this all [00:24:00] DD P all we had to do was this all reduce operation and we commun we we [00:24:03] reduce operation and we commun we we computed what's the communication [00:24:04] computed what's the communication overhead of that and turns out that when [00:24:06] overhead of that and turns out that when you're doing this Optimizer State [00:24:08] you're doing this Optimizer State sharding you have to do exactly the same [00:24:11] sharding you have to do exactly the same amount of communication overhead just [00:24:14] amount of communication overhead just because an all reduce is equivalent to a [00:24:16] because an all reduce is equivalent to a reduce scatter followed by an all gather [00:24:18] reduce scatter followed by an all gather okay and so we basically saved memory [00:24:20] okay and so we basically saved memory for free okay so just I mean you should [00:24:24] for free okay so just I mean you should just always use this okay um because you [00:24:28] just always use this okay um because you going to get memory savings and you [00:24:30] going to get memory savings and you don't have any additional communication [00:24:31] don't have any additional communication overhead okay so we happy we saved [00:24:35] overhead okay so we happy we saved memory and now you know we want to Shard [00:24:38] memory and now you know we want to Shard even more things okay so let's let's [00:24:40] even more things okay so let's let's start doing zero stage two uh and now [00:24:44] start doing zero stage two uh and now along with sharding the stuff in green [00:24:46] along with sharding the stuff in green which was my Optimizer state I'm also [00:24:49] which was my Optimizer state I'm also going to Shard gradients okay and now [00:24:53] going to Shard gradients okay and now this is going to be a little bit more [00:24:54] this is going to be a little bit more complex because um we kind of still need [00:24:57] complex because um we kind of still need need the full gradient [00:24:59] need the full gradient for the workers's data slice okay but [00:25:01] for the workers's data slice okay but each GPU only has enough memory for um [00:25:06] each GPU only has enough memory for um instantiating the gradient for a small [00:25:09] instantiating the gradient for a small subset of parameters so how are we going [00:25:10] subset of parameters so how are we going to deal with [00:25:12] to deal with that so we are actually never going to [00:25:15] that so we are actually never going to instantiate the full gradient vector and [00:25:18] instantiate the full gradient vector and then whenever a GPU gets um a gradient [00:25:22] then whenever a GPU gets um a gradient in the backward pass you instantiate uh [00:25:26] in the backward pass you instantiate uh a vector sort of temporarily for the [00:25:28] a vector sort of temporarily for the parameter for which you just received a [00:25:30] parameter for which you just received a gradient and then compute the gradient [00:25:32] gradient and then compute the gradient and then just send it to the right [00:25:33] and then just send it to the right worker and then you destroy the memory [00:25:35] worker and then you destroy the memory that you just created okay that's kind [00:25:38] that you just created okay that's kind kind of the sketch and let's kind of go [00:25:39] kind of the sketch and let's kind of go through this step by step [00:25:42] through this step by step okay so um we have four workers okay [00:25:46] okay so um we have four workers okay each worker performs a backward pass and [00:25:49] each worker performs a backward pass and the backward pass happens layer by layer [00:25:51] the backward pass happens layer by layer right so uh recall the lecture on uh [00:25:54] right so uh recall the lecture on uh Auto diff so you have the loss and then [00:25:57] Auto diff so you have the loss and then you have this back backward passw layer [00:25:59] you have this back backward passw layer by layer you compute gradients okay so [00:26:01] by layer you compute gradients okay so now let's say you're at layer J you take [00:26:03] now let's say you're at layer J you take the Upstream gradient you compute [00:26:06] the Upstream gradient you compute gradient for the parameters at ler okay [00:26:10] gradient for the parameters at ler okay immediately the moment you compute those [00:26:12] immediately the moment you compute those gradients send it to the right worker [00:26:14] gradients send it to the right worker okay so there exists some worker that is [00:26:17] okay so there exists some worker that is responsible for ler okay and what's [00:26:20] responsible for ler okay and what's going to happen is every GPU that's just [00:26:23] going to happen is every GPU that's just computed uh the gradient for layer J for [00:26:27] computed uh the gradient for layer J for its data slice sends it to the right [00:26:29] its data slice sends it to the right worker okay uh and then the moment [00:26:32] worker okay uh and then the moment you've done that you deallocate this [00:26:34] you've done that you deallocate this memory that you just created okay uh and [00:26:37] memory that you just created okay uh and so uh this is kind of uh a fourth MPI [00:26:42] so uh this is kind of uh a fourth MPI operation but really not very different [00:26:43] operation but really not very different from a reduce CER this is just a reduce [00:26:45] from a reduce CER this is just a reduce so there are four gpus uh that have a [00:26:48] so there are four gpus uh that have a gradient and then they just have to [00:26:50] gradient and then they just have to communicate it to whoever is responsible [00:26:52] communicate it to whoever is responsible for maintaining gradient for that layer [00:26:56] for maintaining gradient for that layer okay um [00:26:59] okay um and then yeah so there exists some [00:27:01] and then yeah so there exists some worker that is responsible for a given [00:27:04] worker that is responsible for a given layer they're going to [00:27:06] layer they're going to update uh its parameter Shard using uh [00:27:11] update uh its parameter Shard using uh the full gradient that it received via [00:27:12] the full gradient that it received via this communication along with the [00:27:14] this communication along with the optimizer State okay and then at the end [00:27:17] optimizer State okay and then at the end to synchronize everything you have to [00:27:18] to synchronize everything you have to perform in all gather as [00:27:21] perform in all gather as before okay uh any any questions about [00:27:24] before okay uh any any questions about about uh this like high level sketch [00:27:34] okay so so let's keep moving um okay so [00:27:37] okay so so let's keep moving um okay so recall that uh for zero stage one um it [00:27:42] recall that uh for zero stage one um it was basically free because uh turns out [00:27:45] was basically free because uh turns out that an all reduce is equivalent to a [00:27:46] that an all reduce is equivalent to a reduce scatter plus an all gather and [00:27:49] reduce scatter plus an all gather and we're kind of doing the same thing here [00:27:51] we're kind of doing the same thing here we have a reduce followed by an all [00:27:54] we have a reduce followed by an all gather so this is practically also for [00:27:56] gather so this is practically also for free Okay so [00:27:58] free Okay so we've gotten away with saving memory [00:28:01] we've gotten away with saving memory without any communication overhead [00:28:03] without any communication overhead compared to DDP so far okay so let's [00:28:06] compared to DDP so far okay so let's keep going let's let's try and see if we [00:28:07] keep going let's let's try and see if we can Shard even more things [00:28:09] can Shard even more things okay uh and I think someone sort of [00:28:12] okay uh and I think someone sort of alluded to this in the audience early on [00:28:15] alluded to this in the audience early on so what happens if you Shard even your [00:28:17] so what happens if you Shard even your model parameters okay so let's say you [00:28:20] model parameters okay so let's say you run into a situation where uh you know [00:28:22] run into a situation where uh you know forget about the optimizer State even [00:28:24] forget about the optimizer State even your model wouldn't fit on a single GPU [00:28:27] your model wouldn't fit on a single GPU and so in that that case what you do is [00:28:29] and so in that that case what you do is you split up your model so you split up [00:28:31] you split up your model so you split up your model across all the different gpus [00:28:33] your model across all the different gpus so you Shard your model parameters which [00:28:35] so you Shard your model parameters which is the stuff in blue um uh but the [00:28:39] is the stuff in blue um uh but the caveat is that now we're not going to [00:28:41] caveat is that now we're not going to get uh this for free we're not going to [00:28:43] get uh this for free we're not going to get memory savings for free there's [00:28:44] get memory savings for free there's going to be some communication overhead [00:28:47] going to be some communication overhead okay and uh this is zero stage three [00:28:49] okay and uh this is zero stage three this is the final stage of zero this is [00:28:52] this is the final stage of zero this is also known as fsdp fully shoted data [00:28:55] also known as fsdp fully shoted data parallel for uh anyone who's heard that [00:28:57] parallel for uh anyone who's heard that term before [00:29:00] um and here's sort of the high level [00:29:03] um and here's sort of the high level sketch okay and I feel like uh this is [00:29:06] sketch okay and I feel like uh this is kind of the easiest to understand [00:29:08] kind of the easiest to understand compared to zero stage one and two um [00:29:11] compared to zero stage one and two um just because there needs to be [00:29:13] just because there needs to be communication at every step of the way [00:29:15] communication at every step of the way right you can't get away uh without [00:29:17] right you can't get away uh without communicating so the first thing we're [00:29:19] communicating so the first thing we're going to do is we're going to take our [00:29:21] going to do is we're going to take our model and we're going to convert the [00:29:24] model and we're going to convert the entire model into fsdp units Okay so [00:29:28] entire model into fsdp units Okay so here's a sketch um a simple deep neural [00:29:30] here's a sketch um a simple deep neural network I'm going to convert that into [00:29:33] network I'm going to convert that into multiple fsdp units uh three fsdp units [00:29:37] multiple fsdp units uh three fsdp units here okay so that's that's just a data [00:29:39] here okay so that's that's just a data structure an fsdp unit okay we've not [00:29:41] structure an fsdp unit okay we've not done anything so far uh and then I have [00:29:45] done anything so far uh and then I have this fsdp unit I'm going to convert this [00:29:47] this fsdp unit I'm going to convert this into another data structure called a [00:29:49] into another data structure called a flat parameter and then I'm going to uh [00:29:52] flat parameter and then I'm going to uh assign a subset of these parameters to [00:29:55] assign a subset of these parameters to every single GP okay so here we have 16 [00:29:58] every single GP okay so here we have 16 gpus and a flat parameter consisting of [00:30:01] gpus and a flat parameter consisting of 14 parameters plus some extra padding so [00:30:04] 14 parameters plus some extra padding so that uh things divide properly and I'm [00:30:07] that uh things divide properly and I'm going to assign each parameter to uh a [00:30:12] going to assign each parameter to uh a distinct GPU okay and so that's [00:30:15] distinct GPU okay and so that's basically just a complex way of saying [00:30:16] basically just a complex way of saying that we created some data structures and [00:30:18] that we created some data structures and we just like divided up model parameters [00:30:21] we just like divided up model parameters uh to every GPU okay so every GPU gets a [00:30:24] uh to every GPU okay so every GPU gets a subset of model [00:30:26] subset of model parameters um okay now let's start [00:30:29] parameters um okay now let's start thinking about what my forward pass [00:30:30] thinking about what my forward pass would look like so there's no GPU that [00:30:32] would look like so there's no GPU that has the full set of parameters okay so [00:30:34] has the full set of parameters okay so you're running a forward pass let's say [00:30:36] you're running a forward pass let's say you're at layer 4 now there's no GPU [00:30:38] you're at layer 4 now there's no GPU that has all of layer 4 so you have to [00:30:41] that has all of layer 4 so you have to communicate uh so we need to do an all [00:30:43] communicate uh so we need to do an all gathered operation that's the operation [00:30:45] gathered operation that's the operation that we did to um you know uh cumulate [00:30:50] that we did to um you know uh cumulate multiple things that are on multiple [00:30:52] multiple things that are on multiple gpus uh so that every GPU has the full [00:30:54] gpus uh so that every GPU has the full thing uh so you perform an all gather so [00:30:57] thing uh so you perform an all gather so that all so you have all pieces of layer [00:30:59] that all so you have all pieces of layer four you run a forward pass okay and now [00:31:02] four you run a forward pass okay and now you don't need layer four so you now [00:31:04] you don't need layer four so you now discard your parameter shards [00:31:07] discard your parameter shards okay and now you have to run your [00:31:11] okay and now you have to run your backward pass right so you computed your [00:31:12] backward pass right so you computed your loss and now you have to do a backward [00:31:14] loss and now you have to do a backward pass uh again uh let's say you are back [00:31:17] pass uh again uh let's say you are back at layer four you have your Upstream [00:31:19] at layer four you have your Upstream gradient uh you don't have layer four so [00:31:21] gradient uh you don't have layer four so you need to do another all gather so you [00:31:23] you need to do another all gather so you get all the parameters of layer four [00:31:26] get all the parameters of layer four um and then you uh run a backward pass [00:31:30] um and then you uh run a backward pass for layer Force you compute the gradient [00:31:32] for layer Force you compute the gradient for your subset of parameters so recall [00:31:34] for your subset of parameters so recall that every GPU has different data points [00:31:38] that every GPU has different data points right so there's going to be different [00:31:39] right so there's going to be different gradients for every GPU okay so then for [00:31:43] gradients for every GPU okay so then for layer four uh you do an all gather get [00:31:45] layer four uh you do an all gather get all parameters computer gradient every [00:31:48] all parameters computer gradient every GPU has different gradients and then you [00:31:50] GPU has different gradients and then you have to do a reduce scatter so that uh [00:31:53] have to do a reduce scatter so that uh you can send the full gradient to the [00:31:56] you can send the full gradient to the GPU that's responsible for whatever [00:31:58] GPU that's responsible for whatever parts of layer 4 uh that that that [00:32:00] parts of layer 4 uh that that that you're sending okay [00:32:04] you're sending okay so yeah so so that's basically uh full [00:32:07] so yeah so so that's basically uh full fsdp um and then once you uh once you [00:32:11] fsdp um and then once you uh once you sort of run the forward and backward [00:32:12] sort of run the forward and backward pass then each GPU will update its own [00:32:15] pass then each GPU will update its own parameter Shard using the full gradient [00:32:18] parameter Shard using the full gradient that it received just now um and then [00:32:22] that it received just now um and then you do a synchronization [00:32:23] you do a synchronization right so let's kind of do a quick review [00:32:28] right so let's kind of do a quick review of everything we've looked at so far um [00:32:32] of everything we've looked at so far um so there was [00:32:33] so there was DDP um which was you don't shot anything [00:32:37] DDP um which was you don't shot anything you have the full model full gradient [00:32:39] you have the full model full gradient the full Optimizer State on every single [00:32:42] the full Optimizer State on every single GPU and all you're going to divide up is [00:32:44] GPU and all you're going to divide up is the full data set right so you have a [00:32:46] the full data set right so you have a big data set of 1,000 examples every GPU [00:32:49] big data set of 1,000 examples every GPU gets 250 examples okay and then you [00:32:52] gets 250 examples okay and then you compute a forward pass and a backward [00:32:54] compute a forward pass and a backward pass every GPU has a different gradient [00:32:56] pass every GPU has a different gradient you need to communicate that gradient [00:32:59] you need to communicate that gradient and then you synchronize okay and so [00:33:01] and then you synchronize okay and so that was called an all reduce operation [00:33:04] that was called an all reduce operation uh in MPI terms and then we looked at [00:33:07] uh in MPI terms and then we looked at zero which is now we want to save some [00:33:09] zero which is now we want to save some memory we don't want uh the full sort of [00:33:12] memory we don't want uh the full sort of memory requirements of models gradients [00:33:14] memory requirements of models gradients and Optimizer State on every single GPU [00:33:17] and Optimizer State on every single GPU and in zero stage one we shouted the [00:33:20] and in zero stage one we shouted the optimizer state so that there is uh you [00:33:23] optimizer state so that there is uh you know so that you don't have to maintain [00:33:24] know so that you don't have to maintain the full Optimizer state for every GPU [00:33:26] the full Optimizer state for every GPU you kind of break that down between all [00:33:29] you kind of break that down between all the different gpus that you have and we [00:33:31] the different gpus that you have and we saw that the communication overhead of [00:33:33] saw that the communication overhead of maintaining synchronization in zero [00:33:35] maintaining synchronization in zero stage one boiled down to basically just [00:33:38] stage one boiled down to basically just doing an all reduce through this uh [00:33:40] doing an all reduce through this uh identity that says that an all reduce is [00:33:42] identity that says that an all reduce is a reduce scatter plus an all gather okay [00:33:46] a reduce scatter plus an all gather okay and we save memory for free with zero [00:33:48] and we save memory for free with zero stage one and two so you should just do [00:33:49] stage one and two so you should just do it uh you know um and then with zero [00:33:53] it uh you know um and then with zero stage three things got a little bit more [00:33:56] stage three things got a little bit more complex uh because you have to divide up [00:33:59] complex uh because you have to divide up your model parameters the optimizer [00:34:02] your model parameters the optimizer State and the gradient and so while [00:34:04] State and the gradient and so while you're running your forward pass you [00:34:06] you're running your forward pass you kind of have to do some communication to [00:34:08] kind of have to do some communication to get the full parameters for for any [00:34:10] get the full parameters for for any layer for layer four uh in in our [00:34:12] layer for layer four uh in in our example and then also have to do an all [00:34:15] example and then also have to do an all Gather in the backward pass so you get [00:34:17] Gather in the backward pass so you get the full gradient and then you have to [00:34:19] the full gradient and then you have to do a reduce scatter so that you can send [00:34:21] do a reduce scatter so that you can send the full gradient for whatever chunk of [00:34:23] the full gradient for whatever chunk of the parameter to the right GPU and uh [00:34:27] the parameter to the right GPU and uh over overall that's like two all gathers [00:34:29] over overall that's like two all gathers plus a reduced scatter so that's a lot [00:34:32] plus a reduced scatter so that's a lot more overhead uh than stages one and two [00:34:35] more overhead uh than stages one and two but if you don't uh have enough GPU vram [00:34:38] but if you don't uh have enough GPU vram so that you can even load your model [00:34:40] so that you can even load your model onto a GPU then this is kind of what you [00:34:43] onto a GPU then this is kind of what you have to do um all right uh any questions [00:34:47] have to do um all right uh any questions about MPI Primitives or stages of of [00:34:51] about MPI Primitives or stages of of zero or fsdp [00:34:59] okay cool um so I'm going to fix the lie [00:35:02] okay cool um so I'm going to fix the lie that uh that I said earlier about the [00:35:05] that uh that I said earlier about the GPU vram calculation um so I said that [00:35:09] GPU vram calculation um so I said that there's just like model parameters and [00:35:12] there's just like model parameters and gradients uh and the optimizer state but [00:35:14] gradients uh and the optimizer state but there's this thing there this like final [00:35:16] there's this thing there this like final thing the model Activation so like you [00:35:18] thing the model Activation so like you know we've all seen that uh as you keep [00:35:21] know we've all seen that uh as you keep you know you you want to increase the [00:35:23] you know you you want to increase the bat size uh and there's a point when the [00:35:27] bat size uh and there's a point when the GPU say says that it can't fit more [00:35:29] GPU say says that it can't fit more stuff and that's because you also need [00:35:31] stuff and that's because you also need to store model activations in the [00:35:33] to store model activations in the backward pass right uh and that scales [00:35:36] backward pass right uh and that scales linearly with the bath size okay so the [00:35:38] linearly with the bath size okay so the larger the bath size the more the number [00:35:40] larger the bath size the more the number of model activations that need to be [00:35:42] of model activations that need to be stored uh and by the way if you're doing [00:35:44] stored uh and by the way if you're doing mixed Precision this is in fp16 or bf16 [00:35:47] mixed Precision this is in fp16 or bf16 but it scales with the bat size okay and [00:35:50] but it scales with the bat size okay and so uh that's sort of the other thing [00:35:53] so uh that's sort of the other thing that you have to think about and you [00:35:54] that you have to think about and you know you uh none of the techniques that [00:35:56] know you uh none of the techniques that we've looked at so far [00:35:58] we've looked at so far help with uh kind of Shing model [00:36:01] help with uh kind of Shing model activations um so okay so we looked at a [00:36:05] activations um so okay so we looked at a bunch of like um you know basics of like [00:36:08] bunch of like um you know basics of like multi-gpu uh training and you know like [00:36:11] multi-gpu uh training and you know like floating point but it kind of boils down [00:36:14] floating point but it kind of boils down to this very simple flowchart which you [00:36:16] to this very simple flowchart which you can use for your final projects when [00:36:18] can use for your final projects when you're fine tuning models so the first [00:36:21] you're fine tuning models so the first thing is always use mixed Precision [00:36:23] thing is always use mixed Precision training you know you barely ever see a [00:36:26] training you know you barely ever see a a hit in performance [00:36:28] a hit in performance uh by performance I mean like [00:36:30] uh by performance I mean like generalization or like you know F1 or [00:36:33] generalization or like you know F1 or accuracy and if you're using the newer [00:36:35] accuracy and if you're using the newer ampere architectures the the h100s or [00:36:38] ampere architectures the the h100s or the A1 100s or the a6000 always use BF [00:36:41] the A1 100s or the a6000 always use BF float 16 uh it's just better um and you [00:36:45] float 16 uh it's just better um and you can check that with with that torch [00:36:47] can check that with with that torch command um okay so always use mixed [00:36:50] command um okay so always use mixed Precision training now uh ask yourself [00:36:54] Precision training now uh ask yourself this question does bat size one fit on a [00:36:57] this question does bat size one fit on a a single GPU okay if it fits and try a [00:37:01] a single GPU okay if it fits and try a larger bat size okay bat size one is too [00:37:04] larger bat size okay bat size one is too small uh try a larger bat size and or [00:37:08] small uh try a larger bat size and or use zero stage two okay zero stage two [00:37:10] use zero stage two okay zero stage two is for free just use a zero stage two [00:37:13] is for free just use a zero stage two and increase your bat size if you can't [00:37:15] and increase your bat size if you can't fit even bat size one uh then you have [00:37:19] fit even bat size one uh then you have to see if zero stage three fixes your [00:37:22] to see if zero stage three fixes your out of memory issues cuz now you're [00:37:23] out of memory issues cuz now you're going to Shard your model [00:37:25] going to Shard your model parameters okay and all of this is is in [00:37:27] parameters okay and all of this is is in the context of full fine tuning right so [00:37:29] the context of full fine tuning right so I'm fine tuning all of my model [00:37:33] I'm fine tuning all of my model parameters [00:37:35] parameters okay sometimes the answer to that [00:37:38] okay sometimes the answer to that question is also no so you can't full [00:37:41] question is also no so you can't full fine tune your model on four whatever A1 [00:37:45] fine tune your model on four whatever A1 100s or you know [00:37:47] 100s or you know a6000 um and you've tried zero stage [00:37:50] a6000 um and you've tried zero stage three you've start you've tried mixed [00:37:52] three you've start you've tried mixed Precision training uh you have a bat [00:37:54] Precision training uh you have a bat size of one maybe you did gradient [00:37:57] size of one maybe you did gradient checkpoint pointing activation [00:37:58] checkpoint pointing activation checkpointing and nothing works and so [00:38:01] checkpointing and nothing works and so now uh basically you can't do full fine [00:38:04] now uh basically you can't do full fine tuning and so the thing to do is to try [00:38:06] tuning and so the thing to do is to try parameter efficient fine-tuning okay and [00:38:08] parameter efficient fine-tuning okay and that's going to give you a lot more uh [00:38:11] that's going to give you a lot more uh memory savings [00:38:12] memory savings okay so let's talk about parameter [00:38:15] okay so let's talk about parameter efficient fine tuning [00:38:18] efficient fine tuning okay um so why is it called parameter [00:38:22] okay um so why is it called parameter refer in fine tuning so in full fine [00:38:23] refer in fine tuning so in full fine tuning uh you know you run a forward [00:38:26] tuning uh you know you run a forward pass and a backward pass and you update [00:38:28] pass and a backward pass and you update every single model parameter and in [00:38:30] every single model parameter and in parameter efficient fine-tuning you're [00:38:32] parameter efficient fine-tuning you're only going to update a small subset of [00:38:35] only going to update a small subset of the full uh set of parameters [00:38:38] the full uh set of parameters okay and uh why would you want to do [00:38:41] okay and uh why would you want to do that so uh you know maybe you're in a [00:38:44] that so uh you know maybe you're in a setting where you cannot full fine tune [00:38:46] setting where you cannot full fine tune even with bat size one you've tried all [00:38:48] even with bat size one you've tried all the tricks possible uh it just wouldn't [00:38:51] the tricks possible uh it just wouldn't fit uh and so maybe you have to do [00:38:53] fit uh and so maybe you have to do parameter efficient fine tuning um maybe [00:38:57] parameter efficient fine tuning um maybe the other possible reason why you want [00:38:59] the other possible reason why you want to do it is kind of slightly more [00:39:02] to do it is kind of slightly more scientific uh you know these like models [00:39:05] scientific uh you know these like models these days are heavily [00:39:07] these days are heavily overparameterized uh and you have a [00:39:09] overparameterized uh and you have a small data set and you you believe that [00:39:11] small data set and you you believe that if you do parameter efficient F tuning [00:39:14] if you do parameter efficient F tuning uh then you can get a better like [00:39:16] uh then you can get a better like generalization [00:39:17] generalization okay um or you believe that you know [00:39:21] okay um or you believe that you know it's going to match for fine tuning okay [00:39:24] it's going to match for fine tuning okay uh sort of a second reason for wanting [00:39:26] uh sort of a second reason for wanting to do f [00:39:28] to do f adaptation um so uh the plot on the [00:39:32] adaptation um so uh the plot on the right here shows uh in in red it's sort [00:39:35] right here shows uh in in red it's sort of the estimated growth uh in training [00:39:38] of the estimated growth uh in training compute for training the largest AI [00:39:40] compute for training the largest AI models and uh the line in blue is the [00:39:44] models and uh the line in blue is the global compute capacity so very soon we [00:39:47] global compute capacity so very soon we are going to overshoot the global [00:39:49] are going to overshoot the global compute capacity and going to need a lot [00:39:51] compute capacity and going to need a lot more compute than you know the global [00:39:54] more compute than you know the global capacity and so this is kind of not [00:39:56] capacity and so this is kind of not sustainable [00:39:58] sustainable um and you know there's there there are [00:40:00] um and you know there's there there are arguments to be made about how uh if we [00:40:03] arguments to be made about how uh if we keep going down this route then you know [00:40:06] keep going down this route then you know AI development becomes concentrated in [00:40:09] AI development becomes concentrated in only the hands of a few well-funded [00:40:11] only the hands of a few well-funded organizations and you know as students [00:40:13] organizations and you know as students we can't do it um and so that's a [00:40:18] we can't do it um and so that's a problem and then also like if there's [00:40:20] problem and then also like if there's only a small number of players that are [00:40:22] only a small number of players that are training and fine-tuning models then [00:40:24] training and fine-tuning models then they may bias the model in specific ways [00:40:26] they may bias the model in specific ways that reflect their value systems and not [00:40:28] that reflect their value systems and not sort of the the broader uh public um and [00:40:33] sort of the the broader uh public um and so that's another reason to do full uh [00:40:35] so that's another reason to do full uh to to think about efficient adaptation [00:40:38] to to think about efficient adaptation um and there's sort of this Paradigm in [00:40:41] um and there's sort of this Paradigm in in in machine learning in general and [00:40:44] in in machine learning in general and NLP uh specifically to focus uh a lot on [00:40:49] NLP uh specifically to focus uh a lot on on accuracy instead of efficiency and so [00:40:51] on accuracy instead of efficiency and so the pro on the right here shows um the [00:40:55] the pro on the right here shows um the percentage of papers where where the [00:40:57] percentage of papers where where the main contribution is a method that [00:41:00] main contribution is a method that produces just you know more accurate [00:41:03] produces just you know more accurate models versus methods that produce same [00:41:07] models versus methods that produce same accuracy for more efficiency um and so [00:41:10] accuracy for more efficiency um and so we can see that for most [00:41:12] we can see that for most conferences the vast majority of papers [00:41:15] conferences the vast majority of papers are about accuracy there's very few [00:41:17] are about accuracy there's very few papers about efficiency okay um and and [00:41:22] papers about efficiency okay um and and so maybe this is kind of leading to this [00:41:24] so maybe this is kind of leading to this like monoculture and maybe that's why we [00:41:25] like monoculture and maybe that's why we want to focus on efficiency [00:41:28] want to focus on efficiency uh the second maybe bigger sort of [00:41:30] uh the second maybe bigger sort of concern is that there's this huge uh [00:41:33] concern is that there's this huge uh hidden environmental cost of of training [00:41:35] hidden environmental cost of of training and fine-tuning large language models uh [00:41:38] and fine-tuning large language models uh so I was just reading some report where [00:41:40] so I was just reading some report where where they said that the cost of [00:41:42] where they said that the cost of training gpt3 was equivalent to 1.1 [00:41:46] training gpt3 was equivalent to 1.1 million tons of carbon emission or some [00:41:49] million tons of carbon emission or some some some some some such number and they [00:41:50] some some some some such number and they they kind of estimated that that's the [00:41:52] they kind of estimated that that's the cost of running a Coal Power Plant for [00:41:54] cost of running a Coal Power Plant for 10 hours straight okay um [00:41:58] 10 hours straight okay um all right uh and and uh for an example [00:42:00] all right uh and and uh for an example closer to home um in in in the [00:42:04] closer to home um in in in the reinforcement learning class um there [00:42:07] reinforcement learning class um there was like uh you know the final project [00:42:09] was like uh you know the final project no not the final project a homework [00:42:11] no not the final project a homework assignment and um you know a lot of [00:42:13] assignment and um you know a lot of students implemented uh kind of a common [00:42:17] students implemented uh kind of a common algorithm uh one uh one or two [00:42:20] algorithm uh one uh one or two algorithms that sort of outperformed [00:42:21] algorithms that sort of outperformed everything else it used a lot more power [00:42:24] everything else it used a lot more power okay and someone did this calculation [00:42:27] okay and someone did this calculation that if everyone had used the most [00:42:29] that if everyone had used the most efficient algorithm that would have uh [00:42:33] efficient algorithm that would have uh uh sorry if everyone had used the more [00:42:36] uh sorry if everyone had used the more efficient algorithm that would have [00:42:37] efficient algorithm that would have reduced the power consumption of the [00:42:39] reduced the power consumption of the Class by about 880 kilowatt hours which [00:42:43] Class by about 880 kilowatt hours which is what an American household uses in a [00:42:45] is what an American household uses in a month okay so so there's another like [00:42:48] month okay so so there's another like the you know these are all reasons to [00:42:49] the you know these are all reasons to think about like efficiency and like how [00:42:51] think about like efficiency and like how you can um like fine tune your models [00:42:54] you can um like fine tune your models with like uh less resources okay uh so [00:42:57] with like uh less resources okay uh so let's let's kind of jump back into like [00:43:00] let's let's kind of jump back into like parameter efficient fine tuning okay and [00:43:02] parameter efficient fine tuning okay and let's let's start by recapping what full [00:43:04] let's let's start by recapping what full fine tuning [00:43:06] fine tuning is um any questions so far about any of [00:43:12] this okay um so yeah so let's recap full [00:43:16] this okay um so yeah so let's recap full fine tuning so let's say we have some [00:43:18] fine tuning so let's say we have some large uh pre-trained Auto regressive [00:43:20] large uh pre-trained Auto regressive language model um let's say it's a [00:43:24] language model um let's say it's a GPT um and maybe we want to to use it [00:43:27] GPT um and maybe we want to to use it for summarization maybe we want it for [00:43:29] for summarization maybe we want it for semantic parsing so like converting [00:43:31] semantic parsing so like converting natural language to SQL commands or [00:43:34] natural language to SQL commands or maybe we want it to answer questions [00:43:36] maybe we want it to answer questions about paragraphs okay and what do we do [00:43:39] about paragraphs okay and what do we do we collect a data set of XY Pairs and [00:43:43] we collect a data set of XY Pairs and then we do full fine tuning in full fine [00:43:45] then we do full fine tuning in full fine tuning we are going to update all of the [00:43:48] tuning we are going to update all of the model parameters based on the gradient [00:43:50] model parameters based on the gradient for some loss function okay um and maybe [00:43:55] for some loss function okay um and maybe that's not feasible uh GPD 3 has 175 [00:43:58] that's not feasible uh GPD 3 has 175 billion parameters and so there's just [00:44:00] billion parameters and so there's just like lot more parameters to learn and [00:44:03] like lot more parameters to learn and even once you have done full fine tuning [00:44:06] even once you have done full fine tuning uh you kind of have to store all of the [00:44:08] uh you kind of have to store all of the parameters and if you're doing like [00:44:09] parameters and if you're doing like several tasks you have to store [00:44:10] several tasks you have to store parameters for every task okay uh so can [00:44:14] parameters for every task okay uh so can we do better [00:44:17] we do better um okay so the main idea is instead of [00:44:23] um okay so the main idea is instead of updating all of the parameters I am [00:44:26] updating all of the parameters I am going to update [00:44:27] going to update um a much smaller number of parameters [00:44:31] um a much smaller number of parameters okay uh and and then instead of uh [00:44:34] okay uh and and then instead of uh finding sort of a Delta Theta which is [00:44:37] finding sort of a Delta Theta which is the same size as the entire set of [00:44:39] the same size as the entire set of parameters I have to search over a much [00:44:41] parameters I have to search over a much smaller [00:44:42] smaller space uh and then the added benefit is [00:44:47] space uh and then the added benefit is uh I can store this much smaller Delta [00:44:50] uh I can store this much smaller Delta uh pretty easily on disk and hopefully [00:44:52] uh pretty easily on disk and hopefully it's going to require less compute and [00:44:55] it's going to require less compute and hopefully it's going to generalize [00:44:57] hopefully it's going to generalize almost as well as full fine tuning [00:45:00] almost as well as full fine tuning okay um so there's many different ways [00:45:04] okay um so there's many different ways of um kind of operationalizing this high [00:45:07] of um kind of operationalizing this high level idea of of parameter efficient [00:45:10] level idea of of parameter efficient fine tuning uh the one I'm going to talk [00:45:13] fine tuning uh the one I'm going to talk about today is Laura okay so that stands [00:45:17] about today is Laura okay so that stands for uh low rank [00:45:20] for uh low rank adaptation and uh that basically comes [00:45:24] adaptation and uh that basically comes from this observation that when you have [00:45:26] from this observation that when you have uh big language models that you fine [00:45:28] uh big language models that you fine tune often times when you look at sort [00:45:32] tune often times when you look at sort of the um the like geometric structure [00:45:36] of the um the like geometric structure of the gradients they tend to have a low [00:45:38] of the gradients they tend to have a low intrinsic rank okay um do people [00:45:43] intrinsic rank okay um do people remember Rank and [00:45:45] remember Rank and SD all right okay so um so these [00:45:49] SD all right okay so um so these parameters uh so the the gradients tend [00:45:51] parameters uh so the the gradients tend to have a low intrinsic rank okay and so [00:45:55] to have a low intrinsic rank okay and so what the authors real realized is [00:45:58] what the authors real realized is instead of uh fine-tuning the entire set [00:46:01] instead of uh fine-tuning the entire set of parameters you could instead [00:46:03] of parameters you could instead fine-tune a much smaller let's say rank [00:46:06] fine-tune a much smaller let's say rank R Matrix for every uh full rank Matrix [00:46:10] R Matrix for every uh full rank Matrix that exists in the model okay so let's [00:46:13] that exists in the model okay so let's say we have uh some pre-trained weight [00:46:16] say we have uh some pre-trained weight Matrix W [00:46:18] Matrix W KN and what I'm going to do is instead [00:46:21] KN and what I'm going to do is instead of applying some kind of arbitrary [00:46:24] of applying some kind of arbitrary update I'm going to make sure that the [00:46:26] update I'm going to make sure that the update it has this following form okay [00:46:29] update it has this following form okay so it's going to be the product of two [00:46:31] so it's going to be the product of two low rank matrices b and a okay so a is [00:46:36] low rank matrices b and a okay so a is in R uh in R is is an R cross K uh [00:46:40] in R uh in R is is an R cross K uh Matrix and B is a d cross R Matrix okay [00:46:44] Matrix and B is a d cross R Matrix okay and R is is the rank much much smaller [00:46:47] and R is is the rank much much smaller than either the um the uh sort of uh [00:46:51] than either the um the uh sort of uh incoming Dimension and much much smaller [00:46:54] incoming Dimension and much much smaller than the outgoing Dimension okay and the [00:46:57] than the outgoing Dimension okay and the term Alpha you can think of that as some [00:47:00] term Alpha you can think of that as some kind of trade-off between uh the [00:47:02] kind of trade-off between uh the knowledge that's already stored in the [00:47:05] knowledge that's already stored in the pre-rain model versus some additional [00:47:08] pre-rain model versus some additional knowledge that you want to add into the [00:47:09] knowledge that you want to add into the model okay so if Alpha is zero then [00:47:11] model okay so if Alpha is zero then you're not doing anything if Alpha is [00:47:13] you're not doing anything if Alpha is something really really small then you [00:47:15] something really really small then you don't really want to change your model [00:47:17] don't really want to change your model parameters all that much and you want to [00:47:18] parameters all that much and you want to add some really small task specific [00:47:22] add some really small task specific knowledge um and then additionally the [00:47:25] knowledge um and then additionally the only trainable parameter s here are [00:47:27] only trainable parameter s here are going to be a and b [00:47:31] going to be a and b okay [00:47:33] okay um [00:47:36] um okay um and then uh sort of the other [00:47:39] okay um and then uh sort of the other thing to note about this is uh since I'm [00:47:42] thing to note about this is uh since I'm representing updates as this product u b [00:47:45] representing updates as this product u b * a as I increase R that's going to [00:47:49] * a as I increase R that's going to converge towards full fine tuning right [00:47:50] converge towards full fine tuning right so you kind of have the slider that you [00:47:52] so you kind of have the slider that you can use to control how much fine tuning [00:47:56] can use to control how much fine tuning you want to do is essentially [00:47:58] you want to do is essentially um and then uh the other important thing [00:48:01] um and then uh the other important thing is uh inference latency so what you can [00:48:03] is uh inference latency so what you can do is uh you can just store these [00:48:07] do is uh you can just store these learned matrices for every task and [00:48:09] learned matrices for every task and whenever you switch to a different task [00:48:11] whenever you switch to a different task uh you can just uh you can just remove [00:48:14] uh you can just uh you can just remove the extra term that you've added to [00:48:15] the extra term that you've added to every Matrix for that task and add in [00:48:18] every Matrix for that task and add in sort of the task specific terms for the [00:48:21] sort of the task specific terms for the new task that you want to run inference [00:48:23] new task that you want to run inference on okay and and the cost of like storing [00:48:27] on okay and and the cost of like storing these like much smaller matrices is also [00:48:29] these like much smaller matrices is also way lower than storing uh sort of the [00:48:31] way lower than storing uh sort of the full Delta [00:48:33] full Delta okay um and we'll kind of see where you [00:48:37] okay um and we'll kind of see where you should apply Laura but like generally [00:48:38] should apply Laura but like generally you want to apply it to uh the weight [00:48:40] you want to apply it to uh the weight matrixes in self attention okay so um in [00:48:45] matrixes in self attention okay so um in code it actually looks fairly simple um [00:48:49] code it actually looks fairly simple um so what you're going to do is uh when [00:48:51] so what you're going to do is uh when you're running uh the regular forward [00:48:54] you're running uh the regular forward pass then you sort of uh you know just [00:48:57] pass then you sort of uh you know just like compute the hidden State as let's [00:49:00] like compute the hidden State as let's say the product of uh The Matrix and the [00:49:02] say the product of uh The Matrix and the incoming feature uh Vector now with [00:49:05] incoming feature uh Vector now with Laura what you're going to do is you're [00:49:07] Laura what you're going to do is you're going to freeze your model parameters [00:49:09] going to freeze your model parameters you're going to compute the H as before [00:49:12] you're going to compute the H as before and then to that you're going to add [00:49:13] and then to that you're going to add this additional offset term and that's [00:49:16] this additional offset term and that's the only thing that's going to be uh [00:49:18] the only thing that's going to be uh trainable okay and that's pretty much [00:49:20] trainable okay and that's pretty much all you have to do we have to do it for [00:49:22] all you have to do we have to do it for every single like weight Matrix uh for [00:49:25] every single like weight Matrix uh for every single layer okay [00:49:27] every single layer okay um but yeah so there's like an alpha [00:49:29] um but yeah so there's like an alpha term in the second to last line where do [00:49:31] term in the second to last line where do you define Alpha in the rest or do you [00:49:33] you define Alpha in the rest or do you just like put it [00:49:35] just like put it somewhere uh so you so so yeah so you [00:49:37] somewhere uh so you so so yeah so you define the somewhere uh if you set it to [00:49:40] define the somewhere uh if you set it to one that's like saying that uh I kind of [00:49:43] one that's like saying that uh I kind of want um uh like an equal trade-off [00:49:46] want um uh like an equal trade-off between pre-trained knowledge and the [00:49:48] between pre-trained knowledge and the new task specific knowledge typically [00:49:50] new task specific knowledge typically people set it to one you could set it to [00:49:52] people set it to one you could set it to something larger than one if you believe [00:49:53] something larger than one if you believe your task is something that the model uh [00:49:55] your task is something that the model uh the pre- model has no idea about uh or [00:49:58] the pre- model has no idea about uh or something smaller than one if you don't [00:49:59] something smaller than one if you don't want to change the model too [00:50:04] much [00:50:06] much um so so that's basically Laura um in [00:50:11] um so so that's basically Laura um in practice so I said there's a bunch of [00:50:13] practice so I said there's a bunch of different uh parameter efficient fine [00:50:14] different uh parameter efficient fine Heering methods right so uh I'm not even [00:50:17] Heering methods right so uh I'm not even going to name all of these uh there's [00:50:21] going to name all of these uh there's adapters some of you might have heard [00:50:23] adapters some of you might have heard about [00:50:23] about adapters uh there is uh bitfit which is [00:50:27] adapters uh there is uh bitfit which is not here um and so there's like lots of [00:50:31] not here um and so there's like lots of different like ptuning um but it turns [00:50:33] different like ptuning um but it turns out that like compared to a lot of these [00:50:35] out that like compared to a lot of these different uh methods it's it's kind of [00:50:38] different uh methods it's it's kind of like uh you know pretty high high [00:50:40] like uh you know pretty high high performing uh on a bunch of different [00:50:42] performing uh on a bunch of different tasks uh for these like relatively [00:50:45] tasks uh for these like relatively smaller models okay um and then if we [00:50:49] smaller models okay um and then if we try and look at uh some of the bigger [00:50:52] try and look at uh some of the bigger like we're trying to fine tune some of [00:50:53] like we're trying to fine tune some of the bigger models like GPD 3 uh and then [00:50:57] the bigger models like GPD 3 uh and then compare it with other parameter [00:50:58] compare it with other parameter efficient fine tuning methods um so full [00:51:01] efficient fine tuning methods um so full fine tuning is is is at the is at the [00:51:03] fine tuning is is is at the is at the way top then we have bit fit uh which is [00:51:06] way top then we have bit fit uh which is you only fine tune the bias terms um and [00:51:09] you only fine tune the bias terms um and and adapters compared to that uh firstly [00:51:13] and adapters compared to that uh firstly Laura requires a lot fewer additional [00:51:16] Laura requires a lot fewer additional parameters that you need to store uh and [00:51:18] parameters that you need to store uh and it kind of gives you a good tradeoff for [00:51:20] it kind of gives you a good tradeoff for accuracy compared to full fine tuning [00:51:22] accuracy compared to full fine tuning and sometimes there's a regularizing [00:51:24] and sometimes there's a regularizing effect uh from fine tuning only small [00:51:26] effect uh from fine tuning only small subset of your model [00:51:30] subset of your model parameters [00:51:32] parameters um [00:51:34] um okay so uh you know the question is like [00:51:39] okay so uh you know the question is like for every uh Matrix you can apply Laura [00:51:43] for every uh Matrix you can apply Laura and I said that you want to apply it to [00:51:46] and I said that you want to apply it to the various um learned uh bit matrices [00:51:50] the various um learned uh bit matrices in site self attention the question is [00:51:52] in site self attention the question is what uh what parameters uh you want to [00:51:55] what uh what parameters uh you want to apply Lura to and uh generally the rule [00:51:58] apply Lura to and uh generally the rule of the thumb is that if you apply it to [00:52:00] of the thumb is that if you apply it to the Matrix that takes your hidden State [00:52:02] the Matrix that takes your hidden State and converts that into queries and the [00:52:05] and converts that into queries and the Matrix that converts your hidden State [00:52:07] Matrix that converts your hidden State into values apply Lura to those and [00:52:10] into values apply Lura to those and that's pretty much going to give you the [00:52:11] that's pretty much going to give you the best performance overall um the other [00:52:15] best performance overall um the other hyperparameter for Laura is the optimal [00:52:18] hyperparameter for Laura is the optimal rank so recall that these like two [00:52:20] rank so recall that these like two matrices b and a that are both like low [00:52:23] matrices b and a that are both like low rank and turns out that like already [00:52:26] rank and turns out that like already with a a really small uh rank you can [00:52:29] with a a really small uh rank you can get a pretty high performance okay and [00:52:31] get a pretty high performance okay and this is much much smaller than um sort [00:52:34] this is much much smaller than um sort of the Hidden State dimensions of most [00:52:37] of the Hidden State dimensions of most of the matrices for most models these [00:52:41] of the matrices for most models these days okay all right so uh we covered a [00:52:45] days okay all right so uh we covered a bunch of things uh we talked about uh [00:52:47] bunch of things uh we talked about uh you know floating points and mixed [00:52:49] you know floating points and mixed Precision training uh multi-gpu training [00:52:52] Precision training uh multi-gpu training DDP [00:52:54] DDP fsdp uh Laura [00:52:56] fsdp uh Laura uh kind of boils down to a very simple [00:52:59] uh kind of boils down to a very simple flowchart that you can just like use for [00:53:02] flowchart that you can just like use for your project so if you were sleeping [00:53:03] your project so if you were sleeping through the entire lecture maybe now [00:53:05] through the entire lecture maybe now it's the time to wake up and just like [00:53:07] it's the time to wake up and just like look at this uh this flowchart [00:53:10] look at this uh this flowchart um so always use mixed Precision [00:53:13] um so always use mixed Precision training okay if you have the newer [00:53:16] training okay if you have the newer Amper architectures use B float 16 try [00:53:19] Amper architectures use B float 16 try with the bat size one okay if bat size [00:53:22] with the bat size one okay if bat size one fits try a larger bat size and then [00:53:24] one fits try a larger bat size and then always just use zero stage two [00:53:27] always just use zero stage two okay bad size one doesn't fit uh try [00:53:30] okay bad size one doesn't fit uh try zero stage three uh maybe try gradient [00:53:33] zero stage three uh maybe try gradient checkpointing activation checkpointing [00:53:35] checkpointing activation checkpointing uh sorry there's a question this is [00:53:37] uh sorry there's a question this is assuming we have more than one GPU [00:53:39] assuming we have more than one GPU because doesn't help [00:53:42] because doesn't help us oh yeah so all of this applies only [00:53:45] us oh yeah so all of this applies only if you have more than one GPU yeah if [00:53:47] if you have more than one GPU yeah if you have a single [00:53:49] you have a single GPU yeah you have to do other things [00:53:52] GPU yeah you have to do other things maybe have to like heavily uh quantize [00:53:54] maybe have to like heavily uh quantize the model and even then I don't think [00:53:57] the model and even then I don't think you can uh fine tune some of the bigger [00:53:59] you can uh fine tune some of the bigger bigger models yeah um so assuming you [00:54:03] bigger models yeah um so assuming you have multiple gpus uh you can try zero [00:54:06] have multiple gpus uh you can try zero stage three if you have uh out of memory [00:54:10] stage three if you have uh out of memory errors with a bat size of one if that [00:54:12] errors with a bat size of one if that doesn't work you can try Laura okay and [00:54:14] doesn't work you can try Laura okay and the main hyperparameters in Laura are [00:54:17] the main hyperparameters in Laura are the alpha the rank and what uh weight [00:54:20] the alpha the rank and what uh weight matrices to apply Lura to apply that to [00:54:23] matrices to apply Lura to apply that to the query Matrix apply that to the value [00:54:25] the query Matrix apply that to the value Matrix set rank to eight okay that's a [00:54:28] Matrix set rank to eight okay that's a good starting point set Alpha to one [00:54:30] good starting point set Alpha to one okay just do that and you should be good [00:54:32] okay just do that and you should be good to go okay so you can find tun your [00:54:34] to go okay so you can find tun your models and things should be reasonably [00:54:37] models and things should be reasonably good okay so uh I'm going to end now [00:54:41] good okay so uh I'm going to end now unless there's [00:54:43] unless there's questions um oh there's one question in [00:54:46] questions um oh there's one question in the [00:54:47] the back [00:54:49] back diamides I was wondering if you just [00:54:51] diamides I was wondering if you just like go back to it and walk through it a [00:54:53] like go back to it and walk through it a little bit on step uh sorry on slide 48 [00:55:03] yeah this diagram from the last right uh [00:55:07] yeah this diagram from the last right uh okay um so let's go through this diagram [00:55:10] okay um so let's go through this diagram so basically uh what this diagram shows [00:55:13] so basically uh what this diagram shows is how the communication overhead is [00:55:16] is how the communication overhead is really not that bad if you have a fairly [00:55:19] really not that bad if you have a fairly big model such that the the time it [00:55:22] big model such that the the time it takes to do a forward pass you can [00:55:23] takes to do a forward pass you can already sort of prefetch uh all of the [00:55:26] already sort of prefetch uh all of the parameters for the next layer okay so [00:55:28] parameters for the next layer okay so that's pretty much the idea so that's [00:55:29] that's pretty much the idea so that's kind of like a standard idea that I [00:55:32] kind of like a standard idea that I guess everyone should already be using [00:55:33] guess everyone should already be using like you want to make sure and pyos does [00:55:36] like you want to make sure and pyos does this by default by the way like you want [00:55:38] this by default by the way like you want to make sure that you know um you sort [00:55:41] to make sure that you know um you sort of fully saturate your GPU uh and then [00:55:45] of fully saturate your GPU uh and then sort of make sure that you uh kind of [00:55:47] sort of make sure that you uh kind of overlay that with any additional compute [00:55:49] overlay that with any additional compute you're doing and that's pretty much [00:55:50] you're doing and that's pretty much what's going on here but let's sort of [00:55:53] what's going on here but let's sort of go through this kind of Step by Step [00:55:55] go through this kind of Step by Step okay and so the starting point here is [00:55:59] okay and so the starting point here is uh fsdp units so 0o one and two are [00:56:02] uh fsdp units so 0o one and two are different fsdp units okay so um what you [00:56:07] different fsdp units okay so um what you start by doing is um you want to run a [00:56:11] start by doing is um you want to run a forward pass on the first layer you [00:56:13] forward pass on the first layer you don't have the first layer okay so let's [00:56:15] don't have the first layer okay so let's say you are gpuk you don't have the [00:56:16] say you are gpuk you don't have the first layer so you have to do an all [00:56:17] first layer so you have to do an all gather to get all of the parameters for [00:56:20] gather to get all of the parameters for the first layer so that's [00:56:23] the first layer so that's ag0 okay at the end of ag0 every GPU has [00:56:28] ag0 okay at the end of ag0 every GPU has the full set of parameters for uh the [00:56:31] the full set of parameters for uh the layers corresponding to fsdp unit Zer [00:56:33] layers corresponding to fsdp unit Zer let's just say that's layer that's layer [00:56:35] let's just say that's layer that's layer one okay or let's just say that's layer [00:56:37] one okay or let's just say that's layer zero okay so you have the full [00:56:39] zero okay so you have the full parameters for layer zero you run a [00:56:41] parameters for layer zero you run a forward pass so that's the stuff in [00:56:43] forward pass so that's the stuff in blue and while you're running a forward [00:56:46] blue and while you're running a forward pass through the first layer you're [00:56:49] pass through the first layer you're going to be smart about uh communication [00:56:51] going to be smart about uh communication overheads and while you're running that [00:56:53] overheads and while you're running that you're going to prefetch uh the [00:56:55] you're going to prefetch uh the parameters for the next fsdp unit okay [00:56:58] parameters for the next fsdp unit okay so let's say Layer Two is a different [00:56:59] so let's say Layer Two is a different fsdp unit so that's a G1 okay and so you [00:57:04] fsdp unit so that's a G1 okay and so you can see that there is like a little bit [00:57:06] can see that there is like a little bit of overlap between um between F zero and [00:57:09] of overlap between um between F zero and ag1 okay at the end of getting all of [00:57:13] ag1 okay at the end of getting all of the parameters for layer one uh you're [00:57:17] the parameters for layer one uh you're going to do a forward pass okay and so [00:57:19] going to do a forward pass okay and so on and then you're going to do AG uh [00:57:22] on and then you're going to do AG uh ag2 uh and at the same time uh now let's [00:57:26] ag2 uh and at the same time uh now let's say you just have way tooo many uh [00:57:27] say you just have way tooo many uh parameters on your GPU okay so you're [00:57:29] parameters on your GPU okay so you're going to do a little bit of uh like [00:57:32] going to do a little bit of uh like memory free uh you're going to free up [00:57:34] memory free uh you're going to free up some memory okay so that's the stuff in [00:57:36] some memory okay so that's the stuff in uh in yellow okay and so that's how that [00:57:39] uh in yellow okay and so that's how that goes so you know you basically overlay [00:57:42] goes so you know you basically overlay all gather operations with the forward [00:57:44] all gather operations with the forward pass okay and that's how you run the [00:57:46] pass okay and that's how you run the forward pass okay so the communication [00:57:48] forward pass okay so the communication overhead is really not that bad if you [00:57:49] overhead is really not that bad if you have like a really big uh deep neural [00:57:52] have like a really big uh deep neural network uh and uh assuming that you have [00:57:55] network uh and uh assuming that you have uh um kind of shoted everything properly [00:57:59] uh um kind of shoted everything properly okay um and then you start the backward [00:58:03] okay um and then you start the backward pass so in the backward pass uh uh I [00:58:06] pass so in the backward pass uh uh I guess it's a little bit tricky uh cuz [00:58:08] guess it's a little bit tricky uh cuz you know you want to do these uh all [00:58:12] you know you want to do these uh all gather operations to get the full [00:58:15] gather operations to get the full gradient so let's say it's a 10 layer [00:58:16] gradient so let's say it's a 10 layer neural network okay so you want to [00:58:18] neural network okay so you want to compute the full gradient at layer 10 [00:58:20] compute the full gradient at layer 10 you need to do an all gather operation [00:58:21] you need to do an all gather operation to get all of the gradients or to get [00:58:23] to get all of the gradients or to get all of the parameters at layer 10 and [00:58:25] all of the parameters at layer 10 and then you have to do a reduced scatter [00:58:27] then you have to do a reduced scatter okay so you have four gpus everyone has [00:58:30] okay so you have four gpus everyone has the full set of parameters at layer 10 [00:58:33] the full set of parameters at layer 10 they have different gradients and so [00:58:35] they have different gradients and so they have to kind of merge their [00:58:36] they have to kind of merge their gradients and then sort of split them up [00:58:38] gradients and then sort of split them up to the right GPU and so that's the [00:58:40] to the right GPU and so that's the reduced scatter uh but that's not too [00:58:42] reduced scatter uh but that's not too bad because you can still like overlay [00:58:44] bad because you can still like overlay uh reduce scatter operations with the [00:58:46] uh reduce scatter operations with the backward pass okay and so that's what [00:58:47] backward pass okay and so that's what you see happening in the backward pass [00:58:49] you see happening in the backward pass there okay um and then like along with [00:58:54] there okay um and then like along with these forward and backward path [00:58:57] these forward and backward path uh at sort of um you know regular [00:59:00] uh at sort of um you know regular intervals you have to make sure that you [00:59:01] intervals you have to make sure that you kind of free up GPU memory so for [00:59:04] kind of free up GPU memory so for example once you have run a forward pass [00:59:06] example once you have run a forward pass through layer one now you're on to Layer [00:59:08] through layer one now you're on to Layer Two you don't need anything in layer one [00:59:09] Two you don't need anything in layer one so you just like free up uh the memory [00:59:11] so you just like free up uh the memory in layer one okay that's pretty much the [00:59:14] in layer one okay that's pretty much the idea of behind this diagram um so [00:59:16] idea of behind this diagram um so there's a few details here um one of the [00:59:20] there's a few details here um one of the details is like in fsdp uh unit zero is [00:59:23] details is like in fsdp uh unit zero is sort of treated differently so you'll [00:59:25] sort of treated differently so you'll see that unit zero is never freed up uh [00:59:28] see that unit zero is never freed up uh that's just sort of an implementation [00:59:29] that's just sort of an implementation detail in fsdp I'll just quickly say one [00:59:32] detail in fsdp I'll just quickly say one more thing about fsdp and take a [00:59:34] more thing about fsdp and take a question okay so uh the presentation [00:59:38] question okay so uh the presentation here makes it seem like it's so simple [00:59:40] here makes it seem like it's so simple and that it can be applied to any any [00:59:43] and that it can be applied to any any neural network right but uh turns out [00:59:45] neural network right but uh turns out that that's not the full picture so you [00:59:48] that that's not the full picture so you need to do this kind of uh like you need [00:59:51] need to do this kind of uh like you need to kind of divide up your neural network [00:59:53] to kind of divide up your neural network into fsdp units okay and depending on [00:59:57] into fsdp units okay and depending on how you uh depending on what policy you [01:00:00] how you uh depending on what policy you use for dividing up your parameters into [01:00:02] use for dividing up your parameters into fsdp units there's different [01:00:04] fsdp units there's different communication overheads okay so for [01:00:07] communication overheads okay so for example it makes sense to kind of uh [01:00:10] example it makes sense to kind of uh have like multiple like consecutive [01:00:13] have like multiple like consecutive layers in the same fsdp unit and so on [01:00:15] layers in the same fsdp unit and so on and so this is like very architecture [01:00:17] and so this is like very architecture specific so when you start to use this [01:00:19] specific so when you start to use this in pytorch you'll see that the [01:00:22] in pytorch you'll see that the fsdp uh wrapper requires a sort of a [01:00:26] fsdp uh wrapper requires a sort of a sharding policy uh and that is like very [01:00:29] sharding policy uh and that is like very architecture specific so because [01:00:31] architecture specific so because everyone uses Transformers now they're [01:00:33] everyone uses Transformers now they're like very sort of you know handcrafted [01:00:36] like very sort of you know handcrafted fine-tuned policies for Transformer like [01:00:38] fine-tuned policies for Transformer like for for creating fsdp units and and [01:00:40] for for creating fsdp units and and shotting strategies for Transformers but [01:00:43] shotting strategies for Transformers but let's say you want to you know for your [01:00:45] let's say you want to you know for your final project you came up with a new [01:00:46] final project you came up with a new architecture sub quadratic attention [01:00:50] architecture sub quadratic attention whatever maybe it's not going to be as [01:00:52] whatever maybe it's not going to be as efficient just because you don't have [01:00:54] efficient just because you don't have the right shot policy okay so that's [01:00:57] the right shot policy okay so that's like one detail about fsdp uh that that [01:01:00] like one detail about fsdp uh that that maybe you want to keep in mind okay uh [01:01:02] maybe you want to keep in mind okay uh you have a question just a clarification [01:01:04] you have a question just a clarification when you mentioned uh you can throw away [01:01:07] when you mentioned uh you can throw away the weights that you don't need after [01:01:08] the weights that you don't need after each layer forward FS but then when you [01:01:10] each layer forward FS but then when you compute backward FS you stream them back [01:01:12] compute backward FS you stream them back in each time uh or do you sort of cash [01:01:15] in each time uh or do you sort of cash some or cash recent or is there any [01:01:18] some or cash recent or is there any cashing going on or do you throw them [01:01:19] cashing going on or do you throw them all way and streaming all back uh so [01:01:23] all way and streaming all back uh so there might be some caching but uh uh in [01:01:26] there might be some caching but uh uh in in the system but uh the idea is that [01:01:30] in the system but uh the idea is that you just sort of throw them away or at [01:01:31] you just sort of throw them away or at least to the user it it seems like [01:01:33] least to the user it it seems like you've thrown it all away in terms of [01:01:35] you've thrown it all away in terms of like uh GPU Ram utilization and then we [01:01:38] like uh GPU Ram utilization and then we stream them each layer [01:01:41] stream them each layer again and so that's why it's important [01:01:43] again and so that's why it's important to like Shard it properly right so for [01:01:46] to like Shard it properly right so for example if like every consecutive layer [01:01:49] example if like every consecutive layer is you know is like shoted such that [01:01:50] is you know is like shoted such that it's on multiple gpus then you kind of [01:01:52] it's on multiple gpus then you kind of always are communicating right as [01:01:54] always are communicating right as opposed to you know you kind of did like [01:01:57] opposed to you know you kind of did like uh all gather and then all of the sort [01:02:00] uh all gather and then all of the sort of next three layers are loaded in okay [01:02:03] of next three layers are loaded in okay so that's why like you know how you [01:02:04] so that's why like you know how you Shard and this like Shing policy becomes [01:02:10] important [01:02:16] okay okay so if there's no more [01:02:19] okay okay so if there's no more questions uh let's uh let's end early uh [01:02:22] questions uh let's uh let's end early uh thank you so much [01:02:24] thank you so much [Applause] ================================================================================ LECTURE 014 ================================================================================ Stanford CS224N: NLP w/ DL| Spring 2024 | Lecture 13 - Brain-Computer Interfaces, Chaofei Fan Source: https://www.youtube.com/watch?v=tfVgHsKpRC8 --- Transcript [00:00:05] so thanks faon for coming know it's a [00:00:09] so thanks faon for coming know it's a really busy time in the quarter [00:00:10] really busy time in the quarter everyone's busy with the homework [00:00:12] everyone's busy with the homework project and [00:00:13] project and midterms um yeah today I'm going to tell [00:00:17] midterms um yeah today I'm going to tell you something I'm really passionate [00:00:19] you something I'm really passionate about is uh the speech breing computer [00:00:22] about is uh the speech breing computer interface uh my research so yeah before [00:00:25] interface uh my research so yeah before that so I'm just s some self [00:00:27] that so I'm just s some self introduction so I'm cha [00:00:30] introduction so I'm cha um from the uh Stanford mptl lab our lab [00:00:34] um from the uh Stanford mptl lab our lab uh trying to build speech Compu build [00:00:36] uh trying to build speech Compu build bre computer interfaces to help people [00:00:38] bre computer interfaces to help people restore communication or like restore [00:00:41] restore communication or like restore movements um so today I think I'm just [00:00:45] movements um so today I think I'm just going to really tell you guys how cool [00:00:49] going to really tell you guys how cool this bre computer interface is given [00:00:51] this bre computer interface is given that we have so many uh recent [00:00:54] that we have so many uh recent development in the AI and the Machine [00:00:56] development in the AI and the Machine learning and I hope you guys will can [00:00:58] learning and I hope you guys will can enjoy this talk [00:01:04] um all right so let me first start with [00:01:07] um all right so let me first start with the video to give you some motivations [00:01:11] the video to give you some motivations that why we want to build a uh green [00:01:13] that why we want to build a uh green computer [00:01:19] interface yeah I think what the story [00:01:23] interface yeah I think what the story tell is that we saw this teenager Howard [00:01:26] tell is that we saw this teenager Howard who's uh 21 at time of this video was [00:01:30] who's uh 21 at time of this video was shot and he lost all his dreams because [00:01:32] shot and he lost all his dreams because of a severe stroke that also like made [00:01:35] of a severe stroke that also like made him uh in this kind of a locking state [00:01:38] him uh in this kind of a locking state where he can't move and he talk about he [00:01:42] where he can't move and he talk about he used to like going out and play football [00:01:45] used to like going out and play football uh making friends and just let her his [00:01:48] uh making friends and just let her his emotion out I think all this is lost to [00:01:50] emotion out I think all this is lost to him and I think the most important thing [00:01:53] him and I think the most important thing is that he couldn't really speak to [00:01:55] is that he couldn't really speak to express himself to let all the emotions [00:01:58] express himself to let all the emotions out so Howard just is just like one of [00:02:00] out so Howard just is just like one of those individuals who suffer from this [00:02:03] those individuals who suffer from this kind of neurological disease disorder [00:02:06] kind of neurological disease disorder such as brain stamp stroke or AOS that [00:02:09] such as brain stamp stroke or AOS that can cause like severe speech and motor [00:02:11] can cause like severe speech and motor impairment and even complete loss of [00:02:17] speech this individuals I think the life [00:02:19] speech this individuals I think the life is really challenging for them right [00:02:21] is really challenging for them right just think about it you cannot speak you [00:02:23] just think about it you cannot speak you cannot move like you still have a fully [00:02:25] cannot move like you still have a fully functioned brain but everything is lost [00:02:28] functioned brain but everything is lost all all your dreams could be sh [00:02:32] other so I think for people like Howard [00:02:36] other so I think for people like Howard as you just saw in this video the way [00:02:38] as you just saw in this video the way they can still like communicate with the [00:02:40] they can still like communicate with the outside world with their loved one is [00:02:42] outside world with their loved one is through this assistive communication [00:02:44] through this assistive communication devices such as the one we just saw in [00:02:46] devices such as the one we just saw in the video which is this kind of lead [00:02:48] the video which is this kind of lead board that has the letters like [00:02:50] board that has the letters like organized physically and for people like [00:02:53] organized physically and for people like Howard which may still like have there [00:02:55] Howard which may still like have there some residual eye movement they can use [00:02:57] some residual eye movement they can use their like gaze to tell the his friend [00:03:00] their like gaze to tell the his friend that where he's looking at and then his [00:03:03] that where he's looking at and then his friend can use the Gaze to tell what [00:03:05] friend can use the Gaze to tell what letter he's trying to say just imagine [00:03:07] letter he's trying to say just imagine how slow this process is if you want to [00:03:10] how slow this process is if you want to just say a sentence it might takes you a [00:03:12] just say a sentence it might takes you a few minutes to express like simple [00:03:15] few minutes to express like simple things like how are you like I'm feeling [00:03:18] things like how are you like I'm feeling not comfortable today like an [00:03:20] not comfortable today like an alternative here is you can also use [00:03:22] alternative here is you can also use like a it tracking device so that people [00:03:25] like a it tracking device so that people can you know use it tracking to uh [00:03:29] can you know use it tracking to uh type on the computer on on virtual [00:03:31] type on the computer on on virtual keyboard but by you know just think [00:03:34] keyboard but by you know just think about it if you have to look at the [00:03:35] about it if you have to look at the computer screen all the times all day [00:03:38] computer screen all the times all day it's really tiring for for them and also [00:03:40] it's really tiring for for them and also like these people are not like us they [00:03:42] like these people are not like us they probably even if they have still have [00:03:44] probably even if they have still have like a residual eye movement it's very [00:03:47] like a residual eye movement it's very hard for them to move their eyes so it's [00:03:49] hard for them to move their eyes so it's very tiring as [00:03:52] very tiring as well [00:03:54] well um maybe something different here is [00:03:58] um maybe something different here is that maybe it's some of you guys have [00:04:00] that maybe it's some of you guys have already seen this recently is uh some [00:04:02] already seen this recently is uh some videos published by a company called [00:04:04] videos published by a company called neuralink for example here's one video [00:04:07] neuralink for example here's one video here let me see if I can play [00:04:09] here let me see if I can play them hopefully I can play [00:04:13] them hopefully I can play them all right so I think here is that [00:04:17] them all right so I think here is that this company neur link is uh developing [00:04:19] this company neur link is uh developing this kind of like a [00:04:21] this kind of like a tiny uh implantable device that can be [00:04:25] tiny uh implantable device that can be actually placed inside your scale and [00:04:28] actually placed inside your scale and then rate the brain [00:04:31] then rate the brain signals um The Hope here is that because [00:04:34] signals um The Hope here is that because for people like Hardward their brain is [00:04:36] for people like Hardward their brain is still fully [00:04:37] still fully functioning so we can the Hope here is [00:04:40] functioning so we can the Hope here is that maybe we can using this kind of [00:04:42] that maybe we can using this kind of like a direct interface with their brain [00:04:46] like a direct interface with their brain so that they can still use their intact [00:04:48] so that they can still use their intact brain to control like a computer or even [00:04:51] brain to control like a computer or even like a robots to help them to live like [00:04:53] like a robots to help them to live like a normal life and here is a quote from [00:04:56] a normal life and here is a quote from this um their participants noan and he's [00:05:00] this um their participants noan and he's pretty excited about being able to uh [00:05:03] pretty excited about being able to uh using [00:05:05] using this very state of art BCI to be able to [00:05:08] this very state of art BCI to be able to connect with their families and then [00:05:10] connect with their families and then being able to support himself I think [00:05:12] being able to support himself I think this kind of like PCI what I trying to [00:05:14] this kind of like PCI what I trying to say here that for people like Hardward [00:05:17] say here that for people like Hardward are like a lot of people who has lost [00:05:19] are like a lot of people who has lost their [00:05:20] their uh control of their body and language I [00:05:24] uh control of their body and language I think BCI can bring hope to them so [00:05:26] think BCI can bring hope to them so that's what I'm going [00:05:28] that's what I'm going to motivate here today is that we're [00:05:31] to motivate here today is that we're trying to use BCI to really help these [00:05:33] trying to use BCI to really help these people but I think before going into the [00:05:35] people but I think before going into the details about how this is how this works [00:05:37] details about how this is how this works I just want to First go through a brief [00:05:39] I just want to First go through a brief history of brain computer interface just [00:05:42] history of brain computer interface just to help you guys understand how this [00:05:44] to help you guys understand how this thing works like why we can put such [00:05:46] thing works like why we can put such tiny devices into the brain and then [00:05:48] tiny devices into the brain and then suddenly can interpret what the brain is [00:05:50] suddenly can interpret what the brain is doing like there's a lot of interesting [00:05:52] doing like there's a lot of interesting stories here so let me start with a [00:05:54] stories here so let me start with a brief history of [00:05:57] brief history of PCI so first uh back to the 19th century [00:06:01] PCI so first uh back to the 19th century here a British scientist called Richard [00:06:05] here a British scientist called Richard Kon uh started to do some experiments on [00:06:09] Kon uh started to do some experiments on animals and one of the things he found [00:06:12] animals and one of the things he found is that uh you can actually measure the [00:06:14] is that uh you can actually measure the brain activity like measure electricity [00:06:17] brain activity like measure electricity is from the brain uh [00:06:20] is from the brain uh moreover if you ask the animal not ask [00:06:23] moreover if you ask the animal not ask but like let animals do do some task say [00:06:26] but like let animals do do some task say moving their heads then you can see that [00:06:28] moving their heads then you can see that they're like the electricity changes [00:06:30] they're like the electricity changes somehow so I think this is the very [00:06:33] somehow so I think this is the very first early experiments that that [00:06:35] first early experiments that that scientists do to show that okay actually [00:06:37] scientists do to show that okay actually you can decode some signals from the [00:06:39] you can decode some signals from the brain but we still don't know how [00:06:41] brain but we still don't know how exactly what those exact like exactly [00:06:43] exactly what those exact like exactly what those electric signals means [00:06:46] what those electric signals means here so fast forward um to [00:06:51] here so fast forward um to 1924 a German scientist called hansburg [00:06:55] 1924 a German scientist called hansburg invent this device called the yeah I [00:06:59] invent this device called the yeah I always forgot how to read that word but [00:07:01] always forgot how to read that word but anyway so it's show for EEG so it's [00:07:03] anyway so it's show for EEG so it's basically this kind of like on the right [00:07:04] basically this kind of like on the right you can see that's kind of like this uh [00:07:06] you can see that's kind of like this uh electrod that you can play place on the [00:07:10] electrod that you can play place on the outside of your basically on your scale [00:07:11] outside of your basically on your scale and then measure this kind of like a [00:07:13] and then measure this kind of like a wave like um signals so what the [00:07:17] wave like um signals so what the scientist hburg find is [00:07:20] scientist hburg find is that uh first first so first he's the [00:07:24] that uh first first so first he's the first scientist to find that you you can [00:07:26] first scientist to find that you you can actually measure this kind of like wave [00:07:27] actually measure this kind of like wave like like like signals uh from like just [00:07:30] like like like signals uh from like just this kind of like a electrod place on [00:07:32] this kind of like a electrod place on the head and then he found that this [00:07:34] the head and then he found that this kind of signals has very different uh [00:07:37] kind of signals has very different uh safe frequency depending on how the user [00:07:41] safe frequency depending on how the user how the patient is uh the fun like the [00:07:44] how the patient is uh the fun like the state of patient is for example if the [00:07:46] state of patient is for example if the patient is in this kind of very calm [00:07:47] patient is in this kind of very calm state that then it will have generate [00:07:49] state that then it will have generate this kind of slow alha wave around like [00:07:52] this kind of slow alha wave around like 10 to maybe 20 HZ I don't I forgot the [00:07:54] 10 to maybe 20 HZ I don't I forgot the exact range but if the patient is open [00:07:56] exact range but if the patient is open their eyes and then doing some like a [00:07:58] their eyes and then doing some like a cognitive in tax then you'll see really [00:08:01] cognitive in tax then you'll see really sharp beta waves so he's the first [00:08:03] sharp beta waves so he's the first scientist to discover that you can [00:08:05] scientist to discover that you can actually know the using this kind of [00:08:06] actually know the using this kind of like electrodes to measure the brain the [00:08:09] like electrodes to measure the brain the electricity electric signal of the brain [00:08:12] electricity electric signal of the brain and also there's a funny story here is [00:08:13] and also there's a funny story here is that the hburg used to be a used to be a [00:08:18] that the hburg used to be a used to be a soldier and then one day he was training [00:08:20] soldier and then one day he was training on the horse and then he fell from the [00:08:22] on the horse and then he fell from the horse and I suffered from a concussion [00:08:25] horse and I suffered from a concussion he also has like a twin sister not a [00:08:27] he also has like a twin sister not a twin but he has a he has yes he has a [00:08:29] twin but he has a he has yes he has a twin sister and then the story is that [00:08:32] twin sister and then the story is that at the same day her sister felt like [00:08:35] at the same day her sister felt like there's a weird something weird and then [00:08:37] there's a weird something weird and then he just starting to worry about worry [00:08:39] he just starting to worry about worry about his brother so her sister his [00:08:41] about his brother so her sister his sister sent a a telegraph to his his [00:08:44] sister sent a a telegraph to his his father and telling his brother that is [00:08:46] father and telling his brother that is his brother okay so this really intrig [00:08:49] his brother okay so this really intrig hansberger that you know maybe there's [00:08:51] hansberger that you know maybe there's something called a telepathy that can [00:08:53] something called a telepathy that can connect to uh to people through this [00:08:55] connect to uh to people through this kind of like I know whatever this spring [00:08:57] kind of like I know whatever this spring wave so that's his motivation to start [00:09:00] wave so that's his motivation to start to study uh Psychology and Neuroscience [00:09:03] to study uh Psychology and Neuroscience and invent this EG which we are still [00:09:05] and invent this EG which we are still using today to diagn like things like a [00:09:10] using today to diagn like things like a epilepsy okay [00:09:13] epilepsy okay um and then people starting to using the [00:09:17] um and then people starting to using the this kind of eug devices to perform like [00:09:20] this kind of eug devices to perform like to use it to maybe since we can can [00:09:23] to use it to maybe since we can can somehow detect this kind of wave like [00:09:25] somehow detect this kind of wave like things from the brain and then we can [00:09:27] things from the brain and then we can also control the like the frequency of [00:09:30] also control the like the frequency of the wave so someone started like a [00:09:31] the wave so someone started like a musician here starting to use this EG [00:09:34] musician here starting to use this EG devices to um perform music right anyway [00:09:38] devices to um perform music right anyway so I guess you guys already got the idea [00:09:40] so I guess you guys already got the idea that someone is controlling like doing [00:09:42] that someone is controlling like doing trying to perform some music with his [00:09:44] trying to perform some music with his Brin wave I think this is a really cool [00:09:46] Brin wave I think this is a really cool experiment it's done in I think in the [00:09:49] experiment it's done in I think in the 1950s and you can already see that [00:09:51] 1950s and you can already see that people starting to um get the idea that [00:09:53] people starting to um get the idea that you can actually uh bypass like the your [00:09:57] you can actually uh bypass like the your body you can actually use your brain [00:09:59] body you can actually use your brain directly connect your brain to some like [00:10:01] directly connect your brain to some like external device and controlling that [00:10:03] external device and controlling that device so I think the idea here that [00:10:06] device so I think the idea here that what if we can also leverage the same [00:10:07] what if we can also leverage the same idea but to help people with like [00:10:10] idea but to help people with like Hardward you know you can help them to [00:10:12] Hardward you know you can help them to maybe control robotic arm but like the [00:10:15] maybe control robotic arm but like the the problem with this kind of like a EG [00:10:18] the problem with this kind of like a EG or external measuring device is that the [00:10:22] or external measuring device is that the signal you get is very weak just imagine [00:10:24] signal you get is very weak just imagine that uh your brain is generating you [00:10:28] that uh your brain is generating you know probably know that like the brain [00:10:30] know probably know that like the brain has a lot of neurons right the neurons [00:10:32] has a lot of neurons right the neurons actually generating a lot of like [00:10:34] actually generating a lot of like signals and if you just put some [00:10:36] signals and if you just put some electrodes on your on the scalp on the [00:10:39] electrodes on your on the scalp on the scalp and then what you are actually [00:10:40] scalp and then what you are actually measuring is like the average neuron [00:10:43] measuring is like the average neuron firings of like maybe millions of [00:10:44] firings of like maybe millions of neurons which is if you think about like [00:10:47] neurons which is if you think about like analogy is that um if you are trying to [00:10:49] analogy is that um if you are trying to see hear what people are saying in the [00:10:52] see hear what people are saying in the room next to us but we can only trying [00:10:56] room next to us but we can only trying to figure out like they what they saying [00:10:58] to figure out like they what they saying in this room what we are hearing is kind [00:11:00] in this room what we are hearing is kind of the mumbling of a lot of things we [00:11:02] of the mumbling of a lot of things we can probably just tell maybe they are in [00:11:04] can probably just tell maybe they are in this kind of like a happy mood or maybe [00:11:06] this kind of like a happy mood or maybe they have reached a conclusion but not [00:11:08] they have reached a conclusion but not exactly what they are trying to say so [00:11:10] exactly what they are trying to say so the limitation here is that this kind of [00:11:13] the limitation here is that this kind of EG device can only give us very uh low [00:11:16] EG device can only give us very uh low Precision or like low resolution signal [00:11:18] Precision or like low resolution signal we want to get barer signal so how I [00:11:21] we want to get barer signal so how I think the answer is trying to go inside [00:11:24] think the answer is trying to go inside the brain and then putting this kind of [00:11:27] the brain and then putting this kind of electrodes next to a neuron and then [00:11:29] electrodes next to a neuron and then trying to directly measure the neuro [00:11:31] trying to directly measure the neuro activities of this [00:11:34] activities of this neurons um and for the purpose of this [00:11:38] neurons um and for the purpose of this talk we're mostly going to focus on [00:11:42] talk we're mostly going to focus on uh the neurons in this region of the [00:11:45] uh the neurons in this region of the brain called motor cortex so this brain [00:11:48] brain called motor cortex so this brain as some of of you may already know that [00:11:50] as some of of you may already know that brain has different regions that's doing [00:11:52] brain has different regions that's doing different tasks so in the center of the [00:11:55] different tasks so in the center of the brain it's called motor cortex it's [00:11:57] brain it's called motor cortex it's basically controlling all your muscle [00:11:59] basically controlling all your muscle muscles or or your body muscles so the [00:12:01] muscles or or your body muscles so the Hope here that if we can understand the [00:12:03] Hope here that if we can understand the neuron the coding of like the [00:12:05] neuron the coding of like the information of neuron that's encoded [00:12:08] information of neuron that's encoded here then perhaps we can decode this [00:12:10] here then perhaps we can decode this information and use this information to [00:12:13] information and use this information to help people like H to to be able to [00:12:16] help people like H to to be able to control a external arm or being able to [00:12:19] control a external arm or being able to uh speak [00:12:21] uh speak again um so here's a some very [00:12:26] again um so here's a some very basic neuroscience [00:12:29] basic neuroscience uh so we know that there's a this kind [00:12:32] uh so we know that there's a this kind of cell called neurons right so this is [00:12:36] of cell called neurons right so this is each one of this thing is called neuron [00:12:38] each one of this thing is called neuron and this is the body of neuron called [00:12:40] and this is the body of neuron called soma and then this is the axon so this [00:12:43] soma and then this is the axon so this is another neuron so neuron connects [00:12:45] is another neuron so neuron connects through this uh tiny thing called [00:12:48] through this uh tiny thing called synapse so if neuron has want to [00:12:50] synapse so if neuron has want to transfer some information to another [00:12:52] transfer some information to another neuron just like in artificial neuron [00:12:53] neuron just like in artificial neuron Network you have some uh neurons and [00:12:56] Network you have some uh neurons and then you want to send information to the [00:12:58] then you want to send information to the uh next layer you can basically this [00:13:01] uh next layer you can basically this neuron will generate some action [00:13:03] neuron will generate some action potential which is just some like a [00:13:05] potential which is just some like a electricity here to signal that another [00:13:09] electricity here to signal that another neuron that there is some information [00:13:11] neuron that there is some information there so if you put a a chiny electrod [00:13:14] there so if you put a a chiny electrod say on the uh Exon of this neuron here [00:13:18] say on the uh Exon of this neuron here and measure the uh membrane potential [00:13:21] and measure the uh membrane potential which you will get is something like [00:13:23] which you will get is something like this so you'll have like on the x axxis [00:13:25] this so you'll have like on the x axxis is the time and on the y- axis is the [00:13:27] is the time and on the y- axis is the measure the uh uh electric potential and [00:13:31] measure the uh uh electric potential and then you will see this come like a [00:13:33] then you will see this come like a very uh sharp spikes spikes and then if [00:13:37] very uh sharp spikes spikes and then if you zoom into this spikes you will see [00:13:39] you zoom into this spikes you will see this kind of like a typical firing [00:13:42] this kind of like a typical firing signature of the neuron which is like [00:13:44] signature of the neuron which is like the electric like if the voltage [00:13:46] the electric like if the voltage suddenly goes up and then goes down so [00:13:49] suddenly goes up and then goes down so basically what you can measure at the [00:13:51] basically what you can measure at the neuron is that it's kind of very sharp [00:13:53] neuron is that it's kind of very sharp spikes that's what you will get by [00:13:55] spikes that's what you will get by putting electrodes next to neuron Okay [00:13:58] putting electrodes next to neuron Okay so [00:13:59] so how do we figure out what like what kind [00:14:01] how do we figure out what like what kind of information is encoded in this uh [00:14:04] of information is encoded in this uh what call a a spike [00:14:07] what call a a spike train um we can perform some Behavior [00:14:10] train um we can perform some Behavior tasks here so for example here uh [00:14:14] tasks here so for example here uh suppose that we still lessen into a [00:14:15] suppose that we still lessen into a single neuron and this neuron is let's [00:14:19] single neuron and this neuron is let's say we're using a monkey for this [00:14:21] say we're using a monkey for this experiment right so we are instructing [00:14:22] experiment right so we are instructing training a monkey to do two things uh [00:14:25] training a monkey to do two things uh one thing is basically we're trying to [00:14:27] one thing is basically we're trying to instruct the monkey to move his uh hands [00:14:30] instruct the monkey to move his uh hands either to the left or to the right and [00:14:33] either to the left or to the right and then we measure the firing of the spikes [00:14:37] then we measure the firing of the spikes of that single neuron and then see [00:14:39] of that single neuron and then see trying to get what kind of information [00:14:41] trying to get what kind of information is encoded by that neuron so what you [00:14:44] is encoded by that neuron so what you see here is that each row here is [00:14:47] see here is that each row here is basically a spike train of that neuron [00:14:50] basically a spike train of that neuron as you just saw here each vertical line [00:14:52] as you just saw here each vertical line is just a spike of the [00:14:54] is just a spike of the neuron uh each row is a trial a trial [00:14:58] neuron uh each row is a trial a trial means that the the mon he is try uh just [00:15:00] means that the the mon he is try uh just trying to move his hand in like One [00:15:03] trying to move his hand in like One Direction [00:15:06] Direction um then the vertical if you see here [00:15:10] um then the vertical if you see here that because like you can see here that [00:15:13] that because like you can see here that the neuron seems to far slightly [00:15:16] the neuron seems to far slightly different across trials so I think [00:15:18] different across trials so I think that's one like fundamental properties [00:15:20] that's one like fundamental properties of a neuron is that it's very noisy it's [00:15:23] of a neuron is that it's very noisy it's not like in artificial neuron network if [00:15:25] not like in artificial neuron network if you put something you always get [00:15:26] you put something you always get something out whereas in a real neuron [00:15:28] something out whereas in a real neuron Network in a neural real neural uh [00:15:31] Network in a neural real neural uh things are really noisy so sometimes [00:15:33] things are really noisy so sometimes they fire a little bit faster but [00:15:34] they fire a little bit faster but sometimes they fire a little bit slower [00:15:36] sometimes they fire a little bit slower under the same experimental [00:15:38] under the same experimental conditions and so [00:15:40] conditions and so here what we are trying to measure is [00:15:44] here what we are trying to measure is that if what kind of information is this [00:15:47] that if what kind of information is this neuron encoding when the monkey has mov [00:15:50] neuron encoding when the monkey has mov his uh limb to the left or Lim to the [00:15:53] his uh limb to the left or Lim to the right and then one we can [00:15:57] right and then one we can also uh split the this information [00:15:59] also uh split the this information encoding into two phases one is a [00:16:01] encoding into two phases one is a preparation and the other is the [00:16:03] preparation and the other is the execution so for preparation for [00:16:06] execution so for preparation for execution execution means that the [00:16:09] execution execution means that the monkey is actually moving his his his uh [00:16:12] monkey is actually moving his his his uh his arms where where is the prep [00:16:14] his arms where where is the prep preparation means that the monkey is [00:16:16] preparation means that the monkey is preparing to move but he's holding his [00:16:18] preparing to move but he's holding his arm fixed so he will actually move his [00:16:21] arm fixed so he will actually move his arm at this uh right this go time here [00:16:25] arm at this uh right this go time here so what you can see here that it seems [00:16:27] so what you can see here that it seems like this neuron [00:16:29] like this neuron um likes to firing a lot during the [00:16:32] um likes to firing a lot during the execution when the monkey's hand is [00:16:34] execution when the monkey's hand is moving to the right and it also fires a [00:16:38] moving to the right and it also fires a little bit more when the monkey is [00:16:41] little bit more when the monkey is preparing to move to the left so this [00:16:45] preparing to move to the left so this means that maybe the neuron is encoding [00:16:47] means that maybe the neuron is encoding some movement Direction [00:16:49] some movement Direction here uh so basically if you can repeat [00:16:52] here uh so basically if you can repeat this experiments for many different [00:16:54] this experiments for many different neurons and then for a lot of different [00:16:56] neurons and then for a lot of different directions and eventually what [00:16:58] directions and eventually what scientists find is that the neurons like [00:17:02] scientists find is that the neurons like if you fit like say for a single neuron [00:17:04] if you fit like say for a single neuron if you fit the the firing rates of that [00:17:07] if you fit the the firing rates of that neuron basically how many spikes it's [00:17:09] neuron basically how many spikes it's generating every seconds to different uh [00:17:12] generating every seconds to different uh movement directions you can fit this [00:17:14] movement directions you can fit this kind of like a coign tuning curve to it [00:17:17] kind of like a coign tuning curve to it so what this tuning curve means that on [00:17:18] so what this tuning curve means that on the Y AIS is the firing rate and then on [00:17:20] the Y AIS is the firing rate and then on the horizontal axis is the movement [00:17:23] the horizontal axis is the movement Direction so this neuron prefers to like [00:17:26] Direction so this neuron prefers to like firing the most when the movement is say [00:17:29] firing the most when the movement is say 180 degrees to some reference and then [00:17:32] 180 degrees to some reference and then the fireing gradually goes down um so [00:17:36] the fireing gradually goes down um so that's like one the first thing [00:17:38] that's like one the first thing scientists find like how the like single [00:17:40] scientists find like how the like single neuron encoding like movement [00:17:44] information and then if you measure [00:17:46] information and then if you measure multiple neurons like you will find that [00:17:47] multiple neurons like you will find that each neuron could incode very different [00:17:49] each neuron could incode very different informations for example this green [00:17:51] informations for example this green neuron here it's tuning curve is [00:17:53] neuron here it's tuning curve is slightly shifted to the right and then [00:17:55] slightly shifted to the right and then the magnitude is shifted down so it's [00:17:57] the magnitude is shifted down so it's preferred direction is around maybe [00:18:00] preferred direction is around maybe 250 now with two neurons you can [00:18:02] 250 now with two neurons you can actually decode out more actually decode [00:18:06] actually decode out more actually decode out what the the intended movement [00:18:09] out what the the intended movement direction is right so for example with a [00:18:11] direction is right so for example with a single neuron like suppose right now I [00:18:13] single neuron like suppose right now I measure the firing rate is around like [00:18:15] measure the firing rate is around like 30 uh spikes per second then there could [00:18:18] 30 uh spikes per second then there could be two movement Direction with 120 and [00:18:20] be two movement Direction with 120 and then 240 however with uh second neurons [00:18:24] then 240 however with uh second neurons here you can see that we can basically [00:18:25] here you can see that we can basically eliminate the like suppose we measure [00:18:27] eliminate the like suppose we measure the second neuron is around five five [00:18:30] the second neuron is around five five spikes per second then we can exactly [00:18:33] spikes per second then we can exactly pinpoint that it's the actually the [00:18:35] pinpoint that it's the actually the movement direction is 120 instead of the [00:18:38] movement direction is 120 instead of the uh the other one [00:18:40] uh the other one right [00:18:42] right uh however we know that uh neurons has [00:18:46] uh however we know that uh neurons has uh some noises so we actually cannot [00:18:49] uh some noises so we actually cannot really exactly tell uh the movement [00:18:52] really exactly tell uh the movement Direction by using two neurons here so [00:18:55] Direction by using two neurons here so for example in the third part here right [00:18:58] for example in the third part here right due to the why the actual like suppose [00:19:01] due to the why the actual like suppose the ground Cho like the the actual fire [00:19:05] the ground Cho like the the actual fire rate is those gray lines but um due to [00:19:08] rate is those gray lines but um due to the noise the firing rate is slightly [00:19:10] the noise the firing rate is slightly shift to those Dash lines and you can [00:19:13] shift to those Dash lines and you can see that originally if we can decode the [00:19:15] see that originally if we can decode the movement direction is 120 but in this [00:19:18] movement direction is 120 but in this case the possibility become that there's [00:19:20] case the possibility become that there's four possibilities we cannot exactly we [00:19:23] four possibilities we cannot exactly we cannot uniquely [00:19:24] cannot uniquely Define um however you can see that maybe [00:19:28] Define um however you can see that maybe it's still more likely that the the the [00:19:31] it's still more likely that the the the direction that monkey tries to move is [00:19:32] direction that monkey tries to move is like around 120 rather than the one two [00:19:37] like around 120 rather than the one two Lage is like around 50 and then the [00:19:39] Lage is like around 50 and then the other one is more like greater than 240 [00:19:41] other one is more like greater than 240 right so how do we deal with this kind [00:19:44] right so how do we deal with this kind of like a noise in neuron like how can [00:19:46] of like a noise in neuron like how can we still uniqu more accurate like how [00:19:48] we still uniqu more accurate like how can we still accurately decide decoding [00:19:51] can we still accurately decide decoding this um intended movements from this [00:19:54] this um intended movements from this kind of like a multiple uh neuron [00:19:57] kind of like a multiple uh neuron recordings uh I [00:19:59] recordings uh I think we can basically use machine [00:20:01] think we can basically use machine learning here right so we can treat this [00:20:04] learning here right so we can treat this as a kind of like a classification [00:20:06] as a kind of like a classification problem so here we are plotting each dot [00:20:09] problem so here we are plotting each dot here is basically a firing combination [00:20:12] here is basically a firing combination of two neurons and the color here [00:20:15] of two neurons and the color here basically represents the the intended [00:20:18] basically represents the the intended move movement Direction and then if you [00:20:21] move movement Direction and then if you like somehow train a machine learning [00:20:22] like somehow train a machine learning classifier here then you can basically [00:20:25] classifier here then you can basically see we can draw some like decision [00:20:27] see we can draw some like decision boundaries where say on the right side [00:20:29] boundaries where say on the right side where those uh if like on new [00:20:32] where those uh if like on new measurement that we get that the firing [00:20:34] measurement that we get that the firing rate somehow Drops Fall onto this here [00:20:37] rate somehow Drops Fall onto this here then we probably know that it's going to [00:20:39] then we probably know that it's going to be the monkey is trying to move to the [00:20:42] be the monkey is trying to move to the left direction right uh so okay so I [00:20:48] left direction right uh so okay so I guess here [00:20:49] guess here like we know that okay we can do this [00:20:51] like we know that okay we can do this kind of single neural measurement we can [00:20:53] kind of single neural measurement we can measure firing rates of multiple neurons [00:20:56] measure firing rates of multiple neurons and then by training a machine learning [00:20:58] and then by training a machine learning model [00:20:59] model we can use this U machine learning [00:21:02] we can use this U machine learning models with the neuro data to infer like [00:21:04] models with the neuro data to infer like the what's the likely movement [00:21:06] the what's the likely movement directions so this is how we are going [00:21:08] directions so this is how we are going to build up to uh actually build a uh [00:21:12] to build up to uh actually build a uh brain computer interface questions yeah [00:21:14] brain computer interface questions yeah this for all these data you mentioned [00:21:16] this for all these data you mentioned like neuron one as a very like specific [00:21:18] like neuron one as a very like specific number where how do you pinpoint which [00:21:20] number where how do you pinpoint which neuron can start [00:21:21] neuron can start measuring yeah so here neural one is [00:21:24] measuring yeah so here neural one is basically like we are taking so we here [00:21:26] basically like we are taking so we here make assumption that each tiny [00:21:29] make assumption that each tiny electrode here you see is measuring like [00:21:33] electrode here you see is measuring like exactly one neurons and then that [00:21:35] exactly one neurons and then that electrode will be fixed always be fixed [00:21:38] electrode will be fixed always be fixed always measuring the fire of that [00:21:40] always measuring the fire of that neurons [00:21:41] neurons yeah but in the real case it's not it's [00:21:44] yeah but in the real case it's not it's not always the case because like you [00:21:46] not always the case because like you think about it the bra brain is this [00:21:49] think about it the bra brain is this kind of like a soft structure so if you [00:21:51] kind of like a soft structure so if you put electrodes there could move a little [00:21:53] put electrodes there could move a little bit and measure different neurons so [00:21:55] bit and measure different neurons so that's one of the challenging problems [00:21:57] that's one of the challenging problems of uh BC okay is that how you can deal [00:22:00] of uh BC okay is that how you can deal with this kind of neural change [00:22:03] with this kind of neural change recording change [00:22:04] recording change yeah all right let's go back to here so [00:22:08] yeah all right let's go back to here so now we can basically know that we can [00:22:10] now we can basically know that we can put some electrod into the brain into a [00:22:12] put some electrod into the brain into a motor cortex measure some signals and [00:22:14] motor cortex measure some signals and then we know how the neuron in encodes [00:22:17] then we know how the neuron in encodes those signals and then we can also build [00:22:19] those signals and then we can also build a machine learning decoder to decode [00:22:22] a machine learning decoder to decode those signals right so can basically [00:22:25] those signals right so can basically have some like methods to be able to [00:22:27] have some like methods to be able to build a brain computer interface so that [00:22:29] build a brain computer interface so that we can interpret what the like a still [00:22:33] we can interpret what the like a still functioning fully functioning brain [00:22:34] functioning fully functioning brain trying to [00:22:36] trying to do um one more thing is that so how we [00:22:40] do um one more thing is that so how we can record this uh [00:22:44] can record this uh signals so yeah this is a very [00:22:46] signals so yeah this is a very complicated figure but don't worry about [00:22:49] complicated figure but don't worry about all the details what I'm trying to show [00:22:52] all the details what I'm trying to show here that basically there's a lot of [00:22:54] here that basically there's a lot of different technologies that you can use [00:22:56] different technologies that you can use to uh record brain signals [00:22:59] to uh record brain signals but when you think about this technology [00:23:00] but when you think about this technology you can think about it as in this kind [00:23:02] you can think about it as in this kind of [00:23:03] of two-dimensional two-dimensional way so [00:23:05] two-dimensional two-dimensional way so on the y axis is the think about as a [00:23:09] on the y axis is the think about as a spatial resolution so the higher so the [00:23:15] spatial resolution so the higher so the higher up up you go on the y- axis that [00:23:17] higher up up you go on the y- axis that means that you can basically measure you [00:23:20] means that you can basically measure you you can basically measure like very [00:23:24] you can basically measure like very um the like basically shows like what's [00:23:28] um the like basically shows like what's the the region like what's the size of [00:23:31] the the region like what's the size of the region of the brain that you can [00:23:32] the region of the brain that you can measure so if you go really high up [00:23:34] measure so if you go really high up there that means that you can only [00:23:35] there that means that you can only measure say uh a very large like like [00:23:38] measure say uh a very large like like the average activ average brain [00:23:41] the average activ average brain activities of a very large very large [00:23:43] activities of a very large very large Brain area whereas if you go down the Y [00:23:45] Brain area whereas if you go down the Y AIS that means that you can actually [00:23:47] AIS that means that you can actually measure two very fine green skilles like [00:23:49] measure two very fine green skilles like such as a single [00:23:51] such as a single neurons whereas the Y the the horizontal [00:23:54] neurons whereas the Y the the horizontal access here means like the the temporal [00:23:56] access here means like the the temporal resolution that means that for [00:23:59] resolution that means that for Technologies such as this kind of like a [00:24:01] Technologies such as this kind of like a single neuron recordings you can [00:24:03] single neuron recordings you can basically measure like [00:24:06] basically measure like exactly at each time point for example [00:24:09] exactly at each time point for example like one milliseconds what's the like [00:24:11] like one milliseconds what's the like electric potential is for that single [00:24:13] electric potential is for that single neuron whereas for technology Recording [00:24:16] neuron whereas for technology Recording Technology such as Mi which basically [00:24:19] Technology such as Mi which basically measures like the blood flow in a small [00:24:21] measures like the blood flow in a small brain region it can only measure on [00:24:23] brain region it can only measure on average like around like like say 0 five [00:24:26] average like around like like say 0 five seconds or 1 seconds what's the blood [00:24:28] seconds or 1 seconds what's the blood flow changes in that small brain area so [00:24:30] flow changes in that small brain area so that's really like average of a lot of [00:24:32] that's really like average of a lot of information because we know that the [00:24:34] information because we know that the neuron fire is like this really like [00:24:36] neuron fire is like this really like fast speed right the [00:24:38] fast speed right the firing the electric potential change of [00:24:40] firing the electric potential change of neuron is usually around order of one [00:24:42] neuron is usually around order of one milliseconds right if you can only [00:24:44] milliseconds right if you can only measure things [00:24:45] measure things around like say one second you are [00:24:48] around like say one second you are basically like averaging smoothing out a [00:24:50] basically like averaging smoothing out a lot of [00:24:51] lot of informations so ideally we want to have [00:24:54] informations so ideally we want to have something both have like a high uh [00:24:56] something both have like a high uh spatial resolution and also uh temporal [00:25:00] spatial resolution and also uh temporal R resolution here [00:25:03] R resolution here um so what we will use in most of [00:25:07] um so what we will use in most of this uh right now in a lot of like a [00:25:10] this uh right now in a lot of like a clinical trial in our lab is this kind [00:25:12] clinical trial in our lab is this kind of like [00:25:13] of like a multi electrod array so each each [00:25:18] a multi electrod array so each each electrod here is like a tiny needle that [00:25:20] electrod here is like a tiny needle that can measure maybe a signal of a few [00:25:24] can measure maybe a signal of a few neurons and then you put this needles [00:25:27] neurons and then you put this needles into a small tiny like a square on the [00:25:29] into a small tiny like a square on the size of like a fingernail and then you [00:25:31] size of like a fingernail and then you can measure maybe on order of hundreds [00:25:34] can measure maybe on order of hundreds of [00:25:35] of neurons all right so now we can have [00:25:38] neurons all right so now we can have this uh devices to measure neurons let's [00:25:42] this uh devices to measure neurons let's go into a more like [00:25:44] go into a more like a uh examples of how we do this [00:25:52] here so here's a let's take a example [00:25:55] here so here's a let's take a example for example [00:25:57] for example here so suppose someone has have a like [00:26:00] here so suppose someone has have a like say SP spinal cord injury and then lost [00:26:03] say SP spinal cord injury and then lost the connection to his body so his mind [00:26:06] the connection to his body so his mind is still fully functioning so the [00:26:09] is still fully functioning so the question here is that whether we can [00:26:11] question here is that whether we can still what kind of information we can [00:26:13] still what kind of information we can still decode from his motor cortex such [00:26:15] still decode from his motor cortex such that we can decode those informations [00:26:18] that we can decode those informations and then use those informations to [00:26:20] and then use those informations to either control his arm or like his own [00:26:23] either control his arm or like his own arm or like artificial [00:26:24] arm or like artificial [Music] [00:26:25] [Music] arm right what we going to do is uh [00:26:29] arm right what we going to do is uh trying to put this kind of tiny [00:26:31] trying to put this kind of tiny electrodes micro electrode arrays into [00:26:33] electrodes micro electrode arrays into his motor [00:26:35] his motor cortex really penetrating into his like [00:26:38] cortex really penetrating into his like the surface of his motor [00:26:43] cortex and each electrod here as you see [00:26:46] cortex and each electrod here as you see here is this kind of tiny needle and [00:26:48] here is this kind of tiny needle and those uh gr triangle is the size of a [00:26:52] those uh gr triangle is the size of a neuron so each electrod maybe is [00:26:55] neuron so each electrod maybe is measuring the firing potential of [00:26:58] measuring the firing potential of multiple like local field potential of [00:27:00] multiple like local field potential of multiple neurons around [00:27:05] it then we can pass all this information [00:27:09] it then we can pass all this information in real time to a computer through this [00:27:12] in real time to a computer through this kind of a wire right [00:27:22] now and then this what we get on the [00:27:25] now and then this what we get on the computer here is that for example here [00:27:28] computer here is that for example here each block is basically the measurement [00:27:31] each block is basically the measurement of that one [00:27:32] of that one electrodes [00:27:36] and if we do some like Behavior [00:27:38] and if we do some like Behavior experiments as we just show now we can [00:27:40] experiments as we just show now we can probably figure out the tuning curve for [00:27:42] probably figure out the tuning curve for each electrod for example this one is [00:27:44] each electrod for example this one is probably it's preferred direction is to [00:27:46] probably it's preferred direction is to the left [00:27:51] right so we can repeat experiments [00:27:54] right so we can repeat experiments Behavior experiments for other for other [00:27:58] Behavior experiments for other for other channels here and probably train a ml [00:28:01] channels here and probably train a ml decoder to figure out what this channel [00:28:03] decoder to figure out what this channel each channel is encoding the preferred [00:28:05] each channel is encoding the preferred direction for each channel so once we [00:28:07] direction for each channel so once we have the decoder [00:28:08] have the decoder trained and [00:28:10] trained and then at test time we can basically ask [00:28:13] then at test time we can basically ask our participants who has the thing [00:28:16] our participants who has the thing implant his brain to trying to imagine [00:28:19] implant his brain to trying to imagine to move his hand to some directions and [00:28:22] to move his hand to some directions and then the decoder we're trying to figure [00:28:23] then the decoder we're trying to figure out the direction he's trying to move [00:28:25] out the direction he's trying to move right so that's the basic idea here [00:28:28] right so that's the basic idea here um let me go to a demo [00:28:32] um let me go to a demo here so this is one of the research [00:28:36] here so this is one of the research coming out of our lab in [00:28:38] coming out of our lab in 2017 so here you see a [00:28:41] 2017 so here you see a participant is typing on virtual [00:28:44] participant is typing on virtual keyboard with uh with her mind right and [00:28:47] keyboard with uh with her mind right and then on the bottom shows the typing [00:28:50] then on the bottom shows the typing speed measured as the uh correct [00:28:54] speed measured as the uh correct characters per minute so it picks around [00:28:57] characters per minute so it picks around 40 and then on average maybe it's around [00:29:00] 40 and then on average maybe it's around like [00:29:13] 20 I think this is really amazing like [00:29:16] 20 I think this is really amazing like think about like four people like [00:29:18] think about like four people like Hardward who used to have to um using [00:29:22] Hardward who used to have to um using this kind of ladder board to [00:29:24] this kind of ladder board to communicate now with this bring computer [00:29:27] communicate now with this bring computer interface he can fully uh communicate by [00:29:32] interface he can fully uh communicate by himself through a like say a computer so [00:29:34] himself through a like say a computer so that's a huge improvement over that the [00:29:36] that's a huge improvement over that the board yeah does person open the ey or [00:29:39] board yeah does person open the ey or close the uh he still opens his eyes I [00:29:43] close the uh he still opens his eyes I mean she's still opens her eyes so is [00:29:45] mean she's still opens her eyes so is there anything with eye tracking not for [00:29:48] there anything with eye tracking not for this [00:29:50] this experiment so even even she closed the [00:29:53] experiment so even even she closed the eye still work right yeah it will still [00:29:55] eye still work right yeah it will still work but like she won't have the visual [00:29:57] work but like she won't have the visual feedback you know she won't know where [00:29:59] feedback you know she won't know where she's typing how about like if she came [00:30:02] she's typing how about like if she came up with a in a character in in the mind [00:30:05] up with a in a character in in the mind e or R without looking at the keyboard [00:30:08] e or R without looking at the keyboard uh okay um that's something I'm going to [00:30:11] uh okay um that's something I'm going to show next so [00:30:13] show next so yeah how do you know whether it's the [00:30:16] yeah how do you know whether it's the person who mped or whether it's the [00:30:19] person who mped or whether it's the machine that's not capturing the correct [00:30:24] machine that's not capturing the correct character like I'm confused by what do [00:30:26] character like I'm confused by what do you mean by correct [00:30:29] you mean by correct okay oh oh here so yeah that that's good [00:30:32] okay oh oh here so yeah that that's good question so let me clarify here I think [00:30:34] question so let me clarify here I think the task here is uh maybe it's not [00:30:38] the task here is uh maybe it's not readable but I think the task here is [00:30:40] readable but I think the task here is basically she is copying a sentence so [00:30:44] basically she is copying a sentence so there we know the ground tro then we can [00:30:46] there we know the ground tro then we can measure like the error [00:30:48] measure like the error rate and [00:30:51] rate and yeah ask how do you the clicking motion [00:30:55] yeah ask how do you the clicking motion or the like selection motion is is that [00:30:58] or the like selection motion is is that easy to distinguish or is there a [00:31:00] easy to distinguish or is there a certain way of knowing that the user is [00:31:02] certain way of knowing that the user is pressing down or does she visualize like [00:31:05] pressing down or does she visualize like like a mouse or actually it that's [00:31:09] like a mouse or actually it that's really good question so as I just [00:31:10] really good question so as I just mentioned right so we can decode [00:31:13] mentioned right so we can decode movements and we can also decode like [00:31:15] movements and we can also decode like say different gestures say like like say [00:31:19] say different gestures say like like say you can use this kind of gestures or [00:31:20] you can use this kind of gestures or like move his uh her elbows so you can [00:31:23] like move his uh her elbows so you can just imagine different uh modor [00:31:26] just imagine different uh modor movements and then we can basically [00:31:27] movements and then we can basically match decode those modern movements and [00:31:29] match decode those modern movements and map that to say a click signal or [00:31:32] map that to say a click signal or different [00:31:34] different signals so was it if the person look at [00:31:37] signals so was it if the person look at the keyboard and then remember the [00:31:39] the keyboard and then remember the keyboard in the in her and then she [00:31:43] keyboard in the in her and then she close the eye and it does still [00:31:47] close the eye and it does still work [00:31:49] work um I think that's really even like it's [00:31:51] um I think that's really even like it's that's even hard for me to do right I [00:31:53] that's even hard for me to do right I can't like can you like remember the [00:31:55] can't like can you like remember the keyboard and then just control like say [00:31:58] keyboard and then just control like say Mouse and I use keyboard every day so I [00:32:01] Mouse and I use keyboard every day so I definitely remember the word Vis in my [00:32:03] definitely remember the word Vis in my mind and I just close my eye but this is [00:32:06] mind and I just close my eye but this is like a virtual keyboard so it's not like [00:32:07] like a virtual keyboard so it's not like a physical keyboard we can use your [00:32:09] a physical keyboard we can use your muscle [00:32:10] muscle memories yeah so yeah so maybe one thing [00:32:13] memories yeah so yeah so maybe one thing I have to clarify here that the mental [00:32:15] I have to clarify here that the mental image for her is to say controlling like [00:32:18] image for her is to say controlling like a uh like what's the word I mean just [00:32:22] a uh like what's the word I mean just just like controlling a like a mouse [00:32:23] just like controlling a like a mouse right so she's not actually doing the [00:32:26] right so she's not actually doing the touch typing but she is actually moving [00:32:28] touch typing but she is actually moving like say a [00:32:33] mouse let's move [00:32:35] mouse let's move on all right so this is basically just a [00:32:38] on all right so this is basically just a showcase that [00:32:39] showcase that U building up on all the knowledge we [00:32:43] U building up on all the knowledge we have uh learn about the brain so we can [00:32:46] have uh learn about the brain so we can basically decode uh some attempted [00:32:50] basically decode uh some attempted movements from people like uh like uh [00:32:53] movements from people like uh like uh this I forgot her name but like I think [00:32:56] this I forgot her name but like I think it's T6 that code name T6 that we can [00:32:59] it's T6 that code name T6 that we can really help this kind of people to uh uh [00:33:02] really help this kind of people to uh uh regain communication through this kind [00:33:04] regain communication through this kind of PCI and I [00:33:07] of PCI and I think as I mentioned earlier you can [00:33:09] think as I mentioned earlier you can also use BCI to control robotics arms so [00:33:12] also use BCI to control robotics arms so for example in this [00:33:14] for example in this case this is the participants in uh I [00:33:18] case this is the participants in uh I think [00:33:19] think ctech [00:33:25] oops he's uh using his mind to control [00:33:28] oops he's uh using his mind to control this robotic s which grabs him a [00:33:34] [Applause] [00:33:43] drink all [00:33:48] right hey you finish that thing off [00:33:51] right hey you finish that thing off that's good [00:33:58] and [00:34:08] that's there you [00:34:12] that's there you go all right so and also you can do [00:34:16] go all right so and also you can do things like a restore working [00:34:18] things like a restore working abilities um I think that's someone just [00:34:22] abilities um I think that's someone just mentioned right now just now is that [00:34:25] mentioned right now just now is that maybe uh there's like we can try trying [00:34:27] maybe uh there's like we can try trying to restore different modalities of [00:34:30] to restore different modalities of communication for example just now we're [00:34:32] communication for example just now we're just using the movements and then using [00:34:34] just using the movements and then using by restoring movements we can control [00:34:36] by restoring movements we can control computer but like how about we directly [00:34:38] computer but like how about we directly restore the abilities to do handwriting [00:34:41] restore the abilities to do handwriting right because handwriting is very um [00:34:44] right because handwriting is very um natural ways to uh communicate so uh [00:34:48] natural ways to uh communicate so uh Frank Frank wet uh research scientist [00:34:50] Frank Frank wet uh research scientist from our lab in 2021 published a paper [00:34:53] from our lab in 2021 published a paper to show that you can actually do this [00:34:55] to show that you can actually do this kind of uh handwriting BCI and he showed [00:34:58] kind of uh handwriting BCI and he showed that it's actually really really fast [00:35:01] that it's actually really really fast compared to the previous one okay so now [00:35:03] compared to the previous one okay so now we have seen that there's a different [00:35:05] we have seen that there's a different ways to restore [00:35:07] ways to restore communication [00:35:09] communication um here's just like a t a measurement of [00:35:13] um here's just like a t a measurement of different uh ways of communicating right [00:35:15] different uh ways of communicating right you can see on the very left is the uh [00:35:18] you can see on the very left is the uh say Sip and puff interface it's very [00:35:21] say Sip and puff interface it's very slow basically that means that's for [00:35:23] slow basically that means that's for someone who can not really move but can [00:35:26] someone who can not really move but can still do uh some like a breathing they [00:35:28] still do uh some like a breathing they can do the kind like a sip and a path to [00:35:31] can do the kind like a sip and a path to say yes and no to communicate that's [00:35:33] say yes and no to communicate that's really slow maybe around five words per [00:35:35] really slow maybe around five words per minute per minute for a normal person [00:35:38] minute per minute for a normal person I'm really surprised that on average a [00:35:40] I'm really surprised that on average a normal person can write maybe like 13 or [00:35:44] normal person can write maybe like 13 or 14 words per minute that's really slow [00:35:46] 14 words per minute that's really slow but maybe that's just the average speed [00:35:49] but maybe that's just the average speed and on the very far right side is the [00:35:51] and on the very far right side is the natural communication which can reach up [00:35:53] natural communication which can reach up to 150 and 60 words per minute for [00:35:56] to 150 and 60 words per minute for minute just put everything into context [00:35:58] minute just put everything into context here so the 2D cursor I just show you [00:36:01] here so the 2D cursor I just show you guys right now it can do eight minutes [00:36:03] guys right now it can do eight minutes for work per per minute eight words per [00:36:06] for work per per minute eight words per minute and the handwriting can do around [00:36:08] minute and the handwriting can do around 18 words per minute so compared to the [00:36:13] 18 words per minute so compared to the say uh Le board or like uh this kind of [00:36:16] say uh Le board or like uh this kind of ey tracking we are doing like we are [00:36:20] ey tracking we are doing like we are really uh make a lot of like advancement [00:36:23] really uh make a lot of like advancement here but still it's far way far from [00:36:26] here but still it's far way far from like the natural conversation speed so [00:36:29] like the natural conversation speed so the next question basically is okay how [00:36:31] the next question basically is okay how can we get there can we actually restore [00:36:33] can we get there can we actually restore natural uh can we actually restore [00:36:36] natural uh can we actually restore speech with a Brin computer [00:36:39] speech with a Brin computer interface [00:36:41] interface um um to get there I think there's [00:36:47] um um to get there I think there's a huge barrier here first is that um the [00:36:51] a huge barrier here first is that um the language processing in the brain is a [00:36:53] language processing in the brain is a really complicated process so for [00:36:56] really complicated process so for example here is sh all the braing areas [00:36:58] example here is sh all the braing areas that's involved in the language and we [00:37:00] that's involved in the language and we still don't know exactly how this [00:37:02] still don't know exactly how this happens but this is just the our best [00:37:04] happens but this is just the our best guess at the how language is uh [00:37:07] guess at the how language is uh processed in the brain on the very right [00:37:08] processed in the brain on the very right you see that uh there's a lot of brain [00:37:11] you see that uh there's a lot of brain regions that's involved with uh [00:37:13] regions that's involved with uh knowledge and reasoning in the center is [00:37:16] knowledge and reasoning in the center is uh maybe area that's involved with [00:37:19] uh maybe area that's involved with semantics and uh syntax and on the very [00:37:22] semantics and uh syntax and on the very left is about the perception of speech [00:37:25] left is about the perception of speech and then the production of uh Speech [00:37:29] and then the production of uh Speech language is really complex so um maybe [00:37:33] language is really complex so um maybe the Hope here that we can start with Mod [00:37:35] the Hope here that we can start with Mod cortex which modor planning of language [00:37:38] cortex which modor planning of language because things I've just shown you guys [00:37:40] because things I've just shown you guys that we already know how the model [00:37:41] that we already know how the model cortex can incode movements right and we [00:37:44] cortex can incode movements right and we can also know that in order to produce [00:37:47] can also know that in order to produce language we need to speak and then maybe [00:37:49] language we need to speak and then maybe we can put some electrodes into this [00:37:51] we can put some electrodes into this part of the modal cortex that actually [00:37:53] part of the modal cortex that actually controls our oral facial muscles and [00:37:56] controls our oral facial muscles and then trying to decode some information [00:37:57] then trying to decode some information there then see if we can actually [00:37:59] there then see if we can actually restore [00:38:01] restore speech [00:38:04] um to actually being able to restore [00:38:07] um to actually being able to restore speech is uh I think it's more [00:38:09] speech is uh I think it's more complicated compared to the restoring [00:38:11] complicated compared to the restoring movements so what I'm trying to say here [00:38:14] movements so what I'm trying to say here is that this the production of speech is [00:38:16] is that this the production of speech is really a a lot of a complicated [00:38:18] really a a lot of a complicated movements and it's really rapid it's [00:38:20] movements and it's really rapid it's just more than just moving your hands to [00:38:22] just more than just moving your hands to like a certain [00:38:23] like a certain directions so restoring speech is much [00:38:26] directions so restoring speech is much harder than just the coding out the [00:38:28] harder than just the coding out the those U uh like um movements of each [00:38:32] those U uh like um movements of each articulators so in instead of uh trying [00:38:35] articulators so in instead of uh trying to decode the movements of each [00:38:37] to decode the movements of each articulators because it's very hard [00:38:39] articulators because it's very hard right so also like for people who has uh [00:38:42] right so also like for people who has uh like hard or people like uh has lost [00:38:45] like hard or people like uh has lost speech it's basically it's very hard to [00:38:48] speech it's basically it's very hard to actually measure their uh Speech Artic [00:38:52] actually measure their uh Speech Artic movements um instead maybe we can try to [00:38:56] movements um instead maybe we can try to decode out this kind of like a discrete [00:38:58] decode out this kind of like a discrete fums instead of like this kind of [00:39:00] fums instead of like this kind of continuous speech article Ms because we [00:39:02] continuous speech article Ms because we know that all the languages has this can [00:39:05] know that all the languages has this can be decomposed into this kind of basic [00:39:07] be decomposed into this kind of basic fonetic units for example for English we [00:39:10] fonetic units for example for English we know that there's different vows and [00:39:11] know that there's different vows and different consonants uh they are [00:39:13] different consonants uh they are correlated with how you place your towns [00:39:16] correlated with how you place your towns in your mouth and how you place your [00:39:17] in your mouth and how you place your different speech articulators so here [00:39:20] different speech articulators so here we're trying to instead of decoding the [00:39:22] we're trying to instead of decoding the actual articul movements we are trying [00:39:24] actual articul movements we are trying to decode this kind of like discrete [00:39:26] to decode this kind of like discrete phic [00:39:28] phic tokens um and then there's a previous uh [00:39:32] tokens um and then there's a previous uh work work showing that if you put some [00:39:35] work work showing that if you put some electos on the modor cortex and then you [00:39:37] electos on the modor cortex and then you can [00:39:38] can actually tell the differences by [00:39:41] actually tell the differences by measuring the tell the differences of [00:39:43] measuring the tell the differences of different fums by measuring the electric [00:39:45] different fums by measuring the electric activities in the motor cortex so [00:39:47] activities in the motor cortex so there's a hope of uh being able to [00:39:49] there's a hope of uh being able to restore speech from by just putting elal [00:39:53] restore speech from by just putting elal in the model cortex [00:39:55] in the model cortex here and inde so in [00:39:59] here and inde so in 2021 uh researchers from UCSF actually [00:40:02] 2021 uh researchers from UCSF actually demonstrated that's actually feasible to [00:40:04] demonstrated that's actually feasible to build this kind of like small vocabulary [00:40:06] build this kind of like small vocabulary speech PCI with this EOG Recording [00:40:09] speech PCI with this EOG Recording Technology the difference between EOG [00:40:13] Technology the difference between EOG and then the micro electral array I just [00:40:14] and then the micro electral array I just show you guys is that uh whereas the [00:40:18] show you guys is that uh whereas the micro electral arrays actually [00:40:19] micro electral arrays actually penetrates into the cortex but the Eco [00:40:22] penetrates into the cortex but the Eco stays on cortex so it doesn't actually [00:40:24] stays on cortex so it doesn't actually record single neuron firing still record [00:40:26] record single neuron firing still record some uh average neuro activities over a [00:40:28] some uh average neuro activities over a small region so compared to micro [00:40:31] small region so compared to micro electrod race they will have a slightly [00:40:33] electrod race they will have a slightly lower resolution so that's why their [00:40:37] lower resolution so that's why their prototype is uh around this kind of like [00:40:39] prototype is uh around this kind of like a small vocabulary BC which can only [00:40:42] a small vocabulary BC which can only decode 50 words at around maybe 75% [00:40:46] decode 50 words at around maybe 75% accuracy but this is still very exciting [00:40:48] accuracy but this is still very exciting work that showcase that you can actually [00:40:51] work that showcase that you can actually achieve maybe achieve uh this kind of [00:40:53] achieve maybe achieve uh this kind of like a speech decoding using uh uh by [00:40:56] like a speech decoding using uh uh by putting some en actuals into M [00:40:58] putting some en actuals into M cortex all right so I'll just right now [00:41:01] cortex all right so I'll just right now go into the research that's done in our [00:41:04] go into the research that's done in our lab which is to build a high performance [00:41:06] lab which is to build a high performance speech neuroth neuro [00:41:08] speech neuroth neuro prothesis so in [00:41:12] prothesis so in 2022 so we recruited a uh participant [00:41:16] 2022 so we recruited a uh participant code named T12 who has uh AOS so T12 um [00:41:22] code named T12 who has uh AOS so T12 um she used to be a very active person she [00:41:26] she used to be a very active person she likes to write horse likes to jog [00:41:30] likes to write horse likes to jog but uh because of AOS a couple years ago [00:41:33] but uh because of AOS a couple years ago she basically uh couldn't do all those [00:41:35] she basically uh couldn't do all those things that she used to enjoy and unlike [00:41:39] things that she used to enjoy and unlike most Asos uh patients her symptom starts [00:41:44] most Asos uh patients her symptom starts with the artifcial movements first so [00:41:46] with the artifcial movements first so she still can move her hands a little [00:41:49] she still can move her hands a little bit but she cannot really uh speak [00:41:53] bit but she cannot really uh speak intelligibly uh so we decided to put [00:41:56] intelligibly uh so we decided to put four uh micro electrod arrays into her [00:41:59] four uh micro electrod arrays into her brain two arrays into her modor cortex [00:42:02] brain two arrays into her modor cortex and then two array into the part of the [00:42:04] and then two array into the part of the Brokers area which is supposed to [00:42:06] Brokers area which is supposed to involve with uh uh language planning so [00:42:09] involve with uh uh language planning so the Hope here is that we want to PO [00:42:11] the Hope here is that we want to PO decode the the execution of uh of speech [00:42:17] decode the the execution of uh of speech the production of speech which is is you [00:42:18] the production of speech which is is you can how you control your your like uh [00:42:21] can how you control your your like uh Speech arcs and also maybe decode some [00:42:23] Speech arcs and also maybe decode some like high level uh planning about the [00:42:25] like high level uh planning about the speech so that's why want to put arrays [00:42:28] speech so that's why want to put arrays into two different brain regions [00:42:30] into two different brain regions here um so the first thing we do after [00:42:33] here um so the first thing we do after we put arrays in her brain is that we [00:42:35] we put arrays in her brain is that we did some like a Behavior test to see [00:42:38] did some like a Behavior test to see what kind of information we can decode [00:42:40] what kind of information we can decode from those arrays so here's a very uh [00:42:44] from those arrays so here's a very uh here's a result the first result we got [00:42:46] here's a result the first result we got is that we're are trying to [00:42:48] is that we're are trying to classify different uh task here the [00:42:52] classify different uh task here the first plot is that we're trying to using [00:42:54] first plot is that we're trying to using this four arrays to classify the auto [00:42:57] this four arrays to classify the auto facial [00:42:58] facial movements um so this dash line is the [00:43:02] movements um so this dash line is the the cue that she's actually executing [00:43:06] the cue that she's actually executing those artificial movements and then [00:43:08] those artificial movements and then before this dash line she is trying to [00:43:10] before this dash line she is trying to prefer to do those artificial movements [00:43:13] prefer to do those artificial movements so you can see that this two uh right [00:43:16] so you can see that this two uh right line here show that you [00:43:18] line here show that you can classify you can classify You can [00:43:22] can classify you can classify You can predict those movements much better [00:43:24] predict those movements much better above chance using this two arrays in [00:43:26] above chance using this two arrays in the M cortex whereas this two are really [00:43:29] the M cortex whereas this two are really in the broadcast area you can basically [00:43:31] in the broadcast area you can basically you can't really predict too much above [00:43:34] you can't really predict too much above chance especially during the execution [00:43:36] chance especially during the execution of those more movements and for single [00:43:39] of those more movements and for single forums which we instruct our [00:43:42] forums which we instruct our participants to speak uh single English [00:43:44] participants to speak uh single English fums you can also uh predict those [00:43:47] fums you can also uh predict those things much higher about like much [00:43:50] things much higher about like much higher above chance using these two [00:43:51] higher above chance using these two arrays from the model cortex and also [00:43:53] arrays from the model cortex and also for the words for single words right [00:43:57] for the words for single words right so what this results tell us is that for [00:44:00] so what this results tell us is that for those two arrays we have put into t12's [00:44:03] those two arrays we have put into t12's mind these two arrays in the motor CeX [00:44:05] mind these two arrays in the motor CeX contains a lot of informations about the [00:44:08] contains a lot of informations about the the fumes being articulated and also the [00:44:10] the fumes being articulated and also the the words being articulated but those [00:44:12] the words being articulated but those two AR in the broadcast area which is [00:44:15] two AR in the broadcast area which is supposed to uh help us to figure out the [00:44:18] supposed to uh help us to figure out the planning of the speech production [00:44:19] planning of the speech production doesn't contain too much information so [00:44:22] doesn't contain too much information so that's really intriguing to us and we're [00:44:23] that's really intriguing to us and we're still trying to figure out why why [00:44:26] still trying to figure out why why that's true so for the rest of the uh [00:44:30] that's true so for the rest of the uh this talk here we mostly using only this [00:44:32] this talk here we mostly using only this two arrays in the motor cortex [00:44:34] two arrays in the motor cortex here so now we know that there's a [00:44:38] here so now we know that there's a fanatic informations being encoded in [00:44:41] fanatic informations being encoded in the those two arrays what we're going to [00:44:44] the those two arrays what we're going to do next is actually trying to build a [00:44:46] do next is actually trying to build a real time spring to text [00:44:49] real time spring to text BCI here so what we're going to do is uh [00:44:53] BCI here so what we're going to do is uh let me just show you a video demo first [00:44:56] let me just show you a video demo first to [00:44:57] to you know get a sense of what the the PC [00:45:02] you know get a sense of what the the PC trying to build so here this is our [00:45:05] trying to build so here this is our participants so she's connect to uh our [00:45:08] participants so she's connect to uh our decoding machines through this cable [00:45:10] decoding machines through this cable here which transmits her neuros signals [00:45:12] here which transmits her neuros signals in real time to decoding machine and [00:45:14] in real time to decoding machine and then on this screen you can see that [00:45:15] then on this screen you can see that there's a sentence here that's we [00:45:17] there's a sentence here that's we instructed her to copy to basically to [00:45:20] instructed her to copy to basically to read that out and then this once this uh [00:45:24] read that out and then this once this uh R square turn scen she will try to speak [00:45:28] R square turn scen she will try to speak and then what you will see below here is [00:45:31] and then what you will see below here is what the machine is [00:45:36] [Music] [00:45:42] decoded I don't want to call for a [00:45:52] babysitter that would be good [00:45:57] [Music] [00:46:02] I did well in [00:46:14] school I don't see much pollution [00:46:16] school I don't see much pollution turning [00:46:17] turning on all right so that's [00:46:20] on all right so that's the almost perfect decoding results from [00:46:23] the almost perfect decoding results from her and you can tell from the video that [00:46:25] her and you can tell from the video that although she can vocalize but it's not [00:46:27] although she can vocalize but it's not really intelligible because of her [00:46:30] really intelligible because of her limited um artificial muscle movements [00:46:34] limited um artificial muscle movements but we can still decode out from her [00:46:35] but we can still decode out from her brain signal that what she is trying to [00:46:37] brain signal that what she is trying to say and this video is uh so the task I [00:46:40] say and this video is uh so the task I just show is she's trying to copy a [00:46:42] just show is she's trying to copy a sentence but this is trying to answer a [00:46:45] sentence but this is trying to answer a question here [00:46:59] I have a very good friend and [00:47:02] I have a very good friend and sister and we also try to different [00:47:05] sister and we also try to different modalities which is uh [00:47:07] modalities which is uh because when she tries to uh attempting [00:47:10] because when she tries to uh attempting to articulate it's very actually very [00:47:12] to articulate it's very actually very tiring for her to actually uh to [00:47:15] tiring for her to actually uh to articulate those sounds so what we try [00:47:17] articulate those sounds so what we try here is that only instructed her to move [00:47:19] here is that only instructed her to move her mouse or move her articulators but [00:47:22] her mouse or move her articulators but not move but not vocalized so what we [00:47:24] not move but not vocalized so what we call this a silent speech and we can [00:47:26] call this a silent speech and we can still decode pretty well using this kind [00:47:28] still decode pretty well using this kind of like a silent speech [00:47:32] [Music] [00:47:40] modality I do not have much to compare [00:47:42] modality I do not have much to compare it [00:47:47] [Music] [00:47:55] to I as much as I would like to [00:47:58] to I as much as I would like to either all right [00:48:03] okay okay so let's just move on more [00:48:06] okay okay so let's just move on more technical details about how we uh build [00:48:08] technical details about how we uh build this uh Speech V here so um as I just [00:48:14] this uh Speech V here so um as I just mentioned right so the first thing we [00:48:15] mentioned right so the first thing we need to do is to try to build a decoder [00:48:18] need to do is to try to build a decoder and before building that decoder we need [00:48:20] and before building that decoder we need to do some data collection so here is [00:48:22] to do some data collection so here is our research scientist Frank sitting [00:48:24] our research scientist Frank sitting next to T12 and asking her to read that [00:48:28] next to T12 and asking her to read that sentence on the screen and then we'll [00:48:29] sentence on the screen and then we'll record the new activities of her seeing [00:48:33] record the new activities of her seeing that sentence so we'll have this kind of [00:48:34] that sentence so we'll have this kind of a paired data collected paired data [00:48:37] a paired data collected paired data where the input is the new activity and [00:48:39] where the input is the new activity and the output is the targeted sentence we [00:48:41] the output is the targeted sentence we want to decode so we basically have to [00:48:44] want to decode so we basically have to basically go to um where T12 Liv and [00:48:48] basically go to um where T12 Liv and then we'll do some data collection [00:48:49] then we'll do some data collection session there uh and then testing the [00:48:53] decoder the way we collect data is that [00:48:57] decoder the way we collect data is that uh because um we only have very limited [00:49:01] uh because um we only have very limited time and we cannot ask T12 to to to [00:49:05] time and we cannot ask T12 to to to speak a lot of sentences so we'll divide [00:49:09] speak a lot of sentences so we'll divide the our data collection into this kind [00:49:11] the our data collection into this kind of a block structure where we can start [00:49:12] of a block structure where we can start instructed her to speak 40 sentences [00:49:14] instructed her to speak 40 sentences every block and then she take a break [00:49:16] every block and then she take a break and then we collect another block so the [00:49:18] and then we collect another block so the data collection will last about 100 [00:49:21] data collection will last about 100 minutes for every research session and [00:49:24] minutes for every research session and then we Crea train a decoder maybe that [00:49:27] then we Crea train a decoder maybe that takes like uh 10 to 20 minutes it's [00:49:30] takes like uh 10 to 20 minutes it's really quick after training a decoder [00:49:32] really quick after training a decoder we'll start actually evaluating the [00:49:34] we'll start actually evaluating the performance of decoder by asking our [00:49:36] performance of decoder by asking our participants to speak some new sentences [00:49:39] participants to speak some new sentences and then see how well we can decode on [00:49:41] and then see how well we can decode on those new set of [00:49:43] those new set of sentences so in total we did the [00:49:46] sentences so in total we did the experiments sessions over maybe three [00:49:48] experiments sessions over maybe three months of time and then we collect about [00:49:51] months of time and then we collect about like 10,000 sentences from this T like a [00:49:55] like 10,000 sentences from this T like a switchboard uh tele telephone [00:49:58] switchboard uh tele telephone conversation Corpus which really want to [00:50:01] conversation Corpus which really want to emphasize that we want to decode this [00:50:02] emphasize that we want to decode this kind of conversational [00:50:05] kind of conversational English um once we have the data then we [00:50:07] English um once we have the data then we can try to see how we can design a [00:50:10] can try to see how we can design a decoder that can best solve this task so [00:50:13] decoder that can best solve this task so let's first Define the problem here so [00:50:15] let's first Define the problem here so we have some neuro features inputs right [00:50:18] we have some neuro features inputs right so let's say we have some neuro features [00:50:21] so let's say we have some neuro features which is a Time series you can think [00:50:23] which is a Time series you can think about as a maybe similar to audio that [00:50:26] about as a maybe similar to audio that uh [00:50:27] uh at each time point we'll get some [00:50:29] at each time point we'll get some feature [00:50:32] vectors um the output of this decoder is [00:50:37] vectors um the output of this decoder is a set of words right so we know that [00:50:39] a set of words right so we know that she's trying to speak some sentence so [00:50:40] she's trying to speak some sentence so we are trying to decode the words from [00:50:43] we are trying to decode the words from this input neur features [00:50:46] this input neur features here um as I say as I mentioned earlier [00:50:49] here um as I say as I mentioned earlier so instead of like directly decoding [00:50:52] so instead of like directly decoding words from the input sentences maybe we [00:50:55] words from the input sentences maybe we want to have this inter media Target of [00:50:57] want to have this inter media Target of a PHS to decode the reason is that first [00:51:01] a PHS to decode the reason is that first we know that um there's only 40 PHS in [00:51:05] we know that um there's only 40 PHS in English so that's much small set [00:51:06] English so that's much small set compared to the number of words so if [00:51:08] compared to the number of words so if you want to train decoder I can actually [00:51:10] you want to train decoder I can actually decode words then you will have you need [00:51:12] decode words then you will have you need to have much more data to cover all the [00:51:14] to have much more data to cover all the possible words whereas for PHS you [00:51:17] possible words whereas for PHS you probably need way less data to cover all [00:51:19] probably need way less data to cover all the 40 fums here so we inide of decoding [00:51:22] the 40 fums here so we inide of decoding directly decoding the words we decided [00:51:24] directly decoding the words we decided to decode a inter media representation [00:51:26] to decode a inter media representation of PHS from the neuro input [00:51:29] of PHS from the neuro input features okay so basically there's a two [00:51:32] features okay so basically there's a two decoders we want to design one is the [00:51:34] decoders we want to design one is the first is a neural to volume decoder and [00:51:36] first is a neural to volume decoder and the second is volum to word decoder so [00:51:38] the second is volum to word decoder so that's the two decoders that will have [00:51:41] that's the two decoders that will have to design this task let's focus on the [00:51:44] to design this task let's focus on the ne tonium decoder [00:51:47] ne tonium decoder first so basically that's I think at [00:51:50] first so basically that's I think at this point of uh class we probably know [00:51:52] this point of uh class we probably know that we can treat this problem as a [00:51:53] that we can treat this problem as a sequence to sequence problem right so [00:51:55] sequence to sequence problem right so the input is some feature sequence [00:51:57] the input is some feature sequence output is a token sequence um and for [00:52:02] output is a token sequence um and for sequence two sequence problem we can [00:52:04] sequence two sequence problem we can probably know that we can use some like [00:52:06] probably know that we can use some like encoder and decoder models to to solve [00:52:09] encoder and decoder models to to solve this problem right [00:52:11] this problem right however for encod decod model it's [00:52:14] however for encod decod model it's actually more powerful than we actually [00:52:16] actually more powerful than we actually need because um encod decod model allows [00:52:19] need because um encod decod model allows this kind of arbitary alignments between [00:52:22] this kind of arbitary alignments between inputs and outputs that's really helpful [00:52:24] inputs and outputs that's really helpful for tasks such as the machine trans [00:52:26] for tasks such as the machine trans ation whereas you know some languages [00:52:29] ation whereas you know some languages has like different word orders than [00:52:31] has like different word orders than other Lang than other languages but here [00:52:36] other Lang than other languages but here we know that the the alignment is [00:52:39] we know that the the alignment is actually is kind of like a more [00:52:41] actually is kind of like a more monotonic compared to the like say [00:52:43] monotonic compared to the like say machine translation where the alignment [00:52:45] machine translation where the alignment is arbitrary but monotonic I means that [00:52:47] is arbitrary but monotonic I means that you know that the probably like that the [00:52:51] you know that the probably like that the say for example the first two neuro FES [00:52:53] say for example the first two neuro FES probably like corresponds to the first [00:52:55] probably like corresponds to the first fores in out sentence rather than the [00:52:57] fores in out sentence rather than the last volume in the output sentence right [00:52:59] last volume in the output sentence right so this is kind of a monotonic [00:53:01] so this is kind of a monotonic alignment um so to solve this problems [00:53:04] alignment um so to solve this problems of monotonic [00:53:06] of monotonic alignments um we can actually borrow [00:53:08] alignments um we can actually borrow idea that people have de developed [00:53:11] idea that people have de developed for machine learning tasks such as [00:53:14] for machine learning tasks such as handwriting recognition and speech [00:53:16] handwriting recognition and speech recognition where the task is also [00:53:18] recognition where the task is also trying to decode a say a ladder sequence [00:53:22] trying to decode a say a ladder sequence or like fum sequence also ladder [00:53:24] or like fum sequence also ladder sequences from some like a uh [00:53:27] sequences from some like a uh um Say speech Fe Fe features and the [00:53:31] um Say speech Fe Fe features and the techn we going to use is called [00:53:32] techn we going to use is called connection is temporal classification so [00:53:35] connection is temporal classification so for people who have taken CS2 24s you [00:53:38] for people who have taken CS2 24s you probably already know what this means [00:53:40] probably already know what this means but I'm going to do some little bit [00:53:42] but I'm going to do some little bit introduction about this thing I guess I [00:53:45] introduction about this thing I guess I don't have too much time but I just [00:53:46] don't have too much time but I just quickly go over it here so what um CDC [00:53:50] quickly go over it here so what um CDC the connection tempis classification do [00:53:53] the connection tempis classification do is that giving some input sequence right [00:53:56] is that giving some input sequence right the goal is that we want to decode some [00:53:58] the goal is that we want to decode some output sequence but we don't know the [00:54:00] output sequence but we don't know the exact alignment between them and usually [00:54:02] exact alignment between them and usually the input output has some lens mismatch [00:54:05] the input output has some lens mismatch for example in the case of like say [00:54:08] for example in the case of like say speech recognition um where the input is [00:54:11] speech recognition um where the input is uh could be have like a lens of like [00:54:14] uh could be have like a lens of like several thousands of uh frames right [00:54:17] several thousands of uh frames right where each frame correspond to a very [00:54:19] where each frame correspond to a very like fine level like find high temporal [00:54:22] like fine level like find high temporal resolution featur such recorded at like [00:54:25] resolution featur such recorded at like say 20 millisecond whereas the output [00:54:27] say 20 millisecond whereas the output only has a few tokens so that's like a [00:54:30] only has a few tokens so that's like a huge lens mismatch [00:54:32] huge lens mismatch here what we can do is uh we can still [00:54:36] here what we can do is uh we can still using like say a RN Transformer model to [00:54:40] using like say a RN Transformer model to predict what the output token is at each [00:54:43] predict what the output token is at each time step right and then we somehow have [00:54:46] time step right and then we somehow have to figure out a way to fill in between [00:54:49] to figure out a way to fill in between like fill in between the output tokens [00:54:52] like fill in between the output tokens some like spacers so that output token [00:54:54] some like spacers so that output token can also have the same length as the [00:54:56] can also have the same length as the input [00:54:56] input output sequence can have the same length [00:54:58] output sequence can have the same length as the input sequence so what the CDC [00:55:01] as the input sequence so what the CDC loss does is introduce this additional [00:55:04] loss does is introduce this additional blank token as [00:55:06] blank token as output um with this blank token what you [00:55:09] output um with this blank token what you can actually do is U for example here is [00:55:13] can actually do is U for example here is a a example output of the CTC classifier [00:55:16] a a example output of the CTC classifier it's trying to produce this kind of like [00:55:20] it's trying to produce this kind of like sequence here right what you can do is [00:55:22] sequence here right what you can do is first you merge repeated tokens and then [00:55:25] first you merge repeated tokens and then you're taking out that blank tokens so [00:55:28] you're taking out that blank tokens so what you get is a much shorter sequence [00:55:30] what you get is a much shorter sequence that's correspond to Output so what CDC [00:55:32] that's correspond to Output so what CDC loss it does is it allows you to to do a [00:55:36] loss it does is it allows you to to do a sequence to sequence problem that has a [00:55:38] sequence to sequence problem that has a different input and output length and [00:55:40] different input and output length and also have this kind of monotonic [00:55:42] also have this kind of monotonic alignment [00:55:44] alignment property um let me just skip through [00:55:48] property um let me just skip through this detail here how you can train a CTC [00:55:51] this detail here how you can train a CTC loss [00:55:52] loss here uh just uh skip through this thing [00:55:59] so now let's suppose that we have this [00:56:01] so now let's suppose that we have this uh CTC loss that can actually train be [00:56:04] uh CTC loss that can actually train be used train to to to train our model [00:56:07] used train to to to train our model right the next problem is what kind of [00:56:09] right the next problem is what kind of decoder we want to use for this task [00:56:13] decoder we want to use for this task what kind of neural network decoders to [00:56:14] what kind of neural network decoders to use for this task I think at this point [00:56:16] use for this task I think at this point of class I think most of you guys are [00:56:18] of class I think most of you guys are convinced that Transformer is really [00:56:19] convinced that Transformer is really powerful right there's no no reason for [00:56:21] powerful right there's no no reason for me to say more about it but in this case [00:56:23] me to say more about it but in this case we don't want to use Transformer the [00:56:25] we don't want to use Transformer the reason is that [00:56:26] reason is that we don't have large data set as I [00:56:28] we don't have large data set as I mentioned previous we only have 10,000 [00:56:30] mentioned previous we only have 10,000 sentences right and also Transformer is [00:56:33] sentences right and also Transformer is really good at dealing with long range [00:56:34] really good at dealing with long range dependencies but here um for the speech [00:56:37] dependencies but here um for the speech production the like there's no re really [00:56:40] production the like there's no re really required for long range dependency so [00:56:43] required for long range dependency so let's just go back to the very simple RN [00:56:45] let's just go back to the very simple RN we know that RN works for small data [00:56:47] we know that RN works for small data sets and it can it can predict uh it can [00:56:51] sets and it can it can predict uh it can deal with short range dependency pretty [00:56:53] deal with short range dependency pretty well and another ni thing about RN is [00:56:55] well and another ni thing about RN is that it can it's very efficient to run [00:56:57] that it can it's very efficient to run in real time you can do like put a very [00:57:00] in real time you can do like put a very complicated R and even run very [00:57:01] complicated R and even run very efficiently on your on your mobile [00:57:03] efficiently on your on your mobile phone [00:57:05] phone um one like one like the most popular uh [00:57:10] um one like one like the most popular uh RN we have learn ISM right use this this [00:57:14] RN we have learn ISM right use this this kind of like a memory State here's my [00:57:17] kind of like a memory State here's my cursor uh use this memory state to [00:57:20] cursor uh use this memory state to trying to store some long long-term like [00:57:23] trying to store some long long-term like long range informations and then use [00:57:25] long range informations and then use this um different uh input and forget [00:57:28] this um different uh input and forget gate input and output gates to control [00:57:30] gate input and output gates to control how you can write read and write into [00:57:32] how you can write read and write into that memory state right um but LM is [00:57:36] that memory state right um but LM is also very complicated there's a variant [00:57:39] also very complicated there's a variant ofm called Gru Gator recurring unit it's [00:57:43] ofm called Gru Gator recurring unit it's tries like I think the idea here that [00:57:45] tries like I think the idea here that just trying to combine this memory State [00:57:47] just trying to combine this memory State and hidden States into just one hidden [00:57:49] and hidden States into just one hidden state by doing that you can also reduce [00:57:52] state by doing that you can also reduce uh some gates so Gru is basically a more [00:57:55] uh some gates so Gru is basically a more simple version of lstm that's works [00:57:58] simple version of lstm that's works really well when you have a small small [00:58:00] really well when you have a small small data set so here we use Gru instead of L [00:58:03] data set so here we use Gru instead of L lstm for our [00:58:05] lstm for our task so now we know that how we can [00:58:08] task so now we know that how we can decode PHS and then we have new network [00:58:10] decode PHS and then we have new network models to decode PHS right we know how [00:58:13] models to decode PHS right we know how to train the model so at inference Time [00:58:16] to train the model so at inference Time by inference time I mean that at testing [00:58:17] by inference time I mean that at testing time you can pass in some new activities [00:58:20] time you can pass in some new activities into our decoders and then you will [00:58:22] into our decoders and then you will decode out some like phum probabilities [00:58:24] decode out some like phum probabilities right so there's a maybe at the first [00:58:27] right so there's a maybe at the first time stamp the highest probability is I [00:58:30] time stamp the highest probability is I uh the problem here is that how do I [00:58:32] uh the problem here is that how do I figure out the most likely output [00:58:35] figure out the most likely output sequences giving this fun probabilties [00:58:37] sequences giving this fun probabilties right so basically the task is to find [00:58:39] right so basically the task is to find the most likely output sequences here [00:58:43] the most likely output sequences here um I think for this problem I think [00:58:47] um I think for this problem I think since we have already did something [00:58:49] since we have already did something similar in the assignment three which is [00:58:51] similar in the assignment three which is that we can use beam search to figure [00:58:53] that we can use beam search to figure out the most likely sequence here [00:58:55] out the most likely sequence here however this one caveat with the beam [00:58:57] however this one caveat with the beam search when you're applying it to the [00:59:00] search when you're applying it to the CTC LW but I'm not going to expand it [00:59:03] CTC LW but I'm not going to expand it too much here um yeah so let's just skip [00:59:07] too much here um yeah so let's just skip over [00:59:09] that now suppose that we can use the [00:59:12] that now suppose that we can use the beam search to find the most likely fum [00:59:15] beam search to find the most likely fum sequences how do we convert that fume [00:59:17] sequences how do we convert that fume sequences into words right so that's CU [00:59:19] sequences into words right so that's CU like we eventually want to decode uh a [00:59:21] like we eventually want to decode uh a sentences but not just like a a sequence [00:59:24] sentences but not just like a a sequence of vums so one thing you can modify the [00:59:27] of vums so one thing you can modify the beam search is that you can if you have [00:59:30] beam search is that you can if you have like a English dictionary where you can [00:59:31] like a English dictionary where you can map each words into its pronunciations [00:59:34] map each words into its pronunciations then while doing the beam search we can [00:59:36] then while doing the beam search we can basically see if you de some fum [00:59:38] basically see if you de some fum sequences that correspond to words and [00:59:39] sequences that correspond to words and can basically replace that fum sequence [00:59:41] can basically replace that fum sequence with that words right [00:59:44] with that words right um however you can actually do better by [00:59:48] um however you can actually do better by using a language model for example here [00:59:51] using a language model for example here uh that's the the [00:59:54] uh that's the the thing that's the the decoding equation [00:59:59] thing that's the the decoding equation what I want to do here is that uh here [01:00:02] what I want to do here is that uh here the x is the input the Y is decoded the [01:00:05] the x is the input the Y is decoded the word sequences and then because not all [01:00:08] word sequences and then because not all word sequences [01:00:10] word sequences are have the same likelihood right so [01:00:13] are have the same likelihood right so some word sequences say suppose I decode [01:00:15] some word sequences say suppose I decode a sentence say called I can spoke that [01:00:18] a sentence say called I can spoke that doesn't that doesn't seem like a [01:00:19] doesn't that doesn't seem like a syntactically correct so we can use [01:00:22] syntactically correct so we can use maybe use the language model to evaluate [01:00:23] maybe use the language model to evaluate the probabilities of each decoded hypo [01:00:26] the probabilities of each decoded hypo this and then using that as some sort of [01:00:28] this and then using that as some sort of weight on the final um decoding [01:00:31] weight on the final um decoding probabilities so we're adding this extra [01:00:34] probabilities so we're adding this extra term here called the probabilities of a [01:00:37] term here called the probabilities of a sentence here that's actually you can [01:00:39] sentence here that's actually you can just decompose that into um the [01:00:42] just decompose that into um the probability of each token giving its [01:00:44] probability of each token giving its previous tokens and you can basically [01:00:46] previous tokens and you can basically measure this things using any language [01:00:48] measure this things using any language models [01:00:49] models right now another thing we want to add [01:00:51] right now another thing we want to add here is another term is a word insertion [01:00:54] here is another term is a word insertion bonus so one problem about [01:00:56] bonus so one problem about this language model this probability of [01:00:59] this language model this probability of sentences that actually longer sentences [01:01:03] sentences that actually longer sentences will have smaller probabilities than [01:01:05] will have smaller probabilities than shorter senten that's just like the [01:01:06] shorter senten that's just like the properties of this how you decompose [01:01:09] properties of this how you decompose that how you decompose the this [01:01:11] that how you decompose the this probability here so we want to actually [01:01:14] probability here so we want to actually to balance the length of the uh decoded [01:01:17] to balance the length of the uh decoded sequence here by adding some word [01:01:19] sequence here by adding some word insertion bonus so what we eventually [01:01:22] insertion bonus so what we eventually trying to optimize is this equation here [01:01:24] trying to optimize is this equation here using um both the probabilities [01:01:27] using um both the probabilities generated by the RN decoder and then [01:01:30] generated by the RN decoder and then using some sort of language model like [01:01:32] using some sort of language model like weights and then we insertion bonus and [01:01:35] weights and then we insertion bonus and also some ways here you can [01:01:38] also some ways here you can optimize okay just start trying to put [01:01:40] optimize okay just start trying to put everything together so suppose that you [01:01:42] everything together so suppose that you have a neuro feature inputs here which [01:01:45] have a neuro feature inputs here which is you get this neuro features every 20 [01:01:47] is you get this neuro features every 20 milliseconds you pass that through Gru [01:01:50] milliseconds you pass that through Gru and now you got some phing probabilities [01:01:52] and now you got some phing probabilities right this is all happens in real time [01:01:54] right this is all happens in real time so everything has to be happen under [01:01:56] so everything has to be happen under like all computation needs to be down [01:01:58] like all computation needs to be down within 20 milliseconds you do a really [01:02:00] within 20 milliseconds you do a really quick beam search and then you find okay [01:02:02] quick beam search and then you find okay maybe this this pH corresponds to the [01:02:06] maybe this this pH corresponds to the word I or the word I um and then here we [01:02:10] word I or the word I um and then here we want to use the angram language model [01:02:11] want to use the angram language model instead of like a more powerful [01:02:13] instead of like a more powerful Transformer language model the reason is [01:02:15] Transformer language model the reason is that we want to really do a lot of like [01:02:18] that we want to really do a lot of like evaluation really quickly under 20 [01:02:20] evaluation really quickly under 20 milliseconds so if so suppose that you [01:02:23] milliseconds so if so suppose that you have like say 100 hypothesis right and [01:02:25] have like say 100 hypothesis right and then you want to all evaluate the [01:02:26] then you want to all evaluate the possibility of them if you use a [01:02:29] possibility of them if you use a Transformer language model such as gpt3 [01:02:32] Transformer language model such as gpt3 which is really powerful but you cannot [01:02:33] which is really powerful but you cannot really like do really fast inferences [01:02:36] really like do really fast inferences under 20 milliseconds right so whereas [01:02:38] under 20 milliseconds right so whereas the angram language model you can just [01:02:39] the angram language model you can just load everything into memory and all the [01:02:42] load everything into memory and all the evaluation is just a memory lookup so [01:02:43] evaluation is just a memory lookup so it's really [01:02:45] it's really quick after that you can get some [01:02:47] quick after that you can get some probabilities out and then you'll just [01:02:49] probabilities out and then you'll just keep say the top K hypothesis for the [01:02:51] keep say the top K hypothesis for the next uh step next step of a beam search [01:02:54] next uh step next step of a beam search here so that's how we use the angram [01:02:57] here so that's how we use the angram language model in the real time uh [01:03:00] language model in the real time uh decoding after that we'll use a [01:03:02] decoding after that we'll use a Transformer language model so [01:03:05] Transformer language model so to uh rerank all the hypothesis [01:03:08] to uh rerank all the hypothesis generated by the engram language model [01:03:10] generated by the engram language model so this actually happens when you [01:03:11] so this actually happens when you actually decoded out the entire sentence [01:03:14] actually decoded out the entire sentence say what I can keep the most likely 100 [01:03:17] say what I can keep the most likely 100 sentences and then I can at this time I [01:03:20] sentences and then I can at this time I can use the Transformer language model [01:03:22] can use the Transformer language model which can quickly evaluate the [01:03:24] which can quickly evaluate the probabilities say only 100 hypothesis [01:03:27] probabilities say only 100 hypothesis under the time of maybe half a second [01:03:30] under the time of maybe half a second right and then can get a better [01:03:31] right and then can get a better probability measurement of other the [01:03:33] probability measurement of other the sentences [01:03:35] sentences here yeah so putting everything together [01:03:37] here yeah so putting everything together this is how the entire system works when [01:03:40] this is how the entire system works when I show you the previous video is [01:03:42] I show you the previous video is that is that we can right now using uh [01:03:45] that is that we can right now using uh this complicated not comp like [01:03:47] this complicated not comp like multi-stage uh machine learning model to [01:03:50] multi-stage uh machine learning model to accurately decode what the person is [01:03:52] accurately decode what the person is trying to see and build this high [01:03:53] trying to see and build this high performance uh neural uh speech Pro NE [01:03:57] performance uh neural uh speech Pro NE speech [01:03:58] speech PCI right we almost time here so I just [01:04:02] PCI right we almost time here so I just uh skip the evaluation part evaluation [01:04:04] uh skip the evaluation part evaluation is how we measure the performance is [01:04:05] is how we measure the performance is basically measur in the world a rate um [01:04:09] basically measur in the world a rate um it also have a all the data open to as a [01:04:13] it also have a all the data open to as a competition so if you guys are trying [01:04:15] competition so if you guys are trying really want to curious about this thing [01:04:17] really want to curious about this thing you can try to play around with it I [01:04:20] you can try to play around with it I think here's uh I think the most [01:04:23] think here's uh I think the most exciting thing about doing this research [01:04:25] exciting thing about doing this research is that [01:04:26] is that actually see how your research can [01:04:29] actually see how your research can impact people here so this is a quote [01:04:31] impact people here so this is a quote from our participant T12 and then this [01:04:33] from our participant T12 and then this is how she reacts when this thing first [01:04:36] is how she reacts when this thing first worked for her it's really exciting that [01:04:39] worked for her it's really exciting that uh she can speak after so many years of [01:04:42] uh she can speak after so many years of uh uh [01:04:44] uh uh silence okay so in the last five minutes [01:04:47] silence okay so in the last five minutes maybe I can just go a little bit into [01:04:49] maybe I can just go a little bit into what I think is the future of uh bcis [01:04:52] what I think is the future of uh bcis here so I think what I've just shown you [01:04:54] here so I think what I've just shown you guys is that using BC we can help people [01:04:57] guys is that using BC we can help people to either restore movements or um [01:05:00] to either restore movements or um restore [01:05:01] restore communication [01:05:03] communication um one exciting Direction I think is [01:05:06] um one exciting Direction I think is this kind of multimodal BCI here is a a [01:05:11] this kind of multimodal BCI here is a a work published by group at UCS UCSF is [01:05:14] work published by group at UCS UCSF is that they are trying to decode not only [01:05:16] that they are trying to decode not only the phes but also uh Speech I mean the [01:05:22] the phes but also uh Speech I mean the actual speech and then also some articul [01:05:24] actual speech and then also some articul adjures so that you can actually um move [01:05:28] adjures so that you can actually um move 3D avatars here and also as I just [01:05:31] 3D avatars here and also as I just mentioned is that the [01:05:34] mentioned is that the um uh the the the the final goal of this [01:05:38] um uh the the the the final goal of this P PCI is you want to actually deploy it [01:05:41] P PCI is you want to actually deploy it for people to be able to use it every [01:05:44] for people to be able to use it every day just as we use our phones right [01:05:46] day just as we use our phones right so um so here's a more recent [01:05:49] so um so here's a more recent development of speech PCI by our [01:05:52] development of speech PCI by our collaborators at UC Davis what they do [01:05:54] collaborators at UC Davis what they do is they actually p put four arrays into [01:05:56] is they actually p put four arrays into the motor cortex meaning that they can [01:05:58] the motor cortex meaning that they can actually have better uh signals than we [01:06:01] actually have better uh signals than we do so what they I can actually show is [01:06:03] do so what they I can actually show is that so for here just for reference [01:06:06] that so for here just for reference because I previous I just forgot to [01:06:08] because I previous I just forgot to mention that the the final performance [01:06:09] mention that the the final performance of our system is that we can get maybe [01:06:12] of our system is that we can get maybe around [01:06:14] around 25% word arate means that for every 100 [01:06:17] 25% word arate means that for every 100 W the participant says uh 25 of them [01:06:21] W the participant says uh 25 of them maybe is wrong uh so for this latest to [01:06:24] maybe is wrong uh so for this latest to work at the USA d [01:06:26] work at the USA d they show that you can actually get [01:06:28] they show that you can actually get close to zero word error rate in uh a [01:06:32] close to zero word error rate in uh a few sessions by training the system more [01:06:34] few sessions by training the system more and more continously so it's actually [01:06:37] and more continously so it's actually being very close to being actually a [01:06:39] being very close to being actually a usable system right now and here's a [01:06:43] usable system right now and here's a video of their participants using the [01:06:44] video of their participants using the system to type to to speak it's very [01:06:49] system to type to to speak it's very accurate she's he's actually using the [01:06:51] accurate she's he's actually using the system [01:06:52] system to every day right now to uh for [01:06:55] to every day right now to uh for communicating with his family and the [01:06:57] communicating with his family and the full [01:07:04] work that really cannot be understated [01:07:06] work that really cannot be understated how important that [01:07:09] is all right so I think the most [01:07:13] is all right so I think the most exciting Direction I think happen at [01:07:14] exciting Direction I think happen at least like personally I think is [01:07:16] least like personally I think is happening in our lab is that um we're [01:07:19] happening in our lab is that um we're trying to maybe trying to restore more [01:07:21] trying to maybe trying to restore more effortless and natural communication by [01:07:23] effortless and natural communication by decoding this kind of inner speech [01:07:26] decoding this kind of inner speech so previously all the speech PC just [01:07:28] so previously all the speech PC just show you I think the maximum speed we [01:07:30] show you I think the maximum speed we can do is maybe to at 60 to 7 70 words [01:07:34] can do is maybe to at 60 to 7 70 words per minute but that's still far slower [01:07:36] per minute but that's still far slower compared to Natural competition which [01:07:38] compared to Natural competition which happens at 150 words per minute so one [01:07:41] happens at 150 words per minute so one of the reason is that for all this [01:07:43] of the reason is that for all this participants if we ask them to trying to [01:07:46] participants if we ask them to trying to attempt to speak because they have been [01:07:48] attempt to speak because they have been lost speak speech so many years it's [01:07:51] lost speak speech so many years it's very hard for them to speak at a normal [01:07:53] very hard for them to speak at a normal rate however we know that we know that a [01:07:56] rate however we know that we know that a lot of people have this kind of inner [01:07:58] lot of people have this kind of inner speech right we're kind of like talking [01:07:59] speech right we're kind of like talking to our s self in our mind I think the [01:08:02] to our s self in our mind I think the research question here is whether we can [01:08:04] research question here is whether we can decode this sort of like inner [01:08:06] decode this sort of like inner speech um so that is like some like a [01:08:09] speech um so that is like some like a prelim preliminary work from uh one [01:08:11] prelim preliminary work from uh one collaborators in our lab show that you [01:08:13] collaborators in our lab show that you can actually do so so for example here [01:08:16] can actually do so so for example here she's our the result show that by if you [01:08:19] she's our the result show that by if you decode attempted speech which is why I [01:08:21] decode attempted speech which is why I just show you that you can do maybe at [01:08:23] just show you that you can do maybe at the for a small set of work you can do [01:08:26] the for a small set of work you can do say at the 90% accuracy but if you ask [01:08:31] say at the 90% accuracy but if you ask the participants to imagining moving her [01:08:34] the participants to imagining moving her mouth or imagining like a voice in her [01:08:37] mouth or imagining like a voice in her head you can pretty do pretty well right [01:08:39] head you can pretty do pretty well right so it's not as good as um the a template [01:08:42] so it's not as good as um the a template speech but still much better than much [01:08:45] speech but still much better than much higher than chance so I think this [01:08:46] higher than chance so I think this showing that it's possible in the future [01:08:48] showing that it's possible in the future that we can decode this sort of like a [01:08:50] that we can decode this sort of like a inner speech to fully uh restore like [01:08:53] inner speech to fully uh restore like natural communication to people like [01:08:55] natural communication to people like Hardward and [01:08:57] Hardward and T12 but I think there's a more a [01:08:59] T12 but I think there's a more a controversial issue regarding this kind [01:09:02] controversial issue regarding this kind of inner speech because what if like you [01:09:04] of inner speech because what if like you can decode something that's your like [01:09:07] can decode something that's your like your private thoughts or private [01:09:08] your private thoughts or private memories that you don't want to express [01:09:10] memories that you don't want to express right that's a very difficult question [01:09:12] right that's a very difficult question here and [01:09:13] here and also as I just mentioned not everyone [01:09:16] also as I just mentioned not everyone has UN speech and also someone maybe [01:09:18] has UN speech and also someone maybe like when you think about it's kind of [01:09:20] like when you think about it's kind of like speech right speech is just a [01:09:23] like speech right speech is just a external One external representation of [01:09:25] external One external representation of your internal thoughts it's just like a [01:09:27] your internal thoughts it's just like a linear representation that you want to [01:09:29] linear representation that you want to put out through this kind of like a med [01:09:31] put out through this kind of like a med like medium of speech whereas your in [01:09:33] like medium of speech whereas your in your thoughts could be more complex more [01:09:35] your thoughts could be more complex more multi-dimensional so it's very hard to [01:09:38] multi-dimensional so it's very hard to decide where you want to put arrays and [01:09:40] decide where you want to put arrays and you where you want to decode all those [01:09:42] you where you want to decode all those inner but I think that's also very [01:09:44] inner but I think that's also very exciting opportunities to for us to [01:09:46] exciting opportunities to for us to learn more about like the speech [01:09:48] learn more about like the speech processing in the brain um and I just [01:09:51] processing in the brain um and I just mentioned like if you want to decode [01:09:53] mentioned like if you want to decode this kind of inner speech then you also [01:09:55] this kind of inner speech then you also coming to this uh facing a lot of new [01:09:58] coming to this uh facing a lot of new ethical questions um that's really I [01:10:01] ethical questions um that's really I think thought provoking for example like [01:10:04] think thought provoking for example like suppose should we allow bcis to decod [01:10:07] suppose should we allow bcis to decod read in the memories right like what if [01:10:10] read in the memories right like what if like we de something you don't want to [01:10:12] like we de something you don't want to say right that's how can we deal with [01:10:14] say right that's how can we deal with that on the other hand like what if we [01:10:16] that on the other hand like what if we can actually use this things to help [01:10:19] can actually use this things to help people who has lost their memories due [01:10:22] people who has lost their memories due to like a Alzheimer disease right or [01:10:26] to like a Alzheimer disease right or we can read out some like subconscious [01:10:28] we can read out some like subconscious fear that can help people to do their [01:10:31] fear that can help people to do their psycho therapies are this like how [01:10:34] psycho therapies are this like how should we decide whether we want to [01:10:35] should we decide whether we want to allow this kind of in decoding or not or [01:10:37] allow this kind of in decoding or not or you know memory decoding or lot or not [01:10:41] you know memory decoding or lot or not and [01:10:41] and also I think a more like deeper question [01:10:44] also I think a more like deeper question is like what if like one day we could do [01:10:46] is like what if like one day we could do this kind of like a cognitive [01:10:47] this kind of like a cognitive enhancement with BCI such as you know [01:10:51] enhancement with BCI such as you know what if you can move a robotic arm much [01:10:53] what if you can move a robotic arm much faster than your real arm is that [01:10:54] faster than your real arm is that allowed [01:10:55] allowed or like can you actually purchase a [01:10:57] or like can you actually purchase a memory so that you can skip this CS2 24n [01:11:01] memory so that you can skip this CS2 24n class um yeah I think that's really a [01:11:03] class um yeah I think that's really a hard question to answer but it's just [01:11:06] hard question to answer but it's just like to throw this question out so that [01:11:08] like to throw this question out so that you know we're not like it's not like [01:11:10] you know we're not like it's not like only a PCI problem but we're also facing [01:11:12] only a PCI problem but we're also facing this problem right now right there's a [01:11:14] this problem right now right there's a lot of ways you can do enhancement of [01:11:17] lot of ways you can do enhancement of yourself so I guess what I'm trying to [01:11:19] yourself so I guess what I'm trying to say here is that b will raise a lot of [01:11:22] say here is that b will raise a lot of new a lot of new uh ethical questions [01:11:25] new a lot of new uh ethical questions so this is I'm taking this quote from uh [01:11:28] so this is I'm taking this quote from uh this textbook here what it's trying to [01:11:30] this textbook here what it's trying to say is that I think this question is um [01:11:34] say is that I think this question is um we're not really looking for answer here [01:11:36] we're not really looking for answer here but I think the point here that maybe we [01:11:40] but I think the point here that maybe we just want to keep this in discussion [01:11:43] just want to keep this in discussion with Scientists with Engineers with uh [01:11:46] with Scientists with Engineers with uh with policy makers and just to make sure [01:11:48] with policy makers and just to make sure that [01:11:49] that everything know well we we can use BCI [01:11:53] everything know well we we can use BCI to help people that's really need them [01:11:55] to help people that's really need them and and also be aware that there could [01:11:56] and and also be aware that there could be some a lot of like potential issues [01:11:59] be some a lot of like potential issues here um yeah I think the just to give a [01:12:03] here um yeah I think the just to give a summary here so I think I've I hope I [01:12:06] summary here so I think I've I hope I can convince you guys that BCI is a [01:12:08] can convince you guys that BCI is a really cool new research directions it's [01:12:10] really cool new research directions it's uh at the intersection of uh AI machine [01:12:13] uh at the intersection of uh AI machine learning neuroscience and [01:12:15] learning neuroscience and neuroengineering uh we'll soon have this [01:12:18] neuroengineering uh we'll soon have this kind of like systems that can really [01:12:20] kind of like systems that can really help people to be able to communicate [01:12:23] help people to be able to communicate again and also it's a really cool [01:12:24] again and also it's a really cool opportunity days to understand how the [01:12:27] opportunity days to understand how the brain process the process languages I [01:12:30] brain process the process languages I think I think the most important thing [01:12:32] think I think the most important thing is that we are bringing hope to people [01:12:34] is that we are bringing hope to people like horard and [01:12:35] like horard and T12 all right thank you everyone and [01:12:39] T12 all right thank you everyone and special thanks to the people in my last ================================================================================ LECTURE 015 ================================================================================ Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 14 - Reasoning and Agents by Shikhar Murty Source: https://www.youtube.com/watch?v=I0tj4Y7xaOQ --- Transcript [00:00:06] okay uh let's just get started welcome [00:00:09] okay uh let's just get started welcome to lecture 14 everyone uh hope you've [00:00:12] to lecture 14 everyone uh hope you've been uh doing well and uh you know [00:00:16] been uh doing well and uh you know managing all of the various deadlines so [00:00:20] managing all of the various deadlines so uh today we'll be looking at two [00:00:22] uh today we'll be looking at two interesting [00:00:23] interesting applications of language models uh the [00:00:26] applications of language models uh the first half I'll be talking about using [00:00:28] first half I'll be talking about using language models to reason [00:00:30] language models to reason in domains like math geometry uh doing [00:00:34] in domains like math geometry uh doing things like spatial reasoning and then [00:00:36] things like spatial reasoning and then in the second half of the lecture I'll [00:00:37] in the second half of the lecture I'll be talking about how you can use [00:00:40] be talking about how you can use language models to take actions in [00:00:42] language models to take actions in grounded [00:00:44] grounded environments okay um so a little bit of [00:00:47] environments okay um so a little bit of a disclaimer a lot of the content [00:00:49] a disclaimer a lot of the content today's research that was done in the [00:00:52] today's research that was done in the last 3 4 years so there's plenty of [00:00:55] last 3 4 years so there's plenty of questions plenty of unanswered questions [00:00:57] questions plenty of unanswered questions and not a lot of uh uh an so you know [00:01:01] and not a lot of uh uh an so you know let's let's maybe we can have more of a [00:01:04] let's let's maybe we can have more of a discussion U around these topics [00:01:07] discussion U around these topics okay okay so let's get started with [00:01:10] okay okay so let's get started with reasoning um so experts like to start a [00:01:14] reasoning um so experts like to start a lecture on reasoning by really uh [00:01:16] lecture on reasoning by really uh talking about what are the various kinds [00:01:18] talking about what are the various kinds of freezing so I'm going to do that here [00:01:20] of freezing so I'm going to do that here okay but at a high level it's really [00:01:21] okay but at a high level it's really about using facts and logic to arrive at [00:01:23] about using facts and logic to arrive at an answer okay uh but more concretely [00:01:26] an answer okay uh but more concretely there's three distinct categories of [00:01:28] there's three distinct categories of reasoning that that we we can uh talk [00:01:31] reasoning that that we we can uh talk about the first one which is probably [00:01:33] about the first one which is probably the one that most of you are familiar [00:01:35] the one that most of you are familiar with is deductive reasoning where we go [00:01:38] with is deductive reasoning where we go from uh rules of logic along with a [00:01:42] from uh rules of logic along with a premise to come with a firm conclusion [00:01:45] premise to come with a firm conclusion so an example of that could be that we [00:01:47] so an example of that could be that we have the sentence all mammals have [00:01:49] have the sentence all mammals have kidneys and all whales are mammals and [00:01:52] kidneys and all whales are mammals and then we can come up with the conclusion [00:01:54] then we can come up with the conclusion all whales have kidneys and we could do [00:01:56] all whales have kidneys and we could do multiple such steps of reasoning okay a [00:02:00] multiple such steps of reasoning okay a second form of reasoning is inductive [00:02:03] second form of reasoning is inductive where uh given observations we derive [00:02:07] where uh given observations we derive conclusions okay so maybe we've learned [00:02:11] conclusions okay so maybe we've learned from experience that every time we see a [00:02:14] from experience that every time we see a creature with wings it is usually a bird [00:02:17] creature with wings it is usually a bird and uh let's say we [00:02:20] and uh let's say we observe uh a state where we see a [00:02:22] observe uh a state where we see a creature with wings and using our observ [00:02:25] creature with wings and using our observ using our experience we can come up with [00:02:27] using our experience we can come up with this conclusion that the creature is [00:02:29] this conclusion that the creature is likely to be [00:02:30] likely to be so that form of reasoning is inductive [00:02:32] so that form of reasoning is inductive okay and finally we have abductive [00:02:36] okay and finally we have abductive reasoning where we're given an [00:02:38] reasoning where we're given an observation and then we start drawing [00:02:40] observation and then we start drawing possible explanations okay so maybe you [00:02:43] possible explanations okay so maybe you see a car that cannot start and there's [00:02:46] see a car that cannot start and there's a puddle of liquid under the under the [00:02:48] a puddle of liquid under the under the engine and then you you start drawing [00:02:51] engine and then you you start drawing inferences about the situation so one of [00:02:53] inferences about the situation so one of them could be that the car has leak in [00:02:55] them could be that the car has leak in the radiator [00:02:56] the radiator okay all right and apart from that [00:03:00] okay all right and apart from that taxonomy uh we can also think of [00:03:02] taxonomy uh we can also think of reasoning in formal and informal terms [00:03:05] reasoning in formal and informal terms where formal reasoning involves using [00:03:07] where formal reasoning involves using aums and rules of formal logic to derive [00:03:11] aums and rules of formal logic to derive truth conditions okay uh there's also [00:03:14] truth conditions okay uh there's also informal reasoning which is what you and [00:03:16] informal reasoning which is what you and I uh probably do every day and here we [00:03:19] I uh probably do every day and here we just reason about everyday situations [00:03:22] just reason about everyday situations and use common sense to derive [00:03:24] and use common sense to derive conclusions for most of the lecture when [00:03:26] conclusions for most of the lecture when I say reasoning I will mean informal [00:03:29] I say reasoning I will mean informal dedu Ive reasoning and it's often going [00:03:32] dedu Ive reasoning and it's often going to involve multiple [00:03:34] to involve multiple steps okay so uh let's let's come back [00:03:38] steps okay so uh let's let's come back to language models okay so uh we've [00:03:41] to language models okay so uh we've learned in lectures 9 10 11 that uh [00:03:44] learned in lectures 9 10 11 that uh language models are really really good [00:03:46] language models are really really good at or large language models are really [00:03:48] at or large language models are really really good at coming up with plausible [00:03:50] really good at coming up with plausible continuations of text that reflect human [00:03:53] continuations of text that reflect human preferences and constraints today we'll [00:03:56] preferences and constraints today we'll try to answer if they can also reason [00:04:00] try to answer if they can also reason okay so one of the most basic ways we [00:04:04] okay so one of the most basic ways we can try to answer this question is we [00:04:06] can try to answer this question is we are prompting okay and uh we've probably [00:04:10] are prompting okay and uh we've probably already seen this uh there's this uh [00:04:13] already seen this uh there's this uh popular method called Chain of Thought [00:04:15] popular method called Chain of Thought prompting where you get a language model [00:04:17] prompting where you get a language model to produce a reasoning step before [00:04:20] to produce a reasoning step before producing an answer and we could do this [00:04:23] producing an answer and we could do this by providing some in context examples [00:04:26] by providing some in context examples with explicit uh reasoning steps that [00:04:28] with explicit uh reasoning steps that the language model can then mimic at [00:04:30] the language model can then mimic at test time okay so that's Chain of [00:04:33] test time okay so that's Chain of Thought [00:04:34] Thought prompting uh another rather surprising [00:04:38] prompting uh another rather surprising uh property of language models is that [00:04:40] uh property of language models is that sometimes you don't even have to show [00:04:41] sometimes you don't even have to show them these in context examples and you [00:04:43] them these in context examples and you could just prompt them with the sentence [00:04:45] could just prompt them with the sentence let's things step by step and you can [00:04:47] let's things step by step and you can get these uh reasoning rationals before [00:04:51] get these uh reasoning rationals before they produce an [00:04:52] they produce an answer okay so that's pretty [00:04:55] answer okay so that's pretty simple uh but let's let's keep going [00:04:58] simple uh but let's let's keep going okay so another popular way to prompt [00:05:01] okay so another popular way to prompt language models to do reasoning is via [00:05:04] language models to do reasoning is via self-consistency [00:05:06] self-consistency so here what we do is instead of uh [00:05:09] so here what we do is instead of uh greedily sampling a rationale followed [00:05:12] greedily sampling a rationale followed by an answer we're going to sample [00:05:14] by an answer we're going to sample multiple reasoning steps and [00:05:17] multiple reasoning steps and correspondingly multiple answers okay so [00:05:19] correspondingly multiple answers okay so what we see in the figure on the right [00:05:22] what we see in the figure on the right uh we have a question and then what you [00:05:25] uh we have a question and then what you would normally do with Chain of Thought [00:05:27] would normally do with Chain of Thought prompting is you would greedily decode a [00:05:30] prompting is you would greedily decode a rational and then condition on the [00:05:31] rational and then condition on the rational generate an answer with [00:05:33] rational generate an answer with self-consistency we're going to sample [00:05:35] self-consistency we're going to sample multiple times so sample multiple [00:05:37] multiple times so sample multiple rationals they are all going to lead to [00:05:39] rationals they are all going to lead to multiple answers and then we're going to [00:05:42] multiple answers and then we're going to pick the one that is the most common [00:05:44] pick the one that is the most common okay with the idea being that if an [00:05:46] okay with the idea being that if an answer keeps appearing uh for multiple [00:05:49] answer keeps appearing uh for multiple rationals as the majority uh of the [00:05:51] rationals as the majority uh of the rationals agree on then it's more likely [00:05:53] rationals agree on then it's more likely to be [00:05:55] to be correct and the authors of [00:05:57] correct and the authors of self-consistency find that on a VAR of [00:06:00] self-consistency find that on a VAR of mathematical reasoning tasks if you add [00:06:03] mathematical reasoning tasks if you add this simple idea of self-consistency [00:06:05] this simple idea of self-consistency where you sample multiple times and sort [00:06:08] where you sample multiple times and sort of do majority voting that improves [00:06:10] of do majority voting that improves performance pretty drastically over over [00:06:13] performance pretty drastically over over standard Chain of [00:06:14] standard Chain of Thought and interestingly you know when [00:06:17] Thought and interestingly you know when I saw this result the first time I [00:06:19] I saw this result the first time I thought okay this is just like [00:06:21] thought okay this is just like ensembling which is you know we we [00:06:22] ensembling which is you know we we learned this in cs229 the idea is if you [00:06:25] learned this in cs229 the idea is if you want to boost the performance of your [00:06:27] want to boost the performance of your system I'm going to produce like uh 10 [00:06:30] system I'm going to produce like uh 10 classifiers with different random seeds [00:06:33] classifiers with different random seeds and I'm going to produce a [00:06:34] and I'm going to produce a classification decision and I'm going to [00:06:36] classification decision and I'm going to do a majority voting but turns out that [00:06:38] do a majority voting but turns out that it's doing maybe a little bit more than [00:06:40] it's doing maybe a little bit more than just simple ensembling so the author's [00:06:43] just simple ensembling so the author's also compared uh an ensembling approach [00:06:45] also compared uh an ensembling approach where it's the same language model with [00:06:47] where it's the same language model with multiple different prompts and then you [00:06:50] multiple different prompts and then you do majority voting there and then turns [00:06:52] do majority voting there and then turns out that self-consistency is better than [00:06:55] out that self-consistency is better than just simple [00:06:58] embling okay okay so earlier today I [00:07:02] embling okay okay so earlier today I said that I'll be talking about [00:07:03] said that I'll be talking about multi-step reasoning uh so far we've [00:07:06] multi-step reasoning uh so far we've looked at sort of math problems but not [00:07:09] looked at sort of math problems but not and and like prompting but not [00:07:10] and and like prompting but not necessarily multi-step reasoning uh one [00:07:13] necessarily multi-step reasoning uh one of the main kind of aspects about [00:07:16] of the main kind of aspects about multi-step reasoning is it involves [00:07:18] multi-step reasoning is it involves breaking down a large problem into [00:07:20] breaking down a large problem into several [00:07:21] several subparts uh where and and answering each [00:07:24] subparts uh where and and answering each of the sub uh subp parts and then [00:07:26] of the sub uh subp parts and then combining everything into a solution [00:07:28] combining everything into a solution okay uh so there's this kind of [00:07:30] okay uh so there's this kind of decomposition strategy that was uh [00:07:32] decomposition strategy that was uh integrated into another prompting method [00:07:34] integrated into another prompting method called least to most [00:07:36] called least to most prompting and the idea behind least to [00:07:38] prompting and the idea behind least to most prompting is uh like I said given a [00:07:42] most prompting is uh like I said given a question we're going to first break it [00:07:43] question we're going to first break it down into sub questions as shown [00:07:47] down into sub questions as shown here um and then given these sub [00:07:50] here um and then given these sub questions the language model will sort [00:07:53] questions the language model will sort of answer each of the sub questions and [00:07:55] of answer each of the sub questions and then conditioned on its answers to the [00:07:57] then conditioned on its answers to the sub questions is going to generate the [00:07:59] sub questions is going to generate the final [00:08:02] answer and this is kind of how it looks [00:08:05] answer and this is kind of how it looks like uh for uh sort of a math reasoning [00:08:08] like uh for uh sort of a math reasoning problem so in standard Chain of Thought [00:08:10] problem so in standard Chain of Thought prompting uh you would have a question [00:08:12] prompting uh you would have a question followed by a rationale and the answer [00:08:15] followed by a rationale and the answer with least to most prompting which is [00:08:17] with least to most prompting which is this like de composition strategy uh you [00:08:20] this like de composition strategy uh you take the question and then instead of [00:08:23] take the question and then instead of directly producing a rational you you uh [00:08:25] directly producing a rational you you uh sort of ask the language model to break [00:08:27] sort of ask the language model to break it down into problems and then you have [00:08:30] it down into problems and then you have like these uh two different sub problems [00:08:33] like these uh two different sub problems and then you start answering both of [00:08:34] and then you start answering both of those sub problems and then condition [00:08:36] those sub problems and then condition your final answer on the answers to [00:08:39] your final answer on the answers to those sub [00:08:41] those sub problems so okay so that's just like a [00:08:44] problems so okay so that's just like a prompting method right uh one [00:08:46] prompting method right uh one interesting experiment from least to [00:08:48] interesting experiment from least to most prompting was showing that you can [00:08:51] most prompting was showing that you can sometimes generalize from a small number [00:08:53] sometimes generalize from a small number of reasoning steps to much larger number [00:08:55] of reasoning steps to much larger number of reasoning steps so here in this sort [00:08:58] of reasoning steps so here in this sort of math word [00:09:00] of math word problem uh there's two reasoning steps [00:09:03] problem uh there's two reasoning steps and if we show this prompt to the [00:09:05] and if we show this prompt to the language model sort of as in context [00:09:08] language model sort of as in context example we see that it continues to [00:09:11] example we see that it continues to generalize even on examples that [00:09:14] generalize even on examples that required more than five steps of [00:09:16] required more than five steps of reasoning um and in a way that's much [00:09:19] reasoning um and in a way that's much better than standard Chain of [00:09:21] better than standard Chain of Thought uh but it's not entirely clear [00:09:24] Thought uh but it's not entirely clear if structuring inference in this manner [00:09:26] if structuring inference in this manner is really [00:09:27] is really fundamental uh one of the other results [00:09:30] fundamental uh one of the other results they reported was um sort of uh that [00:09:34] they reported was um sort of uh that with enough prompt engineering so uh the [00:09:37] with enough prompt engineering so uh the rose corresponding to best uh normal [00:09:39] rose corresponding to best uh normal Chain of Thought is on par with least to [00:09:42] Chain of Thought is on par with least to most prompting but it's it's kind of an [00:09:44] most prompting but it's it's kind of an interesting idea of trying to break down [00:09:46] interesting idea of trying to break down problems into sub problems solving the [00:09:48] problems into sub problems solving the sub problems and then sort of building [00:09:49] sub problems and then sort of building up a solution based on your answers to [00:09:52] up a solution based on your answers to the sub [00:09:53] the sub problems okay so all this was different [00:09:57] problems okay so all this was different sort of prompting methods to get [00:09:58] sort of prompting methods to get reasoning behavor AV out of language [00:10:00] reasoning behavor AV out of language models can we do something more um so [00:10:03] models can we do something more um so one of the things that we might be [00:10:05] one of the things that we might be interested in is instead of trying to [00:10:07] interested in is instead of trying to get really large language models to do [00:10:10] get really large language models to do reasoning maybe we want to somehow get [00:10:13] reasoning maybe we want to somehow get this kind of reasoning behavior in a [00:10:16] this kind of reasoning behavior in a smaller language model and one popular [00:10:19] smaller language model and one popular approach for doing that is [00:10:21] approach for doing that is distillation where maybe you want to [00:10:23] distillation where maybe you want to fine-tune a smaller llama model uh by [00:10:27] fine-tune a smaller llama model uh by teaching it to imitate a larger llama [00:10:29] teaching it to imitate a larger llama model [00:10:30] model model um and so that's what we're going [00:10:33] model um and so that's what we're going to look at now okay so uh this model is [00:10:36] to look at now okay so uh this model is called [00:10:36] called Orca and at a high level Orca is going [00:10:39] Orca and at a high level Orca is going to fine tune a smaller 13 billion Lama [00:10:44] to fine tune a smaller 13 billion Lama language model on explanations produced [00:10:47] language model on explanations produced by [00:10:48] by gbd4 okay and to construct this data it [00:10:52] gbd4 okay and to construct this data it it's pretty simple uh it has these three [00:10:54] it's pretty simple uh it has these three steps so the first step is uh we get a [00:10:58] steps so the first step is uh we get a wide variety of instructions from the [00:11:01] wide variety of instructions from the flan V2 collection okay so flan V2 is [00:11:04] flan V2 collection okay so flan V2 is basically a data set it's it kind of [00:11:07] basically a data set it's it kind of cumulates multiple data sets into one [00:11:09] cumulates multiple data sets into one sort of collection uh and it consists of [00:11:12] sort of collection uh and it consists of instructions paired with uh questions [00:11:15] instructions paired with uh questions and answers and I I'll show an example [00:11:16] and answers and I I'll show an example of this um in a moment and then we're [00:11:21] of this um in a moment and then we're going to prompt gbd4 or chat [00:11:24] going to prompt gbd4 or chat GPT with these instructions along with a [00:11:28] GPT with these instructions along with a system message [00:11:30] system message and the objective of the system message [00:11:32] and the objective of the system message is to get chat GPT or gp4 to produce an [00:11:36] is to get chat GPT or gp4 to produce an informative explanation along with the [00:11:39] informative explanation along with the answer so here we have a question about [00:11:42] answer so here we have a question about you know simple data processing uh about [00:11:45] you know simple data processing uh about calculating the median and there's a [00:11:47] calculating the median and there's a system instruction that says uh please [00:11:50] system instruction that says uh please justify your steps and uh kind of answer [00:11:53] justify your steps and uh kind of answer step by step and uh in producing its [00:11:57] step by step and uh in producing its output the model sort of provides [00:11:59] output the model sort of provides uh a fairly detailed explanation of how [00:12:02] uh a fairly detailed explanation of how it got to the answer and what orai is [00:12:05] it got to the answer and what orai is going to do is use precisely this [00:12:07] going to do is use precisely this explanation to fine tune a much smaller [00:12:09] explanation to fine tune a much smaller model okay uh so that's what's going to [00:12:13] model okay uh so that's what's going to happen uh once we have these [00:12:15] happen uh once we have these explanations we're going to fine-tune a [00:12:18] explanations we're going to fine-tune a much smaller 13 billion parameter llama [00:12:20] much smaller 13 billion parameter llama model on these explanations [00:12:24] model on these explanations okay so uh so far we've looked at sort [00:12:28] okay so uh so far we've looked at sort of math reasoning [00:12:30] of math reasoning um and um sort of grade school math [00:12:34] um and um sort of grade school math problems uh let's kind of turn to a [00:12:36] problems uh let's kind of turn to a different Benchmark for reasoning so [00:12:38] different Benchmark for reasoning so we're going to look at big bench hard uh [00:12:41] we're going to look at big bench hard uh and this is another data set for [00:12:44] and this is another data set for multi-step reasoning okay and uh let's [00:12:47] multi-step reasoning okay and uh let's look at some examples from Big bench [00:12:49] look at some examples from Big bench hard um so it consists of uh multiple [00:12:52] hard um so it consists of uh multiple different uh subas so there's a total of [00:12:54] different uh subas so there's a total of 23 different subas I'm going to show a [00:12:56] 23 different subas I'm going to show a few examples so one of them is [00:12:59] few examples so one of them is evaluating Boolean Expressions so uh the [00:13:02] evaluating Boolean Expressions so uh the question is true and false and not true [00:13:05] question is true and false and not true and true is okay so that's basically um [00:13:08] and true is okay so that's basically um you know uh evaluate this Boolean [00:13:11] you know uh evaluate this Boolean expression and um you know with with [00:13:15] expression and um you know with with sort of Chain of Thought the model can [00:13:17] sort of Chain of Thought the model can evaluate each of the sub expressions and [00:13:19] evaluate each of the sub expressions and get to The Final [00:13:21] get to The Final Answer um and another example of a task [00:13:25] Answer um and another example of a task from Big bench hard is dat uh is data [00:13:28] from Big bench hard is dat uh is data understanding [00:13:29] understanding where uh you know like maybe the [00:13:32] where uh you know like maybe the question is uh sorry this is date [00:13:34] question is uh sorry this is date understanding not data understanding uh [00:13:36] understanding not data understanding uh so the question is tomorrow is a given [00:13:38] so the question is tomorrow is a given date uh what is the date one year ago [00:13:42] date uh what is the date one year ago from today in a given format and uh it's [00:13:45] from today in a given format and uh it's paired with some options and again the [00:13:47] paired with some options and again the model can sort of think step by step [00:13:49] model can sort of think step by step following you know basic Chain of [00:13:51] following you know basic Chain of Thought and then come up with an answer [00:13:54] Thought and then come up with an answer so this is kind of the flavor of tasks [00:13:56] so this is kind of the flavor of tasks in big bench you know most of these [00:13:58] in big bench you know most of these involveed mult step reasoning they're [00:14:00] involveed mult step reasoning they're fairly synthetic but also um reasonably [00:14:04] fairly synthetic but also um reasonably hard uh for for for for language models [00:14:06] hard uh for for for for language models okay another example is geometric shapes [00:14:10] okay another example is geometric shapes and uh this one is you know pretty [00:14:13] and uh this one is you know pretty surprising that language models can do [00:14:14] surprising that language models can do anything here so you're given sort of [00:14:16] anything here so you're given sort of the SVG path [00:14:18] the SVG path element and uh you know sort of I I have [00:14:21] element and uh you know sort of I I have no idea what this renders us but like [00:14:24] no idea what this renders us but like the question is uh you know just given [00:14:27] the question is uh you know just given the SVG what shape uh you're going to [00:14:31] the SVG what shape uh you're going to get okay and there's a bunch of options [00:14:34] get okay and there's a bunch of options and then again the model uh prompted [00:14:36] and then again the model uh prompted with let's things step by step will [00:14:38] with let's things step by step will produce some answer we don't know if [00:14:39] produce some answer we don't know if it's correct but it's going to produce [00:14:40] it's correct but it's going to produce some answer okay uh and so it's [00:14:43] some answer okay uh and so it's basically this data set covering [00:14:46] basically this data set covering different kinds of reasoning spatial [00:14:48] different kinds of reasoning spatial reasoning data understanding um you know [00:14:51] reasoning data understanding um you know evaluating booleans um and it's sort of [00:14:53] evaluating booleans um and it's sort of multi-choice so it's easier to kind of [00:14:56] multi-choice so it's easier to kind of get uh get sort of an accuracy number [00:15:01] get uh get sort of an accuracy number and so yeah so it covers like a wide [00:15:03] and so yeah so it covers like a wide variety of different tasks um on the [00:15:06] variety of different tasks um on the left we have performance from uh really [00:15:09] left we have performance from uh really large language models uh this is zero [00:15:12] large language models uh this is zero short um Chain of Thought with just the [00:15:15] short um Chain of Thought with just the prompt lets things step by [00:15:17] prompt lets things step by step um so gbd4 has some potential [00:15:21] step um so gbd4 has some potential contamination issues with big bench [00:15:23] contamination issues with big bench heart so let's maybe we can ignore that [00:15:26] heart so let's maybe we can ignore that column uh wuna is is um I think a few [00:15:33] column uh wuna is is um I think a few months ago it was state-ofthe-art as as [00:15:35] months ago it was state-ofthe-art as as an instruction tuned uh llama 13B model [00:15:39] an instruction tuned uh llama 13B model and orca is is again a llama 13B that's [00:15:43] and orca is is again a llama 13B that's fine-tuned specifically on this uh like [00:15:47] fine-tuned specifically on this uh like explanation um data where you know you [00:15:50] explanation um data where you know you have instructions and then you have [00:15:51] have instructions and then you have explanations from chat GPD or gp4 and [00:15:54] explanations from chat GPD or gp4 and you f tune on that and we see that [00:15:57] you f tune on that and we see that overall it it outperforms chat GPT uh [00:16:01] overall it it outperforms chat GPT uh maybe because it's specialized to just [00:16:03] maybe because it's specialized to just like these reasoning problems uh and it [00:16:06] like these reasoning problems uh and it outperforms wuna which was not trained [00:16:09] outperforms wuna which was not trained on like these really extensive [00:16:12] on like these really extensive explanations um so that's one way you [00:16:14] explanations um so that's one way you can get a smaller language model to [00:16:16] can get a smaller language model to display some kind of reasoning Behavior [00:16:20] display some kind of reasoning Behavior okay so you know this was all great and [00:16:24] okay so you know this was all great and you know we we we are very happy that [00:16:26] you know we we we are very happy that like you can just generate rationals [00:16:28] like you can just generate rationals from Big LM and then fine tune a smaller [00:16:30] from Big LM and then fine tune a smaller language model on that but then someone [00:16:32] language model on that but then someone could ask why not just fine tune the big [00:16:35] could ask why not just fine tune the big language model on its own rationals [00:16:37] language model on its own rationals right um so that's also been explored uh [00:16:41] right um so that's also been explored uh and there's a bunch of different methods [00:16:43] and there's a bunch of different methods that do this I'm going to talk about one [00:16:44] that do this I'm going to talk about one of them called reinforced self trining [00:16:46] of them called reinforced self trining or rest and that's going to alternate [00:16:49] or rest and that's going to alternate between two [00:16:50] between two stages the first stage uh given a [00:16:54] stages the first stage uh given a reasoning problem and perhaps The Prompt [00:16:56] reasoning problem and perhaps The Prompt lets things step by step I'm going to [00:16:58] lets things step by step I'm going to have the language model generate [00:17:00] have the language model generate multiple rationals okay and then I'm [00:17:03] multiple rationals okay and then I'm going to filter these rationals based on [00:17:05] going to filter these rationals based on whether they give me the correct answer [00:17:07] whether they give me the correct answer or not okay so you know think about the [00:17:10] or not okay so you know think about the word uh algebra problems someone has [00:17:14] word uh algebra problems someone has three apples someone else has four [00:17:15] three apples someone else has four apples you generate a rationale and the [00:17:18] apples you generate a rationale and the answer comes out to be seven you keep [00:17:19] answer comes out to be seven you keep that rational the answer is 12 you sort [00:17:22] that rational the answer is 12 you sort of leave that rational out and then I'm [00:17:25] of leave that rational out and then I'm going to do an update step where I'm [00:17:27] going to do an update step where I'm going to take uh these rationals that I [00:17:30] going to take uh these rationals that I filtered in my first stage I'm going to [00:17:32] filtered in my first stage I'm going to f tune the language model on that and [00:17:35] f tune the language model on that and then I can do this iteratively now I [00:17:36] then I can do this iteratively now I have an updated language model I can get [00:17:39] have an updated language model I can get hopefully better rationals and then I [00:17:41] hopefully better rationals and then I can update the language model on better [00:17:43] can update the language model on better rationals to get an even better language [00:17:45] rationals to get an even better language model and I can keep doing [00:17:46] model and I can keep doing that [00:17:48] that okay and uh the results are promising [00:17:52] okay and uh the results are promising but uh you know uh what we find is uh on [00:17:56] but uh you know uh what we find is uh on on gsm 8K which is this great School [00:17:59] on gsm 8K which is this great School math uh data set of like algebraic word [00:18:02] math uh data set of like algebraic word problems um as you increase the number [00:18:05] problems um as you increase the number of iterations of [00:18:07] of iterations of self-training we see a slight [00:18:09] self-training we see a slight Improvement in performance and then it [00:18:11] Improvement in performance and then it starts [00:18:12] starts degrading uh math is another data set [00:18:16] degrading uh math is another data set that again focuses on multi-step [00:18:18] that again focuses on multi-step reasoning uh covering math problems and [00:18:22] reasoning uh covering math problems and again we on this data set we see that as [00:18:24] again we on this data set we see that as we do more iterations of this reinforced [00:18:28] we do more iterations of this reinforced self trining Paradigm we see an [00:18:30] self trining Paradigm we see an improvement in in the [00:18:32] improvement in in the accuracy and uh the numbers in Orange [00:18:36] accuracy and uh the numbers in Orange here are a much larger uh Palm model the [00:18:40] here are a much larger uh Palm model the numbers in blue are a smaller model uh [00:18:43] numbers in blue are a smaller model uh and the dash lines represent what you [00:18:46] and the dash lines represent what you get um sort of if you did supervised [00:18:48] get um sort of if you did supervised fine tuning on human provided rationals [00:18:51] fine tuning on human provided rationals so one of the promising things about [00:18:53] so one of the promising things about this approach is when you do multiple [00:18:57] this approach is when you do multiple iterations of training on your own [00:19:00] iterations of training on your own rationals you can [00:19:02] rationals you can outperform um sort of human generated [00:19:05] outperform um sort of human generated rationals um and that is exemplified [00:19:09] rationals um and that is exemplified again in in this graph where uh what we [00:19:13] again in in this graph where uh what we find is uh the blue bar represents [00:19:17] find is uh the blue bar represents accuracy when you take uh the Palm model [00:19:20] accuracy when you take uh the Palm model and you do supervised fine-tuning on [00:19:22] and you do supervised fine-tuning on human provided [00:19:23] human provided rationals okay and then in green is if [00:19:27] rationals okay and then in green is if you did if you controlled uh for the [00:19:30] you did if you controlled uh for the sorry uh so blue is if you f tune on all [00:19:33] sorry uh so blue is if you f tune on all human provided [00:19:34] human provided rationals orange is if you f tune on one [00:19:37] rationals orange is if you f tune on one rational per training [00:19:40] rational per training example okay and these are from these [00:19:42] example okay and these are from these are written by humans in [00:19:45] are written by humans in green uh it's what you get if you fine [00:19:48] green uh it's what you get if you fine tune on one rational chosen at random [00:19:52] tune on one rational chosen at random per question which is generated by the [00:19:54] per question which is generated by the model so it's controlling for the number [00:19:56] model so it's controlling for the number of rationals and we see that it out [00:19:58] of rationals and we see that it out performs human provided rationals and [00:20:01] performs human provided rationals and then if you sort of do the full [00:20:03] then if you sort of do the full multi-step uh iterative procedure where [00:20:06] multi-step uh iterative procedure where you keep improving the model we see [00:20:08] you keep improving the model we see again a boost in [00:20:09] again a boost in performance so that's uh super [00:20:13] performance so that's uh super promising um but let's kind of start [00:20:15] promising um but let's kind of start revisiting the question that we asked um [00:20:19] revisiting the question that we asked um in the beginning about reasoning in [00:20:21] in the beginning about reasoning in language models okay um so one way of [00:20:26] language models okay um so one way of answering that question is we can apply [00:20:28] answering that question is we can apply all these methods and we can look at [00:20:31] all these methods and we can look at benchmarks uh but maybe the way to [00:20:33] benchmarks uh but maybe the way to answer the question correctly is to be [00:20:35] answer the question correctly is to be more [00:20:36] more systematic uh come up with [00:20:38] systematic uh come up with counterfactual tasks and be very uh [00:20:41] counterfactual tasks and be very uh careful about possible data [00:20:43] careful about possible data contamination and I'm going to show some [00:20:45] contamination and I'm going to show some results uh around around [00:20:48] results uh around around that so uh we started the lecture with [00:20:51] that so uh we started the lecture with Chain of Thought and maybe the first [00:20:54] Chain of Thought and maybe the first question to ask is are the rationals [00:20:57] question to ask is are the rationals that the model produces would change of [00:20:58] that the model produces would change of thought [00:21:00] thought faithful what I mean by Faithful Is [00:21:02] faithful what I mean by Faithful Is maybe the model produces some rational [00:21:05] maybe the model produces some rational and then it produces an answer maybe the [00:21:07] and then it produces an answer maybe the answer does not even depend on the [00:21:09] answer does not even depend on the rational that It produced right so maybe [00:21:12] rational that It produced right so maybe the question was you know Tom has three [00:21:14] the question was you know Tom has three apples and Jerry has four apples and the [00:21:16] apples and Jerry has four apples and the rational It produced was okay Tom has [00:21:18] rational It produced was okay Tom has three apples Jerry has four 3 + 4 is [00:21:21] three apples Jerry has four 3 + 4 is seven so the answer is 25 you know uh so [00:21:23] seven so the answer is 25 you know uh so in a case like that uh you'd say that [00:21:26] in a case like that uh you'd say that the model was not faithful to its [00:21:27] the model was not faithful to its rational [00:21:29] rational and so what we see in this plot is is a [00:21:33] and so what we see in this plot is is a very careful experiment where um on on [00:21:38] very careful experiment where um on on the x- [00:21:39] the x- axis uh we have the number of reasoning [00:21:43] axis uh we have the number of reasoning samples so okay so the setup is [00:21:46] samples so okay so the setup is something like this so for every [00:21:47] something like this so for every question the model produces a rational [00:21:51] question the model produces a rational and a rational here is multiple [00:21:53] and a rational here is multiple sentences and what we're going to do is [00:21:56] sentences and what we're going to do is we're going to force the model to [00:21:59] we're going to force the model to sort of early exit from its [00:22:01] sort of early exit from its rationalization and just like force it [00:22:02] rationalization and just like force it to produce an answer okay so if it [00:22:05] to produce an answer okay so if it produced four rationals I can early exit [00:22:08] produced four rationals I can early exit right after the first rational and ask [00:22:10] right after the first rational and ask it to produce an answer I can exit after [00:22:12] it to produce an answer I can exit after the second rational ask it to produce an [00:22:14] the second rational ask it to produce an answer and so on and what I'm going to [00:22:16] answer and so on and what I'm going to plot on the Y AIS is the model's [00:22:18] plot on the Y AIS is the model's accuracy after early exiting um in in in [00:22:22] accuracy after early exiting um in in in this [00:22:23] this procedure so let's say that I early [00:22:25] procedure so let's say that I early exerted after just one rational and the [00:22:28] exerted after just one rational and the model produced exactly the same answer [00:22:30] model produced exactly the same answer that it would if it had seen all four uh [00:22:33] that it would if it had seen all four uh sentences in its rational then maybe we [00:22:36] sentences in its rational then maybe we can conclude that uh the kind of [00:22:39] can conclude that uh the kind of reasoning is is not faithful like it [00:22:42] reasoning is is not faithful like it doesn't matter if the model C is the [00:22:43] doesn't matter if the model C is the full rational or just the first sentence [00:22:46] full rational or just the first sentence uh and if you take that to the extreme [00:22:48] uh and if you take that to the extreme um you know maybe you terine it even [00:22:51] um you know maybe you terine it even without any rational and it produces the [00:22:53] without any rational and it produces the same answer so the results here are [00:22:56] same answer so the results here are somewhat mixed but we see that there are [00:22:59] somewhat mixed but we see that there are enough data sets where uh it doesn't [00:23:02] enough data sets where uh it doesn't matter if you see the full if the model [00:23:04] matter if you see the full if the model sees the full rational before answering [00:23:05] sees the full rational before answering or if you early sort of early exit you [00:23:08] or if you early sort of early exit you kind of get the same answer which means [00:23:11] kind of get the same answer which means that sometimes uh these rationals may be [00:23:14] that sometimes uh these rationals may be post hog explanations of the model's [00:23:17] post hog explanations of the model's answer okay another experiment that [00:23:21] answer okay another experiment that tries to answer this exact same question [00:23:24] tries to answer this exact same question is uh you can take these rationals and [00:23:27] is uh you can take these rationals and then you can start corrupting them [00:23:29] then you can start corrupting them so maybe your rational was length four [00:23:32] so maybe your rational was length four and then I generate the first rational [00:23:34] and then I generate the first rational the second rational and for the third [00:23:35] the second rational and for the third rational I just corrupt it okay and then [00:23:38] rational I just corrupt it okay and then uh the fourth rational and then I asked [00:23:40] uh the fourth rational and then I asked the model to generate my answer if it [00:23:41] the model to generate my answer if it turns out that no matter how much I [00:23:43] turns out that no matter how much I corrupt my rational the model produces [00:23:46] corrupt my rational the model produces the same answer then I can conclude that [00:23:50] the same answer then I can conclude that again the answer kind of did not depend [00:23:51] again the answer kind of did not depend on my [00:23:52] on my rational so on the x-axis uh we are [00:23:56] rational so on the x-axis uh we are looking at the number number of re the [00:23:59] looking at the number number of re the percentage of reasoning steps before uh [00:24:02] percentage of reasoning steps before uh I add sort of a mistake in the rational [00:24:05] I add sort of a mistake in the rational okay so what you should see is kind of a [00:24:08] okay so what you should see is kind of a strictly increasing uh increasing sort [00:24:11] strictly increasing uh increasing sort of trend where if I add a mistake after [00:24:14] of trend where if I add a mistake after the very first step then that's probably [00:24:17] the very first step then that's probably going to change the answer a lot and [00:24:19] going to change the answer a lot and then if I add a mistake after the last [00:24:21] then if I add a mistake after the last step that maybe doesn't change the [00:24:22] step that maybe doesn't change the answer all that much but again we find [00:24:24] answer all that much but again we find that for some data sets uh it so happens [00:24:28] that for some data sets uh it so happens that you know you can add a mistake in [00:24:31] that you know you can add a mistake in the first sentence in your rationale and [00:24:32] the first sentence in your rationale and the answer is not going to change all [00:24:34] the answer is not going to change all that much and so that's also kind of an [00:24:37] that much and so that's also kind of an indicator that maybe these rationals are [00:24:39] indicator that maybe these rationals are sort of post talk explanations of the [00:24:41] sort of post talk explanations of the model's [00:24:42] model's behavior um so yeah so there's a lot of [00:24:46] behavior um so yeah so there's a lot of lines here so if anyone has questions uh [00:24:49] lines here so if anyone has questions uh see a few blank faces in the audience [00:24:59] okay so let's let's uh let's keep moving [00:25:01] okay so let's let's uh let's keep moving um okay so that's that was about like [00:25:04] um okay so that's that was about like whether uh the models where sort of [00:25:07] whether uh the models where sort of Chain of Thought expresses kind of a [00:25:08] Chain of Thought expresses kind of a reasoning that the model is faithful to [00:25:11] reasoning that the model is faithful to uh another question you could ask is [00:25:14] uh another question you could ask is what if I changed my setting a little [00:25:16] what if I changed my setting a little bit right so my model let's say I [00:25:18] bit right so my model let's say I observe that it's able to do arithmetic [00:25:21] observe that it's able to do arithmetic in base 10 so it's able to answer [00:25:23] in base 10 so it's able to answer something like 12 + 14 uh does that mean [00:25:27] something like 12 + 14 uh does that mean that my model knows how to do it [00:25:28] that my model knows how to do it arithmetic or maybe there was just this [00:25:30] arithmetic or maybe there was just this exact same um you know example was [00:25:34] exact same um you know example was present in the training data so one way [00:25:36] present in the training data so one way you could test for this is by creating [00:25:39] you could test for this is by creating counterfactuals which uh based on our [00:25:41] counterfactuals which uh based on our understanding of the data you expect uh [00:25:43] understanding of the data you expect uh to not be present that frequently in the [00:25:45] to not be present that frequently in the training [00:25:46] training data so instead of doing base 10 [00:25:49] data so instead of doing base 10 addition you could do addition in base 9 [00:25:52] addition you could do addition in base 9 and then if the model has the same [00:25:54] and then if the model has the same accuracy in base 9 then you can conclude [00:25:56] accuracy in base 9 then you can conclude that maybe this model has under OD how [00:25:58] that maybe this model has under OD how to do [00:25:59] to do addition similarly for logic uh maybe uh [00:26:04] addition similarly for logic uh maybe uh the reason why the model is so good at [00:26:06] the reason why the model is so good at solving logic problems is because it's [00:26:08] solving logic problems is because it's seen something very similar in its [00:26:10] seen something very similar in its training data so what if I construct a [00:26:12] training data so what if I construct a world where I don't know corgis are [00:26:15] world where I don't know corgis are reptiles can it still do this logic [00:26:20] reptiles can it still do this logic problem okay and so what we find is uh [00:26:25] problem okay and so what we find is uh there is a you know sometimes a pretty [00:26:27] there is a you know sometimes a pretty significant drop when you move from [00:26:30] significant drop when you move from there's a question sorry you counteract [00:26:33] there's a question sorry you counteract why is base 9 counteract and Bas [00:26:36] why is base 9 counteract and Bas 10 so it's it's a counterfactual excuse [00:26:39] 10 so it's it's a counterfactual excuse me in the sense that [00:26:42] me in the sense that um uh the the authors comment that like [00:26:44] um uh the the authors comment that like base 10 addition is like frequently [00:26:46] base 10 addition is like frequently observed in training data but very few [00:26:48] observed in training data but very few people do base 9 addition and so there's [00:26:51] people do base 9 addition and so there's going to be much fewer examples of this [00:26:53] going to be much fewer examples of this in training data so it's more so add a [00:26:56] in training data so it's more so add a distribution I right yeah yeah so you [00:26:59] distribution I right yeah yeah so you can also call it out of distribution for [00:27:03] sure [00:27:04] sure um and yeah so like from results like [00:27:07] um and yeah so like from results like what we see is uh you know there's [00:27:10] what we see is uh you know there's there's like this drop in performance [00:27:12] there's like this drop in performance even for like very simple logic problems [00:27:14] even for like very simple logic problems that don't involve like multiple steps [00:27:16] that don't involve like multiple steps of reasoning a you know kind of a [00:27:18] of reasoning a you know kind of a significant drop in [00:27:20] significant drop in performance um which maybe suggest that [00:27:24] performance um which maybe suggest that there's not that much reasoning there's [00:27:26] there's not that much reasoning there's more memorization um [00:27:29] more memorization um yeah so we could keep going with this [00:27:33] yeah so we could keep going with this Paradigm of like changing the problem [00:27:35] Paradigm of like changing the problem setting so that it starts looking sort [00:27:38] setting so that it starts looking sort of out of distribution to the training [00:27:40] of out of distribution to the training Corpus um and this is exactly what was [00:27:44] Corpus um and this is exactly what was done in this paper that looked at [00:27:46] done in this paper that looked at analogical [00:27:48] analogical reasoning so uh so basically the setup [00:27:51] reasoning so uh so basically the setup is something like this I'm going to show [00:27:53] is something like this I'm going to show certain examples of string [00:27:56] certain examples of string Transformations and I'm going to ask the [00:27:58] Transformations and I'm going to ask the model to generalize to new examples okay [00:28:01] model to generalize to new examples okay so in this extend sequence problem I [00:28:04] so in this extend sequence problem I have ABCD and the output is AB bcde e [00:28:08] have ABCD and the output is AB bcde e and then given i j KL the model has to [00:28:11] and then given i j KL the model has to produce I J K L M okay and so on now the [00:28:16] produce I J K L M okay and so on now the way you can sort of make this into like [00:28:19] way you can sort of make this into like a [00:28:20] a counterfactual or uh something that is [00:28:22] counterfactual or uh something that is out of distribution is uh maybe you can [00:28:25] out of distribution is uh maybe you can kind of change what the extend sequence [00:28:28] kind of change what the extend sequence Quin task is so now instead of [00:28:30] Quin task is so now instead of outputting ABCDE e maybe the model has [00:28:33] outputting ABCDE e maybe the model has to Output ABCD F okay so instead of [00:28:37] to Output ABCD F okay so instead of outputting the next character it has to [00:28:39] outputting the next character it has to Output um sort of one more uh so the [00:28:42] Output um sort of one more uh so the second character after the next and so [00:28:46] second character after the next and so on uh the other kind of contactual you [00:28:48] on uh the other kind of contactual you could add is instead of operating on the [00:28:51] could add is instead of operating on the standard uh alphabet you could modify [00:28:54] standard uh alphabet you could modify the alphabet completely so instead of [00:28:56] the alphabet completely so instead of the alphabet being ABC maybe you start [00:28:59] the alphabet being ABC maybe you start at x y and you know so [00:29:03] at x y and you know so on um so what we find is uh so we find [00:29:09] on um so what we find is uh so we find two things uh the first thing that we [00:29:10] two things uh the first thing that we find is that uh there's a significant [00:29:13] find is that uh there's a significant drop in performance as we go from the [00:29:16] drop in performance as we go from the standard sort of analogical reasoning [00:29:18] standard sort of analogical reasoning problem to one of these counterfactuals [00:29:20] problem to one of these counterfactuals where we either change the alphabet we [00:29:22] where we either change the alphabet we change the description of the Tas so [00:29:24] change the description of the Tas so that it becomes slightly [00:29:26] that it becomes slightly unnatural on the other hand the authors [00:29:28] unnatural on the other hand the authors also did this exact same experiment on [00:29:31] also did this exact same experiment on human subjects where they find very [00:29:34] human subjects where they find very little uh decrease in [00:29:37] little uh decrease in performance okay so overall what this [00:29:41] performance okay so overall what this result suggests is maybe there's some [00:29:43] result suggests is maybe there's some reasoning um maybe there's some [00:29:45] reasoning um maybe there's some memorization but there's nothing [00:29:47] memorization but there's nothing systematic okay uh so you know again [00:29:51] systematic okay uh so you know again this is like all emerging so uh maybe uh [00:29:54] this is like all emerging so uh maybe uh someone will find that you know if you [00:29:55] someone will find that you know if you if you change your prompt a little bit [00:29:57] if you change your prompt a little bit now now models get can do reasoning uh [00:29:59] now now models get can do reasoning uh but this is kind of the current lay of [00:30:01] but this is kind of the current lay of the [00:30:02] the land [00:30:04] land okay [00:30:05] okay um so uh that was sort of the reasoning [00:30:08] um so uh that was sort of the reasoning module of the lecture I'm going to now [00:30:10] module of the lecture I'm going to now switch gar and talk about uh language [00:30:14] switch gar and talk about uh language model [00:30:15] model agents [00:30:17] agents um so uh and and this is kind of related [00:30:20] um so uh and and this is kind of related to reasoning in the sense that uh [00:30:22] to reasoning in the sense that uh reasoning involves sort of this [00:30:24] reasoning involves sort of this multi-step inferences where you know [00:30:27] multi-step inferences where you know given some facts have to arrive at [00:30:28] given some facts have to arrive at completely new [00:30:30] completely new conclusions um with agents what we'll [00:30:32] conclusions um with agents what we'll see is that there's some high level kind [00:30:35] see is that there's some high level kind of objective or model has to accomplish [00:30:37] of objective or model has to accomplish and it has to reason about post [00:30:39] and it has to reason about post conditions object [00:30:41] conditions object affordances um kind of uncertainty in [00:30:43] affordances um kind of uncertainty in the world to carry out a sequence of [00:30:45] the world to carry out a sequence of steps [00:30:47] steps um so let's start with some [00:30:49] um so let's start with some terminology okay so we have our agent on [00:30:53] terminology okay so we have our agent on the right um that's going to be some [00:30:56] the right um that's going to be some some neural network [00:30:58] some neural network and then we have an [00:31:00] and then we have an environment um you know and I I I'll [00:31:03] environment um you know and I I I'll give some examples of what what these [00:31:04] give some examples of what what these environments could be [00:31:07] environments could be um the agent receives an observation [00:31:10] um the agent receives an observation from its [00:31:11] from its environment and based on the observation [00:31:14] environment and based on the observation IT issues an action [00:31:17] IT issues an action okay and along with that it receives [00:31:20] okay and along with that it receives this second variable G and G represents [00:31:24] this second variable G and G represents a language [00:31:25] a language instruction okay uh so there's many [00:31:29] instruction okay uh so there's many names for this setting and and what uh [00:31:31] names for this setting and and what uh and these models uh digital agent [00:31:34] and these models uh digital agent language conditioned policy or an [00:31:37] language conditioned policy or an instruction following [00:31:39] instruction following agent uh some examples of environments [00:31:43] agent uh some examples of environments are maybe uh it's it's sort of a web [00:31:46] are maybe uh it's it's sort of a web browser and in sort of a browsing [00:31:48] browser and in sort of a browsing environment uh where the uh objective is [00:31:52] environment uh where the uh objective is to book a flight from San Francisco to [00:31:54] to book a flight from San Francisco to New York and the observation could [00:31:56] New York and the observation could either be raw pixel that the model [00:32:00] either be raw pixel that the model sees or or it could be the HTML Dom [00:32:06] sees or or it could be the HTML Dom representation and the action space if [00:32:09] representation and the action space if you're looking at these web environments [00:32:11] you're looking at these web environments could be uh typing on specific web [00:32:14] could be uh typing on specific web elements clicking on web elements uh [00:32:17] elements clicking on web elements uh moving your mouse to a certain web [00:32:19] moving your mouse to a certain web element to interact with it and so [00:32:21] element to interact with it and so on and yeah I mean like this are sort of [00:32:25] on and yeah I mean like this are sort of a vast number of applications I don't [00:32:27] a vast number of applications I don't think can cover all applications but [00:32:29] think can cover all applications but like you know uh we can look at some so [00:32:32] like you know uh we can look at some so there's obviously uh like digital [00:32:34] there's obviously uh like digital assistance like uh you know I'm not [00:32:38] assistance like uh you know I'm not going to say the names because I I know [00:32:39] going to say the names because I I know people's mobiles might might might start [00:32:41] people's mobiles might might might start popping up um but you know you can give [00:32:44] popping up um but you know you can give them natural language commands and like [00:32:46] them natural language commands and like you know set an alarm uh set reminders [00:32:49] you know set an alarm uh set reminders and so on uh you could also do natural [00:32:52] and so on uh you could also do natural language programming uh where you could [00:32:56] language programming uh where you could given natural language descript [00:32:58] given natural language descript descriptions uh get a model to sort of [00:33:00] descriptions uh get a model to sort of write python code another example of [00:33:03] write python code another example of this could be UI [00:33:05] this could be UI automation where maybe you want to do [00:33:08] automation where maybe you want to do automated testing of of UI elements and [00:33:11] automated testing of of UI elements and so instead of having a human sort of [00:33:13] so instead of having a human sort of verify whether uh a UI UI element Works [00:33:17] verify whether uh a UI UI element Works maybe you can get a model to execute [00:33:19] maybe you can get a model to execute actions corresponding to a given [00:33:21] actions corresponding to a given instruction or it could be something [00:33:23] instruction or it could be something more sort of user facing where uh you [00:33:26] more sort of user facing where uh you know given some kind of complex [00:33:29] know given some kind of complex environment like Spotify you could ask [00:33:31] environment like Spotify you could ask an agent to play some [00:33:34] an agent to play some songs and then finally uh there is this [00:33:37] songs and then finally uh there is this sort of emerging [00:33:38] sort of emerging application where we want to add [00:33:41] application where we want to add additional tools um or plugins to [00:33:45] additional tools um or plugins to language models so that they can control [00:33:48] language models so that they can control various different [00:33:49] various different applications [00:33:52] applications um okay so uh before we look at how we [00:33:55] um okay so uh before we look at how we can use language models to do [00:33:57] can use language models to do instruction following I think it's very [00:33:59] instruction following I think it's very helpful to look at how this was done [00:34:01] helpful to look at how this was done before language [00:34:02] before language models um so uh there were basically [00:34:06] models um so uh there were basically three main [00:34:07] three main ideas uh sometimes uh the the the right [00:34:11] ideas uh sometimes uh the the the right thing to do was uh collect examples of [00:34:16] thing to do was uh collect examples of utterances paired with uh logical forms [00:34:20] utterances paired with uh logical forms so logical forms uh could be some kind [00:34:23] so logical forms uh could be some kind of an executable representation that you [00:34:25] of an executable representation that you could just execute against either a [00:34:28] could just execute against either a knowledge graph or a database to get an [00:34:31] knowledge graph or a database to get an answer so maybe you have a query like [00:34:34] answer so maybe you have a query like what state botherers [00:34:36] what state botherers Texas and then there exists some sort of [00:34:39] Texas and then there exists some sort of program description that you could [00:34:41] program description that you could execute against uh a Knowledge Graph to [00:34:45] execute against uh a Knowledge Graph to get sort of an answer or a list [00:34:48] get sort of an answer or a list here um and and idea number one that [00:34:51] here um and and idea number one that people looked at was to treat this as [00:34:54] people looked at was to treat this as almost like machine translation right so [00:34:55] almost like machine translation right so you have uh [00:34:58] you have uh a source language which is sort of [00:35:01] a source language which is sort of English commands and then you have a [00:35:03] English commands and then you have a target language which is sort of these [00:35:06] target language which is sort of these uh these like meaning representations or [00:35:08] uh these like meaning representations or logical forms and then you could apply [00:35:10] logical forms and then you could apply the same Machinery from assignment 3 uh [00:35:13] the same Machinery from assignment 3 uh to build kind of a natural language [00:35:15] to build kind of a natural language interface here okay so you directly [00:35:17] interface here okay so you directly maximize the probability of a sequence [00:35:20] maximize the probability of a sequence of actions given a goal or a [00:35:24] of actions given a goal or a command idea number two was [00:35:28] command idea number two was um something a little bit more complex [00:35:30] um something a little bit more complex so here you have um instructions paired [00:35:36] so here you have um instructions paired with actions instead of directly mapping [00:35:38] with actions instead of directly mapping instructions to [00:35:40] instructions to actions uh I'm going to infer an [00:35:43] actions uh I'm going to infer an executable plan okay from these [00:35:47] executable plan okay from these instructions uh and action sequences and [00:35:50] instructions uh and action sequences and I'm going to train a model to go from [00:35:53] I'm going to train a model to go from instructions to these plans and then [00:35:56] instructions to these plans and then Define a very rich execution model [00:35:59] Define a very rich execution model that's going to directly execute these [00:36:01] that's going to directly execute these plans the advantage of this is uh maybe [00:36:04] plans the advantage of this is uh maybe there is more sort of highlevel uh [00:36:07] there is more sort of highlevel uh decisions you could encode in your plan [00:36:09] decisions you could encode in your plan which would be harder to like get into [00:36:12] which would be harder to like get into the model if you were to just train it [00:36:14] the model if you were to just train it uh to produce the action trajectories [00:36:16] uh to produce the action trajectories directly and I have an example of a [00:36:19] directly and I have an example of a system like that from [00:36:21] system like that from 2011 which uh was basically an agent [00:36:24] 2011 which uh was basically an agent that could navigate in um in sort of [00:36:26] that could navigate in um in sort of grounded environment [00:36:28] grounded environment and yeah the idea was something like [00:36:30] and yeah the idea was something like this that uh you kind of took an [00:36:32] this that uh you kind of took an instruction and obtained a [00:36:35] instruction and obtained a plan um and then you would uh train a [00:36:39] plan um and then you would uh train a semantic par so which is basically like [00:36:41] semantic par so which is basically like this kind of machine translation system [00:36:43] this kind of machine translation system that would convert commands into [00:36:45] that would convert commands into sequences of uh into this plan and then [00:36:48] sequences of uh into this plan and then once that's trained at test time given a [00:36:50] once that's trained at test time given a completely new instruction you would run [00:36:53] completely new instruction you would run the semantic parsel get this plan and [00:36:56] the semantic parsel get this plan and then execute it in this execution model [00:36:59] then execute it in this execution model okay and I have an example of an [00:37:02] okay and I have an example of an instruction and and a plan uh from this [00:37:05] instruction and and a plan uh from this 2011 [00:37:07] 2011 system the third idea uh which is [00:37:11] system the third idea uh which is probably um you know maybe the first one [00:37:14] probably um you know maybe the first one that comes to mind if you see a setting [00:37:16] that comes to mind if you see a setting like that is to use reinforcement [00:37:18] like that is to use reinforcement learning [00:37:19] learning directly and what people did there was [00:37:22] directly and what people did there was to use RL to directly map instructions [00:37:25] to use RL to directly map instructions into actions so I'm going to learn a [00:37:28] into actions so I'm going to learn a policy that outputs actions that [00:37:31] policy that outputs actions that maximize some reward okay which is [00:37:34] maximize some reward okay which is conditioned on my natural language [00:37:36] conditioned on my natural language instruction and the observation and this [00:37:39] instruction and the observation and this reward could be both sparse which is I [00:37:42] reward could be both sparse which is I carry out the entire task and then my [00:37:44] carry out the entire task and then my environment tells me if I achieve the [00:37:45] environment tells me if I achieve the task or not or it could be something [00:37:48] task or not or it could be something that I obtain after each step so I take [00:37:50] that I obtain after each step so I take an action and then the and and then the [00:37:53] an action and then the and and then the environment tells me if this action sort [00:37:55] environment tells me if this action sort of completed some percentage of my task [00:37:57] of completed some percentage of my task or not uh and on the top I've included [00:38:00] or not uh and on the top I've included an example of a system from 2009 that [00:38:03] an example of a system from 2009 that did this for automated Windows debugging [00:38:07] did this for automated Windows debugging and so you know you have some natural [00:38:10] and so you know you have some natural language uh instruction uh to click some [00:38:13] language uh instruction uh to click some UI elements and that that get mapped [00:38:16] UI elements and that that get mapped into kind of an API command that the [00:38:18] into kind of an API command that the model executes one after the [00:38:20] model executes one after the other um okay so these were basically [00:38:24] other um okay so these were basically the three main ideas that people had [00:38:26] the three main ideas that people had before language models you would either [00:38:28] before language models you would either train semantic parsers you would either [00:38:32] train semantic parsers you would either infer these plans from instruction [00:38:35] infer these plans from instruction trajectory pairs uh and then learn to [00:38:38] trajectory pairs uh and then learn to directly model plans and then have an [00:38:40] directly model plans and then have an execution model that can execute plans [00:38:42] execution model that can execute plans or you would do reinforcement learning [00:38:44] or you would do reinforcement learning if you had a reward [00:38:46] if you had a reward signal so how do we do things in [00:38:49] signal so how do we do things in 2024 so uh there are a few ways to think [00:38:52] 2024 so uh there are a few ways to think about this uh I think like maybe most [00:38:57] about this uh I think like maybe most instructive is to think about what we [00:38:59] instructive is to think about what we trying to achieve right so we are trying [00:39:00] trying to achieve right so we are trying to model trajectories so sequence of [00:39:03] to model trajectories so sequence of actions conditioned on some goal okay so [00:39:06] actions conditioned on some goal okay so I want my model to book a flight from [00:39:08] I want my model to book a flight from San Francisco to New York and I want it [00:39:11] San Francisco to New York and I want it to produce a trajectory of like maybe [00:39:13] to produce a trajectory of like maybe tip uh typing and clicking actions so [00:39:16] tip uh typing and clicking actions so let's look at how that [00:39:17] let's look at how that factorizes so the probability of a [00:39:20] factorizes so the probability of a trajectory uh conditioned on a goal or [00:39:23] trajectory uh conditioned on a goal or an instruction is just the probability [00:39:26] an instruction is just the probability of the State action next state and so on [00:39:31] of the State action next state and so on condition on the goal and you could [00:39:33] condition on the goal and you could factorize that into two terms so the [00:39:36] factorize that into two terms so the first term is sort of the transition [00:39:39] first term is sort of the transition dynamics of the environment and that's [00:39:41] dynamics of the environment and that's just what happens if I take a certain [00:39:44] just what happens if I take a certain action in a given State how is my state [00:39:47] action in a given State how is my state going to change and the second object is [00:39:50] going to change and the second object is sort of the agent policy which is given [00:39:54] sort of the agent policy which is given my goal and the trajectory so far what [00:39:57] my goal and the trajectory so far what what is the next action I should be [00:39:58] what is the next action I should be taking okay and then uh sort of people [00:40:03] taking okay and then uh sort of people quickly realize that you could just [00:40:06] quickly realize that you could just treat this as kind of a generative [00:40:08] treat this as kind of a generative problem so you could treat uh the [00:40:11] problem so you could treat uh the problem of decision- making in [00:40:12] problem of decision- making in environments as sort of a generative [00:40:15] environments as sort of a generative trajectory modeling [00:40:16] trajectory modeling problem and what I have in sort of the [00:40:19] problem and what I have in sort of the top right is an example of a transformer [00:40:23] top right is an example of a transformer that just takes uh the history of [00:40:26] that just takes uh the history of actions it's taken so far are the [00:40:28] actions it's taken so far are the current state and uh some indication of [00:40:32] current state and uh some indication of what task it should achieve here uh [00:40:35] what task it should achieve here uh based on reward but it could be a [00:40:36] based on reward but it could be a natural language string and it's just [00:40:38] natural language string and it's just trained to predict what's the next [00:40:40] trained to predict what's the next action okay and you could just train an [00:40:42] action okay and you could just train an auto regressive language model to do [00:40:44] auto regressive language model to do this and it turned out that this worked [00:40:47] this and it turned out that this worked very well in sort of an offline RL case [00:40:50] very well in sort of an offline RL case question sorry in figure why are we [00:40:51] question sorry in figure why are we predicting one one action [00:40:57] uh so we are predicting one action [00:41:00] uh so we are predicting one action before and the current action uh oh so [00:41:03] before and the current action uh oh so so no no so you predict an action [00:41:05] so no no so you predict an action execute that right append that to your [00:41:08] execute that right append that to your trajectory and then you predict the next [00:41:10] trajectory and then you predict the next action and so on goe so we we we resolve [00:41:13] action and so on goe so we we we resolve three input to tokens into one output [00:41:15] three input to tokens into one output token and turn it off yeah okay sounds [00:41:19] token and turn it off yeah okay sounds good um and it turned out that this [00:41:21] good um and it turned out that this worked really well and so you know the [00:41:25] worked really well and so you know the the instead of uh you know uh getting [00:41:27] the instead of uh you know uh getting these latent plans and training semantic [00:41:29] these latent plans and training semantic parsers or trying to do reinforcement [00:41:32] parsers or trying to do reinforcement learning uh we started using language [00:41:35] learning uh we started using language models as policies and so a simple way [00:41:39] models as policies and so a simple way to do all of that is uh to prompt a [00:41:42] to do all of that is uh to prompt a language model in a loop [00:41:44] language model in a loop okay so uh we're going to specify the [00:41:47] okay so uh we're going to specify the action space and text so this is like a [00:41:49] action space and text so this is like a simple uh sort of language model agent [00:41:52] simple uh sort of language model agent this is not going to work at all but [00:41:54] this is not going to work at all but like probably just like illustrative of [00:41:56] like probably just like illustrative of of how agent can be built now so you [00:41:59] of how agent can be built now so you provide an action space in text um so [00:42:03] provide an action space in text um so maybe it's a digital environment and [00:42:06] maybe it's a digital environment and maybe it can type maybe it can click [00:42:09] maybe it can type maybe it can click maybe it can type characters maybe it [00:42:11] maybe it can type characters maybe it can move Mouse somewhere uh you provide [00:42:14] can move Mouse somewhere uh you provide it an [00:42:15] it an instruction and you provide it the [00:42:18] instruction and you provide it the sequence of actions and observations [00:42:22] sequence of actions and observations it's received so far okay and then [00:42:25] it's received so far okay and then condition on all that you ask it to [00:42:27] condition on all that you ask it to predict the next uh the next [00:42:30] predict the next uh the next action and there's nothing deep going on [00:42:33] action and there's nothing deep going on here this is just Chain of Thought [00:42:35] here this is just Chain of Thought prompting in a loop okay but uh the hope [00:42:39] prompting in a loop okay but uh the hope is that uh because all of this uh [00:42:41] is that uh because all of this uh because we reduce the problem of [00:42:43] because we reduce the problem of decision making into just Auto [00:42:45] decision making into just Auto regressive modeling this this could work [00:42:47] regressive modeling this this could work okay and indeed like you know a slightly [00:42:50] okay and indeed like you know a slightly more complex version of this can work in [00:42:52] more complex version of this can work in some [00:42:54] some environments okay so now I'm going to [00:42:56] environments okay so now I'm going to sort of give a little FL faor of what [00:42:58] sort of give a little FL faor of what different environments look like now for [00:43:01] different environments look like now for evaluating language models um as [00:43:05] evaluating language models um as agents so the simplest environment uh [00:43:09] agents so the simplest environment uh that that that people consider is mini [00:43:11] that that that people consider is mini wob so uh this is a Sandbox environment [00:43:15] wob so uh this is a Sandbox environment um that evaluates like basic browser [00:43:17] um that evaluates like basic browser interactions like you know maybe on a [00:43:19] interactions like you know maybe on a mini Twitter environment can you get a [00:43:22] mini Twitter environment can you get a language model to retweet a given tweet [00:43:25] language model to retweet a given tweet um given sort of a simulat email client [00:43:28] um given sort of a simulat email client can the model forward someone's email [00:43:30] can the model forward someone's email can it compose an email uh can it click [00:43:33] can it compose an email uh can it click on certain buttons or [00:43:35] on certain buttons or not uh it's not at all real world so [00:43:38] not uh it's not at all real world so it's not real websites uh and it's [00:43:41] it's not real websites uh and it's relatively short Horizon so given any [00:43:44] relatively short Horizon so given any instruction most tasks can be [00:43:46] instruction most tasks can be accomplished in under three [00:43:49] accomplished in under three actions uh but zero short performance of [00:43:52] actions uh but zero short performance of even the best language models is still [00:43:53] even the best language models is still far from perfect even on this very [00:43:55] far from perfect even on this very simple [00:43:55] simple Benchmark um [00:43:58] Benchmark um a second slightly more real world [00:44:01] a second slightly more real world Benchmark is web [00:44:03] Benchmark is web Arina and this is also a Sandbox [00:44:05] Arina and this is also a Sandbox environment but it's kind of an a pretty [00:44:08] environment but it's kind of an a pretty close approximation of real websites uh [00:44:11] close approximation of real websites uh that span e-commerce so there is a [00:44:13] that span e-commerce so there is a website in web Arina that resembles [00:44:15] website in web Arina that resembles Amazon um social media so something that [00:44:18] Amazon um social media so something that resembles [00:44:19] resembles Twitter and additionally there are [00:44:21] Twitter and additionally there are utility tools like Maps so an [00:44:23] utility tools like Maps so an instruction could require a model to [00:44:26] instruction could require a model to open up sort of a map application find [00:44:28] open up sort of a map application find the shortest path from point A to point [00:44:31] the shortest path from point A to point B and use that uh in its later uh [00:44:34] B and use that uh in its later uh sequence of actions and there's [00:44:36] sequence of actions and there's multi-tab browsing like we kind of [00:44:38] multi-tab browsing like we kind of commonly do uh so with mini wob there's [00:44:40] commonly do uh so with mini wob there's only one single tab uh and and with web [00:44:45] only one single tab uh and and with web Arena I think this was the first [00:44:46] Arena I think this was the first environment that introduced this idea [00:44:48] environment that introduced this idea where uh you kind of have multiple tabs [00:44:51] where uh you kind of have multiple tabs and the agent can sort of switch between [00:44:53] and the agent can sort of switch between uh apps tabs uh and again we are going [00:44:57] uh apps tabs uh and again we are going to evaluate sort of functional [00:44:59] to evaluate sort of functional correctness um which is whether the [00:45:02] correctness um which is whether the model sort of gave uh the correct answer [00:45:05] model sort of gave uh the correct answer at the end whether the sequence of steps [00:45:06] at the end whether the sequence of steps it took um gave the intended Behavior as [00:45:10] it took um gave the intended Behavior as opposed to whether it took a sequence of [00:45:11] opposed to whether it took a sequence of steps that maybe a user had [00:45:15] steps that maybe a user had pre-programmed so another popular uh [00:45:19] pre-programmed so another popular uh kind of uh environment is or a data set [00:45:23] kind of uh environment is or a data set is web links so web links also has [00:45:26] is web links so web links also has multi-tab browsing [00:45:28] multi-tab browsing and uh it has web interactions on real [00:45:31] and uh it has web interactions on real websites so this is not sandboxed [00:45:33] websites so this is not sandboxed approximations of real websites is not [00:45:35] approximations of real websites is not sandboxed kind of just like browser like [00:45:38] sandboxed kind of just like browser like uh uh like browser interactions these [00:45:40] uh uh like browser interactions these are like actual real websites um and it [00:45:44] are like actual real websites um and it also introduced like a new action where [00:45:47] also introduced like a new action where the agent could communicate with the [00:45:49] the agent could communicate with the user so maybe there's some instruction [00:45:52] user so maybe there's some instruction uh which is to like [00:45:54] uh which is to like reserve um kind of [00:45:57] reserve um kind of I don't know like like a movie or like [00:45:59] I don't know like like a movie or like uh buy a movie ticket or something and [00:46:02] uh buy a movie ticket or something and then at some point the model has to [00:46:04] then at some point the model has to request credit card information and so [00:46:06] request credit card information and so there is this like additional action [00:46:08] there is this like additional action where a human could be involved in [00:46:11] where a human could be involved in communicating um with the [00:46:14] communicating um with the agent uh and this is not an environment [00:46:16] agent uh and this is not an environment uh but just a collection of interactions [00:46:19] uh but just a collection of interactions uh so you can't for example do any kind [00:46:21] uh so you can't for example do any kind of exploration or online learning here [00:46:23] of exploration or online learning here but you could definitely use this for [00:46:25] but you could definitely use this for evaluating um um okay uh so this was [00:46:29] evaluating um um okay uh so this was just a taste of what some benchmarks [00:46:31] just a taste of what some benchmarks look like uh for for language model [00:46:34] look like uh for for language model agents so how are we going to train [00:46:36] agents so how are we going to train these models right so uh you know given [00:46:40] these models right so uh you know given that we we're going to train we're going [00:46:42] that we we're going to train we're going to treat like uh decision making as sort [00:46:44] to treat like uh decision making as sort of casual uh as causal language modeling [00:46:46] of casual uh as causal language modeling we're not going to use any of the ideas [00:46:49] we're not going to use any of the ideas from [00:46:50] from pre-ms uh the standard practice is to do [00:46:53] pre-ms uh the standard practice is to do in context learning with few short [00:46:55] in context learning with few short examples uh and in the few short [00:46:58] examples uh and in the few short examples uh for typically for any new [00:47:01] examples uh for typically for any new kind of uh website or any new use case [00:47:05] kind of uh website or any new use case you're going to get humans to perform [00:47:06] you're going to get humans to perform those tasks and sort of feed that into [00:47:09] those tasks and sort of feed that into the language models prompt as in context [00:47:11] the language models prompt as in context demonstrations which it could then use [00:47:14] demonstrations which it could then use to solve um similar similar looking [00:47:17] to solve um similar similar looking tasks on very similar [00:47:19] tasks on very similar websites so obviously this is not [00:47:22] websites so obviously this is not scalable uh there's thousands of [00:47:24] scalable uh there's thousands of environments on some environments that [00:47:27] environments on some environments that like lots of different interactions that [00:47:28] like lots of different interactions that are possible and so maybe there's [00:47:31] are possible and so maybe there's something better that we can do than [00:47:33] something better that we can do than just U sort of getting humans to provide [00:47:36] just U sort of getting humans to provide demonstrations for every new use [00:47:39] demonstrations for every new use case um and so we going to use something [00:47:42] case um and so we going to use something we saw early on in the lecture okay [00:47:45] we saw early on in the lecture okay which was to kind of use the language [00:47:47] which was to kind of use the language model to generate rationals and then [00:47:50] model to generate rationals and then fine tune on that and here we don't have [00:47:52] fine tune on that and here we don't have rationals but we could produce action [00:47:54] rationals but we could produce action trajectories and then we're going to use [00:47:56] trajectories and then we're going to use that [00:47:57] that as supervision okay so the way that [00:48:00] as supervision okay so the way that looks like is something like this so [00:48:03] looks like is something like this so let's say I have some [00:48:05] let's say I have some environment um you know let's say it's [00:48:07] environment um you know let's say it's some mini wob environment and I'm going [00:48:10] some mini wob environment and I'm going to just uh get an agent that can [00:48:12] to just uh get an agent that can randomly explore the environment so [00:48:14] randomly explore the environment so it'll just execute a random [00:48:16] it'll just execute a random sequence of clicks and types uh and [00:48:19] sequence of clicks and types uh and scrolling [00:48:20] scrolling operations and let's say it produces [00:48:22] operations and let's say it produces some trajectories [00:48:24] some trajectories okay and now I'm going to use these traj [00:48:27] okay and now I'm going to use these traj and somehow filter them so that was the [00:48:28] and somehow filter them so that was the idea from earlier so you're going to get [00:48:30] idea from earlier so you're going to get a bunch of different outputs and then [00:48:31] a bunch of different outputs and then you're going to filter it somehow so [00:48:33] you're going to filter it somehow so here we're going to use a second [00:48:35] here we're going to use a second language model because we don't know [00:48:38] language model because we don't know what a good trajectory looks like so not [00:48:40] what a good trajectory looks like so not like a math problem where you know you [00:48:42] like a math problem where you know you know the correct answer uh we just had a [00:48:45] know the correct answer uh we just had a language model interact with the website [00:48:47] language model interact with the website and generate trajectories and we want to [00:48:49] and generate trajectories and we want to somehow filter out what a good [00:48:50] somehow filter out what a good trajectories and so we're going to use a [00:48:52] trajectories and so we're going to use a second [00:48:53] second model that will produce a [00:48:57] model that will produce a description uh of these trajectories and [00:49:00] description uh of these trajectories and the idea here is that if you can get a [00:49:02] the idea here is that if you can get a model to produce a description of what [00:49:05] model to produce a description of what uh what the sequence of actions [00:49:06] uh what the sequence of actions corresponds to then maybe that's a good [00:49:09] corresponds to then maybe that's a good enough signal for a good trajectory okay [00:49:13] enough signal for a good trajectory okay and so maybe given the first trajectory [00:49:15] and so maybe given the first trajectory it guesses that the instruction was to [00:49:18] it guesses that the instruction was to book a flight from San Francisco to New [00:49:20] book a flight from San Francisco to New York um for the second trajectory it [00:49:23] York um for the second trajectory it said set the date to some given date um [00:49:26] said set the date to some given date um um and maybe it it wasn't able to come [00:49:29] um and maybe it it wasn't able to come up with any good uh sort of uh [00:49:32] up with any good uh sort of uh instruction for the third [00:49:33] instruction for the third trajectory and then we're going to do [00:49:35] trajectory and then we're going to do something uh again uh that we saw [00:49:38] something uh again uh that we saw earlier on which is to like kind of do [00:49:40] earlier on which is to like kind of do this iteratively so now uh we have a a [00:49:44] this iteratively so now uh we have a a goal that we got for for a trajectory [00:49:48] goal that we got for for a trajectory and now I'm going to get the language [00:49:49] and now I'm going to get the language model to condition Its Behavior on this [00:49:52] model to condition Its Behavior on this goal so the goal is to set the date as [00:49:56] goal so the goal is to set the date as some given date and now instead of doing [00:49:59] some given date and now instead of doing random exploration the model is going to [00:50:00] random exploration the model is going to produce a sequence of actions that have [00:50:03] produce a sequence of actions that have a better correspondence with some [00:50:05] a better correspondence with some natural language [00:50:07] natural language instruction So It produced a trajectory [00:50:10] instruction So It produced a trajectory based on that [00:50:12] based on that instruction and then I'm going to use [00:50:14] instruction and then I'm going to use sort of some course filter that's just [00:50:17] sort of some course filter that's just going to look at correspondences between [00:50:19] going to look at correspondences between the instruction and uh the sequence of [00:50:22] the instruction and uh the sequence of actions and the states the language [00:50:25] actions and the states the language model visited and used that to decide if [00:50:29] model visited and used that to decide if the trajectory was a good trajectory for [00:50:30] the trajectory was a good trajectory for the [00:50:31] the instruction and in this case uh you know [00:50:35] instruction and in this case uh you know given the instruction this seems like a [00:50:36] given the instruction this seems like a pretty good trajectory for completing [00:50:39] pretty good trajectory for completing this task and so then we added to a set [00:50:43] this task and so then we added to a set of examples okay but maybe sometimes uh [00:50:48] of examples okay but maybe sometimes uh things are not so good so for that [00:50:50] things are not so good so for that second [00:50:51] second instruction the generated label was to [00:50:53] instruction the generated label was to book a flight from San Francisco to New [00:50:55] book a flight from San Francisco to New York and let's say we run that again [00:50:58] York and let's say we run that again through the language model and It [00:51:00] through the language model and It produced a second trajectory okay and [00:51:03] produced a second trajectory okay and clearly this does not seem like uh kind [00:51:06] clearly this does not seem like uh kind of a successful trajectory corresponding [00:51:08] of a successful trajectory corresponding to booking a [00:51:09] to booking a flight um and so what do we do here [00:51:13] flight um and so what do we do here maybe we can throw away this uh [00:51:14] maybe we can throw away this uh interaction but interactions are pretty [00:51:16] interaction but interactions are pretty costly like [00:51:17] costly like specifically uh you know if you're [00:51:19] specifically uh you know if you're looking at real websites and each [00:51:21] looking at real websites and each interaction uh you know could take a few [00:51:23] interaction uh you know could take a few milliseconds and so maybe we don't want [00:51:24] milliseconds and so maybe we don't want to throw away this interaction [00:51:27] to throw away this interaction so what we're going to do here is again [00:51:29] so what we're going to do here is again invoke the Rel laaber to take the [00:51:32] invoke the Rel laaber to take the trajectory and assign it a new label so [00:51:34] trajectory and assign it a new label so the model was not successful at [00:51:36] the model was not successful at accomplishing the task it set out to do [00:51:38] accomplishing the task it set out to do but it accomplished something and we're [00:51:40] but it accomplished something and we're going to come up with the best guess of [00:51:42] going to come up with the best guess of what that was with a second language [00:51:43] what that was with a second language model and it's going to say that okay uh [00:51:46] model and it's going to say that okay uh maybe the instruction you accomplished [00:51:48] maybe the instruction you accomplished instead was to set the origin to SFO and [00:51:51] instead was to set the origin to SFO and the destination to New York City okay [00:51:54] the destination to New York City okay and so that's going to get get fed back [00:51:56] and so that's going to get get fed back into the language model and we're going [00:51:58] into the language model and we're going to keep doing this iteratively till our [00:52:01] to keep doing this iteratively till our filter says that this is a good [00:52:02] filter says that this is a good instruction trajectory pair okay so we [00:52:05] instruction trajectory pair okay so we have the same idea of using a language [00:52:07] have the same idea of using a language model to sort of generate outputs and [00:52:10] model to sort of generate outputs and some iterative uh procedure that will [00:52:13] some iterative uh procedure that will like you know give us kind of a good set [00:52:15] like you know give us kind of a good set of training [00:52:17] of training examples um so overall the method looks [00:52:21] examples um so overall the method looks something like this uh you know you have [00:52:24] something like this uh you know you have some [00:52:24] some environment uh we going to use uh kind [00:52:28] environment uh we going to use uh kind of an unconditioned language model to [00:52:30] of an unconditioned language model to just randomly explore the environment [00:52:32] just randomly explore the environment and generate a sequence of [00:52:34] and generate a sequence of trajectories and then we're going to [00:52:36] trajectories and then we're going to convert these trajectories into [00:52:38] convert these trajectories into synthetic training data by iteratively [00:52:42] synthetic training data by iteratively converting trajectories into natural [00:52:44] converting trajectories into natural language descriptions and then taking [00:52:46] language descriptions and then taking natural language descriptions and [00:52:48] natural language descriptions and converting them into even better [00:52:49] converting them into even better trajectories and so on and once we have [00:52:52] trajectories and so on and once we have this collection of synthetic examples uh [00:52:55] this collection of synthetic examples uh there are two things we could do [00:52:57] there are two things we could do one could fine-tune using this data uh [00:53:00] one could fine-tune using this data uh but the simplest thing you could do is [00:53:01] but the simplest thing you could do is kind of repeat the Paradigm earlier of [00:53:04] kind of repeat the Paradigm earlier of you know replace uh human provided [00:53:07] you know replace uh human provided demonstrations in context with these [00:53:09] demonstrations in context with these synthetic [00:53:11] synthetic demonstrations um and we find uh a a [00:53:14] demonstrations um and we find uh a a reasonable boost in performance or 13 [00:53:17] reasonable boost in performance or 13 Point Improvement on the mini Benchmark [00:53:20] Point Improvement on the mini Benchmark and again uh you know even though mini [00:53:21] and again uh you know even though mini wob is very very simple zero short [00:53:24] wob is very very simple zero short performance for even the best language [00:53:25] performance for even the best language models is far from from perfect and we [00:53:28] models is far from from perfect and we also see an improvement on second sort [00:53:30] also see an improvement on second sort of multi-step uh tool use uh environment [00:53:34] of multi-step uh tool use uh environment but so far we've only looked at text [00:53:36] but so far we've only looked at text right um but uh maybe for real world [00:53:40] right um but uh maybe for real world applications it's kind of intractable to [00:53:43] applications it's kind of intractable to for every environment obtain the HTML [00:53:46] for every environment obtain the HTML and feed that into the language models [00:53:47] and feed that into the language models context uh sometimes there can be tens [00:53:50] context uh sometimes there can be tens of thousands of Dom uh elements and then [00:53:54] of thousands of Dom uh elements and then corresponding uh JavaScript and [00:53:56] corresponding uh JavaScript and inputting all that into the language [00:53:57] inputting all that into the language models context could be you know [00:53:59] models context could be you know intractable and maybe that's also not [00:54:01] intractable and maybe that's also not the best way to kind of uh [00:54:04] the best way to kind of uh show the state of the environment maybe [00:54:07] show the state of the environment maybe the best way is to directly show the [00:54:09] the best way is to directly show the pixels uh corresponding to uh the the [00:54:13] pixels uh corresponding to uh the the environment and so now we're going to [00:54:15] environment and so now we're going to look at some examples of vision language [00:54:18] look at some examples of vision language models that people have used for [00:54:20] models that people have used for building these agents [00:54:22] building these agents okay so uh the first one that that we're [00:54:26] okay so uh the first one that that we're going to look at is [00:54:28] going to look at is lava uh and the idea here is again kind [00:54:31] lava uh and the idea here is again kind of similar to Orca that we looked at uh [00:54:33] of similar to Orca that we looked at uh in in sort of the reasoning half of the [00:54:35] in in sort of the reasoning half of the lecture uh we're going to use [00:54:38] lecture uh we're going to use gp4 to generate uh this time both [00:54:42] gp4 to generate uh this time both instructions and [00:54:43] instructions and responses uh for textual descriptions of [00:54:46] responses uh for textual descriptions of images so maybe there's an image um and [00:54:50] images so maybe there's an image um and we're going to sort of use uh metadata [00:54:54] we're going to sort of use uh metadata corresponding to that image to come up [00:54:56] corresponding to that image to come up with a texture description feed that [00:54:58] with a texture description feed that into gbd4 and ask it to generate [00:55:02] into gbd4 and ask it to generate possible questions and [00:55:04] possible questions and responses and then we're going to [00:55:06] responses and then we're going to jointly [00:55:07] jointly fine-tune um sort of an image encoder um [00:55:12] fine-tune um sort of an image encoder um here clip along with a uh a Texton [00:55:16] here clip along with a uh a Texton decoder here wuna which is a Lama model [00:55:20] decoder here wuna which is a Lama model that is instruction [00:55:21] that is instruction tuned um and through this sort of joint [00:55:24] tuned um and through this sort of joint fine-tuning uh at the end we we kind of [00:55:26] fine-tuning uh at the end we we kind of get this image [00:55:28] get this image encoder um that can output language [00:55:30] encoder um that can output language responses and now we can sort of ask [00:55:32] responses and now we can sort of ask questions about images maybe use that uh [00:55:35] questions about images maybe use that uh to directly input uh screenshots instead [00:55:38] to directly input uh screenshots instead of HTML Dom [00:55:40] of HTML Dom elements so a second approach that um [00:55:45] elements so a second approach that um looked at sort of building joint uh [00:55:48] looked at sort of building joint uh image language models that then people [00:55:50] image language models that then people later adapted to agents was uh pix to [00:55:54] later adapted to agents was uh pix to struct and uh the idea is again very [00:55:56] struct and uh the idea is again very similar uh there's an image encoder and [00:55:59] similar uh there's an image encoder and a text decoder U the image encoder will [00:56:03] a text decoder U the image encoder will will will sort of take the image convert [00:56:05] will will sort of take the image convert them into patches and assign each patch [00:56:07] them into patches and assign each patch sort of a position ID uh run that [00:56:10] sort of a position ID uh run that through a [00:56:11] through a Transformer and then there's a decoder [00:56:13] Transformer and then there's a decoder that will decode out some text okay uh [00:56:17] that will decode out some text okay uh one of the new things that pix to struct [00:56:18] one of the new things that pix to struct introduced was a new pre-training task [00:56:22] introduced was a new pre-training task so uh for for lava the pre-training was [00:56:25] so uh for for lava the pre-training was you know fairly simple uh we're going to [00:56:27] you know fairly simple uh we're going to use gbd4 to just generate sort of [00:56:29] use gbd4 to just generate sort of synthetic uh questions and responses [00:56:33] synthetic uh questions and responses based on textual descriptions of images [00:56:34] based on textual descriptions of images but there's only so far you can go with [00:56:36] but there's only so far you can go with textual descriptions of images what pix [00:56:38] textual descriptions of images what pix to struck did was uh to look at [00:56:42] to struck did was uh to look at screenshots uh from websites and mask [00:56:45] screenshots uh from websites and mask out uh screenshots and then ask the [00:56:48] out uh screenshots and then ask the Transformer decoder to produce HTML [00:56:51] Transformer decoder to produce HTML corresponding to the marked outout [00:56:53] corresponding to the marked outout elements uh so here there is uh like [00:56:56] elements uh so here there is uh like this list um that has a corresponding [00:57:00] this list um that has a corresponding HTML uh one of the data points in uh in [00:57:04] HTML uh one of the data points in uh in pix to struct looks something like this [00:57:05] pix to struct looks something like this so you you might mask out let's say the [00:57:09] so you you might mask out let's say the first uh the first answer corresponding [00:57:11] first uh the first answer corresponding to Python and ask the model to produce [00:57:14] to Python and ask the model to produce the HTML corresponding to just the uh [00:57:17] the HTML corresponding to just the uh patch that was mased [00:57:19] patch that was mased out uh and so this seems like a more [00:57:22] out uh and so this seems like a more natural sort of pre-training objective [00:57:24] natural sort of pre-training objective that can maybe have like better [00:57:27] that can maybe have like better interactions between image uh and text [00:57:30] interactions between image uh and text and then this was also adapted for uh [00:57:33] and then this was also adapted for uh building like these multimodal [00:57:35] building like these multimodal agents okay so uh you know at this point [00:57:39] agents okay so uh you know at this point I just want to kind of highlight that [00:57:41] I just want to kind of highlight that this is really an emerging application [00:57:44] this is really an emerging application um there's kind of this huge kind of [00:57:47] um there's kind of this huge kind of prompting Gap uh is what I like to call [00:57:49] prompting Gap uh is what I like to call it so if you do not do extensive [00:57:51] it so if you do not do extensive prompting and if you do not use thepoke [00:57:55] prompting and if you do not use thepoke few short example where for every [00:57:57] few short example where for every different environment you have a [00:57:58] different environment you have a different set of few short examples even [00:58:01] different set of few short examples even the best language models are very very [00:58:03] the best language models are very very far from perfect even on very very [00:58:04] far from perfect even on very very simple tasks like mini wob uh where you [00:58:07] simple tasks like mini wob uh where you know the goal was just to click on [00:58:09] know the goal was just to click on certain elements or uh you know respond [00:58:12] certain elements or uh you know respond to someone's email where in mini wob [00:58:14] to someone's email where in mini wob that just takes like five [00:58:16] that just takes like five actions [00:58:17] actions um and then uh even for something as [00:58:21] um and then uh even for something as simple as mini wob even after doing [00:58:23] simple as mini wob even after doing extensive prompting and few short [00:58:25] extensive prompting and few short examples [00:58:26] examples is this like U drop in performance as [00:58:29] is this like U drop in performance as you go from sort of the simplest task [00:58:32] you go from sort of the simplest task that involve mapping and instruction [00:58:34] that involve mapping and instruction into a single action to mapping an [00:58:37] into a single action to mapping an instruction into maybe five or 10 [00:58:39] instruction into maybe five or 10 actions uh so long Horizon planning uh [00:58:43] actions uh so long Horizon planning uh is is still very very hard even on these [00:58:45] is is still very very hard even on these very simple [00:58:46] very simple benchmarks um and then if you look at [00:58:49] benchmarks um and then if you look at something more complex like web Arena [00:58:51] something more complex like web Arena which tries to approximate real websites [00:58:54] which tries to approximate real websites has multi-tab browsing has external [00:58:57] has multi-tab browsing has external tools that the mod can use there's just [00:58:59] tools that the mod can use there's just a huge difference between sort of human [00:59:02] a huge difference between sort of human level uh task success rate and what the [00:59:06] level uh task success rate and what the best models get uh even after prompting [00:59:09] best models get uh even after prompting even with few short [00:59:12] even with few short examples um and then the kinds of Errors [00:59:15] examples um and then the kinds of Errors model make models make are [00:59:17] model make models make are also pretty weird so one of the examples [00:59:22] also pretty weird so one of the examples uh from from web links was uh the task [00:59:25] uh from from web links was uh the task was to just open Google Translate and [00:59:28] was to just open Google Translate and sign in using credentials and there was [00:59:30] sign in using credentials and there was an email and a password and then what [00:59:33] an email and a password and then what gbd4 V did was instead of uh typing in [00:59:37] gbd4 V did was instead of uh typing in the password it just typed in the email [00:59:39] the password it just typed in the email into the password tab uh and it just [00:59:42] into the password tab uh and it just couldn't recover from this error so you [00:59:43] couldn't recover from this error so you know it it tried to sign in there was an [00:59:45] know it it tried to sign in there was an error it tried to insert uh try tried to [00:59:48] error it tried to insert uh try tried to type in the the email again and so on [00:59:50] type in the the email again and so on and I'm sure with extensive prompting [00:59:51] and I'm sure with extensive prompting you can fix this and maybe that's [00:59:53] you can fix this and maybe that's besides the point right um [00:59:56] besides the point right um and then again uh you know there was [00:59:59] and then again uh you know there was like a different example where uh the [01:00:02] like a different example where uh the model had to issue a search and then [01:00:05] model had to issue a search and then instead of issuing the search with the [01:00:07] instead of issuing the search with the correct term it sort of repeated the [01:00:10] correct term it sort of repeated the same term like three times um and [01:00:13] same term like three times um and obviously that's not going to return any [01:00:15] obviously that's not going to return any query uh return any [01:00:18] query uh return any results um so there's lot lot of room [01:00:20] results um so there's lot lot of room for for improvement as you can see um [01:00:23] for for improvement as you can see um and then there's lots to be done in the [01:00:25] and then there's lots to be done in the space okay so I'm going to recap um and [01:00:28] space okay so I'm going to recap um and take any questions so we kind of looked [01:00:32] take any questions so we kind of looked at two different things today we looked [01:00:34] at two different things today we looked at reasoning in language models uh we [01:00:37] at reasoning in language models uh we saw that there's a few ways that you can [01:00:40] saw that there's a few ways that you can get reasoning like behavior in language [01:00:41] get reasoning like behavior in language models you can prompt them in various [01:00:43] models you can prompt them in various ways so the simplest example of that is [01:00:46] ways so the simplest example of that is Chain of Thought prompting you can do [01:00:48] Chain of Thought prompting you can do Chain of Thought prompting but generate [01:00:50] Chain of Thought prompting but generate multiple rationals and sort of try to uh [01:00:53] multiple rationals and sort of try to uh reconcile them and pick the answer that [01:00:55] reconcile them and pick the answer that was most [01:00:56] was most uh most like [01:00:58] uh most like frequent uh you can do sort of problem [01:01:00] frequent uh you can do sort of problem decomposition in your prompt so ask the [01:01:03] decomposition in your prompt so ask the model to explicitly decompose a problem [01:01:06] model to explicitly decompose a problem into multiple steps before [01:01:08] into multiple steps before answering U so that was all [01:01:10] answering U so that was all prompting you could also try and train [01:01:13] prompting you could also try and train specialize small language models for [01:01:15] specialize small language models for reasoning by generating rationals from a [01:01:18] reasoning by generating rationals from a big language model and then F tuning a [01:01:20] big language model and then F tuning a smaller language model on these [01:01:23] smaller language model on these rationals uh instead of fine tuning a [01:01:26] rationals uh instead of fine tuning a smaller language model on rationals from [01:01:28] smaller language model on rationals from a big language model you could just [01:01:30] a big language model you could just fine-tune the big language model on its [01:01:32] fine-tune the big language model on its own rationals and keep doing this [01:01:34] own rationals and keep doing this iteratively and we saw that sometimes [01:01:36] iteratively and we saw that sometimes like if you do multiple iterations of [01:01:38] like if you do multiple iterations of that performance can keep [01:01:40] that performance can keep improving and and and can even [01:01:42] improving and and and can even outperform sort of human provided [01:01:45] outperform sort of human provided rationals um but on the flip side we saw [01:01:48] rationals um but on the flip side we saw that while there are some initial [01:01:50] that while there are some initial reasons to be optimistic if we go and do [01:01:54] reasons to be optimistic if we go and do counterfactual evalu ation we see that [01:01:58] counterfactual evalu ation we see that you know it's not clear if the models [01:02:00] you know it's not clear if the models are good because they're reasoning or if [01:02:03] are good because they're reasoning or if models are good because you know all of [01:02:05] models are good because you know all of these problems were in some shape or [01:02:07] these problems were in some shape or form already in the training data and we [01:02:08] form already in the training data and we saw that with sort of counterfactual [01:02:11] saw that with sort of counterfactual evaluation um in the second part we [01:02:14] evaluation um in the second part we looked at language model [01:02:15] looked at language model agents uh we kind of talked about the [01:02:17] agents uh we kind of talked about the historical perspective through which uh [01:02:20] historical perspective through which uh people built sort of grounded agents and [01:02:22] people built sort of grounded agents and then we saw that you could recast the [01:02:25] then we saw that you could recast the problem of decision making as just sort [01:02:28] problem of decision making as just sort of uh causal language modeling and then [01:02:31] of uh causal language modeling and then we looked at various ways through which [01:02:33] we looked at various ways through which people have modeled uh decision making [01:02:36] people have modeled uh decision making with language models most of it involves [01:02:39] with language models most of it involves prompting and in context learning and [01:02:41] prompting and in context learning and then we looked at a method for U you [01:02:44] then we looked at a method for U you know similar to sort of what we saw in [01:02:45] know similar to sort of what we saw in the first module uh generating synthetic [01:02:49] the first module uh generating synthetic demonstrations and here we looked at [01:02:51] demonstrations and here we looked at doing exploration and the same kind of [01:02:53] doing exploration and the same kind of iterative [01:02:54] iterative relabeling um um you know most of the [01:02:57] relabeling um um you know most of the language models we looked at today were [01:02:58] language models we looked at today were text only uh we saw some examples of [01:03:01] text only uh we saw some examples of language models that can take both text [01:03:04] language models that can take both text and uh visual [01:03:06] and uh visual input and then uh you know we we saw [01:03:09] input and then uh you know we we saw that benchmarks are very very [01:03:11] that benchmarks are very very challenging models make kind of trivial [01:03:13] challenging models make kind of trivial mistakes uh there's a huge gap between [01:03:15] mistakes uh there's a huge gap between human performance and sort of what we [01:03:17] human performance and sort of what we get with models uh so there's a huge uh [01:03:21] get with models uh so there's a huge uh like there's a huge difference between [01:03:23] like there's a huge difference between human performance and where models are [01:03:25] human performance and where models are and you know a lot of room for driving [01:03:26] and you know a lot of room for driving further [01:03:27] further Improvement um and maybe some of you are [01:03:29] Improvement um and maybe some of you are doing it for your projects uh thank you [01:03:34] [Applause] ================================================================================ LECTURE 016 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert Source: https://www.youtube.com/watch?v=dnF463_Ar9I --- Transcript [00:00:05] okay um well uh welcome back to cs224n [00:00:10] okay um well uh welcome back to cs224n it's welcome back for me to [00:00:11] it's welcome back for me to cs224n um too since I was traveling for [00:00:14] cs224n um too since I was traveling for a couple of weeks I hope everything went [00:00:16] a couple of weeks I hope everything went smoothly in the meantime um so today I'm [00:00:20] smoothly in the meantime um so today I'm delighted to introduce our first invited [00:00:22] delighted to introduce our first invited speaker Nathan Lambert um so Nathan um [00:00:26] speaker Nathan Lambert um so Nathan um did his PhD at UC Berkeley so you're [00:00:29] did his PhD at UC Berkeley so you're allowed Boo and hiss for that [00:00:32] allowed Boo and hiss for that but um but um since then um he worked [00:00:38] but um but um since then um he worked first for a couple of years at hugging [00:00:40] first for a couple of years at hugging face and now he's working at ai2 the [00:00:44] face and now he's working at ai2 the Allen instit the Allen Institute for [00:00:46] Allen instit the Allen Institute for artificial intelligence um in Seattle um [00:00:50] artificial intelligence um in Seattle um so Nathan um comes from a background in [00:00:54] so Nathan um comes from a background in reinforcement learning like quite a few [00:00:56] reinforcement learning like quite a few other people who are now applying [00:00:57] other people who are now applying reinforcement learning to language [00:00:59] reinforcement learning to language models he had an early background [00:01:01] models he had an early background applying reinforcement learning to [00:01:03] applying reinforcement learning to robots but it turns out it's more fun to [00:01:05] robots but it turns out it's more fun to do it with language models um um no it's [00:01:08] do it with language models um um no it's not um okay um but anyway I mean he's [00:01:12] not um okay um but anyway I mean he's been very influential in both developing [00:01:16] been very influential in both developing ideas as to how to do posttraining with [00:01:19] ideas as to how to do posttraining with rhf and other ideas that come since then [00:01:23] rhf and other ideas that come since then including DPO that he'll definitely [00:01:25] including DPO that he'll definitely mention in today's talk um and so he's [00:01:28] mention in today's talk um and so he's one of the so best experts on the [00:01:31] one of the so best experts on the posttraining um phase of language model [00:01:35] posttraining um phase of language model development which has just proven as [00:01:37] development which has just proven as time is passed by that more and more of [00:01:39] time is passed by that more and more of the action of the large language model [00:01:41] the action of the large language model companies is happening not in the the [00:01:44] companies is happening not in the the initial um pre-training language model [00:01:46] initial um pre-training language model training phase but this subsequent [00:01:48] training phase but this subsequent posttraining phase and Nathan will have [00:01:50] posttraining phase and Nathan will have a lot to say about that today thanks a [00:01:52] a lot to say about that today thanks a lot for coming to do this yeah thanks [00:01:54] lot for coming to do this yeah thanks for the wonderful intro um you can see [00:01:57] for the wonderful intro um you can see my talk is life after DPO which is a [00:01:59] my talk is life after DPO which is a little bit of a unclear title so I [00:02:01] little bit of a unclear title so I apologize about this but it's trying to [00:02:03] apologize about this but it's trying to capture like what is the moment that [00:02:05] capture like what is the moment that we're at in alignment and Alignment [00:02:07] we're at in alignment and Alignment research and really DPO is the paper the [00:02:10] research and really DPO is the paper the story of last year which is this paper [00:02:12] story of last year which is this paper that came out and I'll get to the math [00:02:14] that came out and I'll get to the math and now a lot more people are interested [00:02:16] and now a lot more people are interested in able to do alignment and it's [00:02:17] in able to do alignment and it's building on from there so it's like what [00:02:19] building on from there so it's like what what are we going to be interested in [00:02:21] what are we going to be interested in after DPO and a tidbit talking with [00:02:23] after DPO and a tidbit talking with Chris that isn't explicitly in my slides [00:02:26] Chris that isn't explicitly in my slides is like what we're trying to close and [00:02:29] is like what we're trying to close and the labs like meta and people with the [00:02:31] the labs like meta and people with the amount of data that they're using for [00:02:32] amount of data that they're using for this kind of post [00:02:34] this kind of post training um fine-tuning there's all [00:02:36] training um fine-tuning there's all these words all defined is so big that [00:02:39] these words all defined is so big that like the amount of data points that meta [00:02:40] like the amount of data points that meta bought in llama 2 from one of these [00:02:43] bought in llama 2 from one of these providers is much more data than all of [00:02:45] providers is much more data than all of the data that's been collected on [00:02:46] the data that's been collected on chatbot arena for mmis so chatbot Arena [00:02:49] chatbot arena for mmis so chatbot Arena has like 800,000 data points that have [00:02:51] has like 800,000 data points that have been collected and metat 2's paper says [00:02:53] been collected and metat 2's paper says they bought about 1.5 million [00:02:55] they bought about 1.5 million comparisons and these are years outdated [00:02:57] comparisons and these are years outdated and chatbot Arena's data is that's as of [00:03:00] and chatbot Arena's data is that's as of a few weeks ago so you can only imagine [00:03:02] a few weeks ago so you can only imagine what op AI anthropic Etc are buying at [00:03:05] what op AI anthropic Etc are buying at this scale and this is the kind of [00:03:07] this scale and this is the kind of reality that we need to adapt to is like [00:03:09] reality that we need to adapt to is like what is different like we don't have [00:03:11] what is different like we don't have that type of resource doing research and [00:03:13] that type of resource doing research and what are we going to do so this lecture [00:03:15] what are we going to do so this lecture is some history on things that lead up [00:03:17] is some history on things that lead up to DPO that I saw that I think are [00:03:20] to DPO that I saw that I think are important to remember and then really [00:03:22] important to remember and then really we'll kind of go zero to 100 and talk [00:03:24] we'll kind of go zero to 100 and talk about Recent research that we're doing [00:03:26] about Recent research that we're doing to try to answer this question and [00:03:28] to try to answer this question and Define what is happening [00:03:31] Define what is happening so I'll start with a heavily abbreviated [00:03:34] so I'll start with a heavily abbreviated history of language models I won't go [00:03:36] history of language models I won't go through all of this there's a bunch of [00:03:37] through all of this there's a bunch of this in the class already this is late [00:03:38] this in the class already this is late in the lecture I like to talk start with [00:03:41] in the lecture I like to talk start with Claude Channon and then you skip a whole [00:03:42] Claude Channon and then you skip a whole bunch of stuff where this Auto [00:03:44] bunch of stuff where this Auto regressive loss [00:03:45] regressive loss function shows a lot of promise and this [00:03:48] function shows a lot of promise and this was not fast you can see how many years [00:03:50] was not fast you can see how many years that it took to build language modeling [00:03:53] that it took to build language modeling as a field here and deep learning is [00:03:55] as a field here and deep learning is brewing in the background of one of many [00:03:58] brewing in the background of one of many things that went into this [00:04:00] things that went into this and then you have these years with like [00:04:01] and then you have these years with like 2017 the Transformer paper that you hear [00:04:04] 2017 the Transformer paper that you hear about 2018 with gpt1 Elmo and Bert kind [00:04:07] about 2018 with gpt1 Elmo and Bert kind of these foundational topics in language [00:04:10] of these foundational topics in language processing and how embeddings are [00:04:11] processing and how embeddings are created and then with gpt2 and scaling [00:04:14] created and then with gpt2 and scaling laws become this kind of key idea that [00:04:17] laws become this kind of key idea that people are looking at and tracking and [00:04:19] people are looking at and tracking and how these models are improving and then [00:04:21] how these models are improving and then in 2020 is when people really started to [00:04:24] in 2020 is when people really started to wake up to how useful these large scale [00:04:27] wake up to how useful these large scale trained language models were at this [00:04:29] trained language models were at this time I was wasn't even a language [00:04:30] time I was wasn't even a language modeling person but for a lot of people [00:04:32] modeling person but for a lot of people in AI this is when the kind of gravity [00:04:35] in AI this is when the kind of gravity of the situation was starting to suck [00:04:37] of the situation was starting to suck people in and there's a lot of cadence [00:04:39] people in and there's a lot of cadence to these things in 2021 we had the [00:04:41] to these things in 2021 we had the stochastic parrots paper which before [00:04:43] stochastic parrots paper which before chat gbt is raising the warnings of what [00:04:45] chat gbt is raising the warnings of what are actually what are we actually [00:04:47] are actually what are we actually putting into these models and what are [00:04:49] putting into these models and what are they learning like are they actually [00:04:50] they learning like are they actually learning something meaningful from [00:04:52] learning something meaningful from language or are they repeating the [00:04:54] language or are they repeating the language that we have and this is a kind [00:04:55] language that we have and this is a kind of philosophical debate depending on [00:04:57] of philosophical debate depending on where you land on what language is what [00:05:00] where you land on what language is what these language models are doing today [00:05:02] these language models are doing today but it's important that it came out [00:05:04] but it's important that it came out before chat gbt and it's like these [00:05:06] before chat gbt and it's like these foundations of debates of what language [00:05:07] foundations of debates of what language models are doing in 20 end of 2022 is [00:05:10] models are doing in 20 end of 2022 is when chat gbt actually came out which [00:05:13] when chat gbt actually came out which was supposed to be this kind of quiet [00:05:15] was supposed to be this kind of quiet launch of a b like a demo from open Ai [00:05:18] launch of a b like a demo from open Ai and it has since captured the attention [00:05:21] and it has since captured the attention of the world that we have seen and the [00:05:25] of the world that we have seen and the simple question is can C chat GPT exist [00:05:27] simple question is can C chat GPT exist without rlf I think it's it's important [00:05:30] without rlf I think it's it's important to acknowledge that so much of this is [00:05:31] to acknowledge that so much of this is from pre-training but at every point of [00:05:33] from pre-training but at every point of the line and chat GPT and then a lot of [00:05:35] the line and chat GPT and then a lot of these popular models since then rhf and [00:05:38] these popular models since then rhf and these human related or other fine-tuning [00:05:41] these human related or other fine-tuning Technologies seem to be necessary but [00:05:43] Technologies seem to be necessary but not sufficient like you need the [00:05:44] not sufficient like you need the pre-training but you also need this kind [00:05:46] pre-training but you also need this kind of rhf or this post training to really [00:05:49] of rhf or this post training to really shift the needle and what the most [00:05:51] shift the needle and what the most important models are at that certain [00:05:54] important models are at that certain moment some examples you can list so [00:05:56] moment some examples you can list so many of them where rhf has replied [00:05:58] many of them where rhf has replied relied upon I like to look at these [00:06:01] relied upon I like to look at these plots from the anthropic constitutional [00:06:03] plots from the anthropic constitutional AI paper where they kind of show this [00:06:04] AI paper where they kind of show this iterative Improvement of their different [00:06:06] iterative Improvement of their different rhf methods it kind of shows of how you [00:06:09] rhf methods it kind of shows of how you have these multiple model versions that [00:06:11] have these multiple model versions that are evolving over time as you add more [00:06:12] are evolving over time as you add more fine-tuning data this is a a dense paper [00:06:15] fine-tuning data this is a a dense paper but one of the most representative [00:06:17] but one of the most representative figures of kind of what rhf can do [00:06:18] figures of kind of what rhf can do there's a lot of information in here [00:06:20] there's a lot of information in here that you don't need to follow right now [00:06:21] that you don't need to follow right now and then like meta's llama 2 paper is [00:06:23] and then like meta's llama 2 paper is pretty funny where they're they have [00:06:24] pretty funny where they're they have this quote this like reinforcement [00:06:26] this quote this like reinforcement learning known for its instability [00:06:28] learning known for its instability seemed as some shadowy field for those [00:06:30] seemed as some shadowy field for those in the NLP research Community however [00:06:32] in the NLP research Community however reinforcement learning proved highly [00:06:34] reinforcement learning proved highly effective particularly given its cost [00:06:36] effective particularly given its cost and time Effectiveness so this is like [00:06:38] and time Effectiveness so this is like this is from the technical report [00:06:39] this is from the technical report directly which I find really [00:06:40] directly which I find really entertaining is this is back in the day [00:06:42] entertaining is this is back in the day when we were like oh we don't know if [00:06:43] when we were like oh we don't know if rhf is really going to take off this is [00:06:46] rhf is really going to take off this is July of year 2023 is like in this [00:06:49] July of year 2023 is like in this building period and it's just directly [00:06:51] building period and it's just directly from the report and that's aged really [00:06:52] from the report and that's aged really well where people are still using this [00:06:54] well where people are still using this today but there's just a lot of [00:06:56] today but there's just a lot of interesting hints in kind of history of [00:06:58] interesting hints in kind of history of culture of rhf in the releases of these [00:07:01] culture of rhf in the releases of these models where the people these companies [00:07:02] models where the people these companies like to talk about it and give us kind [00:07:04] like to talk about it and give us kind of these cultural details to what's [00:07:06] of these cultural details to what's going [00:07:06] going on so I'm going to kind of go through [00:07:08] on so I'm going to kind of go through some definitions and I don't spend too [00:07:11] some definitions and I don't spend too much time on saying doing rhf 101 and [00:07:14] much time on saying doing rhf 101 and like exactly what is happening with [00:07:16] like exactly what is happening with these kind of mathematical terms but [00:07:18] these kind of mathematical terms but it's important to get on the same page [00:07:19] it's important to get on the same page of what some of these things do and [00:07:20] of what some of these things do and don't mean um there's a lot of [00:07:23] don't mean um there's a lot of definitions I think some of the [00:07:24] definitions I think some of the interesting ones that if they don't make [00:07:26] interesting ones that if they don't make sense right now to come back to is like [00:07:28] sense right now to come back to is like what's the difference between [00:07:28] what's the difference between instruction find tuning and supervise [00:07:30] instruction find tuning and supervise fine tuning I think like instruction [00:07:33] fine tuning I think like instruction fine tuning is what's become really [00:07:34] fine tuning is what's become really popular where it's like you're training [00:07:35] popular where it's like you're training a model to follow instructions and I [00:07:37] a model to follow instructions and I have another slide on this after and [00:07:39] have another slide on this after and supervis fine tuning is like this domain [00:07:41] supervis fine tuning is like this domain specific thing and we want to do both of [00:07:43] specific thing and we want to do both of them I think instruction fine tuning is [00:07:45] them I think instruction fine tuning is more linked to rhf it's about making [00:07:48] more linked to rhf it's about making these models really useful and really [00:07:50] these models really useful and really engaging and kind of easy to work with [00:07:52] engaging and kind of easy to work with and then there's other things like [00:07:53] and then there's other things like alignment which is like super vague but [00:07:56] alignment which is like super vague but it's in the word it's align it's [00:07:58] it's in the word it's align it's training a model to be [00:08:00] training a model to be mirrored to what a user wants and [00:08:01] mirrored to what a user wants and there's a lot of things that you can [00:08:02] there's a lot of things that you can align to rhf is a mouthful which is one [00:08:06] align to rhf is a mouthful which is one specific tool for doing alignment where [00:08:08] specific tool for doing alignment where you have this kind of human feedback [00:08:10] you have this kind of human feedback data which is like feedback is a really [00:08:12] data which is like feedback is a really loaded word there where there can be [00:08:14] loaded word there where there can be like preferences and learning to rank is [00:08:16] like preferences and learning to rank is related to actually putting feedback on [00:08:18] related to actually putting feedback on preferences there's a lot of little [00:08:19] preferences there's a lot of little things I tried to make preference [00:08:21] things I tried to make preference fine-tuning a phrase at one point but [00:08:23] fine-tuning a phrase at one point but didn't really double down on it I think [00:08:24] didn't really double down on it I think it's a little bit clearer than rhf [00:08:26] it's a little bit clearer than rhf especially in the context of DPO but [00:08:28] especially in the context of DPO but there's just these lot of spheres that [00:08:30] there's just these lot of spheres that are overlapping in this kind of post [00:08:32] are overlapping in this kind of post trining or fine tuning space of models [00:08:34] trining or fine tuning space of models these days instruction tuning [00:08:37] these days instruction tuning instruction fine tuning is the kind of [00:08:39] instruction fine tuning is the kind of it's still the foundation of a lot of [00:08:41] it's still the foundation of a lot of this this is where things called system [00:08:43] this this is where things called system prompts are added where we're like [00:08:45] prompts are added where we're like making the model ready for a specific [00:08:47] making the model ready for a specific style of input um open AI is still kind [00:08:50] style of input um open AI is still kind of inov innovating on this they have [00:08:52] of inov innovating on this they have this model spec document they released a [00:08:54] this model spec document they released a few weeks ago where they said they're [00:08:55] few weeks ago where they said they're going to have like a second level system [00:08:57] going to have like a second level system prompt here which this just adds some [00:08:59] prompt here which this just adds some structure to how the models can take in [00:09:01] structure to how the models can take in data so that you can do a lot more of [00:09:03] data so that you can do a lot more of this fine tuning down the line and how [00:09:05] this fine tuning down the line and how user data actually gets passed to the [00:09:07] user data actually gets passed to the model or how the developer passes [00:09:09] model or how the developer passes information that the user doesn't [00:09:11] information that the user doesn't see so what this can often look like is [00:09:14] see so what this can often look like is like stack overlow Reddit data where you [00:09:16] like stack overlow Reddit data where you have a a question at the top and then an [00:09:18] have a a question at the top and then an answer and this is still I think a lot [00:09:21] answer and this is still I think a lot of what is happening behind the scenes [00:09:22] of what is happening behind the scenes there's a lot of data sets of stack [00:09:23] there's a lot of data sets of stack Overflow out there Reddit has these data [00:09:26] Overflow out there Reddit has these data Partnerships and this still uses the [00:09:28] Partnerships and this still uses the auto regressive loss function that we [00:09:29] auto regressive loss function that we started with we haven't branched out [00:09:31] started with we haven't branched out into kind of different loss functions [00:09:33] into kind of different loss functions yet but it's still super important a lot [00:09:35] yet but it's still super important a lot of academic research shows that this is [00:09:37] of academic research shows that this is like all you need in some ways which I I [00:09:40] like all you need in some ways which I I think is a much more mixed bag but it's [00:09:42] think is a much more mixed bag but it's it's the simple method and it's the [00:09:44] it's the simple method and it's the right place to start and where we kind [00:09:46] right place to start and where we kind of go is then we go to this rhf [00:09:49] of go is then we go to this rhf objective which this looks really [00:09:52] objective which this looks really familiar to people that are trained in [00:09:53] familiar to people that are trained in reinforcement learning I think this is a [00:09:55] reinforcement learning I think this is a little different to from the NLP loss [00:09:57] little different to from the NLP loss function um on the left side is like the [00:09:59] function um on the left side is like the standard reinforcement learning [00:10:01] standard reinforcement learning objective which is you're learning a [00:10:02] objective which is you're learning a policy pi to maximize some reward which [00:10:05] policy pi to maximize some reward which is a function of something depending how [00:10:07] is a function of something depending how you set of the problem and then on the [00:10:09] you set of the problem and then on the right side is going to be this kind of [00:10:11] right side is going to be this kind of KL constraint this um it's a distance so [00:10:14] KL constraint this um it's a distance so that the policy doesn't change too much [00:10:16] that the policy doesn't change too much it's related to this whole idea of over [00:10:18] it's related to this whole idea of over optimization that I don't go into too [00:10:20] optimization that I don't go into too much of this talk um but the key ideas [00:10:23] much of this talk um but the key ideas is that we want to optimize a reward but [00:10:25] is that we want to optimize a reward but not over optimize it and the primary [00:10:28] not over optimize it and the primary questions when doing our LF is like how [00:10:30] questions when doing our LF is like how do we Implement a reward function like [00:10:32] do we Implement a reward function like what is our reward actually going to be [00:10:33] what is our reward actually going to be and then how do we optimize it you see [00:10:35] and then how do we optimize it you see this abstracted later as like we train a [00:10:37] this abstracted later as like we train a specific reward model and then we have [00:10:39] specific reward model and then we have specific policy updates and DPO direct [00:10:42] specific policy updates and DPO direct preference optimization handles this a [00:10:44] preference optimization handles this a little bit differently so to get before [00:10:47] little bit differently so to get before we get there it's like the actual [00:10:48] we get there it's like the actual preference model that people use for rlf [00:10:51] preference model that people use for rlf is well I find this interesting like [00:10:53] is well I find this interesting like it's from this Bradley teror Terry model [00:10:55] it's from this Bradley teror Terry model which is from economics in like the [00:10:57] which is from economics in like the 1950s which is essentially a probability [00:11:00] 1950s which is essentially a probability distribution over a pairwise choice and [00:11:04] distribution over a pairwise choice and what ends up happening for various [00:11:05] what ends up happening for various technical reasons is that if we train a [00:11:07] technical reasons is that if we train a preference model it needs to Output a [00:11:08] preference model it needs to Output a scalar value and by some coincidence [00:11:12] scalar value and by some coincidence that I think is still very convenient [00:11:14] that I think is still very convenient they just take the output of this [00:11:15] they just take the output of this learned probability distribution as a [00:11:17] learned probability distribution as a reward they say that okay our reward is [00:11:19] reward they say that okay our reward is going to be proportional to this [00:11:21] going to be proportional to this probability and it's going to work and [00:11:23] probability and it's going to work and it ends up doing so but that's like even [00:11:25] it ends up doing so but that's like even a big leap to accept that it's like we [00:11:28] a big leap to accept that it's like we have this par wise preference [00:11:29] have this par wise preference probability that's saying the [00:11:31] probability that's saying the probability that one answer is chosen [00:11:33] probability that one answer is chosen over another and then you have to kind [00:11:35] over another and then you have to kind of this mental crazy step of saying we [00:11:37] of this mental crazy step of saying we just pass in one number or one piece of [00:11:39] just pass in one number or one piece of text and we're getting the probability [00:11:41] text and we're getting the probability that that one piece of text is chosen [00:11:43] that that one piece of text is chosen over any arbitrary other one so there's [00:11:45] over any arbitrary other one so there's a lot of like assumptions that make this [00:11:47] a lot of like assumptions that make this there's like kind of deep Concepts in [00:11:50] there's like kind of deep Concepts in here but what we're getting is a model [00:11:52] here but what we're getting is a model that's giving us the score [00:11:55] that's giving us the score out and the kind of question is if we [00:11:58] out and the kind of question is if we why do we have to do this and like what [00:11:59] why do we have to do this and like what if we can just take our original [00:12:01] if we can just take our original objective and use gradient Ascent on [00:12:04] objective and use gradient Ascent on this equation Ascent because it's a [00:12:06] this equation Ascent because it's a maximum and this is really what DPO does [00:12:08] maximum and this is really what DPO does I'm blurring through a ton of math it's [00:12:11] I'm blurring through a ton of math it's a great paper to learn a lot of this [00:12:13] a great paper to learn a lot of this math of language modeling where you [00:12:15] math of language modeling where you learn how these probabilities of [00:12:17] learn how these probabilities of different pieces of text are handled by [00:12:19] different pieces of text are handled by the model and how it's ends up being a [00:12:22] the model and how it's ends up being a lot of these like log probability ratios [00:12:24] lot of these like log probability ratios and seeing how the prompt and the [00:12:26] and seeing how the prompt and the completion are handled differently it's [00:12:27] completion are handled differently it's worth digging into and understanding the [00:12:30] worth digging into and understanding the derivation but the core idea is like why [00:12:32] derivation but the core idea is like why can't we just do gradient descent or [00:12:35] can't we just do gradient descent or gradient Ascent to solve rhf [00:12:37] gradient Ascent to solve rhf optimization and this is like it becomes [00:12:41] optimization and this is like it becomes be incredibly simple so if you look at [00:12:44] be incredibly simple so if you look at the code on the right is the um [00:12:46] the code on the right is the um reference code from the original [00:12:47] reference code from the original implementation it's extremely simple to [00:12:49] implementation it's extremely simple to implement and it has this characteristic [00:12:51] implement and it has this characteristic where if you work with something like [00:12:52] where if you work with something like Transformers before it's pretty easy to [00:12:55] Transformers before it's pretty easy to write a loss function that uses DPO [00:12:59] write a loss function that uses DPO rather than building an entire [00:13:01] rather than building an entire infrastructure stack to start with when [00:13:02] infrastructure stack to start with when you do something like a PO and this full [00:13:05] you do something like a PO and this full rhf stuff that open AI does you normally [00:13:07] rhf stuff that open AI does you normally need almost entire new infrastructure [00:13:09] need almost entire new infrastructure stack but you can get started with DPO [00:13:11] stack but you can get started with DPO in a much much simpler way and there's [00:13:13] in a much much simpler way and there's some kind of characteristics that I'll [00:13:15] some kind of characteristics that I'll get to later which is DPO still has a [00:13:17] get to later which is DPO still has a reward model which is really important [00:13:19] reward model which is really important to the math actually checking out [00:13:21] to the math actually checking out whereas you're using your original [00:13:23] whereas you're using your original language model as a different type of [00:13:25] language model as a different type of reward model but that quickly takes us [00:13:27] reward model but that quickly takes us down a whole bunch of derivations that [00:13:29] down a whole bunch of derivations that is probably at least not the lecture [00:13:32] is probably at least not the lecture that I think is as fun to give and the [00:13:35] that I think is as fun to give and the key thing is which is why this lecture [00:13:37] key thing is which is why this lecture is called what it is is that the first [00:13:38] is called what it is is that the first two points mean we'll see more DPO [00:13:41] two points mean we'll see more DPO models than anything else like DPO is [00:13:42] models than anything else like DPO is where everyone will start with if they [00:13:44] where everyone will start with if they want to do alignment research and it's [00:13:47] want to do alignment research and it's for good reason like it is the right [00:13:48] for good reason like it is the right place to start if you're thinking about [00:13:50] place to start if you're thinking about doing this it scales more easily on [00:13:51] doing this it scales more easily on compute it's easier to debug it's even [00:13:54] compute it's easier to debug it's even easier to learn so like it's it's not [00:13:57] easier to learn so like it's it's not really worth second guessing that and it [00:13:59] really worth second guessing that and it is a good place to [00:14:01] is a good place to start but it also leads into these [00:14:03] start but it also leads into these ridiculous conversations online where [00:14:05] ridiculous conversations online where everyone is trying to figure out like is [00:14:07] everyone is trying to figure out like is DPO better than other RL methods pop [00:14:11] DPO better than other RL methods pop which is this older popular deep RL [00:14:14] which is this older popular deep RL algorithm which John scholman wrote [00:14:17] algorithm which John scholman wrote reinforce which is a slightly different [00:14:20] reinforce which is a slightly different parameterization of policy gradient [00:14:22] parameterization of policy gradient they're very similar and DPO ends up [00:14:25] they're very similar and DPO ends up being much simp like it's just simpler [00:14:27] being much simp like it's just simpler to work with so there's this [00:14:29] to work with so there's this meme where it's like if you you just do [00:14:31] meme where it's like if you you just do gradient descent it'll work in reality [00:14:33] gradient descent it'll work in reality they're pretty they're they're different [00:14:35] they're pretty they're they're different loss functions and they're doing very [00:14:36] loss functions and they're doing very different things but you can get similar [00:14:39] different things but you can get similar results with both of them which is why [00:14:41] results with both of them which is why if something is much easier to do you [00:14:42] if something is much easier to do you should just start with it and I come [00:14:44] should just start with it and I come back to this much later in the talk [00:14:45] back to this much later in the talk which is like what is fundamentally [00:14:47] which is like what is fundamentally different about these RL algorithms and [00:14:50] different about these RL algorithms and whe how your data is processed and where [00:14:51] whe how your data is processed and where the signals actually come from but for [00:14:53] the signals actually come from but for now it's like we don't need to say one [00:14:56] now it's like we don't need to say one versus the other we can do both and they [00:14:57] versus the other we can do both and they are different [00:15:00] are different so that's kind of the quick one-on-one [00:15:02] so that's kind of the quick one-on-one of what the core ideas are I'm going to [00:15:04] of what the core ideas are I'm going to kind of take a path to where we how we [00:15:07] kind of take a path to where we how we actually got to training models with DPO [00:15:09] actually got to training models with DPO because I think this slide was from a [00:15:12] because I think this slide was from a different talk where this subsection is [00:15:14] different talk where this subsection is reduced from but DPO really came out [00:15:17] reduced from but DPO really came out months before we started getting popular [00:15:18] months before we started getting popular models trained with it so it's like how [00:15:21] models trained with it so it's like how did we actually get to the point where [00:15:22] did we actually get to the point where the community was training models with [00:15:24] the community was training models with DPO which is much more recently than the [00:15:27] DPO which is much more recently than the paper was actually released [00:15:29] paper was actually released and this comes all the way back to these [00:15:31] and this comes all the way back to these first instruction tuned models that you [00:15:32] first instruction tuned models that you saw so the alpaca the vuna koala Dolly [00:15:36] saw so the alpaca the vuna koala Dolly of the world all in April of 2023 and [00:15:40] of the world all in April of 2023 and these are all built on kind of similar [00:15:42] these are all built on kind of similar things and slight iterations so there's [00:15:44] things and slight iterations so there's kind of figuring out how to use [00:15:45] kind of figuring out how to use synthetic data building on this first [00:15:48] synthetic data building on this first llama release there's some other things [00:15:50] llama release there's some other things that I'll talk about but this is where [00:15:51] that I'll talk about but this is where we started they're all using instruction [00:15:53] we started they're all using instruction tuning most of them use synthetic data [00:15:56] tuning most of them use synthetic data and what vuna actually did was they used [00:16:00] and what vuna actually did was they used this thing called share GPT which was [00:16:02] this thing called share GPT which was the first time that people working in [00:16:04] the first time that people working in kind of this academic alignment space [00:16:06] kind of this academic alignment space had access to data that was from humans [00:16:09] had access to data that was from humans it ended up being a bit of a legal gray [00:16:11] it ended up being a bit of a legal gray area because it was logging data that [00:16:13] area because it was logging data that people used in a Google Chrome extension [00:16:16] people used in a Google Chrome extension called share gbt to like make it so chat [00:16:18] called share gbt to like make it so chat gbt had a share button but this data was [00:16:21] gbt had a share button but this data was really important to things like vuna and [00:16:23] really important to things like vuna and a lot of the other models that came down [00:16:24] a lot of the other models that came down the line and is still used in models [00:16:26] the line and is still used in models today as like one subset of the training [00:16:29] today as like one subset of the training data set so just having access to these [00:16:32] data set so just having access to these human prompts was unlocked a lot of [00:16:34] human prompts was unlocked a lot of potential back in the day and is still [00:16:36] potential back in the day and is still something that were seeing thankfully [00:16:37] something that were seeing thankfully now we're starting to get data sets like [00:16:39] now we're starting to get data sets like this that were collected in more [00:16:41] this that were collected in more permissive ways like this kind of lmis [00:16:43] permissive ways like this kind of lmis data has prompts that are collected and [00:16:46] data has prompts that are collected and with consent and wild chat which was a [00:16:48] with consent and wild chat which was a project from ai2 which essentially gave [00:16:50] project from ai2 which essentially gave people free access to chat gbt and [00:16:51] people free access to chat gbt and exchange for their [00:16:53] exchange for their data the thing that came after Shar GPT [00:16:56] data the thing that came after Shar GPT was the realization that we need more [00:16:58] was the realization that we need more human data and this open Assistant [00:17:00] human data and this open Assistant project is one that um we honestly need [00:17:03] project is one that um we honestly need more of it's It's shows how hard it is [00:17:06] more of it's It's shows how hard it is to create human data that we haven't [00:17:07] to create human data that we haven't seen more things like this this was run [00:17:10] seen more things like this this was run by a few people in a Discord Community [00:17:12] by a few people in a Discord Community working extremely long hours to generate [00:17:14] working extremely long hours to generate prompts responses and preference pairs [00:17:16] prompts responses and preference pairs to kind of common requests the language [00:17:19] to kind of common requests the language models and this was from April of 2023 [00:17:22] models and this was from April of 2023 and we haven't seen anything like it tra [00:17:24] and we haven't seen anything like it tra gbt or lmc's data is similar but there's [00:17:27] gbt or lmc's data is similar but there's not the same level of controls and [00:17:29] not the same level of controls and voting and ranking that they went into [00:17:31] voting and ranking that they went into this open Assistant data and it again is [00:17:33] this open Assistant data and it again is a data set that we're still training [00:17:34] a data set that we're still training models with and many people still train [00:17:36] models with and many people still train models who I think come up time and time [00:17:38] models who I think come up time and time again so it's just like these one or two [00:17:40] again so it's just like these one or two influential data sets from over a year [00:17:42] influential data sets from over a year ago are still what are used to trade [00:17:44] ago are still what are used to trade models so you'll you'll get the theme as [00:17:46] models so you'll you'll get the theme as I keep going there's actually rhf models [00:17:49] I keep going there's actually rhf models trained in April of 2023 as well um this [00:17:54] trained in April of 2023 as well um this was from Carper AI that was doing a lot [00:17:56] was from Carper AI that was doing a lot of work in the space they've kind of [00:17:58] of work in the space they've kind of Taken they've Fallen back a bit in [00:18:00] Taken they've Fallen back a bit in recent times but there were people that [00:18:02] recent times but there were people that were doing the similar methods to what [00:18:04] were doing the similar methods to what I'm going to talk about at the end of [00:18:05] I'm going to talk about at the end of the talk that kind of knowledge and [00:18:08] the talk that kind of knowledge and infrastructure was not translated into [00:18:10] infrastructure was not translated into things that were easy to use so there's [00:18:12] things that were easy to use so there's also this vein of like even if things [00:18:15] also this vein of like even if things are open it doesn't mean that it's going [00:18:17] are open it doesn't mean that it's going to immediately catch on and be useful [00:18:19] to immediately catch on and be useful you have to have the resources the data [00:18:22] you have to have the resources the data and your codebase setup in a way that [00:18:24] and your codebase setup in a way that people can build on it which is what DPO [00:18:25] people can build on it which is what DPO did really well this like RF model from [00:18:29] did really well this like RF model from Carper was successful it was better than [00:18:31] Carper was successful it was better than the vicuna model but no one really built [00:18:33] the vicuna model but no one really built on it right away which I always find [00:18:36] on it right away which I always find confusing then kind of later in the year [00:18:39] confusing then kind of later in the year another key thing for this open [00:18:40] another key thing for this open alignment was the Llama 2 backlash where [00:18:42] alignment was the Llama 2 backlash where the Llama 2 was asked to kill a Linux [00:18:45] the Llama 2 was asked to kill a Linux process it it would refuse and this kind [00:18:47] process it it would refuse and this kind of bred a whole series of models which [00:18:50] of bred a whole series of models which are kind of called we are still referred [00:18:52] are kind of called we are still referred to as uncensored which I don't think is [00:18:54] to as uncensored which I don't think is the best name CU I don't think there was [00:18:56] the best name CU I don't think there was ever actually any censoring to the model [00:18:58] ever actually any censoring to the model wasn't intentional censorship but the [00:19:01] wasn't intentional censorship but the goal is to make models that don't refuse [00:19:03] goal is to make models that don't refuse any request which is useful as a [00:19:05] any request which is useful as a research artifact which is like what do [00:19:07] research artifact which is like what do you get out of a model if it answers [00:19:08] you get out of a model if it answers every question like what are the limits [00:19:10] every question like what are the limits in that regard there are other ways to [00:19:12] in that regard there are other ways to use that which are up to you but like [00:19:16] use that which are up to you but like what ended up happening is a lot of [00:19:17] what ended up happening is a lot of these like um shared gbt data sets [00:19:19] these like um shared gbt data sets because they're from chat gbt there's [00:19:21] because they're from chat gbt there's data that says oh as a language model I [00:19:23] data that says oh as a language model I shouldn't answer that so people started [00:19:25] shouldn't answer that so people started filtering all of that out and there you [00:19:27] filtering all of that out and there you still see a lot of people releasing [00:19:29] still see a lot of people releasing these like uncensored models today as a [00:19:32] these like uncensored models today as a popular area of [00:19:33] popular area of development I think that we should [00:19:35] development I think that we should understand what people need when doing [00:19:38] understand what people need when doing research and researching a model that [00:19:41] research and researching a model that doesn't refuse is reasonable but if [00:19:43] doesn't refuse is reasonable but if you're to deploy a model for free use to [00:19:45] you're to deploy a model for free use to users you should consider whether or not [00:19:47] users you should consider whether or not everything should be answered so it's [00:19:49] everything should be answered so it's like as a researcher how your artifacts [00:19:52] like as a researcher how your artifacts are used kind of depend on the work that [00:19:54] are used kind of depend on the work that you're actually going to be [00:19:56] you're actually going to be doing then in the alignments there's [00:19:59] doing then in the alignments there's this long series I'm almost done with [00:20:00] this long series I'm almost done with the end lens but there's this long [00:20:02] the end lens but there's this long series of models that are really [00:20:03] series of models that are really interesting to people like me that never [00:20:05] interesting to people like me that never really broke through the narrative where [00:20:06] really broke through the narrative where they're saying their things like we used [00:20:08] they're saying their things like we used rhf where the first model to beat gb4 on [00:20:11] rhf where the first model to beat gb4 on alpaca Val and these other V tools [00:20:14] alpaca Val and these other V tools they're scaling things up but they don't [00:20:16] they're scaling things up but they don't always have papers they don't always [00:20:18] always have papers they don't always have code bases and it's like things are [00:20:21] have code bases and it's like things are happening around I just like it's not [00:20:23] happening around I just like it's not just like the hugging face of the world [00:20:25] just like the hugging face of the world there's a lot of different organizations [00:20:27] there's a lot of different organizations in the US and elsewhere where that we're [00:20:29] in the US and elsewhere where that we're aligning models and getting similar [00:20:31] aligning models and getting similar numbers or beating these kind of [00:20:32] numbers or beating these kind of mainstream tech companies and these [00:20:34] mainstream tech companies and these places that you look for models to these [00:20:36] places that you look for models to these so these are all in the summer of [00:20:39] so these are all in the summer of 2023 and this is kind of all this like [00:20:41] 2023 and this is kind of all this like these I bring these up because this [00:20:43] these I bring these up because this comes before like the first big splash [00:20:44] comes before like the first big splash of DPO so this Zephyr model was really [00:20:47] of DPO so this Zephyr model was really the first model that I remember Making a [00:20:50] the first model that I remember Making a Splash with DPO and this is when it took [00:20:53] Splash with DPO and this is when it took until this time which was in September [00:20:55] until this time which was in September after the May release of the paper for [00:20:57] after the May release of the paper for people to really be like o DPO is the [00:20:59] people to really be like o DPO is the real deal like it took four months and [00:21:02] real deal like it took four months and now like the paper has best paper [00:21:04] now like the paper has best paper everyone uses it there's tons of [00:21:05] everyone uses it there's tons of derivations but in industry and people [00:21:08] derivations but in industry and people trying to train models there was a lot [00:21:09] trying to train models there was a lot of skepticism until this moment so this [00:21:11] of skepticism until this moment so this is like a classic academic story of [00:21:13] is like a classic academic story of needing to wait a bit until your um [00:21:17] needing to wait a bit until your um until your work is Vindicated in some [00:21:18] until your work is Vindicated in some ways but the two crucial things here was [00:21:20] ways but the two crucial things here was new data set the ultra feedback data set [00:21:23] new data set the ultra feedback data set which is a data set of um synthetically [00:21:26] which is a data set of um synthetically generated text labeled by gp4 so it's [00:21:29] generated text labeled by gp4 so it's again this kind of new ways of making [00:21:31] again this kind of new ways of making data where it's a preference data set um [00:21:35] data where it's a preference data set um we didn't make it it was made by um open [00:21:37] we didn't make it it was made by um open BMB I think they're based in China and [00:21:40] BMB I think they're based in China and should know more and then we also just [00:21:41] should know more and then we also just had to do a lot of experiments to make [00:21:43] had to do a lot of experiments to make it work there's a weird really low [00:21:45] it work there's a weird really low learning rate that was needed to make [00:21:47] learning rate that was needed to make this kind of chat model work with DPO [00:21:49] this kind of chat model work with DPO which is like 5e minus 7 if you're [00:21:51] which is like 5e minus 7 if you're really plugged into AI you'll know that [00:21:52] really plugged into AI you'll know that like 3 e minus 4 is like the lore of the [00:21:56] like 3 e minus 4 is like the lore of the best learning rate so it's many of [00:21:58] best learning rate so it's many of magnitude lower so that's kind of what [00:22:00] magnitude lower so that's kind of what it took to get this to work we probably [00:22:01] it took to get this to work we probably could have done it months earlier if we [00:22:03] could have done it months earlier if we just did more hyperparameter sweet but [00:22:05] just did more hyperparameter sweet but like this is the random happen stance of [00:22:08] like this is the random happen stance of the stories that people now like [00:22:10] the stories that people now like backcast as being like this is the super [00:22:12] backcast as being like this is the super important bottle it's just it's it's [00:22:14] important bottle it's just it's it's somewhat random and then at the same [00:22:16] somewhat random and then at the same time I was switching jobs to the Allen [00:22:18] time I was switching jobs to the Allen Institute and they were already working [00:22:19] Institute and they were already working on this project which is trying to do a [00:22:21] on this project which is trying to do a systematic study of instruction tuning [00:22:24] systematic study of instruction tuning data along with some of this preference [00:22:26] data along with some of this preference tuning recipes that were coming out [00:22:28] tuning recipes that were coming out because once this Zephyr model came out [00:22:31] because once this Zephyr model came out there's always Skeptics of like oh doing [00:22:32] there's always Skeptics of like oh doing it at 7B is easy like that's a small [00:22:35] it at 7B is easy like that's a small model so it's like oh is it going to [00:22:36] model so it's like oh is it going to actually scale to the real deal to [00:22:38] actually scale to the real deal to bigger models to be what like Chad gbt [00:22:40] bigger models to be what like Chad gbt does so it was like okay we have some [00:22:42] does so it was like okay we have some more compute and we tried it on this 70 [00:22:44] more compute and we tried it on this 70 billion parameter scale and we showed [00:22:46] billion parameter scale and we showed similar gains we all we did was use the [00:22:48] similar gains we all we did was use the same Ultra feedback recipe the low [00:22:51] same Ultra feedback recipe the low learning rate and it largely worked so [00:22:54] learning rate and it largely worked so this is within two months and then this [00:22:56] this is within two months and then this is when in since then is when there's [00:22:59] is when in since then is when there's tons of new DPO models anyone all these [00:23:02] tons of new DPO models anyone all these startups that are releasing their own [00:23:03] startups that are releasing their own models will release an instruct version [00:23:05] models will release an instruct version that is a DPO thing and that kind of [00:23:08] that is a DPO thing and that kind of continued for 6 months I think just [00:23:09] continued for 6 months I think just today I'm starting to see less DPO [00:23:11] today I'm starting to see less DPO models which is interesting I've been [00:23:13] models which is interesting I've been keep tracking keeping track of them for [00:23:15] keep tracking keeping track of them for another evaluation project and it has [00:23:17] another evaluation project and it has finally kind of slowed down a little bit [00:23:19] finally kind of slowed down a little bit I don't know if that's alignment at [00:23:20] I don't know if that's alignment at large but there is so many I I I should [00:23:23] large but there is so many I I I should add a slide that's like a list of the [00:23:24] add a slide that's like a list of the ridiculous amount of DPO models after [00:23:27] ridiculous amount of DPO models after them after these two but this is really [00:23:29] them after these two but this is really when the floodgates kind of started and [00:23:32] when the floodgates kind of started and when we're like okay DPO really [00:23:34] when we're like okay DPO really works so this is kind of why I say like [00:23:37] works so this is kind of why I say like what comes next it's like we could [00:23:39] what comes next it's like we could retrain models on data sets that we have [00:23:42] retrain models on data sets that we have we don't have that many data sets but it [00:23:44] we don't have that many data sets but it kind of feels like we're fishing in the [00:23:45] kind of feels like we're fishing in the dark like Zephyr was built on the [00:23:47] dark like Zephyr was built on the success of needing the low learning rate [00:23:49] success of needing the low learning rate this Tulu 2 model is actually trained on [00:23:51] this Tulu 2 model is actually trained on tpus because we have the Google tensor [00:23:54] tpus because we have the Google tensor research Cloud so we have bigger tpus to [00:23:56] research Cloud so we have bigger tpus to train these models and it's like how do [00:23:58] train these models and it's like how do we do this more systematically and [00:24:00] we do this more systematically and that's kind of where most of what I talk [00:24:02] that's kind of where most of what I talk about today on the technical matter is [00:24:04] about today on the technical matter is the recent research that we've been [00:24:06] the recent research that we've been doing to just kind of make sense of this [00:24:07] doing to just kind of make sense of this and answer the fundamental questions of [00:24:09] and answer the fundamental questions of like what do we need to change about DPO [00:24:12] like what do we need to change about DPO is po better and so [00:24:14] is po better and so on so this is kind of the the the [00:24:17] on so this is kind of the the the reality that I back go back and forth in [00:24:19] reality that I back go back and forth in between which is we don't really have [00:24:21] between which is we don't really have the human data to do rhf like industry [00:24:23] the human data to do rhf like industry but it is getting much easier to do [00:24:25] but it is getting much easier to do alignment research so you can kind of [00:24:27] alignment research so you can kind of choose your narrative I think sometimes [00:24:29] choose your narrative I think sometimes because I'm so close to Industry and [00:24:30] because I'm so close to Industry and hear about people have I'm like too [00:24:31] hear about people have I'm like too often on this side but there is a lot of [00:24:33] often on this side but there is a lot of opportunity to do things it feels [00:24:35] opportunity to do things it feels crowded but being crowded at this point [00:24:39] crowded but being crowded at this point when there's so much investment is just [00:24:40] when there's so much investment is just because you're in the right area and [00:24:43] because you're in the right area and most people in this room aren't trying [00:24:44] most people in this room aren't trying to be professors so if you get scooped [00:24:46] to be professors so if you get scooped it's it's okay but um it's it's I find [00:24:50] it's it's okay but um it's it's I find it very fun and so like how do we [00:24:52] it very fun and so like how do we actually understand what we're doing [00:24:54] actually understand what we're doing with alignment and can we improve on [00:24:55] with alignment and can we improve on these models like we have to 2 it has a [00:24:58] these models like we have to 2 it has a number because we want to keep releasing [00:24:59] number because we want to keep releasing more models so it's like how do we get [00:25:01] more models so it's like how do we get better evaluating what we're doing to [00:25:03] better evaluating what we're doing to try to understand this process and then [00:25:05] try to understand this process and then how do we train better models so these [00:25:07] how do we train better models so these are the sort of things that I'm up to I [00:25:09] are the sort of things that I'm up to I have a few examples of things I've been [00:25:11] have a few examples of things I've been working on I built an evaluation tool [00:25:13] working on I built an evaluation tool for reward models I'll talk more about [00:25:15] for reward models I'll talk more about reward models to start here and we need [00:25:18] reward models to start here and we need better evaluation because when you're [00:25:19] better evaluation because when you're training models you need to be able to [00:25:22] training models you need to be able to do kind of what I call like local [00:25:24] do kind of what I call like local evaluation you need to be able to get a [00:25:25] evaluation you need to be able to get a number that tells you if your training [00:25:27] number that tells you if your training technique is [00:25:29] technique is improving the the end result you can't [00:25:31] improving the the end result you can't wait until chatbot Arena evaluates your [00:25:33] wait until chatbot Arena evaluates your model because that takes you about a [00:25:35] model because that takes you about a month to get your numbers back you need [00:25:36] month to get your numbers back you need to be able to run something at your desk [00:25:38] to be able to run something at your desk that gives you signal on if you're [00:25:40] that gives you signal on if you're actually doing a good job and we're [00:25:41] actually doing a good job and we're still pretty behind on those evaluation [00:25:43] still pretty behind on those evaluation tools and they're there are more coming [00:25:45] tools and they're there are more coming which is promising and then given dp's [00:25:48] which is promising and then given dp's Simplicity can we actually improve on [00:25:50] Simplicity can we actually improve on that and can we catch on to some of the [00:25:52] that and can we catch on to some of the industry rumors that they've let it [00:25:55] industry rumors that they've let it drift [00:25:57] drift aside so so reward bench is this project [00:26:00] aside so so reward bench is this project that I started because there are no [00:26:02] that I started because there are no evaluation tools for reward models my [00:26:04] evaluation tools for reward models my motivation was mostly for transparency [00:26:07] motivation was mostly for transparency given how much industry says reward [00:26:10] given how much industry says reward models are what you need to focus on [00:26:11] models are what you need to focus on they're really important for getting [00:26:12] they're really important for getting good models out the door and it's like [00:26:14] good models out the door and it's like what does that mean like what is like [00:26:16] what does that mean like what is like what does it mean for a reward model to [00:26:18] what does it mean for a reward model to be good if we look at this kind of [00:26:20] be good if we look at this kind of feedback diagram which is the one kind [00:26:23] feedback diagram which is the one kind of homage to the RL background just [00:26:25] of homage to the RL background just feedback loops um is like a word model [00:26:28] feedback loops um is like a word model is in this casee the agent is your [00:26:30] is in this casee the agent is your actual language model Pi is the policy [00:26:33] actual language model Pi is the policy the training data is prompts that you [00:26:35] the training data is prompts that you get so in this kind of um rhf framework [00:26:39] get so in this kind of um rhf framework you have this feedback loop where the [00:26:41] you have this feedback loop where the policy generates something a which is [00:26:43] policy generates something a which is the action which is the completion it [00:26:45] the action which is the completion it goes to the reward model which then [00:26:46] goes to the reward model which then scores it but you kind of on the side [00:26:49] scores it but you kind of on the side are looking at all these evaluation [00:26:50] are looking at all these evaluation tools and it's like none of these [00:26:53] tools and it's like none of these evaluation tools are giving us internal [00:26:56] evaluation tools are giving us internal insight into what's happening in this [00:26:57] insight into what's happening in this feedback loop it seems kind of external [00:26:59] feedback loop it seems kind of external to what we are doing when we're training [00:27:01] to what we are doing when we're training these models so we really wanted to zoom [00:27:03] these models so we really wanted to zoom in on this reward model and reward [00:27:06] in on this reward model and reward models are trained in a kind a another [00:27:08] models are trained in a kind a another kind of weird way the many quirks of rhf [00:27:12] kind of weird way the many quirks of rhf so in order to train a reward model you [00:27:14] so in order to train a reward model you need to collect this pawise preference [00:27:15] need to collect this pawise preference data if you're kind of using chat 2bt a [00:27:18] data if you're kind of using chat 2bt a lot you'll sometimes see it give you two [00:27:20] lot you'll sometimes see it give you two answers and ask you which one is better [00:27:22] answers and ask you which one is better this data is literally what is used to [00:27:24] this data is literally what is used to train a reward model it's a a prompt and [00:27:28] train a reward model it's a a prompt and then two completions a chosen completion [00:27:30] then two completions a chosen completion and a rejected completion but in order [00:27:32] and a rejected completion but in order to train these models you have to pass [00:27:34] to train these models you have to pass both of them in at the same time so you [00:27:36] both of them in at the same time so you pass both of them in at the same time [00:27:37] pass both of them in at the same time and it gives you two scalar values you [00:27:39] and it gives you two scalar values you use a language model that outputs a [00:27:41] use a language model that outputs a scaler just by some modifications of the [00:27:43] scaler just by some modifications of the last layers rather than outputting text [00:27:46] last layers rather than outputting text and then this L function I'll show you [00:27:48] and then this L function I'll show you on the next slide is essentially why you [00:27:50] on the next slide is essentially why you need to you need to use this batch mode [00:27:52] need to you need to use this batch mode idea which is you pass multiple things [00:27:54] idea which is you pass multiple things at once and you get multiple numbers out [00:27:56] at once and you get multiple numbers out so this L function is ESS [00:27:59] so this L function is ESS here this R is the output directly from [00:28:02] here this R is the output directly from the reward model for the rejected [00:28:03] the reward model for the rejected completion and the chosen completion so [00:28:05] completion and the chosen completion so you're trying to separate the distance [00:28:07] you're trying to separate the distance between them and then automatic [00:28:09] between them and then automatic differentiation kind of updates the [00:28:10] differentiation kind of updates the parameters so that this distance is [00:28:12] parameters so that this distance is bigger so you can't just kind of do [00:28:14] bigger so you can't just kind of do supervised learning directly on one [00:28:17] supervised learning directly on one thing to say for the reward model there [00:28:20] thing to say for the reward model there are alignment methods researching that [00:28:22] are alignment methods researching that now but it's really built on this idea [00:28:25] now but it's really built on this idea of separating two things and creating a [00:28:27] of separating two things and creating a margin in the prep references to kind of [00:28:28] margin in the prep references to kind of learn the decision boundary there's a [00:28:30] learn the decision boundary there's a lot of really specific details in [00:28:32] lot of really specific details in Industry such as these models are only [00:28:33] Industry such as these models are only trained for one Epoch they get really [00:28:36] trained for one Epoch they get really low accuracy scores when you compare [00:28:38] low accuracy scores when you compare them to other kind of train test set [00:28:40] them to other kind of train test set things in machine learning and there's [00:28:42] things in machine learning and there's some additional tweaks that people do [00:28:44] some additional tweaks that people do you can do ensem ensembles lamud did [00:28:47] you can do ensem ensembles lamud did this weird margin loss but none of it [00:28:50] this weird margin loss but none of it really trans is transformative and how [00:28:52] really trans is transformative and how these models are trained they're in this [00:28:54] these models are trained they're in this weird place where you can only get about [00:28:56] weird place where you can only get about 70% agreement with your annot [00:28:59] 70% agreement with your annot it's it's kind of the sort of thing of [00:29:01] it's it's kind of the sort of thing of is the noise part of the signal or is it [00:29:03] is the noise part of the signal or is it a bug so like in preferences it could [00:29:05] a bug so like in preferences it could make sense that it's a signal because [00:29:07] make sense that it's a signal because not everyone's preferences here are the [00:29:09] not everyone's preferences here are the same so not getting full of agreement be [00:29:12] same so not getting full of agreement be like this system might be working we [00:29:13] like this system might be working we don't want chpt to be fully [00:29:15] don't want chpt to be fully narrow-minded all the [00:29:17] narrow-minded all the time and this kind of reads to the thing [00:29:19] time and this kind of reads to the thing of like how do we actually evaluate [00:29:21] of like how do we actually evaluate these reward models that I was talking [00:29:22] these reward models that I was talking about I hear all the time that reward [00:29:25] about I hear all the time that reward models are crucial to rhf but um how do [00:29:28] models are crucial to rhf but um how do we know exactly what types of the final [00:29:30] we know exactly what types of the final policy they're improving should we [00:29:32] policy they're improving should we include safety in these reward models [00:29:34] include safety in these reward models how does scaling laws impact reward [00:29:36] how does scaling laws impact reward models there's kind of basic machine [00:29:37] models there's kind of basic machine learning questions it's like can we [00:29:39] learning questions it's like can we evaluate these what should we think [00:29:42] evaluate these what should we think about so what we kind of what we did is [00:29:44] about so what we kind of what we did is we collected a bunch of prompts and then [00:29:47] we collected a bunch of prompts and then we manually created Chosen and rejected [00:29:49] we manually created Chosen and rejected answers for each prompt and then we can [00:29:52] answers for each prompt and then we can see whether or not the reward model [00:29:53] see whether or not the reward model agrees with our human created data and [00:29:56] agrees with our human created data and call that like a win or loss in an [00:29:57] call that like a win or loss in an accurate point of view it's really [00:30:00] accurate point of view it's really direct we're just doing inference on [00:30:01] direct we're just doing inference on existing models and we're going to see [00:30:03] existing models and we're going to see whether or not they agree with human [00:30:06] whether or not they agree with human data and this is a slide if you want to [00:30:09] data and this is a slide if you want to go into the academic side of things this [00:30:11] go into the academic side of things this was built on a lot of existing [00:30:13] was built on a lot of existing evaluation tools that were out there [00:30:15] evaluation tools that were out there you'll see some common names alpaca Val [00:30:17] you'll see some common names alpaca Val Mt Ben are things that you've heard [00:30:19] Mt Ben are things that you've heard about EXs test was on the slide when I [00:30:21] about EXs test was on the slide when I mentioned llama 2 being um overly safe [00:30:25] mentioned llama 2 being um overly safe and there's some other things that are [00:30:26] and there's some other things that are really good but you might not heard [00:30:28] really good but you might not heard about like um this llm bar data set from [00:30:31] about like um this llm bar data set from Princeton is a bunch of trick questions [00:30:32] Princeton is a bunch of trick questions that I'll have an example on later and [00:30:35] that I'll have an example on later and some kind of normal names from anthropic [00:30:37] some kind of normal names from anthropic and open AI in here as well so there's a [00:30:39] and open AI in here as well so there's a lot of different things that we're [00:30:40] lot of different things that we're testing with this data set and then [00:30:41] testing with this data set and then we're trying to get the full picture of [00:30:44] we're trying to get the full picture of like what is going on with these [00:30:47] like what is going on with these models we released this in March of 24 [00:30:50] models we released this in March of 24 and you can see a key in the bottom [00:30:52] and you can see a key in the bottom where these kind of um red circles with [00:30:54] where these kind of um red circles with the arrow in them are DPO models which [00:30:56] the arrow in them are DPO models which you can use as a reward model [00:30:58] you can use as a reward model and then um this kind of these dice [00:31:00] and then um this kind of these dice which look like gray squares when you [00:31:02] which look like gray squares when you zoom out are what I described in this [00:31:04] zoom out are what I described in this kind of um classifier type of training [00:31:08] kind of um classifier type of training and you can see that there's reasonable [00:31:09] and you can see that there's reasonable scores The Benchmark isn't [00:31:12] scores The Benchmark isn't saturated bunch of open models some [00:31:14] saturated bunch of open models some names that you've seen before like the [00:31:15] names that you've seen before like the Tulu models and the Zephyr models are on [00:31:18] Tulu models and the Zephyr models are on here kind of normal stuff we like this [00:31:20] here kind of normal stuff we like this is what we expected it's not too [00:31:22] is what we expected it's not too saturated but if you look here I'll show [00:31:25] saturated but if you look here I'll show you where this model has moved in a few [00:31:27] you where this model has moved in a few months so today we have a lot more [00:31:29] months so today we have a lot more models and there's a lot more [00:31:30] models and there's a lot more information here so I get to tell you [00:31:32] information here so I get to tell you about more interesting things which is [00:31:33] about more interesting things which is like how open AIS and coheres models do [00:31:36] like how open AIS and coheres models do on this which is like I mentioned [00:31:37] on this which is like I mentioned wanting to do this for transparency but [00:31:40] wanting to do this for transparency but we also add new type so this is where [00:31:42] we also add new type so this is where the fifth model ended up so in two [00:31:43] the fifth model ended up so in two months the model that was fifth on your [00:31:45] months the model that was fifth on your leaderboard is now 31st so we're trying [00:31:47] leaderboard is now 31st so we're trying we're getting the saturation from people [00:31:50] we're getting the saturation from people doing research in the area to actually [00:31:52] doing research in the area to actually have places to compare their [00:31:54] have places to compare their models and but we also have models from [00:31:56] models and but we also have models from some closed labs [00:31:58] some closed labs and I'll kind of get into the details [00:32:00] and I'll kind of get into the details here so like some of these are labeled [00:32:02] here so like some of these are labeled as um are different types of models with [00:32:05] as um are different types of models with is llm as a judge um llm as a judge is [00:32:09] is llm as a judge um llm as a judge is the idea if you can ask a language model [00:32:12] the idea if you can ask a language model which answer is better this is kind of [00:32:15] which answer is better this is kind of how things like alpaca Val and Mt bench [00:32:17] how things like alpaca Val and Mt bench are built but you can also use that as a [00:32:19] are built but you can also use that as a reward model I told you that I have [00:32:20] reward model I told you that I have prompts and then Chosen and rejected I [00:32:22] prompts and then Chosen and rejected I could just ask chat gbt which one is [00:32:24] could just ask chat gbt which one is better and see what it does and this is [00:32:26] better and see what it does and this is what we added in as a Baseline and this [00:32:28] what we added in as a Baseline and this ends up being really interesting because [00:32:30] ends up being really interesting because GPT 4 and gbt 40 are not actually as [00:32:35] GPT 4 and gbt 40 are not actually as good in this closed domain as a reward [00:32:38] good in this closed domain as a reward model that coher is training so we don't [00:32:40] model that coher is training so we don't have full information because we don't [00:32:42] have full information because we don't have Open the Eyes reward models but we [00:32:44] have Open the Eyes reward models but we can use their models to compare so we [00:32:46] can use their models to compare so we have a lot of different information [00:32:47] have a lot of different information going into one system about how language [00:32:50] going into one system about how language models and different parts of the [00:32:52] models and different parts of the alignment process choose different [00:32:54] alignment process choose different categories so I'll kind of and kind of [00:32:57] categories so I'll kind of and kind of go back and you can see like this Co [00:32:59] go back and you can see like this Co here is across two different months [00:33:00] here is across two different months theirs has improved a lot and then these [00:33:03] theirs has improved a lot and then these kind of earlier DPO models that we saw [00:33:05] kind of earlier DPO models that we saw higher up on the leaderboard have been [00:33:06] higher up on the leaderboard have been shifting down by more people training [00:33:08] shifting down by more people training reward models to begin [00:33:12] with and the specific category that I'll [00:33:14] with and the specific category that I'll Focus most on is this kind of chat hard [00:33:16] Focus most on is this kind of chat hard thing um if you think about evaluation a [00:33:20] thing um if you think about evaluation a lot it's actually surprisingly common as [00:33:22] lot it's actually surprisingly common as a topic covered in kind of tech coverage [00:33:24] a topic covered in kind of tech coverage is how evaluation is saturating this is [00:33:27] is how evaluation is saturating this is the one feature of our Benchmark that [00:33:28] the one feature of our Benchmark that hasn't fully saturated and it's really [00:33:30] hasn't fully saturated and it's really important to kind of having some sort of [00:33:33] important to kind of having some sort of longevity to The Benchmark and I'll talk [00:33:34] longevity to The Benchmark and I'll talk more about this kind of as we go from [00:33:36] more about this kind of as we go from here so I mentioned this data set and [00:33:39] here so I mentioned this data set and it's interesting to understand if you [00:33:42] it's interesting to understand if you could actually do this problem so what [00:33:45] could actually do this problem so what we have is a prompt a Chosen and a [00:33:47] we have is a prompt a Chosen and a rejected and the prompt is give an [00:33:49] rejected and the prompt is give an example of a metaphor that uses the [00:33:50] example of a metaphor that uses the following object stars and then the [00:33:53] following object stars and then the Chosen and rejected are two similar [00:33:56] Chosen and rejected are two similar metaphors but one of the like you can [00:33:59] metaphors but one of the like you can you can see if you read these what the [00:34:01] you can see if you read these what the differences [00:34:03] differences are I'm just pausing for the people that [00:34:05] are I'm just pausing for the people that are still that are paying attention to [00:34:07] are still that are paying attention to reading these but essentially what [00:34:08] reading these but essentially what happens is that the chosen one is about [00:34:10] happens is that the chosen one is about the sky and the rejected is about the [00:34:11] the sky and the rejected is about the moon or yeah so the twinkling diamonds [00:34:15] moon or yeah so the twinkling diamonds in the sky see I haven't messed it up [00:34:16] in the sky see I haven't messed it up reading the slide but it asks for stars [00:34:18] reading the slide but it asks for stars and it's about this kind of metaphor of [00:34:20] and it's about this kind of metaphor of stars where the rejected is about the [00:34:21] stars where the rejected is about the moon which is also in the sky at night [00:34:23] moon which is also in the sky at night and this data set is a whole bunch of [00:34:25] and this data set is a whole bunch of things like this where what they do to [00:34:26] things like this where what they do to create this is they either manually or [00:34:29] create this is they either manually or by chat GPT ask the or ask to rephrase a [00:34:33] by chat GPT ask the or ask to rephrase a prompt and then you create a new [00:34:34] prompt and then you create a new generation from it so you can kind of [00:34:36] generation from it so you can kind of get these rejected Generations that are [00:34:38] get these rejected Generations that are just off topic and it makes sense for [00:34:41] just off topic and it makes sense for something that would be really hard for [00:34:42] something that would be really hard for language models because they have this [00:34:44] language models because they have this association between the stars and the [00:34:46] association between the stars and the moon but we want our language models to [00:34:49] moon but we want our language models to be able to answer questions like this [00:34:51] be able to answer questions like this and this is the type of thing where our [00:34:53] and this is the type of thing where our reward model Benchmark which is [00:34:54] reward model Benchmark which is something that is training language [00:34:56] something that is training language models has the best correlation [00:34:58] models has the best correlation as something that is hard so this is [00:35:00] as something that is hard so this is promising there's this is the sort of [00:35:02] promising there's this is the sort of thing that you if you're in research is [00:35:05] thing that you if you're in research is interesting so it's really in the weeds [00:35:07] interesting so it's really in the weeds but it shows that we still have things [00:35:09] but it shows that we still have things to learn about these models and there [00:35:11] to learn about these models and there are things that we can't do yet but [00:35:14] are things that we can't do yet but another interesting pattern in safety I [00:35:16] another interesting pattern in safety I mentioned this kind of um uncensored [00:35:18] mentioned this kind of um uncensored models and in safety we see all the [00:35:22] models and in safety we see all the patterns we would expect the breakdown [00:35:24] patterns we would expect the breakdown at the top of this table refusals is [00:35:26] at the top of this table refusals is things that we want the language mod [00:35:28] things that we want the language mod refuse and then this excess T Test data [00:35:30] refuse and then this excess T Test data set can be split into something that we [00:35:32] set can be split into something that we want models to refuse and we want models [00:35:33] want models to refuse and we want models to respond and you can kind of see that [00:35:37] to respond and you can kind of see that there's multiple categories of either [00:35:38] there's multiple categories of either DPO models or reward models where the [00:35:41] DPO models or reward models where the model that handles safety really well [00:35:43] model that handles safety really well refuses things like asking for advice on [00:35:46] refuses things like asking for advice on causing harm and responds to something [00:35:49] causing harm and responds to something that is borderline but there's actually [00:35:51] that is borderline but there's actually a lot of models out there that just [00:35:52] a lot of models out there that just refuse everything so that'll tank your [00:35:54] refuse everything so that'll tank your score on things that that response um to [00:35:57] score on things that that response um to everything which is kind of the safe bet [00:35:59] everything which is kind of the safe bet we were seeing a lot of like tech [00:36:00] we were seeing a lot of like tech companies release models like this which [00:36:02] companies release models like this which it just feels like you just it doesn't [00:36:04] it just feels like you just it doesn't feel right when you talk to them but [00:36:05] feel right when you talk to them but there's also the models that just [00:36:07] there's also the models that just respond to everything it's like not my [00:36:08] respond to everything it's like not my job to gate whether or not I should it's [00:36:12] job to gate whether or not I should it's not like not the language models job to [00:36:13] not like not the language models job to gate the question is the philosophy [00:36:15] gate the question is the philosophy there which is something that we hear a [00:36:17] there which is something that we hear a lot about in the discourse of alignment [00:36:19] lot about in the discourse of alignment but to see it in these reward models and [00:36:21] but to see it in these reward models and DPO models when directly probing them at [00:36:24] DPO models when directly probing them at the without asking them to generate text [00:36:26] the without asking them to generate text is nice to be to confirm a lot of [00:36:28] is nice to be to confirm a lot of suspicions that we have so this is back [00:36:31] suspicions that we have so this is back to some of the DPO math which is again [00:36:34] to some of the DPO math which is again good to know so if you are to go into [00:36:37] good to know so if you are to go into the DPO paper you'll see equation three [00:36:39] the DPO paper you'll see equation three here which is the reward that is defined [00:36:41] here which is the reward that is defined in order to make the math actually work [00:36:43] in order to make the math actually work and this is very different than just [00:36:45] and this is very different than just outputting a scaler it ends up being a [00:36:47] outputting a scaler it ends up being a ratio of the probability of the policy [00:36:50] ratio of the probability of the policy relative to the original policy during [00:36:52] relative to the original policy during training which is called the reference [00:36:53] training which is called the reference model and this is an imp it's like it's [00:36:57] model and this is an imp it's like it's a very complicated mathematical [00:36:59] a very complicated mathematical representation so if you actually take a [00:37:01] representation so if you actually take a piece of text and pass it through a DPO [00:37:03] piece of text and pass it through a DPO model the reward will be something like [00:37:05] model the reward will be something like minus 200 or something because it's a [00:37:07] minus 200 or something because it's a bunch of log probabilities probabilities [00:37:09] bunch of log probabilities probabilities are between 0 to one you take the log [00:37:11] are between 0 to one you take the log you get negative numbers and you sum all [00:37:13] you get negative numbers and you sum all of these up so you got a big negative [00:37:15] of these up so you got a big negative number and that intuitively is like the [00:37:18] number and that intuitively is like the score of that these models are providing [00:37:19] score of that these models are providing which is very different than the other [00:37:21] which is very different than the other type of reward models that I talked [00:37:22] type of reward models that I talked about training earlier and if you have [00:37:25] about training earlier and if you have two prompts with a Chosen and a rejected [00:37:27] two prompts with a Chosen and a rejected equation 4 is the math that you actually [00:37:29] equation 4 is the math that you actually need to do to um decide whether or not [00:37:33] need to do to um decide whether or not one of the answers was better you're [00:37:35] one of the answers was better you're kind of comparing these ratios of [00:37:36] kind of comparing these ratios of probabilities from two different models [00:37:38] probabilities from two different models with respect to this reference model [00:37:40] with respect to this reference model which was the starting point of training [00:37:42] which was the starting point of training and the question is when people release [00:37:44] and the question is when people release a DPO model they normally release a [00:37:46] a DPO model they normally release a model and they don't release all the [00:37:47] model and they don't release all the intermediate checkpoints so this [00:37:49] intermediate checkpoints so this reference model would be an intermediate [00:37:51] reference model would be an intermediate checkpoint and the training process so [00:37:53] checkpoint and the training process so the question is like can you do this can [00:37:55] the question is like can you do this can you use it as a reward model if you [00:37:57] you use it as a reward model if you don't have access to all the information [00:37:59] don't have access to all the information and the short answer is no which is all [00:38:03] and the short answer is no which is all the scores on our Benchmark plummet [00:38:04] the scores on our Benchmark plummet across all the DPO models that we have [00:38:07] across all the DPO models that we have which makes sense because this extra [00:38:09] which makes sense because this extra model is a regular regularizer in the [00:38:12] model is a regular regularizer in the probabilities or it it's in the actual [00:38:14] probabilities or it it's in the actual reward equation if you go back a few [00:38:16] reward equation if you go back a few slides like it's in the equation so if [00:38:18] slides like it's in the equation so if we what we do is we get rid of this and [00:38:20] we what we do is we get rid of this and we stop normalizing equation 4 and we [00:38:23] we stop normalizing equation 4 and we just see if it works and it doesn't but [00:38:26] just see if it works and it doesn't but this is important because DPO is [00:38:29] this is important because DPO is training a reward model but if we don't [00:38:31] training a reward model but if we don't always have access to it we just we just [00:38:33] always have access to it we just we just can't learn from it we can't use that in [00:38:35] can't learn from it we can't use that in another system as clearly so it's just a [00:38:37] another system as clearly so it's just a lot to ask for when getting people to [00:38:39] lot to ask for when getting people to release [00:38:41] release models and this is a interesting slide [00:38:44] models and this is a interesting slide showing coheres kind of progress on [00:38:46] showing coheres kind of progress on reward models in just a few months they [00:38:48] reward models in just a few months they released something that was clearly [00:38:49] released something that was clearly state-of-the-art on our Benchmark a [00:38:51] state-of-the-art on our Benchmark a alignment lab um they this kind of RL [00:38:55] alignment lab um they this kind of RL rhf flow work release something in May [00:38:58] rhf flow work release something in May and then just a few days later coher [00:39:00] and then just a few days later coher sent another number of those like here's [00:39:02] sent another number of those like here's our new model it's still better than [00:39:03] our new model it's still better than everyone else so it's nice to kind of [00:39:05] everyone else so it's nice to kind of have this academic industry intersection [00:39:08] have this academic industry intersection but it's very rare and takes a lot of [00:39:10] but it's very rare and takes a lot of work in terms of networking and building [00:39:12] work in terms of networking and building relationships but we're trying to do it [00:39:14] relationships but we're trying to do it at least in these small niches where the [00:39:16] at least in these small niches where the companies are willing to [00:39:18] companies are willing to share reward bench 2 is going to need to [00:39:21] share reward bench 2 is going to need to just mostly make everything harder and [00:39:22] just mostly make everything harder and make everything more human and kind of [00:39:25] make everything more human and kind of the last point is what I'm going to [00:39:27] the last point is what I'm going to trans into next is like everything I've [00:39:29] trans into next is like everything I've told you about is about part of this rhf [00:39:31] told you about is about part of this rhf pipeline but I haven't told you how it [00:39:34] pipeline but I haven't told you how it is impacting the final model that you [00:39:35] is impacting the final model that you use at the end of the day which is very [00:39:37] use at the end of the day which is very rightful criticism which is like if [00:39:39] rightful criticism which is like if you're evaluating part of the alignment [00:39:41] you're evaluating part of the alignment pipeline you should be telling me [00:39:43] pipeline you should be telling me whether or not the final model is [00:39:44] whether or not the final model is actually useful so this is kind of where [00:39:46] actually useful so this is kind of where I talk about our journey into trying to [00:39:48] I talk about our journey into trying to train um po models so we're trying to [00:39:51] train um po models so we're trying to fine tune a good model we spent a lot of [00:39:53] fine tune a good model we spent a lot of time on DPO with this tul2 work and we [00:39:56] time on DPO with this tul2 work and we wanted to know if we could do better by [00:39:58] wanted to know if we could do better by switching to PO so this is a lot of um [00:40:02] switching to PO so this is a lot of um it's not yet published work but it's [00:40:03] it's not yet published work but it's going to be out soon so the numbers [00:40:05] going to be out soon so the numbers aren't entirely final but we're just [00:40:07] aren't entirely final but we're just trying to [00:40:07] trying to disentangle what the difference between [00:40:09] disentangle what the difference between DPO and PO is at a very empirical level [00:40:14] DPO and PO is at a very empirical level so um we're trying to answer if it's [00:40:16] so um we're trying to answer if it's better or not so what we're going to do [00:40:18] better or not so what we're going to do is kind of walk through a series of [00:40:20] is kind of walk through a series of design decisions and see how it affects [00:40:22] design decisions and see how it affects the suite of evaluations we're starting [00:40:24] the suite of evaluations we're starting with this llama 2 13B model and that has [00:40:27] with this llama 2 13B model and that has already been instruction tuned the [00:40:29] already been instruction tuned the difference between the blue and the red [00:40:30] difference between the blue and the red is the gains from instruction tuning for [00:40:32] is the gains from instruction tuning for these kind of um reasoning coding chat [00:40:35] these kind of um reasoning coding chat tasks instruction tuning does the [00:40:37] tasks instruction tuning does the biggest Delta that you'll see among all [00:40:39] biggest Delta that you'll see among all these slides instruction tuning kind of [00:40:40] these slides instruction tuning kind of puts the model on the map as being [00:40:42] puts the model on the map as being useful and it is easy to see gains at [00:40:46] useful and it is easy to see gains at the beginning and then it's harder and [00:40:47] the beginning and then it's harder and harder for us to really keep improving [00:40:49] harder for us to really keep improving these models so we start with is we add [00:40:52] these models so we start with is we add this anthropic um helpful harmless rhf [00:40:55] this anthropic um helpful harmless rhf data with DPO and you can see that there [00:40:58] data with DPO and you can see that there is a small bump across all the metrics [00:41:01] is a small bump across all the metrics that we did this data set is known as [00:41:03] that we did this data set is known as being particularly noisy among [00:41:05] being particularly noisy among researchers in the area but it is kind [00:41:07] researchers in the area but it is kind of the starting point when you're doing [00:41:08] of the starting point when you're doing research on alignment it's been around [00:41:10] research on alignment it's been around for a few years it's big it's multi-turn [00:41:12] for a few years it's big it's multi-turn it's it's but it's known to be noisy and [00:41:15] it's it's but it's known to be noisy and it still gives Improvement and then what [00:41:17] it still gives Improvement and then what you do is if we switch to this data that [00:41:19] you do is if we switch to this data that was used for both Zephyr and 2 through2 [00:41:22] was used for both Zephyr and 2 through2 officially this Ultra feedback data um [00:41:25] officially this Ultra feedback data um we get an even bigger bump so this is [00:41:26] we get an even bigger bump so this is just kind of showing the difference that [00:41:28] just kind of showing the difference that changing only the data can give you in a [00:41:31] changing only the data can give you in a DPO recipe it's normally increases of [00:41:34] DPO recipe it's normally increases of kind of like 0 to 2% and in the research [00:41:37] kind of like 0 to 2% and in the research sphere of trying to ship a model that's [00:41:39] sphere of trying to ship a model that's a big [00:41:40] a big deal so this is kind of where we Triad [00:41:42] deal so this is kind of where we Triad it into new territory grad students [00:41:45] it into new territory grad students worked really hard and implemented Po [00:41:47] worked really hard and implemented Po and Jacks in addition to what they [00:41:49] and Jacks in addition to what they already had and we were like okay what [00:41:52] already had and we were like okay what happens when we add Po and require [00:41:55] happens when we add Po and require reliably across m multiple experiments [00:41:58] reliably across m multiple experiments it's this is one example with the 13 [00:42:00] it's this is one example with the 13 billion parameters po just happens to do [00:42:03] billion parameters po just happens to do a little bit better it's like like 1% [00:42:05] a little bit better it's like like 1% better and we try to change a lot of [00:42:07] better and we try to change a lot of things and the changing things is where [00:42:10] things and the changing things is where things are get a bit Messier so we've [00:42:13] things are get a bit Messier so we've heard from industry that using a bigger [00:42:15] heard from industry that using a bigger reward model can be really helpful to [00:42:17] reward model can be really helpful to getting a better policy model [00:42:19] getting a better policy model essentially these bigger reward models [00:42:21] essentially these bigger reward models will be better at Nuance they should [00:42:23] will be better at Nuance they should give more label better scores which are [00:42:25] give more label better scores which are used as Rewards they should just kind of [00:42:27] used as Rewards they should just kind of make this process a little bit more [00:42:29] make this process a little bit more stable if we have the compute for it we [00:42:31] stable if we have the compute for it we see that it does improve some things but [00:42:34] see that it does improve some things but it doesn't actually make the model [00:42:36] it doesn't actually make the model overall much better it's kind of [00:42:37] overall much better it's kind of flatlined with like pretty similar data [00:42:40] flatlined with like pretty similar data and then just at making the reward model [00:42:42] and then just at making the reward model bigger which is a little bit surprising [00:42:44] bigger which is a little bit surprising to us and we Al this is like this is the [00:42:47] to us and we Al this is like this is the most this is the most realistic few [00:42:49] most this is the most realistic few slides of the talk but it's like we did [00:42:51] slides of the talk but it's like we did this thing where we took the we even [00:42:53] this thing where we took the we even were trying to see if our reward model [00:42:56] were trying to see if our reward model training was bad as we scaled it up so [00:42:58] training was bad as we scaled it up so we used reward bench on the right which [00:43:00] we used reward bench on the right which I had told you about earlier which it's [00:43:03] I had told you about earlier which it's not clearly correlated whether or not [00:43:05] not clearly correlated whether or not these two 13B models or 70b are better [00:43:08] these two 13B models or 70b are better we also did this best event sampling [00:43:10] we also did this best event sampling idea which is if you generate a bunch of [00:43:12] idea which is if you generate a bunch of completions from the language model you [00:43:14] completions from the language model you can rank them by your reward model and [00:43:15] can rank them by your reward model and then re-evaluate on the top to rank [00:43:18] then re-evaluate on the top to rank completions that shows that our reward [00:43:20] completions that shows that our reward models are better at the bigger scale [00:43:22] models are better at the bigger scale but we couldn't get this to really click [00:43:24] but we couldn't get this to really click into a like a downstream model in a p [00:43:27] into a like a downstream model in a p notion of the world um we even tried [00:43:30] notion of the world um we even tried adding more prompts to rhf we added more [00:43:32] adding more prompts to rhf we added more code and reasoning prompts because [00:43:34] code and reasoning prompts because that's something that open AI talks [00:43:35] that's something that open AI talks about a lot it's like and we want to [00:43:37] about a lot it's like and we want to improve our models on um it doesn't [00:43:40] improve our models on um it doesn't really shift the needle on this kind of [00:43:43] really shift the needle on this kind of cohesive average over many tasks in the [00:43:46] cohesive average over many tasks in the paper what you'll see when it's out is [00:43:48] paper what you'll see when it's out is it shows that we added prompts really [00:43:50] it shows that we added prompts really similar to two math and code evaluations [00:43:52] similar to two math and code evaluations and those specific evaluations got a bit [00:43:54] and those specific evaluations got a bit better but adding the full noise into [00:43:57] better but adding the full noise into the fact that some other valuations [00:43:59] the fact that some other valuations might go down makes it this just like [00:44:01] might go down makes it this just like this process is really hard to [00:44:02] this process is really hard to disentangle and this is why it's like [00:44:05] disentangle and this is why it's like we're getting the 0 to 2% Improvement [00:44:07] we're getting the 0 to 2% Improvement out of Po but DPO doesn't have this this [00:44:10] out of Po but DPO doesn't have this this sort of mess so what we ended up getting [00:44:13] sort of mess so what we ended up getting to is like there's always one more thing [00:44:15] to is like there's always one more thing for us to oblate when you're trading [00:44:17] for us to oblate when you're trading these models with po the sort of things [00:44:20] these models with po the sort of things like different regularization we're [00:44:22] like different regularization we're learning a value function in RL [00:44:24] learning a value function in RL different warmup different size par [00:44:27] different warmup different size par like there's just so many knobs to turn [00:44:29] like there's just so many knobs to turn in Po and it was reliably getting us a [00:44:32] in Po and it was reliably getting us a pretty good model but it's like we're [00:44:34] pretty good model but it's like we're staring into the abyss trying to improve [00:44:36] staring into the abyss trying to improve this right now in the next few months [00:44:38] this right now in the next few months and the bottleneck at in terms of the [00:44:41] and the bottleneck at in terms of the actual technical side is that PO [00:44:43] actual technical side is that PO generates new responses from the model [00:44:45] generates new responses from the model as it trains to kind of refresh the data [00:44:48] as it trains to kind of refresh the data and that is by far in a way the biggest [00:44:50] and that is by far in a way the biggest bottle neck when you're actually [00:44:51] bottle neck when you're actually training these models is it's just way [00:44:53] training these models is it's just way slower than [00:44:55] slower than DPO so all these resources for po things [00:44:58] DPO so all these resources for po things are somewhat available to academics the [00:45:00] are somewhat available to academics the Google tensel research Cloud I think is [00:45:02] Google tensel research Cloud I think is pretty available the grad students I [00:45:04] pretty available the grad students I work with seem to sign up um the code [00:45:06] work with seem to sign up um the code base is open so if you're interested in [00:45:08] base is open so if you're interested in a grad student and you're trying to do [00:45:10] a grad student and you're trying to do po alignment and have access to tpus [00:45:13] po alignment and have access to tpus please get in touch it's it's a very fun [00:45:15] please get in touch it's it's a very fun can of worms but kind of as a summary [00:45:18] can of worms but kind of as a summary like this is the many different DPO data [00:45:21] like this is the many different DPO data sets that we tried this is almost all of [00:45:23] sets that we tried this is almost all of the well-received data sets that are out [00:45:26] the well-received data sets that are out there in the open and they all look at [00:45:28] there in the open and they all look at like the factuality column like some of [00:45:30] like the factuality column like some of these things just don't matter at all [00:45:32] these things just don't matter at all when you're aligning these models so [00:45:34] when you're aligning these models so like we need to get new data sets that [00:45:36] like we need to get new data sets that are really adding different capabilities [00:45:38] are really adding different capabilities to these models and something that [00:45:41] to these models and something that matches these kind of ultra feedback [00:45:43] matches these kind of ultra feedback numbers at the bottom and I don't I [00:45:46] numbers at the bottom and I don't I don't like I'm surprised whenever I look [00:45:48] don't like I'm surprised whenever I look at this but this is where we are at and [00:45:50] at this but this is where we are at and we need to try to keep building data [00:45:52] we need to try to keep building data sets and keep adding freshness to this [00:45:56] sets and keep adding freshness to this system Ultra feedback at this point is [00:45:58] system Ultra feedback at this point is maybe 6 months old or so I don't know [00:46:00] maybe 6 months old or so I don't know the exact age but in terms of people [00:46:02] the exact age but in terms of people training models that that feels old to [00:46:04] training models that that feels old to people to things that are happening um [00:46:07] people to things that are happening um and these are the actual sort of numbers [00:46:09] and these are the actual sort of numbers that you get when you compare DPO versus [00:46:11] that you get when you compare DPO versus Po this is all with this 13 billion [00:46:14] Po this is all with this 13 billion parameter again we changed the data set [00:46:18] parameter again we changed the data set and every one of these poo comes out a [00:46:19] and every one of these poo comes out a little bit better on average and this is [00:46:22] little bit better on average and this is a few grad students and people like me [00:46:23] a few grad students and people like me this is not a big team in Industry doing [00:46:26] this is not a big team in Industry doing this like we're scraping by and I don't [00:46:29] this like we're scraping by and I don't know if it's worth the effort if I see [00:46:32] know if it's worth the effort if I see why open AI uses this because we able to [00:46:34] why open AI uses this because we able to get a bit more signal out of it but it's [00:46:37] get a bit more signal out of it but it's a ton of effort to get a bit better um [00:46:40] a ton of effort to get a bit better um signal out and I'll kind of transition [00:46:44] signal out and I'll kind of transition into a bit more of a like open-ended [00:46:47] into a bit more of a like open-ended discussion of this and then we'll have [00:46:48] discussion of this and then we'll have questions but it's like what about PO is [00:46:52] questions but it's like what about PO is actually special like this generation [00:46:54] actually special like this generation and this online nature and like can we [00:46:58] and this online nature and like can we just change DPO to be like this or like [00:46:59] just change DPO to be like this or like where are the new things going to go and [00:47:02] where are the new things going to go and I had the pleasure of advising one [00:47:04] I had the pleasure of advising one project that was related to this but [00:47:05] project that was related to this but this is much much more General so it's [00:47:09] this is much much more General so it's like what is special about online data [00:47:11] like what is special about online data there's multiple ways that you can get [00:47:13] there's multiple ways that you can get new data into your [00:47:14] new data into your rlf process and then there's also this [00:47:17] rlf process and then there's also this related question in reinforcement [00:47:20] related question in reinforcement learning literature which is like on [00:47:21] learning literature which is like on versus off policy which is a technical [00:47:23] versus off policy which is a technical distinction that often gets looped in [00:47:26] distinction that often gets looped in with these discussions of DPO versus Po [00:47:29] with these discussions of DPO versus Po they're actually related but the [00:47:32] they're actually related but the reinforcement learning discussions have [00:47:33] reinforcement learning discussions have a very much more like definitional [00:47:36] a very much more like definitional flavor to them while in this alignment [00:47:38] flavor to them while in this alignment space we're more focused on if we need [00:47:41] space we're more focused on if we need to get fresh data in and how we need to [00:47:43] to get fresh data in and how we need to label our data for language models so [00:47:45] label our data for language models so I'd make this distinction between these [00:47:47] I'd make this distinction between these two things which is freshly generated [00:47:49] two things which is freshly generated data from the policy if you zoom into a [00:47:51] data from the policy if you zoom into a data set like Ultra feedback it has [00:47:53] data set like Ultra feedback it has Generations from all sorts of models [00:47:55] Generations from all sorts of models from alpaca Von kuna GPT 3.5 GPT 4 llama [00:48:00] from alpaca Von kuna GPT 3.5 GPT 4 llama like there's Generations from all sorts [00:48:02] like there's Generations from all sorts of models in this data set we are using [00:48:04] of models in this data set we are using so when we train these Zephyr these 2u [00:48:06] so when we train these Zephyr these 2u models we're incorporating information [00:48:08] models we're incorporating information from a lot of different models down into [00:48:10] from a lot of different models down into our one policy whereas what PPO is doing [00:48:13] our one policy whereas what PPO is doing is only generating data from your [00:48:15] is only generating data from your existing model and kind of changing this [00:48:17] existing model and kind of changing this distribution over time so like that is a [00:48:20] distribution over time so like that is a very different idea of where the signal [00:48:22] very different idea of where the signal is coming from from the models and then [00:48:24] is coming from from the models and then the second thing is whether or not [00:48:26] the second thing is whether or not refreshing the data labels over time if [00:48:29] refreshing the data labels over time if I have human labelers comparing Chosen [00:48:31] I have human labelers comparing Chosen and rejected that's one data point but I [00:48:34] and rejected that's one data point but I can also later on take this reward model [00:48:36] can also later on take this reward model that I trained and generate a Chosen and [00:48:39] that I trained and generate a Chosen and rejected and change the label so these [00:48:41] rejected and change the label so these kind of two things of like what the [00:48:42] kind of two things of like what the actual text is and when the chosen [00:48:45] actual text is and when the chosen rejected label was given are what people [00:48:48] rejected label was given are what people mean when they're talking about like is [00:48:49] mean when they're talking about like is something special about online in rhf [00:48:52] something special about online in rhf and it's much Clear it's clear to see [00:48:54] and it's much Clear it's clear to see that PO does it very differently than [00:48:56] that PO does it very differently than DPO but we're not restricted to [00:48:58] DPO but we're not restricted to this in the last few weeks I have the [00:49:01] this in the last few weeks I have the dates all in here so um April April in [00:49:05] dates all in here so um April April in May of 2024 there started to be a lot of [00:49:07] May of 2024 there started to be a lot of papers on this about DPO po online [00:49:12] papers on this about DPO po online offline and they really kind of say [00:49:15] offline and they really kind of say similar things which is that online is [00:49:17] similar things which is that online is important and these papers on this slide [00:49:20] important and these papers on this slide they show these kind of more theoretical [00:49:22] they show these kind of more theoretical and closed form experiments on like what [00:49:25] and closed form experiments on like what is special about online data and what [00:49:27] is special about online data and what performance drops if you use this kind [00:49:29] performance drops if you use this kind of offline data it's good to dig into [00:49:32] of offline data it's good to dig into these but it's this is why I say it's [00:49:34] these but it's this is why I say it's like nice to do research now because if [00:49:35] like nice to do research now because if you have an idea a lot of times people [00:49:37] you have an idea a lot of times people have like three papers that confirm the [00:49:39] have like three papers that confirm the notion that you have it's a lot easier [00:49:41] notion that you have it's a lot easier to be confident in things if three [00:49:43] to be confident in things if three independent institutions say something [00:49:45] independent institutions say something similar at the same time there's a lot [00:49:48] similar at the same time there's a lot of methods coming out where people are [00:49:50] of methods coming out where people are trying to modify DPO to actually use [00:49:53] trying to modify DPO to actually use this kind of online notion I think [00:49:55] this kind of online notion I think self-rewarding language models for meta [00:49:57] self-rewarding language models for meta was the first really popular one where [00:50:00] was the first really popular one where they just had they asked the DPO model [00:50:02] they just had they asked the DPO model hey which of these answers is better in [00:50:04] hey which of these answers is better in between each iteration so they did this [00:50:06] between each iteration so they did this like llm as a judge to relabel their own [00:50:08] like llm as a judge to relabel their own data and then they did multiple [00:50:10] data and then they did multiple iterations of DPO and the model had [00:50:13] iterations of DPO and the model had really strong stores there's now ideas [00:50:15] really strong stores there's now ideas like not using all of your data at once [00:50:17] like not using all of your data at once so you can kind of do batches of DPO and [00:50:19] so you can kind of do batches of DPO and update your data the paper that I was on [00:50:22] update your data the paper that I was on with this discriminator guided DPO which [00:50:24] with this discriminator guided DPO which I'll talk about in a second is using [00:50:26] I'll talk about in a second is using reward models plus this DPO training [00:50:28] reward models plus this DPO training objective there's just a lot of things [00:50:30] objective there's just a lot of things that we can change and I think the [00:50:31] that we can change and I think the community again is in this expansion [00:50:33] community again is in this expansion phase where I even get messages from [00:50:36] phase where I even get messages from people are like oh my paper was really [00:50:38] people are like oh my paper was really similar to this other paper they that we [00:50:40] similar to this other paper they that we did it first they didn't site us and I'm [00:50:41] did it first they didn't site us and I'm like this is kind of the point but it's [00:50:43] like this is kind of the point but it's hard it's like it's it's going to be [00:50:46] hard it's like it's it's going to be like this for a little bit longer and [00:50:47] like this for a little bit longer and then hopefully in the end of the year in [00:50:49] then hopefully in the end of the year in a few years we're going to be like okay [00:50:50] a few years we're going to be like okay this is clearly what we need to do on [00:50:52] this is clearly what we need to do on the method side of thing so this is one [00:50:54] the method side of thing so this is one example d2p discriminator guided DPO [00:50:58] example d2p discriminator guided DPO which I'm is an advisor to which is a [00:51:02] which I'm is an advisor to which is a undergrad researcher and the idea is [00:51:05] undergrad researcher and the idea is comparing these three different things [00:51:07] comparing these three different things so like a is the standard DPO you have a [00:51:10] so like a is the standard DPO you have a data set you apply the loss function on [00:51:12] data set you apply the loss function on it B is what we call some sort of online [00:51:15] it B is what we call some sort of online preference optimization which is where [00:51:18] preference optimization which is where you can repeatedly label your data with [00:51:20] you can repeatedly label your data with a uh reward model it just kind of like [00:51:23] a uh reward model it just kind of like the self-reward paper that I mentioned [00:51:25] the self-reward paper that I mentioned which is you can read shuffle your [00:51:27] which is you can read shuffle your preference data based on a reward model [00:51:29] preference data based on a reward model and that kind of adds some notion of [00:51:31] and that kind of adds some notion of online to your data and then the third [00:51:33] online to your data and then the third thing is like what if we're relabeling [00:51:35] thing is like what if we're relabeling data and we're retraining our reward [00:51:38] data and we're retraining our reward model over time so we're just really [00:51:39] model over time so we're just really trying to keep our um kind of what our [00:51:42] trying to keep our um kind of what our policy is doing related to our reward [00:51:44] policy is doing related to our reward model and keep everything really updated [00:51:46] model and keep everything really updated in real time so that it's all it's all [00:51:48] in real time so that it's all it's all lined up and this is wondering how much [00:51:50] lined up and this is wondering how much of a gain do you have by retraining the [00:51:53] of a gain do you have by retraining the reward model over time in a DPO [00:51:55] reward model over time in a DPO framework [00:51:57] framework and part of why I like this paper is [00:51:58] and part of why I like this paper is there's things like closed form tasks so [00:52:02] there's things like closed form tasks so the biggest question that I get for [00:52:04] the biggest question that I get for alignment is like how do we actually [00:52:06] alignment is like how do we actually evaluate it like what tasks is it good [00:52:08] evaluate it like what tasks is it good for there's a whole philosophical [00:52:10] for there's a whole philosophical discussion where I think information [00:52:12] discussion where I think information transformation is a valuable task [00:52:14] transformation is a valuable task writers tell the same stories in [00:52:16] writers tell the same stories in different ways but the best told story [00:52:17] different ways but the best told story is the one that resonates with people [00:52:19] is the one that resonates with people that has value and but at the other time [00:52:23] that has value and but at the other time we're academics and we need to be able [00:52:24] we're academics and we need to be able to measure things so this paper has [00:52:26] to measure things so this paper has things like your reward is counting the [00:52:28] things like your reward is counting the number of nouns in a sentence and then [00:52:30] number of nouns in a sentence and then you're using these alignment methods to [00:52:32] you're using these alignment methods to increase the number of nouns in the [00:52:33] increase the number of nouns in the outputed sentences from the model so you [00:52:36] outputed sentences from the model so you can measure that a lot better because we [00:52:37] can measure that a lot better because we have classifiers which know our nouns [00:52:39] have classifiers which know our nouns and you can see on this left figure is [00:52:41] and you can see on this left figure is that just by retraining this reward [00:52:43] that just by retraining this reward Model A few times and it converges [00:52:45] Model A few times and it converges better than if you were just to relabel [00:52:48] better than if you were just to relabel your preference data it's a mouthful but [00:52:50] your preference data it's a mouthful but it's just like keeping your model your [00:52:52] it's just like keeping your model your training process a little bit more [00:52:53] training process a little bit more online can improve a performance and on [00:52:56] online can improve a performance and on the right is a more standard open-ended [00:52:58] the right is a more standard open-ended evaluation task where we're asking a [00:53:01] evaluation task where we're asking a language model like chbt which answer is [00:53:03] language model like chbt which answer is better and that has all sorts of [00:53:05] better and that has all sorts of problems but we can show similar results [00:53:08] problems but we can show similar results I think the big takeaway is really like [00:53:11] I think the big takeaway is really like these few slides which is the the the [00:53:13] these few slides which is the the the literature is moving we have studies [00:53:15] literature is moving we have studies that show that online is better and [00:53:17] that show that online is better and people are coming up with really cool [00:53:19] people are coming up with really cool clever ways to actually use online data [00:53:22] clever ways to actually use online data so I would I combined with new data sets [00:53:24] so I would I combined with new data sets this is kind of this the like deep of [00:53:26] this is kind of this the like deep of this year is like online methods and how [00:53:29] this year is like online methods and how they [00:53:30] they work so this kind of goes back to what [00:53:33] work so this kind of goes back to what industry is doing and I showed this [00:53:36] industry is doing and I showed this figure earlier on the left with Claude [00:53:37] figure earlier on the left with Claude where you can see the little points [00:53:39] where you can see the little points along the lines and these are these [00:53:41] along the lines and these are these different iterations we don't know [00:53:42] different iterations we don't know exactly what they're doing but it seems [00:53:45] exactly what they're doing but it seems a little bit different where the dots on [00:53:46] a little bit different where the dots on these figures are new data sets from [00:53:48] these figures are new data sets from humans rather than this kind of redo a [00:53:52] humans rather than this kind of redo a reward model relabel your data this is [00:53:54] reward model relabel your data this is what happens when you have access to [00:53:56] what happens when you have access to different type of scale the Llama 2 [00:53:58] different type of scale the Llama 2 paper makes this much clearer they say [00:53:59] paper makes this much clearer they say they work with an annotator they get [00:54:01] they work with an annotator they get batches of data when they're generating [00:54:03] batches of data when they're generating this new batch of data the previous [00:54:05] this new batch of data the previous models checkpoint was used for [00:54:07] models checkpoint was used for Generations they do this many times and [00:54:10] Generations they do this many times and you can kind of see that they're [00:54:11] you can kind of see that they're collecting new human data new human data [00:54:13] collecting new human data new human data new human data and each time they [00:54:15] new human data and each time they generate human data it is trained for a [00:54:18] generate human data it is trained for a new model they're doing a lot of [00:54:19] new model they're doing a lot of training updates and they're kind of [00:54:21] training updates and they're kind of building on each other and this kind of [00:54:23] building on each other and this kind of leads into the last section that I'll [00:54:25] leads into the last section that I'll talk about in the conclusion [00:54:27] talk about in the conclusion is like what did meta do with llama 3 [00:54:30] is like what did meta do with llama 3 this is one of the most funny blog post [00:54:32] this is one of the most funny blog post sentences it's like the ridiculous [00:54:34] sentences it's like the ridiculous things that they give us and then we [00:54:35] things that they give us and then we parse the tea leaves um they say in the [00:54:38] parse the tea leaves um they say in the blog post is that our approach to post [00:54:40] blog post is that our approach to post training is a combination of supervised [00:54:41] training is a combination of supervised fine tuning rejection sampling proximal [00:54:44] fine tuning rejection sampling proximal policy optimization Po and direct [00:54:46] policy optimization Po and direct preference optimization so it's like the [00:54:49] preference optimization so it's like the people ask me like what the heck did [00:54:50] people ask me like what the heck did they do it I mean I kind of agree but it [00:54:53] they do it I mean I kind of agree but it really goes back to this slide in my [00:54:55] really goes back to this slide in my mind which is that [00:54:57] mind which is that they're getting new data and then [00:54:58] they're getting new data and then they're training a new model over time [00:55:01] they're training a new model over time so what I think is happening at each one [00:55:03] so what I think is happening at each one of these points they you they tried a [00:55:05] of these points they you they tried a few methods and they chose the training [00:55:07] few methods and they chose the training method that worked best it's really it's [00:55:09] method that worked best it's really it's practical meta is a really practical [00:55:11] practical meta is a really practical organization especially in the Gen org [00:55:13] organization especially in the Gen org right now and that just makes sense it's [00:55:15] right now and that just makes sense it's like at different points in the model [00:55:17] like at different points in the model your model has different capabilities [00:55:19] your model has different capabilities and it's ready to be trained in [00:55:20] and it's ready to be trained in different ways rejection sampling which [00:55:22] different ways rejection sampling which I didn't cover here is the simplest [00:55:25] I didn't cover here is the simplest Training Method you take a reward model [00:55:27] Training Method you take a reward model you rank some supervised fine tuning [00:55:29] you rank some supervised fine tuning outputs and then you use this autor [00:55:31] outputs and then you use this autor regressive loss function again and then [00:55:34] regressive loss function again and then from there DPO is much simpler to PO but [00:55:38] from there DPO is much simpler to PO but it might not be give you the highest end [00:55:40] it might not be give you the highest end performance and then as your model [00:55:42] performance and then as your model really starts kicking into gear or you [00:55:43] really starts kicking into gear or you have more time to train this model once [00:55:45] have more time to train this model once all of your data is collected and you're [00:55:47] all of your data is collected and you're not on a weekly time crunch um you can [00:55:49] not on a weekly time crunch um you can experiment with all the little knobs of [00:55:51] experiment with all the little knobs of Po and you can really try to get the [00:55:53] Po and you can really try to get the best model out at the end of the day [00:55:56] best model out at the end of the day it's just hopefully they release a [00:55:58] it's just hopefully they release a technical report that confirms some of [00:55:59] technical report that confirms some of my hypothesis but I think this is [00:56:01] my hypothesis but I think this is normally what people are interested in [00:56:03] normally what people are interested in when somebody from industry comes up to [00:56:05] when somebody from industry comes up to give a lecture and it's it's I wish we [00:56:10] give a lecture and it's it's I wish we had more details on what industry was [00:56:12] had more details on what industry was doing but in terms of current directions [00:56:14] doing but in terms of current directions that I'm most interested in rhf I talked [00:56:18] that I'm most interested in rhf I talked about data a lot we are very [00:56:20] about data a lot we are very bottlenecked on data even as academics [00:56:23] bottlenecked on data even as academics with very limited compute we literally [00:56:25] with very limited compute we literally try every data set that is available [00:56:27] try every data set that is available like that's not like we don't have a lot [00:56:29] like that's not like we don't have a lot of compute but we need to keep [00:56:31] of compute but we need to keep innovating there we're going to see more [00:56:34] innovating there we're going to see more DPO methods it's it's here to say [00:56:37] DPO methods it's it's here to say there's a ton I didn't cover here things [00:56:40] there's a ton I didn't cover here things like removing the reference model [00:56:42] like removing the reference model changing the loss function slightly um [00:56:46] changing the loss function slightly um not using pair wise preferences but [00:56:47] not using pair wise preferences but single wise preferences it's a lot going [00:56:50] single wise preferences it's a lot going on there we should use more model sizes [00:56:52] on there we should use more model sizes in 7 and 13 billion parameters or in [00:56:55] in 7 and 13 billion parameters or in llama's case like 7 and 70 billion [00:56:58] llama's case like 7 and 70 billion parameters particularly scaling down is [00:57:00] parameters particularly scaling down is very useful it's a place where Academia [00:57:03] very useful it's a place where Academia can still play there's kind of less of a [00:57:06] can still play there's kind of less of a weird marketing Dynamic where all the [00:57:07] weird marketing Dynamic where all the companies are racing to go bigger for [00:57:09] companies are racing to go bigger for certain um strategic reasons but this is [00:57:12] certain um strategic reasons but this is something that's accessible to many [00:57:13] something that's accessible to many people aligning small models it's hard [00:57:16] people aligning small models it's hard to get signal out of them because the [00:57:17] to get signal out of them because the models show more or less random scores [00:57:20] models show more or less random scores on many benchmarks that people care [00:57:22] on many benchmarks that people care about or really low scores so even just [00:57:24] about or really low scores so even just kind of breaking through in that domain [00:57:26] kind of breaking through in that domain would be really impactful work to kind [00:57:28] would be really impactful work to kind of get more people working on alignment [00:57:30] of get more people working on alignment and then kind of evaluations I covered [00:57:32] and then kind of evaluations I covered at length which is we need to keep [00:57:34] at length which is we need to keep getting more specific on things we care [00:57:36] getting more specific on things we care about and personalization is something [00:57:38] about and personalization is something in alignment that I didn't cover in this [00:57:41] in alignment that I didn't cover in this talk but is something that is good to [00:57:44] talk but is something that is good to compete with this kind of big Tech which [00:57:46] compete with this kind of big Tech which is like how do we train models that are [00:57:48] is like how do we train models that are good for you as an individual rather [00:57:50] good for you as an individual rather than one big model for one big [00:57:52] than one big model for one big technology [00:57:54] technology organization so this these slides will [00:57:57] organization so this these slides will get to you but these are the types of [00:57:59] get to you but these are the types of places that I follow when I'm trying to [00:58:00] places that I follow when I'm trying to see open models or open data sets that [00:58:03] see open models or open data sets that are reputable and easy to keep track of [00:58:06] are reputable and easy to keep track of so you don't have to try to follow um [00:58:08] so you don't have to try to follow um everyone and I write about this a lot [00:58:11] everyone and I write about this a lot without doing too much self-promotion [00:58:13] without doing too much self-promotion but I have like I ended like 10 minutes [00:58:16] but I have like I ended like 10 minutes early for questions that I'm happy to [00:58:17] early for questions that I'm happy to take um in a Q&A format and then that if [00:58:22] take um in a Q&A format and then that if you don't have to stay in waight if you [00:58:23] you don't have to stay in waight if you don't want to [00:58:26] don't want to [Applause] [00:58:35] [Applause] okay thank you Nathan um questions [00:58:38] okay thank you Nathan um questions anyone got [00:58:41] questions assum you're hand a good [00:58:43] questions assum you're hand a good reward model which is a large assumption [00:58:45] reward model which is a large assumption I agree but what is the key challenge to [00:58:47] I agree but what is the key challenge to doing online D in sense you can do en [00:58:49] doing online D in sense you can do en roll outs and then just like rank them [00:58:51] roll outs and then just like rank them using a model and then go and you can [00:58:55] using a model and then go and you can iterate this so what like what is the [00:58:57] iterate this so what like what is the hard [00:59:00] thing yeah I'm going to repeat the [00:59:02] thing yeah I'm going to repeat the questions so that people can hear them [00:59:04] questions so that people can hear them and it gets recorded the idea is if you [00:59:06] and it gets recorded the idea is if you have a good reward model what is [00:59:08] have a good reward model what is stopping you from doing online DPO and [00:59:12] stopping you from doing online DPO and kind of just improving the policy from [00:59:14] kind of just improving the policy from there I think there's kind of multiple [00:59:16] there I think there's kind of multiple angles to this [00:59:18] angles to this that they're both Technical and like the [00:59:21] that they're both Technical and like the kind of industrywide but the technical [00:59:23] kind of industrywide but the technical thing is I think the prompt matching [00:59:25] thing is I think the prompt matching ends up being really important so prompt [00:59:28] ends up being really important so prompt matching so what your reward model can [00:59:30] matching so what your reward model can learn is specific to the prompts [00:59:33] learn is specific to the prompts there're a technical detail where the [00:59:35] there're a technical detail where the prompts used for your policy often are [00:59:37] prompts used for your policy often are exactly the same as your reward model in [00:59:39] exactly the same as your reward model in po which is really strange because we [00:59:41] po which is really strange because we talk about generalization in machine [00:59:43] talk about generalization in machine learning but we're kind of like soft [00:59:44] learning but we're kind of like soft balling oursel at the PO stage which is [00:59:47] balling oursel at the PO stage which is we're only grading po answers which our [00:59:49] we're only grading po answers which our reward model is train to answer which is [00:59:52] reward model is train to answer which is kind of strange so people think that [00:59:53] kind of strange so people think that some of that might break down and we see [00:59:56] some of that might break down and we see some of that when trying to train po [00:59:59] some of that when trying to train po models with off-the-shelf reward models [01:00:01] models with off-the-shelf reward models it's was kind of a long answer and [01:00:04] it's was kind of a long answer and then but I think that I think that's [01:00:06] then but I think that I think that's mostly it's like mostly distribution [01:00:08] mostly it's like mostly distribution matching if I had to guess but if we had [01:00:11] matching if I had to guess but if we had truly a good model it should work for [01:00:13] truly a good model it should work for some things and that could be one of the [01:00:16] some things and that could be one of the reasons why there aren't that many in [01:00:17] reasons why there aren't that many in the open because it would kind of help [01:00:19] the open because it would kind of help people catch up in alignment it's like a [01:00:21] people catch up in alignment it's like a reward model if it is as important as [01:00:23] reward model if it is as important as people say it is it might be easy [01:00:28] other questions yeah [01:00:44] [Music] [01:00:56] for examp [01:01:00] me yeah I think there's there this is a [01:01:03] me yeah I think there's there this is a whole conversation so if I don't cover [01:01:05] whole conversation so if I don't cover it if you want more after I answer I can [01:01:07] it if you want more after I answer I can you can come up but the question is like [01:01:09] you can come up but the question is like is there more than pairwise preferences [01:01:11] is there more than pairwise preferences that could be used in rhf and there's a [01:01:14] that could be used in rhf and there's a lot of different lines of work that are [01:01:15] lot of different lines of work that are studying this um one is methods like [01:01:19] studying this um one is methods like there's a method out of Stanford that's [01:01:21] there's a method out of Stanford that's kto name like csky I always mess it up [01:01:23] kto name like csky I always mess it up with these names are so hard to [01:01:25] with these names are so hard to pronounce but it's like idea of using [01:01:27] pronounce but it's like idea of using one-sided preference data so a lot of [01:01:29] one-sided preference data so a lot of customer apps have like did you did you [01:01:32] customer apps have like did you did you get good support from this agent yes or [01:01:34] get good support from this agent yes or no and like you could use data like that [01:01:36] no and like you could use data like that is it just is a different loss function [01:01:38] is it just is a different loss function for using single um single side of [01:01:40] for using single um single side of preferences or just yes or no there are [01:01:42] preferences or just yes or no there are other things like learning to rank for [01:01:45] other things like learning to rank for multiple answers so this is something I [01:01:49] multiple answers so this is something I slightly insinuated but like binary [01:01:52] slightly insinuated but like binary preferences is kind of like there's a [01:01:53] preferences is kind of like there's a lot of literature on learning [01:01:55] lot of literature on learning preferences [01:01:56] preferences and one of the models that got reduced [01:01:59] and one of the models that got reduced down is the Starling model and they use [01:02:01] down is the Starling model and they use a kwise preference so they have like [01:02:04] a kwise preference so they have like five or nine answers to every prompt and [01:02:07] five or nine answers to every prompt and then they collect answers and then they [01:02:09] then they collect answers and then they have a different loss function and this [01:02:11] have a different loss function and this is one of the models has kind of like [01:02:12] is one of the models has kind of like broken through in the open alignment [01:02:14] broken through in the open alignment space it's one of the few that I left in [01:02:15] space it's one of the few that I left in and skipped in my slide deck but so [01:02:18] and skipped in my slide deck but so that's kind of interesting and then [01:02:19] that's kind of interesting and then there's other research that's like fine [01:02:21] there's other research that's like fine grained um preferences so for every [01:02:25] grained um preferences so for every completion to a prompt you get labels [01:02:27] completion to a prompt you get labels like conciseness helpfulness honesty so [01:02:30] like conciseness helpfulness honesty so there's a few things on that regards [01:02:32] there's a few things on that regards there's like a steer LM paper from [01:02:35] there's like a steer LM paper from Nvidia and then there's work from udub [01:02:37] Nvidia and then there's work from udub that does like learning from fine G [01:02:40] that does like learning from fine G grained preferences so that one's [01:02:42] grained preferences so that one's probably like the one that's most [01:02:43] probably like the one that's most emerging most in the academic sense but [01:02:46] emerging most in the academic sense but there's so much to learn here there's [01:02:47] there's so much to learn here there's like all like literally all the field of [01:02:50] like all like literally all the field of social Choice needs to get condensed [01:02:52] social Choice needs to get condensed into these things [01:03:02] any other [01:03:11] [Applause] [01:03:23] questions yeah so the question is how [01:03:26] questions yeah so the question is how can we broadly is like how can we exceed [01:03:28] can we broadly is like how can we exceed Human Performance with um fine-tuning or [01:03:31] Human Performance with um fine-tuning or any training for that regards I think [01:03:33] any training for that regards I think this is where some older ideas in CS [01:03:35] this is where some older ideas in CS will come back like I think one of the [01:03:36] will come back like I think one of the foundational ideas in CS is search which [01:03:39] foundational ideas in CS is search which is really also motivated as like [01:03:41] is really also motivated as like exploration in RL and therefore we need [01:03:44] exploration in RL and therefore we need to have some sort of language models [01:03:46] to have some sort of language models that can search and generate new data I [01:03:49] that can search and generate new data I was talking with somebody before the [01:03:51] was talking with somebody before the grad student and I think that it's like [01:03:53] grad student and I think that it's like search will be a large part of synthetic [01:03:55] search will be a large part of synthetic data but then the human aspect will be [01:03:57] data but then the human aspect will be what gets it across the line if it can't [01:03:58] what gets it across the line if it can't solve a certain area and this is like [01:04:01] solve a certain area and this is like the qar rumors are ridiculous but that [01:04:04] the qar rumors are ridiculous but that seems to be the best argument for the [01:04:07] seems to be the best argument for the sort of thing that open AI is trying [01:04:09] sort of thing that open AI is trying with that is like how to get that [01:04:12] with that is like how to get that barrier broken with [01:04:18] AI thank you so much for coming in you [01:04:20] AI thank you so much for coming in you mentioned data sets for a big limitation [01:04:23] mentioned data sets for a big limitation and I was curious how one goes about [01:04:25] and I was curious how one goes about creating a new data [01:04:27] creating a new data set yeah this is another thing that's [01:04:30] set yeah this is another thing that's hard I think Community efforts are what [01:04:32] hard I think Community efforts are what people have tried to do I mentioned open [01:04:34] people have tried to do I mentioned open Assistant but most people that do a [01:04:37] Assistant but most people that do a community effort are like I never want [01:04:38] community effort are like I never want to do this again so while I still think [01:04:42] to do this again so while I still think it's worth doing things once that are [01:04:44] it's worth doing things once that are highly impactful even if you like might [01:04:46] highly impactful even if you like might not want to do it again other avenues [01:04:48] not want to do it again other avenues for building these in a sustainable [01:04:51] for building these in a sustainable manner are very important I think that [01:04:54] manner are very important I think that there's some way that this is being done [01:04:56] there's some way that this is being done like chatbot Arena returns some of the [01:04:58] like chatbot Arena returns some of the prompts and the labels to users there's [01:05:01] prompts and the labels to users there's specific concerns I have with that data [01:05:03] specific concerns I have with that data around being too noisy um but that is [01:05:06] around being too noisy um but that is the sort of thing that can happen if AI [01:05:08] the sort of thing that can happen if AI 2 has a demo for their models it's going [01:05:11] 2 has a demo for their models it's going to be about science and like generating [01:05:14] to be about science and like generating information rather than being a chat GPT [01:05:16] information rather than being a chat GPT competitor it's like a nonprofit it [01:05:17] competitor it's like a nonprofit it can't do a product competitor but that's [01:05:19] can't do a product competitor but that's the sort of data that we would want to [01:05:21] the sort of data that we would want to release and something that I might just [01:05:23] release and something that I might just have to do but I'm interested in like [01:05:25] have to do but I'm interested in like academic workshops and competitions as a [01:05:28] academic workshops and competitions as a ground where you could have communities [01:05:30] ground where you could have communities meet every 3 6 8 months and have work [01:05:33] meet every 3 6 8 months and have work that's focused on an area Andor like [01:05:35] that's focused on an area Andor like Focus time to have people contribute to [01:05:37] Focus time to have people contribute to it but it's a good question it's not [01:05:40] it but it's a good question it's not it's probably why there aren't very [01:05:45] many how do you [01:05:49] feel are subject to reward hacking as [01:05:52] feel are subject to reward hacking as well so we get one at the front first [01:05:55] well so we get one at the front first yeah close first and then we'll come to [01:05:56] yeah close first and then we'll come to you um the various places you've done [01:05:59] you um the various places you've done research at over the years do you have [01:06:01] research at over the years do you have any sense of how they compare in terms [01:06:05] any sense of how they compare in terms of uh specifically alignment research I [01:06:08] of uh specifically alignment research I mean obviously they weren't doing [01:06:10] mean obviously they weren't doing alignment research specifically add [01:06:12] alignment research specifically add those [01:06:14] those time I think generally they represents [01:06:16] time I think generally they represents different culture and Investments of the [01:06:18] different culture and Investments of the company like the my I wasn't doing [01:06:20] company like the my I wasn't doing language models until a time at hugging [01:06:22] language models until a time at hugging phase so I can really only speak to [01:06:24] phase so I can really only speak to these kind of two open [01:06:26] these kind of two open companies and from like hugging P's [01:06:29] companies and from like hugging P's perspective is to show that more people [01:06:31] perspective is to show that more people can do this like we're not trying to [01:06:32] can do this like we're not trying to compete with chat 2bt but we're trying [01:06:34] compete with chat 2bt but we're trying to Foster an ecosystem of doing this and [01:06:35] to Foster an ecosystem of doing this and ai2 is similar but more about like what [01:06:39] ai2 is similar but more about like what is happening like how do we learn about [01:06:40] is happening like how do we learn about this how do we do science how do we [01:06:42] this how do we do science how do we study the science of this and [01:06:43] study the science of this and communicate that clearly and I'm sure if [01:06:45] communicate that clearly and I'm sure if you do the exercise you can map this to [01:06:47] you do the exercise you can map this to every company is like what is their [01:06:49] every company is like what is their important thing and like they have [01:06:51] important thing and like they have different goals in their products and [01:06:53] different goals in their products and their corporate structure and things [01:06:55] their corporate structure and things like that [01:06:56] like that I will talk more when not [01:06:57] I will talk more when not [Laughter] [01:07:00] [Laughter] recorded okay up the [01:07:02] recorded okay up the back are reward model also subject to [01:07:05] back are reward model also subject to reward hacking like they achieve a good [01:07:08] reward hacking like they achieve a good result on the outcome but [01:07:11] result on the outcome but act in reality the outcome does not [01:07:15] act in reality the outcome does not expected yeah so this is like when [01:07:17] expected yeah so this is like when talking about reward models this is [01:07:19] talking about reward models this is probably the most established line of [01:07:20] probably the most established line of work the question is like are reward [01:07:22] work the question is like are reward models subject to reward hacking and [01:07:24] models subject to reward hacking and reward hacking is a classic problem in [01:07:26] reward hacking is a classic problem in RL I should bring back from my RL slides [01:07:28] RL I should bring back from my RL slides where you have the boat swimming going [01:07:30] where you have the boat swimming going in circles and then be like this happens [01:07:32] in circles and then be like this happens to your language model and and what [01:07:33] to your language model and and what happens but it is and there's a lot of [01:07:36] happens but it is and there's a lot of research to mitigate it but it's a [01:07:38] research to mitigate it but it's a fundamental problem which is you have a [01:07:40] fundamental problem which is you have a very powerful Optimizer and you have a [01:07:42] very powerful Optimizer and you have a incomplete representation of your reward [01:07:44] incomplete representation of your reward and it will always find where your [01:07:46] and it will always find where your representation of reward is wrong so [01:07:48] representation of reward is wrong so it's like we will always be doing the [01:07:49] it's like we will always be doing the best we can but I think saying it's [01:07:52] best we can but I think saying it's perfect is like not possible in the math [01:08:02] I mean I can also say like the ways that [01:08:03] I mean I can also say like the ways that it fails are pretty funny cuz like if [01:08:05] it fails are pretty funny cuz like if you train these models you'll end up [01:08:06] you train these models you'll end up with a model that just says JavaScript [01:08:08] with a model that just says JavaScript to like every answer to like for on [01:08:10] to like every answer to like for on Infinity it's like sometimes it's really [01:08:12] Infinity it's like sometimes it's really easy to see when that is happening which [01:08:14] easy to see when that is happening which is which is good or like you could [01:08:16] is which is good or like you could change your loss function so that it [01:08:17] change your loss function so that it will always exploit and like it's a good [01:08:20] will always exploit and like it's a good way to kind of make sure that things are [01:08:21] way to kind of make sure that things are working which is you should be able to [01:08:24] working which is you should be able to easily exploit if you turn the brakes [01:08:27] easily exploit if you turn the brakes off [01:08:30] off okay any last public [01:08:37] question if not uh thank you for Nathan [01:08:40] question if not uh thank you for Nathan for giving this [01:08:45] call and if there's anything you'd like [01:08:47] call and if there's anything you'd like to ask off the Record um he'll be here [01:08:49] to ask off the Record um he'll be here for a bit longer ================================================================================ LECTURE 017 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 16 - ConvNets and TreeRNNs Source: https://www.youtube.com/watch?v=S8d-7v3f5MQ --- Transcript [00:00:05] hi okay let me get started for today uh [00:00:08] hi okay let me get started for today uh I guess I'm now down to the the more [00:00:11] I guess I'm now down to the the more select week eight audience of people who [00:00:14] select week eight audience of people who actually want to learn so uh my welcome [00:00:18] actually want to learn so uh my welcome and my pleasure for the people who show [00:00:21] and my pleasure for the people who show up today thank you thank you um okay so [00:00:25] up today thank you thank you um okay so what I want to do today um is Prince [00:00:30] what I want to do today um is Prince ibly sort of talk about a couple of [00:00:34] ibly sort of talk about a couple of other newal network techniques which can [00:00:37] other newal network techniques which can be used for language I mean in some [00:00:42] be used for language I mean in some sense um these two techniques are G ones [00:00:46] sense um these two techniques are G ones that people aren't using very much these [00:00:48] that people aren't using very much these days and that's partly why they get sort [00:00:50] days and that's partly why they get sort of stuck towards the end of the course [00:00:53] of stuck towards the end of the course um because we try to teach people early [00:00:55] um because we try to teach people early on in the course um the most essential [00:00:58] on in the course um the most essential things that you should definitely know [00:01:00] things that you should definitely know about um but you know um the fact of the [00:01:04] about um but you know um the fact of the matter is you know in any scientific [00:01:06] matter is you know in any scientific field there are different ideas and [00:01:09] field there are different ideas and techniques that bounce around and it's [00:01:11] techniques that bounce around and it's good to know a few of the different [00:01:13] good to know a few of the different ideas that are out there because often [00:01:15] ideas that are out there because often what happens is people find new ways to [00:01:19] what happens is people find new ways to reinvent things um and put things [00:01:22] reinvent things um and put things together and see different insides from [00:01:24] together and see different insides from them so today I'm going to tell you a [00:01:26] them so today I'm going to tell you a little bit about using convolutional new [00:01:28] little bit about using convolutional new networks for language Edge and then a [00:01:30] networks for language Edge and then a bit about tree recursive neural [00:01:33] bit about tree recursive neural networks um but before that um just [00:01:36] networks um but before that um just course organization um this is a a bit [00:01:40] course organization um this is a a bit after it happened but I guess I've never [00:01:41] after it happened but I guess I've never been back to say it so thanks to [00:01:43] been back to say it so thanks to everyone who filled in the mid quarter [00:01:45] everyone who filled in the mid quarter surveys um some people said very nice [00:01:48] surveys um some people said very nice things about the lecture fantastic [00:01:50] things about the lecture fantastic lectures and really interesting content [00:01:53] lectures and really interesting content um some people uh wished that we were [00:01:56] um some people uh wished that we were teaching more about State space models I [00:01:58] teaching more about State space models I guess we haven't added those that [00:02:00] guess we haven't added those that lecture in yet um um a couple of people [00:02:04] lecture in yet um um a couple of people thought it' be good to have an exam in [00:02:06] thought it' be good to have an exam in this class clearly they weren't people [00:02:09] this class clearly they weren't people who have friends in cs231n from what [00:02:12] who have friends in cs231n from what I've heard but um um yeah uh and then um [00:02:17] I've heard but um um yeah uh and then um in general people are pretty happy how [00:02:19] in general people are pretty happy how Ed has been going uh a bit less happy on [00:02:23] Ed has been going uh a bit less happy on how office hours have been going I mean [00:02:26] how office hours have been going I mean honestly it's a hard problem I feel to [00:02:30] honestly it's a hard problem I feel to do office hours you know some people are [00:02:32] do office hours you know some people are saying oh you should just use Q status I [00:02:35] saying oh you should just use Q status I sort of remember badly to a year where [00:02:36] sort of remember badly to a year where we did everything but Q status and near [00:02:39] we did everything but Q status and near the assignments due date the the C would [00:02:42] the assignments due date the the C would stretch six hours long and that didn't [00:02:44] stretch six hours long and that didn't seem such a good solution either um but [00:02:46] seem such a good solution either um but we'll work along with it um finally yeah [00:02:50] we'll work along with it um finally yeah on on cloud compute I know this is [00:02:54] on on cloud compute I know this is something that people variously do have [00:02:56] something that people variously do have issues with so um there are quite a few [00:02:59] issues with so um there are quite a few people that still trying to do things [00:03:01] people that still trying to do things with Google collab um which I realize is [00:03:04] with Google collab um which I realize is sort of a very convenient nice interface [00:03:07] sort of a very convenient nice interface but you sort of do suffer from access to [00:03:10] but you sort of do suffer from access to gpus um on Google collab the best way to [00:03:13] gpus um on Google collab the best way to get better access to gpus is to pay 10 [00:03:17] get better access to gpus is to pay 10 bucks for a month of collab Pro um which [00:03:21] bucks for a month of collab Pro um which perhaps means that you end up paying for [00:03:23] perhaps means that you end up paying for two months um for if it's May and June [00:03:26] two months um for if it's May and June um we can't reimburse you for that um [00:03:29] um we can't reimburse you for that um but not so many copies worth of money [00:03:33] but not so many copies worth of money and it does just give you better access [00:03:35] and it does just give you better access to gpus I'm encourage you to use the gcp [00:03:38] to gpus I'm encourage you to use the gcp credits and together AP API access that [00:03:42] credits and together AP API access that we've given to you um you're also [00:03:45] we've given to you um you're also welcome to try other things kaggle [00:03:47] welcome to try other things kaggle notebooks can actually give you be [00:03:48] notebooks can actually give you be better GPU access but um not as not all [00:03:52] better GPU access but um not as not all the nice features of collabs um some [00:03:55] the nice features of collabs um some groups have started using modal which is [00:03:58] groups have started using modal which is can also be um a good way to get GPU [00:04:02] can also be um a good way to get GPU access okay that was the intro to that [00:04:05] access okay that was the intro to that um and so now I wanted to just sort of [00:04:07] um and so now I wanted to just sort of talk about convolutional new networks um [00:04:11] talk about convolutional new networks um for language I mean these slides are [00:04:14] for language I mean these slides are sort of positioned a bit as [00:04:17] sort of positioned a bit as convolutional new networks versus rnns [00:04:21] convolutional new networks versus rnns as opposed to versus Transformers I mean [00:04:24] as opposed to versus Transformers I mean that's um partly you could say because I [00:04:27] that's um partly you could say because I haven't updated my slides enough but in [00:04:30] haven't updated my slides enough but in another sense that's partly because [00:04:32] another sense that's partly because that's how the ideas of convolutional [00:04:35] that's how the ideas of convolutional neural networks really were explored it [00:04:38] neural networks really were explored it was in the days when most people were [00:04:40] was in the days when most people were using recursive neural networks for NLP [00:04:43] using recursive neural networks for NLP a few people said about saying hey maybe [00:04:45] a few people said about saying hey maybe we should use convolutional newal [00:04:46] we should use convolutional newal networks for language as well um whereas [00:04:50] networks for language as well um whereas in truth in The Last 5 Years when [00:04:52] in truth in The Last 5 Years when Transformers have dominated there hasn't [00:04:54] Transformers have dominated there hasn't been much use of convolutional neural [00:04:56] been much use of convolutional neural networks for NLP um so if we think back [00:04:59] networks for NLP um so if we think back to our new networks um if you remember [00:05:03] to our new networks um if you remember those that they kind of gave a way of [00:05:06] those that they kind of gave a way of giving a [00:05:07] giving a representation um for a sentence or part [00:05:10] representation um for a sentence or part of a sentence but they sort of computed [00:05:12] of a sentence but they sort of computed forward through the string um and so you [00:05:17] forward through the string um and so you kind of had to get a representation that [00:05:20] kind of had to get a representation that included in everything that came before [00:05:23] included in everything that came before you and then you so um you didn't really [00:05:26] you and then you so um you didn't really have a representation of the ceremony [00:05:28] have a representation of the ceremony you had a representation of man walked [00:05:30] you had a representation of man walked into the ceremony that you could [00:05:34] into the ceremony that you could use um so in contrast to that [00:05:37] use um so in contrast to that convolutional neural networks basically [00:05:40] convolutional neural networks basically say well you know kind of like an engr [00:05:43] say well you know kind of like an engr model that we should be able to take [00:05:45] model that we should be able to take engrams of words like two G or three G [00:05:49] engrams of words like two G or three G so for the example tentative deal reach [00:05:51] so for the example tentative deal reach to keep the government open that we can [00:05:54] to keep the government open that we can take each Three G tentative deal reach [00:05:57] take each Three G tentative deal reach deal reach to reach to keep keep [00:06:00] deal reach to reach to keep keep government and we can make uh some [00:06:02] government and we can make uh some neural representation for each of those [00:06:05] neural representation for each of those so notice just being done for every um [00:06:08] so notice just being done for every um engram for a certain end so there's [00:06:10] engram for a certain end so there's nothing linguistically or cognitively [00:06:13] nothing linguistically or cognitively especially plausible here um but we're [00:06:15] especially plausible here um but we're just going to sort of form [00:06:17] just going to sort of form representations of multi-word units [00:06:20] representations of multi-word units which will'll then group in some form [00:06:22] which will'll then group in some form further way later on and the standard [00:06:25] further way later on and the standard way of doing that is with convol [00:06:27] way of doing that is with convol convolutional neural networks um so the [00:06:30] convolutional neural networks um so the classic case of convolutional neuron [00:06:32] classic case of convolutional neuron networks is in Vision um the [00:06:35] networks is in Vision um the convolutional neuron networks were [00:06:36] convolutional neuron networks were invented for vision where they gave you [00:06:39] invented for vision where they gave you a kind of a translation [00:06:41] a kind of a translation invariant um model so that you could [00:06:44] invariant um model so that you could recognize your kangaroo no matter where [00:06:46] recognize your kangaroo no matter where in the frame it was and so this little [00:06:48] in the frame it was and so this little picture here which I'll just do the [00:06:50] picture here which I'll just do the lower half side of SL lower half of the [00:06:54] lower half side of SL lower half of the slide is sort of the what a [00:06:56] slide is sort of the what a convolutional new network is doing in 2D [00:06:59] convolutional new network is doing in 2D video [00:07:00] video so the convolution is like a mask that [00:07:03] so the convolution is like a mask that you're sliding over the image and the [00:07:06] you're sliding over the image and the mask is defined by weights which are the [00:07:09] mask is defined by weights which are the little things shown in red and so for [00:07:12] little things shown in red and so for each place you slide your masks to [00:07:16] each place you slide your masks to you're then calculating a score by [00:07:19] you're then calculating a score by taking well what's effectively a DOT [00:07:21] taking well what's effectively a DOT product of the Mask terms by the [00:07:24] product of the Mask terms by the elements in that patch and that's then [00:07:26] elements in that patch and that's then filling in The Matrix on the right um [00:07:29] filling in The Matrix on the right um that's shown in pink and so that's then [00:07:32] that's shown in pink and so that's then calculating our convolved feature um [00:07:36] calculating our convolved feature um from the image uh that make [00:07:39] from the image uh that make sense yeah so well what happens if we [00:07:43] sense yeah so well what happens if we then want to do that for language well [00:07:46] then want to do that for language well you know for language we don't have a 2d [00:07:49] you know for language we don't have a 2d picture we've got a 1D picture we've got [00:07:52] picture we've got a 1D picture we've got a sequence of words so we can have [00:07:54] a sequence of words so we can have tentative deal reach to keep government [00:07:56] tentative deal reach to keep government open so each of our words will have a [00:07:59] open so each of our words will have a word Vector I'm using four-dimensional [00:08:02] word Vector I'm using four-dimensional in my examples to keep it compact on my [00:08:04] in my examples to keep it compact on my slide and so then we can apply um a [00:08:09] slide and so then we can apply um a filter that applies to an engr so this [00:08:13] filter that applies to an engr so this is going to be a filter for trigram and [00:08:17] is going to be a filter for trigram and so then we're going to slide that [00:08:19] so then we're going to slide that downwards in the exactly the same way as [00:08:22] downwards in the exactly the same way as for the vision case apart from we're [00:08:24] for the vision case apart from we're just sliding in one dimension so I [00:08:27] just sliding in one dimension so I calculate um the dot product of the [00:08:30] calculate um the dot product of the filter and this three G and that gives [00:08:34] filter and this three G and that gives me a value minus one if I did my [00:08:36] me a value minus one if I did my arithmetic right then I slide it down to [00:08:39] arithmetic right then I slide it down to the next position and work it out and I [00:08:41] the next position and work it out and I get minus [00:08:43] get minus 0.5 um slide it down get the other [00:08:46] 0.5 um slide it down get the other values and then typically I can add on a [00:08:49] values and then typically I can add on a bias term so my bias is plus one in this [00:08:52] bias term so my bias is plus one in this example and then I'll stick it through a [00:08:54] example and then I'll stick it through a nonlinearity like a sigmoid or something [00:08:57] nonlinearity like a sigmoid or something like that and so I'll be calculating a [00:08:59] like that and so I'll be calculating a vector for a term for each um of these [00:09:04] vector for a term for each um of these three GRS and so that is a convolution [00:09:07] three GRS and so that is a convolution for a single filter um and then commonly [00:09:11] for a single filter um and then commonly what I'm doing after that is deciding [00:09:13] what I'm doing after that is deciding that I'm going to have more than one [00:09:15] that I'm going to have more than one filter and I'll show that in a minute um [00:09:19] filter and I'll show that in a minute um in this example um and in my vision [00:09:22] in this example um and in my vision example earlier we sort of had shrinkage [00:09:26] example earlier we sort of had shrinkage because we started off with seven words [00:09:28] because we started off with seven words but course we sort of slid these [00:09:32] but course we sort of slid these um you know trigram over it we only sort [00:09:35] um you know trigram over it we only sort of had space for um five trigrams and so [00:09:39] of had space for um five trigrams and so we ended up with something smaller than [00:09:41] we ended up with something smaller than our input sentence often people want to [00:09:44] our input sentence often people want to keep it the same size and the way you [00:09:46] keep it the same size and the way you can keep it the same size is by having [00:09:49] can keep it the same size is by having padding so if I put a a zero padding at [00:09:52] padding so if I put a a zero padding at each end well now I'm going to get seven [00:09:55] each end well now I'm going to get seven trigrams coming out as corresponding to [00:09:58] trigrams coming out as corresponding to my original seven words and normally [00:10:01] my original seven words and normally I'll just sort of pad it with zeros like [00:10:03] I'll just sort of pad it with zeros like that um you can actually increase the [00:10:07] that um you can actually increase the size of things because if you add um [00:10:10] size of things because if you add um padding of two at each end you can then [00:10:13] padding of two at each end you can then have a wide convolution and so seven [00:10:15] have a wide convolution and so seven will then go to nine different [00:10:17] will then go to nine different things okay so if we only had one filter [00:10:22] things okay so if we only had one filter things are pretty limiting and so [00:10:25] things are pretty limiting and so commonly as in the vision case what [00:10:28] commonly as in the vision case what we're going to do is um Define multiple [00:10:30] we're going to do is um Define multiple filters and then we're going to be [00:10:32] filters and then we're going to be calculating a value for each of these [00:10:35] calculating a value for each of these filters over each of these triag and so [00:10:38] filters over each of these triag and so then we're getting out a new [00:10:40] then we're getting out a new representation as a vector and you know [00:10:43] representation as a vector and you know depending on how many filters we have [00:10:46] depending on how many filters we have relative to what the word dimensionality [00:10:48] relative to what the word dimensionality is we might end up with something that's [00:10:50] is we might end up with something that's shorter as in this example the same [00:10:53] shorter as in this example the same length or actually longer than what our [00:10:56] length or actually longer than what our input was in terms of word vectors [00:11:00] um but commonly um when we do that um we [00:11:05] um but commonly um when we do that um we then in some way want to summarize all [00:11:09] then in some way want to summarize all of these filters and the most common way [00:11:12] of these filters and the most common way of doing that is to do something that's [00:11:16] of doing that is to do something that's called Max pooling and Max pooling is [00:11:19] called Max pooling and Max pooling is something you see quite a bit in newal [00:11:21] something you see quite a bit in newal networks in general and the way to think [00:11:24] networks in general and the way to think of Max pooling that I think makes sense [00:11:27] of Max pooling that I think makes sense is that you can think of um max pooling [00:11:30] is that you can think of um max pooling is doing what you want if you R want to [00:11:33] is doing what you want if you R want to run something that's like a feature [00:11:36] run something that's like a feature detector so you know if you imagine that [00:11:40] detector so you know if you imagine that you'd you learn these functions um that [00:11:44] you'd you learn these functions um that will look at word vectors and that they [00:11:47] will look at word vectors and that they will look for evidence of something [00:11:49] will look for evidence of something particular so you know maybe this filter [00:11:52] particular so you know maybe this filter looks for the person is using I language [00:11:55] looks for the person is using I language that it's matches the words I my we our [00:12:00] that it's matches the words I my we our something like this and you know maybe [00:12:03] something like this and you know maybe this is a filter that matches you know [00:12:06] this is a filter that matches you know speech verbs like or thinking verbs like [00:12:09] speech verbs like or thinking verbs like think say um said told Etc like that um [00:12:14] think say um said told Etc like that um and so each of these is sort of some [00:12:17] and so each of these is sort of some kind of feature of the text that you [00:12:19] kind of feature of the text that you might want to detect well if that's your [00:12:21] might want to detect well if that's your model of it when you sort of slide your [00:12:24] model of it when you sort of slide your feature detector down the piece of text [00:12:27] feature detector down the piece of text you sort of want to know does this match [00:12:29] you sort of want to know does this match anywhere in this piece of text is [00:12:32] anywhere in this piece of text is somewhere at using an i word regardless [00:12:35] somewhere at using an i word regardless of whether it's in the first second [00:12:36] of whether it's in the first second third or fourth position and so that's [00:12:39] third or fourth position and so that's effectively what you're getting out with [00:12:41] effectively what you're getting out with Max pooling that a feature is counting [00:12:43] Max pooling that a feature is counting as firing to the extent that it fires [00:12:46] as firing to the extent that it fires strongly in any position in the text [00:12:49] strongly in any position in the text that you could match um that's not the [00:12:52] that you could match um that's not the only way you can think of doing it an [00:12:54] only way you can think of doing it an alternative way you could do it is you [00:12:57] alternative way you could do it is you could think of your featured tector as [00:13:00] could think of your featured tector as sort of measuring some quality of the [00:13:02] sort of measuring some quality of the text like um casualness or learnedness [00:13:06] text like um casualness or learnedness or something like that and then you [00:13:08] or something like that and then you might think oh well for overall wanting [00:13:11] might think oh well for overall wanting to know how casual the text is maybe I [00:13:14] to know how casual the text is maybe I want to know the average of how casual [00:13:16] want to know the average of how casual it is in different parts of the text and [00:13:19] it is in different parts of the text and so then you can do the alternative of [00:13:21] so then you can do the alternative of average pooling and sometimes people do [00:13:24] average pooling and sometimes people do that as well you can do both you can [00:13:26] that as well you can do both you can both work out an average pull and a Max [00:13:28] both work out an average pull and a Max pull and put both of them into the [00:13:30] pull and put both of them into the feature detector in general for the kind [00:13:33] feature detector in general for the kind of features people learn in your [00:13:35] of features people learn in your networks if you're just doing one or the [00:13:37] networks if you're just doing one or the other the result does seem to be that [00:13:40] other the result does seem to be that Max pooling is the most effective that [00:13:42] Max pooling is the most effective that that kind of does the feature fire [00:13:44] that kind of does the feature fire metaphor tends in general um to be um [00:13:48] metaphor tends in general um to be um the best way of thinking about [00:13:51] the best way of thinking about things okay so if you want to do all of [00:13:54] things okay so if you want to do all of this in pytorch um con 1D right so um [00:13:58] this in pytorch um con 1D right so um that guess the one-dimensional [00:14:00] that guess the one-dimensional convolutions aren't the most common case [00:14:03] convolutions aren't the most common case and so you're using com 1D and all these [00:14:06] and so you're using com 1D and all these kind of things that you can then be [00:14:09] kind of things that you can then be specifying so the output channels is the [00:14:11] specifying so the output channels is the number of filters you have the kernel [00:14:14] number of filters you have the kernel size is saying the size is how big it is [00:14:17] size is saying the size is how big it is which for my example was three um and [00:14:19] which for my example was three um and then you can sort of just collapse [00:14:21] then you can sort of just collapse things with the max [00:14:23] things with the max pulling okay um there's a space of other [00:14:27] pulling okay um there's a space of other things that you can also do with [00:14:30] things that you can also do with convolutional newal networks which I [00:14:32] convolutional newal networks which I think are sort of less useful and less [00:14:36] think are sort of less useful and less used and language cases but I can sort [00:14:39] used and language cases but I can sort of say them quickly so one thing you can [00:14:42] of say them quickly so one thing you can do um is sort of have a stride CU When [00:14:47] do um is sort of have a stride CU When we sort of did every [00:14:50] we sort of did every trigram of sort of zero tentative deal [00:14:53] trigram of sort of zero tentative deal then tentative deal reach then deal [00:14:55] then tentative deal reach then deal reach 2 you could feel like well they're [00:14:57] reach 2 you could feel like well they're overlapping each other a lot so they've [00:14:59] overlapping each other a lot so they've actually got very similar stuff in them [00:15:02] actually got very similar stuff in them and that would be even more so if we [00:15:04] and that would be even more so if we weren't using 3 G we were using [00:15:06] weren't using 3 G we were using something like five G so something that [00:15:09] something like five G so something that you can do is the stride is sort of how [00:15:11] you can do is the stride is sort of how much you move along so if you move along [00:15:14] much you move along so if you move along two you'd have one trigram that's um [00:15:17] two you'd have one trigram that's um padding tentative deal and then the next [00:15:19] padding tentative deal and then the next one would then be deal reach two and [00:15:22] one would then be deal reach two and then the next one would be to keep [00:15:24] then the next one would be to keep government so that they're overlapping [00:15:26] government so that they're overlapping by less as you go through it um [00:15:30] by less as you go through it um another thing that you can then do um [00:15:35] another thing that you can then do um that's sort of stride like is rather [00:15:38] that's sort of stride like is rather than um doing Max pulling over the [00:15:43] than um doing Max pulling over the entire thing you could do more of a [00:15:45] entire thing you could do more of a local Max pull so um you could think [00:15:49] local Max pull so um you could think that well I want to have this feature [00:15:51] that well I want to have this feature detector for something like use of eye [00:15:54] detector for something like use of eye language but you know if it's a big long [00:15:56] language but you know if it's a big long sentence and there's eye language at [00:15:58] sentence and there's eye language at four different points points maybe you [00:16:00] four different points points maybe you should get four points for that rather [00:16:01] should get four points for that rather than just sort of the one point that [00:16:03] than just sort of the one point that you're going to get from Max pulling so [00:16:06] you're going to get from Max pulling so you could sort of do local Max pulling [00:16:08] you could sort of do local Max pulling sensitive to the stride so here I could [00:16:11] sensitive to the stride so here I could look at the first two of these and Max [00:16:13] look at the first two of these and Max pull those two then the next two and Max [00:16:16] pull those two then the next two and Max pull those the next two and Max pull [00:16:18] pull those the next two and Max pull those and the next pull two and Max pull [00:16:21] those and the next pull two and Max pull those and you could sort of then end up [00:16:23] those and you could sort of then end up with this sort of local Max pooling as [00:16:26] with this sort of local Max pooling as you go along [00:16:30] you go along um okay um and then one other idea [00:16:34] um okay um and then one other idea that's sort of related um that you can [00:16:38] that's sort of related um that you can um do is well another way of capturing [00:16:41] um do is well another way of capturing sort of does something match in multiple [00:16:44] sort of does something match in multiple places is rather than only keeping the [00:16:48] places is rather than only keeping the One Max in each column maybe you could [00:16:50] One Max in each column maybe you could just do a k Max so you could keep the [00:16:53] just do a k Max so you could keep the two maximum things in a column and that [00:16:56] two maximum things in a column and that might be also be a way of seeing whether [00:16:58] might be also be a way of seeing whether something [00:16:59] something is detected in two places or [00:17:02] is detected in two places or not um okay um then I've got lots of [00:17:08] not um okay um then I've got lots of Notions here okay so um [00:17:12] Notions here okay so um dilation is then the notion um that what [00:17:16] dilation is then the notion um that what we'd like to do is sort of form our [00:17:21] we'd like to do is sort of form our trigrams not only as adjacent things but [00:17:25] trigrams not only as adjacent things but things that are spaced out so after [00:17:27] things that are spaced out so after having done our first layer of [00:17:30] having done our first layer of convolutional filters that took trigrams [00:17:33] convolutional filters that took trigrams that got us to the sort of top right [00:17:36] that got us to the sort of top right part here we could then do a dilated [00:17:39] part here we could then do a dilated trigram convolution which means that [00:17:42] trigram convolution which means that we're going to take the first third and [00:17:44] we're going to take the first third and fifth things and um combine them in a [00:17:50] fifth things and um combine them in a convolutional filter and then we'll take [00:17:53] convolutional filter and then we'll take the second fourth and six things and [00:17:55] the second fourth and six things and combine them in a um convolutional [00:17:58] combine them in a um convolutional filter and so we've then got a trigram [00:18:01] filter and so we've then got a trigram filter but it sort of has a bigger range [00:18:04] filter but it sort of has a bigger range of size that it can see um and that's [00:18:07] of size that it can see um and that's sometimes used more commonly used in [00:18:09] sometimes used more commonly used in places like speech than a natural [00:18:13] places like speech than a natural language okay so those are the kind of [00:18:15] language okay so those are the kind of tools we have um for calculating things [00:18:19] tools we have um for calculating things with these convolutions over text and so [00:18:21] with these convolutions over text and so next what I want to do you do is tell [00:18:24] next what I want to do you do is tell you about a a couple of pieces of work [00:18:27] you about a a couple of pieces of work that made use of convolutions natural [00:18:29] that made use of convolutions natural language [00:18:30] language processing I guess this is a decade old [00:18:33] processing I guess this is a decade old now because this is from [00:18:35] now because this is from 2014 um this is um the single most [00:18:39] 2014 um this is um the single most famous piece of work that made use of [00:18:41] famous piece of work that made use of convolutional neural networks for [00:18:43] convolutional neural networks for natural language processing um and Yun [00:18:46] natural language processing um and Yun Kim is now an assistant professor at MIT [00:18:50] Kim is now an assistant professor at MIT I mean in retrospect it's sort of [00:18:53] I mean in retrospect it's sort of actually pretty simple um but I guess [00:18:56] actually pretty simple um but I guess you know he got in early with the idea [00:18:58] you know he got in early with the idea of okay maybe um we could use [00:19:01] of okay maybe um we could use convolutions um for NLP and did a kind [00:19:04] convolutions um for NLP and did a kind of a clear example of that that worked [00:19:06] of a clear example of that that worked pretty well um and so this piece of work [00:19:10] pretty well um and so this piece of work is very well known um so this was um [00:19:16] is very well known um so this was um writing a sentiment classifier so [00:19:18] writing a sentiment classifier so looking at a sentence and deciding [00:19:21] looking at a sentence and deciding whether it's um positive or negative and [00:19:24] whether it's um positive or negative and actually for both the kind of models [00:19:25] actually for both the kind of models that I'm going to talk about today we're [00:19:27] that I'm going to talk about today we're going to use examples um that are doing [00:19:31] going to use examples um that are doing sentiments classification he also [00:19:34] sentiments classification he also considered other tasks subjective or [00:19:36] considered other tasks subjective or objective language question [00:19:38] objective language question classification as to what they were [00:19:39] classification as to what they were about but the main application was doing [00:19:42] about but the main application was doing sentiment [00:19:43] sentiment analysis so what you're going to be [00:19:45] analysis so what you're going to be doing um this this paper shows things [00:19:49] doing um this this paper shows things more in his notation but it's exactly [00:19:53] more in his notation but it's exactly the same as we've just been talking [00:19:54] the same as we've just been talking about that you're taking engrams of word [00:19:57] about that you're taking engrams of word vectors you're going to be um [00:20:00] vectors you're going to be um multiplying them by convolution and [00:20:04] multiplying them by convolution and calculating new vectors and it's going [00:20:07] calculating new vectors and it's going to be done in his model for different [00:20:10] to be done in his model for different sizes of NR so he's going to have some [00:20:13] sizes of NR so he's going to have some convolutional filters that look at byrs [00:20:16] convolutional filters that look at byrs some at Tri and some of them that look [00:20:19] some at Tri and some of them that look at [00:20:23] forrs and then those just slid across [00:20:25] forrs and then those just slid across the positions in the [00:20:27] the positions in the sentence um [00:20:29] sentence um then having done that it does Max [00:20:32] then having done that it does Max pooling as we've been talking about [00:20:36] pooling as we've been talking about which gives a single number um coming [00:20:39] which gives a single number um coming out of each filter and those Max pulled [00:20:42] out of each filter and those Max pulled numbers from each filter and then going [00:20:45] numbers from each filter and then going to be used as a classifier in a final [00:20:48] to be used as a classifier in a final simple softmax layer to give the full [00:20:52] simple softmax layer to give the full answers um there's one other thing that [00:20:55] answers um there's one other thing that came up in this paper which is kind of [00:20:57] came up in this paper which is kind of just an interesting general idea um to [00:21:01] just an interesting general idea um to be aware of um and it was something he [00:21:04] be aware of um and it was something he sort of pioneered which is the following [00:21:07] sort of pioneered which is the following that it's a very common case um that [00:21:12] that it's a very common case um that what you for when you're sort of this I [00:21:16] what you for when you're sort of this I guess this again occurs less with huge [00:21:18] guess this again occurs less with huge pre-trained Transformers but for sort of [00:21:21] pre-trained Transformers but for sort of the classic case of models where you had [00:21:24] the classic case of models where you had word vectors and then you were training [00:21:26] word vectors and then you were training some neural network model [00:21:29] some neural network model um on some supervised data um there was [00:21:33] um on some supervised data um there was this following Pitfall of what happened [00:21:36] this following Pitfall of what happened when you fine-tuned word vectors and so [00:21:39] when you fine-tuned word vectors and so the setting is you know we've started [00:21:42] the setting is you know we've started off with um our pre-trained word vectors [00:21:45] off with um our pre-trained word vectors from glove or word to V or whatever it [00:21:48] from glove or word to V or whatever it is and then we've got a smaller um [00:21:51] is and then we've got a smaller um sentiment analysis data set and that [00:21:54] sentiment analysis data set and that we're going to um train a sentiment [00:21:58] we're going to um train a sentiment classifier and that will involve not [00:22:01] classifier and that will involve not only um learning the parameters of our [00:22:04] only um learning the parameters of our sentiment classifier but also we can [00:22:06] sentiment classifier but also we can back propop into the word Vector [00:22:09] back propop into the word Vector representations um and if you do that I [00:22:13] representations um and if you do that I mean it seems like that should be a good [00:22:15] mean it seems like that should be a good idea because you [00:22:17] idea because you know normal word vectors aren't you know [00:22:21] know normal word vectors aren't you know especially tuned to predicting sentiment [00:22:24] especially tuned to predicting sentiment correctly um they're sort of more tuned [00:22:27] correctly um they're sort of more tuned to meaning of words as to sort of um [00:22:30] to meaning of words as to sort of um just what words are about and so it [00:22:34] just what words are about and so it seems like it should help you if you [00:22:37] seems like it should help you if you could um back propop into the word [00:22:40] could um back propop into the word vectors and change them as you go along [00:22:42] vectors and change them as you go along but if you do that there tends to be a [00:22:46] but if you do that there tends to be a problem and the problem is um what [00:22:49] problem and the problem is um what you'll find is that some words will be [00:22:53] you'll find is that some words will be in your sentiment training data set and [00:22:56] in your sentiment training data set and when you um learn um with backprop these [00:23:00] when you um learn um with backprop these word vectors will move but some words [00:23:03] word vectors will move but some words just won't be in your trading data and [00:23:06] just won't be in your trading data and they're going to stay exactly where they [00:23:08] they're going to stay exactly where they were in the word de vectors because [00:23:11] were in the word de vectors because there's nothing to move them around so [00:23:14] there's nothing to move them around so what tends to happen is you sort of [00:23:16] what tends to happen is you sort of started off like this where all of [00:23:19] started off like this where all of tedious dull and plotting were close by [00:23:22] tedious dull and plotting were close by each other as having similar meanings [00:23:24] each other as having similar meanings and the indicators of something negative [00:23:27] and the indicators of something negative but after you done your training um [00:23:30] but after you done your training um tedious and dull as part of back propop [00:23:33] tedious and dull as part of back propop have moved over here where they're part [00:23:35] have moved over here where they're part of the um negative land and the [00:23:38] of the um negative land and the classification boundaries moved over [00:23:40] classification boundaries moved over here but plotting wasn't in the training [00:23:42] here but plotting wasn't in the training set so it's just sitting exactly where [00:23:45] set so it's just sitting exactly where it was at the start of the process and [00:23:48] it was at the start of the process and now it's being treated as a positive [00:23:50] now it's being treated as a positive word which is completely wrong and so [00:23:53] word which is completely wrong and so that's tended to have the result that um [00:23:57] that's tended to have the result that um when people sort of train uh Language [00:24:01] when people sort of train uh Language new network on a small supervised data [00:24:03] new network on a small supervised data set you got kind of ambivalent results [00:24:06] set you got kind of ambivalent results that sometimes um doing back prop into [00:24:11] that sometimes um doing back prop into the word vectors would help because you [00:24:13] the word vectors would help because you could specialize your word vectors to [00:24:15] could specialize your word vectors to your task but sometimes it would hurt [00:24:17] your task but sometimes it would hurt you because of this kind of effect that [00:24:19] you because of this kind of effect that you sort of messed up the semantic [00:24:21] you sort of messed up the semantic relations that applied over the that [00:24:25] relations that applied over the that were captured reasonably well in the [00:24:27] were captured reasonably well in the initial word vectors [00:24:29] initial word vectors um so the way that um Yun Kim dealt with [00:24:33] um so the way that um Yun Kim dealt with that was a fairly um simple way um he [00:24:36] that was a fairly um simple way um he just doubled his number of channels and [00:24:39] just doubled his number of channels and so he made two copies of each um Channel [00:24:45] so he made two copies of each um Channel each filter in his convolutional neural [00:24:47] each filter in his convolutional neural network and for one of them it used the [00:24:51] network and for one of them it used the fine-tuned word vectors and for one of [00:24:53] fine-tuned word vectors and for one of them it kept the original word vectors [00:24:56] them it kept the original word vectors and then he could have the best of both [00:24:59] and then he could have the best of both worlds okay um so this picture um [00:25:03] worlds okay um so this picture um captures the sort of whole of um his [00:25:07] captures the sort of whole of um his Network um this picture actually comes [00:25:10] Network um this picture actually comes on from comes from a followon paper [00:25:12] on from comes from a followon paper which produced this nice picture so we [00:25:15] which produced this nice picture so we start off with a sentence I like this [00:25:18] start off with a sentence I like this movie very much which should be [00:25:19] movie very much which should be classified positive um so we have words [00:25:22] classified positive um so we have words and their word vectors and so then [00:25:26] and their word vectors and so then you're going to have convolutional field [00:25:28] you're going to have convolutional field filters that are both byr filters trigr [00:25:32] filters that are both byr filters trigr filters and forr filters and at each of [00:25:36] filters and forr filters and at each of those sizes you're going to have ones [00:25:37] those sizes you're going to have ones that work on the unfin tuned word [00:25:40] that work on the unfin tuned word vectors and the fine-tuned word vectors [00:25:44] vectors and the fine-tuned word vectors um and [00:25:46] um and so you're going to put these filters and [00:25:49] so you're going to put these filters and slide them over the text and get um [00:25:53] slide them over the text and get um representations and the way he's doing [00:25:55] representations and the way he's doing this the filters are done without [00:25:58] this the filters are done without padding so that the 4 gr filters you're [00:26:01] padding so that the 4 gr filters you're getting smaller vectors coming out um [00:26:03] getting smaller vectors coming out um and the byr filters you've got bigger [00:26:06] and the byr filters you've got bigger vectors coming out and so then for each [00:26:09] vectors coming out and so then for each of these you're then going to Max pull [00:26:13] of these you're then going to Max pull um so you're just getting the highest [00:26:14] um so you're just getting the highest value from it and then you're getting a [00:26:16] value from it and then you're getting a highest value from the ones with um the [00:26:19] highest value from the ones with um the fine tuning of the word vectors and the [00:26:21] fine tuning of the word vectors and the ones not and so you're getting one one [00:26:24] ones not and so you're getting one one feature out of each filter you're then [00:26:28] feature out of each filter you're then concatenating all of those Max pulled [00:26:32] concatenating all of those Max pulled outputs together so then one vector for [00:26:35] outputs together so then one vector for the entire sentence which is of fixed [00:26:38] the entire sentence which is of fixed size reflecting the number of filters um [00:26:41] size reflecting the number of filters um and then you're just sticking this as a [00:26:44] and then you're just sticking this as a straightforward um linear classifier [00:26:46] straightforward um linear classifier into a soft Max that's then giving you a [00:26:49] into a soft Max that's then giving you a probability of a positive or negative [00:26:52] probability of a positive or negative and that was the entire [00:26:54] and that was the entire model um and the interesting thing was [00:26:58] model um and the interesting thing was this actually worked pretty well um for [00:27:02] this actually worked pretty well um for natural language classification tasks so [00:27:06] natural language classification tasks so um this is a big table of results from [00:27:08] um this is a big table of results from his paper so um there are sentiment data [00:27:13] his paper so um there are sentiment data sets like the Stanford sentiment Tree [00:27:15] sets like the Stanford sentiment Tree Bank two versions of that um movie [00:27:18] Bank two versions of that um movie reviews there's another sentiment data [00:27:20] reviews there's another sentiment data set there's a subjectivity [00:27:22] set there's a subjectivity classifier um the Tre was the kind of [00:27:25] classifier um the Tre was the kind of question type classifier um so various [00:27:28] question type classifier um so various data sets [00:27:30] data sets um and um that you know various people [00:27:37] um and um that you know various people um including us at Stanford I guess all [00:27:39] um including us at Stanford I guess all of these soed hour results were ones we [00:27:42] of these soed hour results were ones we were doing at Stanford um had built um [00:27:46] were doing at Stanford um had built um lots of models on various of these data [00:27:48] lots of models on various of these data sets and his argument was that by using [00:27:53] sets and his argument was that by using this simple convolutional new network um [00:27:56] this simple convolutional new network um you could do as as well sometimes better [00:28:00] you could do as as well sometimes better than any of these other models that were [00:28:02] than any of these other models that were being considered at the time for [00:28:04] being considered at the time for sentiment [00:28:05] sentiment analysis now there was at least one way [00:28:09] analysis now there was at least one way in which um maybe that comparison was [00:28:13] in which um maybe that comparison was too generous to the CNN so um because um [00:28:18] too generous to the CNN so um because um if you remember back when we um were [00:28:20] if you remember back when we um were doing Dropout and we said Dropout is [00:28:23] doing Dropout and we said Dropout is such a good idea I mean Dropout I think [00:28:25] such a good idea I mean Dropout I think came out in 2012 if I'm remember [00:28:28] came out in 2012 if I'm remember remembering correctly so the reality is [00:28:30] remembering correctly so the reality is a lot of these other methods were being [00:28:33] a lot of these other methods were being written before Dropout appeared on the [00:28:36] written before Dropout appeared on the scene whereas he was using Dropout and [00:28:39] scene whereas he was using Dropout and that gave him an advantage and sort of [00:28:41] that gave him an advantage and sort of better experimental technique might have [00:28:43] better experimental technique might have been to compare to redo the other models [00:28:46] been to compare to redo the other models with Dropout um um which he didn't um [00:28:50] with Dropout um um which he didn't um but nevertheless it sort of shows that [00:28:52] but nevertheless it sort of shows that you could get strong results using [00:28:55] you could get strong results using convolutional newal networks um with [00:28:57] convolutional newal networks um with just a very simple [00:29:00] just a very simple architecture yeah um so that's one more [00:29:03] architecture yeah um so that's one more thing that you can do and so I mean the [00:29:05] thing that you can do and so I mean the thing to think about here is you know we [00:29:08] thing to think about here is you know we have this sort of toolkit of ways that [00:29:10] have this sort of toolkit of ways that you can do things we started off with um [00:29:13] you can do things we started off with um word vectors and bags of vectors which [00:29:16] word vectors and bags of vectors which you could use for simple classification [00:29:18] you could use for simple classification we talked early on about window models [00:29:23] we talked early on about window models and your window models are sort of like [00:29:26] and your window models are sort of like what you get for convolutional your [00:29:27] what you get for convolutional your network works but sort of more ad hoc um [00:29:31] network works but sort of more ad hoc um then we have convolutional newal [00:29:33] then we have convolutional newal networks which are definitely good for [00:29:36] networks which are definitely good for classification and very easy to paralyze [00:29:39] classification and very easy to paralyze which is good and we talked about [00:29:41] which is good and we talked about recurrent newal networks which seem to [00:29:43] recurrent newal networks which seem to be cognitively plausible reading through [00:29:46] be cognitively plausible reading through sentences from left to right but aren't [00:29:49] sentences from left to right but aren't easy to [00:29:50] easy to parallelize um and then we've talked [00:29:52] parallelize um and then we've talked about [00:29:53] about Transformers um which to some extent is [00:29:56] Transformers um which to some extent is our you know best model for NLP and is [00:29:59] our you know best model for NLP and is being used everywhere and indeed you [00:30:01] being used everywhere and indeed you know what's happening now is that things [00:30:03] know what's happening now is that things are going in reverse and people are [00:30:05] are going in reverse and people are increasingly using Transformers for [00:30:07] increasingly using Transformers for vision as well though there's still I [00:30:10] vision as well though there's still I think more debate in The Vision World as [00:30:12] think more debate in The Vision World as between CNN and Transformers with some [00:30:15] between CNN and Transformers with some people arguing that both of them have [00:30:17] people arguing that both of them have complimentary [00:30:20] complimentary advantages okay um couple of other just [00:30:23] advantages okay um couple of other just facts on the side and then I'll um show [00:30:25] facts on the side and then I'll um show you um one other um [00:30:29] you um one other um bigger fancier convolutional new network [00:30:32] bigger fancier convolutional new network model for language um so we talked about [00:30:36] model for language um so we talked about for Transformer models the use of layer [00:30:40] for Transformer models the use of layer normalization which sort of um keeps the [00:30:43] normalization which sort of um keeps the size of the numbers in the middle layers [00:30:46] size of the numbers in the middle layers of the newal network about the Same by [00:30:49] of the newal network about the Same by giving zero mean and um unit variance um [00:30:54] giving zero mean and um unit variance um there are slightly different ways that [00:30:56] there are slightly different ways that you can do that um for tional new [00:30:58] you can do that um for tional new networks the standard thing to be using [00:31:01] networks the standard thing to be using is batch normalization and indeed batch [00:31:04] is batch normalization and indeed batch normalization was the thing that was [00:31:07] normalization was the thing that was invented first and sort of layer [00:31:10] invented first and sort of layer normalization and batch normalization [00:31:13] normalization and batch normalization are sort of doing the same thing of sort [00:31:16] are sort of doing the same thing of sort of scaling numbers to give them zero [00:31:18] of scaling numbers to give them zero mean and unit variance but the way that [00:31:22] mean and unit variance but the way that they differ is sort of under what [00:31:24] they differ is sort of under what dimensions they're doing their [00:31:26] dimensions they're doing their calculations [00:31:28] calculations um so that layer Norm is calculating [00:31:30] um so that layer Norm is calculating statistics across the feature Dimension [00:31:33] statistics across the feature Dimension whereas batch Norm is normalizing all [00:31:36] whereas batch Norm is normalizing all the elements in the batch for each um [00:31:38] the elements in the batch for each um feature [00:31:41] independently okay one other little [00:31:45] independently okay one other little concept that turns up um which actually [00:31:48] concept that turns up um which actually sort of connects a bit to Transformers [00:31:51] sort of connects a bit to Transformers as well there's this sort of funny thing [00:31:53] as well there's this sort of funny thing that you can all of what I've presented [00:31:55] that you can all of what I've presented so far was sort of volutions that are [00:32:00] so far was sort of volutions that are some INR byr Tri for GR and so they're [00:32:05] some INR byr Tri for GR and so they're also size one [00:32:08] also size one convolutions and at first sight that [00:32:10] convolutions and at first sight that seems to make no sense at all cuz what's [00:32:13] seems to make no sense at all cuz what's the point of doing a size one [00:32:15] the point of doing a size one convolution because you just got one [00:32:17] convolution because you just got one thing and it's staying just one thing [00:32:20] thing and it's staying just one thing but it actually does make sense because [00:32:23] but it actually does make sense because it corresponds to having a little fully [00:32:27] it corresponds to having a little fully connected layer [00:32:28] connected layer that's only looking at the [00:32:29] that's only looking at the representation in one position so in a [00:32:33] representation in one position so in a language term is taking a word vector [00:32:35] language term is taking a word vector and putting it through a fully connected [00:32:38] and putting it through a fully connected neural network to produce a new [00:32:41] neural network to produce a new representation just of that word and [00:32:44] representation just of that word and that's sort of actually what we also [00:32:45] that's sort of actually what we also have with the fully connected layers and [00:32:47] have with the fully connected layers and Transformers right that you've got a [00:32:49] Transformers right that you've got a fully connected layer that's just at one [00:32:52] fully connected layer that's just at one well subword token position and [00:32:55] well subword token position and calculates a new representation for it [00:32:58] calculates a new representation for it um and so that's um allows you to sort [00:33:02] um and so that's um allows you to sort of create new representations with [00:33:05] of create new representations with actually many fewer parameters than if [00:33:07] actually many fewer parameters than if you're allowing a fully connected layer [00:33:09] you're allowing a fully connected layer across the entire [00:33:13] sentence okay and so this is then a more [00:33:17] sentence okay and so this is then a more recent version of a convolutional newal [00:33:21] recent version of a convolutional newal network still again used for text [00:33:24] network still again used for text classification but a much more complex [00:33:27] classification but a much more complex one um from Cano Al in [00:33:32] one um from Cano Al in 2017 um and again this was still at the [00:33:35] 2017 um and again this was still at the stage um in which lstm sequence models [00:33:39] stage um in which lstm sequence models were dominant in NLP I guess in 2017 [00:33:42] were dominant in NLP I guess in 2017 this is sort of the same year the first [00:33:44] this is sort of the same year the first Transformer paper um came out and you [00:33:48] Transformer paper um came out and you know there were the motivations were [00:33:52] know there were the motivations were sort of comparing vision and language um [00:33:57] sort of comparing vision and language um and so at that point in time um [00:34:00] and so at that point in time um convolutional new network models and [00:34:03] convolutional new network models and vision were already very deep models so [00:34:06] vision were already very deep models so people were using things like resnet [00:34:08] people were using things like resnet models that had 30 50 100 layers in them [00:34:12] models that had 30 50 100 layers in them and that stood in um stark contrast to [00:34:16] and that stood in um stark contrast to what was happening in the lstm world for [00:34:18] what was happening in the lstm world for sequence models where commonly people [00:34:21] sequence models where commonly people were just using two layer sequence [00:34:23] were just using two layer sequence models and if you're wanting to go [00:34:26] models and if you're wanting to go further you might be using a thre layer [00:34:29] further you might be using a thre layer sequence model or four layer sequence [00:34:31] sequence model or four layer sequence model or occasionally if you got really [00:34:33] model or occasionally if you got really really deep um people had used eight [00:34:36] really deep um people had used eight layer sequence models if they had a lot [00:34:37] layer sequence models if they had a lot of data but essentially you know it was [00:34:40] of data but essentially you know it was always in a single digit the number of [00:34:43] always in a single digit the number of layers um and then a second thing was [00:34:47] layers um and then a second thing was you know in some sense the vision models [00:34:50] you know in some sense the vision models were more raw signal models because they [00:34:53] were more raw signal models because they were operating on the individual pixel [00:34:57] were operating on the individual pixel level whereas in NLP the standard was [00:35:00] level whereas in NLP the standard was that we were using Word level models [00:35:03] that we were using Word level models still in the Transformer model so it [00:35:05] still in the Transformer model so it sort of seemed like things were much [00:35:07] sort of seemed like things were much more grouped before they began and so [00:35:10] more grouped before they began and so the idea of this paper is well maybe we [00:35:13] the idea of this paper is well maybe we could do NLP kind of like it was Vision [00:35:17] could do NLP kind of like it was Vision so we'll start with the raw characters [00:35:21] so we'll start with the raw characters as our signal we're going to put them [00:35:23] as our signal we're going to put them into a deeper convolutional neural [00:35:27] into a deeper convolutional neural network work and use the same kind of [00:35:30] network work and use the same kind of architecture um we use for vision and [00:35:33] architecture um we use for vision and use that for language classification [00:35:36] use that for language classification tasks and so that um led to this VD CNN [00:35:40] tasks and so that um led to this VD CNN architecture which is something um that [00:35:43] architecture which is something um that looks very like a vision system in [00:35:47] looks very like a vision system in design um and so what do we have here so [00:35:51] design um and so what do we have here so at the bottom um we have um individual [00:35:56] at the bottom um we have um individual characters um [00:35:58] characters um and the individual characters get a 16d [00:36:02] and the individual characters get a 16d representation um and then you've got [00:36:04] representation um and then you've got some sort of size of piece of text that [00:36:07] some sort of size of piece of text that you're classifying um which for them was [00:36:12] you're classifying um which for them was 1,24 um and then at each stage we're [00:36:16] 1,24 um and then at each stage we're then going to have convolutional blocks [00:36:21] then going to have convolutional blocks um and so these convolutional blocks um [00:36:25] um and so these convolutional blocks um have a whole bunch of filters but [00:36:27] have a whole bunch of filters but they're also then going to group stuff [00:36:31] they're also then going to group stuff together um so that we're kind of sort [00:36:33] together um so that we're kind of sort of starting to collapse into multicar [00:36:36] of starting to collapse into multicar units um so we're starting off first of [00:36:40] units um so we're starting off first of all having you know 64 um size three um [00:36:46] all having you know 64 um size three um convolutional [00:36:48] convolutional filters um and so that gives us a [00:36:51] filters um and so that gives us a representation of 64 um times the window [00:36:55] representation of 64 um times the window size um and then then um we're going to [00:36:59] size um and then then um we're going to do that again and put it um through [00:37:03] do that again and put it um through another set of convolutional [00:37:05] another set of convolutional filters of size three and 64 of them [00:37:09] filters of size three and 64 of them which gets us sort of up to here um and [00:37:13] which gets us sort of up to here um and then at each point we also have residual [00:37:15] then at each point we also have residual connections which we also saw in [00:37:17] connections which we also saw in Transformers but were pioneered in the [00:37:19] Transformers but were pioneered in the vision space so that we have a path that [00:37:22] vision space so that we have a path that things can just go straight through but [00:37:24] things can just go straight through but then when we get to here um we then then [00:37:27] then when we get to here um we then then going to do local pooling so each pair [00:37:31] going to do local pooling so each pair of representations here will be pulled [00:37:35] of representations here will be pulled together and so at that point we've no [00:37:39] together and so at that point we've no longer got a length of the initial [00:37:41] longer got a length of the initial length of [00:37:43] length of 1,24 we've now got a a length of um [00:37:48] 1,24 we've now got a a length of um 512 um so now we're going to be putting [00:37:51] 512 um so now we're going to be putting it [00:37:52] it through again sort of trigram [00:37:55] through again sort of trigram convolutions but now we're going to have [00:37:58] convolutions but now we're going to have 128 of those [00:38:01] 128 of those channels um we're going to repeat that [00:38:03] channels um we're going to repeat that again and then we're going to again [00:38:05] again and then we're going to again group with pooling so now we're going to [00:38:09] group with pooling so now we're going to um have [00:38:12] um have 256 um long sequence because we've done [00:38:16] 256 um long sequence because we've done local pooling of each pair um and we're [00:38:19] local pooling of each pair um and we're going to then have 256 filters at each [00:38:22] going to then have 256 filters at each stage and we go up and then we do local [00:38:26] stage and we go up and then we do local pulling again so each of them is now [00:38:29] pulling again so each of them is now representing an eight G of characters um [00:38:32] representing an eight G of characters um and we're putting triam filters over [00:38:35] and we're putting triam filters over those 8 G so really the amount of a [00:38:40] those 8 G so really the amount of a sentence that the convolutional filters [00:38:43] sentence that the convolutional filters is seeing at this point is 24 characters [00:38:46] is seeing at this point is 24 characters so you know sort of seeing something [00:38:48] so you know sort of seeing something like six-word sequences or something [00:38:50] like six-word sequences or something like that more convolutional blocks [00:38:53] like that more convolutional blocks there um then at there they then do this [00:38:57] there um then at there they then do this ke KX pooling so some of the ideas from [00:38:59] ke KX pooling so some of the ideas from the beginning of the lecture um do show [00:39:01] the beginning of the lecture um do show up so you're then doing kmax pooling and [00:39:05] up so you're then doing kmax pooling and finding the eight highest activations in [00:39:08] finding the eight highest activations in the sequence and that sort of makes [00:39:10] the sequence and that sort of makes sense for something like a text [00:39:12] sense for something like a text classifier because you want to count up [00:39:14] classifier because you want to count up the amount of evidence right if you got [00:39:17] the amount of evidence right if you got some category like is this about I don't [00:39:21] some category like is this about I don't know uh copper mining you want to be [00:39:24] know uh copper mining you want to be seeing whether there's a bunch of places [00:39:26] seeing whether there's a bunch of places in the Tex that's talking about copper [00:39:28] in the Tex that's talking about copper mining um and then right up the top they [00:39:32] mining um and then right up the top they have several fully connected layers [00:39:34] have several fully connected layers which again is very typical of what you [00:39:37] which again is very typical of what you are finding in Vision networks um such [00:39:40] are finding in Vision networks um such as something like vggnet um that after [00:39:44] as something like vggnet um that after you've done a whole bunch of um [00:39:46] you've done a whole bunch of um convolutional layers You' just stick it [00:39:48] convolutional layers You' just stick it through multiple fully connected layers [00:39:50] through multiple fully connected layers at the top and so that's what they're [00:39:52] at the top and so that's what they're doing as well and this is your [00:39:55] doing as well and this is your architecture um for doing text [00:39:58] architecture um for doing text in [00:40:00] in um okay I think I talked through that in [00:40:02] um okay I think I talked through that in a lot of detail um so I'll skip this [00:40:05] a lot of detail um so I'll skip this slide um yeah so um their experiments [00:40:09] slide um yeah so um their experiments were done on text classification data [00:40:12] were done on text classification data sets so um various news classification [00:40:16] sets so um various news classification data sets um dbpedia onology um then [00:40:20] data sets um dbpedia onology um then doing sentiment analysis on Yelp reviews [00:40:23] doing sentiment analysis on Yelp reviews and Amazon [00:40:25] and Amazon reviews um and here here um results um [00:40:31] reviews um and here here um results um from their one so you know they're [00:40:33] from their one so you know they're taking the previous known best published [00:40:36] taking the previous known best published results which are shown here in table [00:40:40] results which are shown here in table four and then they're [00:40:42] four and then they're considering whether they can do better [00:40:45] considering whether they can do better by using their architecture and that [00:40:49] by using their architecture and that they um used architectures of different [00:40:51] they um used architectures of different lengths in terms of the number of um [00:40:55] lengths in terms of the number of um layers of nine layers 17 and 29 layers [00:40:59] layers of nine layers 17 and 29 layers and the result of the paper is in all [00:41:01] and the result of the paper is in all cases um they got the best results by [00:41:04] cases um they got the best results by their deepest Network which was a 29 [00:41:07] their deepest Network which was a 29 layer model which is sort of then sort [00:41:09] layer model which is sort of then sort of similar to what people were doing in [00:41:11] of similar to what people were doing in Vision um and then you know there's some [00:41:16] Vision um and then you know there's some variation as to which was best by using [00:41:18] variation as to which was best by using the max pulling or the kmax pulling but [00:41:20] the max pulling or the kmax pulling but in general was always the Deep model and [00:41:24] in general was always the Deep model and it varied a bit according to the data [00:41:27] it varied a bit according to the data set [00:41:27] set but at least sometimes they were able to [00:41:29] but at least sometimes they were able to produce the best results that were known [00:41:32] produce the best results that were known so I mean I guess for these text [00:41:34] so I mean I guess for these text classification previous results were [00:41:36] classification previous results were slightly better than their results um [00:41:39] slightly better than their results um but for some of the other ones like the [00:41:41] but for some of the other ones like the DBP and the Yelp that their results uh [00:41:45] DBP and the Yelp that their results uh or Al well for both the Yelp data sets [00:41:48] or Al well for both the Yelp data sets their results were better than the best [00:41:50] their results were better than the best known previous results um the Amazon [00:41:54] known previous results um the Amazon ones one was better one was worse but to [00:41:57] ones one was better one was worse but to a first approximation this meant that [00:42:00] a first approximation this meant that they could basically reach the [00:42:01] they could basically reach the state-ofthe-art of a text classification [00:42:04] state-ofthe-art of a text classification system with something that was just a [00:42:07] system with something that was just a deep convolutional neural network [00:42:10] deep convolutional neural network starting from the character level with [00:42:13] starting from the character level with none of the sort of Having learned word [00:42:14] none of the sort of Having learned word vectors in advance or anything like that [00:42:17] vectors in advance or anything like that and so that was a pretty cool [00:42:19] and so that was a pretty cool achievement which showed that you could [00:42:21] achievement which showed that you could go a Fair Way um in doing things um with [00:42:25] go a Fair Way um in doing things um with just this sort of raw character level [00:42:27] just this sort of raw character level convolutional new networks sort of more [00:42:29] convolutional new networks sort of more like a vision [00:42:32] system okay so that's that and then for [00:42:36] system okay so that's that and then for the final piece of the class I then want [00:42:39] the final piece of the class I then want to tell tell you about something in The [00:42:42] to tell tell you about something in The Other Extreme um which is about um tree [00:42:46] Other Extreme um which is about um tree recursive neural networks so tree [00:42:48] recursive neural networks so tree recursive neural networks is a framework [00:42:52] recursive neural networks is a framework um that me and students developed at [00:42:54] um that me and students developed at Stanford so I mean really um when I [00:42:59] Stanford so I mean really um when I first got into new network day in [00:43:02] first got into new network day in 2010 that sort of for about the first [00:43:05] 2010 that sort of for about the first five years um that what me and students [00:43:08] five years um that what me and students worked on was doing these tree recursive [00:43:11] worked on was doing these tree recursive newal networks and so they were sort of [00:43:13] newal networks and so they were sort of um the Stanford brand um [00:43:17] um the Stanford brand um ultimately um they didn't prove as [00:43:21] ultimately um they didn't prove as successful as other things that came [00:43:23] successful as other things that came along but I think they're linguistically [00:43:26] along but I think they're linguistically interested in and I think there's a [00:43:28] interested in and I think there's a clear idea here which is still an idea [00:43:31] clear idea here which is still an idea that exists and um I think there may be [00:43:34] that exists and um I think there may be still some things to do with which I'll [00:43:36] still some things to do with which I'll come back to but the starting point is [00:43:38] come back to but the starting point is essentially being motivated by structure [00:43:41] essentially being motivated by structure of human language um and so most of this [00:43:45] of human language um and so most of this slide is sort of filled um by a paper [00:43:48] slide is sort of filled um by a paper from gome Chomsky and colleagues sort of [00:43:51] from gome Chomsky and colleagues sort of discussing um their views of the human [00:43:54] discussing um their views of the human faculty of language what it is who has [00:43:57] faculty of language what it is who has and how did it evolve um and I don't uh [00:44:01] and how did it evolve um and I don't uh want to dwell on this in too much detail [00:44:04] want to dwell on this in too much detail but essentially in this paper what they [00:44:07] but essentially in this paper what they argue is that you know the defining [00:44:10] argue is that you know the defining property of human language um that's not [00:44:13] property of human language um that's not observed in other things that humans do [00:44:17] observed in other things that humans do is having that language has this [00:44:19] is having that language has this recursive structure that you have this [00:44:21] recursive structure that you have this hierarchical nesting where the same [00:44:24] hierarchical nesting where the same structure repeats inside itself so if [00:44:27] structure repeats inside itself so if you have an example um like the person [00:44:30] you have an example um like the person standing next to the man from the [00:44:32] standing next to the man from the company that purchased the firm that you [00:44:34] company that purchased the firm that you used to work at um what you have is um [00:44:39] used to work at um what you have is um you the whole of this is a noun phrase [00:44:42] you the whole of this is a noun phrase the person headed by the person and then [00:44:45] the person headed by the person and then it's standing next to then the first [00:44:48] it's standing next to then the first square brackets here is another noun [00:44:50] square brackets here is another noun phrase the man from then inside that [00:44:54] phrase the man from then inside that prepositional phrase there's another [00:44:55] prepositional phrase there's another noun phrase the company that [00:44:57] noun phrase the company that purchased um the firm and then the firm [00:45:01] purchased um the firm and then the firm is another noun phrase that has the uh [00:45:05] is another noun phrase that has the uh relative Clause modifier of the firm [00:45:07] relative Clause modifier of the firm that you used to work at so we have [00:45:09] that you used to work at so we have these embedded layers of noun phrases [00:45:12] these embedded layers of noun phrases with the same syntactic structure [00:45:15] with the same syntactic structure underneath them and so for the kind of [00:45:18] underneath them and so for the kind of formalisms that we use in linguistics of [00:45:21] formalisms that we use in linguistics of context free grammar we they it permits [00:45:24] context free grammar we they it permits the kind of infinite embedding of [00:45:26] the kind of infinite embedding of nesting which is the same kind of [00:45:27] nesting which is the same kind of nesting that you get in programming [00:45:30] nesting that you get in programming languages where you can sort of use if [00:45:33] languages where you can sort of use if statements and Nest them as deeply as [00:45:36] statements and Nest them as deeply as you want to because you just have the [00:45:37] you want to because you just have the same repeating recursive structure now [00:45:40] same repeating recursive structure now of course human beings can't actually [00:45:43] of course human beings can't actually understand recursive infinite recursion [00:45:46] understand recursive infinite recursion and people don't actually produce [00:45:48] and people don't actually produce infinite recursion that you could sort [00:45:50] infinite recursion that you could sort of say oh in practice no one's going to [00:45:52] of say oh in practice no one's going to go more than eight deep when they're [00:45:54] go more than eight deep when they're saying a sentence but in terms of the [00:45:58] saying a sentence but in terms of the structure of what the language looks [00:45:59] structure of what the language looks like it seems like you should be able to [00:46:02] like it seems like you should be able to do it infinitely deep and when you [00:46:04] do it infinitely deep and when you actually start looking at the structures [00:46:06] actually start looking at the structures of sentences um they do sort of repeat [00:46:09] of sentences um they do sort of repeat over the same structure quite deeply so [00:46:12] over the same structure quite deeply so this is uh example of a pen Tree Bank um [00:46:15] this is uh example of a pen Tree Bank um tree which is sort of the best known um [00:46:19] tree which is sort of the best known um constituency um tree bank and so here's [00:46:22] constituency um tree bank and so here's my random sentence analyst said Mr [00:46:25] my random sentence analyst said Mr stronck wants to resume assum a more [00:46:27] stronck wants to resume assum a more influential role in running the company [00:46:31] influential role in running the company and well what we end up with sort of if [00:46:33] and well what we end up with sort of if we have these nested things of verb [00:46:36] we have these nested things of verb phrases so running the company is a verb [00:46:39] phrases so running the company is a verb phrase resume a more influential role in [00:46:43] phrase resume a more influential role in running the company is a bigger verb [00:46:45] running the company is a bigger verb phrase um wants to resume a bigger role [00:46:49] phrase um wants to resume a bigger role in running the company as an even bigger [00:46:52] in running the company as an even bigger verb phrase and then said Mr stronck [00:46:56] verb phrase and then said Mr stronck wants to resume a more influential role [00:47:00] wants to resume a more influential role in running the company is an even bigger [00:47:02] in running the company is an even bigger verb phrase and so we have sort of one 2 [00:47:07] verb phrase and so we have sort of one 2 three four verb phrases all nested [00:47:10] three four verb phrases all nested inside each other and so the idea was [00:47:14] inside each other and so the idea was well maybe we should be thinking of [00:47:17] well maybe we should be thinking of sentences as having this kind of tree [00:47:19] sentences as having this kind of tree structure and Computing representations [00:47:22] structure and Computing representations of meanings of sentences in terms of [00:47:25] of meanings of sentences in terms of this tree structure so um we have words [00:47:29] this tree structure so um we have words that have [00:47:30] that have representations in uh word Vector space [00:47:34] representations in uh word Vector space like we saw right at the beginning of [00:47:36] like we saw right at the beginning of the class but then we're going to have a [00:47:38] the class but then we're going to have a phrase like the country of my birth and [00:47:42] phrase like the country of my birth and the classic linguistic answer that you [00:47:44] the classic linguistic answer that you find both in linguistic semantics [00:47:47] find both in linguistic semantics classes or philosophy of language is [00:47:50] classes or philosophy of language is that we could should construct [00:47:51] that we could should construct representations of phrases using the [00:47:54] representations of phrases using the principle of compositionality which says [00:47:57] principle of compositionality which says that the meaning of a phrase or sentence [00:48:00] that the meaning of a phrase or sentence is determined by the meanings of its [00:48:02] is determined by the meanings of its Words which are our word vectors and the [00:48:05] Words which are our word vectors and the rules that combine them so maybe we [00:48:07] rules that combine them so maybe we could take the phrase structure tree of [00:48:10] could take the phrase structure tree of a sentence and combine the word vectors [00:48:14] a sentence and combine the word vectors together by some means and then we can [00:48:17] together by some means and then we can construct a representation of the [00:48:19] construct a representation of the meaning of phrases in a more linguistic [00:48:22] meaning of phrases in a more linguistic way giving us a vector representation of [00:48:25] way giving us a vector representation of the meaning of the phrase which we could [00:48:27] the meaning of the phrase which we could also put into our Vector space and we'd [00:48:30] also put into our Vector space and we'd hope that a phrase like the country of [00:48:32] hope that a phrase like the country of my birth would appear in the vector [00:48:34] my birth would appear in the vector space in a similar place to where words [00:48:37] space in a similar place to where words representing locations [00:48:40] representing locations appeared okay so what we want is to be [00:48:43] appeared okay so what we want is to be able to start with word vectors and Pa [00:48:46] able to start with word vectors and Pa up a [00:48:47] up a sentence and as we pause the sentence [00:48:51] sentence and as we pause the sentence we're then going to be Computing [00:48:53] we're then going to be Computing representations for the different [00:48:55] representations for the different phrases of the sentence [00:48:59] um and so the difference here is now you [00:49:03] um and so the difference here is now you know um you know the difference between [00:49:06] know um you know the difference between recursive and recurrent is sort of a [00:49:09] recursive and recurrent is sort of a fake difference right they both come [00:49:11] fake difference right they both come from the same recur word um but rather [00:49:15] from the same recur word um but rather than having the recursion Just Happening [00:49:18] than having the recursion Just Happening along a sequence as in a recurrent newal [00:49:21] along a sequence as in a recurrent newal Network we're going to have the [00:49:22] Network we're going to have the recursion happening uper tree structure [00:49:26] recursion happening uper tree structure so we can Computing representations for [00:49:28] so we can Computing representations for linguistically Meaningful [00:49:30] linguistically Meaningful phrases um and so there sort of um what [00:49:35] phrases um and so there sort of um what we're going to do with that is you know [00:49:38] we're going to do with that is you know the easy Cas is if we know um the phrase [00:49:42] the easy Cas is if we know um the phrase structure tree we can take the [00:49:45] structure tree we can take the representations of the child nodes put [00:49:49] representations of the child nodes put them into a neural network which could [00:49:52] them into a neural network which could give us the representation of the parent [00:49:54] give us the representation of the parent node but we'd also like to find the tree [00:49:58] node but we'd also like to find the tree structure and so a way we could do that [00:50:00] structure and so a way we could do that is then get a second thing out of the [00:50:02] is then get a second thing out of the newal network we could get a score for [00:50:06] newal network we could get a score for how plausible something is as a [00:50:08] how plausible something is as a constituent does it make sense to [00:50:11] constituent does it make sense to combine these two nose together to form [00:50:14] combine these two nose together to form a larger constituent and then we can use [00:50:17] a larger constituent and then we can use that in [00:50:18] that in AA so um formerly the very simplest kind [00:50:23] AA so um formerly the very simplest kind of tral network and the first one we [00:50:25] of tral network and the first one we explored was when we had two child [00:50:29] explored was when we had two child vectors we're going to be um [00:50:32] vectors we're going to be um representing the parent vector by [00:50:35] representing the parent vector by concatenating the two children [00:50:38] concatenating the two children multiplying them by a matrix adding a [00:50:40] multiplying them by a matrix adding a bias putting it through a nonlinearity [00:50:43] bias putting it through a nonlinearity to get a parent representation p and [00:50:46] to get a parent representation p and then we'd score whether it's a good [00:50:49] then we'd score whether it's a good constituent by taking another um Vector [00:50:54] constituent by taking another um Vector of learn parameters which would um do a [00:50:58] of learn parameters which would um do a DOT product with p and that would give [00:51:01] DOT product with p and that would give us a score as to whether this was a good [00:51:04] us a score as to whether this was a good constituent to include in your past tree [00:51:07] constituent to include in your past tree and the same W parameters were used at [00:51:11] and the same W parameters were used at all nodes of the tree in the same way as [00:51:13] all nodes of the tree in the same way as a new a current new network kept using [00:51:16] a new a current new network kept using the same [00:51:18] the same parameters okay um so if we did that if [00:51:22] parameters okay um so if we did that if we had that we could build a greedy paa [00:51:25] we had that we could build a greedy paa cuz what we could do is we could start [00:51:27] cuz what we could do is we could start with all the word vectors and we could [00:51:30] with all the word vectors and we could just take every pair of words and put it [00:51:33] just take every pair of words and put it through this system and calculate what [00:51:37] through this system and calculate what the representation of that pair would be [00:51:39] the representation of that pair would be as a constituent and then get a score as [00:51:43] as a constituent and then get a score as to whether it seemed a good constituent [00:51:45] to whether it seemed a good constituent or not and then we could just [00:51:47] or not and then we could just greedily decide this is the best [00:51:50] greedily decide this is the best constituent the cat and so if we do a [00:51:53] constituent the cat and so if we do a greedy paraa we can commit to that and [00:51:57] greedy paraa we can commit to that and then well we already still know the the [00:52:00] then well we already still know the the possibilities of combining other pairs [00:52:02] possibilities of combining other pairs of words and we could just additionally [00:52:05] of words and we could just additionally um score how good um the the cat [00:52:09] um score how good um the the cat combined with sat is um so that we [00:52:12] combined with sat is um so that we producing binary um PA structures so now [00:52:16] producing binary um PA structures so now the best pair to combine greedily is the [00:52:19] the best pair to combine greedily is the mat so we can combine them together and [00:52:22] mat so we can combine them together and commit to those we can score combining [00:52:25] commit to those we can score combining on with the m that and now that seems [00:52:28] on with the m that and now that seems the best thing so we'll commit to that [00:52:31] the best thing so we'll commit to that and we just sort of keep on up and we [00:52:33] and we just sort of keep on up and we produce the binary pars of the sentence [00:52:36] produce the binary pars of the sentence and this gives us our sentence [00:52:40] representation okay which is like [00:52:43] representation okay which is like that okay and so that gives us our [00:52:46] that okay and so that gives us our simple RNN um and so uh back in 2011 we [00:52:52] simple RNN um and so uh back in 2011 we got some pretty decent results of [00:52:54] got some pretty decent results of showing that you could use this as a [00:52:56] showing that you could use this as a sentence paa that worked pretty well but [00:52:59] sentence paa that worked pretty well but beyond that that the [00:53:01] beyond that that the representations um we calculated for [00:53:04] representations um we calculated for sentences and phrases were good enough [00:53:08] sentences and phrases were good enough representations that you would use it [00:53:10] representations that you would use it for tasks like sentence classification [00:53:13] for tasks like sentence classification sentiment analysis and it works [00:53:16] sentiment analysis and it works reasonably well um it only works it only [00:53:21] reasonably well um it only works it only worked reasonably well um because if you [00:53:25] worked reasonably well um because if you start thinking about it further you know [00:53:28] start thinking about it further you know there sort of wor strong limitations of [00:53:31] there sort of wor strong limitations of having this single W Matrix that's used [00:53:35] having this single W Matrix that's used at all points to combine things that [00:53:37] at all points to combine things that that if you sort of have that [00:53:40] that if you sort of have that architecture you sort of can't have [00:53:42] architecture you sort of can't have different forms of interaction between [00:53:44] different forms of interaction between the different words you're s just [00:53:46] the different words you're s just uniformly Computing things and that sort [00:53:49] uniformly Computing things and that sort of stands in distinction to the fact [00:53:52] of stands in distinction to the fact that different kinds of things in [00:53:54] that different kinds of things in natural language seem kind of different [00:53:56] natural language seem kind of different you have um different properties with [00:53:59] you have um different properties with verbs and their objects versus an [00:54:02] verbs and their objects versus an adjective modifying a noun just in terms [00:54:04] adjective modifying a noun just in terms what the roles of the different words [00:54:06] what the roles of the different words were so we started um to see limitations [00:54:09] were so we started um to see limitations of this architecture and so in following [00:54:12] of this architecture and so in following years um we started exploring other ways [00:54:15] years um we started exploring other ways to build tree recursive neural networks [00:54:18] to build tree recursive neural networks which had more flexibility as to how [00:54:21] which had more flexibility as to how things were combined together and I'm [00:54:23] things were combined together and I'm not going to show you all the details um [00:54:25] not going to show you all the details um of all of that but I will show you um [00:54:29] of all of that but I will show you um one more um model that we use for [00:54:33] one more um model that we use for building tree recursive newal networks [00:54:35] building tree recursive newal networks and that was used in sort of some of our [00:54:37] and that was used in sort of some of our sentiment analysis work called the [00:54:40] sentiment analysis work called the recursive newal tensor Network um it [00:54:44] recursive newal tensor Network um it wasn't actually the final net version [00:54:46] wasn't actually the final net version that we did after that we sort of [00:54:48] that we did after that we sort of started taking lstm ideas and extending [00:54:52] started taking lstm ideas and extending those to the tree structured case and we [00:54:54] those to the tree structured case and we worked on tree lsdm but I'm not going to [00:54:56] worked on tree lsdm but I'm not going to show that this year um but the idea of [00:55:00] show that this year um but the idea of recursive newal tensor networks is you [00:55:03] recursive newal tensor networks is you know when pairs of words or phrases [00:55:06] know when pairs of words or phrases combined together in linguistic [00:55:09] combined together in linguistic semantics terms depending on the pairs [00:55:13] semantics terms depending on the pairs of words they modify each other in [00:55:15] of words they modify each other in different ways so if you have an [00:55:16] different ways so if you have an adjective and a noun like um a red ball [00:55:21] adjective and a noun like um a red ball sort of red is giving attributes of the [00:55:23] sort of red is giving attributes of the noun whereas if you have something like [00:55:25] noun whereas if you have something like a verb in object um like kick the ball [00:55:29] a verb in object um like kick the ball you've got a very different role for the [00:55:31] you've got a very different role for the object as the right hand side versus the [00:55:34] object as the right hand side versus the red ball it's sort of the opposite way [00:55:36] red ball it's sort of the opposite way around so we want to have more [00:55:38] around so we want to have more flexibility in the way we can calculate [00:55:41] flexibility in the way we can calculate meanings of phrases depending on what's [00:55:44] meanings of phrases depending on what's in it and the way we came up with doing [00:55:47] in it and the way we came up with doing that is to come up with what's we call [00:55:49] that is to come up with what's we call this neural tensil layer and so the idea [00:55:52] this neural tensil layer and so the idea in the neural tensil layer is that we [00:55:56] in the neural tensil layer is that we have had um the [00:55:59] have had um the representations of um the the child um [00:56:04] representations of um the the child um words or phrases and so rather than sort [00:56:08] words or phrases and so rather than sort of directly putting rather than directly [00:56:11] of directly putting rather than directly concatenating them and [00:56:14] concatenating them and then um putting it through a sort of a [00:56:17] then um putting it through a sort of a linear transformation like a regular new [00:56:19] linear transformation like a regular new network layer instead what we could do [00:56:22] network layer instead what we could do is that we could learn um in between [00:56:27] is that we could learn um in between matrices and if we put several of those [00:56:30] matrices and if we put several of those together we're then getting a [00:56:31] together we're then getting a three-dimensional tensor and we could [00:56:34] three-dimensional tensor and we could multiply a vector um by a tenser time a [00:56:40] multiply a vector um by a tenser time a vector and then we'll end up getting out [00:56:45] vector and then we'll end up getting out um vectors for each one we'll have [00:56:47] um vectors for each one we'll have multiple such [00:56:54] vectors okay and the the place that we [00:56:57] vectors okay and the the place that we applied this model is for this task of [00:57:00] applied this model is for this task of sentiment analysis so let me just tell [00:57:02] sentiment analysis so let me just tell you a little bit more of what we did [00:57:05] you a little bit more of what we did here and this is sort of in fact going [00:57:07] here and this is sort of in fact going backwards to the Stanford sentiment Tree [00:57:10] backwards to the Stanford sentiment Tree Bank um that was already used in the Yim [00:57:12] Bank um that was already used in the Yim work so the goal of sentiment analysis [00:57:15] work so the goal of sentiment analysis is to see whether a piece of text is [00:57:17] is to see whether a piece of text is positive negative or neutral um so a lot [00:57:20] positive negative or neutral um so a lot of the time doing sentiment analysis is [00:57:23] of the time doing sentiment analysis is pretty easy um you know in the 2010s and [00:57:28] pretty easy um you know in the 2010s and probably even today you know quite a few [00:57:31] probably even today you know quite a few people's sentiment analysis systems are [00:57:34] people's sentiment analysis systems are essentially just um keyword matching [00:57:38] essentially just um keyword matching right if you see great marvelous [00:57:40] right if you see great marvelous wonderful positive sentiment if you see [00:57:43] wonderful positive sentiment if you see something of poor bad negative sentiment [00:57:47] something of poor bad negative sentiment um and so lots of the time you can sort [00:57:50] um and so lots of the time you can sort of effectively do a kind of dictionary [00:57:53] of effectively do a kind of dictionary matching and get pretty good sentiment [00:57:56] matching and get pretty good sentiment especially on longer documents but you [00:57:58] especially on longer documents but you know on the other hand people use [00:58:00] know on the other hand people use language in lots of interesting ways and [00:58:03] language in lots of interesting ways and it's not always that easy so if you look [00:58:06] it's not always that easy so if you look at something like movie reviews such as [00:58:08] at something like movie reviews such as the Snippets you get on Rotten Tomatoes [00:58:11] the Snippets you get on Rotten Tomatoes um you get Snippets like this in Rotten [00:58:14] um you get Snippets like this in Rotten Tomatoes with this cast and this subject [00:58:17] Tomatoes with this cast and this subject matter the movie should have been [00:58:19] matter the movie should have been funnier and more entertaining and if you [00:58:21] funnier and more entertaining and if you just think of it as okay we're doing [00:58:24] just think of it as okay we're doing dictionary matching there's the word [00:58:27] dictionary matching there's the word entertaining that's definitely positive [00:58:30] entertaining that's definitely positive um and funnier that's positive so there [00:58:33] um and funnier that's positive so there are two positive words so this should be [00:58:35] are two positive words so this should be a positive review but of course it's not [00:58:38] a positive review but of course it's not a positive review this is a negative [00:58:40] a positive review this is a negative review um because it's saying um well [00:58:44] review um because it's saying um well I'm just reading it out again with this [00:58:46] I'm just reading it out again with this cast and subject matter the movie should [00:58:48] cast and subject matter the movie should have been funnier and more entertaining [00:58:51] have been funnier and more entertaining right so the compositional structure of [00:58:54] right so the compositional structure of human language goes together to mean [00:58:57] human language goes together to mean that these um because [00:58:59] that these um because it's buried under should have been the [00:59:03] it's buried under should have been the funnier and entertaining are actually [00:59:06] funnier and entertaining are actually lacking and so it's a negative review [00:59:09] lacking and so it's a negative review and so these were the kind of examples [00:59:11] and so these were the kind of examples that we were interested in and saying [00:59:14] that we were interested in and saying could we sort of actually understand the [00:59:16] could we sort of actually understand the structure of sentences more and do a [00:59:18] structure of sentences more and do a better job at sentiment analysis and so [00:59:22] better job at sentiment analysis and so up until this time um people are just [00:59:26] up until this time um people are just sort of had pieces of text and a [00:59:28] sort of had pieces of text and a classification Judgment of positive and [00:59:31] classification Judgment of positive and negative so we decided we're going to do [00:59:33] negative so we decided we're going to do more than that and come up with the [00:59:36] more than that and come up with the Stanford sentiment Tree Bank where what [00:59:38] Stanford sentiment Tree Bank where what we did was passed up a whole lot of [00:59:42] we did was passed up a whole lot of sentences almost 12,000 of them and then [00:59:46] sentences almost 12,000 of them and then what we are going to do is put sentiment [00:59:49] what we are going to do is put sentiment judgments on every linguistic phrase of [00:59:54] judgments on every linguistic phrase of the sentence so for something like this [00:59:56] the sentence so for something like this example you know with this cast is a [00:59:59] example you know with this cast is a phrase no sentiments so that just be [01:00:02] phrase no sentiments so that just be neutral um entertaining is a phrase a [01:00:06] neutral um entertaining is a phrase a one-word phrase its sentiment is [01:00:09] one-word phrase its sentiment is positive um um you know funnier and more [01:00:14] positive um um you know funnier and more entertaining that's a [01:00:16] entertaining that's a phrase very positive um but then by the [01:00:21] phrase very positive um but then by the time we get embedded under should have [01:00:22] time we get embedded under should have been funnier and more entertaining [01:00:24] been funnier and more entertaining that's a bigger phrase its sentiment is [01:00:27] that's a bigger phrase its sentiment is now negative and the movie should have [01:00:29] now negative and the movie should have been funnier and more entertaining [01:00:31] been funnier and more entertaining that's an even bigger phrase it's [01:00:33] that's an even bigger phrase it's negative and so we were um passing up [01:00:36] negative and so we were um passing up trees like that and these examples are [01:00:40] trees like that and these examples are very small I'll show you a big examples [01:00:43] very small I'll show you a big examples later but you can sort of just see that [01:00:46] later but you can sort of just see that in the trees there are blue nodes and [01:00:48] in the trees there are blue nodes and orange nodes corresponding to positive [01:00:50] orange nodes corresponding to positive and negative sentiment reflecting units [01:00:53] and negative sentiment reflecting units at the different sizes and so the [01:00:56] at the different sizes and so the interesting thing is then this gave us a [01:00:58] interesting thing is then this gave us a richer annotated data set because it's [01:01:01] richer annotated data set because it's not only sort of whole sentences or [01:01:04] not only sort of whole sentences or whole articles that were annotated for [01:01:07] whole articles that were annotated for sentiment we had annotations for [01:01:10] sentiment we had annotations for different phrases and simply the fact [01:01:13] different phrases and simply the fact that you were annotating phrases meant [01:01:16] that you were annotating phrases meant that you could learn more from the [01:01:18] that you could learn more from the examples so even if you're using [01:01:20] examples so even if you're using something very simple like a naive based [01:01:22] something very simple like a naive based classifier because there an ations on [01:01:26] classifier because there an ations on Words and smaller phrases you could [01:01:29] Words and smaller phrases you could learn a bit more about which were [01:01:31] learn a bit more about which were positive and which were negative and so [01:01:34] positive and which were negative and so that was the first result that we could [01:01:37] that was the first result that we could um people a baseline method of a byr [01:01:41] um people a baseline method of a byr Nave Bas classifier which is a very [01:01:43] Nave Bas classifier which is a very common sentiment classifier that if you [01:01:46] common sentiment classifier that if you just trained with sentiment classifiers [01:01:48] just trained with sentiment classifiers you got 79% on this data set if you [01:01:52] you got 79% on this data set if you trained with um if you trained using [01:01:56] trained with um if you trained using every node of the tree Bank um you got [01:01:59] every node of the tree Bank um you got 83% so you got a 4% lift and so um that [01:02:03] 83% so you got a 4% lift and so um that was kind of good um these other two [01:02:05] was kind of good um these other two lines show two of our early Tre tree [01:02:09] lines show two of our early Tre tree rnns and the negative part of the result [01:02:11] rnns and the negative part of the result is they weren't really better than a [01:02:13] is they weren't really better than a bite gram um naive Bas classifier they [01:02:17] bite gram um naive Bas classifier they were better than a unigram naive Bas [01:02:19] were better than a unigram naive Bas classifier but a lot of sort of the [01:02:23] classifier but a lot of sort of the extra information that you want to [01:02:24] extra information that you want to capture for sentiment analysis you can [01:02:28] capture for sentiment analysis you can get from byrams because that's can [01:02:30] get from byrams because that's can already tell you sort of not good um [01:02:33] already tell you sort of not good um somewhat interesting and things like [01:02:36] somewhat interesting and things like that [01:02:38] that um but then so the other Hope was to [01:02:41] um but then so the other Hope was to have a more powerful model and so that [01:02:44] have a more powerful model and so that then led into use of this recursive [01:02:46] then led into use of this recursive newal tensor Network which allowed sort [01:02:49] newal tensor Network which allowed sort of the mediated multiplicative [01:02:51] of the mediated multiplicative interactions between word or phrase [01:02:54] interactions between word or phrase vectors [01:02:57] vectors um and so we built that and so then here [01:03:01] um and so we built that and so then here are the results of that model that's [01:03:03] are the results of that model that's shown in red um so by having our [01:03:07] shown in red um so by having our recursive newal tensor Network we were [01:03:10] recursive newal tensor Network we were able to build um a somewhat better newal [01:03:14] able to build um a somewhat better newal Network that performed at least [01:03:18] Network that performed at least reasonably better than a byr naive Bas [01:03:22] reasonably better than a byr naive Bas model rate that we were getting sort of [01:03:24] model rate that we were getting sort of about 22% better than a by byr NA Bas [01:03:28] about 22% better than a by byr NA Bas model so that was progress but I think [01:03:31] model so that was progress but I think perhaps the more interesting thing isn't [01:03:33] perhaps the more interesting thing isn't sort of the aggregate results but the [01:03:36] sort of the aggregate results but the fact that because we were building up [01:03:39] fact that because we were building up this [01:03:40] this model the computed [01:03:43] model the computed representations over a constituency tree [01:03:47] representations over a constituency tree that it actually made judgments of [01:03:49] that it actually made judgments of different parts of sentences and how [01:03:52] different parts of sentences and how they combined so um here's the movie [01:03:55] they combined so um here's the movie review sentence there are slow and rep [01:03:58] review sentence there are slow and rep repetitive Parts but it has just enough [01:04:01] repetitive Parts but it has just enough space to keep it interesting um so I [01:04:03] space to keep it interesting um so I hope you'll agree with the Judgment that [01:04:05] hope you'll agree with the Judgment that overall that's a positive statement [01:04:07] overall that's a positive statement about the movie um and so the um recurs [01:04:11] about the movie um and so the um recurs of newor tens Network builds the tree [01:04:14] of newor tens Network builds the tree structure over this sentence and it says [01:04:17] structure over this sentence and it says you know slow and repetitive that's [01:04:19] you know slow and repetitive that's negative there are slow um repetitive [01:04:22] negative there are slow um repetitive Parts it's all negative to hear but for [01:04:25] Parts it's all negative to hear but for the part over to the right interesting [01:04:28] the part over to the right interesting um spice they're both positive and spice [01:04:32] um spice they're both positive and spice to keep it interesting that's positive [01:04:34] to keep it interesting that's positive it has just enough spice to keep it [01:04:36] it has just enough spice to keep it interesting positive and it correctly [01:04:39] interesting positive and it correctly predicts that when you put these two [01:04:40] predicts that when you put these two halves of these sentences together um [01:04:43] halves of these sentences together um the overall judgment is that this [01:04:45] the overall judgment is that this remains um a positive review and it [01:04:48] remains um a positive review and it gives a positive judgment overall so [01:04:51] gives a positive judgment overall so that was kind of cool um and in [01:04:54] that was kind of cool um and in particular um the fact that we were [01:04:57] particular um the fact that we were building these phrase judgments um meant [01:05:00] building these phrase judgments um meant that it seemed like we could actually do [01:05:02] that it seemed like we could actually do a better job of sentence understanding [01:05:06] a better job of sentence understanding in the way that sort of any linguist [01:05:08] in the way that sort of any linguist doing linguistic semantics would like to [01:05:10] doing linguistic semantics would like to see sentence understanding so one of the [01:05:14] see sentence understanding so one of the things that newal networks when looking [01:05:16] things that newal networks when looking at language have often been faltered for [01:05:20] at language have often been faltered for and are still faled for to this day [01:05:23] and are still faled for to this day using Transformer models is you often [01:05:26] using Transformer models is you often find the result that um new network [01:05:30] find the result that um new network models just don't pay attention to [01:05:32] models just don't pay attention to negation that you can be um having some [01:05:36] negation that you can be um having some sentence and you can compare the [01:05:38] sentence and you can compare the sentence of um you know a lot of [01:05:41] sentence of um you know a lot of students are studying for their final [01:05:44] students are studying for their final exams versus a lot of students aren't [01:05:46] exams versus a lot of students aren't studying for their final exams and the [01:05:49] studying for their final exams and the negation just gets lost that it doesn't [01:05:51] negation just gets lost that it doesn't produce um the differences and [01:05:53] produce um the differences and representation and meaning that you'd [01:05:56] representation and meaning that you'd like it to um so somewhat [01:05:59] like it to um so somewhat interestingly um with this model it [01:06:03] interestingly um with this model it seemed like because we were modeling the [01:06:06] seemed like because we were modeling the the cursive building up of sentence [01:06:09] the cursive building up of sentence structure that we actually um could do [01:06:12] structure that we actually um could do interesting things with um modeling [01:06:15] interesting things with um modeling negation right so in particular [01:06:19] negation right so in particular um that um what You' the results that [01:06:23] um that um what You' the results that you'd like to get is if you have [01:06:25] you'd like to get is if you have something like it's just incredibly dull [01:06:28] something like it's just incredibly dull so dull is a very negative word um [01:06:31] so dull is a very negative word um incredible is a positive word by itself [01:06:35] incredible is a positive word by itself but when you're sort of saying [01:06:36] but when you're sort of saying incredibly dull it's definitely still [01:06:39] incredibly dull it's definitely still negative um and this um uh our recursive [01:06:44] negative um and this um uh our recursive newor tensor network is um correctly [01:06:48] newor tensor network is um correctly modeling um it's just incredibly dull is [01:06:51] modeling um it's just incredibly dull is very negative despite incredible being a [01:06:53] very negative despite incredible being a sort of positive word [01:06:56] sort of positive word so um you know actually in this um model [01:07:01] so um you know actually in this um model um there was five way classification so [01:07:03] um there was five way classification so there was very negative somewhat [01:07:05] there was very negative somewhat negative neutral somewhat positive very [01:07:07] negative neutral somewhat positive very positive um so there's sort of some [01:07:10] positive um so there's sort of some bouncing around as to whether it's [01:07:11] bouncing around as to whether it's giving the classification very negative [01:07:13] giving the classification very negative versus somewhat negative I can't really [01:07:16] versus somewhat negative I can't really explain why in the middle it goes to [01:07:18] explain why in the middle it goes to somewhat negative and then goes back to [01:07:20] somewhat negative and then goes back to very negative but that's the results [01:07:22] very negative but that's the results that came out of the the network and at [01:07:24] that came out of the the network and at any rate if it all stays negative the [01:07:27] any rate if it all stays negative the fact that incredible by itself [01:07:29] fact that incredible by itself incredibly is a positive word it's seen [01:07:32] incredibly is a positive word it's seen in the modif modification of dull as [01:07:36] in the modif modification of dull as that keeps it negative but on the other [01:07:39] that keeps it negative but on the other hand if you put a negation in here it's [01:07:42] hand if you put a negation in here it's definitely not Dull um well then what [01:07:46] definitely not Dull um well then what happens now interestingly the word not [01:07:49] happens now interestingly the word not by itself is a negative word that if you [01:07:52] by itself is a negative word that if you just sort of do the raw statistics of it [01:07:55] just sort of do the raw statistics of it um not occurs much more often in [01:07:59] um not occurs much more often in negative sentiment sentences than it [01:08:01] negative sentiment sentences than it does in positive [01:08:03] does in positive sentiment sentences so you know if you [01:08:05] sentiment sentences so you know if you want to be a more positive person use [01:08:08] want to be a more positive person use negation less um so not by itself as [01:08:12] negation less um so not by itself as negative but if you then combine it [01:08:15] negative but if you then combine it together not Dull or in this case [01:08:17] together not Dull or in this case definitely not Dull um well not Dull is [01:08:23] definitely not Dull um well not Dull is you have two negations so that they [01:08:25] you have two negations so that they cancel each other out and you get [01:08:27] cancel each other out and you get something that's positive and so it's [01:08:29] something that's positive and so it's definitely not Dull comes out as a [01:08:31] definitely not Dull comes out as a positive [01:08:33] positive sentence and so the interesting result [01:08:36] sentence and so the interesting result here is that um if you um compare what [01:08:44] here is that um if you um compare what happens between you know if you have [01:08:47] happens between you know if you have negated positive sentences so you know [01:08:50] negated positive sentences so you know it's definitely not good various models [01:08:54] it's definitely not good various models can model that correctly because not is [01:08:57] can model that correctly because not is a negative word and so therefore it [01:09:00] a negative word and so therefore it weakens the positivity of the positive [01:09:03] weakens the positivity of the positive word and so putting a knot in front of a [01:09:07] word and so putting a knot in front of a positive into a positive sentence makes [01:09:10] positive into a positive sentence makes it less positive and even a not very [01:09:14] it less positive and even a not very well but even a naive Bas model can do [01:09:17] well but even a naive Bas model can do that because not by itself is seen as a [01:09:19] that because not by itself is seen as a negative word but the hard case is what [01:09:23] negative word but the hard case is what happens if you negate a negative [01:09:24] happens if you negate a negative sentence [01:09:26] sentence well the result that you should get is [01:09:28] well the result that you should get is it becomes more positive and neither a [01:09:32] it becomes more positive and neither a byr naive based model or our earlier [01:09:35] byr naive based model or our earlier attempts at recursive models can capture [01:09:38] attempts at recursive models can capture that whereas this RNN structure was able [01:09:42] that whereas this RNN structure was able to correctly modify capture this sort of [01:09:44] to correctly modify capture this sort of semantic modification um structure and [01:09:48] semantic modification um structure and say hey that's made the sentence much [01:09:50] say hey that's made the sentence much more positive um so that was a cool [01:09:53] more positive um so that was a cool result and to some extent you know this [01:09:55] result and to some extent you know this result I think still isn't captured as [01:09:57] result I think still isn't captured as well by any of the um current [01:10:00] well by any of the um current Transformer models even though they have [01:10:02] Transformer models even though they have many other advantages and are much [01:10:04] many other advantages and are much better than a tree recursive neural [01:10:07] better than a tree recursive neural network um so I mean yeah so just to say [01:10:12] network um so I mean yeah so just to say a couple um this is basically the end um [01:10:15] a couple um this is basically the end um to just say a couple of final remarks [01:10:17] to just say a couple of final remarks about these tree recursive new networks [01:10:20] about these tree recursive new networks um you know the reason that they became [01:10:26] um you know the reason that they became uncompetitive is because they just [01:10:28] uncompetitive is because they just didn't allow the kind of [01:10:32] didn't allow the kind of um associations and information flow [01:10:36] um associations and information flow that you have in a Transformer right [01:10:39] that you have in a Transformer right that these these models had a strictly [01:10:42] that these these models had a strictly context free backbone and the only [01:10:45] context free backbone and the only information flow was Tre structured [01:10:48] information flow was Tre structured following the context free backbone um [01:10:51] following the context free backbone um whereas in the Transformer you've got [01:10:54] whereas in the Transformer you've got this tension function where in every [01:10:57] this tension function where in every position you're looking at every other [01:10:58] position you're looking at every other position and so you can have much more [01:11:00] position and so you can have much more general information flow and in general [01:11:04] general information flow and in general that is just good and Transformers are [01:11:07] that is just good and Transformers are much more powerful but you know on the [01:11:10] much more powerful but you know on the other hand to the extent that you [01:11:12] other hand to the extent that you actually want to model the sort of [01:11:15] actually want to model the sort of semantics of human language carefully is [01:11:18] semantics of human language carefully is to sort of what modifies what and how [01:11:21] to sort of what modifies what and how does negation or quantifiers in a [01:11:24] does negation or quantifiers in a sentence behave in some sense these [01:11:27] sentence behave in some sense these models were more right and so one of the [01:11:30] models were more right and so one of the things I'm still kind of interested in [01:11:32] things I'm still kind of interested in is are there any opportunities um to [01:11:35] is are there any opportunities um to combine together some of the benefits of [01:11:37] combine together some of the benefits of both of these ways of thinking and have [01:11:39] both of these ways of thinking and have something that's a bit more tree [01:11:41] something that's a bit more tree structured while still more flexible [01:11:44] structured while still more flexible like a [01:11:46] like a Transformer okay that's it for today [01:11:48] Transformer okay that's it for today thanks a lot ================================================================================ LECTURE 018 ================================================================================ Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 18 - NLP, Linguistics, Philosophy Source: https://www.youtube.com/watch?v=NxH0Y78xcF4 --- Transcript [00:00:05] okay hi everyone I'll get started the [00:00:08] okay hi everyone I'll get started the last [00:00:10] last class okay um yeah well welcome [00:00:15] class okay um yeah well welcome congratulations and thank you to making [00:00:17] congratulations and thank you to making it to the um last real lecture of [00:00:20] it to the um last real lecture of cs224n yeah so so this is the plan for [00:00:23] cs224n yeah so so this is the plan for today um the lecture's titled NLP [00:00:26] today um the lecture's titled NLP Linguistics and philosophy which I took [00:00:28] Linguistics and philosophy which I took as meaning that I could talk about [00:00:30] as meaning that I could talk about anything I wanted to um and so that is [00:00:33] anything I wanted to um and so that is what I'm going to do um so this is sort [00:00:36] what I'm going to do um so this is sort of this um what we're going to go [00:00:38] of this um what we're going to go through talk a bit about the major ideas [00:00:39] through talk a bit about the major ideas of Cs 24n and open problems um some of [00:00:44] of Cs 24n and open problems um some of the more foundational questions of where [00:00:47] the more foundational questions of where are we with llm symbolic versus neural [00:00:50] are we with llm symbolic versus neural systems meaning and Linguistics and NLP [00:00:54] systems meaning and Linguistics and NLP and then I'll close with um some slides [00:00:56] and then I'll close with um some slides on the future risks of AI in the world [00:01:00] on the future risks of AI in the world okay so here is an attempt to sort of [00:01:03] okay so here is an attempt to sort of lay out the most major things that we [00:01:05] lay out the most major things that we looked at in [00:01:07] looked at in cs224n um we started with word vectors [00:01:11] cs224n um we started with word vectors then we developed the idea of neural NP [00:01:13] then we developed the idea of neural NP systems we expanded from a simple feed [00:01:17] systems we expanded from a simple feed forward Network into doing sequence [00:01:19] forward Network into doing sequence models language models rnns lstms and [00:01:23] models language models rnns lstms and then we introduced this powerful new [00:01:24] then we introduced this powerful new model that's been very influential the [00:01:27] model that's been very influential the Transformer and then we built from there [00:01:30] Transformer and then we built from there to the kind of um it's not exactly an [00:01:33] to the kind of um it's not exactly an architecture but model that's been built [00:01:35] architecture but model that's been built up in recent years to produce high [00:01:37] up in recent years to produce high performance NLP systems where we're [00:01:40] performance NLP systems where we're first doing pre-training and then a uh [00:01:43] first doing pre-training and then a uh post-training phase of various [00:01:45] post-training phase of various techniques that we talked about to [00:01:47] techniques that we talked about to produce um these General Foundation [00:01:50] produce um these General Foundation models that understand language so well [00:01:52] models that understand language so well and then we went on from there and [00:01:53] and then we went on from there and talked about various particular topics [00:01:55] talked about various particular topics like benchmarking and reasoning so a few [00:01:59] like benchmarking and reasoning so a few of the major ideas that we looked at [00:02:01] of the major ideas that we looked at were um this idea that you could get a [00:02:04] were um this idea that you could get a long way by having dense representations [00:02:08] long way by having dense representations those are our hidden representations in [00:02:10] those are our hidden representations in new networks and then looking at [00:02:13] new networks and then looking at distributional semantics representing [00:02:15] distributional semantics representing words by their context um first slogan [00:02:18] words by their context um first slogan of you shall know a word by the company [00:02:20] of you shall know a word by the company it keeps and I'll come back to that a [00:02:22] it keeps and I'll come back to that a bit later and talking about um ideas of [00:02:25] bit later and talking about um ideas of meaning but you know that's essentially [00:02:27] meaning but you know that's essentially been the idea that has driven most of [00:02:31] been the idea that has driven most of the successful ideas of modern NLP [00:02:33] the successful ideas of modern NLP whether it's the earli statistical NLP [00:02:35] whether it's the earli statistical NLP phase or more modern neural NLP phase [00:02:39] phase or more modern neural NLP phase and in this world we start instantiating [00:02:41] and in this world we start instantiating that as um these models of word vectors [00:02:45] that as um these models of word vectors but the same contextual idea is then [00:02:47] but the same contextual idea is then used in all the models up through [00:02:50] used in all the models up through Transformers we looked at both the [00:02:52] Transformers we looked at both the challenges and [00:02:54] challenges and opportunities um of training large deep [00:02:57] opportunities um of training large deep newal networks and how gradually people [00:02:59] newal networks and how gradually people developed ideas and tricks such as [00:03:02] developed ideas and tricks such as having residual connections which made [00:03:04] having residual connections which made it much more possible and stable to do [00:03:07] it much more possible and stable to do successfully um which took us from a [00:03:10] successfully um which took us from a place where a lot of this seemed black [00:03:12] place where a lot of this seemed black magic that was hard to get right to [00:03:14] magic that was hard to get right to people being able to very reliably train [00:03:17] people being able to very reliably train high performance Transformer models um [00:03:20] high performance Transformer models um we talked about sequence models what's [00:03:23] we talked about sequence models what's good about them and some of their [00:03:25] good about them and some of their problems and how those problems have [00:03:27] problems and how those problems have been addressed in large measure by [00:03:30] been addressed in large measure by adopting this different architecture of [00:03:31] adopting this different architecture of Transformers which give a form of [00:03:34] Transformers which give a form of parallelization and then we moved into [00:03:36] parallelization and then we moved into the modern form of pre-training by [00:03:40] the modern form of pre-training by language modeling where language [00:03:43] language modeling where language modeling seems a simple thing predicting [00:03:45] modeling seems a simple thing predicting words in context but it emerges as what [00:03:48] words in context but it emerges as what we think of as a universal pre-training [00:03:51] we think of as a universal pre-training task that all kinds of both linguistic [00:03:55] task that all kinds of both linguistic and World Knowledge help you to do this [00:03:58] and World Knowledge help you to do this task of predicting words better and so [00:04:01] task of predicting words better and so this is ended up as just a general [00:04:03] this is ended up as just a general method um to produce the kind of um [00:04:07] method um to produce the kind of um powerful knowledgeable models that we [00:04:09] powerful knowledgeable models that we have today and up until now there's been [00:04:12] have today and up until now there's been this amazing property that we see this [00:04:15] this amazing property that we see this empirical fact that we seem to just get [00:04:18] empirical fact that we seem to just get this um basically well not basically [00:04:21] this um basically well not basically it's extremely linear improvements of [00:04:24] it's extremely linear improvements of performance as we continue to scale data [00:04:28] performance as we continue to scale data and compute and model model size up by [00:04:31] and compute and model model size up by orders of [00:04:33] magnitude um that doesn't mean that all [00:04:36] magnitude um that doesn't mean that all problems in NLP are solved um there are [00:04:38] problems in NLP are solved um there are lots of things that people um still work [00:04:41] lots of things that people um still work on and see opportunities to try and make [00:04:44] on and see opportunities to try and make things better and a few of these um are [00:04:47] things better and a few of these um are mentioned on the next few slides um so [00:04:50] mentioned on the next few slides um so there's a real question of how much um [00:04:54] there's a real question of how much um these models are good at actually [00:04:58] these models are good at actually learning to be able to do things [00:05:01] learning to be able to do things generally rather than just being very [00:05:04] generally rather than just being very good at [00:05:05] good at memorization um that a lot of the [00:05:09] memorization um that a lot of the benefits of what we're getting from [00:05:10] benefits of what we're getting from these large pre-trained language models [00:05:13] these large pre-trained language models is that they've seen a huge amount of [00:05:16] is that they've seen a huge amount of stuff and therefore they know everything [00:05:19] stuff and therefore they know everything they've seen every pattern before and [00:05:21] they've seen every pattern before and they know how to use things I've [00:05:23] they know how to use things I've occasionally use the analogy that um [00:05:26] occasionally use the analogy that um large language models are sort of like a [00:05:29] large language models are sort of like a talking and encyclopedia that they're [00:05:31] talking and encyclopedia that they're really in many ways more like a huge [00:05:34] really in many ways more like a huge knowledge store than necessarily [00:05:37] knowledge store than necessarily something that is intelligent in the [00:05:40] something that is intelligent in the sense of being able to work out how to [00:05:42] sense of being able to work out how to solve new problems and generalize as [00:05:44] solve new problems and generalize as human beings do a kind of interesting [00:05:47] human beings do a kind of interesting fact actually is that in some ways [00:05:51] fact actually is that in some ways Transformer models are actually worse at [00:05:54] Transformer models are actually worse at generalizing than the older lstms that [00:05:57] generalizing than the older lstms that preceded them so here's just you know [00:06:00] preceded them so here's just you know one little graph I'm not going to spend [00:06:01] one little graph I'm not going to spend a lot of time on but this was looking at [00:06:05] a lot of time on but this was looking at data that's being generated by a finite [00:06:08] data that's being generated by a finite autometer and then trying to learn it [00:06:13] autometer and then trying to learn it from a limited amount of data um with [00:06:16] from a limited amount of data um with either an lstm or a Transformer and the [00:06:20] either an lstm or a Transformer and the observation is that you know at the [00:06:22] observation is that you know at the scales that they're working even having [00:06:24] scales that they're working even having seen quite limited [00:06:27] seen quite limited exemplification the lstm is basically [00:06:29] exemplification the lstm is basically sealing the entire of this graph right [00:06:31] sealing the entire of this graph right it's just at the one line um because it [00:06:35] it's just at the one line um because it generalizes in good ways because of its [00:06:37] generalizes in good ways because of its lstm architecture um whereas the [00:06:40] lstm architecture um whereas the Transformer needs to see a ton more data [00:06:43] Transformer needs to see a ton more data before it actually learns the patterns [00:06:45] before it actually learns the patterns well and so if we think of one of the [00:06:49] well and so if we think of one of the Prime attributes of human intelligence [00:06:53] Prime attributes of human intelligence is actually we're amazing at figuring [00:06:57] is actually we're amazing at figuring out and learning things from very [00:07:00] out and learning things from very limited exposure right you know there's [00:07:02] limited exposure right you know there's something that you don't know how to do [00:07:05] something that you don't know how to do and a friend shows you once um what you [00:07:07] and a friend shows you once um what you do to make it work and by and large you [00:07:10] do to make it work and by and large you know you'll improve a few times with [00:07:12] know you'll improve a few times with practice but you can learn effectively [00:07:15] practice but you can learn effectively new skills from these kind of single [00:07:17] new skills from these kind of single shot [00:07:19] shot examples um and that's not always what [00:07:23] examples um and that's not always what we seem to be seeing in our models um [00:07:26] we seem to be seeing in our models um there's a lot of interest in what's [00:07:28] there's a lot of interest in what's going on inside newal networks that a [00:07:31] going on inside newal networks that a lot of the time newal networks still [00:07:33] lot of the time newal networks still appear as blackboxes where we have no [00:07:35] appear as blackboxes where we have no real idea of how they're doing what [00:07:38] real idea of how they're doing what they're doing and as perhaps for your [00:07:40] they're doing and as perhaps for your final projects the main thing you're [00:07:42] final projects the main thing you're doing is measuring um the final [00:07:45] doing is measuring um the final performance number and seeing if it goes [00:07:47] performance number and seeing if it goes up or not um so there's a lot of [00:07:51] up or not um so there's a lot of interest in better understanding what do [00:07:53] interest in better understanding what do they learn how did they learn it why do [00:07:56] they learn how did they learn it why do they succeed and fail and a lot of that [00:07:58] they succeed and fail and a lot of that work is star to look more closely into [00:08:01] work is star to look more closely into what's happening inside new network [00:08:04] what's happening inside new network computations um there is some work of [00:08:07] computations um there is some work of that sort that actually goes back quite [00:08:09] that sort that actually goes back quite a fair way so here's um an old blog post [00:08:13] a fair way so here's um an old blog post by Andre Kathy while he's a grad student [00:08:16] by Andre Kathy while he's a grad student here in [00:08:17] here in 2016 and he he was looking at lstms and [00:08:22] 2016 and he he was looking at lstms and how did they learn and he found that one [00:08:25] how did they learn and he found that one of the neurons in an lstm cell was [00:08:28] of the neurons in an lstm cell was effectively sort of measuring position [00:08:31] effectively sort of measuring position along a line of text and as the line of [00:08:33] along a line of text and as the line of text got long its sort of value started [00:08:36] text got long its sort of value started to change because the model was learning [00:08:40] to change because the model was learning that there was sort of a line length of [00:08:42] that there was sort of a line length of this text and that the line was likely [00:08:44] this text and that the line was likely to be ending at that point and in recent [00:08:47] to be ending at that point and in recent times there started to be with [00:08:48] times there started to be with Transformers as well a lot of work [00:08:50] Transformers as well a lot of work looking at mechanistic interpretability [00:08:52] looking at mechanistic interpretability or causal abstraction trying to [00:08:54] or causal abstraction trying to understand the internals of [00:08:57] understand the internals of models um a problem that's far from [00:09:00] models um a problem that's far from solved and in many respects probably [00:09:03] solved and in many respects probably unsolvable is the multilingual question [00:09:07] unsolvable is the multilingual question of dealing with all the other languages [00:09:09] of dealing with all the other languages of the world um that you do have to keep [00:09:13] of the world um that you do have to keep in your head that whatever you see for [00:09:15] in your head that whatever you see for English um it's worse for every other [00:09:19] English um it's worse for every other language um and what they're getting out [00:09:21] language um and what they're getting out of modern language models now you know [00:09:24] of modern language models now you know there is a good news story here I don't [00:09:26] there is a good news story here I don't want to claim that everything is [00:09:28] want to claim that everything is terrible so in this graph which is kind [00:09:30] terrible so in this graph which is kind of small um the blue line was the [00:09:33] of small um the blue line was the performance of GPT 3.5 English and then [00:09:39] performance of GPT 3.5 English and then all of the the [00:09:40] all of the the green um bars are then the performance [00:09:44] green um bars are then the performance of GPT 4 and so you know there's a [00:09:47] of GPT 4 and so you know there's a genuine good news story here which is [00:09:50] genuine good news story here which is look not just for English but for a lot [00:09:52] look not just for English but for a lot of other languages for Greek lvan um [00:09:56] of other languages for Greek lvan um Arabic Turkish um all of them in GP 4 [00:10:00] Arabic Turkish um all of them in GP 4 better than English was in gpt3 3.5 so [00:10:05] better than English was in gpt3 3.5 so you know that's the good news argument [00:10:08] you know that's the good news argument um that um you know building these [00:10:11] um that um you know building these models big is in some sense raising all [00:10:14] models big is in some sense raising all boats but you know these are still all [00:10:18] boats but you know these are still all huge languages and you know things are [00:10:20] huge languages and you know things are starting to drop off at the bottom of [00:10:23] starting to drop off at the bottom of this table um for um languages where the [00:10:27] this table um for um languages where the performance is worse than than English [00:10:29] performance is worse than than English and GPT 3.5 but even for those languages [00:10:33] and GPT 3.5 but even for those languages they're they're languages for which much [00:10:36] they're they're languages for which much less written data is available but [00:10:39] less written data is available but they're still large languages so the [00:10:40] they're still large languages so the ones at the three at the bottom are [00:10:42] ones at the three at the bottom are actually all Indian languages they're [00:10:44] actually all Indian languages they're Punjabi morati and telu um and which are [00:10:48] Punjabi morati and telu um and which are languages that are each spoken by [00:10:50] languages that are each spoken by millions of people they're not small [00:10:52] millions of people they're not small languages um so the real question is [00:10:56] languages um so the real question is what happens when you actually get to [00:10:58] what happens when you actually get to the small resource languages so the vast [00:11:01] the small resource languages so the vast majority of languages around the world [00:11:04] majority of languages around the world don't have millions of speakers they [00:11:06] don't have millions of speakers they vary from having hundreds of speakers to [00:11:10] vary from having hundreds of speakers to hundreds of thousands of speakers and [00:11:12] hundreds of thousands of speakers and there are thousands of such languages a [00:11:14] there are thousands of such languages a lot of those languages are primarily [00:11:16] lot of those languages are primarily oral and have very limited amounts of um [00:11:20] oral and have very limited amounts of um written text now some of those languages [00:11:22] written text now some of those languages are likely or many of those languages [00:11:25] are likely or many of those languages are likely to go extinct um in the [00:11:27] are likely to go extinct um in the coming decades um but many of those [00:11:30] coming decades um but many of those language communities would like to [00:11:32] language communities would like to preserve their languages and it's very [00:11:35] preserve their languages and it's very unclear how the kind of language [00:11:37] unclear how the kind of language technologies that we've been talking [00:11:40] technologies that we've been talking about in the later parts of um the [00:11:42] about in the later parts of um the course can be extended to those [00:11:44] course can be extended to those languages because there just isn't [00:11:46] languages because there just isn't sufficient data to build the kind of [00:11:48] sufficient data to build the kind of models that we've been looking [00:11:51] models that we've been looking at so I imagine you've gotten some idea [00:11:54] at so I imagine you've gotten some idea in this course of how evaluation is a [00:11:58] in this course of how evaluation is a huge part of what we do that effectively [00:12:01] huge part of what we do that effectively a lot of the way that progress is being [00:12:03] a lot of the way that progress is being driven is by defining evaluations of [00:12:06] driven is by defining evaluations of what models should be able to achieve [00:12:08] what models should be able to achieve and then people working um to measure um [00:12:12] and then people working um to measure um systems and improve systems so they do [00:12:15] systems and improve systems so they do better on what we see as good language [00:12:18] better on what we see as good language other understanding or other properties [00:12:21] other understanding or other properties um one of the concerns that many people [00:12:24] um one of the concerns that many people have about what's happened with the [00:12:27] have about what's happened with the large recent um Clos models from large [00:12:31] large recent um Clos models from large companies is a concern that all of the [00:12:34] companies is a concern that all of the benchmarks are being sullied and not to [00:12:36] benchmarks are being sullied and not to be trusted um so here's um one example [00:12:41] be trusted um so here's um one example um that comes from a tweet of Horus [00:12:43] um that comes from a tweet of Horus Hurst um and he's noting I suspect gp4s [00:12:48] Hurst um and he's noting I suspect gp4s performance is influenced by data [00:12:50] performance is influenced by data contamination at least on code forces [00:12:52] contamination at least on code forces one of the coding benchmarks of the [00:12:55] one of the coding benchmarks of the easiest problems on Co code forces it [00:12:57] easiest problems on Co code forces it solved 10 out of 10 pre 2021 problems [00:13:01] solved 10 out of 10 pre 2021 problems but zero out of 10 recent problems this [00:13:04] but zero out of 10 recent problems this strongly points to contamination and the [00:13:08] strongly points to contamination and the worry is that every time you're seeing [00:13:10] worry is that every time you're seeing these fantastic results of how well the [00:13:13] these fantastic results of how well the latest best language model is performing [00:13:16] latest best language model is performing that at this point so much data is on [00:13:19] that at this point so much data is on the web that gets included um in the [00:13:23] the web that gets included um in the pre-training data for these large [00:13:24] pre-training data for these large language models that essentially they're [00:13:27] language models that essentially they're memorizing at least a good share of the [00:13:30] memorizing at least a good share of the questions that are appearing in these [00:13:32] questions that are appearing in these challenges so they're not actually [00:13:33] challenges so they're not actually solving them in a fair way as an [00:13:36] solving them in a fair way as an independent test set at all they're just [00:13:38] independent test set at all they're just memorizing them and so there are sort of [00:13:41] memorizing them and so there are sort of issues then as to you know what kind of [00:13:43] issues then as to you know what kind of Thoroughly hidden test sets we can have [00:13:46] Thoroughly hidden test sets we can have or dynamic evaluation mechanisms so we [00:13:49] or dynamic evaluation mechanisms so we can actually have Benchmark [00:13:51] can actually have Benchmark Integrity another huge area the number [00:13:54] Integrity another huge area the number of us involved in Stanford elsewhere is [00:13:57] of us involved in Stanford elsewhere is making NLP work in different technical [00:14:00] making NLP work in different technical domains so domains including biomedical [00:14:03] domains so domains including biomedical or clinical medical NLP have a lot of [00:14:05] or clinical medical NLP have a lot of differences of vocabulary um and usage [00:14:09] differences of vocabulary um and usage they have a lot of potential good uses [00:14:12] they have a lot of potential good uses but they also have a lot of potential [00:14:14] but they also have a lot of potential risks of doing harm if the language [00:14:17] risks of doing harm if the language understanding is incomplete I myself [00:14:20] understanding is incomplete I myself have been more involved um doing things [00:14:22] have been more involved um doing things in the legal NLP um working with other [00:14:26] in the legal NLP um working with other people at the reg lab with danho in [00:14:29] people at the reg lab with danho in building Foundation models for law um [00:14:32] building Foundation models for law um and there are all kinds of ways again in [00:14:36] and there are all kinds of ways again in which um this kind of Technology could [00:14:39] which um this kind of Technology could be really useful right the biggest [00:14:42] be really useful right the biggest problem in most countries it's bad in [00:14:45] problem in most countries it's bad in the United States but it's way worse in [00:14:47] the United States but it's way worse in a place like India is that most people [00:14:50] a place like India is that most people can't get access to the kind of legal [00:14:52] can't get access to the kind of legal help that they need to solve their [00:14:54] help that they need to solve their problems because of the cost of it and [00:14:57] problems because of the cost of it and the lack of um trained lawyers so if [00:15:00] the lack of um trained lawyers so if more could be done to be able to help [00:15:03] more could be done to be able to help people via um NLP tools you know in [00:15:07] people via um NLP tools you know in principle that would be great but in [00:15:09] principle that would be great but in practice um the tools still don't have [00:15:12] practice um the tools still don't have good enough language understanding um so [00:15:15] good enough language understanding um so um in the reg lab there's a just [00:15:17] um in the reg lab there's a just completed study out at the moment [00:15:19] completed study out at the moment looking at um legal NLP systems and we [00:15:22] looking at um legal NLP systems and we were finding that the hallucination rate [00:15:25] were finding that the hallucination rate the rate in which there was madeup stuff [00:15:28] the rate in which there was madeup stuff in their legal AR answers was [00:15:30] in their legal AR answers was effectively for one question in six [00:15:33] effectively for one question in six which isn't a very good accuracy rate if [00:15:35] which isn't a very good accuracy rate if you're someone who's wanting to rely on [00:15:37] you're someone who's wanting to rely on these systems um for legal [00:15:40] these systems um for legal advice um there are lots of things also [00:15:43] advice um there are lots of things also to work out dealing with the social and [00:15:45] to work out dealing with the social and cultural aspects of NLP NLP systems [00:15:49] cultural aspects of NLP NLP systems remain very biased against various [00:15:52] remain very biased against various cultures and [00:15:53] cultures and religions um [00:15:56] religions um they they have certain social norms you [00:15:59] they they have certain social norms you could say that they pick up from [00:16:01] could say that they pick up from somewhere but those social norms are [00:16:04] somewhere but those social norms are very biased it's against certain groups [00:16:06] very biased it's against certain groups and um related to um there being small [00:16:10] and um related to um there being small languages that I mentioned before that [00:16:12] languages that I mentioned before that there are lots of issues with [00:16:13] there are lots of issues with underrepresented groups in having the [00:16:16] underrepresented groups in having the kind of NLP that they'd like to [00:16:19] kind of NLP that they'd like to have okay so um that's sort of the [00:16:22] have okay so um that's sort of the summary of that bit so for the next bit [00:16:25] summary of that bit so for the next bit I thought I'd just sort of give one more [00:16:27] I thought I'd just sort of give one more bit of perspective on where are we um [00:16:31] bit of perspective on where are we um with um the best um language models like [00:16:35] with um the best um language models like gp4 I mean I think it's really [00:16:39] gp4 I mean I think it's really interesting at this moment of where we [00:16:41] interesting at this moment of where we are CU you know on the one hand the [00:16:45] are CU you know on the one hand the performance of these models is just [00:16:49] performance of these models is just amazing and you know even as someone who [00:16:53] amazing and you know even as someone who works in NLP and has worked in it for [00:16:56] works in NLP and has worked in it for many many years now I mean you know I [00:17:00] many many years now I mean you know I can tell a a sort of story that these [00:17:03] can tell a a sort of story that these models you know that we do this training [00:17:06] models you know that we do this training to predict the next word and it's [00:17:08] to predict the next word and it's conditioning on a lot of text and it [00:17:10] conditioning on a lot of text and it knows about things and it does but you [00:17:13] knows about things and it does but you know in some sense these things still [00:17:15] know in some sense these things still seem like magic right it's just kind of [00:17:18] seem like magic right it's just kind of hard to believe how this could possibly [00:17:20] hard to believe how this could possibly work so in this example I asked chat GPT [00:17:24] work so in this example I asked chat GPT 40 I did that this this morning I asked [00:17:28] 40 I did that this this morning I asked and um to write a sonnet explaining the [00:17:32] and um to write a sonnet explaining the Transformer neet architecture in which [00:17:35] Transformer neet architecture in which every line begins with the letter t um [00:17:39] every line begins with the letter t um and it sort of still frankly blows my [00:17:42] and it sort of still frankly blows my mind and I don't actually feel I can [00:17:45] mind and I don't actually feel I can really explain even to myself in a way [00:17:47] really explain even to myself in a way that's convincing how this large [00:17:52] that's convincing how this large Transformer is able to take all its [00:17:54] Transformer is able to take all its pre-training text and reading that [00:17:57] pre-training text and reading that instruction [00:17:59] instruction um and as a next tokken prediction [00:18:01] um and as a next tokken prediction machine it successfully produces [00:18:04] machine it successfully produces something that is a Sonet and every line [00:18:09] something that is a Sonet and every line begins with the letter T I hope you [00:18:11] begins with the letter T I hope you remember from your high school English [00:18:12] remember from your high school English classes sonnets are meant to have AB um [00:18:16] classes sonnets are meant to have AB um rhyming um it's a little bit imperfect [00:18:19] rhyming um it's a little bit imperfect here net set some people will complain [00:18:21] here net set some people will complain about that but it's basically a rhyme [00:18:23] about that but it's basically a rhyme behold unfold grasp clasp right so it's [00:18:26] behold unfold grasp clasp right so it's basically doing the writing rhyming [00:18:28] basically doing the writing rhyming right and um the lines are meant to be [00:18:30] right and um the lines are meant to be an i Amic pentameter and it's basically [00:18:33] an i Amic pentameter and it's basically produced lines and I Amic pentameter to [00:18:36] produced lines and I Amic pentameter to delve into the world of neural Nets you [00:18:39] delve into the world of neural Nets you know I Amic pentam um yeah it's [00:18:42] know I Amic pentam um yeah it's incredible so here's um it's Sonet it [00:18:44] incredible so here's um it's Sonet it came up with to delve into the world of [00:18:47] came up with to delve into the world of neural Nets Transformers Rise a Marvel [00:18:49] neural Nets Transformers Rise a Marvel to behold through layers deep the [00:18:52] to behold through layers deep the networks path is set to learn from data [00:18:55] networks path is set to learn from data patterns to unfold the tokens feed into [00:18:58] patterns to unfold the tokens feed into attentions grasp to weigh their import [00:19:01] attentions grasp to weigh their import context to a line the query's keys and [00:19:04] context to a line the query's keys and values in a clasp to process sequences [00:19:07] values in a clasp to process sequences with power fine through heads of many [00:19:10] with power fine through heads of many multifaceted site to parallelize [00:19:13] multifaceted site to parallelize capturing the span transforming each a [00:19:16] capturing the span transforming each a matrix math Delight to synthesize with [00:19:19] matrix math Delight to synthesize with elegance and plan the models might in [00:19:23] elegance and plan the models might in every task it shows to turn raw data [00:19:26] every task it shows to turn raw data into wisdom's Pros now now you you could [00:19:31] into wisdom's Pros now now you you could object um that I'm not sure this exactly [00:19:34] object um that I'm not sure this exactly explained the transform new net [00:19:36] explained the transform new net architecture um it's a little bit [00:19:39] architecture um it's a little bit abstract I'll give it [00:19:41] abstract I'll give it that but you know in another sense it [00:19:46] that but you know in another sense it did it did in one place or another evoke [00:19:49] did it did in one place or another evoke quite a bit of stuff about Transformers [00:19:52] quite a bit of stuff about Transformers with queries keys and values and [00:19:55] with queries keys and values and multiheaded stuff parallelized with [00:19:58] multiheaded stuff parallelized with Matrix math and whatever else um yeah [00:20:03] Matrix math and whatever else um yeah still still kind of blows my mind how [00:20:05] still still kind of blows my mind how well that works and you know indeed um [00:20:09] well that works and you know indeed um as natural language understanding and [00:20:13] as natural language understanding and sort of world understanding devices I [00:20:17] sort of world understanding devices I mean these devices have clearly crossed [00:20:20] mean these devices have clearly crossed the threshold in which they're very [00:20:24] the threshold in which they're very usable in many contexts so here's [00:20:27] usable in many contexts so here's there's now started to be some fairly [00:20:29] there's now started to be some fairly good studies um that have been done on [00:20:33] good studies um that have been done on you know how much value people can get [00:20:35] you know how much value people can get out of using llms like um [00:20:38] out of using llms like um gp4 um so this um study by dequa and a [00:20:43] gp4 um so this um study by dequa and a whole lot of colle including Ethan mik [00:20:46] whole lot of colle including Ethan mik um they took a bunch of Consultants from [00:20:50] um they took a bunch of Consultants from the Boston Consulting Group and so you [00:20:52] the Boston Consulting Group and so you know what that's like that means you [00:20:54] know what that's like that means you know 23 year olds graduating from [00:20:57] know 23 year olds graduating from universities like this one but more on [00:20:58] universities like this one but more on the East Coast um they become you know [00:21:01] the East Coast um they become you know Boston Consultants you know not exactly [00:21:05] Boston Consultants you know not exactly um dummies um and so they they found in [00:21:10] um dummies um and so they they found in this study so you know controlled task [00:21:13] this study so you know controlled task um there are there actually so three [00:21:15] um there are there actually so three groups but the big contrast is that you [00:21:18] groups but the big contrast is that you know two of the groups were using gp4 to [00:21:21] know two of the groups were using gp4 to do Consulting tasks and one of the [00:21:23] do Consulting tasks and one of the groups wasn't using um gp4 to do tasks [00:21:28] groups wasn't using um gp4 to do tasks the difference between the two that were [00:21:30] the difference between the two that were was one of them was giving given more [00:21:32] was one of them was giving given more training on how to use gp4 but that [00:21:34] training on how to use gp4 but that didn't seem to make much of a difference [00:21:37] didn't seem to make much of a difference um but their result was um that the [00:21:40] um but their result was um that the groups using [00:21:42] groups using gp4 um in their study completed 12% more [00:21:46] gp4 um in their study completed 12% more tasks on average they did the task 25% [00:21:49] tasks on average they did the task 25% more com quickly and um the results were [00:21:54] more com quickly and um the results were judged 40% higher quality than those not [00:21:57] judged 40% higher quality than those not using AI which I think is a pretty um [00:22:00] using AI which I think is a pretty um stunning success of how um gp4 or [00:22:05] stunning success of how um gp4 or similar llms are good enough to actually [00:22:08] similar llms are good enough to actually help people get real work done you know [00:22:11] help people get real work done you know with whatever asteris you want to put [00:22:13] with whatever asteris you want to put about the quality of Management [00:22:15] about the quality of Management Consultant work in various instances um [00:22:19] Consultant work in various instances um yeah I mean an interesting result is [00:22:22] yeah I mean an interesting result is that you know using these llms seem to [00:22:24] that you know using these llms seem to be a big leveler and actually you see [00:22:27] be a big leveler and actually you see exactly the same thing for people using [00:22:31] exactly the same thing for people using um coding llms that they're a huge [00:22:35] um coding llms that they're a huge assistance for people whose own skills [00:22:37] assistance for people whose own skills are weaker and they're much less of an [00:22:40] are weaker and they're much less of an assistance for people whose own skills [00:22:42] assistance for people whose own skills are [00:22:43] are strong okay so that's the good news [00:22:45] strong okay so that's the good news story but you know if on the other hand [00:22:48] story but you know if on the other hand you're more like the good news story for [00:22:50] you're more like the good news story for human beings um here's a study um that [00:22:54] human beings um here's a study um that goes in the other direction can gp4 [00:22:57] goes in the other direction can gp4 write fiction that matches the quality [00:23:00] write fiction that matches the quality of New Yorker fiction writers um and the [00:23:04] of New Yorker fiction writers um and the result of that study was not even close [00:23:08] result of that study was not even close um that gp4 was measured as 3 to 10 [00:23:12] um that gp4 was measured as 3 to 10 times worse at creative writing than a [00:23:15] times worse at creative writing than a New Yorker fiction writer so there's [00:23:17] New Yorker fiction writer so there's still hope for human beings hang on [00:23:21] still hope for human beings hang on there um and so you know I think that's [00:23:25] there um and so you know I think that's kind of the you know the jewel screen [00:23:27] kind of the you know the jewel screen picture that we have at the moment then [00:23:29] picture that we have at the moment then in some ways these things are great and [00:23:31] in some ways these things are great and useful in other ways um they're not so [00:23:35] useful in other ways um they're not so great and I think that's something um [00:23:37] great and I think that's something um that we're still sort of going to be um [00:23:40] that we're still sort of going to be um seeing playing out um in the future [00:23:44] seeing playing out um in the future years I think um living in Silicon [00:23:47] years I think um living in Silicon Valley um we see a lot of the positive [00:23:50] Valley um we see a lot of the positive hype so if you just want to um see a [00:23:52] hype so if you just want to um see a little bit of the negative on the other [00:23:54] little bit of the negative on the other side um late last year there was uh [00:23:57] side um late last year there was uh peace in the financial times um which [00:24:00] peace in the financial times um which was titled generative AI hyly [00:24:03] was titled generative AI hyly intelligent um and I won't read all of [00:24:06] intelligent um and I won't read all of this um but basically they were a [00:24:09] this um but basically they were a wanting to express um considerable um [00:24:13] wanting to express um considerable um skepticism of the current AI boom [00:24:16] skepticism of the current AI boom investors should keep their heads [00:24:18] investors should keep their heads expectations for generative AI are [00:24:20] expectations for generative AI are running way ahead of the limitations [00:24:22] running way ahead of the limitations that apply to it as investment in [00:24:24] that apply to it as investment in generative AI grows so does pressure to [00:24:27] generative AI grows so does pressure to create new use cases by 2027 IDC thinks [00:24:30] create new use cases by 2027 IDC thinks Enterprise spending on generative AI [00:24:33] Enterprise spending on generative AI will reach 143 billion up from 16 [00:24:36] will reach 143 billion up from 16 billion this year so 10 times up open AI [00:24:39] billion this year so 10 times up open AI hopes for more funding to pursue [00:24:41] hopes for more funding to pursue humanlike AI it is worth remembering [00:24:44] humanlike AI it is worth remembering that when examining alman's plan for [00:24:45] that when examining alman's plan for super intelligence models predict they [00:24:48] super intelligence models predict they do not comprehend that limitation cast [00:24:50] do not comprehend that limitation cast out on AI achieving even humanlike [00:24:54] out on AI achieving even humanlike intelligence um and then they sort of [00:24:57] intelligence um and then they sort of start talking about some the problems um [00:25:00] start talking about some the problems um with you know limited gains for low [00:25:02] with you know limited gains for low skilled workers inaccuracies in the work [00:25:05] skilled workers inaccuracies in the work they produc um and suggests that the [00:25:10] they produc um and suggests that the limitations will become more obvious as [00:25:12] limitations will become more obvious as generative AI tools roll out that will [00:25:14] generative AI tools roll out that will put pressure on providers to address [00:25:16] put pressure on providers to address costs AI could add 4 trillion to profits [00:25:20] costs AI could add 4 trillion to profits says McKenzie but pricing Clarity is [00:25:22] says McKenzie but pricing Clarity is lacking without it companies cannot [00:25:24] lacking without it companies cannot predict what Financial gains AI can [00:25:27] predict what Financial gains AI can accomplish and AI cannot predict that [00:25:32] either okay um that's that topic I'm [00:25:36] either okay um that's that topic I'm chugging through my topics um the next [00:25:38] chugging through my topics um the next topic is I wanted to return and say a [00:25:42] topic is I wanted to return and say a bit more about symbolic methods that [00:25:46] bit more about symbolic methods that dominated AI from the 60s until about [00:25:51] dominated AI from the 60s until about 2010 versus um what I termed here is [00:25:54] 2010 versus um what I termed here is cybernetics because the original [00:25:57] cybernetics because the original alternative going back to the 50s and [00:25:59] alternative going back to the 50s and 60s um was called cybernetics and in a [00:26:05] 60s um was called cybernetics and in a very real sense neural networks is a [00:26:09] very real sense neural networks is a continuation of the cybernetics [00:26:11] continuation of the cybernetics tradition rather than the AI tradition [00:26:15] tradition rather than the AI tradition that started in the 50s and 60s um in [00:26:19] that started in the 50s and 60s um in this context um Stanford is the home of [00:26:22] this context um Stanford is the home of the symbolic systems program um so at [00:26:25] the symbolic systems program um so at the moment um we are unique in having a [00:26:28] the moment um we are unique in having a symbolic systems program um so the name [00:26:32] symbolic systems program um so the name symbolic systems came about because at [00:26:35] symbolic systems came about because at the time it was started um so I guess [00:26:39] the time it was started um so I guess philosophy was an active part of the [00:26:40] philosophy was an active part of the symbolic systems program and John [00:26:43] symbolic systems program and John barwise um shown in this picture he [00:26:46] barwise um shown in this picture he actually died young so he he actually [00:26:48] actually died young so he he actually died in 2000 um John barwise um had um a [00:26:54] died in 2000 um John barwise um had um a very strong belief um that you me to be [00:26:59] very strong belief um that you me to be um dealing with meaning in the world and [00:27:04] um dealing with meaning in the world and the connection between people's thinking [00:27:07] the connection between people's thinking in the world and so he um refused to [00:27:11] in the world and so he um refused to allow the program to be called cognitive [00:27:14] allow the program to be called cognitive science as it's called at most other [00:27:16] science as it's called at most other places and it ended up being called [00:27:19] places and it ended up being called symbolic systems um now at one point [00:27:22] symbolic systems um now at one point there were two universities that had [00:27:24] there were two universities that had symbolic systems CU John barise actually [00:27:27] symbolic systems CU John barise actually moved away from Stanford and went to [00:27:30] moved away from Stanford and went to Indiana which is where he originally was [00:27:33] Indiana which is where he originally was from um and so Indiana also had a [00:27:35] from um and so Indiana also had a symbolic systems program for a number of [00:27:37] symbolic systems program for a number of years but they've actually changed [00:27:39] years but they've actually changed Theirs to cognitive science now since he [00:27:41] Theirs to cognitive science now since he died so we are unique in having symbolic [00:27:46] died so we are unique in having symbolic systems and so the idea of symbolic [00:27:48] systems and so the idea of symbolic systems this is sort of what's on the [00:27:51] systems this is sort of what's on the website website with a bit of [00:27:53] website website with a bit of interpretation right so symbolic system [00:27:56] interpretation right so symbolic system studies systems are meaningful symbols [00:27:58] studies systems are meaningful symbols that represent the world about us like [00:28:00] that represent the world about us like human languages Logics and programming [00:28:03] human languages Logics and programming languages and the systems that work with [00:28:05] languages and the systems that work with these symbols like brains computers and [00:28:08] these symbols like brains computers and complex social systems contrasting that [00:28:11] complex social systems contrasting that to the sort of typical view of cognitive [00:28:14] to the sort of typical view of cognitive science which is focusing on the mind [00:28:16] science which is focusing on the mind and intelligence as a naturally [00:28:18] and intelligence as a naturally occurring phenomenon um symbolic systems [00:28:21] occurring phenomenon um symbolic systems gives equal Focus to human constructed [00:28:23] gives equal Focus to human constructed systems that use symbols to communicate [00:28:26] systems that use symbols to communicate and to represent information [00:28:29] and to represent information right so um in AI terms um you know AI [00:28:34] right so um in AI terms um you know AI as a field and the name AI um [00:28:38] as a field and the name AI um arose um around arguing for a symbolic [00:28:42] arose um around arguing for a symbolic approach right that John um John [00:28:45] approach right that John um John McCarthy who's the color photo there and [00:28:47] McCarthy who's the color photo there and who um founded Stanford's artificial [00:28:51] who um founded Stanford's artificial intelligence um and the St the original [00:28:55] intelligence um and the St the original famous Stanford AI lab so John McCarthy [00:28:58] famous Stanford AI lab so John McCarthy came up with the name artificial [00:29:00] came up with the name artificial intelligence and he very ex very [00:29:04] intelligence and he very ex very explicitly chose a new name to um [00:29:08] explicitly chose a new name to um disassociate what he was doing um from [00:29:11] disassociate what he was doing um from the cybernetics approach um which had [00:29:13] the cybernetics approach um which had been pursued by people including um [00:29:16] been pursued by people including um Norbert weer um at MIT who's shown on [00:29:19] Norbert weer um at MIT who's shown on the right side um so Marvin miny the [00:29:22] the right side um so Marvin miny the Teeny um photo down here um sort of [00:29:26] Teeny um photo down here um sort of founded um artificial intelligence at [00:29:29] founded um artificial intelligence at MIT um McCarthy worked with him for a [00:29:32] MIT um McCarthy worked with him for a few years and then McCarthy came to [00:29:35] few years and then McCarthy came to Stanford um and two of the other most [00:29:37] Stanford um and two of the other most prominent early AI people new and Simon [00:29:40] prominent early AI people new and Simon who were at CMU and the other two people [00:29:43] who were at CMU and the other two people on the right side and [00:29:46] on the right side and so in particular um new and Simon um [00:29:52] so in particular um new and Simon um develop well actually no let me say a [00:29:54] develop well actually no let me say a sentence first yeah so I mean McCarthy's [00:29:57] sentence first yeah so I mean McCarthy's own background was a mathematician and [00:29:59] own background was a mathematician and logician right so that he wanted to [00:30:03] logician right so that he wanted to construct uh an artificial intelligence [00:30:06] construct uh an artificial intelligence that looked like math and logic [00:30:08] that looked like math and logic effectively right and that sort of was [00:30:11] effectively right and that sort of was AI as a symbolic system and that was [00:30:15] AI as a symbolic system and that was developed as a position in the [00:30:16] developed as a position in the philosophy of artificial intelligence by [00:30:19] philosophy of artificial intelligence by new and Simon and so they developed what [00:30:22] new and Simon and so they developed what they called the physical symbol system [00:30:25] they called the physical symbol system hypothesis um so that said a phys iCal [00:30:28] hypothesis um so that said a phys iCal symbol system has the necessary and [00:30:31] symbol system has the necessary and sufficient means for General intelligent [00:30:34] sufficient means for General intelligent action and so you know that's a super [00:30:37] action and so you know that's a super strong claim it's not only claiming that [00:30:40] strong claim it's not only claiming that um having a symbol system allows you to [00:30:44] um having a symbol system allows you to produce artificial general intelligence [00:30:47] produce artificial general intelligence but through um the necessary Clause that [00:30:51] but through um the necessary Clause that you can't have artificial general [00:30:52] you can't have artificial general intelligence without having a symbol [00:30:55] intelligence without having a symbol system um so that was sort of um the [00:30:58] system um so that was sort of um the basis of um classical AI right um so um [00:31:05] basis of um classical AI right um so um and that kind of contrasts a bit with [00:31:08] and that kind of contrasts a bit with the you know so cybernetics you know had [00:31:11] the you know so cybernetics you know had its origins in sort of control and [00:31:15] its origins in sort of control and communication so it's much nearer to [00:31:17] communication so it's much nearer to sort of an electrical engineering kind [00:31:19] sort of an electrical engineering kind of background um and was wanting to sort [00:31:22] of background um and was wanting to sort of um unify ideas of control and [00:31:26] of um unify ideas of control and communication between animals um maybe [00:31:30] communication between animals um maybe perhaps more than humans and machines um [00:31:34] perhaps more than humans and machines um yeah so I mean you know um sort of in [00:31:38] yeah so I mean you know um sort of in yeah so cybernetics comes from a Greek [00:31:41] yeah so cybernetics comes from a Greek word um kubernetes which is sort of [00:31:44] word um kubernetes which is sort of interesting all the uses it has so it's [00:31:46] interesting all the uses it has so it's exactly the same route that occurs both [00:31:50] exactly the same route that occurs both in kubernetes if you're are familiar [00:31:52] in kubernetes if you're are familiar with that as you know distributed [00:31:54] with that as you know distributed containers on Modern systems but also [00:31:58] containers on Modern systems but also it's actually the same route that the [00:32:00] it's actually the same route that the word government comes from um course [00:32:02] word government comes from um course it's a control system as [00:32:05] it's a control system as well [00:32:07] well um yeah um so under the cybernetics [00:32:11] um yeah um so under the cybernetics tradition was when newal Nets first [00:32:13] tradition was when newal Nets first started being explored um the very [00:32:16] started being explored um the very earliest neuron Nets um the most famous [00:32:18] earliest neuron Nets um the most famous ones are Frank Rosen blats which were [00:32:20] ones are Frank Rosen blats which were used for vision the newal net was [00:32:23] used for vision the newal net was actually wired um to say just a teeny [00:32:26] actually wired um to say just a teeny bit about this in case you think that AI [00:32:30] bit about this in case you think that AI hype is only a thing of the [00:32:33] hype is only a thing of the 2020s there was just as much AI hype in [00:32:37] 2020s there was just as much AI hype in the [00:32:37] the 1950s when Rosen blat um unveiled his [00:32:42] 1950s when Rosen blat um unveiled his perceptron um so in the New York Times [00:32:45] perceptron um so in the New York Times article about it new Navy device learns [00:32:48] article about it new Navy device learns by doing psychologist shows embryo of [00:32:52] by doing psychologist shows embryo of computer design to read and grow wiser [00:32:55] computer design to read and grow wiser the Navy revealed the embryo of an of an [00:32:58] the Navy revealed the embryo of an of an electronic computer today that it [00:33:00] electronic computer today that it expects will be able to walk talk see [00:33:03] expects will be able to walk talk see write reproduce itself and be conscious [00:33:07] write reproduce itself and be conscious of its existence um and this hype is all [00:33:11] of its existence um and this hype is all the more incredible when you get to the [00:33:14] the more incredible when you get to the um later paragraph of the article and [00:33:17] um later paragraph of the article and you find out what the demonstration was [00:33:19] you find out what the demonstration was actually of and the demonstration that [00:33:22] actually of and the demonstration that people were shown um was um that this [00:33:26] people were shown um was um that this device learned to [00:33:28] device learned to differentiate between right arrow and [00:33:31] differentiate between right arrow and left arrow pictures after 50 [00:33:35] left arrow pictures after 50 [Laughter] [00:33:37] [Laughter] exposures but there you [00:33:40] exposures but there you go okay um yeah so what what do we make [00:33:45] go okay um yeah so what what do we make of this in the case of NLP and language [00:33:51] of this in the case of NLP and language and you know the position I would like [00:33:56] and you know the position I would like to suggest is you know there's just no [00:34:01] to suggest is you know there's just no doubt um that language is a Sy symbolic [00:34:06] doubt um that language is a Sy symbolic system right that humans [00:34:09] system right that humans developed language as a symbolic system [00:34:12] developed language as a symbolic system it's perhaps most obvious um that if you [00:34:16] it's perhaps most obvious um that if you think about it in writing we have [00:34:18] think about it in writing we have symbols of the letters and words that we [00:34:21] symbols of the letters and words that we use but even if there's no writing and [00:34:24] use but even if there's no writing and you know the majority of human language [00:34:26] you know the majority of human language use over time has been verbal human [00:34:30] use over time has been verbal human language use that even though the [00:34:32] language use that even though the substrate it's carried on where the [00:34:35] substrate it's carried on where the sound waves or in sign languages [00:34:37] sound waves or in sign languages movements of hands even though that's a [00:34:39] movements of hands even though that's a continuous substrate the structure of [00:34:42] continuous substrate the structure of human languages is a symbol system we [00:34:45] human languages is a symbol system we have symbols which are the sounds of [00:34:48] have symbols which are the sounds of human languages for cat we have a c an [00:34:50] human languages for cat we have a c an at and a t those are symbols and they're [00:34:53] at and a t those are symbols and they're recognized in a symbolic Way by language [00:34:56] recognized in a symbolic Way by language users and indeed um the all the [00:35:00] users and indeed um the all the pioneering work and categorical [00:35:02] pioneering work and categorical perception in cognitive psychology is [00:35:05] perception in cognitive psychology is done with um the sounds of human [00:35:07] done with um the sounds of human languages um the phones as linguists [00:35:10] languages um the phones as linguists call them so spoken language also has a [00:35:14] call them so spoken language also has a symbolic structure but you know going [00:35:18] symbolic structure but you know going against new and Simon the fact that [00:35:22] against new and Simon the fact that humans use a symbol system for [00:35:25] humans use a symbol system for communication doesn't mean that the [00:35:27] communication doesn't mean that the process of the symbols the human brain [00:35:30] process of the symbols the human brain has to be a physical symbol system and [00:35:33] has to be a physical symbol system and so similarly we don't have to design NLP [00:35:37] so similarly we don't have to design NLP our computer processors as physical [00:35:39] our computer processors as physical symbol systems either um the brain is [00:35:43] symbol systems either um the brain is you know clearly much more like a newal [00:35:45] you know clearly much more like a newal network model um and probably newal [00:35:48] network model um and probably newal models will scale better and capture [00:35:51] models will scale better and capture language processing better than [00:35:53] language processing better than something that is a symbolic processor [00:35:56] something that is a symbolic processor um in the same way I mean that sort of [00:35:58] um in the same way I mean that sort of leaves behind the question of well why [00:36:01] leaves behind the question of well why did humans come up with a symbol system [00:36:05] did humans come up with a symbol system for communication I mean after all you [00:36:07] for communication I mean after all you know we could have just sort of hummed [00:36:10] know we could have just sort of hummed at different frequencies and that could [00:36:12] at different frequencies and that could have been used as our system of [00:36:15] have been used as our system of communication I mean I think the [00:36:17] communication I mean I think the dominant idea which seems reasonable to [00:36:19] dominant idea which seems reasonable to me but who knows is that um having a [00:36:22] me but who knows is that um having a symbolic system gives signaling [00:36:25] symbolic system gives signaling reliability right that if you have [00:36:28] reliability right that if you have discrete Target points that are [00:36:30] discrete Target points that are separated then that gives you an ability [00:36:33] separated then that gives you an ability um when there's degradation of the [00:36:34] um when there's degradation of the signal to recover it [00:36:37] signal to recover it well um yeah so where does that leave [00:36:40] well um yeah so where does that leave Linguistics which is mainly um been [00:36:43] Linguistics which is mainly um been developed in terms of um describing a [00:36:47] developed in terms of um describing a symbolic system I think the right way to [00:36:51] symbolic system I think the right way to think about it as Linguistics is good [00:36:53] think about it as Linguistics is good for giving us questions Concepts and [00:36:55] for giving us questions Concepts and distinctions when thinking about [00:36:58] distinctions when thinking about language acquisition processing and [00:37:01] language acquisition processing and understanding and indeed one of the [00:37:03] understanding and indeed one of the interesting things that's come about is [00:37:05] interesting things that's come about is that sort of as um NLP and AI have been [00:37:10] that sort of as um NLP and AI have been developed further and as able to do a [00:37:13] developed further and as able to do a lot of low-level stuff that there's [00:37:15] lot of low-level stuff that there's actually the sort of higher level [00:37:18] actually the sort of higher level Concepts that linguists often talk about [00:37:20] Concepts that linguists often talk about a lot things like compositionality and [00:37:22] a lot things like compositionality and systematic generalization which I'll [00:37:25] systematic generalization which I'll come back to in a few minutes um the [00:37:28] come back to in a few minutes um the mapping of stable meanings for symbols [00:37:31] mapping of stable meanings for symbols the reference of um linguistic [00:37:34] the reference of um linguistic expressions in the world that they get [00:37:36] expressions in the world that they get talked about more and more in artificial [00:37:39] talked about more and more in artificial intelligence context building neural [00:37:42] intelligence context building neural systems and I mean I think one way to [00:37:45] systems and I mean I think one way to think about is that you know a lot of [00:37:47] think about is that you know a lot of the early neural netw work um of um most [00:37:52] the early neural netw work um of um most notably visual processing but also other [00:37:57] notably visual processing but also other kinds of sensory stuff like sounds I [00:38:00] kinds of sensory stuff like sounds I mean doing that is sort of what gets you [00:38:02] mean doing that is sort of what gets you to insect level intelligence and if you [00:38:05] to insect level intelligence and if you want to get higher up the chain than [00:38:07] want to get higher up the chain than insect level intelligence then a lot of [00:38:10] insect level intelligence then a lot of the kind of questions and properties of [00:38:12] the kind of questions and properties of linguistic systems become increasingly [00:38:15] linguistic systems become increasingly relevant um at a slightly more prosaic [00:38:20] relevant um at a slightly more prosaic level um that I don't think one [00:38:25] level um that I don't think one necessarily wants to believe all the F [00:38:28] necessarily wants to believe all the F details of different um linguistic [00:38:30] details of different um linguistic theories but you know for how human [00:38:33] theories but you know for how human languages are structured and how they [00:38:35] languages are structured and how they behave I think yeah most of our broad [00:38:37] behave I think yeah most of our broad understanding of linguistics is right [00:38:41] understanding of linguistics is right and so therefore when we're thinking [00:38:42] and so therefore when we're thinking about NP systems and we're thinking [00:38:45] about NP systems and we're thinking about you know understanding how they [00:38:47] about you know understanding how they behave wanting to know whether they have [00:38:49] behave wanting to know whether they have certain properties thinking up ways to [00:38:52] certain properties thinking up ways to evaluate them a lot of that is done in [00:38:54] evaluate them a lot of that is done in terms of linguistic understand standing [00:38:58] terms of linguistic understand standing wanting to see whether they capture [00:39:00] wanting to see whether they capture facts about sentence structure discourse [00:39:02] facts about sentence structure discourse structure semantic properties like [00:39:04] structure semantic properties like natural language inference um whether [00:39:07] natural language inference um whether you can do things like bridging and AFA [00:39:10] you can do things like bridging and AFA which I did not cover this year's class [00:39:11] which I did not cover this year's class because we skipped the co- reference [00:39:13] because we skipped the co- reference lecture when we slice one lecture off [00:39:15] lecture when we slice one lecture off the class metaphors presuppositions all [00:39:18] the class metaphors presuppositions all of these things are linguistic Notions [00:39:20] of these things are linguistic Notions that we try and get our NLP models to [00:39:23] that we try and get our NLP models to capture so I just want to say a couple [00:39:26] capture so I just want to say a couple more remarks um about you know the role [00:39:30] more remarks um about you know the role of human language in human intelligence [00:39:34] of human language in human intelligence I think this is kind of interesting um [00:39:37] I think this is kind of interesting um so an interesting person in the history [00:39:39] so an interesting person in the history of linguistics um is this guy vilhelm [00:39:43] of linguistics um is this guy vilhelm Von Hot um who is um a prominent German [00:39:48] Von Hot um who is um a prominent German academic um so um so really the American [00:39:53] academic um so um so really the American education system um was borrowed from [00:39:57] education system um was borrowed from Germany right so up until the second [00:40:00] Germany right so up until the second world war the preeminent place of [00:40:03] world war the preeminent place of Science and learning was Germany and [00:40:06] Science and learning was Germany and Germany um essentially via V Von [00:40:09] Germany um essentially via V Von humber's work developed the idea of [00:40:12] humber's work developed the idea of having graduate education and the US [00:40:15] having graduate education and the US copied graduate education from Germany [00:40:18] copied graduate education from Germany and started um doing its own but you [00:40:21] and started um doing its own but you know in in that context it was still the [00:40:24] know in in that context it was still the case that um for people in the United [00:40:27] case that um for people in the United United States prior to the 1930s that [00:40:31] United States prior to the 1930s that generally people would go to Germany um [00:40:35] generally people would go to Germany um to finish their education either to get [00:40:38] to finish their education either to get their PHD or to do a postdoc or [00:40:40] their PHD or to do a postdoc or something like that right so you know if [00:40:43] something like that right so you know if you uh Trace back my own academic tree [00:40:46] you uh Trace back my own academic tree or most other academic trees of people [00:40:49] or most other academic trees of people who got phds in the US they actually go [00:40:53] who got phds in the US they actually go back a few generations and then they go [00:40:55] back a few generations and then they go back to Germany um so [00:40:58] back to Germany um so uh we don't think of that as much in the [00:41:01] uh we don't think of that as much in the modern world yeah so um hbal was [00:41:04] modern world yeah so um hbal was influential and developing the [00:41:05] influential and developing the university system um but he also um [00:41:09] university system um but he also um worked a lot on um language and I he's [00:41:13] worked a lot on um language and I he's someone that um Chomsky always cites um [00:41:17] someone that um Chomsky always cites um because he's known for this famous [00:41:19] because he's known for this famous statement about that human language must [00:41:22] statement about that human language must make infinite use of finite means so the [00:41:24] make infinite use of finite means so the fact that we have a limited um Supply of [00:41:27] fact that we have a limited um Supply of words and sentence structures but out of [00:41:29] words and sentence structures but out of those we can recursively build up an [00:41:32] those we can recursively build up an infinite number of sentences and that's [00:41:35] infinite number of sentences and that's in chomsky's view supporting the kind of [00:41:38] in chomsky's view supporting the kind of um symbolic structured view of language [00:41:41] um symbolic structured view of language that he's been advocating but I think [00:41:43] that he's been advocating but I think there's sort of another interesting take [00:41:47] there's sort of another interesting take of Von Holts which um we can argue [00:41:50] of Von Holts which um we can argue whether it's um right or not but I think [00:41:53] whether it's um right or not but I think is kind of interesting and one of the [00:41:56] is kind of interesting and one of the things he want wants to stress is that [00:41:59] things he want wants to stress is that language isn't just something um used [00:42:04] language isn't just something um used for the purpose of [00:42:06] for the purpose of communication um that he [00:42:10] communication um that he um I should actually introduce something [00:42:12] um I should actually introduce something here so um so caraman and tersi are two [00:42:17] here so um so caraman and tersi are two well-known cognitive psychologists and [00:42:19] well-known cognitive psychologists and they introduced this idea that there are [00:42:21] they introduced this idea that there are two kinds of thinking system one [00:42:24] two kinds of thinking system one cognition and system 2 cognition and [00:42:27] cognition and system 2 cognition and system one is the kind of subconscious [00:42:30] system one is the kind of subconscious thinking that you're not really thinking [00:42:32] thinking that you're not really thinking of just we process stuff when it comes [00:42:34] of just we process stuff when it comes into our heads whether visual signals or [00:42:38] into our heads whether visual signals or um speech and system 2 Thinking is um [00:42:42] um speech and system 2 Thinking is um the conscious let me think about this [00:42:44] the conscious let me think about this and try and figure out what's going on [00:42:46] and try and figure out what's going on I'm solving a math problem style of [00:42:48] I'm solving a math problem style of thinking and um you know I think you can [00:42:53] thinking and um you know I think you can see in Von Holt's writings essentially [00:42:56] see in Von Holt's writings essentially the same kind of distinction between [00:42:59] the same kind of distinction between system 1 and system 2 cognition although [00:43:02] system 1 and system 2 cognition although he refers to system one cognition as a [00:43:05] he refers to system one cognition as a of the spirit um and system to cognition [00:43:09] of the spirit um and system to cognition as thinking um yeah and so basically he [00:43:14] as thinking um yeah and so basically he argues for a version of the philos [00:43:17] argues for a version of the philos philosophical position of the language [00:43:19] philosophical position of the language of thought of suggesting that effective [00:43:23] of thought of suggesting that effective system to thinking um requires extension [00:43:27] system to thinking um requires extension of the Mind through the symbols of [00:43:29] of the Mind through the symbols of language um and so he argued that having [00:43:34] language um and so he argued that having language is absolutely a necessary [00:43:37] language is absolutely a necessary foundation for the progress of the human [00:43:39] foundation for the progress of the human mind and I think that's actually an [00:43:41] mind and I think that's actually an interesting perspective which I have [00:43:43] interesting perspective which I have some sympathy with I mean you know [00:43:46] some sympathy with I mean you know obviously we can think without language [00:43:48] obviously we can think without language you know we can feel afraid we can think [00:43:51] you know we can feel afraid we can think visually and about how things that fit [00:43:54] visually and about how things that fit together but I think it's fairly [00:43:58] together but I think it's fairly plausible that for the sort of more [00:44:01] plausible that for the sort of more abstract larger scale thinking that [00:44:05] abstract larger scale thinking that humans engage in and has led them to [00:44:08] humans engage in and has led them to sort of higher levels of thought than a [00:44:11] sort of higher levels of thought than a chimpanzee gets to that language gives a [00:44:14] chimpanzee gets to that language gives a scaffolding Inside the Mind that makes [00:44:17] scaffolding Inside the Mind that makes that possible another version of that um [00:44:20] that possible another version of that um is from the philosopher Daniel dennet [00:44:23] is from the philosopher Daniel dennet who just actually died a couple of [00:44:24] who just actually died a couple of months ago um so dennet wrote this book [00:44:28] months ago um so dennet wrote this book um called from bacteria to bark and back [00:44:31] um called from bacteria to bark and back and the main thing this book was about [00:44:33] and the main thing this book was about was the origin of human consciousness [00:44:35] was the origin of human consciousness and I'm not going to talk about human [00:44:37] and I'm not going to talk about human consciousness um today um but um he [00:44:42] consciousness um today um but um he introduced this model of four grades of [00:44:46] introduced this model of four grades of progressively more competent [00:44:48] progressively more competent intelligences um and so the four levels [00:44:52] intelligences um and so the four levels um he outlined was that the bottom one [00:44:55] um he outlined was that the bottom one was dar wian so darwinian intelligence [00:45:00] was dar wian so darwinian intelligence was something that was predesigned and [00:45:02] was something that was predesigned and fixed it doesn't improve during its [00:45:05] fixed it doesn't improve during its lifetime Improvement only happens um by [00:45:09] lifetime Improvement only happens um by Evolution through genetic selection um [00:45:12] Evolution through genetic selection um so things like bacteria and viruses are [00:45:16] so things like bacteria and viruses are darwinian [00:45:17] darwinian intelligences so then after that was [00:45:20] intelligences so then after that was scaran [00:45:22] scaran intelligences and so they improve [00:45:25] intelligences and so they improve Behavior by learning to respond on to [00:45:28] Behavior by learning to respond on to reinforcement um so something like a [00:45:30] reinforcement um so something like a lizard um or you know perhaps a dog we [00:45:34] lizard um or you know perhaps a dog we could um argue about how intelligent [00:45:36] could um argue about how intelligent dogs are um has um scaran intelligence [00:45:41] dogs are um has um scaran intelligence and so then the third level up um [00:45:44] and so then the third level up um paparian intelligence is things that [00:45:47] paparian intelligence is things that learn models of the environment so they [00:45:50] learn models of the environment so they can improve performance by thinking [00:45:53] can improve performance by thinking through plans and then executing them [00:45:57] through plans and then executing them and seeing how they behave so in a [00:46:00] and seeing how they behave so in a computational [00:46:02] computational sense paparian intelligence kind of [00:46:05] sense paparian intelligence kind of means that you can do modelbased [00:46:07] means that you can do modelbased reinforcement [00:46:09] reinforcement learning um and so um primates like [00:46:13] learning um and so um primates like chimpanzees can definitely um do the [00:46:16] chimpanzees can definitely um do the kind of planning and modelbased um [00:46:19] kind of planning and modelbased um reinforcement learning that gives you a [00:46:21] reinforcement learning that gives you a preparan intelligence but actually a lot [00:46:24] preparan intelligence but actually a lot of recent evidence shows that a lot of [00:46:26] of recent evidence shows that a lot of simp simpler creatures can also do it um [00:46:31] simp simpler creatures can also do it um so I'm not sure the facts here so all [00:46:34] so I'm not sure the facts here so all these studies you see are about um crows [00:46:38] these studies you see are about um crows from um the South Pacific Australia and [00:46:44] from um the South Pacific Australia and Fiji and places like that so I'm not [00:46:46] Fiji and places like that so I'm not sure if Northern Hemisphere crows are [00:46:48] sure if Northern Hemisphere crows are Dumber but at least southern hemisphere [00:46:50] Dumber but at least southern hemisphere crows um can learn plans so that they [00:46:55] crows um can learn plans so that they can do multi-stage planning to work out [00:46:59] can do multi-stage planning to work out ways to get a piece of meat that's down [00:47:01] ways to get a piece of meat that's down a Hole by learning to pick up a stick [00:47:03] a Hole by learning to pick up a stick and poke it in and um so you know that [00:47:07] and poke it in and um so you know that even crows can be paparian intelligences [00:47:10] even crows can be paparian intelligences um but what Dennis suggests is that [00:47:12] um but what Dennis suggests is that there's a stage Beyond um paparian [00:47:16] there's a stage Beyond um paparian intelligence which he calls Gregorian [00:47:19] intelligence which he calls Gregorian intelligence and the idea of Gregorian [00:47:22] intelligence and the idea of Gregorian intelligence is that you can build [00:47:24] intelligence is that you can build Thinking Tools which allow you to do a [00:47:28] Thinking Tools which allow you to do a higher level of control of mental [00:47:32] higher level of control of mental searches and So He suggests that you [00:47:36] searches and So He suggests that you know things like well mathematics is a [00:47:40] know things like well mathematics is a thinking tool but well also democracy is [00:47:43] thinking tool but well also democracy is a thinking tool but nevertheless out of [00:47:46] a thinking tool but nevertheless out of the space of Thinking Tools um that [00:47:49] the space of Thinking Tools um that human language is the preminent thinking [00:47:51] human language is the preminent thinking tool that we have and So He suggests [00:47:55] tool that we have and So He suggests that you know the only biolog iCal [00:47:57] that you know the only biolog iCal example we have of a Gregorian [00:47:59] example we have of a Gregorian intelligence um as human [00:48:01] intelligence um as human beings and so that I I think in that [00:48:04] beings and so that I I think in that kind of sense you can say that there's a [00:48:06] kind of sense you can say that there's a very important role for [00:48:09] very important role for language okay two parts to go in my um [00:48:14] language okay two parts to go in my um summary okay so the next one is what [00:48:17] summary okay so the next one is what kind of semantic should we use for [00:48:19] kind of semantic should we use for language um and so this is getting back [00:48:22] language um and so this is getting back to the question I mentioned for word [00:48:24] to the question I mentioned for word vectors and this is kind of interesting [00:48:27] vectors and this is kind of interesting so the semantics that's been dominant in [00:48:31] so the semantics that's been dominant in philosophy of language or in linguistic [00:48:33] philosophy of language or in linguistic semantics is a notion of model theoretic [00:48:37] semantics is a notion of model theoretic semantics where the meaning of words is [00:48:40] semantics where the meaning of words is their [00:48:41] their denotation what they represent in the [00:48:43] denotation what they represent in the world I mentioned this I think in a [00:48:45] world I mentioned this I think in a early lecture right so that if you have [00:48:48] early lecture right so that if you have a word like computer the meaning of [00:48:50] a word like computer the meaning of computer is the set of computers this [00:48:52] computer is the set of computers this one that one that one all the other [00:48:53] one that one that one all the other computers are out right so it's [00:48:55] computers are out right so it's denotational relationship between a word [00:48:59] denotational relationship between a word and its denotation in the world or in a [00:49:01] and its denotation in the world or in a model of the world and that was the um [00:49:04] model of the world and that was the um notion that was used in most of the [00:49:07] notion that was used in most of the history of AI for doing symbolic Ai and [00:49:10] history of AI for doing symbolic Ai and that then contrasts with this sort of [00:49:12] that then contrasts with this sort of distributional semantics that the [00:49:15] distributional semantics that the meaning of a word is understanding the [00:49:17] meaning of a word is understanding the context in which it's used which is [00:49:20] context in which it's used which is effectively what we're using um for our [00:49:22] effectively what we're using um for our neural [00:49:24] neural models yeah so if if you look at the [00:49:27] models yeah so if if you look at the traditional view of understand [00:49:31] traditional view of understand interpreting the meaning of human [00:49:33] interpreting the meaning of human language um and this is what you'll have [00:49:35] language um and this is what you'll have seen if you did an intro logic class at [00:49:39] seen if you did an intro logic class at some point right that we have a sentence [00:49:41] some point right that we have a sentence the red apple is on the table and you [00:49:44] the red apple is on the table and you get to write in some logical [00:49:46] get to write in some logical representation first order predicate [00:49:49] representation first order predicate calculus or whatever this one's a bit [00:49:51] calculus or whatever this one's a bit different to allow in thus where [00:49:53] different to allow in thus where normally for first order predal calculus [00:49:55] normally for first order predal calculus you only do um for all and if there [00:49:58] you only do um for all and if there exists but you have a a sort of a formal [00:50:01] exists but you have a a sort of a formal logic and you know in the early week in [00:50:04] logic and you know in the early week in weeks one and two of the logic class you [00:50:06] weeks one and two of the logic class you have some English sentences for which [00:50:08] have some English sentences for which you translate into formal logic and then [00:50:10] you translate into formal logic and then after that um you forget about human [00:50:13] after that um you forget about human languages and you just sort of start [00:50:15] languages and you just sort of start proving stuff about formal logical [00:50:18] proving stuff about formal logical systems um and so to some extent what [00:50:21] systems um and so to some extent what you get a philosophy class um represents [00:50:25] you get a philosophy class um represents the tradition of Alfred tasi um so taski [00:50:29] the tradition of Alfred tasi um so taski believed that you couldn't talk about [00:50:32] believed that you couldn't talk about meaning in terms of talking about human [00:50:35] meaning in terms of talking about human languages because human languages were [00:50:37] languages because human languages were quot quote impossibly [00:50:40] quot quote impossibly incoherent um yeah and so um from about [00:50:44] incoherent um yeah and so um from about the sort of 1940s until um 1980 you know [00:50:48] the sort of 1940s until um 1980 you know tasy was the preeminent legis um in the [00:50:52] tasy was the preeminent legis um in the US um he was in Berkeley um and so that [00:50:56] US um he was in Berkeley um and so that was very much um the view of the [00:50:59] was very much um the view of the logicians of the world but during that [00:51:02] logicians of the world but during that period one of his students um was this [00:51:05] period one of his students um was this guy Richard [00:51:07] guy Richard montigue so um Richard montigue sort of [00:51:10] montigue so um Richard montigue sort of rebelled against that picture um saying [00:51:14] rebelled against that picture um saying I reject the contention that an [00:51:16] I reject the contention that an important theoretical difference exists [00:51:18] important theoretical difference exists between formal and natural languages and [00:51:21] between formal and natural languages and so he then set about um showing that [00:51:27] so he then set about um showing that well you could start building up a [00:51:30] well you could start building up a formal semantics for describing the [00:51:33] formal semantics for describing the meaning of natural language sentences [00:51:36] meaning of natural language sentences and so Richard monu's work became the [00:51:39] and so Richard monu's work became the foundation of the work that's used in um [00:51:42] foundation of the work that's used in um semantics in linguistics as well for [00:51:45] semantics in linguistics as well for anyone who's done 1 Ling 130 or 230 um [00:51:49] anyone who's done 1 Ling 130 or 230 um the picture you saw is sort of a [00:51:51] the picture you saw is sort of a montigue picture of um semantics and so [00:51:55] montigue picture of um semantics and so that was the semantics that [00:51:58] that was the semantics that was um taken over and essentially used [00:52:03] was um taken over and essentially used as the model of doing natural language [00:52:05] as the model of doing natural language understanding for most of the history of [00:52:09] understanding for most of the history of NLP you know roughly 1960 to [00:52:14] NLP you know roughly 1960 to 2015 17 you know and so the picture [00:52:18] 2015 17 you know and so the picture essentially was that if we wanted to [00:52:22] essentially was that if we wanted to have a sentence that we interpreted like [00:52:24] have a sentence that we interpreted like the red apple is on the table what we [00:52:27] the red apple is on the table what we would do is we'd first produce a [00:52:30] would do is we'd first produce a syntactic structure for the sentence so [00:52:32] syntactic structure for the sentence so we would pause it um and then using [00:52:37] we would pause it um and then using ideas roughly along the lines that [00:52:39] ideas roughly along the lines that montue suggested we would construct its [00:52:43] montue suggested we would construct its meaning by looking up meanings of words [00:52:47] meaning by looking up meanings of words in a lexicon and then using the [00:52:50] in a lexicon and then using the compositionality of human languages to [00:52:53] compositionality of human languages to work out the meanings of progressively [00:52:55] work out the meanings of progressively larger phrases and clauses in terms of [00:52:58] larger phrases and clauses in terms of the meanings of those words and the way [00:53:01] the meanings of those words and the way that they are combined slightly [00:53:03] that they are combined slightly reminiscent of my discussion of tree [00:53:06] reminiscent of my discussion of tree structures to meanings in the last [00:53:08] structures to meanings in the last lecture I gave and so you would build up [00:53:12] lecture I gave and so you would build up um a meaning representation of a [00:53:15] um a meaning representation of a sentence and so this could then give you [00:53:18] sentence and so this could then give you a semantic meaning of a sentence that [00:53:20] a semantic meaning of a sentence that you could use in a system um this is [00:53:24] you could use in a system um this is approximately a slide except titled that [00:53:27] approximately a slide except titled that I actually used to use in [00:53:29] I actually used to use in cs224n um in the [00:53:32] cs224n um in the 2000s decade right so we have a we have [00:53:37] 2000s decade right so we have a we have um or part of a sentence um I get oh no [00:53:40] um or part of a sentence um I get oh no it's a whole sentence here it is how [00:53:42] it's a whole sentence here it is how many red [00:53:44] many red cars what can I get this sentence I [00:53:47] cars what can I get this sentence I think there's a sentence here how many [00:53:50] think there's a sentence here how many oh how many red cars in Palo Alo does [00:53:53] oh how many red cars in Palo Alo does Kathy like how many red cars in pal does [00:53:57] Kathy like how many red cars in pal does Kathy like and sorry yeah the cars sorry [00:53:59] Kathy like and sorry yeah the cars sorry got hidden underneath here um yeah so we [00:54:02] got hidden underneath here um yeah so we have a sentence we pass it we look up [00:54:05] have a sentence we pass it we look up meanings of words in a lexicon we start [00:54:07] meanings of words in a lexicon we start composing them up um we get a semantic [00:54:10] composing them up um we get a semantic form for the whole sentence which we can [00:54:12] form for the whole sentence which we can then convert into SQL and we can run [00:54:15] then convert into SQL and we can run against a database and we can get the [00:54:17] against a database and we can get the answer and this was a in outline the [00:54:21] answer and this was a in outline the kind of technology that was widely used [00:54:23] kind of technology that was widely used for natural language understanding [00:54:25] for natural language understanding systems that were built anywhere from [00:54:28] systems that were built anywhere from the 1960s to the 2s and1s and you know [00:54:33] the 1960s to the 2s and1s and you know in particular um they were used not only [00:54:37] in particular um they were used not only in a purely kind of rule based grammar [00:54:39] in a purely kind of rule based grammar and lexicon way this same basic [00:54:42] and lexicon way this same basic technology was incorporated into a [00:54:44] technology was incorporated into a machine learning context where your goal [00:54:46] machine learning context where your goal was to start to learn various of these [00:54:49] was to start to learn various of these parts you could not only learn the paraa [00:54:52] parts you could not only learn the paraa but you could also um learn semantic [00:54:55] but you could also um learn semantic meanings of words and learn composition [00:54:57] meanings of words and learn composition rules and so the Acme of that work was [00:55:00] rules and so the Acme of that work was then what was called semantic paring [00:55:03] then what was called semantic paring that was pioneered by Luke zettel moer [00:55:05] that was pioneered by Luke zettel moer and Mike Collins in the 2000th decade [00:55:08] and Mike Collins in the 2000th decade and then taken up by others including um [00:55:11] and then taken up by others including um Percy leang so person's PhD thesis but [00:55:14] Percy leang so person's PhD thesis but also actually his early work at Stanford [00:55:17] also actually his early work at Stanford before he was convinced to do neural [00:55:19] before he was convinced to do neural networks um was doing semantic paing um [00:55:24] networks um was doing semantic paing um work um so you know these systems could [00:55:27] work um so you know these systems could actually work and were used in limited [00:55:30] actually work and were used in limited domains but they're always extremely [00:55:32] domains but they're always extremely brittle um and yeah the interesting [00:55:35] brittle um and yeah the interesting thing is sort of what of humans I mean [00:55:38] thing is sort of what of humans I mean there is you know some evidence that [00:55:41] there is you know some evidence that humans do something like this that they [00:55:44] humans do something like this that they um work out the structure of sentences [00:55:46] um work out the structure of sentences and compute um meanings um in a a bottom [00:55:51] and compute um meanings um in a a bottom up mostly projective way you know [00:55:54] up mostly projective way you know there's a lot of controversy to exactly [00:55:57] there's a lot of controversy to exactly how human understanding of sentences [00:56:00] how human understanding of sentences still works but you know there are [00:56:01] still works but you know there are certainly people have argued in support [00:56:03] certainly people have argued in support of human brains doing something similar [00:56:07] of human brains doing something similar um that's obviously not what we're [00:56:09] um that's obviously not what we're getting with current day [00:56:11] getting with current day Transformers and so you know the [00:56:13] Transformers and so you know the question is do our current day [00:56:17] question is do our current day um neural language models provide [00:56:20] um neural language models provide suitable meaning functions and that's [00:56:23] suitable meaning functions and that's you know a complex question [00:56:26] you know a complex question CU you know in many ways yeah they seem [00:56:28] CU you know in many ways yeah they seem to they do an amazing job at [00:56:31] to they do an amazing job at understanding whatever sentences you put [00:56:33] understanding whatever sentences you put into them but there's still some genuine [00:56:35] into them but there's still some genuine concerns as to whether [00:56:37] concerns as to whether they are making shortcuts or work to a [00:56:41] they are making shortcuts or work to a certain extent and don't actually have [00:56:43] certain extent and don't actually have the same kind of [00:56:45] the same kind of compositional understanding with [00:56:47] compositional understanding with systematic generalization that human [00:56:49] systematic generalization that human beings [00:56:50] beings do okay um so that's the traditional [00:56:54] do okay um so that's the traditional denotational semantics View and that [00:56:57] denotational semantics View and that contrasts um with the kind of use theory [00:57:00] contrasts um with the kind of use theory of meaning and in the first or second [00:57:04] of meaning and in the first or second lecture and at the beginning of this one [00:57:06] lecture and at the beginning of this one I attributed that to the British [00:57:08] I attributed that to the British linguist Jr fth you shall know a word by [00:57:12] linguist Jr fth you shall know a word by the company at kees but it's not only a [00:57:14] the company at kees but it's not only a position of f it's also been a minority [00:57:18] position of f it's also been a minority position of philosophers in particular [00:57:20] position of philosophers in particular it was Advanced by viken Stein and his [00:57:23] it was Advanced by viken Stein and his later work and his work philosophical [00:57:25] later work and his work philosophical invest vation so in that work he writes [00:57:29] invest vation so in that work he writes when I talk about language words [00:57:31] when I talk about language words sentences Etc I must speak the language [00:57:34] sentences Etc I must speak the language of every day is this language somehow [00:57:36] of every day is this language somehow too coarse and material for what we want [00:57:38] too coarse and material for what we want to say then how is another one to be [00:57:41] to say then how is another one to be constructed and how strange that we [00:57:43] constructed and how strange that we should be able to do anything at all [00:57:45] should be able to do anything at all with the one we we have um philosophical [00:57:48] with the one we we have um philosophical investigations is written in this sort [00:57:49] investigations is written in this sort of vaguely poetical literary style but [00:57:52] of vaguely poetical literary style but the point of it is meant to be [00:57:54] the point of it is meant to be saying look these logician people are [00:57:57] saying look these logician people are claiming you can't use natural langu [00:58:00] claiming you can't use natural langu human languages to express meaning and [00:58:02] human languages to express meaning and you have to translate into this um [00:58:04] you have to translate into this um symbol system but isn't that a weird [00:58:07] symbol system but isn't that a weird concept that one symbol system is no [00:58:09] concept that one symbol system is no good but this other symbol system [00:58:11] good but this other symbol system somehow fixes things um and then about [00:58:15] somehow fixes things um and then about denotational semantics he writes you say [00:58:19] denotational semantics he writes you say the point isn't the word but it's [00:58:21] the point isn't the word but it's meaning and you think of the meaning as [00:58:23] meaning and you think of the meaning as a thing of the same kind as the word [00:58:25] a thing of the same kind as the word they also different from the word he the [00:58:28] they also different from the word he the word there the meaning so that's the [00:58:30] word there the meaning so that's the symbol and its denotation the money and [00:58:33] symbol and its denotation the money and the cow that you can buy with it um but [00:58:36] the cow that you can buy with it um but contrast money in its use and he goes on [00:58:39] contrast money in its use and he goes on from there to argue for the kind of you [00:58:42] from there to argue for the kind of you know the meaning of money is the way [00:58:44] know the meaning of money is the way that money can be used in the world the [00:58:47] that money can be used in the world the meaning of money isn't pointing at [00:58:49] meaning of money isn't pointing at pieces of [00:58:51] pieces of money okay so this this is what's [00:58:55] money okay so this this is what's referred to as a use theory of meaning [00:58:57] referred to as a use theory of meaning and so the question is is that a good [00:59:00] and so the question is is that a good theory of meaning so some people just [00:59:04] theory of meaning so some people just don't accept [00:59:07] don't accept um this kind of distributional semantic [00:59:11] um this kind of distributional semantic use theories of meaning um as a theory [00:59:14] use theories of meaning um as a theory of meaning or semantics most prominently [00:59:17] of meaning or semantics most prominently in recent NLP work that's the position [00:59:20] in recent NLP work that's the position of Bender and cola that they just take [00:59:22] of Bender and cola that they just take as axiomatic the only thing that counts [00:59:25] as axiomatic the only thing that counts as having a meaning is that you got form [00:59:28] as having a meaning is that you got form over here and meaning over there [00:59:32] over here and meaning over there um but I think that that's too narrow I [00:59:36] um but I think that that's too narrow I think we have to argue that meaning [00:59:40] think we have to argue that meaning arises from connect meaning of words [00:59:42] arises from connect meaning of words arises from connecting words to other [00:59:45] arises from connecting words to other things and although in some sense you [00:59:48] things and although in some sense you could say connecting um words to things [00:59:52] could say connecting um words to things in the real world is privileged it's not [00:59:54] in the real world is privileged it's not the only way that you can ground [00:59:57] the only way that you can ground meanings you can have meanings in a [00:59:59] meanings you can have meanings in a virtual world but you can also have [01:00:01] virtual world but you can also have meanings by connecting one word to other [01:00:04] meanings by connecting one word to other things in human language and the other [01:00:07] things in human language and the other thing that I think you need to say is [01:00:10] thing that I think you need to say is you know meaning isn't a sort of a zero1 [01:00:13] you know meaning isn't a sort of a zero1 thing that you know the denotation of a [01:00:17] thing that you know the denotation of a word or you don't I think meaning is a [01:00:19] word or you don't I think meaning is a gradient thing and you can understand [01:00:22] gradient thing and you can understand meanings of words and phrases either [01:00:24] meanings of words and phrases either more or less and so this is an example I [01:00:27] more or less and so this is an example I gave in a piece that I wrote a couple of [01:00:30] gave in a piece that I wrote a couple of years ago okay what is the um meaning of [01:00:33] years ago okay what is the um meaning of the word [01:00:34] the word Shai um [01:00:37] Shai um well maybe a few of you know it but um [01:00:40] well maybe a few of you know it but um if you don't well what could I do um [01:00:43] if you don't well what could I do um well you know if you'd seen or held one [01:00:46] well you know if you'd seen or held one you'd have classic grounded meaning um [01:00:49] you'd have classic grounded meaning um know something about the denotation um [01:00:52] know something about the denotation um well if that's not the case well you [01:00:54] well if that's not the case well you know I could at least show you picture [01:00:56] know I could at least show you picture of one here's a picture of one so that [01:00:58] of one here's a picture of one so that gives you some information about what a [01:01:01] gives you some information about what a Shai is but you know is that the only [01:01:04] Shai is but you know is that the only thing I can do I mean [01:01:07] thing I can do I mean suppose well sorry I left out a bullet [01:01:10] suppose well sorry I left out a bullet point you know so this gives you a [01:01:12] point you know so this gives you a partial meaning of a Shai but surely you [01:01:15] partial meaning of a Shai but surely you have a richer meaning if you'd heard one [01:01:18] have a richer meaning if you'd heard one being played um and well is showing you [01:01:22] being played um and well is showing you a picture of one the only thing I can do [01:01:25] a picture of one the only thing I can do um suppose you'd never you know seen [01:01:27] um suppose you'd never you know seen felt or heard one but you know I told [01:01:31] felt or heard one but you know I told you it's a traditional Indian instrument [01:01:33] you it's a traditional Indian instrument a bit like an OBO well I think you [01:01:37] a bit like an OBO well I think you understand something about the meaning [01:01:38] understand something about the meaning of the word at that point um that you [01:01:42] of the word at that point um that you know it's sort of connected to India [01:01:44] know it's sort of connected to India it's a wind instrument using reads [01:01:48] it's a wind instrument using reads that's used for playing music you know I [01:01:50] that's used for playing music you know I could tell you some other things about [01:01:52] could tell you some other things about it I could say it has whole sort of like [01:01:54] it I could say it has whole sort of like a recorder but has multiple reads and a [01:01:57] a recorder but has multiple reads and a flared end more like an OBO um then [01:02:01] flared end more like an OBO um then maybe you know a bit more about a shenai [01:02:02] maybe you know a bit more about a shenai even though you've never seen one um and [01:02:07] even though you've never seen one um and um if you then extend to what we do more [01:02:11] um if you then extend to what we do more in um our sort of Corpus based [01:02:14] in um our sort of Corpus based linguistic learning you know you could [01:02:17] linguistic learning you know you could imagine it's not that I tried to Define [01:02:19] imagine it's not that I tried to Define one for you instead I've just shown you [01:02:22] one for you instead I've just shown you a textual use example so here or several [01:02:25] a textual use example so here or several of those [01:02:26] of those so here's one textual use example from a [01:02:29] so here's one textual use example from a week before Shai players sat in bamboo [01:02:33] week before Shai players sat in bamboo ma CH at the entrance to the house [01:02:35] ma CH at the entrance to the house playing their pipes bash Babu disliked [01:02:38] playing their pipes bash Babu disliked the Shai's whale but was determined to [01:02:41] the Shai's whale but was determined to fulfill every conventional expectation [01:02:44] fulfill every conventional expectation the Grooms family might have um so if [01:02:49] the Grooms family might have um so if that's all you know about a Shai you [01:02:52] that's all you know about a Shai you know in some ways you understand less of [01:02:55] know in some ways you understand less of the mean meaning of the word than if [01:02:56] the mean meaning of the word than if you'd seen one but actually in other [01:03:00] you'd seen one but actually in other ways you understand more of the meaning [01:03:04] ways you understand more of the meaning of the word than if you just seen one [01:03:06] of the word than if you just seen one because you know from that one textual [01:03:09] because you know from that one textual example you know some things you um have [01:03:12] example you know some things you um have heard a characterization of the sound as [01:03:15] heard a characterization of the sound as wailing um and um you know that it's [01:03:20] wailing um and um you know that it's connected with weddings which you don't [01:03:22] connected with weddings which you don't get from just having um held or looked [01:03:24] get from just having um held or looked at one or even you know having had [01:03:27] at one or even you know having had someone stand in front of you and play [01:03:29] someone stand in front of you and play it and you know that's an important part [01:03:31] it and you know that's an important part of the meaning of aai to um people and [01:03:35] of the meaning of aai to um people and so that's the sense in which I think so [01:03:37] so that's the sense in which I think so meaning comes from various kinds of [01:03:40] meaning comes from various kinds of connections okay last topic our AI [01:03:44] connections okay last topic our AI future um yeah so there are different [01:03:47] future um yeah so there are different senses of our AI future and lots of [01:03:49] senses of our AI future and lots of things um that can we can be worried [01:03:52] things um that can we can be worried about one thing we can be worried about [01:03:54] about one thing we can be worried about is whether we're all going to lose our [01:03:56] is whether we're all going to lose our jobs um interesting question uh here's a [01:04:03] jobs um interesting question uh here's a newspaper article from The New York [01:04:05] newspaper article from The New York Times March of the machine makes Idle [01:04:08] Times March of the machine makes Idle Hands prevalence of unemployment with [01:04:10] Hands prevalence of unemployment with greatly increased industrial output [01:04:12] greatly increased industrial output points to the influence of Labor saving [01:04:15] points to the influence of Labor saving devices as an underlying cause um this [01:04:18] devices as an underlying cause um this was published in the New York Times in [01:04:22] was published in the New York Times in 1928 um but you know it turns out that [01:04:25] 1928 um but you know it turns out that quite quite a few people like labor [01:04:27] quite quite a few people like labor saving machines like washing machines [01:04:31] saving machines like washing machines and dishwashers and um sewing machines [01:04:35] and dishwashers and um sewing machines lots of useful labor saving machines um [01:04:39] lots of useful labor saving machines um and well you know this was published in [01:04:41] and well you know this was published in 1928 just before um you know at a time [01:04:46] 1928 just before um you know at a time when a small group of immensely powerful [01:04:49] when a small group of immensely powerful and rich men dominated the United States [01:04:54] and rich men dominated the United States um just before the Great Depression um [01:04:57] um just before the Great Depression um but what happened in the decades after [01:05:00] but what happened in the decades after that um greatly changed policies in the [01:05:04] that um greatly changed policies in the United States led to Boom years that [01:05:07] United States led to Boom years that distributed um wealth and work much more [01:05:11] distributed um wealth and work much more evenly across the country and the [01:05:14] evenly across the country and the country [01:05:15] country boomed you know here's another one in [01:05:18] boomed you know here's another one in the past new Industries hired far more [01:05:21] the past new Industries hired far more people than those they put out of [01:05:23] people than those they put out of business but this is this is not true of [01:05:25] business but this is this is not true of many today's new Industries today's new [01:05:28] many today's new Industries today's new Industries have comparatively few jobs [01:05:30] Industries have comparatively few jobs for the unskilled or semiskilled just [01:05:32] for the unskilled or semiskilled just the class of workers whose jobs are [01:05:34] the class of workers whose jobs are being eliminated by automation you know [01:05:37] being eliminated by automation you know this was um Time magazine in [01:05:40] this was um Time magazine in 1961 um so this is a longstanding um [01:05:43] 1961 um so this is a longstanding um fear which at least so far has not been [01:05:46] fear which at least so far has not been realized you know here we are um in [01:05:50] realized you know here we are um in which a country in which not everyone [01:05:52] which a country in which not everyone might have the work that they wish they [01:05:55] might have the work that they wish they had but that overall almost everybody [01:05:58] had but that overall almost everybody has a job and many people are working a [01:06:01] has a job and many people are working a lot of hours a week whereas Once Upon a [01:06:04] lot of hours a week whereas Once Upon a Time the claim was that before the end [01:06:06] Time the claim was that before the end of the 20th century we would only have [01:06:08] of the 20th century we would only have to do a three-day work week because [01:06:10] to do a three-day work week because there wouldn't be much work to go around [01:06:12] there wouldn't be much work to go around imagine um yeah so another fear is will [01:06:18] imagine um yeah so another fear is will almost all the money go to 5 to 10 [01:06:20] almost all the money go to 5 to 10 enormous technology Giants um I actually [01:06:24] enormous technology Giants um I actually think this is a more serious worry this [01:06:26] think this is a more serious worry this seems to be the direction that we're [01:06:28] seems to be the direction that we're headed in at the moment um I think [01:06:30] headed in at the moment um I think there's no doubt that modern networks [01:06:33] there's no doubt that modern networks and a concentration of AI Talent tend to [01:06:36] and a concentration of AI Talent tend to encourage this outcome um just but you [01:06:39] encourage this outcome um just but you know essentially this is the modern [01:06:41] know essentially this is the modern analog of what happened in the early [01:06:43] analog of what happened in the early Decades of the 20th century you know the [01:06:46] Decades of the 20th century you know the equivalent then was Transportation [01:06:49] equivalent then was Transportation networks and it was domination of the [01:06:51] networks and it was domination of the new Transportation networks like [01:06:53] new Transportation networks like Railways that led to a few people [01:06:56] Railways that led to a few people dominating the economic system um but [01:06:59] dominating the economic system um but what happens there um would be you know [01:07:04] what happens there um would be you know essentially comes down to a political [01:07:06] essentially comes down to a political and social question so um as I was [01:07:08] and social question so um as I was mentioning before after the Great [01:07:11] mentioning before after the Great Depression countes successfully dealt [01:07:14] Depression countes successfully dealt with the monopolistic power of a small [01:07:17] with the monopolistic power of a small number of companies um and with [01:07:20] number of companies um and with political leadership we could do that [01:07:22] political leadership we could do that again um the problem is that there's not [01:07:25] again um the problem is that there's not much sign of political leadership right [01:07:27] much sign of political leadership right at the moment um but that's um a [01:07:29] at the moment um but that's um a political problem to solve rather than [01:07:32] political problem to solve rather than actually being a technological problem [01:07:34] actually being a technological problem to solve um so the next problem is [01:07:38] to solve um so the next problem is should we be afraid of an imminent [01:07:41] should we be afraid of an imminent Singularity IE when machines have [01:07:43] Singularity IE when machines have artificial general intelligence beyond [01:07:46] artificial general intelligence beyond the human level um in particular um [01:07:49] the human level um in particular um would such an event threaten human [01:07:52] would such an event threaten human survival um so this is um [01:07:56] survival um so this is um uh concern that is increasingly um explo [01:08:00] uh concern that is increasingly um explo exploded into the mainstream with [01:08:02] exploded into the mainstream with discussions of AI existential risk and [01:08:05] discussions of AI existential risk and in quite a few of the discussions that [01:08:08] in quite a few of the discussions that have been leading to the setting up of [01:08:10] have been leading to the setting up of things like AI safety institutes in the [01:08:13] things like AI safety institutes in the US UK um are motivated by maybe um there [01:08:17] US UK um are motivated by maybe um there are these worries of out of control [01:08:20] are these worries of out of control artificial intelligence um taking over [01:08:24] artificial intelligence um taking over and deciding to eliminate Humanity so we [01:08:27] and deciding to eliminate Humanity so we get these sort of article headlines like [01:08:29] get these sort of article headlines like pausing AI developments isn't enough we [01:08:32] pausing AI developments isn't enough we need to shut it all down how Rogue AIS [01:08:35] need to shut it all down how Rogue AIS may arise AI Godfather Jeffrey Hinton [01:08:39] may arise AI Godfather Jeffrey Hinton warns of dangers as he quits Google we [01:08:42] warns of dangers as he quits Google we must slow down the race to Godlike AI [01:08:47] must slow down the race to Godlike AI um I don't personally um give these [01:08:52] um I don't personally um give these concerns too much credence um and I [01:08:56] concerns too much credence um and I think there's started to be increasing [01:08:58] think there's started to be increasing push back against them um so in the [01:09:01] push back against them um so in the other direction um franois charal who is [01:09:05] other direction um franois charal who is the architect of caras sort of argues [01:09:08] the architect of caras sort of argues there does not exist any AI model or [01:09:10] there does not exist any AI model or technique that could represent an [01:09:12] technique that could represent an Extinction risk for Humanity not even if [01:09:14] Extinction risk for Humanity not even if you extrapolate capabilities far into [01:09:17] you extrapolate capabilities far into the future V scaling laws most arguments [01:09:20] the future V scaling laws most arguments boil down to this is a new type of [01:09:23] boil down to this is a new type of Technology it could happen [01:09:26] Technology it could happen um Joel Pino who um meta AI leader um [01:09:31] um Joel Pino who um meta AI leader um refers to existential risk discour is [01:09:34] refers to existential risk discour is unhinged and points out the flaw of the [01:09:37] unhinged and points out the flaw of the lot of the utilitarian argumentation [01:09:40] lot of the utilitarian argumentation that goes along with discussions of [01:09:43] that goes along with discussions of these risks which is um you know if you [01:09:46] these risks which is um you know if you say the elimination of um humanity is [01:09:51] say the elimination of um humanity is infinitely bad that means you know any [01:09:55] infinitely bad that means you know any any nonzero chance multiplied by [01:09:57] any nonzero chance multiplied by Infinity will be bigger than the Badness [01:10:00] Infinity will be bigger than the Badness of anything else that could happen in [01:10:01] of anything else that could happen in the world um but that that isn't [01:10:04] the world um but that that isn't actually a sensible way to have rational [01:10:07] actually a sensible way to have rational discussion about the outcomes and many [01:10:09] discussion about the outcomes and many people including Tim n jeu have argued [01:10:13] people including Tim n jeu have argued that a lot of [01:10:15] that a lot of the well a lot of what the a lot of the [01:10:19] the well a lot of what the a lot of the outcome of this focus on existential [01:10:22] outcome of this focus on existential risk and if you're more cynical a lot of [01:10:24] risk and if you're more cynical a lot of the purpose of this focus of on [01:10:27] the purpose of this focus of on existential risk is to distract away [01:10:30] existential risk is to distract away from the immediate harms that are [01:10:32] from the immediate harms that are arising from companies deploying [01:10:34] arising from companies deploying automated systems including their biases [01:10:37] automated systems including their biases worker exploitation copyright violation [01:10:40] worker exploitation copyright violation disinformation growing concentration of [01:10:43] disinformation growing concentration of power and Regulatory capture by Leading [01:10:46] power and Regulatory capture by Leading AI companies and that's something that [01:10:49] AI companies and that's something that is worth you know thinking about that [01:10:52] is worth you know thinking about that behind all the discussions of our [01:10:54] behind all the discussions of our amazing AIS and all the things we can do [01:10:57] amazing AIS and all the things we can do with them like get our homework done or [01:10:59] with them like get our homework done or generate wonderful images that there are [01:11:02] generate wonderful images that there are lots of things underneath about [01:11:05] lots of things underneath about disinformation deception um [01:11:08] disinformation deception um hallucinations um problems of [01:11:10] hallucinations um problems of homogeneity of decision making violation [01:11:13] homogeneity of decision making violation of copyrights and people's creativity [01:11:17] of copyrights and people's creativity lots of carbon emissions um erosion of [01:11:20] lots of carbon emissions um erosion of Rich human practices so we need to be [01:11:23] Rich human practices so we need to be conscious of the sort of present day [01:11:25] conscious of the sort of present day harms that can come about from AI um and [01:11:29] harms that can come about from AI um and for NLP as well there are various kinds [01:11:31] for NLP as well there are various kinds of harms that we've touched on which [01:11:33] of harms that we've touched on which include generating offensive content [01:11:36] include generating offensive content generating untruthful content and [01:11:38] generating untruthful content and enabling disinformation so the [01:11:40] enabling disinformation so the disinformation one is an interesting one [01:11:44] disinformation one is an interesting one that if models can reason well about [01:11:47] that if models can reason well about texts can they also be persuasive in [01:11:51] texts can they also be persuasive in communicating incorrect information or [01:11:53] communicating incorrect information or opinions to users perhaps there are new [01:11:56] opinions to users perhaps there are new possibilities for doing very [01:11:59] possibilities for doing very personalized um misinformation [01:12:01] personalized um misinformation propagation that easily persuades human [01:12:04] propagation that easily persuades human beings better than traditional methods [01:12:07] beings better than traditional methods of political advertising and there's [01:12:10] of political advertising and there's starting to be evidence that that's true [01:12:12] starting to be evidence that that's true it's still being debated in the [01:12:14] it's still being debated in the literature but there's now multiple [01:12:17] literature but there's now multiple studies suggesting that humans can be [01:12:20] studies suggesting that humans can be influenced by disinformation generated [01:12:22] influenced by disinformation generated by AIS and it seems reasonable to think [01:12:25] by AIS and it seems reasonable to think that we're going to start to see more [01:12:27] that we're going to start to see more use of that in political systems and [01:12:30] use of that in political systems and elsewhere which is potentially um quite [01:12:33] elsewhere which is potentially um quite scary and you know perhaps the worst of [01:12:36] scary and you know perhaps the worst of it isn't going to be text based it's [01:12:38] it isn't going to be text based it's likely that visual [01:12:41] likely that visual um fakes are going to be even more [01:12:45] um fakes are going to be even more compelling in political context and um [01:12:48] compelling in political context and um this sort of seems like whether it [01:12:50] this sort of seems like whether it happens in the US for this election or [01:12:53] happens in the US for this election or in other countries in their election [01:12:55] in other countries in their election that we're likely to see some major [01:12:58] that we're likely to see some major incidents where um AI generated fakes [01:13:01] incidents where um AI generated fakes can be seen of having a major impacts on [01:13:04] can be seen of having a major impacts on political [01:13:05] political systems so I sort of think really um [01:13:09] systems so I sort of think really um what we should be doing is worrying not [01:13:12] what we should be doing is worrying not about existential risks but worrying [01:13:14] about existential risks but worrying about what people and organizations with [01:13:17] about what people and organizations with power will use AI to do um that this is [01:13:21] power will use AI to do um that this is a pattern that we've noticed multiple [01:13:24] a pattern that we've noticed multiple times Al so with social media right in [01:13:27] times Al so with social media right in the early days of social media there was [01:13:29] the early days of social media there was the idea that this was meant to lead to [01:13:32] the idea that this was meant to lead to new freedoms for people across the globe [01:13:34] new freedoms for people across the globe bringing the positives of free political [01:13:37] bringing the positives of free political um thought and improved human lives in [01:13:40] um thought and improved human lives in lar measure that isn't what's happened [01:13:43] lar measure that isn't what's happened that new technologies get captured by [01:13:45] that new technologies get captured by powerful people and organizations who [01:13:48] powerful people and organizations who Master the new technological options and [01:13:52] Master the new technological options and Ai and machine learning is being [01:13:54] Ai and machine learning is being increasingly used [01:13:55] increasingly used um for surveillance and control and [01:13:57] um for surveillance and control and we're seeing that around the world at [01:14:00] we're seeing that around the world at the [01:14:01] the moment um so my final thought to end [01:14:04] moment um so my final thought to end with um is a thought about Carl San so [01:14:08] with um is a thought about Carl San so when I was young many decades ago um [01:14:12] when I was young many decades ago um Carl Sean did the series Cosmos on [01:14:15] Carl Sean did the series Cosmos on television explaining the Miracles of [01:14:18] television explaining the Miracles of the universe and at the time when I was [01:14:21] the universe and at the time when I was a teenager I loved Cosmos now this was a [01:14:25] a teenager I loved Cosmos now this was a long time ago um so much more recently [01:14:28] long time ago um so much more recently there's now a new generation of Cosmos [01:14:31] there's now a new generation of Cosmos and the book is advertised on the basis [01:14:34] and the book is advertised on the basis of with a new forward by Neil degrass [01:14:37] of with a new forward by Neil degrass Tyson um I think um K San was a good guy [01:14:43] Tyson um I think um K San was a good guy um and he didn't only write Cosmos he [01:14:46] um and he didn't only write Cosmos he wrote a number of other books and [01:14:48] wrote a number of other books and another of the books he wrote um was the [01:14:50] another of the books he wrote um was the demon Haunted World um which has a theme [01:14:54] demon Haunted World um which has a theme that's a little bit [01:14:55] that's a little bit um closer um to um some of the things [01:14:58] um closer um to um some of the things that connect with um what we're dealing [01:15:01] that connect with um what we're dealing with here um so in that book um he [01:15:05] with here um so in that book um he writes I have a for boing of a world in [01:15:07] writes I have a for boing of a world in my children's or grandchildren's time [01:15:10] my children's or grandchildren's time when awesome technological powers are in [01:15:13] when awesome technological powers are in the hands of a very few and no one [01:15:16] the hands of a very few and no one representing the public interest can [01:15:18] representing the public interest can even grasp the issues when the people [01:15:20] even grasp the issues when the people have lost the ability to set their own [01:15:22] have lost the ability to set their own agendas or knowledgeably question those [01:15:25] agendas or knowledgeably question those in Authority when clutching our crystals [01:15:28] in Authority when clutching our crystals and nervously Consulting our horoscopes [01:15:30] and nervously Consulting our horoscopes our critical faculties in Decline unable [01:15:33] our critical faculties in Decline unable to distinguish between what feels good [01:15:36] to distinguish between what feels good and what's true we slide almost without [01:15:39] and what's true we slide almost without noticing back into Superstition and [01:15:42] noticing back into Superstition and darkness um I think if you look around [01:15:44] darkness um I think if you look around the US and many other parts of the world [01:15:47] the US and many other parts of the world today um this is actually much more the [01:15:50] today um this is actually much more the risk um that humanity is facing and why [01:15:55] risk um that humanity is facing and why um education which we try to provide at [01:15:57] um education which we try to provide at Stanford and other places is an [01:16:00] Stanford and other places is an important thing um that should be valued [01:16:04] important thing um that should be valued and all the other things that go along [01:16:06] and all the other things that go along with this of having things like open [01:16:08] with this of having things like open source that supports the broad [01:16:11] source that supports the broad dissemination of [01:16:13] dissemination of learning thank you [01:16:16] learning thank you [Applause] ================================================================================ LECTURE 019 ================================================================================ Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela Source: https://www.youtube.com/watch?v=5vfIT5LOkR0 --- Transcript [00:00:05] so today I'm delighted to introduce our [00:00:08] so today I'm delighted to introduce our first um invited speaker is Dao Aquila [00:00:12] first um invited speaker is Dao Aquila um there has also been [00:00:14] um there has also been um as well as being invited and I'll [00:00:16] um as well as being invited and I'll tell his background [00:00:18] tell his background um he's also [00:00:19] um he's also um in the symbolic systems program has [00:00:22] um in the symbolic systems program has been an Adjunct professor and has been [00:00:24] been an Adjunct professor and has been involved with some students in that role [00:00:26] involved with some students in that role as well but in his invited role he's [00:00:29] as well but in his invited role he's originally from the Netherlands where he [00:00:31] originally from the Netherlands where he even learned some logic among other [00:00:32] even learned some logic among other things back in the old days but in more [00:00:35] things back in the old days but in more recent times he's been [00:00:37] recent times he's been um a prominent [00:00:39] um a prominent um deep learning researcher for a number [00:00:41] um deep learning researcher for a number of years he worked at [00:00:44] of years he worked at um Facebook now meta in the fair unit [00:00:47] um Facebook now meta in the fair unit and was involved in various ideas [00:00:49] and was involved in various ideas including retrieval augmented Generation [00:00:53] including retrieval augmented Generation Um after that he then spent some time at [00:00:56] Um after that he then spent some time at hugging face he's become interested in [00:00:59] hugging face he's become interested in looking at multimodal models which is [00:01:01] looking at multimodal models which is what he's going to be talking about [00:01:03] what he's going to be talking about today and we welcome Dara it's great to [00:01:06] today and we welcome Dara it's great to have you [00:01:07] have you thank you very much [00:01:11] foreign [00:01:16] yes uh yeah thanks everyone for coming I [00:01:20] yes uh yeah thanks everyone for coming I understand that you get points for being [00:01:22] understand that you get points for being here so you're not really here for me [00:01:23] here so you're not really here for me but uh uh thanks for coming anyway so [00:01:27] but uh uh thanks for coming anyway so I'm going to talk about multimodal deep [00:01:28] I'm going to talk about multimodal deep learning uh it's gonna have an NLP focus [00:01:31] learning uh it's gonna have an NLP focus of course that's for discourse but it's [00:01:33] of course that's for discourse but it's also because otherwise I would really be [00:01:35] also because otherwise I would really be talking for uh many more hours than I [00:01:38] talking for uh many more hours than I have time for here so I'll try to really [00:01:40] have time for here so I'll try to really keep it focused on on the things that I [00:01:42] keep it focused on on the things that I think will be most useful for you to [00:01:44] think will be most useful for you to learn and so the first thing you should [00:01:46] learn and so the first thing you should understand is that this whole concept of [00:01:48] understand is that this whole concept of multimodality is is kind of ill-defined [00:01:51] multimodality is is kind of ill-defined actually [00:01:52] actually um so if you go to the dictionary you'll [00:01:54] um so if you go to the dictionary you'll see that it means having or involving [00:01:56] see that it means having or involving several modes or modalities or Maxima [00:02:01] several modes or modalities or Maxima um and and so what mode here really [00:02:03] um and and so what mode here really means is so it could be mode in the very [00:02:05] means is so it could be mode in the very generic sense or it could be a very [00:02:07] generic sense or it could be a very precise sense of the mode of a [00:02:09] precise sense of the mode of a statistical distribution [00:02:11] statistical distribution um and so depending on the paper you're [00:02:13] um and so depending on the paper you're reading in some cases people really mean [00:02:15] reading in some cases people really mean this statistical sense in other cases [00:02:17] this statistical sense in other cases people really mean this sort of very [00:02:19] people really mean this sort of very vague concept of a modality where it [00:02:21] vague concept of a modality where it really means the type of information [00:02:22] really means the type of information that you're getting so an example of [00:02:24] that you're getting so an example of modality in that case is an image or [00:02:27] modality in that case is an image or speech signal or audio in general or [00:02:30] speech signal or audio in general or even affection so smell or or things [00:02:32] even affection so smell or or things like that so in this uh lecture we're [00:02:35] like that so in this uh lecture we're just going to focus mostly on text [00:02:37] just going to focus mostly on text because this is an NLP course and we're [00:02:40] because this is an NLP course and we're going to focus on images mostly as the [00:02:42] going to focus on images mostly as the other modality to to keep it simple [00:02:46] other modality to to keep it simple all right so why does it matter why do [00:02:49] all right so why does it matter why do we care about multi-modality [00:02:51] we care about multi-modality um and so there are a couple of really [00:02:52] um and so there are a couple of really good reasons in general for this uh the [00:02:55] good reasons in general for this uh the the first one is is about faithfulness [00:02:58] the first one is is about faithfulness so if you look at how we humans [00:03:00] so if you look at how we humans understand the world how we make sense [00:03:02] understand the world how we make sense of what happens in the world uh that is [00:03:05] of what happens in the world uh that is very multimodal uh right so we we [00:03:07] very multimodal uh right so we we perceive the world not just using Vision [00:03:09] perceive the world not just using Vision uh or just audio but we synthesize [00:03:12] uh or just audio but we synthesize information across all of these [00:03:13] information across all of these different modalities and that's how we [00:03:15] different modalities and that's how we understand the world and each other [00:03:17] understand the world and each other um there's also a very practical uh [00:03:19] um there's also a very practical uh argument for doing it it's because the [00:03:21] argument for doing it it's because the internet is multimodal right so if you [00:03:23] internet is multimodal right so if you go to I don't know uh like Facebook or [00:03:25] go to I don't know uh like Facebook or something like that like it rarely [00:03:27] something like that like it rarely happens that it's just text or just an [00:03:29] happens that it's just text or just an image there's usually a combination of [00:03:31] image there's usually a combination of multiple modalities and then the the [00:03:34] multiple modalities and then the the final good reason uh that we're just [00:03:37] final good reason uh that we're just starting to hit now if you're if you're [00:03:38] starting to hit now if you're if you're really following where the field is [00:03:40] really following where the field is going we're kind of running out of Text [00:03:42] going we're kind of running out of Text data for these large language models so [00:03:44] data for these large language models so uh one interesting way to uh keep [00:03:46] uh one interesting way to uh keep scaling on the data side is to make use [00:03:49] scaling on the data side is to make use of all of these other modalities right [00:03:51] of all of these other modalities right so if you can have your language model [00:03:52] so if you can have your language model also watch all of the videos of cats in [00:03:55] also watch all of the videos of cats in the world it's going to understand the [00:03:57] the world it's going to understand the concept of catch cat much better and [00:03:59] concept of catch cat much better and that's what we want to have in these [00:04:00] that's what we want to have in these models we want them to understand the [00:04:03] models we want them to understand the world in the same way that humans [00:04:04] world in the same way that humans understand it [00:04:06] understand it um so right now multimodality is really [00:04:09] um so right now multimodality is really one of the main frontiers of this New [00:04:10] one of the main frontiers of this New Foundation model uh drives that were all [00:04:13] Foundation model uh drives that were all in right now [00:04:15] in right now there's a thing called the mcgurk effect [00:04:17] there's a thing called the mcgurk effect let's see if it loads up uh but uh so uh [00:04:21] let's see if it loads up uh but uh so uh what what we'll see when this loads is [00:04:23] what what we'll see when this loads is uh this guy over here and uh we'll have [00:04:26] uh this guy over here and uh we'll have the same audio effect uh being played so [00:04:29] the same audio effect uh being played so the audio is exactly the same and this [00:04:31] the audio is exactly the same and this man is going to say something like [00:04:35] and so you're hearing a b there I think [00:04:38] and so you're hearing a b there I think if you look at my mouth because that's [00:04:40] if you look at my mouth because that's what I said uh but if you then change [00:04:42] what I said uh but if you then change the video to where he says [00:04:45] the video to where he says with exactly the same audio you're going [00:04:48] with exactly the same audio you're going to hear the other version [00:04:50] to hear the other version um so unfortunately I can't really like [00:04:52] um so unfortunately I can't really like swap in the different audio here so you [00:04:54] swap in the different audio here so you have to trust me for it we might [00:04:56] have to trust me for it we might suddenly start hearing a guy saying [00:05:00] all right [00:05:02] all right um so um uh multimodal applications so [00:05:06] um so um uh multimodal applications so when we have multiple modalities we can [00:05:09] when we have multiple modalities we can do all kinds of uh interesting things [00:05:11] do all kinds of uh interesting things and as I said most of the use cases we [00:05:13] and as I said most of the use cases we have on the internet they're all [00:05:15] have on the internet they're all multimodal [00:05:16] multimodal um and there are some really kind of [00:05:18] um and there are some really kind of obvious things we would be interested in [00:05:20] obvious things we would be interested in if we have information from these [00:05:22] if we have information from these different data sources right from [00:05:24] different data sources right from different modalities uh so obviously we [00:05:26] different modalities uh so obviously we might want to do retrieval so maybe [00:05:28] might want to do retrieval so maybe given a bit of text we want to find the [00:05:30] given a bit of text we want to find the right image or maybe given some image we [00:05:33] right image or maybe given some image we want to find the right text for it so we [00:05:35] want to find the right text for it so we can match them up obviously we can also [00:05:37] can match them up obviously we can also do this in a generative setting so then [00:05:39] do this in a generative setting so then we have image captioning which you've [00:05:40] we have image captioning which you've probably heard of we can do text to [00:05:42] probably heard of we can do text to image generation so that's image [00:05:44] image generation so that's image synthesis and so stable diffusion [00:05:46] synthesis and so stable diffusion everybody in the audience here has [00:05:47] everybody in the audience here has probably seen that then we could do a [00:05:50] probably seen that then we could do a visual question answering where we have [00:05:52] visual question answering where we have an image and text and then we need to [00:05:53] an image and text and then we need to generate some new text we have [00:05:55] generate some new text we have multimodal classification where we have [00:05:57] multimodal classification where we have image syntax and we need to have a label [00:05:58] image syntax and we need to have a label for example whether something is hate [00:06:00] for example whether something is hate speech or not and then in general we [00:06:03] speech or not and then in general we want to be able to have a richer [00:06:05] want to be able to have a richer understanding of information which means [00:06:07] understanding of information which means that we combine images and text and then [00:06:09] that we combine images and text and then use it for Downstream applications that [00:06:11] use it for Downstream applications that require better understanding or better [00:06:12] require better understanding or better Generation [00:06:14] Generation Um so this field really is super hot [00:06:16] Um so this field really is super hot right now uh so there's this uh this [00:06:19] right now uh so there's this uh this nice paper title I predict that this [00:06:21] nice paper title I predict that this paper is going to do really well in [00:06:22] paper is going to do really well in terms of citations just because it has [00:06:24] terms of citations just because it has such a sideable title I think a lot of [00:06:26] such a sideable title I think a lot of people are not actually going to read it [00:06:28] people are not actually going to read it and so I mean I've been in this field [00:06:31] and so I mean I've been in this field for quite a while now and people have [00:06:32] for quite a while now and people have been saying this for a really long time [00:06:34] been saying this for a really long time I think Chris would agree that so for [00:06:36] I think Chris would agree that so for decades people have been saying that [00:06:37] decades people have been saying that multimodal is the next big thing uh but [00:06:39] multimodal is the next big thing uh but now it's really true I think [00:06:42] now it's really true I think all right so uh the outline for uh what [00:06:45] all right so uh the outline for uh what we're going to be talking about so first [00:06:47] we're going to be talking about so first I'm going to tell you a little bit about [00:06:48] I'm going to tell you a little bit about early models then we're going to do a [00:06:50] early models then we're going to do a bit of a deep dive on some of the [00:06:52] bit of a deep dive on some of the specifics then we're gonna go over a [00:06:55] specifics then we're gonna go over a particular type of fusion contrastive [00:06:57] particular type of fusion contrastive models or Lake Fusion then we're gonna [00:06:59] models or Lake Fusion then we're gonna go through a little bit of the history [00:07:00] go through a little bit of the history of multimodal foundation models then [00:07:04] of multimodal foundation models then we're going to talk a little bit about [00:07:05] we're going to talk a little bit about evaluation a little bit about other [00:07:07] evaluation a little bit about other modalities and then I'll make some [00:07:08] modalities and then I'll make some predictions for the future and hopefully [00:07:10] predictions for the future and hopefully maybe give you some cool research ideas [00:07:11] maybe give you some cool research ideas or things to talk or think about [00:07:14] or things to talk or think about all right so [00:07:16] all right so um obviously like there's a lot of work [00:07:18] um obviously like there's a lot of work that happened before deep learning but I [00:07:20] that happened before deep learning but I think if you want to start from like the [00:07:22] think if you want to start from like the Deep learning Revolution and what was [00:07:24] Deep learning Revolution and what was happening in images and text then a good [00:07:28] happening in images and text then a good starting point is for example Wasabi uh [00:07:31] starting point is for example Wasabi uh or device uh or Richard zoker who you [00:07:34] or device uh or Richard zoker who you who you've probably heard of has done [00:07:36] who you've probably heard of has done some really cool early work in this that [00:07:38] some really cool early work in this that really pioneered a lot of these ideas uh [00:07:41] really pioneered a lot of these ideas uh and the basic uh gist of this is that we [00:07:43] and the basic uh gist of this is that we have a vision model on the one hand we [00:07:46] have a vision model on the one hand we have a language model so this really I [00:07:48] have a language model so this really I mean the first lecture of this course I [00:07:50] mean the first lecture of this course I think was about word embeddings right so [00:07:51] think was about word embeddings right so that's just your basic word embedding [00:07:53] that's just your basic word embedding model and now we need to figure out how [00:07:55] model and now we need to figure out how to align them in the same multimodal [00:07:57] to align them in the same multimodal space so the way you do that is you get [00:08:00] space so the way you do that is you get some sort of similarity metric right a [00:08:01] some sort of similarity metric right a score function or like a kernel function [00:08:03] score function or like a kernel function if you're thinking about this from a [00:08:05] if you're thinking about this from a support Vector machine literature [00:08:06] support Vector machine literature perspective and now you need to figure [00:08:08] perspective and now you need to figure out uh in a Max margin or margin loss [00:08:12] out uh in a Max margin or margin loss um uh how you want to align these two [00:08:14] um uh how you want to align these two points in your embedding space right so [00:08:16] points in your embedding space right so things that are similar you want to [00:08:18] things that are similar you want to bring them closer together things that [00:08:19] bring them closer together things that are not you want to bring them further [00:08:20] are not you want to bring them further apart and if you do that in this [00:08:23] apart and if you do that in this multimodal embedding space that means [00:08:25] multimodal embedding space that means that you can do interesting cross-modal [00:08:28] that you can do interesting cross-modal transfer where you can take the word [00:08:30] transfer where you can take the word embedding for something like Auto or [00:08:31] embedding for something like Auto or like horse and then you can find close [00:08:34] like horse and then you can find close images in the embedding space to that [00:08:36] images in the embedding space to that thing and now now you've solved the [00:08:38] thing and now now you've solved the retrieval problem so this is a really [00:08:41] retrieval problem so this is a really nice early application and I think a lot [00:08:43] nice early application and I think a lot of the stuff that I'm going to talk [00:08:44] of the stuff that I'm going to talk about in the in the early slides you're [00:08:46] about in the in the early slides you're going to see this income over and over [00:08:48] going to see this income over and over again you're going to see it get kind of [00:08:50] again you're going to see it get kind of reinvented with fancier models but it's [00:08:52] reinvented with fancier models but it's basically all the same stuff [00:08:55] basically all the same stuff um so you can do cross modal transfer [00:08:57] um so you can do cross modal transfer where you have images and text but you [00:08:59] where you have images and text but you can also combine them together so that [00:09:01] can also combine them together so that you get a multimodal word embedding uh [00:09:04] you get a multimodal word embedding uh and so this uh just gives you a more [00:09:06] and so this uh just gives you a more accurate representation of how humans [00:09:08] accurate representation of how humans understand words word meaning because [00:09:10] understand words word meaning because when we think about the word Moon uh or [00:09:13] when we think about the word Moon uh or cat or something we can go to Wikipedia [00:09:15] cat or something we can go to Wikipedia and read that a cat is a small [00:09:17] and read that a cat is a small carnivorous mammal that people like to [00:09:19] carnivorous mammal that people like to keep as pets or we can just go and look [00:09:21] keep as pets or we can just go and look at pictures of cats and now we [00:09:22] at pictures of cats and now we understand what a cat is right and I [00:09:25] understand what a cat is right and I would argue actually that for a lot of [00:09:26] would argue actually that for a lot of people the picture of the cats is much [00:09:28] people the picture of the cats is much closer to the meaning of the concept of [00:09:29] closer to the meaning of the concept of cat [00:09:31] cat um so uh some some early work where [00:09:33] um so uh some some early work where people were trying to do this is from [00:09:36] people were trying to do this is from Rooney at all where they did multimodal [00:09:38] Rooney at all where they did multimodal distribution or semantics using this [00:09:40] distribution or semantics using this very uh elegant approach called bag of [00:09:43] very uh elegant approach called bag of visual words so just like who has heard [00:09:45] visual words so just like who has heard of bag of visual words [00:09:48] of bag of visual words very few people okay so it's it's [00:09:50] very few people okay so it's it's surprisingly simple and so I kind of [00:09:53] surprisingly simple and so I kind of like it it's nicely elegant so you take [00:09:55] like it it's nicely elegant so you take a picture of a moon in this case I think [00:09:57] a picture of a moon in this case I think you can see it in the back too right so [00:09:59] you can see it in the back too right so uh we use an algorithm like sift uh to [00:10:02] uh we use an algorithm like sift uh to find interesting key points so it's sort [00:10:04] find interesting key points so it's sort of where the difference between the [00:10:06] of where the difference between the pixels and the pixels next to it where [00:10:08] pixels and the pixels next to it where that difference is Big those are sort of [00:10:10] that difference is Big those are sort of the spots you want to be looking at [00:10:12] the spots you want to be looking at and for each of these key points you get [00:10:16] and for each of these key points you get feature descriptors so relatively small [00:10:18] feature descriptors so relatively small vectors like 32 dimensional events are [00:10:20] vectors like 32 dimensional events are kind of on the on the implementation of [00:10:22] kind of on the on the implementation of this and what you can do now with these [00:10:24] this and what you can do now with these feature descriptors is you can cluster [00:10:26] feature descriptors is you can cluster them using k-means and then you assign [00:10:29] them using k-means and then you assign every one of these points uh so you can [00:10:32] every one of these points uh so you can count how often they occur right so in [00:10:33] count how often they occur right so in this picture of the Moon we have like uh [00:10:35] this picture of the Moon we have like uh actually the count is oh yeah so there [00:10:37] actually the count is oh yeah so there are three like red dots right so that's [00:10:39] are three like red dots right so that's why the Red Dot one is three so what [00:10:42] why the Red Dot one is three so what that gives you is a uh an idea of the [00:10:44] that gives you is a uh an idea of the visual words very similar to the [00:10:47] visual words very similar to the original bag of words model that you [00:10:48] original bag of words model that you hopefully have heard about maybe in the [00:10:51] hopefully have heard about maybe in the first lecture [00:10:52] first lecture um so that's the visual equivalent of [00:10:54] um so that's the visual equivalent of the textual thing [00:10:55] the textual thing um and so if you do this and you then [00:10:58] um and so if you do this and you then concatenate or you apply sud to fuse the [00:11:00] concatenate or you apply sud to fuse the information what you get is a word [00:11:02] information what you get is a word embedding that is much more [00:11:04] embedding that is much more representative of human meaning so you [00:11:07] representative of human meaning so you know as reflected in the data sets that [00:11:09] know as reflected in the data sets that people used to care about at the time so [00:11:12] people used to care about at the time so after that there were a couple of people [00:11:15] after that there were a couple of people me included who tried to take these [00:11:17] me included who tried to take these ideas and then really applied deep [00:11:18] ideas and then really applied deep learning to them so some of the very [00:11:20] learning to them so some of the very early versions of this use convolutional [00:11:23] early versions of this use convolutional neural networks uh and then you can [00:11:25] neural networks uh and then you can transfer the features from uh your your [00:11:28] transfer the features from uh your your confnet and you take your word [00:11:30] confnet and you take your word embeddings which you've seen in the [00:11:31] embeddings which you've seen in the first lecture uh and then you can [00:11:33] first lecture uh and then you can concatenate them now you have a [00:11:35] concatenate them now you have a multimodal work vector or you can do [00:11:37] multimodal work vector or you can do something slightly fancier so you've [00:11:39] something slightly fancier so you've seen the skip gram model you can also [00:11:41] seen the skip gram model you can also try to do skip gram predictions onto [00:11:44] try to do skip gram predictions onto image features right so when you see [00:11:46] image features right so when you see your work like cat in some contexts like [00:11:48] your work like cat in some contexts like the cute little cat said on the Met then [00:11:51] the cute little cat said on the Met then when you see cat you also want to [00:11:52] when you see cat you also want to predict cat pictures [00:11:54] predict cat pictures so super easy ideas but it turned out [00:11:56] so super easy ideas but it turned out that this gives you much richer work [00:11:58] that this gives you much richer work representations uh so that's kind of [00:12:00] representations uh so that's kind of cool but obviously words are very [00:12:02] cool but obviously words are very limited what we really care about is not [00:12:04] limited what we really care about is not words but sentences so uh then people [00:12:07] words but sentences so uh then people started really looking into sentence [00:12:09] started really looking into sentence representations and how can we figure [00:12:11] representations and how can we figure out uh how to get compositional [00:12:13] out uh how to get compositional understanding in the sentence [00:12:15] understanding in the sentence representations and how to how do we [00:12:17] representations and how to how do we align that with images [00:12:19] align that with images um so the loss here is very similar to [00:12:22] um so the loss here is very similar to what we saw with works and pictures but [00:12:24] what we saw with works and pictures but now we just have a sentence encoder [00:12:26] now we just have a sentence encoder right um and so there's some really cool [00:12:28] right um and so there's some really cool early papers from Andre karapati and [00:12:31] early papers from Andre karapati and Richard Soaker also had some work here [00:12:34] Richard Soaker also had some work here um and then you know so the basic idea [00:12:37] um and then you know so the basic idea is just that instead of having these [00:12:38] is just that instead of having these word embeddings we now have an lscm in [00:12:41] word embeddings we now have an lscm in these papers or some other kind of [00:12:43] these papers or some other kind of recurrent neural network or in the case [00:12:45] recurrent neural network or in the case of this one recursive neural network and [00:12:47] of this one recursive neural network and then we try to align the features [00:12:49] then we try to align the features together [00:12:51] together um and so so these three or four papers [00:12:53] um and so so these three or four papers are actually very important than this [00:12:54] are actually very important than this one by me is less important but it's [00:12:56] one by me is less important but it's still kind of interesting because uh we [00:12:59] still kind of interesting because uh we showed here that grounded sentence [00:13:01] showed here that grounded sentence representation so if you actually just [00:13:03] representation so if you actually just use this part here as a sentence encoder [00:13:05] use this part here as a sentence encoder for NLP tasks the ability to just [00:13:08] for NLP tasks the ability to just predict pictures from it already gives [00:13:10] predict pictures from it already gives you a really good sentence [00:13:11] you a really good sentence representation right so so just by [00:13:14] representation right so so just by predicting pictures you can sort of [00:13:16] predicting pictures you can sort of imagine what things look like and that [00:13:18] imagine what things look like and that gives you a really good meaning [00:13:19] gives you a really good meaning representation which you can then [00:13:20] representation which you can then transfer to I don't know sentiment [00:13:22] transfer to I don't know sentiment classification or something else [00:13:26] um and then of course uh once we have [00:13:29] um and then of course uh once we have census encoders uh or then we also have [00:13:32] census encoders uh or then we also have decoders and and so when the sequence to [00:13:35] decoders and and so when the sequence to sequence architecture came out which [00:13:36] sequence architecture came out which you've probably also heard about in this [00:13:38] you've probably also heard about in this course uh what you can do instead of [00:13:41] course uh what you can do instead of having a text encoder for like your [00:13:42] having a text encoder for like your Source language if you're doing machine [00:13:44] Source language if you're doing machine translation is you can plug in a confnet [00:13:46] translation is you can plug in a confnet uh instead of an lstm encoder and now [00:13:50] uh instead of an lstm encoder and now you can generate captions so that's [00:13:52] you can generate captions so that's exactly what people did we used to have [00:13:54] exactly what people did we used to have all of these fancy diagrams in our [00:13:56] all of these fancy diagrams in our papers then where we explain the lstm [00:13:58] papers then where we explain the lstm and how that works probably people don't [00:14:00] and how that works probably people don't learn that anymore these days they do [00:14:02] learn that anymore these days they do yeah very good they might make a [00:14:05] yeah very good they might make a comeback I think you know at some point [00:14:07] comeback I think you know at some point uh Transformers are going to go away [00:14:09] uh Transformers are going to go away we'll see [00:14:11] we'll see um and uh so uh one of the things that [00:14:13] um and uh so uh one of the things that that people figured out in machine [00:14:15] that people figured out in machine translation very early on is that you [00:14:17] translation very early on is that you can do alignment of words between your [00:14:19] can do alignment of words between your Source language and your target language [00:14:21] Source language and your target language and you can do the same thing actually [00:14:23] and you can do the same thing actually with images right so if you want to [00:14:24] with images right so if you want to align a word in your uh in your [00:14:27] align a word in your uh in your generated sequence with something in [00:14:29] generated sequence with something in your picture then you can do the same uh [00:14:32] your picture then you can do the same uh use the same approach for that and that [00:14:33] use the same approach for that and that approach of course is called attention [00:14:35] approach of course is called attention right so you know you've learned a lot [00:14:37] right so you know you've learned a lot about detention probably in this course [00:14:39] about detention probably in this course and and so yeah that was one of the the [00:14:41] and and so yeah that was one of the the building blocks of these systems as well [00:14:43] building blocks of these systems as well where you can do very interesting things [00:14:45] where you can do very interesting things uh and really see that when it has to [00:14:47] uh and really see that when it has to generate stop for the stop sign that is [00:14:50] generate stop for the stop sign that is really actually looking at the stop sign [00:14:52] really actually looking at the stop sign right so there's a really cool alignment [00:14:54] right so there's a really cool alignment going on there [00:14:55] going on there um in these models [00:14:58] um in these models um and so the the final kind of early [00:15:00] um and so the the final kind of early model we should talk about a little bit [00:15:01] model we should talk about a little bit uh is Gans uh who here is sort of Gans [00:15:06] uh is Gans uh who here is sort of Gans okay that's a lot a lot more than bag of [00:15:08] okay that's a lot a lot more than bag of visual words I guess that makes sense [00:15:10] visual words I guess that makes sense um and uh so so yeah the basic idea of [00:15:14] um and uh so so yeah the basic idea of again is really that you have this [00:15:15] again is really that you have this generator and discriminator and you want [00:15:17] generator and discriminator and you want to have the generator uh generate images [00:15:19] to have the generator uh generate images that the discriminator cannot [00:15:20] that the discriminator cannot distinguish uh uh from uh so it cannot [00:15:23] distinguish uh uh from uh so it cannot distinguish fake and real images right [00:15:25] distinguish fake and real images right and if you do that you can actually [00:15:27] and if you do that you can actually condition that on the piece of text uh [00:15:30] condition that on the piece of text uh and then you can generate images uh [00:15:32] and then you can generate images uh using some some uh text prompt right so [00:15:35] using some some uh text prompt right so that's what what uh kind of the the [00:15:37] that's what what uh kind of the the first versions of stable diffusion we're [00:15:39] first versions of stable diffusion we're doing things like this and you know it's [00:15:41] doing things like this and you know it's all the a natural progression to that [00:15:43] all the a natural progression to that model [00:15:44] model um [00:15:44] um so those were the early models [00:15:48] so those were the early models um maybe do people have any like burning [00:15:50] um maybe do people have any like burning questions about this or does this all [00:15:52] questions about this or does this all make sense [00:15:55] all right [00:15:56] all right so let's do a bit of a deeper dive then [00:15:59] so let's do a bit of a deeper dive then on on in particular on features and [00:16:01] on on in particular on features and fusion so those are really the kind of [00:16:03] fusion so those are really the kind of core building blocks for for all of this [00:16:05] core building blocks for for all of this multimodal stuff really but before we go [00:16:06] multimodal stuff really but before we go there maybe very briefly like if all of [00:16:09] there maybe very briefly like if all of this multimodal stuff is cool and sort [00:16:12] this multimodal stuff is cool and sort of useful and and doesn't look that [00:16:14] of useful and and doesn't look that difficult you know like why aren't we [00:16:16] difficult you know like why aren't we all doing multimodal things and so why [00:16:19] all doing multimodal things and so why why do we focus on specific modalities [00:16:21] why do we focus on specific modalities and I think there are a couple of [00:16:23] and I think there are a couple of problems just to be aware of uh so one [00:16:26] problems just to be aware of uh so one is modalities can sometimes dominate [00:16:28] is modalities can sometimes dominate especially text is much more dominant [00:16:30] especially text is much more dominant than Vision or audio in many use cases [00:16:32] than Vision or audio in many use cases right so uh you can already just have a [00:16:35] right so uh you can already just have a model that picks up on the tech signal [00:16:37] model that picks up on the tech signal and basically learns to ignore the image [00:16:38] and basically learns to ignore the image completely which actually happened [00:16:40] completely which actually happened embarrassingly for visual question [00:16:42] embarrassingly for visual question answering we'll get to that so visual [00:16:44] answering we'll get to that so visual question answering you could do that [00:16:45] question answering you could do that without actually looking at the picture [00:16:48] without actually looking at the picture um the additional modalities can add a [00:16:51] um the additional modalities can add a lot of noise so it makes your machine [00:16:52] lot of noise so it makes your machine learning problem more difficult uh you [00:16:55] learning problem more difficult uh you don't always have full coverage right so [00:16:56] don't always have full coverage right so as I said if you look at Facebook posts [00:16:58] as I said if you look at Facebook posts sometimes you have text sometimes you [00:16:59] sometimes you have text sometimes you have pictures sometimes you have both [00:17:01] have pictures sometimes you have both but you don't have a guarantee that you [00:17:02] but you don't have a guarantee that you always have both so how do you deal with [00:17:04] always have both so how do you deal with that [00:17:05] that um in many cases we just really weren't [00:17:07] um in many cases we just really weren't ready it was too complicated to [00:17:09] ready it was too complicated to implement stuff and also just in general [00:17:11] implement stuff and also just in general like how to design your model really to [00:17:13] like how to design your model really to uh to combine all the information is [00:17:16] uh to combine all the information is actually quite complicated so in order [00:17:19] actually quite complicated so in order to to uh you know to maybe drive the [00:17:22] to to uh you know to maybe drive the home that point home a little bit [00:17:24] home that point home a little bit um so featurizing text I guess we all [00:17:27] um so featurizing text I guess we all know how to do that by now especially [00:17:29] know how to do that by now especially sort of in the age of Transformers and [00:17:31] sort of in the age of Transformers and before in lstm sorry we just said like [00:17:32] before in lstm sorry we just said like you have your batch by your secrets so [00:17:35] you have your batch by your secrets so batch size by sequence length by [00:17:37] batch size by sequence length by embedding size right so it's always like [00:17:39] embedding size right so it's always like a 3D tensor and that's how you encode [00:17:42] a 3D tensor and that's how you encode your textual information when you pump [00:17:44] your textual information when you pump it through your neural net [00:17:46] it through your neural net um and so with images it's like trickier [00:17:48] um and so with images it's like trickier because you can just kind of look at the [00:17:51] because you can just kind of look at the patches but then if you do convolutions [00:17:53] patches but then if you do convolutions you're kind of like shifting over the [00:17:55] you're kind of like shifting over the image and then you're aggregating right [00:17:57] image and then you're aggregating right um and in many cases you don't really [00:17:59] um and in many cases you don't really want to be this uniform you want to have [00:18:01] want to be this uniform you want to have something that actually looks at the [00:18:03] something that actually looks at the things in the picture right so this is [00:18:05] things in the picture right so this is called region features where you would [00:18:07] called region features where you would use an object detector as a first step [00:18:09] use an object detector as a first step for processing your image and then you [00:18:11] for processing your image and then you would have a confident backbone that [00:18:13] would have a confident backbone that encodes the features for that particular [00:18:15] encodes the features for that particular sub image like this guys like skateboard [00:18:17] sub image like this guys like skateboard or something it has its own like vector [00:18:19] or something it has its own like vector representation right [00:18:22] representation right um and then in terms of dense features [00:18:24] um and then in terms of dense features we now also have Vision Transformers so [00:18:26] we now also have Vision Transformers so we'll just very quickly go over that to [00:18:28] we'll just very quickly go over that to make sure we're on the same page so [00:18:30] make sure we're on the same page so there are all these models like YOLO is [00:18:31] there are all these models like YOLO is a really good one if you haven't heard [00:18:32] a really good one if you haven't heard of that yet uh so we're at YOLO V7 now I [00:18:36] of that yet uh so we're at YOLO V7 now I think create I don't know uh so there's [00:18:38] think create I don't know uh so there's a new one coming out every every other [00:18:40] a new one coming out every every other like year or something but the basic [00:18:43] like year or something but the basic idea is that we get these bounding boxes [00:18:45] idea is that we get these bounding boxes uh for things in the images right [00:18:47] uh for things in the images right actually segmentations with the bounding [00:18:49] actually segmentations with the bounding boxes is what people tend to use and [00:18:51] boxes is what people tend to use and they they have labels right so this is [00:18:53] they they have labels right so this is labeled like Backpacker or something and [00:18:55] labeled like Backpacker or something and so you can do this as a pre-processing [00:18:57] so you can do this as a pre-processing step on your image to get a much richer [00:19:00] step on your image to get a much richer representation of what is really in that [00:19:02] representation of what is really in that image which you can then pump into your [00:19:03] image which you can then pump into your system as we'll see later and and so [00:19:06] system as we'll see later and and so then how you encode the information that [00:19:08] then how you encode the information that is in these little bounding boxes or [00:19:09] is in these little bounding boxes or actually in the image itself in general [00:19:11] actually in the image itself in general we just use a standard comp net for that [00:19:14] we just use a standard comp net for that and so this probably feels like super [00:19:17] and so this probably feels like super obvious now but uh in 2014 when people [00:19:20] obvious now but uh in 2014 when people were starting to discover this it was [00:19:22] were starting to discover this it was really very surprising that you could [00:19:24] really very surprising that you could just use off-the-shelf continent [00:19:26] just use off-the-shelf continent features to really replace the entire [00:19:28] features to really replace the entire computer vision pipeline so people used [00:19:31] computer vision pipeline so people used to do all of this very fancy [00:19:32] to do all of this very fancy sophisticated stuff and people you know [00:19:34] sophisticated stuff and people you know spend decades on trying to refine this [00:19:36] spend decades on trying to refine this and then it was all thrown away and [00:19:38] and then it was all thrown away and replaced by a confnet that does all of [00:19:40] replaced by a confnet that does all of that stuff for free [00:19:42] that stuff for free um and so the cool thing you get there [00:19:43] um and so the cool thing you get there is that you can transfer very easily [00:19:45] is that you can transfer very easily across different tasks so you don't you [00:19:47] across different tasks so you don't you can have a very generic confidence and [00:19:49] can have a very generic confidence and then use it to all kinds of very [00:19:50] then use it to all kinds of very specialized uh things like spotting [00:19:53] specialized uh things like spotting buildings in Paris for example or [00:19:56] buildings in Paris for example or flowers or other stuff [00:19:58] flowers or other stuff um and then of course in the age of [00:20:00] um and then of course in the age of Transformers [00:20:01] Transformers um how far how far we're already quite a [00:20:04] um how far how far we're already quite a while and this is only the first [00:20:05] while and this is only the first Transformer actually uh in the the slide [00:20:07] Transformer actually uh in the the slide deck so uh you know we're making good [00:20:09] deck so uh you know we're making good progress uh so Vision Transformers are [00:20:12] progress uh so Vision Transformers are what we would use these days to encode [00:20:14] what we would use these days to encode the images uh where you have these [00:20:16] the images uh where you have these flattened patches and then you would do [00:20:18] flattened patches and then you would do uh kind of the the standard birth [00:20:21] uh kind of the the standard birth architecture maybe as you would know it [00:20:23] architecture maybe as you would know it from this course and then you do [00:20:24] from this course and then you do classification right so this is all like [00:20:26] classification right so this is all like a standard Transformer everything [00:20:27] a standard Transformer everything standards except now your input here is [00:20:29] standards except now your input here is not words or tokens it's patches of an [00:20:32] not words or tokens it's patches of an image and then you classify that [00:20:35] image and then you classify that all right so then we have a bunch of [00:20:37] all right so then we have a bunch of features and now how do we combine the [00:20:39] features and now how do we combine the information right so let's say we have [00:20:41] information right so let's say we have two vectors u and v uh so you know it [00:20:44] two vectors u and v uh so you know it sounds easy right to how how we could [00:20:46] sounds easy right to how how we could could combine them it turns out that [00:20:48] could combine them it turns out that they're actually very many ways to [00:20:50] they're actually very many ways to combine them so I I don't think it's [00:20:51] combine them so I I don't think it's it's really useful to go over all the [00:20:53] it's really useful to go over all the different ways here [00:20:55] different ways here um but you can do very simple things [00:20:56] um but you can do very simple things right so obviously like uh inner product [00:20:58] right so obviously like uh inner product or similarity is what you would use if [00:21:00] or similarity is what you would use if you want to do cross-modal things so if [00:21:02] you want to do cross-modal things so if you want to embed things in the same [00:21:03] you want to embed things in the same Vector space uh but you can do sort of [00:21:05] Vector space uh but you can do sort of fancier uh projections on top or [00:21:08] fancier uh projections on top or different combinations that are kind of [00:21:10] different combinations that are kind of linear uh or you can do multiplicative [00:21:13] linear uh or you can do multiplicative things where you uh multiply the [00:21:15] things where you uh multiply the components element wise or you do some [00:21:16] components element wise or you do some sort of gating over the different [00:21:18] sort of gating over the different features you can do attention you can do [00:21:20] features you can do attention you can do fancier buy linear things you can do [00:21:22] fancier buy linear things you can do very fancy compact bilinear things so [00:21:25] very fancy compact bilinear things so there there's really a wealth of [00:21:26] there there's really a wealth of literature kind of on all the different [00:21:28] literature kind of on all the different ways you can combine two vectors and and [00:21:31] ways you can combine two vectors and and so uh this is called multimodal fusion [00:21:33] so uh this is called multimodal fusion and most of the literature on multiple [00:21:35] and most of the literature on multiple modality is essentially about this [00:21:37] modality is essentially about this question what is the best way to do [00:21:39] question what is the best way to do fusion and that's it [00:21:42] fusion and that's it um so so I think within that discussion [00:21:44] um so so I think within that discussion it's maybe useful to distinguish between [00:21:46] it's maybe useful to distinguish between different levels of fusion so you can do [00:21:48] different levels of fusion so you can do it very early where basically you make [00:21:51] it very early where basically you make sure you have the different features and [00:21:52] sure you have the different features and then you just kind of uh in in the sort [00:21:55] then you just kind of uh in in the sort of modern sense of attention you would [00:21:56] of modern sense of attention you would attend to everything in all the features [00:21:58] attend to everything in all the features from the beginning you can first treat [00:22:01] from the beginning you can first treat them separately and then combine them or [00:22:03] them separately and then combine them or you can treat them as completely [00:22:04] you can treat them as completely separate and then you only combine the [00:22:06] separate and then you only combine the final scores right and so there's the so [00:22:09] final scores right and so there's the so that's kind of what we would call Early [00:22:10] that's kind of what we would call Early fusion and then sort of my my invention [00:22:13] fusion and then sort of my my invention for calling the middle part would be [00:22:14] for calling the middle part would be sort of middle fusion and then you have [00:22:16] sort of middle fusion and then you have late Fusion uh where you really just [00:22:18] late Fusion uh where you really just combine the scores or the logits but you [00:22:21] combine the scores or the logits but you don't really have any interaction [00:22:22] don't really have any interaction between the information from the [00:22:24] between the information from the different modalities [00:22:26] different modalities um so you could do really fun stuff with [00:22:29] um so you could do really fun stuff with multimodal Fusion so this is a paper I [00:22:32] multimodal Fusion so this is a paper I really like film [00:22:34] really like film um where uh you have this sort of very [00:22:36] um where uh you have this sort of very special uh [00:22:38] special uh feature Maps this sort of f here and it [00:22:41] feature Maps this sort of f here and it gets gets modulated by a multiplicative [00:22:44] gets gets modulated by a multiplicative Factor so this gamma and an additive [00:22:47] Factor so this gamma and an additive sort of bias Vector this beta and you [00:22:49] sort of bias Vector this beta and you have a different one for every layer of [00:22:51] have a different one for every layer of a resnet that is conditioned on some [00:22:54] a resnet that is conditioned on some encoding of the thing you're after uh so [00:22:56] encoding of the thing you're after uh so in this case are there more cubes than [00:22:58] in this case are there more cubes than yellow things so we have some Vector [00:22:59] yellow things so we have some Vector representation for that and we use that [00:23:02] representation for that and we use that Vector representation to modulate the [00:23:04] Vector representation to modulate the resnet blocks at every layer of the [00:23:06] resnet blocks at every layer of the confident [00:23:08] confident um so you know you can really do very [00:23:10] um so you know you can really do very fun things where you're sort of [00:23:11] fun things where you're sort of modulating one network with the other [00:23:13] modulating one network with the other one and really try to have them learn uh [00:23:15] one and really try to have them learn uh as much as possible from that [00:23:18] as much as possible from that all right so [00:23:20] all right so um let's talk about late Fusion then so [00:23:22] um let's talk about late Fusion then so late Fusion is what we would Now call [00:23:24] late Fusion is what we would Now call contrastive models uh but the basic idea [00:23:27] contrastive models uh but the basic idea is that we have this similarity score so [00:23:29] is that we have this similarity score so we have the two kind of we process the [00:23:31] we have the two kind of we process the modalities completely independently and [00:23:33] modalities completely independently and then at the very end we do some [00:23:34] then at the very end we do some combination uh and the most famous uh [00:23:38] combination uh and the most famous uh instance of that uh these days is clip [00:23:40] instance of that uh these days is clip so who's heard of clip [00:23:43] so who's heard of clip okay so clip uh from openai and so it's [00:23:48] okay so clip uh from openai and so it's again exactly the same contrast of loss [00:23:51] again exactly the same contrast of loss that that we've seen in all these early [00:23:52] that that we've seen in all these early approaches [00:23:54] approaches um it does kind of negative sampling uh [00:23:57] um it does kind of negative sampling uh but then in batch so you just have a [00:23:59] but then in batch so you just have a batch you have two things that are [00:24:00] batch you have two things that are aligned right so like this the first [00:24:02] aligned right so like this the first piece of text and the first image they [00:24:04] piece of text and the first image they are aligned so this is the right answer [00:24:06] are aligned so this is the right answer and I just want to make sure that I rank [00:24:08] and I just want to make sure that I rank this thing higher than all the [00:24:10] this thing higher than all the alternatives [00:24:11] alternatives right and I want to make sure I rank [00:24:13] right and I want to make sure I rank this thing higher than all the [00:24:14] this thing higher than all the Alternatives so it's a very very simple [00:24:16] Alternatives so it's a very very simple idea uh really really nothing special [00:24:19] idea uh really really nothing special about this architecture that that was [00:24:21] about this architecture that that was sort of invented here but what made this [00:24:23] sort of invented here but what made this uh thing so cool was first of all it was [00:24:26] uh thing so cool was first of all it was Transformers and it was Transformers all [00:24:28] Transformers and it was Transformers all the way so your text encoder would be a [00:24:29] the way so your text encoder would be a Transformer and your image encoder would [00:24:31] Transformer and your image encoder would be a vit image encoder so also a [00:24:34] be a vit image encoder so also a Transformer [00:24:35] Transformer um and it was trained on lots and lots [00:24:37] um and it was trained on lots and lots of web data so Alex Radford is really a [00:24:41] of web data so Alex Radford is really a genius at creating very high quality [00:24:43] genius at creating very high quality data sets and he he created I think 300 [00:24:46] data sets and he he created I think 300 million image text pairs for this data [00:24:48] million image text pairs for this data set trained a bigger model on on it than [00:24:50] set trained a bigger model on on it than people used to do and then we got this [00:24:54] people used to do and then we got this amazing model out of it [00:24:56] amazing model out of it um and so so uh moving away from the [00:24:59] um and so so uh moving away from the words there to the sort of texts that [00:25:02] words there to the sort of texts that you would see on the internet right so [00:25:03] you would see on the internet right so the caption uh for an image on the web [00:25:05] the caption uh for an image on the web it's not going to say dog or cat it's [00:25:07] it's not going to say dog or cat it's going to say a photo of a cat doing [00:25:09] going to say a photo of a cat doing something something right so uh that [00:25:12] something something right so uh that that means that you can do kind of zero [00:25:13] that means that you can do kind of zero shot uh label predictions where you have [00:25:16] shot uh label predictions where you have a photo of uh and then you need to [00:25:18] a photo of uh and then you need to figure out what uh the right label is [00:25:21] figure out what uh the right label is for a given image using this kind of [00:25:24] for a given image using this kind of prompt right so the the thing you know [00:25:25] prompt right so the the thing you know you probably all know about prompting [00:25:27] you probably all know about prompting large language models and so you can [00:25:29] large language models and so you can prompt vision and language models in in [00:25:31] prompt vision and language models in in very much the same way and do zero shot [00:25:33] very much the same way and do zero shot generalization [00:25:35] generalization um so if you want a really really good [00:25:37] um so if you want a really really good paper I would recommend that you read [00:25:39] paper I would recommend that you read this paper this is really one that's [00:25:40] this paper this is really one that's going to teach you how to write really [00:25:42] going to teach you how to write really good papers it's thorough and it's [00:25:44] good papers it's thorough and it's really worth a very close read I think [00:25:46] really worth a very close read I think if you're interested in this view [00:25:48] if you're interested in this view um and so I think when it came out uh [00:25:51] um and so I think when it came out uh actually on imagenet itself it it didn't [00:25:54] actually on imagenet itself it it didn't really outperform resnet right so so you [00:25:57] really outperform resnet right so so you might think oh yeah actually it's not [00:25:58] might think oh yeah actually it's not all that special but what really made it [00:26:01] all that special but what really made it special was that it generalized much [00:26:02] special was that it generalized much better to these other data sets right so [00:26:05] better to these other data sets right so this uh this resnet thing here is pretty [00:26:07] this uh this resnet thing here is pretty terrible at some of these kind of [00:26:08] terrible at some of these kind of adversarial versions of imagenet and [00:26:11] adversarial versions of imagenet and clip is super robust to that so it's [00:26:13] clip is super robust to that so it's just a way better image encoder in [00:26:15] just a way better image encoder in general [00:26:17] general um so uh very very quickly after clip [00:26:20] um so uh very very quickly after clip there was this paper from Google uh [00:26:22] there was this paper from Google uh using a line [00:26:23] using a line um which was basically exactly the same [00:26:26] um which was basically exactly the same idea uh you know the field is not really [00:26:28] idea uh you know the field is not really that creative at all it's like the same [00:26:30] that creative at all it's like the same idea but then you just keep like [00:26:31] idea but then you just keep like throwing more data and more compute at [00:26:33] throwing more data and more compute at it and it often works much better so [00:26:35] it and it often works much better so that's what they found here too and 1.8 [00:26:38] that's what they found here too and 1.8 billion image taxpayers instead of 300 [00:26:40] billion image taxpayers instead of 300 million gives you a better model [00:26:42] million gives you a better model surprise [00:26:45] surprise um but uh so it's still very cool and [00:26:47] um but uh so it's still very cool and and what is really cool I think is that [00:26:49] and what is really cool I think is that there's this organization called lion [00:26:52] there's this organization called lion um uh where uh they've they've started [00:26:55] um uh where uh they've they've started this open source Collective to create [00:26:57] this open source Collective to create really high quality data sets [00:26:59] really high quality data sets um and so the lie on the initial data [00:27:02] um and so the lie on the initial data set [00:27:03] set um was uh how many examples in the [00:27:06] um was uh how many examples in the initial lineup [00:27:07] initial lineup 400 million right he knows I know that [00:27:09] 400 million right he knows I know that he knows [00:27:10] he knows um and uh so so now there's a much [00:27:13] um and uh so so now there's a much bigger version of lion uh that's even [00:27:15] bigger version of lion uh that's even multilingual and it has five billion [00:27:17] multilingual and it has five billion examples right so uh stable diffusion [00:27:19] examples right so uh stable diffusion was trained on sort of the image the [00:27:21] was trained on sort of the image the English subset of this thing uh and [00:27:24] English subset of this thing uh and that's one of the reasons that it's so [00:27:25] that's one of the reasons that it's so awesome it's because it's just seen a [00:27:27] awesome it's because it's just seen a ton of data uh and that really makes [00:27:30] ton of data uh and that really makes your system a lot better so if you're [00:27:32] your system a lot better so if you're looking for like the ultimate data set [00:27:34] looking for like the ultimate data set to play around with uh with your own [00:27:36] to play around with uh with your own ideas if you have enough compute [00:27:38] ideas if you have enough compute obviously then you should really look at [00:27:40] obviously then you should really look at this data set [00:27:42] this data set all right any questions about [00:27:45] all right any questions about up until this point [00:27:50] nope all right [00:27:52] nope all right um so that then we'll we'll move on from [00:27:54] um so that then we'll we'll move on from late Fusion to kind of middle Fusion [00:27:58] late Fusion to kind of middle Fusion early Fusion uh and this really is kind [00:28:00] early Fusion uh and this really is kind of the core uh of what I think a lot of [00:28:03] of the core uh of what I think a lot of people in the field right now or if [00:28:04] people in the field right now or if you're interested in getting in this [00:28:05] you're interested in getting in this field or if you're going to go into [00:28:07] field or if you're going to go into industry and you're going to be using [00:28:08] industry and you're going to be using this stuff like this is what you should [00:28:10] this stuff like this is what you should really understand and and [00:28:13] really understand and and again like the idea is sort of Stack [00:28:15] again like the idea is sort of Stack onto each other so I've kind of uh [00:28:17] onto each other so I've kind of uh sequenced the slides to give you an idea [00:28:19] sequenced the slides to give you an idea sort of of how the scientists kind of [00:28:21] sort of of how the scientists kind of came up with the next step uh and you [00:28:23] came up with the next step uh and you can really see the architecture just get [00:28:25] can really see the architecture just get slightly more and more advanced but [00:28:27] slightly more and more advanced but basically a lot of it is just more data [00:28:29] basically a lot of it is just more data and more compute uh again [00:28:31] and more compute uh again um so uh who knows how bird works [00:28:38] everybody should raise their heads so [00:28:42] everybody should raise their heads so um [00:28:43] um uh yeah so so Bert is kind of so [00:28:46] uh yeah so so Bert is kind of so canonical I think everybody kind of gets [00:28:48] canonical I think everybody kind of gets out Burke works right so I don't think [00:28:49] out Burke works right so I don't think we need a real refresher uh but uh I [00:28:53] we need a real refresher uh but uh I think you can think and so the reason I [00:28:55] think you can think and so the reason I have to slide is because I want you to [00:28:57] have to slide is because I want you to think about if you have a bird model [00:28:59] think about if you have a bird model and you have a bunch of images how are [00:29:02] and you have a bunch of images how are you going to turn that Bird model into [00:29:03] you going to turn that Bird model into something multimodal [00:29:06] something multimodal right so so there are a bunch of like [00:29:08] right so so there are a bunch of like obvious things you could do given the [00:29:10] obvious things you could do given the kind of features I told you about in the [00:29:12] kind of features I told you about in the sort of fusion process so you know how [00:29:14] sort of fusion process so you know how are you going to do that [00:29:16] are you going to do that does anybody want to like [00:29:18] does anybody want to like say something [00:29:24] like if you're doing classification [00:29:27] and then just concatenate it to whatever [00:29:30] and then just concatenate it to whatever encoder like maybe an a n or whatever [00:29:32] encoder like maybe an a n or whatever you're training on the data [00:29:35] you're training on the data concatenating okay exactly yeah so so [00:29:38] concatenating okay exactly yeah so so you can take take the confnet features [00:29:41] you can take take the confnet features and classifier token from bird [00:29:42] and classifier token from bird concatenate them and then classify uh [00:29:45] concatenate them and then classify uh for like a catheter or something like [00:29:47] for like a catheter or something like that or whatever the thing is you're [00:29:48] that or whatever the thing is you're interested in yeah [00:29:49] interested in yeah yeah so that's one thing you could also [00:29:52] yeah so that's one thing you could also like take the confident features and [00:29:53] like take the confident features and like give them to the bird model in lots [00:29:56] like give them to the bird model in lots of different ways right uh we can use [00:29:58] of different ways right uh we can use the region features so [00:30:00] the region features so um and I think a lot of people uh when [00:30:03] um and I think a lot of people uh when Burke came out who were working in in [00:30:05] Burke came out who were working in in vision and language processing were [00:30:06] vision and language processing were thinking exactly about okay so do we do [00:30:08] thinking exactly about okay so do we do like middle Fusion late Fusion do we do [00:30:10] like middle Fusion late Fusion do we do early Fusion how do we do diffusion [00:30:13] early Fusion how do we do diffusion um and so there were a lot of papers all [00:30:16] um and so there were a lot of papers all coming out basically at around the same [00:30:17] coming out basically at around the same time where people were doing versions uh [00:30:20] time where people were doing versions uh of this because so Bert was really kind [00:30:22] of this because so Bert was really kind of the Innovation and then everybody [00:30:23] of the Innovation and then everybody sort of just plugged it into their own [00:30:25] sort of just plugged it into their own thing because of hugging face [00:30:26] thing because of hugging face Transformers and things like that so [00:30:29] Transformers and things like that so um the first thing is uh visual bird [00:30:33] um the first thing is uh visual bird um this was one of the very early ones [00:30:34] um this was one of the very early ones where you have this image and people [00:30:36] where you have this image and people would do uh object detection on this so [00:30:39] would do uh object detection on this so you get like a hat and a racket and a [00:30:41] you get like a hat and a racket and a shirt and things like that so you can [00:30:43] shirt and things like that so you can just really take these features and then [00:30:45] just really take these features and then plug them into your uh your Transformer [00:30:48] plug them into your uh your Transformer model and then you you try to like [00:30:50] model and then you you try to like recover the features and so this really [00:30:53] recover the features and so this really is probably like the simplest way to do [00:30:55] is probably like the simplest way to do it right [00:30:57] it right um and so this is what we call a single [00:30:59] um and so this is what we call a single stream architecture where you have all [00:31:01] stream architecture where you have all of these kind of concatenating the the [00:31:03] of these kind of concatenating the the original input features and then putting [00:31:05] original input features and then putting them through the same Transformer what [00:31:08] them through the same Transformer what you can also do and that's something [00:31:09] you can also do and that's something that this this uh model called vilbert [00:31:12] that this this uh model called vilbert did is where you have two different [00:31:13] did is where you have two different streams so you essentially have these [00:31:16] streams so you essentially have these two parallel Transformers but at every [00:31:18] two parallel Transformers but at every layer uh you kind of give them a cross [00:31:21] layer uh you kind of give them a cross attention right so or co-attention as [00:31:24] attention right so or co-attention as they call it but it's basically like so [00:31:26] they call it but it's basically like so you just make sure you have an attention [00:31:27] you just make sure you have an attention map that spends both and then you just [00:31:29] map that spends both and then you just do your full normal Transformer layer [00:31:32] do your full normal Transformer layer again [00:31:33] again um and then so this you can train just [00:31:35] um and then so this you can train just like your regular Bird right so you uh [00:31:38] like your regular Bird right so you uh you have your your masked model Mass [00:31:41] you have your your masked model Mass language model here and here you do sort [00:31:43] language model here and here you do sort of some equivalent of that and then you [00:31:44] of some equivalent of that and then you also have your next sentence prediction [00:31:47] also have your next sentence prediction which you probably remember from your [00:31:49] which you probably remember from your birth lecture [00:31:50] birth lecture um but instead here we're saying okay is [00:31:52] um but instead here we're saying okay is this image aligned with this piece of [00:31:54] this image aligned with this piece of text or not [00:31:56] text or not um there's also lexmart I mean there I [00:31:59] um there's also lexmart I mean there I could go on forever there are like 100 [00:32:00] could go on forever there are like 100 papers that that came out that did this [00:32:02] papers that that came out that did this all at the same time so Lexmark had a [00:32:04] all at the same time so Lexmark had a different cross-modal output encoder a [00:32:07] different cross-modal output encoder a bunch of different uh ways of encoding [00:32:09] bunch of different uh ways of encoding the positional information right so you [00:32:11] the positional information right so you could say okay I just have a bunch of [00:32:12] could say okay I just have a bunch of bounding boxes that are featureized but [00:32:14] bounding boxes that are featureized but I don't care about where they are in the [00:32:16] I don't care about where they are in the image so it's just kind of like a just a [00:32:18] image so it's just kind of like a just a bag of uh bounding boxes or you could [00:32:21] bag of uh bounding boxes or you could say I found it here like this is the [00:32:22] say I found it here like this is the particular like top left and and bottom [00:32:24] particular like top left and and bottom right coordinate and that's what you [00:32:26] right coordinate and that's what you featurize into your network [00:32:29] featurize into your network um you can also do something even dumber [00:32:32] um you can also do something even dumber and I can say that because this is my [00:32:34] and I can say that because this is my paper [00:32:35] paper um where you just take the the image [00:32:38] um where you just take the the image itself you put it through a resnet and [00:32:41] itself you put it through a resnet and then you uh do a little bit of pooling [00:32:43] then you uh do a little bit of pooling on the final feature maps and you just [00:32:45] on the final feature maps and you just get give those feature Maps too Bert [00:32:48] get give those feature Maps too Bert um and so you then need to distinguish [00:32:50] um and so you then need to distinguish between like your text segment [00:32:51] between like your text segment embeddings right and your vision segment [00:32:54] embeddings right and your vision segment embeddings [00:32:56] embeddings um but so this actually works [00:32:58] um but so this actually works surprisingly well you don't have to do [00:32:59] surprisingly well you don't have to do any uh any additional training you can [00:33:02] any uh any additional training you can just take Bert out of the box initially [00:33:04] just take Bert out of the box initially you freeze it you learn to project into [00:33:06] you freeze it you learn to project into bird token space then you unfreeze your [00:33:09] bird token space then you unfreeze your resnet and then finally you unfreeze [00:33:10] resnet and then finally you unfreeze your birth and now you have a very good [00:33:12] your birth and now you have a very good multimodal classifier on the problem you [00:33:14] multimodal classifier on the problem you care about so a lot of these other [00:33:16] care about so a lot of these other papers they're doing what they call [00:33:18] papers they're doing what they call multimodal pre-training where first you [00:33:20] multimodal pre-training where first you have a bird model and a resnet so [00:33:22] have a bird model and a resnet so they're kind of unimodally pre-trained [00:33:24] they're kind of unimodally pre-trained and then you couple them together and [00:33:26] and then you couple them together and then you have a multimodal sort of [00:33:28] then you have a multimodal sort of intermediate every pre-training step [00:33:30] intermediate every pre-training step before you fine tune it on the problem [00:33:31] before you fine tune it on the problem you care about uh and what we showed [00:33:33] you care about uh and what we showed here is that you don't really need that [00:33:35] here is that you don't really need that actually in many cases so it's a very [00:33:37] actually in many cases so it's a very strong Baseline [00:33:39] strong Baseline um you can also go to the the pixel [00:33:42] um you can also go to the the pixel level completely so so that's what they [00:33:44] level completely so so that's what they they did in this other paper called [00:33:45] they did in this other paper called pixel bird where they it's basically [00:33:47] pixel bird where they it's basically exactly mmbt uh so the the previous uh [00:33:51] exactly mmbt uh so the the previous uh supervised one but here they do do the [00:33:53] supervised one but here they do do the the multimodal pre-training step and [00:33:55] the multimodal pre-training step and show that I think for vqa it helps a [00:33:57] show that I think for vqa it helps a little bit [00:33:58] little bit um so there are many of these birds uh [00:34:02] um so there are many of these birds uh doing sort of visual things uh people [00:34:04] doing sort of visual things uh people really tried everything uh here's [00:34:07] really tried everything uh here's another one called unider where they [00:34:08] another one called unider where they they added a bunch of different losses [00:34:10] they added a bunch of different losses uh we can really talk about this for a [00:34:12] uh we can really talk about this for a very long time uh We're not gonna do [00:34:14] very long time uh We're not gonna do that I'm just gonna kind of talk you [00:34:16] that I'm just gonna kind of talk you through some of the more interesting [00:34:17] through some of the more interesting ones so this one I think is quite [00:34:19] ones so this one I think is quite interesting built because here this is [00:34:22] interesting built because here this is really the first instance where we are [00:34:24] really the first instance where we are completely gone from uh confident [00:34:26] completely gone from uh confident features so we don't do any any [00:34:28] features so we don't do any any pre-processing on the image no regime [00:34:30] pre-processing on the image no regime features no backbone that it featurizes [00:34:32] features no backbone that it featurizes the uh the the parts of the image we [00:34:35] the uh the the parts of the image we care about we just have these patches of [00:34:36] care about we just have these patches of the image so really integrate we [00:34:39] the image so really integrate we flattened those patches we just pump [00:34:41] flattened those patches we just pump them into the Transformer straight away [00:34:42] them into the Transformer straight away so this really is like sort of burnt and [00:34:45] so this really is like sort of burnt and vit together in one model and this [00:34:47] vit together in one model and this worked really very well [00:34:49] worked really very well um so that that's been the trend uh so [00:34:52] um so that that's been the trend uh so here's a here's a nice uh very long list [00:34:54] here's a here's a nice uh very long list of all those all of these different [00:34:56] of all those all of these different models and what they do and and so [00:34:58] models and what they do and and so really the distinctions are just in what [00:35:00] really the distinctions are just in what is the text encoder that you use so do [00:35:02] is the text encoder that you use so do you use birth or something fancier or [00:35:04] you use birth or something fancier or better Roberta uh what is your vision [00:35:07] better Roberta uh what is your vision encoder so in many cases you have these [00:35:10] encoder so in many cases you have these region features so you would do an rcnn [00:35:12] region features so you would do an rcnn style thing or you could just do a [00:35:14] style thing or you could just do a resnet or a vit you have different kinds [00:35:16] resnet or a vit you have different kinds of fusion so either single or dual [00:35:19] of fusion so either single or dual stream as we talked about right so [00:35:20] stream as we talked about right so visual birth or vilbert different [00:35:23] visual birth or vilbert different pre-training tasks so Mass language [00:35:26] pre-training tasks so Mass language modeling image text matching there's a [00:35:28] modeling image text matching there's a bunch of like funkier ones you can do so [00:35:32] bunch of like funkier ones you can do so and then finally you can do multimodal [00:35:34] and then finally you can do multimodal pre-training on all of these different [00:35:36] pre-training on all of these different data sets that have aligned data um So [00:35:40] data sets that have aligned data um So you you're probably wondering okay so [00:35:42] you you're probably wondering okay so what is what is really the interesting [00:35:43] what is what is really the interesting difference between a lot of these and uh [00:35:47] difference between a lot of these and uh so I have another recommended paper that [00:35:49] so I have another recommended paper that if you're interested in this space you [00:35:51] if you're interested in this space you should really take a look at it's also a [00:35:52] should really take a look at it's also a really well done paper where uh they uh [00:35:56] really well done paper where uh they uh they unmask multimodal pre-training so [00:35:59] they unmask multimodal pre-training so basically they say [00:36:01] basically they say if you take all of these little model [00:36:03] if you take all of these little model inventions and you train these different [00:36:05] inventions and you train these different models on exactly the same data in [00:36:08] models on exactly the same data in exactly the same way it turns out that [00:36:10] exactly the same way it turns out that they're all basically the same [00:36:12] they're all basically the same uh so that's a lot of kind of uh you [00:36:15] uh so that's a lot of kind of uh you know wasted effort on the part of the [00:36:17] know wasted effort on the part of the field because everybody is saying like [00:36:18] field because everybody is saying like oh my model is better but it's actually [00:36:20] oh my model is better but it's actually just because you trained it on different [00:36:21] just because you trained it on different data and there's no real sort of model [00:36:24] data and there's no real sort of model Innovation uh going on in a lot of these [00:36:27] Innovation uh going on in a lot of these things so I don't mean to sound [00:36:29] things so I don't mean to sound discouraging or anything like that but [00:36:30] discouraging or anything like that but you know like uh I think that's why this [00:36:33] you know like uh I think that's why this paper is really nice and really [00:36:35] paper is really nice and really important is because it just shows us [00:36:37] important is because it just shows us what really matters [00:36:39] what really matters so this is also work that I I did myself [00:36:42] so this is also work that I I did myself uh called Flava uh with uh with my team [00:36:45] uh called Flava uh with uh with my team where we wanted to take these ideas [00:36:49] where we wanted to take these ideas really to the limit so a lot of the [00:36:50] really to the limit so a lot of the things that you've seen now uh so the [00:36:53] things that you've seen now uh so the visual Birds and the vilberts and things [00:36:55] visual Birds and the vilberts and things like that they're all about multimodal [00:36:56] like that they're all about multimodal questions so how can we do visual [00:36:58] questions so how can we do visual question answering uh something like [00:37:00] question answering uh something like that where we just have these two [00:37:01] that where we just have these two modalities we only care about problems [00:37:03] modalities we only care about problems that always involve these two modalities [00:37:05] that always involve these two modalities and where we want to go and this is this [00:37:07] and where we want to go and this is this is kind of the basic premise I think of [00:37:09] is kind of the basic premise I think of foundation models in general is that we [00:37:12] foundation models in general is that we have one model to rule them all right so [00:37:14] have one model to rule them all right so this one model can consume data from all [00:37:16] this one model can consume data from all of these different modalities and you [00:37:18] of these different modalities and you can synthesize across all of these [00:37:20] can synthesize across all of these different modalities and then do useful [00:37:21] different modalities and then do useful things with that information [00:37:24] things with that information um so so with flavor that's exactly what [00:37:26] um so so with flavor that's exactly what we tried to build so we wanted to have [00:37:28] we tried to build so we wanted to have one Foundation model that is good at [00:37:30] one Foundation model that is good at vision and language and computer vision [00:37:31] vision and language and computer vision and natural language processing is [00:37:34] and natural language processing is jointly pre-trained on all of these [00:37:35] jointly pre-trained on all of these different data sources so it's also [00:37:37] different data sources so it's also trained on just CC news so common all [00:37:40] trained on just CC news so common all and book Corpus so it's very good at the [00:37:43] and book Corpus so it's very good at the sort of things you would expect Bert to [00:37:44] sort of things you would expect Bert to be good at it's drained on imagenet for [00:37:46] be good at it's drained on imagenet for image data so it's good at the things [00:37:48] image data so it's good at the things that you would expect is a kind of basic [00:37:50] that you would expect is a kind of basic image model to be good at and then you [00:37:52] image model to be good at and then you have this PMD data set that we created [00:37:54] have this PMD data set that we created out of publicly available uh image text [00:37:57] out of publicly available uh image text pairs that we also train it on so this [00:38:00] pairs that we also train it on so this PMD data set is really just if you take [00:38:02] PMD data set is really just if you take all the data sets that were ever created [00:38:04] all the data sets that were ever created that have image text pairs that are [00:38:06] that have image text pairs that are publicly available so unfortunately the [00:38:07] publicly available so unfortunately the clip data and the Google Align data and [00:38:10] clip data and the Google Align data and all of these data sets they haven't been [00:38:11] all of these data sets they haven't been open source so this is before rely on uh [00:38:14] open source so this is before rely on uh where so now there's a good alternative [00:38:16] where so now there's a good alternative to this [00:38:17] to this um but so this PMD data set [00:38:20] um but so this PMD data set if you combine all of these image [00:38:22] if you combine all of these image taxpayers you get 70 million of them so [00:38:24] taxpayers you get 70 million of them so that's still pretty decent size and then [00:38:26] that's still pretty decent size and then you can take all of this data basically [00:38:29] you can take all of this data basically to solve all of these problems that we [00:38:30] to solve all of these problems that we know we care about in these different [00:38:32] know we care about in these different fields so you can do multimodal [00:38:33] fields so you can do multimodal reasoning you can do language [00:38:35] reasoning you can do language understanding you can do visual [00:38:36] understanding you can do visual recognition all with exactly the same [00:38:38] recognition all with exactly the same model and that's that's a very powerful [00:38:40] model and that's that's a very powerful idea I think if you uh like if you work [00:38:43] idea I think if you uh like if you work at a company like Facebook you don't [00:38:44] at a company like Facebook you don't want to have different models for all [00:38:46] want to have different models for all kinds of different things you want to [00:38:47] kinds of different things you want to have one model that you can really use [00:38:49] have one model that you can really use for everything that's going to really [00:38:51] for everything that's going to really make your life a lot easier [00:38:54] make your life a lot easier um so the exact architecture here is [00:38:56] um so the exact architecture here is that on the one hand we have this image [00:38:58] that on the one hand we have this image encoder where we take the image we [00:39:00] encoder where we take the image we encoded as patches and we just do what [00:39:03] encoded as patches and we just do what we call mass image modeling but it's [00:39:04] we call mass image modeling but it's basically Mass language modeling and [00:39:06] basically Mass language modeling and then just on the on the image tokens [00:39:08] then just on the on the image tokens right [00:39:10] right uh and then on the other side we have [00:39:12] uh and then on the other side we have the mass language modeling uh on on the [00:39:16] the mass language modeling uh on on the language so your regular sort of bird [00:39:17] language so your regular sort of bird thing and then we have a multimodal part [00:39:20] thing and then we have a multimodal part where all of this information gets [00:39:21] where all of this information gets combined uh so we have a mass multimodal [00:39:25] combined uh so we have a mass multimodal modeling loss term uh where you can also [00:39:27] modeling loss term uh where you can also do image text matching so this is like [00:39:29] do image text matching so this is like your bird next sentence prediction thing [00:39:31] your bird next sentence prediction thing and then we also have a global [00:39:32] and then we also have a global contrastive loss which is exactly like a [00:39:34] contrastive loss which is exactly like a clip so if you do all of this stuff it's [00:39:37] clip so if you do all of this stuff it's just all Transformers all the way down [00:39:39] just all Transformers all the way down and it's sort of a very elegant way I [00:39:42] and it's sort of a very elegant way I think to combine a lot of this [00:39:43] think to combine a lot of this information [00:39:44] information um and when you do that you get [00:39:46] um and when you do that you get something that can really do a lot of [00:39:48] something that can really do a lot of things very well uh so I'm we're not [00:39:50] things very well uh so I'm we're not going to talk about that table is just [00:39:52] going to talk about that table is just way too many numbers but uh so just [00:39:54] way too many numbers but uh so just trust me we were pretty thorough [00:39:55] trust me we were pretty thorough generating uh the table here and so over [00:40:00] generating uh the table here and so over 35 different tests if you compare flavor [00:40:02] 35 different tests if you compare flavor to all kinds of different ablations in [00:40:04] to all kinds of different ablations in terms of clip models then this is just a [00:40:07] terms of clip models then this is just a much better way to to get to this [00:40:09] much better way to to get to this information [00:40:10] information so I think this is a nice example of [00:40:12] so I think this is a nice example of like where we're probably going to go [00:40:13] like where we're probably going to go with the field uh in in the near future [00:40:17] with the field uh in in the near future um so the other Trend that we that we [00:40:19] um so the other Trend that we that we see very obviously in the field right [00:40:20] see very obviously in the field right now is that everybody cares about [00:40:22] now is that everybody cares about generative models right so you know [00:40:24] generative models right so you know language models and and you know image [00:40:26] language models and and you know image generative models there's just a trend [00:40:28] generative models there's just a trend where we want to be generative we want [00:40:30] where we want to be generative we want to move away from this contrastive [00:40:31] to move away from this contrastive discriminative uh stuff to the more [00:40:34] discriminative uh stuff to the more interesting more richer representations [00:40:36] interesting more richer representations maybe that you get out of generating [00:40:38] maybe that you get out of generating sequences or or images [00:40:41] sequences or or images um so this uh Sim vlm paper was one of [00:40:43] um so this uh Sim vlm paper was one of the first ones where they really had [00:40:45] the first ones where they really had this separate decoder that was trying to [00:40:46] this separate decoder that was trying to generate or kind of complete captions [00:40:48] generate or kind of complete captions which they showed gives you a lot richer [00:40:51] which they showed gives you a lot richer representations [00:40:53] representations um I think this is actually the current [00:40:54] um I think this is actually the current state of the art now it's called coca [00:40:57] state of the art now it's called coca so a lot of these uh models they all [00:41:00] so a lot of these uh models they all again look very similar uh but in this [00:41:02] again look very similar uh but in this case now we're starting to really see [00:41:04] case now we're starting to really see these text decoders so initially with [00:41:06] these text decoders so initially with clip I think that's also what they were [00:41:08] clip I think that's also what they were trying to go for like open AI being a [00:41:10] trying to go for like open AI being a company that really likes generative [00:41:12] company that really likes generative models but they couldn't really get it [00:41:13] models but they couldn't really get it to work and I think so it took us a [00:41:15] to work and I think so it took us a while as a field to really figure out [00:41:17] while as a field to really figure out how to do this the right way [00:41:20] how to do this the right way um and so right now we're really kind of [00:41:23] um and so right now we're really kind of in the age of language models right and [00:41:25] in the age of language models right and uh so uh one of the interesting things [00:41:28] uh so uh one of the interesting things you can do with language models is just [00:41:31] you can do with language models is just keep them Frozen and then learn how to [00:41:33] keep them Frozen and then learn how to project into the language models so uh [00:41:35] project into the language models so uh the mmbt architecture I talked about [00:41:38] the mmbt architecture I talked about where we had this bird model we kind of [00:41:40] where we had this bird model we kind of kept it frozen and we learned to learn [00:41:41] kept it frozen and we learned to learn to project into the bird token space you [00:41:45] to project into the bird token space you can do exactly the same thing but then [00:41:46] can do exactly the same thing but then with a much fancier model uh or [00:41:49] with a much fancier model uh or something like T5 even where you just [00:41:51] something like T5 even where you just have an encoder decoder or some kind of [00:41:52] have an encoder decoder or some kind of generative part of this you keep that [00:41:55] generative part of this you keep that thing Frozen uh and then you learn to [00:41:57] thing Frozen uh and then you learn to project into the token space of that [00:42:00] project into the token space of that Frozen language model uh and then you [00:42:02] Frozen language model uh and then you can do lots of fun stuff it turns out so [00:42:05] can do lots of fun stuff it turns out so what they show in this paper is that you [00:42:07] what they show in this paper is that you then get few shot Learners uh so all of [00:42:09] then get few shot Learners uh so all of the things you see with gpt3 where you [00:42:11] the things you see with gpt3 where you can just give it some kind of in-context [00:42:13] can just give it some kind of in-context examples and it's gonna figure out [00:42:15] examples and it's gonna figure out binding uh kind of on the fly so it says [00:42:19] binding uh kind of on the fly so it says like this is a accent this is a blicket [00:42:21] like this is a accent this is a blicket so what is what is this and then it [00:42:23] so what is what is this and then it gives you the answer that is that it's [00:42:24] gives you the answer that is that it's the decks so it really learns in context [00:42:26] the decks so it really learns in context how you decide the feature mappings [00:42:29] how you decide the feature mappings which is really kind of solving the the [00:42:31] which is really kind of solving the the grounding problem that a lot of this [00:42:33] grounding problem that a lot of this multimodal stuff uh started with [00:42:35] multimodal stuff uh started with so I think that's very cool and then uh [00:42:38] so I think that's very cool and then uh probably one of the the coolest papers [00:42:41] probably one of the the coolest papers right now or models right now that you [00:42:43] right now or models right now that you might have heard of if you follow the [00:42:44] might have heard of if you follow the field is flamingo uh out of deepmind [00:42:47] field is flamingo uh out of deepmind where they take a chinchilla language [00:42:49] where they take a chinchilla language model [00:42:51] model um and uh so this is really an optimal [00:42:53] um and uh so this is really an optimal language model and now you have this [00:42:54] language model and now you have this Vision encoder uh that encodes multiple [00:42:57] Vision encoder uh that encodes multiple different images uh that you can then do [00:43:00] different images uh that you can then do reasoning over and then kind of [00:43:02] reasoning over and then kind of autocomplete so what this gets you is [00:43:04] autocomplete so what this gets you is just a much more powerful model because [00:43:07] just a much more powerful model because you can do uh you know your generative [00:43:09] you can do uh you know your generative over lots of different images so like [00:43:11] over lots of different images so like it's really like step wise you can see [00:43:13] it's really like step wise you can see it right we started off with very simple [00:43:14] it right we started off with very simple Transformers and now we're actually at [00:43:16] Transformers and now we're actually at something that that is starting to get [00:43:18] something that that is starting to get pretty complicated because we have these [00:43:20] pretty complicated because we have these building blocks like a perceiver [00:43:22] building blocks like a perceiver resampler where we have a a bunch of [00:43:25] resampler where we have a a bunch of different images that we featureize and [00:43:27] different images that we featureize and now we need to compress the information [00:43:29] now we need to compress the information because sometimes we have three images [00:43:30] because sometimes we have three images sometimes we have five images so we want [00:43:32] sometimes we have five images so we want to make sure that we can compress it so [00:43:34] to make sure that we can compress it so that it's always ready for consumption [00:43:35] that it's always ready for consumption and by the next layer of the language [00:43:39] and by the next layer of the language model and then so this paper again is a [00:43:42] model and then so this paper again is a really good paper to read because they [00:43:44] really good paper to read because they actually so this is not me this is not [00:43:46] actually so this is not me this is not my code this comes from the actual paper [00:43:48] my code this comes from the actual paper so they just have the diagram together [00:43:49] so they just have the diagram together with the code so that you can really [00:43:51] with the code so that you can really understand what it's doing which I think [00:43:53] understand what it's doing which I think is is really uh great [00:43:56] is is really uh great um and so once you you have your [00:43:58] um and so once you you have your perceiver resampling step what you then [00:44:01] perceiver resampling step what you then do is you do a gated cross attention [00:44:03] do is you do a gated cross attention this is how you implement it [00:44:06] this is how you implement it um and uh so this gated cross attention [00:44:08] um and uh so this gated cross attention you do that before your Frozen language [00:44:11] you do that before your Frozen language model layer so you really just have a [00:44:13] model layer so you really just have a frozen chinchilla language model and you [00:44:15] frozen chinchilla language model and you learn to kind of modulate the [00:44:17] learn to kind of modulate the information that goes into that language [00:44:19] information that goes into that language model you propagate the gradients all [00:44:21] model you propagate the gradients all the way back you just don't update the [00:44:23] the way back you just don't update the language model so you're really kind of [00:44:25] language model so you're really kind of trying to figure out like how am I going [00:44:26] trying to figure out like how am I going to design my signal so that my language [00:44:28] to design my signal so that my language model can can do the most with it right [00:44:31] model can can do the most with it right how am I going to combine the [00:44:32] how am I going to combine the information so you'll notice that now we [00:44:34] information so you'll notice that now we do it before the layer right in a lot of [00:44:35] do it before the layer right in a lot of other stuff you would do the attention [00:44:37] other stuff you would do the attention after the layer but here you do it [00:44:39] after the layer but here you do it before [00:44:41] before um so uh karpathy I think more than 10 [00:44:44] um so uh karpathy I think more than 10 years ago had this this image it's [00:44:46] years ago had this this image it's Barack Obama kind of setting his foot [00:44:48] Barack Obama kind of setting his foot here on the scale to make somebody think [00:44:50] here on the scale to make somebody think uh like uh you know they're they're a [00:44:52] uh like uh you know they're they're a lot heavier than they really are uh so [00:44:54] lot heavier than they really are uh so this is obviously funny to us [00:44:57] this is obviously funny to us um but not to an AI system I think [00:44:59] um but not to an AI system I think unless it really uh understands the [00:45:01] unless it really uh understands the scene and so that's why karpathy uh at [00:45:04] scene and so that's why karpathy uh at the time said this would be a really [00:45:05] the time said this would be a really good visual Turing test like if a system [00:45:07] good visual Turing test like if a system can figure this out then it's actually [00:45:09] can figure this out then it's actually really smart [00:45:10] really smart um and so obviously it's been a bit of a [00:45:12] um and so obviously it's been a bit of a challenge for everybody working in the [00:45:14] challenge for everybody working in the field than to get something that [00:45:15] field than to get something that actually works on this and uh so [00:45:17] actually works on this and uh so Flamingo as it turns out kind of gets [00:45:19] Flamingo as it turns out kind of gets the joke [00:45:21] the joke um but uh yeah so it's it's a bit [00:45:23] um but uh yeah so it's it's a bit unclear if it really gets the joke [00:45:24] unclear if it really gets the joke because if you read this conversation [00:45:26] because if you read this conversation it's sort of kind of getting steered in [00:45:28] it's sort of kind of getting steered in the right direction right but [00:45:29] the right direction right but um at least we're making progress let's [00:45:31] um at least we're making progress let's put it that way [00:45:33] put it that way um and then so in Flamingo you still [00:45:36] um and then so in Flamingo you still have a lot of moving Parts but you can [00:45:38] have a lot of moving Parts but you can really take this almost through the full [00:45:39] really take this almost through the full extreme where you try to freeze almost [00:45:41] extreme where you try to freeze almost everything and you just want to learn [00:45:43] everything and you just want to learn this kind of mapping between your image [00:45:45] this kind of mapping between your image encoder and your language model or your [00:45:47] encoder and your language model or your image encoder and your encoder decoder [00:45:49] image encoder and your encoder decoder architecture and all you really do is [00:45:52] architecture and all you really do is just the projection between the two [00:45:53] just the projection between the two right so there's this nice model called [00:45:56] right so there's this nice model called blip 2 where they they experiment with [00:45:58] blip 2 where they they experiment with like opt for the language model in Flint [00:46:00] like opt for the language model in Flint T5 for the encoder decoder architecture [00:46:02] T5 for the encoder decoder architecture and this just gives you amazing results [00:46:04] and this just gives you amazing results it gives you uh really complex captions [00:46:07] it gives you uh really complex captions and things like that without any real [00:46:10] and things like that without any real direct supervision on the captions [00:46:11] direct supervision on the captions itself which is pretty impressive I [00:46:13] itself which is pretty impressive I think so that just shows you the power [00:46:15] think so that just shows you the power of language models in general [00:46:19] um so here are some examples uh so it [00:46:21] um so here are some examples uh so it can really do like different things from [00:46:23] can really do like different things from captioning to reasoning to visual [00:46:25] captioning to reasoning to visual question answering to like like location [00:46:28] question answering to like like location detection uh so you can have a long [00:46:31] detection uh so you can have a long conversation with this system this [00:46:33] conversation with this system this really is is kind of the future where [00:46:34] really is is kind of the future where we're going right where we're going to [00:46:35] we're going right where we're going to have a chat GPT but it's also going to [00:46:37] have a chat GPT but it's also going to be able to see the world in a way [00:46:40] be able to see the world in a way um and so so I think an interesting [00:46:43] um and so so I think an interesting thing so you've probably heard of like [00:46:44] thing so you've probably heard of like Chain of Thought prompting and things [00:46:46] Chain of Thought prompting and things like that where you ask the language [00:46:47] like that where you ask the language model like let's think step by step [00:46:50] model like let's think step by step um and you can tell a vision and [00:46:52] um and you can tell a vision and language model uh generate a rationale [00:46:55] language model uh generate a rationale for why uh why something might be the [00:46:57] for why uh why something might be the case so you generate a potential [00:47:00] case so you generate a potential explanation for what your answer might [00:47:02] explanation for what your answer might be and then after that you ask it to [00:47:04] be and then after that you ask it to answer the question and it turns out [00:47:06] answer the question and it turns out that if you do that sort of multimodal [00:47:08] that if you do that sort of multimodal Chain of Thought prompting then the [00:47:10] Chain of Thought prompting then the system gets much better uh and and so [00:47:12] system gets much better uh and and so you know this was like the new [00:47:14] you know this was like the new state-of-the-art on science QA or [00:47:16] state-of-the-art on science QA or Benchmark like that just because it [00:47:18] Benchmark like that just because it learns to unpack the information right [00:47:20] learns to unpack the information right and so uh I think we're really as a [00:47:23] and so uh I think we're really as a field just starting to figure out what [00:47:24] field just starting to figure out what what the potential is of this and I [00:47:27] what the potential is of this and I think this paper is where they also [00:47:28] think this paper is where they also showed that multimodal Chain of Thought [00:47:30] showed that multimodal Chain of Thought prompting really gets you pretty amazing [00:47:32] prompting really gets you pretty amazing results and they show uh very nice [00:47:35] results and they show uh very nice results on Raven matrices and like very [00:47:38] results on Raven matrices and like very complicated kind of IQ tests the things [00:47:40] complicated kind of IQ tests the things that that humans are supposed to be [00:47:42] that that humans are supposed to be really good at but you have to be a [00:47:44] really good at but you have to be a pretty smart human to really be good at [00:47:45] pretty smart human to really be good at this and this system Just Nails it [00:47:48] this and this system Just Nails it um so you know we're making super fast [00:47:50] um so you know we're making super fast progress and we started off from a very [00:47:53] progress and we started off from a very simple birth model that was able to look [00:47:55] simple birth model that was able to look at some pictures and now we're getting [00:47:57] at some pictures and now we're getting to these very sophisticated Foundation [00:47:59] to these very sophisticated Foundation models so that was my my short history [00:48:02] models so that was my my short history of multimodal foundation models [00:48:06] of multimodal foundation models um so how much time do I have left [00:48:11] all right okay plenty of time [00:48:14] all right okay plenty of time um [00:48:15] um yeah please questions [00:48:24] one of the images they just looked like [00:48:26] one of the images they just looked like they were [00:48:27] they were um boxes [00:48:29] um boxes passed through and kind of no sense of [00:48:32] passed through and kind of no sense of shape in them yeah yeah so so I I think [00:48:36] shape in them yeah yeah so so I I think the the history of computer vision has [00:48:38] the the history of computer vision has been very similar to the history of [00:48:40] been very similar to the history of natural language processing where we [00:48:41] natural language processing where we thought we needed all of this structure [00:48:43] thought we needed all of this structure and all of these different things and it [00:48:45] and all of these different things and it turns out you can just throw it all away [00:48:46] turns out you can just throw it all away and just have a big Transformer over the [00:48:48] and just have a big Transformer over the patches [00:48:51] patches sorry yes [00:48:55] [Laughter] [00:48:59] you mentioned a couple times like Model [00:49:01] you mentioned a couple times like Model T person or what does that mean yeah uh [00:49:04] T person or what does that mean yeah uh yeah sorry I should have explained that [00:49:06] yeah sorry I should have explained that better maybe so it just means that um we [00:49:09] better maybe so it just means that um we are not updating the weights [00:49:11] are not updating the weights um so uh like if we go to uh this era I [00:49:14] um so uh like if we go to uh this era I think is a nice example so uh we have [00:49:17] think is a nice example so uh we have frozen self-attention so that just means [00:49:20] frozen self-attention so that just means that we when we do a forward pass we go [00:49:22] that we when we do a forward pass we go all the way to whatever we want to [00:49:24] all the way to whatever we want to predict we get some gradients we take [00:49:26] predict we get some gradients we take them all the way down but we only update [00:49:28] them all the way down but we only update the non-frozen layers right so here the [00:49:31] the non-frozen layers right so here the gradients actually do get updated but [00:49:33] gradients actually do get updated but these just never change and so the [00:49:35] these just never change and so the reason you want to do that is because [00:49:36] reason you want to do that is because otherwise you're going to drift way too [00:49:38] otherwise you're going to drift way too far right so then you're going to kind [00:49:40] far right so then you're going to kind of destroy all of the cool stuff your [00:49:42] of destroy all of the cool stuff your language model has learned because [00:49:44] language model has learned because you're just going to focus on on the [00:49:46] you're just going to focus on on the small data set that you're training it [00:49:47] small data set that you're training it on so you want to preserve the abilities [00:49:49] on so you want to preserve the abilities of the language model but you want it to [00:49:51] of the language model but you want it to become good at the thing you care about [00:49:58] other questions [00:50:02] is there a benefit to doing like that [00:50:04] is there a benefit to doing like that really your Metal Fusion as opposed to [00:50:06] really your Metal Fusion as opposed to like only doing leg Fusion I think yeah [00:50:09] like only doing leg Fusion I think yeah so so I mean we're going to talk about [00:50:10] so so I mean we're going to talk about evaluation next but so it really depends [00:50:13] evaluation next but so it really depends on the the tasks that you care about [00:50:15] on the the tasks that you care about um and so I would say the earlier is [00:50:18] um and so I would say the earlier is always the better if you can afford it [00:50:20] always the better if you can afford it uh and so like clip is very efficient to [00:50:23] uh and so like clip is very efficient to train it's very late Fusion right at the [00:50:25] train it's very late Fusion right at the very end so there's no interaction [00:50:26] very end so there's no interaction between the different modalities [00:50:29] between the different modalities um and so that's really good if you want [00:50:32] um and so that's really good if you want to be very efficient and if you want to [00:50:33] to be very efficient and if you want to be like for training it's it's much [00:50:36] be like for training it's it's much nicer right uh but if you want to have a [00:50:38] nicer right uh but if you want to have a richer understanding of the the [00:50:40] richer understanding of the the multimodal signal then you want to do [00:50:42] multimodal signal then you want to do earlier Fusion [00:50:44] earlier Fusion so it's yeah it's always a trade-off [00:50:46] so it's yeah it's always a trade-off right [00:50:50] images are just a lot more data than [00:50:53] images are just a lot more data than text so how much more difficult are [00:50:56] text so how much more difficult are these to train and um how much bigger [00:50:59] these to train and um how much bigger does like the image processing have to [00:51:02] does like the image processing have to be compared to uh the language model [00:51:05] be compared to uh the language model yeah so so [00:51:07] yeah so so um images are are more complex in a way [00:51:10] um images are are more complex in a way but but they're also kind of higher [00:51:13] but but they're also kind of higher bandwidth representations right so [00:51:14] bandwidth representations right so there's a lot of kind of like just [00:51:16] there's a lot of kind of like just pixels that our brains just abstract [00:51:19] pixels that our brains just abstract away right it's really about the scene [00:51:21] away right it's really about the scene that you're seeing and like you're not [00:51:22] that you're seeing and like you're not really [00:51:23] really thinking too much about the pixels [00:51:25] thinking too much about the pixels themselves [00:51:26] themselves um so so like John Lagoon likes to say [00:51:28] um so so like John Lagoon likes to say that uh language is just a kind of low [00:51:31] that uh language is just a kind of low bandwidth uh a proxy for a language of [00:51:35] bandwidth uh a proxy for a language of thought which is much richer and much [00:51:37] thought which is much richer and much higher bandwidth and like he thinks [00:51:39] higher bandwidth and like he thinks probably visual I'm not so sure [00:51:41] probably visual I'm not so sure um but uh so uh yeah I I don't think [00:51:45] um but uh so uh yeah I I don't think that there's necessarily a difference [00:51:47] that there's necessarily a difference between kind of the scaling laws that [00:51:49] between kind of the scaling laws that you see in these systems [00:51:51] you see in these systems um or at least we still have to figure [00:51:53] um or at least we still have to figure that out we'll kind of talk about that [00:51:54] that out we'll kind of talk about that towards the end as well [00:52:00] so how [00:52:02] so how to put your bias just like the natural [00:52:04] to put your bias just like the natural one oh yeah they have terrible biases [00:52:07] one oh yeah they have terrible biases yeah [00:52:09] yeah um so yeah so some people are actually [00:52:11] um so yeah so some people are actually working on on this uh who are in this [00:52:13] working on on this uh who are in this very room but uh so these models can be [00:52:15] very room but uh so these models can be very racist also in what they generate [00:52:17] very racist also in what they generate or or the kind of predictions they make [00:52:20] or or the kind of predictions they make so uh if you have an Asian basketball [00:52:23] so uh if you have an Asian basketball player standing sort of like this with a [00:52:25] player standing sort of like this with a basketball very obviously there then the [00:52:28] basketball very obviously there then the model will think that he's playing ping [00:52:29] model will think that he's playing ping pong because he's Asian I'm not joking [00:52:32] pong because he's Asian I'm not joking so uh [00:52:34] so uh so so these models uh yeah just like all [00:52:38] so so these models uh yeah just like all neural networks right this is really a [00:52:40] neural networks right this is really a big problem and one of the the most [00:52:41] big problem and one of the the most interesting problems that that you [00:52:43] interesting problems that that you should be working on if you're a student [00:52:44] should be working on if you're a student and you want to make a difference is how [00:52:46] and you want to make a difference is how do we get these systems to be much [00:52:48] do we get these systems to be much better at these sorts of things [00:52:54] probably examples you show like the [00:52:55] probably examples you show like the model interpret from the content of the [00:52:57] model interpret from the content of the image so really we want to understand [00:52:59] image so really we want to understand the content for a video so what actual [00:53:02] the content for a video so what actual challenges you might see like what [00:53:05] challenges you might see like what improvements we can make uh to um [00:53:08] improvements we can make uh to um yeah so so [00:53:10] yeah so so um you're asking about the attention [00:53:12] um you're asking about the attention Mass sort of right yeah so you can use [00:53:14] Mass sort of right yeah so you can use the same idea for for videos uh and you [00:53:17] the same idea for for videos uh and you just look at the video and and so these [00:53:19] just look at the video and and so these systems are are so good now the object [00:53:20] systems are are so good now the object detectors are so good you can really [00:53:22] detectors are so good you can really track objects kind of real time as they [00:53:25] track objects kind of real time as they they go through your video and so you [00:53:27] they go through your video and so you can try to check how that aligns with [00:53:29] can try to check how that aligns with your attention mask in your model [00:53:31] your attention mask in your model um so so a lot of uh like so videos I [00:53:36] um so so a lot of uh like so videos I think are sort of interesting but [00:53:37] think are sort of interesting but they're also not really interesting [00:53:38] they're also not really interesting because you can very often just [00:53:40] because you can very often just sub-sample images and solve the images [00:53:43] sub-sample images and solve the images rather than having to deal with the [00:53:44] rather than having to deal with the complex video [00:53:46] complex video um good job [00:53:49] um good job all right maybe one more one more [00:53:50] all right maybe one more one more question and then we'll go do some [00:53:52] question and then we'll go do some evaluations yeah so these multi-mole [00:53:55] evaluations yeah so these multi-mole models when um you only provide let's [00:53:57] models when um you only provide let's say you only provide a single source of [00:53:59] say you only provide a single source of media to the same only text or vision [00:54:01] media to the same only text or vision how does it perform in that case because [00:54:03] how does it perform in that case because it's obviously more geared for [00:54:05] it's obviously more geared for multi-million cases yeah so I mean [00:54:07] multi-million cases yeah so I mean that's one of the giant shortcomings of [00:54:09] that's one of the giant shortcomings of a lot of these models is that they're [00:54:11] a lot of these models is that they're really just built for multimodal stuff [00:54:13] really just built for multimodal stuff and so what if I don't have an image [00:54:15] and so what if I don't have an image right uh and so uh I mean that that's [00:54:19] right uh and so uh I mean that that's why we did Flavor because we want to [00:54:21] why we did Flavor because we want to have one model that can do all of that [00:54:22] have one model that can do all of that stuff [00:54:23] stuff um and that's why in in mmbt so the [00:54:26] um and that's why in in mmbt so the supervised multimodal by Transformer we [00:54:28] supervised multimodal by Transformer we actually have an analysis of like how [00:54:30] actually have an analysis of like how robust is this model to missing images [00:54:32] robust is this model to missing images or missing text uh but uh so I think a [00:54:35] or missing text uh but uh so I think a lot of a lot of folks working on these [00:54:37] lot of a lot of folks working on these early visual bird models that were kind [00:54:39] early visual bird models that were kind of myopically focused on vqa uh which is [00:54:43] of myopically focused on vqa uh which is actually a great segue to what I want to [00:54:44] actually a great segue to what I want to talk about next uh so so it really [00:54:48] talk about next uh so so it really depends on the the tasks that you care [00:54:50] depends on the the tasks that you care about as I said right and so I I think [00:54:52] about as I said right and so I I think if I'm gonna tell you about [00:54:54] if I'm gonna tell you about multimodality I also have to tell you [00:54:56] multimodality I also have to tell you how you're going to check that the [00:54:57] how you're going to check that the multimodal system is actually good at [00:54:59] multimodal system is actually good at multimodal things and so that's the the [00:55:02] multimodal things and so that's the the topic of evaluation which actually is a [00:55:05] topic of evaluation which actually is a super important topic and a lot of [00:55:07] super important topic and a lot of people they want to be cool and build [00:55:09] people they want to be cool and build big models uh but I I think it should be [00:55:11] big models uh but I I think it should be way cooler to do proper evaluation of [00:55:13] way cooler to do proper evaluation of these models especially if you're in [00:55:15] these models especially if you're in Academia because you only have limited [00:55:17] Academia because you only have limited gpus anyway right so what what what can [00:55:20] gpus anyway right so what what what can you do [00:55:22] you do sorry I don't want to rub it in with it [00:55:24] sorry I don't want to rub it in with it no [00:55:26] no um so um so how do you check well um [00:55:30] um so um so how do you check well um there's this amazing project uh so like [00:55:33] there's this amazing project uh so like imagenet really changed like the history [00:55:35] imagenet really changed like the history of deep learning I think and this other [00:55:37] of deep learning I think and this other data set Coco I think also really [00:55:40] data set Coco I think also really changed especially vision and language [00:55:42] changed especially vision and language but also I think Vision uh in general [00:55:45] but also I think Vision uh in general where they uh have just a bunch of main [00:55:49] where they uh have just a bunch of main sort of multimodal tasks so these images [00:55:51] sort of multimodal tasks so these images are very richly annotated with all kinds [00:55:54] are very richly annotated with all kinds of different things so like the [00:55:55] of different things so like the segmentation of the objects the bounding [00:55:57] segmentation of the objects the bounding boxes the labels of the boundary boxes [00:56:00] boxes the labels of the boundary boxes they come with like uh sort of a [00:56:02] they come with like uh sort of a different pixel granularities it's a [00:56:04] different pixel granularities it's a huge data set uh it's very fine-grained [00:56:06] huge data set uh it's very fine-grained uh annotated in terms of like the [00:56:09] uh annotated in terms of like the categories that it has and then you have [00:56:11] categories that it has and then you have five captions for each of these images [00:56:14] five captions for each of these images um and so this this really was the first [00:56:16] um and so this this really was the first data set that unlocked a lot of sort of [00:56:18] data set that unlocked a lot of sort of vision and language processing at scale [00:56:20] vision and language processing at scale because you had your picture and you had [00:56:22] because you had your picture and you had your caption and now you need to figure [00:56:24] your caption and now you need to figure out okay how do I give the right caption [00:56:26] out okay how do I give the right caption for this image so that's image [00:56:27] for this image so that's image captioning or can I retrieve given some [00:56:30] captioning or can I retrieve given some piece of text the right image or the [00:56:32] piece of text the right image or the piece of uh or the image for the piece [00:56:34] piece of uh or the image for the piece of text so there's a bunch of very [00:56:36] of text so there's a bunch of very impactful data sets that do this stuff [00:56:38] impactful data sets that do this stuff that we already talked about lion but [00:56:40] that we already talked about lion but Coco really is the main one still I [00:56:43] Coco really is the main one still I think that a lot of people kind of use [00:56:44] think that a lot of people kind of use as the canal core instance of this data [00:56:47] as the canal core instance of this data set category [00:56:49] set category and then the other thing that people [00:56:51] and then the other thing that people really care about in vision and language [00:56:53] really care about in vision and language processing is visual question answering [00:56:56] processing is visual question answering um and so the there really are a bunch [00:56:59] um and so the there really are a bunch of academic groups who are or have been [00:57:02] of academic groups who are or have been so focused on this task that they didn't [00:57:04] so focused on this task that they didn't really care about anything else and [00:57:06] really care about anything else and that's why you see a lot of models that [00:57:07] that's why you see a lot of models that are really optimized just for multimodal [00:57:10] are really optimized just for multimodal and nothing else uh and you can see that [00:57:12] and nothing else uh and you can see that kind of reflected in the citation counts [00:57:14] kind of reflected in the citation counts as of last night 3 A.M [00:57:17] as of last night 3 A.M um where uh so the vqh just has way more [00:57:20] um where uh so the vqh just has way more citations than uh image captioning data [00:57:23] citations than uh image captioning data sets even right and so what you do here [00:57:26] sets even right and so what you do here is you just have an image and then [00:57:27] is you just have an image and then people ask very simple questions so [00:57:30] people ask very simple questions so annotators right they they ask these [00:57:32] annotators right they they ask these simple questions they give the answers [00:57:34] simple questions they give the answers and now we want to be able to answer [00:57:36] and now we want to be able to answer these questions with machines and as I [00:57:39] these questions with machines and as I alluded to earlier one of the the kind [00:57:40] alluded to earlier one of the the kind of embarrassing uh backstories of this [00:57:42] of embarrassing uh backstories of this data set was that the initial version of [00:57:44] data set was that the initial version of the data set was actually found to uh to [00:57:48] the data set was actually found to uh to have images not really matter at all so [00:57:51] have images not really matter at all so you could just look at the question then [00:57:52] you could just look at the question then it could have something like how many [00:57:54] it could have something like how many slices of pizza are there [00:57:56] slices of pizza are there um and so well not in that particular [00:57:58] um and so well not in that particular case but in almost all of the data set [00:58:01] case but in almost all of the data set the right answer for how much or how [00:58:03] the right answer for how much or how many questions was too [00:58:05] many questions was too so if you just predicted two to every [00:58:07] so if you just predicted two to every how much or how many questions you got [00:58:09] how much or how many questions you got like 70 accuracy on the accounting [00:58:11] like 70 accuracy on the accounting category so careful data set uh or [00:58:14] category so careful data set uh or evaluation Benchmark design is also [00:58:17] evaluation Benchmark design is also really a skill and you really need to [00:58:19] really a skill and you really need to think about what you're doing you can't [00:58:20] think about what you're doing you can't just like set some data aside and [00:58:22] just like set some data aside and evaluate it on it you have to really [00:58:23] evaluate it on it you have to really think about what you're doing [00:58:25] think about what you're doing um and so there's gqa by by Chris [00:58:27] um and so there's gqa by by Chris actually which is also just a I I think [00:58:29] actually which is also just a I I think a a better designed version of this data [00:58:32] a a better designed version of this data set maybe so you might want to use that [00:58:34] set maybe so you might want to use that uh these days [00:58:36] uh these days um they're also kind of [00:58:38] um they're also kind of um a very targeted data sets that really [00:58:41] um a very targeted data sets that really try to measure one particular thing and [00:58:43] try to measure one particular thing and I think one of the things we really want [00:58:45] I think one of the things we really want to get at with these models is what we [00:58:47] to get at with these models is what we would call compositionality right so we [00:58:49] would call compositionality right so we want to be able to really take the parts [00:58:51] want to be able to really take the parts and and reason about the whole and [00:58:53] and and reason about the whole and understand the relationships between the [00:58:55] understand the relationships between the different concepts so clever was a very [00:58:57] different concepts so clever was a very clever data set uh that was designed [00:59:00] clever data set uh that was designed really to to measure the the [00:59:01] really to to measure the the compositionality both on the language [00:59:03] compositionality both on the language side and on the vision side so you have [00:59:05] side and on the vision side so you have to understand the relationships between [00:59:07] to understand the relationships between all of these different objects in the [00:59:08] all of these different objects in the images uh so that's been a pretty [00:59:10] images uh so that's been a pretty impactful data set I think for really uh [00:59:13] impactful data set I think for really uh forcing people to think about [00:59:14] forcing people to think about compositionality [00:59:16] compositionality but a lot of these data sets uh really [00:59:19] but a lot of these data sets uh really had big problems uh so so one of the [00:59:22] had big problems uh so so one of the problem is you know uh they were too [00:59:24] problem is you know uh they were too easy uh so vqa is sort of like [00:59:26] easy uh so vqa is sort of like plateauing out we can talk about that a [00:59:28] plateauing out we can talk about that a little bit too it wasn't really [00:59:29] little bit too it wasn't really realistic so you could solve vqa and [00:59:32] realistic so you could solve vqa and that's probably going to make some [00:59:33] that's probably going to make some people's lives better you're all like [00:59:35] people's lives better you're all like trying to process the means how I can [00:59:37] trying to process the means how I can see everything [00:59:39] see everything okay let's get to the memes first then [00:59:40] okay let's get to the memes first then so [00:59:41] so um so uh obviously so these memes are [00:59:44] um so uh obviously so these memes are not actually in the data set [00:59:46] not actually in the data set um so I could put some really hateful [00:59:48] um so I could put some really hateful memes about sort of Hitler or something [00:59:51] memes about sort of Hitler or something which are in the data set but that would [00:59:52] which are in the data set but that would be less fun [00:59:53] be less fun um so uh these are mean meme examples to [00:59:57] um so uh these are mean meme examples to kind of uh demonstrate uh how the data [01:00:00] kind of uh demonstrate uh how the data set was constructed and and so one of [01:00:03] set was constructed and and so one of the problems we had as I said like vqa [01:00:05] the problems we had as I said like vqa the V didn't really matter what we want [01:00:07] the V didn't really matter what we want to have is a data set if we're if we [01:00:09] to have is a data set if we're if we care about multimodality specifically [01:00:11] care about multimodality specifically it's like how do we get a data set that [01:00:14] it's like how do we get a data set that you can only get right if you are good [01:00:16] you can only get right if you are good at multimodal reasoning and otherwise [01:00:18] at multimodal reasoning and otherwise you're just going to screw it up [01:00:20] you're just going to screw it up um and so this is what we came up with [01:00:21] um and so this is what we came up with is if you have a meme like this one love [01:00:24] is if you have a meme like this one love the way you smell today I mean that's [01:00:25] the way you smell today I mean that's not very nice if you send this to your [01:00:27] not very nice if you send this to your friend right [01:00:28] friend right um [01:00:29] um so [01:00:30] so um but uh so it turns out that that if [01:00:33] um but uh so it turns out that that if you just swap out the background now [01:00:35] you just swap out the background now it's a very nice thing to say right uh [01:00:37] it's a very nice thing to say right uh and like this one is you know I don't [01:00:39] and like this one is you know I don't know you're maybe a bit weird if you [01:00:41] know you're maybe a bit weird if you like this but uh there's there's nothing [01:00:43] like this but uh there's there's nothing wrong with it right [01:00:45] wrong with it right um and so it's the same for this one [01:00:48] um and so it's the same for this one here like look how many people love you [01:00:49] here like look how many people love you with the Tumbleweed that's really sad [01:00:51] with the Tumbleweed that's really sad and like you know if you if you if you [01:00:53] and like you know if you if you if you change just one word suddenly it's like [01:00:55] change just one word suddenly it's like a really nice thing to say right [01:00:58] a really nice thing to say right um so so if you want to solve this if [01:01:00] um so so if you want to solve this if you want to classify this correctly for [01:01:02] you want to classify this correctly for the meanness [01:01:04] the meanness then you have to really understand [01:01:06] then you have to really understand multimodal reasoning you have to [01:01:07] multimodal reasoning you have to understand the relationship between the [01:01:09] understand the relationship between the image and the text in order to get to [01:01:11] image and the text in order to get to the right label right and so it was [01:01:13] the right label right and so it was really constructed by Design to do that [01:01:16] really constructed by Design to do that um and uh so how we did it exactly is we [01:01:19] um and uh so how we did it exactly is we we use some really uh highly trained [01:01:22] we use some really uh highly trained annotators and then uh one of the big [01:01:24] annotators and then uh one of the big problems with a lot of these data sets [01:01:26] problems with a lot of these data sets is that uh nobody really knows who owns [01:01:29] is that uh nobody really knows who owns the meme for example right so somebody [01:01:31] the meme for example right so somebody makes this meme now they technically own [01:01:33] makes this meme now they technically own a copyright and so when I made this data [01:01:36] a copyright and so when I made this data set I was working at the Facebook and [01:01:38] set I was working at the Facebook and they were very afraid of copyright [01:01:40] they were very afraid of copyright things so what we actually had to do is [01:01:42] things so what we actually had to do is uh we had to pay people to make new [01:01:44] uh we had to pay people to make new memes [01:01:46] memes um [01:01:47] um and and so so not from stretch so we [01:01:50] and and so so not from stretch so we could show them kind of the actual [01:01:51] could show them kind of the actual examples and then they had to try to [01:01:53] examples and then they had to try to find images that were were uh kind of [01:01:56] find images that were were uh kind of corresponding to the original Source [01:01:58] corresponding to the original Source image and tried to recreate the meme but [01:02:00] image and tried to recreate the meme but now with an image that we could buy from [01:02:03] now with an image that we could buy from Getty [01:02:04] Getty um and and so we gave a lot of money to [01:02:06] um and and so we gave a lot of money to Getty uh so that we could then release [01:02:09] Getty uh so that we could then release the data set uh to the public so that [01:02:11] the data set uh to the public so that people could do actually research on [01:02:12] people could do actually research on this and understand for their multimodal [01:02:15] this and understand for their multimodal models whether they're good or not [01:02:16] models whether they're good or not um and so we really tried to make it so [01:02:18] um and so we really tried to make it so that we had these benign co-founder [01:02:21] that we had these benign co-founder benign confounders uh sorry um it's a [01:02:24] benign confounders uh sorry um it's a startup world with co-founders [01:02:27] startup world with co-founders um so um so the co-founder here is [01:02:29] um so um so the co-founder here is obviously that you have your original [01:02:31] obviously that you have your original Meme and then you have uh your [01:02:33] Meme and then you have uh your confounder where you swap out one of the [01:02:35] confounder where you swap out one of the modalities in here you have the other [01:02:36] modalities in here you have the other one right so we had our annotators do [01:02:38] one right so we had our annotators do that as well uh and so this led to a [01:02:41] that as well uh and so this led to a really nice data set I think uh because [01:02:44] really nice data set I think uh because it showed some of the intuitions that I [01:02:46] it showed some of the intuitions that I think a lot of people under field had [01:02:47] think a lot of people under field had which is that multimodal pre-training [01:02:50] which is that multimodal pre-training doesn't really work [01:02:52] doesn't really work is that an alarm [01:02:54] is that an alarm um so multimodal pre-training doesn't [01:02:56] um so multimodal pre-training doesn't really work uh and so all of this stuff [01:02:59] really work uh and so all of this stuff that people have been doing with all [01:03:01] that people have been doing with all their fancy visual work models actually [01:03:03] their fancy visual work models actually turned out maybe to not really be that [01:03:05] turned out maybe to not really be that useful anyway and so maybe it got you [01:03:07] useful anyway and so maybe it got you like one point extra right from visual [01:03:10] like one point extra right from visual birth to like a different visual birth [01:03:12] birth to like a different visual birth like less than a point uh just just by [01:03:14] like less than a point uh just just by doing that multimodal pre-training [01:03:17] doing that multimodal pre-training um so that means like we still have to [01:03:19] um so that means like we still have to figure this stuff out right this this [01:03:21] figure this stuff out right this this data set is far from Salt and we we [01:03:24] data set is far from Salt and we we still have a long way to go despite all [01:03:25] still have a long way to go despite all of these fancy models and and you know a [01:03:28] of these fancy models and and you know a new paper coming out every week that [01:03:30] new paper coming out every week that does something new like we're not there [01:03:31] does something new like we're not there yet [01:03:33] yet um and I think that's encouraging uh [01:03:35] um and I think that's encouraging uh especially for you like when you uh you [01:03:37] especially for you like when you uh you can go out and solve it [01:03:39] can go out and solve it um so [01:03:40] um so um what we did with this data set is we [01:03:42] um what we did with this data set is we organized the competition we had 100K in [01:03:44] organized the competition we had 100K in price money uh to try to see what people [01:03:46] price money uh to try to see what people could come up with [01:03:48] could come up with um and so there there was a lot of nice [01:03:51] um and so there there was a lot of nice work coming out of that and we've really [01:03:53] work coming out of that and we've really kind of managed to to crank the numbers [01:03:54] kind of managed to to crank the numbers up by quite a lot [01:03:57] up by quite a lot um but the the solutions were slightly [01:03:59] um but the the solutions were slightly disappointing so I don't know if you've [01:04:01] disappointing so I don't know if you've ever used kaggle but if you want to [01:04:03] ever used kaggle but if you want to really win on kaggle you just have to [01:04:04] really win on kaggle you just have to Ensemble the hell out of all of the [01:04:06] Ensemble the hell out of all of the different models that are in the current [01:04:07] different models that are in the current state of the art and then you're very [01:04:09] state of the art and then you're very likely to win right and so that's that's [01:04:11] likely to win right and so that's that's what happened here [01:04:13] what happened here um where you know there wasn't really [01:04:15] um where you know there wasn't really the fundamental breakthrough we had [01:04:17] the fundamental breakthrough we had maybe been hoping for so uh that still [01:04:20] maybe been hoping for so uh that still uh needs to be built I think [01:04:23] uh needs to be built I think um so this other data set I just want to [01:04:25] um so this other data set I just want to kind of briefly talk about so so the [01:04:26] kind of briefly talk about so so the theme sort of of this section is like if [01:04:28] theme sort of of this section is like if you make a data set think about it very [01:04:30] you make a data set think about it very carefully uh because you can really be [01:04:32] carefully uh because you can really be very creative with this and really [01:04:34] very creative with this and really really measure the things you're trying [01:04:35] really measure the things you're trying to get at so [01:04:37] to get at so um this this data set winner ground we [01:04:40] um this this data set winner ground we were trying to figure out okay how good [01:04:41] were trying to figure out okay how good is clip actually so it looks really [01:04:43] is clip actually so it looks really amazing and it's way better than things [01:04:45] amazing and it's way better than things that were previously there but does it [01:04:48] that were previously there but does it understand compositional relationships [01:04:50] understand compositional relationships in the same way that humans would [01:04:51] in the same way that humans would understand it or is it sort of just [01:04:53] understand it or is it sort of just fitting onto the data distribution and [01:04:55] fitting onto the data distribution and it can be very good at the head of the [01:04:57] it can be very good at the head of the distribution but it's terrible at detail [01:04:59] distribution but it's terrible at detail and you can probably already guess where [01:05:01] and you can probably already guess where this is going [01:05:03] this is going um but uh so so just to give you an [01:05:05] um but uh so so just to give you an illustration of what is in this data set [01:05:07] illustration of what is in this data set you would have some plants surrounding a [01:05:09] you would have some plants surrounding a light bulb or you would have eight light [01:05:12] light bulb or you would have eight light bulb surrounding some plants so notice [01:05:14] bulb surrounding some plants so notice that the words here are exactly the same [01:05:16] that the words here are exactly the same words [01:05:17] words but in a different order right so uh and [01:05:20] but in a different order right so uh and and so the visual depiction of these [01:05:23] and so the visual depiction of these words is very very different so if your [01:05:25] words is very very different so if your model your contrastive model is actually [01:05:27] model your contrastive model is actually good at understanding the Visio semantic [01:05:30] good at understanding the Visio semantic or the yeah visual linguistic [01:05:33] or the yeah visual linguistic compositionality uh of these these uh [01:05:36] compositionality uh of these these uh these uh examples then then you can get [01:05:39] these uh examples then then you can get it right but again if it's actually just [01:05:41] it right but again if it's actually just overfitting on the data distribution [01:05:42] overfitting on the data distribution that is seen and it just kind of is [01:05:44] that is seen and it just kind of is biased toward what it sees often then it [01:05:47] biased toward what it sees often then it doesn't really get it right and so one [01:05:49] doesn't really get it right and so one paper uh that we use as a source of [01:05:52] paper uh that we use as a source of inspiration for this work is uh this [01:05:55] inspiration for this work is uh this paper here Order word matters [01:05:57] paper here Order word matters pre-training for little uh so we [01:05:59] pre-training for little uh so we actually found that the order of words [01:06:01] actually found that the order of words doesn't even matter that much for [01:06:03] doesn't even matter that much for General pre-training very often uh which [01:06:06] General pre-training very often uh which is also kind of a scary thing right so [01:06:07] is also kind of a scary thing right so this is deep learning for NLP we think [01:06:09] this is deep learning for NLP we think that you know language is really [01:06:10] that you know language is really important but these models can can [01:06:12] important but these models can can reason about language even if you [01:06:14] reason about language even if you shuffle all the words [01:06:16] shuffle all the words um and so that's that's probably not [01:06:18] um and so that's that's probably not what we want to have and so that that [01:06:20] what we want to have and so that that doesn't tell you something about how [01:06:23] doesn't tell you something about how great we are as researchers it tells you [01:06:25] great we are as researchers it tells you something about how terrible our [01:06:26] something about how terrible our evaluation benchmarks are right and [01:06:29] evaluation benchmarks are right and that's what we need to fix [01:06:31] that's what we need to fix um so so what we did with this data set [01:06:33] um so so what we did with this data set here some other nice examples like [01:06:34] here some other nice examples like there's a mug in some grass or there's [01:06:36] there's a mug in some grass or there's some grass in a mug like these are very [01:06:38] some grass in a mug like these are very different pictures right and so for us [01:06:40] different pictures right and so for us these are trivial like so like you know [01:06:42] these are trivial like so like you know what's the difference between a truck [01:06:43] what's the difference between a truck fire and a fire truck [01:06:46] fire and a fire truck they're pretty pretty important I think [01:06:48] they're pretty pretty important I think also to get that distinction right [01:06:51] also to get that distinction right um so um [01:06:53] um so um guess what [01:06:54] guess what um state-of-the-art models often perform [01:06:56] um state-of-the-art models often perform below random chance [01:07:00] so uh uh you know as I said we still [01:07:03] so uh uh you know as I said we still have a lot of work to do which is good [01:07:06] have a lot of work to do which is good um and and so when this paper came out [01:07:08] um and and so when this paper came out that I I think the the reaction was was [01:07:11] that I I think the the reaction was was really nice and uh so when dolly two [01:07:13] really nice and uh so when dolly two came out [01:07:15] came out um which so you've probably heard of [01:07:17] um which so you've probably heard of Dolly too right so it's sort of like [01:07:18] Dolly too right so it's sort of like stable diffusion but then before stable [01:07:20] stable diffusion but then before stable diffusion [01:07:22] diffusion um and so this was really the the first [01:07:24] um and so this was really the the first model that really showed like just how [01:07:26] model that really showed like just how impressive these generative models can [01:07:28] impressive these generative models can be uh when they're when they're creating [01:07:30] be uh when they're when they're creating images so this is there's a mug in some [01:07:33] images so this is there's a mug in some grass uh you do have to kind of cheat a [01:07:35] grass uh you do have to kind of cheat a little bit because you have to add [01:07:37] little bit because you have to add digital art here uh if you if you don't [01:07:40] digital art here uh if you if you don't add that then it breaks down completely [01:07:42] add that then it breaks down completely right uh so it's sort of prone attacking [01:07:45] right uh so it's sort of prone attacking I think or sort of tuning on the test [01:07:46] I think or sort of tuning on the test set but okay you know [01:07:49] set but okay you know um so this is pretty good right so so [01:07:51] um so this is pretty good right so so it's definitely is better than I think a [01:07:54] it's definitely is better than I think a lot of people would have expected even a [01:07:55] lot of people would have expected even a couple of years ago [01:07:58] couple of years ago um but it's not perfect because uh [01:08:00] um but it's not perfect because uh people on the internet like to take more [01:08:03] people on the internet like to take more pictures of spoons than Forks [01:08:06] um so if you say there are fewer uh [01:08:09] um so if you say there are fewer uh spoons than Forks or there are fewer [01:08:11] spoons than Forks or there are fewer Forks than spoons it just really like [01:08:13] Forks than spoons it just really like spoons more [01:08:16] um you know and so maybe it's like the [01:08:20] um you know and so maybe it's like the Matrix or something I don't know but so [01:08:21] Matrix or something I don't know but so uh [01:08:22] uh spoons are just nicer [01:08:24] spoons are just nicer so uh so again what you can see here is [01:08:27] so uh so again what you can see here is that these models really are just [01:08:28] that these models really are just reflections of the data that they're [01:08:30] reflections of the data that they're trained on right [01:08:32] trained on right um and uh yeah so models are getting [01:08:34] um and uh yeah so models are getting better but if you've looked at stable [01:08:36] better but if you've looked at stable diffusion like it still can't count [01:08:38] diffusion like it still can't count fingers and things like that right so [01:08:40] fingers and things like that right so again uh there's still a lot of a cool [01:08:42] again uh there's still a lot of a cool work to be done [01:08:44] work to be done any questions on evaluation [01:08:52] no okay so let's let's talk about other [01:08:56] no okay so let's let's talk about other modalities then because so we've really [01:08:58] modalities then because so we've really just been focused on images and images [01:09:00] just been focused on images and images are great there are lots of images uh on [01:09:04] are great there are lots of images uh on the internet and and so that makes it [01:09:06] the internet and and so that makes it sort of an obvious thing to focus on [01:09:08] sort of an obvious thing to focus on it's also I think if you look at our [01:09:10] it's also I think if you look at our brain like vision is a very dominant [01:09:12] brain like vision is a very dominant modality right so how we understand the [01:09:14] modality right so how we understand the world is very Vision driven uh but it [01:09:17] world is very Vision driven uh but it that it that doesn't have to be the case [01:09:18] that it that doesn't have to be the case so there's all these other interesting [01:09:20] so there's all these other interesting problems that involve different [01:09:21] problems that involve different modalities and so the most obvious one [01:09:24] modalities and so the most obvious one is just speech or audio right so after [01:09:27] is just speech or audio right so after after CN comes hearing [01:09:29] after CN comes hearing um and and really we could do another [01:09:31] um and and really we could do another lecture just like this just on speech [01:09:33] lecture just like this just on speech and audio and there's lots of [01:09:34] and audio and there's lots of interesting stuff to talk about [01:09:36] interesting stuff to talk about obviously we don't have time but uh I'll [01:09:38] obviously we don't have time but uh I'll give you another uh nice example of how [01:09:41] give you another uh nice example of how Amazing Alec Radford is at creating data [01:09:43] Amazing Alec Radford is at creating data sets [01:09:45] sets um so so there's this whisper model that [01:09:47] um so so there's this whisper model that came out of open AI not too long ago [01:09:49] came out of open AI not too long ago which was trained on 680 000 hours of [01:09:52] which was trained on 680 000 hours of multilingual multitask uh Speech data so [01:09:55] multilingual multitask uh Speech data so speech with transcriptions [01:09:57] speech with transcriptions um and they they trained this very fancy [01:10:00] um and they they trained this very fancy uh thing on there which actually is not [01:10:02] uh thing on there which actually is not very fancy at all it's just the long [01:10:03] very fancy at all it's just the long male spectrogram so how you represent [01:10:05] male spectrogram so how you represent the audio signal and then you feed that [01:10:07] the audio signal and then you feed that into a big Transformer so this is sort [01:10:09] into a big Transformer so this is sort of your encoder self-attention here [01:10:11] of your encoder self-attention here right and then you have your decoder [01:10:13] right and then you have your decoder where you have your cross attention and [01:10:15] where you have your cross attention and then you just generate the sequence so [01:10:17] then you just generate the sequence so this is encoder decoder basic [01:10:20] this is encoder decoder basic Transformer model but your input is uh [01:10:22] Transformer model but your input is uh convolutions one-dimensional [01:10:24] convolutions one-dimensional convolutions over the log mail [01:10:25] convolutions over the log mail spectrogram and so there's lots of [01:10:28] spectrogram and so there's lots of papers that do very similar things uh [01:10:30] papers that do very similar things uh there's there's models like wave to VEC [01:10:32] there's there's models like wave to VEC that try to turn the wave signal into [01:10:34] that try to turn the wave signal into vectors or you can discretize it in lots [01:10:36] vectors or you can discretize it in lots of different ways [01:10:37] of different ways um so there's a wealth of literature [01:10:40] um so there's a wealth of literature then I think one of the funny [01:10:41] then I think one of the funny observations actually is that you can [01:10:43] observations actually is that you can just reduce audio to Vision anyway right [01:10:46] just reduce audio to Vision anyway right so so that's what you could sort of [01:10:48] so so that's what you could sort of argue this log mail spectrogram does but [01:10:50] argue this log mail spectrogram does but so not to toot my own horn but in 27 I I [01:10:54] so not to toot my own horn but in 27 I I did this paper where we showed that you [01:10:55] did this paper where we showed that you can just take a real audio sample turn [01:10:58] can just take a real audio sample turn it into a a kind of a spectrogram really [01:11:03] it into a a kind of a spectrogram really just a spectrogram so what does the [01:11:05] just a spectrogram so what does the spectrum of the the audio file look like [01:11:07] spectrum of the the audio file look like feed that to a regular content like an [01:11:10] feed that to a regular content like an Alex net even and then that gives you [01:11:12] Alex net even and then that gives you amazing auditory features so now you can [01:11:14] amazing auditory features so now you can use this to distinguish between violins [01:11:16] use this to distinguish between violins or guitars and things like that so you [01:11:18] or guitars and things like that so you know maybe you can just reduce all of [01:11:20] know maybe you can just reduce all of this to Vision so one question maybe you [01:11:22] this to Vision so one question maybe you could ask is that can we also reduce [01:11:24] could ask is that can we also reduce language division [01:11:25] language division or Vision to language [01:11:27] or Vision to language you know that could so that's sort of [01:11:29] you know that could so that's sort of what people are thinking about today [01:11:32] what people are thinking about today um so we talked about the video there [01:11:34] um so we talked about the video there was a question about video so a lot of [01:11:36] was a question about video so a lot of these ideas also extend pretty directly [01:11:38] these ideas also extend pretty directly to video but now you just have more data [01:11:40] to video but now you just have more data right so like Flamingo already had a [01:11:43] right so like Flamingo already had a bunch of different images in it you can [01:11:44] bunch of different images in it you can do Flamingo over videos probably a lot [01:11:47] do Flamingo over videos probably a lot of the images are pretty useless for [01:11:50] of the images are pretty useless for what you're trying to do with this video [01:11:51] what you're trying to do with this video model right so they're they're too [01:11:53] model right so they're they're too similar it doesn't really add all that [01:11:55] similar it doesn't really add all that much information so you want to [01:11:56] much information so you want to sub-sample the frames so that you get [01:11:58] sub-sample the frames so that you get the most useful information out of your [01:12:00] the most useful information out of your video uh and so there's a bunch of [01:12:03] video uh and so there's a bunch of approaches that that kind of take the [01:12:05] approaches that that kind of take the keyframes and then you just do a [01:12:06] keyframes and then you just do a standard joint vision and language [01:12:08] standard joint vision and language Transformer encoder thing on top of that [01:12:11] Transformer encoder thing on top of that so this is kind of becoming hopefully by [01:12:13] so this is kind of becoming hopefully by now a very familiar recipe right [01:12:16] now a very familiar recipe right um and so there's this so Merlot is a [01:12:19] um and so there's this so Merlot is a nice architecture that does this and [01:12:21] nice architecture that does this and then they came up with Merlot Reserve [01:12:23] then they came up with Merlot Reserve kind of a silly name where they also [01:12:25] kind of a silly name where they also added audio to this model so this is now [01:12:28] added audio to this model so this is now a tri-modal model right and so you know [01:12:31] a tri-modal model right and so you know for going towards this Foundation model [01:12:33] for going towards this Foundation model that can consume all of these different [01:12:35] that can consume all of these different modalities uh all in one go and that's [01:12:37] modalities uh all in one go and that's really like a clear Trend in the field [01:12:41] um another very interesting Direction I [01:12:44] um another very interesting Direction I think where in the field we were very [01:12:46] think where in the field we were very excited about this for a while but we I [01:12:48] excited about this for a while but we I think it's it's sort of uh gone now [01:12:51] think it's it's sort of uh gone now because it's too difficult to create [01:12:53] because it's too difficult to create lots of high quality data in this [01:12:54] lots of high quality data in this setting but what you can do is you can [01:12:56] setting but what you can do is you can have simulated environments uh so this [01:12:59] have simulated environments uh so this is a paper from deepmind from 2017 where [01:13:01] is a paper from deepmind from 2017 where they had this agent walk around in the [01:13:03] they had this agent walk around in the Maze and then it could have natural [01:13:05] Maze and then it could have natural language instructions they could also [01:13:06] language instructions they could also generalize to like decks and Blicks and [01:13:08] generalize to like decks and Blicks and different sort of groundings to the and [01:13:10] different sort of groundings to the and assignments that that you could do in [01:13:12] assignments that that you could do in that environment so this is a super [01:13:14] that environment so this is a super interesting Direction I think in the [01:13:16] interesting Direction I think in the long term because this is how humans [01:13:17] long term because this is how humans learn language right like we walk around [01:13:19] learn language right like we walk around in the world we interact with our [01:13:21] in the world we interact with our environments we have all of these [01:13:22] environments we have all of these different perceptual observations we [01:13:24] different perceptual observations we synthesize them in our brain we [01:13:26] synthesize them in our brain we manipulate objects we change our own [01:13:28] manipulate objects we change our own Viewpoint and that's how we learn [01:13:30] Viewpoint and that's how we learn everything we know about the world and [01:13:32] everything we know about the world and so our our language is very intricately [01:13:34] so our our language is very intricately connected to that world and how we [01:13:36] connected to that world and how we observe it [01:13:38] observe it um so I think that that might make a [01:13:40] um so I think that that might make a comeback at some point in the future [01:13:43] comeback at some point in the future um you can also do other stuff so [01:13:45] um you can also do other stuff so especially with this kind of [01:13:47] especially with this kind of conditioning or text that that we're [01:13:49] conditioning or text that that we're seeing a lot of right so like so you [01:13:51] seeing a lot of right so like so you know Dali 2 and stable diffusion and all [01:13:53] know Dali 2 and stable diffusion and all of these different things and the [01:13:54] of these different things and the original again we talked about at the [01:13:56] original again we talked about at the beginning you can do the same thing but [01:13:59] beginning you can do the same thing but now you're generating 3D Point clouds [01:14:01] now you're generating 3D Point clouds right so this is a 3D Corgi [01:14:05] right so this is a 3D Corgi um using a corgi and so this this prompt [01:14:07] um using a corgi and so this this prompt can probably become much more complex [01:14:09] can probably become much more complex over time and you can do like sort of [01:14:10] over time and you can do like sort of AutoCAD design and just say like give me [01:14:12] AutoCAD design and just say like give me a house and it's just going to design [01:14:14] a house and it's just going to design the whole house for you uh so you can [01:14:17] the whole house for you uh so you can just like tweak The Prompt and things [01:14:19] just like tweak The Prompt and things like that like that that's all coming or [01:14:21] like that like that that's all coming or or even already here in many cases [01:14:24] or even already here in many cases um so so the final modality I I just [01:14:26] um so so the final modality I I just briefly wanted to talk about is uh [01:14:28] briefly wanted to talk about is uh olfactory embeddings [01:14:30] olfactory embeddings sure [01:14:33] um and uh so olfaction means smell if [01:14:36] um and uh so olfaction means smell if you didn't know [01:14:37] you didn't know um and uh so it turns out so my PhD [01:14:40] um and uh so it turns out so my PhD thesis was about grounding uh semantics [01:14:44] thesis was about grounding uh semantics in uh different perceptual modalities so [01:14:47] in uh different perceptual modalities so a lot of my work started in vision and [01:14:49] a lot of my work started in vision and then it's like okay now audio is sort of [01:14:51] then it's like okay now audio is sort of the obvious next one right so you can [01:14:52] the obvious next one right so you can learn the meaning of violin and then [01:14:54] learn the meaning of violin and then maybe you can learn that violin like [01:14:57] maybe you can learn that violin like what a violin looks like and what it is [01:14:58] what a violin looks like and what it is and what it sounds like and that's going [01:15:00] and what it sounds like and that's going to give you a richer representation but [01:15:02] to give you a richer representation but for a lot of these words well it's [01:15:04] for a lot of these words well it's actually very primitive to their meaning [01:15:06] actually very primitive to their meaning is what they smell like because uh in [01:15:08] is what they smell like because uh in our brains that's really one of the core [01:15:10] our brains that's really one of the core areas of one of the oldest areas in your [01:15:12] areas of one of the oldest areas in your brain uh so uh what you can try to do if [01:15:15] brain uh so uh what you can try to do if you want to complete all of your [01:15:17] you want to complete all of your perceptual modalities is you can try to [01:15:20] perceptual modalities is you can try to build olfactory embedding so it was kind [01:15:22] build olfactory embedding so it was kind of a joke paper I did but [01:15:25] of a joke paper I did but um the funny thing is it actually worked [01:15:28] um the funny thing is it actually worked um so uh there's there's a catalog this [01:15:32] um so uh there's there's a catalog this Sigma Aldrich fine flavors and [01:15:35] Sigma Aldrich fine flavors and fragrances catalog where you can look up [01:15:38] fragrances catalog where you can look up words like melon and pineapple and then [01:15:40] words like melon and pineapple and then it's going to give you all of the [01:15:41] it's going to give you all of the chemical compounds that produce this [01:15:44] chemical compounds that produce this smell or taste and so if you do that [01:15:47] smell or taste and so if you do that then you can count the occurrences and [01:15:49] then you can count the occurrences and then you can sort of do SVD or something [01:15:51] then you can sort of do SVD or something like that only to to get it to be a bit [01:15:54] like that only to to get it to be a bit more of a real embedding model so now [01:15:56] more of a real embedding model so now you get smell embeddings smell vectors [01:15:59] you get smell embeddings smell vectors and then you can compute similarity [01:16:01] and then you can compute similarity judgments between these these smell so [01:16:04] judgments between these these smell so turns out Apple smells like pear uh and [01:16:07] turns out Apple smells like pear uh and you know the chocolate and cocoa and [01:16:09] you know the chocolate and cocoa and sweet and coffee are sort of related [01:16:12] sweet and coffee are sort of related right so you get these clusters of [01:16:13] right so you get these clusters of different smells just based off of their [01:16:15] different smells just based off of their chemical compounds so this bag of [01:16:18] chemical compounds so this bag of chemical compounds model gives you a [01:16:20] chemical compounds model gives you a very rich representation and so if you [01:16:23] very rich representation and so if you look at all of the words that are [01:16:25] look at all of the words that are concrete enough to have smell right so [01:16:28] concrete enough to have smell right so like if you have a word like democracy [01:16:30] like if you have a word like democracy in there that doesn't really smell like [01:16:31] in there that doesn't really smell like anything right so you you ignore [01:16:35] anything right so you you ignore democracy and you just focus on on the [01:16:39] democracy and you just focus on on the things that smell [01:16:41] things that smell um or that good smell I guess [01:16:43] um or that good smell I guess um and then so the the really [01:16:45] um and then so the the really interesting thing to me is that you know [01:16:47] interesting thing to me is that you know this is is much more correlated with [01:16:50] this is is much more correlated with human similarity judgments than the [01:16:52] human similarity judgments than the linguistic vectors we had at the time [01:16:54] linguistic vectors we had at the time right so so for a work like apple like [01:16:57] right so so for a work like apple like you can just get a word Vector like [01:16:59] you can just get a word Vector like you've learned in your first lecture uh [01:17:01] you've learned in your first lecture uh and and so you can do like you know Skip [01:17:03] and and so you can do like you know Skip gram and things like that but that that [01:17:05] gram and things like that but that that thing is not going to be as correlated [01:17:07] thing is not going to be as correlated with human similarity judgments as this [01:17:09] with human similarity judgments as this bag of chemical compounds model [01:17:12] bag of chemical compounds model so that's that's pretty interesting [01:17:14] so that's that's pretty interesting right so even something like smell where [01:17:16] right so even something like smell where maybe we think you know this doesn't [01:17:17] maybe we think you know this doesn't really matter if you really want to [01:17:19] really matter if you really want to understand how humans understand [01:17:20] understand how humans understand language then maybe you want to include [01:17:23] language then maybe you want to include this in your foundation model too [01:17:27] and but I would start with other [01:17:28] and but I would start with other modalities [01:17:31] modalities all right [01:17:32] all right um [01:17:33] um okay yeah sorry yeah uh so where to next [01:17:36] okay yeah sorry yeah uh so where to next uh I'll just I think I've already said [01:17:38] uh I'll just I think I've already said most of this actually so One Foundation [01:17:40] most of this actually so One Foundation model is going to rule them all [01:17:43] model is going to rule them all um and we'll so I mean there will be [01:17:45] um and we'll so I mean there will be many of these but a lot of them are [01:17:46] many of these but a lot of them are going to have very similar traits I [01:17:48] going to have very similar traits I think [01:17:49] think um we're going to be looking at scaling [01:17:51] um we're going to be looking at scaling loss and trying to understand really [01:17:53] loss and trying to understand really what is the relationship between the [01:17:54] what is the relationship between the different modalities which one do we [01:17:56] different modalities which one do we want more of that sort of stuff [01:17:58] want more of that sort of stuff we're going to have retrieval [01:17:59] we're going to have retrieval augmentation this thing is going to be [01:18:01] augmentation this thing is going to be really huge if you've heard of rag or if [01:18:04] really huge if you've heard of rag or if you haven't you should look it up uh so [01:18:06] you haven't you should look it up uh so all of these parts of these models can [01:18:08] all of these parts of these models can also be multimodal we need way better [01:18:10] also be multimodal we need way better evaluation and better measurements we [01:18:12] evaluation and better measurements we already talked about that too and that's [01:18:14] already talked about that too and that's all I have thank you [01:18:15] all I have thank you [Applause] ================================================================================ LECTURE 020 ================================================================================ Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim Source: https://www.youtube.com/watch?v=cd3pRpEtjLs --- Transcript [00:00:05] today I'm delighted to introduce us our [00:00:08] today I'm delighted to introduce us our final guest speaker [00:00:10] final guest speaker um Bean Kim [00:00:11] um Bean Kim um being Kim is a staff research [00:00:13] um being Kim is a staff research scientist at Google brain if you're [00:00:16] scientist at Google brain if you're really into googleology you know those [00:00:18] really into googleology you know those funny words the beginning like staff [00:00:20] funny words the beginning like staff sort of says how senior you are [00:00:22] sort of says how senior you are um and that means that being's a good [00:00:24] um and that means that being's a good research scientist [00:00:26] research scientist um [00:00:27] um um so uh I I discovered at lunch today [00:00:30] um so uh I I discovered at lunch today that bean started out [00:00:32] that bean started out um studying mechanical engineering at [00:00:34] um studying mechanical engineering at Seoul national university but she moved [00:00:37] Seoul national university but she moved on to uh I don't know if it's better [00:00:39] on to uh I don't know if it's better things or not but she moved on to [00:00:41] things or not but she moved on to computer science and did a PhD [00:00:44] computer science and did a PhD um at MIT and there she started working [00:00:47] um at MIT and there she started working on the interpretability and [00:00:49] on the interpretability and explainability of machine learning [00:00:51] explainability of machine learning models [00:00:52] models um I think she'll be talking about some [00:00:54] um I think she'll be talking about some different parts of her work but a theme [00:00:57] different parts of her work but a theme that she's had in some of her recent [00:00:59] that she's had in some of her recent work that I find especially appealing as [00:01:02] work that I find especially appealing as an NLP person is the idea that we should [00:01:06] an NLP person is the idea that we should be using higher level human [00:01:09] be using higher level human interpretable languages for [00:01:11] interpretable languages for communication between people and [00:01:14] communication between people and machines and so welcome Bean looking [00:01:16] machines and so welcome Bean looking forward to your talk [00:01:18] forward to your talk um and go for it thank you thank you [00:01:25] thanks for having me it's honored to be [00:01:27] thanks for having me it's honored to be here it's the rainiest Stanford I've [00:01:31] here it's the rainiest Stanford I've ever seen last night I got here last [00:01:33] ever seen last night I got here last night but then I'm I live in Seattle so [00:01:35] night but then I'm I live in Seattle so this is pretty common so I still was [00:01:37] this is pretty common so I still was able to see the blue sky today I was [00:01:39] able to see the blue sky today I was like this works I really like it here so [00:01:42] like this works I really like it here so today I'm going to share some of my [00:01:44] today I'm going to share some of my dreams chasing my dreams to communicate [00:01:46] dreams chasing my dreams to communicate with machines [00:01:48] with machines so if you're in this class you probably [00:01:51] so if you're in this class you probably agree you don't have to that large [00:01:54] agree you don't have to that large language models and generated models are [00:01:56] language models and generated models are pretty cool they're impressive but you [00:01:59] pretty cool they're impressive but you may also agree that they're a little bit [00:02:01] may also agree that they're a little bit frightening [00:02:02] frightening not just because they're impressive [00:02:04] not just because they're impressive they're doing really good job but also [00:02:07] they're doing really good job but also we're not quite sure where we're going [00:02:10] we're not quite sure where we're going with this technology in 10 years out [00:02:13] with this technology in 10 years out will we look back and say that [00:02:15] will we look back and say that technology was net positive or we will [00:02:17] technology was net positive or we will say ah that was catastrophic we didn't [00:02:19] say ah that was catastrophic we didn't know that that would happen [00:02:22] know that that would happen and ultimately what I would like to do [00:02:25] and ultimately what I would like to do or maybe hopefully what we all want to [00:02:28] or maybe hopefully what we all want to do is to have this technology benefit us [00:02:31] do is to have this technology benefit us humans I know in 10 years time or maybe [00:02:34] humans I know in 10 years time or maybe well 20 years or earlier he's gonna ask [00:02:37] well 20 years or earlier he's gonna ask me he's gonna be like Mom did you work [00:02:39] me he's gonna be like Mom did you work on this AI stuff I watched some of your [00:02:42] on this AI stuff I watched some of your talks [00:02:43] talks and did you know that how this will [00:02:46] and did you know that how this will profoundly change our lives [00:02:48] profoundly change our lives and what did you do about that [00:02:51] and what did you do about that and I have to answer that question and I [00:02:53] and I have to answer that question and I really hope that I have some good things [00:02:55] really hope that I have some good things to say to him [00:02:59] so my initial thought or an instill so [00:03:03] so my initial thought or an instill so or current thought is that if we want [00:03:06] or current thought is that if we want our ultimate goal to be benefit Humanity [00:03:08] our ultimate goal to be benefit Humanity why not directly optimize for it why [00:03:11] why not directly optimize for it why wait [00:03:13] so how can we benefit there's lots of [00:03:16] so how can we benefit there's lots of different ways we can benefit but one [00:03:18] different ways we can benefit but one way we can benefit is to treat this like [00:03:21] way we can benefit is to treat this like a colleague you know a colleague who are [00:03:24] a colleague you know a colleague who are really good at something [00:03:26] really good at something it's called it's not perfect but it's [00:03:28] it's called it's not perfect but it's good at something enough that you want [00:03:30] good at something enough that you want to learn something from them [00:03:32] to learn something from them one difference is though in this case is [00:03:35] one difference is though in this case is that this colleague is kind of weird [00:03:36] that this colleague is kind of weird this colleague might have very different [00:03:39] this colleague might have very different values it might has very different [00:03:42] values it might has very different experiences in the world it may not care [00:03:45] experiences in the world it may not care about surviving as much as we do maybe [00:03:48] about surviving as much as we do maybe mortality isn't really a thing for this [00:03:51] mortality isn't really a thing for this colleague [00:03:52] colleague so you have to navigate that in our [00:03:55] so you have to navigate that in our conversation [00:03:57] conversation so what do you do when you first meet [00:03:58] so what do you do when you first meet somebody there's someone so different [00:04:00] somebody there's someone so different what do you do [00:04:01] what do you do you try to have a conversation [00:04:03] you try to have a conversation to figure out what how do you do what [00:04:06] to figure out what how do you do what you do how are you solving decades-old [00:04:09] you do how are you solving decades-old protein folding problem [00:04:11] protein folding problem how are you so how are you beating the [00:04:13] how are you so how are you beating the world gold champion [00:04:15] world gold champion so easily What It Seems [00:04:17] so easily What It Seems are you using the same language the [00:04:19] are you using the same language the science knowledge the language that we [00:04:21] science knowledge the language that we use atoms molecules or do you think [00:04:24] use atoms molecules or do you think about the world in a very different way [00:04:27] about the world in a very different way and more importantly how can we work [00:04:29] and more importantly how can we work together [00:04:32] I have a one area that I really want to [00:04:34] I have a one area that I really want to talk to and it's alphago [00:04:36] talk to and it's alphago so alphago beat world of gold Champion [00:04:39] so alphago beat world of gold Champion Isador in 2016. Isidore is from South [00:04:41] Isador in 2016. Isidore is from South Korea I'm from South Korea I watched [00:04:43] Korea I'm from South Korea I watched every single batch it was such a big [00:04:45] every single batch it was such a big deal in South Korea and worldwide I hope [00:04:48] deal in South Korea and worldwide I hope and in one of the matches alphago played [00:04:51] and in one of the matches alphago played this move called move 37. how many [00:04:54] this move called move 37. how many people watched alphago match matches [00:04:57] people watched alphago match matches and how many people remember move 37. [00:05:00] and how many people remember move 37. yeah a few people right and I remember [00:05:03] yeah a few people right and I remember the nine Don commentator who's been like [00:05:05] the nine Don commentator who's been like talking a lot throughout the matches [00:05:07] talking a lot throughout the matches suddenly got really quiet [00:05:09] suddenly got really quiet and he said hmm that's a very strange [00:05:13] and he said hmm that's a very strange move [00:05:14] move and I knew then that something really [00:05:17] and I knew then that something really interesting has just happened in from my [00:05:19] interesting has just happened in from my eyes [00:05:20] eyes that this is gonna change something the [00:05:22] that this is gonna change something the South Fargo has made something that [00:05:23] South Fargo has made something that we're gonna remember forever and sure [00:05:25] we're gonna remember forever and sure enough this move turned around the game [00:05:27] enough this move turned around the game for alphago and leading Alpha go to win [00:05:30] for alphago and leading Alpha go to win one of the matches [00:05:33] one of the matches so go players today continue to analyze [00:05:35] so go players today continue to analyze this move and still discuss people talk [00:05:38] this move and still discuss people talk about this is not the move a human would [00:05:40] about this is not the move a human would Phantom [00:05:42] Phantom so the question is how did alphago know [00:05:44] so the question is how did alphago know this is a good move [00:05:49] my dream is to learn something new by [00:05:52] my dream is to learn something new by communicating the machine with machines [00:05:54] communicating the machine with machines and having a conversation [00:05:56] and having a conversation and such that Humanity will gain some [00:05:59] and such that Humanity will gain some new angle to our important problems like [00:06:01] new angle to our important problems like medicine and Science and many others [00:06:04] medicine and Science and many others and this is not just about discovering [00:06:07] and this is not just about discovering new things if you think about reward [00:06:09] new things if you think about reward hacking [00:06:10] hacking you have to have a meaningful [00:06:12] you have to have a meaningful conversation with somebody to truly [00:06:15] conversation with somebody to truly figure out what their true goal is [00:06:18] figure out what their true goal is so in a way solving this problem is a [00:06:21] so in a way solving this problem is a superset of solving AI safety too [00:06:26] so how do we have this conversation [00:06:28] so how do we have this conversation conversation assumes that we share some [00:06:31] conversation assumes that we share some common vocabulary between uh that that [00:06:33] common vocabulary between uh that that exchange to exchange meaning and [00:06:35] exchange to exchange meaning and ultimately the knowledge and naturally a [00:06:37] ultimately the knowledge and naturally a representation plays a key role in this [00:06:39] representation plays a key role in this conversation on the left and we can [00:06:42] conversation on the left and we can visualize this on the left we say what [00:06:44] visualize this on the left we say what this is a representational space of what [00:06:46] this is a representational space of what humans know on the right what machines [00:06:48] humans know on the right what machines know [00:06:50] know here in left Circle there will be [00:06:52] here in left Circle there will be something like this dog is Fluffy and [00:06:54] something like this dog is Fluffy and you know what that means because we all [00:06:55] you know what that means because we all share somewhat similar recovery [00:06:59] share somewhat similar recovery but on the right we have something like [00:07:02] but on the right we have something like move 37 where we humans yet to have a [00:07:05] move 37 where we humans yet to have a representation for [00:07:10] so how do we have this conversation our [00:07:12] so how do we have this conversation our representation space needs overlap and [00:07:15] representation space needs overlap and the more overlap we have the better [00:07:17] the more overlap we have the better conversation we're going to have humans [00:07:20] conversation we're going to have humans are all good at learning new things like [00:07:22] are all good at learning new things like here everyone is learning something new [00:07:24] here everyone is learning something new so we can expand what we know by [00:07:26] so we can expand what we know by learning new Concepts and vocabularies [00:07:30] learning new Concepts and vocabularies and doing so I believe will help us to [00:07:33] and doing so I believe will help us to build machines that can better align [00:07:34] build machines that can better align with our values and our goals [00:07:39] so this is the talk that I gave if [00:07:41] so this is the talk that I gave if you're curious about some of the work [00:07:42] you're curious about some of the work we're doing towards this direction I [00:07:44] we're doing towards this direction I highly recommend it's a YouTube video I [00:07:46] highly recommend it's a YouTube video I clear keen on half an hour you can fast [00:07:48] clear keen on half an hour you can fast uh do a best feed [00:07:50] uh do a best feed but today I'm going to talk more about [00:07:52] but today I'm going to talk more about my hopes and dreams [00:07:54] my hopes and dreams and hopefully at the end of the day your [00:07:56] and hopefully at the end of the day your hopes and dream is there [00:07:59] so first of all [00:08:01] so first of all I'm just gonna set the expectation so at [00:08:04] I'm just gonna set the expectation so at the end of this talk we still don't know [00:08:06] the end of this talk we still don't know how the move 37 is made okay sorry [00:08:09] how the move 37 is made okay sorry that's going to take a while [00:08:12] that's going to take a while in fact the first part of this talk is [00:08:16] in fact the first part of this talk is going to be about how we move backwards [00:08:18] going to be about how we move backwards in this progress [00:08:20] in this progress in in terms of making this progress in [00:08:23] in in terms of making this progress in our journey and still very very small [00:08:26] our journey and still very very small portion of our entire Journey towards [00:08:28] portion of our entire Journey towards understanding move 37 [00:08:31] understanding move 37 and of course this journey wouldn't be [00:08:33] and of course this journey wouldn't be like a singular path there will be lots [00:08:36] like a singular path there will be lots of different branches coming in core [00:08:38] of different branches coming in core ideas like Transformer helped many [00:08:41] ideas like Transformer helped many domains across they will be similar here [00:08:43] domains across they will be similar here so I'm going to talk in the part two [00:08:45] so I'm going to talk in the part two some of our work on understanding [00:08:48] some of our work on understanding emerging behaviors in reinforcement [00:08:50] emerging behaviors in reinforcement learning [00:08:51] learning and all the techniques that I'm going to [00:08:53] and all the techniques that I'm going to talk about is going to be in principle [00:08:55] talk about is going to be in principle applicable to NLP [00:08:59] so coming back to our move our dreams [00:09:01] so coming back to our move our dreams and hopes and dreams 37. [00:09:05] and hopes and dreams 37. so let's first think about how we might [00:09:07] so let's first think about how we might realize this dream [00:09:09] realize this dream and taking a step back we have to ask do [00:09:12] and taking a step back we have to ask do we have tools to First estimate what [00:09:15] we have tools to First estimate what even machines know [00:09:17] even machines know there has been many development in [00:09:18] there has been many development in machine learning last decade now to [00:09:22] machine learning last decade now to develop tools to understand and estimate [00:09:24] develop tools to understand and estimate this purple circle [00:09:26] this purple circle so is that accurate [00:09:28] so is that accurate unfortunately many Recent research [00:09:31] unfortunately many Recent research showed that there's a huge gap between [00:09:33] showed that there's a huge gap between What machines actually know and what we [00:09:36] What machines actually know and what we think the machines know [00:09:40] and identifying and bridging this Gap is [00:09:42] and identifying and bridging this Gap is important because these tools will form [00:09:44] important because these tools will form basis for understanding that move 37. [00:09:50] so what are these tools how many people [00:09:52] so what are these tools how many people familiar with sale in cmaps [00:09:55] familiar with sale in cmaps a lot but you don't have to explain what [00:09:57] a lot but you don't have to explain what it is so saliency map is one of the [00:09:59] it is so saliency map is one of the popular interpretability methods for [00:10:03] popular interpretability methods for Simplicity let's say an imagenet you [00:10:05] Simplicity let's say an imagenet you have an image like this you have a bird [00:10:06] have an image like this you have a bird the explanation is going to take a form [00:10:09] the explanation is going to take a form of the same image but where each pixel [00:10:12] of the same image but where each pixel is numb with associated with a number [00:10:14] is numb with associated with a number that is supposed to [00:10:17] that is supposed to imply some importance of that pixel for [00:10:21] imply some importance of that pixel for prediction of this image [00:10:23] prediction of this image and one definition of that importance is [00:10:26] and one definition of that importance is that that number indicates how the [00:10:28] that that number indicates how the function looked like around this pixel [00:10:30] function looked like around this pixel so for example if I have a pixel I XJ [00:10:33] so for example if I have a pixel I XJ maybe around XJ the function moves up [00:10:36] maybe around XJ the function moves up like the yellow curve or function is [00:10:39] like the yellow curve or function is flat or a function is going down like [00:10:40] flat or a function is going down like the green curve [00:10:43] and so if it's flat like a like a blue [00:10:46] and so if it's flat like a like a blue curve or red curve maybe that feature is [00:10:49] curve or red curve maybe that feature is irrelevant to predicting bird maybe it's [00:10:51] irrelevant to predicting bird maybe it's going up then it's maybe more important [00:10:53] going up then it's maybe more important because the value of X increases and the [00:10:55] because the value of X increases and the function value goes up function value [00:10:57] function value goes up function value here like a prediction value [00:11:01] so let's think about what are the few [00:11:03] so let's think about what are the few ways why this Gap might exist there are [00:11:06] ways why this Gap might exist there are fewer is not exhaustive they're overlap [00:11:08] fewer is not exhaustive they're overlap a little bit but helpful for us to think [00:11:10] a little bit but helpful for us to think about maybe assumptions are wrong so [00:11:13] about maybe assumptions are wrong so this alien again these machines that we [00:11:15] this alien again these machines that we train Works in a completely different [00:11:17] train Works in a completely different perhaps completely different [00:11:18] perhaps completely different representational space very different [00:11:20] representational space very different experiences about the world so assuming [00:11:23] experiences about the world so assuming that it sees the world that we do just [00:11:26] that it sees the world that we do just like we do like having the gesture [00:11:28] like we do like having the gesture phenomenon there's few dots humans are [00:11:30] phenomenon there's few dots humans are have tendency to connect them maybe [00:11:33] have tendency to connect them maybe machines have the two maybe not so maybe [00:11:35] machines have the two maybe not so maybe our assumptions about these machines [00:11:37] our assumptions about these machines were wrong [00:11:39] were wrong maybe our expectations are mismatched we [00:11:42] maybe our expectations are mismatched we thought it was doing X but it was [00:11:44] thought it was doing X but it was actually doing y [00:11:45] actually doing y or [00:11:47] or maybe it's beyond us maybe it's showing [00:11:49] maybe it's beyond us maybe it's showing something superhuman that humans just [00:11:51] something superhuman that humans just can't understand [00:11:55] I'm going to take a deeper into one of [00:11:57] I'm going to take a deeper into one of uh some of these our work this is more [00:12:00] uh some of these our work this is more recent work [00:12:01] recent work so again coming back to the earlier [00:12:03] so again coming back to the earlier story about Salient cement we're going [00:12:05] story about Salient cement we're going to play with some of these methods [00:12:08] to play with some of these methods now uh in 2018 [00:12:11] now uh in 2018 we [00:12:12] we stumbled upon this phenomenon that was [00:12:15] stumbled upon this phenomenon that was quite shocking which was that we were [00:12:17] quite shocking which was that we were actually trying to write some different [00:12:18] actually trying to write some different people again people of Christians here [00:12:20] people again people of Christians here but we were testing something and we [00:12:23] but we were testing something and we realized that train Network and [00:12:25] realized that train Network and untrained network has the same very [00:12:27] untrained network has the same very similar as alien cmap in other words [00:12:30] similar as alien cmap in other words random prediction and meaningful [00:12:32] random prediction and meaningful prediction were giving me the same [00:12:34] prediction were giving me the same explanation [00:12:36] explanation so that was puzzling we thought we had a [00:12:38] so that was puzzling we thought we had a bug but it turned out we didn't it [00:12:41] bug but it turned out we didn't it actually is in this thing [00:12:42] actually is in this thing indistinguishable qualitatively and [00:12:44] indistinguishable qualitatively and quantitatively [00:12:45] quantitatively so that was shocking [00:12:47] so that was shocking but then we wondered maybe this one-off [00:12:50] but then we wondered maybe this one-off case maybe it still works somehow in [00:12:54] case maybe it still works somehow in practice [00:12:56] practice so we tested that in a follow-up paper [00:12:58] so we tested that in a follow-up paper okay what if the model had an error one [00:13:02] okay what if the model had an error one of these errors maybe it has a labeling [00:13:03] of these errors maybe it has a labeling error maybe it has a spheres correlation [00:13:05] error maybe it has a spheres correlation maybe that had Auto distribution at test [00:13:09] maybe that had Auto distribution at test time if we intentionally insert these [00:13:11] time if we intentionally insert these bugs can explanation tell us that [00:13:14] bugs can explanation tell us that there's something wrong with the model [00:13:17] there's something wrong with the model it turns out that that's also not quite [00:13:19] it turns out that that's also not quite true [00:13:21] true you might think that oh maybe superior's [00:13:23] you might think that oh maybe superior's correlation another follow-up work also [00:13:25] correlation another follow-up work also showed that this is also not the case [00:13:28] showed that this is also not the case so we were disappointed [00:13:31] so we were disappointed but then still we say you know maybe [00:13:34] but then still we say you know maybe there is a there's no theoretical proof [00:13:37] there is a there's no theoretical proof of this maybe this is again a lab [00:13:40] of this maybe this is again a lab setting test we had grad students to [00:13:43] setting test we had grad students to test this system maybe there's still [00:13:45] test this system maybe there's still some hope [00:13:48] so this is more recent work where we [00:13:50] so this is more recent work where we theoretically prove that some of these [00:13:52] theoretically prove that some of these methods very popular methods cannot do [00:13:55] methods very popular methods cannot do better than random so I'm going to talk [00:13:57] better than random so I'm going to talk a little a little bit about that [00:14:00] a little a little bit about that I'm missing one person oh I'm missing [00:14:02] I'm missing one person oh I'm missing Tang way in the author list I just [00:14:04] Tang way in the author list I just realized this is also work with pangwei [00:14:07] realized this is also work with pangwei so let's first talk about our [00:14:09] so let's first talk about our expectation what is our expectation [00:14:11] expectation what is our expectation about this tool now the original paper [00:14:15] about this tool now the original paper that developed this method IG and Shop [00:14:18] that developed this method IG and Shop talks about how IG can be used for [00:14:21] talks about how IG can be used for accounting the contributions of each [00:14:23] accounting the contributions of each feature [00:14:25] feature so what that means is that when the Tool [00:14:27] so what that means is that when the Tool assigns zero attribution to a pixel [00:14:29] assigns zero attribution to a pixel we're gonna say okay well pixel is on [00:14:31] we're gonna say okay well pixel is on used by the function [00:14:33] used by the function and that means that F will be [00:14:35] and that means that F will be insensitive if I perturb this X [00:14:40] and in fact this is how it's been used [00:14:42] and in fact this is how it's been used it in practice this is a paper published [00:14:45] it in practice this is a paper published in nature they use the shop to figure [00:14:48] in nature they use the shop to figure out the eligibility criteria in a [00:14:50] out the eligibility criteria in a medical trial [00:14:53] medical trial what we show in this work is that none [00:14:55] what we show in this work is that none of these inferences that seemed pretty [00:14:58] of these inferences that seemed pretty natural were true [00:15:01] natural were true and in fact just because popular [00:15:02] and in fact just because popular attribution methods tell you anything [00:15:04] attribution methods tell you anything about it attribution is X you cannot [00:15:07] about it attribution is X you cannot conclude anything about the actual Model [00:15:10] conclude anything about the actual Model Behavior [00:15:12] Behavior so how does that work [00:15:15] so how does that work how many people here do Theory proof [00:15:20] or few great I'll tell you I I learned [00:15:23] or few great I'll tell you I I learned about Theory proving from this project [00:15:25] about Theory proving from this project as well so I'll tell you like the way [00:15:27] as well so I'll tell you like the way that that we pursued this particular [00:15:29] that that we pursued this particular work is that first think about this [00:15:31] work is that first think about this problem and we're going to formulate [00:15:33] problem and we're going to formulate into some other problem that we know how [00:15:36] into some other problem that we know how to solve so in this case we formulate [00:15:38] to solve so in this case we formulate this as hypothesis testing because once [00:15:41] this as hypothesis testing because once you formulate in the hypothesis testing [00:15:43] you formulate in the hypothesis testing yes or no there are lots of tools in [00:15:45] yes or no there are lots of tools in statistics you can use to prove this [00:15:48] statistics you can use to prove this so what is hypothesis the hypothesis is [00:15:51] so what is hypothesis the hypothesis is that I'm a user I got an attribution [00:15:53] that I'm a user I got an attribution value from one of these tools and I have [00:15:56] value from one of these tools and I have a mental model of ah this feature is [00:15:59] a mental model of ah this feature is important or maybe not important [00:16:01] important or maybe not important then the hypothesis is that whether [00:16:03] then the hypothesis is that whether that's true or not [00:16:05] that's true or not and what we showed is that given [00:16:08] and what we showed is that given whatever hypothesis you may have we can [00:16:11] whatever hypothesis you may have we can you cannot do better than random [00:16:13] you cannot do better than random guessing invalidating or invalidating [00:16:15] guessing invalidating or invalidating this hypothesis testing [00:16:17] this hypothesis testing and that means yes it's sometimes it's [00:16:20] and that means yes it's sometimes it's right but you don't do hypothesis [00:16:22] right but you don't do hypothesis testing if you cannot validate yes or no [00:16:24] testing if you cannot validate yes or no you just don't because like what's the [00:16:26] you just don't because like what's the point of doing it if you just don't know [00:16:28] point of doing it if you just don't know if it's as good as random guessing right [00:16:32] if it's as good as random guessing right and the result is that yes for for this [00:16:35] and the result is that yes for for this for this graph it's just a visualization [00:16:37] for this graph it's just a visualization of our results if you plot true negative [00:16:39] of our results if you plot true negative and true positive and line is random [00:16:42] and true positive and line is random guessing because this is the worst [00:16:43] guessing because this is the worst method that's the best method all the [00:16:45] method that's the best method all the equal distance is this line methods that [00:16:48] equal distance is this line methods that we know shop in IG all all falls under [00:16:51] we know shop in IG all all falls under this line of random guessing [00:16:55] this line of random guessing that's bad news [00:16:56] that's bad news but maybe [00:16:59] but maybe maybe this still works in practice for [00:17:01] maybe this still works in practice for some reason maybe there were some [00:17:03] some reason maybe there were some assumptions that we had that didn't [00:17:05] assumptions that we had that didn't quite meet in the practice [00:17:07] quite meet in the practice so does this phenomenal hold in practice [00:17:11] so does this phenomenal hold in practice the answer is yes we did we now have [00:17:14] the answer is yes we did we now have more image graphs and more bigger models [00:17:16] more image graphs and more bigger models but here we test two concrete and tasks [00:17:19] but here we test two concrete and tasks that people care about in [00:17:21] that people care about in interpretability or use these methods to [00:17:23] interpretability or use these methods to do recourse or spiritual correlation so [00:17:26] do recourse or spiritual correlation so recourse for those are not familiaries [00:17:28] recourse for those are not familiaries you're getting a loan and you wonder [00:17:30] you're getting a loan and you wonder whether if I'm older I would have a high [00:17:33] whether if I'm older I would have a high chance of getting a loan so I tweak this [00:17:36] chance of getting a loan so I tweak this one feature and see if my value goes up [00:17:38] one feature and see if my value goes up or down very reasonable task have people [00:17:41] or down very reasonable task have people do all the time pretty significant [00:17:43] do all the time pretty significant implications socially [00:17:45] implications socially so for two of these concrete and tasks [00:17:49] so for two of these concrete and tasks both of them boil down to this [00:17:51] both of them boil down to this hypothesis testing framework that I [00:17:53] hypothesis testing framework that I talked about [00:17:54] talked about they're all around the random guessing [00:17:57] they're all around the random guessing line over worse than random guessing [00:18:01] so you might say oh no this is not good [00:18:03] so you might say oh no this is not good A lot of people are using these tools [00:18:05] A lot of people are using these tools what do we do [00:18:06] what do we do we have very simple idea about this [00:18:10] we have very simple idea about this so [00:18:11] so people like [00:18:13] people like developing complex tools and I really [00:18:16] developing complex tools and I really hope you're not one of those people [00:18:17] hope you're not one of those people because a lot of times simple methods [00:18:21] because a lot of times simple methods work who comes Razer but also simple [00:18:24] work who comes Razer but also simple methods are elegant there's a reason [00:18:26] methods are elegant there's a reason perhaps a lot of times why they work [00:18:28] perhaps a lot of times why they work they're simple that you can understand [00:18:31] they're simple that you can understand them they make sense so let's try that [00:18:34] them they make sense so let's try that idea here so again your goal is to [00:18:36] idea here so again your goal is to estimate a function shape what do you do [00:18:39] estimate a function shape what do you do well the simplest thing you do is you [00:18:42] well the simplest thing you do is you have a point of interest you sample [00:18:44] have a point of interest you sample around that point and evaluate the [00:18:46] around that point and evaluate the function around that point if it goes up [00:18:49] function around that point if it goes up maybe functions going up if it goes down [00:18:51] maybe functions going up if it goes down maybe functions coming down right so [00:18:54] maybe functions coming down right so that's the simplest way you can kind of [00:18:56] that's the simplest way you can kind of brute force it [00:18:58] brute force it but then the question is how many [00:19:00] but then the question is how many samples do we need so here this is the [00:19:03] samples do we need so here this is the equation that you're boosting you're [00:19:05] equation that you're boosting you're lifting this line upwards that way by [00:19:08] lifting this line upwards that way by adding that additional term [00:19:10] adding that additional term uh it's proportional to number of [00:19:12] uh it's proportional to number of samples the more samples you have the [00:19:13] samples the more samples you have the better estimation you have makes sense [00:19:15] better estimation you have makes sense and differences in output how much [00:19:17] and differences in output how much resolution do you care do you care point [00:19:19] resolution do you care do you care point zero point one to point point one to [00:19:22] zero point one to point point one to point two or do you only care zero slope [00:19:24] point two or do you only care zero slope to like slope one that's resolution that [00:19:27] to like slope one that's resolution that you care about and number of features of [00:19:30] you care about and number of features of course so if you worry about making some [00:19:34] course so if you worry about making some conclusion based on function shape [00:19:37] conclusion based on function shape sample [00:19:38] sample easy [00:19:42] so [00:19:43] so can we infer the Model Behavior using [00:19:46] can we infer the Model Behavior using this popular methods the answer is no [00:19:49] this popular methods the answer is no and this holds both theory and practice [00:19:53] and this holds both theory and practice we're currently working on even bigger [00:19:55] we're currently working on even bigger models to to show just like again you [00:19:57] models to to show just like again you know again empirical evidence that yes [00:19:59] know again empirical evidence that yes it just really doesn't work please you [00:20:02] it just really doesn't work please you know think of think twice and three [00:20:03] know think of think twice and three times before using these methods [00:20:06] times before using these methods and also a model dependent sample [00:20:08] and also a model dependent sample complexity if your function is kind of [00:20:10] complexity if your function is kind of crazy of course you're going to need [00:20:12] crazy of course you're going to need more samples so what is the definition [00:20:14] more samples so what is the definition of how do we characterize these [00:20:15] of how do we characterize these functions [00:20:17] functions and finally we haven't quite given up [00:20:20] and finally we haven't quite given up yet because these methods have a pretty [00:20:22] yet because these methods have a pretty good root in economics and and sharply [00:20:24] good root in economics and and sharply values and all that so maybe they're a [00:20:28] values and all that so maybe they're a lot narrower condition where these [00:20:30] lot narrower condition where these methods work and we believe such such [00:20:33] methods work and we believe such such condition does exist we just have to [00:20:35] condition does exist we just have to figure out when [00:20:36] figure out when once we figure out what that condition [00:20:38] once we figure out what that condition is then in given function I can test it [00:20:42] is then in given function I can test it and say yes I can use shop here yes I [00:20:44] and say yes I can use shop here yes I can use IG here or no I can't [00:20:47] can use IG here or no I can't that would be still very useful so [00:20:49] that would be still very useful so ongoing work [00:20:51] ongoing work before I go to the next one any [00:20:53] before I go to the next one any questions [00:20:54] questions yes do the findings you have about the [00:20:58] yes do the findings you have about the these models like does it only applied [00:21:00] these models like does it only applied in computer-bit models or does it [00:21:02] in computer-bit models or does it applies any model and that has a [00:21:06] applies any model and that has a function [00:21:08] function yeah very simple simple actually [00:21:10] yeah very simple simple actually simplest proof that can show simply any [00:21:14] simplest proof that can show simply any function this holds [00:21:16] function this holds any other questions [00:21:20] it's wonderful yeah this relate to you a [00:21:23] it's wonderful yeah this relate to you a lot but like it's almost seems like for [00:21:27] lot but like it's almost seems like for the last couple of years they're being [00:21:30] the last couple of years they're being at least dozens maybe hundreds of people [00:21:32] at least dozens maybe hundreds of people writing to people through the Shipley [00:21:34] writing to people through the Shipley values I mean it [00:21:38] values I mean it if you're guessed that most of that work [00:21:41] if you're guessed that most of that work that's invalid or that a lot of it might [00:21:44] that's invalid or that a lot of it might be okay because the the point of a [00:21:47] be okay because the the point of a condition where it's all right right off [00:21:50] condition where it's all right right off of you being there [00:21:51] of you being there so two answers to that question my [00:21:54] so two answers to that question my hypothesis testing results shows that [00:21:56] hypothesis testing results shows that it's random right so maybe in the [00:21:58] it's random right so maybe in the optimistic case optimistic case 50 of [00:22:02] optimistic case optimistic case 50 of those papers [00:22:03] those papers you hit it [00:22:06] you hit it and on the other side on the second note [00:22:09] and on the other side on the second note even if maybe shop wasn't perfect maybe [00:22:12] even if maybe shop wasn't perfect maybe it was kind of wrong but even if if it [00:22:15] it was kind of wrong but even if if it helped human at the end task whatever [00:22:17] helped human at the end task whatever that might be Health doctors to be more [00:22:19] that might be Health doctors to be more efficient identifying bugs and whatnot [00:22:21] efficient identifying bugs and whatnot and if they did the validation correctly [00:22:23] and if they did the validation correctly with the right control testing setup [00:22:26] with the right control testing setup then I think it's good you know you [00:22:28] then I think it's good you know you figure it out somehow how to make this [00:22:30] figure it out somehow how to make this noisy tools together work with human [00:22:32] noisy tools together work with human interlude maybe and that's also good and [00:22:34] interlude maybe and that's also good and I personally really like shop uh paper [00:22:37] I personally really like shop uh paper and I'm a I'm a good friend with Scott [00:22:39] and I'm a I'm a good friend with Scott and I love all his work it's just that I [00:22:41] and I love all his work it's just that I think we need to narrow down our [00:22:43] think we need to narrow down our expectations so that our expectations [00:22:45] expectations so that our expectations are better aligned [00:22:49] all right I'm going to talk about [00:22:51] all right I'm going to talk about another word that's a kind of similar [00:22:53] another word that's a kind of similar flavor [00:22:54] flavor now it's an NLP [00:22:56] now it's an NLP so this is one of those papers just like [00:22:59] so this is one of those papers just like the many other papers that that we we [00:23:01] the many other papers that that we we ended up writing one of those [00:23:03] ended up writing one of those Serendipity paper so initially Peter [00:23:05] Serendipity paper so initially Peter came up as an intern and we thought [00:23:08] came up as an intern and we thought we're gonna locate ethical knowledge in [00:23:10] we're gonna locate ethical knowledge in this large language models and then [00:23:12] this large language models and then maybe we're gonna edit them to make them [00:23:14] maybe we're gonna edit them to make them a little more ethical so that was a goal [00:23:16] a little more ethical so that was a goal and then we thought oh the wrong paper [00:23:18] and then we thought oh the wrong paper from David Bowie and I also love David's [00:23:21] from David Bowie and I also love David's work and let's use that so that's the [00:23:23] work and let's use that so that's the start of this work [00:23:24] start of this work but then we start digging into and [00:23:27] but then we start digging into and implementing the realm and like things [00:23:29] implementing the realm and like things didn't quite line up so we do like [00:23:31] didn't quite line up so we do like sanity check experiment after sanity [00:23:34] sanity check experiment after sanity check and we ended up writing completely [00:23:36] check and we ended up writing completely different paper which I'm going to about [00:23:38] different paper which I'm going to about to talk to you about [00:23:39] to talk to you about so [00:23:41] so the this paper the Rome for those who [00:23:44] the this paper the Rome for those who are not familiar which I'm going into [00:23:45] are not familiar which I'm going into detail a little more detail in a bit is [00:23:47] detail a little more detail in a bit is about editing a model so you first [00:23:49] about editing a model so you first locate a knowledge in a in a model like [00:23:53] locate a knowledge in a in a model like the Space Needle is in Seattle that's a [00:23:55] the Space Needle is in Seattle that's a factor knowledge you locate them you [00:23:57] factor knowledge you locate them you edit them because you can locate them [00:24:00] edit them because you can locate them you can mess with it to edit that fact [00:24:03] you can mess with it to edit that fact that's like the whole promise of it in [00:24:05] that's like the whole promise of it in fact that's a lot of times how [00:24:06] fact that's a lot of times how localization or editing methods were [00:24:08] localization or editing methods were motivated in their literature [00:24:11] motivated in their literature but what we show is that this assumption [00:24:14] but what we show is that this assumption is actually not true [00:24:16] is actually not true and to be quite honest with you like I [00:24:18] and to be quite honest with you like I still don't quite get why this is not [00:24:22] still don't quite get why this is not related and I'll talk more about this [00:24:24] related and I'll talk more about this because this is like a big question uh [00:24:26] because this is like a big question uh to us this is a pretty pretty um active [00:24:28] to us this is a pretty pretty um active work [00:24:29] work so substantial fraction of factual [00:24:32] so substantial fraction of factual knowledge is stored outside of layers [00:24:36] knowledge is stored outside of layers that are identified as having no [00:24:38] that are identified as having no knowledge [00:24:39] knowledge and you can you can you can see you can [00:24:41] and you can you can you can see you can you will see this a little more detail [00:24:43] you will see this a little more detail in a bit [00:24:44] in a bit in fact the correlation between where [00:24:47] in fact the correlation between where the location where where the facts are [00:24:49] the location where where the facts are located and how well you will edit if we [00:24:52] located and how well you will edit if we edit that location is completely [00:24:55] edit that location is completely correlated uncorrelated so they have [00:24:57] correlated uncorrelated so they have nothing to do with each other [00:25:00] nothing to do with each other so we thought well maybe it's the [00:25:04] so we thought well maybe it's the problem with the definition of editing [00:25:06] problem with the definition of editing what we mean by editing can mean a lot [00:25:07] what we mean by editing can mean a lot of different things so let's think about [00:25:09] of different things so let's think about different ways to edit a thing so we try [00:25:13] different ways to edit a thing so we try a bunch of things with a little success [00:25:16] a bunch of things with a little success we couldn't find an editing definition [00:25:18] we couldn't find an editing definition that actually relates really well with [00:25:21] that actually relates really well with localization methods like in particular [00:25:23] localization methods like in particular with ROM [00:25:26] so let's talk a little bit about Rome [00:25:28] so let's talk a little bit about Rome how Rome Works super briefly there's a [00:25:31] how Rome Works super briefly there's a lot of details missed out on this side [00:25:32] lot of details missed out on this side but roughly you will get the idea so [00:25:35] but roughly you will get the idea so Rome is Magneto 2022 uh they have what's [00:25:39] Rome is Magneto 2022 uh they have what's called causal tracing algorithm and the [00:25:42] called causal tracing algorithm and the way it works is that you're going to run [00:25:44] way it works is that you're going to run a model on this particular data set now [00:25:47] a model on this particular data set now counter effect data set that has this [00:25:50] counter effect data set that has this Tuple subject relation and object the [00:25:53] Tuple subject relation and object the space needle look is located in Seattle [00:25:56] space needle look is located in Seattle and so you're going to have a clean run [00:25:58] and so you're going to have a clean run of the Space Needle is in Seattle one [00:26:01] of the Space Needle is in Seattle one time you stole every single module every [00:26:03] time you stole every single module every single value activations [00:26:05] single value activations and then in the second run which they [00:26:08] and then in the second run which they call corrupted run you're going to add [00:26:10] call corrupted run you're going to add noise in those Space Needle is or or the [00:26:14] noise in those Space Needle is or or the space [00:26:15] space then then you're going to intervene at [00:26:19] then then you're going to intervene at every single one of those modules [00:26:21] every single one of those modules as if from by copying this module to the [00:26:25] as if from by copying this module to the corrupted run so as if that particular [00:26:27] corrupted run so as if that particular model was never [00:26:29] model was never interrupted never a noise was never [00:26:32] interrupted never a noise was never added to that module [00:26:34] added to that module so it's a typical like intervention case [00:26:36] so it's a typical like intervention case where you pretend everything else being [00:26:39] where you pretend everything else being equal if I change just this one module [00:26:43] equal if I change just this one module what is the probability of having the [00:26:45] what is the probability of having the right answer so in this case probability [00:26:47] right answer so in this case probability of the right answer Seattle given that I [00:26:50] of the right answer Seattle given that I know it's the model and I intervened on [00:26:53] know it's the model and I intervened on it [00:26:54] it so at the end of the day you'll find [00:26:57] so at the end of the day you'll find graph like that where each layer and [00:27:00] graph like that where each layer and each token has a score How likely it is [00:27:03] each token has a score How likely it is if I intervene on that token in that [00:27:06] if I intervene on that token in that layer how How likely is it that I will [00:27:09] layer how How likely is it that I will recover the right answer because if I [00:27:11] recover the right answer because if I recover right answer that's the model [00:27:13] recover right answer that's the model that's the module that's stored on [00:27:14] that's the module that's stored on knowledge [00:27:16] knowledge really reasonable algorithm I couldn't [00:27:18] really reasonable algorithm I couldn't find technical flow in this algorithm I [00:27:20] find technical flow in this algorithm I quite like it actually [00:27:24] so but but when we start looking at this [00:27:27] so but but when we start looking at this using the same model that they use GPT [00:27:29] using the same model that they use GPT gptj we realized that a lot of these [00:27:34] gptj we realized that a lot of these facts so so Rome uses just layer 6 to [00:27:37] facts so so Rome uses just layer 6 to edit because that was the supposedly the [00:27:39] edit because that was the supposedly the best layer across this data set to add [00:27:42] best layer across this data set to add in most of the factual knowledge is [00:27:44] in most of the factual knowledge is stored in layer 6 and they showed uh [00:27:46] stored in layer 6 and they showed uh editing success and whatnot [00:27:49] editing success and whatnot but we realized the truth looks like the [00:27:51] but we realized the truth looks like the graph on the right so the red line is [00:27:54] graph on the right so the red line is the layer 6 their extension paper called [00:27:56] the layer 6 their extension paper called memet and it's multiple layers that's [00:27:59] memet and it's multiple layers that's the Blue Line blue region [00:28:01] the Blue Line blue region the black bars are histogram of where [00:28:04] the black bars are histogram of where the knowledge was actually peaked if you [00:28:06] the knowledge was actually peaked if you test every single layer and as you can [00:28:09] test every single layer and as you can see not a lot of facts fall into that [00:28:11] see not a lot of facts fall into that region so in fact every single fact has [00:28:13] region so in fact every single fact has like different regions that where it [00:28:15] like different regions that where it peaked so layer six for a lot of facts [00:28:18] peaked so layer six for a lot of facts weren't the best layer [00:28:20] weren't the best layer what the editing really worked it really [00:28:22] what the editing really worked it really works and we did we were able to [00:28:24] works and we did we were able to duplicate that results so we thought [00:28:26] duplicate that results so we thought what do we do to find this ethics [00:28:29] what do we do to find this ethics ethical knowledge how do we find the [00:28:31] ethical knowledge how do we find the best layer to edit so that's where we [00:28:33] best layer to edit so that's where we started but then we thought you know [00:28:36] started but then we thought you know what take a step back we're going to [00:28:38] what take a step back we're going to actually do alternative check first to [00:28:40] actually do alternative check first to make sure that tracing effect the the [00:28:43] make sure that tracing effect the the tracing effect is the localization [00:28:46] tracing effect is the localization rip implies better editing results and [00:28:49] rip implies better editing results and that's when everything started to [00:28:51] that's when everything started to falling apart [00:28:53] falling apart so let's define some metrics first the [00:28:56] so let's define some metrics first the edit success this is the rewrite score [00:28:59] edit success this is the rewrite score same score as roam paper used that's [00:29:01] same score as roam paper used that's what we use and the tracing effect this [00:29:04] what we use and the tracing effect this is localization [00:29:05] is localization is probably you can beat the due to the [00:29:08] is probably you can beat the due to the slide [00:29:09] slide so when we plotted the relation between [00:29:12] so when we plotted the relation between tracing effect and rewrite score the [00:29:15] tracing effect and rewrite score the local uh the the editing method Redline [00:29:18] local uh the the editing method Redline applies the perfect correlation [00:29:21] applies the perfect correlation and that was our assumption that there [00:29:24] and that was our assumption that there will be perfectly correlated and which [00:29:25] will be perfectly correlated and which is why we do localization to begin with [00:29:28] is why we do localization to begin with the actual line was yellow [00:29:30] the actual line was yellow it's close to zero it's actually [00:29:32] it's close to zero it's actually negative in this particular data set [00:29:36] negative in this particular data set that is not even on correlated it's like [00:29:38] that is not even on correlated it's like anti-correlated [00:29:39] anti-correlated and we didn't stop there we were like we [00:29:41] and we didn't stop there we were like we were so puzzled we're gonna do this for [00:29:43] were so puzzled we're gonna do this for every single layer and we're gonna find [00:29:45] every single layer and we're gonna find R square value so how much of the choice [00:29:49] R square value so how much of the choice of layer [00:29:50] of layer versus the localization the tracing [00:29:52] versus the localization the tracing effect explains the variance of [00:29:55] effect explains the variance of successful edit [00:29:56] successful edit if you're not familiar with r squared r [00:29:59] if you're not familiar with r squared r squares like a think about it as an [00:30:00] squares like a think about it as an importance of a factor [00:30:03] importance of a factor and it turns out that layer takes 94 [00:30:06] and it turns out that layer takes 94 dressing effect is zero zero one six [00:30:10] dressing effect is zero zero one six and so we were really opposed that we [00:30:12] and so we were really opposed that we were like scratching our head why is [00:30:13] were like scratching our head why is this true [00:30:15] this true but it was true across layer we tried [00:30:18] but it was true across layer we tried all sorts of different things we we [00:30:19] all sorts of different things we we tried different model we tried different [00:30:21] tried different model we tried different data set it was all like roughly the [00:30:24] data set it was all like roughly the case so we were at this point we [00:30:26] case so we were at this point we contacted David and we start talking [00:30:28] contacted David and we start talking about and and we resolve them they [00:30:31] about and and we resolve them they acknowledge that this is a phenomenon [00:30:33] acknowledge that this is a phenomenon that that exists you know [00:30:35] that that exists you know so apart from the layer the other way in [00:30:39] so apart from the layer the other way in which localization can happen is are you [00:30:41] which localization can happen is are you looking at the correct token is that the [00:30:43] looking at the correct token is that the other like corresponding yeah yeah in [00:30:46] other like corresponding yeah yeah in this graph the token [00:30:48] this graph the token so the added benefit of the rest of the [00:30:52] so the added benefit of the rest of the localization could only help you look at [00:30:53] localization could only help you look at which is the correct subgroup token is [00:30:55] which is the correct subgroup token is that it yeah yeah and so looking at any [00:30:57] that it yeah yeah and so looking at any of the software tokens it sort of finds [00:30:59] of the software tokens it sort of finds what I should think of yeah yeah just [00:31:00] what I should think of yeah yeah just layer layer is the most biggest thing [00:31:03] layer layer is the most biggest thing that's the only thing you should care if [00:31:04] that's the only thing you should care if you care about editing layers [00:31:06] you care about editing layers in fact don't worry about localization [00:31:08] in fact don't worry about localization at all it's extra wasted carbon uh [00:31:11] at all it's extra wasted carbon uh climate effect yeah so that was that was [00:31:14] climate effect yeah so that was that was our conclusion [00:31:16] our conclusion but then we thought you know maybe the [00:31:19] but then we thought you know maybe the particular definition of edit that they [00:31:21] particular definition of edit that they used in the room was was maybe different [00:31:23] used in the room was was maybe different maybe maybe there's exists a definition [00:31:25] maybe maybe there's exists a definition of editing that correlates a lot better [00:31:28] of editing that correlates a lot better with localization because there must be [00:31:31] with localization because there must be I'm still puzzled why is this not [00:31:33] I'm still puzzled why is this not correlated so we tried a bunch of [00:31:35] correlated so we tried a bunch of different definitions of edits you might [00:31:39] different definitions of edits you might inject an error you might uh you might [00:31:43] inject an error you might uh you might invert reverse the tracing you might [00:31:46] invert reverse the tracing you might want to erase effect you might we might [00:31:48] want to erase effect you might we might want to amplify the fact all these [00:31:49] want to amplify the fact all these things like maybe one of these will work [00:31:52] things like maybe one of these will work you did it [00:31:53] you did it so the craft that you're seeing down [00:31:55] so the craft that you're seeing down here is R square value for four [00:31:58] here is R square value for four different methods and this wasn't just [00:32:00] different methods and this wasn't just the case for Roman memory it was also [00:32:01] the case for Roman memory it was also the case for fine tuning methods [00:32:04] the case for fine tuning methods that you want to look at the difference [00:32:06] that you want to look at the difference between blue and orange bar represents [00:32:10] between blue and orange bar represents how much the tracing effect influenced [00:32:12] how much the tracing effect influenced our Square value of the tracing effect [00:32:13] our Square value of the tracing effect as you can see it's ignorable they're [00:32:16] as you can see it's ignorable they're all the same [00:32:17] all the same you might feel the effect forcing the [00:32:19] you might feel the effect forcing the last one has a little bit of Hope but [00:32:22] last one has a little bit of Hope but still [00:32:23] still compared to the impact of layer choice [00:32:26] compared to the impact of layer choice of layer it's ignorable [00:32:28] of layer it's ignorable so at this point we said okay well we [00:32:32] so at this point we said okay well we can't locate the ethics no ethical [00:32:34] can't locate the ethics no ethical knowledge at this project we're going to [00:32:36] knowledge at this project we're going to have to switch the direction and we end [00:32:38] have to switch the direction and we end up doing a lot more in-depth analysis on [00:32:40] up doing a lot more in-depth analysis on on this [00:32:44] so in summary does localization help [00:32:47] so in summary does localization help editing no [00:32:49] editing no the relationship is actually zero for [00:32:52] the relationship is actually zero for this particular editing method that from [00:32:54] this particular editing method that from what I know is pretty state-of-the-art [00:32:56] what I know is pretty state-of-the-art and the counter of counter effect data [00:32:58] and the counter of counter effect data it's not true [00:33:00] it's not true are there any other editing methods that [00:33:02] are there any other editing methods that correlate better no but if somebody can [00:33:04] correlate better no but if somebody can answer this question for me that will be [00:33:06] answer this question for me that will be very satisfying like I feel like there [00:33:08] very satisfying like I feel like there should start something still be [00:33:10] should start something still be something there that we're missing [00:33:12] something there that we're missing but causal tracing I think what it does [00:33:14] but causal tracing I think what it does is it reveals the factual information [00:33:18] is it reveals the factual information when the Transformer is passing forward [00:33:21] when the Transformer is passing forward I think it represents where's the fact [00:33:24] I think it represents where's the fact when you're doing that [00:33:26] when you're doing that but what we found here is that it has [00:33:28] but what we found here is that it has nothing to do with editing success those [00:33:31] nothing to do with editing success those two things are different and we have to [00:33:32] two things are different and we have to resolve that somehow [00:33:35] resolve that somehow but a lot of insights that they found in [00:33:37] but a lot of insights that they found in their paper is still useful like the [00:33:39] their paper is still useful like the early to mid-range NLP representation [00:33:41] early to mid-range NLP representation last token there they represent the [00:33:43] last token there they represent the factual something we didn't know before [00:33:46] factual something we didn't know before but it is important not to validate [00:33:48] but it is important not to validate localization methods using the editing [00:33:51] localization methods using the editing method now we know and maybe not to [00:33:54] method now we know and maybe not to motivate editing methods using via [00:33:57] motivate editing methods using via localization those are the two things [00:33:59] localization those are the two things now we know that we shouldn't do because [00:34:02] now we know that we shouldn't do because we couldn't find a relationship [00:34:04] we couldn't find a relationship any questions on this one before I move [00:34:07] any questions on this one before I move on to the next one [00:34:15] you're not shocked by this [00:34:16] you're not shocked by this I am shocked by this I'm still so [00:34:19] I am shocked by this I'm still so puzzled like it should be there should [00:34:22] puzzled like it should be there should be something I don't know [00:34:26] all right [00:34:28] all right so in summary of this first part we [00:34:32] so in summary of this first part we talked about why there the Gap might [00:34:34] talked about why there the Gap might exist and what she what machines know [00:34:36] exist and what she what machines know versus what we think machines now there [00:34:39] versus what we think machines now there are three hypothesis there are three [00:34:40] are three hypothesis there are three ideas assumptions are wrong maybe our [00:34:42] ideas assumptions are wrong maybe our expectations are wrong maybe it's beyond [00:34:44] expectations are wrong maybe it's beyond us there's a good quote that says good [00:34:47] us there's a good quote that says good at good artist still I think good [00:34:49] at good artist still I think good researchers doubt we have to be really [00:34:51] researchers doubt we have to be really suspicious of everything that we do and [00:34:54] suspicious of everything that we do and that's maybe the biggest lesson that [00:34:55] that's maybe the biggest lesson that I've learned over many years that once [00:34:58] I've learned over many years that once you like your results so much that's a [00:35:01] you like your results so much that's a bad sign [00:35:02] bad sign like come back like go home have a beer [00:35:04] like come back like go home have a beer go to sleep and next day you come back [00:35:07] go to sleep and next day you come back and like put your paper in on your desk [00:35:09] and like put your paper in on your desk and think okay now I'm gonna review this [00:35:11] and think okay now I'm gonna review this paper [00:35:12] paper how do I criticize this one what do I [00:35:14] how do I criticize this one what do I not like about this paper right that's [00:35:16] not like about this paper right that's the one way to look at criticize your [00:35:18] the one way to look at criticize your own research and and that will improve [00:35:20] own research and and that will improve your thinking a lot [00:35:26] so let's bring our attention back to our [00:35:28] so let's bring our attention back to our hopes and dreams it keeps coming back [00:35:31] hopes and dreams it keeps coming back so here [00:35:32] so here I came to realize maybe instead of just [00:35:36] I came to realize maybe instead of just building tools to understand perhaps we [00:35:39] building tools to understand perhaps we need to do some groundwork what do I [00:35:41] need to do some groundwork what do I mean well this alien that we've been [00:35:43] mean well this alien that we've been dealing with trying to generate [00:35:45] dealing with trying to generate explanations seems to be a different [00:35:48] explanations seems to be a different kind so maybe we should study them as if [00:35:51] kind so maybe we should study them as if they're like new species in the wild [00:35:54] they're like new species in the wild so what do you do when you observe a new [00:35:56] so what do you do when you observe a new species in the wild you have a couple [00:35:58] species in the wild you have a couple ways but one of the ways is to [00:36:00] ways but one of the ways is to observational study so you saw some [00:36:03] observational study so you saw some species in the wild far away first you [00:36:06] species in the wild far away first you just kind of watch them you watch them [00:36:08] just kind of watch them you watch them and see what are they like what are [00:36:10] and see what are they like what are their habitat how they what what do they [00:36:12] their habitat how they what what do they what are their values and whatnot [00:36:14] what are their values and whatnot and second way you can actually [00:36:16] and second way you can actually intervene and do a control study so we [00:36:19] intervene and do a control study so we did something like this with [00:36:20] did something like this with reinforcement learning setup [00:36:25] I'm going to talk about these two papers [00:36:27] I'm going to talk about these two papers first paper [00:36:28] first paper emergent behaviors in multi-agent [00:36:30] emergent behaviors in multi-agent systems has been so cool who who saw [00:36:33] systems has been so cool who who saw this hide and seek video by open AI [00:36:36] this hide and seek video by open AI yeah it's so cool if you haven't seen it [00:36:38] yeah it's so cool if you haven't seen it just Google it and watch it it's so [00:36:39] just Google it and watch it it's so fascinating I'm only covering the tip of [00:36:41] fascinating I'm only covering the tip of an iceberg in this but at the end of [00:36:43] an iceberg in this but at the end of this hide and seek episode at some point [00:36:47] this hide and seek episode at some point the agents reveal a discover a bug in [00:36:51] the agents reveal a discover a bug in this physical system and start like [00:36:53] this physical system and start like anti-gravity flying in the air and like [00:36:56] anti-gravity flying in the air and like shooting hiders everywhere a super [00:36:58] shooting hiders everywhere a super interesting video you must watch [00:37:01] interesting video you must watch so lots of that and also humanoid [00:37:03] so lots of that and also humanoid football and capture the flag from [00:37:05] football and capture the flag from deepmind lots of interesting behaviors [00:37:07] deepmind lots of interesting behaviors emerging that we observed [00:37:11] here's the my favorite one but but these [00:37:14] here's the my favorite one but but these labels so here these are labels that are [00:37:16] labels so here these are labels that are provided by open AI running and chasing [00:37:19] provided by open AI running and chasing for building and ramp use but and these [00:37:22] for building and ramp use but and these ones were that oh human or humans when [00:37:25] ones were that oh human or humans when painstakingly one by one watch all these [00:37:28] painstakingly one by one watch all these videos and label them manually [00:37:31] videos and label them manually so our question is can we is there [00:37:34] so our question is can we is there better way to discover this emergent [00:37:36] better way to discover this emergent behaviors perhaps some nice [00:37:38] behaviors perhaps some nice visualization can help us explore this [00:37:41] visualization can help us explore this complex uh complex domain a little [00:37:44] complex uh complex domain a little better so that's our goal [00:37:47] better so that's our goal so in this work we're going to again [00:37:49] so in this work we're going to again treat the agents like an observational [00:37:52] treat the agents like an observational study like us new species then we're [00:37:54] study like us new species then we're going to do observational study and what [00:37:56] going to do observational study and what that means is that we only get to [00:37:58] that means is that we only get to observe State and action pair so where [00:38:00] observe State and action pair so where they are what are they doing or uh yeah [00:38:02] they are what are they doing or uh yeah what are they doing [00:38:03] what are they doing and we're going to discover agent [00:38:06] and we're going to discover agent Behavior by basically kind of like a [00:38:09] Behavior by basically kind of like a clustering the data that's all we're [00:38:10] clustering the data that's all we're gonna do [00:38:12] gonna do and how do we do it pretty simple a [00:38:15] and how do we do it pretty simple a generative model have you have covered [00:38:17] generative model have you have covered the Bayesian generator graphical no [00:38:20] the Bayesian generator graphical no gotcha okay so think about [00:38:23] gotcha okay so think about hi then also what you teach yeah so this [00:38:27] hi then also what you teach yeah so this is a graphical model [00:38:29] is a graphical model um think about this as a fake or [00:38:32] um think about this as a fake or hypothetical data generation process so [00:38:35] hypothetical data generation process so how does this work like I'm generating [00:38:37] how does this work like I'm generating the data I created this system I'm going [00:38:39] the data I created this system I'm going to first generate a joint latent [00:38:41] to first generate a joint latent embedding space for that represents all [00:38:44] embedding space for that represents all numbers that represents all the [00:38:45] numbers that represents all the behaviors in the system [00:38:47] behaviors in the system and then I'm gonna for each agent I'm [00:38:49] and then I'm gonna for each agent I'm going to generate another embedding [00:38:51] going to generate another embedding and each embedding when it's conditioned [00:38:53] and each embedding when it's conditioned with State it's going to generate policy [00:38:57] with State it's going to generate policy it's going to decide what it's going to [00:38:58] it's going to decide what it's going to do what action is given the state and [00:39:00] do what action is given the state and the embedding pair [00:39:02] the embedding pair and then what that whole thing generates [00:39:04] and then what that whole thing generates is what you see the state and action [00:39:07] is what you see the state and action pair so how does this work well and then [00:39:09] pair so how does this work well and then given this you build a model and you do [00:39:12] given this you build a model and you do inference to learn all these parameters [00:39:14] inference to learn all these parameters kind of same business as neural network [00:39:16] kind of same business as neural network but it's just have a little more [00:39:18] but it's just have a little more structure [00:39:19] structure so this is completely made up right this [00:39:22] so this is completely made up right this is like my idea of how these new species [00:39:25] is like my idea of how these new species might work and our goal is to we're [00:39:27] might work and our goal is to we're going to try this and see if anything [00:39:29] going to try this and see if anything useful comes up and the way you do this [00:39:31] useful comes up and the way you do this is one of the ways you do this is you [00:39:34] is one of the ways you do this is you optimize for a variation lower bound you [00:39:36] optimize for a variation lower bound you don't need to know that it's very cool [00:39:37] don't need to know that it's very cool actually if if one gets into this [00:39:40] actually if if one gets into this exponential family business uh it's very [00:39:42] exponential family business uh it's very cool [00:39:43] cool CS 228 okay so here's one of the results [00:39:49] CS 228 okay so here's one of the results that we had it's a domain called mujoko [00:39:52] that we had it's a domain called mujoko here we're going to pretend that we have [00:39:53] here we're going to pretend that we have two agents one controlling back leg and [00:39:56] two agents one controlling back leg and one controlling the front leg and on the [00:39:58] one controlling the front leg and on the right we're showing that joint embedding [00:40:00] right we're showing that joint embedding space Z Omega and z alpha while video is [00:40:03] space Z Omega and z alpha while video is running [00:40:05] running I'm going to try to [00:40:07] I'm going to try to put the video back [00:40:10] okay so now I'm going to select this is [00:40:13] okay so now I'm going to select this is a visualization that we built or online [00:40:15] a visualization that we built or online you can you can go check it out you can [00:40:18] you can you can go check it out you can select a little space in agent one space [00:40:21] select a little space in agent one space and you see it maps to pretty tight [00:40:23] and you see it maps to pretty tight space and Agent Zero and it shows pretty [00:40:26] space and Agent Zero and it shows pretty decent running ability so that's cool [00:40:28] decent running ability so that's cool and now I'm going to select somewhere [00:40:30] and now I'm going to select somewhere else in agent one that maps to kind of [00:40:33] else in agent one that maps to kind of disperse area in Agent Zero it looks [00:40:36] disperse area in Agent Zero it looks like it's not not doing as well [00:40:38] like it's not not doing as well and this is just an Insight that we gain [00:40:40] and this is just an Insight that we gain for this data only but like I was [00:40:43] for this data only but like I was quickly able to identify ah this type [00:40:47] quickly able to identify ah this type mapping business kind of represents the [00:40:50] mapping business kind of represents the good running behavior and bad running [00:40:51] good running behavior and bad running behaviors that's something that you can [00:40:53] behaviors that's something that you can do pretty efficiently [00:40:55] do pretty efficiently and now I'm going to show you something [00:40:56] and now I'm going to show you something more interesting so of course we have to [00:40:59] more interesting so of course we have to do this because we have the data it's [00:41:01] do this because we have the data it's it's here it's so cool so we apply this [00:41:04] it's here it's so cool so we apply this framework in the when ai's hide and seek [00:41:06] framework in the when ai's hide and seek this has four agent it looks like a [00:41:09] this has four agent it looks like a simple game but it has pretty complex [00:41:11] simple game but it has pretty complex structure 100 dimensional observations [00:41:12] structure 100 dimensional observations uh five-dimensional action space [00:41:15] uh five-dimensional action space so in this work remember that we pretend [00:41:18] so in this work remember that we pretend that we don't know the labels given by [00:41:20] that we don't know the labels given by open AI we just shuffle them in the mix [00:41:24] open AI we just shuffle them in the mix but we can color them our results with [00:41:26] but we can color them our results with respect to their labels so again this is [00:41:29] respect to their labels so again this is the result of Z Omega and z alpha the [00:41:33] the result of Z Omega and z alpha the individual agents but the coloring is [00:41:35] individual agents but the coloring is something that we didn't know before we [00:41:36] something that we didn't know before we just did it after the fact [00:41:39] just did it after the fact you can see in the Z Omega there's nice [00:41:41] you can see in the Z Omega there's nice kind of [00:41:42] kind of pattern that we can roughly separate [00:41:45] pattern that we can roughly separate what human what makes sense to humans [00:41:47] what human what makes sense to humans and what makes sense to us but remember [00:41:49] and what makes sense to us but remember the the green and gray kind of [00:41:52] the the green and gray kind of everywhere they're mixed so in this [00:41:55] everywhere they're mixed so in this particular run of open AIS hide and seek [00:41:58] particular run of open AIS hide and seek it seemed that those two representations [00:42:00] it seemed that those two representations were kind of entangled [00:42:02] were kind of entangled the running and chasing the blue dots it [00:42:05] the running and chasing the blue dots it seems to be pretty separate and [00:42:07] seems to be pretty separate and distinguishable from all the other [00:42:09] distinguishable from all the other colors and that kind of makes sense [00:42:11] colors and that kind of makes sense because that's basis of playing this [00:42:13] because that's basis of playing this game so if you don't have that [00:42:14] game so if you don't have that representation you have a you have a big [00:42:16] representation you have a you have a big trouble [00:42:17] trouble but in case of like orange which is fort [00:42:21] but in case of like orange which is fort building [00:42:22] building it's a lot more distinguishable in [00:42:25] it's a lot more distinguishable in hiders and that makes sense because [00:42:27] hiders and that makes sense because hiders are the ones building the fort [00:42:30] hiders are the ones building the fort then Seekers don't build the fort so [00:42:31] then Seekers don't build the fort so we're in just a little more entangled in [00:42:33] we're in just a little more entangled in Seekers perhaps if Seekers had built [00:42:36] Seekers perhaps if Seekers had built more separate for building uh [00:42:38] more separate for building uh representation maybe they would have win [00:42:40] representation maybe they would have win this game [00:42:44] so this work can we learn something [00:42:47] so this work can we learn something interesting emerging behaviors by just [00:42:49] interesting emerging behaviors by just simply observing the system the answer [00:42:52] simply observing the system the answer seems to be yes at least for the domains [00:42:54] seems to be yes at least for the domains that we tested a lot more more complex [00:42:56] that we tested a lot more more complex domains should be tested but these are [00:42:58] domains should be tested but these are the ones we had [00:43:01] the ones we had but remember that these methods don't [00:43:03] but remember that these methods don't give you names of these clusters so you [00:43:05] give you names of these clusters so you would have to go and investigate and [00:43:07] would have to go and investigate and click through and explore [00:43:10] click through and explore and if the cluster represents super [00:43:12] and if the cluster represents super superhuman concept this is not going to [00:43:15] superhuman concept this is not going to help you and I'll talk a little more [00:43:17] help you and I'll talk a little more about the work that that we do try to [00:43:19] about the work that that we do try to help them but this is not for you this [00:43:21] help them but this is not for you this is not going to help you there [00:43:22] is not going to help you there and also if you have access to the model [00:43:26] and also if you have access to the model and the reward signal you should use it [00:43:28] and the reward signal you should use it why why dump it [00:43:31] why why dump it so next part we do use it I'm going to [00:43:33] so next part we do use it I'm going to talk about let's work with Nico and [00:43:36] talk about let's work with Nico and Natasha and Shay again [00:43:39] Natasha and Shay again so here this time we're going to [00:43:41] so here this time we're going to intervene we're going to be a little [00:43:43] intervene we're going to be a little intrusive but hopefully we'll learn a [00:43:45] intrusive but hopefully we'll learn a little more [00:43:46] little more so problem is that we're going to build [00:43:48] so problem is that we're going to build a new multi-agent system we're going to [00:43:50] a new multi-agent system we're going to build it from scratch such that we can [00:43:52] build it from scratch such that we can do control testing but at the same time [00:43:54] do control testing but at the same time we shouldn't sacrifice the performance [00:43:56] we shouldn't sacrifice the performance so we're going to try to match the the [00:43:58] so we're going to try to match the the performance of the overall system and we [00:44:01] performance of the overall system and we do succeed [00:44:03] do succeed I had this paper collaboration with [00:44:05] I had this paper collaboration with Folks at Sanford actually here in 2020 [00:44:08] Folks at Sanford actually here in 2020 where we propose this pretty simple idea [00:44:10] where we propose this pretty simple idea which is you have on your own network [00:44:12] which is you have on your own network why don't we embed Concepts in the [00:44:15] why don't we embed Concepts in the middle of the bottleneck where one [00:44:18] middle of the bottleneck where one neuron represents three the other [00:44:19] neuron represents three the other represents stripes and just train the [00:44:22] represents stripes and just train the model end to end [00:44:23] model end to end and why are we doing this well because [00:44:26] and why are we doing this well because then at inference time you can actually [00:44:28] then at inference time you can actually intervene you can pretend you know [00:44:31] intervene you can pretend you know predicting zebra I don't think three [00:44:33] predicting zebra I don't think three should matter so I'm gonna zero out this [00:44:35] should matter so I'm gonna zero out this neuron and feed forward and see what [00:44:36] neuron and feed forward and see what happens so it's particularly useful in [00:44:38] happens so it's particularly useful in the medical setting where there are some [00:44:40] the medical setting where there are some features that doctors don't want we can [00:44:42] features that doctors don't want we can cancel on and test [00:44:44] cancel on and test so this is the work to extend this to RL [00:44:47] so this is the work to extend this to RL setting it's actually not as simple [00:44:50] setting it's actually not as simple extension then as we thought it came out [00:44:53] extension then as we thought it came out to be pretty complex but essentially [00:44:55] to be pretty complex but essentially we're doing that and we're building each [00:44:58] we're doing that and we're building each of the concept bottleneck for each agent [00:45:01] of the concept bottleneck for each agent and at the end of the day what you [00:45:03] and at the end of the day what you optimize is what you usually do typical [00:45:05] optimize is what you usually do typical PPO just think about this as make the [00:45:08] PPO just think about this as make the make daughter system work plus [00:45:10] make daughter system work plus minimizing the difference between the [00:45:12] minimizing the difference between the true concept and estimated concept [00:45:14] true concept and estimated concept that's all you do [00:45:17] that's all you do why are we doing this you can intervene [00:45:19] why are we doing this you can intervene you can pretend now agent 2 pretend that [00:45:22] you can pretend now agent 2 pretend that you can't see agent 1. what happens now [00:45:25] you can't see agent 1. what happens now that's what we're doing here [00:45:29] we're going to do this in two domains [00:45:31] we're going to do this in two domains first domain how many people looked at [00:45:33] first domain how many people looked at the uh saw this cooking game before [00:45:37] the uh saw this cooking game before yeah it's a it's a pretty commonly used [00:45:39] yeah it's a it's a pretty commonly used cooking uh domain in reinforcement [00:45:42] cooking uh domain in reinforcement learning very simple [00:45:43] learning very simple we have two agents yellow and blue and [00:45:46] we have two agents yellow and blue and they're going to make soup [00:45:47] they're going to make soup they can bring Three Tomatoes they get a [00:45:50] they can bring Three Tomatoes they get a war they wait for the tomato and bring [00:45:52] war they wait for the tomato and bring the dishes a dish to the cooking pot [00:45:54] the dishes a dish to the cooking pot they get a reward finally their goal is [00:45:56] they get a reward finally their goal is to deliver as many soups as possible [00:45:58] to deliver as many soups as possible given given some time [00:46:01] given given some time and here Concepts that we use are agent [00:46:04] and here Concepts that we use are agent position orientation agent has tomato it [00:46:06] position orientation agent has tomato it has Dish etc etc something that's [00:46:08] has Dish etc etc something that's immediately available to you already [00:46:11] immediately available to you already and you can of course tweak the [00:46:13] and you can of course tweak the environment to make it more fun so you [00:46:15] environment to make it more fun so you can make it that they have to [00:46:17] can make it that they have to collaborate like you can build a wall [00:46:18] collaborate like you can build a wall between them so that they have to work [00:46:20] between them so that they have to work together in order to serve any tomato [00:46:22] together in order to serve any tomato soup or you can make them freely [00:46:24] soup or you can make them freely available you can work independently or [00:46:26] available you can work independently or together [00:46:27] together whatever your choice [00:46:31] first uh just kind of send you the check [00:46:33] first uh just kind of send you the check was that you can you can detect the [00:46:37] was that you can you can detect the emerging behavior of coordination versus [00:46:40] emerging behavior of coordination versus non-coordination so when the impassable [00:46:43] non-coordination so when the impassable environment when we made up that [00:46:44] environment when we made up that environment and suppose that RL system [00:46:47] environment and suppose that RL system that we trained worked they were able to [00:46:49] that we trained worked they were able to deliver some soups then you see that [00:46:52] deliver some soups then you see that when we intervene uh this graph let me [00:46:53] when we intervene uh this graph let me explain this is a reward of an agent one [00:46:56] explain this is a reward of an agent one when we when there's no intervention so [00:46:59] when we when there's no intervention so this is perfectly good world and when [00:47:02] this is perfectly good world and when there was an intervention this is the [00:47:04] there was an intervention this is the average value of intervening on all [00:47:06] average value of intervening on all Concepts but I'm also going to show you [00:47:08] Concepts but I'm also going to show you each concept soon [00:47:10] each concept soon if you compare left and right you can [00:47:13] if you compare left and right you can tell that in the right when we intervene [00:47:15] tell that in the right when we intervene reward deteriorated quite a lot for both [00:47:18] reward deteriorated quite a lot for both of them and that's one way to see yeah [00:47:21] of them and that's one way to see yeah they are coordinating because somehow [00:47:23] they are coordinating because somehow intervening and at this concept impacted [00:47:27] intervening and at this concept impacted a lot of their performance [00:47:30] but this is what what uh what was really [00:47:32] but this is what what uh what was really interesting to me and I'm curious anyone [00:47:34] interesting to me and I'm curious anyone can guess so this is the same graph as [00:47:38] can guess so this is the same graph as the one you saw before [00:47:39] the one you saw before but except I'm plotting for intervention [00:47:42] but except I'm plotting for intervention for each concept so I'm intervening team [00:47:45] for each concept so I'm intervening team position team orientation team has [00:47:47] position team orientation team has tomato etc etc [00:47:49] tomato etc etc it turns out that they are using or [00:47:53] it turns out that they are using or rather when we intervene on team [00:47:55] rather when we intervene on team orientation the degradation of [00:47:57] orientation the degradation of performance was the biggest to the [00:47:59] performance was the biggest to the extent that we believe that orientation [00:48:00] extent that we believe that orientation had to do with subcoordination [00:48:03] had to do with subcoordination does anyone can guess why this might be [00:48:12] the position [00:48:13] the position there's orientation [00:48:19] yes just a clarification question on the [00:48:21] yes just a clarification question on the orientation is that like the direction [00:48:23] orientation is that like the direction that the teammate is producing yes so it [00:48:26] that the teammate is producing yes so it seems like orientation would let you [00:48:30] seems like orientation would let you yes yes that's right yes where were you [00:48:33] yes yes that's right yes where were you when I was when I was pulling my hair [00:48:34] when I was when I was pulling my hair hair over this question yes that's [00:48:37] hair over this question yes that's exactly right and initially I was really [00:48:39] exactly right and initially I was really puzzled like why not position because I [00:48:41] puzzled like why not position because I expect it to be positioned but exactly [00:48:43] expect it to be positioned but exactly that's exactly right so the orientation [00:48:45] that's exactly right so the orientation is the first signal that an agent can [00:48:48] is the first signal that an agent can get about the next move over the other [00:48:51] get about the next move over the other Asian because they're facing the pot [00:48:53] Asian because they're facing the pot they're going to the pot they're facing [00:48:54] they're going to the pot they're facing the Tomato they're going to get the [00:48:55] the Tomato they're going to get the tomato [00:48:57] tomato really interesting intuition but some [00:49:00] really interesting intuition but some too obvious to some but I needed this [00:49:02] too obvious to some but I needed this graph to work that out [00:49:05] graph to work that out and of course you can use this to [00:49:07] and of course you can use this to identify lazy agents if you look at the [00:49:10] identify lazy agents if you look at the rightmost uh yellow agent our friend [00:49:13] rightmost uh yellow agent our friend just chilling in the in the background [00:49:16] just chilling in the in the background and he's lazy and if you train our [00:49:19] and he's lazy and if you train our religion there's always some agents just [00:49:21] religion there's always some agents just hanging out they just not do anything [00:49:22] hanging out they just not do anything and you can you can easily identify this [00:49:25] and you can you can easily identify this by using this graph if I intervene it it [00:49:28] by using this graph if I intervene it it just doesn't impact any any of their [00:49:30] just doesn't impact any any of their Rewards [00:49:34] so the second domain we're going to look [00:49:37] so the second domain we're going to look at a little more complex domain so this [00:49:39] at a little more complex domain so this is uh it's studying inter-agent social [00:49:41] is uh it's studying inter-agent social dynamics so in this domain there is a [00:49:44] dynamics so in this domain there is a little bit of tension this is called a [00:49:46] little bit of tension this is called a cleanup we have four agents they only [00:49:50] cleanup we have four agents they only get rewards if they eat apples just [00:49:51] get rewards if they eat apples just yellow things or green things or apples [00:49:54] yellow things or green things or apples uh but if you don't clean the river then [00:49:57] uh but if you don't clean the river then Apple stops through all so somebody has [00:49:59] Apple stops through all so somebody has to clean the river and you can see if [00:50:02] to clean the river and you can see if there are four people trying to collect [00:50:05] there are four people trying to collect apples you can just stay someone else's [00:50:06] apples you can just stay someone else's to wait until someone else to to clean [00:50:09] to wait until someone else to to clean the river and then collect the apples [00:50:11] the river and then collect the apples and in fact that's sometimes what [00:50:12] and in fact that's sometimes what happens [00:50:15] and Concepts here again are pretty uh [00:50:19] and Concepts here again are pretty uh pretty uh pretty common things position [00:50:20] pretty uh pretty common things position orientation and and pollution positions [00:50:24] orientation and and pollution positions Etc [00:50:26] Etc so [00:50:27] so would we first plotted the same graph as [00:50:31] would we first plotted the same graph as the previous domain [00:50:33] the previous domain it it it it tells a story so the story [00:50:36] it it it it tells a story so the story here is that when I intervene on Asian [00:50:40] here is that when I intervene on Asian one [00:50:41] one it seems to influence Asian too quite a [00:50:45] it seems to influence Asian too quite a lot if you look at these three different [00:50:47] lot if you look at these three different uh graph reward how reward was impacted [00:50:51] uh graph reward how reward was impacted when I intervened on Asian one it's [00:50:53] when I intervened on Asian one it's agent three and four are fine but it [00:50:55] agent three and four are fine but it seems that only agent two is influenced [00:50:57] seems that only agent two is influenced same with idle time same with the Intel [00:50:59] same with idle time same with the Intel agent distance so we were like oh maybe [00:51:02] agent distance so we were like oh maybe that's true but we keep wondering [00:51:04] that's true but we keep wondering there's like a lot going on in this [00:51:06] there's like a lot going on in this domain like how do we know this is the [00:51:08] domain like how do we know this is the case [00:51:09] case so we decided to take [00:51:12] so we decided to take another step so [00:51:15] another step so we're going to do a little more work [00:51:16] we're going to do a little more work here uh but but not a lot we're going to [00:51:18] here uh but but not a lot we're going to fill the graph to discover interagent [00:51:21] fill the graph to discover interagent relationships this is simplest dumbest [00:51:24] relationships this is simplest dumbest way to build a graph but again I like [00:51:26] way to build a graph but again I like simple things so how do you build a [00:51:28] simple things so how do you build a graph well suppose that you have you're [00:51:30] graph well suppose that you have you're building a graph between movies this is [00:51:32] building a graph between movies this is like not what we do but just to describe [00:51:34] like not what we do but just to describe what we're trying to do we have each row [00:51:37] what we're trying to do we have each row if we want to build a matrix each row is [00:51:40] if we want to build a matrix each row is a movie and each column consists of [00:51:43] a movie and each column consists of features of these movies so length [00:51:45] features of these movies so length Jungle of the movie and so on [00:51:47] Jungle of the movie and so on and the simplest way to build a graph is [00:51:50] and the simplest way to build a graph is to do a regression so exclude I I throw [00:51:55] to do a regression so exclude I I throw and then we're going to regress over [00:51:57] and then we're going to regress over everyone else [00:51:58] everyone else and that gives me beta [00:52:00] and that gives me beta which is kind of coefficient for for [00:52:03] which is kind of coefficient for for each of these and that beta represents [00:52:05] each of these and that beta represents the strength between uh strengths of the [00:52:08] the strength between uh strengths of the edges so this movie is more related to [00:52:10] edges so this movie is more related to this movie and not the other movie and [00:52:12] this movie and not the other movie and ta-da you have a graph it's like dummy [00:52:13] ta-da you have a graph it's like dummy story there's a lot of caveats to you [00:52:15] story there's a lot of caveats to you shouldn't do this with them a lot of [00:52:16] shouldn't do this with them a lot of times but you know this is the simplest [00:52:18] times but you know this is the simplest way to do it [00:52:20] way to do it so we did the same thing here instead [00:52:22] so we did the same thing here instead instead of movie we're going to use [00:52:25] instead of movie we're going to use intervention on concept C on agent n as [00:52:29] intervention on concept C on agent n as our node [00:52:31] our node and for to build this Matrix we're going [00:52:34] and for to build this Matrix we're going to use intervention outcome which [00:52:36] to use intervention outcome which wouldn't happen to be available without [00:52:38] wouldn't happen to be available without our framework [00:52:39] our framework for reward resource collected and and [00:52:42] for reward resource collected and and many other things [00:52:45] many other things and when you build this graph at the end [00:52:47] and when you build this graph at the end of the day you get betas that represent [00:52:49] of the day you get betas that represent relationship between these interventions [00:52:52] relationship between these interventions okay [00:52:54] okay so I had a [00:52:55] so I had a graph of that Matrix apparently I [00:52:57] graph of that Matrix apparently I removed before I came over but imagine [00:53:00] removed before I came over but imagine there was a matrix [00:53:02] there was a matrix there is a nicely highlighted between [00:53:05] there is a nicely highlighted between agent 1 and 4 and that only [00:53:07] agent 1 and 4 and that only contradicting the original hypothesis [00:53:10] contradicting the original hypothesis that we had and this is the video of it [00:53:12] that we had and this is the video of it so when we stared at that Matrix it it [00:53:15] so when we stared at that Matrix it it turns out that there's no High Edge [00:53:18] turns out that there's no High Edge strong edges between agent one and two [00:53:22] strong edges between agent one and two so we were like that's weird but there [00:53:24] so we were like that's weird but there is strong edges between agent one and [00:53:25] is strong edges between agent one and four so we like dig deeper into it [00:53:28] four so we like dig deeper into it watched a lot of uh a lot of sessions to [00:53:30] watched a lot of uh a lot of sessions to validate what's happening and it turns [00:53:32] validate what's happening and it turns out that the story was a lot more [00:53:34] out that the story was a lot more complicated the ones orientation was [00:53:37] complicated the ones orientation was important for four [00:53:39] important for four but when that fails agent 1 and 2 kinda [00:53:42] but when that fails agent 1 and 2 kinda gets cornered in and you can see that in [00:53:44] gets cornered in and you can see that in the graph agent 4 kind of get a get [00:53:46] the graph agent 4 kind of get a get agent one and four uh sorry one and two [00:53:49] agent one and four uh sorry one and two blue and yellow agent kind of gets in [00:53:51] blue and yellow agent kind of gets in the corner together they kind of get [00:53:52] the corner together they kind of get stuck and this is simply just accidental [00:53:55] stuck and this is simply just accidental because of the way that we built this [00:53:57] because of the way that we built this environment it just happened [00:54:00] environment it just happened but but the true the raw statistics [00:54:03] but but the true the raw statistics wouldn't have told us this story that [00:54:04] wouldn't have told us this story that this was completely accidental in fact [00:54:06] this was completely accidental in fact there was no correlation no coordination [00:54:08] there was no correlation no coordination between agent one and two but only after [00:54:11] between agent one and two but only after the graph we realized this was the case [00:54:14] the graph we realized this was the case now this might be one-off case but you [00:54:17] now this might be one-off case but you know what a lot of emerging behaviors [00:54:18] know what a lot of emerging behaviors that we want to detect a lot of them [00:54:21] that we want to detect a lot of them will be one-off case and we really want [00:54:23] will be one-off case and we really want to get to the truth of that rather than [00:54:25] to get to the truth of that rather than having some surface level statistics [00:54:30] so [00:54:31] so can we build multi-agent system that [00:54:35] can we build multi-agent system that enables intervention and performs as [00:54:37] enables intervention and performs as well the answer is yes there's a graph [00:54:38] well the answer is yes there's a graph that shows the red line and blue line [00:54:40] that shows the red line and blue line roughly a line that's good news we are [00:54:43] roughly a line that's good news we are performing as well [00:54:45] performing as well um but remember these Concepts you need [00:54:47] um but remember these Concepts you need to label them or you should have some [00:54:49] to label them or you should have some way of getting those Concepts positions [00:54:51] way of getting those Concepts positions and orientation there might be something [00:54:53] and orientation there might be something that we would love to extend in the [00:54:55] that we would love to extend in the future [00:54:56] future before I go on any questions [00:55:03] you shy [00:55:05] you shy [Music] [00:55:10] cool [00:55:11] cool all right [00:55:13] all right so I did tell you that we're not gonna [00:55:15] so I did tell you that we're not gonna know uh move uh the solution to move 37 [00:55:18] know uh move uh the solution to move 37 I still don't okay I still don't [00:55:20] I still don't okay I still don't but I'll tell you a little bit of work [00:55:23] but I'll tell you a little bit of work that I'm currently doing I'm really [00:55:25] that I'm currently doing I'm really excited about uh that we started [00:55:28] excited about uh that we started thinking you know what will this [00:55:30] thinking you know what will this understanding move 37 happen before [00:55:32] understanding move 37 happen before within my lifetime and I was like oh [00:55:35] within my lifetime and I was like oh maybe not but I kind of want it to [00:55:36] maybe not but I kind of want it to happen so we start this is all about [00:55:39] happen so we start this is all about research right you started carving out a [00:55:41] research right you started carving out a space where things are a little [00:55:43] space where things are a little resolvable and you try to attack that [00:55:45] resolvable and you try to attack that problem so this is our attempt to do [00:55:47] problem so this is our attempt to do exactly that to get a little closer to [00:55:50] exactly that to get a little closer to our ultimate goal or my ultimate goal of [00:55:53] our ultimate goal or my ultimate goal of understanding that move 37. [00:55:56] understanding that move 37. so before that how many people here know [00:55:57] so before that how many people here know Alpha Zero from T my yes Alpha zero is a [00:56:02] Alpha Zero from T my yes Alpha zero is a self-trained uh self-trained chess [00:56:05] self-trained uh self-trained chess playing machine that beats that has [00:56:07] playing machine that beats that has higher yellow rating than any other [00:56:09] higher yellow rating than any other humans and beats stockfish which is [00:56:11] humans and beats stockfish which is arguably no existing human can beat [00:56:13] arguably no existing human can beat stock fish so in the previous paper we [00:56:17] stock fish so in the previous paper we try to discover human chess Concepts in [00:56:21] try to discover human chess Concepts in this network so when does concept like [00:56:24] this network so when does concept like material imbalance appear in its Network [00:56:27] material imbalance appear in its Network which layer and when in the training [00:56:30] which layer and when in the training time [00:56:31] time and which we call what when and where [00:56:33] and which we call what when and where plots [00:56:34] plots and we also compare the evolution of [00:56:37] and we also compare the evolution of opening moves between humans and Alpha [00:56:39] opening moves between humans and Alpha zero these are the first couple moves [00:56:42] zero these are the first couple moves that you make when you play chess and as [00:56:44] that you make when you play chess and as you can see there's a pretty huge [00:56:45] you can see there's a pretty huge difference left is human right is Alpha [00:56:48] difference left is human right is Alpha zero it turns out that Alpha zero can [00:56:51] zero it turns out that Alpha zero can master or supposedly Master a lot of [00:56:55] master or supposedly Master a lot of variety of different types of openings [00:56:57] variety of different types of openings openings can be very aggressive openings [00:56:59] openings can be very aggressive openings can be very boring could be very long [00:57:02] can be very boring could be very long range targeting for long range strategy [00:57:05] range targeting for long range strategy or short range very different so that [00:57:07] or short range very different so that begs a question what does alpha zero [00:57:10] begs a question what does alpha zero know that humans don't know don't you [00:57:12] know that humans don't know don't you want to learn what that might be [00:57:16] want to learn what that might be so that's what we're doing right now [00:57:17] so that's what we're doing right now we're actually almost um we're about to [00:57:20] we're actually almost um we're about to about to evaluate [00:57:22] about to evaluate so the goal of this war is please teach [00:57:25] so the goal of this war is please teach the world chess champion on new chess [00:57:28] the world chess champion on new chess superhuman chess strategy [00:57:30] superhuman chess strategy and we just got yes from Magnus Carlson [00:57:33] and we just got yes from Magnus Carlson who is the world chess champion [00:57:36] who is the world chess champion he just lost the match I know but but [00:57:38] he just lost the match I know but but you know he still he's still champion in [00:57:40] you know he still he's still champion in my mind he's still championed in two [00:57:42] my mind he's still championed in two categories actually [00:57:44] categories actually so the way that we're doing this is [00:57:46] so the way that we're doing this is we're going to discover new chess [00:57:48] we're going to discover new chess strategy by explicitly explicitly for [00:57:51] strategy by explicitly explicitly for getting existing chess strategy which we [00:57:54] getting existing chess strategy which we have a lot of data for [00:57:56] have a lot of data for and then we're going to learn a graph [00:57:58] and then we're going to learn a graph this time a little more complicated [00:58:00] this time a little more complicated graph by uh using the the existing [00:58:04] graph by uh using the the existing relationships between existing Concepts [00:58:07] relationships between existing Concepts so that we can get a little bit of more [00:58:09] so that we can get a little bit of more idea of what the New Concept might look [00:58:11] idea of what the New Concept might look like [00:58:12] like and Magnus Carlson uh so my favorite [00:58:14] and Magnus Carlson uh so my favorite part about this work I talk about [00:58:16] part about this work I talk about carving out my favorite part about this [00:58:18] carving out my favorite part about this work is that the evaluation is going to [00:58:20] work is that the evaluation is going to be pretty clear so it's not just like [00:58:22] be pretty clear so it's not just like Magnus coming in inside say oh your work [00:58:24] Magnus coming in inside say oh your work is kind of nice and and say nice things [00:58:26] is kind of nice and and say nice things about our work no Magnus actually has to [00:58:28] about our work no Magnus actually has to solve some puzzles [00:58:29] solve some puzzles and we will be able to evaluate him [00:58:31] and we will be able to evaluate him whether he did it or not so it's like a [00:58:34] whether he did it or not so it's like a kind of success and fail but I'm [00:58:35] kind of success and fail but I'm extremely excited this kind of work I [00:58:37] extremely excited this kind of work I can only do because of Lisa who is a [00:58:41] can only do because of Lisa who is a champion herself but also a PhD student [00:58:43] champion herself but also a PhD student at Oxford and like she played against [00:58:46] at Oxford and like she played against Magnus in the past and many others [00:58:48] Magnus in the past and many others chestplates in the world and she's going [00:58:51] chestplates in the world and she's going to be the ultimate uh pre-super human [00:58:53] to be the ultimate uh pre-super human filtering to filter out these Concepts [00:58:56] filtering to filter out these Concepts that will eventually get to Magnus [00:58:59] that will eventually get to Magnus so I'm super excited about this I have [00:59:01] so I'm super excited about this I have no results but it's coming up I'm [00:59:02] no results but it's coming up I'm excited yes [00:59:07] generator because it's already so many [00:59:09] generator because it's already so many puzzles out there so I'm assuming that [00:59:12] puzzles out there so I'm assuming that there's probably something new [00:59:15] there's probably something new what are the problems [00:59:17] what are the problems puzzles are actually pretty simple so [00:59:19] puzzles are actually pretty simple so the way that we generate concepts are [00:59:22] the way that we generate concepts are within the embedding space of alpha zero [00:59:25] within the embedding space of alpha zero and given that because Alpha zero has [00:59:27] and given that because Alpha zero has really weird architecture so every [00:59:30] really weird architecture so every single latent layer in Alpha zero has [00:59:32] single latent layer in Alpha zero has the exact same position as a chessboard [00:59:33] the exact same position as a chessboard that's just the way that they decide to [00:59:35] that's just the way that they decide to do it so because of that we can actually [00:59:37] do it so because of that we can actually identify or generate the board positions [00:59:40] identify or generate the board positions that corresponds to that concept and [00:59:43] that corresponds to that concept and because we have MCTS we can predict what [00:59:47] because we have MCTS we can predict what move it's going to make given that board [00:59:49] move it's going to make given that board position because at inference time it's [00:59:51] position because at inference time it's actually deterministic of the whole lot [00:59:53] actually deterministic of the whole lot plus zero thing so these we have a lot [00:59:55] plus zero thing so these we have a lot of board positions and that's all you [00:59:57] of board positions and that's all you need for puzzles you give up board [00:59:59] need for puzzles you give up board position and then ask Magnus to make a [01:00:01] position and then ask Magnus to make a move we explain the concept and then [01:00:03] move we explain the concept and then give Magnus more board positions and see [01:00:05] give Magnus more board positions and see if we can apply that concept that he [01:00:08] if we can apply that concept that he just learned [01:00:12] for example [01:00:14] right but it seems like you're kind of [01:00:17] right but it seems like you're kind of underneath [01:00:21] yeah so the if I were to ask stockfish [01:00:25] yeah so the if I were to ask stockfish to [01:00:26] to solve those puzzles that were a [01:00:27] solve those puzzles that were a different question because we're [01:00:29] different question because we're interested in whether we can teach human [01:00:31] interested in whether we can teach human not stockfish stockfish might be able to [01:00:33] not stockfish stockfish might be able to do that's actually interesting uh thing [01:00:36] do that's actually interesting uh thing that we could do now I think about but [01:00:38] that we could do now I think about but our goal is to just teach one superhume [01:00:41] our goal is to just teach one superhume like if I have for example 10 000 [01:00:43] like if I have for example 10 000 superhuman Concepts and only three of [01:00:46] superhuman Concepts and only three of them are digestible by Magnus that's a [01:00:49] them are digestible by Magnus that's a win that would be a big win for for this [01:00:52] win that would be a big win for for this type of research [01:00:56] questions [01:00:59] all right yeah so oh so wrap up small [01:01:04] all right yeah so oh so wrap up small steps towards our hopes and dreams we [01:01:07] steps towards our hopes and dreams we talked about the gap between What [01:01:09] talked about the gap between What machines know versus what we think [01:01:10] machines know versus what we think machines know three ideas why that might [01:01:14] machines know three ideas why that might be true the three different maybe angles [01:01:16] be true the three different maybe angles we can try to attack and answer those [01:01:18] we can try to attack and answer those questions and the the bridge that Gap we [01:01:21] questions and the the bridge that Gap we talked about studying aliens these [01:01:24] talked about studying aliens these machines in observation study or control [01:01:26] machines in observation study or control study there are many other ways to study [01:01:28] study there are many other ways to study your species uh and I'm not an expert [01:01:31] your species uh and I'm not an expert but anthropology and other Humanity [01:01:33] but anthropology and other Humanity studies would know a lot better more [01:01:35] studies would know a lot better more about this [01:01:36] about this and maybe just maybe we can try to [01:01:40] and maybe just maybe we can try to understand move 37 at some point [01:01:42] understand move 37 at some point hopefully within my lifetime through [01:01:44] hopefully within my lifetime through this chess uh project that I'm very [01:01:47] this chess uh project that I'm very excited about thank you [01:01:49] excited about thank you [Applause] [01:02:01] you talked about interprecility research [01:02:04] you talked about interprecility research that costs NLP vision and RL [01:02:07] that costs NLP vision and RL um do you think there's much about first [01:02:09] um do you think there's much about first taking certain interpretability [01:02:10] taking certain interpretability techniques from one modality into other [01:02:12] techniques from one modality into other modalities [01:02:17] all right [01:02:19] all right so it depends on your goal I think like [01:02:22] so it depends on your goal I think like think about fairness research which uh [01:02:25] think about fairness research which uh Builds on strong mathematical foundation [01:02:27] Builds on strong mathematical foundation and that's like applicable for any [01:02:30] and that's like applicable for any questions around fairness or hopefully [01:02:32] questions around fairness or hopefully applicable but then once you if your [01:02:36] applicable but then once you if your goal is to actually solve a fairness [01:02:38] goal is to actually solve a fairness issue at hand for somebody the real [01:02:41] issue at hand for somebody the real person in the world that's completely [01:02:43] person in the world that's completely different question you would have to [01:02:45] different question you would have to customize it for a particular [01:02:46] customize it for a particular application so there are two venues and [01:02:48] application so there are two venues and I think similar is true interoperability [01:02:50] I think similar is true interoperability like the theory work that I talked about [01:02:52] like the theory work that I talked about shop and IG are used across domains like [01:02:56] shop and IG are used across domains like Vision texts so that theory paper would [01:02:58] Vision texts so that theory paper would be applicable across the domain things [01:03:01] be applicable across the domain things like RL and the way that we build that [01:03:03] like RL and the way that we build that generative model you would need to test [01:03:05] generative model you would need to test a little bit more to make sure that this [01:03:07] a little bit more to make sure that this works in NLP uh I don't even know how to [01:03:10] works in NLP uh I don't even know how to think about agents in NLP yet so it will [01:03:13] think about agents in NLP yet so it will need a little bit of tweaking but both [01:03:14] need a little bit of tweaking but both directions are fruitful [01:03:20] John has a question [01:03:23] John has a question I saw the recent work in which [01:03:26] I saw the recent work in which some amateur go players found a very [01:03:29] some amateur go players found a very tricky strategy to trick up I think it [01:03:31] tricky strategy to trick up I think it was alphago and that seemed like a [01:03:34] was alphago and that seemed like a concept that humans know that machines [01:03:36] concept that humans know that machines don't in that Venn [01:03:38] don't in that Venn diagrams about that yeah actually it's [01:03:40] diagrams about that yeah actually it's funny you mentioned that Lisa can beat [01:03:44] funny you mentioned that Lisa can beat Alpha zero pretty easily and it's a [01:03:47] Alpha zero pretty easily and it's a similar idea because uh if you you kind [01:03:50] similar idea because uh if you you kind of know what are the most unseen out of [01:03:52] of know what are the most unseen out of distribution moves are and and he she [01:03:55] distribution moves are and and he she can break Alpha zero pretty easily at [01:03:56] can break Alpha zero pretty easily at least I guess that if Isa Dole had known [01:03:59] least I guess that if Isa Dole had known something more about AI then maybe he [01:04:01] something more about AI then maybe he would have tried to confuse alphago but [01:04:03] would have tried to confuse alphago but the truth is you know it takes a lot [01:04:05] the truth is you know it takes a lot it's a high stake game like he said oh [01:04:07] it's a high stake game like he said oh it's like a the famous star worldwide so [01:04:10] it's like a the famous star worldwide so he wouldn't want to make a a move that [01:04:12] he wouldn't want to make a a move that would be seen as a complete mistake like [01:04:15] would be seen as a complete mistake like the one that Magnus made couple of days [01:04:17] the one that Magnus made couple of days ago that got on the news feed everywhere [01:04:19] ago that got on the news feed everywhere that he made this Taco century-wide [01:04:21] that he made this Taco century-wide mistake and that's that's probably hurts [01:04:26] any other questions [01:04:33] zero for example I just like building [01:04:36] zero for example I just like building machine learning [01:04:38] machine learning lazy's games really well [01:04:40] lazy's games really well um [01:04:52] well these work that I've presented are [01:04:54] well these work that I've presented are pretty you [01:04:56] pretty you um but there has been a bit of [01:04:57] um but there has been a bit of discussion in in the robotics applying [01:05:00] discussion in in the robotics applying potentially these two Robotics and of [01:05:02] potentially these two Robotics and of course I can't talk about details but [01:05:05] course I can't talk about details but um [01:05:06] um uh things that [01:05:08] uh things that reinforcement learning in the wild [01:05:10] reinforcement learning in the wild people worry about or some of the [01:05:11] people worry about or some of the surprises right if you have a test for [01:05:14] surprises right if you have a test for it like if you have a unit test for it [01:05:16] it like if you have a unit test for it you're never going to fail because [01:05:18] you're never going to fail because you're going to test before you deploy I [01:05:21] you're going to test before you deploy I think the biggest risk for any of this [01:05:22] think the biggest risk for any of this deployment systems is the surprises that [01:05:26] deployment systems is the surprises that you didn't expect [01:05:27] you didn't expect so my work around the visualization and [01:05:30] so my work around the visualization and others aim to help you with that so we [01:05:34] others aim to help you with that so we may not know names of these surprises [01:05:36] may not know names of these surprises but here's a tool that help you better [01:05:38] but here's a tool that help you better discover those surprises before someone [01:05:41] discover those surprises before someone else does or someone else gets harm [01:05:51] um this is kind of an open independent [01:05:52] um this is kind of an open independent question but I was wondering we're [01:05:54] question but I was wondering we're talking about a lot of ways in which we [01:05:56] talking about a lot of ways in which we try to kind of visualize or understand [01:05:59] try to kind of visualize or understand what's going on in the representation [01:06:00] what's going on in the representation inside the machine but I was wondering [01:06:02] inside the machine but I was wondering whether we could turn it around and try [01:06:05] whether we could turn it around and try to teach machines to tell us like what [01:06:07] to teach machines to tell us like what using our language is what they're doing [01:06:10] using our language is what they're doing and their representations like illegal [01:06:12] and their representations like illegal representations of ours and then get the [01:06:14] representations of ours and then get the machine to do the translation for us [01:06:16] machine to do the translation for us instead of us going into the English [01:06:18] instead of us going into the English yeah great question so it's a really [01:06:21] yeah great question so it's a really interesting question because um that's [01:06:22] interesting question because um that's something that I kind of [01:06:25] something that I kind of tried in in my work previous work called [01:06:27] tried in in my work previous work called testing with Concept activation vectors [01:06:30] testing with Concept activation vectors so that was to map human language into [01:06:33] so that was to map human language into machine space so that they can only [01:06:35] machine space so that they can only speak our language because I understand [01:06:36] speak our language because I understand my language and just talk to me in my [01:06:38] my language and just talk to me in my language the challenge is that how would [01:06:41] language the challenge is that how would you do that for something like Alpha [01:06:43] you do that for something like Alpha zero like we don't have a vocabulary for [01:06:46] zero like we don't have a vocabulary for it like move 37 then there's going to be [01:06:49] it like move 37 then there's going to be a lot of missing valuable knowledge that [01:06:52] a lot of missing valuable knowledge that we might we might not get from the [01:06:54] we might we might not get from the machine so I think the approach has to [01:06:56] machine so I think the approach has to be both ways we should leverage as much [01:06:58] be both ways we should leverage as much as we can but acknowledging that even [01:07:01] as we can but acknowledging that even that mapping that trying to map our [01:07:04] that mapping that trying to map our language to machines is going to is not [01:07:06] language to machines is going to is not going to be perfect because it's a kind [01:07:09] going to be perfect because it's a kind of proxy for what we think like a [01:07:11] of proxy for what we think like a penguin is there's a psychology research [01:07:13] penguin is there's a psychology research that says everyone thinks very [01:07:15] that says everyone thinks very differently about what penguin is like [01:07:16] differently about what penguin is like if I like a picture of penguin everyone [01:07:19] if I like a picture of penguin everyone was thinking different penguin right now [01:07:21] was thinking different penguin right now right Australia has the cutest penguin [01:07:24] right Australia has the cutest penguin the fairy penguin I'm thinking that [01:07:26] the fairy penguin I'm thinking that right I don't know how many people are [01:07:27] right I don't know how many people are thinking that so given that like we give [01:07:30] thinking that so given that like we give we're so different machine's gonna think [01:07:31] we're so different machine's gonna think something else so how do you bridge that [01:07:33] something else so how do you bridge that Gap extend that to 100 Concepts and [01:07:36] Gap extend that to 100 Concepts and composing those Concepts it's gonna go [01:07:38] composing those Concepts it's gonna go out a while very soon so there's pros [01:07:40] out a while very soon so there's pros and cons I'm into both of them I think [01:07:43] and cons I'm into both of them I think some applications [01:07:45] some applications exclusively exclusively just using human [01:07:48] exclusively exclusively just using human concepts are still very helpful it gets [01:07:50] concepts are still very helpful it gets you uh halfway but my ambition is that [01:07:55] you uh halfway but my ambition is that we shouldn't stop there we should [01:07:56] we shouldn't stop there we should benefit from them by having us having [01:07:59] benefit from them by having us having them teach us new things that we didn't [01:08:01] them teach us new things that we didn't know before [01:08:19] but like um I don't know but like trying [01:08:23] but like um I don't know but like trying to locate like specific strategies in [01:08:24] to locate like specific strategies in the embedding space [01:08:27] What are the alternatives I guess I [01:08:30] What are the alternatives I guess I don't know the Alternatives just because [01:08:32] don't know the Alternatives just because I feel like the wrong thing [01:08:35] that's possible so like it's like some [01:08:37] that's possible so like it's like some transformed space of our embedding space [01:08:40] transformed space of our embedding space in Alpha zero maybe it's a function of [01:08:42] in Alpha zero maybe it's a function of uh applied to that embedding space so [01:08:44] uh applied to that embedding space so thinking about that as a raw Vector is [01:08:47] thinking about that as a raw Vector is is a is a dead end could be uh we'll see [01:08:50] is a is a dead end could be uh we'll see how this chess project goes in a couple [01:08:52] how this chess project goes in a couple months I might I might rethink my [01:08:55] months I might I might rethink my strategy but interesting thought [01:08:57] strategy but interesting thought yeah so I'm a Psychology major and I do [01:09:00] yeah so I'm a Psychology major and I do realize that a lot of the stuff will be [01:09:02] realize that a lot of the stuff will be trying to hear like at least this is how [01:09:04] trying to hear like at least this is how we can figure out how our brains work [01:09:07] we can figure out how our brains work and so I think that this would there be [01:09:11] and so I think that this would there be um stuff that um use that's applicable [01:09:14] um stuff that um use that's applicable to internal networks and on the contrary [01:09:16] to internal networks and on the contrary youth English means interpretability [01:09:18] youth English means interpretability it's in the studies of neural network [01:09:19] it's in the studies of neural network will help us understand and stuff for [01:09:21] will help us understand and stuff for our own brain yeah I talked to Jeffrey [01:09:24] our own brain yeah I talked to Jeffrey Hinton uh you know he would really like [01:09:26] Hinton uh you know he would really like this so I believe I believe you probably [01:09:28] this so I believe I believe you probably know about this history I think that's [01:09:29] know about this history I think that's how it all started right the whole [01:09:32] how it all started right the whole neural network is to understand human [01:09:34] neural network is to understand human brain [01:09:35] brain um [01:09:37] um so so that that that's that's the answer [01:09:39] so so that that that's that's the answer to your question interesting however in [01:09:40] to your question interesting however in my view there is some biases that we [01:09:44] my view there is some biases that we have in neuros Neuroscience because of [01:09:47] have in neuros Neuroscience because of the limitations of tools like physical [01:09:48] the limitations of tools like physical tools and availability of humans that [01:09:51] tools and availability of humans that you can poke in I think that influences [01:09:53] you can poke in I think that influences interpretability research and I'll try [01:09:55] interpretability research and I'll try to give you an example what I mean so in [01:09:57] to give you an example what I mean so in you know cat near the the line or the [01:09:59] you know cat near the the line or the horizontal line and vertical line neuron [01:10:00] horizontal line and vertical line neuron in cat brain so they put the prop in and [01:10:03] in cat brain so they put the prop in and figure out this one neuron detects [01:10:04] figure out this one neuron detects vertical lines and you can like validate [01:10:06] vertical lines and you can like validate it's really cool if you look at the [01:10:08] it's really cool if you look at the video the video is still online [01:10:10] video the video is still online yeah what is it [01:10:12] yeah what is it yes yes yes uh so why why did they do [01:10:15] yes yes yes uh so why why did they do that well because you had one cat and a [01:10:18] that well because you had one cat and a four poor cat and you had uh we can only [01:10:22] four poor cat and you had uh we can only prob a few neurons at a time right [01:10:25] prob a few neurons at a time right so that that implied a lot of few [01:10:27] so that that implied a lot of few interpreportable research actually [01:10:28] interpreportable research actually looked at or very focused on like neuron [01:10:31] looked at or very focused on like neuron wise representation like this one neuron [01:10:34] wise representation like this one neuron must be very special I actually think [01:10:35] must be very special I actually think that's not true that was limited by our [01:10:38] that's not true that was limited by our ability like physical ability ability to [01:10:40] ability like physical ability ability to prop organisms but in your network you [01:10:42] prop organisms but in your network you don't have to do that like you can apply [01:10:43] don't have to do that like you can apply functions to embeddings you can change [01:10:45] functions to embeddings you can change the whole embedding to something else [01:10:47] the whole embedding to something else override so that kind of uh is actually [01:10:50] override so that kind of uh is actually a uh obstacle in our thinking rather [01:10:54] a uh obstacle in our thinking rather than helping [01:10:58] yeah [01:11:00] yeah okay maybe we should call it there [01:11:03] okay maybe we should call it there um so for Thursday when you're not [01:11:05] um so for Thursday when you're not having uh lecture on Thursday [01:11:08] having uh lecture on Thursday um there'll be Tas in me here so if you [01:11:11] um there'll be Tas in me here so if you have any you know last minute panics on [01:11:14] have any you know last minute panics on your project so I think we might have [01:11:16] your project so I think we might have some straight inside to help you we [01:11:19] some straight inside to help you we probably won't actually [01:11:21] probably won't actually um [01:11:30] final lecture cs224 in today [01:11:34] final lecture cs224 in today [Applause] ================================================================================ LECTURE 021 ================================================================================ Stanford CS224N NLP with Deep Learning | 2023 | Python Tutorial, Manasi Sharma Source: https://www.youtube.com/watch?v=8j4wpU98Q74 --- Transcript [00:00:05] all right hi everyone [00:00:07] all right hi everyone um welcome to the 224n python review [00:00:09] um welcome to the 224n python review session [00:00:11] session um the goal of the session really will [00:00:12] um the goal of the session really will be to sort of give you the basics [00:00:15] be to sort of give you the basics um of python and numpy in particular [00:00:18] um of python and numpy in particular that you'll be using a lot in your [00:00:19] that you'll be using a lot in your second homework [00:00:20] second homework um and the homework will come after that [00:00:22] um and the homework will come after that as well [00:00:23] as well um we're sort of taking this tutorial [00:00:25] um we're sort of taking this tutorial from the background of anyone who hasn't [00:00:28] from the background of anyone who hasn't touched programming languages to some [00:00:30] touched programming languages to some extent [00:00:31] extent um but also for people who have will be [00:00:33] um but also for people who have will be sort of going through a lot of that [00:00:34] sort of going through a lot of that material very quickly and we'll be [00:00:35] material very quickly and we'll be progressing to numpy as well [00:00:37] progressing to numpy as well um and as I mentioned first and foremost [00:00:38] um and as I mentioned first and foremost the session is really meant for the [00:00:40] the session is really meant for the people who are here in person so if [00:00:41] people who are here in person so if you'd like me to slow down speed up at [00:00:43] you'd like me to slow down speed up at any point need time for clarifications [00:00:46] any point need time for clarifications feel free to ask us and it's really [00:00:47] feel free to ask us and it's really meant for you first um here and that I [00:00:50] meant for you first um here and that I really would like it to be sort of an [00:00:51] really would like it to be sort of an interactive session as well [00:00:53] interactive session as well all right so this is a topic the topics [00:00:55] all right so this is a topic the topics we'll be covering today [00:00:57] we'll be covering today um going through first of all why python [00:00:58] um going through first of all why python is a language why have we chosen it for [00:01:00] is a language why have we chosen it for sort of discourse and in general why do [00:01:02] sort of discourse and in general why do people prefer prefer it to some extent [00:01:04] people prefer prefer it to some extent for machine learning and natural [00:01:05] for machine learning and natural language processing [00:01:06] language processing um some basics of the language itself [00:01:08] um some basics of the language itself common data structures and then getting [00:01:10] common data structures and then getting to sort of the meat of it through numpy [00:01:13] to sort of the meat of it through numpy which as I mentioned you'll be [00:01:14] which as I mentioned you'll be extensively using in your homeworks [00:01:15] extensively using in your homeworks going forward and then some practical [00:01:17] going forward and then some practical tips about how to use [00:01:18] tips about how to use um things in Python [00:01:21] um things in Python all right it's first thing why python um [00:01:23] all right it's first thing why python um so a lot of you who might have um been [00:01:26] so a lot of you who might have um been first introduced to programming might [00:01:28] first introduced to programming might have done Java before a lot of people [00:01:30] have done Java before a lot of people use Matlab in [00:01:31] use Matlab in um other in other fields as well [00:01:34] um other in other fields as well um so why python python is generally [00:01:36] um so why python python is generally used um for one because it's a very high [00:01:38] used um for one because it's a very high level language um it can look very very [00:01:40] level language um it can look very very english-like and so it's really easy to [00:01:42] english-like and so it's really easy to work with for people especially when [00:01:43] work with for people especially when they get started out it has a lot of [00:01:45] they get started out it has a lot of scientific computational functionality [00:01:47] scientific computational functionality as well similar to Matlab so when you [00:01:49] as well similar to Matlab so when you talk about numpy you'll see that it has [00:01:50] talk about numpy you'll see that it has a lot of framework so very very quick [00:01:52] a lot of framework so very very quick and efficient operations involving math [00:01:54] and efficient operations involving math or matrices and that's very very useful [00:01:56] or matrices and that's very very useful in applications such as deep learning [00:01:59] in applications such as deep learning and for deep learning in particular a [00:02:01] and for deep learning in particular a lot of Frameworks that people use [00:02:02] lot of Frameworks that people use particularly for example Pi torch and [00:02:04] particularly for example Pi torch and tensorflow interface directly with [00:02:06] tensorflow interface directly with python and so for that those main [00:02:07] python and so for that those main reasons people generally tend to use [00:02:09] reasons people generally tend to use Python within deep learning [00:02:12] Python within deep learning okay so the setup information is in the [00:02:15] okay so the setup information is in the slides if you like to look at them [00:02:16] slides if you like to look at them offline [00:02:17] offline um I will be sort of jumping over that [00:02:18] um I will be sort of jumping over that for now because I want to sort of get to [00:02:20] for now because I want to sort of get to the introduction to the language itself [00:02:22] the introduction to the language itself and if you have time come back to sort [00:02:23] and if you have time come back to sort of the setup information a lot of it's [00:02:25] of the setup information a lot of it's pretty direct you can walk through it um [00:02:27] pretty direct you can walk through it um it gives you steps for sort of how to [00:02:29] it gives you steps for sort of how to install packages [00:02:31] install packages um what is a conda environment for [00:02:33] um what is a conda environment for example and gets you set up with your [00:02:35] example and gets you set up with your first working python environment so you [00:02:36] first working python environment so you can sort of run simple and basic [00:02:38] can sort of run simple and basic commands to get used to the language but [00:02:40] commands to get used to the language but for now I'm going to be skipping over [00:02:41] for now I'm going to be skipping over this and coming back to it if we have [00:02:42] this and coming back to it if we have time [00:02:44] time all right language Basics so [00:02:47] all right language Basics so um in Python you have variables and [00:02:50] um in Python you have variables and these variables can take on multiple [00:02:51] these variables can take on multiple values the assignment operation there's [00:02:53] values the assignment operation there's an equal sign will allow you to assign [00:02:56] an equal sign will allow you to assign this particular value to a variable a [00:02:57] this particular value to a variable a nice thing with python is you don't have [00:02:59] nice thing with python is you don't have to instantiate the type of the variable [00:03:01] to instantiate the type of the variable to begin with and then only instantiate [00:03:03] to begin with and then only instantiate or only assign values of that type so [00:03:05] or only assign values of that type so for example in certain languages we [00:03:07] for example in certain languages we first say that this variable X is only [00:03:10] first say that this variable X is only going to be of type intent any value [00:03:12] going to be of type intent any value aside from that assigned to it will [00:03:13] aside from that assigned to it will throw an error Python's pretty flexible [00:03:15] throw an error Python's pretty flexible so if I want to I can reassign it I can [00:03:17] so if I want to I can reassign it I can start with X is equal to 10 and then [00:03:19] start with X is equal to 10 and then later on like five lines later I can say [00:03:21] later on like five lines later I can say x is equal to high as a string and there [00:03:24] x is equal to high as a string and there would be no issue [00:03:25] would be no issue um you can do simple mathematical [00:03:27] um you can do simple mathematical operations such as the plus and division [00:03:29] operations such as the plus and division signs you can do exponentiation which is [00:03:33] signs you can do exponentiation which is Raising one value to another value so x [00:03:36] Raising one value to another value so x to the power of Y for example using the [00:03:37] to the power of Y for example using the double asterisk [00:03:39] double asterisk um you can do type castings for float [00:03:41] um you can do type castings for float division so if you want to ensure that [00:03:43] division so if you want to ensure that your values are being divided resulting [00:03:45] your values are being divided resulting in a float value and not just dividing [00:03:46] in a float value and not just dividing two integers you can cast two different [00:03:48] two integers you can cast two different types like float if you want something [00:03:50] types like float if you want something to be explicitly an INT you can also [00:03:51] to be explicitly an INT you can also just put an INT instead of the float [00:03:53] just put an INT instead of the float with brackets around the result and [00:03:56] with brackets around the result and that'll give you an integer value and [00:03:58] that'll give you an integer value and then you can also do typecasting to for [00:04:01] then you can also do typecasting to for example convert from integers to Strings [00:04:03] example convert from integers to Strings so in this case if I wanted to instead [00:04:05] so in this case if I wanted to instead of doing 10 plus 3 as a mathematical [00:04:08] of doing 10 plus 3 as a mathematical operation I just want to write out 10 [00:04:10] operation I just want to write out 10 plus 3 then I can convert the X and Y [00:04:13] plus 3 then I can convert the X and Y values for example to Strings and then [00:04:15] values for example to Strings and then add the plus sign as as a character as [00:04:19] add the plus sign as as a character as well to create a string and so a lot of [00:04:20] well to create a string and so a lot of these common operations you can look [00:04:22] these common operations you can look online as well people have lists for [00:04:23] online as well people have lists for them and just see how they're sort of [00:04:25] them and just see how they're sort of dot in Python [00:04:27] dot in Python all right [00:04:28] all right um some other quick things um so Boolean [00:04:30] um some other quick things um so Boolean values the the true and the false um [00:04:32] values the the true and the false um they're always used with capital letters [00:04:34] they're always used with capital letters and some of the languages they might be [00:04:35] and some of the languages they might be lowercase so just one thing to know [00:04:37] lowercase so just one thing to know um python also doesn't have a null value [00:04:39] um python also doesn't have a null value the equivalent of a null value is none [00:04:41] the equivalent of a null value is none so sometimes when you want to say that [00:04:43] so sometimes when you want to say that this value you want to return none [00:04:45] this value you want to return none saying I'm not really doing anything [00:04:46] saying I'm not really doing anything here you want to do checks protect for [00:04:49] here you want to do checks protect for example in if statements [00:04:51] example in if statements um to say that this doesn't have a value [00:04:53] um to say that this doesn't have a value then you can assign it to none so none [00:04:56] then you can assign it to none so none sort of functions as a null equivalent [00:04:59] sort of functions as a null equivalent so you're not really returning anything [00:05:00] so you're not really returning anything it doesn't have a value not the same as [00:05:02] it doesn't have a value not the same as zero and um another nice thing about [00:05:05] zero and um another nice thing about python is lists which are sort of um [00:05:08] python is lists which are sort of um mutable we'll come to that a little bit [00:05:09] mutable we'll come to that a little bit later but sort of mutable lists of [00:05:12] later but sort of mutable lists of objects that means that you can change [00:05:14] objects that means that you can change them they can be of any type so you can [00:05:17] them they can be of any type so you can have a mixture of integers non-values [00:05:19] have a mixture of integers non-values strings [00:05:21] strings Etc [00:05:22] Etc and yeah functions can return the [00:05:23] and yeah functions can return the non-value as well [00:05:26] non-value as well um and another quick thing instead of [00:05:28] um and another quick thing instead of using the double and and uh in some of [00:05:31] using the double and and uh in some of the languages as people might do with [00:05:32] the languages as people might do with python I mentioned earlier it's very [00:05:34] python I mentioned earlier it's very english-like so you can actually just [00:05:35] english-like so you can actually just write out [00:05:36] write out um if x is equal to three and and in [00:05:40] um if x is equal to three and and in English Y is equal to four then return [00:05:42] English Y is equal to four then return true or something [00:05:43] true or something um it's it's quite nice that way so you [00:05:45] um it's it's quite nice that way so you can use and or and not [00:05:47] can use and or and not um and then just the comparison [00:05:48] um and then just the comparison operators of equal equals to and not [00:05:51] operators of equal equals to and not equal to we'll check for equality and [00:05:53] equal to we'll check for equality and inequality this one's pretty standard I [00:05:55] inequality this one's pretty standard I feel across many languages and you can [00:05:57] feel across many languages and you can use them in python as well [00:05:59] use them in python as well and yeah remember just clicking the [00:06:01] and yeah remember just clicking the equal equal to sign is different from [00:06:02] equal equal to sign is different from the assignment operator this one checks [00:06:04] the assignment operator this one checks for equality that one is just assigning [00:06:05] for equality that one is just assigning a value [00:06:06] a value so single equal sign versus two of them [00:06:10] so single equal sign versus two of them all right and then also in Python you [00:06:12] all right and then also in Python you don't use brackets so python you can use [00:06:15] don't use brackets so python you can use basically spaces or tabs so either [00:06:18] basically spaces or tabs so either indents of two or four to be able to [00:06:21] indents of two or four to be able to break up what is contained in the [00:06:23] break up what is contained in the function or contained within like an if [00:06:24] function or contained within like an if statement a for statement [00:06:26] statement a for statement um or any Loops for example [00:06:28] um or any Loops for example um and so the main thing is you can [00:06:29] um and so the main thing is you can choose whether to do two or four you [00:06:31] choose whether to do two or four you just have to be consistent throughout [00:06:33] just have to be consistent throughout your entire [00:06:34] your entire um code base otherwise they will throw [00:06:36] um code base otherwise they will throw an error [00:06:37] an error now go to some common data structures [00:06:39] now go to some common data structures and for this we'll transition to the [00:06:41] and for this we'll transition to the collab [00:06:43] so this will sort of show you in real [00:06:46] so this will sort of show you in real time this is by the way a collab a [00:06:48] time this is by the way a collab a collab is basically um a Jupiter [00:06:50] collab is basically um a Jupiter notebook for those of you who are [00:06:51] notebook for those of you who are familiar with those [00:06:52] familiar with those um that you can use that it's hosted on [00:06:54] um that you can use that it's hosted on Google servers [00:06:56] Google servers um the really nice thing about jupyter [00:06:57] um the really nice thing about jupyter notebooks is you don't have to run an [00:06:58] notebooks is you don't have to run an entire file altogether you can run it [00:07:01] entire file altogether you can run it step by step into what are these called [00:07:03] step by step into what are these called cells so if you want to see like an [00:07:05] cells so if you want to see like an intermediate output you can see that [00:07:06] intermediate output you can see that pretty easily and that way and Indian [00:07:09] pretty easily and that way and Indian also writes for example a lot of um like [00:07:11] also writes for example a lot of um like descriptions [00:07:13] descriptions um pertaining to cells which is really [00:07:14] um pertaining to cells which is really really nice to have as well so a lot of [00:07:16] really nice to have as well so a lot of people like tend to use these when [00:07:17] people like tend to use these when they're sort of starting off the project [00:07:18] they're sort of starting off the project we want to debug things and colab allows [00:07:21] we want to debug things and colab allows you to use these jupyter notebook type [00:07:24] you to use these jupyter notebook type applications hosted on their servers for [00:07:27] applications hosted on their servers for free basically so anyone can create one [00:07:29] free basically so anyone can create one of these and run their code [00:07:32] of these and run their code all right so lists are mutable arrays [00:07:34] all right so lists are mutable arrays mutable means that you can change them [00:07:36] mutable means that you can change them so that once you declare them you can [00:07:38] so that once you declare them you can add to them you can delete them and [00:07:40] add to them you can delete them and they're optimized for that purpose so [00:07:41] they're optimized for that purpose so they expect to be changed very often [00:07:43] they expect to be changed very often we'll come to what are called numpy [00:07:45] we'll come to what are called numpy arrays later and those tend to be pretty [00:07:47] arrays later and those tend to be pretty much fixed when you create a new one [00:07:49] much fixed when you create a new one you'd have when you change one you'd [00:07:50] you'd have when you change one you'd basically have to create a new array [00:07:52] basically have to create a new array um which will have the additional [00:07:53] um which will have the additional information so this is highly optimized [00:07:55] information so this is highly optimized for changing things so if you know for [00:07:56] for changing things so if you know for example and you're in a loop you're [00:07:58] example and you're in a loop you're adding different elements to let's say a [00:08:00] adding different elements to let's say a bigger entity you'd want to use [00:08:02] bigger entity you'd want to use something like a list because you're [00:08:03] something like a list because you're going to be changing that very often so [00:08:06] going to be changing that very often so let's see how they work so we start off [00:08:08] let's see how they work so we start off with a names array with Zach and J [00:08:11] with a names array with Zach and J um you can index into the list [00:08:14] um you can index into the list um by so what is this index into the [00:08:16] um by so what is this index into the list by index which means that you can [00:08:18] list by index which means that you can um [00:08:18] um list out the elements in the list um [00:08:21] list out the elements in the list um depending on What's called the index so [00:08:22] depending on What's called the index so it's what place that value is at within [00:08:25] it's what place that value is at within the list so zero refers to the first [00:08:26] the list so zero refers to the first element so Python's what's called zero [00:08:28] element so Python's what's called zero index which means it starts with zero [00:08:30] index which means it starts with zero and then it goes to one so here zero [00:08:32] and then it goes to one so here zero will be Zack [00:08:34] will be Zack and then let's say I want to append [00:08:36] and then let's say I want to append something to the end so the to add [00:08:38] something to the end so the to add something to the end of the list the [00:08:40] something to the end of the list the term is append not add and so if I want [00:08:43] term is append not add and so if I want to append [00:08:44] to append I can now create a separate list which [00:08:46] I can now create a separate list which is the original list itself with the [00:08:48] is the original list itself with the added last element [00:08:50] added last element and what would currently be the length [00:08:51] and what would currently be the length of this it would be three because you [00:08:53] of this it would be three because you have three elements and you can just [00:08:55] have three elements and you can just quickly get that by using the Len [00:08:56] quickly get that by using the Len function not length just three letters [00:08:58] function not length just three letters Len [00:09:00] Len all right [00:09:01] all right um it's also really nice because python [00:09:03] um it's also really nice because python has overloaded the plus [00:09:05] has overloaded the plus operation to be able to concatenate [00:09:08] operation to be able to concatenate lists so here I have a separate list [00:09:11] lists so here I have a separate list right and all you need for a list [00:09:12] right and all you need for a list definition is just brackets so this is a [00:09:14] definition is just brackets so this is a separate list altogether even though I [00:09:16] separate list altogether even though I haven't saved it in the variable just [00:09:17] haven't saved it in the variable just Abby and Kevin and I can just do a plus [00:09:20] Abby and Kevin and I can just do a plus equal to which means that names is equal [00:09:22] equal to which means that names is equal to names plus Abby and Kevin and this [00:09:24] to names plus Abby and Kevin and this should output this full list [00:09:27] should output this full list you can create lists by just putting the [00:09:29] you can create lists by just putting the plain brackets or an existing list and [00:09:32] plain brackets or an existing list and then as I mentioned earlier your list [00:09:34] then as I mentioned earlier your list can have a variety of types within them [00:09:36] can have a variety of types within them so here this list contains an integer [00:09:37] so here this list contains an integer value a list value so you can have a [00:09:40] value a list value so you can have a list of lists as many sort of sub lists [00:09:42] list of lists as many sort of sub lists as you like a float value and a none [00:09:44] as you like a float value and a none value [00:09:45] value and this is completely valid within [00:09:47] and this is completely valid within python [00:09:48] python slicing refers to how you can access [00:09:51] slicing refers to how you can access only parts of the list so if I only want [00:09:53] only parts of the list so if I only want for example [00:09:55] for example um in this numbers array I only want 0 1 [00:09:58] um in this numbers array I only want 0 1 2 slicing is a way that you can extract [00:10:01] 2 slicing is a way that you can extract only those parts so the way sizing works [00:10:03] only those parts so the way sizing works is the first element is included and the [00:10:05] is the first element is included and the last element is excluded so here I start [00:10:07] last element is excluded so here I start with 0 1 2 3. so 3 is not included and [00:10:12] with 0 1 2 3. so 3 is not included and so 0 1 2 will be printed out there's [00:10:15] so 0 1 2 will be printed out there's also shorthands so if you know that [00:10:16] also shorthands so if you know that you're going to be starting with the [00:10:18] you're going to be starting with the first element of the arrays if you know [00:10:20] first element of the arrays if you know I'm starting I want zero one two and it [00:10:21] I'm starting I want zero one two and it starts with zero then you don't need to [00:10:23] starts with zero then you don't need to even include the first index you can [00:10:25] even include the first index you can just leave that and include the last [00:10:27] just leave that and include the last index that would be excluded so that [00:10:29] index that would be excluded so that would be blank semicolon 3 and same deal [00:10:33] would be blank semicolon 3 and same deal with the end if you know that you want [00:10:35] with the end if you know that you want to take everything let's say from like [00:10:37] to take everything let's say from like five and six till the end of the array [00:10:40] five and six till the end of the array you can start with what would you like [00:10:42] you can start with what would you like so 0 1 2 3 4 5 till the end and leave [00:10:46] so 0 1 2 3 4 5 till the end and leave that oh sorry [00:10:49] um fun fact so this um semicolon when [00:10:51] um fun fact so this um semicolon when you take this just the semicolon it'll [00:10:53] you take this just the semicolon it'll take everything in the list but it'll [00:10:54] take everything in the list but it'll also create a duplicate in memory that's [00:10:56] also create a duplicate in memory that's that's a very slight [00:10:58] that's a very slight um very useful thing to to know [00:11:00] um very useful thing to to know um because sometimes when you're like [00:11:02] um because sometimes when you're like past Less in an array uh sorry in Python [00:11:04] past Less in an array uh sorry in Python which is out of scope of this tutorial [00:11:06] which is out of scope of this tutorial um you can only pass the reference to it [00:11:08] um you can only pass the reference to it so if you will change the array that [00:11:09] so if you will change the array that gets changed this will create an [00:11:11] gets changed this will create an entirely separate copy in memory of the [00:11:13] entirely separate copy in memory of the exact same array so if you make any [00:11:15] exact same array so if you make any changes to it it won't affect your [00:11:16] changes to it it won't affect your original array so this is a very pretty [00:11:18] original array so this is a very pretty neat way to do that [00:11:20] neat way to do that um and then another fun thing that [00:11:21] um and then another fun thing that python has which is pretty unique is you [00:11:23] python has which is pretty unique is you can index negatively so negative [00:11:25] can index negatively so negative indexing means you index from the back [00:11:27] indexing means you index from the back of the array so -1 refers to the last [00:11:29] of the array so -1 refers to the last element of the array minus three will [00:11:32] element of the array minus three will refer to the third last element and so [00:11:35] refer to the third last element and so what minus one will give you will be six [00:11:37] what minus one will give you will be six in this case when minus three will give [00:11:39] in this case when minus three will give you will be everything because you're [00:11:40] you will be everything because you're starting with the minus three elements [00:11:42] starting with the minus three elements so minus one minus two minus three till [00:11:45] so minus one minus two minus three till the end and then this one seems kind of [00:11:48] the end and then this one seems kind of confusing right three to minus two so [00:11:50] confusing right three to minus two so this will do is it will give you 0 1 2 3 [00:11:52] this will do is it will give you 0 1 2 3 so you start with three and then minus [00:11:54] so you start with three and then minus one minus two so you leave off the X the [00:11:57] one minus two so you leave off the X the last because you excluded within list [00:12:01] last because you excluded within list um you'd only get three and four so [00:12:03] um you'd only get three and four so that's what this is [00:12:04] that's what this is okay that's about lists [00:12:07] okay that's about lists tuples are immutable arrays so once you [00:12:10] tuples are immutable arrays so once you declare the values of these they cannot [00:12:11] declare the values of these they cannot be changed so I start with remember we [00:12:13] be changed so I start with remember we started with like the list of Zach and [00:12:15] started with like the list of Zach and Jay tuples you start with Zach and Jay [00:12:19] Jay tuples you start with Zach and Jay um and you can still access them you [00:12:20] um and you can still access them you know I can still print out name zero [00:12:23] know I can still print out name zero same as I did with lists but if I try to [00:12:25] same as I did with lists but if I try to change it in this case it'll throw an [00:12:27] change it in this case it'll throw an error so two pulls once you've [00:12:28] error so two pulls once you've instantiated them they cannot be changed [00:12:31] instantiated them they cannot be changed and to create an empty Tuple you just [00:12:33] and to create an empty Tuple you just create you can either use just a tuple [00:12:35] create you can either use just a tuple sign or oftentimes you can just use the [00:12:37] sign or oftentimes you can just use the parentheses backwards so you can just [00:12:39] parentheses backwards so you can just say for example as you did here just [00:12:41] say for example as you did here just parentheses to instantiate something [00:12:45] parentheses to instantiate something all right [00:12:46] all right and yeah this one we'll we'll come to a [00:12:48] and yeah this one we'll we'll come to a little bit later in shapes but you can [00:12:50] little bit later in shapes but you can also have a tuple of a single value and [00:12:52] also have a tuple of a single value and all you have to do there is just put the [00:12:54] all you have to do there is just put the value and put a comma so that just shows [00:12:56] value and put a comma so that just shows that you have a tuple which is like with [00:12:58] that you have a tuple which is like with like an immutable array so you can't [00:13:00] like an immutable array so you can't change it it's a list but only of one [00:13:01] change it it's a list but only of one item [00:13:02] item and that's here [00:13:04] and that's here okay I'll quickly move to dictionaries [00:13:07] okay I'll quickly move to dictionaries um for those of you who might be [00:13:08] um for those of you who might be familiar with other languages this is [00:13:10] familiar with other languages this is the equivalent of like a hash map or [00:13:12] the equivalent of like a hash map or hash table [00:13:13] hash table um what this is useful for essentially [00:13:14] um what this is useful for essentially is mapping one value to another in a [00:13:16] is mapping one value to another in a really really quick way [00:13:18] really really quick way um so if I want to map for example a [00:13:19] um so if I want to map for example a string to an index which you will happen [00:13:21] string to an index which you will happen to do a lot of in your homeworks [00:13:24] to do a lot of in your homeworks um this is a really really useful way to [00:13:25] um this is a really really useful way to do that and so what what it does is you [00:13:27] do that and so what what it does is you can instantiate this dictionary and it [00:13:30] can instantiate this dictionary and it says corresponding that DAC is going to [00:13:32] says corresponding that DAC is going to correspond to the string value whatever [00:13:33] correspond to the string value whatever it is and so anytime I want to retrieve [00:13:36] it is and so anytime I want to retrieve the string value I just use this [00:13:38] the string value I just use this dictionary I indexed by it which is what [00:13:41] dictionary I indexed by it which is what I do here and then it outputs the [00:13:43] I do here and then it outputs the corresponding value and it does that [00:13:45] corresponding value and it does that really really quickly [00:13:47] really really quickly um and yeah so it's really useful very [00:13:50] um and yeah so it's really useful very very commonly used especially when you [00:13:52] very commonly used especially when you sort for example you have like a list of [00:13:53] sort for example you have like a list of strings or a list of items and you want [00:13:56] strings or a list of items and you want to have a corresponding index for them [00:13:58] to have a corresponding index for them because and as you'll see in NLP [00:14:00] because and as you'll see in NLP oftentimes you're using what you're [00:14:02] oftentimes you're using what you're working with indices and numbers in [00:14:04] working with indices and numbers in particular so it's a really great way to [00:14:06] particular so it's a really great way to sort of move from like string formats to [00:14:09] sort of move from like string formats to just like um numerical index values [00:14:12] just like um numerical index values there's some other things you can do for [00:14:13] there's some other things you can do for dictionaries you can check whether [00:14:15] dictionaries you can check whether certain elements are in there so if you [00:14:16] certain elements are in there so if you for example try to index phone book is [00:14:19] for example try to index phone book is equal to monsie they'll throw an error [00:14:21] equal to monsie they'll throw an error because there's no string that says [00:14:22] because there's no string that says Monty in that phone book dictionary and [00:14:24] Monty in that phone book dictionary and so sometimes you might be wanting to do [00:14:25] so sometimes you might be wanting to do checks before you extract a value and so [00:14:28] checks before you extract a value and so this will just check for example if I do [00:14:30] this will just check for example if I do print monsie and phone book it should [00:14:31] print monsie and phone book it should say false or for example here Kevin and [00:14:33] say false or for example here Kevin and phone book it should say false while [00:14:35] phone book it should say false while something that's actually in that [00:14:36] something that's actually in that dictionary Zach will be true [00:14:39] dictionary Zach will be true okay and then if you'd like to delete an [00:14:41] okay and then if you'd like to delete an entry from the um from the dictionary [00:14:43] entry from the um from the dictionary you can just do that using the Dell [00:14:45] you can just do that using the Dell command [00:14:47] command all right let's move to Loops [00:14:51] all right let's move to Loops um quickly so Loops are a really great [00:14:53] um quickly so Loops are a really great way to optimize for the same kind of app [00:14:56] way to optimize for the same kind of app same kind of operation you're doing it's [00:14:58] same kind of operation you're doing it's also a great way to [00:15:00] also a great way to um start to sequentially go over those [00:15:03] um start to sequentially go over those list type or array type objects we were [00:15:05] list type or array type objects we were talking about earlier you know you have [00:15:06] talking about earlier you know you have like a list of names right how do you [00:15:09] like a list of names right how do you access all of them so Loops are a really [00:15:11] access all of them so Loops are a really great way to do that [00:15:12] great way to do that um in Python they've abstracted away a [00:15:14] um in Python they've abstracted away a lot of the confusing sort of [00:15:16] lot of the confusing sort of um parts and other languages that might [00:15:18] um parts and other languages that might be you can really for example first [00:15:20] be you can really for example first index on numbers so what you do is you [00:15:22] index on numbers so what you do is you have like a range function that you call [00:15:24] have like a range function that you call so here you say range and the range of [00:15:27] so here you say range and the range of the last number you'd want so with this [00:15:29] the last number you'd want so with this range function will return is 0 1 2 3 4 [00:15:31] range function will return is 0 1 2 3 4 and that's what will be stored in this I [00:15:33] and that's what will be stored in this I value and here it's just printing out [00:15:35] value and here it's just printing out that I value [00:15:36] that I value so if I wanted for example Loop over the [00:15:38] so if I wanted for example Loop over the length of a list of size 10 I just have [00:15:41] length of a list of size 10 I just have to do 4i and range 10 and then index [00:15:44] to do 4i and range 10 and then index that corresponding part of the list you [00:15:46] that corresponding part of the list you technically don't even have to do that [00:15:47] technically don't even have to do that because in Python you can just directly [00:15:49] because in Python you can just directly get the element of the list so here I [00:15:51] get the element of the list so here I have an a list of [00:15:53] have an a list of um names where I have Zac J and Richard [00:15:55] um names where I have Zac J and Richard instead of saying first the length of [00:15:58] instead of saying first the length of the list and then doing this range [00:16:00] the list and then doing this range operation I can just directly say for [00:16:02] operation I can just directly say for name and names and then print out the [00:16:04] name and names and then print out the names and it will just directly get the [00:16:06] names and it will just directly get the element in each list [00:16:07] element in each list but sometimes you might want both you [00:16:10] but sometimes you might want both you might both want this element Zach as [00:16:13] might both want this element Zach as well as its position in the array and [00:16:15] well as its position in the array and for that you can actually use this [00:16:16] for that you can actually use this really helpful function called enumerate [00:16:18] really helpful function called enumerate and so enumerate will basically pair [00:16:20] and so enumerate will basically pair those two values and it'll give you the [00:16:22] those two values and it'll give you the both the value which is here name for [00:16:24] both the value which is here name for example and its corresponding index [00:16:26] example and its corresponding index within the array both together so that's [00:16:29] within the array both together so that's really really convenient versus for [00:16:31] really really convenient versus for example having to do this like a little [00:16:32] example having to do this like a little bit more complicated range operation [00:16:34] bit more complicated range operation where you first take the range and then [00:16:36] where you first take the range and then you index the US [00:16:38] you index the US how do you iterate over a dictionary so [00:16:40] how do you iterate over a dictionary so for dictionaries if you want to enter [00:16:43] for dictionaries if you want to enter um iterate over What's called the keys [00:16:45] um iterate over What's called the keys so all of these first items that you [00:16:47] so all of these first items that you first you know put into the dictionary [00:16:48] first you know put into the dictionary you can just iterate the same way you [00:16:50] you can just iterate the same way you would a list you just say foreign name [00:16:52] would a list you just say foreign name and for example phone book and you can [00:16:54] and for example phone book and you can output the keys if you want to iterate [00:16:55] output the keys if you want to iterate over what is stored in the list which is [00:16:58] over what is stored in the list which is called a value you'd have to do the [00:17:00] called a value you'd have to do the dictionary dot values and if you want [00:17:03] dictionary dot values and if you want both you use the dot items function and [00:17:06] both you use the dot items function and so that will print out both of these [00:17:09] all right [00:17:11] all right so this is sort of covering the [00:17:14] so this is sort of covering the overarching most commonly used sort of [00:17:16] overarching most commonly used sort of structures lists dictionaries and then [00:17:19] structures lists dictionaries and then loops and how to sort of efficiently use [00:17:21] loops and how to sort of efficiently use them within your code [00:17:23] them within your code we'll quickly be moving to the sort of [00:17:25] we'll quickly be moving to the sort of meat of what um is really really strong [00:17:27] meat of what um is really really strong about Python and what you'll be using a [00:17:29] about Python and what you'll be using a lot for your coming homework essentially [00:17:31] lot for your coming homework essentially homework 2 which is numpy [00:17:34] homework 2 which is numpy okay so for numpy also I'm going to be [00:17:36] okay so for numpy also I'm going to be going to the collab we just quickly [00:17:38] going to the collab we just quickly wanted to mention [00:17:40] wanted to mention um what numpy is so numpy is basically [00:17:42] um what numpy is so numpy is basically an optimized Library [00:17:44] an optimized Library um for mathematical operations you know [00:17:46] um for mathematical operations you know people tend to like math lab because [00:17:47] people tend to like math lab because it's very very useful for these [00:17:48] it's very very useful for these mathematical operations which people use [00:17:50] mathematical operations which people use in their research [00:17:51] in their research um pythons sort of solution to that is [00:17:54] um pythons sort of solution to that is to have a separate Library entirely [00:17:55] to have a separate Library entirely where they make use of subroutines which [00:17:59] where they make use of subroutines which are sort of like sub languages sorry sub [00:18:01] are sort of like sub languages sorry sub scripts that are written in a different [00:18:03] scripts that are written in a different language called C or C plus plus that [00:18:05] language called C or C plus plus that are highly optimized for efficiency so [00:18:08] are highly optimized for efficiency so the reason C and C plus plus are much [00:18:10] the reason C and C plus plus are much faster than python is because they're [00:18:12] faster than python is because they're closer to what's called machine language [00:18:13] closer to what's called machine language which is what the computer will read I [00:18:15] which is what the computer will read I mentioned earlier one of the nice things [00:18:16] mentioned earlier one of the nice things about python is it's kind of high level [00:18:18] about python is it's kind of high level it looks like English right to some [00:18:19] it looks like English right to some extent you know we say literally like is [00:18:21] extent you know we say literally like is you know if x is equal to one or X is [00:18:23] you know if x is equal to one or X is equal to two right but um that also [00:18:26] equal to two right but um that also means that there's a lot more [00:18:27] means that there's a lot more translation required on the computer's [00:18:28] translation required on the computer's part before it understands what you mean [00:18:31] part before it understands what you mean um and that's useful when you know we're [00:18:33] um and that's useful when you know we're writing out code where we want under [00:18:34] writing out code where we want under understand it but it's a little bit less [00:18:35] understand it but it's a little bit less useful when you're sort of running a lot [00:18:37] useful when you're sort of running a lot of operations on a lot of data so the [00:18:40] of operations on a lot of data so the real benefit of something like lumpy is [00:18:42] real benefit of something like lumpy is that if you have sort of your memory and [00:18:44] that if you have sort of your memory and your data in a particular format it'll [00:18:46] your data in a particular format it'll call the these like species scripts or [00:18:48] call the these like species scripts or what are called subroutines in a [00:18:49] what are called subroutines in a different language and it'll make them [00:18:50] different language and it'll make them very very fast and so that's the real [00:18:52] very very fast and so that's the real benefit of using numpy and almost [00:18:53] benefit of using numpy and almost everyone [00:18:55] everyone um in in sort of NLP is very very [00:18:57] um in in sort of NLP is very very familiar with this because you'll be [00:18:58] familiar with this because you'll be running a lot of operations on for [00:19:00] running a lot of operations on for example like co-occurrence matrices [00:19:01] example like co-occurrence matrices which are really really big and [00:19:03] which are really really big and um it's very useful to have them [00:19:05] um it's very useful to have them optimized for time so that's really the [00:19:06] optimized for time so that's really the benefit of using numpy [00:19:08] benefit of using numpy and numpy basically it's involved for [00:19:10] and numpy basically it's involved for all these like math and Matrix and [00:19:12] all these like math and Matrix and Vector calculations and it's different [00:19:14] Vector calculations and it's different than a list although you can easily [00:19:15] than a list although you can easily translate between a list and a numpy [00:19:17] translate between a list and a numpy array numpy arrays are specifically as I [00:19:19] array numpy arrays are specifically as I mentioned designed to be used in these [00:19:22] mentioned designed to be used in these subroutines so they have a specific [00:19:23] subroutines so they have a specific format they're instantiated differently [00:19:25] format they're instantiated differently and you can translate between this and [00:19:27] and you can translate between this and sort of your standard lists easily but [00:19:29] sort of your standard lists easily but to know that you can only do numpy [00:19:31] to know that you can only do numpy operations on numpy arrays you can't do [00:19:32] operations on numpy arrays you can't do numpy operations on lists directly you'd [00:19:35] numpy operations on lists directly you'd first have to like convert them which is [00:19:36] first have to like convert them which is really simple you just use this [00:19:37] really simple you just use this numpy.array function but just know that [00:19:40] numpy.array function but just know that they operate only on numpy arrays [00:19:42] they operate only on numpy arrays okay so for numpy we're going to be [00:19:44] okay so for numpy we're going to be going back to the collab [00:19:47] going back to the collab and then as I mentioned earlier the real [00:19:49] and then as I mentioned earlier the real strength of numpy is you know it [00:19:50] strength of numpy is you know it supports these large multi-dimensional [00:19:52] supports these large multi-dimensional arrays and matrices for very very [00:19:54] arrays and matrices for very very optimized high-level mathematical [00:19:56] optimized high-level mathematical functions [00:19:58] functions um and just to go back step back for a [00:19:59] um and just to go back step back for a quick second what is a matrix matrices [00:20:02] quick second what is a matrix matrices are basically like rectangular [00:20:04] are basically like rectangular um structures of numbers that are used [00:20:07] um structures of numbers that are used and you can treat them with specific [00:20:08] and you can treat them with specific rules [00:20:10] rules um for operations between different [00:20:11] um for operations between different kinds of things so if you have like a [00:20:14] kinds of things so if you have like a lot of data instead of you know [00:20:15] lot of data instead of you know individually potentially multiplying [00:20:16] individually potentially multiplying things if you can store them in this [00:20:18] things if you can store them in this rectangular format you have specific [00:20:20] rectangular format you have specific rules about how this Matrix for example [00:20:22] rules about how this Matrix for example interact with a different one and by [00:20:24] interact with a different one and by doing that which is matrix [00:20:25] doing that which is matrix multiplication or Matrix math [00:20:27] multiplication or Matrix math um you can do a wide variety of [00:20:30] um you can do a wide variety of mathematical operations a vector is [00:20:32] mathematical operations a vector is generally this is conventional none of [00:20:34] generally this is conventional none of these are like hard and fast rules but [00:20:35] these are like hard and fast rules but conventionally a vector is a matrix in [00:20:38] conventionally a vector is a matrix in one dimension so it's usually like a row [00:20:40] one dimension so it's usually like a row vector or a column Vector which usually [00:20:43] vector or a column Vector which usually just means that it's a list of values [00:20:46] just means that it's a list of values and only one mentioned so it's like for [00:20:48] and only one mentioned so it's like for example here when I come down to X is [00:20:50] example here when I come down to X is equal to numpy array of one two three [00:20:52] equal to numpy array of one two three that's a listen only one dimension [00:20:55] that's a listen only one dimension versus for example Z when I this is z [00:20:58] versus for example Z when I this is z down here that is what's called like a [00:21:01] down here that is what's called like a two-dimensional array because you have [00:21:03] two-dimensional array because you have both rows for example like six comma [00:21:06] both rows for example like six comma seven and then you have eight comma nine [00:21:09] seven and then you have eight comma nine um versus in this first one you only [00:21:11] um versus in this first one you only have three values in one dimension so [00:21:14] have three values in one dimension so that's sort of the conventional [00:21:15] that's sort of the conventional difference between the two another [00:21:16] difference between the two another convention is matrices generally [00:21:18] convention is matrices generally referred to two-dimensional objects so [00:21:19] referred to two-dimensional objects so this as I mentioned is like Z this is [00:21:22] this as I mentioned is like Z this is two dimensional you might have heard the [00:21:24] two dimensional you might have heard the word tensor also tensors by convention [00:21:26] word tensor also tensors by convention usually are like higher dimensional [00:21:28] usually are like higher dimensional objects so instead of having two [00:21:29] objects so instead of having two Dimensions you know two comma two you [00:21:31] Dimensions you know two comma two you can have like n Dimensions you can have [00:21:33] can have like n Dimensions you can have two comma two comma two comma two comma [00:21:36] two comma two comma two comma two comma two for like five or six dimensions and [00:21:38] two for like five or six dimensions and those are very valid to do mathematical [00:21:40] those are very valid to do mathematical operations on [00:21:41] operations on um and those are often colloquially sort [00:21:43] um and those are often colloquially sort of called tensors in addition and this [00:21:46] of called tensors in addition and this will be covered in the next tutorial in [00:21:48] will be covered in the next tutorial in pi torch [00:21:49] pi torch um those larger sort of tensors are also [00:21:52] um those larger sort of tensors are also optimized for efficiency [00:21:54] optimized for efficiency um to be used on gpus and so they're [00:21:56] um to be used on gpus and so they're called tensor in a more concrete way [00:21:58] called tensor in a more concrete way because you're using these tensors with [00:22:00] because you're using these tensors with pytorch and other sort of packages to [00:22:02] pytorch and other sort of packages to directly do those quicker GPU operations [00:22:04] directly do those quicker GPU operations on for deep learning so those are sort [00:22:06] on for deep learning so those are sort of this is a quick sort of terminology [00:22:08] of this is a quick sort of terminology difference between the three [00:22:10] difference between the three okay so now [00:22:12] okay so now um let's start off with just some quick [00:22:14] um let's start off with just some quick sort of representations of how are these [00:22:15] sort of representations of how are these matrices and vectors represented in [00:22:17] matrices and vectors represented in numpy [00:22:18] numpy um this sort of goes back to your [00:22:19] um this sort of goes back to your question about like what is the [00:22:21] question about like what is the difference between like three comma [00:22:23] difference between like three comma versus like one comma three [00:22:25] versus like one comma three um so usually three comma and numpy [00:22:28] um so usually three comma and numpy arrays usually just means that you have [00:22:29] arrays usually just means that you have one list of like one two three for [00:22:33] one list of like one two three for example there's like three values versus [00:22:35] example there's like three values versus if you add another list on top of that [00:22:37] if you add another list on top of that this one comma 3 essentially refers to [00:22:40] this one comma 3 essentially refers to the fact that there's a list of lists so [00:22:42] the fact that there's a list of lists so anytime you have two Dimensions it [00:22:44] anytime you have two Dimensions it always means that there's a list of [00:22:46] always means that there's a list of lists [00:22:47] lists um and that being like a list of lists [00:22:48] um and that being like a list of lists for example like a row so here one comma [00:22:51] for example like a row so here one comma three means that there's one row and [00:22:53] three means that there's one row and then three columns so it's saying [00:22:54] then three columns so it's saying there's one row of three comma four [00:22:57] there's one row of three comma four comma five essentially and then each of [00:22:59] comma five essentially and then each of those is a column separately [00:23:01] those is a column separately you can easily reshape them so these are [00:23:04] you can easily reshape them so these are basically the same format but from [00:23:06] basically the same format but from numpy's perspective you'll see a little [00:23:07] numpy's perspective you'll see a little bit later for operations such as [00:23:09] bit later for operations such as broadcasting you need to have it for [00:23:11] broadcasting you need to have it for example sometimes in this one comma [00:23:13] example sometimes in this one comma three format or three comma one format [00:23:15] three format or three comma one format um and also like [00:23:17] um and also like what like as I said three is this he [00:23:18] what like as I said three is this he just like she represents three numbers [00:23:20] just like she represents three numbers one comma three means like one row of [00:23:22] one comma three means like one row of three elements three comma one will mean [00:23:25] three elements three comma one will mean you have essentially in each column [00:23:27] you have essentially in each column you'll have a separate array so you'll [00:23:29] you'll have a separate array so you'll see sort of boxes around each of them [00:23:31] see sort of boxes around each of them there's an example that comes a little [00:23:32] there's an example that comes a little bit later in this collab which will make [00:23:33] bit later in this collab which will make it a little bit more clearer so here if [00:23:36] it a little bit more clearer so here if you can see the difference between like [00:23:36] you can see the difference between like X and Y one of them has only one bracket [00:23:39] X and Y one of them has only one bracket which just says it's one list only one [00:23:42] which just says it's one list only one list of one comma two comma three the [00:23:45] list of one comma two comma three the second one is two brackets which says [00:23:46] second one is two brackets which says it's a list with only one list in it if [00:23:49] it's a list with only one list in it if it's a list of a list that's really the [00:23:51] it's a list of a list that's really the main difference between like these sort [00:23:53] main difference between like these sort of two representations so I could have [00:23:55] of two representations so I could have like let's say like a separate [00:23:59] like let's say like a separate one I'm going to call this a and I just [00:24:02] one I'm going to call this a and I just do this [00:24:03] do this so it's the same sort of elements but [00:24:07] so it's the same sort of elements but this will be one comma three because [00:24:08] this will be one comma three because it's showing that there's one outer list [00:24:11] it's showing that there's one outer list which shows the rows and then one inner [00:24:13] which shows the rows and then one inner list which I like to have each of those [00:24:15] list which I like to have each of those values [00:24:16] values so the benefit will when I'm coming to [00:24:18] so the benefit will when I'm coming to what's a little bit later which is [00:24:19] what's a little bit later which is broadcasting and so it essentially will [00:24:21] broadcasting and so it essentially will help you determine what dimensions you [00:24:23] help you determine what dimensions you want to match against because sometimes [00:24:25] want to match against because sometimes you'd want to have one comma three like [00:24:27] you'd want to have one comma three like one comma two comma three applied only [00:24:30] one comma two comma three applied only two rows in some other Matrix well we'll [00:24:33] two rows in some other Matrix well we'll come to that a little bit later [00:24:34] come to that a little bit later um but sometimes you might want to have [00:24:35] um but sometimes you might want to have it only applied to columns and so like [00:24:38] it only applied to columns and so like if I have a separate Matrix for example [00:24:39] if I have a separate Matrix for example of zero zero zero zero zero zero zero [00:24:42] of zero zero zero zero zero zero zero zero and I want the resulting Matrix to [00:24:44] zero and I want the resulting Matrix to be for example one two three one two [00:24:45] be for example one two three one two three one two three along the rows let [00:24:47] three one two three along the rows let me actually draw this out it might be [00:24:48] me actually draw this out it might be easier [00:24:50] easier so [00:24:52] so let's say I have like the zero zero zero [00:24:54] let's say I have like the zero zero zero zero zero zero zero zero [00:24:57] zero zero zero zero zero and if I want to have a matrix that does [00:24:59] and if I want to have a matrix that does one two three one two three one two [00:25:03] one two three one two three one two three versus [00:25:05] three versus one two three one two three one two [00:25:09] one two three one two three one two three [00:25:10] three the difference in how to generate these [00:25:12] the difference in how to generate these two [00:25:13] two um will be the difference in the shape [00:25:15] um will be the difference in the shape like how you represent their shape it's [00:25:16] like how you represent their shape it's the same one two three but the resulting [00:25:19] the same one two three but the resulting array you're generating by repeating the [00:25:21] array you're generating by repeating the one two three values [00:25:23] one two three values um requires a difference in shape and so [00:25:25] um requires a difference in shape and so we'll come to that a little bit later [00:25:25] we'll come to that a little bit later because this process of how do you [00:25:27] because this process of how do you generate these arrays is called [00:25:28] generate these arrays is called broadcasting but that's the real benefit [00:25:29] broadcasting but that's the real benefit of having an understanding of the shapes [00:25:31] of having an understanding of the shapes the same one two three values are the [00:25:33] the same one two three values are the same it's just how they're sort of used [00:25:34] same it's just how they're sort of used with regards to other arrays [00:25:36] with regards to other arrays all right so yeah vectors can be usually [00:25:39] all right so yeah vectors can be usually represented as sort of and this is what [00:25:40] represented as sort of and this is what I talked about earlier as like n [00:25:41] I talked about earlier as like n Dimensions n by one or one by n [00:25:43] Dimensions n by one or one by n dimensions and they can result in this [00:25:45] dimensions and they can result in this different Behavior kind of what like [00:25:46] different Behavior kind of what like this that I talked about [00:25:48] this that I talked about um matrices are usually in two [00:25:49] um matrices are usually in two Dimensions represented as M by n [00:25:51] Dimensions represented as M by n um these are just two examples if for [00:25:53] um these are just two examples if for example I generate let's say an engine [00:25:54] example I generate let's say an engine also reshape so I start with for example [00:25:57] also reshape so I start with for example this array [00:25:58] this array which is a list of 10 oh sorry it's [00:26:00] which is a list of 10 oh sorry it's important on Pi quickly [00:26:04] so I start off with this Matrix a which [00:26:07] so I start off with this Matrix a which is basically a one-dimensional list of [00:26:08] is basically a one-dimensional list of ten values I can reshape it into a five [00:26:11] ten values I can reshape it into a five by two Matrix so you just have to make [00:26:13] by two Matrix so you just have to make sure that your Dimensions match which [00:26:14] sure that your Dimensions match which means that like you can multiply them [00:26:16] means that like you can multiply them together and get the original size so if [00:26:19] together and get the original size so if I start off with the 10 matrix I can [00:26:20] I start off with the 10 matrix I can make a two by five Matrix I can make a [00:26:22] make a two by five Matrix I can make a five by two Matrix I can make a ten by [00:26:24] five by two Matrix I can make a ten by one one by ten I can't make it for [00:26:26] one one by ten I can't make it for example three and five because that it [00:26:28] example three and five because that it wouldn't fit into the original size [00:26:30] wouldn't fit into the original size um and for that this operation called [00:26:31] um and for that this operation called reshape is really useful [00:26:33] reshape is really useful um you might be wondering why is there [00:26:34] um you might be wondering why is there two parentheses the way that reshape [00:26:36] two parentheses the way that reshape works is essentially it'll take in a [00:26:38] works is essentially it'll take in a tuple so remember that what I was [00:26:40] tuple so remember that what I was talking about earlier with tuples is [00:26:41] talking about earlier with tuples is that these they're immutable objects and [00:26:43] that these they're immutable objects and they're defined by parentheses so the [00:26:45] they're defined by parentheses so the outer parenthesis is representing what [00:26:47] outer parenthesis is representing what you're inputting to the function and [00:26:48] you're inputting to the function and what you're inputting is a tuple so it [00:26:50] what you're inputting is a tuple so it uses a second set of parentheses [00:26:52] uses a second set of parentheses so now let's go to some array operations [00:26:56] so now let's go to some array operations um so I started off with you know this [00:26:57] um so I started off with you know this array X [00:26:59] array X um when you apply simple operations for [00:27:01] um when you apply simple operations for example a Max operation sometimes you [00:27:03] example a Max operation sometimes you might want the max of the entire array [00:27:05] might want the max of the entire array so if I do the max of in the entire [00:27:06] so if I do the max of in the entire array what's the max value of the entire [00:27:08] array what's the max value of the entire array by the way just the entire thing [00:27:10] array by the way just the entire thing six right so if I just do NP dot Max of [00:27:13] six right so if I just do NP dot Max of X it'll return one value and return six [00:27:16] X it'll return one value and return six but let's say I want the max of every [00:27:18] but let's say I want the max of every row right like and every in each of [00:27:20] row right like and every in each of these rows I say I want let's say the [00:27:22] these rows I say I want let's say the max of each I want two and then four and [00:27:24] max of each I want two and then four and then six how do you do that and so numpy [00:27:27] then six how do you do that and so numpy always has like usually in most of their [00:27:29] always has like usually in most of their functions an access variable and what [00:27:31] functions an access variable and what the axis variable will do is it'll tell [00:27:33] the axis variable will do is it'll tell you which of these Dimensions do you [00:27:35] you which of these Dimensions do you want to take the max over [00:27:37] want to take the max over and the way to sort of think about it is [00:27:39] and the way to sort of think about it is this is going to be a little bit tricky [00:27:40] this is going to be a little bit tricky but the way people describe it is the [00:27:43] but the way people describe it is the access is what you want to apply your [00:27:45] access is what you want to apply your function over what you want to reduce [00:27:47] function over what you want to reduce over and what that means is I print out [00:27:49] over and what that means is I print out the shape of the original array it's [00:27:51] the shape of the original array it's three by two [00:27:52] three by two I want to apply access one or as I [00:27:55] I want to apply access one or as I remember you know numpy is zero indexed [00:27:57] remember you know numpy is zero indexed it'll be zero one so I want to apply the [00:27:59] it'll be zero one so I want to apply the max over the second dimension the second [00:28:01] max over the second dimension the second dimension means that for each of these [00:28:04] dimension means that for each of these essentially you know that like for like [00:28:07] essentially you know that like for like the row Dimension is the First Dimension [00:28:09] the row Dimension is the First Dimension so it's not around along the rows I'm [00:28:11] so it's not around along the rows I'm going to be comparing columns and so [00:28:13] going to be comparing columns and so compare this entire column to this [00:28:15] compare this entire column to this entire column [00:28:16] entire column and so just remember for axes [00:28:19] and so just remember for axes um usually the axis zero refers to the [00:28:21] um usually the axis zero refers to the row axis and then axis one refers to the [00:28:23] row axis and then axis one refers to the column access [00:28:25] column access um if you don't even want to remember [00:28:25] um if you don't even want to remember that you can just remember that from the [00:28:27] that you can just remember that from the original Dimension which of these it's [00:28:29] original Dimension which of these it's referring to [00:28:31] referring to um and that's the dimension you want to [00:28:32] um and that's the dimension you want to compare over or reduce over [00:28:35] compare over or reduce over so it can be a little bit harder to [00:28:37] so it can be a little bit harder to grasp around it usually the best way to [00:28:39] grasp around it usually the best way to sort of get around is like just play [00:28:40] sort of get around is like just play with a bunch of sort of operations of [00:28:42] with a bunch of sort of operations of min max and things like that but just [00:28:45] min max and things like that but just remember like the access is what you [00:28:46] remember like the access is what you want to compare over not the resulting [00:28:49] want to compare over not the resulting thing so axis one means here column I [00:28:51] thing so axis one means here column I want to compare between the columns I [00:28:52] want to compare between the columns I want to get for example comparing one to [00:28:54] want to get for example comparing one to two three to four five to six [00:28:57] two three to four five to six does that make sense [00:29:00] does that make sense okay [00:29:01] okay and what this will do is if I just do [00:29:04] and what this will do is if I just do numpy.axis it'll just return basically [00:29:06] numpy.axis it'll just return basically since I'm comparing these columns it'll [00:29:08] since I'm comparing these columns it'll just return a resultant column and so as [00:29:10] just return a resultant column and so as I mentioned you know um for over the [00:29:12] I mentioned you know um for over the axis one you get three values because [00:29:14] axis one you get three values because you're comparing over these columns and [00:29:16] you're comparing over these columns and each column has three values I'm [00:29:18] each column has three values I'm comparing over rows as you mentioned I [00:29:19] comparing over rows as you mentioned I get two values right [00:29:21] get two values right um and so this will just be the Tuple [00:29:22] um and so this will just be the Tuple comma which is just indicating that it's [00:29:24] comma which is just indicating that it's just a list it's not a list of lists [00:29:26] just a list it's not a list of lists it's just a list but let's say I want a [00:29:28] it's just a list but let's say I want a list of lists you know maybe I want to [00:29:29] list of lists you know maybe I want to do those operations I talked about [00:29:30] do those operations I talked about earlier [00:29:31] earlier um instead of reshaping which is always [00:29:33] um instead of reshaping which is always there it's always an option you can also [00:29:35] there it's always an option you can also use this um feature called keep dimms [00:29:38] use this um feature called keep dimms and what that'll do is it'll take the [00:29:39] and what that'll do is it'll take the original Dimensions which is two [00:29:42] original Dimensions which is two Dimensions right because you have three [00:29:43] Dimensions right because you have three comma two just two of them and it'll [00:29:45] comma two just two of them and it'll keep that consistent so it'll be three [00:29:48] keep that consistent so it'll be three comma one [00:29:49] comma one but it just means that instead of [00:29:51] but it just means that instead of returning just the extracted column [00:29:53] returning just the extracted column which is just a list it'll basically [00:29:55] which is just a list it'll basically keep the column in the context of the [00:29:58] keep the column in the context of the original sort of X and it'll be it'll [00:30:00] original sort of X and it'll be it'll keep it as like a two-dimensional value [00:30:04] all right [00:30:06] all right now these are just some operations so in [00:30:09] now these are just some operations so in numpy [00:30:10] numpy um you can use an asterisk as a an [00:30:12] um you can use an asterisk as a an element-wise multiplication so an [00:30:14] element-wise multiplication so an asterisk means that I'm going to be [00:30:15] asterisk means that I'm going to be comparing every single value [00:30:17] comparing every single value um to every single corresponding value [00:30:19] um to every single corresponding value in another Matrix and it's you need your [00:30:21] in another Matrix and it's you need your matrices to also be the same size for [00:30:23] matrices to also be the same size for this one so this one it's basically an [00:30:24] this one so this one it's basically an element wise matrix it's not a matrix [00:30:26] element wise matrix it's not a matrix multiplication so you need to have them [00:30:27] multiplication so you need to have them be the exact same size so this will [00:30:29] be the exact same size so this will compare for example one into three two [00:30:31] compare for example one into three two into three three into three and four [00:30:32] into three three into three and four into three [00:30:34] into three all right [00:30:36] all right um you can also do matrix multiplication [00:30:37] um you can also do matrix multiplication which is a different operation entirely [00:30:40] which is a different operation entirely um for those of you unfamiliar with [00:30:42] um for those of you unfamiliar with matrix multiplication [00:30:44] matrix multiplication um you would basically be multiplying a [00:30:46] um you would basically be multiplying a row of one Matrix with the column of [00:30:48] row of one Matrix with the column of another Matrix and for that to be [00:30:50] another Matrix and for that to be necessary you need to have the second [00:30:52] necessary you need to have the second dimension of the first array be equal to [00:30:54] dimension of the first array be equal to the first dimension of the second array [00:30:55] the first dimension of the second array so for matrix multiplication if I have [00:30:58] so for matrix multiplication if I have an [00:31:00] an a and two B comma 3 in tune c [00:31:06] a and two B comma 3 in tune c um shaped matrices these two have to be [00:31:09] um shaped matrices these two have to be equal for matrix multiplication just [00:31:10] equal for matrix multiplication just something to keep in mind because [00:31:12] something to keep in mind because oftentimes if you're doing matrix [00:31:13] oftentimes if you're doing matrix multiplication [00:31:15] multiplication um you need you have to make sure these [00:31:17] um you need you have to make sure these dimensions are the same which means that [00:31:18] dimensions are the same which means that for example [00:31:24] this is a valid operation [00:31:26] this is a valid operation um but this can sometimes throw an error [00:31:31] sometimes so it's just important to make [00:31:34] sometimes so it's just important to make sure that sometimes you you want to make [00:31:35] sure that sometimes you you want to make sure that these are exactly equal you [00:31:36] sure that these are exactly equal you can actually just print out the shapes [00:31:37] can actually just print out the shapes and make sure that these are equal to be [00:31:39] and make sure that these are equal to be doing matrix multiplication and then for [00:31:41] doing matrix multiplication and then for matrix multiplication [00:31:43] matrix multiplication um there's a couple of [00:31:44] um there's a couple of functions you can use the first one is [00:31:47] functions you can use the first one is just np.mat mule which is NP dot matrix [00:31:49] just np.mat mule which is NP dot matrix multiplication you can also just use the [00:31:52] multiplication you can also just use the um the at operation and that one both of [00:31:54] um the at operation and that one both of those are overloaded you can choose [00:31:56] those are overloaded you can choose whichever one they'll result in the same [00:31:57] whichever one they'll result in the same exact operation and just a quick session [00:32:00] exact operation and just a quick session show you can to show what this will do [00:32:02] show you can to show what this will do is we'll multiply one into two so it'll [00:32:04] is we'll multiply one into two so it'll come like one two versus three four so [00:32:07] come like one two versus three four so it'll do one into three two into three [00:32:09] it'll do one into three two into three and add those two values so that's what [00:32:11] and add those two values so that's what matrix multiplication will do [00:32:15] okay and then dot products will what [00:32:18] okay and then dot products will what what a DOT product is that it takes two [00:32:19] what a DOT product is that it takes two vectors so usually it operates on [00:32:21] vectors so usually it operates on vectors and a vector as I mentioned is [00:32:23] vectors and a vector as I mentioned is just like a one-dimensional matrix so [00:32:26] just like a one-dimensional matrix so it's just basically Three cross one for [00:32:27] it's just basically Three cross one for example a four cross one [00:32:29] example a four cross one um it'll element wise multiply between [00:32:31] um it'll element wise multiply between two different vectors and we'll sum up [00:32:32] two different vectors and we'll sum up those values and so here what a DOT [00:32:35] those values and so here what a DOT product do would be like one into one [00:32:36] product do would be like one into one plus two into ten plus three into a [00:32:38] plus two into ten plus three into a hundred and for a numpy you can just do [00:32:40] hundred and for a numpy you can just do NP Dot and then both of those vectors [00:32:45] um this one is just a side on how you [00:32:47] um this one is just a side on how you would want the structure of the dot [00:32:49] would want the structure of the dot product to be [00:32:50] product to be um for arrays that are more so okay so [00:32:54] um for arrays that are more so okay so the phrase is the best way [00:32:57] the phrase is the best way um for single dimensional vectors this [00:33:00] um for single dimensional vectors this operation Works directly anytime it's a [00:33:02] operation Works directly anytime it's a multiple dimensional Matrix [00:33:05] multiple dimensional Matrix um then it treats it as a matrix [00:33:07] um then it treats it as a matrix multiplication the NP dot dot function [00:33:09] multiplication the NP dot dot function so for two by two Matrix versus a two by [00:33:11] so for two by two Matrix versus a two by two Matrix dot product it's not going to [00:33:13] two Matrix dot product it's not going to return the sum it's going to return [00:33:16] return the sum it's going to return um the matrix multiplication so that's [00:33:17] um the matrix multiplication so that's just something to keep in mind if you [00:33:20] just something to keep in mind if you want to make sure that your your dot [00:33:22] want to make sure that your your dot product is happening in the correct way [00:33:25] product is happening in the correct way you would want to make sure that sort of [00:33:27] you would want to make sure that sort of similar to what I was talking about [00:33:29] similar to what I was talking about earlier that [00:33:32] earlier that here this is I think the best way best [00:33:34] here this is I think the best way best way to show it [00:33:36] okay so you would want the second like [00:33:40] okay so you would want the second like that what I mentioned like the last [00:33:42] that what I mentioned like the last dimension of the first one to match with [00:33:44] dimension of the first one to match with the first dimension of the next one [00:33:45] the first dimension of the next one because it's treating it as like a [00:33:46] because it's treating it as like a matrix multiplication [00:33:48] matrix multiplication um here the error that it's throwing is [00:33:50] um here the error that it's throwing is it's three comma two combined with three [00:33:52] it's three comma two combined with three and so the way to sort of like fix that [00:33:55] and so the way to sort of like fix that would be to have this be like for [00:33:58] would be to have this be like for example like [00:33:59] example like um switch the two so you have two comma [00:34:01] um switch the two so you have two comma three and then three comma [00:34:04] three and then three comma it's really a dimension matching thing [00:34:05] it's really a dimension matching thing at this point so it's it can be a little [00:34:09] at this point so it's it can be a little bit confusing but when you sort of the [00:34:10] bit confusing but when you sort of the main thing to keep in mind is like for [00:34:11] main thing to keep in mind is like for single dimensional vectors you can just [00:34:13] single dimensional vectors you can just do NP dot dot directly and they'll give [00:34:15] do NP dot dot directly and they'll give you the dot product value for higher [00:34:17] you the dot product value for higher dimensional matrices it treats it as a [00:34:18] dimensional matrices it treats it as a matrix multiplication [00:34:20] matrix multiplication um and so for if you still want to like [00:34:22] um and so for if you still want to like for those higher dimensional values to [00:34:24] for those higher dimensional values to ensure that you're getting a DOT product [00:34:26] ensure that you're getting a DOT product um you'd have to make sure that the [00:34:27] um you'd have to make sure that the dimensions are aligned similar to these [00:34:30] dimensions are aligned similar to these so anything that's two by two plus for [00:34:33] so anything that's two by two plus for both [00:34:34] both um any any Matrix who doesn't have a [00:34:36] um any any Matrix who doesn't have a single dimension in any of them yes it [00:34:38] single dimension in any of them yes it would treat it as a matrix matte Neil [00:34:40] would treat it as a matrix matte Neil the same thing [00:34:42] the same thing okay [00:34:45] all right um okay I'm going to move to [00:34:47] all right um okay I'm going to move to indexing so similar to what I was [00:34:50] indexing so similar to what I was talking about earlier remember with list [00:34:51] talking about earlier remember with list I was saying if you just do the [00:34:52] I was saying if you just do the semicolon it'll create like the same [00:34:54] semicolon it'll create like the same array same deal here the the semicolon [00:34:56] array same deal here the the semicolon just means that you take everything from [00:34:58] just means that you take everything from the original array in fact it returns a [00:34:59] the original array in fact it returns a copy so it returns a deep copy means if [00:35:02] copy so it returns a deep copy means if you have a completely separate copy in [00:35:03] you have a completely separate copy in memory [00:35:04] memory um okay now I'm going into sort of more [00:35:06] um okay now I'm going into sort of more details about how do you want to index [00:35:08] details about how do you want to index quickly so if I for example have let's [00:35:11] quickly so if I for example have let's say this three by four Matrix and I only [00:35:14] say this three by four Matrix and I only want to select the zero and the second [00:35:16] want to select the zero and the second rows how would I do that [00:35:18] rows how would I do that so what's useful is that you can sort of [00:35:20] so what's useful is that you can sort of treat a numpy you can treat different [00:35:21] treat a numpy you can treat different dimensions differently for indexing so a [00:35:24] dimensions differently for indexing so a semicolon means you select everything in [00:35:26] semicolon means you select everything in that Dimension which for example here [00:35:28] that Dimension which for example here there's a semicolon in the second [00:35:29] there's a semicolon in the second dimension which means I'm taking all of [00:35:31] dimension which means I'm taking all of the column values [00:35:33] the column values um versus what's in the First Dimension [00:35:35] um versus what's in the First Dimension here it's saying a numpy array of zero [00:35:37] here it's saying a numpy array of zero and two so it's saying only the zero [00:35:39] and two so it's saying only the zero index and only the two index which means [00:35:41] index and only the two index which means only the zero would throw and only the [00:35:44] only the zero would throw and only the second row so what this would look like [00:35:46] second row so what this would look like would be something like [00:35:50] would be something like I have a matrix [00:35:55] okay I have a matrix and I only want to [00:35:57] okay I have a matrix and I only want to select the zeroth row and I only want to [00:35:59] select the zeroth row and I only want to select the column the second row 0 and [00:36:02] select the column the second row 0 and second and everything in the columns [00:36:06] all right [00:36:09] all right and then similarly for example if I want [00:36:11] and then similarly for example if I want to select in the column Dimension [00:36:13] to select in the column Dimension um I want to select the first and second [00:36:15] um I want to select the first and second rows at only the first row I can do that [00:36:17] rows at only the first row I can do that so you can basically treat them [00:36:18] so you can basically treat them separately you can think how many [00:36:20] separately you can think how many columns do I want how many rows do I [00:36:21] columns do I want how many rows do I want and then index so separately and [00:36:24] want and then index so separately and that goes for as many dimensions as you [00:36:25] that goes for as many dimensions as you want in your entire tensor [00:36:28] want in your entire tensor um some nice things also if I want to [00:36:30] um some nice things also if I want to for example take it I have this like let [00:36:32] for example take it I have this like let me print that actually X here I'll just [00:36:34] me print that actually X here I'll just generate the X [00:36:37] generate the X okay so this is X right so if I want to [00:36:39] okay so this is X right so if I want to take all the values of X that are above [00:36:41] take all the values of X that are above 0.5 for example I can do that by using [00:36:44] 0.5 for example I can do that by using what's called Boolean indexing so I just [00:36:47] what's called Boolean indexing so I just basically would say x indexed by [00:36:50] basically would say x indexed by everything in X that's bigger than 0.5 [00:36:52] everything in X that's bigger than 0.5 so it's pretty direct and it'll just [00:36:54] so it's pretty direct and it'll just output all the values in this entire [00:36:56] output all the values in this entire array that are bigger than 0.5 [00:37:00] all right this one is also another way [00:37:03] all right this one is also another way to do reshaping so I kind of mentioned [00:37:05] to do reshaping so I kind of mentioned earlier you know sometimes you won't [00:37:06] earlier you know sometimes you won't have this like list of three elements [00:37:08] have this like list of three elements and you want to reshape it to a three by [00:37:10] and you want to reshape it to a three by one array for example you can also use [00:37:13] one array for example you can also use what's called numpy.new access this will [00:37:16] what's called numpy.new access this will essentially add another access [00:37:18] essentially add another access in whatever Dimension you want so if I [00:37:20] in whatever Dimension you want so if I want to change go from like this three [00:37:22] want to change go from like this three by four array to a three by [00:37:24] by four array to a three by three by four two three by four by one [00:37:28] three by four two three by four by one then I can just add a numpy.nu axis [00:37:31] then I can just add a numpy.nu axis there an even simpler way to think about [00:37:33] there an even simpler way to think about it would be like a 2 comma [00:37:35] it would be like a 2 comma to uh two comma one and so it's just [00:37:39] to uh two comma one and so it's just it's another way to do what essentially [00:37:40] it's another way to do what essentially what would be the reshape reshaping [00:37:42] what would be the reshape reshaping operation [00:37:45] does that make sense also what this [00:37:46] does that make sense also what this would look like for example let me just [00:37:48] would look like for example let me just a little bit more concrete [00:37:50] a little bit more concrete this is [00:37:58] so as we see I have this list right I [00:38:00] so as we see I have this list right I have like a singular list and then each [00:38:02] have like a singular list and then each in in that list I have a list of lists [00:38:04] in in that list I have a list of lists so I have a list with element one and [00:38:06] so I have a list with element one and list of element two so this is what that [00:38:07] list of element two so this is what that reshape operation will do [00:38:10] reshape operation will do and what numpy.new access will enable [00:38:12] and what numpy.new access will enable you to do as well [00:38:15] all right [00:38:17] all right um I think we are a good time um so the [00:38:20] um I think we are a good time um so the last main topic we'll be covering is [00:38:22] last main topic we'll be covering is broadcasting [00:38:24] broadcasting um and what's really great about [00:38:25] um and what's really great about broadcasting is it'll allow you to [00:38:28] broadcasting is it'll allow you to operate with numpy arrays that are of [00:38:30] operate with numpy arrays that are of different shapes but can be sort of with [00:38:33] different shapes but can be sort of with many operations in them can be repeated [00:38:35] many operations in them can be repeated it allows for that in a very efficient [00:38:37] it allows for that in a very efficient manner and this is actually one of the [00:38:38] manner and this is actually one of the most I would say useful things about [00:38:40] most I would say useful things about numpy and one of its defining features [00:38:41] numpy and one of its defining features and what that means is [00:38:44] and what that means is um if for example in this case right if [00:38:46] um if for example in this case right if we go back to this example that I had [00:38:47] we go back to this example that I had with I start off with the zero zero zero [00:38:51] with I start off with the zero zero zero array how do I generate this array [00:38:54] array how do I generate this array versus how do I generate this array [00:38:56] versus how do I generate this array right instead of me saying okay element [00:39:00] right instead of me saying okay element zero zero plus one element zero one plus [00:39:05] zero zero plus one element zero one plus two all that stuff right instead of [00:39:06] two all that stuff right instead of doing that one by one what broadcasting [00:39:08] doing that one by one what broadcasting allows me to do is I can have only one [00:39:11] allows me to do is I can have only one vector of size one two three [00:39:14] vector of size one two three and it'll depending on how I do the [00:39:17] and it'll depending on how I do the broadcasting which I'll come to in a [00:39:18] broadcasting which I'll come to in a second I can duplicate it along the row [00:39:21] second I can duplicate it along the row Dimension or I can duplicate it along [00:39:23] Dimension or I can duplicate it along the column Dimension and numpy allows [00:39:25] the column Dimension and numpy allows for that it'll do that on its own in the [00:39:26] for that it'll do that on its own in the back end [00:39:27] back end and so that's really what broadcasting [00:39:28] and so that's really what broadcasting means is I don't need to for example [00:39:31] means is I don't need to for example create a new array saying I want to like [00:39:34] create a new array saying I want to like create a new array to begin with which [00:39:36] create a new array to begin with which is already like this and then add those [00:39:37] is already like this and then add those two together I can just duplicate this [00:39:39] two together I can just duplicate this and get this [00:39:41] and get this all right so now some rules for [00:39:43] all right so now some rules for broadcasting and let me just quickly [00:39:44] broadcasting and let me just quickly visually also just show what [00:39:45] visually also just show what broadcasting will do oh sorry [00:39:50] so broadcasting this is a pretty good [00:39:52] so broadcasting this is a pretty good visual analogy [00:39:54] visual analogy um I had this one by one comma one comma [00:39:56] um I had this one by one comma one comma two comma three Vector right [00:39:59] two comma three Vector right um and I wanna basically add let's say [00:40:01] um and I wanna basically add let's say only the columns with this one comma two [00:40:05] only the columns with this one comma two comma 3 Vector so what broadcasting [00:40:07] comma 3 Vector so what broadcasting allows you to do is you'll you only pass [00:40:09] allows you to do is you'll you only pass these two values in and on the back end [00:40:11] these two values in and on the back end it'll duplicate this along the column [00:40:13] it'll duplicate this along the column Dimension so let's say I have one two [00:40:15] Dimension so let's say I have one two three one two three one two three one [00:40:16] three one two three one two three one two three and then it'll do the addition [00:40:18] two three and then it'll do the addition similarly if I pass it a vector one [00:40:21] similarly if I pass it a vector one comma two comma three comma four and I [00:40:23] comma two comma three comma four and I want it to be added to each of the rows [00:40:25] want it to be added to each of the rows instead of each of the columns it'll be [00:40:27] instead of each of the columns it'll be able to do that by sort of duplicating [00:40:28] able to do that by sort of duplicating it on the back end so this is visually [00:40:30] it on the back end so this is visually what's happening with Broadcasting [00:40:32] what's happening with Broadcasting all right [00:40:34] all right now some rules so how does numpy know [00:40:37] now some rules so how does numpy know when and how to do broadcasting so the [00:40:40] when and how to do broadcasting so the main two rules to keep in mind with for [00:40:42] main two rules to keep in mind with for broadcasting is one [00:40:44] broadcasting is one um it can only happen if all of the [00:40:46] um it can only happen if all of the dimensions every single Dimension [00:40:47] dimensions every single Dimension between two arrays are compatible and [00:40:50] between two arrays are compatible and when they say what is compatible either [00:40:52] when they say what is compatible either the dimension values are equal or one of [00:40:55] the dimension values are equal or one of them is equal to one and that is the [00:40:57] them is equal to one and that is the only rule required so for example I [00:41:00] only rule required so for example I start off with this x array right I have [00:41:02] start off with this x array right I have this like a three by four x array [00:41:05] this like a three by four x array um will Y is equal to three comma 1 be [00:41:08] um will Y is equal to three comma 1 be compatible [00:41:09] compatible yes it will be why because you have [00:41:11] yes it will be why because you have three in the First Dimension between the [00:41:13] three in the First Dimension between the two which is the same and in the second [00:41:14] two which is the same and in the second dimension you have four and you have one [00:41:16] dimension you have four and you have one so those are compatible values and so [00:41:18] so those are compatible values and so what this tells numpy on the back end is [00:41:20] what this tells numpy on the back end is I'm doing for example an addition [00:41:22] I'm doing for example an addition Operation X Plus y it knows that okay [00:41:25] Operation X Plus y it knows that okay three and three are the same but four [00:41:27] three and three are the same but four and one are not the same you know one of [00:41:29] and one are not the same you know one of them has one dimension so I need to [00:41:31] them has one dimension so I need to duplicate this y along the second [00:41:34] duplicate this y along the second dimension which means I need to [00:41:35] dimension which means I need to duplicate it along the column Dimension [00:41:37] duplicate it along the column Dimension and once it does that it duplicates it [00:41:39] and once it does that it duplicates it it'll get four three comma four in Array [00:41:41] it'll get four three comma four in Array and then it can do the addition and it [00:41:43] and then it can do the addition and it does that really fast so it's better to [00:41:44] does that really fast so it's better to use broadcasting in this way but then [00:41:47] use broadcasting in this way but then for you to create a separate array [00:41:48] for you to create a separate array already duplicated and then add them [00:41:51] already duplicated and then add them similarly I have this Z array which is [00:41:54] similarly I have this Z array which is one comma four [00:41:55] one comma four what x into Z will do is first I'll [00:41:58] what x into Z will do is first I'll check okay three comma one okay is that [00:42:01] check okay three comma one okay is that compatible yes because you have three in [00:42:03] compatible yes because you have three in one dimension you have one in the second [00:42:04] one dimension you have one in the second and four and four are compatible okay so [00:42:06] and four and four are compatible okay so say I know that these two are compatible [00:42:08] say I know that these two are compatible in the second dimension I'm going to [00:42:09] in the second dimension I'm going to change anything in the First Dimension [00:42:11] change anything in the First Dimension it'll know to duplicate them basically [00:42:13] it'll know to duplicate them basically so in order to duplicate Z and so add it [00:42:16] so in order to duplicate Z and so add it three times in the row Dimension create [00:42:19] three times in the row Dimension create a separate array and then multiply those [00:42:21] a separate array and then multiply those two [00:42:22] two so this is giving you an example of [00:42:24] so this is giving you an example of saying I started off with X I have y and [00:42:26] saying I started off with X I have y and then the final shape will be three comma [00:42:28] then the final shape will be three comma four so a lot of times in deep learning [00:42:31] four so a lot of times in deep learning um you will have the same [00:42:33] um you will have the same um [00:42:33] um because you'll have different batches of [00:42:35] because you'll have different batches of different images coming in but you want [00:42:38] different images coming in but you want to apply let's say the same weight [00:42:39] to apply let's say the same weight Matrix to all of them and instead of [00:42:42] Matrix to all of them and instead of duplicating that weight Matrix a hundred [00:42:44] duplicating that weight Matrix a hundred or even like potentially depending on [00:42:46] or even like potentially depending on the size of your batch size like a [00:42:47] the size of your batch size like a thousand times and then adding those [00:42:49] thousand times and then adding those together you use the same Matrix and [00:42:51] together you use the same Matrix and it'll know okay if I'm going to be [00:42:53] it'll know okay if I'm going to be duplicating over the batch Dimension [00:42:54] duplicating over the batch Dimension it'll do that for you on the back end so [00:42:56] it'll do that for you on the back end so it's use a lot of times in deep learning [00:42:58] it's use a lot of times in deep learning because of this and basically in your [00:43:00] because of this and basically in your second homework that's basically what [00:43:01] second homework that's basically what you'll be doing implementing a feed [00:43:03] you'll be doing implementing a feed floral Network in numpy and it'll say [00:43:06] floral Network in numpy and it'll say you have like this W Matrix yeah this [00:43:08] you have like this W Matrix yeah this like B Matrix which is a by we'll come [00:43:10] like B Matrix which is a by we'll come to those in class and it'll ask you to [00:43:12] to those in class and it'll ask you to implement their numpy because that's [00:43:14] implement their numpy because that's basically what you're doing is if you [00:43:15] basically what you're doing is if you have this input image you have a weight [00:43:17] have this input image you have a weight Matrix which will somehow scale it to an [00:43:19] Matrix which will somehow scale it to an output and that weight Matrix will be [00:43:21] output and that weight Matrix will be applied to multiple images in your batch [00:43:23] applied to multiple images in your batch and those images can be different but [00:43:24] and those images can be different but their sizes will be the same and it's [00:43:26] their sizes will be the same and it's optimized for that [00:43:29] okay [00:43:30] okay um so this is just more examples of sort [00:43:32] um so this is just more examples of sort of the same thing your final thing that [00:43:34] of the same thing your final thing that you'll be coming to is a size of three [00:43:35] you'll be coming to is a size of three comma four [00:43:37] comma four um let's see this one's sort of the [00:43:39] um let's see this one's sort of the example that I showed right here right [00:43:41] example that I showed right here right which is that I have this array of [00:43:43] which is that I have this array of flight say zeros I have this numpy like [00:43:46] flight say zeros I have this numpy like this B array of size what size were they [00:43:48] this B array of size what size were they would this be yes good because you have [00:43:50] would this be yes good because you have one outer list and inside this you have [00:43:52] one outer list and inside this you have one inner list so it's just basically [00:43:54] one inner list so it's just basically one row and then three values inside so [00:43:57] one row and then three values inside so yes and so would this be compatible yes [00:44:00] yes and so would this be compatible yes and so it'll know basically to duplicate [00:44:02] and so it'll know basically to duplicate um over the row Dimension and so you're [00:44:03] um over the row Dimension and so you're going to get duplicates in the row [00:44:05] going to get duplicates in the row Dimensions you're going to get one two [00:44:06] Dimensions you're going to get one two three one two three one two three and [00:44:08] three one two three one two three and that's what's Happening Here [00:44:10] that's what's Happening Here um so these are for example a little bit [00:44:12] um so these are for example a little bit sometimes when it says more complex [00:44:14] sometimes when it says more complex um Behavior what this basically just [00:44:16] um Behavior what this basically just means is that like if I have this B [00:44:18] means is that like if I have this B Vector which is three comma one [00:44:20] Vector which is three comma one if I'm doing this B plus b dot transpose [00:44:23] if I'm doing this B plus b dot transpose by the transpose is just changing the [00:44:25] by the transpose is just changing the dimensions and switching them so if I [00:44:26] dimensions and switching them so if I have a two by three Matrix uh transpose [00:44:28] have a two by three Matrix uh transpose will be a three by two Matrix [00:44:30] will be a three by two Matrix um what that means visually is something [00:44:32] um what that means visually is something like [00:44:33] like your row and rows and like column [00:44:36] your row and rows and like column Dimensions will get switched [00:44:38] Dimensions will get switched six goes to I believe it's like one two [00:44:42] six goes to I believe it's like one two three four five six so like three row [00:44:46] three four five six so like three row rows versus like three columns [00:44:49] um and what this is just saying is that [00:44:51] um and what this is just saying is that uh a three by one and a one by three [00:44:54] uh a three by one and a one by three um both of those vectors will be [00:44:55] um both of those vectors will be compatible because remember in each [00:44:57] compatible because remember in each Dimension it's either the same or one [00:44:59] Dimension it's either the same or one and so it knows to duplicate uh over [00:45:02] and so it knows to duplicate uh over both of those dimensions and that's [00:45:04] both of those dimensions and that's what's Happening Here [00:45:06] what's Happening Here uh okay so I think we are right at time [00:45:10] uh okay so I think we are right at time um and what I would recommend is [00:45:11] um and what I would recommend is basically playing with variations of [00:45:13] basically playing with variations of this for broadcasting and see December [00:45:15] this for broadcasting and see December the two rules for broadcasting is just [00:45:17] the two rules for broadcasting is just if it's compatible it's either the same [00:45:19] if it's compatible it's either the same value or it's one and whatever is the [00:45:21] value or it's one and whatever is the one dimension is what's going to be [00:45:22] one dimension is what's going to be duplicated over on the back end so yeah [00:45:24] duplicated over on the back end so yeah it's not going to be compatible if [00:45:26] it's not going to be compatible if they're divisible for example right so [00:45:27] they're divisible for example right so if you have like let's say six and three [00:45:29] if you have like let's say six and three that's not compatible [00:45:31] that's not compatible um you can reshape it and then see if [00:45:33] um you can reshape it and then see if you'd like to have one there's tricks [00:45:35] you'd like to have one there's tricks you can use [00:45:37] you can use um where you're sort of thinking like on [00:45:38] um where you're sort of thinking like on the back end how do I want this data to [00:45:39] the back end how do I want this data to be multiplied you can maybe reshape [00:45:41] be multiplied you can maybe reshape everything into like an eight one like [00:45:43] everything into like an eight one like one by eighteen Matrix and then multiply [00:45:44] one by eighteen Matrix and then multiply everything and then reshape it back [00:45:46] everything and then reshape it back that's what you can do but you can never [00:45:48] that's what you can do but you can never just directly for example six by three [00:45:49] just directly for example six by three make that compatible [00:45:51] make that compatible okay [00:45:52] okay um so I think let's wrap up this one's [00:45:54] um so I think let's wrap up this one's just a quick example of another use of [00:45:56] just a quick example of another use of efficient numpy code [00:45:58] efficient numpy code um quick note never preferably don't use [00:46:01] um quick note never preferably don't use uh Loops whenever you're dealing with [00:46:03] uh Loops whenever you're dealing with large data matrices mostly because Loops [00:46:06] large data matrices mostly because Loops are almost always about a hundred times [00:46:08] are almost always about a hundred times slower numpy is usually very very [00:46:11] slower numpy is usually very very efficient as this is just an example of [00:46:13] efficient as this is just an example of what you can accomplish with numpy and [00:46:15] what you can accomplish with numpy and same thing using Loops so what this is [00:46:17] same thing using Loops so what this is saying is that I have an X Matrix of [00:46:19] saying is that I have an X Matrix of size thousand by thousand and I want to [00:46:22] size thousand by thousand and I want to apply you know let's say I want to add [00:46:23] apply you know let's say I want to add everything from row 100 onwards [00:46:27] everything from row 100 onwards um with a plus five so visually what [00:46:28] um with a plus five so visually what that will look like is something like [00:46:31] that will look like is something like I have this full Matrix and I wanted [00:46:35] I have this full Matrix and I wanted everything here basically to be add with [00:46:38] everything here basically to be add with plus added with plus five [00:46:40] plus added with plus five um then in in the loop format I can [00:46:43] um then in in the loop format I can basically Loop over the First Dimension [00:46:45] basically Loop over the First Dimension um of 100 plus and do that or numpy I [00:46:48] um of 100 plus and do that or numpy I can basically do what's called numpy.a [00:46:49] can basically do what's called numpy.a range which will generate um integers in [00:46:51] range which will generate um integers in like we see one two three four five six [00:46:53] like we see one two three four five six all the way up to that hundred value in [00:46:55] all the way up to that hundred value in this case it's between hundred and [00:46:56] this case it's between hundred and thousands let's start with hundred [00:46:57] thousands let's start with hundred hundred one hundred two all the way two [00:46:59] hundred one hundred two all the way two thousand in the First Dimension and then [00:47:01] thousand in the First Dimension and then just add that with five [00:47:03] just add that with five so this is just an example of how you [00:47:04] so this is just an example of how you would switch from using Loops to using [00:47:06] would switch from using Loops to using numpy and it's a lot lot faster ================================================================================ LECTURE 022 ================================================================================ Stanford CS224N NLP with Deep Learning | 2023 | PyTorch Tutorial, Drew Kaul Source: https://www.youtube.com/watch?v=Uv0AIRr3ptg --- Transcript [00:00:05] and so today I kind of just want to [00:00:07] and so today I kind of just want to cover the fundamentals of Pi torch [00:00:10] cover the fundamentals of Pi torch um really just kind of see what are the [00:00:12] um really just kind of see what are the similarities between pi torch and numpy [00:00:15] similarities between pi torch and numpy and python which you guys are used to at [00:00:17] and python which you guys are used to at this point [00:00:18] this point and see how we can build up a lot of the [00:00:21] and see how we can build up a lot of the building blocks that we'll need in order [00:00:23] building blocks that we'll need in order to Define more complex models so [00:00:25] to Define more complex models so specifically we're going to talk today [00:00:27] specifically we're going to talk today about tensors what are tensor objects [00:00:30] about tensors what are tensor objects how do we manipulate them [00:00:32] how do we manipulate them uh what is auto grad how pytorch helps [00:00:35] uh what is auto grad how pytorch helps us compute different gradients [00:00:37] us compute different gradients and finally how we actually do [00:00:39] and finally how we actually do optimization and how we write the [00:00:40] optimization and how we write the training Loop for our neural networks [00:00:42] training Loop for our neural networks and if we have time at the end then [00:00:44] and if we have time at the end then we'll try and go through a bit of a demo [00:00:46] we'll try and go through a bit of a demo to kind of put everything together and [00:00:49] to kind of put everything together and see how everything comes together when [00:00:52] see how everything comes together when you want to solve an actual NLP task [00:00:56] you want to solve an actual NLP task all right so let's get started [00:00:58] all right so let's get started so if you go to the course website [00:01:00] so if you go to the course website there's a notebook [00:01:02] there's a notebook and you can just make a copy of this [00:01:04] and you can just make a copy of this collab notebook and then just run the [00:01:06] collab notebook and then just run the cells as we go [00:01:08] cells as we go and so to start today we're talking [00:01:12] and so to start today we're talking about Pi torch like I said it's a deep [00:01:14] about Pi torch like I said it's a deep learning framework that really does two [00:01:16] learning framework that really does two main things one is it makes it very easy [00:01:18] main things one is it makes it very easy to author and manipulate tensors and [00:01:21] to author and manipulate tensors and make use of your GPU so that you can [00:01:23] make use of your GPU so that you can actually leverage a lot of that [00:01:25] actually leverage a lot of that capability and two is it makes the [00:01:28] capability and two is it makes the process of authoring neural networks [00:01:30] process of authoring neural networks much simpler you can now use different [00:01:32] much simpler you can now use different building blocks like linear layers and [00:01:35] building blocks like linear layers and different loss functions and compose [00:01:37] different loss functions and compose them in different ways in order to [00:01:39] them in different ways in order to author the types of models that you need [00:01:41] author the types of models that you need for your specific use cases [00:01:44] for your specific use cases and so pytorch is one of the two main [00:01:46] and so pytorch is one of the two main Frameworks along with tensorflow in this [00:01:48] Frameworks along with tensorflow in this class we'll focus on Pi torch but [00:01:50] class we'll focus on Pi torch but they're quite similar and so we'll start [00:01:52] they're quite similar and so we'll start by importing torch and we'll import the [00:01:56] by importing torch and we'll import the neural network module which is torch.nn [00:02:00] neural network module which is torch.nn and for this first part of the tutorial [00:02:02] and for this first part of the tutorial I want to talk a bit about tensors one [00:02:05] I want to talk a bit about tensors one thing that you guys are all familiar [00:02:07] thing that you guys are all familiar with now is numpy arrays and so pretty [00:02:11] with now is numpy arrays and so pretty much you can think about tensors as [00:02:14] much you can think about tensors as the equivalent in pi torch to numpy [00:02:16] the equivalent in pi torch to numpy arrays they're essentially [00:02:18] arrays they're essentially multi-dimensional arrays that you can [00:02:20] multi-dimensional arrays that you can manipulate in different ways and you'll [00:02:24] manipulate in different ways and you'll essentially use them to represent your [00:02:25] essentially use them to represent your data to be able to actually manipulate [00:02:28] data to be able to actually manipulate it and perform all the different Matrix [00:02:30] it and perform all the different Matrix operations that underlie your neural [00:02:33] operations that underlie your neural network [00:02:34] network and so in this case for example if we're [00:02:37] and so in this case for example if we're thinking of an image one way you can [00:02:40] thinking of an image one way you can think about it in terms of a tensor is [00:02:41] think about it in terms of a tensor is it's a 256 by 256 tensor where it's has [00:02:46] it's a 256 by 256 tensor where it's has a width of 256 pixels and a height of [00:02:49] a width of 256 pixels and a height of 256 pixels and for instance if we have a [00:02:52] 256 pixels and for instance if we have a batch of images and those images contain [00:02:54] batch of images and those images contain three channels like red green and blue [00:02:57] three channels like red green and blue then we might have a four-dimensional [00:02:59] then we might have a four-dimensional tensor which is the batch size by the [00:03:02] tensor which is the batch size by the number of channels by the width and the [00:03:04] number of channels by the width and the height and so everything we're going to [00:03:06] height and so everything we're going to see today is all going to be represented [00:03:08] see today is all going to be represented as tensors which you can just think of [00:03:10] as tensors which you can just think of as multi-dimensional arrays [00:03:13] as multi-dimensional arrays and so to kind of get some intuition [00:03:15] and so to kind of get some intuition about this we're going to spend a little [00:03:16] about this we're going to spend a little bit of time going through essentially [00:03:19] bit of time going through essentially lists of lists and how we can convert [00:03:22] lists of lists and how we can convert them into tensors and how we can [00:03:24] them into tensors and how we can manipulate them with different [00:03:25] manipulate them with different operations [00:03:27] operations so to start off with we just have [00:03:29] so to start off with we just have a simple list of lists that you're all [00:03:31] a simple list of lists that you're all familiar with in this case it's a two by [00:03:34] familiar with in this case it's a two by three list [00:03:36] three list and now we want to create a tensor and [00:03:39] and now we want to create a tensor and so here the way we'll create this tensor [00:03:43] so here the way we'll create this tensor is by doing torch.tensor and then [00:03:45] is by doing torch.tensor and then essentially writing the same [00:03:48] essentially writing the same syntax that we had before just write out [00:03:50] syntax that we had before just write out the list of lists that represents that [00:03:52] the list of lists that represents that particular tensor [00:03:56] and so in this case we get back a tensor [00:03:58] and so in this case we get back a tensor object which is the same shape and [00:04:00] object which is the same shape and contains the same data [00:04:03] contains the same data and so now the second thing with the [00:04:04] and so now the second thing with the tensor is that it contains a data type [00:04:06] tensor is that it contains a data type so there's different data types for [00:04:09] so there's different data types for instance there are different varying [00:04:11] instance there are different varying level of precision floating Point [00:04:12] level of precision floating Point numbers that you can use you can have [00:04:14] numbers that you can use you can have integers you can have different data [00:04:16] integers you can have different data types that actually populate your tensor [00:04:18] types that actually populate your tensor and so by default I believe this will be [00:04:21] and so by default I believe this will be float32 but you can explicitly specify [00:04:25] float32 but you can explicitly specify which data type your tensor is by [00:04:27] which data type your tensor is by passing in the d-type argument [00:04:29] passing in the d-type argument and so we see here now even though we [00:04:32] and so we see here now even though we you know wrote in a bunch of integers [00:04:33] you know wrote in a bunch of integers they have a decimal point which [00:04:35] they have a decimal point which indicates that they're floating Point [00:04:36] indicates that they're floating Point numbers [00:04:38] numbers and so same thing here we could [00:04:40] and so same thing here we could create another tensor in this case with [00:04:43] create another tensor in this case with data type float32 [00:04:46] data type float32 and [00:04:48] and in this third example you see that we [00:04:50] in this third example you see that we create another tensor we don't actually [00:04:53] create another tensor we don't actually specify the data type but pytorch [00:04:55] specify the data type but pytorch essentially implicitly takes the data [00:04:58] essentially implicitly takes the data type to be floating points since we [00:05:00] type to be floating points since we actually passed in a floating Point [00:05:02] actually passed in a floating Point number into this tensor [00:05:04] number into this tensor so [00:05:06] so pretty much at a high level tensors are [00:05:08] pretty much at a high level tensors are like multi-dimensional arrays we can [00:05:10] like multi-dimensional arrays we can specify the data type for them we can [00:05:12] specify the data type for them we can populate them just like numpy rays [00:05:14] populate them just like numpy rays okay so now great we know how to create [00:05:17] okay so now great we know how to create tensors we know that ultimately [00:05:19] tensors we know that ultimately everything that we work with all the [00:05:21] everything that we work with all the data we have is going to be expressed as [00:05:23] data we have is going to be expressed as tensors [00:05:24] tensors now the question is what are the [00:05:25] now the question is what are the functions that we have to manipulate [00:05:26] functions that we have to manipulate them [00:05:27] them and so we have some basic utilities that [00:05:30] and so we have some basic utilities that can help us instantiate tensors easily [00:05:33] can help us instantiate tensors easily specifically torch.zeros and torch.1s [00:05:37] specifically torch.zeros and torch.1s these are [00:05:39] these are two ways to create tensors of a [00:05:42] two ways to create tensors of a particular shape in this case tensors of [00:05:44] particular shape in this case tensors of all zeros or tensors of all ones and [00:05:47] all zeros or tensors of all ones and you'll see that this will be very [00:05:48] you'll see that this will be very helpful [00:05:49] helpful when you do your homeworks typically [00:05:51] when you do your homeworks typically you'll you'll want to just need to just [00:05:53] you'll you'll want to just need to just create a bunch of zero Matrix and it'll [00:05:56] create a bunch of zero Matrix and it'll be very easy to just specify the shape [00:05:57] be very easy to just specify the shape here without having to write everything [00:05:59] here without having to write everything out super explicitly and then you can [00:06:02] out super explicitly and then you can update that tensor as needed [00:06:05] update that tensor as needed another thing you can do is just like we [00:06:08] another thing you can do is just like we have ranges in Python so if you want to [00:06:10] have ranges in Python so if you want to Loop over a bunch of numbers you can [00:06:13] Loop over a bunch of numbers you can specify a range you can also use [00:06:16] specify a range you can also use torch dot a range to be able to actually [00:06:20] torch dot a range to be able to actually instantiate a tensor with a particular [00:06:22] instantiate a tensor with a particular range in this case we just looped over [00:06:25] range in this case we just looped over the numbers one through ten you could [00:06:27] the numbers one through ten you could reshape this and make it one through [00:06:29] reshape this and make it one through five and then six through ten that's [00:06:32] five and then six through ten that's another way to be able to instantiate [00:06:33] another way to be able to instantiate tensors [00:06:34] tensors and finally [00:06:37] and finally something to note is that [00:06:39] something to note is that when we apply particular operations such [00:06:42] when we apply particular operations such as just simple python operations like [00:06:45] as just simple python operations like addition or multiplication by default [00:06:48] addition or multiplication by default they're going to be element wise so [00:06:50] they're going to be element wise so they'll apply to all the elements in our [00:06:52] they'll apply to all the elements in our tensor so in this case we took our [00:06:55] tensor so in this case we took our tensor I think this one was probably [00:06:57] tensor I think this one was probably from earlier above and we added two [00:07:01] from earlier above and we added two everywhere here we've multiplied [00:07:02] everywhere here we've multiplied everything by two [00:07:04] everything by two but pretty much the pi torch semantics [00:07:07] but pretty much the pi torch semantics for broadcasting work pretty much the [00:07:09] for broadcasting work pretty much the same as the numpy semantics so if you [00:07:13] same as the numpy semantics so if you pretty much have different Matrix [00:07:16] pretty much have different Matrix operations where you need to batch [00:07:18] operations where you need to batch across a particular Dimension pytorch [00:07:20] across a particular Dimension pytorch will be smart about it and it will [00:07:22] will be smart about it and it will actually make sure that you broadcast [00:07:24] actually make sure that you broadcast over the appropriate Dimensions although [00:07:26] over the appropriate Dimensions although of course you have to make sure that the [00:07:28] of course you have to make sure that the shapes are compatible based on the [00:07:30] shapes are compatible based on the actual broadcasting rules so we'll get [00:07:33] actual broadcasting rules so we'll get to that in a little bit when we look at [00:07:35] to that in a little bit when we look at reshaping and how the bra uh how [00:07:38] reshaping and how the bra uh how different you know operations have those [00:07:40] different you know operations have those semantics in this case we have to define [00:07:43] semantics in this case we have to define the I guess I'm not personally aware of [00:07:45] the I guess I'm not personally aware of how you would Define kind of a jagged [00:07:48] how you would Define kind of a jagged tensor that has unequal dimensions [00:07:52] tensor that has unequal dimensions um but typically we don't want to do [00:07:54] um but typically we don't want to do that because it makes our computation a [00:07:56] that because it makes our computation a lot more complex and so in cases where [00:07:58] lot more complex and so in cases where we have you know for instance we have [00:08:00] we have you know for instance we have different sentences that we turn into [00:08:03] different sentences that we turn into tokens [00:08:04] tokens we might have different length sentences [00:08:06] we might have different length sentences in our training set we'll actually pad [00:08:08] in our training set we'll actually pad all the dimensions to be the same [00:08:10] all the dimensions to be the same because ultimately we want to do [00:08:12] because ultimately we want to do everything with Matrix operations and so [00:08:14] everything with Matrix operations and so in order to do that we need to have a [00:08:16] in order to do that we need to have a matrix of a fixed shape [00:08:18] matrix of a fixed shape um but yeah that's that's a good point I [00:08:20] um but yeah that's that's a good point I I'm not sure if there is a way to do [00:08:22] I'm not sure if there is a way to do that but typically we just get around [00:08:24] that but typically we just get around this by padding [00:08:26] this by padding okay so now we know how to define [00:08:27] okay so now we know how to define tensors [00:08:29] tensors we can do some interesting things with [00:08:31] we can do some interesting things with them so here we've created two tensors [00:08:35] them so here we've created two tensors one of them is a three by two tensor the [00:08:38] one of them is a three by two tensor the other one is a two by four tensor and I [00:08:40] other one is a two by four tensor and I think the answer is written up here but [00:08:43] think the answer is written up here but what do we expect is the shape when we [00:08:45] what do we expect is the shape when we multiply these two tensors [00:08:48] multiply these two tensors so we have a three by two tensor and a [00:08:50] so we have a three by two tensor and a two by four tensor [00:08:53] two by four tensor yeah three by four [00:08:55] yeah three by four and so [00:08:57] and so more generally [00:08:59] more generally um we can use matte mole in order to do [00:09:01] um we can use matte mole in order to do matrix multiplication it also implements [00:09:04] matrix multiplication it also implements batch to matrix multiplication and so I [00:09:08] batch to matrix multiplication and so I won't go over the entire review of [00:09:10] won't go over the entire review of broadcasting semantics but the main gist [00:09:12] broadcasting semantics but the main gist is that [00:09:14] is that the dimensions of two tensors are [00:09:16] the dimensions of two tensors are compatible if you can left pad the [00:09:19] compatible if you can left pad the tensors with ones so that the dimensions [00:09:22] tensors with ones so that the dimensions that line up either a have the same [00:09:24] that line up either a have the same number in that Dimension or B one of [00:09:26] number in that Dimension or B one of them is a dummy Dimension one of them [00:09:28] them is a dummy Dimension one of them has a one [00:09:29] has a one and in that case in those dummy [00:09:31] and in that case in those dummy Dimensions Pi torch will actually make [00:09:33] Dimensions Pi torch will actually make sure to copy over the tensor as many [00:09:35] sure to copy over the tensor as many times as needed so that you can then [00:09:38] times as needed so that you can then actually perform the operation and [00:09:40] actually perform the operation and that's useful when you want to do things [00:09:41] that's useful when you want to do things like batch dot products or bashed Matrix [00:09:44] like batch dot products or bashed Matrix multiplications [00:09:47] multiplications and I guess the final Point here is [00:09:50] and I guess the final Point here is there's also a shorthand notation that [00:09:52] there's also a shorthand notation that you can use so instead of kind of having [00:09:54] you can use so instead of kind of having to type out matte mole every time you [00:09:56] to type out matte mole every time you can just use the add operator similar to [00:09:58] can just use the add operator similar to numpy effectively that's kind of where [00:10:01] numpy effectively that's kind of where we get into how batching works so for [00:10:04] we get into how batching works so for example if you had [00:10:07] example if you had um let's say two tensors that have [00:10:11] um let's say two tensors that have um [00:10:11] um some batch Dimension and then one of [00:10:14] some batch Dimension and then one of them is M by one and the other one is [00:10:18] them is M by one and the other one is one by n [00:10:19] one by n and if you do a batched matrix multiply [00:10:22] and if you do a batched matrix multiply to those two tensors now what you [00:10:25] to those two tensors now what you effectively do is you preserve the batch [00:10:26] effectively do is you preserve the batch Dimension and then you're doing a matrix [00:10:29] Dimension and then you're doing a matrix multiplication between an M by one [00:10:30] multiplication between an M by one tensor and a one by n so you get [00:10:33] tensor and a one by n so you get something that's the batch Dimension by [00:10:35] something that's the batch Dimension by m by n so effectively they're kind of [00:10:39] m by n so effectively they're kind of more I think the full semantics are [00:10:40] more I think the full semantics are written out on the pytorch website for [00:10:43] written out on the pytorch website for how the matrix multiplication works but [00:10:45] how the matrix multiplication works but you're right you don't just have these [00:10:46] you're right you don't just have these cases where you have two two dimensional [00:10:48] cases where you have two two dimensional tensors you can have arbitrary number of [00:10:50] tensors you can have arbitrary number of dimensions and as long as the dimensions [00:10:53] dimensions and as long as the dimensions match up based on those semantics I was [00:10:55] match up based on those semantics I was saying then you can multiply it [00:10:57] saying then you can multiply it alternatively you can do what I do which [00:10:59] alternatively you can do what I do which is just multiply it anyways and then if [00:11:01] is just multiply it anyways and then if it throws an error print out the shapes [00:11:03] it throws an error print out the shapes and kind of work from there that tends [00:11:06] and kind of work from there that tends to be faster in my opinion a lot of ways [00:11:08] to be faster in my opinion a lot of ways but yeah that's a good point [00:11:12] but yeah that's a good point all right so yeah let's keep going [00:11:15] all right so yeah let's keep going through some of the other different [00:11:16] through some of the other different functionalities here [00:11:19] functionalities here so we can Define another tensor [00:11:21] so we can Define another tensor um and kind of one of the key things [00:11:23] um and kind of one of the key things that we always want to look at is the [00:11:25] that we always want to look at is the shape so in this case we just have a 1D [00:11:29] shape so in this case we just have a 1D tensor of length three so the torch dot [00:11:32] tensor of length three so the torch dot size just gives us three in general this [00:11:35] size just gives us three in general this is kind of one of the key debugging [00:11:36] is kind of one of the key debugging steps and something that I'll try and [00:11:38] steps and something that I'll try and emphasize a lot throughout this session [00:11:40] emphasize a lot throughout this session which is printing the shapes of all of [00:11:43] which is printing the shapes of all of your tensors is probably your best [00:11:45] your tensors is probably your best resource when it comes to debugging it's [00:11:48] resource when it comes to debugging it's kind of one of the hardest things to [00:11:49] kind of one of the hardest things to Intuit exactly what's going on once you [00:11:51] Intuit exactly what's going on once you start stacking a lot of different [00:11:53] start stacking a lot of different operations together so printing out the [00:11:56] operations together so printing out the shapes at each point and seeing do they [00:11:58] shapes at each point and seeing do they match what you expect is something [00:12:00] match what you expect is something important and it's better to rely on [00:12:03] important and it's better to rely on that than just on the error message that [00:12:05] that than just on the error message that pytorch gives you because under the hood [00:12:07] pytorch gives you because under the hood pytorch might Implement certain [00:12:09] pytorch might Implement certain optimizations and [00:12:11] optimizations and actually reshape the underlying tensor [00:12:13] actually reshape the underlying tensor you have so you may not see the numbers [00:12:15] you have so you may not see the numbers you expect so it's always great to print [00:12:17] you expect so it's always great to print out the shape [00:12:19] out the shape and [00:12:21] and so yeah let's uh so again we can always [00:12:25] so yeah let's uh so again we can always print out the shape [00:12:27] print out the shape and we can have a more complex uh in [00:12:31] and we can have a more complex uh in this case a three-dimensional tensor [00:12:32] this case a three-dimensional tensor which is three by two by four and we can [00:12:36] which is three by two by four and we can print out the shape and we can see all [00:12:37] print out the shape and we can see all the dimensions here [00:12:40] the dimensions here and so now you're like okay great we [00:12:42] and so now you're like okay great we have tensors we can look at their shapes [00:12:44] have tensors we can look at their shapes but what do we actually do with them and [00:12:46] but what do we actually do with them and so now let's get into kind of what are [00:12:49] so now let's get into kind of what are the operations that we can apply to [00:12:51] the operations that we can apply to these tensors [00:12:53] these tensors and so one of them is it's very easy to [00:12:56] and so one of them is it's very easy to reshape tensors [00:12:59] reshape tensors so in this case we're creating this 15 [00:13:02] so in this case we're creating this 15 dimensional tensor that's the numbers 1 [00:13:05] dimensional tensor that's the numbers 1 to 15 and now we're reshaping it so now [00:13:08] to 15 and now we're reshaping it so now it's a five by three tensor here [00:13:11] it's a five by three tensor here and so you might wonder well like what's [00:13:14] and so you might wonder well like what's what's the point of that and it's [00:13:17] what's the point of that and it's because a lot of times when we are doing [00:13:20] because a lot of times when we are doing machine learning we actually want to [00:13:22] machine learning we actually want to learn in batches and so we might take [00:13:24] learn in batches and so we might take our data and we might reshape it so now [00:13:26] our data and we might reshape it so now that instead of kind of being a long [00:13:28] that instead of kind of being a long flat and list of things we actually have [00:13:30] flat and list of things we actually have a set of batches or in in some cases we [00:13:33] a set of batches or in in some cases we have a set of batches of a set of [00:13:36] have a set of batches of a set of sentences or sequences of a particular [00:13:39] sentences or sequences of a particular length and each of the elements in that [00:13:41] length and each of the elements in that sequence has an embedding of a [00:13:43] sequence has an embedding of a particular dimension and So based on the [00:13:46] particular dimension and So based on the types of operations that you're trying [00:13:47] types of operations that you're trying to do you'll sometimes need to reshape [00:13:50] to do you'll sometimes need to reshape those tensors and sometimes you'll want [00:13:53] those tensors and sometimes you'll want to [00:13:54] to particularly sometimes transpose [00:13:56] particularly sometimes transpose Dimensions if you want to for instance [00:13:58] Dimensions if you want to for instance reorganize your data so [00:14:01] reorganize your data so that's another operation to keep in mind [00:14:05] that's another operation to keep in mind I believe the differences view will [00:14:08] I believe the differences view will um [00:14:09] um view will create a view of the [00:14:11] view will create a view of the underlying tensor and so I think the [00:14:13] underlying tensor and so I think the underlying tensor will still have the [00:14:14] underlying tensor will still have the same shape reshape will actually modify [00:14:17] same shape reshape will actually modify the tensor [00:14:20] um [00:14:22] all right and then finally like I said [00:14:25] all right and then finally like I said at the beginning [00:14:26] at the beginning your intuition about pytorch tensors can [00:14:29] your intuition about pytorch tensors can simply be their kind of a nice easy way [00:14:32] simply be their kind of a nice easy way to work with numpy arrays but they have [00:14:35] to work with numpy arrays but they have all these great properties like now we [00:14:36] all these great properties like now we can essentially use them with gpus and [00:14:40] can essentially use them with gpus and it's very optimized and we can also [00:14:42] it's very optimized and we can also compute gradients quickly and [00:14:46] compute gradients quickly and to kind of just emphasize this point if [00:14:48] to kind of just emphasize this point if you have some numpy code and you have a [00:14:50] you have some numpy code and you have a bunch of numpy arrays you can directly [00:14:52] bunch of numpy arrays you can directly convert them into Pi torch sensors by [00:14:54] convert them into Pi torch sensors by simply catch casting them [00:14:56] simply catch casting them and you can also take those tensors and [00:14:59] and you can also take those tensors and convert them back to numpy arrays [00:15:04] all right and so one of the things you [00:15:06] all right and so one of the things you might be asking is why do we care about [00:15:08] might be asking is why do we care about tensors what makes them good [00:15:12] tensors what makes them good and one of the great things about them [00:15:13] and one of the great things about them is that they support vectorized [00:15:16] is that they support vectorized operations very easily essentially we [00:15:18] operations very easily essentially we can parallelize a lot of different [00:15:20] can parallelize a lot of different computations and do them for instance [00:15:22] computations and do them for instance across a batch of data all at once and [00:15:26] across a batch of data all at once and one of those operations you might want [00:15:27] one of those operations you might want to do for instance is a sum so [00:15:30] to do for instance is a sum so you can take in this case a tensor which [00:15:34] you can take in this case a tensor which is [00:15:35] is shape five by seven [00:15:37] shape five by seven and [00:15:40] it looks like that's not working [00:15:42] it looks like that's not working you can take a tensor that's shaped five [00:15:44] you can take a tensor that's shaped five by seven and now you can compute [00:15:46] by seven and now you can compute different operations on it that [00:15:49] different operations on it that essentially collapse the dimensionality [00:15:50] essentially collapse the dimensionality so the first one is sum and so you can [00:15:54] so the first one is sum and so you can take it and you can sum across both the [00:15:55] take it and you can sum across both the rows as well as the columns and so one [00:15:58] rows as well as the columns and so one way I like to think about this to kind [00:16:00] way I like to think about this to kind of keep them straight is that the [00:16:03] of keep them straight is that the dimension that you specify in the sum is [00:16:05] dimension that you specify in the sum is the dimension you're collapsing so in [00:16:09] the dimension you're collapsing so in this case if you take the data and sum [00:16:10] this case if you take the data and sum over Dimension zero because you know the [00:16:13] over Dimension zero because you know the shape of the underlying tensor is five [00:16:15] shape of the underlying tensor is five by seven [00:16:16] by seven you've collapsed the zeroth dimension so [00:16:19] you've collapsed the zeroth dimension so you should be left with something that's [00:16:21] you should be left with something that's just shape seven [00:16:22] just shape seven and if you see the actual tensor you got [00:16:25] and if you see the actual tensor you got 75 80 85 90. you get this tensor which [00:16:28] 75 80 85 90. you get this tensor which is shape seven [00:16:30] is shape seven alternatively you can think about [00:16:32] alternatively you can think about whether or not you're kind of summing [00:16:33] whether or not you're kind of summing across the rows or something across the [00:16:35] across the rows or something across the columns [00:16:36] columns but it's not just some it applies to [00:16:39] but it's not just some it applies to other operations as well you can compute [00:16:40] other operations as well you can compute standard deviations you can normalize [00:16:43] standard deviations you can normalize your data you can do other operations [00:16:45] your data you can do other operations which essentially batch across the [00:16:47] which essentially batch across the entire set of data [00:16:49] entire set of data and not only do these apply over one [00:16:52] and not only do these apply over one dimension but here you can see that if [00:16:54] dimension but here you can see that if you don't specify any dimensions then by [00:16:57] you don't specify any dimensions then by default the operation actually applies [00:16:58] default the operation actually applies to the entire tensor so here we end up [00:17:01] to the entire tensor so here we end up just taking the sum of the entire thing [00:17:03] just taking the sum of the entire thing so if you think about it the zeroth [00:17:05] so if you think about it the zeroth dimension is the number of rows there [00:17:07] dimension is the number of rows there are five rows and there are seven [00:17:08] are five rows and there are seven columns so if we sum out the rows then [00:17:14] columns so if we sum out the rows then we're actually summing across the [00:17:16] we're actually summing across the columns [00:17:17] columns and so now we only have seven values [00:17:20] and so now we only have seven values but I like to think about more just in [00:17:22] but I like to think about more just in terms of the dimensions to keep it [00:17:23] terms of the dimensions to keep it straight rather than rows or columns [00:17:24] straight rather than rows or columns because it can get confusing if you're [00:17:26] because it can get confusing if you're summing out Dimension zero then [00:17:28] summing out Dimension zero then effectively you've taken something which [00:17:30] effectively you've taken something which has some shape that's Dimension Zero by [00:17:32] has some shape that's Dimension Zero by Dimension One to just whatever is the [00:17:35] Dimension One to just whatever is the dimension one shape [00:17:36] dimension one shape and then from there you can kind of [00:17:38] and then from there you can kind of figure out okay which way did I actually [00:17:40] figure out okay which way did I actually sum to check if you were right numpy [00:17:43] sum to check if you were right numpy implements a lot of this vectorization [00:17:45] implements a lot of this vectorization and I believe in the homework that [00:17:48] and I believe in the homework that you have right now I think part of your [00:17:51] you have right now I think part of your job is to vectorize a lot of these [00:17:52] job is to vectorize a lot of these things [00:17:53] things so the big Advantage with pi torch is [00:17:55] so the big Advantage with pi torch is that essentially it's optimized to be [00:17:58] that essentially it's optimized to be able to take advantage of your GPU when [00:18:00] able to take advantage of your GPU when we actually start building out neural [00:18:01] we actually start building out neural networks that are bigger that involve [00:18:04] networks that are bigger that involve more computation we're going to be doing [00:18:06] more computation we're going to be doing a lot of these matrix multiplication [00:18:07] a lot of these matrix multiplication operations that it's going to be a lot [00:18:10] operations that it's going to be a lot better for our processor if we can make [00:18:12] better for our processor if we can make use of the GPU and so that's where [00:18:15] use of the GPU and so that's where pytorch really comes in handy [00:18:17] pytorch really comes in handy in addition to also defining a lot of [00:18:20] in addition to also defining a lot of those neural network modules as we'll [00:18:22] those neural network modules as we'll see later for you so that now you don't [00:18:25] see later for you so that now you don't need to worry about for instance [00:18:26] need to worry about for instance implementing a basic linear layer and [00:18:29] implementing a basic linear layer and back propagation from scratch and also [00:18:31] back propagation from scratch and also your Optimizer all of those things will [00:18:34] your Optimizer all of those things will be built in and you can just call the [00:18:35] be built in and you can just call the respective apis to make use of them [00:18:37] respective apis to make use of them whereas in Python and numpy you might [00:18:40] whereas in Python and numpy you might have to do a lot of that coding yourself [00:18:47] all right so [00:18:50] we'll keep going [00:18:54] so this is a quiz except I think it [00:18:58] so this is a quiz except I think it tells you the answer so it's not much of [00:18:59] tells you the answer so it's not much of a quiz [00:19:00] a quiz but [00:19:02] but pretty much you know what would you do [00:19:03] pretty much you know what would you do if now I told you instead of you know [00:19:06] if now I told you instead of you know summing over this tensor I want you to [00:19:09] summing over this tensor I want you to compute the average [00:19:11] compute the average and so there's there's two different [00:19:12] and so there's there's two different ways you could compute the average you [00:19:14] ways you could compute the average you could compute the average across the [00:19:16] could compute the average across the rows or across the columns [00:19:18] rows or across the columns and so essentially [00:19:21] and so essentially now we kind of get back to this question [00:19:23] now we kind of get back to this question of well which dimension am I actually [00:19:25] of well which dimension am I actually going to reduce over and so here if we [00:19:27] going to reduce over and so here if we want to preserve the rows then we need [00:19:30] want to preserve the rows then we need to actually sum over the second [00:19:31] to actually sum over the second dimension [00:19:33] dimension um [00:19:34] um they're really the first uh zeroth and [00:19:37] they're really the first uh zeroth and first so the First Dimension is what we [00:19:39] first so the First Dimension is what we have to sum over because we want to [00:19:41] have to sum over because we want to preserve the zeroth dimension [00:19:44] preserve the zeroth dimension and so that's why for row average you [00:19:46] and so that's why for row average you see the dim equals one [00:19:48] see the dim equals one and for column average same reasoning is [00:19:50] and for column average same reasoning is why you see the dim equals zero [00:19:53] why you see the dim equals zero and so if we run this code we'll see [00:19:57] and so if we run this code we'll see kind of what are the shapes that we [00:19:59] kind of what are the shapes that we expect if we're taking the average over [00:20:01] expect if we're taking the average over rows then an object that's two by three [00:20:04] rows then an object that's two by three should just become an object that's two [00:20:06] should just become an object that's two it's just a one-dimensional [00:20:09] it's just a one-dimensional almost a vector you can think of [00:20:11] almost a vector you can think of and if we are averaging across the [00:20:14] and if we are averaging across the columns there's three columns so now our [00:20:16] columns there's three columns so now our average should have three values and so [00:20:19] average should have three values and so now we're left with a three a [00:20:21] now we're left with a three a one-dimensional tensor of length three [00:20:25] so yeah does that kind of make sense I [00:20:27] so yeah does that kind of make sense I guess is this General intuition about [00:20:28] guess is this General intuition about how we deal with shapes and how some of [00:20:31] how we deal with shapes and how some of these operations manipulate shapes [00:20:33] these operations manipulate shapes so now we'll get into indexing [00:20:35] so now we'll get into indexing this can get a little bit tricky [00:20:38] this can get a little bit tricky but [00:20:40] but I think you'll find that the semantics [00:20:42] I think you'll find that the semantics are very similar to numpy [00:20:44] are very similar to numpy so one of the things that you can do in [00:20:48] so one of the things that you can do in numpy is that you can take these numpy [00:20:50] numpy is that you can take these numpy arrays and you can slice across them in [00:20:52] arrays and you can slice across them in many different ways you can create [00:20:54] many different ways you can create copies of them [00:20:55] copies of them and you can index across particular [00:20:58] and you can index across particular Dimensions to select out different [00:21:00] Dimensions to select out different elements different rows or different [00:21:02] elements different rows or different columns [00:21:03] columns and so in this case let's take this [00:21:05] and so in this case let's take this example tensor which is [00:21:07] example tensor which is three by two by two [00:21:10] three by two by two and [00:21:12] and first thing you'll always want to do [00:21:13] first thing you'll always want to do when you have a new tensor print out its [00:21:15] when you have a new tensor print out its shape understand what you're working [00:21:16] shape understand what you're working with [00:21:18] with and so [00:21:20] and so I guess uh [00:21:23] I guess uh I may have shown this already but what [00:21:25] I may have shown this already but what will X bracket zero print out what [00:21:28] will X bracket zero print out what happens if we index into just the first [00:21:29] happens if we index into just the first element [00:21:31] element what's the shape of this [00:21:35] yeah two by two right because if you [00:21:38] yeah two by two right because if you think about it our tensor is really just [00:21:41] think about it our tensor is really just a list of three things each of those [00:21:43] a list of three things each of those things happens to also be a two by two [00:21:45] things happens to also be a two by two tensor so we get a two by two object in [00:21:48] tensor so we get a two by two object in this case the first thing one two three [00:21:50] this case the first thing one two three four [00:21:52] four and so just like numpy if you provide a [00:21:55] and so just like numpy if you provide a colon in a particular Dimension it means [00:21:57] colon in a particular Dimension it means essentially copy over that dimension [00:22:00] essentially copy over that dimension so if we do X bracket zero implicitly [00:22:03] so if we do X bracket zero implicitly we're essentially putting a colon for [00:22:05] we're essentially putting a colon for all the other dimensions so it's [00:22:07] all the other dimensions so it's essentially saying grab the first thing [00:22:10] essentially saying grab the first thing along the zeroth dimension and then grab [00:22:13] along the zeroth dimension and then grab everything along the other two [00:22:14] everything along the other two dimensions [00:22:15] dimensions if we now [00:22:18] if we now take uh just the zeroth along the [00:22:21] take uh just the zeroth along the element along the First Dimension [00:22:24] element along the First Dimension um [00:22:24] um what are we going to get well [00:22:28] what are we going to get well ultimately we're going to get now if you [00:22:31] ultimately we're going to get now if you look uh the kind of First Dimension were [00:22:34] look uh the kind of First Dimension were these three things the second dimension [00:22:36] these three things the second dimension is now each of these two rows within [00:22:39] is now each of these two rows within those things so like one two and three [00:22:41] those things so like one two and three four five six and seven eight [00:22:43] four five six and seven eight 9 10 and 11 12. so if we index into the [00:22:48] 9 10 and 11 12. so if we index into the second dimension or the First Dimension [00:22:50] second dimension or the First Dimension and get the zeroth element [00:22:52] and get the zeroth element then we're going to end up with one two [00:22:54] then we're going to end up with one two by six and nine ten [00:22:58] by six and nine ten and [00:22:59] and even if that's a little bit tricky you [00:23:01] even if that's a little bit tricky you can kind of go back to the trick I [00:23:03] can kind of go back to the trick I mentioned before where we're slicing [00:23:06] mentioned before where we're slicing across the First Dimension so if we look [00:23:08] across the First Dimension so if we look at the shape of our tensor it's three by [00:23:10] at the shape of our tensor it's three by two by two [00:23:12] two by two if we collapse the First Dimension that [00:23:15] if we collapse the First Dimension that two in the middle we're left with [00:23:16] two in the middle we're left with something that's three by two [00:23:18] something that's three by two so it might seem a little bit trivial [00:23:21] so it might seem a little bit trivial kind of going through this in a lot of [00:23:23] kind of going through this in a lot of detail but I think it's important [00:23:25] detail but I think it's important because it can get tricky when your [00:23:26] because it can get tricky when your tensor shapes get more complicated how [00:23:29] tensor shapes get more complicated how to actually reason about this [00:23:31] to actually reason about this and so I won't go through every example [00:23:33] and so I won't go through every example here since a lot of them kind of [00:23:35] here since a lot of them kind of reinforce the same thing but I'll just [00:23:38] reinforce the same thing but I'll just highlight a few things just like numpy [00:23:40] highlight a few things just like numpy you can choose to get a range of [00:23:43] you can choose to get a range of elements [00:23:44] elements in this case [00:23:47] in this case where we're taking this new tensor which [00:23:50] where we're taking this new tensor which is [00:23:51] is one two one through fifteen rearranged [00:23:54] one two one through fifteen rearranged that's a five by three tensor and if we [00:23:57] that's a five by three tensor and if we take the zero through third row [00:24:00] take the zero through third row um exclusive we'll get the first three [00:24:02] um exclusive we'll get the first three rows [00:24:04] rows and we can do the same thing but now [00:24:06] and we can do the same thing but now with slicing across multiple dimensions [00:24:10] and I think the final point I want to [00:24:13] and I think the final point I want to talk about here is list indexing [00:24:16] talk about here is list indexing list indexing is also present in numpy [00:24:19] list indexing is also present in numpy and it's a very clever shorthand for [00:24:21] and it's a very clever shorthand for being able to essentially select out [00:24:23] being able to essentially select out multiple elements at once [00:24:26] multiple elements at once so in this case what you can do is [00:24:29] so in this case what you can do is if you want to get the zeroth the second [00:24:32] if you want to get the zeroth the second and the fourth element [00:24:34] and the fourth element of our Matrix you can just instead of [00:24:37] of our Matrix you can just instead of indexing with a particular number or set [00:24:39] indexing with a particular number or set of numbers index with a list of indices [00:24:43] of numbers index with a list of indices so in this case if we go up to our [00:24:45] so in this case if we go up to our tensor [00:24:48] if we take out the zeroth the second and [00:24:51] if we take out the zeroth the second and the fourth we should see those three [00:24:53] the fourth we should see those three rows [00:24:55] rows and that's what we end up getting [00:25:00] yeah again these are kind of a lot of [00:25:03] yeah again these are kind of a lot of examples to just reiterate the same [00:25:05] examples to just reiterate the same point which is that you can slice across [00:25:08] point which is that you can slice across your data in multiple ways and at [00:25:10] your data in multiple ways and at different points you're going to need to [00:25:11] different points you're going to need to do that [00:25:12] do that so being familiar with the shapes that [00:25:15] so being familiar with the shapes that you understand what's the underlying [00:25:17] you understand what's the underlying output that you expect is important [00:25:19] output that you expect is important in this case for instance we're slicing [00:25:22] in this case for instance we're slicing across the first and the second [00:25:23] across the first and the second dimension and we're keeping the first uh [00:25:27] dimension and we're keeping the first uh the zeroth and so [00:25:29] the zeroth and so we're going to end up getting [00:25:30] we're going to end up getting essentially kind of the the top left [00:25:32] essentially kind of the the top left element of each of those three things in [00:25:35] element of each of those three things in our tensor if we scroll all the way [00:25:38] our tensor if we scroll all the way up here we'll get this one we'll get [00:25:40] up here we'll get this one we'll get this five and we'll get this nine [00:25:42] this five and we'll get this nine because we go across all of this the [00:25:45] because we go across all of this the Earth Dimension and then across the [00:25:47] Earth Dimension and then across the first and the second we only take the [00:25:49] first and the second we only take the first [00:25:49] first uh [00:25:51] uh the the zeroth element in both of those [00:25:53] the the zeroth element in both of those positions and so that's why we get 1 5 [00:25:56] positions and so that's why we get 1 5 9. [00:26:01] and also of course you can you know [00:26:03] and also of course you can you know apply all of the colons to get back the [00:26:05] apply all of the colons to get back the original tensor [00:26:11] okay and then I think the last thing [00:26:14] okay and then I think the last thing when it comes to indexing is conversions [00:26:17] when it comes to indexing is conversions so typically when we're writing code [00:26:20] so typically when we're writing code with neural networks ultimately we're [00:26:22] with neural networks ultimately we're going to [00:26:24] going to you know process some data through a [00:26:26] you know process some data through a network and we're going to get a loss [00:26:27] network and we're going to get a loss and that loss needs to be a scalar and [00:26:30] and that loss needs to be a scalar and then we're going to compute gradients [00:26:31] then we're going to compute gradients with respect to that loss so one thing [00:26:34] with respect to that loss so one thing to keep in mind is that sometimes you [00:26:37] to keep in mind is that sometimes you might have an operation and it fails [00:26:38] might have an operation and it fails because it was actually expecting a [00:26:40] because it was actually expecting a scalar value rather than a tensor and so [00:26:43] scalar value rather than a tensor and so you can extract out the scalar from this [00:26:45] you can extract out the scalar from this one by one tensor by just calling dot [00:26:48] one by one tensor by just calling dot item [00:26:51] so in this case you know if you have a [00:26:53] so in this case you know if you have a tensor which is just literally one then [00:26:56] tensor which is just literally one then you can actually get the python scalar [00:26:57] you can actually get the python scalar that corresponds to it by calling dot [00:26:59] that corresponds to it by calling dot item [00:27:00] item so now we can get into the more [00:27:01] so now we can get into the more interesting stuff [00:27:03] interesting stuff one of the really cool things with [00:27:04] one of the really cool things with pytorch is autograd [00:27:07] pytorch is autograd and what autograd is is high torch [00:27:10] and what autograd is is high torch essentially provides an automatic [00:27:13] essentially provides an automatic differentiation package where when you [00:27:17] differentiation package where when you define your neural network you're [00:27:19] define your neural network you're essentially defining many nodes that [00:27:22] essentially defining many nodes that compute some function [00:27:24] compute some function and in the forward past you're kind of [00:27:26] and in the forward past you're kind of running your data through those nodes [00:27:28] running your data through those nodes but what pytorch is doing on the back [00:27:30] but what pytorch is doing on the back end is that each of those points it's [00:27:33] end is that each of those points it's going to actually store the gradients [00:27:34] going to actually store the gradients and accumulate them so that every time [00:27:37] and accumulate them so that every time you do your backwards pass [00:27:39] you do your backwards pass you apply the chain rule to be able to [00:27:41] you apply the chain rule to be able to calculate all these different gradients [00:27:43] calculate all these different gradients and pytorch caches those gradients [00:27:46] and pytorch caches those gradients and then you will have access to all of [00:27:49] and then you will have access to all of those gradients to be able to actually [00:27:51] those gradients to be able to actually then run your favorite Optimizer and [00:27:54] then run your favorite Optimizer and optimize you know with SGD or with atom [00:27:57] optimize you know with SGD or with atom or whichever Optimizer you choose [00:27:59] or whichever Optimizer you choose and so that's kind of one of the great [00:28:02] and so that's kind of one of the great features you don't have to worry about [00:28:04] features you don't have to worry about actually writing the code that computes [00:28:06] actually writing the code that computes all these gradients and actually caches [00:28:09] all these gradients and actually caches all of them properly applies the chain [00:28:10] all of them properly applies the chain rule does all these steps you can have [00:28:13] rule does all these steps you can have shocked all of that away with just one [00:28:15] shocked all of that away with just one call to dot backward [00:28:17] call to dot backward and so in this case we'll run through a [00:28:20] and so in this case we'll run through a little bit of an example where we'll see [00:28:22] little bit of an example where we'll see the gradients getting computed [00:28:24] the gradients getting computed automatically [00:28:26] automatically so [00:28:29] in this case we're going to initialize a [00:28:31] in this case we're going to initialize a tensor [00:28:32] tensor and requires grad is true by default it [00:28:36] and requires grad is true by default it just means that by default for a given [00:28:38] just means that by default for a given tensor [00:28:39] tensor python [00:28:40] python pytorch will store the gradient [00:28:42] pytorch will store the gradient associated with it and you might wonder [00:28:45] associated with it and you might wonder well you know why why uh why do we have [00:28:49] well you know why why uh why do we have this you know when we always want to [00:28:51] this you know when we always want to store the gradient and the answer is at [00:28:54] store the gradient and the answer is at train time you need the gradients in [00:28:55] train time you need the gradients in order to actually train your network but [00:28:58] order to actually train your network but at inference time you'd actually want to [00:28:59] at inference time you'd actually want to disable your gradients and you can [00:29:01] disable your gradients and you can actually do that because it's a lot of [00:29:03] actually do that because it's a lot of extra computation that's not needed [00:29:05] extra computation that's not needed since you're not making any updates to [00:29:07] since you're not making any updates to your Network anymore [00:29:09] your Network anymore and so let's create this right now [00:29:13] and so let's create this right now uh we don't have any gradients uh being [00:29:16] uh we don't have any gradients uh being computed because we haven't actually [00:29:18] computed because we haven't actually called backwards to actually compute [00:29:22] called backwards to actually compute um [00:29:22] um some quantity with respect to this [00:29:25] some quantity with respect to this particular tensor we haven't actually [00:29:27] particular tensor we haven't actually computed [00:29:28] computed um those gradients yet so right now the [00:29:31] um those gradients yet so right now the dot grad feature which will actually [00:29:33] dot grad feature which will actually store the gradient associated with that [00:29:35] store the gradient associated with that tensor is not [00:29:37] tensor is not and so now let's just Define a really [00:29:39] and so now let's just Define a really simple function we have X we're going to [00:29:42] simple function we have X we're going to define the function y equals 3x squared [00:29:46] define the function y equals 3x squared and so now we're going to call Y dot [00:29:48] and so now we're going to call Y dot backward [00:29:50] backward and so now what happens is when we [00:29:52] and so now what happens is when we actually print out x dot grad what we [00:29:55] actually print out x dot grad what we should expect to see is number 12. and [00:29:59] should expect to see is number 12. and the reason is that [00:30:01] the reason is that our function y is 3x squared if we [00:30:04] our function y is 3x squared if we compute the gradient of that function [00:30:05] compute the gradient of that function we're going to get 6x [00:30:08] we're going to get 6x and our actual value was 2. so the [00:30:12] and our actual value was 2. so the actual gradient is going to be 12. [00:30:16] actual gradient is going to be 12. and we see that when we print out X talk [00:30:17] and we see that when we print out X talk grad that's what we get [00:30:21] grad that's what we get and now we'll just run it again let's [00:30:24] and now we'll just run it again let's set Z equal to 3x squared we call Z dot [00:30:27] set Z equal to 3x squared we call Z dot backwards and we print out X talk grad [00:30:29] backwards and we print out X talk grad again and now we see that [00:30:32] again and now we see that I may not run this in the right order [00:30:36] I may not run this in the right order Okay so [00:30:39] Okay so here in the second one that I re-rad we [00:30:41] here in the second one that I re-rad we see that it says 24. and so you might be [00:30:44] see that it says 24. and so you might be wondering well I just did the same thing [00:30:45] wondering well I just did the same thing twice shouldn't I see 12 again [00:30:48] twice shouldn't I see 12 again and the answer is that by default [00:30:50] and the answer is that by default pytorch will accumulate the gradients so [00:30:54] pytorch will accumulate the gradients so it won't actually rewrite the gradient [00:30:56] it won't actually rewrite the gradient each time you compute it it will sum it [00:30:59] each time you compute it it will sum it and the reason is because when you [00:31:01] and the reason is because when you actually have back propagation for your [00:31:03] actually have back propagation for your network you want to accumulate the [00:31:05] network you want to accumulate the gradients you know across all of your [00:31:07] gradients you know across all of your examples and then actually apply your [00:31:09] examples and then actually apply your update you don't want to overwrite the [00:31:10] update you don't want to overwrite the gradient but this also means that every [00:31:13] gradient but this also means that every time you have a training iteration for [00:31:16] time you have a training iteration for your network you need to zero out the [00:31:17] your network you need to zero out the gradient because you don't want the [00:31:19] gradient because you don't want the previous gradients from the last Epoch [00:31:22] previous gradients from the last Epoch where you iterated through all of your [00:31:23] where you iterated through all of your training data to mess with the current [00:31:26] training data to mess with the current update that you're doing [00:31:28] update that you're doing so that's kind of one thing to note [00:31:30] so that's kind of one thing to note which is that [00:31:32] which is that that's essentially why we will see when [00:31:35] that's essentially why we will see when we actually write the training Loop you [00:31:37] we actually write the training Loop you have to run zero grad in order to zero [00:31:39] have to run zero grad in order to zero out the gradient [00:31:40] out the gradient yes so I accidentally ran the cells in [00:31:44] yes so I accidentally ran the cells in the wrong order maybe to make it more [00:31:46] the wrong order maybe to make it more clear let me put this one first [00:31:52] so this is actually what it should look [00:31:54] so this is actually what it should look like which is that we ran it once and I [00:31:56] like which is that we ran it once and I ran this cell first and it has 12. and [00:32:00] ran this cell first and it has 12. and then we ran it a second time and we get [00:32:02] then we ran it a second time and we get 24. [00:32:03] 24. yes so if you have all of your tensors [00:32:06] yes so if you have all of your tensors defined [00:32:07] defined then when you actually called out [00:32:09] then when you actually called out backwards if it's a function of multiple [00:32:11] backwards if it's a function of multiple variables it's going to compute all of [00:32:13] variables it's going to compute all of those partials all of those gradients [00:32:15] those partials all of those gradients yeah so what's happening here is that [00:32:17] yeah so what's happening here is that the way Pi torch works is that it's [00:32:20] the way Pi torch works is that it's storing the accumulate accumulated [00:32:23] storing the accumulate accumulated gradient at X and so we've essentially [00:32:26] gradient at X and so we've essentially made two different [00:32:28] made two different backwards passes we've called it once on [00:32:31] backwards passes we've called it once on this function y and we've which is a [00:32:34] this function y and we've which is a function of X and we've called it once [00:32:35] function of X and we've called it once on Z which is also a function of X and [00:32:38] on Z which is also a function of X and so you're right we can't actually [00:32:39] so you're right we can't actually disambiguate which came from what we [00:32:41] disambiguate which came from what we just see the accumulated gradient but [00:32:44] just see the accumulated gradient but typically that's actually exactly what [00:32:46] typically that's actually exactly what we want because what we want is to be [00:32:48] we want because what we want is to be able to run our Network and accumulate [00:32:51] able to run our Network and accumulate the gradient across all of the training [00:32:54] the gradient across all of the training examples that Define our loss [00:32:55] examples that Define our loss and then perform our Optimizer step so [00:32:58] and then perform our Optimizer step so yeah even with respect to one thing it [00:33:00] yeah even with respect to one thing it doesn't matter because in practice each [00:33:02] doesn't matter because in practice each of those things is really a different [00:33:04] of those things is really a different example in our set of training examples [00:33:06] example in our set of training examples and so we're not interested in you know [00:33:08] and so we're not interested in you know the gradient from one example we're [00:33:10] the gradient from one example we're actually interested in the overall [00:33:11] actually interested in the overall gradient [00:33:12] gradient so going back to this example [00:33:15] so going back to this example What's Happening Here is that in the [00:33:17] What's Happening Here is that in the backwards pass what it's doing is [00:33:21] backwards pass what it's doing is you can imagine there's the X tensor and [00:33:23] you can imagine there's the X tensor and then there's the dot grad attribute [00:33:25] then there's the dot grad attribute which is another separate tensor it's [00:33:26] which is another separate tensor it's going to be the same shape as X [00:33:28] going to be the same shape as X and what that is storing is it's storing [00:33:31] and what that is storing is it's storing the accumulated gradient from every [00:33:34] the accumulated gradient from every single time that you've called dot [00:33:36] single time that you've called dot backward on a quantity that [00:33:39] backward on a quantity that essentially has some dependency on X [00:33:41] essentially has some dependency on X that will have a non-zero gradient and [00:33:44] that will have a non-zero gradient and so the first time we call it the [00:33:46] so the first time we call it the gradient will be 12 because 6X 6 times 2 [00:33:49] gradient will be 12 because 6X 6 times 2 12. the second time we do it with Z it's [00:33:52] 12. the second time we do it with Z it's also still 12. but the point is that dot [00:33:55] also still 12. but the point is that dot grad doesn't actually overwrite the [00:33:57] grad doesn't actually overwrite the gradient each time you called out [00:33:58] gradient each time you called out backwards it simply adds them it [00:34:00] backwards it simply adds them it accumulates them and kind of the [00:34:02] accumulates them and kind of the intuition there is that [00:34:04] intuition there is that ultimately you're going to want to [00:34:06] ultimately you're going to want to compute the gradient with respect to the [00:34:10] compute the gradient with respect to the loss and that loss is going to be made [00:34:11] loss and that loss is going to be made up of many different examples and so you [00:34:13] up of many different examples and so you need to accumulate the gradient from all [00:34:16] need to accumulate the gradient from all of those in order to make a single [00:34:17] of those in order to make a single update [00:34:18] update and then of course you'll have to zero [00:34:20] and then of course you'll have to zero that out because every time you make one [00:34:22] that out because every time you make one pass through all of your data you don't [00:34:23] pass through all of your data you don't want that next batch of data to also be [00:34:26] want that next batch of data to also be double counting the previous batches [00:34:28] double counting the previous batches update you want to keep those separate [00:34:30] update you want to keep those separate and so we'll see that in a second [00:34:38] all right so now we're going to move on [00:34:40] all right so now we're going to move on to [00:34:42] to one of the final pieces of the puzzle [00:34:43] one of the final pieces of the puzzle which is neural networks how do we [00:34:45] which is neural networks how do we actually use them in pi torch [00:34:48] actually use them in pi torch and once we have that and we have our [00:34:51] and once we have that and we have our optimization we'll finally be able to [00:34:53] optimization we'll finally be able to figure out how do we actually train a [00:34:55] figure out how do we actually train a neural network what does that look like [00:34:56] neural network what does that look like and why it's so clean and efficient when [00:34:59] and why it's so clean and efficient when you do it in pi torch [00:35:02] so the first thing that you want to do [00:35:04] so the first thing that you want to do is [00:35:05] is we're going to be defining neural [00:35:07] we're going to be defining neural networks in terms of existing building [00:35:09] networks in terms of existing building blocks in terms of existing apis which [00:35:12] blocks in terms of existing apis which will Implement for instance linear [00:35:14] will Implement for instance linear layers or different activation functions [00:35:16] layers or different activation functions that we need so we're going to import [00:35:18] that we need so we're going to import torch.nn [00:35:20] torch.nn because that is the neural network [00:35:22] because that is the neural network package that we're going to make use of [00:35:24] package that we're going to make use of and so let's start with the linear layer [00:35:27] and so let's start with the linear layer the way the linear layer Works in pi [00:35:30] the way the linear layer Works in pi torch is it takes in two arguments it [00:35:32] torch is it takes in two arguments it takes in the input Dimension and then [00:35:35] takes in the input Dimension and then the output dimension [00:35:37] the output dimension and so pretty much what it does is it [00:35:40] and so pretty much what it does is it takes in some input [00:35:42] takes in some input which has some arbitrary amount of [00:35:45] which has some arbitrary amount of dimensions and then finally [00:35:47] dimensions and then finally the input Dimension and it will [00:35:49] the input Dimension and it will essentially output it to that same set [00:35:52] essentially output it to that same set of Dimensions except the output [00:35:53] of Dimensions except the output dimension in the very last place [00:35:57] dimension in the very last place and you can think of the linear layer as [00:36:00] and you can think of the linear layer as essentially just performing a simple ax [00:36:02] essentially just performing a simple ax plus b by default it's going to [00:36:07] plus b by default it's going to um it's going to apply a bias but you [00:36:09] um it's going to apply a bias but you can also disable that if you don't want [00:36:11] can also disable that if you don't want a bias term [00:36:13] a bias term and so let's look at a small example so [00:36:23] here we have our input [00:36:26] here we have our input and [00:36:28] and we're going to create a linear layer in [00:36:30] we're going to create a linear layer in this case as an input size of four an [00:36:33] this case as an input size of four an output size of two [00:36:36] output size of two and [00:36:37] and all we're going to do is once we Define [00:36:39] all we're going to do is once we Define it by instantiating it with nn.linear [00:36:42] it by instantiating it with nn.linear whatever the name of our layer is in [00:36:44] whatever the name of our layer is in this case we called it linear we just [00:36:46] this case we called it linear we just essentially apply it with parentheses as [00:36:48] essentially apply it with parentheses as if it were a function to whatever input [00:36:51] if it were a function to whatever input and that actually does [00:36:53] and that actually does the actual forward pass through this [00:36:55] the actual forward pass through this linear layer to get our output [00:37:01] and so you can see that the original [00:37:04] and so you can see that the original shape was two by three by four then we [00:37:07] shape was two by three by four then we pass it through this linear layer which [00:37:09] pass it through this linear layer which has an output dimension of size two and [00:37:11] has an output dimension of size two and so ultimately our output is two by three [00:37:14] so ultimately our output is two by three by two which is good that's what we [00:37:16] by two which is good that's what we expect that's not shape error [00:37:18] expect that's not shape error but you know something common that [00:37:21] but you know something common that you'll see is you know maybe [00:37:23] you'll see is you know maybe uh you decide to [00:37:27] uh you decide to you get a little confused and maybe you [00:37:29] you get a little confused and maybe you do let's say [00:37:31] do let's say two by two you match the wrong dimension [00:37:35] two by two you match the wrong dimension and so here we're going to get a shape [00:37:38] and so here we're going to get a shape error [00:37:39] error and you see that the error message isn't [00:37:41] and you see that the error message isn't as helpful because it's actually changed [00:37:42] as helpful because it's actually changed the shape of what we were working with [00:37:43] the shape of what we were working with we said this was two by three by four [00:37:45] we said this was two by three by four under the hood Pi torch has changes to a [00:37:48] under the hood Pi torch has changes to a six by four but [00:37:50] six by four but if we you know in this case it's obvious [00:37:52] if we you know in this case it's obvious because we instantiated it with the [00:37:55] because we instantiated it with the shape but if we didn't have the shape [00:37:57] shape but if we didn't have the shape then one simple thing we could do is [00:37:59] then one simple thing we could do is actually just print out the shape and [00:38:02] actually just print out the shape and we'd see okay this last Dimension is [00:38:03] we'd see okay this last Dimension is size four so I actually need to change [00:38:05] size four so I actually need to change my input dimension in my linear layer to [00:38:08] my input dimension in my linear layer to be size four [00:38:11] thank you [00:38:14] and you'll also notice on this output we [00:38:18] and you'll also notice on this output we have this grad function and so that's [00:38:19] have this grad function and so that's because we're actually Computing and [00:38:22] because we're actually Computing and storing the gradients here [00:38:23] storing the gradients here for our tensor [00:38:32] yeah so typically we think of the First [00:38:34] yeah so typically we think of the First Dimension as the batch Dimension so in [00:38:37] Dimension as the batch Dimension so in this case it said n this you can think [00:38:39] this case it said n this you can think of as if you had a batch of images it [00:38:40] of as if you had a batch of images it would be the number of images if you had [00:38:42] would be the number of images if you had a training Corpus of text it would be [00:38:45] a training Corpus of text it would be essentially the number of sentences or [00:38:48] essentially the number of sentences or sequences [00:38:50] sequences um pretty much that is usually [00:38:51] um pretty much that is usually considered the batch Dimension the star [00:38:53] considered the batch Dimension the star indicates that there can be an arbitrary [00:38:55] indicates that there can be an arbitrary number of dimensions [00:38:56] number of dimensions so for instance if we had images [00:39:00] so for instance if we had images this could be a four-dimensional tensor [00:39:01] this could be a four-dimensional tensor object it could be the batch size by the [00:39:05] object it could be the batch size by the number of channels by the height by the [00:39:07] number of channels by the height by the width [00:39:07] width but in general there's no fixed number [00:39:10] but in general there's no fixed number of Dimensions your input tensor can be [00:39:12] of Dimensions your input tensor can be any number of Dimensions the key is just [00:39:14] any number of Dimensions the key is just that that last Dimension needs to match [00:39:16] that that last Dimension needs to match up with the input dimension of your [00:39:19] up with the input dimension of your linear layer [00:39:20] linear layer the two is the output size so [00:39:23] the two is the output size so essentially we're saying that we're [00:39:25] essentially we're saying that we're going to map [00:39:26] going to map this last Dimension which is four [00:39:29] this last Dimension which is four dimensional to now two dimensional [00:39:31] dimensional to now two dimensional so in general you know you can think of [00:39:33] so in general you know you can think of this is if we're stacking a neural [00:39:35] this is if we're stacking a neural network this is kind of the input [00:39:37] network this is kind of the input Dimension size and this would be like [00:39:39] Dimension size and this would be like the hidden Dimension size [00:39:44] and so one thing we can do is we can [00:39:45] and so one thing we can do is we can actually print out the parameters and we [00:39:47] actually print out the parameters and we can actually see what are the values of [00:39:49] can actually see what are the values of our linear layer or in general for any [00:39:51] our linear layer or in general for any layer that we Define in our neural [00:39:53] layer that we Define in our neural network [00:39:54] network what are the actual parameters and [00:39:58] what are the actual parameters and in this case we see that there's two [00:39:59] in this case we see that there's two sets of parameters [00:40:01] sets of parameters because we have a bias as well as the [00:40:04] because we have a bias as well as the actual [00:40:05] actual um [00:40:06] um the actual linear layer itself [00:40:08] the actual linear layer itself and so both of them store the gradients [00:40:12] and so both of them store the gradients and in this case [00:40:16] um you know these are these are what the [00:40:17] um you know these are these are what the current values of these parameters are [00:40:19] current values of these parameters are and they'll change as we train the [00:40:21] and they'll change as we train the network [00:40:24] okay so now let's go through some of the [00:40:26] okay so now let's go through some of the other module layers [00:40:28] other module layers um [00:40:31] so in general nn.linear is one of the [00:40:34] so in general nn.linear is one of the layers you have access to you have a [00:40:36] layers you have access to you have a couple of other different layers that [00:40:37] couple of other different layers that are pretty common you have 2D [00:40:40] are pretty common you have 2D convolutions you have transpose [00:40:41] convolutions you have transpose convolutions you have batch Norm layers [00:40:44] convolutions you have batch Norm layers when you need to do normalization in [00:40:46] when you need to do normalization in your network [00:40:47] your network you can do up sampling you can do Max [00:40:49] you can do up sampling you can do Max pooling you can do lots of different [00:40:51] pooling you can do lots of different operators but the main key here is that [00:40:53] operators but the main key here is that all of them are built-in building blocks [00:40:55] all of them are built-in building blocks that you can just call just like we did [00:40:57] that you can just call just like we did with nn.linear [00:40:58] with nn.linear and so [00:41:01] and so let's just go I guess I'm running out of [00:41:03] let's just go I guess I'm running out of time but let's just try and go through [00:41:05] time but let's just try and go through these last few layers and then I'll wrap [00:41:07] these last few layers and then I'll wrap up by kind of showing an example that [00:41:09] up by kind of showing an example that puts it all together [00:41:11] puts it all together so in this case we can define an [00:41:13] so in this case we can define an activation function which is typical [00:41:15] activation function which is typical with our networks we need to introduce [00:41:17] with our networks we need to introduce non-linearities in this case we use the [00:41:19] non-linearities in this case we use the sigmoid function and so now we can [00:41:22] sigmoid function and so now we can Define our our Network as this very [00:41:24] Define our our Network as this very simple thing which had one linear layer [00:41:25] simple thing which had one linear layer and then an activation [00:41:29] and in general when we compose these [00:41:32] and in general when we compose these layers together we don't need to [00:41:33] layers together we don't need to actually write [00:41:35] actually write every single line by line applying the [00:41:37] every single line by line applying the next layer we can actually stack all of [00:41:39] next layer we can actually stack all of them together in this case we can use [00:41:42] them together in this case we can use nn.sequential and list all of the layers [00:41:44] nn.sequential and list all of the layers so here we have our linear layer [00:41:46] so here we have our linear layer followed by our sigmoid and then [00:41:49] followed by our sigmoid and then now we're just essentially passing the [00:41:52] now we're just essentially passing the input through this whole set of layers [00:41:53] input through this whole set of layers all at once so we take our input we call [00:41:57] all at once so we take our input we call block on the input and we get the output [00:42:01] block on the input and we get the output and so [00:42:02] and so let's just kind of see putting it all [00:42:03] let's just kind of see putting it all together what does it look like to [00:42:05] together what does it look like to define a network and what does it look [00:42:06] define a network and what does it look like when we train one [00:42:08] like when we train one so here we're going to actually Define a [00:42:10] so here we're going to actually Define a multi-layer perceptron and the way it [00:42:13] multi-layer perceptron and the way it works is to define a neural network you [00:42:15] works is to define a neural network you extend the NN dot module class [00:42:18] extend the NN dot module class the key here is there's really two main [00:42:19] the key here is there's really two main things you have to Define when you [00:42:21] things you have to Define when you create your own network one is the [00:42:23] create your own network one is the initialization so in the init function [00:42:25] initialization so in the init function you actually initialize all the [00:42:27] you actually initialize all the parameters you need in this case we [00:42:29] parameters you need in this case we initialize an input size a hidden size [00:42:32] initialize an input size a hidden size and we actually Define the model itself [00:42:34] and we actually Define the model itself in this case it's a simple model which [00:42:37] in this case it's a simple model which consists of a linear layer followed by [00:42:41] consists of a linear layer followed by an activation followed by another linear [00:42:43] an activation followed by another linear layer followed by a final activation [00:42:45] layer followed by a final activation and the second function we have to [00:42:47] and the second function we have to Define is the forward which actually [00:42:49] Define is the forward which actually does the forward pass of the network and [00:42:51] does the forward pass of the network and so here our forward function takes in [00:42:54] so here our forward function takes in our input X [00:42:56] our input X in general it could take in some [00:42:58] in general it could take in some arbitrary amount of inputs into this [00:43:00] arbitrary amount of inputs into this function [00:43:01] function but essentially it needs to figure out [00:43:03] but essentially it needs to figure out how are you actually Computing the [00:43:05] how are you actually Computing the output and in this case it's very simple [00:43:07] output and in this case it's very simple we just pass it into the network that we [00:43:09] we just pass it into the network that we just defined [00:43:10] just defined and return the output [00:43:12] and return the output and [00:43:14] and again you could do this more explicitly [00:43:16] again you could do this more explicitly by kind of doing what we did earlier [00:43:18] by kind of doing what we did earlier where we could actually write out all of [00:43:20] where we could actually write out all of the layers individually instead of [00:43:22] the layers individually instead of wrapping them into one object [00:43:25] wrapping them into one object and then [00:43:26] and then doing a line by line operation for each [00:43:29] doing a line by line operation for each one of these layers [00:43:32] one of these layers and so finally if we Define our class [00:43:34] and so finally if we Define our class it's very simple to use it we can now [00:43:37] it's very simple to use it we can now just instantiate some input instantiate [00:43:40] just instantiate some input instantiate our model by calling multi-layer [00:43:42] our model by calling multi-layer perceptron with our parameters and then [00:43:44] perceptron with our parameters and then just pass it through our model [00:43:48] so that's great but this is all just a [00:43:51] so that's great but this is all just a full red pass how do we actually train [00:43:53] full red pass how do we actually train the network how do we actually make it [00:43:55] the network how do we actually make it better [00:43:55] better and so this is the final step which is [00:43:57] and so this is the final step which is we have optimization built in to Pi [00:44:00] we have optimization built in to Pi torch so we have this backward function [00:44:02] torch so we have this backward function which goes and computes all these [00:44:04] which goes and computes all these gradients in the backward pass and now [00:44:06] gradients in the backward pass and now the only step left is to actually update [00:44:08] the only step left is to actually update the parameters using those gradients and [00:44:11] the parameters using those gradients and so here we'll import the torch.opt-in [00:44:13] so here we'll import the torch.opt-in package which contains all the [00:44:15] package which contains all the optimizers that you need [00:44:18] optimizers that you need essentially this part is just creating [00:44:21] essentially this part is just creating some random data so that we can actually [00:44:23] some random data so that we can actually decide how to fit our data [00:44:26] decide how to fit our data but this is really the key here which is [00:44:28] but this is really the key here which is we'll instantiate our model that we [00:44:30] we'll instantiate our model that we defined [00:44:31] defined we'll Define the atom optimizer [00:44:34] we'll Define the atom optimizer um [00:44:35] um and we'll Define it with a particular [00:44:37] and we'll Define it with a particular learning rate we'll Define a loss [00:44:39] learning rate we'll Define a loss function which is again another built-in [00:44:41] function which is again another built-in module in this case we're using the [00:44:42] module in this case we're using the cross entropy loss [00:44:44] cross entropy loss and finally to calculate our predictions [00:44:46] and finally to calculate our predictions all we do is simply is just call model [00:44:49] all we do is simply is just call model on our actual input [00:44:51] on our actual input and to calculate our loss we just call [00:44:53] and to calculate our loss we just call our loss function on our predictions and [00:44:56] our loss function on our predictions and our true labels and we extract the [00:44:58] our true labels and we extract the scalar here [00:45:00] scalar here and now when we put it all together this [00:45:03] and now when we put it all together this is what the training Loop looks like [00:45:05] is what the training Loop looks like we have some number of epochs that we [00:45:07] we have some number of epochs that we want to train our Network [00:45:09] want to train our Network for each of these epochs the first thing [00:45:11] for each of these epochs the first thing we do is we take our Optimizer and we [00:45:13] we do is we take our Optimizer and we zero out the gradient and the reason we [00:45:14] zero out the gradient and the reason we do that is because like many of you [00:45:16] do that is because like many of you noted we actually are accumulating the [00:45:19] noted we actually are accumulating the gradient we're not resetting it every [00:45:20] gradient we're not resetting it every time we call Dot backward so we zero out [00:45:23] time we call Dot backward so we zero out the gradient [00:45:24] the gradient we get our model predictions by doing a [00:45:26] we get our model predictions by doing a forward pass [00:45:28] forward pass we then compute the loss between the [00:45:30] we then compute the loss between the predictions and the True Values [00:45:33] predictions and the True Values finally we call law stop backward this [00:45:36] finally we call law stop backward this is what actually computes all the [00:45:37] is what actually computes all the gradients in the backward pass from our [00:45:40] gradients in the backward pass from our loss [00:45:41] loss and the final step is we call Dot step [00:45:43] and the final step is we call Dot step on our Optimizer in this case we're [00:45:46] on our Optimizer in this case we're using atom [00:45:47] using atom and this will take a step on our loss [00:45:49] and this will take a step on our loss function and so if we run this code [00:45:52] function and so if we run this code we end up seeing that we're able to [00:45:54] we end up seeing that we're able to start with some trading loss which is [00:45:56] start with some trading loss which is relatively high and in 10 epochs we're [00:45:58] relatively high and in 10 epochs we're able to essentially completely fit our [00:46:00] able to essentially completely fit our data [00:46:02] data and if we print out our model parameters [00:46:05] and if we print out our model parameters and we printed them out from the start [00:46:06] and we printed them out from the start as well we'd see that they've changed as [00:46:09] as well we'd see that they've changed as we've actually done this optimization [00:46:12] we've actually done this optimization so I'll kind of wrap it up here but I [00:46:14] so I'll kind of wrap it up here but I think the key takeaway is that a lot of [00:46:17] think the key takeaway is that a lot of the things that you're doing at the [00:46:18] the things that you're doing at the beginning of this class are really about [00:46:20] beginning of this class are really about understanding the basics of how neural [00:46:22] understanding the basics of how neural networks work how you actually Implement [00:46:25] networks work how you actually Implement them how you implement the backward pass [00:46:27] them how you implement the backward pass the great thing about Pi torch is that [00:46:29] the great thing about Pi torch is that once you get to the very next assignment [00:46:31] once you get to the very next assignment you'll see that now that you have a good [00:46:32] you'll see that now that you have a good underlying understanding of those things [00:46:34] underlying understanding of those things you can abstract a lot of the complexity [00:46:37] you can abstract a lot of the complexity of how do you do back prop how do you [00:46:39] of how do you do back prop how do you store all these gradients how do you [00:46:41] store all these gradients how do you compute them how do you actually run the [00:46:43] compute them how do you actually run the optimizer and let pytorch handle all of [00:46:45] optimizer and let pytorch handle all of that for you and you can use all of [00:46:47] that for you and you can use all of these building blocks all these [00:46:48] these building blocks all these different neural network layers to now [00:46:51] different neural network layers to now Define your own networks that you can [00:46:53] Define your own networks that you can use to solve whatever problems you need ================================================================================ LECTURE 023 ================================================================================ Stanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel Source: https://www.youtube.com/watch?v=b80by3Xk_A8 --- Transcript [00:00:04] hi everyone uh welcome to the 224n [00:00:09] hi everyone uh welcome to the 224n hugging face Transformers tutorial [00:00:12] hugging face Transformers tutorial um so this tutorial is just going to be [00:00:14] um so this tutorial is just going to be about using the hugging face Library [00:00:17] about using the hugging face Library it's really useful in a super effective [00:00:20] it's really useful in a super effective way of being able to use kind of some [00:00:22] way of being able to use kind of some off the shelf NLP models specifically [00:00:25] off the shelf NLP models specifically models that are kind of Transformer [00:00:27] models that are kind of Transformer based and being able to use those for [00:00:30] based and being able to use those for either your final project your custom [00:00:33] either your final project your custom final project or something like that [00:00:35] final project or something like that just using it in the future so these are [00:00:37] just using it in the future so these are it's a really helpful package to to [00:00:40] it's a really helpful package to to learn and it interfaces really well with [00:00:43] learn and it interfaces really well with pi torch in particular too [00:00:45] pi torch in particular too okay so first things first is in case [00:00:49] okay so first things first is in case there's anything else that you are [00:00:51] there's anything else that you are missing from this kind of like tutorial [00:00:53] missing from this kind of like tutorial the hugging face documentation is really [00:00:55] the hugging face documentation is really good they also have lots of kind of [00:00:57] good they also have lots of kind of tutorials and walkthroughs as well as [00:01:00] tutorials and walkthroughs as well as other kind of like notebooks that you [00:01:01] other kind of like notebooks that you can play around with as well [00:01:03] can play around with as well so if you're ever wondering about [00:01:05] so if you're ever wondering about something else that's a really good [00:01:07] something else that's a really good place to look [00:01:08] place to look okay so in the collab the first thing [00:01:10] okay so in the collab the first thing we're going to do uh that I already did [00:01:12] we're going to do uh that I already did but can maybe run again is just [00:01:14] but can maybe run again is just installing the Transformers python [00:01:16] installing the Transformers python package and then the data sets python [00:01:19] package and then the data sets python package so this corresponds to that [00:01:23] package so this corresponds to that hugging face Transformers and data sets [00:01:25] hugging face Transformers and data sets um and so those are really helpful the [00:01:27] um and so those are really helpful the Transformers is where we'll get a lot of [00:01:29] Transformers is where we'll get a lot of these kind of pre-trained models from [00:01:31] these kind of pre-trained models from and the data sets will give us some [00:01:33] and the data sets will give us some helpful data sets that we can [00:01:35] helpful data sets that we can potentially use for various tasks so in [00:01:37] potentially use for various tasks so in this case sentiment analysis [00:01:41] okay and so we'll use a bit of like a [00:01:43] okay and so we'll use a bit of like a helper function for helping us [00:01:45] helper function for helping us understand what encoding is uh what [00:01:48] understand what encoding is uh what encodings are actually happening as well [00:01:50] encodings are actually happening as well so we'll run this just to kind of kick [00:01:53] so we'll run this just to kind of kick things off an important a few more a few [00:01:55] things off an important a few more a few more things [00:01:56] more things okay so [00:01:59] okay so um so first what we'll do is this is [00:02:01] um so first what we'll do is this is generally kind of like the step by step [00:02:03] generally kind of like the step by step for how to use something off of Hocking [00:02:05] for how to use something off of Hocking face so first what we'll do is we'll [00:02:07] face so first what we'll do is we'll find some model [00:02:09] find some model um from like the hugging face Hub here [00:02:12] um from like the hugging face Hub here and note that there's like a ton of [00:02:15] and note that there's like a ton of different models that you're able to use [00:02:16] different models that you're able to use there's bird there's gpt2 there's T5 [00:02:20] there's bird there's gpt2 there's T5 small which is another language model [00:02:21] small which is another language model from Google [00:02:23] from Google um so there are a bunch of these uh [00:02:26] um so there are a bunch of these uh different models that are pre-trained [00:02:27] different models that are pre-trained and all of these weights are up here in [00:02:29] and all of these weights are up here in hugging face that are freely available [00:02:31] hugging face that are freely available for for you guys to download so if [00:02:34] for for you guys to download so if there's a particular model you're [00:02:35] there's a particular model you're interested in you can probably find a [00:02:37] interested in you can probably find a version of it here you can also see kind [00:02:40] version of it here you can also see kind of different types of models on the side [00:02:42] of different types of models on the side as well that for a specific task so if [00:02:45] as well that for a specific task so if we wanted to do something like uh zero [00:02:48] we wanted to do something like uh zero shot classification there are a couple [00:02:50] shot classification there are a couple models that are specifically good at [00:02:53] models that are specifically good at doing that particular task okay so based [00:02:55] doing that particular task okay so based off of what tasks you're looking for [00:02:56] off of what tasks you're looking for there's probably a hugging face model [00:02:59] there's probably a hugging face model for it that's available online for you [00:03:01] for it that's available online for you to download [00:03:02] to download okay so that's what we'll do first is [00:03:05] okay so that's what we'll do first is we'll go ahead and find a model on the [00:03:08] we'll go ahead and find a model on the hug and face Hub and then [00:03:10] hug and face Hub and then um you know whatever you want to do in [00:03:11] um you know whatever you want to do in this case we'll do sentiment analysis [00:03:13] this case we'll do sentiment analysis and then there are two things that we [00:03:15] and then there are two things that we need next the first is a tokenizer for [00:03:18] need next the first is a tokenizer for actually you know splitting your input [00:03:21] actually you know splitting your input text into tokens that your model can use [00:03:23] text into tokens that your model can use and the actual model itself [00:03:27] and the actual model itself um and so the tokenizer again kind of [00:03:30] um and so the tokenizer again kind of converts this to some vocabulary IDs [00:03:32] converts this to some vocabulary IDs these discrete IDs that your model can [00:03:34] these discrete IDs that your model can actually take in and the model will [00:03:37] actually take in and the model will produce some prediction based off of [00:03:38] produce some prediction based off of that [00:03:39] that okay so [00:03:41] okay so um so first what we can do is again [00:03:44] um so first what we can do is again import this Auto tokenizer and this Auto [00:03:47] import this Auto tokenizer and this Auto model from uh for sequence [00:03:50] model from uh for sequence classification so what this will do [00:03:52] classification so what this will do initially is download some of the you [00:03:55] initially is download some of the you know key things that we need so that we [00:03:57] know key things that we need so that we can actually initialize these so what do [00:04:00] can actually initialize these so what do each of these do so first the tokenizer [00:04:03] each of these do so first the tokenizer this Auto tokenizer is from some [00:04:06] this Auto tokenizer is from some pre-trained tokenizer that has already [00:04:08] pre-trained tokenizer that has already been used so in general there's a [00:04:11] been used so in general there's a corresponding tokenizer for every model [00:04:13] corresponding tokenizer for every model that you want to try and use in this [00:04:15] that you want to try and use in this case it's like cbert so send like [00:04:18] case it's like cbert so send like something around sentiment and Roberta [00:04:20] something around sentiment and Roberta and then the second is you can import [00:04:23] and then the second is you can import this model for sequence classification [00:04:25] this model for sequence classification as well from something pre-trained on [00:04:28] as well from something pre-trained on the model Hub again so again this [00:04:30] the model Hub again so again this corresponds to sentiment Roberta large [00:04:32] corresponds to sentiment Roberta large English and if we want we can even find [00:04:34] English and if we want we can even find this over here [00:04:37] this over here um we can find it as [00:04:42] um [00:04:43] um I think English yeah large English so [00:04:46] I think English yeah large English so again this is something we can easily [00:04:48] again this is something we can easily find you just copy this string up here [00:04:50] find you just copy this string up here and then you can import that [00:04:52] and then you can import that okay we've downloaded all of the kind of [00:04:55] okay we've downloaded all of the kind of all the things that we need some kind of [00:04:57] all the things that we need some kind of like binary files as well and then now [00:05:00] like binary files as well and then now we can go ahead and actually you know [00:05:02] we can go ahead and actually you know use some of these inputs right so this [00:05:04] use some of these inputs right so this gives you some set of an input right [00:05:07] gives you some set of an input right this input string I'm excited to learn [00:05:08] this input string I'm excited to learn about hogging face Transformers we'll [00:05:11] about hogging face Transformers we'll get some tokenized inputs [00:05:13] get some tokenized inputs here from the actual tokenized things [00:05:17] here from the actual tokenized things here after we pass it through the [00:05:19] here after we pass it through the tokenizer and then lastly we'll get some [00:05:22] tokenizer and then lastly we'll get some notion of the model output that we get [00:05:25] notion of the model output that we get right so this is kind of some legits [00:05:27] right so this is kind of some legits here over whatever classification that [00:05:29] here over whatever classification that we have so in this case good or bad and [00:05:32] we have so in this case good or bad and then some corresponding prediction okay [00:05:35] then some corresponding prediction okay and we'll walk through what this kind of [00:05:37] and we'll walk through what this kind of looks like in just a second as well a [00:05:39] looks like in just a second as well a little more depth but this is broadly [00:05:40] little more depth but this is broadly kind of like how we can actually use [00:05:43] kind of like how we can actually use these together we'll tokenize some input [00:05:45] these together we'll tokenize some input and then we'll pass these inputs to the [00:05:47] and then we'll pass these inputs to the model [00:05:47] model so we'll talk about tokenizers first so [00:05:51] so we'll talk about tokenizers first so um so tokenizers are used for basically [00:05:54] um so tokenizers are used for basically just pre-processing the inputs that you [00:05:56] just pre-processing the inputs that you get for any model and it takes some raw [00:05:58] get for any model and it takes some raw string to like [00:06:00] string to like um essentially a mapping uh to some [00:06:03] um essentially a mapping uh to some number or ID that the model can take in [00:06:06] number or ID that the model can take in and actually kind of understand [00:06:08] and actually kind of understand so tokenizers are either kind of like [00:06:11] so tokenizers are either kind of like are specific to the model that you want [00:06:13] are specific to the model that you want to use or you can use the auto tokenizer [00:06:16] to use or you can use the auto tokenizer that will kind of conveniently import [00:06:19] that will kind of conveniently import whatever corresponding tokenizer you [00:06:21] whatever corresponding tokenizer you need for that model type [00:06:24] need for that model type um so that's that's kind of like the [00:06:26] um so that's that's kind of like the helpfulness of the auto tokenizer it'll [00:06:28] helpfulness of the auto tokenizer it'll kind of make that selection for you [00:06:30] kind of make that selection for you um and make sure that you get the [00:06:32] um and make sure that you get the correct tokenizer for whatever model [00:06:34] correct tokenizer for whatever model you're using so the question is uh does [00:06:36] you're using so the question is uh does it make sure that everything is mapped [00:06:38] it make sure that everything is mapped to the correct index that the model is [00:06:40] to the correct index that the model is trained on the answer is yes so that's [00:06:41] trained on the answer is yes so that's why the auto tokenizer is helpful [00:06:44] why the auto tokenizer is helpful so there are two types of tokenizers uh [00:06:48] so there are two types of tokenizers uh there's the a Python tokenizer and [00:06:51] there's the a Python tokenizer and there's also like a tokenizer fast that [00:06:55] there's also like a tokenizer fast that the tokenizer fast is written in Rust in [00:06:58] the tokenizer fast is written in Rust in general if you do the auto tokenizer [00:07:00] general if you do the auto tokenizer it'll just default to the fast one [00:07:01] it'll just default to the fast one there's not really a huge difference [00:07:03] there's not really a huge difference here it's just about kind of like the [00:07:05] here it's just about kind of like the inference time for getting the model [00:07:07] inference time for getting the model outputs [00:07:08] outputs yeah uh so the question is the tokenizer [00:07:11] yeah uh so the question is the tokenizer uh creates dictionaries of the model [00:07:13] uh creates dictionaries of the model inputs [00:07:15] inputs um so I to think it's more like I think [00:07:18] um so I to think it's more like I think the way to think about a tokenizer is [00:07:20] the way to think about a tokenizer is like that [00:07:22] like that um like that dictionary almost right so [00:07:25] um like that dictionary almost right so you want to kind of translate almost or [00:07:27] you want to kind of translate almost or have this mapping from the tokens that [00:07:30] have this mapping from the tokens that you can get from like this string and [00:07:33] you can get from like this string and then map that into kind of some inputs [00:07:35] then map that into kind of some inputs that the model will actually use so [00:07:37] that the model will actually use so we'll see an example of that in just a [00:07:38] we'll see an example of that in just a second [00:07:39] second so for example we can kind of call the [00:07:42] so for example we can kind of call the tokenizer in any way that we would for [00:07:45] tokenizer in any way that we would for like a typical Pi torch model but we're [00:07:47] like a typical Pi torch model but we're just going to call it on like a string [00:07:48] just going to call it on like a string so here we have our input string is [00:07:51] so here we have our input string is hugging face Transformers is great we [00:07:54] hugging face Transformers is great we pass that into the tokenizer almost like [00:07:56] pass that into the tokenizer almost like it's like a function right and then [00:07:58] it's like a function right and then we'll get out some tokenization so this [00:08:01] we'll get out some tokenization so this gives us a set of input IDs so uh to [00:08:04] gives us a set of input IDs so uh to answer the earlier question these are [00:08:06] answer the earlier question these are basically the numbers that each of these [00:08:08] basically the numbers that each of these tokens represent [00:08:10] tokens represent so that the model can actually use them [00:08:13] so that the model can actually use them and then a corresponding attention mask [00:08:16] and then a corresponding attention mask for the particular Transformer [00:08:19] for the particular Transformer okay [00:08:21] okay so there are a couple ways of accessing [00:08:23] so there are a couple ways of accessing the actual tokenized input IDs you can [00:08:28] the actual tokenized input IDs you can treat it like a dictionary so hence kind [00:08:30] treat it like a dictionary so hence kind of thinking about it almost as that [00:08:31] of thinking about it almost as that dictionary form it's also just like a [00:08:33] dictionary form it's also just like a property of the output that you get so [00:08:36] property of the output that you get so there are two ways of accessing this in [00:08:38] there are two ways of accessing this in like a pretty pythonic way [00:08:42] okay [00:08:43] okay so what we can see as well is that we [00:08:47] so what we can see as well is that we can look at the particular the actual [00:08:49] can look at the particular the actual kind of tokenization process almost and [00:08:52] kind of tokenization process almost and so this can maybe give some insight into [00:08:54] so this can maybe give some insight into what happens at each step right so our [00:08:57] what happens at each step right so our initial input string is going to be [00:08:59] initial input string is going to be hugging face Transformers is great [00:09:02] hugging face Transformers is great okay the next step is that we actually [00:09:04] okay the next step is that we actually want to tokenize these individual kind [00:09:07] want to tokenize these individual kind of uh individual words that are passed [00:09:10] of uh individual words that are passed in so here this is the kind of output of [00:09:13] in so here this is the kind of output of this tokenization step [00:09:15] this tokenization step right we get kind of these individual [00:09:18] right we get kind of these individual split tokens we'll convert them to IDs [00:09:21] split tokens we'll convert them to IDs here [00:09:23] here and then we'll add any special tokens [00:09:25] and then we'll add any special tokens that our model might need for actually [00:09:28] that our model might need for actually performing inference on this [00:09:31] performing inference on this okay [00:09:33] okay so there's a couple steps that happen [00:09:35] so there's a couple steps that happen kind of like underneath when you use an [00:09:38] kind of like underneath when you use an actual we use a tokenizer that happens [00:09:41] actual we use a tokenizer that happens at it a few things at a time [00:09:44] at it a few things at a time one thing to note is that for fast [00:09:46] one thing to note is that for fast tokenizers as well there's another [00:09:50] tokenizers as well there's another option that you're able to get to [00:09:52] option that you're able to get to so you have essentially right you have [00:09:56] so you have essentially right you have this input string you have the number of [00:09:58] this input string you have the number of tokens that you get you might have some [00:10:00] tokens that you get you might have some notion of like the special token mask as [00:10:03] notion of like the special token mask as well right so using Char to word is [00:10:07] well right so using Char to word is going to give you like the word piece of [00:10:09] going to give you like the word piece of a particular character in the input [00:10:11] a particular character in the input so here this is just giving you [00:10:13] so here this is just giving you additional options that you can use for [00:10:15] additional options that you can use for the fast tokenizer as well for [00:10:17] the fast tokenizer as well for understanding how the tokens are being [00:10:18] understanding how the tokens are being used [00:10:20] used um in the from the input string [00:10:25] okay uh so there are different ways of [00:10:29] okay uh so there are different ways of using the outputs of these tokenizers [00:10:31] using the outputs of these tokenizers too so one is that you know you can pass [00:10:34] too so one is that you know you can pass this in and if you indicate that you [00:10:38] this in and if you indicate that you want it to return a tensor you can also [00:10:40] want it to return a tensor you can also return a pi torch tensor so that's great [00:10:44] return a pi torch tensor so that's great um in case [00:10:45] um in case you need a pie torch tensor which you [00:10:47] you need a pie torch tensor which you probably generally want [00:10:49] probably generally want you can also add multiple tokens into [00:10:52] you can also add multiple tokens into the tokenizer and then pad them as [00:10:55] the tokenizer and then pad them as however you need so for here for example [00:11:00] however you need so for here for example we can use the pad token as being this [00:11:03] we can use the pad token as being this kind of like pad bracket almost and [00:11:06] kind of like pad bracket almost and giving the token ID is going to [00:11:08] giving the token ID is going to correspond to zero right so it's just [00:11:09] correspond to zero right so it's just going to add padding to whatever input [00:11:11] going to add padding to whatever input that you give so if you need you need [00:11:15] that you give so if you need you need your outputs to be the same length for a [00:11:17] your outputs to be the same length for a particular type of model right this will [00:11:19] particular type of model right this will add those padding tokens and then [00:11:21] add those padding tokens and then correspondingly gives you like the zeros [00:11:23] correspondingly gives you like the zeros in the attention mask where you actually [00:11:25] in the attention mask where you actually need it [00:11:28] okay and so the way to do that here is [00:11:31] okay and so the way to do that here is uh you basically set padding padding to [00:11:34] uh you basically set padding padding to be true you can also set truncation to [00:11:36] be true you can also set truncation to be true as well and so if you ever have [00:11:38] be true as well and so if you ever have kind of like [00:11:40] kind of like um more uh [00:11:42] um more uh any other kind of like features of the [00:11:44] any other kind of like features of the tokenizer that you're interested in [00:11:45] tokenizer that you're interested in again you can check the hugging face [00:11:48] again you can check the hugging face documentation which is pretty thorough [00:11:49] documentation which is pretty thorough for what each of these things do [00:11:52] for what each of these things do yeah so the the question is kind of [00:11:54] yeah so the the question is kind of looking at [00:11:55] looking at um looking at the the hash hash at least [00:11:58] um looking at the the hash hash at least and whether that means that we should [00:12:00] and whether that means that we should have like a space before or not so [00:12:04] have like a space before or not so um so here in this case [00:12:06] um so here in this case um yeah so the in this case uh we [00:12:09] um yeah so the in this case uh we probably don't want like the space [00:12:10] probably don't want like the space before right just because [00:12:12] before right just because um we uh have like the hugging like I [00:12:16] um we uh have like the hugging like I don't know hugging is all one word [00:12:19] don't know hugging is all one word um in this case [00:12:20] um in this case um generally like generally the uh for [00:12:23] um generally like generally the uh for like the tokenizers generally the output [00:12:26] like the tokenizers generally the output that they give is still pretty [00:12:28] that they give is still pretty consistent though [00:12:29] consistent though um in terms of how the tokenization [00:12:31] um in terms of how the tokenization process works so there might be kind of [00:12:33] process works so there might be kind of these like you know instances where it [00:12:35] these like you know instances where it might be contrary to what you might [00:12:37] might be contrary to what you might expect for kind of how something is [00:12:39] expect for kind of how something is tokenized [00:12:41] tokenized um in general the tokenization generally [00:12:43] um in general the tokenization generally works fine [00:12:45] works fine um so [00:12:46] um so in most cases kind of like the direct [00:12:48] in most cases kind of like the direct output that you get from the hugging [00:12:50] output that you get from the hugging face tokenizer is sufficient [00:12:55] foreign [00:12:57] okay awesome so one last thing past the [00:13:01] okay awesome so one last thing past the adding kind of additional padding is [00:13:04] adding kind of additional padding is that you can also kind of uh decode like [00:13:07] that you can also kind of uh decode like an entire batch at one one given time so [00:13:10] an entire batch at one one given time so if we [00:13:12] if we um look again we have like uh our [00:13:14] um look again we have like uh our tokenizer we'll initially have this [00:13:17] tokenizer we'll initially have this method called like a batch decode so if [00:13:19] method called like a batch decode so if we have like the model inputs that we [00:13:22] we have like the model inputs that we get up here this is the output of [00:13:24] get up here this is the output of passing these sentences or these strings [00:13:26] passing these sentences or these strings into the tokenizer we can go ahead and [00:13:30] into the tokenizer we can go ahead and just pass like these input IDs that [00:13:34] just pass like these input IDs that correspond to that into the batch decode [00:13:37] correspond to that into the batch decode and it'll give us kind of this good this [00:13:39] and it'll give us kind of this good this decoding that corresponds to all the [00:13:42] decoding that corresponds to all the padding we added in each of the [00:13:43] padding we added in each of the particular kind of like uh words and [00:13:47] particular kind of like uh words and strings [00:13:48] strings um and if you want to you know ignore [00:13:51] um and if you want to you know ignore all the the presence of these padding [00:13:53] all the the presence of these padding tokens or anything like that [00:13:55] tokens or anything like that um you can also pass that into skipping [00:13:57] um you can also pass that into skipping the special tokens [00:13:59] the special tokens gotcha so this gives like a this is a [00:14:02] gotcha so this gives like a this is a pretty high level overview of the how [00:14:04] pretty high level overview of the how you would want to use tokenizers I guess [00:14:07] you would want to use tokenizers I guess in your [00:14:08] in your um in using hugging face [00:14:10] um in using hugging face so now we can talk about maybe how to [00:14:14] so now we can talk about maybe how to use the hugging face models themselves [00:14:16] use the hugging face models themselves so again this is this is pretty similar [00:14:18] so again this is this is pretty similar to what we're seeing for something like [00:14:21] to what we're seeing for something like initially using a tokenizer you just [00:14:24] initially using a tokenizer you just choose the specific model type [00:14:27] choose the specific model type um for your model and then I and you can [00:14:30] um for your model and then I and you can use that or the specific kind of Auto [00:14:32] use that or the specific kind of Auto model class where again this Auto model [00:14:35] model class where again this Auto model kind of takes almost the [00:14:38] kind of takes almost the um like the initialization process it [00:14:40] um like the initialization process it takes care of it for you in a pretty [00:14:42] takes care of it for you in a pretty easy way without really any too much [00:14:45] easy way without really any too much overhead [00:14:46] overhead um so additionally so [00:14:49] um so additionally so um for the pre-trained Transformers that [00:14:52] um for the pre-trained Transformers that we have they generally have the same [00:14:53] we have they generally have the same underlying architecture but you'll have [00:14:56] underlying architecture but you'll have different kind of heads associated with [00:14:59] different kind of heads associated with each Transformer so attention head so [00:15:01] each Transformer so attention head so you might have to train if you're doing [00:15:03] you might have to train if you're doing some sequence classification or just [00:15:05] some sequence classification or just some other task so hugging face will do [00:15:08] some other task so hugging face will do this for you and so for this I I will [00:15:12] this for you and so for this I I will walk through an example of how to do [00:15:14] walk through an example of how to do this for sentiment analysis [00:15:16] this for sentiment analysis um so if there's a specific context like [00:15:19] um so if there's a specific context like sequence classification we want to use [00:15:21] sequence classification we want to use we can use like this the very specific [00:15:25] we can use like this the very specific kind of like Class A hugging face [00:15:27] kind of like Class A hugging face provides so distilbert for sequence [00:15:29] provides so distilbert for sequence classification [00:15:31] classification alternatively if we were doing it using [00:15:34] alternatively if we were doing it using distilbert in like a mass language model [00:15:36] distilbert in like a mass language model setting we use distilbert for Mast LM [00:15:39] setting we use distilbert for Mast LM and then lastly if we're just doing it [00:15:42] and then lastly if we're just doing it purely for the representations that we [00:15:43] purely for the representations that we get out of distilled bird we just use [00:15:45] get out of distilled bird we just use like the Baseline model so the key thing [00:15:48] like the Baseline model so the key thing here or key takeaway is that there are [00:15:51] here or key takeaway is that there are some task specific classes that we can [00:15:53] some task specific classes that we can use from hugging face to initialize [00:15:56] use from hugging face to initialize so Auto model again is similar to kind [00:15:59] so Auto model again is similar to kind of like the auto tokenizer so for this [00:16:02] of like the auto tokenizer so for this it's just going to kind of load by [00:16:04] it's just going to kind of load by default that specific model and so in [00:16:08] default that specific model and so in this case it's going to be just like [00:16:11] this case it's going to be just like kind of like the basic basic weights [00:16:13] kind of like the basic basic weights that you need for them [00:16:17] okay [00:16:18] okay so [00:16:19] so um so here we'll have basically three [00:16:22] um so here we'll have basically three different types of models that we can [00:16:24] different types of models that we can look at one is like an encoder type [00:16:25] look at one is like an encoder type model which is Bert [00:16:27] model which is Bert a decoder type model like gpt2 that's [00:16:31] a decoder type model like gpt2 that's like uh performing these like uh you [00:16:34] like uh performing these like uh you know generating some text potentially [00:16:36] know generating some text potentially and encoder decoder models so Bart or T5 [00:16:39] and encoder decoder models so Bart or T5 in this case so again if you go back to [00:16:42] in this case so again if you go back to kind of the the hugging face Hub there's [00:16:45] kind of the the hugging face Hub there's a whole sort of different [00:16:48] a whole sort of different um different types of models that that [00:16:50] um different types of models that that you could potentially use and if we look [00:16:52] you could potentially use and if we look in the documentation as well [00:16:54] in the documentation as well so here we can understand some notion of [00:16:58] so here we can understand some notion of like the different types of classes that [00:17:00] like the different types of classes that we might want to use right so there's [00:17:02] we might want to use right so there's some notion of like the auto tokenizer [00:17:05] some notion of like the auto tokenizer different Auto models for different [00:17:07] different Auto models for different types of tasks [00:17:09] types of tasks um so here if again if you have any kind [00:17:11] um so here if again if you have any kind of like specific use cases that you're [00:17:13] of like specific use cases that you're looking for then you can check the [00:17:15] looking for then you can check the documentation here again if you use like [00:17:18] documentation here again if you use like an auto model from pre like pre-trained [00:17:21] an auto model from pre like pre-trained you'll just create a model that's an [00:17:22] you'll just create a model that's an instance of that or model [00:17:25] instance of that or model in this case root model for the burst [00:17:27] in this case root model for the burst Burt based case [00:17:31] okay Let's uh we can go ahead and start [00:17:35] okay Let's uh we can go ahead and start one last thing to note is that like [00:17:37] one last thing to note is that like again the particular choice of your [00:17:39] again the particular choice of your model matches up with kind of the type [00:17:42] model matches up with kind of the type of architecture that you have to use [00:17:44] of architecture that you have to use right so there are different these [00:17:47] right so there are different these different types of models can perform [00:17:49] different types of models can perform specific tasks so you're not going to be [00:17:52] specific tasks so you're not going to be able to kind of load or use Bert for [00:17:55] able to kind of load or use Bert for instance or distill Bert as like a [00:17:57] instance or distill Bert as like a sequence to sequence model for instance [00:17:58] sequence to sequence model for instance which requires the encoder and decoder [00:18:01] which requires the encoder and decoder because distill distilber I only [00:18:05] because distill distilber I only consists of an encoder so there's a bit [00:18:07] consists of an encoder so there's a bit of like a limitation on how you can [00:18:09] of like a limitation on how you can exactly use these but it's basically [00:18:11] exactly use these but it's basically based on like the model architecture [00:18:13] based on like the model architecture itself [00:18:16] okay awesome so let's go ahead and get [00:18:19] okay awesome so let's go ahead and get started here [00:18:21] started here um so similarly here we can import so [00:18:25] um so similarly here we can import so Auto model for sequence classification [00:18:28] Auto model for sequence classification so again this is we're going to perform [00:18:29] so again this is we're going to perform some classification tasks and we'll [00:18:32] some classification tasks and we'll import this Auto model here so that we [00:18:34] import this Auto model here so that we don't have to reference again just like [00:18:36] don't have to reference again just like something like distilbert for sequence [00:18:38] something like distilbert for sequence classification we'll be able to load it [00:18:40] classification we'll be able to load it automatically and it'll be all set [00:18:43] automatically and it'll be all set alternatively we can do distillburt for [00:18:46] alternatively we can do distillburt for sequence classification here and that [00:18:48] sequence classification here and that specifically will will require [00:18:50] specifically will will require distilbert speed the input there okay so [00:18:53] distilbert speed the input there okay so these are two different ways of [00:18:55] these are two different ways of basically getting the same model here [00:18:56] basically getting the same model here one using the auto model one using just [00:19:00] one using the auto model one using just explicitly distiller [00:19:02] explicitly distiller cool and here because it's [00:19:04] cool and here because it's classification we need to specify the [00:19:06] classification we need to specify the number of labels or the number of [00:19:08] number of labels or the number of classes that we're actually going to [00:19:09] classes that we're actually going to classify for each of the input sentences [00:19:13] classify for each of the input sentences okay so here we'll get some like a [00:19:17] okay so here we'll get some like a warning here right if you are following [00:19:20] warning here right if you are following along and you print this out because [00:19:22] along and you print this out because some of the sequence classification [00:19:23] some of the sequence classification classification parameters aren't trained [00:19:25] classification parameters aren't trained yet and so we'll go ahead and take care [00:19:28] yet and so we'll go ahead and take care of that [00:19:30] of that so here similarly we'll kind of like [00:19:32] so here similarly we'll kind of like walk through how to [00:19:35] walk through how to um how to actually you know train some [00:19:37] um how to actually you know train some of these models so the first is how do [00:19:40] of these models so the first is how do you actually pass any of the inputs that [00:19:42] you actually pass any of the inputs that you get from a tokenizer into the model [00:19:44] you get from a tokenizer into the model okay well if we get some model inputs [00:19:48] okay well if we get some model inputs from the tokenizer up here [00:19:51] from the tokenizer up here and we pass this into the model by [00:19:53] and we pass this into the model by specifying that the input IDs are the [00:19:57] specifying that the input IDs are the input IDs from the model inputs [00:20:00] input IDs from the model inputs and likewise we want to emphasize or we [00:20:03] and likewise we want to emphasize or we can you know show here and specifically [00:20:05] can you know show here and specifically pass in that the attention mask is going [00:20:07] pass in that the attention mask is going to correspond to the attention mask that [00:20:09] to correspond to the attention mask that we gave from these like these outputs of [00:20:12] we gave from these like these outputs of the tokenizer [00:20:13] the tokenizer okay [00:20:14] okay so this is option one where you can [00:20:16] so this is option one where you can specifically identify which property [00:20:19] specifically identify which property goes to what [00:20:20] goes to what the second option is using kind of a [00:20:24] the second option is using kind of a pythonic [00:20:26] pythonic hack almost which is where you can [00:20:28] hack almost which is where you can directly pass in the model inputs and so [00:20:32] directly pass in the model inputs and so this will basically unpack almost the [00:20:36] this will basically unpack almost the keys of like the model inputs here so [00:20:39] keys of like the model inputs here so the model input keys so the input IDs [00:20:41] the model input keys so the input IDs correspond to this the attention mask [00:20:44] correspond to this the attention mask corresponds to the attention mask [00:20:46] corresponds to the attention mask argument [00:20:47] argument so when we use this star star kind of [00:20:50] so when we use this star star kind of syntax this will go ahead and unpack our [00:20:52] syntax this will go ahead and unpack our dictionary and basically map the [00:20:54] dictionary and basically map the arguments to something of the same keys [00:20:56] arguments to something of the same keys so this is an alternative way of passing [00:20:58] so this is an alternative way of passing it into the model [00:21:00] it into the model both are going to be the same [00:21:04] both are going to be the same okay [00:21:05] okay so now what we can do is we can actually [00:21:07] so now what we can do is we can actually print out what the model outputs look [00:21:09] print out what the model outputs look like so again these are the inputs the [00:21:13] like so again these are the inputs the token IDs and the attention mask [00:21:15] token IDs and the attention mask and then second we'll get the actual [00:21:18] and then second we'll get the actual model outputs so here notice that the [00:21:21] model outputs so here notice that the outputs are given by kind of these [00:21:24] outputs are given by kind of these legits here there's two of them we [00:21:26] legits here there's two of them we passed in one example and there's kind [00:21:28] passed in one example and there's kind of two potential classes that we're [00:21:29] of two potential classes that we're trying to classify [00:21:31] trying to classify okay and then lastly we have of course [00:21:33] okay and then lastly we have of course the corresponding distribution over the [00:21:36] the corresponding distribution over the labels here here right since this is [00:21:38] labels here here right since this is going to be binary classification [00:21:40] going to be binary classification yes it's like a little bit weird that [00:21:43] yes it's like a little bit weird that you're going to have like the two [00:21:44] you're going to have like the two classes for the binary classification [00:21:46] classes for the binary classification task and you could basically just choose [00:21:48] task and you could basically just choose to classify one class or not [00:21:51] to classify one class or not um but we do this just basically because [00:21:53] um but we do this just basically because of how hugging face models are are set [00:21:56] of how hugging face models are are set up [00:21:57] up um and so uh Additionally you know these [00:22:01] um and so uh Additionally you know these are the models that we load in from [00:22:03] are the models that we load in from hugging face are basically just Pi torch [00:22:06] hugging face are basically just Pi torch modules so like these are the actual [00:22:09] modules so like these are the actual models and we can use them in the same [00:22:10] models and we can use them in the same way that we've been using models before [00:22:13] way that we've been using models before so that means things like lost dot [00:22:15] so that means things like lost dot backward or something like that actually [00:22:17] backward or something like that actually will do this back propagation step [00:22:20] will do this back propagation step corresponding to the loss of like your [00:22:22] corresponding to the loss of like your inputs that you pass in so so it's [00:22:26] inputs that you pass in so so it's really easy to train train these guys as [00:22:28] really easy to train train these guys as long as you have like a label you know [00:22:30] long as you have like a label you know label for your data you can calculate [00:22:32] label for your data you can calculate your loss using you know the pi torch [00:22:35] your loss using you know the pi torch cross entropy function [00:22:37] cross entropy function you get some loss back and then you can [00:22:40] you get some loss back and then you can go ahead and back propagate it [00:22:42] go ahead and back propagate it you can actually even get kind of the [00:22:44] you can actually even get kind of the parameters as well [00:22:46] parameters as well um in the model that you're would [00:22:48] um in the model that you're would probably get updated from this [00:22:50] probably get updated from this this is just some big tensor of the [00:22:52] this is just some big tensor of the actual embedding weights that you have [00:22:57] okay we also have like a pretty easy way [00:23:00] okay we also have like a pretty easy way for hugging face itself to be able to to [00:23:03] for hugging face itself to be able to to calculate the loss that we get so again [00:23:06] calculate the loss that we get so again if we tokenize some input string we get [00:23:08] if we tokenize some input string we get our model inputs we have two labels [00:23:11] our model inputs we have two labels positive and negative [00:23:13] positive and negative um and then give some kind of [00:23:14] um and then give some kind of corresponding label that we assign to [00:23:17] corresponding label that we assign to the the model inputs and we pass this in [00:23:20] the the model inputs and we pass this in we can see here that the actual model [00:23:23] we can see here that the actual model outputs that's that are given by a [00:23:24] outputs that's that are given by a hugging face includes this loss here [00:23:28] hugging face includes this loss here right so it'll include the loss [00:23:30] right so it'll include the loss corresponding to that input anyways so [00:23:33] corresponding to that input anyways so it's a really easy way of actually [00:23:35] it's a really easy way of actually calculating the loss just natively in [00:23:38] calculating the loss just natively in hugging face without having to call any [00:23:41] hugging face without having to call any additional things from a pie torch [00:23:42] additional things from a pie torch Library [00:23:44] Library and lastly we can actually even use [00:23:47] and lastly we can actually even use um if we have kind of like these two [00:23:49] um if we have kind of like these two labels here [00:23:50] labels here again for positive or negative what we [00:23:53] again for positive or negative what we can do is just take the model outputs [00:23:55] can do is just take the model outputs look at the legits and see which one is [00:23:58] look at the legits and see which one is like the biggest again [00:24:00] like the biggest again we'll pass that and take it to the ARG [00:24:03] we'll pass that and take it to the ARG Max so that'll give the index that's [00:24:04] Max so that'll give the index that's largest and then that's the output label [00:24:07] largest and then that's the output label that the model is actually predicting [00:24:09] that the model is actually predicting so again it gives a really easy way of [00:24:11] so again it gives a really easy way of being able to do this sort of like [00:24:13] being able to do this sort of like classification getting the loss getting [00:24:15] classification getting the loss getting what the actual labels are just from [00:24:17] what the actual labels are just from within hugging face [00:24:22] okay awesome so [00:24:25] okay awesome so um well last thing as well is that we [00:24:27] um well last thing as well is that we can also kind of look inside the model [00:24:31] can also kind of look inside the model um in a pretty pretty cool way and also [00:24:33] um in a pretty pretty cool way and also seeing what the attention weights the [00:24:36] seeing what the attention weights the model actually puts uh the attention [00:24:38] model actually puts uh the attention weights the model actually has so this [00:24:42] weights the model actually has so this is helpful if you're trying to [00:24:43] is helpful if you're trying to understand like what's going on inside [00:24:45] understand like what's going on inside of some NLP model and so for here we can [00:24:50] of some NLP model and so for here we can do again I where we're importing our [00:24:53] do again I where we're importing our model from some pre-trained kind of [00:24:56] model from some pre-trained kind of pre-trained model model weights in the [00:24:59] pre-trained model model weights in the um the hugging face Hub we want to [00:25:02] um the hugging face Hub we want to Output attention set output attentions [00:25:04] Output attention set output attentions to true and output hidden states to true [00:25:06] to true and output hidden states to true so these are going to be the key [00:25:08] so these are going to be the key arguments that we can use we're actually [00:25:10] arguments that we can use we're actually kind of investigating what's going on [00:25:13] kind of investigating what's going on inside the model at each point in time [00:25:15] inside the model at each point in time again we'll set the model to be in eval [00:25:18] again we'll set the model to be in eval mode [00:25:20] mode and lastly we'll go ahead and tokenize [00:25:22] and lastly we'll go ahead and tokenize our input string again [00:25:25] our input string again we don't really care about any of the [00:25:27] we don't really care about any of the gradients here [00:25:29] gradients here um again so we don't actually want to [00:25:31] um again so we don't actually want to back propagate anything here and finally [00:25:34] back propagate anything here and finally pass in the model inputs [00:25:36] pass in the model inputs so now what we're able to do is when we [00:25:39] so now what we're able to do is when we print out the model hidden States so now [00:25:42] print out the model hidden States so now this is a new kind of property in the [00:25:44] this is a new kind of property in the output dictionary that we get we can [00:25:46] output dictionary that we get we can look at what these actually look like [00:25:48] look at what these actually look like here [00:25:49] here um and sorry this is a massive output [00:25:54] so you can actually look at the hidden [00:25:56] so you can actually look at the hidden State size per layer right and so this [00:25:59] State size per layer right and so this kind of gives a notion of what we're [00:26:01] kind of gives a notion of what we're going to be looking like looking at like [00:26:03] going to be looking like looking at like what the shape of this is at each given [00:26:05] what the shape of this is at each given layer in our model [00:26:07] layer in our model as well as the attention head size per [00:26:10] as well as the attention head size per layer so this gives you like it the kind [00:26:12] layer so this gives you like it the kind of shape of what you're looking at [00:26:13] of shape of what you're looking at and then if we actually look at the [00:26:16] and then if we actually look at the model output itself we'll get all of [00:26:19] model output itself we'll get all of these different like hidden States [00:26:21] these different like hidden States basically right so [00:26:23] basically right so um so we have like tons and tons of [00:26:26] um so we have like tons and tons of these different hidden States we'll have [00:26:28] these different hidden States we'll have the last hidden State here [00:26:31] the last hidden State here so the model output is pretty robust for [00:26:33] so the model output is pretty robust for kind of showing you what the hidden [00:26:35] kind of showing you what the hidden state looks like as well as what [00:26:37] state looks like as well as what attention weights actually look like [00:26:38] attention weights actually look like here [00:26:40] here so in case you're trying to analyze a [00:26:42] so in case you're trying to analyze a particular model this is a really [00:26:44] particular model this is a really helpful way of doing that [00:26:45] helpful way of doing that so what model.eval does is it sorry [00:26:49] so what model.eval does is it sorry question is what is the dot eval do [00:26:52] question is what is the dot eval do um what it does is it basically sets [00:26:55] um what it does is it basically sets your and this is true for any PI torch [00:26:57] your and this is true for any PI torch module or model is it sets it into [00:26:59] module or model is it sets it into quote-unquote eval mode so again for [00:27:02] quote-unquote eval mode so again for this like we're not really trying to [00:27:05] this like we're not really trying to calculate any of the gradients or [00:27:07] calculate any of the gradients or anything like that that might correspond [00:27:09] anything like that that might correspond to [00:27:11] to um like correspond to some data that we [00:27:13] um like correspond to some data that we pass in or try and update our model in [00:27:16] pass in or try and update our model in any way we just care about evaluating it [00:27:19] any way we just care about evaluating it on that particular data point [00:27:21] on that particular data point um so for that then it's helpful to set [00:27:23] um so for that then it's helpful to set the model into like eval mode [00:27:25] the model into like eval mode essentially to help make sure that that [00:27:29] essentially to help make sure that that kind of like disables some of like that [00:27:32] kind of like disables some of like that stuff that you'd use during training [00:27:34] stuff that you'd use during training time so it just makes it a little more [00:27:36] time so it just makes it a little more efficient [00:27:37] efficient yeah the question was uh it's already [00:27:39] yeah the question was uh it's already pre-changed so can you go ahead and [00:27:40] pre-changed so can you go ahead and evaluate it yeah you you can so yeah [00:27:43] evaluate it yeah you you can so yeah this is just the raw pre-trained model [00:27:44] this is just the raw pre-trained model with no no fine tuning [00:27:47] with no no fine tuning so the question is like how do you [00:27:48] so the question is like how do you interpret [00:27:50] interpret um these shapes basically uh for the [00:27:53] um these shapes basically uh for the attention head size and then the hidden [00:27:55] attention head size and then the hidden State size so um so yeah the the key [00:27:58] State size so um so yeah the the key thing here uh is you'll probably want to [00:28:01] thing here uh is you'll probably want to look at kind of the shape given on the [00:28:03] look at kind of the shape given on the side it'll correspond to like the layer [00:28:05] side it'll correspond to like the layer that you're actually kind of like uh [00:28:07] that you're actually kind of like uh looking at so here [00:28:09] looking at so here um like when we call we looked at the [00:28:12] um like when we call we looked at the shape here we're specifically looking at [00:28:14] shape here we're specifically looking at like the first first one in this list [00:28:16] like the first first one in this list right so this will give us the first [00:28:18] right so this will give us the first hidden layer all right the second gives [00:28:21] hidden layer all right the second gives us a notion of kind of like the the [00:28:24] us a notion of kind of like the the batch that we're looking at and then the [00:28:26] batch that we're looking at and then the last is like so this is like some tensor [00:28:28] last is like so this is like some tensor right 768 dimensional I don't know [00:28:32] right 768 dimensional I don't know representation the corresponds there [00:28:35] representation the corresponds there um and then for the attention head size [00:28:37] um and then for the attention head size it corresponds to like the actual query [00:28:40] it corresponds to like the actual query word and the keyword for these last two [00:28:42] word and the keyword for these last two here [00:28:48] but yes so um but for this you know we [00:28:52] but yes so um but for this you know we would expect this kind of initial index [00:28:54] would expect this kind of initial index here right the one to be bigger if we [00:28:57] here right the one to be bigger if we printed out all of the you know all of [00:28:58] printed out all of the you know all of the layers but we're just looking at the [00:29:00] the layers but we're just looking at the first one here [00:29:01] first one here so we can also do this [00:29:04] so we can also do this um [00:29:05] um for [00:29:07] for um you know actually being able to get [00:29:09] um you know actually being able to get some notion of how these different how [00:29:12] some notion of how these different how this actually like looks [00:29:14] this actually like looks um and plot out these axes as well so [00:29:16] um and plot out these axes as well so again if we take this same kind of model [00:29:18] again if we take this same kind of model input which again is like this hugging [00:29:21] input which again is like this hugging face Transformers is great we're [00:29:23] face Transformers is great we're actually trying to see like what do [00:29:25] actually trying to see like what do these representations look like on like [00:29:27] these representations look like on like a per layer basis so what we can do here [00:29:30] a per layer basis so what we can do here is basically we're looking at for each [00:29:33] is basically we're looking at for each layer that we have in our model right [00:29:35] layer that we have in our model right and again this is purely from the model [00:29:37] and again this is purely from the model output attentions or the actual outputs [00:29:40] output attentions or the actual outputs of the model [00:29:41] of the model so what we can do is for each layer and [00:29:45] so what we can do is for each layer and then for each head we can analyze [00:29:47] then for each head we can analyze essentially like what these [00:29:49] essentially like what these representations look like and in [00:29:51] representations look like and in particular what the attention weights [00:29:52] particular what the attention weights are across each of like the tokens that [00:29:54] are across each of like the tokens that we have so this is like a good way of [00:29:56] we have so this is like a good way of again understanding like what your model [00:29:58] again understanding like what your model is actually attending to within each [00:30:01] is actually attending to within each layer so on the side if we look here [00:30:04] layer so on the side if we look here maybe zoom in a bit we can see that this [00:30:07] maybe zoom in a bit we can see that this is going to be like corresponds to the [00:30:09] is going to be like corresponds to the different layers and the top will [00:30:11] different layers and the top will correspond to these are across the the [00:30:13] correspond to these are across the the different attention heads okay this will [00:30:17] different attention heads okay this will just give you some notion of like what [00:30:18] just give you some notion of like what the weights are [00:30:21] the weights are so again just to um to clarify so again [00:30:24] so again just to um to clarify so again if we maybe look at the labels sorry [00:30:25] if we maybe look at the labels sorry it's like a little cut off and like [00:30:27] it's like a little cut off and like zoomed out but so this y-axis here like [00:30:31] zoomed out but so this y-axis here like these different rows corresponds to the [00:30:34] these different rows corresponds to the different layers within the model oops [00:30:37] different layers within the model oops um on the x-axis here right we have like [00:30:41] um on the x-axis here right we have like the [00:30:42] the um like the different attention heads [00:30:45] um like the different attention heads that are present in the model as well [00:30:46] that are present in the model as well and so for each head we are able to for [00:30:50] and so for each head we are able to for each at each layer to basically get a [00:30:53] each at each layer to basically get a sense of like what how the attention [00:30:56] sense of like what how the attention distribution is actually being [00:30:58] distribution is actually being distributed what's being attended to [00:31:00] distributed what's being attended to corresponding to each of like the tokens [00:31:02] corresponding to each of like the tokens that you actually get here [00:31:04] that you actually get here so if we if we look up again [00:31:08] so if we if we look up again um here as well right we're just trying [00:31:10] um here as well right we're just trying to look at like basically the model of [00:31:13] to look at like basically the model of tensions that we get [00:31:14] tensions that we get for each kind of corresponding layer [00:31:17] for each kind of corresponding layer the question is what's the the color key [00:31:20] the question is what's the the color key um yellow is like higher higher [00:31:23] um yellow is like higher higher magnitude and higher value and then [00:31:25] magnitude and higher value and then darker is like closer to zero so [00:31:27] darker is like closer to zero so probably very Navy is like zero [00:31:33] so what we can do is now maybe walk [00:31:35] so what we can do is now maybe walk through like what a fine-tuning task [00:31:37] through like what a fine-tuning task looks like here [00:31:39] looks like here um and so first like uh in a project you [00:31:41] um and so first like uh in a project you know you're probably going to want to [00:31:43] know you're probably going to want to fine-tune a model [00:31:44] fine-tune a model um that's fine it's a and we'll go ahead [00:31:47] um that's fine it's a and we'll go ahead and walk through an example of what that [00:31:49] and walk through an example of what that looks like here [00:31:53] okay so what we can do as well [00:31:57] okay so what we can do as well is all right [00:32:00] is all right what we can do as well is use some of [00:32:02] what we can do as well is use some of the [00:32:03] the um the data sets that we can get from [00:32:06] um the data sets that we can get from hugging face as well so it doesn't just [00:32:08] hugging face as well so it doesn't just have models it has really nice data sets [00:32:10] have models it has really nice data sets and be able to kind of like load that in [00:32:13] and be able to kind of like load that in as well so here what we're going to be [00:32:15] as well so here what we're going to be looking at is uh looking at like the [00:32:18] looking at is uh looking at like the IMDb data set [00:32:21] IMDb data set um and so here again is for sentiment [00:32:23] um and so here again is for sentiment analysis I will just look at only the [00:32:26] analysis I will just look at only the first 50 tokens or so [00:32:29] first 50 tokens or so um and generally [00:32:31] um and generally so this is this is like a you know a [00:32:34] so this is this is like a you know a helper function that we'll use for [00:32:36] helper function that we'll use for truncating the output that we get [00:32:39] truncating the output that we get and then lastly for actually kind of [00:32:41] and then lastly for actually kind of making this data set we can use the data [00:32:44] making this data set we can use the data set dict class from a hugging face again [00:32:47] set dict class from a hugging face again that will basically give us this smaller [00:32:51] that will basically give us this smaller data set that we can get for the uh for [00:32:54] data set that we can get for the uh for the train data set as well as specifying [00:32:56] the train data set as well as specifying what we want for validation as well [00:32:58] what we want for validation as well so here what we're going to do for our [00:33:00] so here what we're going to do for our like mini data set for the purpose of [00:33:02] like mini data set for the purpose of this demonstration is we'll use I make [00:33:06] this demonstration is we'll use I make train and vowel both from the IMDb train [00:33:09] train and vowel both from the IMDb train data set uh we'll Shuffle it a bit and [00:33:12] data set uh we'll Shuffle it a bit and then we're just going to select here 128 [00:33:15] then we're just going to select here 128 examples and then 32 for validation so [00:33:19] examples and then 32 for validation so it'll Shuffle it around it'll take the [00:33:21] it'll Shuffle it around it'll take the first 128 and I'll take the LA the next [00:33:23] first 128 and I'll take the LA the next 32. [00:33:25] 32. um [00:33:26] um and then we'll kind of truncate those [00:33:28] and then we'll kind of truncate those particular inputs that we get again just [00:33:30] particular inputs that we get again just to kind of make sure we're efficient [00:33:33] to kind of make sure we're efficient and we can actually run this on a CPU [00:33:38] okay [00:33:39] okay so next we can do is just see kind of [00:33:42] so next we can do is just see kind of what does this look like it'll just [00:33:44] what does this look like it'll just again this is kind of just like a [00:33:46] again this is kind of just like a dictionary it's a wrapper class almost [00:33:47] dictionary it's a wrapper class almost of giving you know your train data set [00:33:50] of giving you know your train data set and then your validation data set [00:33:52] and then your validation data set and in particular you can even look at [00:33:54] and in particular you can even look at like what the first 10 of these looks [00:33:57] like what the first 10 of these looks like [00:33:58] like so first like the output so we specify [00:34:01] so first like the output so we specify train we want to look at the first 10 [00:34:04] train we want to look at the first 10 entries in our train data set [00:34:06] entries in our train data set and the output of this is going to be a [00:34:10] and the output of this is going to be a dictionary as well which is pretty cool [00:34:12] dictionary as well which is pretty cool so we have some the first 10 test text [00:34:16] so we have some the first 10 test text examples that give the actual movie [00:34:19] examples that give the actual movie reviews here [00:34:21] reviews here um so this is the given in a list [00:34:24] um so this is the given in a list and then the second uh key that you get [00:34:27] and then the second uh key that you get are the labels corresponding to each of [00:34:29] are the labels corresponding to each of these so whether it's positive or [00:34:31] these so whether it's positive or negative so here one is going to be a [00:34:33] negative so here one is going to be a positive review 0 is negative [00:34:36] positive review 0 is negative so it makes it really easy to use this [00:34:38] so it makes it really easy to use this for some something like sentiment [00:34:40] for some something like sentiment sentiment analysis [00:34:43] okay [00:34:44] okay so what we can do is go ahead and [00:34:48] so what we can do is go ahead and prepare the data set and put it into [00:34:50] prepare the data set and put it into batches of 16. okay so what does this [00:34:53] batches of 16. okay so what does this look like what we can do is we can call [00:34:55] look like what we can do is we can call the map function that this like uh this [00:34:59] the map function that this like uh this small like data set dictionary has so we [00:35:02] small like data set dictionary has so we call map [00:35:04] call map and pass in a Lambda function of what we [00:35:06] and pass in a Lambda function of what we want to actually do so here the Lambda [00:35:08] want to actually do so here the Lambda function is for each example that we [00:35:11] function is for each example that we have [00:35:12] have we want to tokenize the text basically [00:35:15] we want to tokenize the text basically so this is basically saying how do we [00:35:17] so this is basically saying how do we want to you know pre-process this [00:35:21] want to you know pre-process this um and so here we're extracting the [00:35:23] um and so here we're extracting the tokens input IDs that will pass as a [00:35:25] tokens input IDs that will pass as a model we adding padding and truncation [00:35:28] model we adding padding and truncation as well we're going to do this in a [00:35:30] as well we're going to do this in a batch and then the batch size will be [00:35:32] batch and then the batch size will be 16. [00:35:33] 16. hopefully this makes sense [00:35:36] hopefully this makes sense okay so [00:35:38] okay so um next we're basically just going to [00:35:42] um next we're basically just going to um [00:35:43] um uh do like a little more modification on [00:35:45] uh do like a little more modification on what the data set actually looks like so [00:35:48] what the data set actually looks like so we're going to remove the column that [00:35:49] we're going to remove the column that corresponds to text and then we're going [00:35:53] corresponds to text and then we're going to rename the column label to labels so [00:35:56] to rename the column label to labels so again if we see this this was called [00:35:58] again if we see this this was called label we're just going to call it labels [00:36:00] label we're just going to call it labels and we're going to remove the text [00:36:01] and we're going to remove the text column because we don't really need it [00:36:03] column because we don't really need it anymore we just have gone ahead and [00:36:06] anymore we just have gone ahead and pre-processed our data into the input [00:36:08] pre-processed our data into the input IDs that we need [00:36:09] IDs that we need okay and lastly we're going to set it [00:36:11] okay and lastly we're going to set it the format to torch so we can go ahead [00:36:13] the format to torch so we can go ahead and just pass this in pass this into our [00:36:16] and just pass this in pass this into our model or Pi torch model [00:36:19] model or Pi torch model the question is what is labels so um so [00:36:22] the question is what is labels so um so label here corresponds to like again the [00:36:25] label here corresponds to like again the first in the context of sentiment [00:36:27] first in the context of sentiment analysis it's like just yeah positive or [00:36:30] analysis it's like just yeah positive or negative and so here we're just renaming [00:36:32] negative and so here we're just renaming the column [00:36:33] the column okay so now we'll just go ahead and see [00:36:35] okay so now we'll just go ahead and see what this looks like again we're going [00:36:37] what this looks like again we're going to look at the train set and only these [00:36:39] to look at the train set and only these first two things [00:36:41] first two things and so [00:36:43] and so um so here now we have the two labels [00:36:45] um so here now we have the two labels that correspond to each of the reviews [00:36:47] that correspond to each of the reviews and the input IDs that we get [00:36:50] and the input IDs that we get corresponding for each of the reviews as [00:36:52] corresponding for each of the reviews as well [00:36:53] well lastly we also get the attention mask so [00:36:55] lastly we also get the attention mask so it's basically just taking the what you [00:36:58] it's basically just taking the what you get out from the tokenizer and it's just [00:37:00] get out from the tokenizer and it's just adding this back into the data set so [00:37:01] adding this back into the data set so it's really easy to pass in [00:37:03] it's really easy to pass in the question is we truncated which makes [00:37:07] the question is we truncated which makes things easy but how do you want to apply [00:37:10] things easy but how do you want to apply like padding evenly so here if we do [00:37:15] like padding evenly so here if we do pass in so first is like you could [00:37:17] pass in so first is like you could either manually set some high truncation [00:37:20] either manually set some high truncation limit like we did the second is that you [00:37:24] limit like we did the second is that you can just go ahead and set padding to be [00:37:27] can just go ahead and set padding to be true and then basically like the padding [00:37:30] true and then basically like the padding is basically uh [00:37:33] is basically uh added I based off of kind of like the [00:37:36] added I based off of kind of like the longest [00:37:37] longest um like longest sequence that you have [00:37:39] um like longest sequence that you have yeah so the question is I guess doing it [00:37:42] yeah so the question is I guess doing it for all of them all the text lists [00:37:44] for all of them all the text lists evenly [00:37:46] evenly um so again it just like depends on like [00:37:47] um so again it just like depends on like the size of like the data set you're [00:37:49] the size of like the data set you're you're like you're loading in right so [00:37:51] you're like you're loading in right so if you're looking at particular batches [00:37:53] if you're looking at particular batches at a time you can just pad within that [00:37:55] at a time you can just pad within that particular like batch versus like yeah [00:37:58] particular like batch versus like yeah you don't need to like load all the data [00:38:00] you don't need to like load all the data set into memory pad the entire data set [00:38:03] set into memory pad the entire data set like or like in the same way so it's [00:38:05] like or like in the same way so it's fine to do it within just batches [00:38:08] fine to do it within just batches yeah the question was how does uh how [00:38:10] yeah the question was how does uh how were the input IDs like added and uh [00:38:13] were the input IDs like added and uh yeah the answer is yes it's basically [00:38:15] yeah the answer is yes it's basically done automatically [00:38:17] done automatically um so we had to manually remove the text [00:38:20] um so we had to manually remove the text column here and that kind of like this [00:38:23] column here and that kind of like this first line here buy it [00:38:25] first line here buy it um like if you recall like the outputs [00:38:27] um like if you recall like the outputs of token like at the tokenizer it's [00:38:30] of token like at the tokenizer it's basically just the input IDs and the and [00:38:32] basically just the input IDs and the and the attention mask so it just is smart [00:38:35] the attention mask so it just is smart enough to basically aggregate those [00:38:36] enough to basically aggregate those together [00:38:39] um [00:38:39] um okay the last thing we're going to do is [00:38:42] okay the last thing we're going to do is basically just put these so we have this [00:38:45] basically just put these so we have this like data set now [00:38:47] like data set now um that looks great we're just gonna [00:38:49] um that looks great we're just gonna import like a pytorch data loader [00:38:52] import like a pytorch data loader typical normal data loader and then go [00:38:55] typical normal data loader and then go ahead and load each of these data sets [00:38:57] ahead and load each of these data sets that we just had I mean specifying the [00:38:59] that we just had I mean specifying the batch size to be 16. [00:39:02] batch size to be 16. okay so that's fine and great and so now [00:39:08] okay so that's fine and great and so now for training the model it's basically [00:39:10] for training the model it's basically like exactly the same as what we would [00:39:13] like exactly the same as what we would do in typical pytorch so again it's like [00:39:17] do in typical pytorch so again it's like you still want to compute the loss you [00:39:19] you still want to compute the loss you can back propagate the loss and [00:39:21] can back propagate the loss and everything [00:39:22] everything um [00:39:23] um yeah so it's it's really up to your own [00:39:25] yeah so it's it's really up to your own design how you do uh how you do the [00:39:28] design how you do uh how you do the training [00:39:29] training um so here there's only like a few kind [00:39:33] um so here there's only like a few kind of asterisks I guess one is that you can [00:39:35] of asterisks I guess one is that you can import specific kind of optim Optimizer [00:39:38] import specific kind of optim Optimizer types from the Transformers uh package [00:39:42] types from the Transformers uh package so you can do atom with weight Decay you [00:39:45] so you can do atom with weight Decay you can get a linear schedule for like the [00:39:47] can get a linear schedule for like the learning rate which will kind of [00:39:49] learning rate which will kind of decrease the learning during the [00:39:51] decrease the learning during the learning rate over time for each [00:39:53] learning rate over time for each training step so again it's basically up [00:39:55] training step so again it's basically up to your choice but if you look at the [00:39:57] to your choice but if you look at the structure of like this code right we [00:39:59] structure of like this code right we load the model for classification we set [00:40:02] load the model for classification we set a number of epochs and then however many [00:40:04] a number of epochs and then however many training steps we actually want to do [00:40:06] training steps we actually want to do we initialize our Optimizer and get some [00:40:09] we initialize our Optimizer and get some learning rate schedule [00:40:11] learning rate schedule right and then from there it's basically [00:40:13] right and then from there it's basically the same thing as what we would do for a [00:40:15] the same thing as what we would do for a typical kind of like Pi torch model [00:40:17] typical kind of like Pi torch model right we set the model to train mode we [00:40:20] right we set the model to train mode we go ahead and pass in all these batches [00:40:24] go ahead and pass in all these batches from like the the data loader and then [00:40:27] from like the the data loader and then back propagate step the optimizer and [00:40:30] back propagate step the optimizer and everything like that [00:40:31] everything like that so it's I pretty pretty similar from [00:40:35] so it's I pretty pretty similar from what we're kind of like used to seeing [00:40:37] what we're kind of like used to seeing essentially [00:40:42] awesome so that'll go do its thing at [00:40:46] awesome so that'll go do its thing at some point [00:40:47] some point um okay and so I so that's one potential [00:40:51] um okay and so I so that's one potential option is if you really like pie torch [00:40:52] option is if you really like pie torch you can just go ahead and do that and [00:40:54] you can just go ahead and do that and it's really nice and easy [00:40:56] it's really nice and easy um the second thing is I that hugging [00:41:00] um the second thing is I that hugging face actually has some sort of like a [00:41:03] face actually has some sort of like a trainer class that you're able to use [00:41:05] trainer class that you're able to use that can handle most of most of these [00:41:08] that can handle most of most of these things so again if we do the kind of [00:41:10] things so again if we do the kind of like the same thing here this will [00:41:12] like the same thing here this will actually run one star model is done [00:41:14] actually run one star model is done training [00:41:16] training um [00:41:16] um like we can create the our you know our [00:41:19] like we can create the our you know our data set in the same way as before [00:41:21] data set in the same way as before now what we can what we need to use is [00:41:24] now what we can what we need to use is like this import of like a training [00:41:27] like this import of like a training arguments class so this is going to be [00:41:30] arguments class so this is going to be basically a dictionary of all the things [00:41:32] basically a dictionary of all the things that we want to use when we're going to [00:41:34] that we want to use when we're going to actually train our model and then this [00:41:37] actually train our model and then this kind of like additional trainer class [00:41:39] kind of like additional trainer class which will handle the training kind of [00:41:42] which will handle the training kind of like magically for us and kind of wrap [00:41:43] like magically for us and kind of wrap around in that way [00:41:47] okay anyways so if you can okay I think [00:41:49] okay anyways so if you can okay I think we're missing a directory but [00:41:52] we're missing a directory but um [00:41:52] um I think yeah pretty straightforward for [00:41:54] I think yeah pretty straightforward for how you want to train yeah [00:41:56] how you want to train yeah um so for for here at least again there [00:42:00] um so for for here at least again there are kind of the two key arguments the [00:42:02] are kind of the two key arguments the first is training arguments so this will [00:42:04] first is training arguments so this will specify have a number of specifications [00:42:07] specify have a number of specifications that you can actually pass through to it [00:42:09] that you can actually pass through to it it's where you want to log things for [00:42:12] it's where you want to log things for each kind of like device in this case [00:42:14] each kind of like device in this case like we're just using one GPU but [00:42:16] like we're just using one GPU but potentially if you're using multiple [00:42:18] potentially if you're using multiple gpus what the batch size is during [00:42:20] gpus what the batch size is during training or the batch sizes during [00:42:23] training or the batch sizes during evaluation time [00:42:25] evaluation time how long you want to train it for [00:42:27] how long you want to train it for how you want to evaluate it so this is [00:42:30] how you want to evaluate it so this is kind of like evaluating on an Epoch [00:42:33] kind of like evaluating on an Epoch level [00:42:33] level what the learning rate is and so on so [00:42:35] what the learning rate is and so on so on so again if you want to check the [00:42:38] on so again if you want to check the documentation uh you can see that here [00:42:41] documentation uh you can see that here there's a bunch of different arguments [00:42:43] there's a bunch of different arguments that you can give there's like warm-up [00:42:45] that you can give there's like warm-up steps warm-up ratio like weight Decay [00:42:47] steps warm-up ratio like weight Decay there's like so many things [00:42:50] there's like so many things um so again it's basically like a [00:42:52] um so again it's basically like a dictionary feel free to kind of like [00:42:54] dictionary feel free to kind of like look at these different arguments you [00:42:56] look at these different arguments you can pass in but there's a couple key [00:42:58] can pass in but there's a couple key ones here and this is basically this [00:43:00] ones here and this is basically this basically mimics the same arguments that [00:43:02] basically mimics the same arguments that we used before in our like explicit Pi [00:43:05] we used before in our like explicit Pi torch method here for hugging face [00:43:09] torch method here for hugging face okay similarly [00:43:12] okay similarly um what we do is we can just pass this [00:43:14] um what we do is we can just pass this into the trainer and that will take care [00:43:16] into the trainer and that will take care of basically everything for us so that [00:43:19] of basically everything for us so that whole training Loop that we did before [00:43:20] whole training Loop that we did before is kind of condensed into this one class [00:43:23] is kind of condensed into this one class function [00:43:24] function um for actually just doing the training [00:43:26] um for actually just doing the training so we pass the model the arguments the [00:43:29] so we pass the model the arguments the train data set eval data set what [00:43:31] train data set eval data set what tokenizer we want to use and then some [00:43:34] tokenizer we want to use and then some function for computing metrics [00:43:36] function for computing metrics so for here we pass in this function uh [00:43:40] so for here we pass in this function uh eval and it takes eval predictions as [00:43:43] eval and it takes eval predictions as input basically what this does is these [00:43:45] input basically what this does is these predictions are given from the trainer [00:43:47] predictions are given from the trainer passed into this function and we just [00:43:50] passed into this function and we just can split it into the actual legits and [00:43:52] can split it into the actual legits and the labels that are predicted or sorry [00:43:54] the labels that are predicted or sorry the ground truth labels that we have and [00:43:57] the ground truth labels that we have and then from here we can just calculate any [00:43:59] then from here we can just calculate any sort of additional metrics we want like [00:44:00] sort of additional metrics we want like accuracy [00:44:02] accuracy F1 square recolors and whatever you want [00:44:07] okay so this is like an alternative way [00:44:09] okay so this is like an alternative way of formulating that training Loop [00:44:13] okay uh the last thing here as well is [00:44:17] okay uh the last thing here as well is that we can have some sort of callback [00:44:19] that we can have some sort of callback as well if you want to do things during [00:44:21] as well if you want to do things during the training process so after every [00:44:24] the training process so after every Epoch or something like that you want to [00:44:26] Epoch or something like that you want to evaluate your model on the validation [00:44:28] evaluate your model on the validation set or something like that or just go [00:44:31] set or something like that or just go ahead and like dump some sort of output [00:44:34] ahead and like dump some sort of output that's what you can use a callback for [00:44:36] that's what you can use a callback for and so here this is just a login [00:44:40] and so here this is just a login callback it's just gonna [00:44:42] callback it's just gonna log kind of like that information about [00:44:45] log kind of like that information about the the process itself again not super [00:44:49] the the process itself again not super important but in case that you're [00:44:51] important but in case that you're looking to try and do any sort of [00:44:54] looking to try and do any sort of callback during training it's an easy [00:44:56] callback during training it's an easy way to add it in the second is if you [00:44:58] way to add it in the second is if you want to do early stopping as well so [00:45:01] want to do early stopping as well so early stopping will basically [00:45:03] early stopping will basically um stop your model early as it sounds if [00:45:07] um stop your model early as it sounds if it's not learning anything and a bunch [00:45:09] it's not learning anything and a bunch of epochs are going by and so you can [00:45:12] of epochs are going by and so you can set that so that you don't waste kind of [00:45:13] set that so that you don't waste kind of like compute time or you can see the [00:45:15] like compute time or you can see the results more easily the question is is [00:45:17] results more easily the question is is there a good choice for the patient's [00:45:19] there a good choice for the patient's value [00:45:20] value um I think it just depends on the model [00:45:22] um I think it just depends on the model architecture not really I guess this is [00:45:25] architecture not really I guess this is It's a [00:45:26] It's a yeah pretty up to your discretion [00:45:31] okay awesome and so the last thing that [00:45:34] okay awesome and so the last thing that we do [00:45:35] we do um is just do call trainer.train so if [00:45:37] um is just do call trainer.train so if you recall this is just the [00:45:39] you recall this is just the instantiation of this trainer class we [00:45:42] instantiation of this trainer class we all trainer.train and it'll just kind of [00:45:45] all trainer.train and it'll just kind of go so now it's training [00:45:47] go so now it's training which is great it gives us a nice kind [00:45:50] which is great it gives us a nice kind of estimate of [00:45:51] of estimate of how long things are taking what's going [00:45:53] how long things are taking what's going on what arguments do we actually pass in [00:45:57] on what arguments do we actually pass in um [00:45:58] um so that's just going to run and then [00:46:01] so that's just going to run and then likewise hopefully it'll train [00:46:03] likewise hopefully it'll train relatively quickly okay it'll take two [00:46:06] relatively quickly okay it'll take two minutes we can also evaluate the model [00:46:10] minutes we can also evaluate the model um pretty easily as well so we just [00:46:12] um pretty easily as well so we just called trainer.predict [00:46:14] called trainer.predict on whatever data set that we're [00:46:16] on whatever data set that we're interested in so here it's the tokenized [00:46:18] interested in so here it's the tokenized data set corresponding about the [00:46:20] data set corresponding about the validation data set [00:46:22] validation data set okay hopefully we can pop that out soon [00:46:27] okay hopefully we can pop that out soon um and lastly so if we saved anything to [00:46:30] um and lastly so if we saved anything to our model checkpoints so hopefully this [00:46:33] our model checkpoints so hopefully this is [00:46:33] is um [00:46:34] um saving stuff right now [00:46:39] yeah so this is going to be is [00:46:41] yeah so this is going to be is continuing to save stuff to the folder [00:46:43] continuing to save stuff to the folder that we specified [00:46:45] that we specified and so here in case we ever want to kind [00:46:48] and so here in case we ever want to kind of like load our model again from the [00:46:51] of like load our model again from the weights that we've actually saved we [00:46:53] weights that we've actually saved we just pass in the name of the checkpoint [00:46:54] just pass in the name of the checkpoint like the relative path here to our [00:46:57] like the relative path here to our checkpoint so notice how we have some [00:46:59] checkpoint so notice how we have some checkpoint 8 here [00:47:01] checkpoint 8 here right we just pass in the path to that [00:47:04] right we just pass in the path to that folder we load it back in be tokenized [00:47:07] folder we load it back in be tokenized and it's the same thing as as we did [00:47:08] and it's the same thing as as we did before [00:47:11] there are a few kind of additional [00:47:14] there are a few kind of additional appendices for how to do like different [00:47:17] appendices for how to do like different tasks as well there's appendix on [00:47:19] tasks as well there's appendix on generation how to define a custom data [00:47:23] generation how to define a custom data set as well [00:47:24] set as well how it's a kind of like pipeline [00:47:27] how it's a kind of like pipeline um different kind of like tasks together [00:47:31] um different kind of like tasks together um so uh so this is kind of like uh [00:47:35] um so uh so this is kind of like uh using some a pre-trained model that you [00:47:37] using some a pre-trained model that you can just use through kind of like the [00:47:39] can just use through kind of like the pipeline interface really easily [00:47:43] um there's like in different types of [00:47:45] um there's like in different types of tasks like Mass language modeling but um [00:47:48] tasks like Mass language modeling but um feel free to look at through those at [00:47:49] feel free to look at through those at your own time and uh yeah thanks a bunch ================================================================================ LECTURE INDEX.md ================================================================================ CS224N – NLP with Deep Learning Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPDoLWjDaRAdP9D Total Videos: 23 Transcripts Downloaded: 23 Failed/No Captions: 0 --- Lectures 1. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors - Video: [https://www.youtube.com/watch?v=DzpHeXVSC5I](https://www.youtube.com/watch?v=DzpHeXVSC5I) - Transcript: [001_DzpHeXVSC5I.md](001_DzpHeXVSC5I.md) 2. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 2 - Word Vectors and Language Models - Video: [https://www.youtube.com/watch?v=nBor4jfWetQ](https://www.youtube.com/watch?v=nBor4jfWetQ) - Transcript: [002_nBor4jfWetQ.md](002_nBor4jfWetQ.md) 3. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 3 - Backpropagation, Neural Network - Video: [https://www.youtube.com/watch?v=HnliVHU2g9U](https://www.youtube.com/watch?v=HnliVHU2g9U) - Transcript: [003_HnliVHU2g9U.md](003_HnliVHU2g9U.md) 4. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 4 - Dependency Parsing - Video: [https://www.youtube.com/watch?v=KVKvde-_MYc](https://www.youtube.com/watch?v=KVKvde-_MYc) - Transcript: [004_KVKvde-_MYc.md](004_KVKvde-_MYc.md) 5. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 5 - Recurrent Neural Networks - Video: [https://www.youtube.com/watch?v=fyc0Jzr74y4](https://www.youtube.com/watch?v=fyc0Jzr74y4) - Transcript: [005_fyc0Jzr74y4.md](005_fyc0Jzr74y4.md) 6. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 6 - Sequence to Sequence Models - Video: [https://www.youtube.com/watch?v=Ba6Fn1-Jsfw](https://www.youtube.com/watch?v=Ba6Fn1-Jsfw) - Transcript: [006_Ba6Fn1-Jsfw.md](006_Ba6Fn1-Jsfw.md) 7. Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 7 - Attention, Final Projects and LLM Intro - Video: [https://www.youtube.com/watch?v=J7ruSOIzhrE](https://www.youtube.com/watch?v=J7ruSOIzhrE) - Transcript: [007_J7ruSOIzhrE.md](007_J7ruSOIzhrE.md) 8. Stanford CS224N NLP with Deep Learning | 2023 | Lecture 8 - Self-Attention and Transformers - Video: [https://www.youtube.com/watch?v=LWMzyfvuehA](https://www.youtube.com/watch?v=LWMzyfvuehA) - Transcript: [008_LWMzyfvuehA.md](008_LWMzyfvuehA.md) 9. Stanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretraining - Video: [https://www.youtube.com/watch?v=DGfCRXuNA2w](https://www.youtube.com/watch?v=DGfCRXuNA2w) - Transcript: [009_DGfCRXuNA2w.md](009_DGfCRXuNA2w.md) 10. Stanford CS224N NLP with Deep Learning | 2023 | Lecture 11 - Natural Language Generation - Video: [https://www.youtube.com/watch?v=N9L32bFieEY](https://www.youtube.com/watch?v=N9L32bFieEY) - Transcript: [010_N9L32bFieEY.md](010_N9L32bFieEY.md) 11. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 10 - Post-training by Archit Sharma - Video: [https://www.youtube.com/watch?v=35X6zlhoCy4](https://www.youtube.com/watch?v=35X6zlhoCy4) - Transcript: [011_35X6zlhoCy4.md](011_35X6zlhoCy4.md) 12. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 11 - Benchmarking by Yann Dubois - Video: [https://www.youtube.com/watch?v=TO0CqzqiArM](https://www.youtube.com/watch?v=TO0CqzqiArM) - Transcript: [012_TO0CqzqiArM.md](012_TO0CqzqiArM.md) 13. Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 12 - Efficient Training, Shikhar Murty - Video: [https://www.youtube.com/watch?v=UVX7SYGCKkA](https://www.youtube.com/watch?v=UVX7SYGCKkA) - Transcript: [013_UVX7SYGCKkA.md](013_UVX7SYGCKkA.md) 14. Stanford CS224N: NLP w/ DL| Spring 2024 | Lecture 13 - Brain-Computer Interfaces, Chaofei Fan - Video: [https://www.youtube.com/watch?v=tfVgHsKpRC8](https://www.youtube.com/watch?v=tfVgHsKpRC8) - Transcript: [014_tfVgHsKpRC8.md](014_tfVgHsKpRC8.md) 15. Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 14 - Reasoning and Agents by Shikhar Murty - Video: [https://www.youtube.com/watch?v=I0tj4Y7xaOQ](https://www.youtube.com/watch?v=I0tj4Y7xaOQ) - Transcript: [015_I0tj4Y7xaOQ.md](015_I0tj4Y7xaOQ.md) 16. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert - Video: [https://www.youtube.com/watch?v=dnF463_Ar9I](https://www.youtube.com/watch?v=dnF463_Ar9I) - Transcript: [016_dnF463_Ar9I.md](016_dnF463_Ar9I.md) 17. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 16 - ConvNets and TreeRNNs - Video: [https://www.youtube.com/watch?v=S8d-7v3f5MQ](https://www.youtube.com/watch?v=S8d-7v3f5MQ) - Transcript: [017_S8d-7v3f5MQ.md](017_S8d-7v3f5MQ.md) 18. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 18 - NLP, Linguistics, Philosophy - Video: [https://www.youtube.com/watch?v=NxH0Y78xcF4](https://www.youtube.com/watch?v=NxH0Y78xcF4) - Transcript: [018_NxH0Y78xcF4.md](018_NxH0Y78xcF4.md) 19. Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela - Video: [https://www.youtube.com/watch?v=5vfIT5LOkR0](https://www.youtube.com/watch?v=5vfIT5LOkR0) - Transcript: [019_5vfIT5LOkR0.md](019_5vfIT5LOkR0.md) 20. Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim - Video: [https://www.youtube.com/watch?v=cd3pRpEtjLs](https://www.youtube.com/watch?v=cd3pRpEtjLs) - Transcript: [020_cd3pRpEtjLs.md](020_cd3pRpEtjLs.md) 21. Stanford CS224N NLP with Deep Learning | 2023 | Python Tutorial, Manasi Sharma - Video: [https://www.youtube.com/watch?v=8j4wpU98Q74](https://www.youtube.com/watch?v=8j4wpU98Q74) - Transcript: [021_8j4wpU98Q74.md](021_8j4wpU98Q74.md) 22. Stanford CS224N NLP with Deep Learning | 2023 | PyTorch Tutorial, Drew Kaul - Video: [https://www.youtube.com/watch?v=Uv0AIRr3ptg](https://www.youtube.com/watch?v=Uv0AIRr3ptg) - Transcript: [022_Uv0AIRr3ptg.md](022_Uv0AIRr3ptg.md) 23. Stanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel - Video: [https://www.youtube.com/watch?v=b80by3Xk_A8](https://www.youtube.com/watch?v=b80by3Xk_A8) - Transcript: [023_b80by3Xk_A8.md](023_b80by3Xk_A8.md)