================================================================================
LECTURE 001
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors

Source: https://www.youtube.com/watch?v=DzpHeXVSC5I

---

Transcript

[00:00:05] so the thing that seems kind of amazing
[00:00:07] so the thing that seems kind of amazing um to me and us is the fact that well
[00:00:11] um to me and us is the fact that well actually this course was taught um just
[00:00:14] actually this course was taught um just last quarter and here we are with the
[00:00:17] last quarter and here we are with the enormous number of people again taking
[00:00:19] enormous number of people again taking this class um I guess that says
[00:00:22] this class um I guess that says something maybe approximately what it
[00:00:24] something maybe approximately what it says is chat
[00:00:26] says is chat GPT um um
[00:00:31] GPT um um um but anyway it's great to have you all
[00:00:33] um but anyway it's great to have you all lots of exciting content to have and um
[00:00:36] lots of exciting content to have and um hope you'll all enjoy it so let me get
[00:00:39] hope you'll all enjoy it so let me get started and um start telling you a bit
[00:00:42] started and um start telling you a bit about the course um before diving
[00:00:44] about the course um before diving straight into um today's content um for
[00:00:47] straight into um today's content um for people still coming in you know there
[00:00:50] people still coming in you know there are oodles of seats still right on
[00:00:52] are oodles of seats still right on either side especially down near the
[00:00:54] either side especially down near the front there are tons of seats so um do
[00:00:58] front there are tons of seats so um do feel um empowered to go out and seek
[00:01:02] feel um empowered to go out and seek those seats if other if people on the
[00:01:04] those seats if other if people on the corridors are really nice they could
[00:01:06] corridors are really nice they could even move towards the edges to make it
[00:01:08] even move towards the edges to make it easier for people but um One Way or
[00:01:10] easier for people but um One Way or Another feel free to find a seat okay so
[00:01:14] Another feel free to find a seat okay so this is the plan for what I want to get
[00:01:15] this is the plan for what I want to get through today so first of all I'm going
[00:01:18] through today so first of all I'm going to tell you about the course for a few
[00:01:20] to tell you about the course for a few minutes um then um have a few remarks
[00:01:23] minutes um then um have a few remarks about human language and word meaning um
[00:01:27] about human language and word meaning um then the main technical thing we want to
[00:01:29] then the main technical thing we want to get into today today is start learning
[00:01:31] get into today today is start learning about the word Tove algorithm so the
[00:01:34] about the word Tove algorithm so the word Tove algorithm is slightly over a
[00:01:36] word Tove algorithm is slightly over a decade old now it was introduced in um
[00:01:41] decade old now it was introduced in um 2013 but it was a wildly successful
[00:01:45] 2013 but it was a wildly successful simple way of learning Vector
[00:01:47] simple way of learning Vector representations of words so I want to
[00:01:49] representations of words so I want to show you um that as a sort of a first
[00:01:52] show you um that as a sort of a first easy baby system for the kind of neural
[00:01:55] easy baby system for the kind of neural representations that we're going to talk
[00:01:57] representations that we're going to talk about in class um with then going to get
[00:02:00] about in class um with then going to get more concrete with that looking at its
[00:02:02] more concrete with that looking at its objective function gradients and
[00:02:04] objective function gradients and optimization and then hopefully if all
[00:02:07] optimization and then hopefully if all goes I stick to schedule spend a few
[00:02:09] goes I stick to schedule spend a few minutes just um playing around an i
[00:02:12] minutes just um playing around an i python notebook um huh I'm going to have
[00:02:15] python notebook um huh I'm going to have to change computers for that um than um
[00:02:18] to change computers for that um than um sort of um seeing some of the things you
[00:02:20] sort of um seeing some of the things you can do with this okay so this is the um
[00:02:24] can do with this okay so this is the um course Logistics and brief I'm
[00:02:26] course Logistics and brief I'm Christopher Manning hi again everyone um
[00:02:29] Christopher Manning hi again everyone um the head ta is who unfortunately has a
[00:02:33] the head ta is who unfortunately has a bit of a health problem so he's not
[00:02:34] bit of a health problem so he's not actually here today um we've got a
[00:02:36] actually here today um we've got a course manager for the course who
[00:02:39] course manager for the course who is who is up the back there um and then
[00:02:43] is who is up the back there um and then we've got a whole um lot of teas if
[00:02:46] we've got a whole um lot of teas if you're a TA who's here you could stand
[00:02:48] you're a TA who's here you could stand up and wave or something like that so
[00:02:50] up and wave or something like that so people can see a few of the Tas and see
[00:02:54] people can see a few of the Tas and see some friendly faces um okay we've got
[00:02:57] some friendly faces um okay we've got some Tas um and some other ones and so
[00:03:00] some Tas um and some other ones and so you can look at them on the website um
[00:03:02] you can look at them on the website um if you're here you know what time the
[00:03:04] if you're here you know what time the class is um there's an email list but
[00:03:08] class is um there's an email list but preferably don't use it and use the Ed
[00:03:11] preferably don't use it and use the Ed site that you can find on the course
[00:03:13] site that you can find on the course website so the main place to go and look
[00:03:16] website so the main place to go and look for information is the course website
[00:03:18] for information is the course website which we've got up here and that then
[00:03:20] which we've got up here and that then links in to Ed which is what we're going
[00:03:23] links in to Ed which is what we're going to use as the main discussion board
[00:03:25] to use as the main discussion board please use that rather than sending
[00:03:28] please use that rather than sending emails um to um the first assignment for
[00:03:32] emails um to um the first assignment for this class it's a sort of an easy one
[00:03:34] this class it's a sort of an easy one it's the warm-up assignment but we want
[00:03:36] it's the warm-up assignment but we want to get people busy and doing stuff um
[00:03:39] to get people busy and doing stuff um straight away so the first assignment is
[00:03:41] straight away so the first assignment is already live on the web page and it's
[00:03:44] already live on the web page and it's due next Tuesday before class so um
[00:03:48] due next Tuesday before class so um you're slightly less than seven days
[00:03:50] you're slightly less than seven days left to do it um so do get started on
[00:03:53] left to do it um so do get started on that um and to help with that um we're
[00:03:56] that um and to help with that um we're going to be immediately starting office
[00:03:58] going to be immediately starting office hours um tomorrow and they're also
[00:04:00] hours um tomorrow and they're also described on the website um we also do a
[00:04:03] described on the website um we also do a few tutorials on Friday um the first of
[00:04:07] few tutorials on Friday um the first of these tutorials is a tutorial on Python
[00:04:09] these tutorials is a tutorial on Python and numpy um many people don't need that
[00:04:12] and numpy um many people don't need that because they've done other classes and
[00:04:14] because they've done other classes and done this um but for some people we try
[00:04:17] done this um but for some people we try and make this class accessible to
[00:04:19] and make this class accessible to everybody so if you'd like to um brush
[00:04:21] everybody so if you'd like to um brush up a bit on python or how to use numpy
[00:04:24] up a bit on python or how to use numpy it's a great thing to go along to and
[00:04:26] it's a great thing to go along to and who's right over there is going to be
[00:04:28] who's right over there is going to be teaching it on Friday
[00:04:30] teaching it on Friday today okay what do we hope to teach you
[00:04:34] today okay what do we hope to teach you know at the end of the quarter when you
[00:04:36] know at the end of the quarter when you get the evow you'll be asked to rate
[00:04:39] get the evow you'll be asked to rate whether this class met its learning
[00:04:42] whether this class met its learning goals these are my learning goals um
[00:04:47] goals these are my learning goals um what are they so the first one um is to
[00:04:50] what are they so the first one um is to teach you about the foundations and
[00:04:53] teach you about the foundations and current methods for using deep learning
[00:04:56] current methods for using deep learning applied to natural language processing
[00:04:58] applied to natural language processing so this class
[00:05:00] so this class tries to sort of build up from the
[00:05:02] tries to sort of build up from the bottom up so we start off doing simple
[00:05:04] bottom up so we start off doing simple things like word vectors and feed
[00:05:06] things like word vectors and feed forward neural networks recurrent
[00:05:07] forward neural networks recurrent networks and attention we then fairly
[00:05:10] networks and attention we then fairly quickly move into the kind of key
[00:05:12] quickly move into the kind of key methods they Ed for NLP um in
[00:05:16] methods they Ed for NLP um in 2024 I wrote down here Transformers and
[00:05:19] 2024 I wrote down here Transformers and coder decoder models I probably should
[00:05:21] coder decoder models I probably should have written large language models
[00:05:22] have written large language models somewhere in this list as well um but
[00:05:25] somewhere in this list as well um but then pre-training and post-training of
[00:05:27] then pre-training and post-training of large language models adaptation model
[00:05:30] large language models adaptation model interpretability agents Etc but that's
[00:05:33] interpretability agents Etc but that's not the only thing that we want to do so
[00:05:35] not the only thing that we want to do so there are a couple of other things that
[00:05:36] there are a couple of other things that we crucially want to achieve um the
[00:05:39] we crucially want to achieve um the second is to give you
[00:05:42] second is to give you some understanding of human languages
[00:05:45] some understanding of human languages and the difficulties in understanding
[00:05:47] and the difficulties in understanding and producing them on computers now
[00:05:50] and producing them on computers now there are few of you in this class who
[00:05:51] there are few of you in this class who are Linguistics Majors or perhaps the
[00:05:54] are Linguistics Majors or perhaps the symbolic systems Majors yay to the
[00:05:56] symbolic systems Majors yay to the symbolic systems Majors but for quite a
[00:05:59] symbolic systems Majors but for quite a few of the rest of you um you'll never
[00:06:01] few of the rest of you um you'll never see any uh Linguistics in the sense of
[00:06:05] see any uh Linguistics in the sense of understanding how language Works apart
[00:06:07] understanding how language Works apart from this class so we do want to try and
[00:06:10] from this class so we do want to try and convey a little bit of a sense of what
[00:06:12] convey a little bit of a sense of what some of the issues are in language
[00:06:14] some of the issues are in language structure and why it's proven to be
[00:06:18] structure and why it's proven to be quite difficult um to get computers to
[00:06:21] quite difficult um to get computers to understand human languages even though
[00:06:23] understand human languages even though humans seem very good at learning to
[00:06:26] humans seem very good at learning to understand each other and then the final
[00:06:28] understand each other and then the final thing that we want to make it on to um
[00:06:31] thing that we want to make it on to um is actually concretely Building Systems
[00:06:34] is actually concretely Building Systems so that this isn't just a theory class
[00:06:37] so that this isn't just a theory class that we actually want you to leave this
[00:06:39] that we actually want you to leave this class thinking oh yeah in my first job
[00:06:43] class thinking oh yeah in my first job wherever you go whether it's a startup
[00:06:45] wherever you go whether it's a startup or a big Tech or um some nonprofit oh
[00:06:48] or a big Tech or um some nonprofit oh there's something they want to do that
[00:06:51] there's something they want to do that they'd like that would be useful if we
[00:06:52] they'd like that would be useful if we had a text classification system or we
[00:06:55] had a text classification system or we did information extraction to get some
[00:06:57] did information extraction to get some kind of facts out of documents I know to
[00:06:59] kind of facts out of documents I know to build that I can build that system
[00:07:02] build that I can build that system because I did
[00:07:05] cs224n okay um here's how you get graded
[00:07:08] cs224n okay um here's how you get graded um so we have four assignments mainly
[00:07:12] um so we have four assignments mainly one and a half weeks long apart from the
[00:07:14] one and a half weeks long apart from the first one they make up almost half the
[00:07:16] first one they make up almost half the grade the other half of the grade is
[00:07:19] grade the other half of the grade is made up of a final project which there
[00:07:22] made up of a final project which there are two variants of a custom or default
[00:07:25] are two variants of a custom or default final project which we'll get on to in a
[00:07:28] final project which we'll get on to in a minute um and then there's a few percent
[00:07:30] minute um and then there's a few percent that go for
[00:07:32] that go for participation um six late
[00:07:35] participation um six late days um collaboration policy um like all
[00:07:39] days um collaboration policy um like all other CS classes we've had issues with
[00:07:43] other CS classes we've had issues with people not doing their own work we
[00:07:45] people not doing their own work we really do want you to learn things in
[00:07:47] really do want you to learn things in this class and the way you do that is by
[00:07:50] this class and the way you do that is by doing your own work um so make sure you
[00:07:53] doing your own work um so make sure you understand that um and so for the
[00:07:56] understand that um and so for the assignments everyone is expected to do
[00:07:58] assignments everyone is expected to do their own assign assignments um you can
[00:08:01] their own assign assignments um you can talk to your friends but you're expected
[00:08:02] talk to your friends but you're expected to do your own assignment for the final
[00:08:05] to do your own assignment for the final project you can do that as a group um
[00:08:08] project you can do that as a group um then we have the issue of um AI tools
[00:08:12] then we have the issue of um AI tools now of course in this class we love
[00:08:14] now of course in this class we love large language models but nevertheless
[00:08:17] large language models but nevertheless we don't want you to do your assignments
[00:08:19] we don't want you to do your assignments by saying hey chat GPT could you answer
[00:08:22] by saying hey chat GPT could you answer question three for me um that is not the
[00:08:25] question three for me um that is not the way to learn things um if you want to
[00:08:27] way to learn things um if you want to make use of um AI as a tool to assist
[00:08:30] make use of um AI as a tool to assist you such as for coding assistance go for
[00:08:33] you such as for coding assistance go for it um but we're wanting you to be
[00:08:36] it um but we're wanting you to be working out how to answer assignment
[00:08:38] working out how to answer assignment questions by
[00:08:40] questions by yourself okay so this is what the
[00:08:42] yourself okay so this is what the assignments look like so assignment one
[00:08:45] assignments look like so assignment one is meant to be an easy onramp and it's
[00:08:47] is meant to be an easy onramp and it's done as a Jupiter notebook assignment
[00:08:51] done as a Jupiter notebook assignment two um then has people uh you know what
[00:08:57] two um then has people uh you know what can I say here we are at this fine
[00:09:00] can I say here we are at this fine liberal artarts and Engineering
[00:09:03] liberal artarts and Engineering institution we're not at a coding boot
[00:09:06] institution we're not at a coding boot camp so we hope that people have some
[00:09:08] camp so we hope that people have some deep understanding of how things work so
[00:09:11] deep understanding of how things work so in assignment two we actually want you
[00:09:14] in assignment two we actually want you to do some math and understand how
[00:09:18] to do some math and understand how things work in neural networks um so for
[00:09:21] things work in neural networks um so for some people assignment to is the
[00:09:23] some people assignment to is the scariest assignment in the whole class
[00:09:26] scariest assignment in the whole class um but then it's also the place where we
[00:09:28] um but then it's also the place where we introduce py talk which is software
[00:09:30] introduce py talk which is software package we use for building newal
[00:09:32] package we use for building newal networks and we build a dependency paa
[00:09:35] networks and we build a dependency paa which we'll get to later as something
[00:09:37] which we'll get to later as something more linguistic um then for assignment
[00:09:40] more linguistic um then for assignment three and four we move on to larger
[00:09:43] three and four we move on to larger projects using pytorch with gpus and
[00:09:45] projects using pytorch with gpus and we'll be making use of Google cloud and
[00:09:49] we'll be making use of Google cloud and um for those two assignments um we look
[00:09:52] um for those two assignments um we look at doing machine translation and getting
[00:09:55] at doing machine translation and getting information um out with Transformers and
[00:09:58] information um out with Transformers and then these are the two final project
[00:10:00] then these are the two final project options so essentially you know we have
[00:10:03] options so essentially you know we have a default final project where we give
[00:10:05] a default final project where we give you a lot of scaffolding and an outline
[00:10:07] you a lot of scaffolding and an outline of what to do but um it's still an
[00:10:10] of what to do but um it's still an open-ended project there are lots of
[00:10:12] open-ended project there are lots of different things you can try to make
[00:10:14] different things you can try to make this system work better and we encourage
[00:10:17] this system work better and we encourage you to explore um but nevertheless
[00:10:20] you to explore um but nevertheless you're given a leg up from quite a lot
[00:10:22] you're given a leg up from quite a lot of um scaffolding we'll talk about this
[00:10:25] of um scaffolding we'll talk about this more but you can either do that option
[00:10:27] more but you can either do that option or you can just come up with totally
[00:10:29] or you can just come up with totally your own project and do
[00:10:32] your own project and do that okay that's the course any
[00:10:35] that okay that's the course any questions on the
[00:10:37] questions on the course yes for the final project how are
[00:10:40] course yes for the final project how are mentors assigned um so if you if you can
[00:10:45] mentors assigned um so if you if you can find your own Mentor your interest in
[00:10:48] find your own Mentor your interest in something and there's someone that's
[00:10:50] something and there's someone that's happy to Mentor you that person can be
[00:10:52] happy to Mentor you that person can be your Mentor otherwise one of the course
[00:10:55] your Mentor otherwise one of the course Tas will be your mentor and how that
[00:10:58] Tas will be your mentor and how that person is assigned is uh one of the T
[00:11:02] person is assigned is uh one of the T who is in charge of final projects
[00:11:04] who is in charge of final projects assigns people and they do the best they
[00:11:07] assigns people and they do the best they can in terms of finding people with some
[00:11:09] can in terms of finding people with some expertise and having to divide all the
[00:11:11] expertise and having to divide all the students across the mentors roughly
[00:11:14] students across the mentors roughly equally any other
[00:11:19] questions okay I'll power ahead um human
[00:11:23] questions okay I'll power ahead um human language and word
[00:11:25] language and word meaning um so let me just sort of um say
[00:11:29] meaning um so let me just sort of um say a little bit about the big picture here
[00:11:32] a little bit about the big picture here um so we're in the area of artificial
[00:11:35] um so we're in the area of artificial intelligence and we've got this idea
[00:11:38] intelligence and we've got this idea that humans are intelligent and then
[00:11:41] that humans are intelligent and then there's the question of you know how
[00:11:44] there's the question of you know how does language um fit into that and you
[00:11:47] does language um fit into that and you know this is something that there is
[00:11:49] know this is something that there is some argument about and if you want to
[00:11:51] some argument about and if you want to you can run off in onto social media and
[00:11:54] you can run off in onto social media and read some of the arguments about these
[00:11:56] read some of the arguments about these things and contribute to it um if you
[00:11:58] things and contribute to it um if you wish too um but here is my perhaps bias
[00:12:02] wish too um but here is my perhaps bias take as a linguist um well you can
[00:12:06] take as a linguist um well you can compare human beings um to some of our
[00:12:10] compare human beings um to some of our nearest neighbors like chimpanzees
[00:12:13] nearest neighbors like chimpanzees bonobos and things like that and you
[00:12:16] bonobos and things like that and you know well one big distinguishing thing
[00:12:19] know well one big distinguishing thing is we have language and they don't um
[00:12:24] is we have language and they don't um but you know in most other respects um
[00:12:27] but you know in most other respects um chimps are very similar human beings
[00:12:30] chimps are very similar human beings right you know in they they can use
[00:12:33] right you know in they they can use tools they can plan how to solve things
[00:12:37] tools they can plan how to solve things um they've got really good memory um
[00:12:39] um they've got really good memory um chimps have better short-term memory
[00:12:41] chimps have better short-term memory than human beings do right so that in
[00:12:44] than human beings do right so that in most respects it's hard to show an
[00:12:47] most respects it's hard to show an intelligence difference between chimps
[00:12:49] intelligence difference between chimps and people except for the fact that we
[00:12:52] and people except for the fact that we have language but us having language has
[00:12:55] have language but us having language has been this enormous differentiator right
[00:12:59] been this enormous differentiator right that if you look around um what happened
[00:13:02] that if you look around um what happened on the planet you know that there are
[00:13:04] on the planet you know that there are creatures that are stronger than us
[00:13:06] creatures that are stronger than us faster than us more venomous than us
[00:13:09] faster than us more venomous than us have every possible Advantage um but
[00:13:11] have every possible Advantage um but human beings took over the whole place
[00:13:14] human beings took over the whole place and how did that happen we had language
[00:13:17] and how did that happen we had language um so we could communicate and that
[00:13:20] um so we could communicate and that communication allowed us to have human
[00:13:24] communication allowed us to have human ascendency but I'd like to me so one big
[00:13:27] ascendency but I'd like to me so one big role of language is the fact fact that
[00:13:29] role of language is the fact fact that it allows communication but I'd like to
[00:13:32] it allows communication but I'd like to suggest it's actually not the only role
[00:13:34] suggest it's actually not the only role of language that language has also
[00:13:38] of language that language has also allowed humans I would argue to achieve
[00:13:41] allowed humans I would argue to achieve a higher level of thought so there are
[00:13:44] a higher level of thought so there are various kinds of thoughts that you can
[00:13:46] various kinds of thoughts that you can have without any language involved you
[00:13:48] have without any language involved you know you can think about a scene you can
[00:13:51] know you can think about a scene you can move some bits of furniture around in
[00:13:52] move some bits of furniture around in your mind and there's no language and
[00:13:55] your mind and there's no language and obviously emotional responses of feeling
[00:13:58] obviously emotional responses of feeling scared or excited
[00:13:59] scared or excited they happen and there's no language
[00:14:01] they happen and there's no language involved but you know I think most of
[00:14:04] involved but you know I think most of the time when we're doing higher level
[00:14:07] the time when we're doing higher level cognition um if you're thinking to
[00:14:09] cognition um if you're thinking to yourself oh gee my friend seemed upset
[00:14:12] yourself oh gee my friend seemed upset about what I said last night my should
[00:14:15] about what I said last night my should probably work out how to fix that or
[00:14:17] probably work out how to fix that or maybe I could BL BL BL BL I think we
[00:14:20] maybe I could BL BL BL BL I think we think in language and plan out things
[00:14:22] think in language and plan out things and so that it's given us a scaffolding
[00:14:25] and so that it's given us a scaffolding to do much more detailed thought and
[00:14:27] to do much more detailed thought and planning
[00:14:29] planning most recently of all of course human
[00:14:32] most recently of all of course human beings invented ways to write um and
[00:14:36] beings invented ways to write um and that led so writing is really really
[00:14:39] that led so writing is really really recent I mean no one really knows how
[00:14:42] recent I mean no one really knows how old human languages are you know most
[00:14:44] old human languages are you know most people think a few 100,000 years not
[00:14:46] people think a few 100,000 years not very long by evolutionary time scales
[00:14:49] very long by evolutionary time scales but writing we do know writing is really
[00:14:52] but writing we do know writing is really really recent so writing is about 5,000
[00:14:55] really recent so writing is about 5,000 years old um and so but you know writing
[00:15:00] years old um and so but you know writing proved to be this um again this amazing
[00:15:04] proved to be this um again this amazing cognitive tool that just gave Humanity
[00:15:06] cognitive tool that just gave Humanity an enormous leg up because Suddenly It's
[00:15:09] an enormous leg up because Suddenly It's Not only that you could share
[00:15:11] Not only that you could share information and learn from the people
[00:15:13] information and learn from the people that were standing within 50 feet of you
[00:15:15] that were standing within 50 feet of you you could then um share knowledge across
[00:15:19] you could then um share knowledge across time and space so really um having
[00:15:22] time and space so really um having writing was enough to take us from the
[00:15:25] writing was enough to take us from the Bronze Age very simple um metal working
[00:15:28] Bronze Age very simple um metal working to the kind of um you know mobile phones
[00:15:31] to the kind of um you know mobile phones and all the other technology that we
[00:15:34] and all the other technology that we walk around with today in just a very
[00:15:36] walk around with today in just a very short amount of time so language is
[00:15:39] short amount of time so language is pretty
[00:15:41] pretty cool um but it's you know one shouldn't
[00:15:44] cool um but it's you know one shouldn't only fixate on um the sort of knowledge
[00:15:49] only fixate on um the sort of knowledge side of language and how that's made um
[00:15:52] side of language and how that's made um human beings great I mean there's this
[00:15:54] human beings great I mean there's this other side of language where language is
[00:15:57] other side of language where language is this very flexible system which is used
[00:16:01] this very flexible system which is used as a social tool by human beings um so
[00:16:05] as a social tool by human beings um so that we can speak with a lot of
[00:16:09] that we can speak with a lot of imprecision and nuance and emotion in
[00:16:12] imprecision and nuance and emotion in language and we can get people to
[00:16:15] language and we can get people to understand we can set up sort of new
[00:16:17] understand we can set up sort of new ways of thinking about things by using
[00:16:20] ways of thinking about things by using words for them and languages aren't
[00:16:22] words for them and languages aren't static languages change as human beings
[00:16:25] static languages change as human beings use them that languages aren't something
[00:16:28] use them that languages aren't something that would delivered down on Tablets by
[00:16:31] that would delivered down on Tablets by God languages are things that humans
[00:16:33] God languages are things that humans constructed and humans changed them with
[00:16:36] constructed and humans changed them with each success of generation and indeed
[00:16:39] each success of generation and indeed most of the innovation in language
[00:16:41] most of the innovation in language happens among young people you know
[00:16:44] happens among young people you know people that are either a few years
[00:16:45] people that are either a few years younger than you are most of you are now
[00:16:48] younger than you are most of you are now um in their um earlier teens going into
[00:16:51] um in their um earlier teens going into the 20s right that's a big period of
[00:16:53] the 20s right that's a big period of linguistic Innovation where people think
[00:16:55] linguistic Innovation where people think up cool new phrases and ways of saying
[00:16:58] up cool new phrases and ways of saying things and some of those get embedded
[00:17:00] things and some of those get embedded and extended and that then becomes the
[00:17:02] and extended and that then becomes the future of language um so um herb Clark
[00:17:06] future of language um so um herb Clark used to be a um psychologist um at
[00:17:10] used to be a um psychologist um at Stanford he's now retired but he had
[00:17:12] Stanford he's now retired but he had this rather um nice quote the common
[00:17:15] this rather um nice quote the common misconception is that language use has
[00:17:17] misconception is that language use has primarily to do with words and what they
[00:17:19] primarily to do with words and what they mean it doesn't it has primarily to do
[00:17:22] mean it doesn't it has primarily to do with people and what they
[00:17:24] with people and what they mean okay so that's language and two
[00:17:27] mean okay so that's language and two slides for you um so now we'll skip
[00:17:30] slides for you um so now we'll skip ahead to deep learning so in the last
[00:17:33] ahead to deep learning so in the last decade or so we've been able to make um
[00:17:37] decade or so we've been able to make um fantastic progress in doing more with
[00:17:39] fantastic progress in doing more with computers understanding um human
[00:17:42] computers understanding um human languages um in using deep learning
[00:17:45] languages um in using deep learning we'll say a bit more about the history
[00:17:47] we'll say a bit more about the history later on but you know work on trying to
[00:17:50] later on but you know work on trying to do things with human language started in
[00:17:51] do things with human language started in the 1950s so it had been sort of going
[00:17:54] the 1950s so it had been sort of going for 60 years or so and you know there
[00:17:58] for 60 years or so and you know there was some stuff it's not that nobody
[00:18:00] was some stuff it's not that nobody could do anything but you know the
[00:18:03] could do anything but you know the ability to understand and produce
[00:18:06] ability to understand and produce language had always been kind of
[00:18:08] language had always been kind of questionable where it's really in the
[00:18:10] questionable where it's really in the last decade with new networks that just
[00:18:13] last decade with new networks that just enormous strides of progress have been
[00:18:15] enormous strides of progress have been made um and that's led into the world
[00:18:17] made um and that's led into the world that we have today so one of the first
[00:18:20] that we have today so one of the first big breakthroughs came in the area of
[00:18:23] big breakthroughs came in the area of using um neural NLP systems for machine
[00:18:27] using um neural NLP systems for machine translation and so this started about
[00:18:30] translation and so this started about 2014 and was already deployed live on
[00:18:34] 2014 and was already deployed live on services like Google by 2016 it was so
[00:18:38] services like Google by 2016 it was so good that it saw really really rapid um
[00:18:41] good that it saw really really rapid um commercial deployment and I mean overall
[00:18:44] commercial deployment and I mean overall this kind of facility um with machine
[00:18:48] this kind of facility um with machine translation just means that you're
[00:18:51] translation just means that you're growing up in such a different world um
[00:18:54] growing up in such a different world um to people a few Generations back right
[00:18:58] to people a few Generations back right people a few Generations back um that
[00:19:01] people a few Generations back um that unless you actually knew different
[00:19:03] unless you actually knew different languages of different people you sort
[00:19:06] languages of different people you sort of had no chance to communicate with
[00:19:08] of had no chance to communicate with them where um now we're very close to
[00:19:11] them where um now we're very close to having something like the Babel Fish
[00:19:13] having something like the Babel Fish from Hitchhiker's Guide to the Galaxy um
[00:19:16] from Hitchhiker's Guide to the Galaxy um for understanding all languages it's
[00:19:18] for understanding all languages it's just it's not a Babel Fish it's a cell
[00:19:22] just it's not a Babel Fish it's a cell phone but you know you can have it out
[00:19:23] phone but you know you can have it out between two people and have it do
[00:19:26] between two people and have it do simultaneous translation and you know
[00:19:28] simultaneous translation and you know it's not perfect people keep on doing
[00:19:30] it's not perfect people keep on doing research on this but um you know by and
[00:19:33] research on this but um you know by and large it means you can pick anything up
[00:19:35] large it means you can pick anything up from different areas of the world um as
[00:19:38] from different areas of the world um as you can see this example is from a
[00:19:40] you can see this example is from a couple of years ago since it's still
[00:19:41] couple of years ago since it's still from the um Co pandemic era but you know
[00:19:44] from the um Co pandemic era but you know I can um see this um Swahili from Kenya
[00:19:49] I can um see this um Swahili from Kenya and say oh gee I wonder what that means
[00:19:51] and say oh gee I wonder what that means stick it into Google translate and um I
[00:19:55] stick it into Google translate and um I can learn that Malawi um lost two
[00:19:58] can learn that Malawi um lost two ministers due to um Co infections and
[00:20:02] ministers due to um Co infections and they died right so you know we're just
[00:20:04] they died right so you know we're just in this different era of being able to
[00:20:06] in this different era of being able to understand stuff and then there are lots
[00:20:09] understand stuff and then there are lots of other things that we can do with
[00:20:10] of other things that we can do with modern NLP so until a few years ago um
[00:20:15] modern NLP so until a few years ago um we had web search engines and you put in
[00:20:18] we had web search engines and you put in some text you could write it as a
[00:20:20] some text you could write it as a sentence if you wanted to but it didn't
[00:20:22] sentence if you wanted to but it didn't really matter whether you wrote a
[00:20:23] really matter whether you wrote a sentence or not because what you got was
[00:20:25] sentence or not because what you got was some keywords that were then matched
[00:20:28] some keywords that were then matched against index and you were shown some
[00:20:30] against index and you were shown some pages that might have the answers to
[00:20:32] pages that might have the answers to your questions but these days um you can
[00:20:35] your questions but these days um you can put an actual question into a modern
[00:20:38] put an actual question into a modern search engine like when did Kendrick
[00:20:40] search engine like when did Kendrick Lamar's first album come out it can go
[00:20:43] Lamar's first album come out it can go and find documents that have relevant
[00:20:45] and find documents that have relevant information it can read those documents
[00:20:48] information it can read those documents and it can give you an answer so that it
[00:20:51] and it can give you an answer so that it actually can become an answer engine
[00:20:54] actually can become an answer engine rather than just something that finds
[00:20:55] rather than just something that finds documents that might be relevant to what
[00:20:57] documents that might be relevant to what you're interested in and the way that
[00:20:59] you're interested in and the way that that's done is with big neural networks
[00:21:02] that's done is with big neural networks so that you might commonly have um for
[00:21:05] so that you might commonly have um for your query you've got a retrieval neural
[00:21:08] your query you've got a retrieval neural network which can find passages that are
[00:21:11] network which can find passages that are similar to The query they might then be
[00:21:13] similar to The query they might then be reranked by a second neural network and
[00:21:16] reranked by a second neural network and then there'll be a third reading neural
[00:21:18] then there'll be a third reading neural network that'll read those passages um
[00:21:21] network that'll read those passages um and synthesize information from them
[00:21:24] and synthesize information from them which it then returns as the
[00:21:26] which it then returns as the answer okay that gets to about
[00:21:30] answer okay that gets to about 2018 um but then things got more
[00:21:32] 2018 um but then things got more advanced again so it was really around
[00:21:36] advanced again so it was really around 2019 that people started to see the
[00:21:39] 2019 that people started to see the power of large language models and so
[00:21:42] power of large language models and so back in
[00:21:43] back in 2019 those of us in NLP were really
[00:21:46] 2019 those of us in NLP were really excited about
[00:21:48] excited about gpt2 um it didn't make much of an impact
[00:21:50] gpt2 um it didn't make much of an impact on the Nightly News but it was really
[00:21:52] on the Nightly News but it was really exciting an NLP land um because gpt2
[00:21:57] exciting an NLP land um because gpt2 already for the first time time meant
[00:21:59] already for the first time time meant here was a large language model that
[00:22:02] here was a large language model that could just generate fluent text that
[00:22:05] could just generate fluent text that really until then um NLP systems had
[00:22:08] really until then um NLP systems had done sort of a decent job at
[00:22:10] done sort of a decent job at understanding certain facts out of text
[00:22:13] understanding certain facts out of text but we've just never been able to
[00:22:14] but we've just never been able to generate fluent text that was at all
[00:22:17] generate fluent text that was at all good um where here what you could do
[00:22:19] good um where here what you could do with gpt2 is you could um write
[00:22:22] with gpt2 is you could um write something like the start of a story a
[00:22:24] something like the start of a story a train Carriage containing controlled
[00:22:26] train Carriage containing controlled nuclear materials was stolen in cincin
[00:22:28] nuclear materials was stolen in cincin today its whereabouts are unknown and
[00:22:32] today its whereabouts are unknown and then GPT 2 would just write a
[00:22:35] then GPT 2 would just write a continuation the incident occurred on
[00:22:37] continuation the incident occurred on the downtown train line which runs from
[00:22:39] the downtown train line which runs from Covington ashin stations in an email to
[00:22:42] Covington ashin stations in an email to Ohio news outlets the US Department of
[00:22:44] Ohio news outlets the US Department of energy set is working with the Federal
[00:22:46] energy set is working with the Federal Railroad Administration to find the
[00:22:48] Railroad Administration to find the Thief dot dot dot and so the way this is
[00:22:50] Thief dot dot dot and so the way this is working is this conditioning on all the
[00:22:53] working is this conditioning on all the past material and as I show at the very
[00:22:56] past material and as I show at the very bottom line down here it's generating
[00:22:59] bottom line down here it's generating one word at a time as to what word it
[00:23:02] one word at a time as to what word it thinks would be likely to come next
[00:23:04] thinks would be likely to come next after that um and so from that simple
[00:23:07] after that um and so from that simple method of sort of generating words out
[00:23:09] method of sort of generating words out of U one after another it's able to
[00:23:12] of U one after another it's able to produce excellent text and the thing to
[00:23:15] produce excellent text and the thing to notice is I mean this text is not only
[00:23:18] notice is I mean this text is not only kind of you know formally correct you
[00:23:21] kind of you know formally correct you know not the spellings correct and the
[00:23:23] know not the spellings correct and the sentences are real sentences not
[00:23:26] sentences are real sentences not disconnected garbage but you know it
[00:23:28] disconnected garbage but you know it actually understands a lot right so the
[00:23:31] actually understands a lot right so the prompt that was written said there were
[00:23:34] prompt that was written said there were stolen nuclear materials in Cincinnati
[00:23:37] stolen nuclear materials in Cincinnati but you know gpt2 knows a lot of stuff
[00:23:40] but you know gpt2 knows a lot of stuff it knows that Cincinnati is in Ohio it
[00:23:44] it knows that Cincinnati is in Ohio it knows that in the United States it's the
[00:23:46] knows that in the United States it's the department of energy that regulates
[00:23:48] department of energy that regulates nuclear materials um it knows if
[00:23:51] nuclear materials um it knows if something is stolen it's a theft and
[00:23:54] something is stolen it's a theft and that that would um make sense that um
[00:23:58] that that would um make sense that um people are getting involved with that um
[00:24:01] people are getting involved with that um it talks about you know there's train
[00:24:03] it talks about you know there's train Carriage So it's talking about the train
[00:24:05] Carriage So it's talking about the train line where it goes it really knows a lot
[00:24:07] line where it goes it really knows a lot and can write you know coherent
[00:24:10] and can write you know coherent discourse um like a real story so that's
[00:24:13] discourse um like a real story so that's kind of amazing um but you know things
[00:24:17] kind of amazing um but you know things moved on from there and so now we're in
[00:24:20] moved on from there and so now we're in the world of chat GPT and
[00:24:22] the world of chat GPT and gp4 and one of the things that we will
[00:24:25] gp4 and one of the things that we will talk about later is this was a huge huge
[00:24:28] talk about later is this was a huge huge user success because now you could ask
[00:24:32] user success because now you could ask questions or give it commands and it
[00:24:35] questions or give it commands and it would do what you wanted and that was
[00:24:38] would do what you wanted and that was further amazing so here I'm saying hey
[00:24:41] further amazing so here I'm saying hey please draft a polite email to my boss
[00:24:43] please draft a polite email to my boss Jeremy that I would not be able to come
[00:24:45] Jeremy that I would not be able to come into the office for the next two days
[00:24:47] into the office for the next two days because my 9-year-old song that's a
[00:24:50] because my 9-year-old song that's a misspelling for son but the system works
[00:24:52] misspelling for son but the system works fine Des spite it um Peter is angry with
[00:24:55] fine Des spite it um Peter is angry with me that I'm not giving him much time
[00:24:58] me that I'm not giving him much time and it writes a nice email um it
[00:25:01] and it writes a nice email um it corrects the spelling mistake because it
[00:25:04] corrects the spelling mistake because it knows people make spelling mistakes it
[00:25:05] knows people make spelling mistakes it doesn't talk about songs and everything
[00:25:08] doesn't talk about songs and everything works out beautifully um you can get it
[00:25:11] works out beautifully um you can get it to do other things so you can um ask it
[00:25:15] to do other things so you can um ask it what is unusual about this image um so
[00:25:18] what is unusual about this image um so in thinking about meaning one of the
[00:25:20] in thinking about meaning one of the things that's interesting with these
[00:25:22] things that's interesting with these recent models um is that they're
[00:25:24] recent models um is that they're multimodal and can operate across modes
[00:25:28] multimodal and can operate across modes and so um a favorite term that we coined
[00:25:31] and so um a favorite term that we coined at Stanford is the term Foundation
[00:25:33] at Stanford is the term Foundation models which we use as a generalization
[00:25:36] models which we use as a generalization of large language models to have the
[00:25:39] of large language models to have the same kind of technology used across
[00:25:41] same kind of technology used across different modalities images sound um
[00:25:46] different modalities images sound um various kinds of bioinformatic things
[00:25:48] various kinds of bioinformatic things DNA RNA things like that seismic waves
[00:25:52] DNA RNA things like that seismic waves any kind of signal building these same
[00:25:55] any kind of signal building these same kind of large
[00:25:56] kind of large models another place that you can see
[00:25:59] models another place that you can see that um is going from text to images um
[00:26:05] that um is going from text to images um so if I asked for a picture of a train
[00:26:07] so if I asked for a picture of a train going over the Golden Gate Bridge um
[00:26:10] going over the Golden Gate Bridge um this is now um darly 2 um it gives me a
[00:26:15] this is now um darly 2 um it gives me a picture of a train going over the Golden
[00:26:17] picture of a train going over the Golden Gate Bridge um this is a perfect time to
[00:26:20] Gate Bridge um this is a perfect time to welcome anyone who's watching this um on
[00:26:23] welcome anyone who's watching this um on Stanford online um if you're on Stanford
[00:26:25] Stanford online um if you're on Stanford online and are not in the Bay Area the
[00:26:28] online and are not in the Bay Area the important thing to know is no trains go
[00:26:31] important thing to know is no trains go over the Golden Gate Bridge um but you
[00:26:34] over the Golden Gate Bridge um but you might not be completely happy with this
[00:26:36] might not be completely happy with this picture um because you know it shows the
[00:26:39] picture um because you know it shows the Golden Gate Bridge and a train going
[00:26:41] Golden Gate Bridge and a train going over it but it doesn't show the bay so
[00:26:43] over it but it doesn't show the bay so maybe I'd like to um get with the bay in
[00:26:46] maybe I'd like to um get with the bay in the background and if I ask for that
[00:26:48] the background and if I ask for that well look now I've got um a train going
[00:26:51] well look now I've got um a train going over the Golden Gate Bridge with the bay
[00:26:52] over the Golden Gate Bridge with the bay in the background but you still might
[00:26:55] in the background but you still might not be this this might not be exactly
[00:26:58] not be this this might not be exactly what you want like maybe you'd prefer
[00:27:00] what you want like maybe you'd prefer something that's a pencil drawing so I
[00:27:03] something that's a pencil drawing so I can say a train going over the Golden
[00:27:05] can say a train going over the Golden Gate Bridge detailed pencil drawing and
[00:27:07] Gate Bridge detailed pencil drawing and I can get a pencil drawing um or maybe
[00:27:10] I can get a pencil drawing um or maybe it's unrealistic that the Golden Gate
[00:27:12] it's unrealistic that the Golden Gate Bridge only has trains going over it now
[00:27:15] Bridge only has trains going over it now um so maybe it be good to have some cars
[00:27:17] um so maybe it be good to have some cars as well um so I could ask for a train
[00:27:20] as well um so I could ask for a train and cars and we can get a train and cars
[00:27:23] and cars and we can get a train and cars going over it um now I actually made
[00:27:26] going over it um now I actually made these ones all by myself so should be
[00:27:28] these ones all by myself so should be impressed with my generative AI artwork
[00:27:31] impressed with my generative AI artwork um but these examples are actually a bit
[00:27:33] um but these examples are actually a bit old now because they're done with DAR 2
[00:27:35] old now because they're done with DAR 2 and if you keep up with these things
[00:27:37] and if you keep up with these things that's a few years ago there now Dary 3
[00:27:39] that's a few years ago there now Dary 3 and so on so we can now get much fancier
[00:27:41] and so on so we can now get much fancier things again right an illustration from
[00:27:44] things again right an illustration from a graphic novel a bustling City street
[00:27:46] a graphic novel a bustling City street under the shine of a full moon the
[00:27:48] under the shine of a full moon the sidewalks bustling with pedestrians
[00:27:50] sidewalks bustling with pedestrians enjoying the night life at the corner
[00:27:52] enjoying the night life at the corner stall a young woman with fiery red hair
[00:27:54] stall a young woman with fiery red hair dressed in a signature velvet cloak is
[00:27:57] dressed in a signature velvet cloak is haggling with the grumpy old vendor the
[00:27:59] haggling with the grumpy old vendor the grumpy vendor a tall sophisticated man
[00:28:01] grumpy vendor a tall sophisticated man is wearing a sharp suit Sports a
[00:28:03] is wearing a sharp suit Sports a noteworthy mustache is animatedly
[00:28:05] noteworthy mustache is animatedly conversing on his steampunk telephone
[00:28:08] conversing on his steampunk telephone and pretty much um we're getting all of
[00:28:12] and pretty much um we're getting all of that okay um so let's now get on to
[00:28:16] that okay um so let's now get on to starting to think more about meaning so
[00:28:20] starting to think more about meaning so perhaps um what can we do for meaning
[00:28:25] perhaps um what can we do for meaning right so if you think of words and there
[00:28:28] right so if you think of words and there meaning um if you look up a dictionary
[00:28:31] meaning um if you look up a dictionary and say what does meaning mean um
[00:28:33] and say what does meaning mean um meaning is defined as the idea that is
[00:28:35] meaning is defined as the idea that is represented by a word or phrase the idea
[00:28:38] represented by a word or phrase the idea that a person wants to express by using
[00:28:40] that a person wants to express by using words the idea that is expressed um and
[00:28:45] words the idea that is expressed um and in in linguistics you know if you go and
[00:28:48] in in linguistics you know if you go and do a semantics class or something the
[00:28:50] do a semantics class or something the commonest way of thinking of of meaning
[00:28:53] commonest way of thinking of of meaning is somewhat like what's presented up
[00:28:55] is somewhat like what's presented up above there that meaning is thought of
[00:28:58] above there that meaning is thought of as a pairing between what sometimes
[00:29:01] as a pairing between what sometimes called signifier and signified but it's
[00:29:03] called signifier and signified but it's perhaps easy to think of as a symbol a
[00:29:06] perhaps easy to think of as a symbol a word and then an idea or thing and so
[00:29:10] word and then an idea or thing and so this notion is referred to as
[00:29:12] this notion is referred to as denotational semantics so the idea or
[00:29:15] denotational semantics so the idea or thing is the denotation of the symbol
[00:29:18] thing is the denotation of the symbol and so this same idea of denotational
[00:29:20] and so this same idea of denotational semantics has also been used for
[00:29:21] semantics has also been used for programming languages because in
[00:29:23] programming languages because in programming languages you have symbols
[00:29:26] programming languages you have symbols like while and if
[00:29:28] like while and if variables and they have a meaning and
[00:29:31] variables and they have a meaning and that could be their
[00:29:32] that could be their denotation um so we sort of would say
[00:29:35] denotation um so we sort of would say that the meaning of tree is all the
[00:29:38] that the meaning of tree is all the trees you can find out around the world
[00:29:41] trees you can find out around the world um that's sort of a okay notion of
[00:29:44] um that's sort of a okay notion of meaning um it's a popular one it's never
[00:29:47] meaning um it's a popular one it's never been very
[00:29:48] been very obvious or at least traditionally it
[00:29:50] obvious or at least traditionally it wasn't very obvious as of what we could
[00:29:52] wasn't very obvious as of what we could do with that to get it into computers so
[00:29:55] do with that to get it into computers so if you looked in the pre-neural world
[00:29:58] if you looked in the pre-neural world when people tried to look at meanings
[00:30:01] when people tried to look at meanings inside computers they sort of had to do
[00:30:04] inside computers they sort of had to do something much more primitive of looking
[00:30:07] something much more primitive of looking at words and their relationship so a
[00:30:09] at words and their relationship so a very common traditional solution was to
[00:30:12] very common traditional solution was to make use of word net and word net was
[00:30:15] make use of word net and word net was kind of a sort of fancy thesaurus that
[00:30:17] kind of a sort of fancy thesaurus that showed word relations so i' tell you
[00:30:20] showed word relations so i' tell you about synonyms and is a kind of things
[00:30:23] about synonyms and is a kind of things um so a panda is a kind of carnivore
[00:30:26] um so a panda is a kind of carnivore which is a placental which is a m
[00:30:28] which is a placental which is a m and things like that good has various
[00:30:31] and things like that good has various meanings it's a trade good or the sense
[00:30:33] meanings it's a trade good or the sense of goodness and you could explore with
[00:30:35] of goodness and you could explore with that but systems like wordnet um were
[00:30:39] that but systems like wordnet um were never very good for computational
[00:30:42] never very good for computational meaning um they missed a lot of nuance
[00:30:45] meaning um they missed a lot of nuance wordnet would tell you that proficient
[00:30:47] wordnet would tell you that proficient is a synonym for good but if you think
[00:30:50] is a synonym for good but if you think about all the things that you would say
[00:30:52] about all the things that you would say were good you know that was a good shot
[00:30:54] were good you know that was a good shot would you say that was a proficient shot
[00:30:57] would you say that was a proficient shot sounds kind of weird to me um you know
[00:30:59] sounds kind of weird to me um you know there's a lot of color and Nuance on how
[00:31:01] there's a lot of color and Nuance on how words are used um word net is very
[00:31:05] words are used um word net is very incomplete it's missing anything that's
[00:31:07] incomplete it's missing anything that's kind of cooler more modern slang this
[00:31:10] kind of cooler more modern slang this maybe isn't very modern slang now but
[00:31:12] maybe isn't very modern slang now but you won't find more modern slang either
[00:31:14] you won't find more modern slang either in it it's sort of very human-made Etc
[00:31:17] in it it's sort of very human-made Etc it's got a lot of issues so um this led
[00:31:20] it's got a lot of issues so um this led into the idea of can we represent
[00:31:22] into the idea of can we represent meaning differently and this leads us
[00:31:24] meaning differently and this leads us into word
[00:31:26] into word vectors um so when we have words Wicked
[00:31:31] vectors um so when we have words Wicked badass Nifty wizard what do they turn
[00:31:35] badass Nifty wizard what do they turn into when we have
[00:31:38] into when we have computers um well
[00:31:40] computers um well effectively um you know words are these
[00:31:43] effectively um you know words are these discrete symbols um that they're just
[00:31:46] discrete symbols um that they're just kind of some kind of atom or symbol and
[00:31:49] kind of some kind of atom or symbol and if we then turn those into something
[00:31:52] if we then turn those into something that's closer to math um how symbols are
[00:31:57] that's closer to math um how symbols are normally rep presented is you have a
[00:32:00] normally rep presented is you have a vocabulary and your word is some item in
[00:32:04] vocabulary and your word is some item in that vocabulary so Motel is the that
[00:32:06] that vocabulary so Motel is the that word in the vocabulary and hotel is this
[00:32:09] word in the vocabulary and hotel is this word in the vocabulary and commonly this
[00:32:12] word in the vocabulary and commonly this is what computational systems do you
[00:32:14] is what computational systems do you take all your strings and you index them
[00:32:17] take all your strings and you index them to numbers and that's the sort of
[00:32:19] to numbers and that's the sort of position in a vector that they belong in
[00:32:22] position in a vector that they belong in um and well we have huge numbers of
[00:32:25] um and well we have huge numbers of words so we might have a huge vocabulary
[00:32:28] words so we might have a huge vocabulary so we'll have very big and long vectors
[00:32:30] so we'll have very big and long vectors and so these get referred to as one hot
[00:32:33] and so these get referred to as one hot vectors um for representing the meaning
[00:32:36] vectors um for representing the meaning of words um but
[00:32:39] of words um but representing words by one hot vectors
[00:32:42] representing words by one hot vectors turns out to not be a very good way of
[00:32:45] turns out to not be a very good way of computing with them it was used for
[00:32:48] computing with them it was used for decades um but it turns out to be kind
[00:32:50] decades um but it turns out to be kind of problematic and part of why it's
[00:32:52] of problematic and part of why it's problematic is it doesn't have any
[00:32:56] problematic is it doesn't have any natural inherent sense
[00:32:58] natural inherent sense of the meanings of words you just have
[00:33:00] of the meanings of words you just have different words you have hotel and motel
[00:33:03] different words you have hotel and motel and house and chair and so if you think
[00:33:06] and house and chair and so if you think about in terms of these Vector
[00:33:09] about in terms of these Vector representations that if you have motel
[00:33:11] representations that if you have motel and hotel there's no indication that
[00:33:14] and hotel there's no indication that they're kind of similar they're just two
[00:33:16] they're kind of similar they're just two different symbols which have ones in
[00:33:19] different symbols which have ones in different positions in the vector or
[00:33:21] different positions in the vector or formally in math terms um if you think
[00:33:24] formally in math terms um if you think about taking the dotproduct of these two
[00:33:27] about taking the dotproduct of these two vectors zero um the two vectors are
[00:33:30] vectors zero um the two vectors are orthogonal they have nothing to do with
[00:33:33] orthogonal they have nothing to do with each other now there are things that you
[00:33:35] each other now there are things that you can do with that you can start saying oh
[00:33:37] can do with that you can start saying oh let me start building up some other
[00:33:39] let me start building up some other resource of word similarity and I'll
[00:33:42] resource of word similarity and I'll consult that resource of word similarity
[00:33:45] consult that resource of word similarity and it'll tell me that motels and hotels
[00:33:47] and it'll tell me that motels and hotels are similar to each other and people did
[00:33:50] are similar to each other and people did things like that right in web search it
[00:33:52] things like that right in web search it was referred to as query expansion
[00:33:54] was referred to as query expansion techniques um but still the point is
[00:33:57] techniques um but still the point is that there's no natural notion of
[00:34:00] that there's no natural notion of similarity in one hot
[00:34:02] similarity in one hot vectors um and so the the idea was that
[00:34:07] vectors um and so the the idea was that maybe we could do better than that that
[00:34:10] maybe we could do better than that that we could learn to include similarity in
[00:34:13] we could learn to include similarity in the vectors themselves and so that leads
[00:34:16] the vectors themselves and so that leads into the idea of word vectors um but it
[00:34:19] into the idea of word vectors um but it also leads into a different way of
[00:34:21] also leads into a different way of thinking about semantics um I just
[00:34:24] thinking about semantics um I just realized I forgot to say one thing back
[00:34:26] realized I forgot to say one thing back two slides um these kind of
[00:34:30] two slides um these kind of representations are referred to as
[00:34:32] representations are referred to as localist representations meaning that
[00:34:34] localist representations meaning that there's one point in which um something
[00:34:38] there's one point in which um something is represented so that um you've got
[00:34:42] is represented so that um you've got here is the representation of motel and
[00:34:45] here is the representation of motel and here is the representation of Hotel it's
[00:34:48] here is the representation of Hotel it's in one place in the vector that each
[00:34:50] in one place in the vector that each word is represented and they'll be
[00:34:52] word is represented and they'll be different to what we do next um so
[00:34:55] different to what we do next um so there's an alternative idea of semantic
[00:34:58] there's an alternative idea of semantic um which goes back quite a long way
[00:35:01] um which goes back quite a long way people commonly quote this quote of Jr F
[00:35:05] people commonly quote this quote of Jr F who was a British linguist who said in
[00:35:07] who was a British linguist who said in 1957 you shall know a word by the
[00:35:09] 1957 you shall know a word by the company it keeps but also goes back to
[00:35:12] company it keeps but also goes back to philosophical work by binstein and
[00:35:14] philosophical work by binstein and others that what you should do is
[00:35:17] others that what you should do is represent a word's meaning by the
[00:35:20] represent a word's meaning by the context in which it appears um so the
[00:35:23] context in which it appears um so the words that appear around the word give
[00:35:26] words that appear around the word give information
[00:35:28] information um about its meaning and so that's the
[00:35:30] um about its meaning and so that's the idea of what's called distributional
[00:35:32] idea of what's called distributional semantics in contrast to denotational
[00:35:35] semantics in contrast to denotational semantics so if I want to know about the
[00:35:38] semantics so if I want to know about the word banking I say give me some
[00:35:41] word banking I say give me some sentences that use the word banking here
[00:35:43] sentences that use the word banking here are some sentences using the word
[00:35:44] are some sentences using the word banking government debt problems turning
[00:35:47] banking government debt problems turning into banking crises as happened in 2009
[00:35:51] into banking crises as happened in 2009 etc etc and knowing about that context
[00:35:55] etc etc and knowing about that context words that occur around banking
[00:35:58] words that occur around banking those will become the meaning of banking
[00:36:01] those will become the meaning of banking and so we're going to use those
[00:36:04] and so we're going to use those statistics um about words and what other
[00:36:08] statistics um about words and what other words appear around them in order to
[00:36:12] words appear around them in order to learn a new kind of representation of a
[00:36:16] learn a new kind of representation of a word so our new representation of words
[00:36:20] word so our new representation of words is we're going to represent them now as
[00:36:23] is we're going to represent them now as a dense a sort of a shorter dense Vector
[00:36:27] a dense a sort of a shorter dense Vector that giv the meaning of the words now my
[00:36:30] that giv the meaning of the words now my vectors are very short here um these are
[00:36:32] vectors are very short here um these are only eight dimensional if I counted
[00:36:34] only eight dimensional if I counted right so I could fit them on my slide
[00:36:36] right so I could fit them on my slide they're not that short in practice they
[00:36:38] they're not that short in practice they might be 200 2,000 but reasonably short
[00:36:43] might be 200 2,000 but reasonably short they're not going to be like the half a
[00:36:44] they're not going to be like the half a million of the half a million different
[00:36:46] million of the half a million different words in our vocabulary and the idea is
[00:36:50] words in our vocabulary and the idea is if words have stuff to do with each
[00:36:52] if words have stuff to do with each other um they'll have sort of similar
[00:36:55] other um they'll have sort of similar vectors which corresponds to their dot
[00:36:58] vectors which corresponds to their dot product being large so for Banking and
[00:37:01] product being large so for Banking and monetary in my example here both of them
[00:37:03] monetary in my example here both of them are positive in the First Dimension
[00:37:05] are positive in the First Dimension positive in the second dimension
[00:37:07] positive in the second dimension negative on the third on the fourth
[00:37:10] negative on the third on the fourth they've got opposite signs so if we want
[00:37:11] they've got opposite signs so if we want to work out the dot product we're taking
[00:37:14] to work out the dot product we're taking the product of the corresponding terms
[00:37:17] the product of the corresponding terms and it'll get bigger to the extent that
[00:37:19] and it'll get bigger to the extent that both of the corresponding ones have the
[00:37:21] both of the corresponding ones have the same sides and bigger if they have large
[00:37:24] same sides and bigger if they have large um
[00:37:25] um magnitude Okay so these are what we call
[00:37:29] magnitude Okay so these are what we call word vectors which are also known as
[00:37:31] word vectors which are also known as embeddings or newal word representations
[00:37:35] embeddings or newal word representations or phrases like that and so the first
[00:37:37] or phrases like that and so the first thing we want to do is learn good word
[00:37:41] thing we want to do is learn good word vectors for different words and our word
[00:37:44] vectors for different words and our word vectors will be good word vectors if um
[00:37:50] vectors will be good word vectors if um they give us a good sense of the
[00:37:52] they give us a good sense of the meanings of words they know which words
[00:37:55] meanings of words they know which words are similar to other words in meaning
[00:37:58] are similar to other words in meaning um we refer to them as
[00:38:00] um we refer to them as embeddings um because we can think of
[00:38:03] embeddings um because we can think of this as a vector in a high dimensional
[00:38:05] this as a vector in a high dimensional space and so that we're embedding each
[00:38:08] space and so that we're embedding each word as a position in that
[00:38:10] word as a position in that high-dimensional space and the
[00:38:13] high-dimensional space and the dimensionality of the space um will be
[00:38:15] dimensionality of the space um will be the length of the vector so it might be
[00:38:17] the length of the vector so it might be something like a 300 dimensional space
[00:38:20] something like a 300 dimensional space um now that kind of gets problematic
[00:38:24] um now that kind of gets problematic because human beings can't look at 300
[00:38:27] because human beings can't look at 300 dimensional spaces and aren't very good
[00:38:29] dimensional spaces and aren't very good at understanding or visualizing what
[00:38:32] at understanding or visualizing what goes on in them so the only thing that I
[00:38:35] goes on in them so the only thing that I can show you is um two-dimensional
[00:38:38] can show you is um two-dimensional spaces but um a thing that is good to
[00:38:44] spaces but um a thing that is good to have somewhat in your head is that
[00:38:47] have somewhat in your head is that really high-dimensional spaces behave
[00:38:50] really high-dimensional spaces behave extremely differently to two-dimensional
[00:38:53] extremely differently to two-dimensional spaces in high dimensional spaces things
[00:38:58] spaces in high dimensional spaces things can in a two-dimensional space you're
[00:39:01] can in a two-dimensional space you're only near to something else if you got
[00:39:02] only near to something else if you got similar X and Y coordinates in a high
[00:39:05] similar X and Y coordinates in a high dimensional space things can be very
[00:39:07] dimensional space things can be very near to all sorts of things on different
[00:39:10] near to all sorts of things on different dimensions in the space and so we can
[00:39:12] dimensions in the space and so we can capture different senses of words and
[00:39:15] capture different senses of words and ways that words are similar to each
[00:39:17] ways that words are similar to each other um but here's the kind of picture
[00:39:20] other um but here's the kind of picture we end up with so we're what we're going
[00:39:21] we end up with so we're what we're going to do is learn a way to um represent all
[00:39:26] to do is learn a way to um represent all words as vectors based on the other
[00:39:30] words as vectors based on the other words that they within context and we
[00:39:33] words that they within context and we can embed them into this Vector space
[00:39:35] can embed them into this Vector space and of course you can't read anything
[00:39:37] and of course you can't read anything there but you know we can zoom into this
[00:39:39] there but you know we can zoom into this space further and if we zoom into this
[00:39:42] space further and if we zoom into this space and just show a bit of it well
[00:39:44] space and just show a bit of it well here's a part of the space um where it's
[00:39:46] here's a part of the space um where it's showing um country words and some other
[00:39:49] showing um country words and some other location words so we've got so of um
[00:39:52] location words so we've got so of um countries up the top there we've got
[00:39:54] countries up the top there we've got some nationality terms British
[00:39:56] some nationality terms British Australian American European um further
[00:39:59] Australian American European um further down or we can go to another piece of
[00:40:02] down or we can go to another piece of the space and here's a bit of the space
[00:40:04] the space and here's a bit of the space um where we have verbs and not only if
[00:40:06] um where we have verbs and not only if we got verbs but you know there's
[00:40:08] we got verbs but you know there's actually quite quite a lot of fine
[00:40:10] actually quite quite a lot of fine structure here of what's similar that
[00:40:12] structure here of what's similar that represents things about verbs so you've
[00:40:15] represents things about verbs so you've got um sort of verbs of you know
[00:40:19] got um sort of verbs of you know communication statements saying thinking
[00:40:22] communication statements saying thinking expecting grouping together come and go
[00:40:24] expecting grouping together come and go group together down the bottom you've
[00:40:27] group together down the bottom you've got forms of the verb have then you've
[00:40:29] got forms of the verb have then you've got forms of the verb to be above them
[00:40:32] got forms of the verb to be above them you've got become and remain which are
[00:40:34] you've got become and remain which are actually sort of similar to the verb to
[00:40:36] actually sort of similar to the verb to be um because they take these sort of
[00:40:40] be um because they take these sort of complements of state so just that you
[00:40:42] complements of state so just that you can same as you can say I am angry you
[00:40:46] can same as you can say I am angry you can say he remained angry or he became
[00:40:49] can say he remained angry or he became angry right so those verbs are more so
[00:40:53] angry right so those verbs are more so than most verbs sort of similar to the
[00:40:54] than most verbs sort of similar to the verb to be so we get these kind of
[00:40:57] verb to be so we get these kind of interesting semantic spaces where things
[00:41:00] interesting semantic spaces where things that have similar meaning are close by
[00:41:02] that have similar meaning are close by to each other and so the question is how
[00:41:06] to each other and so the question is how do we get to those things and how we get
[00:41:09] do we get to those things and how we get to those things is then um for you know
[00:41:13] to those things is then um for you know that there are various ways of doing it
[00:41:15] that there are various ways of doing it but the one I want to get through um
[00:41:17] but the one I want to get through um today is showing you about um word DEC
[00:41:21] today is showing you about um word DEC okay I'll pause for 30 seconds for bread
[00:41:23] okay I'll pause for 30 seconds for bread breath anyone have a question or
[00:41:25] breath anyone have a question or anything they want to know yes
[00:41:32] so but it doesn't to solve the problem
[00:41:36] so but it doesn't to solve the problem where the similar meanings um might
[00:41:39] where the similar meanings um might depend on context right so let's take
[00:41:42] depend on context right so let's take your example about profession
[00:41:44] your example about profession ver so those two words have their own
[00:41:48] ver so those two words have their own batteries and we understand similarity
[00:41:52] batteries and we understand similarity some spectors but it's contact right
[00:41:56] some spectors but it's contact right because if you have different contact
[00:41:58] because if you have different contact those two similar and this Al does
[00:42:03] those two similar and this Al does not yes correct um so that's a good
[00:42:06] not yes correct um so that's a good thought you can keep it for a few weeks
[00:42:08] thought you can keep it for a few weeks to some extent yeah so for the first
[00:42:10] to some extent yeah so for the first thing we're going to do we're just going
[00:42:12] thing we're going to do we're just going to learn one word Vector for a string so
[00:42:16] to learn one word Vector for a string so we're going to have a word let's say
[00:42:19] we're going to have a word let's say it's star and we're going to learn one
[00:42:21] it's star and we're going to learn one word Vector for it so that absolutely
[00:42:24] word Vector for it so that absolutely doesn't capture the meaning of a word in
[00:42:27] doesn't capture the meaning of a word in context so it won't be saying whether
[00:42:30] context so it won't be saying whether it's meaning a Hollywood star or an
[00:42:33] it's meaning a Hollywood star or an astronomical star or something like that
[00:42:36] astronomical star or something like that and so later on we're going to get on to
[00:42:38] and so later on we're going to get on to contextual meaning representation so
[00:42:40] contextual meaning representation so wait for that but the thing I would like
[00:42:43] wait for that but the thing I would like to going along with what I said about
[00:42:45] to going along with what I said about high dimensional spaces being weird the
[00:42:48] high dimensional spaces being weird the the cool thing that we will already find
[00:42:51] the cool thing that we will already find is our representation for Star will be
[00:42:55] is our representation for Star will be very close to um the representations for
[00:42:59] very close to um the representations for astronomical words like nebula um and
[00:43:03] astronomical words like nebula um and what other every other astronomical
[00:43:05] what other every other astronomical words you know and
[00:43:07] words you know and simultaneously um it'll be um very close
[00:43:11] simultaneously um it'll be um very close to words that mean something like a
[00:43:13] to words that mean something like a Hollywood star
[00:43:17] Hollywood star um help me out no any words that mean
[00:43:19] um help me out no any words that mean something similar um celebrity that's a
[00:43:22] something similar um celebrity that's a good one okay yeah how are youing the
[00:43:27] good one okay yeah how are youing the Bing is to a lower dimensional space
[00:43:29] Bing is to a lower dimensional space visualized um so that pictures I was
[00:43:33] visualized um so that pictures I was showing you used a particular method
[00:43:35] showing you used a particular method called tne um which is a nonliner
[00:43:39] called tne um which is a nonliner dimensionality reduction that tends to
[00:43:41] dimensionality reduction that tends to work better for high dimensional new
[00:43:43] work better for high dimensional new representations um then PCA which you
[00:43:46] representations um then PCA which you might know um but I'm not going to go
[00:43:48] might know um but I'm not going to go into that now yes um how do you know Di
[00:43:57] but not
[00:43:59] but not too I mean that's something that people
[00:44:02] too I mean that's something that people have worked on it depends on how much it
[00:44:05] have worked on it depends on how much it depends on how much data you've got to
[00:44:07] depends on how much data you've got to make your representations over you know
[00:44:10] make your representations over you know so normally you know it's worked out
[00:44:13] so normally you know it's worked out either empirically for what works best
[00:44:16] either empirically for what works best or practically based on how big vectors
[00:44:19] or practically based on how big vectors you want to work on I mean to give you
[00:44:21] you want to work on I mean to give you some idea you know things start to work
[00:44:25] some idea you know things start to work well when you get to 100 dimensional
[00:44:27] well when you get to 100 dimensional space for a long time people used 300
[00:44:30] space for a long time people used 300 Dimensions because that seemed to work
[00:44:32] Dimensions because that seemed to work pretty well but as we people have
[00:44:34] pretty well but as we people have started building Huger and Huger models
[00:44:37] started building Huger and Huger models with way way more data it's now become
[00:44:39] with way way more data it's now become increasingly common to use numbers like
[00:44:42] increasingly common to use numbers like 1,000 or even 2,000 dimensional vectors
[00:44:46] 1,000 or even 2,000 dimensional vectors yeah
[00:44:48] yeah okay so um you mentioned that there sort
[00:44:51] okay so um you mentioned that there sort of hidden structur the small areas as
[00:44:53] of hidden structur the small areas as well as large areas of the embeding and
[00:44:56] well as large areas of the embeding and as in pieces different like different
[00:44:59] as in pieces different like different structures will come up but generally we
[00:45:01] structures will come up but generally we seem to use distance as the single
[00:45:03] seem to use distance as the single metric for closeness which doesn't seem
[00:45:05] metric for closeness which doesn't seem to me that we get it like distance
[00:45:07] to me that we get it like distance between like this and that in space
[00:45:09] between like this and that in space three will be the same right so how
[00:45:11] three will be the same right so how would that we don't only use distance we
[00:45:14] would that we don't only use distance we also use directions in the spaces having
[00:45:17] also use directions in the spaces having semantic meanings and I'll show a
[00:45:18] semantic meanings and I'll show a example that soon
[00:45:21] example that soon yeah the entri they seem to be between1
[00:45:26] yeah the entri they seem to be between1 and one is there a reason for that or do
[00:45:28] and one is there a reason for that or do we have bounds that we them um so good
[00:45:33] we have bounds that we them um so good question I mean you know they don't have
[00:45:36] question I mean you know they don't have to be um and the way we're going to
[00:45:40] to be um and the way we're going to learn them they're not bounded but you
[00:45:43] learn them they're not bounded but you know you can bound things sometimes
[00:45:45] know you can bound things sometimes people length normalize so that um the
[00:45:48] people length normalize so that um the vectors are of length one but at any
[00:45:51] vectors are of length one but at any rate normally in this work we use some
[00:45:53] rate normally in this work we use some method called you know regularization
[00:45:56] method called you know regularization that tries to kind of keep coefficients
[00:45:58] that tries to kind of keep coefficients small so they're generally not getting
[00:46:01] small so they're generally not getting huge yeah a specific word for example
[00:46:05] huge yeah a specific word for example like the bank we use as before in the
[00:46:07] like the bank we use as before in the previous slides um so is there like a
[00:46:11] previous slides um so is there like a like for the word representation is
[00:46:13] like for the word representation is there like a single embedding for each
[00:46:16] there like a single embedding for each word or do we have multiple embeddings
[00:46:18] word or do we have multiple embeddings for each word but what we're doing at
[00:46:20] for each word but what we're doing at the moment each word each string of
[00:46:24] the moment each word each string of letters has a single embed in and what
[00:46:28] letters has a single embed in and what you can think of that embedding as as
[00:46:32] you can think of that embedding as as kind of as an average over all its
[00:46:35] kind of as an average over all its senses um so for example like Bank it
[00:46:38] senses um so for example like Bank it can mean like the financial institution
[00:46:41] can mean like the financial institution or it can also mean like the river B and
[00:46:44] or it can also mean like the river B and then what I said before about star
[00:46:46] then what I said before about star applies the interesting thing is you'll
[00:46:50] applies the interesting thing is you'll find that we're able to come up with a
[00:46:52] find that we're able to come up with a representation where our learned
[00:46:53] representation where our learned representation because it's kind of an
[00:46:55] representation because it's kind of an average of those will end up similar to
[00:46:58] average of those will end up similar to words that are semantically evoked by
[00:47:01] words that are semantically evoked by both senses um I think I probably about
[00:47:04] both senses um I think I probably about about go on at this point um okay word
[00:47:08] about go on at this point um okay word to ve okay so word Tove was this method
[00:47:12] to ve okay so word Tove was this method of learning word vectors um that was
[00:47:15] of learning word vectors um that was thought up by tamas mikolov and
[00:47:17] thought up by tamas mikolov and colleagues at Google um in
[00:47:20] colleagues at Google um in 2013 you know it wasn't the first method
[00:47:23] 2013 you know it wasn't the first method there are other people that did methods
[00:47:25] there are other people that did methods of learning word vectors that go back to
[00:47:28] of learning word vectors that go back to about um the turn of the Millennium um
[00:47:31] about um the turn of the Millennium um it wasn't the last there ones that come
[00:47:33] it wasn't the last there ones that come after it as well but it was a
[00:47:35] after it as well but it was a particularly simple one um and this
[00:47:38] particularly simple one um and this particularly you know fast running one
[00:47:41] particularly you know fast running one and so it really caught people's um
[00:47:43] and so it really caught people's um attention so the idea of it is um that
[00:47:47] attention so the idea of it is um that we start off with a large amount of text
[00:47:52] we start off with a large amount of text so that can just be thought of as a long
[00:47:54] so that can just be thought of as a long list of words and in l p we refer to
[00:47:57] list of words and in l p we refer to that as a corpus Corpus um is just Latin
[00:48:01] that as a corpus Corpus um is just Latin for body um so you know it's exactly the
[00:48:04] for body um so you know it's exactly the same as if you have a dead person on the
[00:48:05] same as if you have a dead person on the floor right that's a corpus no um yeah
[00:48:10] floor right that's a corpus no um yeah so it's just a body but we mean a body
[00:48:11] so it's just a body but we mean a body of text um not a live person oh sorry a
[00:48:16] of text um not a live person oh sorry a dead person um yeah um if you want to
[00:48:19] dead person um yeah um if you want to know more about Latin um since some
[00:48:21] know more about Latin um since some there isn't very good classical
[00:48:23] there isn't very good classical education these days um Corpus despite
[00:48:27] education these days um Corpus despite the US ending is a fourth declen neuter
[00:48:31] the US ending is a fourth declen neuter noun and that means um the plural of
[00:48:35] noun and that means um the plural of Corpus is not corpy the plural of Corpus
[00:48:40] Corpus is not corpy the plural of Corpus is corpora um so I'm sure sometime later
[00:48:45] is corpora um so I'm sure sometime later in this class I will read a projector
[00:48:48] in this class I will read a projector assignment that refers to corpy and I
[00:48:51] assignment that refers to corpy and I will know that that person was not
[00:48:52] will know that that person was not paying attention um in the first lecture
[00:48:55] paying attention um in the first lecture um or else they should have said corpora
[00:48:59] um or else they should have said corpora um c r p o r a is the correct um form
[00:49:02] um c r p o r a is the correct um form for that okay I should move on okay so
[00:49:05] for that okay I should move on okay so we have our text then we know um that
[00:49:10] we have our text then we know um that we're going to represent each word so
[00:49:13] we're going to represent each word so this is each word type so you know star
[00:49:16] this is each word type so you know star or Bank Etc so for wherever it occurs by
[00:49:20] or Bank Etc so for wherever it occurs by a single vector and so what we're going
[00:49:23] a single vector and so what we're going to do in this algorithm is we're going
[00:49:25] to do in this algorithm is we're going to go through each position in the text
[00:49:28] to go through each position in the text and so at each position in the text
[00:49:30] and so at each position in the text which is a list of words we're going to
[00:49:32] which is a list of words we're going to have a center word and words outside it
[00:49:36] have a center word and words outside it um and then what we're going to do is
[00:49:38] um and then what we're going to do is use the similarity of the word vectors
[00:49:42] use the similarity of the word vectors for C and the outside words to calculate
[00:49:45] for C and the outside words to calculate the probability that they should have
[00:49:47] the probability that they should have occurred or not um and then we just keep
[00:49:50] occurred or not um and then we just keep fiddling and we learn ve word vectors
[00:49:53] fiddling and we learn ve word vectors now you know at First Sight I'll show
[00:49:56] now you know at First Sight I'll show this more concretely maybe I'll just
[00:49:57] this more concretely maybe I'll just show up more concretely first so here's
[00:49:59] show up more concretely first so here's the idea we're going to have um a vector
[00:50:03] the idea we're going to have um a vector for each word type so a word type means
[00:50:06] for each word type so a word type means you know the word problems wherever it
[00:50:08] you know the word problems wherever it occurs which is differentiated from a
[00:50:11] occurs which is differentiated from a word token which is this instance of the
[00:50:14] word token which is this instance of the word problems so we're going to have a
[00:50:16] word problems so we're going to have a vector for each word type and so I'm
[00:50:20] vector for each word type and so I'm going to want to know look in this text
[00:50:24] going to want to know look in this text the word turning occurred before
[00:50:26] the word turning occurred before occurred before the word into How likely
[00:50:29] occurred before the word into How likely should that have been to happen and what
[00:50:32] should that have been to happen and what I'm going to do is calculate a
[00:50:34] I'm going to do is calculate a probability of the word turning
[00:50:37] probability of the word turning occurring close to the word into and I'm
[00:50:40] occurring close to the word into and I'm going to do that for each word in a
[00:50:43] going to do that for each word in a narrow context in example here I'm
[00:50:45] narrow context in example here I'm saying I'm using two words to the left
[00:50:48] saying I'm using two words to the left and two words to the right and what I
[00:50:50] and two words to the right and what I want to do is make those probability
[00:50:53] want to do is make those probability estimates as good as possible so in
[00:50:55] estimates as good as possible so in particular I want the probability of
[00:50:57] particular I want the probability of cooccurrence to be high for words that
[00:51:00] cooccurrence to be high for words that actually do occur within the nearby
[00:51:02] actually do occur within the nearby context of each other and so then the
[00:51:05] context of each other and so then the question is how am I going to oh and
[00:51:09] question is how am I going to oh and once I've done it for that word I'm
[00:51:10] once I've done it for that word I'm going to go along and do exactly the
[00:51:13] going to go along and do exactly the same thing for the next
[00:51:15] same thing for the next word and so I'll can continue through
[00:51:18] word and so I'll can continue through the text in that way and so what we want
[00:51:22] the text in that way and so what we want to do is come up with vector
[00:51:26] to do is come up with vector representations of words that will let
[00:51:29] representations of words that will let us predict these probabilities quote
[00:51:32] us predict these probabilities quote unquote well now you know there's a huge
[00:51:35] unquote well now you know there's a huge limit to how well we can do it because
[00:51:37] limit to how well we can do it because you know we've got a simple model
[00:51:39] you know we've got a simple model obviously when you see the word banking
[00:51:42] obviously when you see the word banking I can't tell you that the word into is
[00:51:44] I can't tell you that the word into is going to occur before banking but you
[00:51:47] going to occur before banking but you know I want to do it as well as possible
[00:51:50] know I want to do it as well as possible so what I want my model to say is after
[00:51:53] so what I want my model to say is after the word banking crisis is pretty likely
[00:51:58] the word banking crisis is pretty likely um but the word skillet is not very
[00:52:03] um but the word skillet is not very likely and if I can do that I'm doing a
[00:52:06] likely and if I can do that I'm doing a good job and so we turn that into a
[00:52:09] good job and so we turn that into a piece of math um here's how we do it
[00:52:12] piece of math um here's how we do it turn it into a piece of math so we're
[00:52:15] turn it into a piece of math so we're going to go through our Corpus every
[00:52:17] going to go through our Corpus every position in the Corpus um and we're
[00:52:20] position in the Corpus um and we're going to have a fixed window size M
[00:52:23] going to have a fixed window size M which was two in my example and then
[00:52:26] which was two in my example and then what we you're going to want to do is
[00:52:27] what we you're going to want to do is have the probability of words in the
[00:52:30] have the probability of words in the context um being as high as possible so
[00:52:34] context um being as high as possible so we want to maximize this likelihood
[00:52:37] we want to maximize this likelihood where we're going through every position
[00:52:38] where we're going through every position in the text and then we're going through
[00:52:41] in the text and then we're going through every word in the context and sort of
[00:52:44] every word in the context and sort of wanting to make this
[00:52:47] big okay so conceptually that's what
[00:52:51] big okay so conceptually that's what we're doing but in practice um we never
[00:52:54] we're doing but in practice um we never quite do that um we um use two little
[00:52:59] quite do that um we um use two little tricks here um the first one is you know
[00:53:02] tricks here um the first one is you know for completely arbitrary reasons it
[00:53:05] for completely arbitrary reasons it really makes no difference um everyone
[00:53:08] really makes no difference um everyone got into minimizing things rather than
[00:53:11] got into minimizing things rather than maximizing things and so the algorithms
[00:53:14] maximizing things and so the algorithms that we use get referred to as gradient
[00:53:16] that we use get referred to as gradient descent as you'll see in a moment so the
[00:53:18] descent as you'll see in a moment so the first thing we do um is put a minor sign
[00:53:22] first thing we do um is put a minor sign in front so that we can minimize it
[00:53:24] in front so that we can minimize it rather than maximize it that part's
[00:53:25] rather than maximize it that part's pretty trivial
[00:53:27] pretty trivial but the second part is here we have this
[00:53:29] but the second part is here we have this enormous product and working with
[00:53:32] enormous product and working with enormous products is more difficult for
[00:53:34] enormous products is more difficult for the math so the second thing that we do
[00:53:37] the math so the second thing that we do is introduce a logarithm and so once we
[00:53:40] is introduce a logarithm and so once we take the log of the likelihood um that
[00:53:44] take the log of the likelihood um that then um when we take logs of products
[00:53:48] then um when we take logs of products they turn into sums and so now we can
[00:53:51] they turn into sums and so now we can sum over each word position the text sum
[00:53:56] sum over each word position the text sum over for each word in the context window
[00:53:58] over for each word in the context window and then sum these log probabilities and
[00:54:01] and then sum these log probabilities and then we still got the minus sign in
[00:54:03] then we still got the minus sign in front so we want to minimize the sum of
[00:54:06] front so we want to minimize the sum of log probabilities so what we're doing is
[00:54:10] log probabilities so what we're doing is um then wanting to look at the negative
[00:54:13] um then wanting to look at the negative log
[00:54:15] log likelihood um and then the final thing
[00:54:17] likelihood um and then the final thing that we do is you know to since this
[00:54:20] that we do is you know to since this will get um bigger depending on the
[00:54:23] will get um bigger depending on the number of words in the Corpus um we
[00:54:25] number of words in the Corpus um we divide through by the number of words in
[00:54:27] divide through by the number of words in the Corpus and so our objective function
[00:54:30] the Corpus and so our objective function is the average negative log likelihood
[00:54:33] is the average negative log likelihood um so by minimizing this objective
[00:54:36] um so by minimizing this objective function we're maximizing the
[00:54:39] function we're maximizing the probability of words in the
[00:54:41] probability of words in the context okay um we're almost there
[00:54:46] context okay um we're almost there that's what we want to do um we've got a
[00:54:49] that's what we want to do um we've got a couple more tricks that we want to get
[00:54:52] couple more tricks that we want to get through the next one is well I've said
[00:54:54] through the next one is well I've said we want to maximize the this probability
[00:54:58] we want to maximize the this probability how do we maximize this probability what
[00:55:00] how do we maximize this probability what is this probability we haven't defined
[00:55:03] is this probability we haven't defined how we're going to calculate this
[00:55:05] how we're going to calculate this probability and this is where the word
[00:55:08] probability and this is where the word vectors come in so we're going to Define
[00:55:11] vectors come in so we're going to Define this
[00:55:12] this probability in terms of the word Vector
[00:55:15] probability in terms of the word Vector so we're going to say each word type is
[00:55:18] so we're going to say each word type is represented by a vector of real numbers
[00:55:21] represented by a vector of real numbers these are 100 real numbers and we're
[00:55:23] these are 100 real numbers and we're going to have a formula that works out
[00:55:26] going to have a formula that works out the probability simply in terms of the
[00:55:30] the probability simply in terms of the vectors of for each um word that there
[00:55:33] vectors of for each um word that there are no other parameters in this model so
[00:55:36] are no other parameters in this model so over here I've shown this Theta which
[00:55:38] over here I've shown this Theta which are the parameters of our model and all
[00:55:41] are the parameters of our model and all and only the parameters of our model are
[00:55:45] and only the parameters of our model are these word vectors for each word in the
[00:55:48] these word vectors for each word in the vocabulary that's a lot of parameters
[00:55:50] vocabulary that's a lot of parameters because we have a lot of words and and
[00:55:52] because we have a lot of words and and we've got fairly big word vectors but
[00:55:54] we've got fairly big word vectors but they are the only parameters
[00:55:57] they are the only parameters okay and how we do that is um by using
[00:56:01] okay and how we do that is um by using this little um trick here we're going to
[00:56:03] this little um trick here we're going to say the probability of an outside word
[00:56:06] say the probability of an outside word given a center word is going to be
[00:56:09] given a center word is going to be defined in terms of the dotproduct of
[00:56:12] defined in terms of the dotproduct of the two word vectors so if things have a
[00:56:15] the two word vectors so if things have a high dot product they'll be similar and
[00:56:19] high dot product they'll be similar and therefore they'll have a high
[00:56:20] therefore they'll have a high probability of cooccurrence where I mean
[00:56:23] probability of cooccurrence where I mean similar in a kind of a weird sense right
[00:56:26] similar in a kind of a weird sense right it is the case that we're going to want
[00:56:27] it is the case that we're going to want to say hotel and motel are similar but
[00:56:30] to say hotel and motel are similar but you know it's also the case that we're
[00:56:32] you know it's also the case that we're going to want to have the word the able
[00:56:36] going to want to have the word the able to appear easily before the word student
[00:56:39] to appear easily before the word student so in some weird sense the also has to
[00:56:42] so in some weird sense the also has to be similar to student um that has to be
[00:56:44] be similar to student um that has to be similar to basically any noun right um
[00:56:48] similar to basically any noun right um okay so we're going to work witht
[00:56:50] okay so we're going to work witht products and then we do this funky
[00:56:52] products and then we do this funky little bit of math here and that will
[00:56:54] little bit of math here and that will give us our probability
[00:56:56] give us our probability ities okay so let's just go through the
[00:56:59] ities okay so let's just go through the funky bit of math um so here's our
[00:57:03] funky bit of math um so here's our formula for the
[00:57:04] formula for the probabilities so what we're doing here
[00:57:07] probabilities so what we're doing here is we're starting off with this dot
[00:57:10] is we're starting off with this dot product right so the dot product is you
[00:57:13] product right so the dot product is you take the two vectors you multiply each
[00:57:16] take the two vectors you multiply each component together and you sum them so
[00:57:18] component together and you sum them so if they both the same sign that
[00:57:22] if they both the same sign that increases your dot product and if
[00:57:24] increases your dot product and if they're both big and increases it a lot
[00:57:26] they're both big and increases it a lot okay so that gives us a similarity
[00:57:30] okay so that gives us a similarity between two vectors and that's unbounded
[00:57:33] between two vectors and that's unbounded that's that's just a real number it can
[00:57:35] that's that's just a real number it can be either negative or positive okay but
[00:57:37] be either negative or positive okay but what we'd like to get out is a
[00:57:39] what we'd like to get out is a probability so for our next tricks we
[00:57:42] probability so for our next tricks we first of all exponentiate because if we
[00:57:45] first of all exponentiate because if we take um e to X for any X we now have to
[00:57:50] take um e to X for any X we now have to get something positive out right that's
[00:57:52] get something positive out right that's what exponentiation does okay and then
[00:57:55] what exponentiation does okay and then well since it's meant to be a
[00:57:57] well since it's meant to be a probability we'd like it to be between 0
[00:58:00] probability we'd like it to be between 0 and 1 and so we turn it into numbers
[00:58:03] and 1 and so we turn it into numbers between 0 and one in the dumbest way
[00:58:06] between 0 and one in the dumbest way possible which is we just normalize so
[00:58:09] possible which is we just normalize so that we work out the quantity in the
[00:58:11] that we work out the quantity in the numerator for every possible context
[00:58:14] numerator for every possible context word um and so we get the total of all
[00:58:18] word um and so we get the total of all of those numbers and divide through by
[00:58:20] of those numbers and divide through by it and then we're getting a probability
[00:58:22] it and then we're getting a probability distribution of How likely different
[00:58:24] distribution of How likely different words are in this context
[00:58:27] words are in this context text okay um yeah so that this little
[00:58:31] text okay um yeah so that this little trick that we're doing here is referred
[00:58:34] trick that we're doing here is referred to as the softmax function so for the
[00:58:36] to as the softmax function so for the softmax function you can take um
[00:58:40] softmax function you can take um unbounded real numbers put them through
[00:58:43] unbounded real numbers put them through this little softmax trick that we just
[00:58:45] this little softmax trick that we just went through the steps of and what
[00:58:47] went through the steps of and what you'll get out is a probability
[00:58:50] you'll get out is a probability distribution so I'm now getting in this
[00:58:53] distribution so I'm now getting in this example uh probability distribution over
[00:58:57] example uh probability distribution over context words my probability estimates
[00:59:00] context words my probability estimates over all the context words in my
[00:59:02] over all the context words in my vocabulary will sum up to one by
[00:59:05] vocabulary will sum up to one by definition by the way that I've con
[00:59:07] definition by the way that I've con constructed this um so it's called the
[00:59:10] constructed this um so it's called the softmax function because it amplifies
[00:59:14] softmax function because it amplifies the probabilities of the largest things
[00:59:16] the probabilities of the largest things that's because of the um the X function
[00:59:21] that's because of the um the X function but it's soft because it still assigns
[00:59:24] but it's soft because it still assigns some probability to smaller items but
[00:59:27] some probability to smaller items but you know it's sort of a funny name
[00:59:30] you know it's sort of a funny name because you know when you think about um
[00:59:33] because you know when you think about um max I mean Max normally picks out just
[00:59:36] max I mean Max normally picks out just one thing whereas the soft Max is
[00:59:39] one thing whereas the soft Max is turning a bunch of real numbers into a
[00:59:42] turning a bunch of real numbers into a probability
[00:59:43] probability distribution um so this soft Max is used
[00:59:47] distribution um so this soft Max is used everywhere um in deep learning any time
[00:59:50] everywhere um in deep learning any time that we're wanting to turn things that
[00:59:51] that we're wanting to turn things that are just vectors in RN into
[00:59:54] are just vectors in RN into probabilities we shove them through a
[00:59:56] probabilities we shove them through a soft Max
[00:59:58] soft Max function okay
[01:00:02] function okay um so
[01:00:06] um so so in some sense this part I think still
[01:00:10] so in some sense this part I think still seems very abstract and I mean the
[01:00:14] seems very abstract and I mean the reason it seems very abstract is um
[01:00:19] reason it seems very abstract is um because I've sort of said we have
[01:00:22] because I've sort of said we have vectors for each word and using these
[01:00:25] vectors for each word and using these vectors
[01:00:26] vectors we can then calculate
[01:00:28] we can then calculate probabilities um but where do the
[01:00:31] probabilities um but where do the vectors come from and the answer to
[01:00:34] vectors come from and the answer to where the vectors are going to come from
[01:00:37] where the vectors are going to come from is we're going to turn this into an
[01:00:39] is we're going to turn this into an optimization problem we have a large
[01:00:42] optimization problem we have a large amount of text and so therefore we can
[01:00:46] amount of text and so therefore we can hope to find word vectors that make the
[01:00:50] hope to find word vectors that make the contexts of the words in our observed
[01:00:53] contexts of the words in our observed text as big as possible so literally
[01:00:56] text as big as possible so literally what we're going to do is we're going to
[01:00:59] what we're going to do is we're going to start off with random vectors for every
[01:01:01] start off with random vectors for every word and then we want to fiddle those
[01:01:04] word and then we want to fiddle those vectors so that the calculated
[01:01:08] vectors so that the calculated probabilities of words in a context go
[01:01:11] probabilities of words in a context go up and we're going to keep fiddling
[01:01:13] up and we're going to keep fiddling until they stop going up anymore and
[01:01:15] until they stop going up anymore and we're getting the highest probability
[01:01:17] we're getting the highest probability estimates um that we can and the way
[01:01:20] estimates um that we can and the way that we do that fiddling um is we use
[01:01:23] that we do that fiddling um is we use calculus um so you know what we're going
[01:01:26] calculus um so you know what we're going to do is kind of conceptually exactly
[01:01:30] to do is kind of conceptually exactly what you do if you're in something like
[01:01:31] what you do if you're in something like a two-dimensional space like the picture
[01:01:33] a two-dimensional space like the picture on the right right that if you want to
[01:01:36] on the right right that if you want to find the minimum in this two-dimensional
[01:01:38] find the minimum in this two-dimensional space and you start off at the top left
[01:01:41] space and you start off at the top left what you can do is say let me work out
[01:01:44] what you can do is say let me work out the derivatives of the function um at
[01:01:47] the derivatives of the function um at the top left and they sort of Point sort
[01:01:50] the top left and they sort of Point sort of down and a bit to the right and so
[01:01:52] of down and a bit to the right and so you can walk down and a bit to the right
[01:01:54] you can walk down and a bit to the right and you can say oh G given where I am
[01:01:57] and you can say oh G given where I am now um let me work out the derivatives
[01:02:00] now um let me work out the derivatives what direction do they point and they're
[01:02:02] what direction do they point and they're still pointing down but a bit more to
[01:02:03] still pointing down but a bit more to the right so you can walk a bit further
[01:02:05] the right so you can walk a bit further that way and you can keep on walking and
[01:02:08] that way and you can keep on walking and eventually you'll make it to the minimum
[01:02:11] eventually you'll make it to the minimum of the space um in our case um we've got
[01:02:15] of the space um in our case um we've got a lot more than two Dimensions so our
[01:02:18] a lot more than two Dimensions so our parameters for our model are the
[01:02:21] parameters for our model are the concatenation of all the word vectors
[01:02:25] concatenation of all the word vectors but it's even slightly worse than I've
[01:02:26] but it's even slightly worse than I've explained um up until now because
[01:02:30] explained um up until now because actually for each word we assume two
[01:02:34] actually for each word we assume two vectors we assume one vector when
[01:02:36] vectors we assume one vector when they're the center word and one vector
[01:02:38] they're the center word and one vector when they're the outside word doing that
[01:02:41] when they're the outside word doing that just makes the math a bit simpler which
[01:02:43] just makes the math a bit simpler which I can explain later so if we say had a
[01:02:46] I can explain later so if we say had a 100 dimensional vectors we'll have 100
[01:02:49] 100 dimensional vectors we'll have 100 parameters for ad v as an outside word
[01:02:53] parameters for ad v as an outside word 100 parameters for um ah as an outside
[01:02:56] 100 parameters for um ah as an outside word all the way through to 100
[01:02:58] word all the way through to 100 parameters for zebra as an outside word
[01:03:01] parameters for zebra as an outside word then we'd have 100 parameters for arvar
[01:03:03] then we'd have 100 parameters for arvar as a um a center word continuing down so
[01:03:08] as a um a center word continuing down so you know if we had a vocabulary of
[01:03:10] you know if we had a vocabulary of 400,000 words and 100 um dimensional
[01:03:14] 400,000 words and 100 um dimensional word vectors that means we'd have
[01:03:16] word vectors that means we'd have 400,000 * 2 is
[01:03:19] 400,000 * 2 is 800,000 time 100 we'd have 80 million
[01:03:22] 800,000 time 100 we'd have 80 million parameters so that's a lot of parameters
[01:03:25] parameters so that's a lot of parameters in our space to try and Fiddle to
[01:03:28] in our space to try and Fiddle to optimize things but luckily we have big
[01:03:30] optimize things but luckily we have big computers um and that's the kind of
[01:03:33] computers um and that's the kind of thing that we do so we simply um say G
[01:03:38] thing that we do so we simply um say G this is our optimization problem we're
[01:03:41] this is our optimization problem we're going to compute the gradients of all of
[01:03:46] going to compute the gradients of all of these parameters and that will um give
[01:03:49] these parameters and that will um give us the answer of what we have um and you
[01:03:55] us the answer of what we have um and you know
[01:03:57] know this feels like magic I mean it doesn't
[01:04:00] this feels like magic I mean it doesn't really seem like you know we could just
[01:04:03] really seem like you know we could just start with nothing we could start with
[01:04:05] start with nothing we could start with random word vectors and a pile of text
[01:04:08] random word vectors and a pile of text and say uh do some math and we will get
[01:04:12] and say uh do some math and we will get something useful out um but the miracle
[01:04:15] something useful out um but the miracle of what happens in these deep learning
[01:04:17] of what happens in these deep learning spaces is we do get something useful out
[01:04:20] spaces is we do get something useful out we can just um minimize um all of the
[01:04:23] we can just um minimize um all of the parameters and
[01:04:26] parameters and um then we'll get something useful out
[01:04:29] um then we'll get something useful out um so what I wanted to uh I guess I'm
[01:04:32] um so what I wanted to uh I guess I'm not going to quite get to the end of
[01:04:34] not going to quite get to the end of what I hope to today um but what we I
[01:04:36] what I hope to today um but what we I wanted to do is sort of um get through
[01:04:41] wanted to do is sort of um get through um some of um what we do here but you
[01:04:45] um some of um what we do here but you know I wanted to take a few minutes to
[01:04:47] know I wanted to take a few minutes to sort of go through concretely how we do
[01:04:50] sort of go through concretely how we do um the math of
[01:04:52] um the math of minimization now lots of different
[01:04:55] minimization now lots of different people will um take um
[01:04:59] people will um take um cs224n and some of you know way more
[01:05:02] cs224n and some of you know way more math than I do and so if this next 10
[01:05:06] math than I do and so if this next 10 minutes might be extremely boring and if
[01:05:09] minutes might be extremely boring and if that's the case you can either catch up
[01:05:11] that's the case you can either catch up on Discord or Instagram or something or
[01:05:14] on Discord or Instagram or something or else you can leave but it turns out
[01:05:16] else you can leave but it turns out there are other people that do
[01:05:18] there are other people that do cs224n that can't quite remember um when
[01:05:21] cs224n that can't quite remember um when they lasted a math course and we'd like
[01:05:25] they lasted a math course and we'd like um everybody to be able to learn
[01:05:27] um everybody to be able to learn something about this um so I do actually
[01:05:30] something about this um so I do actually like in the first two weeks to kind of
[01:05:32] like in the first two weeks to kind of go through it a bit concretely so let's
[01:05:35] go through it a bit concretely so let's um try to do this so this was our
[01:05:37] um try to do this so this was our likelihood and then we'd already covered
[01:05:40] likelihood and then we'd already covered the fact that what we were going to do
[01:05:42] the fact that what we were going to do is have an objective function in terms
[01:05:44] is have an objective function in terms of our parameters that was the negative
[01:05:48] of our parameters that was the negative the average negative log likelihood
[01:05:51] the average negative log likelihood across all the
[01:05:52] across all the words
[01:05:54] words um if I remember the notation for this
[01:05:58] um if I remember the notation for this the
[01:05:59] the sum um in this
[01:06:03] sum um in this oops um I'll probably have a hard time
[01:06:05] oops um I'll probably have a hard time writing this um the sum of position M
[01:06:11] writing this um the sum of position M I've got a more neatly written out
[01:06:13] I've got a more neatly written out version of it that appears on the
[01:06:14] version of it that appears on the version of the slides it's on the
[01:06:17] version of the slides it's on the webiz um and then we're going to be
[01:06:19] webiz um and then we're going to be taking this
[01:06:21] taking this log of the probability of the word
[01:06:26] log of the probability of the word at
[01:06:27] at position um t
[01:06:32] plus sorry position J um t +
[01:06:38] plus sorry position J um t + [Music]
[01:06:39] [Music] J
[01:06:41] J okay trying to write this on my iPad is
[01:06:44] okay trying to write this on my iPad is not working super well I'll confess
[01:06:47] not working super well I'll confess we'll see how I get on um
[01:06:50] we'll see how I get on um WT okay
[01:06:53] WT okay um okay and so then we had the form of
[01:06:57] um okay and so then we had the form of what we um wanted to use for the
[01:07:01] what we um wanted to use for the probability and the probability of an
[01:07:04] probability and the probability of an outside word given a context word is was
[01:07:08] outside word given a context word is was then this soft maxed equation where
[01:07:10] then this soft maxed equation where we're taking the x of the outside
[01:07:16] we're taking the x of the outside vector and the center
[01:07:20] vector and the center Vector over the normalization term where
[01:07:24] Vector over the normalization term where we sum over the
[01:07:38] vocabulary okay um so
[01:07:42] vocabulary okay um so to to work out um how to change our
[01:07:46] to to work out um how to change our parameters so our parameters are all of
[01:07:49] parameters so our parameters are all of these word vectors that we summarize um
[01:07:53] these word vectors that we summarize um inside Theta
[01:07:55] inside Theta what we're then going to want to do is
[01:07:58] what we're then going to want to do is work out the partial
[01:08:00] work out the partial derivative of this objective function
[01:08:04] derivative of this objective function with respect to all the parameters Theta
[01:08:08] with respect to all the parameters Theta but you know in particular um I'm going
[01:08:11] but you know in particular um I'm going to cons just start doing here the
[01:08:14] to cons just start doing here the partial derivatives with respect to um
[01:08:17] partial derivatives with respect to um the
[01:08:18] the center
[01:08:20] center word and we can work through the outside
[01:08:23] word and we can work through the outside words separately well this partial
[01:08:26] words separately well this partial derivative is a big a big sum and it's a
[01:08:29] derivative is a big a big sum and it's a big sum of terms like this and so when I
[01:08:34] big sum of terms like this and so when I have a partial derivative of a big sum
[01:08:36] have a partial derivative of a big sum of terms I can work out the partial
[01:08:39] of terms I can work out the partial derivatives of each term independently
[01:08:42] derivatives of each term independently and then sum them so what I want to be
[01:08:44] and then sum them so what I want to be doing is working out um the partial
[01:08:49] doing is working out um the partial derivative
[01:08:51] derivative of the log of this probability which
[01:08:54] of the log of this probability which equals the the log of
[01:08:57] equals the the log of that with respect to the center Vector
[01:09:02] that with respect to the center Vector um and so at this point I have a log of
[01:09:07] um and so at this point I have a log of two things being divided and so that
[01:09:10] two things being divided and so that means I can separate that out of the log
[01:09:13] means I can separate that out of the log of the numerator minus the log of the
[01:09:17] of the numerator minus the log of the denominator and so what I'll be doing is
[01:09:20] denominator and so what I'll be doing is working out the partial derivative with
[01:09:23] working out the partial derivative with respect to the center vector
[01:09:25] respect to the center vector of the log the
[01:09:28] of the log the numerator um log X of
[01:09:33] numerator um log X of utvc
[01:09:34] utvc minus um
[01:09:37] minus um the partial
[01:09:39] the partial derivative um with respect to the
[01:09:42] derivative um with respect to the denominator which is then the log of the
[01:09:48] denominator which is then the log of the sum of w = 1 to V of x
[01:09:58] who okay I'm having real trouble here
[01:10:02] who okay I'm having real trouble here writing I look at the slides where I
[01:10:04] writing I look at the slides where I wrote it neatly at home okay um so I
[01:10:08] wrote it neatly at home okay um so I want to work with these two terms now at
[01:10:13] want to work with these two terms now at this point um part of
[01:10:17] it at this point part of it is easy
[01:10:21] it at this point part of it is easy because here I just have a log of an
[01:10:23] because here I just have a log of an exponential and so so those two
[01:10:26] exponential and so so those two functions just cancel out and go away
[01:10:29] functions just cancel out and go away and so then I want to get the partial
[01:10:32] and so then I want to get the partial derivative of U outside transpose V
[01:10:37] derivative of U outside transpose V Center um with respect to V Center and
[01:10:41] Center um with respect to V Center and so um what you get for the answer to
[01:10:44] so um what you get for the answer to that is that that just comes out um as
[01:10:49] that is that that just comes out um as U and um maybe you remember that but if
[01:10:54] U and um maybe you remember that but if you don't remember that the thing to
[01:10:56] you don't remember that the thing to think about is okay this is a whole
[01:11:00] think about is okay this is a whole Vector right and so we've got a vector
[01:11:03] Vector right and so we've got a vector here and a vector here so what this is
[01:11:05] here and a vector here so what this is going to be looking like is like sort of
[01:11:08] going to be looking like is like sort of U1 V1 plus U2
[01:11:13] U1 V1 plus U2 V2 um plus u3 V3 Etc long and so what
[01:11:19] V2 um plus u3 V3 Etc long and so what we're going to want to do is work out
[01:11:21] we're going to want to do is work out the partial derivative with respect to
[01:11:24] the partial derivative with respect to each element
[01:11:26] each element um VI right and so if you just think of
[01:11:29] um VI right and so if you just think of a sort of a single element derivative
[01:11:32] a sort of a single element derivative with respect to um
[01:11:34] with respect to um V1 well it's going to be just U1 because
[01:11:39] V1 well it's going to be just U1 because every other term would go to zero um and
[01:11:41] every other term would go to zero um and then if you worked it out with respect
[01:11:43] then if you worked it out with respect to V2 then it would be just U2 and every
[01:11:47] to V2 then it would be just U2 and every other term goes to zero and so since you
[01:11:50] other term goes to zero and so since you keep on doing that along the whole
[01:11:52] keep on doing that along the whole Vector that what you're going to get out
[01:11:54] Vector that what you're going to get out is the vector
[01:11:55] is the vector um U1 U2
[01:11:58] um U1 U2 u3 um down the vocab um for the whole
[01:12:02] u3 um down the vocab um for the whole list of vocab
[01:12:03] list of vocab items okay so that part is easy um but
[01:12:08] items okay so that part is easy um but then we also want to um work out the
[01:12:13] then we also want to um work out the partial derivatives of that one and at
[01:12:16] partial derivatives of that one and at that point I maybe have to um go to
[01:12:20] that point I maybe have to um go to another slide so we then want to have
[01:12:25] another slide so we then want to have um the partial
[01:12:29] um the partial derivative
[01:12:30] derivative um with respect to
[01:12:33] um with respect to VC of
[01:12:35] VC of the
[01:12:37] the log of the sum equals W 1 to V of the x
[01:12:44] log of the sum equals W 1 to V of the x u w transvers VC right so at this point
[01:12:50] u w transvers VC right so at this point things aren't quite so easy and we have
[01:12:53] things aren't quite so easy and we have to remember a little bit more calcul
[01:12:55] to remember a little bit more calcul so in particular what we have to
[01:12:57] so in particular what we have to remember is the chain rule so here we
[01:13:01] remember is the chain rule so here we have this inside function so that we've
[01:13:04] have this inside function so that we've got a function um we've got a function G
[01:13:10] got a function um we've got a function G of VC which you know we might say the
[01:13:13] of VC which you know we might say the output of that is Zed and then we put
[01:13:17] output of that is Zed and then we put outside that an extra function f and so
[01:13:22] outside that an extra function f and so when we have something like that what we
[01:13:25] when we have something like that what we get is the derivative of f with respect
[01:13:28] get is the derivative of f with respect to VC we can take the derivative of f
[01:13:33] to VC we can take the derivative of f with respect to Z times the derivative
[01:13:36] with respect to Z times the derivative of Z with respect to um VC right that's
[01:13:40] of Z with respect to um VC right that's the chain rule so we're going to then
[01:13:44] the chain rule so we're going to then apply that here so first of all we're
[01:13:48] apply that here so first of all we're going to take the derivative of log um
[01:13:52] going to take the derivative of log um and so the derivative of log is one X
[01:13:55] and so the derivative of log is one X you have to remember that or look it up
[01:13:57] you have to remember that or look it up or get mathematic to do it for you or
[01:14:00] or get mathematic to do it for you or something like that
[01:14:02] something like that um and so we're going to have one over
[01:14:07] um and so we're going to have one over the inside Z part the sum of w equal 1
[01:14:11] the inside Z part the sum of w equal 1 to V of the X
[01:14:14] to V of the X uwt
[01:14:16] uwt VC um and then that's going to be
[01:14:20] VC um and then that's going to be multiplied by the derivative of the
[01:14:24] multiplied by the derivative of the inside part part
[01:14:26] inside part part um so then we're going to have the
[01:14:30] um so then we're going to have the derivative with respect to VC of the sum
[01:14:34] derivative with respect to VC of the sum of w =
[01:14:37] of w = 1 to
[01:14:40] 1 to V of the X
[01:14:47] of okay um
[01:14:51] of okay um so um so that's um made us a little bit
[01:14:55] so um so that's um made us a little bit of progress um but we've still got
[01:14:58] of progress um but we've still got something to do here um and so well what
[01:15:01] something to do here um and so well what we're going to do here is we're going to
[01:15:03] we're going to do here is we're going to notice Oh wait we're again in the space
[01:15:07] notice Oh wait we're again in the space to run the chain rule again so now we've
[01:15:10] to run the chain rule again so now we've got this function well so first of all
[01:15:13] got this function well so first of all we can move the sum to the outside right
[01:15:16] we can move the sum to the outside right because we've got a sum of terms W = one
[01:15:19] because we've got a sum of terms W = one to V and so we want to work out the
[01:15:21] to V and so we want to work out the derivatives of the inside piece um with
[01:15:25] derivatives of the inside piece um with respect to it sorry I'm doing this kind
[01:15:26] respect to it sorry I'm doing this kind of informally of just doing this piece
[01:15:28] of informally of just doing this piece now um okay so this again gives us an F
[01:15:33] now um okay so this again gives us an F over a function um
[01:15:37] over a function um G and so we're going to again want to
[01:15:40] G and so we're going to again want to split the pieces up and so use the chain
[01:15:43] split the pieces up and so use the chain rule one more time so we're going to
[01:15:45] rule one more time so we're going to have the sum of w = 1 to V and now we
[01:15:49] have the sum of w = 1 to V and now we have to know what the derivative of x is
[01:15:51] have to know what the derivative of x is and the derivative of x is X so that
[01:15:54] and the derivative of x is X so that will be X of
[01:15:58] will be X of uux
[01:16:00] uux tv0 and then we're taking the derivative
[01:16:04] tv0 and then we're taking the derivative of the inside part with respect to VC of
[01:16:09] of the inside part with respect to VC of ux um T
[01:16:11] ux um T VC well luckily this was the bit that we
[01:16:15] VC well luckily this was the bit that we already knew how to do because we worked
[01:16:17] already knew how to do because we worked it out before and so this is going to be
[01:16:19] it out before and so this is going to be the sum of w equal 1 to V of this X
[01:16:27] times
[01:16:30] times ux okay so then at this point um we want
[01:16:35] ux okay so then at this point um we want to combine these two forms together so
[01:16:39] to combine these two forms together so that we want to combine this part that
[01:16:42] that we want to combine this part that we worked out and this piece here that
[01:16:45] we worked out and this piece here that we've worked out and if we canine
[01:16:49] we've worked out and if we canine combine them together with what we
[01:16:52] combine them together with what we worked out on the first slide um for
[01:16:55] worked out on the first slide um for numerator um since this is we have
[01:17:00] numerator um since this is we have the U which was the derivative of the
[01:17:05] the U which was the derivative of the numerator and then for the um derivative
[01:17:09] numerator and then for the um derivative of the denominator we're going to
[01:17:13] of the denominator we're going to have um on top this part and then on the
[01:17:17] have um on top this part and then on the bottom we're going to have that part and
[01:17:19] bottom we're going to have that part and so we can rewrite that as the sum from
[01:17:23] so we can rewrite that as the sum from wal 1 to V of the
[01:17:29] wal 1 to V of the X of
[01:17:31] X of uxt
[01:17:33] uxt v0 *
[01:17:35] v0 * ux over um the
[01:17:38] ux over um the [Music]
[01:17:40] [Music] sum sorry x = 1 to V sum over W = 1 to V
[01:17:48] sum sorry x = 1 to V sum over W = 1 to V of the X this part here of UW
[01:17:56] okay so we can rearrange things in that
[01:18:00] okay so we can rearrange things in that form and then lo and behold we find that
[01:18:04] form and then lo and behold we find that we've recreated here this form of the
[01:18:07] we've recreated here this form of the softn equation so we end up with
[01:18:11] softn equation so we end up with u minus the sum x = 1 to V of the
[01:18:18] u minus the sum x = 1 to V of the probability of x given c times um U of x
[01:18:25] probability of x given c times um U of x so what this is saying is we're wanting
[01:18:29] so what this is saying is we're wanting to have this quantity which takes the
[01:18:32] to have this quantity which takes the actual observed U vector and it's
[01:18:35] actual observed U vector and it's comparing it to the weighted prediction
[01:18:40] comparing it to the weighted prediction so we're taking the weight of some of
[01:18:42] so we're taking the weight of some of the our current ux vectors um based on
[01:18:46] the our current ux vectors um based on How likely we we were they were to occur
[01:18:50] How likely we we were they were to occur and so this is a form that you see quite
[01:18:53] and so this is a form that you see quite a bit in these kind of der
[01:18:55] a bit in these kind of der you get exer observed minus expected the
[01:18:59] you get exer observed minus expected the weighted average and so what you'd like
[01:19:01] weighted average and so what you'd like to have is your expectation the weighted
[01:19:05] to have is your expectation the weighted average be the same as um the what was
[01:19:09] average be the same as um the what was observed because then you'll get a
[01:19:11] observed because then you'll get a derivative of zero which means that
[01:19:13] derivative of zero which means that you've hit a
[01:19:15] you've hit a maximum um and so that gives us you know
[01:19:19] maximum um and so that gives us you know the form of the derivative of the um
[01:19:25] the form of the derivative of the um that we're having with respect to the
[01:19:27] that we're having with respect to the center Vector parameters to finish it
[01:19:30] center Vector parameters to finish it off you'd have to then work it out also
[01:19:32] off you'd have to then work it out also for the um outside Vector parameters but
[01:19:34] for the um outside Vector parameters but hey it's officially the end of class
[01:19:37] hey it's officially the end of class time so I'd better wrap up quickly now
[01:19:39] time so I'd better wrap up quickly now but you know so the deal is we're going
[01:19:42] but you know so the deal is we're going to work out all of these derivatives um
[01:19:45] to work out all of these derivatives um for each parameter and then these
[01:19:48] for each parameter and then these derivatives will give a direction to
[01:19:50] derivatives will give a direction to change numbers which will let us find
[01:19:53] change numbers which will let us find good word vectors
[01:19:55] good word vectors automatically um I do want you to
[01:19:58] automatically um I do want you to understand how this works but
[01:20:00] understand how this works but fortunately you'll find out very quickly
[01:20:02] fortunately you'll find out very quickly that computers will do this for you and
[01:20:04] that computers will do this for you and on a regular basis you don't actually
[01:20:06] on a regular basis you don't actually have to do it yourself um more about
[01:20:08] have to do it yourself um more about that on Thursday okay see you everyone


================================================================================
LECTURE 002
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 2 - Word Vectors and Language Models

Source: https://www.youtube.com/watch?v=nBor4jfWetQ

---

Transcript

[00:00:05] okay I should try and get
[00:00:08] okay I should try and get started okay so what we're going to do
[00:00:11] started okay so what we're going to do today is we're going to try and um do
[00:00:14] today is we're going to try and um do everything else that you need to know
[00:00:17] everything else that you need to know about word vectors and start to learn a
[00:00:20] about word vectors and start to learn a teeny bit about new Nets and then we'll
[00:00:22] teeny bit about new Nets and then we'll kind of get much further into sort of
[00:00:24] kind of get much further into sort of doing more with the math of new or Nets
[00:00:27] doing more with the math of new or Nets next week so this is the general plan um
[00:00:30] next week so this is the general plan um so I'm going to sort of uh sort of
[00:00:33] so I'm going to sort of uh sort of finish up from where I was last time
[00:00:35] finish up from where I was last time with optimization Basics then look a
[00:00:38] with optimization Basics then look a little bit more about word to V and word
[00:00:40] little bit more about word to V and word vectors and then some of the variants of
[00:00:43] vectors and then some of the variants of word to v um and then I'm going to
[00:00:45] word to v um and then I'm going to briefly consider alternatives sort of
[00:00:48] briefly consider alternatives sort of like what can you get from just counting
[00:00:50] like what can you get from just counting words in different ways um then we're
[00:00:52] words in different ways um then we're going to go on and talk a little bit
[00:00:54] going to go on and talk a little bit about the evaluation of word vectors um
[00:00:58] about the evaluation of word vectors um the topic of word sensor that already
[00:01:00] the topic of word sensor that already came up a couple of times um last time
[00:01:03] came up a couple of times um last time when people were asking questions and
[00:01:06] when people were asking questions and then towards the end um start to
[00:01:08] then towards the end um start to introduce the idea of classification
[00:01:11] introduce the idea of classification doing new classification and um what new
[00:01:15] doing new classification and um what new networks are about which is something
[00:01:17] networks are about which is something that they'll then expand on more in the
[00:01:19] that they'll then expand on more in the second
[00:01:20] second week um before I get into that just
[00:01:23] week um before I get into that just notes on course um organization um so
[00:01:26] notes on course um organization um so remember the first assignment is already
[00:01:29] remember the first assignment is already out and it's it's due before class next
[00:01:33] out and it's it's due before class next Tuesday um so then um our python review
[00:01:37] Tuesday um so then um our python review session is going to be taught this
[00:01:39] session is going to be taught this Friday um 3:30 to 4:20 it's not going to
[00:01:42] Friday um 3:30 to 4:20 it's not going to be taught here it's going to be taught
[00:01:44] be taught here it's going to be taught in gates bo1 the gates basement um and
[00:01:47] in gates bo1 the gates basement um and encourage everyone again to um come to
[00:01:50] encourage everyone again to um come to office hours and help sessions they've
[00:01:52] office hours and help sessions they've already started they're listed on the
[00:01:54] already started they're listed on the website um we're having these sort of um
[00:01:58] website um we're having these sort of um office our help sessions in classrooms
[00:02:01] office our help sessions in classrooms with multiple Tas so um just turn up if
[00:02:05] with multiple Tas so um just turn up if you're on campus and you can be helped
[00:02:07] you're on campus and you can be helped and if you are on campus we'd like you
[00:02:09] and if you are on campus we'd like you to just turn up that we do also have a
[00:02:11] to just turn up that we do also have a zoom option for Stanford online
[00:02:16] students um finally I have office hours
[00:02:19] students um finally I have office hours which I have not yet open but I will
[00:02:21] which I have not yet open but I will open sometime tonight um they're going
[00:02:24] open sometime tonight um they're going to be on Monday afternoons now obviously
[00:02:27] to be on Monday afternoons now obviously given the number of people not everyone
[00:02:29] given the number of people not everyone can make it into my office hours and I'm
[00:02:31] can make it into my office hours and I'm going to do these by appointment so
[00:02:33] going to do these by appointment so they're by 15minute appointments on
[00:02:36] they're by 15minute appointments on kendley um but you know I'm very happy
[00:02:38] kendley um but you know I'm very happy to talk to some people um and you know
[00:02:43] to talk to some people um and you know uh I put this little note at the end
[00:02:44] uh I put this little note at the end saying don't hog the slots um some
[00:02:47] saying don't hog the slots um some people think it'd be a really good idea
[00:02:49] people think it'd be a really good idea if they really work out how to sign up
[00:02:51] if they really work out how to sign up every week for an office hour session
[00:02:54] every week for an office hour session than me with me and that's sort of a
[00:02:56] than me with me and that's sort of a little bit antisocial um so um
[00:03:00] little bit antisocial um so um think about that um okay so at the end
[00:03:03] think about that um okay so at the end of last time I did a sort of bad job of
[00:03:07] of last time I did a sort of bad job of trying to write on write on slides of
[00:03:09] trying to write on write on slides of working out the derivatives of word to
[00:03:13] working out the derivatives of word to be um and hopefully you could read it
[00:03:15] be um and hopefully you could read it much more clearly in the the version
[00:03:17] much more clearly in the the version that appears on the website where I was
[00:03:19] that appears on the website where I was doing it at home more carefully um so
[00:03:23] doing it at home more carefully um so that was saying that we had this lock
[00:03:25] that was saying that we had this lock function and our job was to work out its
[00:03:28] function and our job was to work out its derivatives which would tell us which
[00:03:31] derivatives which would tell us which direction to go to walk downhill um and
[00:03:35] direction to go to walk downhill um and so I didn't really quite finish the loop
[00:03:37] so I didn't really quite finish the loop here so you know we have some cost
[00:03:40] here so you know we have some cost function that we want to minimize and
[00:03:43] function that we want to minimize and then we work out the gradients of that
[00:03:45] then we work out the gradients of that the gradient of that function to work
[00:03:47] the gradient of that function to work out which direction is downhill and then
[00:03:50] out which direction is downhill and then the simplest algorithm is then that we
[00:03:54] the simplest algorithm is then that we that we work out the direction downhill
[00:03:58] that we work out the direction downhill we walk a little bit in that direction
[00:03:59] we walk a little bit in that direction ction and then we repeat we work out the
[00:04:03] ction and then we repeat we work out the gradient at this point we walk downhill
[00:04:05] gradient at this point we walk downhill a little bit and we keep on going and
[00:04:08] a little bit and we keep on going and we'll get to the minimum and with a sort
[00:04:11] we'll get to the minimum and with a sort of a one-dimensional function like this
[00:04:13] of a one-dimensional function like this it's sort of very simple we're just
[00:04:14] it's sort of very simple we're just walking down hill but when we have a
[00:04:17] walking down hill but when we have a function of many many dimensions when we
[00:04:20] function of many many dimensions when we calculate the gradient at different
[00:04:21] calculate the gradient at different points we might be starting to um walk
[00:04:24] points we might be starting to um walk in different directions and so that's
[00:04:27] in different directions and so that's why we need to do calculus and have
[00:04:28] why we need to do calculus and have gradients and so this gives us the basic
[00:04:31] gradients and so this gives us the basic algorithm of gradient
[00:04:34] algorithm of gradient descent um and so under the gradient
[00:04:37] descent um and so under the gradient descent algorithm um what we're doing is
[00:04:41] descent algorithm um what we're doing is that we've got um our loss function J
[00:04:46] that we've got um our loss function J we're working out its gradient um and
[00:04:49] we're working out its gradient um and then we're taking a little bit of a
[00:04:52] then we're taking a little bit of a little multipli of the gradient so that
[00:04:54] little multipli of the gradient so that Alpha is our step size or learning rate
[00:04:57] Alpha is our step size or learning rate um and that's normally Alpha is a very
[00:04:59] um and that's normally Alpha is a very small number number something like 10- 3
[00:05:01] small number number something like 10- 3 or 10-4 or maybe even 10- 5 so we're
[00:05:05] or 10-4 or maybe even 10- 5 so we're taking a really little bit of the
[00:05:07] taking a really little bit of the gradient and then we're subtracting it
[00:05:10] gradient and then we're subtracting it from our parameters to get new
[00:05:13] from our parameters to get new parameters and as we do that we will
[00:05:16] parameters and as we do that we will walk downhill and the reason why we want
[00:05:19] walk downhill and the reason why we want to have a small learning rate is we
[00:05:21] to have a small learning rate is we don't want to walk too fast so if from
[00:05:23] don't want to walk too fast so if from here we worked out the gradient and said
[00:05:25] here we worked out the gradient and said it's in this direction and we just kept
[00:05:28] it's in this direction and we just kept on walking we sort of might end up way
[00:05:31] on walking we sort of might end up way over here or if we had a really big step
[00:05:33] over here or if we had a really big step size we might even end up at a worse
[00:05:35] size we might even end up at a worse Point than we started with so we want to
[00:05:37] Point than we started with so we want to take little steps to walk downhill and
[00:05:41] take little steps to walk downhill and so that's the very basic gradient
[00:05:42] so that's the very basic gradient descent algorithm now the very basic
[00:05:46] descent algorithm now the very basic gradient descent algorithm we never use
[00:05:50] gradient descent algorithm we never use what we actually use is the next thing
[00:05:53] what we actually use is the next thing up which is called stochastic gradient
[00:05:55] up which is called stochastic gradient descent so the problem is for the basic
[00:05:59] descent so the problem is for the basic gradient Cent algorithm we've worked out
[00:06:03] gradient Cent algorithm we've worked out um for an entire set of data um what the
[00:06:09] um for an entire set of data um what the objective function is and what the um
[00:06:12] objective function is and what the um slope at the point of evaluation is and
[00:06:17] slope at the point of evaluation is and in general um we've got a lot of data in
[00:06:21] in general um we've got a lot of data in which we're Computing models so simply
[00:06:24] which we're Computing models so simply trying to calculate our objective
[00:06:26] trying to calculate our objective function over all of our data for our
[00:06:29] function over all of our data for our model the training data for the model
[00:06:32] model the training data for the model would take us a very very long time um
[00:06:35] would take us a very very long time um and so that's very um very expensive to
[00:06:38] and so that's very um very expensive to compute and so we'd wait a very long
[00:06:40] compute and so we'd wait a very long time before we make even a single step
[00:06:43] time before we make even a single step of gradient update um so for neural Nets
[00:06:46] of gradient update um so for neural Nets what you're always doing is using this
[00:06:48] what you're always doing is using this variant that's called stochastic
[00:06:49] variant that's called stochastic gradient descent and so for stochastic
[00:06:52] gradient descent and so for stochastic gradient descent what that means is we
[00:06:54] gradient descent what that means is we pick a very small subset of our data
[00:06:57] pick a very small subset of our data like maybe we pick 16 or 2 Data items
[00:07:01] like maybe we pick 16 or 2 Data items and we pretend that's all of our data
[00:07:04] and we pretend that's all of our data and we evaluate the function J based on
[00:07:07] and we evaluate the function J based on that small subset and work out the
[00:07:09] that small subset and work out the gradient based on that small subset so
[00:07:12] gradient based on that small subset so it's a noisy inaccurate estimate of the
[00:07:14] it's a noisy inaccurate estimate of the gradient and we use that um to be the
[00:07:18] gradient and we use that um to be the direction in which we walk um so that's
[00:07:21] direction in which we walk um so that's normally referred to also as having mini
[00:07:24] normally referred to also as having mini batches or mini batch gradient
[00:07:27] batches or mini batch gradient descent um and in theory working out the
[00:07:32] descent um and in theory working out the gradient based on this small subset um
[00:07:36] gradient based on this small subset um is an approximation but one of the
[00:07:38] is an approximation but one of the interesting things in the way things
[00:07:41] interesting things in the way things have emerged in new network land is it
[00:07:43] have emerged in new network land is it turns out that new networks actually
[00:07:45] turns out that new networks actually often work better when you throw some
[00:07:48] often work better when you throw some noise into the system that having this
[00:07:50] noise into the system that having this noise in the system gives you jiggle and
[00:07:53] noise in the system gives you jiggle and moves things around and so actually
[00:07:57] moves things around and so actually stochastic gradient descent not only is
[00:07:59] stochastic gradient descent not only is way way faster but actually works better
[00:08:02] way way faster but actually works better as a system for optimization of neural
[00:08:06] as a system for optimization of neural networks okay so if you remember from
[00:08:09] networks okay so if you remember from last time um for um word de the idea was
[00:08:13] last time um for um word de the idea was we started by just saying each word has
[00:08:16] we started by just saying each word has a random um Vector representing it so we
[00:08:20] a random um Vector representing it so we will literally sort of just get random
[00:08:23] will literally sort of just get random small numbers and fill up the vectors
[00:08:25] small numbers and fill up the vectors with those random small numbers um
[00:08:27] with those random small numbers um there's an important point there which
[00:08:29] there's an important point there which is you do have to initialize your
[00:08:32] is you do have to initialize your vectors with random small numbers if you
[00:08:34] vectors with random small numbers if you just leave the all the vectors as zero
[00:08:37] just leave the all the vectors as zero then nothing works um and that's because
[00:08:41] then nothing works um and that's because if everything starts off the same you
[00:08:44] if everything starts off the same you get these sort of false symmetries which
[00:08:46] get these sort of false symmetries which means that you can't learn so you always
[00:08:49] means that you can't learn so you always do want to be initializing your vectors
[00:08:52] do want to be initializing your vectors with random numbers and then we're going
[00:08:54] with random numbers and then we're going to go through each position in the
[00:08:56] to go through each position in the Corpus using our estimates we're going
[00:08:59] Corpus using our estimates we're going to try try and predict the probability
[00:09:01] to try try and predict the probability of words in the context as we talked
[00:09:03] of words in the context as we talked about last time then we so that gives us
[00:09:06] about last time then we so that gives us an objective function from which we can
[00:09:09] an objective function from which we can then look at our errors look at our
[00:09:12] then look at our errors look at our gradient um and update the vectors so
[00:09:15] gradient um and update the vectors so that they learn to predict surrounding
[00:09:17] that they learn to predict surrounding words better and so the incredible thing
[00:09:20] words better and so the incredible thing is that we can do no more than that and
[00:09:24] is that we can do no more than that and we end up learning word vectors which
[00:09:27] we end up learning word vectors which actually capture quite a lot of the
[00:09:30] actually capture quite a lot of the semantics the meaning and relationships
[00:09:32] semantics the meaning and relationships between different words so you know when
[00:09:35] between different words so you know when this was first
[00:09:38] this was first um discovered um for these algorithms I
[00:09:41] um discovered um for these algorithms I mean it really feels like magic that you
[00:09:43] mean it really feels like magic that you can just sort of do this math simple
[00:09:47] can just sort of do this math simple math over a lot of text um and actually
[00:09:51] math over a lot of text um and actually learn about the meanings of words that
[00:09:53] learn about the meanings of words that it's sort of just sort of surprising
[00:09:55] it's sort of just sort of surprising that something so simple could work but
[00:09:58] that something so simple could work but as time has gone on this same recipe has
[00:10:01] as time has gone on this same recipe has then been for all kinds of learning
[00:10:03] then been for all kinds of learning about
[00:10:04] about um um the behavior of language from new
[00:10:07] um um the behavior of language from new networks um so let's just um go through
[00:10:12] networks um so let's just um go through a sense of how that is but before we do
[00:10:14] a sense of how that is but before we do that so
[00:10:16] that so um let me just mention so for our word
[00:10:19] um let me just mention so for our word toac algorithms the only parameters of
[00:10:22] toac algorithms the only parameters of the model are these word vectors they're
[00:10:25] the model are these word vectors they're the outside word vectors and the center
[00:10:27] the outside word vectors and the center word vectors which we actually treat as
[00:10:30] word vectors which we actually treat as disjoint as I mentioned last time and
[00:10:33] disjoint as I mentioned last time and when we do the computations we're
[00:10:35] when we do the computations we're considering the dot product between the
[00:10:38] considering the dot product between the various um possible outside words with
[00:10:43] various um possible outside words with our Center word and we using those to
[00:10:46] our Center word and we using those to get a probability distribution over How
[00:10:48] get a probability distribution over How likely the model thinks that different
[00:10:51] likely the model thinks that different outside words were and then we're
[00:10:53] outside words were and then we're comparing that to the actual outside
[00:10:56] comparing that to the actual outside word in the context and that gives us
[00:10:58] word in the context and that gives us our source of error so as such this is
[00:11:01] our source of error so as such this is what's referred to in NLP as a bag of
[00:11:04] what's referred to in NLP as a bag of words model that it doesn't actually
[00:11:06] words model that it doesn't actually know about the structure of sentences
[00:11:08] know about the structure of sentences and or even what's to the left and
[00:11:10] and or even what's to the left and what's to the right it's predicting
[00:11:12] what's to the right it's predicting exactly the same probabilities at each
[00:11:15] exactly the same probabilities at each position to the left or right um but
[00:11:17] position to the left or right um but it's wanting to know about what kind of
[00:11:19] it's wanting to know about what kind of words appear in the context of the
[00:11:21] words appear in the context of the center word um so I just wanted to uh
[00:11:26] center word um so I just wanted to uh stop this for a minute and um
[00:11:30] stop this for a minute and um let's see not that one
[00:11:38] um so give you some kind of a sense that
[00:11:42] um so give you some kind of a sense that this really um does work um so this is a
[00:11:45] this really um does work um so this is a little Jupiter notebook um that I've got
[00:11:48] little Jupiter notebook um that I've got um for this
[00:11:51] um for this um okay and so this is using and here
[00:11:54] um okay and so this is using and here I'm using a package um gen Sim which we
[00:11:57] I'm using a package um gen Sim which we don't continue to use after that really
[00:12:02] don't continue to use after that really um but it's sort of one package that let
[00:12:04] um but it's sort of one package that let you load and play with word vectors and
[00:12:08] you load and play with word vectors and the word vectors I'm going to use here
[00:12:10] the word vectors I'm going to use here are are glove word vectors and actually
[00:12:13] are are glove word vectors and actually I'm going to um glove was a model we
[00:12:16] I'm going to um glove was a model we built at Stanford and I'm going to
[00:12:18] built at Stanford and I'm going to actually talk about it a little bit
[00:12:19] actually talk about it a little bit later um so strictly speaking um these
[00:12:23] later um so strictly speaking um these aren't exactly word to ve word vectors
[00:12:25] aren't exactly word to ve word vectors but they behave in exactly the same way
[00:12:28] but they behave in exactly the same way um and so okay so now it's loaded up by
[00:12:32] um and so okay so now it's loaded up by word vectors because the word vectors
[00:12:34] word vectors because the word vectors are a big data file and so as we've
[00:12:37] are a big data file and so as we've discussed um for a word right that the
[00:12:41] discussed um for a word right that the representation of any word here is bread
[00:12:44] representation of any word here is bread is just um a vector of real numbers
[00:12:48] is just um a vector of real numbers right so um I'm using a 100 dimensional
[00:12:51] right so um I'm using a 100 dimensional word vectors to keep things quicker um
[00:12:54] word vectors to keep things quicker um for my class demo so this is the word
[00:12:57] for my class demo so this is the word bread um and then I can say well what's
[00:13:00] bread um and then I can say well what's the representation for
[00:13:02] the representation for croissant
[00:13:04] croissant um and um this is croissant and we can
[00:13:09] um and um this is croissant and we can sort of get a visual sense of oh they're
[00:13:11] sort of get a visual sense of oh they're at least a little bit similar right so
[00:13:13] at least a little bit similar right so the first components are both negative
[00:13:16] the first components are both negative the second components are both positive
[00:13:18] the second components are both positive the third components are both negative
[00:13:21] the third components are both negative and large the fourth components are both
[00:13:23] and large the fourth components are both positive right they they seem like
[00:13:26] positive right they they seem like they're kind of similar vectors so that
[00:13:28] they're kind of similar vectors so that seems kind of hopeful because that means
[00:13:31] seems kind of hopeful because that means that it knows that bread and croissant
[00:13:33] that it knows that bread and croissant are a bit um similar to each other um
[00:13:37] are a bit um similar to each other um the this package has a nice simple
[00:13:39] the this package has a nice simple function where rather than doing that by
[00:13:41] function where rather than doing that by hand you can just ask it about all the
[00:13:44] hand you can just ask it about all the word vectors and say which ones are most
[00:13:47] word vectors and say which ones are most similar so I can ask it um what um words
[00:13:51] similar so I can ask it um what um words in its vocabulary most similar to USA
[00:13:54] in its vocabulary most similar to USA and in this model everything's been
[00:13:56] and in this model everything's been lowercased I should mention um and so if
[00:13:59] lowercased I should mention um and so if I do that it has Canada America U.S.A
[00:14:03] I do that it has Canada America U.S.A then United States Australia well those
[00:14:06] then United States Australia well those seem a fairly reasonable list of most
[00:14:08] seem a fairly reasonable list of most similar words though you might think
[00:14:10] similar words though you might think it's a little strange that Canada wins
[00:14:12] it's a little strange that Canada wins out over the USA with dots over it
[00:14:16] out over the USA with dots over it um um similarly I can ask if what's most
[00:14:19] um um similarly I can ask if what's most similar to banana and I get coconut
[00:14:21] similar to banana and I get coconut mango bananas potato pineapple fruit Etc
[00:14:25] mango bananas potato pineapple fruit Etc again pretty sensible you know a little
[00:14:27] again pretty sensible you know a little bit of a bias to more tropical fruits or
[00:14:30] bit of a bias to more tropical fruits or I can go to cron and ask what's most
[00:14:33] I can go to cron and ask what's most similar to cran the most similar things
[00:14:35] similar to cran the most similar things to cron isn't bread but it's things like
[00:14:37] to cron isn't bread but it's things like Brios bagette fatcha um which sort of
[00:14:40] Brios bagette fatcha um which sort of basically makes sense though here's
[00:14:42] basically makes sense though here's pudding here um and I've got wait I'd
[00:14:46] pudding here um and I've got wait I'd already done oh sorry yeah I remember
[00:14:49] already done oh sorry yeah I remember what this is right so with this most
[00:14:52] what this is right so with this most similar you've got a positive word
[00:14:54] similar you've got a positive word vector and you're saying what other
[00:14:56] vector and you're saying what other words are most similar in position to
[00:14:58] words are most similar in position to that that um there's something else you
[00:15:01] that that um there's something else you can do which you can say is this is let
[00:15:04] can do which you can say is this is let me take the negative of that word vector
[00:15:06] me take the negative of that word vector and say what's most similar to the
[00:15:09] and say what's most similar to the negative of it and you could possibly
[00:15:12] negative of it and you could possibly think a that might be useful to find
[00:15:14] think a that might be useful to find antonyms or something like that I mean
[00:15:16] antonyms or something like that I mean the truth is it isn't if you ask for the
[00:15:20] the truth is it isn't if you ask for the things that are most similar to the
[00:15:21] things that are most similar to the negative of the banana Vector um and in
[00:15:24] negative of the banana Vector um and in most other vectors it's the same you get
[00:15:27] most other vectors it's the same you get off out these weirdo things things that
[00:15:29] off out these weirdo things things that you're not really sure if they're words
[00:15:30] you're not really sure if they're words at all or maybe they are in some other
[00:15:33] at all or maybe they are in some other language or some of them are names right
[00:15:35] language or some of them are names right like shichi is a Japanese name um but
[00:15:38] like shichi is a Japanese name um but not very useful stuff don't that um they
[00:15:41] not very useful stuff don't that um they not don't really feel like a negative of
[00:15:43] not don't really feel like a negative of banana but it turns out that from there
[00:15:47] banana but it turns out that from there we get this um powerful ability of um
[00:15:51] we get this um powerful ability of um that was observed for word to ve which
[00:15:54] that was observed for word to ve which is that we could isolate semantic
[00:15:57] is that we could isolate semantic components um and then put them together
[00:16:00] components um and then put them together in interesting ways so um looking at
[00:16:04] in interesting ways so um looking at this picture what we could do is start
[00:16:06] this picture what we could do is start with a positive Vector for King from the
[00:16:10] with a positive Vector for King from the origin up the king then we could use the
[00:16:13] origin up the king then we could use the negation to say subtract out the vector
[00:16:16] negation to say subtract out the vector for man and then we could have another
[00:16:19] for man and then we could have another positive Vector of add on the um Vector
[00:16:22] positive Vector of add on the um Vector for woman and then we can ask the model
[00:16:25] for woman and then we can ask the model is if you're over here in the space um
[00:16:28] is if you're over here in the space um what is the nearest word to you over
[00:16:31] what is the nearest word to you over there and so um that's what this um next
[00:16:35] there and so um that's what this um next thing does right it's sort of says
[00:16:38] thing does right it's sort of says positive Vector for King negative for
[00:16:40] positive Vector for King negative for man also positive for Queen where does
[00:16:43] man also positive for Queen where does that get you to and that gets you to
[00:16:46] that get you to and that gets you to Queen yay um and so this was the most
[00:16:50] Queen yay um and so this was the most celebrated property that was discovered
[00:16:53] celebrated property that was discovered with these word vectors that they
[00:16:55] with these word vectors that they weren't only good for meaning similarity
[00:16:58] weren't only good for meaning similarity but that they were good um for doing
[00:17:02] but that they were good um for doing these kind of meaning components and
[00:17:05] these kind of meaning components and these got referred to as analogies
[00:17:07] these got referred to as analogies because that's you can think of them as
[00:17:09] because that's you can think of them as a is to B as C is to what so it's sort
[00:17:12] a is to B as C is to what so it's sort of like um woman is to King no sorry um
[00:17:16] of like um woman is to King no sorry um man is to King or King is to man as um
[00:17:21] man is to King or King is to man as um I'm saying this the wrong way around man
[00:17:23] I'm saying this the wrong way around man is to King as woman is to what um in the
[00:17:26] is to King as woman is to what um in the analogies and so here I've defined a
[00:17:29] analogies and so here I've defined a little function that is now saying this
[00:17:32] little function that is now saying this little function just automates that and
[00:17:34] little function just automates that and we'll compute
[00:17:35] we'll compute analogies um and so now I can ask it in
[00:17:39] analogies um and so now I can ask it in just this analogy format um man is to
[00:17:42] just this analogy format um man is to King as woman is to queen and um that
[00:17:46] King as woman is to queen and um that one was sort of the canonical example
[00:17:49] one was sort of the canonical example but you know you can actually sort of
[00:17:51] but you know you can actually sort of has have fun with this and I mean you
[00:17:55] has have fun with this and I mean you know uh this is pretty oldfashioned
[00:17:58] know uh this is pretty oldfashioned stuff you know know I I feel like I'm
[00:18:01] stuff you know know I I feel like I'm maybe like now at this point an old guy
[00:18:04] maybe like now at this point an old guy talking about how much fun we used to
[00:18:06] talking about how much fun we used to have sitting around the radio listening
[00:18:08] have sitting around the radio listening to radio plays um because you know
[00:18:11] to radio plays um because you know basically no one uses this stuff anymore
[00:18:13] basically no one uses this stuff anymore and there are much much better and
[00:18:15] and there are much much better and fancier things like chat GPT but you
[00:18:18] fancier things like chat GPT but you know back in the day when I was younger
[00:18:21] know back in the day when I was younger um you know it was really stunning
[00:18:24] um you know it was really stunning already just how this very simple model
[00:18:27] already just how this very simple model built on very simple data um could just
[00:18:32] built on very simple data um could just have quite good semantic understanding
[00:18:35] have quite good semantic understanding and do quite good analogies so you can
[00:18:37] and do quite good analogies so you can actually you know play with this quite a
[00:18:39] actually you know play with this quite a bit and have a bit of fun so you can do
[00:18:42] bit and have a bit of fun so you can do something like analogy Australia comma
[00:18:48] something like analogy Australia comma be
[00:18:50] be France okay what people think the answer
[00:18:53] France okay what people think the answer will
[00:18:54] will be close the answer gives us Champagne
[00:18:58] be close the answer gives us Champagne but that seems a pretty good answer um I
[00:19:01] but that seems a pretty good answer um I could then put in Russia what what
[00:19:04] could then put in Russia what what people think vodka yeah it'll see you
[00:19:07] people think vodka yeah it'll see you can get back vodka you know this is
[00:19:10] can get back vodka you know this is actually works kind of
[00:19:12] actually works kind of interestingly um you know I could do a
[00:19:15] interestingly um you know I could do a different one I can do the test
[00:19:17] different one I can do the test something different I can do something
[00:19:19] something different I can do something like pencil is
[00:19:22] like pencil is to
[00:19:24] to sketching as
[00:19:27] sketching as camera is
[00:19:31] to photographing yeah that works quite
[00:19:34] to photographing yeah that works quite well um so um we built this model in
[00:19:39] well um so um we built this model in 2014 so it's a little bit out of date in
[00:19:43] 2014 so it's a little bit out of date in politics so you know well we we can't do
[00:19:46] politics so you know well we we can't do the last decade of politics which is
[00:19:48] the last decade of politics which is maybe unfortunate but you know we could
[00:19:50] maybe unfortunate but you know we could try out older politics questions so we
[00:19:53] try out older politics questions so we could try um
[00:19:57] could try um Obama is to
[00:20:02] Clinton
[00:20:04] Clinton as
[00:20:05] as Reagan is to if you remember your Us
[00:20:08] Reagan is to if you remember your Us world history class any guesses what
[00:20:11] world history class any guesses what it's going to
[00:20:13] it's going to say there's a bush one any other
[00:20:17] say there's a bush one any other ideas some people have different
[00:20:19] ideas some people have different opinions of Bill Clinton
[00:20:22] opinions of Bill Clinton any uh um what it answers is Nixon which
[00:20:26] any uh um what it answers is Nixon which I think is actually kind of fair
[00:20:29] I think is actually kind of fair um
[00:20:31] um but um you can also get it to do some
[00:20:36] but um you can also get it to do some just sort of language syntactic facts so
[00:20:39] just sort of language syntactic facts so you can do something like tallest to
[00:20:43] you can do something like tallest to tallest as long
[00:20:47] tallest as long oops as long is to this one's
[00:20:53] oops as long is to this one's easy um yeah so you know with this simp
[00:20:58] easy um yeah so you know with this simp simple method of learning with this
[00:21:00] simple method of learning with this simple bag of words model it's enough to
[00:21:03] simple bag of words model it's enough to learn a lot about the semantics of words
[00:21:08] learn a lot about the semantics of words and you know stuff that's beyond
[00:21:10] and you know stuff that's beyond conventional semantics right you know
[00:21:12] conventional semantics right you know like our examples with Australia De Beer
[00:21:15] like our examples with Australia De Beer as Russia as to vodka I mean that sort
[00:21:17] as Russia as to vodka I mean that sort of cultural World Knowledge which goes a
[00:21:20] of cultural World Knowledge which goes a little bit beyond um what people
[00:21:22] little bit beyond um what people normally think of as sort of word
[00:21:23] normally think of as sort of word meaning semantics but it's also in there
[00:21:26] meaning semantics but it's also in there yes if you perhaps subract the distance
[00:21:29] yes if you perhaps subract the distance from let's say like man and King does
[00:21:31] from let's say like man and King does that also capture a concept of
[00:21:33] that also capture a concept of relationship between two words like
[00:21:34] relationship between two words like would that give you back like ruler like
[00:21:36] would that give you back like ruler like something like that we taking the like
[00:21:39] something like that we taking the like the difference between two vectors does
[00:21:40] the difference between two vectors does capture
[00:21:41] capture some
[00:21:43] some right the distance between man so so Man
[00:21:49] right the distance between man so so Man compared to King should be a ruler
[00:21:51] compared to King should be a ruler concept but isn't that what I'm using CU
[00:21:53] concept but isn't that what I'm using CU then I'm taking that I'm taking the
[00:21:57] then I'm taking that I'm taking the distance between man and
[00:22:00] distance between man and King is what I'm adding on to woman to
[00:22:03] King is what I'm adding on to woman to get the queen
[00:22:05] get the queen right right yeah so I can depending on
[00:22:09] right right yeah so I can depending on if you think of these words depending on
[00:22:11] if you think of these words depending on which thing you think of as the analogy
[00:22:14] which thing you think of as the analogy you can think of it you've both got um a
[00:22:19] you can think of it you've both got um a a vector a difference Vector between
[00:22:21] a vector a difference Vector between words that gives you a gender analogy
[00:22:25] words that gives you a gender analogy and one that gives you a ruler analogy
[00:22:27] and one that gives you a ruler analogy yeah absolutely
[00:22:30] yeah absolutely any other
[00:22:32] any other questions yeah um in the watch ve umm we
[00:22:37] questions yeah um in the watch ve umm we get two vectors like for each one a u
[00:22:40] get two vectors like for each one a u and a v but here you only have one
[00:22:42] and a v but here you only have one vector so how do you go from two to one
[00:22:46] vector so how do you go from two to one yeah good question I mean the commonest
[00:22:49] yeah good question I mean the commonest way in practice was you just average the
[00:22:51] way in practice was you just average the two of them and and and really you sort
[00:22:54] two of them and and and really you sort of find out that they um end up very
[00:22:57] of find out that they um end up very close you
[00:22:59] close you know um because if you think of it since
[00:23:02] know um because if you think of it since you're going along every position of the
[00:23:04] you're going along every position of the text you're both going to be the case
[00:23:06] text you're both going to be the case where if the text is sort of you know
[00:23:09] where if the text is sort of you know the octopus has legs you know you're
[00:23:11] the octopus has legs you know you're going to have octopus in the center with
[00:23:13] going to have octopus in the center with legs in the context and a couple of time
[00:23:15] legs in the context and a couple of time steps later it's going to be legs in the
[00:23:16] steps later it's going to be legs in the center with octavus in the context so
[00:23:19] center with octavus in the context so you know although they vary a bit for
[00:23:21] you know although they vary a bit for all the regions of the neuron Nets very
[00:23:23] all the regions of the neuron Nets very basically they end up very similar and
[00:23:25] basically they end up very similar and people normally just average them yeah
[00:23:28] people normally just average them yeah can you
[00:23:29] can you this process so use this uh the answer
[00:23:32] this process so use this uh the answer of one to then be placed into the
[00:23:34] of one to then be placed into the analogy function of the another and see
[00:23:36] analogy function of the another and see how far away you can go before it SS
[00:23:38] how far away you can go before it SS break
[00:23:40] break down I think you can um so you what wait
[00:23:44] down I think you can um so you what wait you're wanting to how far away how
[00:23:46] you're wanting to how far away how distant of a relation between two words
[00:23:48] distant of a relation between two words can you do before it starts providing
[00:23:51] can you do before it starts providing incorrect relationship
[00:23:55] between but are you wanting to sort of
[00:23:57] between but are you wanting to sort of make two steps from somewhere or yeah
[00:24:03] make two steps from somewhere or yeah St many St go away
[00:24:09] froming so it doesn't always work I mean
[00:24:12] froming so it doesn't always work I mean there are so examples that are fail um
[00:24:15] there are so examples that are fail um I'm sort of shy to try that now because
[00:24:17] I'm sort of shy to try that now because I don't have a predefined function that
[00:24:19] I don't have a predefined function that did it and that might take me um too
[00:24:21] did it and that might take me um too long but you could play with it at home
[00:24:24] long but you could play with it at home and see how it works for you
[00:24:29] curious is a clarification why is it
[00:24:31] curious is a clarification why is it that we use two separate sets of vectors
[00:24:33] that we use two separate sets of vectors for word uh is it just to get more
[00:24:35] for word uh is it just to get more parameters or is there um I'll get back
[00:24:38] parameters or is there um I'll get back to that maybe I should go on at this
[00:24:41] to that maybe I should go on at this point um let me move on and kind of just
[00:24:44] point um let me move on and kind of just get through some more details of the
[00:24:47] get through some more details of the word Tove algorithm um so um just a
[00:24:52] word Tove algorithm um so um just a technical point on this class so you
[00:24:54] technical point on this class so you don't make any big mistakes and waste
[00:24:56] don't make any big mistakes and waste your weekend I mean for most instances
[00:24:59] your weekend I mean for most instances of 224n we've actually had people
[00:25:03] of 224n we've actually had people Implement from scratch word to VC as
[00:25:05] Implement from scratch word to VC as assignment to um but you know for this
[00:25:09] assignment to um but you know for this quarter doing it in Spring quarter as
[00:25:12] quarter doing it in Spring quarter as you probably know Spring quarter is
[00:25:13] you probably know Spring quarter is actually a little shorter than the other
[00:25:15] actually a little shorter than the other two quarters um we decided to skip
[00:25:18] two quarters um we decided to skip having people Implement word to VC um so
[00:25:21] having people Implement word to VC um so don't look at the old assignment to that
[00:25:23] don't look at the old assignment to that says Implement word to V or else you'll
[00:25:25] says Implement word to V or else you'll be misspending your time wait for the
[00:25:27] be misspending your time wait for the newer assignment to to come out um but
[00:25:31] newer assignment to to come out um but you know despite that let me just sort
[00:25:32] you know despite that let me just sort of um say a little bit more about some
[00:25:36] of um say a little bit more about some of the details right yeah so why two
[00:25:38] of the details right yeah so why two vectors um so the two vectors is it just
[00:25:43] vectors um so the two vectors is it just makes the math a little bit easy so if
[00:25:46] makes the math a little bit easy so if you think about the math right if you
[00:25:49] you think about the math right if you have the same vectors for the center
[00:25:52] have the same vectors for the center word and for the outside words well for
[00:25:55] word and for the outside words well for whatever the the center word is let's
[00:25:58] whatever the the center word is let's say it's octopus that when you're going
[00:26:00] say it's octopus that when you're going through the trying out every possible
[00:26:04] through the trying out every possible context word for the normalization at
[00:26:07] context word for the normalization at some point you'll hit octopus again and
[00:26:10] some point you'll hit octopus again and so at that point you'll have a quadratic
[00:26:12] so at that point you'll have a quadratic term right you'll have the x squared of
[00:26:14] term right you'll have the x squared of the octopus vector and that kind of
[00:26:17] the octopus vector and that kind of messes up I mean you're clever people
[00:26:20] messes up I mean you're clever people you could work out the math of it but it
[00:26:22] you could work out the math of it but it makes the math more of a mess right CU
[00:26:25] makes the math more of a mess right CU every other term it's something
[00:26:26] every other term it's something different and it's just like a X and
[00:26:29] different and it's just like a X and then at one position you've got an x s
[00:26:31] then at one position you've got an x s um so it just makes the math Messier and
[00:26:34] um so it just makes the math Messier and so they kept it really simple by just
[00:26:37] so they kept it really simple by just having them be disjoint vectors but it
[00:26:40] having them be disjoint vectors but it it doesn't make it better I mean it
[00:26:42] it doesn't make it better I mean it actually turns out it works a fraction
[00:26:44] actually turns out it works a fraction better if you do it right um but in
[00:26:48] better if you do it right um but in practice people have usually just
[00:26:50] practice people have usually just estimated them separately and then
[00:26:53] estimated them separately and then average them at the end um there if if
[00:26:55] average them at the end um there if if you actually look at the paper here's m
[00:26:58] you actually look at the paper here's m of it out you can find it 2013 paper um
[00:27:02] of it out you can find it 2013 paper um there's actually sort of a family of
[00:27:04] there's actually sort of a family of methods that they describe um so they
[00:27:07] methods that they describe um so they described two methods one of which um
[00:27:11] described two methods one of which um was that you have an inside word that's
[00:27:13] was that you have an inside word that's predicting the words around it and then
[00:27:16] predicting the words around it and then the other one tried to predict the
[00:27:18] the other one tried to predict the center word from all the words in the
[00:27:20] center word from all the words in the context which was called continuous bag
[00:27:22] context which was called continuous bag of words in their paper um the one that
[00:27:24] of words in their paper um the one that I've described is Skip gram which is
[00:27:27] I've described is Skip gram which is simpler and works just great um but then
[00:27:31] simpler and works just great um but then the other part of it is um for working
[00:27:35] the other part of it is um for working out um what loss function to be used for
[00:27:38] out um what loss function to be used for training and what I've um presented so
[00:27:42] training and what I've um presented so far um is naive soft Max where we just
[00:27:46] far um is naive soft Max where we just consider every possible choice of a
[00:27:49] consider every possible choice of a context word and just run all the math
[00:27:52] context word and just run all the math um you know that's totally doable and
[00:27:55] um you know that's totally doable and with our modern super fast computers
[00:27:57] with our modern super fast computers it's not even that unreasonable to do we
[00:27:59] it's not even that unreasonable to do we do things like this all the time but at
[00:28:01] do things like this all the time but at least at the time that they um wrote
[00:28:04] least at the time that they um wrote their paper this seemed kind of
[00:28:06] their paper this seemed kind of expensive and they considered other
[00:28:08] expensive and they considered other Alternatives um like a hierarchical
[00:28:10] Alternatives um like a hierarchical softmax which I'm not going to explain
[00:28:12] softmax which I'm not going to explain right now but I do just want to explain
[00:28:15] right now but I do just want to explain negative
[00:28:17] negative sampling okay so this is just to sort of
[00:28:19] sampling okay so this is just to sort of see a bit of a different way of doing
[00:28:21] see a bit of a different way of doing things so for what we did last time we
[00:28:25] things so for what we did last time we had this sort of straightforward soft
[00:28:28] had this sort of straightforward soft Max equation and so in the denominator
[00:28:31] Max equation and so in the denominator you're summing over every word in the
[00:28:34] you're summing over every word in the vocabulary um and so if you might have
[00:28:36] vocabulary um and so if you might have 400,000 words in your vocabulary a lot
[00:28:39] 400,000 words in your vocabulary a lot of words in human languages you know
[00:28:41] of words in human languages you know that's kind of a a big sum especially
[00:28:44] that's kind of a a big sum especially when for each element of the sum you're
[00:28:46] when for each element of the sum you're taking a DOT product between 100
[00:28:49] taking a DOT product between 100 dimensional or 300 dimensional vectors
[00:28:51] dimensional or 300 dimensional vectors and then exponentiating it right a lot
[00:28:54] and then exponentiating it right a lot of math going on somewhere in there um
[00:28:57] of math going on somewhere in there um so
[00:28:59] so maybe we could short circuit that and so
[00:29:02] maybe we could short circuit that and so the idea of the negative sampling was to
[00:29:04] the idea of the negative sampling was to say well rather than evaluating it for
[00:29:07] say well rather than evaluating it for every single possible word maybe we
[00:29:11] every single possible word maybe we could just sort of train some simple
[00:29:14] could just sort of train some simple logistic regressions where they're going
[00:29:17] logistic regressions where they're going to say you should like the true word
[00:29:20] to say you should like the true word that's in the context and if we randomly
[00:29:23] that's in the context and if we randomly pick a few other words you shouldn't
[00:29:25] pick a few other words you shouldn't like them very much um and that's skip
[00:29:28] like them very much um and that's skip gram negative sampling so that's what
[00:29:30] gram negative sampling so that's what this looks like as an equation um so
[00:29:33] this looks like as an equation um so we've got our Center word and our actual
[00:29:37] we've got our Center word and our actual context word and we're um saying well
[00:29:41] context word and we're um saying well let's work out the term for the actual
[00:29:45] let's work out the term for the actual Center word we'd like this to be um high
[00:29:50] Center word we'd like this to be um high probability um so since we're minimizing
[00:29:53] probability um so since we're minimizing we're going to negate that and have it
[00:29:55] we're going to negate that and have it go down and then we're going to sample
[00:29:57] go down and then we're going to sample some other words and we'd like this to
[00:29:59] some other words and we'd like this to be the opposite um but the the other
[00:30:02] be the opposite um but the the other thing that we've changed here is now
[00:30:04] thing that we've changed here is now we're not using the soft Max anymore
[00:30:07] we're not using the soft Max anymore we're using this Sigma which stands for
[00:30:10] we're using this Sigma which stands for the logistic function which is often
[00:30:13] the logistic function which is often called the sigmoid um sigmoid just means
[00:30:16] called the sigmoid um sigmoid just means s-shaped and but you know you could
[00:30:18] s-shaped and but you know you could actually have an Infinity of s-shaped
[00:30:20] actually have an Infinity of s-shaped functions and the one that we actually
[00:30:22] functions and the one that we actually use is the logistic function so the
[00:30:25] use is the logistic function so the logistic function has this form and and
[00:30:28] logistic function has this form and and Maps um from any real number to a
[00:30:32] Maps um from any real number to a probability between zero and one um so
[00:30:36] probability between zero and one um so what we're wanting to say at that point
[00:30:38] what we're wanting to say at that point is for the real outside world we're
[00:30:43] is for the real outside world we're hoping that this dot product is large so
[00:30:46] hoping that this dot product is large so its probability is near one um and so
[00:30:49] its probability is near one um and so that will then sort of help with the
[00:30:52] that will then sort of help with the minimization and for the other words
[00:30:54] minimization and for the other words we'd like their probability to be small
[00:30:58] we'd like their probability to be small um so we what like them to appear sort
[00:31:01] um so we what like them to appear sort of over here um and that's what this is
[00:31:05] of over here um and that's what this is calculating but as written it's sort of
[00:31:07] calculating but as written it's sort of sticking the minus sign on the inside
[00:31:10] sticking the minus sign on the inside there which works because of the this is
[00:31:12] there which works because of the this is symmetric right so you're wanting to be
[00:31:15] symmetric right so you're wanting to be over here which means that if you negate
[00:31:18] over here which means that if you negate it you'll be on this side which will be
[00:31:26] large okay um
[00:31:28] large okay um and so then the final bit of this which
[00:31:30] and so then the final bit of this which is the asterisk is so we're going to
[00:31:33] is the asterisk is so we're going to pick a few words you know it might only
[00:31:35] pick a few words you know it might only be five or 10 that are our negative
[00:31:38] be five or 10 that are our negative samples um but for picking those words
[00:31:42] samples um but for picking those words what works well is not just to sort of
[00:31:46] what works well is not just to sort of pick sort of randomly uniformly from all
[00:31:49] pick sort of randomly uniformly from all the 400,000 words in our vocab what you
[00:31:53] the 400,000 words in our vocab what you basically want to do is sort of be
[00:31:55] basically want to do is sort of be paying attention to how common the words
[00:31:57] paying attention to how common the words are something like that is a really
[00:31:59] are something like that is a really common word and so we refer to that as
[00:32:02] common word and so we refer to that as the unigram distribution that means
[00:32:04] the unigram distribution that means you're s just taking individual words
[00:32:06] you're s just taking individual words independently how commonly they are so
[00:32:09] independently how commonly they are so about 10% of the time you'd be choosing
[00:32:11] about 10% of the time you'd be choosing the um but so that's sort of roughly
[00:32:15] the um but so that's sort of roughly what you want to do for sampling but
[00:32:17] what you want to do for sampling but people have found that you can actually
[00:32:19] people have found that you can actually do even a bit better than that so the
[00:32:21] do even a bit better than that so the standard thing that they presented for
[00:32:24] standard thing that they presented for word DEC is you're taking the unigram
[00:32:27] word DEC is you're taking the unigram probability of the word and raising it
[00:32:29] probability of the word and raising it to the power of
[00:32:31] to the power of 34 uh what does that end up
[00:32:35] doing question for the audience if I
[00:32:39] doing question for the audience if I take probabilities and raise them to the
[00:32:41] take probabilities and raise them to the three
[00:32:43] three quarters some less frequent words just
[00:32:46] quarters some less frequent words just become sing more correct yeah so the
[00:32:51] become sing more correct yeah so the raising to the
[00:32:52] raising to the 34s means that you're sort of somewhat
[00:32:56] 34s means that you're sort of somewhat upping the probability
[00:32:59] upping the probability of the less frequent words so you're
[00:33:02] of the less frequent words so you're sort of in between you know between
[00:33:05] sort of in between you know between having every word uniform and exactly
[00:33:07] having every word uniform and exactly using their relative frequencies in the
[00:33:10] using their relative frequencies in the text you're sort of moving a little bit
[00:33:12] text you're sort of moving a little bit in the direction of uniform and so you
[00:33:15] in the direction of uniform and so you get better results by going somewhat in
[00:33:17] get better results by going somewhat in the distance of sampling more uniformly
[00:33:20] the distance of sampling more uniformly but you don't want to go all the way
[00:33:22] but you don't want to go all the way there which should correspond to I guess
[00:33:25] there which should correspond to I guess putting um a zero in there rather than
[00:33:28] putting um a zero in there rather than three
[00:33:30] three quarters okay um
[00:33:34] quarters okay um yeah okay uh let's see I had a side here
[00:33:38] yeah okay uh let's see I had a side here but Time Rushes along so let's not
[00:33:41] but Time Rushes along so let's not bother with this side it's not that
[00:33:42] bother with this side it's not that important um okay so that's um the word
[00:33:47] important um okay so that's um the word to VC algorithm that we've seen all of
[00:33:50] to VC algorithm that we've seen all of um in its different forms um a
[00:33:55] um in its different forms um a reasonable uh wonder that you have at
[00:33:58] reasonable uh wonder that you have at this point is you
[00:34:01] this point is you know this seems a kind of a weird way of
[00:34:05] know this seems a kind of a weird way of doing what we're wanting to do right the
[00:34:07] doing what we're wanting to do right the idea is look we have this text we have
[00:34:11] idea is look we have this text we have words and we have words in the context
[00:34:14] words and we have words in the context of words it sort of seems like an
[00:34:17] of words it sort of seems like an obvious thing to do would be to say
[00:34:19] obvious thing to do would be to say let's just count some statistics we have
[00:34:22] let's just count some statistics we have words and there are other words that
[00:34:23] words and there are other words that occur in their context so let's just see
[00:34:26] occur in their context so let's just see how often the word word swim occurs next
[00:34:29] how often the word word swim occurs next to octopus and how often the word fish
[00:34:32] to octopus and how often the word fish occurs next to octopus let's get some
[00:34:34] occurs next to octopus let's get some counts um and see how often words occur
[00:34:38] counts um and see how often words occur in the context of other words and maybe
[00:34:40] in the context of other words and maybe we could use that um to calculate some
[00:34:44] we could use that um to calculate some form of word vectors um and so that's
[00:34:48] form of word vectors um and so that's something that people have already also
[00:34:50] something that people have already also considered so if we use the same kind of
[00:34:53] considered so if we use the same kind of idea of a context window we could just
[00:34:55] idea of a context window we could just make a matrix of how often words occur
[00:34:58] make a matrix of how often words occur in the context of other words and so you
[00:35:01] in the context of other words and so you know here's a baby example my Corpus is
[00:35:04] know here's a baby example my Corpus is I like deep learning I like NLP I enjoy
[00:35:07] I like deep learning I like NLP I enjoy flying um and my context window I'm
[00:35:10] flying um and my context window I'm using is just one word to the left and
[00:35:12] using is just one word to the left and the right and then I can make this kind
[00:35:15] the right and then I can make this kind of um co-occurrence count Matrix um
[00:35:19] of um co-occurrence count Matrix um where I'm putting in the counts of
[00:35:21] where I'm putting in the counts of different words in every context and you
[00:35:24] different words in every context and you know because my Corpus is so small um
[00:35:27] know because my Corpus is so small um every thing in the Matrix is a zero or
[00:35:30] every thing in the Matrix is a zero or one except for right here where I've got
[00:35:31] one except for right here where I've got the twos because I have I like twice
[00:35:34] the twos because I have I like twice right but in principle I've got a matrix
[00:35:37] right but in principle I've got a matrix of counts for all the different counts
[00:35:39] of counts for all the different counts here um so maybe you know this gives
[00:35:43] here um so maybe you know this gives this gives me a word Vector right you
[00:35:45] this gives me a word Vector right you know here's a word Vector for deep um is
[00:35:49] know here's a word Vector for deep um is this long Vector here and you know I
[00:35:51] this long Vector here and you know I could just say that is my word vector
[00:35:53] could just say that is my word vector and indeed sometimes people have done
[00:35:56] and indeed sometimes people have done that but they're kind of of ungainly
[00:35:58] that but they're kind of of ungainly word vectors because if we have 400,000
[00:36:02] word vectors because if we have 400,000 words in our vocabulary the size of this
[00:36:05] words in our vocabulary the size of this Matrix is 400,000 by
[00:36:08] Matrix is 400,000 by 400,000 which is a lot worse than our
[00:36:10] 400,000 which is a lot worse than our word ve word vectors because if we're
[00:36:13] word ve word vectors because if we're making them only 100 dimensional um
[00:36:15] making them only 100 dimensional um we've only got 400,000 by 100 which is
[00:36:19] we've only got 400,000 by 100 which is still a big number but it's a lot
[00:36:20] still a big number but it's a lot smaller than 400,000 time 400,000 so
[00:36:24] smaller than 400,000 time 400,000 so that's inconvenient so when people have
[00:36:26] that's inconvenient so when people have started with um these kind of
[00:36:29] started with um these kind of cooccurrence Matrix um the general thing
[00:36:32] cooccurrence Matrix um the general thing that people have done is to say well
[00:36:35] that people have done is to say well somehow we want to reduce the
[00:36:37] somehow we want to reduce the dimensionality of that Matrix so that we
[00:36:40] dimensionality of that Matrix so that we have a smaller Matrix to deal with um
[00:36:44] have a smaller Matrix to deal with um and so then how can we reduce the
[00:36:47] and so then how can we reduce the dimensionality of the Matrix and at this
[00:36:50] dimensionality of the Matrix and at this point if you remember your linear
[00:36:52] point if you remember your linear algebra and stuff like that you should
[00:36:54] algebra and stuff like that you should be thinking of things like PCA and
[00:36:57] be thinking of things like PCA and particular if you want it to work for
[00:36:59] particular if you want it to work for any Matrix of any shape there's the um
[00:37:02] any Matrix of any shape there's the um singular value decomposition so there
[00:37:05] singular value decomposition so there the classic singular value decomposition
[00:37:07] the classic singular value decomposition for any Matrix you can rewrite it as a
[00:37:11] for any Matrix you can rewrite it as a product of three matrices a u and a v
[00:37:15] product of three matrices a u and a v which are both
[00:37:16] which are both orthonormal um which means that um you
[00:37:19] orthonormal um which means that um you get these um
[00:37:22] get these um independent
[00:37:23] independent um vectors um they orthog to each other
[00:37:29] um vectors um they orthog to each other um and then in the middle we have the
[00:37:32] um and then in the middle we have the singular vectors which are ordered in
[00:37:35] singular vectors which are ordered in size is the most important singular
[00:37:37] size is the most important singular vector and these are sort of waiting
[00:37:39] vector and these are sort of waiting terms on the different number of uh the
[00:37:43] terms on the different number of uh the different dimensions and so this is sort
[00:37:45] different dimensions and so this is sort of the full SVD
[00:37:48] of the full SVD decomposition um but you know part of it
[00:37:51] decomposition um but you know part of it is a relevant because if I've got this
[00:37:53] is a relevant because if I've got this picture you know nothing is happening in
[00:37:55] picture you know nothing is happening in the part that's sort of shown in yellow
[00:37:57] the part that's sort of shown in yellow there um but if you want you know at the
[00:38:01] there um but if you want you know at the moment you know this is just an a a good
[00:38:05] moment you know this is just an a a good a full decomposition but if we wanting
[00:38:08] a full decomposition but if we wanting to have sort of smaller low dimensional
[00:38:10] to have sort of smaller low dimensional vectors well the next trick we pull is
[00:38:13] vectors well the next trick we pull is we say well we know where the smallest
[00:38:15] we say well we know where the smallest singular vectors are so we could just
[00:38:18] singular vectors are so we could just set them to zero and if we did that then
[00:38:21] set them to zero and if we did that then more of this goes away and we end up
[00:38:25] more of this goes away and we end up with two-dimensional representation
[00:38:28] with two-dimensional representation of our words and so that gives us
[00:38:31] of our words and so that gives us another way of um forming
[00:38:34] another way of um forming low-dimensional word
[00:38:36] low-dimensional word representations um and this had actually
[00:38:39] representations um and this had actually been explored before modern new word
[00:38:42] been explored before modern new word vectors and using algorithms such as
[00:38:44] vectors and using algorithms such as latent semantic analysis um and it is
[00:38:49] latent semantic analysis um and it is sort of half worked but it never worked
[00:38:53] sort of half worked but it never worked very well but you know some people
[00:38:56] very well but you know some people especially in psychology had kept on
[00:38:58] especially in psychology had kept on working on it um and among other people
[00:39:02] working on it um and among other people in the early 2000s there was this grad
[00:39:04] in the early 2000s there was this grad student um Doug roie who kept on working
[00:39:07] student um Doug roie who kept on working on it and he came up with an algorithm
[00:39:11] on it and he came up with an algorithm um that he called Kohl's and you know he
[00:39:16] um that he called Kohl's and you know he he had known as other people before him
[00:39:18] he had known as other people before him had known that just sort of doing an SVD
[00:39:21] had known that just sort of doing an SVD on Raw counts didn't seem to give you
[00:39:23] on Raw counts didn't seem to give you word vectors that worked very well um
[00:39:26] word vectors that worked very well um but he had some ideas to do better than
[00:39:28] but he had some ideas to do better than that um so one thing that helps a lot is
[00:39:31] that um so one thing that helps a lot is if you log the
[00:39:32] if you log the frequencies so you can put log
[00:39:34] frequencies so you can put log frequencies in the cells um but then um
[00:39:39] frequencies in the cells um but then um he sort of used some other ideas some of
[00:39:41] he sort of used some other ideas some of which were also picked up in word to V
[00:39:44] which were also picked up in word to V one of which is ramping the windows so
[00:39:46] one of which is ramping the windows so that you count closer words more than
[00:39:48] that you count closer words more than further away words um he used piercon
[00:39:52] further away words um he used piercon correlations instead of counts Etc but
[00:39:55] correlations instead of counts Etc but he ended up coming up with a low dial
[00:39:58] he ended up coming up with a low dial version of Word vectors that are sort of
[00:40:00] version of Word vectors that are sort of ultimately still based on an SVD um and
[00:40:04] ultimately still based on an SVD um and he got out these word vectors and you
[00:40:07] he got out these word vectors and you know interestingly sort of no one really
[00:40:10] know interestingly sort of no one really noticed at the time but Doug roie in his
[00:40:13] noticed at the time but Doug roie in his dissertation effectively discovered this
[00:40:17] dissertation effectively discovered this same property of having linear semantic
[00:40:19] same property of having linear semantic components so look here we go here's one
[00:40:22] components so look here we go here's one of so this is actually you know picture
[00:40:24] of so this is actually you know picture from his dissertation and look here you
[00:40:27] from his dissertation and look here you know we've got this meaning component
[00:40:29] know we've got this meaning component which is doer of an event and he's
[00:40:31] which is doer of an event and he's essentially shown with the way he's
[00:40:33] essentially shown with the way he's processed his word vectors that the do
[00:40:36] processed his word vectors that the do of an event is a linear meaning
[00:40:39] of an event is a linear meaning component that you can use to move
[00:40:41] component that you can use to move between a verb and the doer of the verb
[00:40:44] between a verb and the doer of the verb kind of cool but he didn't become famous
[00:40:47] kind of cool but he didn't become famous because no one was paying attention to
[00:40:49] because no one was paying attention to what he had come up with so that so once
[00:40:52] what he had come up with so that so once word DEC um became popular that was
[00:40:57] word DEC um became popular that was something that I was kind of interested
[00:41:00] something that I was kind of interested in and so um working together with a
[00:41:03] in and so um working together with a postto Jeffrey Pennington um we thought
[00:41:07] postto Jeffrey Pennington um we thought that you know you there was interest in
[00:41:11] that you know you there was interest in this sort of space of having doing
[00:41:13] this sort of space of having doing things with matrices of counts and how
[00:41:16] things with matrices of counts and how do you then get them to work well as
[00:41:19] do you then get them to work well as word vectors in the same way that word
[00:41:21] word vectors in the same way that word de V worked well as word vectors and so
[00:41:24] de V worked well as word vectors and so that's what led um into the glove
[00:41:27] that's what led um into the glove algorithm that was what I was actually
[00:41:29] algorithm that was what I was actually showing you and so um what we wanted was
[00:41:34] showing you and so um what we wanted was to say look we want a model in which um
[00:41:39] to say look we want a model in which um linear components sort of adding or
[00:41:41] linear components sort of adding or subtracting a vector in a vector space
[00:41:44] subtracting a vector in a vector space correspond to a meaning difference how
[00:41:47] correspond to a meaning difference how can we do that um and Jeffrey um did
[00:41:52] can we do that um and Jeffrey um did good thinking and uh math and um thought
[00:41:56] good thinking and uh math and um thought about that for a bit and his solution
[00:42:00] about that for a bit and his solution was to say well if we think of um that
[00:42:04] was to say well if we think of um that ratios of cooccurrence probabilities can
[00:42:07] ratios of cooccurrence probabilities can encode meaning components so if we can
[00:42:09] encode meaning components so if we can make a ratio of cooccurrence
[00:42:11] make a ratio of cooccurrence probabilities into something linear in
[00:42:14] probabilities into something linear in the vector space will get the kind of
[00:42:16] the vector space will get the kind of result that word DEC or Doug Ro got um
[00:42:20] result that word DEC or Doug Ro got um so what does that mean well so if you
[00:42:24] so what does that mean well so if you start thinking of words occurring in the
[00:42:26] start thinking of words occurring in the context of I ice you might think that
[00:42:28] context of I ice you might think that sort of solid and water are likely to K
[00:42:31] sort of solid and water are likely to K near ice and gas or a random word like
[00:42:35] near ice and gas or a random word like random aren't likely to occur near ice
[00:42:39] random aren't likely to occur near ice but and similar for steam you'd expect
[00:42:43] but and similar for steam you'd expect um that you know gas and water are
[00:42:46] um that you know gas and water are likely to occur near steam but probably
[00:42:48] likely to occur near steam but probably not solid or random and well if you're
[00:42:51] not solid or random and well if you're just looking at one of these you don't
[00:42:53] just looking at one of these you don't really get meaning components because
[00:42:56] really get meaning components because you get something that's large here or
[00:42:58] you get something that's large here or large here but if you then look at the
[00:43:01] large here but if you then look at the ratio of two of these cooccurrence
[00:43:05] ratio of two of these cooccurrence probabilities then what you get out is
[00:43:08] probabilities then what you get out is that for solid it's going to be large
[00:43:10] that for solid it's going to be large and small is going to be um and for gas
[00:43:14] and small is going to be um and for gas it's going to be small and so you're
[00:43:16] it's going to be small and so you're getting a direction in the space which
[00:43:19] getting a direction in the space which will correspond to the um solid liquid
[00:43:22] will correspond to the um solid liquid gas dimension of physics um whereas for
[00:43:26] gas dimension of physics um whereas for the other words it will be about one um
[00:43:29] the other words it will be about one um this is just uh uh the wave your hands
[00:43:32] this is just uh uh the wave your hands this was the conception of the idea but
[00:43:34] this was the conception of the idea but if you actually do the counts this
[00:43:36] if you actually do the counts this actually works out so using real data
[00:43:39] actually works out so using real data this is what you get for cooccurrence
[00:43:42] this is what you get for cooccurrence and indeed you kind of get these sort of
[00:43:44] and indeed you kind of get these sort of factors of 10 um in both of these
[00:43:47] factors of 10 um in both of these directions of these two and the numbers
[00:43:49] directions of these two and the numbers are over there are approximately one um
[00:43:53] are over there are approximately one um so um Jeffrey's idea was well we going
[00:43:57] so um Jeffrey's idea was well we going to start with a cooccurrence count
[00:44:00] to start with a cooccurrence count Matrix and we want to make this turn
[00:44:05] Matrix and we want to make this turn into a linear component and well how do
[00:44:08] into a linear component and well how do you do that well first of all it sort of
[00:44:10] you do that well first of all it sort of makes sense immediately that you should
[00:44:12] makes sense immediately that you should be putting a login right because once
[00:44:14] be putting a login right because once you put a log in um this ratio will be
[00:44:17] you put a log in um this ratio will be being turned into something that's
[00:44:19] being turned into something that's subtracted um and so simply all you have
[00:44:23] subtracted um and so simply all you have to do is have a log by linear model
[00:44:25] to do is have a log by linear model where the um dotproduct of two word
[00:44:29] where the um dotproduct of two word vectors models this conditional
[00:44:32] vectors models this conditional probability and then the difference
[00:44:34] probability and then the difference between two vectors will be
[00:44:36] between two vectors will be corresponding to this log of the ratio
[00:44:39] corresponding to this log of the ratio of their cooccurrence
[00:44:41] of their cooccurrence probabilities um so that was basically
[00:44:44] probabilities um so that was basically the glove model um so you're wanting to
[00:44:48] the glove model um so you're wanting to you know model this dot product um such
[00:44:53] you know model this dot product um such that it's being close to the log of the
[00:44:56] that it's being close to the log of the cooccur Cur probability but you sort of
[00:44:59] cooccur Cur probability but you sort of do a little bit of extra work to have
[00:45:01] do a little bit of extra work to have some biased terms and and some frequency
[00:45:04] some biased terms and and some frequency thresholds which aren't very important
[00:45:07] thresholds which aren't very important um so I'm going to skip past them but I
[00:45:09] um so I'm going to skip past them but I think that basic intuition as to what's
[00:45:13] think that basic intuition as to what's the important thing to get linear
[00:45:14] the important thing to get linear meaning components is a good one to know
[00:45:19] about okay is everyone good
[00:45:23] about okay is everyone good today
[00:45:25] today cool yes all right noticed the the
[00:45:28] cool yes all right noticed the the original X Matrix you show was like 3x 5
[00:45:31] original X Matrix you show was like 3x 5 or something shouldn't It Be Square so
[00:45:34] or something shouldn't It Be Square so yeah I mean if you're doing sorry yeah I
[00:45:37] yeah I mean if you're doing sorry yeah I maybe should have just shown you a
[00:45:38] maybe should have just shown you a square one if you're just doing
[00:45:40] square one if you're just doing vocabulary to vocabulary yes it should
[00:45:42] vocabulary to vocabulary yes it should be square um but there was a bit in the
[00:45:45] be square um but there was a bit in the slides that I didn't mention that there
[00:45:46] slides that I didn't mention that there was another way you could do it where
[00:45:48] was another way you could do it where you did it words versus documents and
[00:45:50] you did it words versus documents and then it would be non Square um but yeah
[00:45:52] then it would be non Square um but yeah you're right so let we can just just um
[00:45:57] you're right so let we can just just um consider the square case Okay
[00:46:00] consider the square case Okay um so you know hey I showed you that
[00:46:04] um so you know hey I showed you that demo of the glove vectors and they work
[00:46:06] demo of the glove vectors and they work great didn't
[00:46:07] great didn't they um so you know these are good
[00:46:09] they um so you know these are good vectors um but in general in NLP we'd
[00:46:13] vectors um but in general in NLP we'd like to have things that we can evaluate
[00:46:16] like to have things that we can evaluate and know whether things are really good
[00:46:19] and know whether things are really good um and
[00:46:21] um and so everywhere through the course we're
[00:46:23] so everywhere through the course we're going to want to evaluate things and
[00:46:25] going to want to evaluate things and work out how good they are and what's
[00:46:27] work out how good they are and what's better and what's worse um and so one of
[00:46:31] better and what's worse um and so one of the fundamental Notions of evaluation
[00:46:33] the fundamental Notions of evaluation that will come up again and again is
[00:46:35] that will come up again and again is intrinsic and extrinsic evaluations so
[00:46:38] intrinsic and extrinsic evaluations so an intrinsic evaluation is where you are
[00:46:41] an intrinsic evaluation is where you are doing a very specific internal subtask
[00:46:46] doing a very specific internal subtask and you just try and score whether it's
[00:46:48] and you just try and score whether it's good or bad so normally intrinsic
[00:46:50] good or bad so normally intrinsic evaluations are fast to compute help you
[00:46:54] evaluations are fast to compute help you understand the component you're building
[00:46:56] understand the component you're building but they are sort of distant from your
[00:46:59] but they are sort of distant from your Downstream task and improving the
[00:47:01] Downstream task and improving the numbers internally may or may not help
[00:47:04] numbers internally may or may not help you and that's um the contrast with an
[00:47:07] you and that's um the contrast with an extrinsic evaluation um which is that
[00:47:11] extrinsic evaluation um which is that you've got some real task you want to do
[00:47:13] you've got some real task you want to do question answering or document
[00:47:15] question answering or document summarization or machine
[00:47:17] summarization or machine translation um and you want to know
[00:47:20] translation um and you want to know whether some clever bit of um internal
[00:47:23] whether some clever bit of um internal modeling will help you on that task so
[00:47:26] modeling will help you on that task so then you have to sort of run an entire
[00:47:28] then you have to sort of run an entire system and work out Downstream
[00:47:31] system and work out Downstream accuracies um and find out whether it
[00:47:34] accuracies um and find out whether it actually helps you at the end of the day
[00:47:36] actually helps you at the end of the day um but that often means it's kind of you
[00:47:38] um but that often means it's kind of you know indirect so harder to see exactly
[00:47:41] know indirect so harder to see exactly what's happening in your task um so for
[00:47:44] what's happening in your task um so for something like word vectors you know if
[00:47:46] something like word vectors you know if we just sort of measure are they
[00:47:48] we just sort of measure are they modeling word similarity well that's an
[00:47:52] modeling word similarity well that's an intrinsic
[00:47:53] intrinsic evaluation but you know we'd probably
[00:47:56] evaluation but you know we'd probably like to know whether they model word
[00:47:59] like to know whether they model word similarity well for some Downstream task
[00:48:02] similarity well for some Downstream task which might be um doing web search right
[00:48:05] which might be um doing web search right we'd like um when you say you know cell
[00:48:09] we'd like um when you say you know cell phone or mobile phone that it comes out
[00:48:11] phone or mobile phone that it comes out at about the same um so that would then
[00:48:13] at about the same um so that would then be web search might be our extrinsic
[00:48:17] be web search might be our extrinsic evaluation okay so for word factors um
[00:48:22] evaluation okay so for word factors um two two intrinsic evaluations the ones
[00:48:25] two two intrinsic evaluations the ones we've already seen
[00:48:27] we've already seen so there's the word Vector analogies um
[00:48:30] so there's the word Vector analogies um you know I cheated when I showed you the
[00:48:33] you know I cheated when I showed you the glove demo I only showed you ones that
[00:48:36] glove demo I only showed you ones that work but if you play for it yourself you
[00:48:38] work but if you play for it yourself you can find some that don't work um so what
[00:48:41] can find some that don't work um so what we can do is sort of have a set of um
[00:48:45] we can do is sort of have a set of um word analogies and find out which ones
[00:48:48] word analogies and find out which ones work now you know in general glove does
[00:48:51] work now you know in general glove does work you know here's a set of word
[00:48:53] work you know here's a set of word vectors um showing you um the sort of
[00:48:57] vectors um showing you um the sort of um female distinction it's kind of good
[00:48:59] um female distinction it's kind of good and linear but in general for different
[00:49:02] and linear but in general for different ones it's going to work and it's not
[00:49:04] ones it's going to work and it's not work and you're going to be able to
[00:49:05] work and you're going to be able to score what percentage of the time it
[00:49:08] score what percentage of the time it works um or we can do um word similarity
[00:49:12] works um or we can do um word similarity How We Do word similarity is we actually
[00:49:15] How We Do word similarity is we actually use human judgments of similarity so um
[00:49:19] use human judgments of similarity so um psychologists ask undergrads and they
[00:49:23] psychologists ask undergrads and they say here is the word plane and car how
[00:49:26] say here is the word plane and car how similar are they on a scale of 1 to 10
[00:49:29] similar are they on a scale of 1 to 10 or 0 to 10 maybe actually I think it's 0
[00:49:31] or 0 to 10 maybe actually I think it's 0 to 10 here on a scale of 0 to 10 and the
[00:49:35] to 10 here on a scale of 0 to 10 and the person says seven and then they ask
[00:49:38] person says seven and then they ask another person um and uh they average
[00:49:43] another person um and uh they average what the undergrads say and they come
[00:49:45] what the undergrads say and they come out with these numbers so you know Tiger
[00:49:47] out with these numbers so you know Tiger Tiger gets 10 book and paper got a an
[00:49:50] Tiger gets 10 book and paper got a an average of
[00:49:52] average of 7.46 um plane and car got
[00:49:55] 7.46 um plane and car got 5.77 um stock and phone got 1.62 and
[00:49:59] 5.77 um stock and phone got 1.62 and stock and jaguar got
[00:50:01] stock and jaguar got 0.92 um noisy process but you roughly
[00:50:04] 0.92 um noisy process but you roughly get to see how similar people think
[00:50:06] get to see how similar people think words are and so then we ask our models
[00:50:09] words are and so then we ask our models to also score how similar they think
[00:50:12] to also score how similar they think words are and then we get models of how
[00:50:16] words are and then we get models of how well the scores are correlated um
[00:50:20] well the scores are correlated um between um human judgments and our
[00:50:23] between um human judgments and our models judgments um and so here are sort
[00:50:26] models judgments um and so here are sort of a big table of numbers that we um
[00:50:29] of a big table of numbers that we um don't need to go through all of but you
[00:50:31] don't need to go through all of but you know it sort of shows that a plain SVD
[00:50:34] know it sort of shows that a plain SVD Works terribly simply doing SVD over log
[00:50:38] Works terribly simply doing SVD over log counts already starts to work reasonably
[00:50:41] counts already starts to work reasonably um and then you know here's um the two
[00:50:43] um and then you know here's um the two word DEC algorithms sibo and Skip gram
[00:50:47] word DEC algorithms sibo and Skip gram and here are numbers from our glove
[00:50:49] and here are numbers from our glove vectors and so you get these kind of
[00:50:51] vectors and so you get these kind of scores that you can then score different
[00:50:53] scores that you can then score different models as to how good they are
[00:50:57] models as to how good they are and well then you can ALS oh sorry yeah
[00:51:00] and well then you can ALS oh sorry yeah um that's the only thing I have there
[00:51:02] um that's the only thing I have there but you know what can you do for
[00:51:05] but you know what can you do for Downstream evaluation well then you want
[00:51:07] Downstream evaluation well then you want to um pick some Downstream task and so a
[00:51:11] to um pick some Downstream task and so a simple Downstream task that's been used
[00:51:13] simple Downstream task that's been used a lot um in NLP is what's called named
[00:51:17] a lot um in NLP is what's called named entity recognition and so that's
[00:51:20] entity recognition and so that's recognizing names of things and what
[00:51:22] recognizing names of things and what type they are um so if the sentence is
[00:51:25] type they are um so if the sentence is Chris Manning lives in poo Alto you want
[00:51:28] Chris Manning lives in poo Alto you want to say Chris and Manning that's the name
[00:51:30] to say Chris and Manning that's the name of a person and poo and outto that's the
[00:51:34] of a person and poo and outto that's the name of a place um so that can be the
[00:51:37] name of a place um so that can be the task and well that's the kind of task
[00:51:40] task and well that's the kind of task which you might think word vectors would
[00:51:42] which you might think word vectors would help you with um and it's um indeed the
[00:51:45] help you with um and it's um indeed the case right so he what's label was
[00:51:47] case right so he what's label was discreet was a baseline symbolic um
[00:51:51] discreet was a baseline symbolic um probabilistic um name entity recognition
[00:51:55] probabilistic um name entity recognition task and by putting word vectors into it
[00:51:58] task and by putting word vectors into it you can make them the numbers go up so
[00:52:01] you can make them the numbers go up so these numbers for glav are higher than
[00:52:03] these numbers for glav are higher than the ones on the first line and so I'm
[00:52:06] the ones on the first line and so I'm getting substantial improvements from
[00:52:09] getting substantial improvements from adding word vectors to my system yay
[00:52:15] um okay
[00:52:18] um okay um I'll pile ahead into the next thing
[00:52:21] um I'll pile ahead into the next thing this next one I think is interesting we
[00:52:23] this next one I think is interesting we should spend a minute on and it came up
[00:52:25] should spend a minute on and it came up in your questions last time words have
[00:52:29] in your questions last time words have lots of meanings most words um have a
[00:52:34] lots of meanings most words um have a whole bunch of meanings words that don't
[00:52:37] whole bunch of meanings words that don't have a lot of different meanings are
[00:52:39] have a lot of different meanings are only some very specialized scientific
[00:52:41] only some very specialized scientific words okay so my example of of word with
[00:52:45] words okay so my example of of word with multiple meanings is probably not the
[00:52:47] multiple meanings is probably not the first one you think of all the time um
[00:52:49] first one you think of all the time um the most famous example of a word with a
[00:52:51] the most famous example of a word with a lot of meanings is bank which already
[00:52:52] lot of meanings is bank which already came up last time and I use star which
[00:52:55] came up last time and I use star which is another one here's a word that you
[00:52:57] is another one here's a word that you probably don't use that often um but it
[00:52:59] probably don't use that often um but it you know it still has lots of meaning so
[00:53:01] you know it still has lots of meaning so the word Pike what are some things that
[00:53:03] the word Pike what are some things that the word Pike can
[00:53:05] the word Pike can mean fish a fish yes it's a kind of fish
[00:53:08] mean fish a fish yes it's a kind of fish okay we've got one what else can a pike
[00:53:11] okay we've got one what else can a pike be yeah a spear a spear yeah for the
[00:53:14] be yeah a spear a spear yeah for the Dungeons and Dragons crowd yeah there's
[00:53:16] Dungeons and Dragons crowd yeah there's a long arm right yep that's another one
[00:53:19] a long arm right yep that's another one yeah a road right yes so Pike is used as
[00:53:23] yeah a road right yes so Pike is used as a shorthand well a shorthand for a ter
[00:53:26] a shorthand well a shorthand for a ter Turn Pike why it's called a Turnpike
[00:53:28] Turn Pike why it's called a Turnpike where yeah originally you had you know
[00:53:30] where yeah originally you had you know this the spey looking thing um at the
[00:53:33] this the spey looking thing um at the start of it as sort of count people okay
[00:53:35] start of it as sort of count people okay we've got three other thing meanings for
[00:53:37] we've got three other thing meanings for pike yeah is it also a crap like a
[00:53:42] pike yeah is it also a crap like a [Music]
[00:53:43] [Music] fraternity I'll believe you I can't say
[00:53:45] fraternity I'll believe you I can't say I know that one
[00:53:48] I know that one um are
[00:53:51] um are Pikes sharp as like a needle something
[00:53:55] Pikes sharp as like a needle something Sharp
[00:53:57] Sharp maybe I mean I think it's really the
[00:53:59] maybe I mean I think it's really the sort of Pike as the
[00:54:02] weapon other scratch your heads um one
[00:54:06] weapon other scratch your heads um one that I think a lot of you will have seen
[00:54:09] that I think a lot of you will have seen um in diving and swimming you can do a
[00:54:13] um in diving and swimming you can do a pike Olympics if you see olympic diving
[00:54:17] pike Olympics if you see olympic diving there are Pikes anyone seen
[00:54:20] there are Pikes anyone seen those um trust me that's a pike um okay
[00:54:25] those um trust me that's a pike um okay um and we've sort of been doing um the
[00:54:28] um and we've sort of been doing um the noun uses but you know you can also use
[00:54:32] noun uses but you know you can also use Pike as a verb right you know like once
[00:54:35] Pike as a verb right you know like once you've got your medieval weapon you can
[00:54:37] you've got your medieval weapon you can Pike somebody um and that's a usage of
[00:54:41] Pike somebody um and that's a usage of Pike um and you can do other ones right
[00:54:44] Pike um and you can do other ones right so uh here we go here's
[00:54:47] so uh here we go here's um um ones I got from a dictionary we
[00:54:51] um um ones I got from a dictionary we got most of those there are sort of
[00:54:53] got most of those there are sort of weirder usages right like coming down
[00:54:55] weirder usages right like coming down the pike that's kind of a metaphorical
[00:54:57] the pike that's kind of a metaphorical use that comes um from the the road
[00:55:01] use that comes um from the the road sense but it sort of ends up meaning the
[00:55:03] sense but it sort of ends up meaning the future um yeah um in Australia we also
[00:55:07] future um yeah um in Australia we also use Pike to mean um sort of chicken out
[00:55:10] use Pike to mean um sort of chicken out of doing something um but I don't think
[00:55:13] of doing something um but I don't think that usage is really used in the US
[00:55:15] that usage is really used in the US anyway words have lots of meanings so
[00:55:17] anyway words have lots of meanings so how can you deal with that well one way
[00:55:20] how can you deal with that well one way you could deal with it is to say okay
[00:55:23] you could deal with it is to say okay words have several meanings
[00:55:27] words have several meanings and so we're just going to say words
[00:55:29] and so we're just going to say words have several meanings and then we're
[00:55:31] have several meanings and then we're going to take instances of words in text
[00:55:34] going to take instances of words in text we're going to Cluster them based on
[00:55:37] we're going to Cluster them based on their similarity of occurrence to decide
[00:55:39] their similarity of occurrence to decide which sense of the word to regard each
[00:55:43] which sense of the word to regard each token as and then we're going to learn
[00:55:46] token as and then we're going to learn word vectors for those token clusters
[00:55:50] word vectors for those token clusters which are our sensors and you can do
[00:55:52] which are our sensors and you can do that um we did it in 2012
[00:55:56] that um we did it in 2012 before word de came out um so you see
[00:55:59] before word de came out um so you see here we have bank one and somewhere over
[00:56:03] here we have bank one and somewhere over here we have Bank two and here we have
[00:56:06] here we have Bank two and here we have Jaguar one Jaguar 2 Jaguar 3 Jaguar 4
[00:56:12] Jaguar one Jaguar 2 Jaguar 3 Jaguar 4 and you know this really works out great
[00:56:15] and you know this really works out great right so Jaguar one um picks out the
[00:56:19] right so Jaguar one um picks out the sense of the um kind of car right and
[00:56:22] sense of the um kind of car right and it's close to luxury and convertible
[00:56:24] it's close to luxury and convertible Jaguar 2
[00:56:26] Jaguar 2 comes right close to software in
[00:56:28] comes right close to software in Microsoft and this one's a bit of a
[00:56:31] Microsoft and this one's a bit of a historical one but you know it's when
[00:56:34] historical one but you know it's when you were when most of you were five or
[00:56:37] you were when most of you were five or whatever you might remember um Apple
[00:56:39] whatever you might remember um Apple used to use large cats for versions of
[00:56:44] used to use large cats for versions of um Mac OS right so sort of Mac OS 10.3
[00:56:49] um Mac OS right so sort of Mac OS 10.3 or something like that a long time ago
[00:56:51] or something like that a long time ago was called Jaguar right so it's software
[00:56:54] was called Jaguar right so it's software close to Microsoft um Jaguar 3 um okay
[00:56:59] close to Microsoft um Jaguar 3 um okay string keyboard solo musical um drum
[00:57:03] string keyboard solo musical um drum base that's because there's a Jaguar
[00:57:05] base that's because there's a Jaguar keyboard um and then finally um the sort
[00:57:10] keyboard um and then finally um the sort of what we think of is the basic sense
[00:57:12] of what we think of is the basic sense but turns out turns up rather Less in
[00:57:15] but turns out turns up rather Less in text corpor normally um Jaguar next to
[00:57:18] text corpor normally um Jaguar next to Hunter is the animal right so it's done
[00:57:20] Hunter is the animal right so it's done a good job at learning the different
[00:57:22] a good job at learning the different sensors but you know that's not what's
[00:57:26] sensors but you know that's not what's actually usually done these days and
[00:57:28] actually usually done these days and instead you know what's usually done is
[00:57:31] instead you know what's usually done is you do only have one vector for Jaguar
[00:57:35] you do only have one vector for Jaguar and when you do that um or Pike here the
[00:57:39] and when you do that um or Pike here the one vector you learn is a weighted a
[00:57:44] one vector you learn is a weighted a weighted average of the vectors that you
[00:57:47] weighted average of the vectors that you would have learned for the
[00:57:49] would have learned for the senses um it's often referred to as a
[00:57:53] senses um it's often referred to as a superposition because somehow um neet
[00:57:56] superposition because somehow um neet math people like to use physics terms um
[00:58:00] math people like to use physics terms um and
[00:58:00] and so they call it a superposition but it's
[00:58:03] so they call it a superposition but it's a weighted average so you're taking the
[00:58:06] a weighted average so you're taking the relative frequency of the different
[00:58:08] relative frequency of the different senses and multiplying the vectors you
[00:58:10] senses and multiplying the vectors you would have learned if you'd had sense
[00:58:12] would have learned if you'd had sense vectors and that's what you get as the
[00:58:15] vectors and that's what you get as the representation as a whole
[00:58:18] representation as a whole um and um you know I can make a sort of
[00:58:22] um and um you know I can make a sort of a linguistic argument as to why you
[00:58:24] a linguistic argument as to why you might want to do that um which is you
[00:58:27] might want to do that um which is you know although this model of words have
[00:58:31] know although this model of words have senses is you know very longstanding and
[00:58:35] senses is you know very longstanding and common I mean it's essentially the way
[00:58:37] common I mean it's essentially the way dictionaries are built right you look up
[00:58:39] dictionaries are built right you look up a word in the dictionary and it says
[00:58:41] a word in the dictionary and it says sense one sense 2 sense three and you
[00:58:44] sense one sense 2 sense three and you get them for things like bank or Jaguars
[00:58:46] get them for things like bank or Jaguars we're talking about I mean it's sort of
[00:58:50] we're talking about I mean it's sort of really a broken model right that
[00:58:53] really a broken model right that like word meanings have a lot of nuance
[00:58:57] like word meanings have a lot of nuance they're used in a lot of different
[00:58:59] they're used in a lot of different contexts they are extreme examples like
[00:59:02] contexts they are extreme examples like Bank wherever it was where we have
[00:59:05] Bank wherever it was where we have Finance bank and Bank of a river bank
[00:59:07] Finance bank and Bank of a river bank over here where it seems like the senses
[00:59:10] over here where it seems like the senses are this far apart but you know most
[00:59:13] are this far apart but you know most words have sort of different meanings
[00:59:15] words have sort of different meanings but they're not actually that far apart
[00:59:17] but they're not actually that far apart and trying to cut them into senses seems
[00:59:20] and trying to cut them into senses seems actually very artificial and you know
[00:59:24] actually very artificial and you know there you know if you look up different
[00:59:26] there you know if you look up different dictionaries and you say how many senses
[00:59:28] dictionaries and you say how many senses does this word have pretty much everyone
[00:59:31] does this word have pretty much everyone will give you a different answer so the
[00:59:33] will give you a different answer so the kind of situation you have is a word
[00:59:36] kind of situation you have is a word like like field well a field can be used
[00:59:40] like like field well a field can be used for a place where you grow a crop um it
[00:59:43] for a place where you grow a crop um it can be used for sort of natural things
[00:59:46] can be used for sort of natural things like a a rock field or an ice field it
[00:59:50] like a a rock field or an ice field it can be used for a sporting field um
[00:59:54] can be used for a sporting field um there's the mathematical sense of field
[00:59:56] there's the mathematical sense of field now all of these things sort of have
[00:59:58] now all of these things sort of have something to do with each other I mean
[01:00:00] something to do with each other I mean the math one's further away but the
[01:00:01] the math one's further away but the physical ones are sort sort of flat
[01:00:04] physical ones are sort sort of flat spaces um but you know the the the sense
[01:00:08] spaces um but you know the the the sense of it being a sporting field is clearly
[01:00:11] of it being a sporting field is clearly kind of different from the sense of it
[01:00:12] kind of different from the sense of it being an ice field is the ice field and
[01:00:16] being an ice field is the ice field and The Rock field different or am I just
[01:00:19] The Rock field different or am I just modifying they are they different senses
[01:00:22] modifying they are they different senses right so really you sort of have a kind
[01:00:24] right so really you sort of have a kind of a a
[01:00:26] of a a uh what a math person say is sort of
[01:00:29] uh what a math person say is sort of like some probability density
[01:00:31] like some probability density distribution over things that can be
[01:00:33] distribution over things that can be meant by the meaning of a word so it
[01:00:35] meant by the meaning of a word so it sort of maybe makes sense to more use
[01:00:37] sort of maybe makes sense to more use this model where you're just actually
[01:00:39] this model where you're just actually saying we have a vector that's an
[01:00:42] saying we have a vector that's an average of all the contexts and we'll
[01:00:44] average of all the contexts and we'll see more of that when we get to
[01:00:45] see more of that when we get to contextual word vectors um later on um
[01:00:49] contextual word vectors um later on um but one more surprising result on this
[01:00:53] but one more surprising result on this is um since you you have um the vector
[01:00:59] is um since you you have um the vector for pike overall being the sum of these
[01:01:03] for pike overall being the sum of these different um sense vectors you know
[01:01:08] different um sense vectors you know standard math would tell you that if you
[01:01:10] standard math would tell you that if you just have the single Vector there's no
[01:01:13] just have the single Vector there's no way that you can recover the individual
[01:01:15] way that you can recover the individual sense
[01:01:16] sense vectors but um higher maath tells you um
[01:01:21] vectors but um higher maath tells you um that actually these um Vector spaces are
[01:01:26] that actually these um Vector spaces are so high dimensional and sparse that you
[01:01:29] so high dimensional and sparse that you can use ideas from sparse coding Theory
[01:01:33] can use ideas from sparse coding Theory to reconstruct the sense vectors out of
[01:01:36] to reconstruct the sense vectors out of the whole vector and if you actually
[01:01:39] the whole vector and if you actually want to understand this um some of the
[01:01:42] want to understand this um some of the people in statistics David Dono I think
[01:01:44] people in statistics David Dono I think is one of them um teach courses on
[01:01:46] is one of them um teach courses on sparse coding Theory um but um I'm not
[01:01:50] sparse coding Theory um but um I'm not going to try and teach that but you know
[01:01:52] going to try and teach that but you know here's an example um from this paper the
[01:01:55] here's an example um from this paper the San
[01:01:56] San Al where one of the ATS is tar who's now
[01:01:59] Al where one of the ATS is tar who's now faculty in computer science here where
[01:02:03] faculty in computer science here where they uh starting off with the word
[01:02:05] they uh starting off with the word vector and using sparse coding um to
[01:02:09] vector and using sparse coding um to divide out sense vectors from one um
[01:02:12] divide out sense vectors from one um word vector and they work pretty well
[01:02:14] word vector and they work pretty well right so here's one sense of tie which
[01:02:17] right so here's one sense of tie which is piece of clothing another sense of
[01:02:20] is piece of clothing another sense of tie which is ties in the game um this
[01:02:24] tie which is ties in the game um this one is sort of similar to that one and
[01:02:26] one is sort of similar to that one and I'll admit but this sense of tie here is
[01:02:28] I'll admit but this sense of tie here is then a tie as sort of you put on your
[01:02:31] then a tie as sort of you put on your electrical cables um then you have the
[01:02:34] electrical cables um then you have the musical sense of tie right at least four
[01:02:36] musical sense of tie right at least four out of five they've done a pretty good
[01:02:39] out of five they've done a pretty good job of getting senses out of this single
[01:02:42] job of getting senses out of this single um word vector by sparse coding so
[01:02:44] um word vector by sparse coding so sparse coding must be cool if you want
[01:02:46] sparse coding must be cool if you want to go off and learn more about
[01:02:49] to go off and learn more about it
[01:02:53] okay okay so that's everything I was
[01:02:56] okay okay so that's everything I was going to say about words vectors and
[01:02:58] going to say about words vectors and word
[01:02:59] word senses is everyone good to there any
[01:03:05] questions I'll Rush ahead for the last
[01:03:08] questions I'll Rush ahead for the last two pieces okay so I just wanted to
[01:03:11] two pieces okay so I just wanted to start to introdu in the last 15 minutes
[01:03:14] start to introdu in the last 15 minutes um the ideas of how we can build neural
[01:03:20] um the ideas of how we can build neural classifiers um and how we start to build
[01:03:23] classifiers um and how we start to build in general neural network works I mean
[01:03:26] in general neural network works I mean in a sense we've already built a very
[01:03:30] in a sense we've already built a very simple neural classifier our word DEC
[01:03:34] simple neural classifier our word DEC model is predicting what words are
[01:03:37] model is predicting what words are likely to occur in the context of
[01:03:39] likely to occur in the context of another word um and you can think of
[01:03:41] another word um and you can think of that as a classifier but let's look at a
[01:03:43] that as a classifier but let's look at a simple classifier like our named entity
[01:03:45] simple classifier like our named entity recognizers I mentioned before so for
[01:03:48] recognizers I mentioned before so for the named entity recognizer we want to
[01:03:50] the named entity recognizer we want to label words with their class so we want
[01:03:53] label words with their class so we want to say these two words are a person but
[01:03:56] to say these two words are a person but the same words Paris and Hilton are then
[01:04:00] the same words Paris and Hilton are then locations in this second sentence so
[01:04:03] locations in this second sentence so words can be ambiguous as to what their
[01:04:06] words can be ambiguous as to what their class is um and the other state is that
[01:04:11] class is um and the other state is that they're not a named entity at all
[01:04:13] they're not a named entity at all they're just a word um that is some
[01:04:15] they're just a word um that is some other word and this is something that's
[01:04:17] other word and this is something that's used in lots of places um as a a bit of
[01:04:22] used in lots of places um as a a bit of understanding so if you've seen any of
[01:04:24] understanding so if you've seen any of those web pages um where they've sort of
[01:04:26] those web pages um where they've sort of you know tagged company names with a
[01:04:28] you know tagged company names with a stock ticker or um there links on a
[01:04:32] stock ticker or um there links on a Wikipedia page to a Wikipedia page or
[01:04:35] Wikipedia page to a Wikipedia page or something like that right you've got
[01:04:37] something like that right you've got named entities where commonly after
[01:04:40] named entities where commonly after finding the named entities um you're
[01:04:42] finding the named entities um you're doing this second stage of entity
[01:04:44] doing this second stage of entity linking where you're then linking the
[01:04:46] linking where you're then linking the named entity to some canonical form of
[01:04:49] named entity to some canonical form of it like a wikkipedia page but we're not
[01:04:51] it like a wikkipedia page but we're not going to talk about the second part of
[01:04:53] going to talk about the second part of it um for the rest of the day
[01:04:56] it um for the rest of the day um and
[01:04:58] um and so um we could say that building with
[01:05:01] so um we could say that building with our word vectors we've got this simple
[01:05:04] our word vectors we've got this simple task where what we're going to do is
[01:05:07] task where what we're going to do is we're going to look at a word in context
[01:05:09] we're going to look at a word in context because sometimes Paris is a name of a
[01:05:12] because sometimes Paris is a name of a person sometimes it's a location and so
[01:05:15] person sometimes it's a location and so we're going to want to look at this word
[01:05:17] we're going to want to look at this word in its context and say aha this is a
[01:05:20] in its context and say aha this is a name of a location in this instance and
[01:05:24] name of a location in this instance and so the way that we going to do it um is
[01:05:27] so the way that we going to do it um is we're going to form a a window
[01:05:30] we're going to form a a window classifier so we're going to take a word
[01:05:33] classifier so we're going to take a word with a couple of words of context on
[01:05:35] with a couple of words of context on each side and for the words in our
[01:05:38] each side and for the words in our context window we're going to use our
[01:05:40] context window we're going to use our word vectors because we want to show
[01:05:42] word vectors because we want to show they're useful for something and then we
[01:05:43] they're useful for something and then we want to feed this into something that is
[01:05:46] want to feed this into something that is a classifier and our classifier um it's
[01:05:51] a classifier and our classifier um it's actually going to be a really simple loc
[01:05:53] actually going to be a really simple loc classifier we're only here going to do
[01:05:56] classifier we're only here going to do location or not a location so this one
[01:05:59] location or not a location so this one here we're W to say for this um window
[01:06:02] here we're W to say for this um window here yes it's a location um and whereas
[01:06:07] here yes it's a location um and whereas if it had
[01:06:08] if it had been I what I love Paris Hilton greatly
[01:06:13] been I what I love Paris Hilton greatly um then we'd be saying no because Paris
[01:06:16] um then we'd be saying no because Paris the word in the middle of the context
[01:06:19] the word in the middle of the context then isn't um a location so that's sort
[01:06:22] then isn't um a location so that's sort of the idea of a classification or
[01:06:24] of the idea of a classification or classifier we're making assigning some
[01:06:28] classifier we're making assigning some set of classes um to things right so in
[01:06:32] set of classes um to things right so in general for
[01:06:33] general for classifiers um we do supervised learning
[01:06:37] classifiers um we do supervised learning which means we have some labeled
[01:06:39] which means we have some labeled examples our training data set so we
[01:06:42] examples our training data set so we have input items XI and for each one
[01:06:45] have input items XI and for each one we've got a class Yi so I had for my ex
[01:06:50] we've got a class Yi so I had for my ex example training examples ones like I
[01:06:53] example training examples ones like I love Paris Hilton greatly that
[01:06:56] love Paris Hilton greatly that was negative not a location and I visit
[01:07:01] was negative not a location and I visit Paris Every Spring that's positive that
[01:07:04] Paris Every Spring that's positive that is a location where I'm actually
[01:07:05] is a location where I'm actually classifying the middle word okay so
[01:07:08] classifying the middle word okay so inputs labels and in general we've got
[01:07:11] inputs labels and in general we've got um labels are a set of classes so my set
[01:07:14] um labels are a set of classes so my set here is simply location not a location
[01:07:17] here is simply location not a location but I could get fancier and I could say
[01:07:21] but I could get fancier and I could say I've got five classes I've got location
[01:07:23] I've got five classes I've got location person name
[01:07:26] person name uh whatever other ones there are um
[01:07:29] uh whatever other ones there are um company name drug name right I could be
[01:07:32] company name drug name right I could be assigning a bunch of or other not a name
[01:07:36] assigning a bunch of or other not a name a bunch of different classes but I'm
[01:07:37] a bunch of different classes but I'm going to be doing it with only two
[01:07:40] going to be doing it with only two because I'm using this example on next
[01:07:42] because I'm using this example on next Tuesday's lecture as well and I'm
[01:07:44] Tuesday's lecture as well and I'm wanting to keep it simple um so that's
[01:07:47] wanting to keep it simple um so that's what we're going to do and so what we're
[01:07:51] what we're going to do and so what we're going to be using in our class um is
[01:07:54] going to be using in our class um is neural classifiers and so I just wanted
[01:07:57] neural classifiers and so I just wanted to sort of um sort of just sort of go
[01:08:01] to sort of um sort of just sort of go through quickly just the sort of food
[01:08:03] through quickly just the sort of food for thought as we go into it so for a
[01:08:06] for thought as we go into it so for a typical um stats machine learning
[01:08:10] typical um stats machine learning classifier you can build classifiers
[01:08:13] classifier you can build classifiers like logistic regression or softmax
[01:08:15] like logistic regression or softmax classifiers or other ones like support
[01:08:18] classifiers or other ones like support Vector machines or naive Bays or
[01:08:21] Vector machines or naive Bays or whatever else you might have seen the
[01:08:24] whatever else you might have seen the vast majority of these classifiers are
[01:08:27] vast majority of these classifiers are linear classifiers meaning that they
[01:08:29] linear classifiers meaning that they have a linear decision boundary and when
[01:08:32] have a linear decision boundary and when we're learning these classifiers we're
[01:08:34] we're learning these classifiers we're learning parameters here W but our the
[01:08:38] learning parameters here W but our the our inputs are fixed that our inputs are
[01:08:41] our inputs are fixed that our inputs are represented by symbols like um or
[01:08:44] represented by symbols like um or quantities so we have fixed inputs we
[01:08:47] quantities so we have fixed inputs we learn parameters as weights into that
[01:08:50] learn parameters as weights into that are used to multiply the inputs and then
[01:08:53] are used to multiply the inputs and then we use a linear decision boundary
[01:08:56] we use a linear decision boundary so when we have our neural classifier
[01:08:59] so when we have our neural classifier we're kind of getting some more power um
[01:09:02] we're kind of getting some more power um so first of all we're not only learning
[01:09:05] so first of all we're not only learning weights W for our classifier we're also
[01:09:08] weights W for our classifier we're also learning distributed representations for
[01:09:10] learning distributed representations for our words so our words can sort of re
[01:09:15] our words so our words can sort of re represent our word vectors re represent
[01:09:18] represent our word vectors re represent the actual words of symbols and can move
[01:09:20] the actual words of symbols and can move them around in the space so that in
[01:09:23] them around in the space so that in terms of the original space we've got a
[01:09:27] terms of the original space we've got a nonlinear classifier that can represent
[01:09:29] nonlinear classifier that can represent much more complex functions um but we
[01:09:33] much more complex functions um but we will then sort of use the word vectors
[01:09:36] will then sort of use the word vectors to re-represent um those words to do a
[01:09:39] to re-represent um those words to do a final classification so at the end of
[01:09:42] final classification so at the end of our deep Network which we're about to
[01:09:45] our deep Network which we're about to build we will have a linear classifier
[01:09:48] build we will have a linear classifier in terms of our re-represented vectors
[01:09:51] in terms of our re-represented vectors but not in terms of our original space
[01:09:53] but not in terms of our original space let me try and be concrete about that
[01:09:56] let me try and be concrete about that okay so here's what I'm going to use and
[01:09:58] okay so here's what I'm going to use and we'll use again next Tuesday um as my um
[01:10:02] we'll use again next Tuesday um as my um little neural network um and so I start
[01:10:07] little neural network um and so I start with some words museums and Paris are
[01:10:10] with some words museums and Paris are amazing I first of
[01:10:12] amazing I first of all come up with the word embedding of
[01:10:15] all come up with the word embedding of those using my word vectors so now I'm
[01:10:18] those using my word vectors so now I'm got this sort of high dimensional Vector
[01:10:21] got this sort of high dimensional Vector which is just a concatenation of five
[01:10:23] which is just a concatenation of five word vectors so you know if I have 100
[01:10:25] word vectors so you know if I have 100 dimensional word vectors this is 500
[01:10:27] dimensional word vectors this is 500 dimensional and then I'm going to put it
[01:10:29] dimensional and then I'm going to put it through a neural network layer which is
[01:10:32] through a neural network layer which is simply multiplying that vector by a
[01:10:35] simply multiplying that vector by a matrix and adding on a bias um vector
[01:10:40] matrix and adding on a bias um vector and then I'm going to put it through
[01:10:41] and then I'm going to put it through some
[01:10:42] some nonlinearity which might be for example
[01:10:44] nonlinearity which might be for example the logistic function that we've already
[01:10:46] the logistic function that we've already seen so that will give me a new
[01:10:49] seen so that will give me a new representation and in particular um if
[01:10:54] representation and in particular um if the w is say 8 by 500 I'll be reducing
[01:10:58] the w is say 8 by 500 I'll be reducing it to a much or what yeah 8 by 500 I'll
[01:11:02] it to a much or what yeah 8 by 500 I'll be reducing it to a much smaller Vector
[01:11:06] be reducing it to a much smaller Vector right so then I can do after that I can
[01:11:09] right so then I can do after that I can multiply my hidden representation the
[01:11:12] multiply my hidden representation the middle of my neural network by another
[01:11:15] middle of my neural network by another vector and that will give me a score and
[01:11:17] vector and that will give me a score and I'm going to put the score into the
[01:11:19] I'm going to put the score into the logistic um function that we saw earlier
[01:11:21] logistic um function that we saw earlier to say what's the probability this is
[01:11:23] to say what's the probability this is the location so at this point um my
[01:11:28] the location so at this point um my classifier is going to be a linear
[01:11:31] classifier is going to be a linear classifier in terms of this internal
[01:11:34] classifier in terms of this internal representation that's used right at the
[01:11:36] representation that's used right at the end but it's going to be a nonlinear
[01:11:38] end but it's going to be a nonlinear classifier in terms of my word
[01:11:43] vectors
[01:11:45] vectors um
[01:11:46] um okay
[01:11:48] okay uh great um here's one other thing this
[01:11:53] uh great um here's one other thing this is just sort of uh uh a note for learner
[01:11:57] is just sort of uh uh a note for learner ahead um since you want to know this
[01:11:59] ahead um since you want to know this when we start doing the next assignments
[01:12:02] when we start doing the next assignments I mean up until now I've presented
[01:12:04] I mean up until now I've presented everything as you know doing log
[01:12:07] everything as you know doing log likelihood and negative log likelihood
[01:12:10] likelihood and negative log likelihood for building our models um very soon now
[01:12:14] for building our models um very soon now assignment two we're going to be
[01:12:16] assignment two we're going to be starting to do things um with pytorch
[01:12:19] starting to do things um with pytorch and when you um start working out your
[01:12:23] and when you um start working out your um losses with payto watch um what
[01:12:26] um losses with payto watch um what you're going to be wanting to use is
[01:12:28] you're going to be wanting to use is cross entropy loss and so just let me
[01:12:31] cross entropy loss and so just let me quickly say what cross entropy loss is
[01:12:34] quickly say what cross entropy loss is so cross entropy is from information
[01:12:37] so cross entropy is from information Theory so if you have a true probability
[01:12:40] Theory so if you have a true probability distribution p and you're Computing a
[01:12:43] distribution p and you're Computing a probability distribution Q your cross
[01:12:46] probability distribution Q your cross entropy loss is like this um so it's the
[01:12:51] entropy loss is like this um so it's the log of your um model probability
[01:12:56] log of your um model probability um the expectation of that under your um
[01:12:59] um the expectation of that under your um true probability distribution but
[01:13:02] true probability distribution but there's sort of a special case whereas
[01:13:04] there's sort of a special case whereas if you have ground truth or gold or
[01:13:07] if you have ground truth or gold or Target data where things are labeled one
[01:13:12] Target data where things are labeled one zero so like for examples of um I love
[01:13:16] zero so like for examples of um I love Paris um when warm right I'm just
[01:13:21] Paris um when warm right I'm just labeling it one for location probability
[01:13:24] labeling it one for location probability one it's a location probability zero is
[01:13:27] one it's a location probability zero is not a location so if you're just
[01:13:30] not a location so if you're just labeling the right class as probability
[01:13:32] labeling the right class as probability one then in this summation every other
[01:13:36] one then in this summation every other term goes to zero and the only thing
[01:13:40] term goes to zero and the only thing you're left with is what probability is
[01:13:43] you're left with is what probability is my model what log probability is my
[01:13:46] my model what log probability is my model giving to the right class and so
[01:13:50] model giving to the right class and so that then is um your log likelihood
[01:13:54] that then is um your log likelihood which we can use for um the negative log
[01:13:58] which we can use for um the negative log likelihood um little bit of a
[01:14:01] likelihood um little bit of a complication here just remember that you
[01:14:03] complication here just remember that you want to use cross entry loss in py tort
[01:14:06] want to use cross entry loss in py tort when building the model okay um before
[01:14:10] when building the model okay um before we end today um here is um my obligatory
[01:14:13] we end today um here is um my obligatory one picture of human neurons don't miss
[01:14:16] one picture of human neurons don't miss it because I'm not going to show any
[01:14:18] it because I'm not going to show any more of these um okay these are human
[01:14:21] more of these um okay these are human neurons right human neurons um were
[01:14:25] neurons right human neurons um were the um inspiration for neural networks
[01:14:29] the um inspiration for neural networks right so human neurons have a single
[01:14:32] right so human neurons have a single output which comes down this axon um and
[01:14:37] output which comes down this axon um and then um when you have these
[01:14:42] then um when you have these outputs um they then feed into other
[01:14:46] outputs um they then feed into other neurons I guess I don't really have an
[01:14:48] neurons I guess I don't really have an example here but in general one output
[01:14:52] example here but in general one output can feed into multiple different neurons
[01:14:54] can feed into multiple different neurons you can see the different things hanging
[01:14:55] you can see the different things hanging into it so you know you have the output
[01:14:58] into it so you know you have the output connecting to the input and sort of
[01:15:00] connecting to the input and sort of where you make this connection right
[01:15:02] where you make this connection right that's the synapses that people talk
[01:15:05] that's the synapses that people talk about and so one neuron will normally
[01:15:08] about and so one neuron will normally have many many inputs where it picks
[01:15:11] have many many inputs where it picks things up from other neurons and they
[01:15:13] things up from other neurons and they all go into the nucleus of the cell and
[01:15:17] all go into the nucleus of the cell and the nucleus combines together all those
[01:15:20] the nucleus combines together all those inputs and kind of what happens is if
[01:15:23] inputs and kind of what happens is if there's enough positive activation from
[01:15:26] there's enough positive activation from all of these um inputs it then sends
[01:15:31] all of these um inputs it then sends signals down its output now strictly how
[01:15:35] signals down its output now strictly how um neurons work is that they send spikes
[01:15:39] um neurons work is that they send spikes so the level of activation a neuron is
[01:15:41] so the level of activation a neuron is its rate of spiking um but that
[01:15:44] its rate of spiking um but that immediately got turned in artificial
[01:15:46] immediately got turned in artificial neural networks into just a real value
[01:15:50] neural networks into just a real value as to what is its level of activation um
[01:15:53] as to what is its level of activation um and so it does this so this was kind of
[01:15:56] and so it does this so this was kind of the genuine inspiration of all of our
[01:15:59] the genuine inspiration of all of our neural networks right so a binary
[01:16:01] neural networks right so a binary logistic regression is kind of a bit
[01:16:05] logistic regression is kind of a bit similar to a neuron right it has
[01:16:07] similar to a neuron right it has multiple inputs you're working out your
[01:16:11] multiple inputs you're working out your total level of excitation where in
[01:16:14] total level of excitation where in particular you can have inputs that are
[01:16:17] particular you can have inputs that are both exciting positive inputs and inputs
[01:16:21] both exciting positive inputs and inputs that are um negative um which are then
[01:16:25] that are um negative um which are then inhibitory inputs you combine them all
[01:16:27] inhibitory inputs you combine them all together and you get an output that's
[01:16:29] together and you get an output that's your level of exit
[01:16:32] your level of exit excitation and you're then sort of
[01:16:34] excitation and you're then sort of converting that through some
[01:16:36] converting that through some nonlinearity and so this was proposed as
[01:16:39] nonlinearity and so this was proposed as a very simple model of human neurons now
[01:16:43] a very simple model of human neurons now human neurons are way more complex than
[01:16:45] human neurons are way more complex than this um and some people like
[01:16:47] this um and some people like neuroscientists think we maybe should be
[01:16:49] neuroscientists think we maybe should be doing a better model of actual human
[01:16:53] doing a better model of actual human neurons but um
[01:16:55] neurons but um in terms of what's being done in the
[01:16:57] in terms of what's being done in the current neural networks eat the World
[01:16:59] current neural networks eat the World Revolution everyone's forgotten about
[01:17:02] Revolution everyone's forgotten about that and is just sticking with this very
[01:17:04] that and is just sticking with this very very simple model um which conveniently
[01:17:08] very simple model um which conveniently turns into linear algebra in a very
[01:17:11] turns into linear algebra in a very simple way um so this gives us sort of
[01:17:15] simple way um so this gives us sort of like a single neuron but then precise
[01:17:19] like a single neuron but then precise right so this is which this s single
[01:17:22] right so this is which this s single neuron if you use the logistic function
[01:17:24] neuron if you use the logistic function is identical to logistic regression
[01:17:27] is identical to logistic regression which you've probably seen in some stats
[01:17:29] which you've probably seen in some stats class or something but the difference is
[01:17:32] class or something but the difference is that for newal networks we don't just
[01:17:34] that for newal networks we don't just have one logistic regression we have a
[01:17:37] have one logistic regression we have a bunch of logistic regressions at once
[01:17:42] bunch of logistic regressions at once and well that' be tricky if we had to
[01:17:44] and well that' be tricky if we had to Define what each of these logistic
[01:17:46] Define what each of these logistic regressions was calculating but what we
[01:17:50] regressions was calculating but what we don't what we do is we just feed them
[01:17:53] don't what we do is we just feed them into another logistic regression and so
[01:17:57] into another logistic regression and so we have some eventual output that we
[01:18:01] we have some eventual output that we want to be something like we want it to
[01:18:03] want to be something like we want it to say you know this is or isn't a location
[01:18:07] say you know this is or isn't a location but then what will happen is by our
[01:18:10] but then what will happen is by our machine learning these intermediate
[01:18:12] machine learning these intermediate logistic regressions will figure out all
[01:18:15] logistic regressions will figure out all by themselves something useful to do
[01:18:18] by themselves something useful to do that's the magic um right so that you
[01:18:22] that's the magic um right so that you get this sort of self-learning prop
[01:18:25] get this sort of self-learning prop where the model has a lot of parameters
[01:18:28] where the model has a lot of parameters and internally um will work out useful
[01:18:31] and internally um will work out useful things to do so in general we can get
[01:18:33] things to do so in general we can get more Magic by having more layers in the
[01:18:35] more Magic by having more layers in the neural network and that we will um build
[01:18:39] neural network and that we will um build up functions so effectively these
[01:18:41] up functions so effectively these intermediate layers let us learn a model
[01:18:45] intermediate layers let us learn a model that re represents the input data in
[01:18:48] that re represents the input data in ways that will make it easier to
[01:18:51] ways that will make it easier to classify or easier to interpret and do
[01:18:54] classify or easier to interpret and do things with Downstream in our new
[01:18:58] things with Downstream in our new network um and it's time so I should
[01:19:01] network um and it's time so I should stop there thank you


================================================================================
LECTURE 003
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 3 - Backpropagation, Neural Network

Source: https://www.youtube.com/watch?v=HnliVHU2g9U

---

Transcript

[00:00:05] okay hi everyone I'll be started okay so
[00:00:09] okay hi everyone I'll be started okay so it's Tuesday of week two so hopefully
[00:00:12] it's Tuesday of week two so hopefully that means everyone has done the
[00:00:15] that means everyone has done the assignment one everyone done assignment
[00:00:17] assignment one everyone done assignment one um you know uh I'm if I'm saying
[00:00:21] one um you know uh I'm if I'm saying this I'm probably saying it to the wrong
[00:00:23] this I'm probably saying it to the wrong people but it it seems like every year
[00:00:27] people but it it seems like every year some people blow some of their um late
[00:00:31] some people blow some of their um late days on assignment one and it's really
[00:00:33] days on assignment one and it's really just the wrong place to use them um so
[00:00:37] just the wrong place to use them um so yeah hopefully you've all done
[00:00:38] yeah hopefully you've all done assignment one um and did not that this
[00:00:41] assignment one um and did not that this is meant to be the easy onramp um and
[00:00:44] is meant to be the easy onramp um and then we go straight on from that so that
[00:00:47] then we go straight on from that so that out today we have assignment two so
[00:00:50] out today we have assignment two so assignment two um has two purposes um
[00:00:55] assignment two um has two purposes um purpose one is to make you do some math
[00:00:59] purpose one is to make you do some math um to um some understanding of what
[00:01:02] um to um some understanding of what newal networks really compute and how
[00:01:05] newal networks really compute and how they compute it and that's what I'm
[00:01:06] they compute it and that's what I'm going to talk about today is also going
[00:01:08] going to talk about today is also going through that math um but then
[00:01:11] through that math um but then simultaneously maybe it does three
[00:01:13] simultaneously maybe it does three things in assignment two um so we're
[00:01:16] things in assignment two um so we're going to be learning something about
[00:01:17] going to be learning something about dependency paing which will be actually
[00:01:20] dependency paing which will be actually something about um language structure
[00:01:22] something about um language structure and Linguistics but then thirdly um for
[00:01:26] and Linguistics but then thirdly um for assignment two we're going to start
[00:01:27] assignment two we're going to start using pytorch so pytorch is is one of
[00:01:30] using pytorch so pytorch is is one of the leading software Frameworks for deep
[00:01:32] the leading software Frameworks for deep learning and the one that we're going to
[00:01:35] learning and the one that we're going to um use for this class um so for p i mean
[00:01:40] um use for this class um so for p i mean the assignment 3 pytorch is
[00:01:45] the assignment 3 pytorch is exceedingly um scaffolded so it's sort
[00:01:48] exceedingly um scaffolded so it's sort of you know here's this thing and you
[00:01:50] of you know here's this thing and you have to write these two lines use these
[00:01:52] have to write these two lines use these two functions um but nevertheless um for
[00:01:56] two functions um but nevertheless um for help people get up to speed and get
[00:01:59] help people get up to speed and get started using pytorch on Friday at 3:30
[00:02:03] started using pytorch on Friday at 3:30 in Gates bo1 or it will again be
[00:02:05] in Gates bo1 or it will again be recorded um we have a tutorial on
[00:02:08] recorded um we have a tutorial on pytorch and so that's a great way to get
[00:02:11] pytorch and so that's a great way to get more of a sense of py torch and how it
[00:02:13] more of a sense of py torch and how it works before doing assignment
[00:02:16] works before doing assignment two um yeah uh the other things yeah so
[00:02:21] two um yeah uh the other things yeah so for nearly all the lectures we've got um
[00:02:24] for nearly all the lectures we've got um further reading of places that you can
[00:02:26] further reading of places that you can look um of all the classes in the attire
[00:02:30] look um of all the classes in the attire um this um for many people might be a
[00:02:35] um this um for many people might be a really good one to look at the suggested
[00:02:37] really good one to look at the suggested readings we have several readings which
[00:02:39] readings we have several readings which are sort of shorter tutorials and
[00:02:42] are sort of shorter tutorials and reviews of the kind of um Matrix
[00:02:45] reviews of the kind of um Matrix calculus um and linear algebra that we
[00:02:48] calculus um and linear algebra that we need for this class um so really
[00:02:51] need for this class um so really encourage you um to look at those um if
[00:02:54] encourage you um to look at those um if you decide that one is your favorite you
[00:02:56] you decide that one is your favorite you can tell us on Ed which one you think is
[00:02:58] can tell us on Ed which one you think is the best one to choose between between
[00:03:00] the best one to choose between between them I kind of like the one that's first
[00:03:02] them I kind of like the one that's first on the list but maybe you'll feel
[00:03:03] on the list but maybe you'll feel differently um yeah um conversely um
[00:03:08] differently um yeah um conversely um yeah so today will be sort of all math
[00:03:11] yeah so today will be sort of all math and then Thursday will be kind of all
[00:03:15] and then Thursday will be kind of all language and Linguistics some people
[00:03:17] language and Linguistics some people find the language and Linguistics hard
[00:03:18] find the language and Linguistics hard as well um so I guess different kinds of
[00:03:21] as well um so I guess different kinds of people um okay so getting straight into
[00:03:25] people um okay so getting straight into it um so where we started last time um
[00:03:30] it um so where we started last time um I'd sort of shown these baby neural
[00:03:32] I'd sort of shown these baby neural networks and sort of said well we can
[00:03:34] networks and sort of said well we can think of each of those orange things as
[00:03:37] think of each of those orange things as basically like a little logistic
[00:03:39] basically like a little logistic regression unit and the crucial
[00:03:42] regression unit and the crucial difference from then the kind of um
[00:03:45] difference from then the kind of um statistics machine learning you see in a
[00:03:47] statistics machine learning you see in a stats class 109 or wherever is that in
[00:03:51] stats class 109 or wherever is that in those you have one logistic regression
[00:03:54] those you have one logistic regression and you're defining the input features
[00:03:56] and you're defining the input features to it and you've got some decision
[00:03:59] to it and you've got some decision variable that you want to un have at the
[00:04:02] variable that you want to un have at the output here you're sort of building
[00:04:04] output here you're sort of building these Cascades of little logistic
[00:04:07] these Cascades of little logistic regressions and so the idea is right at
[00:04:09] regressions and so the idea is right at the end we're going to Define what we
[00:04:11] the end we're going to Define what we want we're going to capture that by our
[00:04:14] want we're going to capture that by our objective function or loss function but
[00:04:16] objective function or loss function but for the stuff in the middle that that
[00:04:19] for the stuff in the middle that that stuff in the middle is going to be a
[00:04:21] stuff in the middle is going to be a chance for the neural network to learn
[00:04:24] chance for the neural network to learn by itself what would be useful inputs to
[00:04:29] by itself what would be useful inputs to further Downstream neurons what kind of
[00:04:34] further Downstream neurons what kind of functions should I come up with in terms
[00:04:36] functions should I come up with in terms of my inputs that will help me provide
[00:04:40] of my inputs that will help me provide um useful outputs to help the final
[00:04:43] um useful outputs to help the final computation down the track um and you
[00:04:47] computation down the track um and you know if you haven't sort of seen and um
[00:04:49] know if you haven't sort of seen and um thought about this much before I mean I
[00:04:52] thought about this much before I mean I think you know it's worth uh sitting
[00:04:55] think you know it's worth uh sitting with that idea for a moment CU this is
[00:04:57] with that idea for a moment CU this is really a super powerful idea which is
[00:05:00] really a super powerful idea which is what's made neural networks more
[00:05:03] what's made neural networks more powerful in most circumstances than
[00:05:06] powerful in most circumstances than other forms of machine learning the fact
[00:05:08] other forms of machine learning the fact that you have this
[00:05:11] that you have this self-organization of intermediate levels
[00:05:13] self-organization of intermediate levels of representation that you use to
[00:05:15] of representation that you use to compute things that will be useful
[00:05:18] compute things that will be useful Downstream for what you eventually want
[00:05:20] Downstream for what you eventually want to
[00:05:21] to do um the other reason I was bring back
[00:05:23] do um the other reason I was bring back up this picture is I've sort of wanted
[00:05:25] up this picture is I've sort of wanted to go straight from here um to matric
[00:05:30] to go straight from here um to matric um so while you could sort of wire
[00:05:34] um so while you could sort of wire together neurons however you wanted to
[00:05:37] together neurons however you wanted to and arguably if you look at human brains
[00:05:40] and arguably if you look at human brains they look more like neurons wired
[00:05:42] they look more like neurons wired together however you wanted to but for
[00:05:44] together however you wanted to but for what's done with neural networks
[00:05:46] what's done with neural networks basically there's always this kind of
[00:05:48] basically there's always this kind of regular structure of layers so once we
[00:05:51] regular structure of layers so once we have this regular structure of layers we
[00:05:54] have this regular structure of layers we are taking um the output of one of our
[00:06:00] are taking um the output of one of our neurons at one layer and we're feeding
[00:06:03] neurons at one layer and we're feeding them together with weights um to produce
[00:06:07] them together with weights um to produce um the inputs to the next layer so we're
[00:06:10] um the inputs to the next layer so we're taking the X1 X2 X3 outputs we're
[00:06:13] taking the X1 X2 X3 outputs we're multiplying them all by weights we're
[00:06:15] multiplying them all by weights we're adding a bias
[00:06:18] adding a bias term um and then we're going to put it
[00:06:20] term um and then we're going to put it through a nonlinearity and that will
[00:06:22] through a nonlinearity and that will give us the value at the next layer so
[00:06:24] give us the value at the next layer so if we then kind of collapse that to a
[00:06:27] if we then kind of collapse that to a vector and this to a vector
[00:06:30] vector and this to a vector that then collapses into a computation
[00:06:32] that then collapses into a computation that first of all we're doing a matrix
[00:06:36] that first of all we're doing a matrix multiplication we're calculating WX of
[00:06:40] multiplication we're calculating WX of the inputs and then we're adding on the
[00:06:42] the inputs and then we're adding on the biases as a a vector of biases which
[00:06:45] biases as a a vector of biases which gives us this intermediate value Z and
[00:06:48] gives us this intermediate value Z and then we have this nonlinearity or
[00:06:52] then we have this nonlinearity or activation function which is applied um
[00:06:55] activation function which is applied um to that which gives us the values in the
[00:06:58] to that which gives us the values in the next layer of the new Network and the
[00:07:01] next layer of the new Network and the activation function is applied to a
[00:07:03] activation function is applied to a vector and produces a vector but it's
[00:07:05] vector and produces a vector but it's operating on each of the individual
[00:07:08] operating on each of the individual components of that Vector one at a time
[00:07:10] components of that Vector one at a time so we've got some Scala function that
[00:07:13] so we've got some Scala function that we're just applying to each element of
[00:07:15] we're just applying to each element of the
[00:07:15] the vector um and so that's the kind of
[00:07:19] vector um and so that's the kind of picture um we saw when I did this
[00:07:23] picture um we saw when I did this example and I'm going to continue to use
[00:07:25] example and I'm going to continue to use this example in today's class remember
[00:07:27] this example in today's class remember we were going to decide whether the word
[00:07:29] we were going to decide whether the word in the middle of the input window was a
[00:07:32] in the middle of the input window was a location or not and so we were doing the
[00:07:35] location or not and so we were doing the matrix multiplication putting it through
[00:07:38] matrix multiplication putting it through the nonlinearity we're then just doing a
[00:07:40] the nonlinearity we're then just doing a DOT product here and then we're going
[00:07:43] DOT product here and then we're going that got stuck into um uh sigmoid um to
[00:07:48] that got stuck into um uh sigmoid um to predict yes or no um and the final thing
[00:07:52] predict yes or no um and the final thing I wanted to say a little bit about is
[00:07:55] I wanted to say a little bit about is these FS the
[00:07:56] these FS the nonlinearity or the activation function
[00:07:59] nonlinearity or the activation function and where did they come in well the
[00:08:02] and where did they come in well the starting point of where they came in in
[00:08:04] starting point of where they came in in the history of newal networks is when
[00:08:07] the history of newal networks is when people came up with this idea that well
[00:08:10] people came up with this idea that well you could represent um the operation of
[00:08:13] you could represent um the operation of a basic neuron by doing a matrix
[00:08:17] a basic neuron by doing a matrix multiplication of the inputs and then
[00:08:21] multiplication of the inputs and then having a bias term or here represents a
[00:08:24] having a bias term or here represents a threshold term to see whether the
[00:08:27] threshold term to see whether the neurons should fire or not that was
[00:08:30] neurons should fire or not that was actually in the very first
[00:08:31] actually in the very first implementation which B dates back to the
[00:08:34] implementation which B dates back to the 1940s as done as a threshold right so
[00:08:38] 1940s as done as a threshold right so that if um the activation was greater
[00:08:42] that if um the activation was greater than Theta you output one otherwise you
[00:08:45] than Theta you output one otherwise you output zero and well if you have a
[00:08:48] output zero and well if you have a threshold um the the two lines are flat
[00:08:52] threshold um the the two lines are flat right so there is no slope there is no
[00:08:55] right so there is no slope there is no gradient um so that's actually makes
[00:08:58] gradient um so that's actually makes learning much harder so the whole secret
[00:09:02] learning much harder so the whole secret of what we build with newal networks and
[00:09:05] of what we build with newal networks and in an alternative name that's um popular
[00:09:08] in an alternative name that's um popular in some circles these days is
[00:09:11] in some circles these days is gradient-based learning and the entire
[00:09:14] gradient-based learning and the entire idea of gradient based learning is if we
[00:09:17] idea of gradient based learning is if we actually have some slopes um then it's
[00:09:19] actually have some slopes um then it's like going skiing during spring break
[00:09:22] like going skiing during spring break you can work out where it's steeper and
[00:09:24] you can work out where it's steeper and you can head down where it's steeper um
[00:09:27] you can head down where it's steeper um and that will allow us to op imiz our
[00:09:30] and that will allow us to op imiz our function to learn much more quickly and
[00:09:34] function to learn much more quickly and so that's one reason that we don't just
[00:09:36] so that's one reason that we don't just want to have threshold units we want to
[00:09:39] want to have threshold units we want to have things with slopes so we um have
[00:09:42] have things with slopes so we um have gradient so in subsequent work people
[00:09:45] gradient so in subsequent work people started using activation functions with
[00:09:49] started using activation functions with slopes um and so the first popular one
[00:09:53] slopes um and so the first popular one um was this sigmoidal logistic that
[00:09:55] um was this sigmoidal logistic that we've seen for mapping to probabilities
[00:09:58] we've seen for mapping to probabilities but you know it's sort of imperfect it
[00:10:01] but you know it's sort of imperfect it seemed because you know the output was
[00:10:03] seemed because you know the output was always non- negative so that sort of
[00:10:06] always non- negative so that sort of tends to push things towards bigger
[00:10:08] tends to push things towards bigger numbers so um there was quite a bit of
[00:10:12] numbers so um there was quite a bit of use then of this tan H function um and
[00:10:17] use then of this tan H function um and you'll actually see tan H when we do
[00:10:19] you'll actually see tan H when we do assignment three we'll be using tan H's
[00:10:21] assignment three we'll be using tan H's in our current new networks um and so um
[00:10:27] in our current new networks um and so um I've written there um the formula
[00:10:29] I've written there um the formula usually give for Tan H in terms of
[00:10:32] usually give for Tan H in terms of exponentials um yeah if your math is
[00:10:35] exponentials um yeah if your math is Rusty it's not obvious that tan H and
[00:10:37] Rusty it's not obvious that tan H and logistic have much to do with each other
[00:10:39] logistic have much to do with each other um but if you want to treat this as a
[00:10:42] um but if you want to treat this as a math problem um that a 10 H is literally
[00:10:46] math problem um that a 10 H is literally just a rescaled logistic you're
[00:10:48] just a rescaled logistic you're stretching it by two and moving it down
[00:10:50] stretching it by two and moving it down by one it's the same function
[00:10:54] by one it's the same function um okay um but you know so that's nice
[00:10:58] um okay um but you know so that's nice but if you're calculating 10 hes you
[00:11:00] but if you're calculating 10 hes you have to do all of these exponentials and
[00:11:03] have to do all of these exponentials and you know exponentials are kind of slow
[00:11:05] you know exponentials are kind of slow on your computer and things like that
[00:11:07] on your computer and things like that you might wonder whether you couldn't
[00:11:09] you might wonder whether you couldn't get away with something much cheaper and
[00:11:12] get away with something much cheaper and so people thought about that and thought
[00:11:14] so people thought about that and thought oh maybe we could just use a so-called
[00:11:16] oh maybe we could just use a so-called hard tan H um where it has a slope of
[00:11:20] hard tan H um where it has a slope of one in the middle and is then just flat
[00:11:22] one in the middle and is then just flat outside that area and you know that
[00:11:25] outside that area and you know that seemed to work in many cases and so that
[00:11:28] seemed to work in many cases and so that then led to the popularity of the
[00:11:31] then led to the popularity of the rectified linear unit um so the
[00:11:34] rectified linear unit um so the rectified linear unit is simply zero on
[00:11:37] rectified linear unit is simply zero on the negative region and then is yal x in
[00:11:40] the negative region and then is yal x in the positive region now this seems kind
[00:11:44] the positive region now this seems kind of wonky and goes against what I was
[00:11:47] of wonky and goes against what I was saying about gradient based learning
[00:11:48] saying about gradient based learning because once you're in the negative
[00:11:50] because once you're in the negative region there's no gradient um you're
[00:11:52] region there's no gradient um you're just dead um but in the positive region
[00:11:56] just dead um but in the positive region there is gradient and the gradient is
[00:11:59] there is gradient and the gradient is particularly simple right the slope is
[00:12:01] particularly simple right the slope is always one um and so you know this still
[00:12:06] always one um and so you know this still feels slightly perverse to me but you
[00:12:09] feels slightly perverse to me but you know this really became the norm of what
[00:12:12] know this really became the norm of what people use for a number of years because
[00:12:15] people use for a number of years because people found that although for an
[00:12:18] people found that although for an individual neuron it was dead half the
[00:12:20] individual neuron it was dead half the time anytime it went negative that
[00:12:22] time anytime it went negative that overall for your newal network some
[00:12:24] overall for your newal network some things would be alive so it kind of gave
[00:12:27] things would be alive so it kind of gave sort of a form of specializ ation and
[00:12:30] sort of a form of specializ ation and the fact that the slope was always one
[00:12:32] the fact that the slope was always one meant that you got um really easy
[00:12:36] meant that you got um really easy productive backward flow of gradients in
[00:12:38] productive backward flow of gradients in a way we'll talk about later and so um
[00:12:42] a way we'll talk about later and so um learning with realu turned to be out to
[00:12:45] learning with realu turned to be out to be very effective and people started
[00:12:48] be very effective and people started using the realu nonlinearity everywhere
[00:12:51] using the realu nonlinearity everywhere and it sort of became the default in the
[00:12:53] and it sort of became the default in the norm and you'll see us using it in the
[00:12:55] norm and you'll see us using it in the assignments in particular we use it in
[00:12:57] assignments in particular we use it in assignment to and so get to see that it
[00:13:00] assignment to and so get to see that it works but nevertheless at some point
[00:13:03] works but nevertheless at some point people sort of had second thoughts and
[00:13:06] people sort of had second thoughts and decided you know having a dead over half
[00:13:08] decided you know having a dead over half of its range maybe isn't such a good
[00:13:11] of its range maybe isn't such a good idea after all even though it seemed to
[00:13:12] idea after all even though it seemed to work great for a few years and so a lot
[00:13:15] work great for a few years and so a lot of what's happened since then is then to
[00:13:18] of what's happened since then is then to come up with other functions which are
[00:13:20] come up with other functions which are in some sense Ru like but not actually
[00:13:24] in some sense Ru like but not actually dead um so
[00:13:27] dead um so um um
[00:13:29] um um okay I don't really yeah uh here we go
[00:13:33] okay I don't really yeah uh here we go enough so one one version of that is the
[00:13:36] enough so one one version of that is the socalled Leaky value so for the Leaky
[00:13:39] socalled Leaky value so for the Leaky value you make the negative half a
[00:13:42] value you make the negative half a straight line as well with a very minor
[00:13:44] straight line as well with a very minor slope but still it's got a little bit of
[00:13:46] slope but still it's got a little bit of slope um there is then a variant of that
[00:13:48] slope um there is then a variant of that called the parametric value where you
[00:13:51] called the parametric value where you have one extra parameter which is
[00:13:53] have one extra parameter which is actually what the slope of the ne the
[00:13:55] actually what the slope of the ne the negative part is and people showed some
[00:13:58] negative part is and people showed some positive result with that um more
[00:14:00] positive result with that um more recently again and this is what you
[00:14:03] recently again and this is what you often see in recent Transformer models
[00:14:06] often see in recent Transformer models um is you see um nonlinearities like
[00:14:10] um is you see um nonlinearities like Swiss swis and Jello so both of these
[00:14:14] Swiss swis and Jello so both of these are sort of fancy functions but kind of
[00:14:17] are sort of fancy functions but kind of what they both look like is basically
[00:14:20] what they both look like is basically this is yal X to all intense and
[00:14:23] this is yal X to all intense and purposes not quite but approximately and
[00:14:25] purposes not quite but approximately and then you got sort of some funky bit of
[00:14:27] then you got sort of some funky bit of curve down here which again gives you a
[00:14:29] curve down here which again gives you a bit of slope um it's sort of the curve
[00:14:31] bit of slope um it's sort of the curve is going the opposite way that's sort of
[00:14:33] is going the opposite way that's sort of a bit funny but they seem to work well
[00:14:35] a bit funny but they seem to work well commonly used um in recent Transformer
[00:14:39] commonly used um in recent Transformer models um so you know that's a bit of a
[00:14:42] models um so you know that's a bit of a dump of all the nonlinearities people
[00:14:45] dump of all the nonlinearities people use I mean the details of that aren't
[00:14:48] use I mean the details of that aren't super important right now but the
[00:14:51] super important right now but the important thing to um have in your head
[00:14:54] important thing to um have in your head is why do we need nonlinearities and the
[00:14:58] is why do we need nonlinearities and the way to think about about that is that
[00:15:01] way to think about about that is that what we're doing with neural networks is
[00:15:03] what we're doing with neural networks is function approximation there's some very
[00:15:06] function approximation there's some very complex function that we want to learn
[00:15:08] complex function that we want to learn you know like maybe we want to go from a
[00:15:10] you know like maybe we want to go from a piece of text to its meaning or we want
[00:15:12] piece of text to its meaning or we want to be interpreting visual scenes or
[00:15:15] to be interpreting visual scenes or something like that um and so we want to
[00:15:18] something like that um and so we want to build really good function
[00:15:20] build really good function approximators and well if you're just
[00:15:23] approximators and well if you're just doing Matrix multiplies a matrix
[00:15:25] doing Matrix multiplies a matrix multiply of a vector is a linear
[00:15:27] multiply of a vector is a linear transform so that doesn't let you
[00:15:30] transform so that doesn't let you multiply complex functions I guess
[00:15:32] multiply complex functions I guess strictly if you put a bias on the end
[00:15:34] strictly if you put a bias on the end it's then an aine transform but let's
[00:15:36] it's then an aine transform but let's keep it simple linear transforms right
[00:15:39] keep it simple linear transforms right so if you're doing multiple if you're
[00:15:41] so if you're doing multiple if you're doing multiple Matrix multiplies you're
[00:15:44] doing multiple Matrix multiplies you're doing multiple linear transforms but
[00:15:46] doing multiple linear transforms but they compose so you could have just
[00:15:48] they compose so you could have just multiplied these two matrices together
[00:15:50] multiplied these two matrices together and you'd have a single linear transform
[00:15:52] and you'd have a single linear transform so you get no power in terms of
[00:15:56] so you get no power in terms of representation by having multi-layer
[00:15:59] representation by having multi-layer networks that are just Matrix multiplies
[00:16:02] networks that are just Matrix multiplies you know as in a little aside in terms
[00:16:05] you know as in a little aside in terms of representational power having
[00:16:08] of representational power having multi-layer Matrix multiplies gives you
[00:16:11] multi-layer Matrix multiplies gives you no power but if you think about in terms
[00:16:13] no power but if you think about in terms of learning actually it does give you
[00:16:15] of learning actually it does give you some power so in the theoretical
[00:16:18] some power so in the theoretical Community looking at newal networks
[00:16:19] Community looking at newal networks there actually quite a few papers that
[00:16:21] there actually quite a few papers that look at linear newal networks meaning
[00:16:23] look at linear newal networks meaning that they're just sequences of the
[00:16:25] that they're just sequences of the multiplies with no nonlinearities
[00:16:28] multiplies with no nonlinearities because they have interesting learning
[00:16:30] because they have interesting learning properties even though they give you no
[00:16:32] properties even though they give you no representational power um okay but we'd
[00:16:35] representational power um okay but we'd like to be able to learn functions like
[00:16:37] like to be able to learn functions like this not only functions like this and to
[00:16:40] this not only functions like this and to be able to learn functions like this we
[00:16:42] be able to learn functions like this we need more than linear transforms and we
[00:16:45] need more than linear transforms and we achieve those by having something that
[00:16:48] achieve those by having something that makes us be calculating a nonlinear
[00:16:51] makes us be calculating a nonlinear function and it's these activation
[00:16:54] function and it's these activation functions that give us nonlinear
[00:16:56] functions that give us nonlinear functions okay cool
[00:17:00] functions okay cool um okay so then getting on to today so
[00:17:04] um okay so then getting on to today so the whole thing we want to do now is
[00:17:06] the whole thing we want to do now is gradient based learning right this is
[00:17:08] gradient based learning right this is our stochastic gradient thecenter
[00:17:10] our stochastic gradient thecenter equation where here you know that upside
[00:17:14] equation where here you know that upside down triangle symbol right that's our
[00:17:16] down triangle symbol right that's our gradient we're wanting to work out the
[00:17:18] gradient we're wanting to work out the slope of our objective function and so
[00:17:23] slope of our objective function and so this is how we're going to learn by
[00:17:25] this is how we're going to learn by calculating gradients so what we want to
[00:17:27] calculating gradients so what we want to know is how do we calculate the
[00:17:29] know is how do we calculate the gradients for an arbitrary function and
[00:17:32] gradients for an arbitrary function and so what I want to do today is first of
[00:17:35] so what I want to do today is first of all do this by hand um for math um and
[00:17:40] all do this by hand um for math um and then discuss you know how do we do it
[00:17:43] then discuss you know how do we do it computationally um which is effectively
[00:17:46] computationally um which is effectively um the famous thing that's taken as
[00:17:48] um the famous thing that's taken as powering um underpowering all of neuron
[00:17:50] powering um underpowering all of neuron Nets which is the back propagation
[00:17:52] Nets which is the back propagation algorithm but the back propagation
[00:17:54] algorithm but the back propagation algorithm is just automating the math
[00:17:57] algorithm is just automating the math okay and so for the math it's Matrix
[00:18:01] okay and so for the math it's Matrix calculus and at this point then there's
[00:18:03] calculus and at this point then there's a huge Spectrum um between people who
[00:18:06] a huge Spectrum um between people who know much more math than me and people
[00:18:08] know much more math than me and people who barely ever learned this um but you
[00:18:11] who barely ever learned this um but you know uh I hope to explain the
[00:18:14] know uh I hope to explain the essentials um or remind people of them
[00:18:18] essentials um or remind people of them um enough that you're at least at a
[00:18:20] um enough that you're at least at a starting point um for reading some other
[00:18:23] starting point um for reading some other stuff and doing homework too so um let's
[00:18:26] stuff and doing homework too so um let's get into that and so I'm going to spend
[00:18:28] get into that and so I'm going to spend about half the time on those two halves
[00:18:31] about half the time on those two halves um and you know the hope is that after
[00:18:33] um and you know the hope is that after this you'll feel like oh I actually
[00:18:35] this you'll feel like oh I actually understand how new networks work under
[00:18:37] understand how new networks work under the hood fingers crossed
[00:18:40] the hood fingers crossed okay so here we go so if you're a
[00:18:43] okay so here we go so if you're a Stanford student um you maybe did math
[00:18:47] Stanford student um you maybe did math 51 um or else you could have done math
[00:18:50] 51 um or else you could have done math 51 which teaches linear algebra
[00:18:53] 51 which teaches linear algebra multivariate calculus and modern
[00:18:55] multivariate calculus and modern applications um math 51 covers every
[00:18:59] applications um math 51 covers every everything I'm going to talk about and
[00:19:00] everything I'm going to talk about and way more stuff so if you actually know
[00:19:02] way more stuff so if you actually know that and remember it you can um look at
[00:19:05] that and remember it you can um look at Instagram for the next 35 minutes um but
[00:19:08] Instagram for the next 35 minutes um but I think the problem is um that you know
[00:19:12] I think the problem is um that you know quite apart from the fact a lot of
[00:19:13] quite apart from the fact a lot of people do it as Frost um you know this
[00:19:16] people do it as Frost um you know this is a lot to get through in 10 weeks and
[00:19:19] is a lot to get through in 10 weeks and I think that a lot of the people who do
[00:19:21] I think that a lot of the people who do this class sort of by two years later
[00:19:23] this class sort of by two years later don't really have much ability to use
[00:19:25] don't really have much ability to use any of it um but you know if you
[00:19:27] any of it um but you know if you actually looked at this book really
[00:19:29] actually looked at this book really Harden for a very really long time you
[00:19:31] Harden for a very really long time you would have discovered that actually
[00:19:34] would have discovered that actually right towards the end of the book in
[00:19:36] right towards the end of the book in appendix G um there's actually an
[00:19:38] appendix G um there's actually an appendix on newal networks and the
[00:19:40] appendix on newal networks and the multivariable chain roll which is
[00:19:43] multivariable chain roll which is precisely what we're going to be using
[00:19:45] precisely what we're going to be using um for um doing our neural networks um
[00:19:49] um for um doing our neural networks um but there are only two problems one
[00:19:51] but there are only two problems one problem is that this is Page 697 of the
[00:19:54] problem is that this is Page 697 of the book and I'm not sure anyone ever gets
[00:19:56] book and I'm not sure anyone ever gets that far and the problem is even if you
[00:20:00] that far and the problem is even if you do get that far you know I don't know I
[00:20:03] do get that far you know I don't know I find these pages that they're really
[00:20:05] find these pages that they're really dense texty Pages it's not even easy to
[00:20:08] dense texty Pages it's not even easy to understand them if you have gone there
[00:20:10] understand them if you have gone there so here's my attempt on that um so the
[00:20:13] so here's my attempt on that um so the Mantra to have in your head is G if I
[00:20:17] Mantra to have in your head is G if I can remember basic single variable
[00:20:20] can remember basic single variable calculus you know that I've got 3x squ
[00:20:23] calculus you know that I've got 3x squ and the derivative of that is 6X that's
[00:20:26] and the derivative of that is 6X that's all you sort of need to know right the
[00:20:28] all you sort of need to know right the ENT is multivariable calculus is just
[00:20:32] ENT is multivariable calculus is just like single variable calculus except
[00:20:35] like single variable calculus except you're using
[00:20:36] you're using matrices okay so that's our article of
[00:20:38] matrices okay so that's our article of faith and we're going to do that and so
[00:20:41] faith and we're going to do that and so what we're wanting to do is to do Matrix
[00:20:44] what we're wanting to do is to do Matrix calculus or the generalization of that
[00:20:46] calculus or the generalization of that tensor calculus sort of using um vectors
[00:20:49] tensor calculus sort of using um vectors matrices and higher order tensors
[00:20:52] matrices and higher order tensors because if we can do things and what's
[00:20:54] because if we can do things and what's referred to as vectorized gradients in
[00:20:56] referred to as vectorized gradients in the neural network world that that will
[00:20:59] the neural network world that that will be sort of the fast efficient way to do
[00:21:02] be sort of the fast efficient way to do our operations you know so if you want
[00:21:05] our operations you know so if you want to think it all through you can do it
[00:21:07] to think it all through you can do it single variable at a time and check that
[00:21:10] single variable at a time and check that you're doing the right thing and I sort
[00:21:12] you're doing the right thing and I sort of tried to indicate that in the first
[00:21:13] of tried to indicate that in the first lecture but if we want to have our
[00:21:15] lecture but if we want to have our networks go room we want to be doing
[00:21:18] networks go room we want to be doing Matrix
[00:21:20] Matrix calculus okay so let's work up to doing
[00:21:23] calculus okay so let's work up to doing that okay so this is the part that I I
[00:21:27] that okay so this is the part that I I trust everyone can remember right so we
[00:21:30] trust everyone can remember right so we have f ofx = x cubed and we can do
[00:21:34] have f ofx = x cubed and we can do single variable um derivative and the
[00:21:40] single variable um derivative and the derivative is
[00:21:42] derivative is 3x² everyone remember that one okay
[00:21:46] 3x² everyone remember that one okay that's something we can all start from
[00:21:48] that's something we can all start from and remember this derivative is saying
[00:21:50] and remember this derivative is saying the slope of things right so the slope
[00:21:53] the slope of things right so the slope of things lets us work out where is
[00:21:56] of things lets us work out where is something steep so we'll be able to go
[00:21:58] something steep so we'll be able to go skiing right that's our goal right and
[00:22:02] skiing right that's our goal right and so you can think of the slope of things
[00:22:05] so you can think of the slope of things as how much the output will change if we
[00:22:08] as how much the output will change if we change the input a bit right that's our
[00:22:10] change the input a bit right that's our measure of um steepness right so um that
[00:22:14] measure of um steepness right so um that so since the derivative is 3x^2 if we're
[00:22:17] so since the derivative is 3x^2 if we're at x = 1 that means the slope is about 3
[00:22:21] at x = 1 that means the slope is about 3 * 1 S 3 so if I work out the value of
[00:22:26] * 1 S 3 so if I work out the value of the function for 1.01 it's gone up by
[00:22:29] the function for 1.01 it's gone up by about three times point I move the X by
[00:22:32] about three times point I move the X by 0.01 and the output moved by 0.03 where
[00:22:36] 0.01 and the output moved by 0.03 where if I go to x = 4 the derivative is 3 *
[00:22:40] if I go to x = 4 the derivative is 3 * 4^2 is 48 and so if I work out the value
[00:22:43] 4^2 is 48 and so if I work out the value of the function at 4.01 I get
[00:22:46] of the function at 4.01 I get approximately
[00:22:47] approximately 64.4 versus 64 right that small
[00:22:51] 64.4 versus 64 right that small difference from 4 to 4.01 has been
[00:22:54] difference from 4 to 4.01 has been magnified 48 times in the output okay
[00:22:58] magnified 48 times in the output okay okay so now we just uh remember the
[00:23:02] okay so now we just uh remember the Mantra it's going to be exactly the same
[00:23:04] Mantra it's going to be exactly the same single value calculus um but with more
[00:23:08] single value calculus um but with more stuff so if we have a function with n
[00:23:11] stuff so if we have a function with n inputs we're then going to work out its
[00:23:14] inputs we're then going to work out its gradient um which is its partial
[00:23:17] gradient um which is its partial derivative with respect to each input so
[00:23:20] derivative with respect to each input so its gradient will now be a vector of the
[00:23:24] its gradient will now be a vector of the same size as the number of inputs um and
[00:23:27] same size as the number of inputs um and there's this funky symb
[00:23:29] there's this funky symb um which people pronounce various ways I
[00:23:32] um which people pronounce various ways I mean you know this kind of originated
[00:23:34] mean you know this kind of originated some kind of someone's weird way of
[00:23:36] some kind of someone's weird way of drawing a calligraphic d right so it is
[00:23:38] drawing a calligraphic d right so it is really a D um so I think I'll mainly
[00:23:41] really a D um so I think I'll mainly just call it D but sometimes people call
[00:23:43] just call it D but sometimes people call it partial or funy D or some some other
[00:23:46] it partial or funy D or some some other name right so you have DF dx1 DF dx2 for
[00:23:50] name right so you have DF dx1 DF dx2 for each of the variables okay so if we go
[00:23:53] each of the variables okay so if we go beyond that um and then have a function
[00:23:57] beyond that um and then have a function with um n inputs and M outputs what we
[00:24:03] with um n inputs and M outputs what we then get for um the gradient is what's
[00:24:07] then get for um the gradient is what's referred to as the Jacobian now actually
[00:24:11] referred to as the Jacobian now actually um the dude this is named after was a
[00:24:13] um the dude this is named after was a German Jew so it should really be yakobi
[00:24:18] German Jew so it should really be yakobi um but no one says that in this country
[00:24:20] um but no one says that in this country um Jacobian um okay so the Jacobian is
[00:24:26] um Jacobian um okay so the Jacobian is then a matrix of partial
[00:24:29] then a matrix of partial derivatives um where you're working out
[00:24:31] derivatives um where you're working out for each output and each input the
[00:24:35] for each output and each input the partial derivative between the component
[00:24:37] partial derivative between the component of the input and the output so this
[00:24:41] of the input and the output so this looks like the kind of thing that we're
[00:24:43] looks like the kind of thing that we're going to have when we have a neural
[00:24:45] going to have when we have a neural network layer because we're going to
[00:24:47] network layer because we're going to have um n inputs and M outputs for the
[00:24:50] have um n inputs and M outputs for the two layers of our neural networks so
[00:24:52] two layers of our neural networks so we'll be using these kind of
[00:24:56] we'll be using these kind of jacobians okay um so then you know the
[00:24:59] jacobians okay um so then you know the whole idea of newal networks is we've
[00:25:01] whole idea of newal networks is we've got these
[00:25:04] got these multi-level
[00:25:05] multi-level computations and they're going to
[00:25:07] computations and they're going to correspond to composition of functions
[00:25:10] correspond to composition of functions so we need to know how to compose things
[00:25:14] so we need to know how to compose things both for calculating functions and for
[00:25:17] both for calculating functions and for calculating their gradients so if we
[00:25:20] calculating their gradients so if we have a one variable function and we want
[00:25:23] have a one variable function and we want to um work out um its derivative in
[00:25:27] to um work out um its derivative in terms of a composition of two functions
[00:25:31] terms of a composition of two functions what we're doing is multiplying um our
[00:25:35] what we're doing is multiplying um our computations okay so um if you compose
[00:25:39] computations okay so um if you compose together um Z of Y um that's the
[00:25:42] together um Z of Y um that's the function that we did at the beginning
[00:25:44] function that we did at the beginning that gives you oh was it no it's not
[00:25:47] that gives you oh was it no it's not sorry it's different okay Z of Y gives
[00:25:50] sorry it's different okay Z of Y gives you
[00:25:51] you 3x2 right and so we know that the
[00:25:53] 3x2 right and so we know that the derivative of that is um
[00:25:56] derivative of that is um 6X okay if we do it in terms of the
[00:26:00] 6X okay if we do it in terms of the pieces we can work out DZ
[00:26:04] pieces we can work out DZ Dy um which is just going to be three
[00:26:08] Dy um which is just going to be three and um Dy DX is 2X and we can work out
[00:26:13] and um Dy DX is 2X and we can work out the total derivative by multiplying
[00:26:15] the total derivative by multiplying these two pieces and we get 6X the same
[00:26:18] these two pieces and we get 6X the same answer right um so um Matrix calculus is
[00:26:23] answer right um so um Matrix calculus is exactly like single variable calculus
[00:26:26] exactly like single variable calculus except we're using um tensors of
[00:26:29] except we're using um tensors of different um so the word tensor is used
[00:26:32] different um so the word tensor is used to mean as you go up that Spectrum in in
[00:26:35] to mean as you go up that Spectrum in in its size so from sort of Scala to Vector
[00:26:37] its size so from sort of Scala to Vector to Matrix to then you know what in
[00:26:41] to Matrix to then you know what in computer science is normally still uh is
[00:26:43] computer science is normally still uh is multidimensional arrays um that spectrum
[00:26:47] multidimensional arrays um that spectrum is then tensors of different um
[00:26:50] is then tensors of different um Dimensions okay so um when we have
[00:26:54] Dimensions okay so um when we have multiple variable functions we're going
[00:26:56] multiple variable functions we're going to multiply jacobians so here we have a
[00:26:59] to multiply jacobians so here we have a function WX + B and then we um compose
[00:27:04] function WX + B and then we um compose um the nonlinearity F to get H and so
[00:27:09] um the nonlinearity F to get H and so we're going to be able to compute that
[00:27:11] we're going to be able to compute that in the same way as a product of these
[00:27:14] in the same way as a product of these partial derivatives which are
[00:27:18] partial derivatives which are jacobians okay so let's start looking at
[00:27:21] jacobians okay so let's start looking at a few examples of what we get so let's
[00:27:24] a few examples of what we get so let's count with start with an element wise
[00:27:26] count with start with an element wise activation function so when we have a a
[00:27:31] activation function so when we have a a vector that's being calculated as the
[00:27:34] vector that's being calculated as the activation function of a previously
[00:27:36] activation function of a previously computed quantity well we're Computing
[00:27:39] computed quantity well we're Computing that component wise as I explained
[00:27:41] that component wise as I explained before so hi equals F of Z and where the
[00:27:46] before so hi equals F of Z and where the sort of f is our activation function
[00:27:48] sort of f is our activation function that actually applies to a scalar but
[00:27:51] that actually applies to a scalar but you know overall this layer is a
[00:27:53] you know overall this layer is a function with n outputs and N inputs and
[00:27:56] function with n outputs and N inputs and so it's going to have an N by n Jacobian
[00:27:59] so it's going to have an N by n Jacobian and well what that's going to so this is
[00:28:01] and well what that's going to so this is our definition of the Jacobian um but in
[00:28:06] our definition of the Jacobian um but in this case this is sort of a special case
[00:28:09] this case this is sort of a special case because if I equals J um then we're
[00:28:14] because if I equals J um then we're going to have um the output um
[00:28:19] going to have um the output um J the HJ depending on Z and otherwise
[00:28:24] J the HJ depending on Z and otherwise it's going to be zero CU for the off
[00:28:26] it's going to be zero CU for the off diagonal entries it doesn't matter how
[00:28:29] diagonal entries it doesn't matter how you change the value it's not changing
[00:28:31] you change the value it's not changing the output because the output only
[00:28:33] the output because the output only depends on the corresponding index and
[00:28:36] depends on the corresponding index and so what we're going to get for this
[00:28:37] so what we're going to get for this Jacobian of activation functions is a
[00:28:40] Jacobian of activation functions is a matrix where everything is zero apart
[00:28:43] matrix where everything is zero apart from the diagonal terms um that
[00:28:46] from the diagonal terms um that correspond to where we're calculating
[00:28:49] correspond to where we're calculating the activation function and for those
[00:28:52] the activation function and for those ones we're going to have to work out how
[00:28:54] ones we're going to have to work out how to compute the derivative of our
[00:28:57] to compute the derivative of our activation function that was on
[00:29:00] activation function that was on assignment one one of the questions on
[00:29:02] assignment one one of the questions on assignment one I do believe or or was it
[00:29:05] assignment one I do believe or or was it on assignment two no no it's assignment
[00:29:07] on assignment two no no it's assignment two one of the questions on assignment
[00:29:09] two one of the questions on assignment two I got that wrong one of the ones on
[00:29:10] two I got that wrong one of the ones on the new assignment is say hey um can you
[00:29:13] the new assignment is say hey um can you work out um the derivative of a logistic
[00:29:16] work out um the derivative of a logistic function well then we'd be a to plug
[00:29:18] function well then we'd be a to plug that straight into um fpre so I'm not
[00:29:21] that straight into um fpre so I'm not going to give that answer away today um
[00:29:24] going to give that answer away today um okay so um other things that we want to
[00:29:27] okay so um other things that we want to do um with uh Jacobian is well we have
[00:29:32] do um with uh Jacobian is well we have this um layer of our neural network
[00:29:35] this um layer of our neural network where we're um calculating WX + B and we
[00:29:39] where we're um calculating WX + B and we can want to work out the partial
[00:29:40] can want to work out the partial derivative of that with respect to X um
[00:29:43] derivative of that with respect to X um you know this is the kind of place where
[00:29:46] you know this is the kind of place where it actually works to remember the Mantra
[00:29:49] it actually works to remember the Mantra and say Matrix calculus is just like
[00:29:52] and say Matrix calculus is just like single value variable calculus but with
[00:29:55] single value variable calculus but with matrices so if you just don't use your
[00:29:57] matrices so if you just don't use your bra too hard and think oh it's just like
[00:30:00] bra too hard and think oh it's just like single variable calculus so what should
[00:30:01] single variable calculus so what should the answer be it's obviously going to be
[00:30:03] the answer be it's obviously going to be W right and indeed it is um similarly if
[00:30:07] W right and indeed it is um similarly if we want to do the same thing for wxb and
[00:30:10] we want to do the same thing for wxb and work out the partial derivative with
[00:30:12] work out the partial derivative with respect to B um well that would be one
[00:30:15] respect to B um well that would be one in terms of single variable calculus and
[00:30:18] in terms of single variable calculus and so in Matrix calculus that becomes an
[00:30:21] so in Matrix calculus that becomes an identity Matrix okay slightly different
[00:30:23] identity Matrix okay slightly different the same idea um but that's reflecting
[00:30:26] the same idea um but that's reflecting the fact that b is actually Vector so we
[00:30:28] the fact that b is actually Vector so we need we need it to be coming out um as
[00:30:31] need we need it to be coming out um as an identity Matrix um okay so um higher
[00:30:37] an identity Matrix um okay so um higher up in my example picture I did this sort
[00:30:40] up in my example picture I did this sort of vector um do product of
[00:30:44] of vector um do product of UT um and well what happens if we work
[00:30:49] UT um and well what happens if we work out the um the
[00:30:53] out the um the um to C in of that what we end up with
[00:30:57] um to C in of that what we end up with strict ly is we come out with
[00:31:00] strict ly is we come out with HT um and you know this is sort of like
[00:31:04] HT um and you know this is sort of like when you're working out um well we did
[00:31:07] when you're working out um well we did this in the first class right when we
[00:31:09] this in the first class right when we did a a DOT product calculation that you
[00:31:12] did a a DOT product calculation that you kind of get for each individual element
[00:31:15] kind of get for each individual element you get the opposite term and so you get
[00:31:18] you get the opposite term and so you get um the other Vector coming out um these
[00:31:21] um the other Vector coming out um these are sort of good ones to comput it at
[00:31:23] are sort of good ones to comput it at home for practice to make sure um you
[00:31:26] home for practice to make sure um you really do know the answers and why they
[00:31:28] really do know the answers and why they work out the way they
[00:31:30] work out the way they do okay um so let's go back to our
[00:31:34] do okay um so let's go back to our little neural net um this was most of
[00:31:37] little neural net um this was most of our neural net up above our neural net
[00:31:40] our neural net up above our neural net um there was the
[00:31:42] um there was the nonlinearity now I'm going to uh leave
[00:31:45] nonlinearity now I'm going to uh leave that out of this time oh see I got it
[00:31:47] that out of this time oh see I got it wrong it's on assignment two um but you
[00:31:50] wrong it's on assignment two um but you know normally you'd be C calculating the
[00:31:53] know normally you'd be C calculating the partials of the output the loss function
[00:31:56] partials of the output the loss function with respect to the inputs but since the
[00:31:59] with respect to the inputs but since the loss function is On Assignment two I'm
[00:32:01] loss function is On Assignment two I'm going to leave that out and I'm just
[00:32:03] going to leave that out and I'm just going to calculate derivatives with
[00:32:05] going to calculate derivatives with respect to this score that feeds into
[00:32:08] respect to this score that feeds into the loss function so we first we got um
[00:32:11] the loss function so we first we got um the newal network layer the nonlinearity
[00:32:15] the newal network layer the nonlinearity and then we're doing this dot product to
[00:32:17] and then we're doing this dot product to work out a score for each position which
[00:32:19] work out a score for each position which feeds into the logistic function so if
[00:32:22] feeds into the logistic function so if you want to work out
[00:32:24] you want to work out ddb um so that's with respect to the
[00:32:27] ddb um so that's with respect to the bias for first so the way we do it is
[00:32:30] bias for first so the way we do it is you know we break up our equations into
[00:32:33] you know we break up our equations into our individual pieces that are composed
[00:32:36] our individual pieces that are composed together and so that means we break this
[00:32:38] together and so that means we break this up so we first calculate the zal WX + B
[00:32:42] up so we first calculate the zal WX + B then we apply the activation function to
[00:32:45] then we apply the activation function to the different
[00:32:46] the different components okay then after that
[00:32:50] components okay then after that um what we remember to do is okay to
[00:32:54] um what we remember to do is okay to work out um our um partial derivatives
[00:33:00] work out um our um partial derivatives of B of s with respect to B that what
[00:33:04] of B of s with respect to B that what we're going to be doing is doing the
[00:33:07] we're going to be doing is doing the product of um the partial derivatives of
[00:33:10] product of um the partial derivatives of the component pieces so we're applying
[00:33:13] the component pieces so we're applying um The Matrix calculus version of the
[00:33:15] um The Matrix calculus version of the chain rule so dsdb equals D sdh * DHD Z
[00:33:22] chain rule so dsdb equals D sdh  DHD Z 
[00:33:23] * dzb um and which corresponds to these
[00:33:26] dzb um and which corresponds to these three layers that are composed together
[00:33:30] three layers that are composed together and so at that point um we remember our
[00:33:32] and so at that point um we remember our useful jacobians from the previous slide
[00:33:36] useful jacobians from the previous slide and we can just apply them so the top
[00:33:39] and we can just apply them so the top one um DSD is um the you
[00:33:45] one um DSD is um the you transpose or else or maybe it's you
[00:33:48] transpose or else or maybe it's you let's come back to that there's a fine
[00:33:51] let's come back to that there's a fine point on that that I will explain more
[00:33:53] point on that that I will explain more about later um
[00:33:55] about later um um okay um then for the DH DZ that was
[00:34:00] um okay um then for the DH DZ that was the activation function where we got the
[00:34:02] the activation function where we got the diagonal of um the derivative of f of Z
[00:34:07] diagonal of um the derivative of f of Z and then for dzdb that's where we got
[00:34:10] and then for dzdb that's where we got the identity function okay um so we can
[00:34:15] the identity function okay um so we can simplify that down and so what that's
[00:34:18] simplify that down and so what that's going to end up as is the UT transpose
[00:34:23] going to end up as is the UT transpose that funny symbol there times the um the
[00:34:27] that funny symbol there times the um the vector
[00:34:28] vector element y's um derivative of f um this
[00:34:32] element y's um derivative of f um this symbol which doesn't normally turn up um
[00:34:36] symbol which doesn't normally turn up um in your regular math course but turns up
[00:34:38] in your regular math course but turns up all the time in New networks is referred
[00:34:41] all the time in New networks is referred to as the Hadad product and the Hadad
[00:34:44] to as the Hadad product and the Hadad product is meaning element wise
[00:34:46] product is meaning element wise multiplication so it's not like a cross
[00:34:49] multiplication so it's not like a cross product where you put two vectors
[00:34:50] product where you put two vectors together and you get out one number of
[00:34:52] together and you get out one number of Scala you put two vectors together you
[00:34:55] Scala you put two vectors together you element wise multiply them all and
[00:34:57] element wise multiply them all and you're left with another Vector of the
[00:35:00] you're left with another Vector of the same
[00:35:01] same type okay so that so now this gave us
[00:35:05] type okay so that so now this gave us are working out of the partials of dstb
[00:35:08] are working out of the partials of dstb and for a neural network um we want to
[00:35:11] and for a neural network um we want to work out all the other partials as well
[00:35:13] work out all the other partials as well so overall here in the picture right we
[00:35:18] so overall here in the picture right we had the X the W the B um and the U and
[00:35:26] had the X the W the B um and the U and we'd like to work out partials with
[00:35:28] we'd like to work out partials with respect to all of those variables so we
[00:35:31] respect to all of those variables so we can change their values and learn so
[00:35:34] can change their values and learn so that our model predicts better um so um
[00:35:39] that our model predicts better um so um so suppose we now want to um calculate
[00:35:43] so suppose we now want to um calculate ddw so again we can split it up with the
[00:35:46] ddw so again we can split it up with the same chain Rule and say
[00:35:48] same chain Rule and say ddw equals the product of these three
[00:35:51] ddw equals the product of these three things and the important thing to notice
[00:35:54] things and the important thing to notice is that two of those three things were
[00:35:56] is that two of those three things were exactly the same ones that we calculated
[00:35:58] exactly the same ones that we calculated before the only bit that's different is
[00:36:01] before the only bit that's different is that at the end we're now doing DZ DW
[00:36:05] that at the end we're now doing DZ DW rather than
[00:36:06] rather than dzdb and so the first central idea that
[00:36:10] dzdb and so the first central idea that we'll come back to when we do
[00:36:11] we'll come back to when we do computation graphs is oh we really want
[00:36:14] computation graphs is oh we really want to avoid doing repeated work so we want
[00:36:18] to avoid doing repeated work so we want to realize that those two parts of
[00:36:20] to realize that those two parts of things are the same and since we're just
[00:36:22] things are the same and since we're just sort of multiplying these um partial
[00:36:25] sort of multiplying these um partial derivatives together right we can just
[00:36:27] derivatives together right we can just compute what that part is and reuse it
[00:36:30] compute what that part is and reuse it and so if we want to um wait yeah okay
[00:36:36] and so if we want to um wait yeah okay so if we're wanting to calculate
[00:36:39] so if we're wanting to calculate ddw the part that's the same this part
[00:36:43] ddw the part that's the same this part here we can refer to as Delta so Delta
[00:36:47] here we can refer to as Delta so Delta is sort of the Upstream gradient or the
[00:36:49] is sort of the Upstream gradient or the error signal the part that you've got
[00:36:51] error signal the part that you've got from sort of starting at the beginning
[00:36:53] from sort of starting at the beginning DS DH DH DZ this sort of shared Upstream
[00:36:57] DS DH DH DZ this sort of shared Upstream part we can calculate that once and then
[00:37:01] part we can calculate that once and then we can use it um to calculate both of
[00:37:04] we can use it um to calculate both of these two things and for dsdb because
[00:37:09] these two things and for dsdb because the dzdb just comes out as the identity
[00:37:11] the dzdb just comes out as the identity Matrix um the answer is just Delta but
[00:37:16] Matrix um the answer is just Delta but for D
[00:37:17] for D sdw we need to work out the DZ DW before
[00:37:22] sdw we need to work out the DZ DW before we're
[00:37:24] we're finished okay so what um does what do we
[00:37:29] finished okay so what um does what do we get for that last piece so one question
[00:37:32] get for that last piece so one question you might start off with and is normally
[00:37:35] you might start off with and is normally a good thing to think about when you're
[00:37:37] a good thing to think about when you're doing assignment problems on this and
[00:37:39] doing assignment problems on this and other things is the first thing to think
[00:37:41] other things is the first thing to think about is you know what do things look
[00:37:44] about is you know what do things look like like am I should the answer be a
[00:37:47] like like am I should the answer be a vector a matrix uh what size should it
[00:37:50] vector a matrix uh what size should it be and things like that so for
[00:37:53] be and things like that so for ddw um W is an N bym Matrix
[00:37:58] ddw um W is an N bym Matrix um and S is a scalar so therefore since
[00:38:03] um and S is a scalar so therefore since we have one output and M * m inputs the
[00:38:08] we have one output and M * m inputs the answer according to math should be that
[00:38:12] answer according to math should be that we've got a one by n * m Jacobian I a
[00:38:17] we've got a one by n * m Jacobian I a big long row Vector um but here's where
[00:38:22] big long row Vector um but here's where um things get a teeny bit tricky um and
[00:38:26] um things get a teeny bit tricky um and there's sort of we end up with this
[00:38:29] there's sort of we end up with this weird mess of um math and Engineering
[00:38:34] weird mess of um math and Engineering convenience because you know immediately
[00:38:36] convenience because you know immediately what we're wanting to do is we're
[00:38:39] what we're wanting to do is we're wanting to take our old parameters which
[00:38:42] wanting to take our old parameters which will be in stored in the form of
[00:38:44] will be in stored in the form of matrices vectors and so on that we're
[00:38:47] matrices vectors and so on that we're using as coefficients and we're going to
[00:38:49] using as coefficients and we're going to want to subtract from them um our you
[00:38:53] want to subtract from them um our you know a fraction of our calculated
[00:38:55] know a fraction of our calculated gradient so what we'd like to do is have
[00:38:59] gradient so what we'd like to do is have our um calculated gradients in the same
[00:39:03] our um calculated gradients in the same shapes as our parameters because then we
[00:39:06] shapes as our parameters because then we can just do subtraction whereas if
[00:39:08] can just do subtraction whereas if they've turned into a God Almighty row
[00:39:10] they've turned into a God Almighty row Vector um that's not quite so convenient
[00:39:13] Vector um that's not quite so convenient um so it turns out that what we end up
[00:39:17] um so it turns out that what we end up doing um is using something that gets
[00:39:20] doing um is using something that gets referred to as the shape convention that
[00:39:23] referred to as the shape convention that um we
[00:39:25] um we uh uh
[00:39:27] uh uh reshape our um jacobians so they fit
[00:39:33] reshape our um jacobians so they fit into things that are of the same shape
[00:39:36] into things that are of the same shape as the parameters that we are using so
[00:39:40] as the parameters that we are using so we're going to represent um DSD W as an
[00:39:43] we're going to represent um DSD W as an N bym Matrix laid out as follows and
[00:39:48] N bym Matrix laid out as follows and that's a place that one people can get
[00:39:51] that's a place that one people can get confused okay so that's what we want to
[00:39:53] confused okay so that's what we want to calculate that kind of Matrix but um and
[00:39:57] calculate that kind of Matrix but um and so that Matrix is going to be Delta * DZ
[00:40:00] so that Matrix is going to be Delta * DZ DW so Delta is going to be part of the
[00:40:03] DW so Delta is going to be part of the answer and then we want to know um what
[00:40:06] answer and then we want to know um what DZ DW is um and the answer is going to
[00:40:10] DZ DW is um and the answer is going to be it's going to come out like this so D
[00:40:12] be it's going to come out like this so D sdw is going to be um delta T * XT so
[00:40:17] sdw is going to be um delta T * XT so it's going to be the product of the
[00:40:19] it's going to be the product of the Upstream gradient which was the same
[00:40:21] Upstream gradient which was the same thing we calculated before for the other
[00:40:23] thing we calculated before for the other two quantities and then a local inputs
[00:40:27] two quantities and then a local inputs symbol which um is input signal which is
[00:40:32] symbol which um is input signal which is here coming out to
[00:40:34] here coming out to XT okay um and you know so we're taking
[00:40:38] XT okay um and you know so we're taking the transposes of those two vectors
[00:40:41] the transposes of those two vectors which it means that we end up
[00:40:43] which it means that we end up calculating an outer product of those
[00:40:45] calculating an outer product of those two vectors um which gives us our
[00:40:49] two vectors um which gives us our gradient um and so why is that the right
[00:40:53] gradient um and so why is that the right answer well you know it kind of looks
[00:40:55] answer well you know it kind of looks convenient CU that's giving us something
[00:40:57] convenient CU that's giving us something of the right shape um for what I was
[00:40:59] of the right shape um for what I was arguing we want to find out and we have
[00:41:01] arguing we want to find out and we have the right number of terms um now I'm
[00:41:05] the right number of terms um now I'm going to rush through this so I
[00:41:07] going to rush through this so I encourage you to read um the uh lecture
[00:41:10] encourage you to read um the uh lecture notes and do this more carefully but um
[00:41:13] notes and do this more carefully but um let me at least a little bit explain why
[00:41:15] let me at least a little bit explain why it makes sense right so um if you think
[00:41:19] it makes sense right so um if you think of one weight um in so all of these
[00:41:23] of one weight um in so all of these connections are our Matrix right The
[00:41:25] connections are our Matrix right The Matrix is being represented by all these
[00:41:26] Matrix is being represented by all these lines and in your network so if you
[00:41:29] lines and in your network so if you think of one number in The Matrix so
[00:41:31] think of one number in The Matrix so here is
[00:41:32] here is w23 so it's connecting from input 3 or
[00:41:36] w23 so it's connecting from input 3 or it's multiplying input three to give
[00:41:39] it's multiplying input three to give part of the answer of
[00:41:41] part of the answer of H2 right so it's this line here um so
[00:41:45] H2 right so it's this line here um so for this line here um this weight is
[00:41:50] for this line here um this weight is being used only in the calculation of
[00:41:52] being used only in the calculation of H2 and the only thing it's dependent on
[00:41:55] H2 and the only thing it's dependent on is X3 um so that if you're then wanting
[00:41:59] is X3 um so that if you're then wanting to work out the partial
[00:42:03] to work out the partial of um h
[00:42:06] of um h two um or Z2 sorry yeah um sorry yeah
[00:42:13] two um or Z2 sorry yeah um sorry yeah sorry Z2 the partial of Z2 with respect
[00:42:17] sorry Z2 the partial of Z2 with respect to X3 it's sort of depending on these
[00:42:21] to X3 it's sort of depending on these two pieces only and that's what you're
[00:42:23] two pieces only and that's what you're achieving um by working out um the sort
[00:42:28] achieving um by working out um the sort of outer product like
[00:42:31] of outer product like that okay um yeah so let me just come
[00:42:35] that okay um yeah so let me just come back um one more time to um this the
[00:42:39] back um one more time to um this the sort of question of the shape of
[00:42:42] sort of question of the shape of derivatives
[00:42:44] derivatives um you know so I already sort of fudged
[00:42:47] um you know so I already sort of fudged it um when I was sort of um talking
[00:42:51] it um when I was sort of um talking about oh should I put the the transpose
[00:42:54] about oh should I put the the transpose there or should I nod and get a row
[00:42:56] there or should I nod and get a row vector is a column Vector um so there's
[00:43:00] vector is a column Vector um so there's sort of this disagreement between
[00:43:04] sort of this disagreement between whether you kind of have the Jacobian
[00:43:07] whether you kind of have the Jacobian form which is what actually makes the
[00:43:09] form which is what actually makes the chain roll work right in terms of doing
[00:43:13] chain roll work right in terms of doing multiplication um versus the shape
[00:43:16] multiplication um versus the shape convention which is how we store
[00:43:18] convention which is how we store everything for our computations and Mak
[00:43:21] everything for our computations and Mak doing stochastic gradient descent where
[00:43:24] doing stochastic gradient descent where you're um subtracting um whatever kind
[00:43:27] you're um subtracting um whatever kind of tensor you have easy um so
[00:43:32] of tensor you have easy um so um you know this can be a source of
[00:43:34] um you know this can be a source of confusion um since we're doing a
[00:43:37] confusion um since we're doing a computer science course for the answers
[00:43:39] computer science course for the answers in the assignment we expect you to
[00:43:41] in the assignment we expect you to follow the shape convention so you know
[00:43:44] follow the shape convention so you know if you're working out the derivatives
[00:43:46] if you're working out the derivatives with respect to some Matrix it should be
[00:43:49] with respect to some Matrix it should be shaped like a matrix with the same
[00:43:51] shaped like a matrix with the same parameters um but you know you may well
[00:43:55] parameters um but you know you may well want to think about Jacobian forms and
[00:43:57] want to think about Jacobian forms and Computing your answers I mean there are
[00:43:59] Computing your answers I mean there are sort of two ways to go about do doing
[00:44:01] sort of two ways to go about do doing this one way of doing it is to sort of
[00:44:04] this one way of doing it is to sort of work out all the math using Jacobian Al
[00:44:07] work out all the math using Jacobian Al math 51 and at the end just to reshape
[00:44:10] math 51 and at the end just to reshape it so it fits into the same shape as the
[00:44:14] it so it fits into the same shape as the parameters according to our shape
[00:44:16] parameters according to our shape convention I mean the other way is to
[00:44:19] convention I mean the other way is to sort of do each stage following the
[00:44:21] sort of do each stage following the shape convention but then you sort of
[00:44:23] shape convention but then you sort of have to be game to sort of reshape
[00:44:26] have to be game to sort of reshape things as needed by sort of doing um
[00:44:29] things as needed by sort of doing um transposing to have things work out at
[00:44:31] transposing to have things work out at the different
[00:44:33] the different stages okay that was my attempt to
[00:44:36] stages okay that was my attempt to quickly review the
[00:44:39] math most people are still here um I
[00:44:43] math most people are still here um I will now go on to the second half and go
[00:44:47] will now go on to the second half and go on to the um How We Do the computation
[00:44:50] on to the um How We Do the computation right so you know so most of um yeah so
[00:44:55] right so you know so most of um yeah so the famous thing that powers new
[00:44:57] the famous thing that powers new networks is the back propagation
[00:44:59] networks is the back propagation algorithm so the back propagation
[00:45:02] algorithm so the back propagation algorithm is really only two things you
[00:45:06] algorithm is really only two things you know its invention made people famous
[00:45:09] know its invention made people famous because it gave an effective learning
[00:45:11] because it gave an effective learning algorithm but you know at a fundamental
[00:45:13] algorithm but you know at a fundamental level the back propagation algorithm is
[00:45:16] level the back propagation algorithm is only two things thing one is you use the
[00:45:20] only two things thing one is you use the chain rule you do calculus of complex
[00:45:23] chain rule you do calculus of complex functions and Thing Two is
[00:45:27] functions and Thing Two is um you store intermediate results so you
[00:45:30] um you store intermediate results so you never recompute the same stuff again
[00:45:33] never recompute the same stuff again that's all there is to the um back
[00:45:35] that's all there is to the um back propagation
[00:45:37] propagation algorithm and so let's just go through
[00:45:40] algorithm and so let's just go through that so if we're
[00:45:42] that so if we're computationally wanting to deal um with
[00:45:45] computationally wanting to deal um with you know functions and doing back
[00:45:48] you know functions and doing back propagation we can think of them as
[00:45:51] propagation we can think of them as being represented as a graph and in some
[00:45:54] being represented as a graph and in some way or another um this kind of graph is
[00:45:58] way or another um this kind of graph is being used inside your newal network
[00:46:01] being used inside your newal network framework so here is here is a re-
[00:46:03] framework so here is here is a re- representation of my little neural
[00:46:05] representation of my little neural network for finding whether the word at
[00:46:07] network for finding whether the word at the center is a location so I'm taking
[00:46:11] the center is a location so I'm taking the X Vector input I'm multiplying it by
[00:46:14] the X Vector input I'm multiplying it by W I'm adding B to it I'm putting it
[00:46:17] W I'm adding B to it I'm putting it through the
[00:46:18] through the nonlinearity and then I'm um doing the
[00:46:20] nonlinearity and then I'm um doing the dot product with my Vector U right so
[00:46:23] dot product with my Vector U right so that was my computation and so the The
[00:46:27] that was my computation and so the The Source nodes are the inputs in this
[00:46:30] Source nodes are the inputs in this graph the interior nodes then the
[00:46:33] graph the interior nodes then the operations I do um and so then the edges
[00:46:38] operations I do um and so then the edges that connect those together then pass
[00:46:41] that connect those together then pass along the result of each operation so I
[00:46:44] along the result of each operation so I passed along WX to the addition function
[00:46:48] passed along WX to the addition function with B then I that gives me Z that I
[00:46:52] with B then I that gives me Z that I pass through the nonlinearity which
[00:46:53] pass through the nonlinearity which gives me H which I then um dot product
[00:46:56] gives me H which I then um dot product with the U to get S okay so I do
[00:47:00] with the U to get S okay so I do precisely this computation and this is
[00:47:03] precisely this computation and this is referred to as Ford propagation or the
[00:47:06] referred to as Ford propagation or the forward pass of a neural network so um
[00:47:09] forward pass of a neural network so um the forward pass just calculates
[00:47:12] the forward pass just calculates functions okay but then once we've done
[00:47:16] functions okay but then once we've done that what we want to
[00:47:18] that what we want to do is then work out gradients so we can
[00:47:22] do is then work out gradients so we can do gradient based learning and so that
[00:47:25] do gradient based learning and so that part is then referred to as back
[00:47:29] part is then referred to as back propagation or the backward pass and
[00:47:32] propagation or the backward pass and then we run things backward so for
[00:47:35] then we run things backward so for running things backward we're going to
[00:47:37] running things backward we're going to use the same graph and we're going to
[00:47:39] use the same graph and we're going to backwards pass along at gradients and so
[00:47:42] backwards pass along at gradients and so we start at the right hand side and we
[00:47:45] we start at the right hand side and we have dsds so dsds is just one because
[00:47:49] have dsds so dsds is just one because you know um if you change S you've
[00:47:52] you know um if you change S you've changed s and then what we want to do is
[00:47:56] changed s and then what we want to do is sort of then work further back so we can
[00:47:59] sort of then work further back so we can work out DSD ddz dsdb ddw dsdx as we
[00:48:05] work out DSD ddz dsdb ddw dsdx as we work back um and so this is um the what
[00:48:11] work back um and so this is um the what we want to work out with gradients um
[00:48:14] we want to work out with gradients um and so how are we going to do that well
[00:48:17] and so how are we going to do that well if we look at a single node so for
[00:48:19] if we look at a single node so for example our um our nonlinearity node but
[00:48:23] example our um our nonlinearity node but any node where H equal F ofx what we can
[00:48:27] any node where H equal F ofx what we can have is an upstream gradient um DS DH
[00:48:32] have is an upstream gradient um DS DH and what we want to do is calculate the
[00:48:35] and what we want to do is calculate the downstream gradient of the next variable
[00:48:38] downstream gradient of the next variable down the D
[00:48:40] down the D sdz and the way that we're going to do
[00:48:42] sdz and the way that we're going to do that is we're going to say well let's
[00:48:46] that is we're going to say well let's look at F what is F's gradient and
[00:48:50] look at F what is F's gradient and that's going to be our local gradient
[00:48:53] that's going to be our local gradient and then this is immediately what gives
[00:48:55] and then this is immediately what gives us the chain rule
[00:48:57] us the chain rule that D sdz is going to be the product of
[00:49:01] that D sdz is going to be the product of our Upstream gradient D sdh times the
[00:49:05] our Upstream gradient D sdh times the DHD Z the local um gradient that we
[00:49:08] DHD Z the local um gradient that we calculate at that node so Downstream
[00:49:11] calculate at that node so Downstream gradient equals um Upstream gradient
[00:49:15] gradient equals um Upstream gradient times local
[00:49:18] times local gradient
[00:49:21] gradient um oh yeah that's what that's what it
[00:49:23] um oh yeah that's what that's what it says when I press um that again okay so
[00:49:27] says when I press um that again okay so this is the sort of the um single the
[00:49:31] this is the sort of the um single the single input single output case though
[00:49:34] single input single output case though those inputs might be vectors or
[00:49:36] those inputs might be vectors or matrices or something like that um we
[00:49:39] matrices or something like that um we then have sort of more complex graph
[00:49:42] then have sort of more complex graph cases um
[00:49:44] cases um so I think I should have retitled this
[00:49:46] so I think I should have retitled this SL oh yeah so still so sorry so the next
[00:49:50] SL oh yeah so still so sorry so the next case is for our node it might have
[00:49:53] case is for our node it might have multiple inputs so this is where we're
[00:49:55] multiple inputs so this is where we're calculating
[00:49:57] calculating WX so in that case we still have an up
[00:50:01] WX so in that case we still have an up we have a single Upstream gradient and
[00:50:04] we have a single Upstream gradient and then what we're going to do is we want
[00:50:07] then what we're going to do is we want to calculate the downstream gradient
[00:50:10] to calculate the downstream gradient with respect to each input and the way
[00:50:13] with respect to each input and the way we're going to do that is we're going to
[00:50:15] we're going to do that is we're going to work out the local gradient with respect
[00:50:18] work out the local gradient with respect to each input and then we're going to do
[00:50:20] to each input and then we're going to do the same kind of multiplication of
[00:50:23] the same kind of multiplication of Upstream gradient times local gradient
[00:50:27] Upstream gradient times local gradient with respect to each input again um
[00:50:30] with respect to each input again um chain
[00:50:32] chain rule okay um so here's a little example
[00:50:36] rule okay um so here's a little example of this so I'm this isn't really uh the
[00:50:40] of this so I'm this isn't really uh the kind of thing you normally see in a
[00:50:41] kind of thing you normally see in a neural network but it's an easy example
[00:50:44] neural network but it's an easy example so F of XYZ is going to be x + y * the
[00:50:48] so F of XYZ is going to be x + y * the max of y z and we've got current values
[00:50:53] max of y z and we've got current values of X Y and Z of 1 2 and z respectively
[00:50:57] of X Y and Z of 1 2 and z respectively so here's our little computation graph
[00:51:00] so here's our little computation graph um and so for forward propagation you
[00:51:03] um and so for forward propagation you know we're going to do this addition
[00:51:05] know we're going to do this addition we're going to do this Max function and
[00:51:07] we're going to do this Max function and then we're going to multiply the two and
[00:51:10] then we're going to multiply the two and that gives us the value of f um so we
[00:51:13] that gives us the value of f um so we can run that with the current values of
[00:51:15] can run that with the current values of X Y and Z and this is what we get so the
[00:51:19] X Y and Z and this is what we get so the max of two and 0 is two addition is
[00:51:22] max of two and 0 is two addition is three the answer is six okay so then
[00:51:26] three the answer is six okay so then after having done that we run the
[00:51:29] after having done that we run the backward propagation and yeah so this
[00:51:32] backward propagation and yeah so this procedure you know is not actually
[00:51:34] procedure you know is not actually special to new networks right you can
[00:51:36] special to new networks right you can use it for any piece of math if you want
[00:51:38] use it for any piece of math if you want to just run your math um on pytorch
[00:51:41] to just run your math um on pytorch rather than um working it out in your
[00:51:43] rather than um working it out in your head or with Mathematica um okay so now
[00:51:46] head or with Mathematica um okay so now we work out um backwards so we want to
[00:51:50] we work out um backwards so we want to know the local gradient so d a DZ is
[00:51:55] know the local gradient so d a DZ is going to be one
[00:51:58] going to be one sorry I said that wrong d a DX is going
[00:52:00] sorry I said that wrong d a DX is going to be 1 so a = x + y d a d y = 1 um for
[00:52:05] to be 1 so a = x + y d a d y = 1 um for the max function that's going to depend
[00:52:08] the max function that's going to depend on which of the two is larger because
[00:52:10] on which of the two is larger because it's going to have a slope of one for
[00:52:14] it's going to have a slope of one for the one that's the biggest and zero for
[00:52:16] the one that's the biggest and zero for the one that's the smallest um and then
[00:52:19] the one that's the smallest um and then for the product that's like what we saw
[00:52:22] for the product that's like what we saw with vectors that dfda is going to be B
[00:52:26] with vectors that dfda is going to be B and D FDB is going to be a um so those
[00:52:29] and D FDB is going to be a um so those are all our local gradients and so then
[00:52:32] are all our local gradients and so then we can use those to calculate out the
[00:52:35] we can use those to calculate out the derivatives so dfdf is one um we then
[00:52:39] derivatives so dfdf is one um we then multiply that by the two um local
[00:52:44] multiply that by the two um local gradients um that are
[00:52:46] gradients um that are calculated
[00:52:49] calculated um for A and B so that gives us um two
[00:52:54] um for A and B so that gives us um two and three where you're swapping over the
[00:52:57] and three where you're swapping over the numbers then for the max um that we're
[00:53:02] numbers then for the max um that we're having the one that is biggest um we're
[00:53:06] having the one that is biggest um we're taking the Upstream times one so it gets
[00:53:10] taking the Upstream times one so it gets three the other one gets zero and then
[00:53:13] three the other one gets zero and then for the plus we're just sending the
[00:53:15] for the plus we're just sending the gradient down in both directions and so
[00:53:18] gradient down in both directions and so both of them come out as two um and so
[00:53:23] both of them come out as two um and so that gives us DF DX so the final
[00:53:26] that gives us DF DX so the final function value is two D
[00:53:29] function value is two D fdy we're taking the three and adding
[00:53:32] fdy we're taking the three and adding the two I'll mention that again in a
[00:53:34] the two I'll mention that again in a minute which gives us five and then
[00:53:38] minute which gives us five and then DFD Z is zero um and we should be able
[00:53:42] DFD Z is zero um and we should be able to again be able to quickly check that
[00:53:44] to again be able to quickly check that we've got this right right so um if we
[00:53:48] we've got this right right so um if we consider you know the slope around um Z
[00:53:54] consider you know the slope around um Z as you change z a little so Z is Z if we
[00:53:57] as you change z a little so Z is Z if we make Z
[00:53:59] make Z 0.1 that makes absolutely no difference
[00:54:02] 0.1 that makes absolutely no difference to what the computed function value is
[00:54:05] to what the computed function value is um so the gradient there is zero that's
[00:54:08] um so the gradient there is zero that's correct um so if I change up the Top If
[00:54:12] correct um so if I change up the Top If I change x a little bit right if I
[00:54:15] I change x a little bit right if I change X to
[00:54:17] change X to 1.1 then I'll be calculating um 1.1 + 2
[00:54:22] 1.1 then I'll be calculating um 1.1 + 2 is
[00:54:24] is 3.1 um and then I'll be taking the max
[00:54:28] 3.1 um and then I'll be taking the max which is two and I'll be calculating
[00:54:33] which is two and I'll be calculating 5.1 um and so wait no I did that
[00:54:39] 5.1 um and so wait no I did that wrong oh times two
[00:54:42] wrong oh times two wait I didn't do the multiplication
[00:54:45] wait I didn't do the multiplication right um sorry yeah so we get the 3.1
[00:54:50] right um sorry yeah so we get the 3.1 that's multiplied by two that gives us
[00:54:53] that's multiplied by two that gives us 6.2 so a change of 0.1 in the X has
[00:54:57] 6.2 so a change of 0.1 in the X has moved things up by 2 so that corresponds
[00:55:00] moved things up by 2 so that corresponds to the gradient being two and so then
[00:55:03] to the gradient being two and so then the final case is well what if we change
[00:55:06] the final case is well what if we change y to um so y started off as two and made
[00:55:12] y to um so y started off as two and made it 2.1 then we're going to get 2.1
[00:55:17] it 2.1 then we're going to get 2.1 multiplied by 1 is
[00:55:21] 2.1 61 6.5 and right and then we've got
[00:55:26] 2.1 61 6.5 and right and then we've got the 2.1
[00:55:28] the 2.1 here the oh sorry I keep doing this
[00:55:31] here the oh sorry I keep doing this wrong 2.1 + 1 = 3.1 and then we've got
[00:55:35] wrong 2.1 + 1 = 3.1 and then we've got 2.1 as the max so we've got 2.1 *
[00:55:40] 2.1 as the max so we've got 2.1 * 3.1 and that comes out to be
[00:55:45] 3.1 and that comes out to be 6.51 so it's approximately gone up by so
[00:55:48] 6.51 so it's approximately gone up by so our 0.1 difference has gone up to
[00:55:51] our 0.1 difference has gone up to approximately 0.5 this is just an
[00:55:53] approximately 0.5 this is just an estimate um and so that correspond to
[00:55:56] estimate um and so that correspond to the gradient being five right we get
[00:55:59] the gradient being five right we get this five times multiplication of our
[00:56:02] this five times multiplication of our changes okay um and so that that fact
[00:56:06] changes okay um and so that that fact that illustrates the fact that the right
[00:56:08] that illustrates the fact that the right thing to do is when you have outward
[00:56:11] thing to do is when you have outward branches in your um computation graph
[00:56:15] branches in your um computation graph and you're running the um back
[00:56:18] and you're running the um back propagation that what you do is you sum
[00:56:22] propagation that what you do is you sum the gradients right um so that for this
[00:56:26] the gradients right um so that for this case we had y being um the Y is sort of
[00:56:31] case we had y being um the Y is sort of going into these two different things in
[00:56:34] going into these two different things in our previous chart so once we've worked
[00:56:36] our previous chart so once we've worked out the Upstream gradients we sum them
[00:56:39] out the Upstream gradients we sum them to get the total gradient and so that's
[00:56:41] to get the total gradient and so that's what we did back here we had two outward
[00:56:44] what we did back here we had two outward things and we sort of took these
[00:56:46] things and we sort of took these calculated Upstream gradients of two and
[00:56:48] calculated Upstream gradients of two and three and we just summ them to get five
[00:56:52] three and we just summ them to get five and that gave the right answer
[00:56:59] okay um and so you can think about that
[00:57:03] okay um and so you can think about that um for the sort of just generally how
[00:57:07] um for the sort of just generally how the sort of things to think about as
[00:57:09] the sort of things to think about as sort of gradients move around in these
[00:57:12] sort of gradients move around in these pictures so that when we have a plus
[00:57:16] pictures so that when we have a plus operation that um plush just sort of
[00:57:20] operation that um plush just sort of distributes gradient so the same
[00:57:22] distributes gradient so the same gradient that's the Upstream gradient
[00:57:24] gradient that's the Upstream gradient goes to each input
[00:57:26] goes to each input um when you have a Max it's kind of like
[00:57:29] um when you have a Max it's kind of like a router of gradient so the max is going
[00:57:32] a router of gradient so the max is going to send the gradient to one of the
[00:57:35] to send the gradient to one of the inputs and send nothing at all to the
[00:57:38] inputs and send nothing at all to the other inputs um and when you have a
[00:57:41] other inputs um and when you have a multiplication it's a little bit funky
[00:57:43] multiplication it's a little bit funky because you're sort of um doing this
[00:57:46] because you're sort of um doing this sort of switching of the forward
[00:57:48] sort of switching of the forward coefficient so you're taking the
[00:57:50] coefficient so you're taking the Upstream gradient multiplied by the
[00:57:53] Upstream gradient multiplied by the opposite um for coefficient gives you
[00:57:57] opposite um for coefficient gives you your um Downstream
[00:58:00] your um Downstream gradient okay um so we kind of have this
[00:58:04] gradient okay um so we kind of have this systematic way of being able to sort of
[00:58:07] systematic way of being able to sort of forward pass calculate the values of
[00:58:09] forward pass calculate the values of functions then run this backward to work
[00:58:13] functions then run this backward to work out um the um gradients heading down the
[00:58:17] out um the um gradients heading down the network and so the main other thing of
[00:58:21] network and so the main other thing of the back propagation algorithm is just
[00:58:24] the back propagation algorithm is just that we want to do this
[00:58:26] that we want to do this efficiently so the wrong way to do it
[00:58:29] efficiently so the wrong way to do it would be to say well gee I want to
[00:58:31] would be to say well gee I want to calculate dstb
[00:58:33] calculate dstb ddw dsdx ddu so let me start doing those
[00:58:38] ddw dsdx ddu so let me start doing those one at a time and when I've done them
[00:58:40] one at a time and when I've done them all I will stop because that means if
[00:58:43] all I will stop because that means if you first calculated dstb you do all of
[00:58:47] you first calculated dstb you do all of the part that's in blue um but then if
[00:58:50] the part that's in blue um but then if you went on to DS
[00:58:53] you went on to DS DW um you'd be calculating all the part
[00:58:56] DW um you'd be calculating all the part in red and well just as we saw in the
[00:58:59] in red and well just as we saw in the math part when we were doing it as math
[00:59:02] math part when we were doing it as math um these parts are exactly the same
[00:59:05] um these parts are exactly the same you're doing exactly the same
[00:59:07] you're doing exactly the same computations so you only want to do
[00:59:10] computations so you only want to do those that part once and work out this
[00:59:13] those that part once and work out this Upstream gradient or error signal that
[00:59:16] Upstream gradient or error signal that is being then sort of calculated and is
[00:59:19] is being then sort of calculated and is then being shared so the picture that we
[00:59:22] then being shared so the picture that we want to have is you're doing together
[00:59:25] want to have is you're doing together the Shar part and then you're only sort
[00:59:28] the Shar part and then you're only sort of doing separately the little bits um
[00:59:31] of doing separately the little bits um that you need to
[00:59:33] that you need to do okay um boy I seem to have been
[00:59:36] do okay um boy I seem to have been rushing through today and I'm going to
[00:59:38] rushing through today and I'm going to actually end early unless anyone is
[00:59:40] actually end early unless anyone is going to slow me down but I did have uh
[00:59:43] going to slow me down but I did have uh just a few more slides um to go through
[00:59:46] just a few more slides um to go through um yeah so the sort of
[00:59:49] um yeah so the sort of generalization of this as an algorithm
[00:59:52] generalization of this as an algorithm is you know in the general case you know
[00:59:55] is you know in the general case you know so we normally have these sort of neural
[00:59:58] so we normally have these sort of neural network layers and matrices which you
[01:00:01] network layers and matrices which you can represent as vectors and
[01:00:03] can represent as vectors and matrices um and you know it's sort of
[01:00:07] matrices um and you know it's sort of nice and clean and it looks like um
[01:00:10] nice and clean and it looks like um doing that in Calculus class I mean
[01:00:13] doing that in Calculus class I mean strictly speaking that isn't necessary
[01:00:16] strictly speaking that isn't necessary so the algorithm for forward propagation
[01:00:18] so the algorithm for forward propagation and backward propagation that I've
[01:00:21] and backward propagation that I've outlined that you can have it work in a
[01:00:24] outlined that you can have it work in a completely arbitrary comput ation graph
[01:00:27] completely arbitrary comput ation graph providing it's a dag that doesn't have
[01:00:29] providing it's a dag that doesn't have Cycles in it um so the general algorithm
[01:00:32] Cycles in it um so the general algorithm is well you've got a whole bunch of
[01:00:35] is well you've got a whole bunch of variables that depend on other variables
[01:00:38] variables that depend on other variables there's some way in which we can sort
[01:00:41] there's some way in which we can sort them so that each variable only depends
[01:00:44] them so that each variable only depends on variables to the left of it so that's
[01:00:47] on variables to the left of it so that's referred to as a topological sort of the
[01:00:50] referred to as a topological sort of the outputs and so that means there's a way
[01:00:52] outputs and so that means there's a way we can do a forward pass where we're
[01:00:56] we can do a forward pass where we're calculating um variables in terms of
[01:00:59] calculating um variables in terms of ones that have already been calculated
[01:01:02] ones that have already been calculated but you know if we want to have some
[01:01:03] but you know if we want to have some extra wonky AR so it's not like nice
[01:01:06] extra wonky AR so it's not like nice Matrix multiplies or anything we're
[01:01:08] Matrix multiplies or anything we're totally allowed to um do that or we can
[01:01:11] totally allowed to um do that or we can have things not fully connected right so
[01:01:13] have things not fully connected right so there's no connections across here right
[01:01:16] there's no connections across here right we can have an arbitrary computation
[01:01:18] we can have an arbitrary computation graph um and so that gives us our
[01:01:20] graph um and so that gives us our forward propagation and then once we've
[01:01:24] forward propagation and then once we've done the forward propagation we can
[01:01:26] done the forward propagation we can initialize the output gradient as as one
[01:01:31] initialize the output gradient as as one and then we're going to visit the nodes
[01:01:34] and then we're going to visit the nodes in reverse order and at for each node
[01:01:38] in reverse order and at for each node we're going to compute a gradient by
[01:01:41] we're going to compute a gradient by using the Upstream gradient and the
[01:01:43] using the Upstream gradient and the local gradient to compute the downstream
[01:01:46] local gradient to compute the downstream gradient and so then we can head back
[01:01:48] gradient and so then we can head back down the computation graph and work out
[01:01:51] down the computation graph and work out all of the downstream gradients and so
[01:01:54] all of the downstream gradients and so the crucial thing to notice is that if
[01:01:57] the crucial thing to notice is that if you do it correctly um that working out
[01:02:03] you do it correctly um that working out um the the gradients has the same bigo
[01:02:09] um the the gradients has the same bigo complexity as working out the forward
[01:02:12] complexity as working out the forward calculation right so that if you're
[01:02:14] calculation right so that if you're doing more you know in if if in terms of
[01:02:17] doing more you know in if if in terms of Big O terms right you might have
[01:02:19] Big O terms right you might have different functions depending on what
[01:02:20] different functions depending on what the derivatives are but in bigo terms if
[01:02:23] the derivatives are but in bigo terms if you're doing more work in the backward
[01:02:26] you're doing more work in the backward path than you're doing in the forward
[01:02:27] path than you're doing in the forward paths that means that you're somehow
[01:02:30] paths that means that you're somehow failing to do this um efficient
[01:02:33] failing to do this um efficient computation and that you're recomputing
[01:02:35] computation and that you're recomputing some of your
[01:02:37] some of your work okay um so because we have such a
[01:02:41] work okay um so because we have such a good algorithm here you should be able
[01:02:45] good algorithm here you should be able to just work out the backward path
[01:02:48] to just work out the backward path automatically and that that gets
[01:02:50] automatically and that that gets referred to as automatic differentiation
[01:02:53] referred to as automatic differentiation so if you had the symbolic form of what
[01:02:59] so if you had the symbolic form of what you're calculating with your forward
[01:03:01] you're calculating with your forward pass um you should just be able to say
[01:03:04] pass um you should just be able to say yo computer um can you work out the
[01:03:07] yo computer um can you work out the backward pass for me and you know kind
[01:03:10] backward pass for me and you know kind of mathematical like it could look at
[01:03:13] of mathematical like it could look at the symbolic form of all of your
[01:03:15] the symbolic form of all of your functions um work out their derivatives
[01:03:19] functions um work out their derivatives and do the entire thing for you um so
[01:03:23] and do the entire thing for you um so early on there was a
[01:03:26] early on there was a pioneering um deep learning framework
[01:03:29] pioneering um deep learning framework theano principally from the um
[01:03:31] theano principally from the um university of Montreal which attempted
[01:03:35] university of Montreal which attempted to do precisely that that you had the
[01:03:37] to do precisely that that you had the entire forward path computation started
[01:03:40] entire forward path computation started in symbolic form and it just did the
[01:03:43] in symbolic form and it just did the entire thing for you and worked out the
[01:03:47] entire thing for you and worked out the backward pass
[01:03:49] backward pass automatically um but you know somehow
[01:03:52] automatically um but you know somehow that sort of proved to be um too
[01:03:56] that sort of proved to be um too heavyweight or um hard to deal with
[01:03:59] heavyweight or um hard to deal with different things or people just like to
[01:04:01] different things or people just like to write their own python or whatever it is
[01:04:04] write their own python or whatever it is um so that idea did not fully succeed
[01:04:08] um so that idea did not fully succeed and so what in practice all of the
[01:04:11] and so what in practice all of the current main Frameworks have fallen back
[01:04:14] current main Frameworks have fallen back on is something that's actually less
[01:04:18] on is something that's actually less automated than that so it's sort of like
[01:04:20] automated than that so it's sort of like we've gone backwards in time but the
[01:04:22] we've gone backwards in time but the software's g a lot better really it's a
[01:04:24] software's g a lot better really it's a lot staer and faster um so all of the
[01:04:28] lot staer and faster um so all of the modern deep learning Frameworks sort of
[01:04:31] modern deep learning Frameworks sort of say look I will manage the computation
[01:04:34] say look I will manage the computation graph for you and I can run the forward
[01:04:37] graph for you and I can run the forward propagation path and the backward
[01:04:38] propagation path and the backward propagation path but you're going to
[01:04:41] propagation path but you're going to have to work out the local um
[01:04:43] have to work out the local um derivatives yourself um so if you're if
[01:04:47] derivatives yourself um so if you're if you're putting in a layer or putting in
[01:04:51] you're putting in a layer or putting in um you know a function like an
[01:04:53] um you know a function like an activation function in the in a neural
[01:04:56] activation function in the in a neural network your class your python class
[01:05:00] network your class your python class that represents that you're going to
[01:05:02] that represents that you're going to have to tell me what the forward um
[01:05:05] have to tell me what the forward um computation is and what the local
[01:05:08] computation is and what the local gradient is and I'm just going to call
[01:05:10] gradient is and I'm just going to call your local gradient and assume it's
[01:05:13] your local gradient and assume it's correct um so there's a bit more that
[01:05:16] correct um so there's a bit more that has to be done manually so so the part
[01:05:19] has to be done manually so so the part that's automated then um is that you
[01:05:23] that's automated then um is that you know when you know precisely this code
[01:05:27] know when you know precisely this code obviously but roughly you know inside
[01:05:30] obviously but roughly you know inside the Deep learning software um it's
[01:05:33] the Deep learning software um it's Computing with a computation graph and
[01:05:36] Computing with a computation graph and it's got a forward and a backward and
[01:05:37] it's got a forward and a backward and it's doing what I presented on the
[01:05:39] it's doing what I presented on the pictures before so for the forward um
[01:05:43] pictures before so for the forward um pass it's topologically sorting all the
[01:05:47] pass it's topologically sorting all the nodes of the graph and then it's going
[01:05:49] nodes of the graph and then it's going through them and for each node in the
[01:05:52] through them and for each node in the graph it's calling its forward function
[01:05:55] graph it's calling its forward function which will be able to compute its local
[01:05:58] which will be able to compute its local value in terms of its inputs which have
[01:06:01] value in terms of its inputs which have already been calculated because it's
[01:06:03] already been calculated because it's topologically sorted and then it's um
[01:06:06] topologically sorted and then it's um running the backward pass and in the
[01:06:08] running the backward pass and in the backward pass you're reversing your
[01:06:10] backward pass you're reversing your topological sort and then you're working
[01:06:13] topological sort and then you're working out um the gradient which is going to be
[01:06:17] out um the gradient which is going to be the multiplication of the Upstream error
[01:06:19] the multiplication of the Upstream error signal times your local gradient and so
[01:06:22] signal times your local gradient and so what a human being has to implement um
[01:06:27] what a human being has to implement um is that for anything whether it's a
[01:06:29] is that for anything whether it's a single gate here's a multiply gate or a
[01:06:32] single gate here's a multiply gate or a newal network layer you have to
[01:06:34] newal network layer you have to implement a forward pass and a backward
[01:06:37] implement a forward pass and a backward pass so here for my baby example since
[01:06:41] pass so here for my baby example since we're just doing multiplication my
[01:06:43] we're just doing multiplication my forward pass is that I just multiply the
[01:06:46] forward pass is that I just multiply the two numbers and um return it so I'm
[01:06:49] two numbers and um return it so I'm specifying that for the local node and
[01:06:52] specifying that for the local node and then the other part is that I have to
[01:06:55] then the other part is that I have to work out those gradients and well we
[01:06:59] work out those gradients and well we sort of know how to do that because
[01:07:00] sort of know how to do that because that's the examples that we've been
[01:07:02] that's the examples that we've been doing here um but notice that there's
[01:07:05] doing here um but notice that there's sort of a trick right for what I've got
[01:07:07] sort of a trick right for what I've got Now you kind of can't write down what
[01:07:11] Now you kind of can't write down what the gradients are cuz you know what
[01:07:16] the gradients are cuz you know what these cuz you know backward is just
[01:07:18] these cuz you know backward is just taking as an input the Upstream gradient
[01:07:21] taking as an input the Upstream gradient and you can't work out what the
[01:07:24] and you can't work out what the downstream gradient are going to be
[01:07:26] downstream gradient are going to be unless you know what function values
[01:07:29] unless you know what function values you're calculating it at um so the
[01:07:32] you're calculating it at um so the standard trick that all which is how
[01:07:35] standard trick that all which is how everyone writes this code is you're
[01:07:37] everyone writes this code is you're relying on the fact that the forward is
[01:07:39] relying on the fact that the forward is being calculated before the backward and
[01:07:42] being calculated before the backward and so your forward method um shoves into
[01:07:46] so your forward method um shoves into some local variables of the class what
[01:07:49] some local variables of the class what the values of the inputs are and then
[01:07:52] the values of the inputs are and then you have them available um so when you
[01:07:54] you have them available um so when you get to the back backward pass you can do
[01:07:57] get to the back backward pass you can do what we did before um that um the DX is
[01:08:02] what we did before um that um the DX is going to be the Upstream error signal
[01:08:05] going to be the Upstream error signal times the opposite input and um and
[01:08:08] times the opposite input and um and similarly for Dy and that's going to
[01:08:11] similarly for Dy and that's going to give us the
[01:08:12] give us the answer okay um just two last things then
[01:08:17] answer okay um just two last things then to mention yeah so doing this um
[01:08:22] to mention yeah so doing this um your you need to write um you need to
[01:08:25] your you need to write um you need to get the math right for what's the
[01:08:28] get the math right for what's the derivative of your function so you get
[01:08:31] derivative of your function so you get the right backward calculation so the
[01:08:34] the right backward calculation so the standard way to check that you've got
[01:08:37] standard way to check that you've got the right backward calculation is to do
[01:08:40] the right backward calculation is to do manual gradient checking um with numeric
[01:08:44] manual gradient checking um with numeric gradients so the way you do that um is
[01:08:49] gradients so the way you do that um is you sort of like for the couple of
[01:08:51] you sort of like for the couple of examples I did when I said oh let's
[01:08:53] examples I did when I said oh let's check it by for
[01:08:55] check it by for going from 1 to 1.1 what should the
[01:08:58] going from 1 to 1.1 what should the slope be approximately we're going to do
[01:09:00] slope be approximately we're going to do that in an automated way and so we're
[01:09:03] that in an automated way and so we're going to say at the value X let's
[01:09:06] going to say at the value X let's estimate what the gradient should be and
[01:09:08] estimate what the gradient should be and the way to do that is to pick a small H
[01:09:12] the way to do that is to pick a small H there isn't a magical number because it
[01:09:14] there isn't a magical number because it depends on the function but typically
[01:09:16] depends on the function but typically you know for neural networks around 10us
[01:09:18] you know for neural networks around 10us 4 is good um a small H and work out the
[01:09:23] 4 is good um a small H and work out the function value I forward part
[01:09:26] function value I forward part at x + H and x - H divided by the run
[01:09:31] at x + H and x - H divided by the run which is 2 H and that should give you an
[01:09:34] which is 2 H and that should give you an estimate of the slope what the backward
[01:09:37] estimate of the slope what the backward pass is calculating and you want those
[01:09:40] pass is calculating and you want those two numbers to be approximately equal
[01:09:43] two numbers to be approximately equal you know within some 10us 2 of each
[01:09:46] you know within some 10us 2 of each other and then probably you're
[01:09:47] other and then probably you're calculating the gradient right and if
[01:09:50] calculating the gradient right and if they aren't equal um that um you
[01:09:56] they aren't equal um that um you probably have made a mistake um yeah so
[01:09:59] probably have made a mistake um yeah so um note that this formul for the version
[01:10:02] um note that this formul for the version I did for my examples I just compared to
[01:10:06] I did for my examples I just compared to x with x + H right I did a one-sided
[01:10:10] x with x + H right I did a one-sided estimate which is normally what you get
[01:10:12] estimate which is normally what you get taught in a math class if you're doing
[01:10:15] taught in a math class if you're doing this to check your gradients numerically
[01:10:18] this to check your gradients numerically you're far far better off doing this
[01:10:21] you're far far better off doing this two-sided estimate because it's much
[01:10:23] two-sided estimate because it's much more accurate and stable when you're
[01:10:24] more accurate and stable when you're doing it equally around both sides of
[01:10:27] doing it equally around both sides of your H um yeah so this looks easy to do
[01:10:32] your H um yeah so this looks easy to do um if if this was just so good why
[01:10:36] um if if this was just so good why doesn't everyone do this all the time
[01:10:37] doesn't everyone do this all the time and forget about calculus um you know
[01:10:40] and forget about calculus um you know the reason you don't want to do this is
[01:10:43] the reason you don't want to do this is that doing this is incredibly slow right
[01:10:46] that doing this is incredibly slow right because you have to repeat this
[01:10:48] because you have to repeat this computation for every parameter of your
[01:10:51] computation for every parameter of your model that you're not getting the kind
[01:10:53] model that you're not getting the kind of speed UPS your getting from the um
[01:10:57] of speed UPS your getting from the um back propagation algorithm but you know
[01:10:59] back propagation algorithm but you know it's useful for checking your
[01:11:00] it's useful for checking your implementation is correct you know in
[01:11:02] implementation is correct you know in the old days before Frameworks like py
[01:11:05] the old days before Frameworks like py torch um you know we used to write
[01:11:08] torch um you know we used to write everything by hand and people often got
[01:11:10] everything by hand and people often got things wrong um but nowadays you know
[01:11:13] things wrong um but nowadays you know it's less needed but it's good to check
[01:11:14] it's less needed but it's good to check that if you've implemented your own new
[01:11:16] that if you've implemented your own new layer that it's doing the right
[01:11:19] layer that it's doing the right thing okay um yeah so that's everything
[01:11:23] thing okay um yeah so that's everything that we need to know about new Nets
[01:11:25] that we need to know about new Nets propagation is the chain rule applied
[01:11:28] propagation is the chain rule applied efficiently forward pass is just
[01:11:30] efficiently forward pass is just function application backward pass is
[01:11:33] function application backward pass is chain rule applied inefficiently um so
[01:11:38] chain rule applied inefficiently um so you know uh we're going to inflict pain
[01:11:41] you know uh we're going to inflict pain on our students by making them do some
[01:11:44] on our students by making them do some math and calculate some of these things
[01:11:47] math and calculate some of these things and um do the homework and I know
[01:11:50] and um do the homework and I know that'll be harder for some of you than
[01:11:52] that'll be harder for some of you than others um you know that in some sense
[01:11:57] others um you know that in some sense you don't actually need to know how to
[01:11:58] you don't actually need to know how to do this the beauty of these modern deep
[01:12:00] do this the beauty of these modern deep learning Frameworks is they'll do it all
[01:12:02] learning Frameworks is they'll do it all for you they predefine common layer
[01:12:05] for you they predefine common layer types and you can just plug them
[01:12:06] types and you can just plug them together like pieces of Lego and they'll
[01:12:08] together like pieces of Lego and they'll be computed right and this is precisely
[01:12:11] be computed right and this is precisely the reason that high school students
[01:12:13] the reason that high school students across the country and the world can now
[01:12:15] across the country and the world can now do deep learning projects for their
[01:12:18] do deep learning projects for their science fairs because you don't actually
[01:12:20] science fairs because you don't actually have to understand any of this math um
[01:12:22] have to understand any of this math um you can just use what's given to you um
[01:12:25] you can just use what's given to you um but you know um we kind of uh want to
[01:12:28] but you know um we kind of uh want to hope that you actually do understand
[01:12:31] hope that you actually do understand something about what's going on under
[01:12:33] something about what's going on under the hood and how new networks work so
[01:12:36] the hood and how new networks work so therefore we make you suffer a little
[01:12:37] therefore we make you suffer a little bit and of course you know if you sort
[01:12:41] bit and of course you know if you sort of wanting to look at and understand
[01:12:44] of wanting to look at and understand more complex things you need to have
[01:12:46] more complex things you need to have some sense of what's going on so later
[01:12:48] some sense of what's going on so later on when we get on to a current new
[01:12:50] on when we get on to a current new networks we'll talk a bit about things
[01:12:53] networks we'll talk a bit about things like exploding and Vanishing great
[01:12:55] like exploding and Vanishing great and if you want to have some
[01:12:57] and if you want to have some understanding about well why things
[01:12:58] understanding about well why things aren't working and things are going
[01:13:00] aren't working and things are going wrong um then you sort of want to know
[01:13:03] wrong um then you sort of want to know what it's actually calculating rather
[01:13:04] what it's actually calculating rather than just thinking it's all a black box
[01:13:07] than just thinking it's all a black box magic and so that's why we hope to have
[01:13:09] magic and so that's why we hope to have uh taught something about that okay I
[01:13:13] uh taught something about that okay I think I'm done if the audience is
[01:13:15] think I'm done if the audience is sufficiently stunned um and we can stop
[01:13:18] sufficiently stunned um and we can stop for today okay thank you


================================================================================
LECTURE 004
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 4 - Dependency Parsing

Source: https://www.youtube.com/watch?v=KVKvde-_MYc

---

Transcript

[00:00:05] okay hi
[00:00:06] okay hi everyone okay so for today we're going
[00:00:09] everyone okay so for today we're going to um you know I guess do a 180 from
[00:00:13] to um you know I guess do a 180 from where we were on Tuesday and so today um
[00:00:18] where we were on Tuesday and so today um I'm going to talk about um syntactic
[00:00:22] I'm going to talk about um syntactic structure linguistic structure of human
[00:00:24] structure linguistic structure of human language sentences dependency passing
[00:00:28] language sentences dependency passing and how well dependency and then how you
[00:00:30] and how well dependency and then how you go about building dependency paes so um
[00:00:34] go about building dependency paes so um we're um solidly inter linguistic Zone
[00:00:38] we're um solidly inter linguistic Zone today um how many people in the audience
[00:00:41] today um how many people in the audience have done a Linguistics class yay okay
[00:00:45] have done a Linguistics class yay okay there's some people have done
[00:00:46] there's some people have done Linguistics classes okay great um and
[00:00:50] Linguistics classes okay great um and for the rest of you well you know this
[00:00:51] for the rest of you well you know this is your chance to see a little bit of
[00:00:53] is your chance to see a little bit of human language structure and if you like
[00:00:55] human language structure and if you like it you can enroll in a Linguistics class
[00:00:57] it you can enroll in a Linguistics class later on um yeah so
[00:01:02] oops so um you know so assignment two we
[00:01:06] oops so um you know so assignment two we handed out on Tuesday so in the second
[00:01:09] handed out on Tuesday so in the second half of assignment two um what your job
[00:01:12] half of assignment two um what your job to do is to build a neural dependency
[00:01:14] to do is to build a neural dependency paard using p torch um as you'll s we'll
[00:01:18] paard using p torch um as you'll s we'll sort of come to later on really the bit
[00:01:21] sort of come to later on really the bit that you have to build is just the
[00:01:22] that you have to build is just the machine learning bit of making decisions
[00:01:25] machine learning bit of making decisions and really we give you most of the rest
[00:01:27] and really we give you most of the rest of the new dependency paa but this is
[00:01:30] of the new dependency paa but this is also then a chance to remind you that
[00:01:32] also then a chance to remind you that assignment two in that second half uses
[00:01:35] assignment two in that second half uses py torch one of the leading deep
[00:01:37] py torch one of the leading deep learning Frameworks um so if you're not
[00:01:40] learning Frameworks um so if you're not familiar with that it'd be a really good
[00:01:41] familiar with that it'd be a really good idea to also go along to um the Friday
[00:01:45] idea to also go along to um the Friday pie torch tutorial um though we have
[00:01:48] pie torch tutorial um though we have tried to make assignment too so it's a
[00:01:50] tried to make assignment too so it's a fairly good place for learning pie torch
[00:01:53] fairly good place for learning pie torch as you go along um we'll say more soon
[00:01:56] as you go along um we'll say more soon about um final projects but you're
[00:01:58] about um final projects but you're certainly already in encourage to come
[00:02:00] certainly already in encourage to come and meet with Tas or me about final
[00:02:03] and meet with Tas or me about final projects and we're putting up
[00:02:04] projects and we're putting up information about the T so you can know
[00:02:06] information about the T so you can know more about them on um the office hours
[00:02:10] more about them on um the office hours page okay so let's get straight into it
[00:02:13] page okay so let's get straight into it and start looking at linguistic
[00:02:15] and start looking at linguistic structure so um in in um thinking about
[00:02:21] structure so um in in um thinking about linguistic structure of human languages
[00:02:24] linguistic structure of human languages there are two primary ways um that
[00:02:28] there are two primary ways um that people have thought about it so one way
[00:02:31] people have thought about it so one way is um using the idea that linguists
[00:02:34] is um using the idea that linguists normally call phrase structure which is
[00:02:37] normally call phrase structure which is then represented in terms of what um
[00:02:40] then represented in terms of what um computer scientists normally know as
[00:02:43] computer scientists normally know as context free grammars so I'm going to
[00:02:45] context free grammars so I'm going to spend a couple of minutes um going over
[00:02:48] spend a couple of minutes um going over that view of things but you know
[00:02:50] that view of things but you know actually it's not the main one that I'm
[00:02:52] actually it's not the main one that I'm going to talk about in this class I'm
[00:02:54] going to talk about in this class I'm going to spend most of this class
[00:02:56] going to spend most of this class talking about an alternative way of
[00:02:58] talking about an alternative way of thinking about things called dependency
[00:03:00] thinking about things called dependency grammars um there are actually some
[00:03:02] grammars um there are actually some correspondences you can make between the
[00:03:04] correspondences you can make between the two ways of thinking about things but
[00:03:06] two ways of thinking about things but I'm not going to go through into those
[00:03:08] I'm not going to go through into those um here today so for the constituency
[00:03:11] um here today so for the constituency grammar or phrase structure um version
[00:03:14] grammar or phrase structure um version of things the way that you go about
[00:03:17] of things the way that you go about thinking about the structure of human
[00:03:19] thinking about the structure of human languages is well there are words
[00:03:21] languages is well there are words languages have lots of words hundreds of
[00:03:24] languages have lots of words hundreds of thousands of words but it seems like a
[00:03:27] thousands of words but it seems like a lot of the words nearly all the words in
[00:03:30] lot of the words nearly all the words in fact fall into a few basic classes that
[00:03:34] fact fall into a few basic classes that represent their nature and how they
[00:03:37] represent their nature and how they behave in sentences so for words like
[00:03:41] behave in sentences so for words like the examples here we have nouns so cat
[00:03:44] the examples here we have nouns so cat is a noun door is a noun but you know
[00:03:48] is a noun door is a noun but you know something like Linguistics is also a
[00:03:50] something like Linguistics is also a noun so we have nouns um and then we
[00:03:54] noun so we have nouns um and then we have other kinds of words so something
[00:03:56] have other kinds of words so something like cuddly is an adjective a word that
[00:03:58] like cuddly is an adjective a word that can modify nouns um and then we have
[00:04:02] can modify nouns um and then we have theth for the the cuddly cat um the is
[00:04:07] theth for the the cuddly cat um the is sort of a slightly more complex one as
[00:04:10] sort of a slightly more complex one as to how to name um normally in modern
[00:04:13] to how to name um normally in modern Linguistics refer to words like that as
[00:04:15] Linguistics refer to words like that as a determiner you might also see the name
[00:04:18] a determiner you might also see the name article and when sometimes when people
[00:04:21] article and when sometimes when people try to shoehorn um human language into
[00:04:25] try to shoehorn um human language into eight part of speech categories that
[00:04:27] eight part of speech categories that they say it's an adjective that doesn't
[00:04:28] they say it's an adjective that doesn't really behave like regular adjectives
[00:04:31] really behave like regular adjectives and then we have words like by or
[00:04:32] and then we have words like by or through or on and to and ones like that
[00:04:35] through or on and to and ones like that and so they're then prepositions right
[00:04:37] and so they're then prepositions right so we have these classes and with lots
[00:04:40] so we have these classes and with lots of words fitting into each class and so
[00:04:42] of words fitting into each class and so they're referred to conventionally as
[00:04:44] they're referred to conventionally as parts of speech but then once we've got
[00:04:47] parts of speech but then once we've got words we start putting them into bigger
[00:04:50] words we start putting them into bigger units so the cuddly cat is some kind of
[00:04:53] units so the cuddly cat is some kind of unit and so it seems like this is a um
[00:04:57] unit and so it seems like this is a um explication of a noun cat and so this
[00:05:01] explication of a noun cat and so this gets referred to as a noun phrase um and
[00:05:04] gets referred to as a noun phrase um and then by the door well this is a phrase
[00:05:08] then by the door well this is a phrase um but actually it has inside it this
[00:05:10] um but actually it has inside it this the door and that's a noun phrase but
[00:05:14] the door and that's a noun phrase but this bigger unit here of by the door is
[00:05:17] this bigger unit here of by the door is then a prepositional
[00:05:19] then a prepositional phrase and we can continue to build
[00:05:22] phrase and we can continue to build bigger units so inside you know this we
[00:05:27] bigger units so inside you know this we have this phrase that we've already
[00:05:28] have this phrase that we've already looked at with the noun phrase and a
[00:05:30] looked at with the noun phrase and a prepositional phrase but then we can
[00:05:32] prepositional phrase but then we can have another noun phrase the cuddly cat
[00:05:36] have another noun phrase the cuddly cat and we can put them together and build a
[00:05:39] and we can put them together and build a bigger noun phrase the cuddly Cat by the
[00:05:43] bigger noun phrase the cuddly Cat by the door and so to represent this you can
[00:05:46] door and so to represent this you can start to write um a phrase structure
[00:05:49] start to write um a phrase structure grammar or a context free grammar that
[00:05:52] grammar or a context free grammar that represents what are the possibilities
[00:05:54] represents what are the possibilities for building up sentences here in
[00:05:56] for building up sentences here in English those similar kinds of phrase
[00:05:58] English those similar kinds of phrase structure grammars can be written for
[00:06:00] structure grammars can be written for other languages so this is sort of
[00:06:03] other languages so this is sort of starting to give you possible structures
[00:06:05] starting to give you possible structures for a noun phrase so you can have a noun
[00:06:09] for a noun phrase so you can have a noun phrase just goes to a determiner
[00:06:12] phrase just goes to a determiner followed by a noun but then as well as
[00:06:15] followed by a noun but then as well as the cat and a dog you can have the large
[00:06:19] the cat and a dog you can have the large cats so you might say that okay rather
[00:06:21] cats so you might say that okay rather than that I might want to have as a
[00:06:23] than that I might want to have as a better rule that a noun phrase goes to a
[00:06:27] better rule that a noun phrase goes to a determiner and optional adjective
[00:06:30] determiner and optional adjective and then a noun if you think about it
[00:06:33] and then a noun if you think about it you can sort of have multiple adjectives
[00:06:36] you can sort of have multiple adjectives so you can have the the large um green
[00:06:41] so you can have the the large um green cat or something like that um so you can
[00:06:45] cat or something like that um so you can really get multiple adjectives that are
[00:06:46] really get multiple adjectives that are heing and that sort of star the clean
[00:06:49] heing and that sort of star the clean star says you can have lots of them um
[00:06:52] star says you can have lots of them um the large cuddly green cat um but then
[00:06:56] the large cuddly green cat um but then you can stick things after the now
[00:06:59] you can stick things after the now phrase so you can put these
[00:07:01] phrase so you can put these prepositional phrases like in a crate so
[00:07:04] prepositional phrases like in a crate so we might also want to say that a noun
[00:07:05] we might also want to say that a noun phrase can be Rewritten as a noun phrase
[00:07:09] phrase can be Rewritten as a noun phrase followed by a prepositional phrase where
[00:07:12] followed by a prepositional phrase where a prepositional phrase um can be
[00:07:15] a prepositional phrase um can be represented by preposition followed by a
[00:07:17] represented by preposition followed by a noun phrase and somewhere we're also
[00:07:19] noun phrase and somewhere we're also going to want to represent our parts of
[00:07:22] going to want to represent our parts of speech membership so a
[00:07:25] speech membership so a determiner um can go to words like a or
[00:07:30] determiner um can go to words like a or and an adjective can go to words like
[00:07:35] and an adjective can go to words like large cuddly or many other words that
[00:07:39] large cuddly or many other words that I'm not going to write
[00:07:40] I'm not going to write down and a preposition can go to words
[00:07:45] down and a preposition can go to words like in on
[00:07:48] like in on under Etc after that okay so now I've
[00:07:52] under Etc after that okay so now I've got a a little grammar here and this
[00:07:55] got a a little grammar here and this little grammar here could sort of make
[00:07:58] little grammar here could sort of make everything
[00:08:00] everything I've gotten these sentences well
[00:08:02] I've gotten these sentences well actually can do this one too it can do
[00:08:03] actually can do this one too it can do the large barking one where there are
[00:08:05] the large barking one where there are multiple ones um but then if I start
[00:08:08] multiple ones um but then if I start going Beyond these noun phrases and say
[00:08:13] going Beyond these noun phrases and say think of sentences like talk to the cat
[00:08:16] think of sentences like talk to the cat or talk to the large cuddly dog by the
[00:08:21] or talk to the large cuddly dog by the door well now I've got here a verb talk
[00:08:25] door well now I've got here a verb talk and I've again got a
[00:08:27] and I've again got a preposition so I might um then have more
[00:08:30] preposition so I might um then have more rules that says I can have a verb phrase
[00:08:33] rules that says I can have a verb phrase and the verb phrase can go to a verb and
[00:08:37] and the verb phrase can go to a verb and then a prepositional phrase and then
[00:08:40] then a prepositional phrase and then that could explain these two sentences
[00:08:42] that could explain these two sentences as well and in this kind of way I can
[00:08:45] as well and in this kind of way I can start to build up a grammar of the
[00:08:48] start to build up a grammar of the structure of English sentences as a
[00:08:51] structure of English sentences as a context free grammar make sense yeah
[00:08:55] context free grammar make sense yeah okay um and so that's
[00:09:00] okay um and so that's um what is being quite commonly done in
[00:09:04] um what is being quite commonly done in linguistics and elsewhere
[00:09:07] linguistics and elsewhere um okay
[00:09:10] um okay uh yeah uh so let me just do that once
[00:09:14] uh yeah uh so let me just do that once more but behind in this one so one thing
[00:09:18] more but behind in this one so one thing I can do here is say oh I have I'm going
[00:09:23] I can do here is say oh I have I'm going to look at this with its phrase
[00:09:26] to look at this with its phrase structure and if I write it upside down
[00:09:28] structure and if I write it upside down to give myself some space for later you
[00:09:31] to give myself some space for later you know I could start
[00:09:33] know I could start um making a phrase structure that is of
[00:09:38] um making a phrase structure that is of this
[00:09:38] this sentence um and I'll start to run out of
[00:09:41] sentence um and I'll start to run out of space but I can sort of start to make
[00:09:46] space but I can sort of start to make this phrase structure of the sentence so
[00:09:49] this phrase structure of the sentence so that's um phrase structure um but
[00:09:52] that's um phrase structure um but there's another form of
[00:09:54] there's another form of representation that has been fairly
[00:09:56] representation that has been fairly widely used in linguistics and we're
[00:09:59] widely used in linguistics and we're going and has been commonly used in NLP
[00:10:01] going and has been commonly used in NLP and we're going to use for the paes we
[00:10:03] and we're going to use for the paes we built and that's dependency structure so
[00:10:06] built and that's dependency structure so dependency structure represents things
[00:10:08] dependency structure represents things in a slightly different way it thinks
[00:10:11] in a slightly different way it thinks about words that are the main word or
[00:10:14] about words that are the main word or head or something and then which words
[00:10:17] head or something and then which words they take as modifiers or arguments so
[00:10:20] they take as modifiers or arguments so for look in the large crate in the
[00:10:22] for look in the large crate in the kitchen by the door well this is um
[00:10:26] kitchen by the door well this is um describing a looking command so that the
[00:10:29] describing a looking command so that the head of the whole thing is looking um
[00:10:32] head of the whole thing is looking um and then looking is taking one or more
[00:10:36] and then looking is taking one or more arguments or modifiers and well what the
[00:10:40] arguments or modifiers and well what the looking is saying here is well what you
[00:10:43] looking is saying here is well what you want to do is look in the large crate so
[00:10:48] want to do is look in the large crate so we are looking in something and then
[00:10:51] we are looking in something and then when what we're looking in is a crate
[00:10:54] when what we're looking in is a crate and then the crate has some modifiers
[00:10:57] and then the crate has some modifiers it's a large crate it's the large crate
[00:11:01] it's a large crate it's the large crate and then the crate um is also um placed
[00:11:06] and then the crate um is also um placed somewhere it's placed in the kitchen so
[00:11:11] somewhere it's placed in the kitchen so that in the kitchen is also modifying
[00:11:14] that in the kitchen is also modifying crate um and then we've got over here
[00:11:17] crate um and then we've got over here the by the door um
[00:11:20] the by the door um well the by the door um is also
[00:11:24] well the by the door um is also modifying crate so we've also got a link
[00:11:27] modifying crate so we've also got a link down over to here and that gives us our
[00:11:30] down over to here and that gives us our piece of structure here which having
[00:11:32] piece of structure here which having filled that in makes me realize I
[00:11:34] filled that in makes me realize I actually got it wrong in when I was
[00:11:36] actually got it wrong in when I was doing the constituency representation
[00:11:40] doing the constituency representation whoopsie so I should get my CU in the
[00:11:43] whoopsie so I should get my CU in the constituency representation I made the
[00:11:46] constituency representation I made the kitchen by the door into a phrase that
[00:11:48] kitchen by the door into a phrase that was actually wrong whoops bad me um so
[00:11:51] was actually wrong whoops bad me um so what I should have actually had um was
[00:11:56] what I should have actually had um was we had another prepositional phrase that
[00:11:59] we had another prepositional phrase that went to a noun phrase of in the kitchen
[00:12:03] went to a noun phrase of in the kitchen and then both of
[00:12:07] those were coming off a bigger noun
[00:12:11] those were coming off a bigger noun phrase like that whoopsie okay um I get
[00:12:15] phrase like that whoopsie okay um I get it right most of the time okay but so
[00:12:17] it right most of the time okay but so this idea of um dependency structure is
[00:12:20] this idea of um dependency structure is that we're sort of finding what is the
[00:12:23] that we're sort of finding what is the head word and then we're saying which
[00:12:25] head word and then we're saying which things modify the headword and either of
[00:12:28] things modify the headword and either of these represent ations we can be using
[00:12:31] these represent ations we can be using um to sort of work out what the
[00:12:34] um to sort of work out what the structure of sentences is in terms of
[00:12:37] structure of sentences is in terms of which words go together and which words
[00:12:39] which words go together and which words modify other words and so the basic idea
[00:12:43] modify other words and so the basic idea is so when humans communicate we
[00:12:46] is so when humans communicate we communicate in a linear stream right so
[00:12:49] communicate in a linear stream right so that if it's um conventional writing
[00:12:52] that if it's um conventional writing systems it's a linear stream of words
[00:12:55] systems it's a linear stream of words that you're reading if it's spoken
[00:12:58] that you're reading if it's spoken language like you're understanding me
[00:13:00] language like you're understanding me speaking right now it's not a linear
[00:13:03] speaking right now it's not a linear stream of words it's a linear Sound
[00:13:05] stream of words it's a linear Sound Stream and you know like when people
[00:13:09] Stream and you know like when people speak there aren't any you know there
[00:13:10] speak there aren't any you know there isn't white space between words when
[00:13:13] isn't white space between words when people speak you know occasionally
[00:13:14] people speak you know occasionally people pause at the end of a clause or
[00:13:17] people pause at the end of a clause or sentence or something but in general I'm
[00:13:19] sentence or something but in general I'm just sort of speaking continuous words
[00:13:21] just sort of speaking continuous words that run one into each other so that
[00:13:23] that run one into each other so that there's a a a linear sequence of sounds
[00:13:26] there's a a a linear sequence of sounds coming out my mouth um and you have to
[00:13:28] coming out my mouth um and you have to do all bit like that but the but if
[00:13:31] do all bit like that but the but if you're then thinking oh gee
[00:13:34] you're then thinking oh gee um I can actually understand Chris
[00:13:38] um I can actually understand Chris talking then somehow you're taking that
[00:13:41] talking then somehow you're taking that linear stream and you're turning it into
[00:13:44] linear stream and you're turning it into a meaning where certain words are
[00:13:47] a meaning where certain words are modifying other words and you have these
[00:13:50] modifying other words and you have these bigger units um like constituents that
[00:13:52] bigger units um like constituents that are understanding the meaning of the
[00:13:54] are understanding the meaning of the sentence and so human listeners need to
[00:13:58] sentence and so human listeners need to work out what modifies what to be able
[00:14:01] work out what modifies what to be able to understand sentences
[00:14:04] to understand sentences correctly and so similarly our models
[00:14:08] correctly and so similarly our models need to be able to understand sentence
[00:14:10] need to be able to understand sentence structure in order to be able to
[00:14:12] structure in order to be able to interpret language correctly and so for
[00:14:15] interpret language correctly and so for what we're going to be doing for
[00:14:16] what we're going to be doing for building dependency paes is we're going
[00:14:19] building dependency paes is we're going to be
[00:14:20] to be explicitly building a neural network
[00:14:23] explicitly building a neural network model that says let's find the structure
[00:14:26] model that says let's find the structure of these sentences um in way we actually
[00:14:29] of these sentences um in way we actually move away from that later on because
[00:14:32] move away from that later on because when we move into um Transformer
[00:14:34] when we move into um Transformer language models they just take in the um
[00:14:38] language models they just take in the um sequence of words but actually inside
[00:14:41] sequence of words but actually inside the parameters of the newal network they
[00:14:43] the parameters of the newal network they recognizing and building the same kind
[00:14:45] recognizing and building the same kind of structural units and we'll talk about
[00:14:47] of structural units and we'll talk about that later in the class um to give you
[00:14:50] that later in the class um to give you more of a sense of how um these you know
[00:14:53] more of a sense of how um these you know understanding what modifies what is
[00:14:56] understanding what modifies what is important for interpretation um here are
[00:14:58] important for interpretation um here are a few funny examples from newspaper
[00:15:01] a few funny examples from newspaper headlines and they're funny examples
[00:15:03] headlines and they're funny examples because you get um there sentences don't
[00:15:07] because you get um there sentences don't just have one way of interpreting them
[00:15:10] just have one way of interpreting them um when you have a sequence of words
[00:15:13] um when you have a sequence of words commonly in human languages sequence of
[00:15:16] commonly in human languages sequence of words are ambiguous and it's relying on
[00:15:19] words are ambiguous and it's relying on human interpretation of what makes sense
[00:15:22] human interpretation of what makes sense and what goes together to work out how
[00:15:24] and what goes together to work out how to read them so here's a first example
[00:15:29] to read them so here's a first example um scientists count whales from space um
[00:15:33] um scientists count whales from space um now that's um ambiguous and you can give
[00:15:36] now that's um ambiguous and you can give this two possible readings so how can
[00:15:39] this two possible readings so how can you give this um headline two possible
[00:15:43] you give this um headline two possible readings yeah one is that they scien in
[00:15:47] readings yeah one is that they scien in space counting whales and the other one
[00:15:49] space counting whales and the other one is that they're whales from in space
[00:15:51] is that they're whales from in space yeah so one possibility is so we've got
[00:15:54] yeah so one possibility is so we've got this prepositional phrase this is a
[00:15:56] this prepositional phrase this is a prepositional phrase here one
[00:15:59] prepositional phrase here one possibility is that this prepositional
[00:16:02] possibility is that this prepositional phrase is
[00:16:04] phrase is modifying or is the OB yeah it's
[00:16:07] modifying or is the OB yeah it's modifying
[00:16:09] modifying whales um so they're whales from space
[00:16:13] whales um so they're whales from space and the other possibility is that it's
[00:16:16] and the other possibility is that it's the counting that's happening from space
[00:16:18] the counting that's happening from space so the scientists are counting it from
[00:16:20] so the scientists are counting it from space okay so that corresponds to my two
[00:16:23] space okay so that corresponds to my two pictures here um so in one picture it's
[00:16:27] pictures here um so in one picture it's the counting that is happening from
[00:16:31] the counting that is happening from space um which is actually the right
[00:16:34] space um which is actually the right interpretation of what the article is
[00:16:35] interpretation of what the article is about um but in the other interpretation
[00:16:38] about um but in the other interpretation we have space whales um and the and the
[00:16:41] we have space whales um and the and the scientists are counting the space whales
[00:16:43] scientists are counting the space whales that are arriving or something like that
[00:16:46] that are arriving or something like that and so then we have um the from space
[00:16:50] and so then we have um the from space that are modifying the
[00:16:51] that are modifying the whales okay um so what we have here
[00:16:55] whales okay um so what we have here right is a prepositional phrase which
[00:16:58] right is a prepositional phrase which comes after a noun phrase it's just a
[00:17:00] comes after a noun phrase it's just a onew noun phrase here whales that's fine
[00:17:03] onew noun phrase here whales that's fine and then before that is a verb and so
[00:17:06] and then before that is a verb and so the thing so one place in English where
[00:17:09] the thing so one place in English where you get a lot of ambiguities is from
[00:17:12] you get a lot of ambiguities is from these prepositional phrases because
[00:17:14] these prepositional phrases because whenever you get prepositional phrases
[00:17:16] whenever you get prepositional phrases and prepositional phrases are really
[00:17:18] and prepositional phrases are really common in English if you think about it
[00:17:20] common in English if you think about it whenever you get them like this it's
[00:17:23] whenever you get them like this it's always ambiguous as
[00:17:26] always ambiguous as to oops it's always amb uous as to what
[00:17:30] to oops it's always amb uous as to what earlier thing in the sentence they
[00:17:32] earlier thing in the sentence they dependent of um and so um you know you
[00:17:39] dependent of um and so um you know you can sort of put in another prepositional
[00:17:41] can sort of put in another prepositional phrase in the morning or something like
[00:17:44] phrase in the morning or something like that and so then the ambiguities just
[00:17:48] that and so then the ambiguities just multiply and so the important thing to
[00:17:51] multiply and so the important thing to notice here about human language is
[00:17:56] notice here about human language is human language is in synta I terms
[00:17:59] human language is in synta I terms globally ambiguous right so in
[00:18:02] globally ambiguous right so in programming languages you have local
[00:18:06] programming languages you have local ambiguities interpretation how many
[00:18:08] ambiguities interpretation how many people have done a compilers class I
[00:18:10] people have done a compilers class I think very few these days anyone done a
[00:18:12] think very few these days anyone done a compilers class okay it looks like less
[00:18:15] compilers class okay it looks like less people have done a compilers class than
[00:18:16] people have done a compilers class than a Linguistics class that's
[00:18:19] a Linguistics class that's interesting okay well I won't make too
[00:18:22] interesting okay well I won't make too many analogies to compilers classes in
[00:18:24] many analogies to compilers classes in you know when I was young you know that
[00:18:27] you know when I was young you know that was still the old days kind CS
[00:18:29] was still the old days kind CS curriculum where writing interpreters
[00:18:31] curriculum where writing interpreters and compilers were seen as the main sta
[00:18:34] and compilers were seen as the main sta of Computer Science Education but no
[00:18:36] of Computer Science Education but no more I guess um yeah so um in
[00:18:40] more I guess um yeah so um in programming languages you can have a
[00:18:42] programming languages you can have a local ambiguity but ambiguities are
[00:18:45] local ambiguity but ambiguities are always resolved right so we have simple
[00:18:47] always resolved right so we have simple rules and programming languages that
[00:18:50] rules and programming languages that else is um construed with the nearest if
[00:18:54] else is um construed with the nearest if um you know it's a bit different in
[00:18:56] um you know it's a bit different in Python because it's indentation but you
[00:18:58] Python because it's indentation but you know There are rules that so there's
[00:19:00] know There are rules that so there's never Global ambiguity in a programming
[00:19:02] never Global ambiguity in a programming language um and but human languages just
[00:19:07] language um and but human languages just aren't like that right that there's
[00:19:09] aren't like that right that there's nothing that resolves which of these two
[00:19:12] nothing that resolves which of these two readings is correct if I made it a
[00:19:14] readings is correct if I made it a bigger sentence that'd still be
[00:19:16] bigger sentence that'd still be ambiguous you're just sort of meant to
[00:19:18] ambiguous you're just sort of meant to read it and use context in your
[00:19:20] read it and use context in your intelligence to decide um what's going
[00:19:23] intelligence to decide um what's going on and so to take a a bigger but real
[00:19:27] on and so to take a a bigger but real example um this is the kind of boring
[00:19:31] example um this is the kind of boring sentence that you can read in the Wall
[00:19:33] sentence that you can read in the Wall Street Journal most mornings um the
[00:19:36] Street Journal most mornings um the board approved its acquisition by Royal
[00:19:38] board approved its acquisition by Royal Trustco limited of Toronto for $27 a
[00:19:41] Trustco limited of Toronto for $27 a share at its monthly meeting um so um
[00:19:45] share at its monthly meeting um so um what you can see in this sentence is
[00:19:47] what you can see in this sentence is we've got a verb and then we've got a
[00:19:50] we've got a verb and then we've got a noun phrase and then what are what after
[00:19:53] noun phrase and then what are what after that we have four prepositional phrases
[00:19:56] that we have four prepositional phrases in a row okay so what do these
[00:19:58] in a row okay so what do these prepositional phrases modify so what
[00:20:01] prepositional phrases modify so what does by Royal Trustco limited
[00:20:06] modify the acquisition right so it's the
[00:20:09] modify the acquisition right so it's the acquisition by Royal Trust Co then of
[00:20:13] acquisition by Royal Trust Co then of Toronto
[00:20:15] Toronto modifies so it's Royal Trustco limited
[00:20:18] modifies so it's Royal Trustco limited of
[00:20:19] of Toronto um so yeah later on
[00:20:22] Toronto um so yeah later on prepositional phrases can also modify
[00:20:25] prepositional phrases can also modify earlier prepositional phrases or at
[00:20:27] earlier prepositional phrases or at least the noun phrase inside them World
[00:20:29] least the noun phrase inside them World Trustco limited okay for $27 a
[00:20:34] Trustco limited okay for $27 a share is back to modifying the
[00:20:38] share is back to modifying the acquisition okay at its monthly
[00:20:41] acquisition okay at its monthly meeting is is yeah it's the approval so
[00:20:45] meeting is is yeah it's the approval so it's gone way back up to there right um
[00:20:49] it's gone way back up to there right um but you know um yeah so you know so if
[00:20:53] but you know um yeah so you know so if you start having sentences um with a
[00:20:56] you start having sentences um with a whole bunch of prepositional phrases
[00:20:58] whole bunch of prepositional phrases like this you can start getting more and
[00:21:01] like this you can start getting more and more ambiguities of attachment I mean
[00:21:04] more ambiguities of attachment I mean you don't get the
[00:21:06] you don't get the full you don't get the sort of full free
[00:21:09] full you don't get the sort of full free choice factorial number of attachment
[00:21:12] choice factorial number of attachment points because there is a restriction
[00:21:15] points because there is a restriction that these dependencies don't cross um
[00:21:19] that these dependencies don't cross um so once you've gone back further you
[00:21:21] so once you've gone back further you have to stay equally far back or go even
[00:21:23] have to stay equally far back or go even back further back again but nevertheless
[00:21:26] back further back again but nevertheless so the number of readings you get is
[00:21:28] so the number of readings you get is Catalan series which is a series you see
[00:21:31] Catalan series which is a series you see in a whole bunch of other places if
[00:21:32] in a whole bunch of other places if you've done any graph Theory or anything
[00:21:34] you've done any graph Theory or anything like that you know if you're doing
[00:21:36] like that you know if you're doing triangulations you get um cat alarms
[00:21:39] triangulations you get um cat alarms because you get the same property um
[00:21:41] because you get the same property um that things don't cross so it's an
[00:21:43] that things don't cross so it's an exponentially growing um sequence of
[00:21:46] exponentially growing um sequence of possible readings and so it quickly gets
[00:21:49] possible readings and so it quickly gets very big so I think when you you've got
[00:21:53] very big so I think when you you've got four prepositional phrases you get 13
[00:21:56] four prepositional phrases you get 13 readings and if you have five you 27 and
[00:22:00] readings and if you have five you 27 and you know grows up from there so you get
[00:22:02] you know grows up from there so you get a lot of ambiguities but the crucial
[00:22:04] a lot of ambiguities but the crucial thing to notice is you know human beings
[00:22:07] thing to notice is you know human beings read sentences like this every morning
[00:22:10] read sentences like this every morning or at least people who work in banking
[00:22:12] or at least people who work in banking do while you know having their Corn
[00:22:14] do while you know having their Corn Flakes and you know they their brain
[00:22:17] Flakes and you know they their brain doesn't explode trying to think about
[00:22:19] doesn't explode trying to think about the 13 different readings and which one
[00:22:21] the 13 different readings and which one is correct right we just sort of do this
[00:22:24] is correct right we just sort of do this as we go along and it's sort of obvious
[00:22:27] as we go along and it's sort of obvious um okay let's just do a couple more
[00:22:29] um okay let's just do a couple more examples of where we get ambiguities in
[00:22:33] examples of where we get ambiguities in um in human language so a different one
[00:22:36] um in human language so a different one you get is coordination scope ambiguity
[00:22:40] you get is coordination scope ambiguity so shuttle veteran and longtime Nas
[00:22:42] so shuttle veteran and longtime Nas executive Fred Gregory appointed the
[00:22:44] executive Fred Gregory appointed the board how is this sentence
[00:22:47] board how is this sentence ambiguous it mean two people or one
[00:22:50] ambiguous it mean two people or one person yeah so there can either be one
[00:22:54] person yeah so there can either be one person Fred Gregory and they're both a
[00:22:57] person Fred Gregory and they're both a shuttle veteran and a NASA
[00:23:00] shuttle veteran and a NASA executive or it can be that there are
[00:23:03] executive or it can be that there are two people there's a shuttle veteran and
[00:23:07] two people there's a shuttle veteran and um there's a longtime Nessa executive
[00:23:10] um there's a longtime Nessa executive Fred
[00:23:11] Fred Gregory okay yeah so and we'd be kind of
[00:23:14] Gregory okay yeah so and we'd be kind of capturing those by having extra grammar
[00:23:17] capturing those by having extra grammar rules where a noun phrase can go to a
[00:23:19] rules where a noun phrase can go to a noun phrase a conjunction and a noun
[00:23:22] noun phrase a conjunction and a noun phrase um but then another another thing
[00:23:26] phrase um but then another another thing that you get in English is um apposition
[00:23:29] that you get in English is um apposition so you can have a noun phrase that's a
[00:23:32] so you can have a noun phrase that's a descriptive noun phrase of another noun
[00:23:34] descriptive noun phrase of another noun phrase like a name um you know the
[00:23:37] phrase like a name um you know the author Fred Gregory or something like
[00:23:39] author Fred Gregory or something like that um um saying the word English again
[00:23:44] that um um saying the word English again I I I meant to comment um so you know
[00:23:48] I I I meant to comment um so you know I'm I'm only going to give English
[00:23:49] I'm I'm only going to give English examples here um in different languages
[00:23:53] examples here um in different languages you don't get all the same ambiguities
[00:23:56] you don't get all the same ambiguities um so if you're familiar with same
[00:24:00] um so if you're familiar with same Chinese um you might have thought about
[00:24:02] Chinese um you might have thought about the prepositional phrase example of wait
[00:24:05] the prepositional phrase example of wait a minute we don't have that one because
[00:24:08] a minute we don't have that one because the prepositional phrase modifying the
[00:24:10] the prepositional phrase modifying the verb would appear before the verb and
[00:24:12] verb would appear before the verb and the object noun would be afterward so it
[00:24:14] the object noun would be afterward so it would be completely unambiguous and
[00:24:16] would be completely unambiguous and that's true um but you know that doesn't
[00:24:19] that's true um but you know that doesn't mean that Chinese is unambiguous Chinese
[00:24:22] mean that Chinese is unambiguous Chinese has lots of very bad
[00:24:25] has lots of very bad ambiguities and um yeah it's just that
[00:24:29] ambiguities and um yeah it's just that you know different languages have
[00:24:30] you know different languages have different syntactic structures okay um
[00:24:33] different syntactic structures okay um here's so sometimes um in English
[00:24:38] here's so sometimes um in English especially when you're sort of in a more
[00:24:39] especially when you're sort of in a more written form rather than having an
[00:24:41] written form rather than having an explicit con coordination word you can
[00:24:45] explicit con coordination word you can just sort of use ju to position with a
[00:24:47] just sort of use ju to position with a comma um to have the idea of
[00:24:51] comma um to have the idea of coordination so here's a um fun example
[00:24:55] coordination so here's a um fun example um from the first Trump Administration
[00:24:58] um from the first Trump Administration of how we can have a coordination scope
[00:25:01] of how we can have a coordination scope ambiguity um doctor no heart cognitive
[00:25:05] ambiguity um doctor no heart cognitive issues um right so again this is the
[00:25:08] issues um right so again this is the same kind of coordination scope
[00:25:11] same kind of coordination scope ambiguity that it can either be kind of
[00:25:13] ambiguity that it can either be kind of no hard and cognitive issues being
[00:25:16] no hard and cognitive issues being conjoined together like that or else it
[00:25:19] conjoined together like that or else it could be that it's no heart or cognitive
[00:25:23] could be that it's no heart or cognitive issues being conjoined um together like
[00:25:26] issues being conjoined um together like that you make the choice
[00:25:29] that you make the choice um okay uh let's
[00:25:33] um okay uh let's see oh this this is this is my risque
[00:25:36] see oh this this is this is my risque one for a different kind of ambiguity um
[00:25:39] one for a different kind of ambiguity um trigger warning um students get
[00:25:42] trigger warning um students get firsthand job
[00:25:45] firsthand job experience so this one is also um an
[00:25:49] experience so this one is also um an ambiguity um you know as to whether
[00:25:51] ambiguity um you know as to whether you're having the um the
[00:25:55] you're having the um the firsthand and then both the job and the
[00:25:59] firsthand and then both the job and the firsthand a modifying experience or
[00:26:03] firsthand a modifying experience or there's this other reading if you have a
[00:26:04] there's this other reading if you have a smutty mind that might come to you
[00:26:08] smutty mind that might come to you um okay one more fun one okay mutilated
[00:26:12] um okay one more fun one okay mutilated body washes up on Rio Beach to be used
[00:26:15] body washes up on Rio Beach to be used for Olympics beach
[00:26:17] for Olympics beach volleyball okay so what are what are the
[00:26:20] volleyball okay so what are what are the two possible readings of this sentence
[00:26:23] two possible readings of this sentence you know these are real examples from
[00:26:24] you know these are real examples from quality newspapers
[00:26:27] quality newspapers um okay what are the two readings of
[00:26:29] um okay what are the two readings of this sentence
[00:26:35] yeah so we've got so so the here we have
[00:26:39] yeah so we've got so so the here we have one of these
[00:26:41] one of these infinitival um so infinitival verb
[00:26:45] infinitival um so infinitival verb phrase to be used for Olympic beach
[00:26:48] phrase to be used for Olympic beach volleyball and for
[00:26:50] volleyball and for these as well you know they kind of have
[00:26:53] these as well you know they kind of have the same effect as prepositional phrases
[00:26:57] the same effect as prepositional phrases um that they can can modify um different
[00:27:00] um that they can can modify um different things um so it can either be the Rio
[00:27:03] things um so it can either be the Rio Beach that's going to be used for the
[00:27:06] Beach that's going to be used for the Olympic beach volleyball or it's going
[00:27:08] Olympic beach volleyball or it's going to be the mutilated body that gets used
[00:27:10] to be the mutilated body that gets used for the um beach
[00:27:13] for the um beach volleyball okay um yeah so the so these
[00:27:16] volleyball okay um yeah so the so these are the kind of ways in which we sort of
[00:27:18] are the kind of ways in which we sort of want to use um the structure of the
[00:27:21] want to use um the structure of the sentence to understand what they're
[00:27:23] sentence to understand what they're meaning we also use it in lots of sort
[00:27:26] meaning we also use it in lots of sort of just sort of M
[00:27:28] of just sort of M practical ways um when we're building
[00:27:31] practical ways um when we're building various kinds of natural language
[00:27:33] various kinds of natural language processing systems so you know a kind of
[00:27:36] processing systems so you know a kind of thing that people often in Practical
[00:27:39] thing that people often in Practical systems do is that they want to get out
[00:27:41] systems do is that they want to get out facts of various kinds so for people who
[00:27:44] facts of various kinds so for people who um do stuff with bioinformatics that
[00:27:47] um do stuff with bioinformatics that they commonly want to get out things
[00:27:49] they commonly want to get out things like protein protein interaction facts
[00:27:51] like protein protein interaction facts and so commonly you can get those kind
[00:27:54] and so commonly you can get those kind of facts out by looking for patterns so
[00:27:58] of facts out by looking for patterns so you know have a verb of interacts that's
[00:28:00] you know have a verb of interacts that's going to be indicating um an interaction
[00:28:03] going to be indicating um an interaction pattern and well it's going to be taking
[00:28:05] pattern and well it's going to be taking arguments so it's going to be taking a
[00:28:08] arguments so it's going to be taking a subject and interacts with the
[00:28:10] subject and interacts with the prepositional argument and so that will
[00:28:13] prepositional argument and so that will be um an interaction that kisy whatever
[00:28:16] be um an interaction that kisy whatever that is interacts with sass a but in
[00:28:19] that is interacts with sass a but in this case the Sass a is coordinated with
[00:28:21] this case the Sass a is coordinated with the kai a and the kai B so it's also
[00:28:24] the kai a and the kai B so it's also going to end up interacting with those
[00:28:27] going to end up interacting with those two other things as well and so you can
[00:28:29] two other things as well and so you can use the sort of sentence structure
[00:28:31] use the sort of sentence structure patterns of a dependency PA to be
[00:28:33] patterns of a dependency PA to be getting out the kind of um facts and
[00:28:36] getting out the kind of um facts and events that you're interested in for
[00:28:38] events that you're interested in for something like an um event understanding
[00:28:40] something like an um event understanding system and people you know do these kind
[00:28:43] system and people you know do these kind of anal analyses over biomedical texts
[00:28:47] of anal analyses over biomedical texts to build up the kind of structured
[00:28:49] to build up the kind of structured databases of known protein protein
[00:28:51] databases of known protein protein interactions and things of that
[00:28:54] interactions and things of that sort okay so linguistic structure is
[00:28:58] sort okay so linguistic structure is useful um and it's syntactically very
[00:29:03] useful um and it's syntactically very ambiguous and so you should think of
[00:29:06] ambiguous and so you should think of humans as active interpreters that are
[00:29:09] humans as active interpreters that are using their contextual knowledge both of
[00:29:12] using their contextual knowledge both of earlier stuff in the text knowledge of
[00:29:14] earlier stuff in the text knowledge of the world around them how the world
[00:29:16] the world around them how the world Works to work out the right um structure
[00:29:19] Works to work out the right um structure yeah so now I want to go on um and show
[00:29:22] yeah so now I want to go on um and show you a bit more about sort of dependency
[00:29:25] you a bit more about sort of dependency grammars which is what we're going to be
[00:29:27] grammars which is what we're going to be using so for dependency syntax that it
[00:29:31] using so for dependency syntax that it postulates that you can capture the
[00:29:34] postulates that you can capture the structure of a sentence by having these
[00:29:38] structure of a sentence by having these sort of asymmetric um dependent
[00:29:41] sort of asymmetric um dependent relations which we might just call
[00:29:43] relations which we might just call arrows which are going from heads to
[00:29:46] arrows which are going from heads to dependence so here the sentence is um
[00:29:49] dependence so here the sentence is um bills on ports and immigration were
[00:29:51] bills on ports and immigration were submitted by Senator Brownback
[00:29:54] submitted by Senator Brownback Republican of Kansas and we sort of
[00:29:56] Republican of Kansas and we sort of picking out heads um and then we got um
[00:30:00] picking out heads um and then we got um things that depend on them that modify
[00:30:03] things that depend on them that modify them um yeah so if you're on the um
[00:30:08] them um yeah so if you're on the um video audience and you are educated in
[00:30:11] video audience and you are educated in the United States and you're over the
[00:30:13] the United States and you're over the age of 50 um or if you happen to go to
[00:30:17] age of 50 um or if you happen to go to one of those kind of private schools
[00:30:19] one of those kind of private schools where they also teach Latin um you might
[00:30:23] where they also teach Latin um you might have seen sentence diagramming um so
[00:30:26] have seen sentence diagramming um so read Kellogg um sentence diagramming was
[00:30:30] read Kellogg um sentence diagramming was something that was actually very
[00:30:31] something that was actually very widespread um in American Education um
[00:30:35] widespread um in American Education um which really was a it was a some it was
[00:30:38] which really was a it was a some it was really dependency grammar was a sort of
[00:30:40] really dependency grammar was a sort of a somewhat quirky form of dependency
[00:30:42] a somewhat quirky form of dependency grammar where you had to write lines at
[00:30:44] grammar where you had to write lines at different angles and stuff like that but
[00:30:47] different angles and stuff like that but basically you're writing sort of heads
[00:30:49] basically you're writing sort of heads and their dependence underneath them
[00:30:51] and their dependence underneath them with different funny shaped lines um it
[00:30:54] with different funny shaped lines um it also was dependency
[00:30:56] also was dependency grammar okay um so this is the start of
[00:30:59] grammar okay um so this is the start of a dependency grammar but just like the
[00:31:02] a dependency grammar but just like the the funny angled lines of sentence
[00:31:04] the funny angled lines of sentence diagramming normally people want to add
[00:31:07] diagramming normally people want to add some more information than that um and
[00:31:10] some more information than that um and so most commonly um that the arrows are
[00:31:14] so most commonly um that the arrows are then typed by giving the name of some
[00:31:17] then typed by giving the name of some grammatical relation so something can be
[00:31:19] grammatical relation so something can be the noun subject or an oblique or an
[00:31:25] the noun subject or an oblique or an appositional modifier or a case mark or
[00:31:28] appositional modifier or a case mark or things like that um and um I I'm just
[00:31:33] things like that um and um I I'm just trying to give you the idea of
[00:31:35] trying to give you the idea of dependency grammars I'm not expecting
[00:31:37] dependency grammars I'm not expecting you to master all of these names and
[00:31:40] you to master all of these names and ways of doing things um and you know
[00:31:44] ways of doing things um and you know there are different systems of deciding
[00:31:46] there are different systems of deciding what's heads and dependence and not all
[00:31:48] what's heads and dependence and not all the details are important what you
[00:31:51] the details are important what you should get into your head is just sort
[00:31:53] should get into your head is just sort of the basic idea of what one of these
[00:31:55] of the basic idea of what one of these does and some sense of oh it should be
[00:31:58] does and some sense of oh it should be at the phrase level it should be
[00:32:00] at the phrase level it should be representing what's modifying what so we
[00:32:03] representing what's modifying what so we do actually ask some questions um on the
[00:32:07] do actually ask some questions um on the assignment and so for the cases like the
[00:32:10] assignment and so for the cases like the prepositional phrase what is it
[00:32:12] prepositional phrase what is it modifying you should be able to give the
[00:32:14] modifying you should be able to give the right answer to
[00:32:16] right answer to that okay
[00:32:19] that okay um yeah um okay so uh this is just a
[00:32:24] um yeah um okay so uh this is just a little bit more um vocabulary so yeah we
[00:32:27] little bit more um vocabulary so yeah we have these or dependencies and so I'm
[00:32:30] have these or dependencies and so I'm going to say that they connect between a
[00:32:32] going to say that they connect between a head and a dependent but sometimes
[00:32:34] head and a dependent but sometimes people use other words like governor and
[00:32:36] people use other words like governor and modifier and things like that um and so
[00:32:40] modifier and things like that um and so dependencies are generally taken and
[00:32:43] dependencies are generally taken and will be taking them as forming a tree so
[00:32:46] will be taking them as forming a tree so you've got something that's connected as
[00:32:49] you've got something that's connected as cyclic and has a single root to it so
[00:32:52] cyclic and has a single root to it so our single root is the top of the
[00:32:54] our single root is the top of the sentence here
[00:32:56] sentence here um so
[00:32:58] um so um dependency so although what you see
[00:33:02] um dependency so although what you see most often these days either in a
[00:33:04] most often these days either in a Linguistics class or when you get taught
[00:33:07] Linguistics class or when you get taught CS
[00:33:08] CS 103 at Stanford or computer science what
[00:33:12] 103 at Stanford or computer science what you see there is normally context free
[00:33:15] you see there is normally context free grammars or phase structure grammars I
[00:33:17] grammars or phase structure grammars I mean really you know it is dependency
[00:33:20] mean really you know it is dependency grammars that have the really long
[00:33:23] grammars that have the really long history so really the predominant way of
[00:33:26] history so really the predominant way of representing the structure of human
[00:33:28] representing the structure of human languages throughout human history is
[00:33:31] languages throughout human history is dependency grammar um so who linguist um
[00:33:34] dependency grammar um so who linguist um Herald as the first dependency
[00:33:37] Herald as the first dependency grammarian or really the first person
[00:33:39] grammarian or really the first person who tried to write the grammar of a
[00:33:42] who tried to write the grammar of a human language period was panani so
[00:33:45] human language period was panani so panani was working with Sanskrit um
[00:33:48] panani was working with Sanskrit um panani lived so long ago that actually
[00:33:50] panani lived so long ago that actually people don't really know when he lived I
[00:33:53] people don't really know when he lived I mean he lived somewhere between about
[00:33:55] mean he lived somewhere between about the 4th and 8th Century before the
[00:33:57] the 4th and 8th Century before the common ER
[00:33:58] common ER but really no one knows when um but you
[00:34:01] but really no one knows when um but you know he lived um sort of up in part of
[00:34:04] know he lived um sort of up in part of actually what's now Afghanistan um and
[00:34:07] actually what's now Afghanistan um and um for Motivated for largely religious
[00:34:11] um for Motivated for largely religious reasons um he said about developing a
[00:34:14] reasons um he said about developing a grammar of Sanskrit and the way he
[00:34:16] grammar of Sanskrit and the way he represented the syntax of Sanskrit was
[00:34:19] represented the syntax of Sanskrit was using a dependency grammar um so there
[00:34:21] using a dependency grammar um so there was a lot of work on grammar in Arabic
[00:34:24] was a lot of work on grammar in Arabic in the first Millennium they used
[00:34:26] in the first Millennium they used dependency grammars um in contrast um
[00:34:30] dependency grammars um in contrast um the idea of sort of having context free
[00:34:33] the idea of sort of having context free grammars that's really really recent so
[00:34:35] grammars that's really really recent so the first work on um phrase structure
[00:34:38] the first work on um phrase structure grammars dates to the 40s and then was
[00:34:41] grammars dates to the 40s and then was sort of um canonicalized by the work of
[00:34:44] sort of um canonicalized by the work of Chomsky in the
[00:34:46] Chomsky in the 1950s yeah so um a fact for the computer
[00:34:51] 1950s yeah so um a fact for the computer science part of people in the audience
[00:34:53] science part of people in the audience so computer dear computer scientists if
[00:34:56] so computer dear computer scientists if you know about Chomsky computer
[00:34:58] you know about Chomsky computer scientists normally know two things
[00:34:59] scientists normally know two things about Chomsky one is they hate on the
[00:35:02] about Chomsky one is they hate on the Chomsky hierarchy that they were forced
[00:35:04] Chomsky hierarchy that they were forced to learn um in CS 103 or equivalent
[00:35:07] to learn um in CS 103 or equivalent classes and the second one is he's a
[00:35:10] classes and the second one is he's a very left politician um but um if I only
[00:35:13] very left politician um but um if I only deal with the first one of the two now
[00:35:16] deal with the first one of the two now um the Chomsky hierarchy was not
[00:35:19] um the Chomsky hierarchy was not invented either to torture Elementary
[00:35:22] invented either to torture Elementary computer scientists um or um to explain
[00:35:26] computer scientists um or um to explain fundamental facts about formal language
[00:35:28] fundamental facts about formal language Theory the Chomsky hierarchy was
[00:35:30] Theory the Chomsky hierarchy was actually invented in thinking about
[00:35:33] actually invented in thinking about human languages because at that time and
[00:35:37] human languages because at that time and in stuff that's come more often it was
[00:35:40] in stuff that's come more often it was commonly the case that um people were
[00:35:44] commonly the case that um people were modeling human languages with um Regular
[00:35:48] modeling human languages with um Regular finite so finite State grammar
[00:35:51] finite so finite State grammar equivalent mechanisms and Chomsky wanted
[00:35:54] equivalent mechanisms and Chomsky wanted to argue that that was a completely
[00:35:56] to argue that that was a completely inadequate um formalism to represent um
[00:36:00] inadequate um formalism to represent um the complexity of human language and so
[00:36:03] the complexity of human language and so it was in the context of arguments about
[00:36:05] it was in the context of arguments about human language was why he developed um
[00:36:07] human language was why he developed um the Chomsky
[00:36:09] the Chomsky hierarchy okay um yeah so anyway uh
[00:36:12] hierarchy okay um yeah so anyway uh that's enough of the history of that um
[00:36:15] that's enough of the history of that um here's my uh picture of part of panan
[00:36:18] here's my uh picture of part of panan grammar but actually or a version of it
[00:36:21] grammar but actually or a version of it actually um this is really really
[00:36:24] actually um this is really really misleading and because one of the
[00:36:26] misleading and because one of the astounding facts about P's grammar and
[00:36:29] astounding facts about P's grammar and part of why no one knows what century he
[00:36:31] part of why no one knows what century he lived in was Pan's grammar was composed
[00:36:35] lived in was Pan's grammar was composed orally um so this sort of kind of blows
[00:36:38] orally um so this sort of kind of blows my mind you know it's it seems you know
[00:36:42] my mind you know it's it seems you know um some of um the famous things in the
[00:36:45] um some of um the famous things in the west like Homer's works right the
[00:36:48] west like Homer's works right the Odyssey and The Iliad right they were
[00:36:50] Odyssey and The Iliad right they were originally oral works that were passed
[00:36:53] originally oral works that were passed down um in oral form you know you can
[00:36:57] down um in oral form you know you can that seems hard to do but you can kind
[00:36:59] that seems hard to do but you can kind of believe if you did plays in high
[00:37:01] of believe if you did plays in high school or something that someone could
[00:37:04] school or something that someone could um memorize the Odyssey perhaps um but
[00:37:07] um memorize the Odyssey perhaps um but the idea that people could memorize a
[00:37:10] the idea that people could memorize a grammar of a
[00:37:12] grammar of a language passing it down for hundreds of
[00:37:15] language passing it down for hundreds of years um kind of blows my mind um but
[00:37:18] years um kind of blows my mind um but that's exactly what happened um yeah
[00:37:21] that's exactly what happened um yeah with pines grammar um so you know really
[00:37:25] with pines grammar um so you know really although this is sort of an old
[00:37:26] although this is sort of an old birchbark manuscript you know that
[00:37:28] birchbark manuscript you know that really it probably dates from about a
[00:37:31] really it probably dates from about a millennium after panan um composed um
[00:37:34] millennium after panan um composed um his grammar okay getting back to the um
[00:37:37] his grammar okay getting back to the um modern days um yeah so um for things to
[00:37:41] modern days um yeah so um for things to know yeah so I mean we don't want you to
[00:37:45] know yeah so I mean we don't want you to fixate on the sort of details of
[00:37:47] fixate on the sort of details of dependency grammar structure providing
[00:37:49] dependency grammar structure providing you have the rough idea but just one
[00:37:51] you have the rough idea but just one thing um that you can possibly be
[00:37:53] thing um that you can possibly be confused about is you know there people
[00:37:57] confused about is you know there people do things in different ways one way in
[00:38:00] do things in different ways one way in which they don't agree is even which way
[00:38:02] which they don't agree is even which way to draw the arrows so some people draw
[00:38:06] to draw the arrows so some people draw arrows um from the head pointing at the
[00:38:09] arrows um from the head pointing at the dependence and there are other people
[00:38:11] dependence and there are other people who draw the arrows starting at the
[00:38:12] who draw the arrows starting at the dependent and pointing back at the heads
[00:38:15] dependent and pointing back at the heads um so um for modern dependency grammar
[00:38:19] um so um for modern dependency grammar um largely follows the work of um Lucien
[00:38:22] um largely follows the work of um Lucien tener a French um linguist um he did the
[00:38:28] tener a French um linguist um he did the um arrows pointing from the head to the
[00:38:30] um arrows pointing from the head to the dependent and so that's what I'm doing
[00:38:32] dependent and so that's what I'm doing today but um you'll see both um we sort
[00:38:36] today but um you'll see both um we sort of said that you know normally you
[00:38:38] of said that you know normally you assume that you have a tree with a
[00:38:40] assume that you have a tree with a single root it's kind of common and it
[00:38:43] single root it's kind of common and it works out more easily for the paing if
[00:38:46] works out more easily for the paing if you sort of add to a sentence a sort of
[00:38:48] you sort of add to a sentence a sort of a fake root node so you know that that's
[00:38:51] a fake root node so you know that that's going to be the starting point and it's
[00:38:53] going to be the starting point and it's going to take one dependent which is the
[00:38:56] going to take one dependent which is the word that's the head of the sentence
[00:38:58] word that's the head of the sentence and then you're going to work down from
[00:39:00] and then you're going to work down from there okay um so um before getting more
[00:39:05] there okay um so um before getting more into doing dependency paing I just
[00:39:07] into doing dependency paing I just wanted to sort of take a little detour
[00:39:11] wanted to sort of take a little detour um to tell you about the you know the
[00:39:14] um to tell you about the you know the importance um that happened with sort of
[00:39:17] importance um that happened with sort of the rise of annotated data um in natural
[00:39:21] the rise of annotated data um in natural language processing so and you know this
[00:39:25] language processing so and you know this is sort of an interesting flip slop
[00:39:27] is sort of an interesting flip slop that's occurred but we're going to sort
[00:39:29] that's occurred but we're going to sort of today go in One Direction and later
[00:39:31] of today go in One Direction and later class we'll go in the other direction um
[00:39:34] class we'll go in the other direction um so in early natural language processing
[00:39:37] so in early natural language processing um people started to see oh human
[00:39:41] um people started to see oh human languages have structure so what we
[00:39:45] languages have structure so what we should do is start writing rules for the
[00:39:48] should do is start writing rules for the structure of human languages and you
[00:39:50] structure of human languages and you know I start writing a few context free
[00:39:53] know I start writing a few context free grammar rules for the structure of
[00:39:54] grammar rules for the structure of English on that early slide and you
[00:39:56] English on that early slide and you could also write dependency grammar
[00:39:58] could also write dependency grammar structure rules so um people tried to do
[00:40:01] structure rules so um people tried to do natural language processing by having
[00:40:04] natural language processing by having rules grammar rules dictionaries of
[00:40:07] rules grammar rules dictionaries of parts of speech and things like that and
[00:40:09] parts of speech and things like that and that gave you paes um that in retrospect
[00:40:15] that gave you paes um that in retrospect worked out pretty badly and it worked
[00:40:19] worked out pretty badly and it worked out pretty badly for a number of reasons
[00:40:21] out pretty badly for a number of reasons one reason is that although there are
[00:40:25] one reason is that although there are these sort of very canonical clear
[00:40:27] these sort of very canonical clear structures in human languages there's a
[00:40:29] structures in human languages there's a very long tale of messy stuff where all
[00:40:32] very long tale of messy stuff where all kinds of weird um usages start to emerge
[00:40:37] kinds of weird um usages start to emerge in human languages which sort of means
[00:40:40] in human languages which sort of means you just got this it's just really hard
[00:40:42] you just got this it's just really hard to get coverage for handwritten language
[00:40:45] to get coverage for handwritten language um and that's because people you humans
[00:40:48] um and that's because people you humans use language creatively right so you
[00:40:51] use language creatively right so you know you can start thinking of some of
[00:40:53] know you can start thinking of some of the things um that you've probably come
[00:40:57] the things um that you've probably come I'm probably not very good at Young
[00:40:59] I'm probably not very good at Young persons slang usages of grammar these
[00:41:02] persons slang usages of grammar these days but you know um the kind of ones
[00:41:05] days but you know um the kind of ones that you might be still familiar with
[00:41:07] that you might be still familiar with right Star Wars you have Yoda talk where
[00:41:09] right Star Wars you have Yoda talk where you rearrange the sentences but people
[00:41:11] you rearrange the sentences but people still understand them right so that's
[00:41:13] still understand them right so that's changing the word order and earlier on
[00:41:16] changing the word order and earlier on than that there was sort of a uh little
[00:41:19] than that there was sort of a uh little a bit of a f to putting not at the end
[00:41:21] a bit of a f to putting not at the end of the sentences of that's a really
[00:41:24] of the sentences of that's a really great idea not um and well you know
[00:41:28] great idea not um and well you know people um learn to understand that but
[00:41:31] people um learn to understand that but it's different to regular grammar right
[00:41:33] it's different to regular grammar right so it it's really hard to write a full
[00:41:36] so it it's really hard to write a full grammar but the bigger reason actually
[00:41:38] grammar but the bigger reason actually is the problem of ambiguity I talked
[00:41:40] is the problem of ambiguity I talked about right that if you just write a
[00:41:43] about right that if you just write a grammar well you know my sentence with
[00:41:46] grammar well you know my sentence with the prepositional phrases had 13
[00:41:49] the prepositional phrases had 13 different paes and you didn't have much
[00:41:51] different paes and you didn't have much reason to choose between them but you
[00:41:53] reason to choose between them but you know if you had information about how
[00:41:56] know if you had information about how often words modify other words then you
[00:41:59] often words modify other words then you could get some statistics and start to
[00:42:02] could get some statistics and start to predict in which order which things
[00:42:05] predict in which order which things modify other things and so people wanted
[00:42:08] modify other things and so people wanted to start to be able to do that
[00:42:09] to start to be able to do that prediction that underlies probalistic or
[00:42:12] prediction that underlies probalistic or machine learning models and so to be
[00:42:15] machine learning models and so to be able to do that that led you know sort
[00:42:18] able to do that that led you know sort of earliest antecedence in the 6s but
[00:42:21] of earliest antecedence in the 6s but really starting in the late 80s and into
[00:42:24] really starting in the late 80s and into the '90s that people decide you the way
[00:42:27] the '90s that people decide you the way make progress in um natural language
[00:42:30] make progress in um natural language processing natural language
[00:42:32] processing natural language understanding is to build annotated data
[00:42:35] understanding is to build annotated data resources and so all through the '90s
[00:42:38] resources and so all through the '90s and the 2000s decades the name of the
[00:42:42] and the 2000s decades the name of the game for a lot of natural language
[00:42:43] game for a lot of natural language processing was people building annotated
[00:42:46] processing was people building annotated data resources and then building machine
[00:42:49] data resources and then building machine learning systems on top using those
[00:42:51] learning systems on top using those resources now that's kind of gone into
[00:42:53] resources now that's kind of gone into reverse and gone away again with large
[00:42:55] reverse and gone away again with large language models which will to another
[00:42:57] language models which will to another week or so but here's an example so this
[00:43:01] week or so but here's an example so this is the universal dependencies tree Banks
[00:43:04] is the universal dependencies tree Banks which I'm actually been heavily involved
[00:43:06] which I'm actually been heavily involved with and it's a cool resource for all
[00:43:08] with and it's a cool resource for all kinds of purposes because it's actually
[00:43:11] kinds of purposes because it's actually a wide
[00:43:12] a wide crosslinguistic um database where
[00:43:14] crosslinguistic um database where there's over a 100 different languages
[00:43:17] there's over a 100 different languages um with sentences passed with a uniform
[00:43:20] um with sentences passed with a uniform um dependency formalism so it's actually
[00:43:23] um dependency formalism so it's actually really good for things like cross
[00:43:24] really good for things like cross linguistic work and psychol linguistic
[00:43:26] linguistic work and psychol linguistic work work but you know what these are is
[00:43:30] work work but you know what these are is taking sentences I think mamar was a
[00:43:33] taking sentences I think mamar was a famous goat trainer or
[00:43:34] famous goat trainer or something um and putting a dependency
[00:43:38] something um and putting a dependency structure on it it's sort of all written
[00:43:40] structure on it it's sort of all written there sort of very squished down and
[00:43:42] there sort of very squished down and human beings are producing these
[00:43:44] human beings are producing these dependency structures and then this is
[00:43:46] dependency structures and then this is giving us data that we can learn things
[00:43:49] giving us data that we can learn things from dependency like dependency paes
[00:43:51] from dependency like dependency paes from and indeed um for what you do on
[00:43:54] from and indeed um for what you do on homework 2 this is precisely what you'll
[00:43:56] homework 2 this is precisely what you'll be using is data of this sort um to
[00:44:00] be using is data of this sort um to build a dependency paa and it's going to
[00:44:03] build a dependency paa and it's going to learn that you know that you have goat
[00:44:05] learn that you know that you have goat trainers um and you have famous trainers
[00:44:09] trainers um and you have famous trainers and so it'll build up sort of statistics
[00:44:11] and so it'll build up sort of statistics and information to predict what kinds of
[00:44:13] and information to predict what kinds of things are
[00:44:15] things are likely um yeah so starting off building
[00:44:20] likely um yeah so starting off building a tree Bank like that
[00:44:22] a tree Bank like that um feels kind of like oh this is going
[00:44:25] um feels kind of like oh this is going to be slow hard work work and it is
[00:44:28] to be slow hard work work and it is actually slow hard work um but it proved
[00:44:31] actually slow hard work um but it proved to be a very effective strategy because
[00:44:34] to be a very effective strategy because it gave wonderful reusable resources
[00:44:38] it gave wonderful reusable resources that once people had done it once all
[00:44:41] that once people had done it once all sorts of people could use it to build
[00:44:43] sorts of people could use it to build paes part of speech taggers um to do
[00:44:47] paes part of speech taggers um to do psycholinguistic models and all kinds of
[00:44:49] psycholinguistic models and all kinds of things you'd get the sort of
[00:44:51] things you'd get the sort of distributional frequency information
[00:44:53] distributional frequency information that's good for machine learning it also
[00:44:55] that's good for machine learning it also provided one other thing that's crucial
[00:44:58] provided one other thing that's crucial is it gave a method to evaluate systems
[00:45:02] is it gave a method to evaluate systems to say how good they are at um producing
[00:45:06] to say how good they are at um producing paes I mean this may seem kind of
[00:45:10] paes I mean this may seem kind of comical to you in the modern era of
[00:45:12] comical to you in the modern era of machine learning but the fact of the
[00:45:14] machine learning but the fact of the matter is when people did natural
[00:45:16] matter is when people did natural language processing in the 50s 60s 7s
[00:45:21] language processing in the 50s 60s 7s nobody had evaluation methods the way
[00:45:25] nobody had evaluation methods the way you showed people you had good paa is
[00:45:28] you showed people you had good paa is you ran the program you said type in a
[00:45:31] you ran the program you said type in a sentence look at what it look look it's
[00:45:33] sentence look at what it look look it's worked it's a really good paa um you
[00:45:36] worked it's a really good paa um you know there was no systematic valuation
[00:45:39] know there was no systematic valuation of NLP systems whatsoever um so actually
[00:45:43] of NLP systems whatsoever um so actually saying look here's uh thousand hand
[00:45:46] saying look here's uh thousand hand passas sentences let's evaluate how well
[00:45:49] passas sentences let's evaluate how well your paa does on those you know that was
[00:45:52] your paa does on those you know that was actually a revolutionary new development
[00:45:55] actually a revolutionary new development um that happened in the um
[00:45:57] um that happened in the um well end of the 80s but especially in
[00:45:59] well end of the 80s but especially in the
[00:46:00] the 90s okay um so now we going have it now
[00:46:06] 90s okay um so now we going have it now that we have all of those knowledge um
[00:46:08] that we have all of those knowledge um we're going to um want to start building
[00:46:12] we're going to um want to start building dependency paes and so I'm going to um
[00:46:16] dependency paes and so I'm going to um show a particular way of dependency
[00:46:18] show a particular way of dependency passing which is the one you're going to
[00:46:20] passing which is the one you're going to use in the assignment um but you know
[00:46:22] use in the assignment um but you know just first off it's sort of worth just
[00:46:24] just first off it's sort of worth just thinking for a moment you know what kind
[00:46:27] thinking for a moment you know what kind of information should a dependency paer
[00:46:30] of information should a dependency paer have to make decisions so these are kind
[00:46:32] have to make decisions so these are kind of the four factors the sort of the
[00:46:35] of the four factors the sort of the obvious things that are useful for
[00:46:37] obvious things that are useful for dependency passing I mean the first one
[00:46:40] dependency passing I mean the first one is sort of thinking of the two words at
[00:46:43] is sort of thinking of the two words at the ends of the arrow as to whether they
[00:46:45] the ends of the arrow as to whether they are plausible right so that for um
[00:46:49] are plausible right so that for um discussion of the outstanding issues was
[00:46:52] discussion of the outstanding issues was completed um so to have um discussion of
[00:46:57] completed um so to have um discussion of issues right that's that's a plausible
[00:47:01] issues right that's that's a plausible dependency um to have um you know what's
[00:47:05] dependency um to have um you know what's the silly one to have something like um
[00:47:08] the silly one to have something like um the being a dependent of completed that
[00:47:11] the being a dependent of completed that makes no sense at all so you know what
[00:47:14] makes no sense at all so you know what words there are involved um the second
[00:47:18] words there are involved um the second one is dependency distance so you can
[00:47:21] one is dependency distance so you can have longdistance dependencies that go a
[00:47:24] have longdistance dependencies that go a long way but most dependenc sees are
[00:47:28] long way but most dependenc sees are short distance you know a lot of words
[00:47:30] short distance you know a lot of words are depending on their neighboring words
[00:47:32] are depending on their neighboring words at a very short distance so that's a
[00:47:35] at a very short distance so that's a good preference to have um as well as
[00:47:38] good preference to have um as well as just the distance it's somewhat
[00:47:40] just the distance it's somewhat informative knowing what's in between so
[00:47:44] informative knowing what's in between so it's rare for dependencies to span verbs
[00:47:47] it's rare for dependencies to span verbs or
[00:47:48] or punctuation um and then there's a final
[00:47:52] punctuation um and then there's a final one which is to think of the veency of
[00:47:55] one which is to think of the veency of heads and that's how many
[00:47:57] heads and that's how many arguments um they take so that if you
[00:48:01] arguments um they take so that if you have sort of something like a verb broke
[00:48:04] have sort of something like a verb broke um well it probably has something um to
[00:48:09] um well it probably has something um to the left because it probably has who did
[00:48:12] the left because it probably has who did the breaking and it probably has
[00:48:15] the breaking and it probably has something to the right um cuz there
[00:48:18] something to the right um cuz there might be the cup or something like that
[00:48:21] might be the cup or something like that um but you know it doesn't have to be
[00:48:23] um but you know it doesn't have to be that cuz it could be the cup broke um so
[00:48:27] that cuz it could be the cup broke um so you can have something to the left but
[00:48:30] you can have something to the left but nothing to the right but you sort of
[00:48:32] nothing to the right but you sort of have to have something to the left and
[00:48:34] have to have something to the left and conversly you can't have any number of
[00:48:37] conversly you can't have any number of things you can't sort of just say he
[00:48:39] things you can't sort of just say he broke the cup the sorcer um the dish um
[00:48:43] broke the cup the sorcer um the dish um right so it doesn't take just lots of
[00:48:45] right so it doesn't take just lots of arguments to the left so we've got a
[00:48:46] arguments to the left so we've got a notion of veency like
[00:48:49] notion of veency like that um yeah there's one other tricky
[00:48:53] that um yeah there's one other tricky little notion on dependency paring which
[00:48:56] little notion on dependency paring which is
[00:48:57] is normally um normally dependencies kind
[00:49:01] normally um normally dependencies kind of Nest like this um and nesting
[00:49:04] of Nest like this um and nesting dependencies corresponds to a tree
[00:49:07] dependencies corresponds to a tree structure as you'd have in a context
[00:49:09] structure as you'd have in a context free grammar
[00:49:15] [Music]
[00:49:17] [Music] yeah because in a sense when I read the
[00:49:19] yeah because in a sense when I read the sentence
[00:49:21] sentence which I thought that the most important
[00:49:24] which I thought that the most important discussion
[00:49:27] discussion [Music]
[00:49:31] so so I will fair enough I will assert
[00:49:36] so so I will fair enough I will assert that this
[00:49:38] that this is uh
[00:49:40] is uh sentence and discussion is the subject
[00:49:45] sentence and discussion is the subject of the
[00:49:46] of the verb completed and you know normally for
[00:49:51] verb completed and you know normally for a sentence we say the main thing in the
[00:49:54] a sentence we say the main thing in the sentence is its verb and so yeah so
[00:49:57] sentence is its verb and so yeah so that's why the root is heading to
[00:49:59] that's why the root is heading to completed and the subject of the verb is
[00:50:01] completed and the subject of the verb is also an important thing but the
[00:50:04] also an important thing but the arguments of the verb like the subject
[00:50:06] arguments of the verb like the subject of the verb the object of the verb if
[00:50:08] of the verb the object of the verb if there is one prepositional phrase
[00:50:09] there is one prepositional phrase modifiers they're all taken as
[00:50:11] modifiers they're all taken as dependence of the
[00:50:14] dependence of the verb
[00:50:17] verb yeah up on that is
[00:50:21] yeah up on that is it is it not the that you start
[00:50:24] it is it not the that you start with so um if you have a sentence with a
[00:50:29] with so um if you have a sentence with a verb like this um it's always that is
[00:50:34] verb like this um it's always that is always the answer I mean some of the
[00:50:37] always the answer I mean some of the details here depend on languages but
[00:50:40] details here depend on languages but there are languages in which you don't
[00:50:41] there are languages in which you don't have to have a verb in a sentence um and
[00:50:44] have to have a verb in a sentence um and you can get things like
[00:50:47] you can get things like um um just I mean you can do it in sort
[00:50:51] um um just I mean you can do it in sort of very restricted ways in in in English
[00:50:54] of very restricted ways in in in English right so if you just sort of say easy as
[00:50:57] right so if you just sort of say easy as pie um there's no verb and so then
[00:51:00] pie um there's no verb and so then you're saying easy um the adjective
[00:51:03] you're saying easy um the adjective which is sort of the predicate adjective
[00:51:05] which is sort of the predicate adjective is then the head of the
[00:51:08] is then the head of the sentence
[00:51:10] sentence sorry like a question like what is the
[00:51:13] sorry like a question like what is the story is the is like we would still look
[00:51:16] story is the is like we would still look at that as the that is complicated some
[00:51:20] at that as the that is complicated some people would say it is um and some
[00:51:23] people would say it is um and some people would say it isn't um and
[00:51:26] people would say it isn't um and particular in Universal dependencies we
[00:51:29] particular in Universal dependencies we don't actually say that is is the head
[00:51:30] don't actually say that is is the head of the sentence but I don't want to get
[00:51:32] of the sentence but I don't want to get too far into this if you want you could
[00:51:34] too far into this if you want you could sort of look more at how things are done
[00:51:37] sort of look more at how things are done but you know I want to fully admit that
[00:51:39] but you know I want to fully admit that you know dependency grammar isn't sort
[00:51:42] you know dependency grammar isn't sort of one uniquely defined Theory people
[00:51:44] of one uniquely defined Theory people have had different ideas of which things
[00:51:46] have had different ideas of which things to take as the head in various
[00:51:49] to take as the head in various circumstances and they argue about it
[00:51:51] circumstances and they argue about it linguists argue about what the right
[00:51:53] linguists argue about what the right structure is to put over all sorts of
[00:51:55] structure is to put over all sorts of sentences but that the fact that people
[00:51:58] sentences but that the fact that people do things different ways doesn't mean
[00:52:00] do things different ways doesn't mean that everybody doesn't agree that there
[00:52:02] that everybody doesn't agree that there are units there are phrases and
[00:52:05] are units there are phrases and modifiers and ambiguities and so on
[00:52:08] modifiers and ambiguities and so on between them okay um yeah so normally we
[00:52:12] between them okay um yeah so normally we get this sort of nesting that
[00:52:13] get this sort of nesting that corresponds to what you can build with
[00:52:15] corresponds to what you can build with context free grammar structure but
[00:52:18] context free grammar structure but sometimes in human languages you get
[00:52:21] sometimes in human languages you get dependencies that don't Nest so you get
[00:52:25] dependencies that don't Nest so you get sentences like I'll give a talk tomorrow
[00:52:27] sentences like I'll give a talk tomorrow on new networks where actually the on
[00:52:31] on new networks where actually the on new networks is modifying the talk where
[00:52:35] new networks is modifying the talk where the
[00:52:36] the yesterday um is an argument of sorry the
[00:52:40] yesterday um is an argument of sorry the tomorrow is an argument of give and so
[00:52:42] tomorrow is an argument of give and so you get these Crossing dependencies
[00:52:45] you get these Crossing dependencies which are referred to as non-projective
[00:52:48] which are referred to as non-projective dependencies um you also get them when
[00:52:51] dependencies um you also get them when you form questions so who did Bill buy
[00:52:55] you form questions so who did Bill buy the coffee from yesterday that the who
[00:52:58] the coffee from yesterday that the who is the object of the preposition from
[00:53:02] is the object of the preposition from but it's been moved out the front and so
[00:53:04] but it's been moved out the front and so that again um gives us
[00:53:08] that again um gives us non-projective um if you think about
[00:53:12] non-projective um if you think about it um yeah you can still say um that you
[00:53:16] it um yeah you can still say um that you have a dependency tree but it's got the
[00:53:20] have a dependency tree but it's got the words in different orders and so one of
[00:53:22] words in different orders and so one of the things that you have to cope with
[00:53:24] the things that you have to cope with for full dependency PA
[00:53:27] for full dependency PA is dealing with this non projectivity
[00:53:29] is dealing with this non projectivity but I mean actually we're not going to
[00:53:30] but I mean actually we're not going to deal with it in our paes we're only
[00:53:32] deal with it in our paes we're only going to do um projective dependency
[00:53:35] going to do um projective dependency paing okay so there are various ways
[00:53:39] paing okay so there are various ways that people do dependency paring people
[00:53:42] that people do dependency paring people have done it by dynamic programming
[00:53:45] have done it by dynamic programming people who've done it using graph
[00:53:47] people who've done it using graph algorithms if I have enough time at the
[00:53:49] algorithms if I have enough time at the end I might mention that again people
[00:53:52] end I might mention that again people have done it with constraint
[00:53:53] have done it with constraint satisfaction methods if you saw those in
[00:53:55] satisfaction methods if you saw those in cs220 21 um but the most common way in
[00:54:00] cs220 21 um but the most common way in practice that's emerged um has been this
[00:54:03] practice that's emerged um has been this transition based paing um which is kind
[00:54:08] transition based paing um which is kind of sort of interesting as well and gives
[00:54:11] of sort of interesting as well and gives the sort of a very simple machine
[00:54:13] the sort of a very simple machine learning mechanism um so it makes it
[00:54:16] learning mechanism um so it makes it good for um assignment two and so that's
[00:54:19] good for um assignment two and so that's what we're going to explore
[00:54:22] what we're going to explore here okay so um what we do
[00:54:27] here okay so um what we do um in Greedy decision based paing in
[00:54:30] um in Greedy decision based paing in transition based paing is you know this
[00:54:33] transition based paing is you know this is where it's unfortunate that only two
[00:54:35] is where it's unfortunate that only two people in the class have done a
[00:54:36] people in the class have done a compiler's class um so um so a simple
[00:54:41] compiler's class um so um so a simple form of passing that's also used in
[00:54:43] form of passing that's also used in compilers classes something called shift
[00:54:46] compilers classes something called shift reduced paing where you start sort of
[00:54:49] reduced paing where you start sort of bottom up and you start putting little
[00:54:51] bottom up and you start putting little units together and build bigger
[00:54:53] units together and build bigger constituents but um if most people have
[00:54:56] constituents but um if most people have hav't seen it that's not going to be um
[00:54:59] hav't seen it that's not going to be um very much um help so I'm going to give
[00:55:01] very much um help so I'm going to give you a concrete example so the things to
[00:55:04] you a concrete example so the things to know is we' have two data structure well
[00:55:07] know is we' have two data structure well we have more than two I guess um for
[00:55:09] we have more than two I guess um for dealing with the sentence we have two
[00:55:11] dealing with the sentence we have two data structures we have a buffer which
[00:55:14] data structures we have a buffer which has the words of our input sentence and
[00:55:17] has the words of our input sentence and then we start building pieces of um
[00:55:19] then we start building pieces of um sentence structure which we put on a
[00:55:22] sentence structure which we put on a stack and a little trick to know is for
[00:55:25] stack and a little trick to know is for the buffer the the top is written to the
[00:55:27] the buffer the the top is written to the left and for the stack the top is
[00:55:29] left and for the stack the top is written to the right um and so we take
[00:55:32] written to the right um and so we take actions which are like shift and reduce
[00:55:35] actions which are like shift and reduce actions and when we take ARC building
[00:55:39] actions and when we take ARC building actions we build up a set of dependency
[00:55:41] actions we build up a set of dependency arcs which are going to be the
[00:55:43] arcs which are going to be the dependency structure of our sentence and
[00:55:46] dependency structure of our sentence and that's all incredibly abstract um and so
[00:55:49] that's all incredibly abstract um and so I'm going to show an example um which
[00:55:52] I'm going to show an example um which hopefully will a bit give the idea um
[00:55:57] hopefully will a bit give the idea um so here's an example so I want to do
[00:55:59] so here's an example so I want to do this very um simple example um of
[00:56:04] this very um simple example um of passing up the sentence I ate fish um so
[00:56:10] passing up the sentence I ate fish um so I so the way I do this is I have my
[00:56:13] I so the way I do this is I have my stack and so I start by putting the root
[00:56:17] stack and so I start by putting the root symbol on my stack and then I have in my
[00:56:20] symbol on my stack and then I have in my buffer all the words of the sentence and
[00:56:23] buffer all the words of the sentence and so that's the sort of start condition
[00:56:25] so that's the sort of start condition I've written and very small print there
[00:56:28] I've written and very small print there then for each step of processing I have
[00:56:31] then for each step of processing I have a choice of three operations I can
[00:56:34] a choice of three operations I can either shift which moves the top word on
[00:56:38] either shift which moves the top word on the buffer onto the stack or I can do
[00:56:42] the buffer onto the stack or I can do left Arc or right Arc and these are my
[00:56:45] left Arc or right Arc and these are my two reduce operations that build a
[00:56:47] two reduce operations that build a little bit of syntactic structure by
[00:56:50] little bit of syntactic structure by saying that one word is a dependent of
[00:56:52] saying that one word is a dependent of another word in either a left or a right
[00:56:55] another word in either a left or a right direction so here a sequence of
[00:56:57] direction so here a sequence of operations I can take and um so starting
[00:57:02] operations I can take and um so starting off the first thing I can do is shift so
[00:57:05] off the first thing I can do is shift so then I've moved I onto the stack um I
[00:57:09] then I've moved I onto the stack um I can decide that I want to shift again
[00:57:12] can decide that I want to shift again and so then I'd take eight and also move
[00:57:16] and so then I'd take eight and also move it onto the stack and so I've now got
[00:57:18] it onto the stack and so I've now got three things on my
[00:57:21] three things on my stack so at this point you know I can do
[00:57:26] stack so at this point you know I can do other things I mean in particular a left
[00:57:29] other things I mean in particular a left Arc is going to say well I can take the
[00:57:32] Arc is going to say well I can take the top two things on the stack and make the
[00:57:37] top two things on the stack and make the uh the thing on the top The Head and the
[00:57:40] uh the thing on the top The Head and the thing one down on the stack a dependent
[00:57:43] thing one down on the stack a dependent of it so if I do a left Arc operation
[00:57:47] of it so if I do a left Arc operation I'm effectively saying that the I is a
[00:57:49] I'm effectively saying that the I is a dependent of eight and then I pop both
[00:57:52] dependent of eight and then I pop both of then I pop the dependent off the
[00:57:55] of then I pop the dependent off the stack back but I add on that I've built
[00:57:59] stack back but I add on that I've built um a dependency that I made I a
[00:58:01] um a dependency that I made I a dependent of eight I could then do
[00:58:04] dependent of eight I could then do another shift operation so I shift fish
[00:58:08] another shift operation so I shift fish um from the buffer onto the stack and
[00:58:11] um from the buffer onto the stack and then I can do a right Arc which says um
[00:58:15] then I can do a right Arc which says um okay I'm going to have fish as a
[00:58:17] okay I'm going to have fish as a dependent of eight so then fish
[00:58:19] dependent of eight so then fish disappears from the stack and I add in
[00:58:22] disappears from the stack and I add in this new dependency saying fishes
[00:58:26] this new dependency saying fishes dependent of eight um I then do right
[00:58:29] dependent of eight um I then do right Arc again um which is then saying that
[00:58:34] Arc again um which is then saying that um eight is a dependent of root so I'm
[00:58:37] um eight is a dependent of root so I'm left with just root on my stack and I've
[00:58:39] left with just root on my stack and I've built a new dependent saying eight is a
[00:58:41] built a new dependent saying eight is a dependent of root and at this point I've
[00:58:44] dependent of root and at this point I've gone to the finishing condition my
[00:58:46] gone to the finishing condition my finishing condition is that my buffer is
[00:58:48] finishing condition is that my buffer is empty and my um stack contains just the
[00:58:52] empty and my um stack contains just the word root um and so this gives me a
[00:58:56] word root um and so this gives me a little step set of operations referred
[00:59:00] little step set of operations referred to as the transitions of
[00:59:02] to as the transitions of transition-based passing and by making a
[00:59:04] transition-based passing and by making a sequence of these different transitions
[00:59:07] sequence of these different transitions I can build sentence structure and I've
[00:59:10] I can build sentence structure and I've got choices of when to shift and when to
[00:59:14] got choices of when to shift and when to reduce and whether to reduce left or
[00:59:17] reduce and whether to reduce left or reduce right the arc left Arc right and
[00:59:20] reduce right the arc left Arc right and so by making different ones of those
[00:59:22] so by making different ones of those choices I could make any structure for
[00:59:25] choices I could make any structure for the sentence that I wanted to so you
[00:59:28] the sentence that I wanted to so you know if I somehow thought that this
[00:59:31] know if I somehow thought that this sentence should have a different
[00:59:33] sentence should have a different structure and that I should be the head
[00:59:36] structure and that I should be the head and eights are dependent of that and
[00:59:38] and eights are dependent of that and fishes are dependent of that well I
[00:59:41] fishes are dependent of that well I could achieve this by making some
[00:59:43] could achieve this by making some different choices as to I'd now be
[00:59:47] different choices as to I'd now be saying I was doing a right Arc operation
[00:59:50] saying I was doing a right Arc operation so that eight would become a dependent
[00:59:52] so that eight would become a dependent of I rather than the other way around so
[00:59:54] of I rather than the other way around so the choices of which oper operations I
[00:59:56] the choices of which oper operations I take determine the syntactic structure
[00:59:59] take determine the syntactic structure the set of dependencies that I have
[01:00:02] the set of dependencies that I have built um which are my set of
[01:00:04] built um which are my set of dependencies down here now the center of
[01:00:07] dependencies down here now the center of dependencies I built were exactly the
[01:00:09] dependencies I built were exactly the right ones because at each step I took
[01:00:12] right ones because at each step I took the right
[01:00:13] the right operation um and so the essential um
[01:00:18] operation um and so the essential um idea of um transition based paing and
[01:00:23] idea of um transition based paing and where it came to the four was
[01:00:26] where it came to the four was um there was a particular guy who's I've
[01:00:30] um there was a particular guy who's I've got a photo of him in somewhere in a bit
[01:00:32] got a photo of him in somewhere in a bit I thought um so yoram Nea um is a
[01:00:35] I thought um so yoram Nea um is a Swedish NLP person um and in the early
[01:00:40] Swedish NLP person um and in the early 2000s um he came up with the idea of um
[01:00:45] 2000s um he came up with the idea of um rather than doing the kind of dynamic
[01:00:47] rather than doing the kind of dynamic programming and chart paing and things
[01:00:50] programming and chart paing and things that people commonly used to do with
[01:00:53] that people commonly used to do with PES these days we have machine Lear
[01:00:56] PES these days we have machine Lear learning so maybe we could build a fast
[01:00:59] learning so maybe we could build a fast efficient paa and the way we're going to
[01:01:02] efficient paa and the way we're going to build it is with this making a sequence
[01:01:06] build it is with this making a sequence of Transitions and it'll be the job of
[01:01:09] of Transitions and it'll be the job of the machine learning to predict what is
[01:01:12] the machine learning to predict what is the right transition at each point in
[01:01:14] the right transition at each point in time so if you do that you know at each
[01:01:19] time so if you do that you know at each point you're dealing with one thing and
[01:01:23] point you're dealing with one thing and so the number of operations you're doing
[01:01:25] so the number of operations you're doing to pass a sentence is linear so this
[01:01:28] to pass a sentence is linear so this gives a linear time passing algorithm
[01:01:32] gives a linear time passing algorithm whereas if um you've seen context free
[01:01:35] whereas if um you've seen context free grammars and stuff like that in CS 103
[01:01:39] grammars and stuff like that in CS 103 and you want to do anything where you're
[01:01:41] and you want to do anything where you're fully considering the paes and
[01:01:43] fully considering the paes and structures of context free grammars
[01:01:45] structures of context free grammars you've then got a cubic time algorithm
[01:01:48] you've then got a cubic time algorithm which is much less Pleasant to be
[01:01:50] which is much less Pleasant to be dealing with um so for the simplest form
[01:01:54] dealing with um so for the simplest form of transition-based paing you do no
[01:01:56] of transition-based paing you do no search whatsoever at each step you're
[01:01:59] search whatsoever at each step you're just predicting the next transition and
[01:02:02] just predicting the next transition and so you're doing this sort of sequence of
[01:02:04] so you're doing this sort of sequence of transition predictions as machine
[01:02:07] transition predictions as machine learning operations and that sequence
[01:02:09] learning operations and that sequence gives you the path structure of the
[01:02:11] gives you the path structure of the sentence and the essential result that
[01:02:14] sentence and the essential result that Neo is able to show is that machine
[01:02:17] Neo is able to show is that machine learning um is good enough that you can
[01:02:20] learning um is good enough that you can do this and get a very accurate paraa
[01:02:24] do this and get a very accurate paraa despite the fact that it no search
[01:02:26] despite the fact that it no search whatsoever is just doing predictions in
[01:02:29] whatsoever is just doing predictions in this
[01:02:31] this way okay um so how did so when he did in
[01:02:38] way okay um so how did so when he did in 2005 that was before neural networks
[01:02:40] 2005 that was before neural networks came to the four and so the way he was
[01:02:43] came to the four and so the way he was doing it was by using a sort of an older
[01:02:47] doing it was by using a sort of an older style um symbolic feature based um
[01:02:50] style um symbolic feature based um machine Learning System so he had a big
[01:02:53] machine Learning System so he had a big classifier which might have been a list
[01:02:55] classifier which might have been a list istic regression classifier or something
[01:02:58] istic regression classifier or something else like a support Vector machine and
[01:03:00] else like a support Vector machine and so to power that he was using indicator
[01:03:04] so to power that he was using indicator features so the kind of features you'd
[01:03:06] features so the kind of features you'd use is that the word on the top of the
[01:03:09] use is that the word on the top of the stack is the word good and it's part of
[01:03:11] stack is the word good and it's part of speech as an adjective or um the word um
[01:03:17] speech as an adjective or um the word um on the top of the stack is good but the
[01:03:21] on the top of the stack is good but the word that's sort of second on the stack
[01:03:24] word that's sort of second on the stack is the verb had
[01:03:26] is the verb had right you'd get these sort of
[01:03:27] right you'd get these sort of combinations of matching functions and
[01:03:30] combinations of matching functions and they would be used as features in a
[01:03:32] they would be used as features in a machine Learning System to predict the
[01:03:34] machine Learning System to predict the pars um but the problem is that once you
[01:03:37] pars um but the problem is that once you started building these features that
[01:03:41] started building these features that were conjunctions of multiple terms you
[01:03:43] were conjunctions of multiple terms you ended up with millions and millions of
[01:03:45] ended up with millions and millions of features right because you putting
[01:03:47] features right because you putting particular words and features um and
[01:03:50] particular words and features um and then you're combining choices of
[01:03:51] then you're combining choices of multiple words so they're just millions
[01:03:53] multiple words so they're just millions and millions of features so had to deal
[01:03:56] and millions of features so had to deal with millions and millions of features
[01:03:58] with millions and millions of features and furthermore individual features were
[01:04:01] and furthermore individual features were exceedingly sparse that you barely ever
[01:04:03] exceedingly sparse that you barely ever saw them right that you'd have a feature
[01:04:06] saw them right that you'd have a feature that only turned up you know 10 times in
[01:04:08] that only turned up you know 10 times in a million sentences because they are
[01:04:10] a million sentences because they are matching these very precise systems so
[01:04:13] matching these very precise systems so on the one hand um by making these
[01:04:16] on the one hand um by making these feature conjunctions paing got more
[01:04:19] feature conjunctions paing got more accurate and indeed people produced
[01:04:21] accurate and indeed people produced pretty accurate paes in those days but
[01:04:24] pretty accurate paes in those days but they had sort of these unappealing
[01:04:26] they had sort of these unappealing characteristics of this
[01:04:28] characteristics of this sort um yeah so uh before going on
[01:04:33] sort um yeah so uh before going on further I should just explain how we
[01:04:35] further I should just explain how we evaluate dependency paes um so um to
[01:04:41] evaluate dependency paes um so um to evaluate dependency paes we're basically
[01:04:44] evaluate dependency paes we're basically assessing are you getting the
[01:04:47] assessing are you getting the dependency arcs arrows your proposing
[01:04:50] dependency arcs arrows your proposing right um so here is someone's dependency
[01:04:53] right um so here is someone's dependency PA she saw the video
[01:04:57] PA she saw the video lecture
[01:04:59] lecture um well actually sorry that's the gold
[01:05:01] um well actually sorry that's the gold pa okay that's a correct pa okay she saw
[01:05:05] pa okay that's a correct pa okay she saw the video lecture that's a correct PA so
[01:05:08] the video lecture that's a correct PA so you can write out what are um the the
[01:05:11] you can write out what are um the the different
[01:05:12] different dependencies right so one's head is two
[01:05:16] dependencies right so one's head is two two's head is zero word three's head is
[01:05:19] two's head is zero word three's head is five words Four's head is five word
[01:05:22] five words Four's head is five word five's head is two so these pairs of
[01:05:25] five's head is two so these pairs of numbers represent our dependencies then
[01:05:29] numbers represent our dependencies then if someone proposes a PA of the sentence
[01:05:32] if someone proposes a PA of the sentence you can literally say okay which of
[01:05:35] you can literally say okay which of these did they get right so they didn't
[01:05:36] these did they get right so they didn't get this one right um they got the rest
[01:05:40] get this one right um they got the rest of them right so they accuracy is 80%
[01:05:43] of them right so they accuracy is 80% and so sometimes people just assess the
[01:05:46] and so sometimes people just assess the arcs unlabeled and so that's referred to
[01:05:49] arcs unlabeled and so that's referred to as unlabeled dependency accuracy but
[01:05:53] as unlabeled dependency accuracy but sometimes people also want to label them
[01:05:55] sometimes people also want to label them with with um subject determiner object
[01:05:59] with with um subject determiner object Etc and say also are you getting the
[01:06:01] Etc and say also are you getting the labels right so in this case only two of
[01:06:04] labels right so in this case only two of the five labels are right so the labeled
[01:06:07] the five labels are right so the labeled accuracy of the dependency passes is
[01:06:12] 40% okay um
[01:06:17] 40% okay um so so that was sort of what people did
[01:06:21] so so that was sort of what people did um until the mid
[01:06:24] um until the mid 2010s um and and I sort of already
[01:06:26] 2010s um and and I sort of already started saying this the problems with
[01:06:29] started saying this the problems with indicator features were they were
[01:06:33] indicator features were they were sparse you didn't see them often they
[01:06:36] sparse you didn't see them often they were incomplete because there's some
[01:06:38] were incomplete because there's some words and combinations you've seen and
[01:06:40] words and combinations you've seen and some you just didn't see in the training
[01:06:41] some you just didn't see in the training data so you're missing features um but
[01:06:45] data so you're missing features um but the final problem is actually just
[01:06:47] the final problem is actually just Computing all those um symbolic features
[01:06:50] Computing all those um symbolic features was just expensive it turns out that if
[01:06:53] was just expensive it turns out that if you did runtime analysis most of time in
[01:06:56] you did runtime analysis most of time in the paing wasn't in doing the machine
[01:06:58] the paing wasn't in doing the machine learning decisions it was just simply in
[01:07:01] learning decisions it was just simply in Computing the features that you put into
[01:07:03] Computing the features that you put into this dependency paa so as new net
[01:07:06] this dependency paa so as new net started to um show that they were
[01:07:08] started to um show that they were successful for things that suggested
[01:07:11] successful for things that suggested that maybe you could build um a better
[01:07:14] that maybe you could build um a better dependency paa by using a newal net um
[01:07:18] dependency paa by using a newal net um transition-based dependency paa which
[01:07:20] transition-based dependency paa which would benefit from the kind of dens and
[01:07:23] would benefit from the kind of dens and compact um feature Vector
[01:07:26] compact um feature Vector representations that we've already
[01:07:28] representations that we've already started to see um and so that's um what
[01:07:32] started to see um and so that's um what started to be explored and in
[01:07:36] particular who was then a PhD student of
[01:07:39] particular who was then a PhD student of mine um and was head TA of 224n twice
[01:07:43] mine um and was head TA of 224n twice actually in the earlier days um so um
[01:07:47] actually in the earlier days um so um she built um uh neural transition-based
[01:07:50] she built um uh neural transition-based dependency paa um and showed the success
[01:07:54] dependency paa um and showed the success of this method
[01:07:56] of this method um so this was um nea's transition-based
[01:07:59] um so this was um nea's transition-based dependency paa um people had also
[01:08:03] dependency paa um people had also explored other methods of dependency
[01:08:05] explored other methods of dependency passing so these were two graph-based
[01:08:07] passing so these were two graph-based dependency paes and essentially um for
[01:08:11] dependency paes and essentially um for the kind of um symbolic feature machine
[01:08:15] the kind of um symbolic feature machine learning methods um NE paa was really
[01:08:19] learning methods um NE paa was really fast because I was using this linear
[01:08:21] fast because I was using this linear trans um transition based paing idea
[01:08:25] trans um transition based paing idea that depend the graph-based dependency
[01:08:28] that depend the graph-based dependency paes were way way slower right you know
[01:08:31] paes were way way slower right you know they're about what 50 times slower um
[01:08:33] they're about what 50 times slower um but they were slightly more accurate you
[01:08:35] but they were slightly more accurate you can see here that you're getting a bit
[01:08:37] can see here that you're getting a bit better numbers so essentially what was
[01:08:40] better numbers so essentially what was able to show was um you could build
[01:08:44] able to show was um you could build something that was basically as accurate
[01:08:46] something that was basically as accurate as the best known um graph-based
[01:08:48] as the best known um graph-based dependency pases but it was fast like
[01:08:52] dependency pases but it was fast like other um transition based fases indeed
[01:08:56] other um transition based fases indeed despite the fact that it's you might
[01:08:58] despite the fact that it's you might think that oh now I've got real numbers
[01:09:00] think that oh now I've got real numbers and matrices and stuff surely that
[01:09:02] and matrices and stuff surely that should be slowing me down the reality
[01:09:05] should be slowing me down the reality was that the symbolic models spent so
[01:09:08] was that the symbolic models spent so much time in feature um computation that
[01:09:12] much time in feature um computation that actually you could make it faster at the
[01:09:14] actually you could make it faster at the same time by using a newal
[01:09:16] same time by using a newal network okay so how did that work um
[01:09:20] network okay so how did that work um well so we've already seen word
[01:09:22] well so we've already seen word embedding so it's going to exploit word
[01:09:25] embedding so it's going to exploit word embed so it can use word representations
[01:09:29] embed so it can use word representations and that has the advantage that even if
[01:09:31] and that has the advantage that even if you haven't seen particular words and
[01:09:32] you haven't seen particular words and particular configurations you've seen
[01:09:35] particular configurations you've seen similar words and so it can exploit
[01:09:38] similar words and so it can exploit what's likely in terms of word
[01:09:40] what's likely in terms of word similarity but it went a bit further
[01:09:42] similarity but it went a bit further than that because why only have
[01:09:44] than that because why only have distributed representations of words we
[01:09:47] distributed representations of words we also have parts of speech and although I
[01:09:49] also have parts of speech and although I sort of said just noun verb adjective
[01:09:53] sort of said just noun verb adjective most actual systems in NLP of parts of
[01:09:57] most actual systems in NLP of parts of speech are much more fine grained so
[01:09:59] speech are much more fine grained so they have different parts of speech for
[01:10:01] they have different parts of speech for plural nouns versus singular nouns so
[01:10:04] plural nouns versus singular nouns so they're sort of different symbols but
[01:10:06] they're sort of different symbols but they're very similar to each other so we
[01:10:08] they're very similar to each other so we might give them distributed
[01:10:10] might give them distributed representation so they're also close to
[01:10:12] representation so they're also close to each other and the same for the types of
[01:10:16] each other and the same for the types of our labels for dependencies some of them
[01:10:18] our labels for dependencies some of them are pretty closely related as well so
[01:10:20] are pretty closely related as well so all of these were being given
[01:10:22] all of these were being given distributed
[01:10:24] distributed representations and so then to represent
[01:10:27] representations and so then to represent the state of the dependency paa for
[01:10:30] the state of the dependency paa for predicting transitions what you were
[01:10:32] predicting transitions what you were doing is you had the same kind of stack
[01:10:35] doing is you had the same kind of stack and buffer and you are taking the key
[01:10:39] and buffer and you are taking the key elements of the stack and the buffer
[01:10:41] elements of the stack and the buffer which are essentially the first thing on
[01:10:43] which are essentially the first thing on the buffer the word that you would be
[01:10:45] the buffer the word that you would be shifting if you're going to do a shift
[01:10:47] shifting if you're going to do a shift and the two things at the top of the
[01:10:49] and the two things at the top of the stack so these are the things that if
[01:10:51] stack so these are the things that if you're either doing a left Arc or a
[01:10:53] you're either doing a left Arc or a right Arc they're things that you're
[01:10:55] right Arc they're things that you're considering combining so for those
[01:10:58] considering combining so for those you're going to be taking the
[01:11:00] you're going to be taking the distributed representations of the word
[01:11:03] distributed representations of the word and their parts of speech and also with
[01:11:06] and their parts of speech and also with a bit more complexity um for
[01:11:08] a bit more complexity um for dependencies you've already constructed
[01:11:10] dependencies you've already constructed if maybe something on the stack is
[01:11:12] if maybe something on the stack is already involved in a dependency each of
[01:11:15] already involved in a dependency each of those we're going to take their
[01:11:16] those we're going to take their distributed representations and we're
[01:11:18] distributed representations and we're going to just concatenate them together
[01:11:21] going to just concatenate them together to produce a big um Vector in the same
[01:11:24] to produce a big um Vector in the same way were concatenating together the five
[01:11:27] way were concatenating together the five words in the last class for predicting
[01:11:30] words in the last class for predicting whether something was the location or
[01:11:32] whether something was the location or not and then we're going to feed that um
[01:11:36] not and then we're going to feed that um into our neural network so um we our
[01:11:41] into our neural network so um we our input layer is our concatenate
[01:11:43] input layer is our concatenate distributed
[01:11:44] distributed representations we're going to put that
[01:11:46] representations we're going to put that through a hidden layer which is like we
[01:11:48] through a hidden layer which is like we were talking about last time WX plus b
[01:11:52] were talking about last time WX plus b then put through a relue
[01:11:54] then put through a relue nonlinearity and then we're going to put
[01:11:57] nonlinearity and then we're going to put above that the same kind of um another
[01:12:01] above that the same kind of um another multiply by a matri so we've got a
[01:12:03] multiply by a matri so we've got a second layer of neural network uh plus
[01:12:06] second layer of neural network uh plus B2 and we're going to take the output of
[01:12:09] B2 and we're going to take the output of that and then we're going to put that
[01:12:10] that and then we're going to put that through a soft Max that gives a
[01:12:12] through a soft Max that gives a probability distribution over whether to
[01:12:16] probability distribution over whether to um shift or do a left Arc or a right Arc
[01:12:20] um shift or do a left Arc or a right Arc operation and so the other way that this
[01:12:24] operation and so the other way that this crucially gave us more power is that
[01:12:27] crucially gave us more power is that other people's dependency paes were
[01:12:30] other people's dependency paes were still using linear classifiers things
[01:12:33] still using linear classifiers things like support Vector machines or logistic
[01:12:36] like support Vector machines or logistic regressions where we had a deep neural
[01:12:38] regressions where we had a deep neural network that gave us a nonlinear
[01:12:41] network that gave us a nonlinear classifier but and so that's why we
[01:12:43] classifier but and so that's why we could be more accurate than other um
[01:12:46] could be more accurate than other um previous um transition based paes and so
[01:12:49] previous um transition based paes and so this um essentially um showed that you
[01:12:53] this um essentially um showed that you could build this very accurate neural
[01:12:56] could build this very accurate neural dependency paa and that it
[01:12:59] dependency paa and that it outperformed um symbolic um probalistic
[01:13:03] outperformed um symbolic um probalistic representations and basically was as
[01:13:05] representations and basically was as good as any other dependency paa that
[01:13:08] good as any other dependency paa that was known so um back a decade to go um
[01:13:12] was known so um back a decade to go um this was a big hit um people got very
[01:13:14] this was a big hit um people got very excited about it um the people at Google
[01:13:17] excited about it um the people at Google got very excited about it because this
[01:13:19] got very excited about it because this gave a um a scalable way remember it's
[01:13:22] gave a um a scalable way remember it's linear time in which you could
[01:13:24] linear time in which you could efficiently go of and pass um the entire
[01:13:27] efficiently go of and pass um the entire web um so they did um some further work
[01:13:31] web um so they did um some further work on taking that model and proving it so
[01:13:34] on taking that model and proving it so that they um made an a deeper neural
[01:13:37] that they um made an a deeper neural network version with bigger vectors and
[01:13:41] network version with bigger vectors and better tuned hyperparameters and they
[01:13:43] better tuned hyperparameters and they added on to a beam search I've just
[01:13:46] added on to a beam search I've just presented the greedy version where you
[01:13:48] presented the greedy version where you always CH just immediately make the best
[01:13:50] always CH just immediately make the best choice but you can improve these paas by
[01:13:53] choice but you can improve these paas by doing some amount of search that does
[01:13:56] doing some amount of search that does help um and so um they pushed that up
[01:14:00] help um and so um they pushed that up and so rather than our kind of 92 uas
[01:14:04] and so rather than our kind of 92 uas here they got it to you know 94
[01:14:08] here they got it to you know 94 94.6 and I mean
[01:14:11] 94.6 and I mean um you're uh you're probably um all too
[01:14:17] um you're uh you're probably um all too young to remember this but um you know
[01:14:20] young to remember this but um you know really at the time of
[01:14:22] really at the time of 2016 you know Google did their kind of
[01:14:25] 2016 you know Google did their kind of typical big PR PR Splash for dependency
[01:14:29] typical big PR PR Splash for dependency parza which kind of blew my mind since I
[01:14:31] parza which kind of blew my mind since I didn't ever think that anyone was really
[01:14:34] didn't ever think that anyone was really going to be writing articles in W and
[01:14:36] going to be writing articles in W and Venture beat and those kind of tech um
[01:14:39] Venture beat and those kind of tech um blogs but you know Google had it all
[01:14:42] blogs but you know Google had it all over the place of the world's most
[01:14:44] over the place of the world's most accurate paa and they gave it a silly
[01:14:47] accurate paa and they gave it a silly name pzy mpas face which really worked
[01:14:50] name pzy mpas face which really worked well um for getting lots of media pickup
[01:14:53] well um for getting lots of media pickup um and so that was then um a very
[01:14:56] um and so that was then um a very successful
[01:14:58] successful paa um I've still got a couple of
[01:15:01] paa um I've still got a couple of minutes left so let me just um do the
[01:15:04] minutes left so let me just um do the last um three slides to show you sort of
[01:15:08] last um three slides to show you sort of another way of doing things which is
[01:15:10] another way of doing things which is also actually also a powerful paing
[01:15:12] also actually also a powerful paing method that is commonly used so um that
[01:15:16] method that is commonly used so um that was transition-based paing and that's
[01:15:18] was transition-based paing and that's what you'll use an assignment to another
[01:15:21] what you'll use an assignment to another way of doing things with dependencies
[01:15:23] way of doing things with dependencies and paing can be done neural is what's
[01:15:26] and paing can be done neural is what's referred to as graph-based dependency
[01:15:28] referred to as graph-based dependency paes and in graph-based dependency paes
[01:15:32] paes and in graph-based dependency paes what you do is um for each word you sort
[01:15:38] what you do is um for each word you sort of ask for each word what am I a
[01:15:41] of ask for each word what am I a dependent of right so if the sentence is
[01:15:45] dependent of right so if the sentence is the big cat sat each word for example
[01:15:48] the big cat sat each word for example big has to be a dependent of one of the
[01:15:51] big has to be a dependent of one of the other four words in this sentence
[01:15:53] other four words in this sentence including this possibility of root so we
[01:15:56] including this possibility of root so we ask am I dependent of that am I
[01:15:58] ask am I dependent of that am I dependent of root am I dependent of cat
[01:16:01] dependent of root am I dependent of cat am I dependent of sat and we want to
[01:16:03] am I dependent of sat and we want to score each of those possibilities and so
[01:16:06] score each of those possibilities and so hopefully we decide the most likely one
[01:16:09] hopefully we decide the most likely one is the Bigg as a dependent of cat and
[01:16:12] is the Bigg as a dependent of cat and then we're going to do the same for
[01:16:14] then we're going to do the same for every other word so you know sat could
[01:16:16] every other word so you know sat could be a dependent of any of these words and
[01:16:19] be a dependent of any of these words and so we could start asking okay which of
[01:16:22] so we could start asking okay which of these words is it most likely a
[01:16:25] these words is it most likely a dependent
[01:16:27] dependent of uh beat to sat cat to sat um sorry
[01:16:31] of uh beat to sat cat to sat um sorry that's unreadable now but hopefully we
[01:16:34] that's unreadable now but hopefully we decide um that sat most likely as the
[01:16:37] decide um that sat most likely as the verb is a dependent of root so we sort
[01:16:41] verb is a dependent of root so we sort of scoring the N squared possible you
[01:16:44] of scoring the N squared possible you know dependencies of the sentence and
[01:16:47] know dependencies of the sentence and each one is given a score and then once
[01:16:50] each one is given a score and then once we've done that our job is let me go to
[01:16:54] we've done that our job is let me go to this one cleaner okay we've decided the
[01:16:56] this one cleaner okay we've decided the good one there and so we we're going to
[01:16:58] good one there and so we we're going to do this using some of the same features
[01:17:01] do this using some of the same features we talked about looking at the words at
[01:17:03] we talked about looking at the words at each end looking at what occurs between
[01:17:05] each end looking at what occurs between them looking at what occurs around them
[01:17:08] them looking at what occurs around them um thinking about um things um and then
[01:17:12] um thinking about um things um and then once we've done that the only other
[01:17:14] once we've done that the only other thing that's a constraint is well we
[01:17:16] thing that's a constraint is well we want the dependencies to form a tree um
[01:17:20] want the dependencies to form a tree um so that we need to do um something like
[01:17:23] so that we need to do um something like a minimum spanning tree algorithm to
[01:17:26] a minimum spanning tree algorithm to sort of find the minimum cost tree
[01:17:28] sort of find the minimum cost tree because we don't want to find a solution
[01:17:31] because we don't want to find a solution where there are Cycles or the parts of
[01:17:34] where there are Cycles or the parts of the sentence end up disconnected with
[01:17:35] the sentence end up disconnected with each other um and so that's graph-based
[01:17:38] each other um and so that's graph-based dependency paes and so just as in the
[01:17:42] dependency paes and so just as in the older symbolic paing days where the
[01:17:44] older symbolic paing days where the graph based dependency paes were more
[01:17:47] graph based dependency paes were more accurate than the transition based paes
[01:17:50] accurate than the transition based paes um that we then started doing some work
[01:17:53] um that we then started doing some work on neural graph based depend dependency
[01:17:55] on neural graph based depend dependency paring and so here's our neurog graph
[01:17:57] paring and so here's our neurog graph based dependency paring um which was
[01:18:00] based dependency paring um which was then a bit over a percent more accurate
[01:18:04] then a bit over a percent more accurate than pzy mpaz face the world's best um
[01:18:07] than pzy mpaz face the world's best um dependency paer um so um so that got us
[01:18:11] dependency paer um so um so that got us to 2017 I mean obviously this is still a
[01:18:14] to 2017 I mean obviously this is still a few years ago but to get further into
[01:18:17] few years ago but to get further into the latest um paring Stories We then
[01:18:20] the latest um paring Stories We then need to sort of get into the ER of large
[01:18:22] need to sort of get into the ER of large language models which I'm not doing
[01:18:24] language models which I'm not doing today um but it's this neural graph
[01:18:26] today um but it's this neural graph based dependency paa um that's in um
[01:18:30] based dependency paa um that's in um stanza our open- source um paing
[01:18:33] stanza our open- source um paing software that's available and that you
[01:18:35] software that's available and that you can see it's using this algorithm as the
[01:18:37] can see it's using this algorithm as the more accurate one okay so now you
[01:18:40] more accurate one okay so now you hopefully know everything about
[01:18:41] hopefully know everything about syntactic structure constituency
[01:18:44] syntactic structure constituency independency paing and are fully
[01:18:46] independency paing and are fully qualified to do assignment to so good
[01:18:48] qualified to do assignment to so good luck with that thanks


================================================================================
LECTURE 005
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 5 - Recurrent Neural Networks

Source: https://www.youtube.com/watch?v=fyc0Jzr74y4

---

Transcript

[00:00:06] okay um let me get started for today so
[00:00:09] okay um let me get started for today so for today um first of all I'm going to
[00:00:13] for today um first of all I'm going to spend a few minutes talking about a
[00:00:16] spend a few minutes talking about a couple more new net Concepts including
[00:00:19] couple more new net Concepts including actually a couple of the concepts that
[00:00:21] actually a couple of the concepts that turn up in assignment two um um then the
[00:00:26] turn up in assignment two um um then the bulk of today is then going to be moving
[00:00:29] bulk of today is then going to be moving on to introducing what are language
[00:00:32] on to introducing what are language models um and then after introducing
[00:00:36] models um and then after introducing language models we're going to um
[00:00:39] language models we're going to um introduce a new kind of newal network
[00:00:42] introduce a new kind of newal network which is one way to build language
[00:00:43] which is one way to build language models which is recurrent newal networks
[00:00:47] models which is recurrent newal networks um they're important thing to know about
[00:00:49] um they're important thing to know about and we use them in assignment three um
[00:00:52] and we use them in assignment three um but they're certainly not the only way
[00:00:54] but they're certainly not the only way to build language models in fact
[00:00:56] to build language models in fact probably a lot of you already know that
[00:00:58] probably a lot of you already know that there's this other kind of new Network
[00:01:00] there's this other kind of new Network called Transformers and we'll get on to
[00:01:02] called Transformers and we'll get on to those after we've done recurrent new
[00:01:05] those after we've done recurrent new Nets um Talk a bit about problems with
[00:01:08] Nets um Talk a bit about problems with um Rec current new networks and well if
[00:01:10] um Rec current new networks and well if I have time I'll get onto the um recap
[00:01:13] I have time I'll get onto the um recap um before getting into the content of
[00:01:15] um before getting into the content of the class I thought I could just spend a
[00:01:17] the class I thought I could just spend a minute on giving you the stats of who is
[00:01:19] minute on giving you the stats of who is in um
[00:01:21] in um cs224n um who's in
[00:01:24] cs224n um who's in cs224n kind of looks like the pie charts
[00:01:27] cs224n kind of looks like the pie charts they show in CS 106a these days um
[00:01:30] they show in CS 106a these days um except more grad students I guess um so
[00:01:33] except more grad students I guess um so the four big groups are the computer
[00:01:36] the four big groups are the computer science undergrads the computer science
[00:01:38] science undergrads the computer science grads um the Undeclared
[00:01:41] grads um the Undeclared undergraduates and the um ndo grads so
[00:01:44] undergraduates and the um ndo grads so this is a large portion of the scpd
[00:01:46] this is a large portion of the scpd students though um some of them are
[00:01:48] students though um some of them are Under Computer Science grads so that
[00:01:51] Under Computer Science grads so that makes up about 60% of the audience um
[00:01:55] makes up about 60% of the audience um and if you're not in one of those four
[00:01:56] and if you're not in one of those four big groups um you're in the other 40%
[00:02:00] big groups um you're in the other 40% and everybody is somewhere so there are
[00:02:01] and everybody is somewhere so there are lots of other interesting groups down
[00:02:04] lots of other interesting groups down here so you know the the bright orange
[00:02:06] here so you know the the bright orange down here that's where the math and
[00:02:09] down here that's where the math and physics phds are um up here I mean
[00:02:13] physics phds are um up here I mean interestingly we now have more
[00:02:15] interestingly we now have more statistics grad students and there are
[00:02:17] statistics grad students and there are ssis undergrads it didn't used to be
[00:02:20] ssis undergrads it didn't used to be that way around in NLP classes um and
[00:02:23] that way around in NLP classes um and you know one of my favorite groups um
[00:02:26] you know one of my favorite groups um the little um magenta group down here um
[00:02:30] the little um magenta group down here um these are the humanity undergrads yay
[00:02:33] these are the humanity undergrads yay Humanity's undergrads um in terms of
[00:02:37] Humanity's undergrads um in terms of years it breaks down like this um first
[00:02:40] years it breaks down like this um first year grad students are the biggest group
[00:02:42] year grad students are the biggest group tons of Juniors and seniors and a couple
[00:02:46] tons of Juniors and seniors and a couple of Brave FR are any Brave FR here today
[00:02:50] of Brave FR are any Brave FR here today [Laughter]
[00:02:52] [Laughter] yeah okay um welcome yeah
[00:02:58] yeah okay um welcome yeah so um modern newal networks especially
[00:03:02] so um modern newal networks especially language models are
[00:03:05] language models are enormous um this chart's sort of out of
[00:03:07] enormous um this chart's sort of out of date because it only goes up to
[00:03:10] date because it only goes up to 2022 but it's sort of actually hard to
[00:03:13] 2022 but it's sort of actually hard to make an accurate chart for 2024 because
[00:03:16] make an accurate chart for 2024 because in the last couple of years um the
[00:03:18] in the last couple of years um the biggest language model makers have in
[00:03:20] biggest language model makers have in general stopped saying how large their
[00:03:22] general stopped saying how large their language models are in terms of
[00:03:23] language models are in terms of parameters but at any rate they're
[00:03:26] parameters but at any rate they're clearly um huge models which um
[00:03:30] clearly um huge models which um have over a 100 billion parameters um
[00:03:34] have over a 100 billion parameters um and so large and then deep in terms of
[00:03:38] and so large and then deep in terms of very many layers new Nets are a
[00:03:41] very many layers new Nets are a Cornerstone of modern NLP systems we're
[00:03:44] Cornerstone of modern NLP systems we're going to be um pretty quickly working
[00:03:47] going to be um pretty quickly working our way up to look at those kind of deep
[00:03:50] our way up to look at those kind of deep models but I just sort of for starting
[00:03:53] models but I just sort of for starting off with something simpler you know I
[00:03:55] off with something simpler you know I did just want to kind of um key you in
[00:03:59] did just want to kind of um key you in for a few minutes into a little bit of
[00:04:01] for a few minutes into a little bit of History right um so um the last time
[00:04:06] History right um so um the last time neural Nets were popular was in the 80s
[00:04:09] neural Nets were popular was in the 80s and 90s and that was when people worked
[00:04:12] and 90s and that was when people worked out the back propagation algorithm Jeff
[00:04:14] out the back propagation algorithm Jeff Hinton and colleagues um made famous the
[00:04:16] Hinton and colleagues um made famous the back propagation algorithm that we've
[00:04:18] back propagation algorithm that we've looked at and that allowed the training
[00:04:21] looked at and that allowed the training of neural Nets with hidden
[00:04:24] of neural Nets with hidden layers um and so but in those days
[00:04:28] layers um and so but in those days pretty much all the new du Nets with
[00:04:30] pretty much all the new du Nets with hidden layers that were trained were
[00:04:32] hidden layers that were trained were trained with one hidden layer you had
[00:04:34] trained with one hidden layer you had the input the hidden layer and the
[00:04:36] the input the hidden layer and the output and that's all that there was and
[00:04:40] output and that's all that there was and the reason for that was for a very very
[00:04:44] the reason for that was for a very very long time people couldn't really get
[00:04:47] long time people couldn't really get things to work um with more hidden
[00:04:50] things to work um with more hidden layers so that only started to change in
[00:04:53] layers so that only started to change in the Resurgence of what often got called
[00:04:56] the Resurgence of what often got called Deep learning but anyway back to neural
[00:04:58] Deep learning but anyway back to neural Nets um started around 2006 and this was
[00:05:04] Nets um started around 2006 and this was um one of the influential papers at the
[00:05:06] um one of the influential papers at the time um greedy layerwise training of
[00:05:08] time um greedy layerwise training of deep neural networks by Yoshua Benjo and
[00:05:11] deep neural networks by Yoshua Benjo and colleagues and so right at the beginning
[00:05:13] colleagues and so right at the beginning of that paper they observed this the the
[00:05:17] of that paper they observed this the the problem however until recently it was
[00:05:19] problem however until recently it was believed too difficult to train deep
[00:05:21] believed too difficult to train deep multi-layer newal networks empirically
[00:05:24] multi-layer newal networks empirically deep networks were generally found to be
[00:05:27] deep networks were generally found to be not better and often worse than new
[00:05:29] not better and often worse than new networks with one or two hidden layers
[00:05:32] networks with one or two hidden layers um Jerry Torro um was actually a faculty
[00:05:35] um Jerry Torro um was actually a faculty member who worked very early on
[00:05:37] member who worked very early on autonomous driving with new networks um
[00:05:40] autonomous driving with new networks um as this is a negative result there's not
[00:05:42] as this is a negative result there's not been much report in the machine learning
[00:05:44] been much report in the machine learning literature um so that really you know
[00:05:49] literature um so that really you know although people had newal networks and
[00:05:51] although people had newal networks and back propagation and current newal
[00:05:53] back propagation and current newal networks we're going to talk about today
[00:05:55] networks we're going to talk about today that for a very long period of time you
[00:06:00] that for a very long period of time you know 15 years or so things seemed
[00:06:03] know 15 years or so things seemed completely stuck in that you couldn't
[00:06:06] completely stuck in that you couldn't although in theory it seemed like deep
[00:06:08] although in theory it seemed like deep neural network should be promising in
[00:06:10] neural network should be promising in practice um they didn't work and so it
[00:06:15] practice um they didn't work and so it really then took some new developments
[00:06:17] really then took some new developments that happened in the late 2000s decade
[00:06:20] that happened in the late 2000s decade and then more profoundly in the 2010s
[00:06:23] and then more profoundly in the 2010s decade to actually figure out how we
[00:06:26] decade to actually figure out how we could have deep neural networks that
[00:06:29] could have deep neural networks that actually worked working far better than
[00:06:31] actually worked working far better than the shallow neural networks and leading
[00:06:33] the shallow neural networks and leading into the networks that um we have today
[00:06:37] into the networks that um we have today and you know we're going to be starting
[00:06:39] and you know we're going to be starting to talk about some of those things in
[00:06:42] to talk about some of those things in this class and come um coming up with
[00:06:46] this class and come um coming up with classes and I mean I I think you know
[00:06:50] classes and I mean I I think you know the tendency when you see the things
[00:06:52] the tendency when you see the things that got new networks to work much
[00:06:55] that got new networks to work much better like the the the natural action
[00:07:00] better like the the the natural action is to sort of shrug and be underwhelmed
[00:07:04] is to sort of shrug and be underwhelmed and think oh is this all there is to it
[00:07:06] and think oh is this all there is to it this doesn't exactly seem like difficult
[00:07:09] this doesn't exactly seem like difficult science um and in some sense that's true
[00:07:13] science um and in some sense that's true they're fairly little introductions of
[00:07:16] they're fairly little introductions of new ideas and tweaks of things but
[00:07:20] new ideas and tweaks of things but nevertheless a handful of little ideas
[00:07:23] nevertheless a handful of little ideas and tweaks of things turn things around
[00:07:27] and tweaks of things turn things around from a field that was sort of stuck for
[00:07:29] from a field that was sort of stuck for 15 years going nowhere and which nearly
[00:07:32] 15 years going nowhere and which nearly everyone had abandoned because of that
[00:07:35] everyone had abandoned because of that to suddenly turning around and there
[00:07:38] to suddenly turning around and there being the ability to train these deeper
[00:07:41] being the ability to train these deeper neural networks which then behaved
[00:07:43] neural networks which then behaved amazingly better as machine learning
[00:07:45] amazingly better as machine learning systems than other things that had
[00:07:48] systems than other things that had preceded them and dominated for the for
[00:07:51] preceded them and dominated for the for the intervening time so that took a lot
[00:07:54] the intervening time so that took a lot of time so what are these things um one
[00:07:58] of time so what are these things um one of them which you can greet with a bit
[00:08:00] of them which you can greet with a bit of a yawn in some sense is doing better
[00:08:04] of a yawn in some sense is doing better regularization of neural Nets so
[00:08:08] regularization of neural Nets so regularization is the idea that Beyond
[00:08:11] regularization is the idea that Beyond just having a loss that we want to
[00:08:14] just having a loss that we want to minimize in terms of describing the data
[00:08:18] minimize in terms of describing the data um we want to in some other ways
[00:08:21] um we want to in some other ways manipulate what parameters we learn so
[00:08:24] manipulate what parameters we learn so that our models work better and so
[00:08:28] that our models work better and so normally we have some more complex loss
[00:08:32] normally we have some more complex loss function that does some regularization
[00:08:35] function that does some regularization the most common way of doing this is
[00:08:37] the most common way of doing this is what's called L2 loss where you add on
[00:08:40] what's called L2 loss where you add on this um parameter squared term at the
[00:08:43] this um parameter squared term at the end and this regularization says you
[00:08:48] end and this regularization says you know it would be kind of good to find a
[00:08:52] know it would be kind of good to find a model with small parameter weights so
[00:08:54] model with small parameter weights so you should be finding the smallest
[00:08:56] you should be finding the smallest parameter weights um that will explain
[00:08:58] parameter weights um that will explain your data well and there's a lot you can
[00:09:01] your data well and there's a lot you can say about um regularization these kind
[00:09:05] say about um regularization these kind of losses they get talked about a lot
[00:09:08] of losses they get talked about a lot more in other classes um like cs229
[00:09:12] more in other classes um like cs229 machine learning and so um I'm not going
[00:09:15] machine learning and so um I'm not going to say very much about it this is in the
[00:09:17] to say very much about it this is in the machine learning theory class um but I
[00:09:20] machine learning theory class um but I do just want to sort of um put in one
[00:09:23] do just want to sort of um put in one note that's sort of very relevant um to
[00:09:29] note that's sort of very relevant um to um what's happened in recent new
[00:09:31] um what's happened in recent new networks work um so the classic view of
[00:09:34] networks work um so the classic view of regularization was we needed this kind
[00:09:37] regularization was we needed this kind of regularization to prevent our
[00:09:39] of regularization to prevent our networks from
[00:09:41] networks from overfitting meaning that they would do a
[00:09:44] overfitting meaning that they would do a very good job at modeling the training
[00:09:47] very good job at modeling the training data but then they would generalize
[00:09:51] data but then they would generalize badly to new data that was shown and so
[00:09:54] badly to new data that was shown and so the picture that you got shown was this
[00:09:56] the picture that you got shown was this that as you train on some training data
[00:10:00] that as you train on some training data your error necessarily goes
[00:10:04] your error necessarily goes down however after some point you start
[00:10:08] down however after some point you start learning specific properties of things
[00:10:11] learning specific properties of things that happen to turn up in those training
[00:10:14] that happen to turn up in those training examples and that you're learning things
[00:10:16] examples and that you're learning things that are only good for the training
[00:10:18] that are only good for the training examples and so they won't generalize
[00:10:21] examples and so they won't generalize well to different pieces of data you see
[00:10:23] well to different pieces of data you see at test time so if you have a separate
[00:10:26] at test time so if you have a separate validation set or a final test test set
[00:10:30] validation set or a final test test set you would and you traced out the error
[00:10:33] you would and you traced out the error or loss on that um validation or test
[00:10:37] or loss on that um validation or test set that after some point it would start
[00:10:39] set that after some point it would start to go up again this is a quirk in my bad
[00:10:43] to go up again this is a quirk in my bad PowerPoint it's just meant to go up um
[00:10:46] PowerPoint it's just meant to go up um and the fact that it goes up um is then
[00:10:49] and the fact that it goes up um is then you have overfit your training data and
[00:10:52] you have overfit your training data and have making the parameters numerically
[00:10:54] have making the parameters numerically small is meant to lessen the extent to
[00:10:57] small is meant to lessen the extent to which you overfit on your training data
[00:10:59] which you overfit on your training data um this is not a picture um that modern
[00:11:04] um this is not a picture um that modern newal Network people believe at all
[00:11:08] newal Network people believe at all instead the picture has changed like
[00:11:10] instead the picture has changed like this um we don't believe that
[00:11:14] this um we don't believe that overfitting exists anymore but what we
[00:11:17] overfitting exists anymore but what we are concerned about is models that will
[00:11:22] are concerned about is models that will generalize well to different data um so
[00:11:26] generalize well to different data um so that when we train you know So In
[00:11:29] that when we train you know So In classical statistics the idea that you
[00:11:32] classical statistics the idea that you could train billions of parameters like
[00:11:36] could train billions of parameters like large neuron Nets now have um would be
[00:11:39] large neuron Nets now have um would be seen as ridiculous because you could not
[00:11:41] seen as ridiculous because you could not possibly estimate those parameters well
[00:11:44] possibly estimate those parameters well um and so you just have all of this
[00:11:47] um and so you just have all of this noisy mess um but what's actually been
[00:11:50] noisy mess um but what's actually been found is that yeah it's true you can't
[00:11:53] found is that yeah it's true you can't estimate the numbers well but what you
[00:11:55] estimate the numbers well but what you get is a kind of an interesting
[00:11:57] get is a kind of an interesting averaging function from all these Myriad
[00:12:00] averaging function from all these Myriad numbers and if you do it right what
[00:12:03] numbers and if you do it right what happens is as you go on
[00:12:07] happens is as you go on training that for a while it might look
[00:12:10] training that for a while it might look like you're starting to overfit but if
[00:12:12] like you're starting to overfit but if you keep on training in a huge Network
[00:12:15] you keep on training in a huge Network um not only will your training loss
[00:12:18] um not only will your training loss continue to go down very infinitesimally
[00:12:21] continue to go down very infinitesimally but your validation loss will go down as
[00:12:24] but your validation loss will go down as well and so that huge on huge networks
[00:12:28] well and so that huge on huge networks these days is um we train our models so
[00:12:33] these days is um we train our models so that they overfit to the training data
[00:12:36] that they overfit to the training data almost completely right so that if you
[00:12:38] almost completely right so that if you train a huge Network now on a training
[00:12:41] train a huge Network now on a training set you can essentially train them to
[00:12:44] set you can essentially train them to get zero loss you know maybe it's
[00:12:48] get zero loss you know maybe it's 0.007 loss or something but you can
[00:12:51] 0.007 loss or something but you can train them to get zero loss because
[00:12:52] train them to get zero loss because you've got such Rich models you can
[00:12:54] you've got such Rich models you can perfectly fit memorize the entire
[00:12:57] perfectly fit memorize the entire training set now classically that would
[00:13:00] training set now classically that would have been seen as a disaster because
[00:13:01] have been seen as a disaster because you've overfit the training data with
[00:13:04] you've overfit the training data with modern large neural networks it's not
[00:13:05] modern large neural networks it's not seen as a disaster because providing
[00:13:08] seen as a disaster because providing you've done regularization well that
[00:13:11] you've done regularization well that your model will also generalize well to
[00:13:14] your model will also generalize well to different
[00:13:15] different data however the flip side of that is
[00:13:19] data however the flip side of that is normally this kind of L2 regularization
[00:13:22] normally this kind of L2 regularization or similar ones like L1 regularization
[00:13:25] or similar ones like L1 regularization aren't strong enough regularization to
[00:13:27] aren't strong enough regularization to achieve that effect and so neural
[00:13:29] achieve that effect and so neural network people have turned to other
[00:13:32] network people have turned to other methods of regularization of which
[00:13:34] methods of regularization of which everyone's favorite is Dropout so this
[00:13:37] everyone's favorite is Dropout so this is one of the things that's on the
[00:13:39] is one of the things that's on the assignment and um at this point I should
[00:13:43] assignment and um at this point I should uh um apologize or something because the
[00:13:48] uh um apologize or something because the way the way Dropout is um done the way
[00:13:53] way the way Dropout is um done the way Dropout is presented here is sort of the
[00:13:55] Dropout is presented here is sort of the original formulation the way Dropout is
[00:13:57] original formulation the way Dropout is presented on the assignment is the way
[00:13:59] presented on the assignment is the way now normally done in deep learning
[00:14:01] now normally done in deep learning packages so um the there are a couple of
[00:14:04] packages so um the there are a couple of details that vary a bit and let me just
[00:14:07] details that vary a bit and let me just present the main idea here and um not
[00:14:10] present the main idea here and um not worry too much about the details of the
[00:14:12] worry too much about the details of the math so the idea of Dropout is at
[00:14:15] math so the idea of Dropout is at training time um every time you are
[00:14:20] training time um every time you are doing a piece of training with an
[00:14:22] doing a piece of training with an example what you're going to do is
[00:14:25] example what you're going to do is inside the middle layers of the neural
[00:14:27] inside the middle layers of the neural network you're just going to throw away
[00:14:30] network you're just going to throw away some of the inputs and so technically
[00:14:32] some of the inputs and so technically the way you do this is you have a random
[00:14:36] the way you do this is you have a random mask that you sample each time of zeros
[00:14:38] mask that you sample each time of zeros and ones you do a hadam mod product of
[00:14:41] and ones you do a hadam mod product of that with the data so some of the data
[00:14:44] that with the data so some of the data items go to zero and you have different
[00:14:48] items go to zero and you have different Mass each time so for the next um thing
[00:14:52] Mass each time so for the next um thing you know I've now masked out um
[00:14:54] you know I've now masked out um something different this time and so
[00:14:57] something different this time and so you're just sort of Random ly throwing
[00:14:59] you're just sort of Random ly throwing away the inputs and the effect of this
[00:15:03] away the inputs and the effect of this is that you're training the model that
[00:15:06] is that you're training the model that it has to be robust and work well and
[00:15:10] it has to be robust and work well and make as much use of every input as it
[00:15:13] make as much use of every input as it can it can't decide that can be
[00:15:15] can it can't decide that can be extremely reliant on you know component
[00:15:18] extremely reliant on you know component 17 of the vector because sometimes it's
[00:15:21] 17 of the vector because sometimes it's just going to randomly disappear so if
[00:15:23] just going to randomly disappear so if there are other features that you could
[00:15:25] there are other features that you could use instead that would let you work out
[00:15:27] use instead that would let you work out what to do next you should also know how
[00:15:30] what to do next you should also know how to make use of those features so at
[00:15:33] to make use of those features so at training time you randomly delete things
[00:15:36] training time you randomly delete things at test time sort of for efficiency but
[00:15:38] at test time sort of for efficiency but also quality of the answer um you don't
[00:15:41] also quality of the answer um you don't delete anything you keep all of your
[00:15:43] delete anything you keep all of your weights but you just rescale things to
[00:15:46] weights but you just rescale things to make up for the fact that you used to be
[00:15:48] make up for the fact that you used to be dropping
[00:15:49] dropping things okay um so what there are several
[00:15:52] things okay um so what there are several ways that you can think of explaining
[00:15:55] ways that you can think of explaining this one motivation that's often given
[00:15:58] this one motivation that's often given is is that this prevents feature co-
[00:16:01] is is that this prevents feature co- adaptation so rather than a model being
[00:16:05] adaptation so rather than a model being able to learn complex functions of
[00:16:07] able to learn complex functions of feature 7 8 and 11 can help me predict
[00:16:11] feature 7 8 and 11 can help me predict this it knows that some of the features
[00:16:13] this it knows that some of the features might be missing so it has to sort of
[00:16:16] might be missing so it has to sort of make use of things in a more flexible
[00:16:18] make use of things in a more flexible way another way of thinking of it is
[00:16:21] way another way of thinking of it is that there's been a lot of work on model
[00:16:23] that there's been a lot of work on model ensembles where you can sort of mix
[00:16:26] ensembles where you can sort of mix together um different models and improve
[00:16:28] together um different models and improve your your results if you're training
[00:16:30] your your results if you're training with Dropout it's kind of like you're
[00:16:33] with Dropout it's kind of like you're training with a huge model Ensemble
[00:16:35] training with a huge model Ensemble because you're training with the
[00:16:37] because you're training with the Ensemble of the power set the
[00:16:39] Ensemble of the power set the exponential number of every possible
[00:16:42] exponential number of every possible Dropout of features all at once and that
[00:16:45] Dropout of features all at once and that gives you a a very good model um so
[00:16:48] gives you a a very good model um so there are different ways of thinking
[00:16:50] there are different ways of thinking about it I mean if you've seen na bays
[00:16:53] about it I mean if you've seen na bays and logistic regression models before
[00:16:56] and logistic regression models before you know I kind of think a nice way to
[00:16:58] you know I kind of think a nice way to think of it is that it gives a sort of a
[00:17:00] think of it is that it gives a sort of a middle ground between the two because
[00:17:02] middle ground between the two because for naive based models you're waiting
[00:17:04] for naive based models you're waiting each feature independently just based on
[00:17:07] each feature independently just based on the data statistics doesn't matter what
[00:17:09] the data statistics doesn't matter what other features are there in a logistic
[00:17:11] other features are there in a logistic regression weights are set in the
[00:17:13] regression weights are set in the context of all the other features and
[00:17:17] context of all the other features and with Dropout you're somewhere in between
[00:17:19] with Dropout you're somewhere in between you're seeing the weights in the context
[00:17:20] you're seeing the weights in the context of some of the other features but
[00:17:22] of some of the other features but different ones will disappear at
[00:17:23] different ones will disappear at different times um but you know
[00:17:26] different times um but you know following work that was done um at
[00:17:29] following work that was done um at Stanford by Stefan varer and others that
[00:17:32] Stanford by Stefan varer and others that generally these PE days people regard
[00:17:35] generally these PE days people regard Dropout as a form of feature dependent
[00:17:38] Dropout as a form of feature dependent regularization and he shows some
[00:17:40] regularization and he shows some theoretical results as to why to think
[00:17:42] theoretical results as to why to think of it that
[00:17:43] of it that way okay I think we've implicitly seen
[00:17:46] way okay I think we've implicitly seen this one um but vectorization is the
[00:17:51] this one um but vectorization is the idea no for Loops always use vectors
[00:17:55] idea no for Loops always use vectors matrices and tensors right the entire
[00:17:59] matrices and tensors right the entire success and speed of deep learning works
[00:18:03] success and speed of deep learning works from the fact that we can do things with
[00:18:05] from the fact that we can do things with vectors matrices and tensors so you know
[00:18:08] vectors matrices and tensors so you know if you're writing for Loops in any
[00:18:11] if you're writing for Loops in any language but especially in Python things
[00:18:13] language but especially in Python things run really slowly if you can do things
[00:18:17] run really slowly if you can do things with vectors and matrices even on CPU
[00:18:20] with vectors and matrices even on CPU things run at least an order of
[00:18:22] things run at least an order of magnitude faster and well what everyone
[00:18:25] magnitude faster and well what everyone really wants to move to doing in deep
[00:18:27] really wants to move to doing in deep learning is running things on gpus or
[00:18:31] learning is running things on gpus or sometimes now newal processing units and
[00:18:33] sometimes now newal processing units and then you're getting you know two three
[00:18:35] then you're getting you know two three orders of magnitude of speed up so um do
[00:18:38] orders of magnitude of speed up so um do always think about I should be doing
[00:18:40] always think about I should be doing things with vectors and matrices if I'm
[00:18:44] things with vectors and matrices if I'm writing a for Loop for anything that
[00:18:46] writing a for Loop for anything that isn't some very superficial bit of input
[00:18:48] isn't some very superficial bit of input processing I've almost certainly made a
[00:18:50] processing I've almost certainly made a mistake and I should be working out um
[00:18:53] mistake and I should be working out um how to do things um with vectors and
[00:18:56] how to do things um with vectors and metries and you know that's kind of
[00:18:58] metries and you know that's kind of thing like Dropout you don't want to
[00:19:00] thing like Dropout you don't want to write a for Loop um that goes through
[00:19:02] write a for Loop um that goes through all the positions and sets some of them
[00:19:04] all the positions and sets some of them to zero you want to be sort of using a
[00:19:07] to zero you want to be sort of using a vector operation with your
[00:19:09] vector operation with your mask um two more I think um parameter
[00:19:14] mask um two more I think um parameter initialization I mean this one might not
[00:19:16] initialization I mean this one might not be obvious but when we start training
[00:19:21] be obvious but when we start training our neural networks in almost all cases
[00:19:26] our neural networks in almost all cases it's vital that we
[00:19:29] it's vital that we initialize the parameters of our
[00:19:32] initialize the parameters of our matrices to some random numbers and the
[00:19:36] matrices to some random numbers and the reason for this is if we just start with
[00:19:39] reason for this is if we just start with the um if we just start with our
[00:19:42] the um if we just start with our matrices all zero or some other constant
[00:19:46] matrices all zero or some other constant normally the case is that we have um
[00:19:51] normally the case is that we have um symmetry so it's sort of like in this
[00:19:54] symmetry so it's sort of like in this picture when you're starting on this
[00:19:56] picture when you're starting on this Saddle Point um that you know it's
[00:19:58] Saddle Point um that you know it's symmetric to the left and the right and
[00:20:01] symmetric to the left and the right and um or whatever forward and backwards and
[00:20:03] um or whatever forward and backwards and left and right and so is you sort of
[00:20:06] left and right and so is you sort of don't know where to go and you might be
[00:20:09] don't know where to go and you might be sort of stuck and stay in the one place
[00:20:11] sort of stuck and stay in the one place I mean normally a way to think about is
[00:20:15] I mean normally a way to think about is the operations that you're doing to all
[00:20:17] the operations that you're doing to all the elements in The Matrix are sort of
[00:20:19] the elements in The Matrix are sort of the same so rather than having you know
[00:20:22] the same so rather than having you know a whole Vector of features if all of
[00:20:25] a whole Vector of features if all of them have the same value initially often
[00:20:27] them have the same value initially often it's sort of like you only have one
[00:20:28] it's sort of like you only have one feature and you've just got a lot of
[00:20:30] feature and you've just got a lot of copies of it so to initialize learning
[00:20:34] copies of it so to initialize learning and have things work well we almost
[00:20:37] and have things work well we almost always want to set all the weights to
[00:20:39] always want to set all the weights to very small random numbers um and so at
[00:20:45] very small random numbers um and so at that point you know CL when I say very
[00:20:49] that point you know CL when I say very small we sort of want to make them an A
[00:20:51] small we sort of want to make them an A Range so that they don't disappear to
[00:20:53] Range so that they don't disappear to zero if we make them a bit smaller and
[00:20:57] zero if we make them a bit smaller and um they don't sort of start blowing up
[00:20:59] um they don't sort of start blowing up into huge numbers when we multipli them
[00:21:01] into huge numbers when we multipli them by things and doing this initialization
[00:21:04] by things and doing this initialization at the right scale was used to be seen
[00:21:07] at the right scale was used to be seen as something pretty important and there
[00:21:09] as something pretty important and there were particular methods that had a basis
[00:21:11] were particular methods that had a basis of sort of thinking of what happens once
[00:21:13] of sort of thinking of what happens once you do Matrix multiplies that people had
[00:21:16] you do Matrix multiplies that people had worked out and often used one of these
[00:21:18] worked out and often used one of these was this harier
[00:21:20] was this harier initialization which was sort of working
[00:21:22] initialization which was sort of working out um what variance of your uniform
[00:21:25] out um what variance of your uniform distribution to be um variant of your
[00:21:29] distribution to be um variant of your distribution to be using based on the
[00:21:31] distribution to be using based on the sort of number of inputs and outputs of
[00:21:34] sort of number of inputs and outputs of a layer and things like that the
[00:21:36] a layer and things like that the specifics of that um you know I think we
[00:21:39] specifics of that um you know I think we still use to initialize things in
[00:21:41] still use to initialize things in assignment two but we'll see later that
[00:21:44] assignment two but we'll see later that they go away because people have come up
[00:21:46] they go away because people have come up with clever methods um in particular
[00:21:49] with clever methods um in particular doing layer normalization which sort of
[00:21:51] doing layer normalization which sort of obviates the need to be so careful on
[00:21:53] obviates the need to be so careful on the initialization but you still need to
[00:21:55] the initialization but you still need to initialize things to something
[00:21:59] initialize things to something okay then the fast the final one which
[00:22:02] okay then the fast the final one which I'll is also something that appears in
[00:22:05] I'll is also something that appears in the second assignment that I just want
[00:22:06] the second assignment that I just want to say a word about um was optimizers so
[00:22:10] to say a word about um was optimizers so we talked about in class um stochastic
[00:22:14] we talked about in class um stochastic gradient descent and did the basic um
[00:22:16] gradient descent and did the basic um equations for stochastic gradient
[00:22:18] equations for stochastic gradient descent and you know to a first
[00:22:21] descent and you know to a first approximation there's nothing wrong with
[00:22:23] approximation there's nothing wrong with stochastic gradient descent and if you
[00:22:26] stochastic gradient descent and if you fiddle around enough you can usually get
[00:22:28] fiddle around enough you can usually get stochastic gradients send actually to
[00:22:29] stochastic gradients send actually to work well for almost any problem but um
[00:22:33] work well for almost any problem but um getting it to work well is very
[00:22:36] getting it to work well is very dependent on getting the scales of
[00:22:38] dependent on getting the scales of things right of sort of having the right
[00:22:40] things right of sort of having the right step size and often you have to have a
[00:22:42] step size and often you have to have a learning rate schedule with decreasing
[00:22:44] learning rate schedule with decreasing step sizes and various other
[00:22:46] step sizes and various other complications so people have come up
[00:22:49] complications so people have come up with um more sophisticated optimizers
[00:22:53] with um more sophisticated optimizers for newal networks and for complex Nets
[00:22:57] for newal networks and for complex Nets sometimes these seem kind of necessary
[00:22:59] sometimes these seem kind of necessary to get them to learn well and at any
[00:23:02] to get them to learn well and at any rate they give you sort of lots of
[00:23:04] rate they give you sort of lots of margins of safety since they're much
[00:23:06] margins of safety since they're much less dependent on you setting different
[00:23:09] less dependent on you setting different hyperparameters right and the idea of
[00:23:12] hyperparameters right and the idea of all of the well all the methods I
[00:23:14] all of the well all the methods I mentioned and the most commonly used
[00:23:16] mentioned and the most commonly used methods is that for each parameter
[00:23:20] methods is that for each parameter they're accumulating a measure of what
[00:23:24] they're accumulating a measure of what the gradient has been in the past and
[00:23:26] the gradient has been in the past and they've got some idea of the scale of
[00:23:28] they've got some idea of the scale of the gradient the slope for a particular
[00:23:31] the gradient the slope for a particular parameter and then they're using that to
[00:23:34] parameter and then they're using that to decide how much you move the learning
[00:23:36] decide how much you move the learning rate at each time step so the simplest
[00:23:39] rate at each time step so the simplest method that was come up with this one
[00:23:41] method that was come up with this one called adrad if you know John duci and E
[00:23:44] called adrad if you know John duci and E he was one of the co-inventors of this
[00:23:47] he was one of the co-inventors of this um you know it's simple and nice enough
[00:23:49] um you know it's simple and nice enough but it tends to stall early then people
[00:23:52] but it tends to stall early then people came up with different methods Adam's
[00:23:54] came up with different methods Adam's the one that's on assignment two it's a
[00:23:56] the one that's on assignment two it's a really good safe place to start art um
[00:23:59] really good safe place to start art um but in a way um sort of our word vectors
[00:24:03] but in a way um sort of our word vectors have a special property because of their
[00:24:05] have a special property because of their spareness that you know you're very
[00:24:07] spareness that you know you're very sparely updating them because particular
[00:24:10] sparely updating them because particular words only turn up occasionally so
[00:24:12] words only turn up occasionally so people have actually come up with
[00:24:14] people have actually come up with particular um optimizers that sort of
[00:24:17] particular um optimizers that sort of have special properties for things like
[00:24:19] have special properties for things like word vectors and so these ones with the
[00:24:22] word vectors and so these ones with the w at the end can sometimes be good to
[00:24:24] w at the end can sometimes be good to try and then you know again there's a
[00:24:26] try and then you know again there's a whole family of extra ideas that people
[00:24:30] whole family of extra ideas that people have used to improve optimizers and if
[00:24:32] have used to improve optimizers and if you want to learn about that you can go
[00:24:33] you want to learn about that you can go off and do an optimization class like
[00:24:36] off and do an optimization class like Conex optimization but there ideas like
[00:24:39] Conex optimization but there ideas like momentum and nesterov acceleration and
[00:24:41] momentum and nesterov acceleration and things like that and all of those things
[00:24:43] things like that and all of those things people also variously try to use um but
[00:24:46] people also variously try to use um but Adam is a good name to remember um if
[00:24:48] Adam is a good name to remember um if you remember nothing
[00:24:50] you remember nothing else okay that took longer than I hoped
[00:24:53] else okay that took longer than I hoped but I'll get on now to language models
[00:24:56] but I'll get on now to language models okay language models so you know um in
[00:25:01] okay language models so you know um in some sense language model is just two
[00:25:04] some sense language model is just two English words but when in NLP we say
[00:25:07] English words but when in NLP we say language models we mean it as a
[00:25:09] language models we mean it as a technical term that has a particular
[00:25:12] technical term that has a particular meaning um so the idea of a language
[00:25:16] meaning um so the idea of a language model is something that can
[00:25:19] model is something that can predict well what word is going to come
[00:25:22] predict well what word is going to come next or more precisely it's going to put
[00:25:25] next or more precisely it's going to put a probability distribution over what
[00:25:27] a probability distribution over what words come next so the students open
[00:25:30] words come next so the students open there what words are likely to come
[00:25:35] next
[00:25:37] next bags laptops laptops notbook notebooks
[00:25:41] bags laptops laptops notbook notebooks notebooks yeah um I have some of those
[00:25:44] notebooks yeah um I have some of those at least okay um yeah I mean so right so
[00:25:50] at least okay um yeah I mean so right so so these are kind of likely words and if
[00:25:53] so these are kind of likely words and if on top of those we put a probability on
[00:25:57] on top of those we put a probability on each one then we have a language model
[00:26:00] each one then we have a language model so formally we've got a context of
[00:26:04] so formally we've got a context of proceeding items we're putting a
[00:26:06] proceeding items we're putting a probability distribution over the next
[00:26:09] probability distribution over the next item which means that the sum of the
[00:26:12] item which means that the sum of the estimates of this for the um items in
[00:26:14] estimates of this for the um items in the vocabulary will sum to one and if
[00:26:18] the vocabulary will sum to one and if we've defined a p like this that
[00:26:20] we've defined a p like this that predicts probabilities of next words
[00:26:23] predicts probabilities of next words that is called a language
[00:26:26] that is called a language model as it says here um an alternative
[00:26:29] model as it says here um an alternative way um that you can think of a language
[00:26:32] way um that you can think of a language model is that a language model is a
[00:26:35] model is that a language model is a system that assigns a probability to a
[00:26:38] system that assigns a probability to a piece of text and um so we can say that
[00:26:43] piece of text and um so we can say that a language model can can take any piece
[00:26:45] a language model can can take any piece of text and give it a probability and
[00:26:48] of text and give it a probability and the reason we can do that is we can use
[00:26:50] the reason we can do that is we can use the chain rule so I want to know the
[00:26:53] the chain rule so I want to know the probability of any stretch of text I say
[00:26:57] probability of any stretch of text I say given my previous definition of language
[00:26:59] given my previous definition of language model easy I can do that probability of
[00:27:02] model easy I can do that probability of X1 with a null preceding context times
[00:27:06] X1 with a null preceding context times the probability of X2 given X1 Etc along
[00:27:10] the probability of X2 given X1 Etc along I can do this chain rule decomposition
[00:27:13] I can do this chain rule decomposition and then the terms of that decomposition
[00:27:16] and then the terms of that decomposition are precisely what the language model as
[00:27:19] are precisely what the language model as I defined it previously provides okay so
[00:27:22] I defined it previously provides okay so language models are this essential
[00:27:25] language models are this essential technology for NLP just just about
[00:27:29] technology for NLP just just about everything from the simplest places
[00:27:31] everything from the simplest places forward um where people do things with
[00:27:34] forward um where people do things with human language and computers people use
[00:27:37] human language and computers people use language models in particular you know
[00:27:40] language models in particular you know they weren't something that got invented
[00:27:42] they weren't something that got invented in 2022 with chat GPT language models
[00:27:46] in 2022 with chat GPT language models have been Central to NLP at least since
[00:27:50] have been Central to NLP at least since the 80s the idea of them goes back to at
[00:27:52] the 80s the idea of them goes back to at least the 50s um so anytime you're
[00:27:56] least the 50s um so anytime you're typing on your phone and it's making
[00:27:59] typing on your phone and it's making suggestions of next words regardless of
[00:28:01] suggestions of next words regardless of whether you like those suggestions or
[00:28:03] whether you like those suggestions or not um those suggestions are being
[00:28:06] not um those suggestions are being generated by a language Model A Very uh
[00:28:09] generated by a language Model A Very uh traditionally a compact not very good
[00:28:12] traditionally a compact not very good language model so it can run sort of
[00:28:15] language model so it can run sort of quickly and very little memory in your
[00:28:17] quickly and very little memory in your keyboard application um if you go on
[00:28:20] keyboard application um if you go on Google and you start typing some stuff
[00:28:23] Google and you start typing some stuff and it's telling you stuff that could
[00:28:25] and it's telling you stuff that could come after it um to complete your query
[00:28:28] come after it um to complete your query well again that's being generated by a
[00:28:31] well again that's being generated by a language model so how can you build a
[00:28:34] language model so how can you build a language model so before getting into
[00:28:36] language model so before getting into new language models I've got just a few
[00:28:39] new language models I've got just a few slides to tell you about the old days of
[00:28:42] slides to tell you about the old days of language modeling so this is sort of how
[00:28:45] language modeling so this is sort of how language models were built um from
[00:28:49] language models were built um from 1975 until you know effectively around
[00:28:53] 1975 until you know effectively around about
[00:28:55] about 2012 um so
[00:28:58] 2012 um so we want to put probabilities on these
[00:29:01] we want to put probabilities on these sequences um and the way we're going to
[00:29:05] sequences um and the way we're going to do it is we're going to build what's
[00:29:07] do it is we're going to build what's called an engram language model um and
[00:29:11] called an engram language model um and so this is meaning we're going to look
[00:29:12] so this is meaning we're going to look at Short word
[00:29:14] at Short word subsequences and use them to predict so
[00:29:17] subsequences and use them to predict so N is a variable describing how short are
[00:29:21] N is a variable describing how short are the word sequences that we're going to
[00:29:23] the word sequences that we're going to use to predict um so if we just look at
[00:29:26] use to predict um so if we just look at the probabilities of individual words we
[00:29:29] the probabilities of individual words we have a unigram language model if we look
[00:29:32] have a unigram language model if we look at probabilities of pairs of words Byram
[00:29:35] at probabilities of pairs of words Byram language model um uh probabilities of
[00:29:38] language model um uh probabilities of three words trigram language models
[00:29:40] three words trigram language models probabilities of more than three words
[00:29:43] probabilities of more than three words they get called four gr language models
[00:29:45] they get called four gr language models five gr language models six gr language
[00:29:48] five gr language models six gr language models um so for people with a Classics
[00:29:50] models um so for people with a Classics education this is horrific of course um
[00:29:54] education this is horrific of course um in particular not even these ones are
[00:29:56] in particular not even these ones are correct um because Graham is a Greek
[00:30:00] correct um because Graham is a Greek root so it should really have Greek
[00:30:02] root so it should really have Greek numbers in front here um so you should
[00:30:04] numbers in front here um so you should have Monograms and diagrams um and you
[00:30:08] have Monograms and diagrams um and you know actually so the first person who
[00:30:09] know actually so the first person who introduced the idea of engram models was
[00:30:11] introduced the idea of engram models was actually Claude Shannon when he was
[00:30:13] actually Claude Shannon when he was working out information Theory the same
[00:30:16] working out information Theory the same guy that did cross entropy and all of
[00:30:18] guy that did cross entropy and all of that and if you look at his 1951 paper
[00:30:21] that and if you look at his 1951 paper he uses diagrams um but the idea died
[00:30:24] he uses diagrams um but the idea died about there and everyone else this is
[00:30:27] about there and everyone else this is what people say in practice um It's Kind
[00:30:30] what people say in practice um It's Kind kind of cute I like it a nice you know
[00:30:33] kind of cute I like it a nice you know practical notation um so to build these
[00:30:36] practical notation um so to build these models the idea is look we're just going
[00:30:39] models the idea is look we're just going to count how often different engrs
[00:30:43] to count how often different engrs appear in text and use those to build
[00:30:46] appear in text and use those to build our probability estimates um and in
[00:30:50] our probability estimates um and in particular our trick is that we make a
[00:30:53] particular our trick is that we make a mark of assumption so that if we're
[00:30:56] mark of assumption so that if we're predicting the next word based on a long
[00:30:59] predicting the next word based on a long context we say ah tell you what we're
[00:31:02] context we say ah tell you what we're not going to use all of it we're only
[00:31:04] not going to use all of it we're only going to use the most recent n minus one
[00:31:08] going to use the most recent n minus one words so we have this big context and we
[00:31:11] words so we have this big context and we throw most of it away and so if we're
[00:31:14] throw most of it away and so if we're predicting word XT + one based on simply
[00:31:18] predicting word XT + one based on simply the preceding n minus one words well
[00:31:21] the preceding n minus one words well then we can make the
[00:31:22] then we can make the prediction using
[00:31:25] prediction using NRS um for
[00:31:28] NRS um for let's whatever it is if we use n is
[00:31:32] let's whatever it is if we use n is three would' have a triam here and
[00:31:36] three would' have a triam here and normalized by a Bagram down here and
[00:31:39] normalized by a Bagram down here and that that would give us relative um
[00:31:42] that that would give us relative um frequencies of the different
[00:31:45] frequencies of the different terms um so we can do that simply by
[00:31:49] terms um so we can do that simply by counting how often NRS occur in a large
[00:31:55] counting how often NRS occur in a large amount of text and simply divid through
[00:31:58] amount of text and simply divid through by the counts and that gives us a
[00:32:00] by the counts and that gives us a relative frequency estimate of the
[00:32:03] relative frequency estimate of the probability of different
[00:32:05] probability of different continuations does that make sense yeah
[00:32:08] continuations does that make sense yeah that's a way to do it okay um so suppose
[00:32:12] that's a way to do it okay um so suppose we're um learning a a forr language
[00:32:15] we're um learning a a forr language model right and we've got a piece of
[00:32:17] model right and we've got a piece of text as the Proctor started the clock
[00:32:20] text as the Proctor started the clock the students open there so well to
[00:32:23] the students open there so well to estimate things we are going to throw
[00:32:27] estimate things we are going to throw away all but the preceding three words
[00:32:30] away all but the preceding three words so we're going to estimate based on
[00:32:32] so we're going to estimate based on students open there and so we're going
[00:32:34] students open there and so we're going to work out the probabilities by looking
[00:32:37] to work out the probabilities by looking for counts of students open their W and
[00:32:41] for counts of students open their W and counts of students open there um so we
[00:32:45] counts of students open there um so we might have in a corpus that students
[00:32:47] might have in a corpus that students open there occurred a thousand times
[00:32:49] open there occurred a thousand times students open their books occurred 400
[00:32:52] students open their books occurred 400 times and so we'd say the probability
[00:32:54] times and so we'd say the probability estimate is simply 0.4 for B
[00:32:58] estimate is simply 0.4 for B if exams occurred 100 times the
[00:33:01] if exams occurred 100 times the probability estimate is
[00:33:03] probability estimate is 0.1 for
[00:33:05] 0.1 for exams
[00:33:08] exams um and well you can sort of see that
[00:33:11] um and well you can sort of see that this is bad it's not terrible because if
[00:33:14] this is bad it's not terrible because if you are going to try and predict the
[00:33:15] you are going to try and predict the next word in a simple way looking at the
[00:33:18] next word in a simple way looking at the immediately prior words is are the most
[00:33:21] immediately prior words is are the most helpful words to look at but it's it's
[00:33:23] helpful words to look at but it's it's clearly sort of primitive because you
[00:33:26] clearly sort of primitive because you know if you known the prior text was as
[00:33:29] know if you known the prior text was as the Proctor started the clock that makes
[00:33:31] the Proctor started the clock that makes it sound likely that the words should
[00:33:32] it sound likely that the words should have been exams where since you're
[00:33:35] have been exams where since you're estimating just based on students open
[00:33:37] estimating just based on students open theirs well you'd be more likely to
[00:33:40] theirs well you'd be more likely to choose books because it's more common so
[00:33:43] choose books because it's more common so it's a kind of a crude estimate but it's
[00:33:46] it's a kind of a crude estimate but it's a decent enough place to start um it's a
[00:33:50] a decent enough place to start um it's a crude estimate that could be problematic
[00:33:52] crude estimate that could be problematic in other ways I mean why why else might
[00:33:56] in other ways I mean why why else might we kind of get in
[00:33:58] we kind of get in Troubles by using this our probability
[00:34:02] estimate
[00:34:05] estimate yeah so there are a lot of engrams yeah
[00:34:08] yeah so there are a lot of engrams yeah so there are a lot of words and
[00:34:10] so there are a lot of words and therefore there are a lot of lot of
[00:34:12] therefore there are a lot of lot of engrams yeah so that's a problem we'll
[00:34:14] engrams yeah so that's a problem we'll come to it later uh anything else maybe
[00:34:17] come to it later uh anything else maybe up the back um like the word w might not
[00:34:19] up the back um like the word w might not even show up in the training data so you
[00:34:21] even show up in the training data so you might just have a count zero for that
[00:34:23] might just have a count zero for that yeah yeah so um so if we're counting
[00:34:28] yeah yeah so um so if we're counting over any reasonable size Corpus there
[00:34:31] over any reasonable size Corpus there are lots of words that we just are not
[00:34:35] are lots of words that we just are not going to have seen right that they never
[00:34:38] going to have seen right that they never happen to occur in the text that we
[00:34:40] happen to occur in the text that we counted over you know so if you start
[00:34:44] counted over you know so if you start thinking students open there you know
[00:34:46] thinking students open there you know there are lots of things that you could
[00:34:48] there are lots of things that you could put there you know students open their
[00:34:50] put there you know students open their accounts or if the students are doing
[00:34:53] accounts or if the students are doing dissections in a biology class maybe
[00:34:56] dissections in a biology class maybe students open their frogs I don't know
[00:34:58] students open their frogs I don't know um you know that there are lots of words
[00:35:01] um you know that there are lots of words that in some context you know would
[00:35:03] that in some context you know would actually be possible and lots of them
[00:35:06] actually be possible and lots of them that we won't have seen and so it give
[00:35:08] that we won't have seen and so it give them a probability estimate of zero and
[00:35:11] them a probability estimate of zero and that tends to be an especially bad thing
[00:35:13] that tends to be an especially bad thing to do with probabilities because once we
[00:35:15] to do with probabilities because once we have a probability estimate of zero any
[00:35:17] have a probability estimate of zero any computations that we do that involve
[00:35:19] computations that we do that involve that will instantly go to zero so we
[00:35:22] that will instantly go to zero so we have to deal with some of these problems
[00:35:24] have to deal with some of these problems so for that sparity problem right yeah
[00:35:28] so for that sparity problem right yeah that we could have the word never
[00:35:31] that we could have the word never occurred in the numerator and so simply
[00:35:35] occurred in the numerator and so simply done we get a probability estimate of
[00:35:38] done we get a probability estimate of zero the way that was dealt with was
[00:35:41] zero the way that was dealt with was that people just hacked the counts a
[00:35:43] that people just hacked the counts a little to make it non zero so there are
[00:35:45] little to make it non zero so there are lots of ways that are explored but the
[00:35:47] lots of ways that are explored but the easiest way is you just sort of added a
[00:35:50] easiest way is you just sort of added a little Delta like you know 0.25 to
[00:35:54] little Delta like you know 0.25 to counts so things that you never saw got
[00:35:56] counts so things that you never saw got a count of 0 .25 in total and things you
[00:36:00] a count of 0 .25 in total and things you saw once got to count of 1.25 and then
[00:36:03] saw once got to count of 1.25 and then there are no zeros anymore everything is
[00:36:05] there are no zeros anymore everything is possible um you could think then there's
[00:36:08] possible um you could think then there's a second problem that wait you might
[00:36:11] a second problem that wait you might never have seen stupid students open
[00:36:13] never have seen stupid students open there before and so that means your
[00:36:16] there before and so that means your denominator is just undefined and you
[00:36:20] denominator is just undefined and you don't have any counts in the numerator
[00:36:22] don't have any counts in the numerator either so you sort of need to do
[00:36:24] either so you sort of need to do something different there and the
[00:36:26] something different there and the standard trick was used then was that
[00:36:29] standard trick was used then was that you um did back off so if you couldn't
[00:36:32] you um did back off so if you couldn't estimate words coming after students
[00:36:35] estimate words coming after students open there you just worked out the
[00:36:37] open there you just worked out the estimates for come words coming after
[00:36:40] estimates for come words coming after open there and if you couldn't estimate
[00:36:43] open there and if you couldn't estimate that you just use the estimate of words
[00:36:45] that you just use the estimate of words coming after there so you used less and
[00:36:48] coming after there so you used less and less context until you could get an
[00:36:50] less context until you could get an estimate that you could
[00:36:52] estimate that you could use um but you know something to note is
[00:36:55] use um but you know something to note is that we''ve got these conflicting
[00:36:57] that we''ve got these conflicting pressures now so that on the one hand
[00:37:00] pressures now so that on the one hand you know if you want to come up with a
[00:37:02] you know if you want to come up with a better estimate that you would like to
[00:37:05] better estimate that you would like to use more context I to have a larger engr
[00:37:09] use more context I to have a larger engr but on the other hand as you make use
[00:37:14] but on the other hand as you make use more more conditioning words well the
[00:37:19] more more conditioning words well the storage size problem someone mentioned
[00:37:22] storage size problem someone mentioned gets worse and worse because the number
[00:37:24] gets worse and worse because the number of NRS that you have to know about is
[00:37:26] of NRS that you have to know about is going up exponen eventally with the size
[00:37:28] going up exponen eventally with the size of the context but also your spareness
[00:37:32] of the context but also your spareness problems are getting way way worse and
[00:37:34] problems are getting way way worse and you're almost necessarily going to be
[00:37:36] you're almost necessarily going to be ending up seeing zeros and so because of
[00:37:39] ending up seeing zeros and so because of that you know in practice where things
[00:37:42] that you know in practice where things tended to um sort of max out was five
[00:37:46] tended to um sort of max out was five and occasionally people use six gr and
[00:37:49] and occasionally people use six gr and seven G but most of the time you know
[00:37:52] seven G but most of the time you know between the sort of spareness and the
[00:37:54] between the sort of spareness and the cost of storage 5 G was the large thing
[00:37:57] cost of storage 5 G was the large thing people dealt with um so
[00:38:00] people dealt with um so um a famous resource from back in the
[00:38:04] um a famous resource from back in the 2000s decade that Google released um was
[00:38:07] 2000s decade that Google released um was Google engrs which was built on a a
[00:38:10] Google engrs which was built on a a trillion word web Corpus and had counts
[00:38:13] trillion word web Corpus and had counts of n g and it gave counts of n g up to
[00:38:17] of n g and it gave counts of n g up to nals 5 and that is where they
[00:38:20] nals 5 and that is where they stopped okay well we've sort of said the
[00:38:23] stopped okay well we've sort of said the storage problem the storage problem is
[00:38:25] storage problem the storage problem is well to do this you need to store the
[00:38:27] well to do this you need to store the these counts the number of counts is
[00:38:30] these counts the number of counts is going up exponentially in the amount of
[00:38:32] going up exponentially in the amount of Contex size um okay um but you know
[00:38:37] Contex size um okay um but you know what's good about engram language models
[00:38:39] what's good about engram language models they're really easy to build you can um
[00:38:43] they're really easy to build you can um build one yourself in a few minutes when
[00:38:45] build one yourself in a few minutes when you've got want to have a bit of fun on
[00:38:47] you've got want to have a bit of fun on the weekend um you know all you have to
[00:38:50] the weekend um you know all you have to do is start sort of storing um these
[00:38:53] do is start sort of storing um these counts for engrams and you can use them
[00:38:56] counts for engrams and you can use them to predict things so you know for if at
[00:38:58] to predict things so you know for if at least if you do it over a small Corpus
[00:39:01] least if you do it over a small Corpus like a couple of million words of text
[00:39:04] like a couple of million words of text um you know you can build an engram
[00:39:06] um you know you can build an engram language model in seconds on your laptop
[00:39:08] language model in seconds on your laptop or you have to buil write the software
[00:39:11] or you have to buil write the software okay a few minutes to write the software
[00:39:13] okay a few minutes to write the software but building the model takes seconds
[00:39:15] but building the model takes seconds because you know there's no training in
[00:39:17] because you know there's no training in your network all you do is count how
[00:39:18] your network all you do is count how often um engrams occur and so once
[00:39:23] often um engrams occur and so once you've done that you can then run an
[00:39:25] you've done that you can then run an engram language model to generate text
[00:39:28] engram language model to generate text you know we could do text generation
[00:39:30] you know we could do text generation before chat GPT right so if I have a
[00:39:33] before chat GPT right so if I have a trigram language model I can start off
[00:39:36] trigram language model I can start off with some words today the and I could
[00:39:39] with some words today the and I could look at my stored engrams and get a
[00:39:43] look at my stored engrams and get a probability distribution over next words
[00:39:46] probability distribution over next words and here they are you know note the
[00:39:49] and here they are you know note the strong patterning of these um these um
[00:39:55] strong patterning of these um these um probabilities cuz remember they're all
[00:39:57] probabilities cuz remember they're all der from counts right that are being
[00:39:59] der from counts right that are being normalized so really these are words
[00:40:02] normalized so really these are words that occurred once these are words that
[00:40:04] that occurred once these are words that occurred twice these are words that
[00:40:06] occurred twice these are words that occurred four times in this context
[00:40:08] occurred four times in this context right so they're sort of in some sense
[00:40:10] right so they're sort of in some sense crude when you look at them more
[00:40:12] crude when you look at them more carefully but so what we could do is
[00:40:14] carefully but so what we could do is then at this point you know we roll a a
[00:40:18] then at this point you know we roll a a die and get a random number from 0er to
[00:40:21] die and get a random number from 0er to one and we can use that sample from this
[00:40:24] one and we can use that sample from this distribution
[00:40:26] distribution um sorry yeah
[00:40:29] um sorry yeah um so we sample from this distribution
[00:40:33] um so we sample from this distribution and so that if we sort of um generate so
[00:40:37] and so that if we sort of um generate so of as our random number something
[00:40:40] of as our random number something like
[00:40:42] like 35 if we go down from the top we'd say
[00:40:46] 35 if we go down from the top we'd say okay we've sampled the word price today
[00:40:49] okay we've sampled the word price today the price and then we repeat over we
[00:40:51] the price and then we repeat over we condition on that we probability
[00:40:53] condition on that we probability distribution of the next word um we
[00:40:56] distribution of the next word um we generate a random numbered and use it to
[00:40:58] generate a random numbered and use it to sample from the distribution um we say
[00:41:01] sample from the distribution um we say generate 2 and so we choose of um we now
[00:41:05] generate 2 and so we choose of um we now condition on that we get a probability
[00:41:08] condition on that we get a probability distribution we generate a random number
[00:41:11] distribution we generate a random number which is 0.5 or something and so we get
[00:41:14] which is 0.5 or something and so we get gold coming out and we can say today the
[00:41:17] gold coming out and we can say today the price of gold and we can keep on doing
[00:41:19] price of gold and we can keep on doing this and generate some text and so
[00:41:22] this and generate some text and so here's some text generated um from 2
[00:41:26] here's some text generated um from 2 million words training data using a
[00:41:29] million words training data using a trigram language model today the price
[00:41:32] trigram language model today the price of gold per ton while production of shoe
[00:41:35] of gold per ton while production of shoe lasts and shoe industry the bank
[00:41:38] lasts and shoe industry the bank intervened just after it considered and
[00:41:39] intervened just after it considered and rejected an IMF demand to rebuild
[00:41:42] rejected an IMF demand to rebuild depleted European stocks September 3rd
[00:41:45] depleted European stocks September 3rd in primary 76 cents a share um
[00:41:50] in primary 76 cents a share um now okay that text isn't great um but
[00:41:54] now okay that text isn't great um but you know I actually want people to you
[00:41:56] you know I actually want people to you know be in a positive of mood today um
[00:41:59] know be in a positive of mood today um and you know actually it's not so bad
[00:42:03] and you know actually it's not so bad right it's sort of surprisingly
[00:42:06] right it's sort of surprisingly grammatical I mean in particular like I
[00:42:09] grammatical I mean in particular like I lowercased everything so this is the IMF
[00:42:11] lowercased everything so this is the IMF that should be capitalized of the
[00:42:13] that should be capitalized of the international monetary fund right you
[00:42:16] international monetary fund right you know there are big pieces of this that
[00:42:18] know there are big pieces of this that even make sense right the bank
[00:42:20] even make sense right the bank intervened just after it considered and
[00:42:22] intervened just after it considered and rejected an if IMF demand you know
[00:42:25] rejected an if IMF demand you know that's pretty much making sense as a
[00:42:27] that's pretty much making sense as a piece of text um right so
[00:42:31] piece of text um right so it's mostly grammatical it looks like
[00:42:34] it's mostly grammatical it looks like you know English text I mean it it makes
[00:42:37] you know English text I mean it it makes no sense right it's sort of really
[00:42:39] no sense right it's sort of really incoherent so there there's work to do
[00:42:42] incoherent so there there's work to do but you know what was already you could
[00:42:45] but you know what was already you could see that even these simple engram models
[00:42:48] see that even these simple engram models you could from a very low level you
[00:42:52] you could from a very low level you could kind of approach what text and
[00:42:55] could kind of approach what text and human language worked like in from below
[00:42:59] human language worked like in from below and you know I could easily make this
[00:43:01] and you know I could easily make this better even with the engram language
[00:43:02] better even with the engram language model because you know rather than two
[00:43:04] model because you know rather than two million words of text if I trained on 10
[00:43:06] million words of text if I trained on 10 million words of text would be better if
[00:43:08] million words of text would be better if I then rather than a trigram model could
[00:43:10] I then rather than a trigram model could go to a forr model get better and You'
[00:43:13] go to a forr model get better and You' sort of start getting better and better
[00:43:16] sort of start getting better and better um approximations of text um and so this
[00:43:20] um approximations of text um and so this is essentially what people um did until
[00:43:25] is essentially what people um did until about
[00:43:25] about 2012 and and you know really uh the same
[00:43:30] 2012 and and you know really uh the same story that people um tell today that
[00:43:33] story that people um tell today that scale will solve everything is exactly
[00:43:36] scale will solve everything is exactly the same story that people used to tell
[00:43:39] the same story that people used to tell in the early
[00:43:40] in the early 2010s with these engram language models
[00:43:43] 2010s with these engram language models if you weren't getting a good enough
[00:43:44] if you weren't getting a good enough results with your 10 million words of
[00:43:47] results with your 10 million words of text and a trigram language model the
[00:43:49] text and a trigram language model the answer was that if you had a 100 million
[00:43:52] answer was that if you had a 100 million words of text and a for gram language
[00:43:54] words of text and a for gram language model you'd do better and then if you
[00:43:56] model you'd do better and then if you had a trillion words of text in a five
[00:43:59] had a trillion words of text in a five gr language model You' do better and gee
[00:44:01] gr language model You' do better and gee wouldn't it be good if we could collect
[00:44:02] wouldn't it be good if we could collect 10 trillion words of text so we could
[00:44:04] 10 trillion words of text so we could train an even better engram language
[00:44:07] train an even better engram language model same strategy um but it turns out
[00:44:10] model same strategy um but it turns out that sometimes you can do better with
[00:44:13] that sometimes you can do better with better models as well as simply scale um
[00:44:16] better models as well as simply scale um and so things got reinvented and started
[00:44:20] and so things got reinvented and started again with building neural language
[00:44:23] again with building neural language models um so how can we build a neural
[00:44:26] models um so how can we build a neural language model um so you know we've got
[00:44:29] language model um so you know we've got the same task of having a sequence of
[00:44:33] the same task of having a sequence of words and we want to put a probability
[00:44:35] words and we want to put a probability estimate over what word comes next um
[00:44:39] estimate over what word comes next um and so the simplest way you could do
[00:44:41] and so the simplest way you could do that which you'll hopefully all have
[00:44:43] that which you'll hopefully all have thought of because it connects what we
[00:44:46] thought of because it connects what we did in earlier classes look we already
[00:44:49] did in earlier classes look we already had this idea that we could have
[00:44:52] had this idea that we could have represented context by the concatenation
[00:44:55] represented context by the concatenation of some word vectors and we could put
[00:44:58] of some word vectors and we could put that into a neural network and we could
[00:45:02] that into a neural network and we could use that to predict something and in the
[00:45:05] use that to predict something and in the example I did in the last couple of
[00:45:07] example I did in the last couple of classes what we used it to predict was
[00:45:10] classes what we used it to predict was is the center word a location or not a
[00:45:13] is the center word a location or not a location just a binary choice but that's
[00:45:16] location just a binary choice but that's not the only thing we could predict we
[00:45:18] not the only thing we could predict we could have predicted lots of things with
[00:45:20] could have predicted lots of things with this new network we could have predicted
[00:45:22] this new network we could have predicted whether the piece of text was positive
[00:45:24] whether the piece of text was positive or negative we could have predicted
[00:45:26] or negative we could have predicted whether it was written in English or
[00:45:28] whether it was written in English or Japanese you know we can predict lots of
[00:45:30] Japanese you know we can predict lots of things so one thing we could choose to
[00:45:33] things so one thing we could choose to predict is we could choose to predict
[00:45:35] predict is we could choose to predict what word is going to come next after
[00:45:37] what word is going to come next after this window of text we'd have a model
[00:45:39] this window of text we'd have a model just like this one apart from up the top
[00:45:42] just like this one apart from up the top instead of doing this binary
[00:45:44] instead of doing this binary classification we' do a many many way
[00:45:47] classification we' do a many many way classification over what is the next
[00:45:50] classification over what is the next word that is going to appear in the
[00:45:52] word that is going to appear in the piece of text and that would then give
[00:45:54] piece of text and that would then give us a neural language model in particular
[00:45:58] us a neural language model in particular it give us a fixed window neural
[00:46:01] it give us a fixed window neural language model so that we'd do the same
[00:46:05] language model so that we'd do the same Markoff assumption trick of throwing
[00:46:07] Markoff assumption trick of throwing away the further back context and so for
[00:46:10] away the further back context and so for the fixed window um we'll you know um
[00:46:15] the fixed window um we'll you know um use word embeddings which you can
[00:46:18] use word embeddings which you can concatenate we'll put it through a
[00:46:20] concatenate we'll put it through a hidden layer and then we'll take the
[00:46:22] hidden layer and then we'll take the output of that hidden layer um multiply
[00:46:26] output of that hidden layer um multiply it by by another layer say and then put
[00:46:29] it by by another layer say and then put that through a soft Max and get an
[00:46:31] that through a soft Max and get an output distribution and so this gives us
[00:46:35] output distribution and so this gives us a sort of a fixed window neural language
[00:46:38] a sort of a fixed window neural language model and you know apart from the fact
[00:46:40] model and you know apart from the fact that we're now doing a classification
[00:46:43] that we're now doing a classification over many many many classes this is
[00:46:46] over many many many classes this is exactly like what we did um last week so
[00:46:49] exactly like what we did um last week so it should look kind of familiar it's
[00:46:52] it should look kind of familiar it's also kind of like what you're doing for
[00:46:54] also kind of like what you're doing for assignment two um and so this is
[00:46:57] assignment two um and so this is essentially the first kind of new
[00:47:00] essentially the first kind of new language model that was
[00:47:03] language model that was proposed um so in particular um yosua
[00:47:06] proposed um so in particular um yosua Benjo um really sort of right at the
[00:47:09] Benjo um really sort of right at the beginning of the 21st century suggested
[00:47:12] beginning of the 21st century suggested that you could do this that rather than
[00:47:14] that you could do this that rather than using an engram language model you could
[00:47:16] using an engram language model you could use a fixed window neural language model
[00:47:20] use a fixed window neural language model and you know even at that point um he
[00:47:23] and you know even at that point um he and colleagues were able to get some
[00:47:26] and colleagues were able to get some positive results from this model but you
[00:47:29] positive results from this model but you know at the time it wasn't widely
[00:47:31] know at the time it wasn't widely noticed it didn't really take off that
[00:47:34] noticed it didn't really take off that much and you know it was sort of for a
[00:47:36] much and you know it was sort of for a combination of reasons when it was only
[00:47:38] combination of reasons when it was only a fixed window it was sort of not that
[00:47:41] a fixed window it was sort of not that different to engrs in some sense and
[00:47:44] different to engrs in some sense and although the new network could give
[00:47:46] although the new network could give better
[00:47:47] better generalization it could be argued rather
[00:47:49] generalization it could be argued rather than using counts I mean in practice you
[00:47:53] than using counts I mean in practice you know new Nets were still hard to run
[00:47:56] know new Nets were still hard to run without gpus and people felt and I think
[00:48:00] without gpus and people felt and I think in general this was the case that you
[00:48:02] in general this was the case that you could get more oomph by doing the scale
[00:48:06] could get more oomph by doing the scale story and um collecting your engram
[00:48:09] story and um collecting your engram counts on hundreds of billions of words
[00:48:12] counts on hundreds of billions of words of text um rather than trying to make a
[00:48:14] of text um rather than trying to make a new network out of it and so it didn't
[00:48:17] new network out of it and so it didn't really sort of especially take off at
[00:48:19] really sort of especially take off at that time but you know in principle it
[00:48:21] that time but you know in principle it seemed a nice thing it you know got rid
[00:48:24] seemed a nice thing it you know got rid of the spasy problem um it got rid of
[00:48:27] of the spasy problem um it got rid of the storage costs you no longer have to
[00:48:29] the storage costs you no longer have to store all observed engrs you just have
[00:48:32] store all observed engrs you just have to store the parameters of your newal
[00:48:34] to store the parameters of your newal network but it didn't solve all the
[00:48:37] network but it didn't solve all the problems that we'd like to solve so in
[00:48:40] problems that we'd like to solve so in particular we still have this problem of
[00:48:42] particular we still have this problem of the Markoff assumption that we're just
[00:48:44] the Markoff assumption that we're just using a small fixed context beforehand
[00:48:48] using a small fixed context beforehand to predict
[00:48:49] to predict from um and there are some disadvantages
[00:48:53] from um and there are some disadvantages to enlarging that window and you know
[00:48:56] to enlarging that window and you know there's no fixed window that's ever big
[00:48:58] there's no fixed window that's ever big enough um there's another there's
[00:49:01] enough um there's another there's another thing that if you look
[00:49:03] another thing that if you look technically at this model that might
[00:49:06] technically at this model that might sort of make you suspicious of it which
[00:49:09] sort of make you suspicious of it which is you know when we have words in
[00:49:13] is you know when we have words in different positions that those words and
[00:49:17] different positions that those words and different positions will be treated by
[00:49:20] different positions will be treated by completely different subp parts of this
[00:49:22] completely different subp parts of this Matrix W so you might think that you
[00:49:26] Matrix W so you might think that you know know okay for predicting that books
[00:49:30] know know okay for predicting that books comes next you know the fact that this
[00:49:32] comes next you know the fact that this is a student um is important but it
[00:49:36] is a student um is important but it doesn't matter so much exactly where the
[00:49:39] doesn't matter so much exactly where the word student occurs right you know the
[00:49:42] word student occurs right you know the context could have been the students
[00:49:44] context could have been the students slowly open there um and it's still the
[00:49:48] slowly open there um and it's still the same students we've just got a bit
[00:49:50] same students we've just got a bit different linguistic structure where
[00:49:52] different linguistic structure where this W Matrix would be using completely
[00:49:55] this W Matrix would be using completely separate parameters to be learning stuff
[00:49:57] separate parameters to be learning stuff about student here versus student in
[00:50:00] about student here versus student in this position so that seems kind of
[00:50:02] this position so that seems kind of inefficient and wrong um and so that
[00:50:06] inefficient and wrong um and so that suggested that we kind of need a
[00:50:08] suggested that we kind of need a different kind of neural architecture
[00:50:11] different kind of neural architecture that can process any length of input and
[00:50:15] that can process any length of input and can use the same parameters to say hey I
[00:50:19] can use the same parameters to say hey I saw the word student that's evidence
[00:50:21] saw the word student that's evidence that things like books exams homework
[00:50:24] that things like books exams homework will be turning up regardless of where
[00:50:26] will be turning up regardless of where it occurs and so that then led to
[00:50:30] it occurs and so that then led to exploration of this different neural
[00:50:32] exploration of this different neural network architecture um called recurrent
[00:50:35] network architecture um called recurrent neural networks which is what I'll go on
[00:50:37] neural networks which is what I'll go on to next but before I do is everyone
[00:50:40] to next but before I do is everyone basically okay with what a language
[00:50:42] basically okay with what a language model is yeah no
[00:50:47] questions okay um we're current newal
[00:50:51] questions okay um we're current newal networks
[00:50:54] um so
[00:50:57] um so recurrent newal networks is a different
[00:51:00] recurrent newal networks is a different family of newal networks so effectively
[00:51:03] family of newal networks so effectively in this class we see several neural
[00:51:06] in this class we see several neural network
[00:51:07] network architectures um so in some sense the
[00:51:11] architectures um so in some sense the first architecture we saw was word to V
[00:51:14] first architecture we saw was word to V it's a sort of a very simple um encoder
[00:51:18] it's a sort of a very simple um encoder decoder architecture um the second
[00:51:21] decoder architecture um the second family we saw was feed forward Network
[00:51:26] family we saw was feed forward Network or fully connected layer classic neural
[00:51:29] or fully connected layer classic neural networks and the third family we're
[00:51:31] networks and the third family we're going to see is recurrent neural
[00:51:33] going to see is recurrent neural networks which have different kinds and
[00:51:35] networks which have different kinds and then we'll go on and go on to
[00:51:37] then we'll go on and go on to Transformer models okay so the idea of a
[00:51:41] Transformer models okay so the idea of a recurrent newal network is that you've
[00:51:44] recurrent newal network is that you've got one set of Weights that are going to
[00:51:48] got one set of Weights that are going to be applied through successive moments in
[00:51:52] be applied through successive moments in time I successive positions in the text
[00:51:56] time I successive positions in the text and as you do that you're going to
[00:51:58] and as you do that you're going to update the parameters as you go um we'll
[00:52:01] update the parameters as you go um we'll go through this in quite a bit of detail
[00:52:04] go through this in quite a bit of detail but you know here's the idea of it so
[00:52:06] but you know here's the idea of it so we've got the students open there and we
[00:52:09] we've got the students open there and we want to predict with that and the way
[00:52:12] want to predict with that and the way that we're going to do
[00:52:14] that we're going to do it okay I've still got four words in my
[00:52:17] it okay I've still got four words in my example so I can put stuff down the left
[00:52:19] example so I can put stuff down the left side of the slide but there could have
[00:52:22] side of the slide but there could have been 24 words with recurrent new
[00:52:24] been 24 words with recurrent new networks because they can deal with any
[00:52:26] networks because they can deal with any length of context okay so as before our
[00:52:30] length of context okay so as before our words start off as just words or one hot
[00:52:34] words start off as just words or one hot vectors and we can look up their word
[00:52:36] vectors and we can look up their word embeddings just like
[00:52:38] embeddings just like before okay but now to compute
[00:52:42] before okay but now to compute probabilities for the next word we're
[00:52:45] probabilities for the next word we're going to do something different so our
[00:52:47] going to do something different so our hidden layer is going to be recurrent
[00:52:51] hidden layer is going to be recurrent and by recurrent it means we're going to
[00:52:54] and by recurrent it means we're going to sort of change a hidden State at each
[00:52:57] sort of change a hidden State at each time step as we proceed through the text
[00:53:00] time step as we proceed through the text from left to right um so we're going to
[00:53:03] from left to right um so we're going to start off with an h0 which is the
[00:53:05] start off with an h0 which is the initial hidden state which can actually
[00:53:08] initial hidden state which can actually just be all zeros um and then at each
[00:53:12] just be all zeros um and then at each time step what we're going to do is
[00:53:15] time step what we're going to do is we're going to multiply the previous
[00:53:18] we're going to multiply the previous hidden state by a weight M
[00:53:20] hidden state by a weight M Matrix we're going to take a word
[00:53:23] Matrix we're going to take a word embedding and multiply it by a weight
[00:53:25] embedding and multiply it by a weight Matrix
[00:53:26] Matrix and then we're going to sum the results
[00:53:28] and then we're going to sum the results of those two things and that's going to
[00:53:30] of those two things and that's going to give us a new hidden state so that
[00:53:33] give us a new hidden state so that hidden state will then sort of store a
[00:53:37] hidden state will then sort of store a memory of everything that's been seen so
[00:53:39] memory of everything that's been seen so far so we'll do that and then we'll
[00:53:43] far so we'll do that and then we'll continue along so we'll multiply the
[00:53:45] continue along so we'll multiply the next word vector by the same weight
[00:53:48] next word vector by the same weight Matrix we we store the previous multiply
[00:53:52] Matrix we we store the previous multiply the previous hidden state by the same
[00:53:54] the previous hidden state by the same weight M Matrix wa each and we add them
[00:53:58] weight M Matrix wa each and we add them together and get a new
[00:54:00] together and get a new representation um I've only sort of said
[00:54:04] representation um I've only sort of said this bit so I've left out a bit commonly
[00:54:06] this bit so I've left out a bit commonly there are two other things you're doing
[00:54:08] there are two other things you're doing you're adding on a biased term because
[00:54:10] you're adding on a biased term because we usually separate out a bias term and
[00:54:13] we usually separate out a bias term and you're putting things through a
[00:54:14] you're putting things through a nonlinearity so I should make sure I
[00:54:16] nonlinearity so I should make sure I mention that and for recurrent neural
[00:54:19] mention that and for recurrent neural networks most commonly this nonlinearity
[00:54:22] networks most commonly this nonlinearity has actually been the tan H function so
[00:54:24] has actually been the tan H function so it's sort of balanced on the positive
[00:54:26] it's sort of balanced on the positive negative side and so you keep on doing
[00:54:29] negative side and so you keep on doing that through each step and so the idea
[00:54:32] that through each step and so the idea is once we've got to here this H4 hidden
[00:54:36] is once we've got to here this H4 hidden state is a hidden state that in some
[00:54:38] state is a hidden state that in some sense has read the text up until now
[00:54:41] sense has read the text up until now it's seen all of the students open there
[00:54:44] it's seen all of the students open there and if the word students occurred in any
[00:54:47] and if the word students occurred in any of these positions it will have been
[00:54:49] of these positions it will have been multiplied by the same we Matrix and
[00:54:53] multiplied by the same we Matrix and added into the hidden state so it's kind
[00:54:55] added into the hidden state so it's kind of got a cleaner
[00:54:56] of got a cleaner low parameter way of incorporating in
[00:54:59] low parameter way of incorporating in the information that seen so now I want
[00:55:02] the information that seen so now I want to predict the next word and to predict
[00:55:05] to predict the next word and to predict the next word I'm then going to do based
[00:55:08] the next word I'm then going to do based on the final hidden State the same thing
[00:55:11] on the final hidden State the same thing I did kind of thing I did before so I'm
[00:55:14] I did kind of thing I did before so I'm going to multiply that hidden state by
[00:55:17] going to multiply that hidden state by matrix and add another bias and stick
[00:55:19] matrix and add another bias and stick that through a soft Max and use that to
[00:55:24] that through a soft Max and use that to um sample from that soft Max well the
[00:55:26] um sample from that soft Max well the softmax will give me a language model of
[00:55:28] softmax will give me a language model of probability over all next words and I
[00:55:31] probability over all next words and I can sample from it to generate the next
[00:55:36] word that make
[00:55:38] word that make sense okay recurrent new um
[00:55:42] sense okay recurrent new um networks
[00:55:48] um okay um so for current newal networks
[00:55:52] um okay um so for current newal networks we can now process any length of
[00:55:55] we can now process any length of preceding content next and we'll just
[00:55:57] preceding content next and we'll just put more and more stuff in our hidden
[00:56:00] put more and more stuff in our hidden State um the so our computation is using
[00:56:05] State um the so our computation is using information from many steps back um our
[00:56:10] information from many steps back um our model Size Doesn't increase for having a
[00:56:13] model Size Doesn't increase for having a long context right you know we have to
[00:56:16] long context right you know we have to do more computation for a long context
[00:56:19] do more computation for a long context but our representation of that long
[00:56:21] but our representation of that long context just remains this fixed size
[00:56:23] context just remains this fixed size hidden Vector h of whatever dimension it
[00:56:26] hidden Vector h of whatever dimension it is so there's no exponential blowout
[00:56:29] is so there's no exponential blowout anymore um there's the same weights
[00:56:31] anymore um there's the same weights appli in every time step so there's a
[00:56:33] appli in every time step so there's a symmetry and how inputs are processed um
[00:56:36] symmetry and how inputs are processed um there are some
[00:56:38] there are some catches um the biggest catch in practice
[00:56:42] catches um the biggest catch in practice is that recurrent computation is slow so
[00:56:45] is that recurrent computation is slow so for the feed forward layer we just had
[00:56:48] for the feed forward layer we just had you know our input Vector we multiply it
[00:56:51] you know our input Vector we multiply it by matrix we multiply it by matrix
[00:56:53] by matrix we multiply it by matrix however many times and then at the end
[00:56:55] however many times and then at the end we're done whereas here we've sort of
[00:56:58] we're done whereas here we've sort of stuck with this sequentiality that you
[00:57:00] stuck with this sequentiality that you have to be doing one hidden Vector at a
[00:57:03] have to be doing one hidden Vector at a time in fact this is going against what
[00:57:06] time in fact this is going against what I said at the beginning of class because
[00:57:08] I said at the beginning of class because essentially here you're doing a for Loop
[00:57:11] essentially here you're doing a for Loop um you're going through for time equals
[00:57:13] um you're going through for time equals 1 to T and then you're generating and
[00:57:16] 1 to T and then you're generating and term each hidden vector and that's one
[00:57:18] term each hidden vector and that's one of the big problems with rnns that have
[00:57:21] of the big problems with rnns that have led them to fall out of favor um there's
[00:57:25] led them to fall out of favor um there's another
[00:57:26] another problem that we'll look at more is that
[00:57:30] problem that we'll look at more is that in theory this is perfect you're just
[00:57:33] in theory this is perfect you're just incorporating all of the past context in
[00:57:36] incorporating all of the past context in in your hidden Vector in practice it
[00:57:39] in your hidden Vector in practice it tends not to work perfectly because you
[00:57:42] tends not to work perfectly because you know although stuff you saw back here is
[00:57:46] know although stuff you saw back here is in some sense still alive in the hidden
[00:57:48] in some sense still alive in the hidden Vector as you come across here that your
[00:57:52] Vector as you come across here that your memory of it gets more and more distant
[00:57:55] memory of it gets more and more distant and it's the words that you saw recently
[00:57:57] and it's the words that you saw recently that dominate The Hidden State now in
[00:58:00] that dominate The Hidden State now in some sense that's right because the
[00:58:01] some sense that's right because the recent stuff is the most important stuff
[00:58:03] recent stuff is the most important stuff that's freshest in your mind you know
[00:58:05] that's freshest in your mind you know it's the same with human beings um they
[00:58:08] it's the same with human beings um they tend to forget stuff from further back
[00:58:10] tend to forget stuff from further back as well um but rnns especially in the
[00:58:13] as well um but rnns especially in the simple form that I've just explained
[00:58:15] simple form that I've just explained forget stuff from further back um rather
[00:58:19] forget stuff from further back um rather too quickly and we'll come back to that
[00:58:22] too quickly and we'll come back to that again um into in Thursday's class
[00:58:27] again um into in Thursday's class okay so for training an RNN language
[00:58:31] okay so for training an RNN language model um the starting off point is we
[00:58:33] model um the starting off point is we get a big Corpus of text again um and
[00:58:37] get a big Corpus of text again um and then we're going to compute um for each
[00:58:42] then we're going to compute um for each time step a prediction of the
[00:58:45] time step a prediction of the probability of next words and then
[00:58:48] probability of next words and then there's going to be an actual next word
[00:58:51] there's going to be an actual next word and we're going to use you know that as
[00:58:54] and we're going to use you know that as the basis of our loss
[00:58:56] the basis of our loss um so our loss function is the cross
[00:58:59] um so our loss function is the cross entropy between the predicted
[00:59:01] entropy between the predicted probability and what the actual next
[00:59:03] probability and what the actual next word that we saw is which again as in
[00:59:06] word that we saw is which again as in the example I showed before is just the
[00:59:09] the example I showed before is just the the negative log likelihood of the
[00:59:11] the negative log likelihood of the actual next word ideally you'd like to
[00:59:15] actual next word ideally you'd like to predict the actual next word with
[00:59:17] predict the actual next word with probability one which means the negative
[00:59:20] probability one which means the negative log of one would be zero and there'd be
[00:59:23] log of one would be zero and there'd be no loss but in practice if you give it
[00:59:26] no loss but in practice if you give it an estimate of 0 five there's only a
[00:59:28] an estimate of 0 five there's only a little bit of loss and so on and so um
[00:59:32] little bit of loss and so on and so um to get our overall objective function we
[00:59:35] to get our overall objective function we work out the average loss the average
[00:59:39] work out the average loss the average negative log likelihood of predicting
[00:59:41] negative log likelihood of predicting each word in turn so showing that as
[00:59:44] each word in turn so showing that as pictures if our Corpus is the students
[00:59:47] pictures if our Corpus is the students open their exams we're first of all
[00:59:50] open their exams we're first of all going to be trying to predict um you
[00:59:53] going to be trying to predict um you know what
[00:59:54] know what comes after the and we will predict some
[01:00:00] comes after the and we will predict some word with um different probabilities and
[01:00:03] word with um different probabilities and then we'll say oh the actual next word
[01:00:05] then we'll say oh the actual next word is students okay you gave that a
[01:00:07] is students okay you gave that a probability of 0.05 say because all you
[01:00:11] probability of 0.05 say because all you know was the first word was the okay
[01:00:13] know was the first word was the okay there's a loss for that um the negative
[01:00:16] there's a loss for that um the negative log prob given to students we then go on
[01:00:20] log prob given to students we then go on and generate the probability estimate
[01:00:23] and generate the probability estimate over the next words and then we say well
[01:00:26] over the next words and then we say well the actual word is opened what
[01:00:28] the actual word is opened what probability estimate did you give to
[01:00:30] probability estimate did you give to that we get a negative probability loss
[01:00:33] that we get a negative probability loss keep on running this along and then we
[01:00:36] keep on running this along and then we sum all of those losses and we average
[01:00:40] sum all of those losses and we average them per word and that's our sort of
[01:00:43] them per word and that's our sort of average per word loss and we want to
[01:00:46] average per word loss and we want to make that as small as possible and so
[01:00:50] make that as small as possible and so that's our training mechanism and it's
[01:00:53] that's our training mechanism and it's important to to know no that you know
[01:00:57] important to to know no that you know for generating this loss we're not just
[01:01:00] for generating this loss we're not just doing free generation we're not just
[01:01:02] doing free generation we're not just saying to the model go off and generate
[01:01:04] saying to the model go off and generate a sentence um what we're actually doing
[01:01:07] a sentence um what we're actually doing is at each step we're effectively saying
[01:01:10] is at each step we're effectively saying okay the prefix is the students open
[01:01:13] okay the prefix is the students open what probability distribution do you put
[01:01:15] what probability distribution do you put on next words after that um generate it
[01:01:19] on next words after that um generate it with our current new network and then
[01:01:21] with our current new network and then say ask for the actual next word what
[01:01:24] say ask for the actual next word what probability estimate did you give to
[01:01:26] probability estimate did you give to there and that's our loss but then what
[01:01:29] there and that's our loss but then what we do is stick there into our current
[01:01:32] we do is stick there into our current new network the right answer so we
[01:01:35] new network the right answer so we always go back to the right answer
[01:01:38] always go back to the right answer generate probability distribution for
[01:01:41] generate probability distribution for next words and then ask okay what
[01:01:43] next words and then ask okay what probability did you give to the actual
[01:01:45] probability did you give to the actual next word exams and then again we use
[01:01:49] next word exams and then again we use the actual next word so we do one step
[01:01:51] the actual next word so we do one step of generation then we pull it back to
[01:01:54] of generation then we pull it back to what was actually gener ated what was
[01:01:57] what was actually gener ated what was what was actually in the text and then
[01:01:59] what was actually in the text and then we ask it for guesses over the next word
[01:02:02] we ask it for guesses over the next word and repeat forever and so the fact that
[01:02:04] and repeat forever and so the fact that we don't do free generation but we pull
[01:02:07] we don't do free generation but we pull it back to the actual piece of text each
[01:02:10] it back to the actual piece of text each time um makes things simple because we
[01:02:13] time um makes things simple because we sort of know what an actual author used
[01:02:17] sort of know what an actual author used for the next word um and that process is
[01:02:21] for the next word um and that process is called teacher forcing and so the most
[01:02:23] called teacher forcing and so the most common way to
[01:02:26] common way to language models is using this kind of
[01:02:28] language models is using this kind of teacher forcing method I mean it's not
[01:02:30] teacher forcing method I mean it's not perfect in all respects cuz you know
[01:02:33] perfect in all respects cuz you know we're not actually exploring different
[01:02:35] we're not actually exploring different things the model might want to generate
[01:02:37] things the model might want to generate on its own and seeing what comes after
[01:02:39] on its own and seeing what comes after them we're only doing the tell me the
[01:02:42] them we're only doing the tell me the next word from some human generated
[01:02:44] next word from some human generated piece of
[01:02:51] text okay um so that's how we get losses
[01:02:56] text okay um so that's how we get losses um and then after that um we want to as
[01:03:00] um and then after that um we want to as before use these losses to update the
[01:03:03] before use these losses to update the parameters of a newal
[01:03:06] parameters of a newal network okay um and how do we do that um
[01:03:12] network okay um and how do we do that um well in principle you know we just have
[01:03:14] well in principle you know we just have all of the texts that we've collected
[01:03:17] all of the texts that we've collected which you could think of as just a
[01:03:18] which you could think of as just a really long sequence of okay we've got a
[01:03:21] really long sequence of okay we've got a billion words of text here it is right
[01:03:24] billion words of text here it is right so in theory you could just run your um
[01:03:28] so in theory you could just run your um recurrent newal network over your
[01:03:29] recurrent newal network over your billion words of text updating the
[01:03:32] billion words of text updating the context as you go um but that would make
[01:03:36] context as you go um but that would make it very difficult to train a model
[01:03:40] it very difficult to train a model because you'd be accumulating these
[01:03:42] because you'd be accumulating these losses for a billion steps and you'd
[01:03:45] losses for a billion steps and you'd have to store them um and then You' be
[01:03:49] have to store them um and then You' be you'd have to store hidden States so you
[01:03:51] you'd have to store hidden States so you could update parameters and it just
[01:03:53] could update parameters and it just wouldn't work so what we actually do is
[01:03:56] wouldn't work so what we actually do is we cut our training data into segments
[01:04:00] we cut our training data into segments of a reasonable length and then we're
[01:04:03] of a reasonable length and then we're going to sort of run our recurrent newal
[01:04:06] going to sort of run our recurrent newal Network on those segments and then we're
[01:04:09] Network on those segments and then we're going to compute a loss for each segment
[01:04:12] going to compute a loss for each segment and then we're going to update the
[01:04:15] and then we're going to update the parameters of the recurrent new network
[01:04:18] parameters of the recurrent new network based on the losses that we found for
[01:04:20] based on the losses that we found for that segment um I I describe it here as
[01:04:24] that segment um I I describe it here as the segments being sentences or
[01:04:26] the segments being sentences or documents which seems a linguistically
[01:04:29] documents which seems a linguistically nice thing it turns out that um in
[01:04:33] nice thing it turns out that um in recent practice when you're wanting to
[01:04:35] recent practice when you're wanting to scale most efficiently on gpus people
[01:04:39] scale most efficiently on gpus people don't bother with those linguistic
[01:04:41] don't bother with those linguistic niceties they just say a segment is 100
[01:04:44] niceties they just say a segment is 100 words just cut every 100 Words and the
[01:04:48] words just cut every 100 Words and the reason why that's really convenient is
[01:04:50] reason why that's really convenient is you can then create a batch of segments
[01:04:53] you can then create a batch of segments all of which are 100 words long and
[01:04:55] all of which are 100 words long and stick those in a matrix and do um
[01:04:59] stick those in a matrix and do um vectorized training more efficiently um
[01:05:02] vectorized training more efficiently um and things go great for you okay but
[01:05:06] and things go great for you okay but there's still a few more things that we
[01:05:07] there's still a few more things that we need to know um to get things to work
[01:05:09] need to know um to get things to work great for you I was try and get a bit
[01:05:11] great for you I was try and get a bit bit more through this before um today
[01:05:14] bit more through this before um today ends so we sort of need to know about
[01:05:17] ends so we sort of need to know about how to work out the derivative of our
[01:05:21] how to work out the derivative of our loss with respect to um the
[01:05:26] loss with respect to um the parameters of our recurrent newal
[01:05:29] parameters of our recurrent newal Network and the interesting case here is
[01:05:33] Network and the interesting case here is you know these wh parameters are sort of
[01:05:36] you know these wh parameters are sort of being used everywhere through the neural
[01:05:39] being used everywhere through the neural network at each stage as are the we ones
[01:05:43] network at each stage as are the we ones so they appear at many places in the
[01:05:45] so they appear at many places in the network so how do we work out the
[01:05:48] network so how do we work out the partial derivatives of the loss with
[01:05:52] partial derivatives of the loss with respect to the repeated weight
[01:05:54] respect to the repeated weight matrices and and the answer to that is
[01:05:57] matrices and and the answer to that is oh it's really simple um you can just
[01:06:01] oh it's really simple um you can just sort of pretend that those wh's in each
[01:06:05] sort of pretend that those wh's in each position are different and work out the
[01:06:09] position are different and work out the partials with respect to them at one
[01:06:11] partials with respect to them at one position and then to get the partials
[01:06:14] position and then to get the partials with respect to wh you just sum whatever
[01:06:17] with respect to wh you just sum whatever you found in the different
[01:06:20] you found in the different positions and so um that is sort of okay
[01:06:26] positions and so um that is sort of okay the gradient with respect to repeated
[01:06:28] the gradient with respect to repeated weight is the sum of the gradient with
[01:06:30] weight is the sum of the gradient with respect to each time it appears and the
[01:06:33] respect to each time it appears and the reason why that is it sort of follows
[01:06:36] reason why that is it sort of follows what I talked about in lecture three um
[01:06:40] what I talked about in lecture three um that we talk or you know you can also
[01:06:44] that we talk or you know you can also think about it in terms of what you
[01:06:46] think about it in terms of what you might remember um from you know
[01:06:49] might remember um from you know multivariable chain roles but you know
[01:06:51] multivariable chain roles but you know the way I introduced in lecture three is
[01:06:54] the way I introduced in lecture three is that gradient at outward branches and so
[01:06:58] that gradient at outward branches and so what you can think about it in a case
[01:07:01] what you can think about it in a case like this is that you've got uh wh
[01:07:05] like this is that you've got uh wh Matrix which is being copied by identity
[01:07:08] Matrix which is being copied by identity to wh1 wh2 wh3 wh4 Etc at each time step
[01:07:15] to wh1 wh2 wh3 wh4 Etc at each time step and so since those are identity copies
[01:07:19] and so since those are identity copies they have um a partial derivative with
[01:07:23] they have um a partial derivative with respect to each other of one
[01:07:26] respect to each other of one and so then we apply the multivariable
[01:07:29] and so then we apply the multivariable chain roll to these copies um and so
[01:07:34] chain roll to these copies um and so we've then got an outward branching node
[01:07:37] we've then got an outward branching node and you're just summing the gradients um
[01:07:40] and you're just summing the gradients um to get the total gradient of each time
[01:07:43] to get the total gradient of each time for The
[01:07:48] Matrix okay
[01:07:52] Matrix okay um yeah I mean there's one other trick
[01:07:55] um yeah I mean there's one other trick that's perhaps worth knowing I mean if
[01:07:58] that's perhaps worth knowing I mean if you've got sort of segments that are 100
[01:08:00] you've got sort of segments that are 100 long um a common speed up is to say oh
[01:08:04] long um a common speed up is to say oh maybe we don't actually have to run back
[01:08:07] maybe we don't actually have to run back propagation for 100 time steps maybe we
[01:08:09] propagation for 100 time steps maybe we could just run it for 20 times steps and
[01:08:12] could just run it for 20 times steps and stop which is referred to as truncated
[01:08:14] stop which is referred to as truncated back propagation through time I mean in
[01:08:17] back propagation through time I mean in practice that tends to be sufficient
[01:08:20] practice that tends to be sufficient note in particular you're still on the
[01:08:22] note in particular you're still on the forward path updating your hidden State
[01:08:25] forward path updating your hidden State using your full context but in the back
[01:08:28] using your full context but in the back propagation you're just sort of cutting
[01:08:31] propagation you're just sort of cutting it short um to speed up
[01:08:34] it short um to speed up training okay um so just as I did before
[01:08:38] training okay um so just as I did before with an engram language model we can use
[01:08:42] with an engram language model we can use uh RNN language model to generate text
[01:08:46] uh RNN language model to generate text and it's pretty much the same idea
[01:08:48] and it's pretty much the same idea except now we're sort of um rather than
[01:08:51] except now we're sort of um rather than just using counts of engrs we're using
[01:08:54] just using counts of engrs we're using the hidden state of our neural network
[01:08:57] the hidden state of our neural network to give us the input to a a probability
[01:09:01] to give us the input to a a probability distribution that we can then sample
[01:09:03] distribution that we can then sample from so I can start with the initial
[01:09:05] from so I can start with the initial hidden State um I can use the start of
[01:09:09] hidden State um I can use the start of sentence symbol I mean the example I had
[01:09:11] sentence symbol I mean the example I had before I started immediately with the um
[01:09:15] before I started immediately with the um hoping that that was less confusing the
[01:09:17] hoping that that was less confusing the first time but what you should have
[01:09:19] first time but what you should have asked is wait a minute where did the the
[01:09:21] asked is wait a minute where did the the come from um so normally what we
[01:09:24] come from um so normally what we actually do is is use a special start of
[01:09:28] actually do is is use a special start of sequence symbol like this angle
[01:09:31] sequence symbol like this angle bracketed s and so we sort of feed it in
[01:09:34] bracketed s and so we sort of feed it in as a pseudo word which has a word
[01:09:37] as a pseudo word which has a word embedding and then we based on this will
[01:09:40] embedding and then we based on this will be generating first words of the text um
[01:09:43] be generating first words of the text um so we end up with some
[01:09:46] so we end up with some representation from which we can sample
[01:09:49] representation from which we can sample and get the first word so now we don't
[01:09:52] and get the first word so now we don't have any actual text so what we're going
[01:09:54] have any actual text so what we're going to do
[01:09:55] to do is take that generated word that we
[01:09:58] is take that generated word that we generated and copy it down as the next
[01:10:01] generated and copy it down as the next input and then we're going to run the
[01:10:04] input and then we're going to run the next stage of newal network um sample
[01:10:08] next stage of newal network um sample from the probability distribution and
[01:10:10] from the probability distribution and next word favorite copy it down as the
[01:10:13] next word favorite copy it down as the next word of the input and keep on
[01:10:16] next word of the input and keep on generating and so this is referred to as
[01:10:18] generating and so this is referred to as a roll out that you're kind of
[01:10:21] a roll out that you're kind of continuing to roll the dice and generate
[01:10:23] continuing to roll the dice and generate forward and generate a piece of text and
[01:10:26] forward and generate a piece of text and so um and normally you want to stop at
[01:10:31] so um and normally you want to stop at some point and the way we can stop it
[01:10:33] some point and the way we can stop it some point is we can have a second
[01:10:35] some point is we can have a second special symbol um the angle bracket SLS
[01:10:40] special symbol um the angle bracket SLS which um says end of um your sequence so
[01:10:44] which um says end of um your sequence so we can generate an end of sequence
[01:10:46] we can generate an end of sequence symbol and then we can um stop and so
[01:10:50] symbol and then we can um stop and so using this we can sort of generate
[01:10:52] using this we can sort of generate pieces of text and essentially you know
[01:10:55] pieces of text and essentially you know this is exactly what's happening if you
[01:10:57] this is exactly what's happening if you use something like chat GPT right that
[01:11:00] use something like chat GPT right that the model is a more complicated model
[01:11:02] the model is a more complicated model that we've haven't yet gotten to but
[01:11:04] that we've haven't yet gotten to but it's generating the response to you by
[01:11:07] it's generating the response to you by doing this kind of process of generating
[01:11:09] doing this kind of process of generating a word at the time treating it as an
[01:11:12] a word at the time treating it as an input and generating the next word and
[01:11:15] input and generating the next word and generating this sort of roll out and
[01:11:17] generating this sort of roll out and that's why and it's done
[01:11:19] that's why and it's done probabilistically so if you do it
[01:11:21] probabilistically so if you do it multiple times um you can get different
[01:11:24] multiple times um you can get different answers we haven't yet gone to chat GPT
[01:11:27] answers we haven't yet gone to chat GPT but we can have a little bit of fun um
[01:11:29] but we can have a little bit of fun um so you can take this simple recurrent
[01:11:32] so you can take this simple recurrent newal Network that we've just built here
[01:11:34] newal Network that we've just built here and you can train it on any piece of
[01:11:37] and you can train it on any piece of text and get it to generate stuff so for
[01:11:41] text and get it to generate stuff so for example I can train it on Barack Obama's
[01:11:44] example I can train it on Barack Obama's speeches so that's a small Corpus right
[01:11:46] speeches so that's a small Corpus right you know he didn't talk that much right
[01:11:48] you know he didn't talk that much right I've only got a few hundred thousand
[01:11:50] I've only got a few hundred thousand words of text it's not a huge Corpus
[01:11:53] words of text it's not a huge Corpus I'll just show this and then I can
[01:11:54] I'll just show this and then I can answer the question um but you know I
[01:11:56] answer the question um but you know I can generate from it and I get something
[01:11:59] can generate from it and I get something like the United States will step up to
[01:12:01] like the United States will step up to the cost of a new challenges of the
[01:12:03] the cost of a new challenges of the American people that will share the fact
[01:12:05] American people that will share the fact that we created the problem they were
[01:12:08] that we created the problem they were attacked and so that they have to say
[01:12:09] attacked and so that they have to say that all the task of the final days of
[01:12:11] that all the task of the final days of war that I will not be able to get this
[01:12:14] war that I will not be able to get this done um yeah well maybe that's slightly
[01:12:17] done um yeah well maybe that's slightly better than my engram language model
[01:12:19] better than my engram language model still not perfect you might say but
[01:12:21] still not perfect you might say but somewhat better maybe did you have a
[01:12:24] somewhat better maybe did you have a question uh yeah so since we're like
[01:12:28] question uh yeah so since we're like training the mod like truncated set of
[01:12:30] training the mod like truncated set of the Corpus that impose some kind of like
[01:12:33] the Corpus that impose some kind of like limitation on like how much we can like
[01:12:36] limitation on like how much we can like produce and like still have some cency
[01:12:38] produce and like still have some cency like meaning like
[01:12:42] like meaning like foring um so yeah so I suggested we're
[01:12:45] foring um so yeah so I suggested we're going to chunk the S chunk the text into
[01:12:48] going to chunk the S chunk the text into 100w units so you know that's the limit
[01:12:52] 100w units so you know that's the limit of the amount of Prior context that
[01:12:53] of the amount of Prior context that we're going to use so I mean that's a
[01:12:56] we're going to use so I mean that's a fair amount 100 words that's typically
[01:12:58] fair amount 100 words that's typically several sentences but to the extent that
[01:13:01] several sentences but to the extent that you wanted to know even more about the
[01:13:04] you wanted to know even more about the further back context you wouldn't be
[01:13:06] further back context you wouldn't be able to and you know certainly that's
[01:13:09] able to and you know certainly that's one of the ways in which modern large
[01:13:11] one of the ways in which modern large language models are using far bigger
[01:13:14] language models are using far bigger context than that they're now using
[01:13:15] context than that they're now using thousands of words of Prior context yeah
[01:13:19] thousands of words of Prior context yeah absolutely it's a limit on how much far
[01:13:22] absolutely it's a limit on how much far back context so in some sense actually
[01:13:25] back context so in some sense actually even though in theory a current newal
[01:13:28] even though in theory a current newal Network can feed in an arbitrary length
[01:13:30] Network can feed in an arbitrary length context as soon as I say oh practically
[01:13:33] context as soon as I say oh practically we cut it into segments you know
[01:13:35] we cut it into segments you know actually that means we are making a
[01:13:37] actually that means we are making a Markoff assumption again and we're
[01:13:39] Markoff assumption again and we're saying the further back context doesn't
[01:13:42] saying the further back context doesn't matter yeah okay uh couple more examples
[01:13:47] matter yeah okay uh couple more examples um so instead of Barack Obama I can feed
[01:13:50] um so instead of Barack Obama I can feed in Harry Potter which is a somewhat
[01:13:52] in Harry Potter which is a somewhat bigger Corpus of text actually and
[01:13:55] bigger Corpus of text actually and generate from that and so I can get um
[01:13:58] generate from that and so I can get um sorry Harry shouted panicking I'll leave
[01:14:01] sorry Harry shouted panicking I'll leave those brooms in London are they no idea
[01:14:03] those brooms in London are they no idea said nearly headless Nick casting low
[01:14:05] said nearly headless Nick casting low close by Cedric carrying the last bit of
[01:14:08] close by Cedric carrying the last bit of trial charms from Harry's shoulder and
[01:14:10] trial charms from Harry's shoulder and to answer him the common room perched
[01:14:13] to answer him the common room perched upon it forearms held a shining knob
[01:14:16] upon it forearms held a shining knob from when the spider hadn't felt it
[01:14:17] from when the spider hadn't felt it seamed he reached the teams
[01:14:20] seamed he reached the teams too well there you are um you can do
[01:14:23] too well there you are um you can do other things as well um
[01:14:26] other things as well um so you can train it on recipes and
[01:14:28] so you can train it on recipes and generate a recipe um this one's a recipe
[01:14:32] generate a recipe um this one's a recipe I don't suggest you try and cook um but
[01:14:36] I don't suggest you try and cook um but it looks sort of like a recipe if you
[01:14:38] it looks sort of like a recipe if you don't look very hard um chocolate Ranch
[01:14:42] don't look very hard um chocolate Ranch Barbecue categories game casseroles
[01:14:45] Barbecue categories game casseroles cookies cookies yield six servings two P
[01:14:50] cookies cookies yield six servings two P two tablespoons of Parmesan cheese
[01:14:52] two tablespoons of Parmesan cheese chopped um one cup of Co coconut milk
[01:14:55] chopped um one cup of Co coconut milk and three eggs beaten um Place each
[01:14:58] and three eggs beaten um Place each pasture over layers of lumps shape
[01:15:01] pasture over layers of lumps shape mixture into the moderate oven and
[01:15:05] mixture into the moderate oven and simmer until firm serve hot and bodied
[01:15:09] simmer until firm serve hot and bodied fresh mustard orange and cheese combine
[01:15:13] fresh mustard orange and cheese combine the cheese and salt together the dough
[01:15:16] the cheese and salt together the dough in a large Skillet and the ingredients
[01:15:19] in a large Skillet and the ingredients and stir in the chocolate and pepper H
[01:15:22] and stir in the chocolate and pepper H um yeah it's not exactly very consistent
[01:15:25] um yeah it's not exactly very consistent recipe when it comes down to it it sort
[01:15:27] recipe when it comes down to it it sort of has a langage of a recipe but it's
[01:15:31] of has a langage of a recipe but it's Absolut maybe if I had scaled it more
[01:15:33] Absolut maybe if I had scaled it more and had a bigger Corpus it would have
[01:15:35] and had a bigger Corpus it would have done a bit better um but it's definitely
[01:15:37] done a bit better um but it's definitely not using the ingredients there are um
[01:15:41] not using the ingredients there are um let's see it's almost um time today so
[01:15:44] let's see it's almost um time today so maybe about all I can do um is uh do H I
[01:15:50] maybe about all I can do um is uh do H I can do one more fun example and then
[01:15:52] can do one more fun example and then after that oh yeah I probably should
[01:15:55] after that oh yeah I probably should that bit at the start next time um so as
[01:15:57] that bit at the start next time um so as a variant of building RNN language
[01:16:00] a variant of building RNN language models I mean so far we've been building
[01:16:03] models I mean so far we've been building them over words so our you know token
[01:16:09] them over words so our you know token time steps over which you build it as
[01:16:12] time steps over which you build it as words I mean actually you can use the
[01:16:14] words I mean actually you can use the idea of recurrent new networks over any
[01:16:16] idea of recurrent new networks over any other size unit and people have used
[01:16:19] other size unit and people have used them for other things so people have
[01:16:21] them for other things so people have used them in bioinformatics for things
[01:16:23] used them in bioinformatics for things like um DNA for sort of having Gene
[01:16:27] like um DNA for sort of having Gene sequencing or protein sequencing or
[01:16:29] sequencing or protein sequencing or anything like that but even staying with
[01:16:32] anything like that but even staying with language um instead of building them
[01:16:34] language um instead of building them over words you can build them over
[01:16:37] over words you can build them over characters so that my I'm generating at
[01:16:41] characters so that my I'm generating at a a letter at a time rather than a word
[01:16:44] a a letter at a time rather than a word at a time and so that can sometimes be
[01:16:47] at a time and so that can sometimes be useful because it allows us to sort of
[01:16:50] useful because it allows us to sort of generate things um that sort of look
[01:16:53] generate things um that sort of look like Words um and perhaps have the
[01:16:55] like Words um and perhaps have the structure of English words um and so and
[01:16:59] structure of English words um and so and so similarly there are other things that
[01:17:01] so similarly there are other things that you can do so before I
[01:17:04] you can do so before I initialized the hidden State I said oh
[01:17:08] initialized the hidden State I said oh you just have an initial hidden State
[01:17:10] you just have an initial hidden State you can make it zeros if you want well
[01:17:13] you can make it zeros if you want well sometimes we're going to build a
[01:17:15] sometimes we're going to build a contextual RNN where we can initialize
[01:17:18] contextual RNN where we can initialize the hidden State um with something else
[01:17:20] the hidden State um with something else so in particular I can initialize the
[01:17:23] so in particular I can initialize the hidden state with the RGB values of a
[01:17:27] hidden state with the RGB values of a color and so I can have initialized the
[01:17:30] color and so I can have initialized the hidden state with the color and generate
[01:17:33] hidden state with the color and generate character at a time the name of paint
[01:17:36] character at a time the name of paint colors and I can train a model based on
[01:17:40] colors and I can train a model based on um a paint company's catalog of names of
[01:17:44] um a paint company's catalog of names of colors and their um RGB of their colors
[01:17:48] colors and their um RGB of their colors and then I can give it different
[01:17:50] and then I can give it different different paint colors and it'll come up
[01:17:52] different paint colors and it'll come up with names for them and it actually does
[01:17:55] with names for them and it actually does an excellent job this one worked really
[01:17:57] an excellent job this one worked really well look at this um this one here is
[01:18:00] well look at this um this one here is gasty pink Power gray Naval tan bco
[01:18:07] gasty pink Power gray Naval tan bco white hble gray Home Star Brown now
[01:18:11] white hble gray Home Star Brown now couldn't you just imagine finding all of
[01:18:13] couldn't you just imagine finding all of these in a paint catalog I mean some of
[01:18:15] these in a paint catalog I mean some of them are there's some really good ones
[01:18:18] them are there's some really good ones over here in the bottom right this color
[01:18:20] over here in the bottom right this color here is
[01:18:22] here is dope and then um this Stoner blue purple
[01:18:28] dope and then um this Stoner blue purple s stinky
[01:18:30] s stinky bean and Turley now I think I've got a a
[01:18:34] bean and Turley now I think I've got a a real business opportunity here in the
[01:18:37] real business opportunity here in the Paint Company Market um for my recurrent
[01:18:40] Paint Company Market um for my recurrent new network okay I'll stop there for
[01:18:42] new network okay I'll stop there for today and do more of the science of new
[01:18:44] today and do more of the science of new networks next time


================================================================================
LECTURE 006
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 6 - Sequence to Sequence Models

Source: https://www.youtube.com/watch?v=Ba6Fn1-Jsfw

---

Transcript

[00:00:05] okay hi
[00:00:07] okay hi everyone back from all
[00:00:10] everyone back from all cs224n um okay
[00:00:13] cs224n um okay so for today the plan is essentially a
[00:00:17] so for today the plan is essentially a continuation of what we started on
[00:00:19] continuation of what we started on Tuesday so I'm going to say um more
[00:00:23] Tuesday so I'm going to say um more about language models and more about
[00:00:27] about language models and more about rnns in particular introd using a more
[00:00:30] rnns in particular introd using a more advanced form of recurrent new network
[00:00:33] advanced form of recurrent new network which was for a while very dominant um
[00:00:35] which was for a while very dominant um lstms will talk about those and then in
[00:00:38] lstms will talk about those and then in the latter part as something to be done
[00:00:41] the latter part as something to be done with um current neural networks we'll
[00:00:44] with um current neural networks we'll start looking at neural machine
[00:00:47] start looking at neural machine translation
[00:00:48] translation um okay so on Tuesday what we did was we
[00:00:52] um okay so on Tuesday what we did was we introduced language models A system that
[00:00:55] introduced language models A system that predicts the next word and then I
[00:00:57] predicts the next word and then I introduced recurrent new Networks that's
[00:01:00] introduced recurrent new Networks that's was this new neural architecture that
[00:01:03] was this new neural architecture that can take sequential input of any length
[00:01:06] can take sequential input of any length and it applied the same weight to each
[00:01:08] and it applied the same weight to each step and can optionally produce output
[00:01:11] step and can optionally produce output on each step um so these are two
[00:01:14] on each step um so these are two distinct Notions though they tend to go
[00:01:17] distinct Notions though they tend to go together so uh recurrent new network can
[00:01:20] together so uh recurrent new network can be used for other purposes on any kinds
[00:01:24] be used for other purposes on any kinds of sequence and I'll mention a few of
[00:01:26] of sequence and I'll mention a few of those later today um and language
[00:01:29] those later today um and language modeling is a traditional component of
[00:01:32] modeling is a traditional component of many NLP tasks anything to do with
[00:01:34] many NLP tasks anything to do with generating text or estimating
[00:01:37] generating text or estimating likelihoods of pieces of text and indeed
[00:01:39] likelihoods of pieces of text and indeed in the modern instantiation of large
[00:01:42] in the modern instantiation of large language models essentially everything
[00:01:44] language models essentially everything we do in um NLP is being done by
[00:01:47] we do in um NLP is being done by language models so a language model one
[00:01:49] language models so a language model one way to do it is with the recurrent
[00:01:51] way to do it is with the recurrent neural network it's certainly not the
[00:01:54] neural network it's certainly not the only way we also talked last time about
[00:01:56] only way we also talked last time about engram language models which were
[00:01:58] engram language models which were language models and then starting next
[00:02:01] language models and then starting next week we'll start to talk about
[00:02:03] week we'll start to talk about Transformers which are now the most
[00:02:05] Transformers which are now the most widespread way that's used for building
[00:02:07] widespread way that's used for building language
[00:02:08] language models um so just finish off a teeny bit
[00:02:12] models um so just finish off a teeny bit that I didn't get to last time on
[00:02:14] that I didn't get to last time on evaluating language models well one way
[00:02:16] evaluating language models well one way to evaluate language models is what I
[00:02:18] to evaluate language models is what I did in class last time generate some
[00:02:21] did in class last time generate some text and say hey doesn't this text look
[00:02:23] text and say hey doesn't this text look good um but you know often we want
[00:02:26] good um but you know often we want something more rigorous than that and
[00:02:29] something more rigorous than that and the standard way to evaluate language
[00:02:32] the standard way to evaluate language models is to say well you know a
[00:02:34] models is to say well you know a language model scores a piece of text
[00:02:36] language model scores a piece of text and says How likely it is and our
[00:02:40] and says How likely it is and our standard for text in the language is
[00:02:44] standard for text in the language is stuff produced by human beings so if we
[00:02:47] stuff produced by human beings so if we find a new piece of text which wasn't
[00:02:51] find a new piece of text which wasn't text that the model was trained on right
[00:02:53] text that the model was trained on right we want some fresh um fresh evaluation
[00:02:56] we want some fresh um fresh evaluation data and we show it um to a language
[00:02:59] data and we show it um to a language model model we can then ask the language
[00:03:02] model model we can then ask the language model to predict the success of words of
[00:03:05] model to predict the success of words of this text and the better it is at doing
[00:03:07] this text and the better it is at doing that the better a language model it is
[00:03:10] that the better a language model it is because it's more accurately able to
[00:03:12] because it's more accurately able to predict um a human written piece of text
[00:03:16] predict um a human written piece of text and so the standard way that that is
[00:03:18] and so the standard way that that is measured is with this measure that's
[00:03:20] measured is with this measure that's called perplexity and so for perplexity
[00:03:24] called perplexity and so for perplexity we are taking um the probability of a
[00:03:27] we are taking um the probability of a prediction from the language model we're
[00:03:29] prediction from the language model we're in inverting it so instead of it being
[00:03:32] in inverting it so instead of it being you know .002 or something we're turning
[00:03:35] you know .002 or something we're turning into 500 or something like that um and
[00:03:41] into 500 or something like that um and then we're taking those numbers we're
[00:03:43] then we're taking those numbers we're taking the product of them at each
[00:03:45] taking the product of them at each position in the text and then we're
[00:03:47] position in the text and then we're finding the geometric average of them um
[00:03:51] finding the geometric average of them um so that's the measure that's normally
[00:03:53] so that's the measure that's normally used but if we in this class we've been
[00:03:56] used but if we in this class we've been tending um to look um at um negative log
[00:04:02] tending um to look um at um negative log likelihoods and um the idea of cross
[00:04:05] likelihoods and um the idea of cross entropy um and so what perplexity is is
[00:04:10] entropy um and so what perplexity is is it's just the exponential of the Cross
[00:04:13] it's just the exponential of the Cross entropy um so if you've already familiar
[00:04:16] entropy um so if you've already familiar with negative log like per word negative
[00:04:19] with negative log like per word negative log likelihoods if you just exponentiate
[00:04:21] log likelihoods if you just exponentiate that um you then get the perplexity now
[00:04:25] that um you then get the perplexity now there's one other little trick as to
[00:04:27] there's one other little trick as to what base you use for your logarithms
[00:04:29] what base you use for your logarithms and exponentials I mean traditionally
[00:04:33] and exponentials I mean traditionally thinking of sort of binary and stuff a
[00:04:35] thinking of sort of binary and stuff a lot of the time people use Bas two for
[00:04:38] lot of the time people use Bas two for measuring perplexity um that's kind of
[00:04:42] measuring perplexity um that's kind of gone out now a lot of the time now
[00:04:43] gone out now a lot of the time now people are using natural logs but if
[00:04:45] people are using natural logs but if you're comparing perplexity numbers
[00:04:48] you're comparing perplexity numbers they're going to be different depending
[00:04:49] they're going to be different depending on um what base you're using for things
[00:04:52] on um what base you're using for things so you needs to be aware of this um so
[00:04:56] so you needs to be aware of this um so from a sort of a modern perspective it
[00:05:00] from a sort of a modern perspective it kind of makes no sense um why um
[00:05:04] kind of makes no sense um why um perplexity is used um the story of why
[00:05:08] perplexity is used um the story of why perplexity was used was you know in the
[00:05:11] perplexity was used was you know in the bad old days of symbolic artificial
[00:05:14] bad old days of symbolic artificial intelligence when all of those famous
[00:05:16] intelligence when all of those famous people like John McCarthy and Ed Fen bam
[00:05:20] people like John McCarthy and Ed Fen bam were around doing logical based systems
[00:05:23] were around doing logical based systems um some people then essentially at IBM
[00:05:27] um some people then essentially at IBM including Fred gelan started exploring
[00:05:30] including Fred gelan started exploring probabilistic methods for speech
[00:05:32] probabilistic methods for speech recognition and other similar methods um
[00:05:36] recognition and other similar methods um and um the story Fred gelene used to
[00:05:39] and um the story Fred gelene used to tell was well at that time this was in
[00:05:43] tell was well at that time this was in the late 70s or early 80s that none of
[00:05:47] the late 70s or early 80s that none of the AI people um that he was trying to
[00:05:50] the AI people um that he was trying to talk to understood how to do any real
[00:05:52] talk to understood how to do any real math and didn't understand um any
[00:05:55] math and didn't understand um any information Theory Notions of um of do
[00:05:59] information Theory Notions of um of do doing things like cross entropy or cross
[00:06:02] doing things like cross entropy or cross entropy rate so he had to come up with
[00:06:04] entropy rate so he had to come up with something simpler they could understand
[00:06:07] something simpler they could understand um and so what he came up with is by
[00:06:09] um and so what he came up with is by sort of doing this
[00:06:10] sort of doing this exponentiated um perplexity you can
[00:06:13] exponentiated um perplexity you can think of per a perplexity number as
[00:06:16] think of per a perplexity number as being equivalent to how many uniform
[00:06:19] being equivalent to how many uniform choices you're choosing between so if
[00:06:22] choices you're choosing between so if the perplexity of something is
[00:06:25] the perplexity of something is 64 that's like having a 64 sided dice
[00:06:28] 64 that's like having a 64 sided dice that you're rolling each time and that's
[00:06:31] that you're rolling each time and that's your chance of your chance of getting a
[00:06:33] your chance of your chance of getting a one on that is your chance of guessing
[00:06:35] one on that is your chance of guessing the right word um so that was why
[00:06:38] the right word um so that was why perplexity got introduced but it's kind
[00:06:40] perplexity got introduced but it's kind of stuck and so when you see scores for
[00:06:43] of stuck and so when you see scores for language models you generally still see
[00:06:46] language models you generally still see perplexities so a lower perplexity is
[00:06:49] perplexities so a lower perplexity is better um so here are the kind of
[00:06:53] better um so here are the kind of numbers and where progress was made with
[00:06:56] numbers and where progress was made with neural language models so before that um
[00:06:59] neural language models so before that um people used engram language models and
[00:07:02] people used engram language models and people used clever ways to smooth them
[00:07:05] people used clever ways to smooth them using methods I vaguely alluded to last
[00:07:07] using methods I vaguely alluded to last time of this ad case smoothing and doing
[00:07:10] time of this ad case smoothing and doing back off actually people use clever
[00:07:13] back off actually people use clever methods um around um the 2000th decade
[00:07:17] methods um around um the 2000th decade the cleverest method known of smoothing
[00:07:20] the cleverest method known of smoothing engram language models was this thing
[00:07:22] engram language models was this thing called interpolated kessi smoothing um
[00:07:25] called interpolated kessi smoothing um and so for a big language model using
[00:07:29] and so for a big language model using that the perplexity was about
[00:07:31] that the perplexity was about 67 um which in some sense means that you
[00:07:34] 67 um which in some sense means that you weren't very good at predicting the next
[00:07:36] weren't very good at predicting the next word um but you know that had actually
[00:07:38] word um but you know that had actually been enormous progress you know when I
[00:07:40] been enormous progress you know when I was a young person doing um NLP you know
[00:07:43] was a young person doing um NLP you know perplexities were three figure numbers
[00:07:46] perplexities were three figure numbers right you were commonly seeing you know
[00:07:48] right you were commonly seeing you know perplexities of 150 or something like
[00:07:50] perplexities of 150 or something like that so progress was made um so when
[00:07:54] that so progress was made um so when rnn's were first introduced people
[00:07:58] rnn's were first introduced people weren't really actually able to do
[00:08:00] weren't really actually able to do better with a sort of a pure RNN but
[00:08:04] better with a sort of a pure RNN but they could do better by combining an RNN
[00:08:07] they could do better by combining an RNN with something else such as a symbolic
[00:08:11] with something else such as a symbolic maximum entropy model which I'm not
[00:08:13] maximum entropy model which I'm not going to explain but those are numbers
[00:08:14] going to explain but those are numbers like that 51 but where progress um
[00:08:17] like that 51 but where progress um really started to be made was when lstm
[00:08:21] really started to be made was when lstm started to be used as an improved RNN
[00:08:23] started to be used as an improved RNN which is what I'm going to come to next
[00:08:25] which is what I'm going to come to next so here are some lstm models and now
[00:08:29] so here are some lstm models and now you're getting numbers like 43 and 30
[00:08:32] you're getting numbers like 43 and 30 and so for 30 you've sort of halfed the
[00:08:35] and so for 30 you've sort of halfed the perplexity which in um cross entropy
[00:08:40] perplexity which in um cross entropy terms means you reduce the cross entropy
[00:08:42] terms means you reduce the cross entropy by about one bit um and so you've made
[00:08:45] by about one bit um and so you've made real progress um in your language
[00:08:48] real progress um in your language modeling Now by modern standards um
[00:08:51] modeling Now by modern standards um these numbers are still really high
[00:08:54] these numbers are still really high right for the best language models that
[00:08:55] right for the best language models that we have now you're getting perplexities
[00:08:58] we have now you're getting perplexities in the single digits you're getting
[00:09:00] in the single digits you're getting models that are very often able to guess
[00:09:03] models that are very often able to guess exactly the right word though of course
[00:09:05] exactly the right word though of course not always cuz no one can predict what
[00:09:07] not always cuz no one can predict what words going to be said by someone next
[00:09:09] words going to be said by someone next in a lot of
[00:09:12] in a lot of circumstances okay um so um to motivate
[00:09:16] circumstances okay um so um to motivate lstms then wanted to sort of say a bit
[00:09:20] lstms then wanted to sort of say a bit about how there are problems with rnns
[00:09:22] about how there are problems with rnns and why that motivated fixing things and
[00:09:25] and why that motivated fixing things and these are the problems of Vanishing and
[00:09:27] these are the problems of Vanishing and exploding gradients so what we wanted to
[00:09:30] exploding gradients so what we wanted to do was say okay we've tried to predict a
[00:09:34] do was say okay we've tried to predict a word at position four and often we're
[00:09:38] word at position four and often we're going to get it we're not going to
[00:09:40] going to get it we're not going to predict it with 100% probability so we
[00:09:43] predict it with 100% probability so we we have a loss that's a negative log
[00:09:45] we have a loss that's a negative log likelihood we give to that word and
[00:09:47] likelihood we give to that word and we're going to want to back propagate
[00:09:49] we're going to want to back propagate that loss um through the
[00:09:52] that loss um through the sequence um and work out our gradients
[00:09:55] sequence um and work out our gradients as we always do now just one note about
[00:09:58] as we always do now just one note about something someone asked after class um
[00:10:01] something someone asked after class um last time you know I sort of showed it
[00:10:03] last time you know I sort of showed it back propagating the whole sequence but
[00:10:05] back propagating the whole sequence but we're doing this at every time step
[00:10:07] we're doing this at every time step right so we're going to back propagate a
[00:10:09] right so we're going to back propagate a loss from time step two back propagate a
[00:10:12] loss from time step two back propagate a loss from time step 3 4 5 6 7 we're
[00:10:14] loss from time step 3 4 5 6 7 we're doing it for each one and in one of the
[00:10:16] doing it for each one and in one of the slides last time we then discussed how
[00:10:18] slides last time we then discussed how we're going to sum all of those losses
[00:10:21] we're going to sum all of those losses or work out the average loss um but for
[00:10:24] or work out the average loss um but for doing this one when we back propagate
[00:10:27] doing this one when we back propagate this loss what happens
[00:10:30] this loss what happens well what happens is we're going to do
[00:10:31] well what happens is we're going to do the same kind of chain roll where we're
[00:10:34] the same kind of chain roll where we're multiplying these partial derivatives at
[00:10:37] multiplying these partial derivatives at every time step and well here we've only
[00:10:40] every time step and well here we've only got a few of them but you know maybe
[00:10:42] got a few of them but you know maybe we're going to have a sequence 30 long
[00:10:45] we're going to have a sequence 30 long and so we're going to be multiplying
[00:10:47] and so we're going to be multiplying each time the partial of HK time with
[00:10:52] each time the partial of HK time with respect to the partial of HK minus1 and
[00:10:55] respect to the partial of HK minus1 and so what kind of effect is that going to
[00:10:58] so what kind of effect is that going to have
[00:10:59] have um in particular you know we might ask
[00:11:02] um in particular you know we might ask what happens if these are small or what
[00:11:05] what happens if these are small or what happens if these are large well if
[00:11:08] happens if these are large well if they're small the gradient will
[00:11:11] they're small the gradient will gradually get smaller and smaller and
[00:11:15] gradually get smaller and smaller and disappear as we back propagate it along
[00:11:18] disappear as we back propagate it along the sequence
[00:11:20] the sequence yeah so why we're taking partial J over
[00:11:24] yeah so why we're taking partial J over partial should we take partial J
[00:11:32] sure I mean we're doing that as well um
[00:11:37] sure I mean we're doing that as well um but you know in general we have to walk
[00:11:40] but you know in general we have to walk the partials along and then you know we
[00:11:44] the partials along and then you know we then have a w at the next
[00:11:47] then have a w at the next step I mean if we're thinking of the
[00:11:49] step I mean if we're thinking of the sort of computation graph that we're
[00:11:52] sort of computation graph that we're sort of doing the chain rule backwards
[00:11:54] sort of doing the chain rule backwards along we're going to be going through a
[00:11:57] along we're going to be going through a w at each step and then arriving at
[00:11:59] w at each step and then arriving at another AG
[00:12:05] right
[00:12:07] right um yeah so I mean at this point you can
[00:12:13] um yeah so I mean at this point you can you can do some math and thinking about
[00:12:17] you can do some math and thinking about things um and there's a couple of papers
[00:12:20] things um and there's a couple of papers that are mentioned at the bottom here
[00:12:22] that are mentioned at the bottom here which I'm actually uh rushing ahead not
[00:12:25] which I'm actually uh rushing ahead not going to do very carefully um but the
[00:12:29] going to do very carefully um but the point is that if you're taking the
[00:12:32] point is that if you're taking the partial of HT with respect to HT
[00:12:36] partial of HT with respect to HT minus1 um and if you make a simplifying
[00:12:39] minus1 um and if you make a simplifying assumption and say Suppose there isn't a
[00:12:42] assumption and say Suppose there isn't a nonlinearity suppose Sigma is just the
[00:12:44] nonlinearity suppose Sigma is just the identity then what the partial will be
[00:12:48] identity then what the partial will be is the Matrix
[00:12:50] is the Matrix wh and so if you keep on back
[00:12:54] wh and so if you keep on back propagating along the recurrent neural
[00:12:56] propagating along the recurrent neural network what you're going to be doing
[00:12:59] network what you're going to be doing is what ending up with powers of the
[00:13:02] is what ending up with powers of the Matrix wh and then there's the question
[00:13:05] Matrix wh and then there's the question of what happens when you raise that
[00:13:07] of what happens when you raise that Matrix to higher and higher Powers well
[00:13:11] Matrix to higher and higher Powers well at that point you can represent the
[00:13:13] at that point you can represent the Matrix in terms of its igen vectors and
[00:13:16] Matrix in terms of its igen vectors and IG values and then there are two
[00:13:19] IG values and then there are two possibilities either all the igen values
[00:13:22] possibilities either all the igen values are less than one and that means that
[00:13:25] are less than one and that means that that number will be getting smaller and
[00:13:28] that number will be getting smaller and smaller smaller as you raise it to
[00:13:30] smaller smaller as you raise it to higher power or it can have IG ve IG
[00:13:34] higher power or it can have IG ve IG values that are larger than one and then
[00:13:37] values that are larger than one and then things will get bigger and bigger as you
[00:13:39] things will get bigger and bigger as you go further back so essentially as you
[00:13:40] go further back so essentially as you back propagate the gradients backwards
[00:13:43] back propagate the gradients backwards unless things are sort of just
[00:13:46] unless things are sort of just precisely of um corresponding to a
[00:13:49] precisely of um corresponding to a largest Aon Vector
[00:13:52] largest Aon Vector of large ion value of approximately one
[00:13:55] of large ion value of approximately one you're either going to get a Vanishing
[00:13:57] you're either going to get a Vanishing or an explosion and both of those will
[00:14:00] or an explosion and both of those will be kind of
[00:14:02] be kind of bad so why is Vanishing gradient a
[00:14:06] bad so why is Vanishing gradient a problem I mean in a sense you could
[00:14:09] problem I mean in a sense you could think it's it's not a problem it's what
[00:14:13] think it's it's not a problem it's what should be happening because all else
[00:14:15] should be happening because all else being equal you know the closest words
[00:14:18] being equal you know the closest words are the most relevant ones and so that's
[00:14:20] are the most relevant ones and so that's where you should be updating your
[00:14:23] where you should be updating your parameters the most and to some extent
[00:14:25] parameters the most and to some extent that that's true but nevertheless this
[00:14:28] that that's true but nevertheless this vanish gradient in this model happens
[00:14:31] vanish gradient in this model happens much too severely so that if you're
[00:14:35] much too severely so that if you're looking at the loss from a later
[00:14:37] looking at the loss from a later position and comparing it to the loss um
[00:14:40] position and comparing it to the loss um from an earlier position and then you're
[00:14:43] from an earlier position and then you're seeing how things are updating it's sort
[00:14:46] seeing how things are updating it's sort of primarily the update is being
[00:14:48] of primarily the update is being determined by the very nearby loss and
[00:14:52] determined by the very nearby loss and not by the far away loss um that the
[00:14:55] not by the far away loss um that the gradient signal from far away is much
[00:14:58] gradient signal from far away is much much smaller ER and well that's bad
[00:15:02] much smaller ER and well that's bad because you know overall for language
[00:15:05] because you know overall for language modeling there are lots of cases where
[00:15:07] modeling there are lots of cases where we want to be able to transmit signals a
[00:15:09] we want to be able to transmit signals a long distance so here's my piece of text
[00:15:13] long distance so here's my piece of text um when she tried to print her ticket
[00:15:15] um when she tried to print her ticket she found that the printer was out of
[00:15:17] she found that the printer was out of toner she went to the stationary store
[00:15:19] toner she went to the stationary store to buy more toner it was very overpriced
[00:15:22] to buy more toner it was very overpriced after installing the toner into the
[00:15:24] after installing the toner into the printer she finally printed
[00:15:27] printer she finally printed her yeah so you know human being you
[00:15:29] her yeah so you know human being you know it's obvious we can predict this
[00:15:31] know it's obvious we can predict this with pretty much probability one you
[00:15:33] with pretty much probability one you know so um really low perplexity for
[00:15:37] know so um really low perplexity for making this decision um but you know if
[00:15:41] making this decision um but you know if you're that's depends on getting back to
[00:15:44] you're that's depends on getting back to the tickets which are sort of about 20
[00:15:46] the tickets which are sort of about 20 odd words back right if you're just
[00:15:48] odd words back right if you're just seeing installing the toner into the
[00:15:50] seeing installing the toner into the printer she finally printed her could be
[00:15:53] printer she finally printed her could be anything it could be her paper her
[00:15:55] anything it could be her paper her invitation her novel um lots of things
[00:15:59] invitation her novel um lots of things could be you're certainly not going to
[00:16:00] could be you're certainly not going to guess tickets
[00:16:02] guess tickets so we sort of want to have these really
[00:16:06] so we sort of want to have these really long distance dependencies but we're
[00:16:09] long distance dependencies but we're only going to be able to learn these
[00:16:10] only going to be able to learn these long distance dependencies if we're
[00:16:13] long distance dependencies if we're actually getting sufficient signal
[00:16:16] actually getting sufficient signal between that position and when the word
[00:16:19] between that position and when the word tickets appears near the beginning that
[00:16:21] tickets appears near the beginning that we can learn the fact that having that
[00:16:24] we can learn the fact that having that tickets 20 words back is the good
[00:16:26] tickets 20 words back is the good predictive thing for predicting tickets
[00:16:30] predictive thing for predicting tickets here and what we find is um that you
[00:16:35] here and what we find is um that you know when the gradient becomes very
[00:16:37] know when the gradient becomes very small the RNN doesn't learn these kind
[00:16:40] small the RNN doesn't learn these kind of longdistance dependencies and so it's
[00:16:43] of longdistance dependencies and so it's unable to sort of make these predictions
[00:16:46] unable to sort of make these predictions well at test time I mean this is a very
[00:16:50] well at test time I mean this is a very sort of just
[00:16:52] sort of just rough uh back of the envelope estimate
[00:16:56] rough uh back of the envelope estimate but you know what people actually found
[00:17:00] but you know what people actually found is that you know with this the kind of
[00:17:02] is that you know with this the kind of simple r&amp;n that we've introduced up
[00:17:04] simple r&amp;n that we've introduced up until now that the amount of effective
[00:17:07] until now that the amount of effective conditioning you could get was about
[00:17:10] conditioning you could get was about seven tokens back um that if things were
[00:17:13] seven tokens back um that if things were further back than that it just never
[00:17:15] further back than that it just never learned to condition on them and so you
[00:17:18] learned to condition on them and so you know compared to when we were talking
[00:17:20] know compared to when we were talking about NS and I said are usually the
[00:17:23] about NS and I said are usually the maximum people did was 5 G occasionally
[00:17:25] maximum people did was 5 G occasionally a bit bigger because of the fact that
[00:17:28] a bit bigger because of the fact that there was this exponential blowout
[00:17:30] there was this exponential blowout although in theory we've now got a much
[00:17:32] although in theory we've now got a much better solution in practice because of
[00:17:35] better solution in practice because of Vanishing gradients well we're only kind
[00:17:37] Vanishing gradients well we're only kind of getting the equivalent of 8 G so we
[00:17:40] of getting the equivalent of 8 G so we haven't made that much progress it feels
[00:17:43] haven't made that much progress it feels like um so there's a reverse problem
[00:17:47] like um so there's a reverse problem which can also happen of exploding
[00:17:49] which can also happen of exploding gradients so if the gradient becomes
[00:17:53] gradients so if the gradient becomes very large because the IG values of that
[00:17:56] very large because the IG values of that Matrix are large well what we're doing
[00:17:59] Matrix are large well what we're doing for the parameter update is you know
[00:18:02] for the parameter update is you know we've got a learning rate but
[00:18:03] we've got a learning rate but essentially if the gradient is very
[00:18:06] essentially if the gradient is very large we're going to make a very very
[00:18:08] large we're going to make a very very large parameter update and that can
[00:18:11] large parameter update and that can cause very bad updates because you know
[00:18:15] cause very bad updates because you know we're sort of assuming that we're taking
[00:18:17] we're sort of assuming that we're taking a step in the direction of the gradient
[00:18:20] a step in the direction of the gradient and well we might overshoot a little but
[00:18:22] and well we might overshoot a little but we'll be in a roughly in the right Zone
[00:18:25] we'll be in a roughly in the right Zone but you know if we had an enormously
[00:18:27] but you know if we had an enormously exploded gradi and well we could kind of
[00:18:30] exploded gradi and well we could kind of be sort of walking off anywhere and you
[00:18:33] be sort of walking off anywhere and you know we think we're heading to the
[00:18:34] know we think we're heading to the sieras and we end up in Iowa or
[00:18:36] sieras and we end up in Iowa or something like that right you know that
[00:18:38] something like that right you know that we could just go arbitrarily far and
[00:18:42] we could just go arbitrarily far and where we're ending up it might not be
[00:18:44] where we're ending up it might not be making any progress whatsoever um so
[00:18:47] making any progress whatsoever um so exploding gradients are a problem um
[00:18:51] exploding gradients are a problem um they can also cause Infinities and nans
[00:18:53] they can also cause Infinities and nans and they're always a problem when you're
[00:18:56] and they're always a problem when you're training models um now for
[00:18:59] training models um now for for dealing with exploding gradients um
[00:19:03] for dealing with exploding gradients um this is the accepted wisdom um this
[00:19:06] this is the accepted wisdom um this unfortunately doesn't this isn't High
[00:19:09] unfortunately doesn't this isn't High futing math really um what people use
[00:19:12] futing math really um what people use for exploding gradients is a crude hack
[00:19:15] for exploding gradients is a crude hack they clip gradients um but you know it
[00:19:18] they clip gradients um but you know it works really well and you really want to
[00:19:20] works really well and you really want to know about this um because clipping
[00:19:23] know about this um because clipping gradients is often essential um to
[00:19:25] gradients is often essential um to having newal networks not having
[00:19:27] having newal networks not having problems so what we do for gradient
[00:19:29] problems so what we do for gradient clipping is we work out the norm of the
[00:19:33] clipping is we work out the norm of the gradient and if it seems too large and
[00:19:37] gradient and if it seems too large and that varies but you know that's normally
[00:19:39] that varies but you know that's normally 5 10 20 something like that for a norm
[00:19:42] 5 10 20 something like that for a norm of a gradient is seen as the limit of
[00:19:44] of a gradient is seen as the limit of what's okay if the norm of your gradient
[00:19:47] what's okay if the norm of your gradient is too large you just scale it down in
[00:19:49] is too large you just scale it down in every direction and you apply a smaller
[00:19:52] every direction and you apply a smaller gradient
[00:19:53] gradient update um it works um yeah so that
[00:19:59] update um it works um yeah so that problem is solvable um but fixing the
[00:20:02] problem is solvable um but fixing the vanishing gradient seemed a more
[00:20:05] vanishing gradient seemed a more difficult problem right that this was
[00:20:08] difficult problem right that this was the problem that our rnns effectively
[00:20:11] the problem that our rnns effectively couldn't preserve information over many
[00:20:13] couldn't preserve information over many time steps um and well what seemed to be
[00:20:17] time steps um and well what seemed to be the problem there the problem seems to
[00:20:20] the problem there the problem seems to be really that we've got sort of an
[00:20:23] be really that we've got sort of an architecture that makes it very hard to
[00:20:25] architecture that makes it very hard to preserve information so if we look at
[00:20:28] preserve information so if we look at sort of the Hidden state from one time
[00:20:31] sort of the Hidden state from one time step to the next time step um it's
[00:20:35] step to the next time step um it's completely being Rewritten right so
[00:20:37] completely being Rewritten right so we're taking the previous time steps
[00:20:40] we're taking the previous time steps hidden um hidden Vector we're
[00:20:43] hidden um hidden Vector we're multiplying it by a matrix which
[00:20:46] multiplying it by a matrix which completely changes it in general adding
[00:20:48] completely changes it in general adding in other stuff from the input so if we
[00:20:51] in other stuff from the input so if we just like to say we'd like to say we'd
[00:20:54] just like to say we'd like to say we'd like you to carry forward information
[00:20:56] like you to carry forward information there's useful stuff in htus y can you
[00:20:59] there's useful stuff in htus y can you just kind of keep it around for a while
[00:21:01] just kind of keep it around for a while it's not actually very easy to do in
[00:21:03] it's not actually very easy to do in this formulation cuz you know trying to
[00:21:06] this formulation cuz you know trying to learn
[00:21:07] learn uh W vectors that all mostly preserve
[00:21:11] uh W vectors that all mostly preserve what was there before isn't at all and
[00:21:14] what was there before isn't at all and obvious thing to do um so the question
[00:21:17] obvious thing to do um so the question was could we design an
[00:21:19] was could we design an RNN which had a sort of a memory where
[00:21:22] RNN which had a sort of a memory where is easy to preserve information yes so
[00:21:25] is easy to preserve information yes so in in one of the earlier slides you
[00:21:27] in in one of the earlier slides you mentioned the
[00:21:29] mentioned the exponentiation happened during their
[00:21:31] exponentiation happened during their analysis they remove the
[00:21:33] analysis they remove the larity
[00:21:36] larity so sort
[00:21:38] so sort ofation prevent Vanishing oring no it
[00:21:43] ofation prevent Vanishing oring no it actually doesn't I mean you can make an
[00:21:45] actually doesn't I mean you can make an argument that it should help because
[00:21:48] argument that it should help because you've got effectively if you got
[00:21:49] you've got effectively if you got something like tan H you've got a
[00:21:51] something like tan H you've got a flattening function um so it should help
[00:21:55] flattening function um so it should help somewhat but it doesn't solve it even if
[00:21:57] somewhat but it doesn't solve it even if you're using a t nonlinearity well so I
[00:22:01] you're using a t nonlinearity well so I guess it should sorry it should help
[00:22:03] guess it should sorry it should help with exploding though actually that even
[00:22:06] with exploding though actually that even that still happens but it definitely
[00:22:08] that still happens but it definitely doesn't help with the vanishing if you
[00:22:10] doesn't help with the vanishing if you have Sig think that is one value so
[00:22:15] have Sig think that is one value so you're always pushing the value between
[00:22:17] you're always pushing the value between one so it's not going up or going down
[00:22:20] one so it's not going up or going down it's staying between Z and
[00:22:21] it's staying between Z and one but so
[00:22:26] yeah Z
[00:22:34] well I guess would it go up and so you
[00:22:37] well I guess would it go up and so you have a really small value that becomes
[00:22:38] have a really small value that becomes oneus a really small value Sig *
[00:22:42] oneus a really small value Sig * 1us oh Sig
[00:22:45] 1us oh Sig *
[00:22:47] * okay um yeah so can we have a
[00:22:52] okay um yeah so can we have a different architecture so we have a
[00:22:55] different architecture so we have a memory that you can add to um and so
[00:22:59] memory that you can add to um and so that led into this new kind of newal
[00:23:03] that led into this new kind of newal network the lstm so this is going back a
[00:23:06] network the lstm so this is going back a few years but any rate um this was
[00:23:09] few years but any rate um this was trying to improve Siri suggestions and
[00:23:12] trying to improve Siri suggestions and the big breakthrough that they were
[00:23:13] the big breakthrough that they were described was being described was oh
[00:23:16] described was being described was oh we're now using an lstm in the keyboard
[00:23:20] we're now using an lstm in the keyboard prediction and the whole advantage of
[00:23:23] prediction and the whole advantage of that um was was going to be able to
[00:23:25] that um was was going to be able to predict context further back so you
[00:23:27] predict context further back so you could differentiate between the children
[00:23:29] could differentiate between the children are playing in the park versus the
[00:23:31] are playing in the park versus the Orioles are playing in the
[00:23:34] Orioles are playing in the playoff okay um so so the sort of big
[00:23:39] playoff okay um so so the sort of big thing that was seen as very successful
[00:23:42] thing that was seen as very successful was these um lstms long short-term
[00:23:45] was these um lstms long short-term memory um just to say a little bit of
[00:23:49] memory um just to say a little bit of the history here right um just on the
[00:23:52] the history here right um just on the how to PA this name right that I think
[00:23:56] how to PA this name right that I think people often don't even understand it
[00:23:58] people often don't even understand it right so what you wanting to do was
[00:24:00] right so what you wanting to do was model shortterm memory right because so
[00:24:04] model shortterm memory right because so for humans people this normally
[00:24:06] for humans people this normally distinguish between the short-term
[00:24:07] distinguish between the short-term memory of stuff that you heard recently
[00:24:10] memory of stuff that you heard recently versus things that you permanently
[00:24:12] versus things that you permanently stored away um and the suggestion was
[00:24:16] stored away um and the suggestion was well in short-term memory humans can
[00:24:19] well in short-term memory humans can remember stuff for quite a while right
[00:24:22] remember stuff for quite a while right you know if you're having a conversation
[00:24:24] you know if you're having a conversation you can still remember um the thing that
[00:24:26] you can still remember um the thing that the person said a few turns ago the
[00:24:28] the person said a few turns ago the conversation said bring back up of oh
[00:24:31] conversation said bring back up of oh didn't you say they um took last weekend
[00:24:34] didn't you say they um took last weekend off or something right and well the
[00:24:36] off or something right and well the problem was that the simple rnns their
[00:24:39] problem was that the simple rnns their short-term memory was only about seven
[00:24:42] short-term memory was only about seven tokens and so we'd like to make it
[00:24:44] tokens and so we'd like to make it better than that and so we wanted long
[00:24:47] better than that and so we wanted long short-term memory and that's where this
[00:24:50] short-term memory and that's where this um name came about and so this was a
[00:24:53] um name came about and so this was a type of recurrent new network um that
[00:24:55] type of recurrent new network um that was um um proposed by HW and schmidhuber
[00:25:01] was um um proposed by HW and schmidhuber in 1997 as a solution to the problem I
[00:25:04] in 1997 as a solution to the problem I mean there's actually a second relevant
[00:25:06] mean there's actually a second relevant piece of work that came a few years
[00:25:08] piece of work that came a few years later that you know that first paper is
[00:25:11] later that you know that first paper is the one that everybody cites um but
[00:25:13] the one that everybody cites um but there's then a second paper by gz and
[00:25:16] there's then a second paper by gz and smid Huber in 2000 which actually
[00:25:19] smid Huber in 2000 which actually introduces a crucial part of the lstm as
[00:25:22] introduces a crucial part of the lstm as we've used in the 21st century that
[00:25:25] we've used in the 21st century that wasn't in the original paper
[00:25:28] wasn't in the original paper um and you know so it's sort of an
[00:25:33] um and you know so it's sort of an interesting story of all of this so you
[00:25:36] interesting story of all of this so you know that um joggen schmidhuber and his
[00:25:40] know that um joggen schmidhuber and his students did a lot of really crucial
[00:25:44] students did a lot of really crucial foundational work in newal networks and
[00:25:47] foundational work in newal networks and the sort of these years and the lat
[00:25:50] the sort of these years and the lat years of the '90s um when just about
[00:25:53] years of the '90s um when just about everybody else had given up on newal
[00:25:56] everybody else had given up on newal networks um so unlike these days where
[00:26:00] networks um so unlike these days where you know doing um pioneering work in new
[00:26:03] you know doing um pioneering work in new networks is a really good way to get
[00:26:05] networks is a really good way to get yourself hugely compensated jobs at
[00:26:08] yourself hugely compensated jobs at Google meta or open AI it really wasn't
[00:26:12] Google meta or open AI it really wasn't actually in these days so you know if
[00:26:15] actually in these days so you know if you um ask G what happened to these
[00:26:18] you um ask G what happened to these students um of hot crater and um G um
[00:26:24] students um of hot crater and um G um that both of them are still in Academia
[00:26:26] that both of them are still in Academia but um G seem to give up on AI and new
[00:26:30] but um G seem to give up on AI and new networks all together and does stuff in
[00:26:33] networks all together and does stuff in the area of
[00:26:35] the area of multimedia um and SE Haw riter um is
[00:26:39] multimedia um and SE Haw riter um is still in machine learning but you know
[00:26:42] still in machine learning but you know for quite a long time he sort of
[00:26:44] for quite a long time he sort of basically gave up on doing more General
[00:26:46] basically gave up on doing more General neural network stuff and went into
[00:26:48] neural network stuff and went into bioinformatics so if you look at his
[00:26:50] bioinformatics so if you look at his Publications from about 2
[00:26:52] Publications from about 2 2015 um they were all in bioinformatics
[00:26:55] 2015 um they were all in bioinformatics and most of them weren't using neural
[00:26:57] and most of them weren't using neural networks at all um though um kind of
[00:27:00] networks at all um though um kind of nicely I mean he's actually gone back
[00:27:02] nicely I mean he's actually gone back into new networks more recently and is
[00:27:05] into new networks more recently and is publishing a new networks again um yeah
[00:27:08] publishing a new networks again um yeah so um really not much attention was paid
[00:27:11] so um really not much attention was paid to this work at the time and so it only
[00:27:14] to this work at the time and so it only sort of really kind of gradually seeped
[00:27:17] sort of really kind of gradually seeped out further um so um Schmid hu had a
[00:27:21] out further um so um Schmid hu had a later student in the mid 2000s decade
[00:27:24] later student in the mid 2000s decade Alex Graves um and Alex Graves um did
[00:27:29] Alex Graves um and Alex Graves um did more stuff with lstms and for people
[00:27:32] more stuff with lstms and for people who've seen um speech recognition where
[00:27:35] who've seen um speech recognition where people commonly do CTC loss and decoding
[00:27:38] people commonly do CTC loss and decoding Alex Graves invented that but um most
[00:27:42] Alex Graves invented that but um most crucially um Alex Graves then um went to
[00:27:46] crucially um Alex Graves then um went to um Toronto to be postto for Jeff Hinton
[00:27:49] um Toronto to be postto for Jeff Hinton and that sort of brought more attention
[00:27:52] and that sort of brought more attention to the fact that lstms were a good model
[00:27:55] to the fact that lstms were a good model um and then Jeff Hinton went to Google
[00:27:58] um and then Jeff Hinton went to Google in
[00:27:59] in 2013 and That Was Then sort of the use
[00:28:02] 2013 and That Was Then sort of the use of um lstms at Google um in the sort of
[00:28:05] of um lstms at Google um in the sort of 2014 to 16 period was when they really
[00:28:09] 2014 to 16 period was when they really sort of um hit the world and became for
[00:28:12] sort of um hit the world and became for a while the completely dominant
[00:28:14] a while the completely dominant framework people use for neural
[00:28:17] framework people use for neural networks um in the world of uh I guess
[00:28:21] networks um in the world of uh I guess startups this is what you call being too
[00:28:23] startups this is what you call being too early for the first people um yeah um
[00:28:29] early for the first people um yeah um okay long short-term memories back to
[00:28:30] okay long short-term memories back to the science um so um let's see um
[00:28:35] the science um so um let's see um there's a slide here that talks about um
[00:28:39] there's a slide here that talks about um long short-term memories but maybe I'll
[00:28:41] long short-term memories but maybe I'll just sort of skip straight ahead and
[00:28:44] just sort of skip straight ahead and start to show the pictures so we've
[00:28:46] start to show the pictures so we've still got a sequence of inputs XT and
[00:28:50] still got a sequence of inputs XT and the difference now is um inside our
[00:28:52] the difference now is um inside our newal Network we're going to have two
[00:28:55] newal Network we're going to have two hidden things one that's still called
[00:28:57] hidden things one that's still called The Hidden State and the other one
[00:28:59] The Hidden State and the other one that's referred to as the cell State and
[00:29:03] that's referred to as the cell State and so what we're going to do is we're going
[00:29:07] so what we're going to do is we're going to modulate how these things get updated
[00:29:10] to modulate how these things get updated by introducing the idea of gates and so
[00:29:14] by introducing the idea of gates and so gates are calculated things vectors
[00:29:18] gates are calculated things vectors whose values are probabilities between
[00:29:21] whose values are probabilities between zero and one and they're things that
[00:29:23] zero and one and they're things that we're going to use to sort of turn
[00:29:25] we're going to use to sort of turn things on or shut them off in a
[00:29:27] things on or shut them off in a probabilistic way so we're going to
[00:29:29] probabilistic way so we're going to control the movement of information by
[00:29:32] control the movement of information by having
[00:29:33] having gating and so we're going to calculate
[00:29:37] gating and so we're going to calculate three gating vectors so these vectors
[00:29:39] three gating vectors so these vectors are the same length as our hidden States
[00:29:43] are the same length as our hidden States um and so the way we calculate these
[00:29:46] um and so the way we calculate these gating vectors is with an equation that
[00:29:49] gating vectors is with an equation that looks basically exactly the same as what
[00:29:53] looks basically exactly the same as what we were using for current newal networks
[00:29:56] we were using for current newal networks um apart from the Sigma there is
[00:29:58] um apart from the Sigma there is definitely going to be the logistic that
[00:30:00] definitely going to be the logistic that goes between Zer and one so we get
[00:30:02] goes between Zer and one so we get probabilities and the three gates we're
[00:30:04] probabilities and the three gates we're going to calculate is a forget gate
[00:30:07] going to calculate is a forget gate which is going to say how much do we
[00:30:10] which is going to say how much do we remember of the previous times hidden
[00:30:13] remember of the previous times hidden State I think the forget gate was
[00:30:15] State I think the forget gate was actually wrongly named I think it makes
[00:30:17] actually wrongly named I think it makes more sense to think of it as a remember
[00:30:19] more sense to think of it as a remember gate because it's actually calculating
[00:30:21] gate because it's actually calculating how much you're remembering um okay then
[00:30:25] how much you're remembering um okay then we've got an input gate and the input
[00:30:28] we've got an input gate and the input gate is going to say how much are you
[00:30:30] gate is going to say how much are you going to pay attention to the next input
[00:30:33] going to pay attention to the next input the next XI and put it into your hidden
[00:30:37] the next XI and put it into your hidden State and then you have an output gate
[00:30:40] State and then you have an output gate and the output gate is going to control
[00:30:42] and the output gate is going to control how much of of what's in the cell which
[00:30:45] how much of of what's in the cell which is your primary memory are you going to
[00:30:48] is your primary memory are you going to transfer over to the hidden state of the
[00:30:51] transfer over to the hidden state of the network okay so once we have
[00:30:54] network okay so once we have um once we have those Gates
[00:30:58] um once we have those Gates um what we're then going to do has have
[00:31:01] um what we're then going to do has have these equations which are how we're
[00:31:04] these equations which are how we're going to um sort of update things so the
[00:31:07] going to um sort of update things so the first thing we're going to do is work
[00:31:10] first thing we're going to do is work out uh work out a potential new cell
[00:31:16] out uh work out a potential new cell content so the new cell content is going
[00:31:19] content so the new cell content is going to be calculated exactly using the
[00:31:23] to be calculated exactly using the exactly the same kind of equation we saw
[00:31:26] exactly the same kind of equation we saw last time for a current networks we're
[00:31:29] last time for a current networks we're going to have um these two matrices um
[00:31:32] going to have um these two matrices um the cell W and the cell U and we're
[00:31:35] the cell W and the cell U and we're going to multiply one by the last time's
[00:31:38] going to multiply one by the last time's hidden State and the other by the new
[00:31:41] hidden State and the other by the new input add on a bias and that's a
[00:31:44] input add on a bias and that's a potential update to the cell but then
[00:31:47] potential update to the cell but then how we're actually going to update the
[00:31:49] how we're actually going to update the cell is by making use of our gates so
[00:31:53] cell is by making use of our gates so we're going to say the new cells content
[00:31:56] we're going to say the new cells content is going to be the old cells content um
[00:32:01] is going to be the old cells content um hadamar producted with the forget gate
[00:32:04] hadamar producted with the forget gate so that's how much to remember of the
[00:32:06] so that's how much to remember of the previous cells content plus this
[00:32:10] previous cells content plus this calculated update had aad producted with
[00:32:14] calculated update had aad producted with the input gate how much to pay attention
[00:32:17] the input gate how much to pay attention to this new potential update that we've
[00:32:20] to this new potential update that we've um invented and then for calculating the
[00:32:24] um invented and then for calculating the new hidden State that's going to be
[00:32:28] new hidden State that's going to be um the had mod product between the
[00:32:31] um the had mod product between the output gate and
[00:32:34] output gate and um our CT having been put through a tan
[00:32:38] um our CT having been put through a tan H um and you know one idea here is you
[00:32:43] H um and you know one idea here is you know we're thinking about how much to
[00:32:47] know we're thinking about how much to keep on remembering what we've had in
[00:32:48] keep on remembering what we've had in the past but you know for thinking about
[00:32:52] the past but you know for thinking about sort of only sending some information to
[00:32:55] sort of only sending some information to the hidden State as sort of a way to
[00:32:57] the hidden State as sort of a way to start thinking about that is you know
[00:33:00] start thinking about that is you know the hidden state of a recurrent newal
[00:33:02] the hidden state of a recurrent newal network is sort of doing multiple Duty
[00:33:05] network is sort of doing multiple Duty right like on one part of it is we were
[00:33:08] right like on one part of it is we were going to feed it into the output to
[00:33:10] going to feed it into the output to predict the next token but in another
[00:33:15] predict the next token but in another thing it's going to do is we just want
[00:33:17] thing it's going to do is we just want it to store information about the past
[00:33:19] it to store information about the past that might come in useful later and that
[00:33:22] that might come in useful later and that we'd like to kind of have carried
[00:33:23] we'd like to kind of have carried through the sequence and so really only
[00:33:27] through the sequence and so really only some of what
[00:33:28] some of what in the hidden State we want to be using
[00:33:30] in the hidden State we want to be using to predict the current word some of it
[00:33:32] to predict the current word some of it isn't relevant to predicting the current
[00:33:34] isn't relevant to predicting the current word but would be good stuff to know for
[00:33:36] word but would be good stuff to know for the future right so you know if the
[00:33:38] the future right so you know if the previous words were set in um for
[00:33:43] previous words were set in um for predicting the next word we basically
[00:33:45] predicting the next word we basically just need to know we're in a satin
[00:33:47] just need to know we're in a satin context where the or a will come next
[00:33:50] context where the or a will come next but you know if earlier on the sentence
[00:33:53] but you know if earlier on the sentence that been saying the King of Prussia we
[00:33:56] that been saying the King of Prussia we somewhere in the hidden State we want to
[00:33:57] somewhere in the hidden State we want to be keeping the information that there's
[00:33:59] be keeping the information that there's a King of Prussia because that might be
[00:34:01] a King of Prussia because that might be relevant for predicting future words and
[00:34:04] relevant for predicting future words and so it makes sense that we only want to
[00:34:06] so it makes sense that we only want to have some of what's in our memory being
[00:34:08] have some of what's in our memory being used to predict the the next word in the
[00:34:11] used to predict the the next word in the current context so the cell is our long
[00:34:14] current context so the cell is our long short-term memory and then we're moving
[00:34:16] short-term memory and then we're moving over to the hidden state things are
[00:34:18] over to the hidden state things are going to be relevant for
[00:34:20] going to be relevant for Generation um yeah I've sort of said
[00:34:23] Generation um yeah I've sort of said that okay
[00:34:25] that okay uh all these are vectors of the same
[00:34:29] uh all these are vectors of the same length n yeah so all of all of these
[00:34:33] length n yeah so all of all of these things both the gates and the new values
[00:34:36] things both the gates and the new values for the cell and hidden state they're
[00:34:38] for the cell and hidden state they're all vectors of length n and part of how
[00:34:43] all vectors of length n and part of how things actually get convenient when
[00:34:46] things actually get convenient when you're actually running of these is up
[00:34:48] you're actually running of these is up until here all of these things have
[00:34:51] until here all of these things have exactly the same shape so you can
[00:34:53] exactly the same shape so you can actually put them all together into a
[00:34:56] actually put them all together into a big Matrix and do the computations of
[00:34:59] big Matrix and do the computations of all four of these in terms of um one big
[00:35:02] all four of these in terms of um one big Matrix if you want question if you did
[00:35:05] Matrix if you want question if you did not have the Dage activation in the
[00:35:07] not have the Dage activation in the hidden State update then the output dat
[00:35:10] hidden State update then the output dat should have been expressed by both the
[00:35:13] should have been expressed by both the input dates
[00:35:17] and you're if this if this bit wasn't
[00:35:21] and you're if this if this bit wasn't here then then you would not need an
[00:35:23] here then then you would not need an output dat well because ft and it would
[00:35:27] output dat well because ft and it would have been able to express it account for
[00:35:31] have been able to express it account for it in some sense my question is how much
[00:35:33] it in some sense my question is how much does having right well no to the extent
[00:35:37] does having right well no to the extent that you want to mask out part of what's
[00:35:40] that you want to mask out part of what's in the
[00:35:41] in the cell so it's not visible when you're
[00:35:44] cell so it's not visible when you're generating the next token isn't it still
[00:35:47] generating the next token isn't it still useful to have an output
[00:35:49] useful to have an output gate you can essentially have XT
[00:35:54] C you don't want HT equal to CT you want
[00:35:58] C you don't want HT equal to CT you want H you want some of the contents of CT to
[00:36:01] H you want some of the contents of CT to be masked out so you you're not seeing
[00:36:03] be masked out so you you're not seeing it when generating the output masking
[00:36:05] it when generating the output masking would have being accounted for by fdn ID
[00:36:08] would have being accounted for by fdn ID is my no because you want to keep it in
[00:36:10] is my no because you want to keep it in CT you want there's information you want
[00:36:12] CT you want there's information you want to keep in CT for the future um but you
[00:36:16] to keep in CT for the future um but you don't want visible when generating the
[00:36:19] don't want visible when generating the current next
[00:36:20] current next word
[00:36:22] word yeah um in some sense a bit I have the
[00:36:26] yeah um in some sense a bit I have the hardest part explaining is
[00:36:28] hardest part explaining is why is it necessarily better to have a
[00:36:31] why is it necessarily better to have a tan H here I mean you can sort of argue
[00:36:35] tan H here I mean you can sort of argue that it's a way of this can just stay
[00:36:38] that it's a way of this can just stay unbounded real numbers and then this is
[00:36:41] unbounded real numbers and then this is getting it back in the shape of stays
[00:36:44] getting it back in the shape of stays between zero and one which is good for
[00:36:46] between zero and one which is good for the um hidden state but it's a little
[00:36:49] the um hidden state but it's a little bit I guess they did it that way it seem
[00:36:52] bit I guess they did it that way it seem to work well okay um here's another way
[00:36:55] to work well okay um here's another way of looking at it which may or may not be
[00:36:57] of looking at it which may or may not be more helpful um as a picture so you know
[00:37:01] more helpful um as a picture so you know at each time step we've got you know as
[00:37:05] at each time step we've got you know as before um an input a hidden State and
[00:37:10] before um an input a hidden State and then we're going to calculate an output
[00:37:12] then we're going to calculate an output from that hidden state but we've sort of
[00:37:14] from that hidden state but we've sort of got this more complex computational unit
[00:37:18] got this more complex computational unit and these pictures of this more complex
[00:37:20] and these pictures of this more complex computational unit um were diagrams were
[00:37:23] computational unit um were diagrams were made by Chris oler who's someone who now
[00:37:26] made by Chris oler who's someone who now um
[00:37:28] um right now works at anthropic
[00:37:30] right now works at anthropic um and so if you blow up in that this is
[00:37:34] um and so if you blow up in that this is sort of showing the computation so
[00:37:36] sort of showing the computation so you're sort of feeding along
[00:37:39] you're sort of feeding along recurrently um the
[00:37:41] recurrently um the C cell as the primary recurrent unit but
[00:37:46] C cell as the primary recurrent unit but you've also got carried along H because
[00:37:50] you've also got carried along H because H is being used to calculate stuff for
[00:37:52] H is being used to calculate stuff for the next time step and then a new H is
[00:37:54] the next time step and then a new H is being generated um and so you're
[00:37:57] being generated um and so you're Computing the forget gate you're
[00:37:59] Computing the forget gate you're forgetting some of the cell content
[00:38:02] forgetting some of the cell content you're Computing an input gate you're
[00:38:04] you're Computing an input gate you're using that to compute um a potential new
[00:38:07] using that to compute um a potential new cell content um you write some of that
[00:38:11] cell content um you write some of that into the cell um depending on the input
[00:38:14] into the cell um depending on the input gate um then you compute an output gate
[00:38:18] gate um then you compute an output gate and then some of the cell will go into
[00:38:23] and then some of the cell will go into um the computation of H depending on the
[00:38:27] um the computation of H depending on the output gate and then just like for the
[00:38:30] output gate and then just like for the previous recurrent new network for
[00:38:33] previous recurrent new network for working out what the predicted next word
[00:38:37] working out what the predicted next word is you're working out an output layer by
[00:38:40] is you're working out an output layer by taking the H and doing another Matrix uh
[00:38:44] taking the H and doing another Matrix uh plus B2 and then using a soft Max on
[00:38:47] plus B2 and then using a soft Max on that to actually predict the next
[00:38:50] that to actually predict the next word okay so you know this all seems
[00:38:54] word okay so you know this all seems very
[00:38:56] very complex um
[00:38:58] complex um and you know back in do you have a
[00:39:00] and you know back in do you have a question yeah I a so how are we deciding
[00:39:05] question yeah I a so how are we deciding theold I I imagine just some sort of
[00:39:08] theold I I imagine just some sort of threshold around the probability of like
[00:39:10] threshold around the probability of like what we're remembering and what we're
[00:39:13] what we're remembering and what we're forgetting um so you know so we we're
[00:39:17] forgetting um so you know so we we're getting more than a threshold right
[00:39:19] getting more than a threshold right because we're actually we're calculating
[00:39:21] because we're actually we're calculating a whole Vector of forgetting and
[00:39:24] a whole Vector of forgetting and remembering so therefore it can choose
[00:39:27] remembering so therefore it can choose to say Okay Dimensions 1 to 177 keep all
[00:39:30] to say Okay Dimensions 1 to 177 keep all of that and throw away Dimensions 18 to
[00:39:33] of that and throw away Dimensions 18 to 22 or really probabilistically to
[00:39:35] 22 or really probabilistically to different extents um and so it's sort of
[00:39:39] different extents um and so it's sort of unspecified it's up to it what it learns
[00:39:42] unspecified it's up to it what it learns but we we're hoping that it will learn
[00:39:45] but we we're hoping that it will learn that certain kinds of information is
[00:39:47] that certain kinds of information is useful to keep carrying forward for at
[00:39:49] useful to keep carrying forward for at least a while um but then we can use
[00:39:53] least a while um but then we can use both the contents of the Hidden State
[00:39:55] both the contents of the Hidden State and the cell sorry of the next input to
[00:39:59] and the cell sorry of the next input to decide to throw away certain information
[00:40:01] decide to throw away certain information so we might think that there are certain
[00:40:03] so we might think that there are certain cues for example you know if it sees the
[00:40:06] cues for example you know if it sees the word next it might think okay change of
[00:40:09] word next it might think okay change of topic now it be a good time to forget
[00:40:11] topic now it be a good time to forget more stuff and reset but it's sort of
[00:40:13] more stuff and reset but it's sort of learning which dimensions of this Vector
[00:40:16] learning which dimensions of this Vector to hold around in an unconstrained way
[00:40:19] to hold around in an unconstrained way whatever's useful to do a better job at
[00:40:21] whatever's useful to do a better job at language
[00:40:22] language modeling okay um yeah so this all looks
[00:40:26] modeling okay um yeah so this all looks like a very complex and
[00:40:29] like a very complex and contous um design and you know quite
[00:40:33] contous um design and you know quite honestly um you know when teaching this
[00:40:35] honestly um you know when teaching this around 201 you know 16 17 and this was
[00:40:40] around 201 you know 16 17 and this was the best kind of newal network we had
[00:40:42] the best kind of newal network we had for language modeling you know we
[00:40:43] for language modeling you know we literally you know spent hours of class
[00:40:46] literally you know spent hours of class time um going through lstms and variant
[00:40:50] time um going through lstms and variant of lstms with different properties CU
[00:40:53] of lstms with different properties CU you know there are different ways you
[00:40:54] you know there are different ways you can do the gating you can have less
[00:40:56] can do the gating you can have less Gates or more gates and do different
[00:40:58] Gates or more gates and do different things um and it seemed the most
[00:41:00] things um and it seemed the most important thing to know um in
[00:41:04] important thing to know um in 2024 you know it's probably not the most
[00:41:07] 2024 you know it's probably not the most important thing to know um but um on
[00:41:11] important thing to know um but um on lstms are um a thing to be aware of we
[00:41:14] lstms are um a thing to be aware of we are going to use them um for um the
[00:41:18] are going to use them um for um the assignment three but you know you can
[00:41:19] assignment three but you know you can just ask pytorch for an lstm and I'll
[00:41:22] just ask pytorch for an lstm and I'll give you one that does all of this stuff
[00:41:24] give you one that does all of this stuff but you know there is one thing that I
[00:41:26] but you know there is one thing that I really want to sort of focus on as to
[00:41:29] really want to sort of focus on as to you know why what is the good thing that
[00:41:33] you know why what is the good thing that an lstm achieves and you know really the
[00:41:36] an lstm achieves and you know really the secret for why you get this
[00:41:38] secret for why you get this fundamentally different behavior in an
[00:41:40] fundamentally different behavior in an lstm is you have that plus sign right
[00:41:44] lstm is you have that plus sign right there right that for the simple
[00:41:47] there right that for the simple recurrent neural network at each time
[00:41:50] recurrent neural network at each time the next hidden state was a result of
[00:41:54] the next hidden state was a result of multiplicative stuff and therefore it
[00:41:56] multiplicative stuff and therefore it was very hard just to preserve
[00:41:59] was very hard just to preserve information um whereas the essence of
[00:42:03] information um whereas the essence of the lstm is to say well look you've got
[00:42:06] the lstm is to say well look you've got this past memory of stuff you've already
[00:42:09] this past memory of stuff you've already seen and what we want to do is add some
[00:42:12] seen and what we want to do is add some new information to it which
[00:42:14] new information to it which fundamentally seems like kind of right
[00:42:17] fundamentally seems like kind of right for human memories um that they're sort
[00:42:19] for human memories um that they're sort of basically additive um and when I said
[00:42:22] of basically additive um and when I said actually it was the second G paper that
[00:42:25] actually it was the second G paper that introduced a crucial part of the lstm
[00:42:28] introduced a crucial part of the lstm the first version of the lstm didn't
[00:42:31] the first version of the lstm didn't have the forget gate so it was a purely
[00:42:33] have the forget gate so it was a purely additive mechanism that you were
[00:42:35] additive mechanism that you were deciding what to add to your memory as
[00:42:38] deciding what to add to your memory as you went along um but you know that
[00:42:41] you went along um but you know that proved to be not quite perfect because
[00:42:44] proved to be not quite perfect because if you keep on adding more and more
[00:42:45] if you keep on adding more and more stuff over a long sequence that tends to
[00:42:48] stuff over a long sequence that tends to be dysfunctional after a certain point
[00:42:50] be dysfunctional after a certain point and so the big Improvement was then to
[00:42:52] and so the big Improvement was then to add this forget gate so some of it went
[00:42:54] add this forget gate so some of it went away but nevertheless having things
[00:42:57] away but nevertheless having things basically additive fixes the problem of
[00:43:01] basically additive fixes the problem of gradient flow you no longer have um
[00:43:04] gradient flow you no longer have um Vanishing gradients and it makes it
[00:43:07] Vanishing gradients and it makes it something that seems much more memory
[00:43:09] something that seems much more memory like you're adding to the things that
[00:43:11] like you're adding to the things that you
[00:43:12] you know okay um so the lstm architecture
[00:43:16] know okay um so the lstm architecture allows you to preserve information over
[00:43:18] allows you to preserve information over many time sets in the cell right so if
[00:43:22] many time sets in the cell right so if you set the forget gate to one um and
[00:43:25] you set the forget gate to one um and the input gate to zero you're just
[00:43:27] the input gate to zero you're just linearly passing along in the cell
[00:43:30] linearly passing along in the cell indefinitely the same
[00:43:34] information okay um it's not the only
[00:43:37] information okay um it's not the only way that you can do longdistance
[00:43:39] way that you can do longdistance information flow and we're going to look
[00:43:42] information flow and we're going to look increasingly in other in future lectures
[00:43:45] increasingly in other in future lectures at other ways you can do um longdistance
[00:43:49] at other ways you can do um longdistance information flow um and just to sort of
[00:43:52] information flow um and just to sort of give a bit of a peek about those now and
[00:43:54] give a bit of a peek about those now and to think about other architectures but
[00:43:57] to think about other architectures but there's a question no no question yes um
[00:44:01] there's a question no no question yes um so so since you're mentioning that those
[00:44:03] so so since you're mentioning that those plus like hand grent uh does it help
[00:44:08] plus like hand grent uh does it help with exploding grent at all does it make
[00:44:10] with exploding grent at all does it make it worse is there like no difference or
[00:44:12] it worse is there like no difference or no it also helps with exploding
[00:44:14] no it also helps with exploding gradients because the fact that you're
[00:44:16] gradients because the fact that you're not doing this sequence of multiplies
[00:44:18] not doing this sequence of multiplies all the time that you sort of have this
[00:44:21] all the time that you sort of have this addition
[00:44:25] operator um
[00:44:27] operator um so one thing you could wonder is that is
[00:44:31] so one thing you could wonder is that is Vanishing and exploding gradients just
[00:44:34] Vanishing and exploding gradients just uh Rec current neural network problem
[00:44:37] uh Rec current neural network problem and it's not I mean it it it occurs
[00:44:40] and it's not I mean it it it occurs earlier and worse when you've got long
[00:44:43] earlier and worse when you've got long sequences but if you start building a
[00:44:45] sequences but if you start building a very deep neural network surely the same
[00:44:48] very deep neural network surely the same thing is happening you know the
[00:44:50] thing is happening you know the parameters aren't the same so it's not
[00:44:53] parameters aren't the same so it's not quite just raising one Matrix to a power
[00:44:56] quite just raising one Matrix to a power but surely depending on your matrices
[00:44:58] but surely depending on your matrices you tend to have the same problem that
[00:45:01] you tend to have the same problem that either your gradients are disappearing
[00:45:03] either your gradients are disappearing or else they're exploding um and that's
[00:45:06] or else they're exploding um and that's what people found and that was part of
[00:45:08] what people found and that was part of the reason why in the early days people
[00:45:10] the reason why in the early days people weren't very successful building deep
[00:45:13] weren't very successful building deep neural networks was because they
[00:45:15] neural networks was because they suffered from problems of this sort that
[00:45:18] suffered from problems of this sort that if you had basically Vanishing gradients
[00:45:20] if you had basically Vanishing gradients um in a deep new network you got very
[00:45:23] um in a deep new network you got very little gradient signal in the lower
[00:45:25] little gradient signal in the lower layers therefore parameters didn't
[00:45:28] layers therefore parameters didn't really update Therefore your model
[00:45:30] really update Therefore your model didn't learn anything in the lower
[00:45:31] didn't learn anything in the lower layers therefore the network didn't work
[00:45:34] layers therefore the network didn't work well and you know that was part of why
[00:45:36] well and you know that was part of why things were stuck in the days around the
[00:45:38] things were stuck in the days around the early 2000s of deep networks didn't work
[00:45:42] early 2000s of deep networks didn't work um and so there are other ways you can
[00:45:44] um and so there are other ways you can think about fixing that so one common
[00:45:47] think about fixing that so one common way of fixing that is to add more direct
[00:45:51] way of fixing that is to add more direct connection so you know the problem when
[00:45:53] connection so you know the problem when we went through our um current step was
[00:45:58] we went through our um current step was we were sort of had this in between
[00:46:01] we were sort of had this in between stuff of doing a matrix multiply and bl
[00:46:03] stuff of doing a matrix multiply and bl blah blah blah um and that kind of
[00:46:07] blah blah blah um and that kind of caused indirectness and the possibility
[00:46:09] caused indirectness and the possibility for things to either explode or vanish
[00:46:13] for things to either explode or vanish um so this network is written sort of
[00:46:15] um so this network is written sort of upside down when I stole the picture
[00:46:18] upside down when I stole the picture from the paper so we'll just have to
[00:46:20] from the paper so we'll just have to deal with that like um right so the in
[00:46:23] deal with that like um right so the in we're going downwards from here to the
[00:46:25] we're going downwards from here to the next layer so
[00:46:27] next layer so you know rather than going through sort
[00:46:29] you know rather than going through sort of weight layers and weight layers which
[00:46:31] of weight layers and weight layers which will start to produce the same kind of
[00:46:33] will start to produce the same kind of problems what you can do is sort of
[00:46:36] problems what you can do is sort of apply the same trick in a vertical
[00:46:40] apply the same trick in a vertical Network and say well look I can also
[00:46:42] Network and say well look I can also just carry the input around with an
[00:46:44] just carry the input around with an identity function and add it on here and
[00:46:48] identity function and add it on here and so then I've got this sort of direct
[00:46:50] so then I've got this sort of direct carrying of information and so that um
[00:46:53] carrying of information and so that um led to the residual Network which was
[00:46:56] led to the residual Network which was what completely transformed computer
[00:46:58] what completely transformed computer vision models um and made them much more
[00:47:02] vision models um and made them much more learnable um than pure networks that
[00:47:05] learnable um than pure networks that lack these residual
[00:47:08] lack these residual connections um if you sort of start
[00:47:10] connections um if you sort of start heading down that path you can think
[00:47:13] heading down that path you can think well why only provide these residual
[00:47:15] well why only provide these residual Loops that take you one step maybe I
[00:47:19] Loops that take you one step maybe I could directly connect each layer to all
[00:47:22] could directly connect each layer to all the successive layers and so people
[00:47:25] the successive layers and so people played with that idea and that led to
[00:47:27] played with that idea and that led to the so-called dens net where you have
[00:47:30] the so-called dens net where you have these kind of skip connections linking
[00:47:32] these kind of skip connections linking to every other layer um a variant of the
[00:47:39] to every other layer um a variant of the residual Network the res net um which
[00:47:42] residual Network the res net um which was actually again introduced by
[00:47:43] was actually again introduced by schmidhuber and students was to say well
[00:47:46] schmidhuber and students was to say well rather than just directly adding in the
[00:47:49] rather than just directly adding in the input
[00:47:51] input um summed with the output of the new
[00:47:54] um summed with the output of the new network layer maybe again we'd be better
[00:47:56] network layer maybe again we'd be better better off having gating so that you are
[00:47:59] better off having gating so that you are deciding by Gates how much of the input
[00:48:02] deciding by Gates how much of the input to have skip around and so that led to a
[00:48:05] to have skip around and so that led to a variant um the the highway net um where
[00:48:09] variant um the the highway net um where you've got sort of GED residual networks
[00:48:12] you've got sort of GED residual networks so various ideas of doing that um not
[00:48:16] so various ideas of doing that um not going to say more about that right now I
[00:48:18] going to say more about that right now I want to skip ahead and sort of do the
[00:48:20] want to skip ahead and sort of do the rest of new or Nets and get on to um
[00:48:24] rest of new or Nets and get on to um machine translation
[00:48:28] okay um so once you have rnns where RNN
[00:48:33] okay um so once you have rnns where RNN is including lstms normally in practice
[00:48:36] is including lstms normally in practice lstms you can use them for anything else
[00:48:38] lstms you can use them for anything else where you're doing sequences and so
[00:48:40] where you're doing sequences and so there are lots of places they're used in
[00:48:42] there are lots of places they're used in N so if you want to assign words parts
[00:48:47] N so if you want to assign words parts of speech like nouns and verbs that
[00:48:49] of speech like nouns and verbs that would be commonly done with a part of
[00:48:51] would be commonly done with a part of speech tagging lstm if you want to be
[00:48:55] speech tagging lstm if you want to be assigning named entity labels like
[00:48:58] assigning named entity labels like location right I I did this toy version
[00:49:01] location right I I did this toy version where we were signing a label to the
[00:49:03] where we were signing a label to the middle of a window but you want to
[00:49:05] middle of a window but you want to assign a label at each position you can
[00:49:07] assign a label at each position you can use an lstm for named entity recognition
[00:49:10] use an lstm for named entity recognition you can use an RNN as a encoder model
[00:49:15] you can use an RNN as a encoder model for a whole sentence so if we want to do
[00:49:18] for a whole sentence so if we want to do sentiment classifications see whether a
[00:49:20] sentiment classifications see whether a piece of text is positive or negative um
[00:49:23] piece of text is positive or negative um we can say run an lstm over it and then
[00:49:27] we can say run an lstm over it and then use this as a representation of the
[00:49:30] use this as a representation of the sentence to in work out whether it's
[00:49:34] sentence to in work out whether it's positive or negative piece of text and
[00:49:36] positive or negative piece of text and well the simplest way of doing that is
[00:49:39] well the simplest way of doing that is to use the final hidden State because
[00:49:42] to use the final hidden State because after all that final hidden state is the
[00:49:44] after all that final hidden state is the hidden State you've gotten from having
[00:49:45] hidden State you've gotten from having seen the entire sentence and use that
[00:49:49] seen the entire sentence and use that and then have a sort of a classification
[00:49:53] and then have a sort of a classification layer um logistic regression on top of
[00:49:56] layer um logistic regression on top of that to give you positive or negative in
[00:49:59] that to give you positive or negative in practice though people have found it's
[00:50:01] practice though people have found it's often better to use every hidden State
[00:50:04] often better to use every hidden State and take some kind of mean or
[00:50:07] and take some kind of mean or elementwise Max and feed that in as the
[00:50:10] elementwise Max and feed that in as the sentence
[00:50:12] sentence encoding you can also use rnns for lots
[00:50:15] encoding you can also use rnns for lots of other purposes where you're using it
[00:50:18] of other purposes where you're using it to generate text based on other
[00:50:21] to generate text based on other information so if you want to do speech
[00:50:23] information so if you want to do speech recognition or summarization machine
[00:50:27] recognition or summarization machine translation um that we'll come to later
[00:50:30] translation um that we'll come to later you can have an an input source which
[00:50:34] you can have an an input source which you'll use to condition your network and
[00:50:38] you'll use to condition your network and then you'll generate the speech
[00:50:41] then you'll generate the speech recognition or the um machine
[00:50:43] recognition or the um machine translation as we'll see later and so we
[00:50:46] translation as we'll see later and so we refer to those as conditional language
[00:50:49] refer to those as conditional language models because rather than just
[00:50:51] models because rather than just generating text starting from nothing
[00:50:54] generating text starting from nothing from a start token we're generating it
[00:50:57] from a start token we're generating it conditioned on some source of
[00:50:59] conditioned on some source of information
[00:51:01] information um one other idea on what normally
[00:51:05] um one other idea on what normally happens when people use these um you
[00:51:09] happens when people use these um you know I
[00:51:11] know I suggested
[00:51:13] suggested um that you know we could sort of do
[00:51:16] um that you know we could sort of do this averaging at each position if you
[00:51:18] this averaging at each position if you think of these about these hidden State
[00:51:21] think of these about these hidden State representations these hidden State
[00:51:24] representations these hidden State representations that representation
[00:51:27] representations that representation isn't only about the word terribly it
[00:51:29] isn't only about the word terribly it has some information about what came
[00:51:32] has some information about what came before it the movie was terribly but it
[00:51:35] before it the movie was terribly but it has no information about what comes
[00:51:38] has no information about what comes after it and well you might think you'd
[00:51:41] after it and well you might think you'd like to have a representation of
[00:51:44] like to have a representation of terribly that knows what came before it
[00:51:48] terribly that knows what came before it but also what came after it and so
[00:51:52] but also what came after it and so people sort of came up with the next
[00:51:54] people sort of came up with the next obvious idea to deal with that which was
[00:51:57] obvious idea to deal with that which was to build a
[00:51:58] to build a bidirectional lstm so you ran a Ford
[00:52:02] bidirectional lstm so you ran a Ford lstm and then you start another lsdm
[00:52:05] lstm and then you start another lsdm that's shown in that sort of greenish
[00:52:07] that's shown in that sort of greenish teal and you ran it backwards and so
[00:52:10] teal and you ran it backwards and so then you had a forwards and backwards
[00:52:12] then you had a forwards and backwards Vector at each position and you just
[00:52:15] Vector at each position and you just concatenated them both and then you had
[00:52:17] concatenated them both and then you had a two-sided context for a representation
[00:52:21] a two-sided context for a representation of word
[00:52:23] of word meaning and so these networks um were
[00:52:27] meaning and so these networks um were pretty widely used so um we were sort of
[00:52:30] pretty widely used so um we were sort of running a forward RNN a backward RNN and
[00:52:34] running a forward RNN a backward RNN and concatenating the states
[00:52:36] concatenating the states together and those were then sort of
[00:52:38] together and those were then sort of commonly sort of written like this to
[00:52:41] commonly sort of written like this to suggest that in a compact way you're
[00:52:44] suggest that in a compact way you're running a bidirectional RNN um that and
[00:52:50] running a bidirectional RNN um that and you know these were very popular for
[00:52:53] you know these were very popular for language analysis they're not they
[00:52:55] language analysis they're not they weren't working able if you wanting to
[00:52:57] weren't working able if you wanting to generate text um but you using them in a
[00:53:02] generate text um but you using them in a lot of places as a representation but
[00:53:04] lot of places as a representation but more recently Transformer models have
[00:53:07] more recently Transformer models have normally taken over from that um one
[00:53:10] normally taken over from that um one more idea which we'll see for machine
[00:53:13] more idea which we'll see for machine translation is
[00:53:16] translation is um you know rnns are sort of deep in the
[00:53:19] um you know rnns are sort of deep in the sense that they unroll over many time
[00:53:22] sense that they unroll over many time steps but up until now they've only been
[00:53:24] steps but up until now they've only been shallow rnns in the sense that we just
[00:53:26] shallow rnns in the sense that we just had one hidden state but you can also
[00:53:29] had one hidden state but you can also make them Deep by having multiple layers
[00:53:32] make them Deep by having multiple layers of hidden States what was also commonly
[00:53:35] of hidden States what was also commonly called stacked rnns so you'd have
[00:53:37] called stacked rnns so you'd have several layers of
[00:53:40] several layers of rnns um built above each other and you
[00:53:44] rnns um built above each other and you might wonder H does this really do
[00:53:46] might wonder H does this really do anything is it are they just big vectors
[00:53:48] anything is it are they just big vectors above the words but precisely because
[00:53:52] above the words but precisely because you have sort of this extra newal
[00:53:54] you have sort of this extra newal Network layer between here and here here
[00:53:56] Network layer between here and here here you get exactly the same power Advantage
[00:53:59] you get exactly the same power Advantage you get otherwise with newal networks
[00:54:01] you get otherwise with newal networks that you can do successive layers of
[00:54:03] that you can do successive layers of feature extraction and so you get more
[00:54:06] feature extraction and so you get more power out of your neural network um to
[00:54:10] power out of your neural network um to some extent um what
[00:54:14] some extent um what people yeah I okay to some extent what
[00:54:18] people yeah I okay to some extent what people um
[00:54:21] people um found with new rnns in those days um is
[00:54:26] found with new rnns in those days um is the having multiple layers definitely
[00:54:29] the having multiple layers definitely helps but unlike what was happening in
[00:54:31] helps but unlike what was happening in those days with other kinds of newal
[00:54:33] those days with other kinds of newal networks for vision Etc people still use
[00:54:36] networks for vision Etc people still use relatively shallow shallow RNN so you
[00:54:40] relatively shallow shallow RNN so you know you always got a lot of gains by
[00:54:42] know you always got a lot of gains by having two layers rather than one but
[00:54:45] having two layers rather than one but you know commonly it started to be more
[00:54:46] you know commonly it started to be more iffy whether you got extra value from
[00:54:49] iffy whether you got extra value from three or four layers so commonly people
[00:54:51] three or four layers so commonly people were running two or thre layer lstms and
[00:54:55] were running two or thre layer lstms and that's what people were using
[00:54:57] that's what people were using but that's completely changed around in
[00:54:59] but that's completely changed around in the world of
[00:55:00] the world of Transformers um where nowadays people
[00:55:02] Transformers um where nowadays people are building very deep um Transformer
[00:55:06] are building very deep um Transformer networks for doing language
[00:55:09] understanding okay but I Should Skip
[00:55:12] understanding okay but I Should Skip ahead and say a few words before time
[00:55:15] ahead and say a few words before time runs out about machine
[00:55:17] runs out about machine translation um so machine translation is
[00:55:20] translation um so machine translation is one of the key natural language
[00:55:22] one of the key natural language processing tasks where We're translating
[00:55:25] processing tasks where We're translating um words from sentences in one language
[00:55:28] um words from sentences in one language to sentences in another language so
[00:55:31] to sentences in another language so we're starting off um with a sentence in
[00:55:34] we're starting off um with a sentence in some language here French and what we
[00:55:36] some language here French and what we want to do is output it in a different
[00:55:39] want to do is output it in a different language here
[00:55:41] language here English um
[00:55:44] English um so that machine translation was actually
[00:55:48] so that machine translation was actually where NLP started right so in the early
[00:55:53] where NLP started right so in the early 50s there wasn't artificial intelligence
[00:55:56] 50s there wasn't artificial intelligence yet there wasn't a field of NLP yet um
[00:56:00] yet there wasn't a field of NLP yet um but people started to work on machine
[00:56:04] but people started to work on machine translation um and the story of why
[00:56:07] translation um and the story of why people started to work on machine
[00:56:08] people started to work on machine translation was essentially you know
[00:56:11] translation was essentially you know computers were first developed um during
[00:56:14] computers were first developed um during the second world war and during the
[00:56:16] the second world war and during the second world war computers were used for
[00:56:18] second world war computers were used for two things one of them was calculating
[00:56:21] two things one of them was calculating artillery table targets artillery tables
[00:56:24] artillery table targets artillery tables to sort of work out what angle to put
[00:56:26] to sort of work out what angle to put your gun on to get it to land in the
[00:56:28] your gun on to get it to land in the right place I'm not very relevant to
[00:56:30] right place I'm not very relevant to what we're doing but the other thing the
[00:56:33] what we're doing but the other thing the other thing that computers were used for
[00:56:35] other thing that computers were used for was code breaking um so um after the
[00:56:40] was code breaking um so um after the second world war it moved very quickly
[00:56:42] second world war it moved very quickly into the cold war and there were you
[00:56:44] into the cold war and there were you know concerns on both sides um you know
[00:56:48] know concerns on both sides um you know of keeping up with the science that was
[00:56:51] of keeping up with the science that was being developed on both sides and people
[00:56:54] being developed on both sides and people had the idea of gee maybe we could think
[00:56:58] had the idea of gee maybe we could think of translation between languages as like
[00:57:01] of translation between languages as like code breaking and that thought um
[00:57:05] code breaking and that thought um occurred to um important relevant people
[00:57:08] occurred to um important relevant people and science funding agencies and
[00:57:11] and science funding agencies and actually lots and lots of funding was
[00:57:14] actually lots and lots of funding was poured into this idea of can we use
[00:57:16] poured into this idea of can we use computers to do machine translation
[00:57:19] computers to do machine translation between languages and you know at the
[00:57:22] between languages and you know at the time in the 50s you know after some
[00:57:26] time in the 50s you know after some initial very impressive looking cook
[00:57:28] initial very impressive looking cook demos it was sort of basically a
[00:57:31] demos it was sort of basically a complete flop and the reason you know
[00:57:33] complete flop and the reason you know there are lots of reasons why it's was a
[00:57:35] there are lots of reasons why it's was a complete flop you know one was people
[00:57:38] complete flop you know one was people knew almost nothing about the structure
[00:57:39] knew almost nothing about the structure of human languages I mean in particular
[00:57:42] of human languages I mean in particular when I was mentioning the other day the
[00:57:44] when I was mentioning the other day the Chomsky hierarchy right and knowing
[00:57:47] Chomsky hierarchy right and knowing about sort of context free languages
[00:57:49] about sort of context free languages right the chompsky hierarchy even hadn't
[00:57:51] right the chompsky hierarchy even hadn't been invented yet right sort of formal
[00:57:54] been invented yet right sort of formal properties of languages hadn't being
[00:57:56] properties of languages hadn't being explored but also you know the computers
[00:58:00] explored but also you know the computers um that people had in the 1950s right
[00:58:03] um that people had in the 1950s right the amount of P computing power or
[00:58:07] the amount of P computing power or memory or anything like this that those
[00:58:10] memory or anything like this that those computers had in those days was
[00:58:11] computers had in those days was laughable right these days you know the
[00:58:14] laughable right these days you know the little power brick for your laptop has
[00:58:17] little power brick for your laptop has more computing power inside it than the
[00:58:20] more computing power inside it than the big mainframe computers that they used
[00:58:22] big mainframe computers that they used to be using in those days so um
[00:58:26] to be using in those days so um basically um people were only able to
[00:58:29] basically um people were only able to build very simple lexicons and
[00:58:32] build very simple lexicons and rule-based substitution rules and
[00:58:35] rule-based substitution rules and nothing like the complexity of human
[00:58:37] nothing like the complexity of human languages which only gradually people
[00:58:40] languages which only gradually people began to understand um but machine
[00:58:43] began to understand um but machine translation started to become more alive
[00:58:46] translation started to become more alive in the 1990s and 2000s decades once
[00:58:50] in the 1990s and 2000s decades once people started to build empirical models
[00:58:54] people started to build empirical models over lots of data and the approach then
[00:58:57] over lots of data and the approach then was called statistical machine
[00:58:59] was called statistical machine translation and so when Google translate
[00:59:02] translation and so when Google translate was first introduced to the world um it
[00:59:05] was first introduced to the world um it was sort of the big unveiling to the
[00:59:07] was sort of the big unveiling to the world of statistical phrase-based
[00:59:10] world of statistical phrase-based machine translation systems where what
[00:59:12] machine translation systems where what you were doing was you were collecting a
[00:59:15] you were doing was you were collecting a large amount of parallel
[00:59:17] large amount of parallel data words that have been translated
[00:59:20] data words that have been translated from one word to another and you know
[00:59:23] from one word to another and you know not for all languages but for quite a
[00:59:25] not for all languages but for quite a few languages quite a few sources of
[00:59:28] few languages quite a few sources of parallel data so the European Union
[00:59:30] parallel data so the European Union generates a huge amount of parallel data
[00:59:32] generates a huge amount of parallel data among European languages there are
[00:59:35] among European languages there are places like Hong Kong where you get
[00:59:37] places like Hong Kong where you get English Chinese if a certain dialect of
[00:59:40] English Chinese if a certain dialect of Chinese um parallel data um the UN
[00:59:44] Chinese um parallel data um the UN generates a lot of parallel data so
[00:59:46] generates a lot of parallel data so getting sources of parallel data and
[00:59:49] getting sources of parallel data and trying to build models and so the way it
[00:59:51] trying to build models and so the way it was done was based on that model we're
[00:59:54] was done was based on that model we're going to try and learn uh probability
[00:59:56] going to try and learn uh probability model for translation so this um the
[01:00:01] model for translation so this um the probability of a translation given a
[01:00:04] probability of a translation given a source sentence and the way it was done
[01:00:06] source sentence and the way it was done at that time was breaking it down using
[01:00:09] at that time was breaking it down using Bas rule into two um sub problems so the
[01:00:14] Bas rule into two um sub problems so the probability of the translation given the
[01:00:16] probability of the translation given the source is going to be the inverted
[01:00:19] source is going to be the inverted probability of the source given the
[01:00:22] probability of the source given the translation times the probability of the
[01:00:24] translation times the probability of the translation and and you know you could
[01:00:27] translation and and you know you could think that this makes it no simpler
[01:00:30] think that this makes it no simpler because you've just you know reversed
[01:00:33] because you've just you know reversed the order of X and Y but the reason why
[01:00:36] the order of X and Y but the reason why it made it simpler and people were able
[01:00:38] it made it simpler and people were able to make progress was the translation
[01:00:41] to make progress was the translation model was treated as a very simple model
[01:00:45] model was treated as a very simple model as to how words tended to get translated
[01:00:48] as to how words tended to get translated to words in the other language and it
[01:00:50] to words in the other language and it didn't need to know anything about you
[01:00:53] didn't need to know anything about you know word order grammar structure of the
[01:00:55] know word order grammar structure of the other Lang anguage and then all of that
[01:00:57] other Lang anguage and then all of that was being handled by just this
[01:01:00] was being handled by just this probability of why which was a pure
[01:01:02] probability of why which was a pure language model as we've talked about
[01:01:04] language model as we've talked about before so you could have a simple
[01:01:06] before so you could have a simple translation model which just sort of
[01:01:08] translation model which just sort of said you know if you see the word om in
[01:01:12] said you know if you see the word om in French you might want to translate it as
[01:01:15] French you might want to translate it as man or person or and put some
[01:01:17] man or person or and put some probabilities on that and then most of
[01:01:20] probabilities on that and then most of the cleverness was in the language model
[01:01:22] the cleverness was in the language model which was telling you what would be a
[01:01:24] which was telling you what would be a good sentence in the target
[01:01:29] good sentence in the target language okay um and so that was
[01:01:32] language okay um and so that was important because you know translations
[01:01:35] important because you know translations get pretty complicated right so um you
[01:01:39] get pretty complicated right so um you not only have to know how to translate
[01:01:42] not only have to know how to translate words and those translations of words
[01:01:44] words and those translations of words VAR in context um but you get a lot of
[01:01:48] VAR in context um but you get a lot of reordering of words in
[01:01:50] reordering of words in sentences
[01:01:52] sentences um I'm not going to be able to spend a
[01:01:54] um I'm not going to be able to spend a lot of time on this
[01:01:56] lot of time on this but you know here here for a while um
[01:01:59] but you know here here for a while um was my favorite example machine
[01:02:03] was my favorite example machine translation sentence um so um this is
[01:02:07] translation sentence um so um this is actually a translated sentence so it
[01:02:10] actually a translated sentence so it comes the original comes from the book
[01:02:12] comes the original comes from the book Guns Germs and Steel if you're familiar
[01:02:15] Guns Germs and Steel if you're familiar with that um but it was that book by
[01:02:17] with that um but it was that book by Jared Diamond um but this book was um
[01:02:21] Jared Diamond um but this book was um translated into Chinese so here's a
[01:02:24] translated into Chinese so here's a sentence um from
[01:02:26] sentence um from um the book in Chinese and you know uh I
[01:02:30] um the book in Chinese and you know uh I guess in the 2000th decade I was
[01:02:32] guess in the 2000th decade I was involved in building statistical machine
[01:02:34] involved in building statistical machine translation systems and I guess there
[01:02:37] translation systems and I guess there was a Mt
[01:02:39] was a Mt evaluation um that um we did where our
[01:02:42] evaluation um that um we did where our system did terribly on this sentence and
[01:02:44] system did terribly on this sentence and I tried it out on Google translate and
[01:02:47] I tried it out on Google translate and it also did terribly in this sentence so
[01:02:49] it also did terribly in this sentence so what the sentence should say is in 1519
[01:02:53] what the sentence should say is in 1519 600 Spaniards landed in Mexico to
[01:02:56] 600 Spaniards landed in Mexico to conquer the Aztec empire with a
[01:02:58] conquer the Aztec empire with a population of a few million they lost
[01:03:01] population of a few million they lost 2/3 of their soldiers in the initial
[01:03:03] 2/3 of their soldiers in the initial Clash um so here's what Google translate
[01:03:06] Clash um so here's what Google translate said in
[01:03:07] said in 2019 2009 1519 600 Spaniards landed in
[01:03:13] 2019 2009 1519 600 Spaniards landed in Mexico millions of people to conquer the
[01:03:16] Mexico millions of people to conquer the Aztec empire the first 2third of
[01:03:19] Aztec empire the first 2third of soldiers against their loss um now it's
[01:03:21] soldiers against their loss um now it's partly bad because the word choices in
[01:03:24] partly bad because the word choices in the translations
[01:03:26] the translations very good um but you know it's
[01:03:28] very good um but you know it's especially bad because it's just um not
[01:03:32] especially bad because it's just um not actually able to capture and use the
[01:03:35] actually able to capture and use the modification relationships um of the
[01:03:38] modification relationships um of the sentence so you know here's the part of
[01:03:41] sentence so you know here's the part of the Chinese that's saying the Aztec
[01:03:43] the Chinese that's saying the Aztec Empire and over there in Orange is the
[01:03:48] Empire and over there in Orange is the few million people and in Chinese
[01:03:51] few million people and in Chinese there's this explicit little character
[01:03:53] there's this explicit little character here dur which is saying that stuff in
[01:03:56] here dur which is saying that stuff in Orange modifies this stuff in green
[01:03:59] Orange modifies this stuff in green which is what it's meant to be in the
[01:04:01] which is what it's meant to be in the correct translation of Aztec empire with
[01:04:04] correct translation of Aztec empire with a population of a few million um but
[01:04:06] a population of a few million um but Google translate completely fails on
[01:04:09] Google translate completely fails on that and it's Suddenly It's the millions
[01:04:11] that and it's Suddenly It's the millions of people who are going to be conquering
[01:04:13] of people who are going to be conquering um the Aztec empire um and you know
[01:04:16] um the Aztec empire um and you know that's sort of in this way um the worst
[01:04:19] that's sort of in this way um the worst thing that's happening here though you
[01:04:21] thing that's happening here though you know the 151 19600 isn't exactly a very
[01:04:24] know the 151 19600 isn't exactly a very good translation um and the first 2third
[01:04:27] good translation um and the first 2third of soldiers against their loss isn't
[01:04:29] of soldiers against their loss isn't very good either um but you know so for
[01:04:32] very good either um but you know so for a while I used to sort of update this
[01:04:35] a while I used to sort of update this and see what happened you know um in
[01:04:40] and see what happened you know um in 2013 it almost seemed like progress had
[01:04:43] 2013 it almost seemed like progress had been made but by
[01:04:46] been made but by 2015 it had gone downhill back to how it
[01:04:49] 2015 it had gone downhill back to how it was before so it just seemed like they
[01:04:52] was before so it just seemed like they got lucky in
[01:04:54] got lucky in 2013 rather than the systems were
[01:04:56] 2013 rather than the systems were working any better um and indeed this
[01:04:59] working any better um and indeed this sort of seemed to be the problem
[01:05:01] sort of seemed to be the problem although some kind of progress had been
[01:05:04] although some kind of progress had been made in machine translation um these
[01:05:07] made in machine translation um these systems um just you know sort of never
[01:05:12] systems um just you know sort of never really worked all that great um and so
[01:05:16] really worked all that great um and so that led to this amazing breakthrough in
[01:05:20] that led to this amazing breakthrough in 2014 um where we then moved to neural
[01:05:23] 2014 um where we then moved to neural machine translation and new machine
[01:05:26] machine translation and new machine translation was much better so what did
[01:05:29] translation was much better so what did we do in neural machine translation so
[01:05:33] we do in neural machine translation so we built a neural machine translation
[01:05:35] we built a neural machine translation system as a single endtoend neural
[01:05:38] system as a single endtoend neural network and that's been a powerful idea
[01:05:41] network and that's been a powerful idea in newal network systems in general
[01:05:43] in newal network systems in general including in NLP if we can just have a
[01:05:47] including in NLP if we can just have a single big system and put a loss
[01:05:50] single big system and put a loss function at the end of it and then we
[01:05:52] function at the end of it and then we can back propagate errors right back
[01:05:54] can back propagate errors right back down through the system um it means
[01:05:56] down through the system um it means we're sort of aligning all of our
[01:05:58] we're sort of aligning all of our learning for the final task we want to
[01:06:01] learning for the final task we want to do and that's been very effective
[01:06:03] do and that's been very effective whereas earlier models couldn't do that
[01:06:05] whereas earlier models couldn't do that um so we built it with a sequence to
[01:06:08] um so we built it with a sequence to sequence model so that sounds like our
[01:06:12] sequence model so that sounds like our lstms but it's meaning that we're going
[01:06:14] lstms but it's meaning that we're going to have two of them one of them to
[01:06:16] to have two of them one of them to encode the sentence um The Source
[01:06:20] encode the sentence um The Source sentence and one to produce the target
[01:06:22] sentence and one to produce the target sentence so that's what we're building
[01:06:25] sentence so that's what we're building so for the Source sentence we're taking
[01:06:28] so for the Source sentence we're taking here it says RNN but let's just think
[01:06:30] here it says RNN but let's just think lstm because that's what we're going to
[01:06:31] lstm because that's what we're going to use in practice it's much better and so
[01:06:34] use in practice it's much better and so we're going to chunk through it
[01:06:37] we're going to chunk through it encoding um what we've read using an RNN
[01:06:41] encoding um what we've read using an RNN so this RNN isn't going to Output
[01:06:43] so this RNN isn't going to Output anything right we're just building up a
[01:06:45] anything right we're just building up a hidden state that knows what's in the
[01:06:47] hidden state that knows what's in the source sentence um so again encoding of
[01:06:50] source sentence um so again encoding of the source sentence and we're going to
[01:06:52] the source sentence and we're going to use that final hidden state to condition
[01:06:56] use that final hidden state to condition the decoder RNN which is going to then
[01:06:59] the decoder RNN which is going to then generate the translation so for the
[01:07:02] generate the translation so for the decoder RNN it's also an lstm but it's
[01:07:06] decoder RNN it's also an lstm but it's going to be an lstm with different
[01:07:08] going to be an lstm with different parameters so we're going to be learning
[01:07:09] parameters so we're going to be learning one set one lstm with Source en coding
[01:07:13] one set one lstm with Source en coding parameters and then for the different
[01:07:14] parameters and then for the different language we're learning a different lstm
[01:07:17] language we're learning a different lstm that all know about the target language
[01:07:20] that all know about the target language and so we give it start and say well
[01:07:23] and so we give it start and say well feed in for feed in your past what
[01:07:28] feed in for feed in your past what you've encoded from the encoder RNN as
[01:07:31] you've encoded from the encoder RNN as your starting point um and then we're
[01:07:33] your starting point um and then we're going to be I that'll be count as the
[01:07:37] going to be I that'll be count as the previous hidden State you're feeding
[01:07:38] previous hidden State you're feeding into your um lstm and then we're going
[01:07:42] into your um lstm and then we're going to generate the first word of the
[01:07:44] to generate the first word of the translation and we'll then copy that
[01:07:47] translation and we'll then copy that translated word down using this as a
[01:07:50] translated word down using this as a generative model as I did last time and
[01:07:52] generative model as I did last time and we um start translating through he hit
[01:07:56] we um start translating through he hit me with a
[01:07:57] me with a pie okay um
[01:08:00] pie okay um so does that sort of make sense the
[01:08:03] so does that sort of make sense the model yeah okay um
[01:08:09] so okay there's some note
[01:08:12] so okay there's some note sorry um yeah sorry what I was going to
[01:08:15] sorry um yeah sorry what I was going to say yeah so the the little pink note
[01:08:17] say yeah so the the little pink note here so what I was showing you is the
[01:08:19] here so what I was showing you is the picture of using it at sort of runtime
[01:08:24] picture of using it at sort of runtime at run time we're going to encode the
[01:08:27] at run time we're going to encode the source and then generate the words of
[01:08:29] source and then generate the words of the translation at training time we're
[01:08:32] the translation at training time we're going to have parallel text we're going
[01:08:35] going to have parallel text we're going to have sentences in their translations
[01:08:37] to have sentences in their translations we're going to run with the same
[01:08:39] we're going to run with the same architecture but as before then for the
[01:08:43] architecture but as before then for the decoder Network we're going to try and
[01:08:46] decoder Network we're going to try and predict each word and then say what
[01:08:49] predict each word and then say what probability did you assign the actual
[01:08:52] probability did you assign the actual next word and that will give us a loss
[01:08:54] next word and that will give us a loss and we'll be calculating the losses at
[01:08:56] and we'll be calculating the losses at each position working out the average
[01:08:59] each position working out the average loss working out the gradients back
[01:09:01] loss working out the gradients back propagating them through the entire
[01:09:02] propagating them through the entire network both the decoder RNN and the
[01:09:06] network both the decoder RNN and the encoder RNN networks and updating all
[01:09:09] encoder RNN networks and updating all the parameters of our model and that's
[01:09:11] the parameters of our model and that's the sense in which it's being trained
[01:09:13] the sense in which it's being trained end to
[01:09:15] end to end okay um so sequence so this is sort
[01:09:19] end okay um so sequence so this is sort of a general notion of an encoder
[01:09:22] of a general notion of an encoder decoder model which is a very general
[01:09:25] decoder model which is a very general thing that we use in all kinds of places
[01:09:29] thing that we use in all kinds of places right that we have one network that
[01:09:31] right that we have one network that encod something which produces a
[01:09:34] encod something which produces a representation which will then feed into
[01:09:36] representation which will then feed into another Network that we'll use to decode
[01:09:39] another Network that we'll use to decode something and even when we go on to do
[01:09:42] something and even when we go on to do other things like use Transformers
[01:09:44] other things like use Transformers rather than lstms we're still commonly
[01:09:47] rather than lstms we're still commonly going to use these kind of encoded ecoda
[01:09:49] going to use these kind of encoded ecoda models because if we want to do not only
[01:09:52] models because if we want to do not only machine translation but other tasks um
[01:09:56] machine translation but other tasks um like um
[01:09:57] like um summarization or um text to speech or
[01:10:03] summarization or um text to speech or other things like that we're going to be
[01:10:05] other things like that we're going to be in this space of using encoder decoder
[01:10:08] in this space of using encoder decoder networks yeah what is the difference
[01:10:10] networks yeah what is the difference between this encod de model and just
[01:10:12] between this encod de model and just using a deeper neural network with more
[01:10:16] layers
[01:10:19] layers um well a lot is sequenced um right so
[01:10:24] um well a lot is sequenced um right so it has never been very Su you're meaning
[01:10:26] it has never been very Su you're meaning like why don't you just build on top of
[01:10:28] like why don't you just build on top of the source right um people have tried
[01:10:32] the source right um people have tried that occasionally it's never been very
[01:10:35] that occasionally it's never been very successful and I think part of the
[01:10:37] successful and I think part of the reason is all of what I was trying to
[01:10:39] reason is all of what I was trying to show before about all all of the word
[01:10:41] show before about all all of the word order changes around a lot between
[01:10:44] order changes around a lot between languages and if you're sort of um just
[01:10:46] languages and if you're sort of um just trying to build stuff on top of the
[01:10:49] trying to build stuff on top of the source sentence it's very hard to cope
[01:10:52] source sentence it's very hard to cope with that in particular it's not even
[01:10:55] with that in particular it's not even the case that the length stays the same
[01:10:57] the case that the length stays the same right one of the big ways um in which
[01:11:00] right one of the big ways um in which languages vary is what little words that
[01:11:03] languages vary is what little words that they have right so that in English
[01:11:05] they have right so that in English you're putting in a lot of these
[01:11:06] you're putting in a lot of these auxiliary verbs and articles whereas
[01:11:09] auxiliary verbs and articles whereas it's in Chinese you don't have any of
[01:11:11] it's in Chinese you don't have any of those and so you're neither needing to
[01:11:14] those and so you're neither needing to depending on Direction add a lot of
[01:11:16] depending on Direction add a lot of words or subtract a lot of words which
[01:11:18] words or subtract a lot of words which is very hard to do if you're sort of
[01:11:20] is very hard to do if you're sort of building on top of the source of
[01:11:22] building on top of the source of it ah is it quick uh yeah so left side
[01:11:27] it ah is it quick uh yeah so left side is that b directional or just like a
[01:11:29] is that b directional or just like a like the encoder um yeah so you you
[01:11:33] like the encoder um yeah so you you totally think and it could be that the
[01:11:37] totally think and it could be that the encoder is bidirectional and that might
[01:11:40] encoder is bidirectional and that might be better um for the for the famous
[01:11:43] be better um for the for the famous original instantiation of this that was
[01:11:45] original instantiation of this that was done at Google they actually didn't make
[01:11:47] done at Google they actually didn't make it bir directional so it was simply
[01:11:49] it bir directional so it was simply taking the final hidden state but that's
[01:11:52] taking the final hidden state but that's absolutely an alternative that you could
[01:11:54] absolutely an alternative that you could do okay okay
[01:11:58] um yeah so I sort of said it was um okay
[01:12:03] um yeah so I sort of said it was um okay usable for lots of things okay um yeah
[01:12:06] usable for lots of things okay um yeah so this is our um conditional language
[01:12:09] so this is our um conditional language model um this so we're now kind of
[01:12:12] model um this so we're now kind of directly calculating the probability of
[01:12:15] directly calculating the probability of Y given X right that the decoder model
[01:12:18] Y given X right that the decoder model is generating um uh language expression
[01:12:22] is generating um uh language expression as a language model directly conditioned
[01:12:25] as a language model directly conditioned on X um and so we train it with a big
[01:12:29] on X um and so we train it with a big parallel Corpus um and that's the only
[01:12:32] parallel Corpus um and that's the only case I'm going to talk about today
[01:12:34] case I'm going to talk about today recently there's been sort of some
[01:12:36] recently there's been sort of some interesting work on unsupervised machine
[01:12:38] interesting work on unsupervised machine translation meaning that you're got only
[01:12:41] translation meaning that you're got only a little bit of information about how
[01:12:43] a little bit of information about how the languages relate you don't really
[01:12:45] the languages relate you don't really have a lot of parallel text but I'm not
[01:12:47] have a lot of parallel text but I'm not going to cover that today um yeah so for
[01:12:50] going to cover that today um yeah so for training it we have um paired sentences
[01:12:55] training it we have um paired sentences um we work out our losses in the
[01:12:59] um we work out our losses in the predictions at each position and then
[01:13:01] predictions at each position and then we're working out our average loss and
[01:13:04] we're working out our average loss and back propagating it through in a single
[01:13:06] back propagating it through in a single system end to end as
[01:13:08] system end to end as described
[01:13:10] described um yeah so in practice um when people
[01:13:15] um yeah so in practice um when people built Big Machine translation systems
[01:13:17] built Big Machine translation systems this was one of the places where
[01:13:19] this was one of the places where absolutely it gave value to have multi-
[01:13:22] absolutely it gave value to have multi- layer stacked um lstms and so typically
[01:13:27] layer stacked um lstms and so typically people are building a model there and
[01:13:29] people are building a model there and you'll be building a model something
[01:13:31] you'll be building a model something like this that's a multi-layer lstm
[01:13:35] like this that's a multi-layer lstm that's being used to encode and
[01:13:38] that's being used to encode and decode
[01:13:40] decode um in my two minutes um remaining um I
[01:13:45] um in my two minutes um remaining um I just want to sort of um quickly say so
[01:13:48] just want to sort of um quickly say so um building these new machine
[01:13:50] um building these new machine translation systems was really the first
[01:13:54] translation systems was really the first big success
[01:13:55] big success of natural language processing deep
[01:13:58] of natural language processing deep learning now in this in this sense you
[01:14:01] learning now in this in this sense you know it depends on how you define what
[01:14:04] know it depends on how you define what parts of language if you look at the
[01:14:06] parts of language if you look at the sort of the history of the Renaissance
[01:14:09] sort of the history of the Renaissance of deep learning the first place where
[01:14:12] of deep learning the first place where deep learning was highly successful was
[01:14:14] deep learning was highly successful was in speech recognition systems um the
[01:14:18] in speech recognition systems um the second place in which it was highly
[01:14:20] second place in which it was highly successful was then object recognition
[01:14:23] successful was then object recognition and vision and then the third place that
[01:14:26] and vision and then the third place that was highly successful was then building
[01:14:28] was highly successful was then building machine translation systems um so you
[01:14:32] machine translation systems um so you know Google had a big um statistical
[01:14:35] know Google had a big um statistical machine translation system and it was uh
[01:14:39] machine translation system and it was uh it was only in
[01:14:41] it was only in 2014 that people first built this sort
[01:14:44] 2014 that people first built this sort of
[01:14:45] of lstm um deep learning machine
[01:14:47] lstm um deep learning machine translation system but it was just sort
[01:14:50] translation system but it was just sort of obviously super good and it was so
[01:14:53] of obviously super good and it was so super good that in only two 2 years it
[01:14:56] super good that in only two 2 years it was then deployed as the live system um
[01:14:59] was then deployed as the live system um that was being used at Google but it
[01:15:01] that was being used at Google but it wasn't only used in Google um that new
[01:15:05] wasn't only used in Google um that new machine translation was just so much
[01:15:07] machine translation was just so much better than what had come before that by
[01:15:10] better than what had come before that by a couple of years after that you know
[01:15:13] a couple of years after that you know absolutely everybody um both us
[01:15:16] absolutely everybody um both us companies and Chinese companies
[01:15:18] companies and Chinese companies Microsoft Facebook 10cent bu um
[01:15:22] Microsoft Facebook 10cent bu um everybody was using new machine
[01:15:24] everybody was using new machine translation system because they are just
[01:15:26] translation system because they are just much better systems and so this was an
[01:15:29] much better systems and so this was an amazing success right CU um statistical
[01:15:32] amazing success right CU um statistical machine translation systems like the
[01:15:34] machine translation systems like the Google system that this is something
[01:15:37] Google system that this is something that had been worked on for about a
[01:15:39] that had been worked on for about a decade hundreds of people had worked on
[01:15:41] decade hundreds of people had worked on it there were millions of lines of code
[01:15:44] it there were millions of lines of code lots of hacks built in um for particular
[01:15:47] lots of hacks built in um for particular languages and language pairs um but
[01:15:50] languages and language pairs um but really a simple small um newal machine
[01:15:55] really a simple small um newal machine translation system was able to um work
[01:15:58] translation system was able to um work much better um there was an article
[01:16:00] much better um there was an article published about it when it sort went
[01:16:02] published about it when it sort went live in the New York Times that you can
[01:16:03] live in the New York Times that you can find in that um link it's a um a little
[01:16:07] find in that um link it's a um a little bit of a a praising piece um where you
[01:16:10] bit of a a praising piece um where you could be a little bit critical um but
[01:16:13] could be a little bit critical um but you know basically it's sort of talking
[01:16:15] you know basically it's sort of talking about how just the difference in quality
[01:16:18] about how just the difference in quality was so obvious that everyone immediately
[01:16:21] was so obvious that everyone immediately noticed even before Google had announced
[01:16:23] noticed even before Google had announced it of w some Machine translation's gone
[01:16:26] it of w some Machine translation's gone so much
[01:16:27] so much better okay um so that's basically today
[01:16:31] better okay um so that's basically today so um for today you know we've leared
[01:16:34] so um for today you know we've leared that lstms are powerful if you're doing
[01:16:37] that lstms are powerful if you're doing something with a current new network you
[01:16:38] something with a current new network you probably want to use an lstm um you
[01:16:41] probably want to use an lstm um you should know about the idea of clipping
[01:16:43] should know about the idea of clipping your gradients um bidirectional lstms
[01:16:47] your gradients um bidirectional lstms are good when you've got an encoder but
[01:16:50] are good when you've got an encoder but you can't use them to generate new text
[01:16:53] you can't use them to generate new text um and encoder decoder rual machine
[01:16:56] um and encoder decoder rual machine translation systems were great new
[01:16:58] translation systems were great new technology um that advanced the field
[01:17:01] technology um that advanced the field thank you


================================================================================
LECTURE 007
================================================================================

Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 7 - Attention, Final Projects and LLM Intro

Source: https://www.youtube.com/watch?v=J7ruSOIzhrE

---

Transcript

[00:00:05] um okay welcome everyone um to week four
[00:00:09] um okay welcome everyone um to week four we're into now um so for today what I
[00:00:12] we're into now um so for today what I want to do is first of all um well a
[00:00:17] want to do is first of all um well a couple more bits on machine translation
[00:00:20] couple more bits on machine translation especially just talking a little bit
[00:00:21] especially just talking a little bit about evaluating machine translation
[00:00:24] about evaluating machine translation then I want to spend a while on
[00:00:26] then I want to spend a while on attention so attention is a very um fun
[00:00:30] attention so attention is a very um fun mental concept and of newal networks
[00:00:33] mental concept and of newal networks which was originally developed in the
[00:00:35] which was originally developed in the context of machine translation there
[00:00:38] context of machine translation there also then a very um Central concept when
[00:00:41] also then a very um Central concept when we're talking about Transformers which
[00:00:43] we're talking about Transformers which we then start talking about on
[00:00:46] we then start talking about on Thursday okay so um getting straight
[00:00:49] Thursday okay so um getting straight into it um so this is the picture that
[00:00:53] into it um so this is the picture that we saw towards the end of last time that
[00:00:56] we saw towards the end of last time that this is how we were baking uh machine
[00:00:59] this is how we were baking uh machine translation system where we are using um
[00:01:02] translation system where we are using um a multi-layer lstm where we're feeding
[00:01:05] a multi-layer lstm where we're feeding in a source sent sentence and then we
[00:01:08] in a source sent sentence and then we were flipping um to um then turning the
[00:01:13] were flipping um to um then turning the model into a decoder with different
[00:01:15] model into a decoder with different parameters which would generate one word
[00:01:17] parameters which would generate one word at a time to generate the translated
[00:01:21] at a time to generate the translated sentence um so um here I've got an
[00:01:24] sentence um so um here I've got an German sentence and it's produced an
[00:01:26] German sentence and it's produced an English translation that looks a pretty
[00:01:29] English translation that looks a pretty good one and but you know we're going to
[00:01:32] good one and but you know we're going to want to have a way of deciding well are
[00:01:36] want to have a way of deciding well are we um producing good translations or not
[00:01:39] we um producing good translations or not um and so we need some way to evaluate
[00:01:42] um and so we need some way to evaluate machine translation now this is a
[00:01:45] machine translation now this is a complex area because you know if you
[00:01:48] complex area because you know if you start poking around in the literature um
[00:01:51] start poking around in the literature um people have proposed literally hundreds
[00:01:55] people have proposed literally hundreds of different measures that could be used
[00:01:56] of different measures that could be used to evaluate machine translation systems
[00:01:59] to evaluate machine translation systems I'm of writing a couple of papers on it
[00:02:02] I'm of writing a couple of papers on it myself so I'm contributed to the problem
[00:02:05] myself so I'm contributed to the problem um but you know by far the most commonly
[00:02:09] um but you know by far the most commonly common measure that you see to this day
[00:02:11] common measure that you see to this day was essentially the first measure
[00:02:14] was essentially the first measure proposed to automatically evaluate
[00:02:16] proposed to automatically evaluate machine translation which was the blue
[00:02:18] machine translation which was the blue measure um which was um proposed to
[00:02:22] measure um which was um proposed to understand for bilingual evaluation
[00:02:24] understand for bilingual evaluation underst study though it went along with
[00:02:27] underst study though it went along with the fact that was proposed by IBM
[00:02:29] the fact that was proposed by IBM probably not coincidence um so until
[00:02:32] probably not coincidence um so until this point um the only way that people
[00:02:36] this point um the only way that people had really used for evaluating
[00:02:38] had really used for evaluating translations was getting human beings to
[00:02:40] translations was getting human beings to look at them and say how good of a
[00:02:42] look at them and say how good of a translation this is and you know that's
[00:02:45] translation this is and you know that's still a gold standard measure that is
[00:02:48] still a gold standard measure that is widely used um for evaluating
[00:02:51] widely used um for evaluating translations because you know many of
[00:02:53] translations because you know many of the automatic measures have various
[00:02:55] the automatic measures have various kinds of um biases and problems that
[00:02:58] kinds of um biases and problems that make human evaluation
[00:03:01] make human evaluation useful but on the other hand um a lot of
[00:03:05] useful but on the other hand um a lot of the time we'd like to you know iterate
[00:03:08] the time we'd like to you know iterate quickly on evaluations we'd like to use
[00:03:11] quickly on evaluations we'd like to use evaluations and training loops and
[00:03:13] evaluations and training loops and things like that and I the IBM people
[00:03:16] things like that and I the IBM people with the blue paper suggests well maybe
[00:03:18] with the blue paper suggests well maybe we can come up with a halfway decent
[00:03:20] we can come up with a halfway decent automatic method of um doing
[00:03:23] automatic method of um doing translations and the idea of what they
[00:03:26] translations and the idea of what they proposed was this that we're going to
[00:03:28] proposed was this that we're going to have one or more reference translations
[00:03:31] have one or more reference translations for a piece of text so these are human
[00:03:34] for a piece of text so these are human written translations and then we can
[00:03:37] written translations and then we can score any automatic translation mainly
[00:03:41] score any automatic translation mainly on how often they have overlapping one 2
[00:03:47] on how often they have overlapping one 2 three and four grams uh the number four
[00:03:50] three and four grams uh the number four isn't special you could have only gone
[00:03:52] isn't special you could have only gone up to three or five but four we seen as
[00:03:54] up to three or five but four we seen as a reasonable length um overlapping um n
[00:03:58] a reasonable length um overlapping um n gs with one of the reference
[00:04:00] gs with one of the reference translations and the more overlap you
[00:04:02] translations and the more overlap you have the better um and we we this
[00:04:07] have the better um and we we this discussion of this evaluation in the
[00:04:09] discussion of this evaluation in the assignment so you can think about it a
[00:04:11] assignment so you can think about it a bit more and I won't go actually through
[00:04:13] bit more and I won't go actually through all the formulas right now um but you
[00:04:16] all the formulas right now um but you know that's most of it um and so here's
[00:04:19] know that's most of it um and so here's um a picture of how that looks so the
[00:04:22] um a picture of how that looks so the original idea was what we should do is
[00:04:26] original idea was what we should do is you know have several reference
[00:04:28] you know have several reference translations and then we'd get a machine
[00:04:31] translations and then we'd get a machine translation and then we'd look at this
[00:04:33] translation and then we'd look at this machine translation and try and find
[00:04:36] machine translation and try and find pieces of it in the reference
[00:04:39] pieces of it in the reference translation so we can certainly find the
[00:04:41] translation so we can certainly find the unigram the um we can't find American at
[00:04:45] unigram the um we can't find American at all but we can find International
[00:04:48] all but we can find International Airport and its in the second reference
[00:04:51] Airport and its in the second reference translation so we're going to get a 4 g
[00:04:53] translation so we're going to get a 4 g match for that um we can find that again
[00:04:56] match for that um we can find that again that's easy office all receives one call
[00:05:00] that's easy office all receives one call Self the sand Arab I'm not of very good
[00:05:02] Self the sand Arab I'm not of very good translation this right so that all
[00:05:04] translation this right so that all misses but then you start to find other
[00:05:06] misses but then you start to find other pieces um that do overlap and you use
[00:05:10] pieces um that do overlap and you use those to work out a score um the
[00:05:12] those to work out a score um the original idea was you should always have
[00:05:14] original idea was you should always have multiple reference translations so that
[00:05:17] multiple reference translations so that you can sample the space of possible
[00:05:21] you can sample the space of possible translations and have reasonable
[00:05:23] translations and have reasonable coverage in practice for what's been
[00:05:25] coverage in practice for what's been done more recently it's not so uncommon
[00:05:28] done more recently it's not so uncommon that people do this with one reference
[00:05:30] that people do this with one reference translation and the argument then is
[00:05:33] translation and the argument then is still on a kind of a probabilistic basis
[00:05:36] still on a kind of a probabilistic basis the more often you have a good
[00:05:37] the more often you have a good translation the more often you'll get
[00:05:39] translation the more often you'll get matches and therefore your score will be
[00:05:43] matches and therefore your score will be better yeah so um
[00:05:47] better yeah so um why you know why did people come up with
[00:05:51] why you know why did people come up with this and why is it still imperfect well
[00:05:55] this and why is it still imperfect well the problem with translation is that
[00:05:58] the problem with translation is that there isn't one right answer it's not
[00:06:01] there isn't one right answer it's not like the kind of classification things
[00:06:02] like the kind of classification things you see in machine learning where you
[00:06:04] you see in machine learning where you show people a picture and the right
[00:06:06] show people a picture and the right answer is to say this the class of this
[00:06:10] answer is to say this the class of this object um is whatever labut or right dog
[00:06:14] object um is whatever labut or right dog breeds or something right that for any
[00:06:17] breeds or something right that for any sentence there are many different ways
[00:06:19] sentence there are many different ways to translate it and you know translators
[00:06:22] to translate it and you know translators can sit around and argue that oh this
[00:06:24] can sit around and argue that oh this phrasing is a little bit nicer than this
[00:06:25] phrasing is a little bit nicer than this phrasing blah blah blah but to a first
[00:06:28] phrasing blah blah blah but to a first approximation you can translate sentence
[00:06:30] approximation you can translate sentence in lots of ways um and those different
[00:06:32] in lots of ways um and those different ways of translation can involve
[00:06:34] ways of translation can involve different word orders so you can't
[00:06:36] different word orders so you can't really sort of check the words off as
[00:06:38] really sort of check the words off as you come down in the sentence and that's
[00:06:41] you come down in the sentence and that's what motivated this idea of sort of
[00:06:43] what motivated this idea of sort of matching engrams anywhere so you can get
[00:06:46] matching engrams anywhere so you can get reasonable credit um for having the
[00:06:49] reasonable credit um for having the right matches but you know nevertheless
[00:06:52] right matches but you know nevertheless it's a pretty crude version of it right
[00:06:57] it's a pretty crude version of it right um you know you can still get a poor
[00:06:59] um you know you can still get a poor blue score for good translation just
[00:07:01] blue score for good translation just because the words you chose didn't
[00:07:03] because the words you chose didn't happen to match a reference translation
[00:07:06] happen to match a reference translation and also you can get points for things
[00:07:09] and also you can get points for things without really having a good translation
[00:07:11] without really having a good translation at all right if you just have words that
[00:07:13] at all right if you just have words that match even if they're having completely
[00:07:16] match even if they're having completely the wrong role in the sentence you will
[00:07:19] the wrong role in the sentence you will get some points but it's harder to get
[00:07:21] get some points but it's harder to get engram matches unless you're um for
[00:07:24] engram matches unless you're um for larger n unless you're using words the
[00:07:27] larger n unless you're using words the right way um there's one other tricken
[00:07:30] right way um there's one other tricken the blue measure that there's a penalty
[00:07:32] the blue measure that there's a penalty for two short system translations CU
[00:07:35] for two short system translations CU otherwise you could leave out everything
[00:07:37] otherwise you could leave out everything difficult and only translate the easy
[00:07:40] difficult and only translate the easy part of the sentence and then for the
[00:07:42] part of the sentence and then for the bits you have translated um you could
[00:07:44] bits you have translated um you could then be getting a high score for the
[00:07:47] then be getting a high score for the Precision of those
[00:07:49] Precision of those pieces okay so we'll use when you're
[00:07:52] pieces okay so we'll use when you're developing your Mt systems for
[00:07:53] developing your Mt systems for assignment three um we'll use them with
[00:07:57] assignment three um we'll use them with blue um so so um now we have a
[00:08:01] blue um so so um now we have a evaluation measure um we can um start
[00:08:05] evaluation measure um we can um start looking at how well
[00:08:08] looking at how well um how well do systems do on a blue
[00:08:12] um how well do systems do on a blue score and blue scores are theoretically
[00:08:14] score and blue scores are theoretically between zero and 100 but you're never
[00:08:17] between zero and 100 but you're never going to get to 100 because of the
[00:08:19] going to get to 100 because of the variations of how you can translate
[00:08:20] variations of how you can translate things um and so typically if you can
[00:08:24] things um and so typically if you can start to get to the 20s uh translations
[00:08:28] start to get to the 20s uh translations uh you can sort of understand what the
[00:08:30] uh you can sort of understand what the source document was about um once you
[00:08:33] source document was about um once you get into the 30s and 40s the
[00:08:35] get into the 30s and 40s the translations are getting much much
[00:08:37] translations are getting much much better
[00:08:39] better um yeah so um statistical phrase-based
[00:08:42] um yeah so um statistical phrase-based translation was pioneered um by IBM in
[00:08:47] translation was pioneered um by IBM in the late 90s actually and was sort of
[00:08:49] the late 90s actually and was sort of redeveloped in the 2000's decade and was
[00:08:52] redeveloped in the 2000's decade and was what Google launched as Google Translate
[00:08:54] what Google launched as Google Translate in the 2000th decade and it continued to
[00:08:57] in the 2000th decade and it continued to be worked on for sort of the following
[00:09:00] be worked on for sort of the following decade but there was basically a strong
[00:09:03] decade but there was basically a strong sense um that progress in Translation
[00:09:07] sense um that progress in Translation had doing statistical phrase based
[00:09:09] had doing statistical phrase based systems had basically stalled that it
[00:09:12] systems had basically stalled that it got a little bit better each year as
[00:09:15] got a little bit better each year as people could build traditional Ingram
[00:09:17] people could build traditional Ingram language models with more data every
[00:09:19] language models with more data every year and things like that but the
[00:09:21] year and things like that but the numbers were barely going upwards um so
[00:09:25] numbers were barely going upwards um so in the years from about 2005
[00:09:29] in the years from about 2005 to 15 or maybe 14 the dominant idea in
[00:09:34] to 15 or maybe 14 the dominant idea in the machine translation Community was
[00:09:37] the machine translation Community was the way we were going to get um better
[00:09:39] the way we were going to get um better machine translation is doing
[00:09:41] machine translation is doing syntax-based machine translation if we
[00:09:44] syntax-based machine translation if we actually knew the structure of sentences
[00:09:46] actually knew the structure of sentences and we'd pass them up then we'd know
[00:09:49] and we'd pass them up then we'd know what the role of words was in sentences
[00:09:51] what the role of words was in sentences and then we'd be able to translate much
[00:09:54] and then we'd be able to translate much better and this was particularly invoked
[00:09:57] better and this was particularly invoked by looking at languages where trans
[00:09:59] by looking at languages where trans worked terribly so in those days
[00:10:03] worked terribly so in those days translation worked sort of okay for
[00:10:06] translation worked sort of okay for languages like French to English um or
[00:10:10] languages like French to English um or Spanish to English which are kind of
[00:10:11] Spanish to English which are kind of sort of similar European languages um
[00:10:15] sort of similar European languages um but the results worked way worse for
[00:10:18] but the results worked way worse for Chinese to English or German to English
[00:10:21] Chinese to English or German to English and even though English is a Germanic
[00:10:23] and even though English is a Germanic language German has a very different
[00:10:25] language German has a very different word order um to English with commonly
[00:10:28] word order um to English with commonly verbs at the end of a clause and
[00:10:31] verbs at the end of a clause and different elements being fronted and so
[00:10:33] different elements being fronted and so there was it wasn't so people tried to
[00:10:36] there was it wasn't so people tried to work on um um grammar-based syntax based
[00:10:41] work on um um grammar-based syntax based methods of statistical machine
[00:10:42] methods of statistical machine translation and I was one of those who
[00:10:44] translation and I was one of those who worked on those in the late 2000s decade
[00:10:48] worked on those in the late 2000s decade but you know the truth is it sort of
[00:10:50] but you know the truth is it sort of didn't really work right if the rate of
[00:10:53] didn't really work right if the rate of progress in um syntax-based
[00:10:56] progress in um syntax-based um machine translation was had slightly
[00:11:00] um machine translation was had slightly more slope um than phrase based machine
[00:11:03] more slope um than phrase based machine translation over these years the amount
[00:11:05] translation over these years the amount of slope um wasn't very much so things
[00:11:09] of slope um wasn't very much so things were completely then thrown on their
[00:11:11] were completely then thrown on their head um when neural machine translation
[00:11:14] head um when neural machine translation got invented because as I explained you
[00:11:17] got invented because as I explained you know the first attempts were in
[00:11:19] know the first attempts were in 2014 the first cases in which it was
[00:11:22] 2014 the first cases in which it was evaluated and boff evaluations was
[00:11:25] evaluated and boff evaluations was 2015 and so in 2015 it wasn't as good as
[00:11:28] 2015 and so in 2015 it wasn't as good as the best other machine translation
[00:11:31] the best other machine translation methods but by 2016 it was and it was
[00:11:35] methods but by 2016 it was and it was just on this much much steeper slope of
[00:11:37] just on this much much steeper slope of getting way way better and that this
[00:11:39] getting way way better and that this graph only goes up to
[00:11:41] graph only goes up to 2019 um but it's continued to go up and
[00:11:44] 2019 um but it's continued to go up and so it's not that uncommon these days
[00:11:47] so it's not that uncommon these days that you see blue numbers in the 50s and
[00:11:49] that you see blue numbers in the 50s and 60s um for new or machine translation
[00:11:53] 60s um for new or machine translation systems so that's a good news story um
[00:11:57] systems so that's a good news story um so after this I want to go on and sort
[00:12:00] so after this I want to go on and sort of introduced this um idea of attention
[00:12:03] of introduced this um idea of attention um which is now a very fundamental
[00:12:06] um which is now a very fundamental important idea in neural systems it's
[00:12:09] important idea in neural systems it's also interesting because it's actually
[00:12:12] also interesting because it's actually something novel that was invented kind
[00:12:15] something novel that was invented kind of recently so for everything that we've
[00:12:18] of recently so for everything that we've done in new networks up until now really
[00:12:23] done in new networks up until now really it had all been invented before the turn
[00:12:26] it had all been invented before the turn of the Millennium right so um basic feed
[00:12:29] of the Millennium right so um basic feed Ford neural networks recurrent neural
[00:12:32] Ford neural networks recurrent neural networks
[00:12:33] networks lstms other things that we haven't yet
[00:12:35] lstms other things that we haven't yet haven't talked about like convolutional
[00:12:37] haven't talked about like convolutional newal networks they were all invented
[00:12:40] newal networks they were all invented last Millennium it was really a waiting
[00:12:44] last Millennium it was really a waiting game at that point until there was
[00:12:46] game at that point until there was sufficient data and computational power
[00:12:49] sufficient data and computational power for them really to show that how good
[00:12:51] for them really to show that how good they were um but attention was something
[00:12:54] they were um but attention was something that actually got invented in 2014 in
[00:12:58] that actually got invented in 2014 in the origins of neural machine
[00:13:00] the origins of neural machine translation and it proved to be a very
[00:13:03] translation and it proved to be a very transformative idea for making neural
[00:13:05] transformative idea for making neural networks more powerful um so the idea of
[00:13:11] networks more powerful um so the idea of what motivated attention was looking at
[00:13:13] what motivated attention was looking at exactly this kind of machine translation
[00:13:15] exactly this kind of machine translation problem so we're running our lstm over
[00:13:17] problem so we're running our lstm over the source sentence and then we were
[00:13:20] the source sentence and then we were using this hidden State as the previous
[00:13:23] using this hidden State as the previous hidden state that we're feeding into the
[00:13:26] hidden state that we're feeding into the generator lstm for the T Target sentence
[00:13:30] generator lstm for the T Target sentence and what that means is everything useful
[00:13:34] and what that means is everything useful about this sentence has to be stuffed
[00:13:37] about this sentence has to be stuffed into that one vector and well that's
[00:13:39] into that one vector and well that's maybe not so hard if you've got a
[00:13:41] maybe not so hard if you've got a four-word sentence but you know maybe
[00:13:43] four-word sentence but you know maybe you've got a 40w sentence out here and
[00:13:46] you've got a 40w sentence out here and it seems to be kind of implausible that
[00:13:49] it seems to be kind of implausible that it' be a good idea to be trying to fit
[00:13:52] it' be a good idea to be trying to fit everything about that sentence into this
[00:13:54] everything about that sentence into this one hidden State and well obviously
[00:13:56] one hidden State and well obviously there are crude solutions to this you
[00:13:58] there are crude solutions to this you make the hidden state bigger and then
[00:14:00] make the hidden state bigger and then you've got more representational space
[00:14:02] you've got more representational space you use a multi-layer lstm you've got
[00:14:05] you use a multi-layer lstm you've got more representational space but it still
[00:14:09] more representational space but it still seems a very questionable thing to do
[00:14:12] seems a very questionable thing to do and it's certainly not like what a human
[00:14:14] and it's certainly not like what a human being does right like if a human being
[00:14:16] being does right like if a human being is translating a sentence they read the
[00:14:19] is translating a sentence they read the sentence and they've got some idea of
[00:14:21] sentence and they've got some idea of its meaning but as they start to
[00:14:23] its meaning but as they start to translate they look back at the earlier
[00:14:25] translate they look back at the earlier parts of the sentence and make use of
[00:14:27] parts of the sentence and make use of that in their translation and so that
[00:14:31] that in their translation and so that doesn't seem like it's a very plausible
[00:14:33] doesn't seem like it's a very plausible model so the idea should be that our new
[00:14:37] model so the idea should be that our new network should be able to attend to
[00:14:39] network should be able to attend to different things in the source um so
[00:14:42] different things in the source um so that they can get information as needed
[00:14:45] that they can get information as needed looking back in the sentence and so this
[00:14:48] looking back in the sentence and so this is the idea of attention so on each step
[00:14:51] is the idea of attention so on each step of the decoder we're going to insert
[00:14:54] of the decoder we're going to insert direct connections to the encoder so we
[00:14:57] direct connections to the encoder so we can look at particular words in the
[00:14:59] can look at particular words in the sentence um so I've got a bunch of
[00:15:02] sentence um so I've got a bunch of diagram sentences that go through what
[00:15:04] diagram sentences that go through what we do and then after that I'll present
[00:15:07] we do and then after that I'll present the equations that go along with this
[00:15:10] the equations that go along with this okay so once we're starting to translate
[00:15:14] okay so once we're starting to translate um we've got a hidden State at the start
[00:15:17] um we've got a hidden State at the start of our generator and then we're going to
[00:15:20] of our generator and then we're going to use this hidden State as our key to look
[00:15:25] use this hidden State as our key to look back into the encoder to try and find
[00:15:28] back into the encoder to try and find useful stuff
[00:15:29] useful stuff so we're going to compare in a way I'll
[00:15:34] so we're going to compare in a way I'll make precise later the hidden state with
[00:15:37] make precise later the hidden state with the hidden State at every position um in
[00:15:42] the hidden State at every position um in the source
[00:15:43] the source sentence and based on our comparisons
[00:15:48] sentence and based on our comparisons we're going to work out an attention
[00:15:50] we're going to work out an attention score where where should we be looking
[00:15:53] score where where should we be looking at in the source sentence while
[00:15:55] at in the source sentence while generating the here the first word of
[00:15:57] generating the here the first word of the translation
[00:15:59] the translation and So based on these attention scores
[00:16:02] and So based on these attention scores we'll stick them into a soft Max as we
[00:16:05] we'll stick them into a soft Max as we commonly do and we'll then get a
[00:16:08] commonly do and we'll then get a probability distribution or a waiting
[00:16:10] probability distribution or a waiting over the different positions in the
[00:16:12] over the different positions in the sentence and then we will use this
[00:16:15] sentence and then we will use this waiting to compute a representation
[00:16:19] waiting to compute a representation based on the
[00:16:20] based on the encoder um which is then going to be a
[00:16:25] encoder um which is then going to be a weighted average of the encoder St dates
[00:16:29] weighted average of the encoder St dates so in this particular case it'll be
[00:16:31] so in this particular case it'll be nearly entirely the representation above
[00:16:34] nearly entirely the representation above the first word eel which means he in
[00:16:37] the first word eel which means he in French um so then we'll take that
[00:16:40] French um so then we'll take that attention
[00:16:43] output um and we'll combine it um with
[00:16:47] output um and we'll combine it um with the hidden state of our decoder and
[00:16:51] the hidden state of our decoder and we'll use both of them together to
[00:16:54] we'll use both of them together to generate an output Vector which we stick
[00:16:57] generate an output Vector which we stick through our soft Max and generate a word
[00:17:00] through our soft Max and generate a word as the first word of the translation y
[00:17:03] as the first word of the translation y one and so then at that point we just
[00:17:06] one and so then at that point we just repeat this over
[00:17:09] repeat this over um so we then go on to generating the
[00:17:12] um so we then go on to generating the second word um well you know we copy
[00:17:15] second word um well you know we copy down the first word generator start to
[00:17:17] down the first word generator start to generate the second word we work out
[00:17:19] generate the second word we work out attention at every position um it gives
[00:17:23] attention at every position um it gives us um an oh sorry there's a little note
[00:17:28] us um an oh sorry there's a little note there which little fine point which
[00:17:30] there which little fine point which maybe I won't deal with but it points
[00:17:32] maybe I won't deal with but it points out sometimes you um also do things like
[00:17:36] out sometimes you um also do things like stick the previous time steps attention
[00:17:38] stick the previous time steps attention output into the next step as an extra
[00:17:41] output into the next step as an extra input and we actually do that in it
[00:17:44] input and we actually do that in it should say assignment three there that's
[00:17:46] should say assignment three there that's buggy um so there are other ways to use
[00:17:49] buggy um so there are other ways to use things but I'll sort of gloss over that
[00:17:51] things but I'll sort of gloss over that um
[00:17:52] um so um we generate another word and we
[00:17:55] so um we generate another word and we sort of repeat over and at each time
[00:17:58] sort of repeat over and at each time step we're looking at different words in
[00:18:01] step we're looking at different words in the source and they will help us to
[00:18:04] the source and they will help us to translate the
[00:18:05] translate the sentence
[00:18:17] yeah wait wait say again why is it bring
[00:18:21] yeah wait wait say again why is it bring bring point the green okay the so the
[00:18:26] bring point the green okay the so the the green Vector the hidden Vector of
[00:18:28] the green Vector the hidden Vector of the coder is going to be used together
[00:18:31] the coder is going to be used together with the hidden States the hidden
[00:18:34] with the hidden States the hidden vectors of the encoder one at a time to
[00:18:37] vectors of the encoder one at a time to calculate the attention scores so the
[00:18:40] calculate the attention scores so the attention score at a position is going
[00:18:43] attention score at a position is going to be a function of the Hidden state of
[00:18:46] to be a function of the Hidden state of the encoder at that position and the
[00:18:50] the encoder at that position and the current hidden state of the
[00:18:54] decoder and I'll explain exactly how in
[00:18:57] decoder and I'll explain exactly how in a moment thank you any other
[00:19:03] questions okay um well so here it is in
[00:19:08] questions okay um well so here it is in math okay so we have um encod of hidden
[00:19:13] math okay so we have um encod of hidden states which we're going to call
[00:19:15] states which we're going to call H um so we have decoder hidden states
[00:19:20] H um so we have decoder hidden states which we're going to call S so there're
[00:19:22] which we're going to call S so there're something different um and we're going
[00:19:25] something different um and we're going to at each point being some particular
[00:19:28] to at each point being some particular time step T so we'll be dealing with
[00:19:31] time step T so we'll be dealing with st um so to calculate the attention
[00:19:36] st um so to calculate the attention scores um for
[00:19:39] scores um for generating uh then word for time step T
[00:19:44] generating uh then word for time step T we're going to calculate an attention
[00:19:47] we're going to calculate an attention score for each position in the
[00:19:51] score for each position in the encoder okay um I'll discuss
[00:19:54] encoder okay um I'll discuss alternatives for this in a moment but
[00:19:56] alternatives for this in a moment but the very easiest way to calculate an
[00:19:59] the very easiest way to calculate an attention score which is shown here is
[00:20:02] attention score which is shown here is to take a DOT product between the hidden
[00:20:06] to take a DOT product between the hidden state of the
[00:20:08] state of the encoder and the current hidden state of
[00:20:11] encoder and the current hidden state of the decoder and so that's what we're
[00:20:13] the decoder and so that's what we're showing here so that will give us some
[00:20:16] showing here so that will give us some dot product score which is just any
[00:20:18] dot product score which is just any number at
[00:20:19] number at all um then the next thing we do is we
[00:20:24] all um then the next thing we do is we stick those ET scores into our softmax
[00:20:27] stick those ET scores into our softmax distribution and that gives us our
[00:20:30] distribution and that gives us our probability distribution as to how much
[00:20:33] probability distribution as to how much weight to put on each position in the
[00:20:36] weight to put on each position in the encoder and so then we are calculating
[00:20:40] encoder and so then we are calculating the weighted average of the encoder
[00:20:41] the weighted average of the encoder hidden States um which we're just doing
[00:20:45] hidden States um which we're just doing with the obvious equation that we're
[00:20:47] with the obvious equation that we're taking the weighted sum of the Hidden
[00:20:50] taking the weighted sum of the Hidden states of the encoder based on the
[00:20:52] states of the encoder based on the attention weights and then what we want
[00:20:55] attention weights and then what we want to do is concatenate our our our tension
[00:21:00] to do is concatenate our our our tension output and the hidden state of the
[00:21:04] output and the hidden state of the decoder and we're going to which is
[00:21:06] decoder and we're going to which is giving us then a double length vector
[00:21:08] giving us then a double length vector and then we're going to feed that into
[00:21:12] and then we're going to feed that into um producing the next word from the
[00:21:14] um producing the next word from the decoder so typically that means we're
[00:21:17] decoder so typically that means we're multiplying that vector by another
[00:21:19] multiplying that vector by another Matrix and then putting it through a
[00:21:22] Matrix and then putting it through a soft Max um to get a probability
[00:21:26] soft Max um to get a probability distribution over um words to output and
[00:21:30] distribution over um words to output and choosing the highest probability
[00:21:34] word okay um that make sense I hope yeah
[00:21:40] word okay um that make sense I hope yeah um okay so attention is great so
[00:21:43] um okay so attention is great so inventing this idea was completely
[00:21:47] inventing this idea was completely transformative um so the very first
[00:21:50] transformative um so the very first modern neural machine translation system
[00:21:52] modern neural machine translation system was done at Google in
[00:21:55] was done at Google in 2014 and they used a pure but very large
[00:22:01] 2014 and they used a pure but very large very deep
[00:22:04] very deep um lstm so was an eight layer deep lstm
[00:22:08] um lstm so was an eight layer deep lstm with a very large Hidden State for the
[00:22:11] with a very large Hidden State for the time um and they were able to get good
[00:22:14] time um and they were able to get good results um but very shortly thereafter
[00:22:19] results um but very shortly thereafter um people at the University of Montreal
[00:22:22] um people at the University of Montreal dimma bad now K huno and Yoshua Benjo um
[00:22:26] dimma bad now K huno and Yoshua Benjo um did a second version of machine
[00:22:28] did a second version of machine translation um using attention and with
[00:22:33] translation um using attention and with a much more modest compute budget of the
[00:22:36] a much more modest compute budget of the kind that you can afford in universities
[00:22:38] kind that you can afford in universities um they were able to get better results
[00:22:40] um they were able to get better results because attention was their secret um
[00:22:43] because attention was their secret um thing so attention significantly
[00:22:46] thing so attention significantly improved nmt performance and essentially
[00:22:49] improved nmt performance and essentially every new machine
[00:22:51] every new machine translation um System since has used
[00:22:54] translation um System since has used attention like we've just seen um you
[00:22:57] attention like we've just seen um you know it's more human like as I was
[00:22:59] know it's more human like as I was indicating because it's sort of what a
[00:23:00] indicating because it's sort of what a human would do you'd look back in the
[00:23:02] human would do you'd look back in the sentence to see what you need to
[00:23:03] sentence to see what you need to translate um and it solves this
[00:23:06] translate um and it solves this bottleneck problem you now no longer
[00:23:08] bottleneck problem you now no longer have to stuff all the information about
[00:23:10] have to stuff all the information about the source sentence into one hidden
[00:23:13] the source sentence into one hidden State you can have the whole of your
[00:23:16] State you can have the whole of your representational space from your entire
[00:23:18] representational space from your entire encoding and use it as you need it um it
[00:23:22] encoding and use it as you need it um it also helps with the vanishing gradient
[00:23:24] also helps with the vanishing gradient problem this is connected to what I was
[00:23:25] problem this is connected to what I was saying NE last time when talking about
[00:23:28] saying NE last time when talking about res ual connection so of a way out of
[00:23:30] res ual connection so of a way out of the banishing gradient problem is to
[00:23:33] the banishing gradient problem is to direct connect things and this provides
[00:23:35] direct connect things and this provides shortcut connections to all of the
[00:23:37] shortcut connections to all of the Hidden states of the
[00:23:39] Hidden states of the encoder and another nice thing that
[00:23:42] encoder and another nice thing that attention does is it gives you some
[00:23:45] attention does is it gives you some interpretability so by looking at where
[00:23:48] interpretability so by looking at where the model is um attending you can
[00:23:50] the model is um attending you can basically see what it's translating at
[00:23:53] basically see what it's translating at different time steps and so that can be
[00:23:57] different time steps and so that can be um really useful
[00:23:58] um really useful and so it's kind of like we can see what
[00:24:02] and so it's kind of like we can see what We're translating where without
[00:24:04] We're translating where without explicitly having trained A system that
[00:24:07] explicitly having trained A system that does that so for my little toy sentence
[00:24:10] does that so for my little toy sentence here if he hit me with a pie you know at
[00:24:12] here if he hit me with a pie you know at the first position um it's you know it
[00:24:16] the first position um it's you know it was looking at the first word eel he
[00:24:18] was looking at the first word eel he which it translates um then there's in
[00:24:21] which it translates um then there's in French this there's this sort of verb on
[00:24:24] French this there's this sort of verb on T to sort of pie somebody I guess in
[00:24:26] T to sort of pie somebody I guess in English as well you can use pie as a
[00:24:28] English as well you can use pie as a verb right so so um the a is a is a sort
[00:24:33] verb right so so um the a is a is a sort of perfect past um exiler so sort of
[00:24:38] of perfect past um exiler so sort of like he has me pied is what the French
[00:24:41] like he has me pied is what the French words are one at a time and so the hit
[00:24:44] words are one at a time and so the hit is already looking at the pi then the me
[00:24:47] is already looking at the pi then the me is attending to the m which means me and
[00:24:50] is attending to the m which means me and then all the with with the piie is
[00:24:52] then all the with with the piie is attending still to Anar which is
[00:24:55] attending still to Anar which is basically the right kind of alignment
[00:24:57] basically the right kind of alignment that you want for words of a sentence so
[00:25:00] that you want for words of a sentence so that's pretty cool um
[00:25:03] that's pretty cool um too okay um so I presented up until this
[00:25:07] too okay um so I presented up until this point um just this
[00:25:11] point um just this um said oh we could do a DOT product but
[00:25:15] um said oh we could do a DOT product but you know in general there's more to it
[00:25:17] you know in general there's more to it than that so what we have is we have
[00:25:20] than that so what we have is we have some values H1 to hn and we have a query
[00:25:24] some values H1 to hn and we have a query vector and we want to work out how to do
[00:25:27] vector and we want to work out how to do a tension based on these things so
[00:25:30] a tension based on these things so attention always involves Computing some
[00:25:33] attention always involves Computing some attention scores and taking the soft Max
[00:25:37] attention scores and taking the soft Max to get an attention distribution and
[00:25:40] to get an attention distribution and then getting an attention output but the
[00:25:43] then getting an attention output but the part where there's variation is how do
[00:25:45] part where there's variation is how do you compute these attention scores and a
[00:25:48] you compute these attention scores and a number of different ways have been done
[00:25:50] number of different ways have been done for that and I just want to go through
[00:25:53] for that and I just want to go through that a little bit um so the simplest way
[00:25:58] that a little bit um so the simplest way that I just presented is this dot
[00:26:00] that I just presented is this dot product attention we just take the
[00:26:02] product attention we just take the hidden States and Dot product the whole
[00:26:04] hidden States and Dot product the whole of them um
[00:26:07] of them um and that sort of works um but it doesn't
[00:26:11] and that sort of works um but it doesn't actually work great and I sort of
[00:26:13] actually work great and I sort of discussed this a bit when talking about
[00:26:16] discussed this a bit when talking about lstms last time right that you know the
[00:26:20] lstms last time right that you know the the hidden state of an lstm is its
[00:26:23] the hidden state of an lstm is its complete memory right so it has to
[00:26:26] complete memory right so it has to variously store lots of things in that
[00:26:29] variously store lots of things in that memory it's got to be storing
[00:26:30] memory it's got to be storing information that'll help it output the
[00:26:34] information that'll help it output the right word it has to be storing
[00:26:37] right word it has to be storing information about the future about other
[00:26:39] information about the future about other things that you'll want to say given the
[00:26:41] things that you'll want to say given the kind of sentence context grammar and
[00:26:44] kind of sentence context grammar and previous words you've said right you
[00:26:47] previous words you've said right you sort of got all kinds of memory and so
[00:26:49] sort of got all kinds of memory and so it sort of makes sense that some of it
[00:26:52] it sort of makes sense that some of it would be useful for linking up for
[00:26:56] would be useful for linking up for looking back and some of it would be
[00:26:59] looking back and some of it would be less useful you sort of want to find the
[00:27:01] less useful you sort of want to find the parts that are related to what you want
[00:27:03] parts that are related to what you want to say immediately not all the parts
[00:27:06] to say immediately not all the parts that um do all of the rest of the
[00:27:09] that um do all of the rest of the future um so that suggested um maybe you
[00:27:14] future um so that suggested um maybe you could do a more general form of um
[00:27:18] could do a more general form of um attention and so tanglong and me in 201
[00:27:22] attention and so tanglong and me in 201 2015 suggested maybe we could introduce
[00:27:27] 2015 suggested maybe we could introduce um what we called by linear attention
[00:27:29] um what we called by linear attention which um I still think is a better name
[00:27:32] which um I still think is a better name but the rest of the world came to call
[00:27:34] but the rest of the world came to call multiplicative attention um where what
[00:27:37] multiplicative attention um where what we're doing is between these two vectors
[00:27:41] we're doing is between these two vectors um we're sticking a matrix um and so
[00:27:44] um we're sticking a matrix um and so we're then learning the parameters of
[00:27:46] we're then learning the parameters of this Matrix just like everything else in
[00:27:48] this Matrix just like everything else in our neural network and so effectively
[00:27:52] our neural network and so effectively this Matrix can
[00:27:54] this Matrix can learn which parts of the um generator
[00:28:00] learn which parts of the um generator hidden State you should be looking to
[00:28:02] hidden State you should be looking to find where in the hidden states of the
[00:28:06] find where in the hidden states of the encoder in particular it no longer
[00:28:09] encoder in particular it no longer requires that things have to match up
[00:28:11] requires that things have to match up Dimension by Dimension you know it could
[00:28:13] Dimension by Dimension you know it could be the case that the encoder is storing
[00:28:16] be the case that the encoder is storing information about word meaning here and
[00:28:19] information about word meaning here and the and the decoder is storing
[00:28:22] the and the decoder is storing information about word meaning here and
[00:28:25] information about word meaning here and by learning appropriate parameters in
[00:28:27] by learning appropriate parameters in this Matrix
[00:28:28] this Matrix um we can sort of match those together
[00:28:31] um we can sort of match those together and work out the right place to pay
[00:28:33] and work out the right place to pay attention so that seemed
[00:28:37] attention so that seemed um kind of a cool approach to us um yeah
[00:28:42] um kind of a cool approach to us um yeah why don't go all in with Even build like
[00:28:44] why don't go all in with Even build like a neural network that's GNA H input and
[00:28:47] a neural network that's GNA H input and output um you can do that I was going to
[00:28:50] output um you can do that I was going to get to that on the next slide um
[00:28:53] get to that on the next slide um actually that's in a way sort of going
[00:28:55] actually that's in a way sort of going backwards but I I will get to it on the
[00:28:57] backwards but I I will get to it on the next slide but before I do that um I
[00:29:01] next slide but before I do that um I will
[00:29:03] will uh um show you um these other versions
[00:29:07] uh um show you um these other versions so the the dis the one thing you might
[00:29:11] so the the dis the one thing you might wonder about um doing it this way is you
[00:29:15] wonder about um doing it this way is you know there's a lot of parameters that
[00:29:18] know there's a lot of parameters that you have to learn in The Matrix W you
[00:29:20] you have to learn in The Matrix W you know there aren't that many in my
[00:29:21] know there aren't that many in my example because they're only 36 but
[00:29:24] example because they're only 36 but that's because my hidden states are only
[00:29:25] that's because my hidden states are only of length six right and if your hidden
[00:29:28] of length six right and if your hidden states are of length a thousand say then
[00:29:31] states are of length a thousand say then you've got a million parameters in that
[00:29:34] you've got a million parameters in that um W Matrix and that seems like it might
[00:29:37] um W Matrix and that seems like it might be kind of problematic and so the way to
[00:29:43] be kind of problematic and so the way to get beyond that which was fairly quickly
[00:29:46] get beyond that which was fairly quickly suggested thereafter is well maybe
[00:29:48] suggested thereafter is well maybe rather than having that whole big Matrix
[00:29:50] rather than having that whole big Matrix in the middle instead what we could do
[00:29:53] in the middle instead what we could do is form it as a low rank Matrix and the
[00:29:57] is form it as a low rank Matrix and the easy way to make a low rank Matrix is
[00:29:59] easy way to make a low rank Matrix is you take two skinny matrices like this
[00:30:02] you take two skinny matrices like this where this is the rank of these of the
[00:30:04] where this is the rank of these of the pieces and multiply them together which
[00:30:07] pieces and multiply them together which would give us the big Matrix that I
[00:30:09] would give us the big Matrix that I showed on the last slide and so this
[00:30:12] showed on the last slide and so this gives you a low parameter um version of
[00:30:16] gives you a low parameter um version of um the um bilinear attention Matrix from
[00:30:19] um the um bilinear attention Matrix from the last slide but at that point if you
[00:30:23] the last slide but at that point if you just do a teeny bit of linear algebra
[00:30:27] just do a teeny bit of linear algebra this computation
[00:30:28] this computation is exactly the same as saying well what
[00:30:31] is exactly the same as saying well what I'm going to do is I'm going to take
[00:30:33] I'm going to do is I'm going to take each of these two vectors and project
[00:30:37] each of these two vectors and project them to a lower dimensional space using
[00:30:40] them to a lower dimensional space using this low rank transformation matrix and
[00:30:43] this low rank transformation matrix and then I'm going to take the dot product
[00:30:45] then I'm going to take the dot product in this low dimensional space
[00:30:49] in this low dimensional space um and uh on Thursday when you get to
[00:30:53] um and uh on Thursday when you get to Transformers um what you will see that
[00:30:56] Transformers um what you will see that Transformers do is this that they're
[00:31:00] Transformers do is this that they're taking um the big vector and they're
[00:31:03] taking um the big vector and they're projecting it to a low dimensional space
[00:31:06] projecting it to a low dimensional space and then taking dot product um attention
[00:31:09] and then taking dot product um attention in that low dimensional
[00:31:12] in that low dimensional space okay um back to the question um
[00:31:16] space okay um back to the question um yeah you're totally right and you know
[00:31:19] yeah you're totally right and you know at this point I'm going um sort of you
[00:31:23] at this point I'm going um sort of you know in an ahistorical manner because
[00:31:26] know in an ahistorical manner because yeah actually the first form of
[00:31:29] yeah actually the first form of attention that was proposed in the B
[00:31:31] attention that was proposed in the B hour Al paper um was hey um let's just
[00:31:37] hour Al paper um was hey um let's just stick a little new net there to
[00:31:39] stick a little new net there to calculate attention scores so um we take
[00:31:43] calculate attention scores so um we take the the S and the H we multiply them
[00:31:47] the the S and the H we multiply them both by a matrix add them put them
[00:31:50] both by a matrix add them put them through a tan H multiply that by a
[00:31:52] through a tan H multiply that by a vector um and we get a number you know
[00:31:55] vector um and we get a number you know this looks just like the kind of comput
[00:31:58] this looks just like the kind of comput ations we've use everywhere else in an
[00:31:59] ations we've use everywhere else in an lstm so there's a little neural net
[00:32:02] lstm so there's a little neural net that's calculating the attention scores
[00:32:04] that's calculating the attention scores and then they go into a softmax as
[00:32:07] and then they go into a softmax as useful usual um in most of the
[00:32:10] useful usual um in most of the literature this is called additive
[00:32:11] literature this is called additive attention which also seems to be a
[00:32:13] attention which also seems to be a really weird name I mean I think kind of
[00:32:15] really weird name I mean I think kind of saying you've got a little new net um
[00:32:17] saying you've got a little new net um makes more sense um for that one
[00:32:21] makes more sense um for that one um so but anyway so this is what they
[00:32:24] um so but anyway so this is what they proposed and used um and and you know at
[00:32:28] proposed and used um and and you know at this point it's a little bit um complex
[00:32:33] this point it's a little bit um complex to be honest I mean you know
[00:32:37] to be honest I mean you know um so like when we wrote our paper the
[00:32:41] um so like when we wrote our paper the next year we had found that the um
[00:32:44] next year we had found that the um bilinear attention worked better for us
[00:32:47] bilinear attention worked better for us um but there was subsequent work
[00:32:49] um but there was subsequent work especially this massive exploration of
[00:32:51] especially this massive exploration of new machine translation architectures
[00:32:54] new machine translation architectures that argued that actually um with the
[00:32:58] that argued that actually um with the right kinds of
[00:33:01] right kinds of uh good hyperparameter optimization
[00:33:04] uh good hyperparameter optimization actually this is the best kind this is
[00:33:06] actually this is the best kind this is better than the bilinear tension but you
[00:33:10] better than the bilinear tension but you know this is a lot more complex and a
[00:33:13] know this is a lot more complex and a lot slower um than doing what you're
[00:33:15] lot slower um than doing what you're doing in the upper part of the chart so
[00:33:18] doing in the upper part of the chart so regardless of whether it's better or not
[00:33:20] regardless of whether it's better or not in practice what's completely one is
[00:33:23] in practice what's completely one is doing this and this is what Transformers
[00:33:25] doing this and this is what Transformers use and just about all other new Nets
[00:33:28] use and just about all other new Nets that are used these
[00:33:30] that are used these days okay um questions on attention will
[00:33:33] days okay um questions on attention will be found in assignment three um yeah so
[00:33:38] be found in assignment three um yeah so um I won't say much more about this now
[00:33:41] um I won't say much more about this now and you know we'll see more of it just
[00:33:43] and you know we'll see more of it just next lecture um but attention is a very
[00:33:47] next lecture um but attention is a very general technique right it was a great
[00:33:49] general technique right it was a great way to improve machine translation and
[00:33:53] way to improve machine translation and that was how it was first invented but
[00:33:55] that was how it was first invented but you know for all kinds of new
[00:33:58] you know for all kinds of new architectures for all kinds of purposes
[00:34:01] architectures for all kinds of purposes you can stick attention to into them and
[00:34:03] you can stick attention to into them and the general finding was that always
[00:34:06] the general finding was that always improved results so in general anywhere
[00:34:09] improved results so in general anywhere where you have a a vector of values a
[00:34:12] where you have a a vector of values a vector query and you can use attention
[00:34:15] vector query and you can use attention to then sort of get a weighted average
[00:34:17] to then sort of get a weighted average of the values which finds relevant
[00:34:20] of the values which finds relevant information that you can use to improve
[00:34:23] information that you can use to improve your
[00:34:25] your performance um and so maybe maybe I
[00:34:28] performance um and so maybe maybe I won't try and even give examples of that
[00:34:30] won't try and even give examples of that now um but you'll sort of see another
[00:34:33] now um but you'll sort of see another example of attention immediately when we
[00:34:35] example of attention immediately when we do things on T on Thursday where um we
[00:34:40] do things on T on Thursday where um we then sort of start doing self attention
[00:34:42] then sort of start doing self attention inside transformers yes in the you also
[00:34:46] inside transformers yes in the you also try
[00:34:48] try nonlinearity um no we did
[00:34:55] not ah
[00:35:02] I mean it didn't seem especially
[00:35:04] I mean it didn't seem especially necessary I don't know but no we did not
[00:35:07] necessary I don't know but no we did not okay um well this is the end of um the
[00:35:13] okay um well this is the end of um the part with attention are there any other
[00:35:17] part with attention are there any other questions yes um for the RNN attention
[00:35:21] questions yes um for the RNN attention stuff is there a need for positional
[00:35:24] stuff is there a need for positional information or is that not required to
[00:35:26] information or is that not required to solve the for positional
[00:35:30] solve the for positional information um
[00:35:34] information um so so there was none and it seemed like
[00:35:38] so so there was none and it seemed like it wasn't very required I mean you
[00:35:43] it wasn't very required I mean you could yeah I mean you could make some
[00:35:46] could yeah I mean you could make some inform you could make some argument that
[00:35:49] inform you could make some argument that maybe position information um might have
[00:35:52] maybe position information um might have been useful but there's also a good
[00:35:55] been useful but there's also a good argument that it wasn't necess Neary and
[00:35:58] argument that it wasn't necess Neary and the sort of
[00:36:00] the sort of recent everywhere usage of positional
[00:36:03] recent everywhere usage of positional information only becomes necessary when
[00:36:06] information only becomes necessary when you get to a Transformer and the reason
[00:36:09] you get to a Transformer and the reason for that is you know going back to the
[00:36:13] for that is you know going back to the pictures
[00:36:15] pictures um for these encoder States they're
[00:36:19] um for these encoder States they're being calculated with respect to the
[00:36:23] being calculated with respect to the previous encoded estate right because
[00:36:25] previous encoded estate right because it's a current newal Network and
[00:36:27] it's a current newal Network and therefore the representation here knows
[00:36:31] therefore the representation here knows something about the past so it kind of
[00:36:33] something about the past so it kind of knows what position it's in basically
[00:36:35] knows what position it's in basically and so that you know that's giving a lot
[00:36:38] and so that you know that's giving a lot of that information or another way to
[00:36:40] of that information or another way to think about it is this final
[00:36:42] think about it is this final representation will give a certain
[00:36:44] representation will give a certain overall sense of the semantics of the
[00:36:46] overall sense of the semantics of the sentence and so to the extent that
[00:36:49] sentence and so to the extent that you're looking backwards the more sort
[00:36:51] you're looking backwards the more sort of associative matching of similar
[00:36:54] of associative matching of similar semantic content that's needed seems
[00:36:56] semantic content that's needed seems sufficient you don't really need
[00:36:58] sufficient you don't really need additional positional
[00:37:02] information okay I will go on okay um so
[00:37:08] information okay I will go on okay um so that's the um that's the new Network's
[00:37:11] that's the um that's the new Network's content for today um and so for the
[00:37:15] content for today um and so for the remaining 39 minutes um I want to talk
[00:37:18] remaining 39 minutes um I want to talk final projects but also a bit about um
[00:37:21] final projects but also a bit about um data experiments and things like that
[00:37:24] data experiments and things like that okay so this is a reminder on the class
[00:37:26] okay so this is a reminder on the class so um we've got the four assignments
[00:37:29] so um we've got the four assignments which are
[00:37:30] which are 48% and then the big other part of what
[00:37:33] 48% and then the big other part of what you need to do um is the final project
[00:37:37] you need to do um is the final project which is
[00:37:38] which is 49% almost completing things out except
[00:37:41] 49% almost completing things out except for the
[00:37:42] for the participation um and um let me just give
[00:37:46] participation um and um let me just give one note back to um collaboration the
[00:37:49] one note back to um collaboration the honor code I mean for final projects
[00:37:53] honor code I mean for final projects it's quite use usual that people use all
[00:37:56] it's quite use usual that people use all sorts of stuff that were written by
[00:37:58] sorts of stuff that were written by other people that's completely fine we
[00:38:00] other people that's completely fine we don't expect you to implement everything
[00:38:02] don't expect you to implement everything from scratch um but you must document
[00:38:06] from scratch um but you must document what you're using you know give
[00:38:07] what you're using you know give references or URLs if you're using other
[00:38:10] references or URLs if you're using other people's code rather than writing your
[00:38:12] people's code rather than writing your own we do not want to know what code you
[00:38:14] own we do not want to know what code you wrote yourself and what things you
[00:38:16] wrote yourself and what things you downloaded from pie um and in particular
[00:38:20] downloaded from pie um and in particular in thinking about final projects um the
[00:38:24] in thinking about final projects um the question of interest for us is what
[00:38:26] question of interest for us is what value add did you provide right so you
[00:38:29] value add did you provide right so you haven't done something great if you've
[00:38:31] haven't done something great if you've downloaded a really good neural network
[00:38:34] downloaded a really good neural network and run it on some data and it produces
[00:38:36] and run it on some data and it produces really good results that's not much
[00:38:38] really good results that's not much value ad um so if you want to have value
[00:38:41] value ad um so if you want to have value ad in that context you at least want to
[00:38:44] ad in that context you at least want to be doing something interesting about
[00:38:47] be doing something interesting about understanding um why it works so well
[00:38:50] understanding um why it works so well what kind of examples it doesn't work
[00:38:52] what kind of examples it doesn't work well on doing some thorough experimental
[00:38:55] well on doing some thorough experimental analysis
[00:39:00] um yeah a couple of other points there
[00:39:05] um yeah a couple of other points there um okay so for the final project um for
[00:39:09] um okay so for the final project um for this class um there's a binary um choice
[00:39:13] this class um there's a binary um choice you can either do our default final
[00:39:15] you can either do our default final project which I'll talk about more a bit
[00:39:18] project which I'll talk about more a bit later or you can come up with your own
[00:39:20] later or you can come up with your own final project and I'll talk about that a
[00:39:23] final project and I'll talk about that a bit too um so um we allow team size of 1
[00:39:28] bit too um so um we allow team size of 1 to three um the complicated thing that
[00:39:31] to three um the complicated thing that comes up
[00:39:34] comes up um oh actually sorry I should say the
[00:39:36] um oh actually sorry I should say the other point first um yeah so if you do
[00:39:40] other point first um yeah so if you do um we we generally encourage people to
[00:39:43] um we we generally encourage people to form teams it means that you can do
[00:39:45] form teams it means that you can do something more interesting it's more
[00:39:47] something more interesting it's more motivational you can make friends
[00:39:49] motivational you can make friends whatever um so teams are good um on
[00:39:53] whatever um so teams are good um on expectations for teams um our expect for
[00:39:57] expectations for teams um our expect for teams is that a bigger team should be
[00:40:00] teams is that a bigger team should be able to do proportionately more work now
[00:40:03] able to do proportionately more work now and so when we're grading things we
[00:40:06] and so when we're grading things we expect to see more work from larger
[00:40:09] expect to see more work from larger teams um now how this works out is kind
[00:40:14] teams um now how this works out is kind of I will admit a little bit complicated
[00:40:17] of I will admit a little bit complicated because you know there's sort of a
[00:40:19] because you know there's sort of a quality issue that's separate from the
[00:40:22] quality issue that's separate from the amount of work um so you know the
[00:40:26] amount of work um so you know the reality is that it's just always the
[00:40:29] reality is that it's just always the case that several of the very best
[00:40:32] case that several of the very best projects are one person efforts because
[00:40:35] projects are one person efforts because they're just somebody who has a good
[00:40:37] they're just somebody who has a good idea and knows what they want to do and
[00:40:39] idea and knows what they want to do and does it by themselves and it is great um
[00:40:43] does it by themselves and it is great um but you know they're also great
[00:40:44] but you know they're also great multi-person projects as well but um the
[00:40:48] multi-person projects as well but um the point I'm meaning is well if you know it
[00:40:51] point I'm meaning is well if you know it kind of doesn't work if you're a
[00:40:52] kind of doesn't work if you're a oneperson project and you try and
[00:40:55] oneperson project and you try and attempt a huge amount of stuff and you
[00:40:57] attempt a huge amount of stuff and you can only get onethird of the way through
[00:40:59] can only get onethird of the way through it um that's not a good recipe for doing
[00:41:02] it um that's not a good recipe for doing well in the final project um for any
[00:41:05] well in the final project um for any project um you really need to sort of be
[00:41:07] project um you really need to sort of be completing something and showing
[00:41:10] completing something and showing something but you know nevertheless if
[00:41:13] something but you know nevertheless if you're one person and you can show
[00:41:15] you're one person and you can show something kind of interesting even if
[00:41:18] something kind of interesting even if our reaction is oh well this would have
[00:41:22] our reaction is oh well this would have been much better if they' shown it was
[00:41:24] been much better if they' shown it was better than this other kind of model or
[00:41:26] better than this other kind of model or it would have been really nice if they'
[00:41:28] it would have been really nice if they' run ablations to work things out well if
[00:41:31] run ablations to work things out well if you're one person will give you a buy
[00:41:33] you're one person will give you a buy and say oh but it was only one person um
[00:41:36] and say oh but it was only one person um whereas if you're a three-person team
[00:41:38] whereas if you're a three-person team and it seems like you obviously should
[00:41:41] and it seems like you obviously should have compared it to some other models
[00:41:43] have compared it to some other models and you obviously could have run it on
[00:41:44] and you obviously could have run it on some other data sets um then we'll feel
[00:41:47] some other data sets um then we'll feel like well as a three-person team um they
[00:41:50] like well as a three-person team um they obviously should have done that and
[00:41:51] obviously should have done that and therefore we should give them um a less
[00:41:54] therefore we should give them um a less good score and that's how that is a
[00:41:57] good score and that's how that is a worked out um the complication comes um
[00:42:01] worked out um the complication comes um with other things people are doing at
[00:42:03] with other things people are doing at the same time we allow people to do
[00:42:06] the same time we allow people to do final projects um that are shared with
[00:42:09] final projects um that are shared with multiple classes um but you know
[00:42:12] multiple classes um but you know expectation is again that you'll do more
[00:42:15] expectation is again that you'll do more work so if there are two of you who are
[00:42:16] work so if there are two of you who are using one project for both this class
[00:42:19] using one project for both this class and cs231 and say then it's sort of like
[00:42:23] and cs231 and say then it's sort of like it's a four-person project and you
[00:42:25] it's a four-person project and you should be doing a lot of work for it um
[00:42:29] should be doing a lot of work for it um there are other cases sometimes people
[00:42:31] there are other cases sometimes people have ra ships or their PHD rotation
[00:42:34] have ra ships or their PHD rotation students or other things if you're doing
[00:42:37] students or other things if you're doing it for other things we'd like you to
[00:42:39] it for other things we'd like you to tell us and we expect you to be um doing
[00:42:42] tell us and we expect you to be um doing um more work for it
[00:42:44] um more work for it um okay um I'm very happy to talk to
[00:42:48] um okay um I'm very happy to talk to people about final projects and have
[00:42:50] people about final projects and have been talking to people about final
[00:42:51] been talking to people about final projects but unfortunately there's only
[00:42:53] projects but unfortunately there's only one of me so I definitely can't talk to
[00:42:56] one of me so I definitely can't talk to 500 people about final projects so I do
[00:42:58] 500 people about final projects so I do also encourage you to um talk to all of
[00:43:01] also encourage you to um talk to all of the Tas about final projects so on the
[00:43:04] the Tas about final projects so on the um office hours page under all of the
[00:43:07] um office hours page under all of the Tas there's some information about
[00:43:10] Tas there's some information about things that they know about um so if you
[00:43:12] things that they know about um so if you know what your projects is about you
[00:43:14] know what your projects is about you could at least try and find one of the
[00:43:16] could at least try and find one of the most useful Tas or just find a TA with a
[00:43:18] most useful Tas or just find a TA with a friendly face um whatever mechanism you
[00:43:21] friendly face um whatever mechanism you use talk to Tas about um final
[00:43:25] use talk to Tas about um final projects yeah so default final project
[00:43:29] projects yeah so default final project so what it's going to be um is so Bert
[00:43:33] so what it's going to be um is so Bert was a famous early um Transformer and
[00:43:37] was a famous early um Transformer and we're going to be sort of building and
[00:43:39] we're going to be sort of building and experimenting with a minimal bir
[00:43:42] experimenting with a minimal bir implementation so if you do this um
[00:43:46] implementation so if you do this um there's part of a a part of an
[00:43:48] there's part of a a part of an implementation of bird and you're meant
[00:43:51] implementation of bird and you're meant to finish it off and you're meant to
[00:43:53] to finish it off and you're meant to fine-tune it and get some data results
[00:43:56] fine-tune it and get some data results um for doing sentiment analysis and then
[00:44:00] um for doing sentiment analysis and then basically um we want the even the
[00:44:03] basically um we want the even the default final project to be an
[00:44:04] default final project to be an open-ended project where people can do
[00:44:07] open-ended project where people can do different things and so then there's
[00:44:09] different things and so then there's lots of other ideas or you can come up
[00:44:11] lots of other ideas or you can come up with your own of ways you could extend
[00:44:14] with your own of ways you could extend um this system and make it better which
[00:44:17] um this system and make it better which might be with paraphrasing contrast of
[00:44:20] might be with paraphrasing contrast of learning low rank adaptation something
[00:44:23] learning low rank adaptation something and you can do something and that is
[00:44:25] and you can do something and that is your final project
[00:44:27] your final project um so why choose the final project so if
[00:44:32] um so why choose the final project so if you haven't had much experience with
[00:44:35] you haven't had much experience with research um you don't have any real idea
[00:44:38] research um you don't have any real idea of what you want to do for a final
[00:44:41] of what you want to do for a final project um or you'd like something with
[00:44:44] project um or you'd like something with clear guidance and a goal and a
[00:44:46] clear guidance and a goal and a leaderboard because we provide a
[00:44:48] leaderboard because we provide a leaderboard for people doing the default
[00:44:50] leaderboard for people doing the default final project of how good your
[00:44:52] final project of how good your performance is on the tasks we provide
[00:44:56] performance is on the tasks we provide then you can do the final project and I
[00:44:58] then you can do the final project and I mean honestly I think for many people
[00:45:01] mean honestly I think for many people the best option um is to do the final
[00:45:03] the best option um is to do the final project um for sort of past performance
[00:45:06] project um for sort of past performance typically about half the students do the
[00:45:08] typically about half the students do the final project including some Pro people
[00:45:11] final project including some Pro people who um start off thinking I'll do a
[00:45:14] who um start off thinking I'll do a custom final project then after a couple
[00:45:16] custom final project then after a couple of weeks they decide H this makes no
[00:45:19] of weeks they decide H this makes no sense what I was suggesting it's not
[00:45:21] sense what I was suggesting it's not working at all I'm just going to abandon
[00:45:23] working at all I'm just going to abandon and flip to the default final project um
[00:45:27] and flip to the default final project um okay but we also allow custom final
[00:45:29] okay but we also allow custom final projects and there are good reasons to
[00:45:31] projects and there are good reasons to do custom final projects so if you have
[00:45:34] do custom final projects so if you have some um topic or research idea that
[00:45:37] some um topic or research idea that you're excited about maybe you're
[00:45:39] you're excited about maybe you're already even working on it or you want
[00:45:42] already even working on it or you want to try something different on your own
[00:45:45] to try something different on your own um or You' just like to have more of the
[00:45:48] um or You' just like to have more of the experience of trying to come up with a
[00:45:50] experience of trying to come up with a research goal finding the necessary data
[00:45:52] research goal finding the necessary data and tools and starting from scratch
[00:45:55] and tools and starting from scratch which is actually very educational if
[00:45:58] which is actually very educational if considerably harder um well then the
[00:46:00] considerably harder um well then the custom final project um is fine for you
[00:46:05] custom final project um is fine for you um restriction on topics I think we'd
[00:46:08] um restriction on topics I think we'd already sort of signal this on AED um we
[00:46:11] already sort of signal this on AED um we insist for um cs224n final projects that
[00:46:16] insist for um cs224n final projects that they have to substantively involve both
[00:46:19] they have to substantively involve both human language and new networks you know
[00:46:22] human language and new networks you know because this is the NLP class um so we'd
[00:46:25] because this is the NLP class um so we'd like people to know and learn something
[00:46:28] like people to know and learn something about human language um I'm totally
[00:46:32] about human language um I'm totally aware of the fact that you can use these
[00:46:34] aware of the fact that you can use these same models for bioinformatic
[00:46:36] same models for bioinformatic sequences um or music or um uh radar
[00:46:41] sequences um or music or um uh radar whatever but we'd like you to do
[00:46:44] whatever but we'd like you to do something with human language for this
[00:46:46] something with human language for this class um that doesn't mean it has to be
[00:46:48] class um that doesn't mean it has to be only about human language so people have
[00:46:51] only about human language so people have done things like visual language models
[00:46:55] done things like visual language models um or music and language um so it can
[00:46:59] um or music and language um so it can have a combination of modalities um but
[00:47:02] have a combination of modalities um but it has to you know substantively not
[00:47:05] it has to you know substantively not completely trivially involve human
[00:47:07] completely trivially involve human language if You' got any questions about
[00:47:09] language if You' got any questions about that ask and it also has to
[00:47:11] that ask and it also has to substantively involve neural networks
[00:47:13] substantively involve neural networks though again it doesn't have to be holy
[00:47:15] though again it doesn't have to be holy about neural networks if you've got some
[00:47:17] about neural networks if you've got some ideas thinking oh I think I could show
[00:47:20] ideas thinking oh I think I could show using kernel machines that they work
[00:47:22] using kernel machines that they work just as well as having multiannual
[00:47:24] just as well as having multiannual networks or something like that that's
[00:47:26] networks or something like that that's of course fine to do as
[00:47:29] of course fine to do as well um
[00:47:31] well um gamesmanship um yeah the default final
[00:47:34] gamesmanship um yeah the default final project is more guided but it's not
[00:47:37] project is more guided but it's not meant to be a complete Slackers ride
[00:47:40] meant to be a complete Slackers ride we're um hoping that people do the same
[00:47:42] we're um hoping that people do the same amount of work for either kind of
[00:47:44] amount of work for either kind of project but on the other hand it does
[00:47:46] project but on the other hand it does kind of give you sort of a clearer focus
[00:47:49] kind of give you sort of a clearer focus and course of things to do but it is
[00:47:52] and course of things to do but it is still an open-ended project um so you
[00:47:57] still an open-ended project um so you know for both um default final projects
[00:48:00] know for both um default final projects and custom final projects there are
[00:48:02] and custom final projects there are great projects and there are not so
[00:48:05] great projects and there are not so great projects um you know if anything
[00:48:08] great projects um you know if anything there's a bit more variance in the
[00:48:10] there's a bit more variance in the custom final project so you know the
[00:48:12] custom final project so you know the path of success is not to do some try
[00:48:16] path of success is not to do some try and do something for a custom final
[00:48:18] and do something for a custom final project um that just looks really weak
[00:48:21] project um that just looks really weak compared to people's default final
[00:48:24] compared to people's default final projects um okay um you can get good
[00:48:28] projects um okay um you can get good grades either way um we get give best
[00:48:31] grades either way um we get give best project Awards um to both kinds of
[00:48:34] project Awards um to both kinds of projects so yeah it's really not that
[00:48:36] projects so yeah it's really not that there's some secret one you have to pick
[00:48:38] there's some secret one you have to pick um Computing um yeah so
[00:48:43] um Computing um yeah so um to be honest with the confessions
[00:48:46] um to be honest with the confessions right at the beginning we're actually in
[00:48:48] right at the beginning we're actually in a less good position um for computing
[00:48:52] a less good position um for computing than we've been in recent years and it's
[00:48:55] than we've been in recent years and it's all open AI fault no that
[00:48:58] all open AI fault no that part um but you know up until and
[00:49:01] part um but you know up until and including last year um we were actually
[00:49:06] including last year um we were actually had invariably managed to get um very
[00:49:10] had invariably managed to get um very generous um cloud computing giveaways
[00:49:13] generous um cloud computing giveaways from one or other cloud computing
[00:49:14] from one or other cloud computing provider which really um provided a lot
[00:49:19] provider which really um provided a lot of um Computing support but you know
[00:49:22] of um Computing support but you know there's the great GPU shortage on at the
[00:49:25] there's the great GPU shortage on at the moment due to the Great success of large
[00:49:27] moment due to the Great success of large language models and it turns out that um
[00:49:30] language models and it turns out that um Cloud compute providers just aren't
[00:49:32] Cloud compute providers just aren't being as generous as they used to be and
[00:49:35] being as generous as they used to be and gee I guess the AWS rep was pointing out
[00:49:38] gee I guess the AWS rep was pointing out um that my course was their single
[00:49:41] um that my course was their single largest Grant of free GPU last year so
[00:49:45] largest Grant of free GPU last year so it's getting harder to do um so um so um
[00:49:49] it's getting harder to do um so um so um really people will have to um patch
[00:49:52] really people will have to um patch things together more in many cases and
[00:49:55] things together more in many cases and so we'll be relying on the Ingenuity of
[00:49:58] so we'll be relying on the Ingenuity of students to be able to find free and
[00:50:01] students to be able to find free and cheap stuff um so um Google is giving
[00:50:05] cheap stuff um so um Google is giving $50 credit per person on gcp which can
[00:50:08] $50 credit per person on gcp which can be used for assignments 34 and the final
[00:50:11] be used for assignments 34 and the final project um on all the clouds if you
[00:50:14] project um on all the clouds if you haven't used a cloud with an account
[00:50:17] haven't used a cloud with an account before you can usually get some free
[00:50:20] before you can usually get some free starter credits um which can be a useful
[00:50:23] starter credits um which can be a useful thing um there are um the sort of jup
[00:50:26] thing um there are um the sort of jup notebooks in the cloud so um the most
[00:50:29] notebooks in the cloud so um the most used one is Google collab um which
[00:50:32] used one is Google collab um which allows limited GPU use um it often tends
[00:50:36] allows limited GPU use um it often tends to get tighter later in the quarter um
[00:50:40] to get tighter later in the quarter um so you might find it a good investment
[00:50:42] so you might find it a good investment to not have a couple of lattes and pay
[00:50:45] to not have a couple of lattes and pay 10 bucks a month to get collab Pro which
[00:50:48] 10 bucks a month to get collab Pro which gives you much better access to G um
[00:50:51] gives you much better access to G um gpus um but there are alternatives to
[00:50:53] gpus um but there are alternatives to that which you might also want to look
[00:50:55] that which you might also want to look at so AWS provid provides a Jupiter
[00:50:57] at so AWS provid provides a Jupiter notebook environment s maker studio lab
[00:51:01] notebook environment s maker studio lab um and you know also owned by Google
[00:51:04] um and you know also owned by Google kegle separately provides kegle
[00:51:06] kegle separately provides kegle notebooks um which actually commonly
[00:51:09] notebooks um which actually commonly give you better GPU access than Google
[00:51:12] give you better GPU access than Google collab provides even though um you know
[00:51:16] collab provides even though um you know they're otherwise not as
[00:51:18] they're otherwise not as nice kegle notebooks are sort of just um
[00:51:22] nice kegle notebooks are sort of just um Bare Bones Jupiter notebooks whereas
[00:51:24] Bare Bones Jupiter notebooks whereas collab has some fancier um UI stuff
[00:51:26] collab has some fancier um UI stuff grafted on it um so other possibilities
[00:51:30] grafted on it um so other possibilities um modal is a low price GPU provider and
[00:51:34] um modal is a low price GPU provider and allows and a a certain amount of free
[00:51:36] allows and a a certain amount of free GPU usage a month so that could be handy
[00:51:39] GPU usage a month so that could be handy um there are other lower cost GPU
[00:51:41] um there are other lower cost GPU providers like vast AI which could be of
[00:51:45] providers like vast AI which could be of relevance um and then the other thing
[00:51:47] relevance um and then the other thing that I'll say more about in a minute is
[00:51:49] that I'll say more about in a minute is you know the way things have changed
[00:51:51] you know the way things have changed with large language models there are
[00:51:54] with large language models there are lots of projects that you might want to
[00:51:55] lots of projects that you might want to do where you're not actually building
[00:51:58] do where you're not actually building models at all yourself but you're
[00:52:01] models at all yourself but you're wanting to you know do experiments on
[00:52:04] wanting to you know do experiments on large language models or you're wanting
[00:52:06] large language models or you're wanting to do in context learning with large
[00:52:08] to do in context learning with large language models um or other things of
[00:52:12] language models um or other things of that sort um and then what you want is
[00:52:15] that sort um and then what you want is to have access to large language models
[00:52:18] to have access to large language models and in particular you probably want to
[00:52:20] and in particular you probably want to use have API access so you can automate
[00:52:23] use have API access so you can automate things so another thing that we have
[00:52:25] things so another thing that we have been able to get um is through the
[00:52:28] been able to get um is through the generosity of together AI that together
[00:52:30] generosity of together AI that together AI is providing $50 of API access to um
[00:52:35] AI is providing $50 of API access to um large language models which can actually
[00:52:38] large language models which can actually be a lot um how much of a lot it is
[00:52:41] be a lot um how much of a lot it is depends on how bigger model you're using
[00:52:43] depends on how bigger model you're using so something you should um think about
[00:52:45] so something you should um think about is how big a model do you really need to
[00:52:47] is how big a model do you really need to use to um show something because if you
[00:52:50] use to um show something because if you can run a 7 billion parameter language
[00:52:52] can run a 7 billion parameter language model um on together you know you can
[00:52:55] model um on together you know you can put a huge huge number of tokens through
[00:52:57] put a huge huge number of tokens through it for 50 bucks whereas if you want to
[00:53:00] it for 50 bucks whereas if you want to run a much bigger model um then you know
[00:53:03] run a much bigger model um then you know the number of tokens start you can get
[00:53:05] the number of tokens start you can get through it goes down by orders of
[00:53:07] through it goes down by orders of magnitude um so that's good um and I
[00:53:11] magnitude um so that's good um and I mentioned some other ones so we've
[00:53:13] mentioned some other ones so we've already put a whole bunch of documents
[00:53:16] already put a whole bunch of documents um up on Ed that talk about these
[00:53:18] um up on Ed that talk about these different um GPU options so do look at
[00:53:23] different um GPU options so do look at those okay um jumping ahead so the first
[00:53:27] those okay um jumping ahead so the first thing you have to do as a project
[00:53:29] thing you have to do as a project proposal um so it's one per team so I
[00:53:32] proposal um so it's one per team so I guess the first step is to work out who
[00:53:34] guess the first step is to work out who your team is um and so for the project
[00:53:38] your team is um and so for the project proposal part of it is actually giving
[00:53:41] proposal part of it is actually giving us the details of your project but
[00:53:44] us the details of your project but there's another major part of it which
[00:53:47] there's another major part of it which is writing a review of a key research
[00:53:50] is writing a review of a key research paper for your topic so what um for the
[00:53:53] paper for your topic so what um for the default final project we provide some
[00:53:55] default final project we provide some suggestions so you can find something
[00:53:57] suggestions so you can find something else if you've got another idea for how
[00:53:59] else if you've got another idea for how to extend the project um for your custom
[00:54:02] to extend the project um for your custom project you're finding your own but what
[00:54:04] project you're finding your own but what we want you to do is get some practice
[00:54:06] we want you to do is get some practice at looking at a research paper
[00:54:09] at looking at a research paper understanding what it's doing
[00:54:11] understanding what it's doing understanding what's convincing what it
[00:54:14] understanding what's convincing what it didn't consider what it failed to do and
[00:54:16] didn't consider what it failed to do and so we want you to write a two-page
[00:54:19] so we want you to write a two-page summary of a research paper and the goal
[00:54:22] summary of a research paper and the goal is for you to be thinking critically
[00:54:25] is for you to be thinking critically about this re search paper you know of
[00:54:28] about this re search paper you know of what did it do that was exciting versus
[00:54:31] what did it do that was exciting versus what did it claim was exciting but was
[00:54:33] what did it claim was exciting but was really obvious or perhaps even wrong
[00:54:37] really obvious or perhaps even wrong Etc okay um and then right so then a so
[00:54:42] Etc okay um and then right so then a so um after that you know we want you to
[00:54:46] um after that you know we want you to say what you're planning to do that may
[00:54:48] say what you're planning to do that may be very straightforward for a default
[00:54:50] be very straightforward for a default final project but it's really important
[00:54:54] final project but it's really important um for a custom final project um and in
[00:54:58] um for a custom final project um and in particular um you know tell us about you
[00:55:02] particular um you know tell us about you know the literature you're going to use
[00:55:05] know the literature you're going to use if any and the kind of models you're
[00:55:06] if any and the kind of models you're going to explore but you know it turns
[00:55:09] going to explore but you know it turns out that when we're unhappy with custom
[00:55:12] out that when we're unhappy with custom final projects the two commonest
[00:55:14] final projects the two commonest complaints about um what you tell us
[00:55:17] complaints about um what you tell us about custom final projects is you don't
[00:55:20] about custom final projects is you don't make clear what data you're going to use
[00:55:23] make clear what data you're going to use because we're sort of worried already if
[00:55:25] because we're sort of worried already if you haven't worked out by the project
[00:55:27] you haven't worked out by the project proposal deadline what data you can use
[00:55:30] proposal deadline what data you can use for your final project and if you don't
[00:55:33] for your final project and if you don't tell us how you're going to evaluate
[00:55:35] tell us how you're going to evaluate your system we want to know how you're
[00:55:37] your system we want to know how you're going to measure whether you're getting
[00:55:39] going to measure whether you're getting any success um as a new thing this year
[00:55:44] any success um as a new thing this year um we'd like you to include an ethical
[00:55:47] um we'd like you to include an ethical considerations paragraph outlining
[00:55:49] considerations paragraph outlining potential ethical challenges of your
[00:55:51] potential ethical challenges of your work if it were deployed in the real
[00:55:54] work if it were deployed in the real world and how that might be mitigated
[00:55:57] world and how that might be mitigated um this is something that now a lot of
[00:55:59] um this is something that now a lot of conferences are requiring and a lot of
[00:56:02] conferences are requiring and a lot of Grants are requiring um so w't give you
[00:56:04] Grants are requiring um so w't give you a little bit of practice on that by
[00:56:06] a little bit of practice on that by writing a paragraph of that um how much
[00:56:09] writing a paragraph of that um how much there is to talk about various somewhat
[00:56:12] there is to talk about various somewhat on what you're trying to do and whether
[00:56:14] on what you're trying to do and whether it has a lot of ethical problems or
[00:56:17] it has a lot of ethical problems or whether it's a fairly straightforward
[00:56:19] whether it's a fairly straightforward question answering system but in all
[00:56:21] question answering system but in all cases you might think about what are the
[00:56:24] cases you might think about what are the possible ethical consider ations of this
[00:56:27] possible ethical consider ations of this piece of work okay the whole thing is
[00:56:30] piece of work okay the whole thing is maximum four
[00:56:31] maximum four pages um okay so for the um research
[00:56:37] pages um okay so for the um research paper summary yeah do think critically
[00:56:41] paper summary yeah do think critically right I mean the
[00:56:43] right I mean the worst the the worst um summaries are
[00:56:48] worst the the worst um summaries are essentially people that just paraphrase
[00:56:51] essentially people that just paraphrase what's in the abstract and introduction
[00:56:53] what's in the abstract and introduction of the paper and we want you to think a
[00:56:56] of the paper and we want you to think a bit harder about this you know what were
[00:56:59] bit harder about this you know what were the novel contributions of the paper um
[00:57:03] the novel contributions of the paper um is it something that you could use for
[00:57:05] is it something that you could use for different kinds of problems and
[00:57:07] different kinds of problems and different ways or was it really
[00:57:09] different ways or was it really exploiting a trick of one data set um
[00:57:12] exploiting a trick of one data set um are there things that it seemed like
[00:57:15] are there things that it seemed like they missed or could have done
[00:57:16] they missed or could have done differently or you weren't convinced
[00:57:18] differently or you weren't convinced were done properly um is it similar or
[00:57:22] were done properly um is it similar or distinctive to other papers that are
[00:57:24] distinctive to other papers that are dealing with the same topic
[00:57:26] dealing with the same topic does it suggest perhaps something that
[00:57:28] does it suggest perhaps something that you could try that extends beyond the
[00:57:30] you could try that extends beyond the paper
[00:57:32] paper um okay and for grading these final
[00:57:36] um okay and for grading these final project proposals most of the points are
[00:57:39] project proposals most of the points are on that paper review and so do pay
[00:57:42] on that paper review and so do pay attention to it um there are some points
[00:57:46] attention to it um there are some points on the project plan but you know really
[00:57:50] on the project plan but you know really we're wanting to mainly give you
[00:57:52] we're wanting to mainly give you formative feedback on the project plan
[00:57:54] formative feedback on the project plan and comments as to how we think it's
[00:57:56] and comments as to how we think it's realistic or unrealistic um but
[00:57:59] realistic or unrealistic um but nevertheless um we're expecting you to
[00:58:02] nevertheless um we're expecting you to sort of have an idea have thought
[00:58:05] sort of have an idea have thought through how you can investigate it
[00:58:07] through how you can investigate it thought through how you can evaluate it
[00:58:09] thought through how you can evaluate it data sets baselines things like that oh
[00:58:13] data sets baselines things like that oh yeah I should emphasize this do have an
[00:58:15] yeah I should emphasize this do have an appropriate Baseline so for anything
[00:58:18] appropriate Baseline so for anything that you're doing you should have
[00:58:21] that you're doing you should have something you can compare it against so
[00:58:24] something you can compare it against so sometimes it's a previous system that
[00:58:26] sometimes it's a previous system that did exactly the same thing but if you're
[00:58:28] did exactly the same thing but if you're doing something more novel and
[00:58:30] doing something more novel and interesting you should be thinking of
[00:58:33] interesting you should be thinking of some seat of the pants obvious way to do
[00:58:36] some seat of the pants obvious way to do things and proving that you can do it
[00:58:38] things and proving that you can do it better and what that is depends a lot on
[00:58:40] better and what that is depends a lot on what your project is but you know if
[00:58:42] what your project is but you know if you're building some complex new net um
[00:58:46] you're building some complex new net um that's going to be used to work out
[00:58:49] that's going to be used to work out textural similarity between two pieces
[00:58:51] textural similarity between two pieces of text well a simple way of working out
[00:58:54] of text well a simple way of working out textual similarity between two pie
[00:58:56] textual similarity between two pie pieces of text is to look up the word
[00:58:59] pieces of text is to look up the word vectors for every word in the text and
[00:59:01] vectors for every word in the text and average them together and work out um
[00:59:04] average them together and work out um the dot product between those average
[00:59:06] the dot product between those average vectors and unless your complex newal
[00:59:09] vectors and unless your complex newal network is significantly better than
[00:59:11] network is significantly better than that it doesn't seem like it's a very
[00:59:13] that it doesn't seem like it's a very good system so you should always attempt
[00:59:15] good system so you should always attempt to have some baselines um after um the
[00:59:21] to have some baselines um after um the project proposal we also have a project
[00:59:23] project proposal we also have a project Milestone Stuck in the Middle to make
[00:59:25] Milestone Stuck in the Middle to make sure everybody um has making some
[00:59:28] sure everybody um has making some progress this is just to help make sure
[00:59:31] progress this is just to help make sure people um do get through things and keep
[00:59:33] people um do get through things and keep working on it so we'll have good final
[00:59:36] working on it so we'll have good final projects um for
[00:59:38] projects um for most final projects I'll say more about
[00:59:41] most final projects I'll say more about this in a minute we The crucial thing we
[00:59:44] this in a minute we The crucial thing we expect for the Milestone is that you
[00:59:47] expect for the Milestone is that you know you've kind of got set up and you
[00:59:49] know you've kind of got set up and you can run something it might just be your
[00:59:52] can run something it might just be your Baseline of looking up the word vectors
[00:59:54] Baseline of looking up the word vectors but means you've kind of got the data
[00:59:56] but means you've kind of got the data and the framework and something that you
[00:59:58] and the framework and something that you can run and produce a number from
[01:00:01] can run and produce a number from it um and then there's the final project
[01:00:05] it um and then there's the final project um we have people submit their code for
[01:00:08] um we have people submit their code for the final projects but final projects
[01:00:11] the final projects but final projects are um
[01:00:13] are um evaluated almost entirely unless there's
[01:00:17] evaluated almost entirely unless there's some major worries or concerns based on
[01:00:20] some major worries or concerns based on your project report so make sure you put
[01:00:23] your project report so make sure you put time into the project report which is
[01:00:26] time into the project report which is essentially a research paper like a
[01:00:28] essentially a research paper like a conference paper and they can be up to
[01:00:31] conference paper and they can be up to eight pages and it varies on what you're
[01:00:34] eight pages and it varies on what you're doing but you know this is the kind of
[01:00:37] doing but you know this is the kind of picture typically of what um Pap will
[01:00:39] picture typically of what um Pap will look like will'll have an abstract and
[01:00:41] look like will'll have an abstract and introduction it'll talk about other
[01:00:44] introduction it'll talk about other related work it'll present the model
[01:00:46] related work it'll present the model you're using the data you're using and
[01:00:49] you're using the data you're using and your experiments and their results and
[01:00:51] your experiments and their results and have some insightful comments in its
[01:00:54] have some insightful comments in its analysis and conclusion at the end
[01:00:57] analysis and conclusion at the end okay um finding research topics um for
[01:01:02] okay um finding research topics um for custom projects um all kinds of things
[01:01:05] custom projects um all kinds of things you can do you know basic philosophy of
[01:01:08] you can do you know basic philosophy of science you're normally either starting
[01:01:10] science you're normally either starting off with here's some problem I want to
[01:01:12] off with here's some problem I want to make some progress on or here's this
[01:01:15] make some progress on or here's this cool idea for a theoretical technique um
[01:01:18] cool idea for a theoretical technique um or change in something and I want to
[01:01:20] or change in something and I want to show it's better than other ways of
[01:01:22] show it's better than other ways of doing it and you're working from that um
[01:01:26] doing it and you're working from that um we allow different kinds of projects um
[01:01:30] we allow different kinds of projects um you know one common type of project is
[01:01:32] you know one common type of project is you've got some task of interest and
[01:01:34] you've got some task of interest and you're going to try and solve it or make
[01:01:37] you're going to try and solve it or make progress on it somehow that you want to
[01:01:40] progress on it somehow that you want to you know get information out of state
[01:01:43] you know get information out of state department documents um and you're going
[01:01:46] department documents um and you're going to see how well you can do it with
[01:01:48] to see how well you can do it with neural NP a second kind is you got some
[01:01:52] neural NP a second kind is you got some ideas of doing something different with
[01:01:53] ideas of doing something different with newal networks and then you going to see
[01:01:56] newal networks and then you going to see how well it works um or maybe given
[01:02:00] how well it works um or maybe given there are large language models these
[01:02:01] there are large language models these days you're going to see how using large
[01:02:04] days you're going to see how using large language models you can do something
[01:02:06] language models you can do something interesting by in context learning or
[01:02:09] interesting by in context learning or building a larger language model program
[01:02:12] building a larger language model program so NE you know nearly all um 224n
[01:02:17] so NE you know nearly all um 224n projects are in those first three types
[01:02:20] projects are in those first three types where at the end of the day you've got
[01:02:24] where at the end of the day you've got some kind of system and you've got some
[01:02:26] some kind of system and you've got some kind of data and you're going to
[01:02:28] kind of data and you're going to evaluate it um but that's not a 100%
[01:02:32] evaluate it um but that's not a 100% requirement there are different kinds of
[01:02:34] requirement there are different kinds of projects you can do and a few people do
[01:02:37] projects you can do and a few people do so you can do an analysis
[01:02:39] so you can do an analysis interpretability project um you could be
[01:02:42] interpretability project um you could be interested in something like how could
[01:02:45] interested in something like how could these Transformer models possibly
[01:02:49] these Transformer models possibly understand what I say to them and give
[01:02:52] understand what I say to them and give the right answers to my statements let
[01:02:55] the right answers to my statements let me try and look inside the neural
[01:02:58] me try and look inside the neural networks and see what they're Computing
[01:03:00] networks and see what they're Computing how recently there's been a lot of work
[01:03:03] how recently there's been a lot of work on this topic often under um titles like
[01:03:06] on this topic often under um titles like mechanistic
[01:03:07] mechanistic interpretability um circuit training and
[01:03:09] interpretability um circuit training and things like that so you can do some kind
[01:03:12] things like that so you can do some kind of analysis or interpretability project
[01:03:14] of analysis or interpretability project or you could even just do it um looking
[01:03:18] or you could even just do it um looking at the behavior of models of some task
[01:03:20] at the behavior of models of some task so you could take some linguistic task
[01:03:24] so you could take some linguistic task like
[01:03:26] like metaphor interpretation and see which
[01:03:28] metaphor interpretation and see which neural networks can interpret them
[01:03:30] neural networks can interpret them correctly and which can't or which kinds
[01:03:32] correctly and which can't or which kinds of ones they can interpret correctly or
[01:03:34] of ones they can interpret correctly or not and do things like that um another
[01:03:38] not and do things like that um another kind is a theoretical um project
[01:03:42] kind is a theoretical um project occasionally people have done things um
[01:03:45] occasionally people have done things um looking at the behavior
[01:03:50] looking at the behavior of well what's a good example um
[01:03:53] of well what's a good example um somewhere that's in the math so an
[01:03:57] somewhere that's in the math so an example that was actually done a few
[01:03:59] example that was actually done a few years ago and turned into a conference
[01:04:01] years ago and turned into a conference paper was looking at in the estimation
[01:04:05] paper was looking at in the estimation of word vectors the stability of um the
[01:04:10] of word vectors the stability of um the word vectors that were computed by
[01:04:13] word vectors that were computed by different algorithms word Toc versus
[01:04:16] different algorithms word Toc versus glove um and um deriving results with
[01:04:21] glove um and um deriving results with proofs about the stability um of the um
[01:04:25] proofs about the stability um of the um um vectors that were calculated um so
[01:04:29] um vectors that were calculated um so that's loud but we don't see many of
[01:04:30] that's loud but we don't see many of those um here very quickly um it's sort
[01:04:35] those um here very quickly um it's sort of just sort of random things so a lot
[01:04:37] of just sort of random things so a lot of past projects you can find on the
[01:04:40] of past projects you can find on the 224n web page you can just find
[01:04:42] 224n web page you can just find different past year reports um and you
[01:04:46] different past year reports um and you can look at them to get ideas as you
[01:04:49] can look at them to get ideas as you wish um so deep poetry um was a gated
[01:04:53] wish um so deep poetry um was a gated lstm where the idea was as well as sort
[01:04:56] lstm where the idea was as well as sort a language model that generated success
[01:04:59] a language model that generated success of words they had extra stuff in it to
[01:05:01] of words they had extra stuff in it to make it um rhyme in a um poetry like
[01:05:06] make it um rhyme in a um poetry like pattern that was kind of fun um you can
[01:05:08] pattern that was kind of fun um you can do a
[01:05:10] do a reimplementation of a paper um that has
[01:05:13] reimplementation of a paper um that has been done previously this is actually
[01:05:15] been done previously this is actually kind of an old one but I remember it
[01:05:17] kind of an old one but I remember it well so back in the days before
[01:05:20] well so back in the days before Transformers um deep minded these kind
[01:05:22] Transformers um deep minded these kind of interesting papers on newal chur
[01:05:25] of interesting papers on newal chur machines and differentiable neural
[01:05:28] machines and differentiable neural computers um and um but they didn't
[01:05:31] computers um and um but they didn't release implementations of them um and
[01:05:34] release implementations of them um and so Carol set about writing her own
[01:05:37] so Carol set about writing her own implementation of a differentiable
[01:05:39] implementation of a differentiable neural computer which in a way was a a
[01:05:42] neural computer which in a way was a a little bit crazy um and a few days
[01:05:45] little bit crazy um and a few days before the deadline she still hadn't
[01:05:47] before the deadline she still hadn't gone at working so it could have been a
[01:05:49] gone at working so it could have been a complete disaster um but she did get it
[01:05:51] complete disaster um but she did get it working before the deadline and got it
[01:05:54] working before the deadline and got it to run producing some interesting
[01:05:56] to run producing some interesting results um so that was kind of cool so
[01:05:59] results um so that was kind of cool so it if it's something interesting it
[01:06:00] it if it's something interesting it doesn't have to be original it can be
[01:06:02] doesn't have to be original it can be sort of reimplementing something
[01:06:05] sort of reimplementing something interesting um okay
[01:06:08] interesting um okay um sometimes um papers do get published
[01:06:11] um sometimes um papers do get published later as interesting ones this was a
[01:06:14] later as interesting ones this was a paper that was sort of again from the
[01:06:16] paper that was sort of again from the early days and was sort of fairly simple
[01:06:19] early days and was sort of fairly simple but you know it was a novel thing that
[01:06:21] but you know it was a novel thing that gave progress so the way we've sort of
[01:06:23] gave progress so the way we've sort of presented these rnm you have sort of
[01:06:26] presented these rnm you have sort of word vectors at the bottom and then you
[01:06:29] word vectors at the bottom and then you kind of compute the soft Max at the top
[01:06:31] kind of compute the soft Max at the top but if you think about the sort of
[01:06:34] but if you think about the sort of multiplying by the output Matrix and
[01:06:37] multiplying by the output Matrix and then putting that into the softmax that
[01:06:39] then putting that into the softmax that output Matrix is also like a set of word
[01:06:42] output Matrix is also like a set of word vectors because you have a column for
[01:06:44] vectors because you have a column for each word and then you put it to you get
[01:06:47] each word and then you put it to you get a score for each output word and then
[01:06:49] a score for each output word and then you're putting a soft Max over that um
[01:06:52] you're putting a soft Max over that um and so their idea was well maybe you
[01:06:54] and so their idea was well maybe you could sort of share those two um sets of
[01:06:59] could sort of share those two um sets of vectors and you'd be able to get
[01:07:00] vectors and you'd be able to get improvements from that and you could
[01:07:03] improvements from that and you could um okay maybe I won't talk about that
[01:07:06] um okay maybe I won't talk about that one um sometimes people have worked on
[01:07:09] one um sometimes people have worked on quantized models that's sort of more of
[01:07:11] quantized models that's sort of more of a um a sort of a general new network
[01:07:14] a um a sort of a general new network technique but providing you show you can
[01:07:16] technique but providing you show you can do useful things with it like have good
[01:07:18] do useful things with it like have good language modeling results even with
[01:07:20] language modeling results even with quantize vectors we'll count that as
[01:07:22] quantize vectors we'll count that as using language um so in recent times um
[01:07:28] using language um so in recent times um these two last tour from
[01:07:31] these two last tour from 2024 um you know a lot of the time
[01:07:34] 2024 um you know a lot of the time people are doing projects with
[01:07:36] people are doing projects with pre-trained large language models which
[01:07:38] pre-trained large language models which we will be talking about in the next
[01:07:40] we will be talking about in the next three models three lectures and then
[01:07:43] three models three lectures and then doing things with them and so you can do
[01:07:45] doing things with them and so you can do lightweight parameter efficient
[01:07:47] lightweight parameter efficient fine-tuning methods you can do in
[01:07:49] fine-tuning methods you can do in context learning methods and things like
[01:07:52] context learning methods and things like this and I suspect that probably quite a
[01:07:55] this and I suspect that probably quite a few of you will do projects um of um
[01:07:58] few of you will do projects um of um this kind so you know here's an example
[01:08:01] this kind so you know here's an example so lots of work has been done on
[01:08:05] so lots of work has been done on producing um code language models
[01:08:09] producing um code language models um and so um these people um decided to
[01:08:15] um and so um these people um decided to improve um the generation of Fortran
[01:08:18] improve um the generation of Fortran maybe they are physicist I don't
[01:08:22] maybe they are physicist I don't know and um so um they were able to show
[01:08:28] know and um so um they were able to show that they could um use parameter
[01:08:30] that they could um use parameter efficient fine-tuning to improve code
[01:08:32] efficient fine-tuning to improve code llama for producing for now where was
[01:08:36] llama for producing for now where was the natural language um code has natural
[01:08:39] the natural language um code has natural language comments in it and the comments
[01:08:41] language comments in it and the comments can be useful for explaining what you
[01:08:44] can be useful for explaining what you want the code to do and so it was
[01:08:47] want the code to do and so it was effectively doing um translation from
[01:08:51] effectively doing um translation from human language um explanation of what
[01:08:55] human language um explanation of what the code code was meant to do into
[01:08:57] the code code was meant to do into pieces of code
[01:08:59] pieces of code um here was another one um which um was
[01:09:05] um here was another one um which um was um doing AI fashion driven cataloging um
[01:09:09] um doing AI fashion driven cataloging um transforming images into textual
[01:09:11] transforming images into textual descriptions which again was starting
[01:09:13] descriptions which again was starting off with an existing visual language
[01:09:15] off with an existing visual language model and looking at how to fine-tune
[01:09:18] model and looking at how to fine-tune it okay um other places to look for
[01:09:21] it okay um other places to look for stuff um so you know you can get kind of
[01:09:25] stuff um so you know you can get kind of lots of ideas of areas and things people
[01:09:27] lots of ideas of areas and things people do by looking at past papers they you're
[01:09:30] do by looking at past papers they you're also welcome to have your own original
[01:09:32] also welcome to have your own original ideas thinking about anything you know
[01:09:34] ideas thinking about anything you know or work on in the world so for NLP
[01:09:37] or work on in the world so for NLP papers there's a site called the ACL
[01:09:39] papers there's a site called the ACL Anthology that's good for them for there
[01:09:42] Anthology that's good for them for there are lots of papers on language that also
[01:09:45] are lots of papers on language that also appear in machine learning conferences
[01:09:47] appear in machine learning conferences so you can look at the neurs or I clear
[01:09:50] so you can look at the neurs or I clear proceedings you can look at past 2 24n
[01:09:53] proceedings you can look at past 2 24n projects and then um the archive
[01:09:56] projects and then um the archive pre-print servers got tons of papers on
[01:09:59] pre-print servers got tons of papers on everything including NLP and you can
[01:10:01] everything including NLP and you can look there but I do actually think it's
[01:10:04] look there but I do actually think it's you know some of the funnest best
[01:10:06] you know some of the funnest best projects are actually people that find
[01:10:07] projects are actually people that find their own problem which is an
[01:10:10] their own problem which is an interesting problem in their world you
[01:10:12] interesting problem in their world you know if there's anything about a cool
[01:10:14] know if there's anything about a cool website that has text on it and you
[01:10:17] website that has text on it and you think you could kind of get information
[01:10:18] think you could kind of get information out of automatically by using a language
[01:10:21] out of automatically by using a language model or something there's probably
[01:10:22] model or something there's probably something interesting and different you
[01:10:24] something interesting and different you can do there um another place to look is
[01:10:27] can do there um another place to look is that there are various leaderboards for
[01:10:29] that there are various leaderboards for the state-ofthe-art on different
[01:10:31] the state-ofthe-art on different problems and you can start looking
[01:10:33] problems and you can start looking through leaderboards for stuff and see
[01:10:36] through leaderboards for stuff and see what you find there um but you know on
[01:10:39] what you find there um but you know on the other hand the disadvantage of
[01:10:41] the other hand the disadvantage of looking at things like leaderboards and
[01:10:43] looking at things like leaderboards and past conferences is you sort of uh tend
[01:10:46] past conferences is you sort of uh tend to be trying to do a bit better on a
[01:10:49] to be trying to do a bit better on a problem someone else has done and that's
[01:10:51] problem someone else has done and that's part of um why you know really often in
[01:10:54] part of um why you know really often in research it's a clever thing to think of
[01:10:57] research it's a clever thing to think of something different perhaps not too far
[01:10:59] something different perhaps not too far from things that other people have done
[01:11:01] from things that other people have done but somehow different so you'll be able
[01:11:04] but somehow different so you'll be able to do something a bit more original and
[01:11:06] to do something a bit more original and different for what you're doing
[01:11:09] different for what you're doing um yeah I mean I do just want to go
[01:11:13] um yeah I mean I do just want to go through this a bit quickly um
[01:11:17] through this a bit quickly um that you know um for sort of decade that
[01:11:22] that you know um for sort of decade that I've been doing natural language
[01:11:24] I've been doing natural language processing with deep learning there's
[01:11:26] processing with deep learning there's sort of been a sea change in what's
[01:11:30] sort of been a sea change in what's possible um so in the early days of the
[01:11:32] possible um so in the early days of the deep learning um Revival you know most
[01:11:37] deep learning um Revival you know most of the work in people's papers were
[01:11:39] of the work in people's papers were trying to find better deep learning
[01:11:42] trying to find better deep learning architectures so that would be here is
[01:11:44] architectures so that would be here is some question answering system I've got
[01:11:47] some question answering system I've got an idea of how I could add attention in
[01:11:49] an idea of how I could add attention in some new place or I could add a new um
[01:11:53] some new place or I could add a new um layer into the new network and the the
[01:11:55] layer into the new network and the the numbers will go up um and um there were
[01:11:58] numbers will go up um and um there were lots of papers like that and it was a
[01:12:00] lots of papers like that and it was a lot of fun and that's what a lot of good
[01:12:04] lot of fun and that's what a lot of good CS2 224n projects did too and people
[01:12:08] CS2 224n projects did too and people were often able to build systems from
[01:12:10] were often able to build systems from scratch that were close to the
[01:12:12] scratch that were close to the state-of-the-art um but you know in the
[01:12:15] state-of-the-art um but you know in the last five years your chances of doing
[01:12:18] last five years your chances of doing this have been become pretty slim
[01:12:22] this have been become pretty slim frankly um you know you can if you
[01:12:24] frankly um you know you can if you really got a good idea and it's
[01:12:26] really got a good idea and it's something different than original by all
[01:12:28] something different than original by all means but it's kind of hard so most work
[01:12:33] means but it's kind of hard so most work these days even for people who are
[01:12:35] these days even for people who are professional
[01:12:36] professional researchers that you know they're making
[01:12:39] researchers that you know they're making use of existing large pre-trained models
[01:12:44] use of existing large pre-trained models in some way and then once you're doing
[01:12:47] in some way and then once you're doing that that actually sort of fixes a lot
[01:12:49] that that actually sort of fixes a lot of your architectural choices because
[01:12:52] of your architectural choices because your large pre- chain neural network has
[01:12:54] your large pre- chain neural network has a certain AR architecture and you kind
[01:12:56] a certain AR architecture and you kind of have to live with it you know you
[01:12:58] of have to live with it you know you might be able to do interesting things
[01:13:00] might be able to do interesting things by adapting it with something like low
[01:13:02] by adapting it with something like low rank adaptation around the side or
[01:13:04] rank adaptation around the side or something but nevertheless there's sort
[01:13:06] something but nevertheless there's sort of constraints on what you want to do so
[01:13:09] of constraints on what you want to do so you know for just about any practical
[01:13:12] you know for just about any practical project like you've got some data set
[01:13:14] project like you've got some data set and you want to understand it and get
[01:13:17] and you want to understand it and get facts out of it or something like that
[01:13:19] facts out of it or something like that essentially the only sensible choice is
[01:13:22] essentially the only sensible choice is to say I am going to use hugging face um
[01:13:25] to say I am going to use hugging face um Transformers um which we have a tutorial
[01:13:28] Transformers um which we have a tutorial on coming up ahead and I will load some
[01:13:31] on coming up ahead and I will load some pre-trained model and I will be running
[01:13:34] pre-trained model and I will be running it over the text and then I'll be
[01:13:35] it over the text and then I'll be working out some other stuff I can do on
[01:13:38] working out some other stuff I can do on a top and around that so you know
[01:13:40] a top and around that so you know building your own architecture is really
[01:13:42] building your own architecture is really only a sensible choice if you can do
[01:13:45] only a sensible choice if you can do something in the small which is more a
[01:13:48] something in the small which is more a sort of exploring architectures project
[01:13:51] sort of exploring architectures project if you've kind of got an idea of hey
[01:13:54] if you've kind of got an idea of hey I've got an idea for different
[01:13:55] I've got an idea for different nonlinearity that I think will work
[01:13:57] nonlinearity that I think will work better than using a re let me
[01:13:59] better than using a re let me investigate kind of thing because then
[01:14:01] investigate kind of thing because then you can do small
[01:14:03] you can do small experiments um yeah maybe I won't read
[01:14:07] experiments um yeah maybe I won't read out all of this list but um there are
[01:14:10] out all of this list but um there are lists of sort of some of the ideas of
[01:14:12] lists of sort of some of the ideas of what's more interesting um now um but
[01:14:16] what's more interesting um now um but you know do be cognizant of the world
[01:14:20] you know do be cognizant of the world we're in in terms of scale I mean one of
[01:14:24] we're in in terms of scale I mean one of the problems we now have is that people
[01:14:28] the problems we now have is that people have um seen the latest paper that was
[01:14:31] have um seen the latest paper that was being pushed by Deep Mind or whoever um
[01:14:34] being pushed by Deep Mind or whoever um doing some cool graph structured um
[01:14:37] doing some cool graph structured um reasoning search to do things and they
[01:14:39] reasoning search to do things and they turn up and say I want to um do this for
[01:14:42] turn up and say I want to um do this for my project but a lot of the time if you
[01:14:45] my project but a lot of the time if you um read further into the paper you'll
[01:14:48] um read further into the paper you'll find that they were doing it on 32 A1
[01:14:51] find that they were doing it on 32 A1 100s for a month and that's not the
[01:14:54] 100s for a month and that's not the scale of computer
[01:14:55] scale of computer that you're going to have available to
[01:14:58] that you're going to have available to you in almost any all circumstances
[01:15:00] you in almost any all circumstances maybe they're one or two industry
[01:15:02] maybe they're one or two industry students for the industry students that
[01:15:04] students for the industry students that you can do that if so go for it but for
[01:15:07] you can do that if so go for it but for the vast majority of people not likely
[01:15:10] the vast majority of people not likely um so you do have to um do something um
[01:15:14] um so you do have to um do something um that um is practical but you know that
[01:15:16] that um is practical but you know that practicality is true for vast majority
[01:15:19] practicality is true for vast majority of the people in the world and if you
[01:15:21] of the people in the world and if you look around in blogs and so on you find
[01:15:25] look around in blogs and so on you find lots of people um doing stuff in
[01:15:27] lots of people um doing stuff in lightweight ways um and describing how
[01:15:30] lightweight ways um and describing how to do that and that's why methods like
[01:15:32] to do that and that's why methods like parameter efficient fine tuning are
[01:15:34] parameter efficient fine tuning are really popular because you can do them
[01:15:36] really popular because you can do them in lightweight ways um the question
[01:15:39] in lightweight ways um the question related to that and I'll end on this is
[01:15:43] related to that and I'll end on this is you know I just want to sort of
[01:15:47] you know I just want to sort of um sort of mention again you know if you
[01:15:50] um sort of mention again you know if you want to you're welcome to use gp4 or
[01:15:53] want to you're welcome to use gp4 or Gemini Pro or Claude Opus or any of
[01:15:56] Gemini Pro or Claude Opus or any of these models in your project um but you
[01:16:00] these models in your project um but you know it has to be then API usage you
[01:16:04] know it has to be then API usage you can't possibly train your own big models
[01:16:08] can't possibly train your own big models I mean even for the models that are
[01:16:10] I mean even for the models that are available open source and like those you
[01:16:13] available open source and like those you know for big models you can't even load
[01:16:16] know for big models you can't even load them into the kind of gpus you have so
[01:16:19] them into the kind of gpus you have so you know probably you can load a llama
[01:16:22] you know probably you can load a llama 7B model but you can't just load into
[01:16:24] 7B model but you can't just load into your G CPU llama 70b model um
[01:16:28] your G CPU llama 70b model um so you have to be um realistic on that
[01:16:32] so you have to be um realistic on that size but you know there's actually now
[01:16:35] size but you know there's actually now lots of interesting things you can do
[01:16:37] lots of interesting things you can do with API access doing things like in
[01:16:39] with API access doing things like in context learning and prompting and
[01:16:41] context learning and prompting and exploring that or building larger
[01:16:44] exploring that or building larger language model programs around these
[01:16:47] language model programs around these language model components and you're
[01:16:49] language model components and you're certainly encouraged to do that um lots
[01:16:52] certainly encouraged to do that um lots of other things you can do such as
[01:16:54] of other things you can do such as analysis projects which look at are
[01:16:56] analysis projects which look at are these models sexist and racist still or
[01:17:00] these models sexist and racist still or do they have good understanding of
[01:17:02] do they have good understanding of analogies or can they interpret love
[01:17:06] analogies or can they interpret love letters or whatever is your topic of
[01:17:08] letters or whatever is your topic of Interest lots of things you can do and
[01:17:11] Interest lots of things you can do and that's totally allowed um but again you
[01:17:14] that's totally allowed um but again you know remember that we'll be trying to
[01:17:17] know remember that we'll be trying to evaluate us on what interesting stuff
[01:17:20] evaluate us on what interesting stuff you did um so your project shouldn't be
[01:17:23] you did um so your project shouldn't be I ran this stuff through gp4 and It
[01:17:26] I ran this stuff through gp4 and It produced great summaries of the
[01:17:27] produced great summaries of the documents I am done the question is um
[01:17:31] documents I am done the question is um what did you do in addition to that to
[01:17:33] what did you do in addition to that to have an interesting research project
[01:17:36] have an interesting research project okay I'll stop there thanks a lot


================================================================================
LECTURE 008
================================================================================

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 8 - Self-Attention and Transformers

Source: https://www.youtube.com/watch?v=LWMzyfvuehA

---

Transcript

[00:00:04] hi everyone uh welcome to cs224n we're
[00:00:09] hi everyone uh welcome to cs224n we're about two minutes in so let's get
[00:00:11] about two minutes in so let's get started
[00:00:12] started um so today uh we've got what I think is
[00:00:15] um so today uh we've got what I think is quite an exciting lecture topic we're
[00:00:17] quite an exciting lecture topic we're going to talk about self-attention and
[00:00:20] going to talk about self-attention and Transformers so these are some ideas
[00:00:22] Transformers so these are some ideas that are sort of the foundation of most
[00:00:26] that are sort of the foundation of most of the modern advances in natural
[00:00:28] of the modern advances in natural language processing and actually uh sort
[00:00:31] language processing and actually uh sort of AI systems in a broad range of fields
[00:00:34] of AI systems in a broad range of fields so it's a very very fun topic
[00:00:37] so it's a very very fun topic um before we get into that
[00:00:41] um before we get into that um
[00:00:41] um [Music]
[00:00:45] okay before we get into that we're going
[00:00:47] okay before we get into that we're going to have a couple of reminders so there
[00:00:49] to have a couple of reminders so there are brand new lecture notes
[00:00:52] are brand new lecture notes uh uh nice thank you yeah
[00:00:57] uh uh nice thank you yeah um I'm very excited about them
[00:00:59] um I'm very excited about them um they go into they they pretty much
[00:01:01] um they go into they they pretty much follow along with uh what I'll be
[00:01:04] follow along with uh what I'll be talking about today but go into
[00:01:05] talking about today but go into considerably more detail uh assignment
[00:01:08] considerably more detail uh assignment four is due a week from today
[00:01:11] four is due a week from today um yeah so the issues with Azure
[00:01:13] um yeah so the issues with Azure continue
[00:01:15] continue um thankfully thankfully
[00:01:18] um thankfully thankfully um
[00:01:19] um our uh uh Tas especially has tested that
[00:01:23] our uh uh Tas especially has tested that this works on collab and the amount of
[00:01:25] this works on collab and the amount of training is such that you know uh you
[00:01:28] training is such that you know uh you know a collab session will allow you to
[00:01:29] know a collab session will allow you to train uh your machine translation system
[00:01:32] train uh your machine translation system so if you don't have a GPU use collab
[00:01:34] so if you don't have a GPU use collab we're continuing to work on getting
[00:01:36] we're continuing to work on getting access to more gpus for uh assignment
[00:01:40] access to more gpus for uh assignment five in the final project uh we'll
[00:01:42] five in the final project uh we'll continue to update you as we're able to
[00:01:44] continue to update you as we're able to um but our you know are the usual
[00:01:46] um but our you know are the usual systems this year uh are no longer
[00:01:49] systems this year uh are no longer holding because companies are changing
[00:01:51] holding because companies are changing their minds about things okay
[00:01:53] their minds about things okay um so our final project proposal uh you
[00:01:55] um so our final project proposal uh you have a proposal of what you want to work
[00:01:58] have a proposal of what you want to work on for uh your final project we will
[00:02:02] on for uh your final project we will give you feedback on whether we think
[00:02:03] give you feedback on whether we think it's a feasible idea or how to change it
[00:02:05] it's a feasible idea or how to change it so this is very important because we
[00:02:07] so this is very important because we want you to work on something that we
[00:02:08] want you to work on something that we think has a good chance of success for
[00:02:09] think has a good chance of success for the rest of the quarter that's going to
[00:02:11] the rest of the quarter that's going to be out tonight we'll have an ad
[00:02:12] be out tonight we'll have an ad announcement when it is out
[00:02:15] announcement when it is out um and we want to get you feedback on
[00:02:16] um and we want to get you feedback on that pretty quickly uh because you know
[00:02:19] that pretty quickly uh because you know you'll be working on this after
[00:02:21] you'll be working on this after assignment five is done really the major
[00:02:22] assignment five is done really the major core component of the course uh after
[00:02:26] core component of the course uh after that is the um is the final project
[00:02:29] that is the um is the final project okay any questions
[00:02:32] cool okay
[00:02:34] cool okay um
[00:02:35] um okay so so let's let's kind of take a
[00:02:38] okay so so let's let's kind of take a look back into what we've done so far in
[00:02:41] look back into what we've done so far in this course and sort of see uh what you
[00:02:45] this course and sort of see uh what you know what we were doing in natural
[00:02:46] know what we were doing in natural language processing what was our
[00:02:47] language processing what was our strategy if you had a natural language
[00:02:48] strategy if you had a natural language processing problem and you wanted to say
[00:02:50] processing problem and you wanted to say take like your best effort attempt at it
[00:02:53] take like your best effort attempt at it without doing anything too fancy you
[00:02:54] without doing anything too fancy you would have said okay I'm going to have
[00:02:56] would have said okay I'm going to have you know a bi-directional lstm uh
[00:02:59] you know a bi-directional lstm uh instead of a simple RNN right I'm going
[00:03:01] instead of a simple RNN right I'm going to use an lstm uh to encode my sentences
[00:03:04] to use an lstm uh to encode my sentences I get bi-directional context and um if I
[00:03:07] I get bi-directional context and um if I have an output that I'm trying to
[00:03:08] have an output that I'm trying to generate right I'll have like a
[00:03:09] generate right I'll have like a unidirectional lstm you know that I was
[00:03:13] unidirectional lstm you know that I was going to generate one by one so you have
[00:03:14] going to generate one by one so you have a translation or a parse or whatever and
[00:03:17] a translation or a parse or whatever and so maybe I've encoded in a
[00:03:18] so maybe I've encoded in a bi-directional LCM The Source sentence
[00:03:20] bi-directional LCM The Source sentence and I'm sort of you know one by one
[00:03:22] and I'm sort of you know one by one decoding out the the target with my
[00:03:24] decoding out the the target with my unidirectional LCM and then uh also
[00:03:27] unidirectional LCM and then uh also right I was going to use something like
[00:03:29] right I was going to use something like attention to give flexible access to
[00:03:32] attention to give flexible access to memory uh if I you know felt like I
[00:03:35] memory uh if I you know felt like I needed to do this sort of look back and
[00:03:36] needed to do this sort of look back and see where I want to translate from okay
[00:03:38] see where I want to translate from okay and this was just working uh
[00:03:40] and this was just working uh exceptionally well and we we motivated
[00:03:42] exceptionally well and we we motivated so you know attention through wanting to
[00:03:45] so you know attention through wanting to do machine translation and you have this
[00:03:46] do machine translation and you have this this bottleneck where you don't want to
[00:03:48] this bottleneck where you don't want to have to encode the whole sentence Source
[00:03:50] have to encode the whole sentence Source sentence in a single vector
[00:03:52] sentence in a single vector okay and in this lecture we have the
[00:03:54] okay and in this lecture we have the same goal so we're going to be looking
[00:03:55] same goal so we're going to be looking at a lot of the same problems that we
[00:03:57] at a lot of the same problems that we did previously but we're going to use
[00:03:59] did previously but we're going to use different building blocks we're going to
[00:04:01] different building blocks we're going to say
[00:04:02] say um you know uh if if 2014 to 2017-ish I
[00:04:06] um you know uh if if 2014 to 2017-ish I was using recurrence uh through lots of
[00:04:08] was using recurrence uh through lots of trial and error years later uh it was we
[00:04:11] trial and error years later uh it was we had these like brand new building blocks
[00:04:12] had these like brand new building blocks that we could plug in sort of you know
[00:04:14] that we could plug in sort of you know uh direct replacement for lstms and
[00:04:17] uh direct replacement for lstms and they're going to allow for just a huge
[00:04:20] they're going to allow for just a huge range of much more successful
[00:04:22] range of much more successful applications and um and so what what are
[00:04:25] applications and um and so what what are the what what are the issues with the
[00:04:28] the what what are the issues with the recurrent neural networks we used to use
[00:04:29] recurrent neural networks we used to use and what are the new systems that we're
[00:04:31] and what are the new systems that we're going to use sort of from this point
[00:04:32] going to use sort of from this point moving forward
[00:04:34] moving forward okay so um so one of the issues with
[00:04:36] okay so um so one of the issues with with a recurrent neural network uh is
[00:04:39] with a recurrent neural network uh is what we're going to call linear
[00:04:40] what we're going to call linear interaction distance so as we know uh
[00:04:43] interaction distance so as we know uh you know
[00:04:44] you know rnns are unrolled left to right or right
[00:04:46] rnns are unrolled left to right or right to left depending on the language and
[00:04:48] to left depending on the language and the direction okay but it encodes the
[00:04:50] the direction okay but it encodes the sort of notion of linear locality which
[00:04:52] sort of notion of linear locality which is useful because if two words occur
[00:04:54] is useful because if two words occur right next to each other sometimes
[00:04:56] right next to each other sometimes they're actually quite related so tasty
[00:04:58] they're actually quite related so tasty Pizza they're nearby and in the
[00:05:00] Pizza they're nearby and in the recurrent neural network right you sort
[00:05:02] recurrent neural network right you sort of encode you know tasty and then you
[00:05:04] of encode you know tasty and then you sort of walk one step and you encode
[00:05:06] sort of walk one step and you encode Pizza
[00:05:08] Pizza um so nearby words do often affect each
[00:05:11] um so nearby words do often affect each other's meanings
[00:05:12] other's meanings um but you know you have this this
[00:05:14] um but you know you have this this problem where very long distance
[00:05:15] problem where very long distance dependencies can take a very long time
[00:05:18] dependencies can take a very long time to interact so if I have the sentence
[00:05:20] to interact so if I have the sentence the chef
[00:05:21] the chef so those are those are nearby those
[00:05:22] so those are those are nearby those interact with each other
[00:05:24] interact with each other and then uh who and then a bunch of
[00:05:28] and then uh who and then a bunch of stuff like the chef who went to the
[00:05:29] stuff like the chef who went to the stores and picked up the ingredients and
[00:05:32] stores and picked up the ingredients and you know loves garlic
[00:05:35] you know loves garlic um and then was right like I actually
[00:05:37] um and then was right like I actually have an RNN step right this sort of
[00:05:40] have an RNN step right this sort of application of the recurrent weight
[00:05:42] application of the recurrent weight Matrix and some element-wise
[00:05:44] Matrix and some element-wise non-linearities once twice three times
[00:05:46] non-linearities once twice three times right sort of as many times as there is
[00:05:48] right sort of as many times as there is potentially the the length of the
[00:05:50] potentially the the length of the sequence between chef and was right and
[00:05:54] sequence between chef and was right and it's the chef who was so this is a long
[00:05:55] it's the chef who was so this is a long distance dependency should feel kind of
[00:05:58] distance dependency should feel kind of you know related to the stuff that we
[00:05:59] you know related to the stuff that we did in dependency syntax but you know
[00:06:01] did in dependency syntax but you know it's quite difficult
[00:06:03] it's quite difficult uh to learn potentially that these words
[00:06:07] uh to learn potentially that these words should be related so if you have sort of
[00:06:10] should be related so if you have sort of a lot of steps uh between
[00:06:13] a lot of steps uh between uh between words
[00:06:16] uh between words um
[00:06:18] you know it can be difficult to learn
[00:06:20] you know it can be difficult to learn the dependencies between them you know
[00:06:22] the dependencies between them you know we talked about all these gradient
[00:06:23] we talked about all these gradient problems lstms do a lot better at
[00:06:25] problems lstms do a lot better at modeling the gradients uh across long
[00:06:28] modeling the gradients uh across long distances than simple recurrent neural
[00:06:31] distances than simple recurrent neural networks but it's not perfect
[00:06:33] networks but it's not perfect um
[00:06:33] um and we already know sort of that this
[00:06:35] and we already know sort of that this linear linear order isn't sort of the
[00:06:37] linear linear order isn't sort of the right way to think about about sentences
[00:06:40] right way to think about about sentences so if I wanted to learn that it's the
[00:06:43] so if I wanted to learn that it's the chef who
[00:06:45] chef who uh was then you know I might have a hard
[00:06:49] uh was then you know I might have a hard time doing it because the gradients have
[00:06:51] time doing it because the gradients have to propagate from west to Chef and you
[00:06:54] to propagate from west to Chef and you know uh really I'd like more direct
[00:06:55] know uh really I'd like more direct connection between words that might be
[00:06:57] connection between words that might be related in the sentence or in a document
[00:07:00] related in the sentence or in a document even right if these are going to get
[00:07:01] even right if these are going to get much longer
[00:07:03] um so so this is this linear interaction
[00:07:05] um so so this is this linear interaction distance problem we would like words
[00:07:07] distance problem we would like words that might be related to be able to
[00:07:09] that might be related to be able to interact with each other in the neural
[00:07:10] interact with each other in the neural networks computation sort of graph uh
[00:07:13] networks computation sort of graph uh more easily than uh sort of being
[00:07:16] more easily than uh sort of being linearly far away
[00:07:19] linearly far away um yeah so that we can learn these long
[00:07:21] um yeah so that we can learn these long distance dependencies better
[00:07:22] distance dependencies better and there's a related problem too that
[00:07:24] and there's a related problem too that again comes back to the recurrent neural
[00:07:26] again comes back to the recurrent neural networks dependence on the index on the
[00:07:29] networks dependence on the index on the index into the sequence often call it a
[00:07:31] index into the sequence often call it a dependence on time so in a recurrent
[00:07:34] dependence on time so in a recurrent neural network the forward and backward
[00:07:36] neural network the forward and backward passes have o of sequence length many so
[00:07:39] passes have o of sequence length many so that means just roughly sequence in this
[00:07:41] that means just roughly sequence in this case just sequence length many
[00:07:42] case just sequence length many unparallelizable operations so you know
[00:07:45] unparallelizable operations so you know we know gpus are great they can do
[00:07:48] we know gpus are great they can do a lot of operations at once as long as
[00:07:51] a lot of operations at once as long as there's no dependency between the
[00:07:53] there's no dependency between the operations in terms of time that you
[00:07:54] operations in terms of time that you have to compute one and then compute the
[00:07:56] have to compute one and then compute the other right but in a recurrent neural
[00:07:59] other right but in a recurrent neural network you can't actually compute the
[00:08:01] network you can't actually compute the RNN hidden state for time step 5 before
[00:08:04] RNN hidden state for time step 5 before you compute the RNN hidden state for
[00:08:05] you compute the RNN hidden state for time step four or time step three right
[00:08:08] time step four or time step three right and so you get this graph that looks
[00:08:10] and so you get this graph that looks very similar where if I want to compute
[00:08:12] very similar where if I want to compute this hidden state so I've got some word
[00:08:13] this hidden state so I've got some word I can I have zero operations I need to
[00:08:16] I can I have zero operations I need to do before I can compute this state I
[00:08:18] do before I can compute this state I have one operation I can do before I can
[00:08:20] have one operation I can do before I can compute this state
[00:08:22] compute this state and as my sequence length grows right
[00:08:24] and as my sequence length grows right I've got okay here I've got three
[00:08:26] I've got okay here I've got three operations I need to do before I can
[00:08:28] operations I need to do before I can compute the state with the number three
[00:08:30] compute the state with the number three because I need to compute this and this
[00:08:32] because I need to compute this and this and that so there's sort of three
[00:08:34] and that so there's sort of three unparallelizable operations that I'm
[00:08:37] unparallelizable operations that I'm sort of glomming you know all the Matrix
[00:08:39] sort of glomming you know all the Matrix multiplies and stuff into a single one
[00:08:40] multiplies and stuff into a single one so so one two three and of course this
[00:08:43] so so one two three and of course this grows with the sequence length as well
[00:08:45] grows with the sequence length as well so uh down over here so as the sequence
[00:08:47] so uh down over here so as the sequence length grows I can't parallelize you
[00:08:50] length grows I can't parallelize you know I can't just have a big GPU just
[00:08:52] know I can't just have a big GPU just you know with the with the Matrix
[00:08:54] you know with the with the Matrix multiply to compute this state because I
[00:08:57] multiply to compute this state because I need to compute all the previous States
[00:08:58] need to compute all the previous States beforehand
[00:09:00] beforehand foreign
[00:09:04] sort of related problems both with the
[00:09:07] sort of related problems both with the dependence on time yeah yeah so I have a
[00:09:09] dependence on time yeah yeah so I have a question on the linear interaction
[00:09:10] question on the linear interaction issues I thought that was the whole
[00:09:12] issues I thought that was the whole point of the attention Network and then
[00:09:14] point of the attention Network and then how maybe um you want during the
[00:09:17] how maybe um you want during the training of the actual cells that depend
[00:09:20] training of the actual cells that depend more on each other can't we do something
[00:09:22] more on each other can't we do something like the attention and sort of work our
[00:09:25] like the attention and sort of work our way
[00:09:26] way so the question is uh with the linear
[00:09:28] so the question is uh with the linear interaction distance wasn't this sort of
[00:09:29] interaction distance wasn't this sort of the point of attention that it sort of
[00:09:31] the point of attention that it sort of gets around that can't we use something
[00:09:32] gets around that can't we use something with attention to sort of help or does
[00:09:34] with attention to sort of help or does that just help so it won't solve the
[00:09:36] that just help so it won't solve the paralyzability problem and in fact
[00:09:38] paralyzability problem and in fact everything we do in the rest of this
[00:09:39] everything we do in the rest of this lecture will be attention-based but
[00:09:41] lecture will be attention-based but we'll get rid of the recurrence and just
[00:09:42] we'll get rid of the recurrence and just do attention more or less so well yeah
[00:09:45] do attention more or less so well yeah it's a great intuition
[00:09:48] any other questions
[00:09:51] Okay cool so
[00:09:55] Okay cool so um
[00:09:56] um so if not recurrence what about
[00:09:57] so if not recurrence what about attentions even just a slide a slide
[00:09:59] attentions even just a slide a slide back
[00:10:00] back um and uh so you know just we're gonna
[00:10:03] um and uh so you know just we're gonna get deep into attention today but just
[00:10:04] get deep into attention today but just for the second right attention treats
[00:10:07] for the second right attention treats each word's representation as a query to
[00:10:09] each word's representation as a query to access and incorporate information from
[00:10:11] access and incorporate information from a set of values so previously right we
[00:10:14] a set of values so previously right we were in a decoder we were decoding out a
[00:10:16] were in a decoder we were decoding out a translation of a sentence and we
[00:10:17] translation of a sentence and we attended to the encoder so that we
[00:10:19] attended to the encoder so that we didn't have to store the entire
[00:10:20] didn't have to store the entire representation of the source sentence
[00:10:22] representation of the source sentence into a single vector and here today
[00:10:24] into a single vector and here today we'll talk think about attention within
[00:10:26] we'll talk think about attention within a single sentence so I've got this sort
[00:10:28] a single sentence so I've got this sort of sentence written out here with a you
[00:10:30] of sentence written out here with a you know word one through word t in this
[00:10:32] know word one through word t in this case and um right on these sort of
[00:10:34] case and um right on these sort of integers in the boxes I'm writing out
[00:10:36] integers in the boxes I'm writing out the number of unparallelizable
[00:10:37] the number of unparallelizable operations that you need to do before
[00:10:40] operations that you need to do before you can can compute these so for each
[00:10:42] you can can compute these so for each word you can independently compute its
[00:10:44] word you can independently compute its embedding without doing anything else
[00:10:46] embedding without doing anything else previously right because the embedding
[00:10:47] previously right because the embedding just depends on the word identity
[00:10:50] just depends on the word identity and then with attention right if I
[00:10:53] and then with attention right if I wanted to build an attention
[00:10:54] wanted to build an attention representation of this word by looking
[00:10:56] representation of this word by looking at all the other words in the sequence
[00:10:57] at all the other words in the sequence that's sort of one big operation and I
[00:11:00] that's sort of one big operation and I can do them in parallel for all the
[00:11:01] can do them in parallel for all the words so the attention for this word I
[00:11:04] words so the attention for this word I can do for the attention for this word I
[00:11:06] can do for the attention for this word I don't need to sort of walk left to right
[00:11:07] don't need to sort of walk left to right like I did for an RNN again we'll get
[00:11:09] like I did for an RNN again we'll get much deeper into this but this you
[00:11:12] much deeper into this but this you should uh have the intuition that it
[00:11:14] should uh have the intuition that it solves the linear interaction problem
[00:11:16] solves the linear interaction problem and the non-parelizability problem
[00:11:18] and the non-parelizability problem because now no matter how far away words
[00:11:20] because now no matter how far away words are from each other I am potentially
[00:11:23] are from each other I am potentially interacting right I might just attend to
[00:11:25] interacting right I might just attend to you even if you're very very far away uh
[00:11:27] you even if you're very very far away uh sort of independent of how far away you
[00:11:29] sort of independent of how far away you are and I also don't need to sort of
[00:11:31] are and I also don't need to sort of walk along the sequence linearly long so
[00:11:34] walk along the sequence linearly long so I'm treating the whole sequence at once
[00:11:36] I'm treating the whole sequence at once all right so so you know the intuition
[00:11:39] all right so so you know the intuition is that attention allows you to look
[00:11:41] is that attention allows you to look very far away at once and it doesn't
[00:11:43] very far away at once and it doesn't have this dependence on the sequence
[00:11:44] have this dependence on the sequence index that keeps us from parallelizing
[00:11:46] index that keeps us from parallelizing operations and so now the rest of the
[00:11:48] operations and so now the rest of the lecture we'll talk uh in great depth
[00:11:50] lecture we'll talk uh in great depth about attention uh so maybe let's just
[00:11:53] about attention uh so maybe let's just uh move on okay
[00:11:56] uh move on okay so let's think more deeply about
[00:11:58] so let's think more deeply about attention
[00:11:59] attention um you know one thing that you might
[00:12:01] um you know one thing that you might think of with attention is that it's
[00:12:03] think of with attention is that it's sort of Performing kind of a fuzzy
[00:12:04] sort of Performing kind of a fuzzy lookup in a key value store so you have
[00:12:07] lookup in a key value store so you have a bunch of keys a bunch of values and
[00:12:09] a bunch of keys a bunch of values and it's going to help you sort of access
[00:12:11] it's going to help you sort of access that so in an actual lookup table right
[00:12:14] that so in an actual lookup table right just like a dictionary in Python for
[00:12:16] just like a dictionary in Python for example right very simple you have a
[00:12:19] example right very simple you have a table of keys that each key maps to a
[00:12:21] table of keys that each key maps to a value and then you like give it a query
[00:12:23] value and then you like give it a query and the query matches you know one of
[00:12:25] and the query matches you know one of the keys and then you return the value
[00:12:27] the keys and then you return the value right so I've got a bunch of keys here
[00:12:31] right so I've got a bunch of keys here and my query matches the key so I return
[00:12:33] and my query matches the key so I return the value simple
[00:12:36] the value simple Fair easy okay good
[00:12:39] Fair easy okay good um and in attention uh right so just
[00:12:43] um and in attention uh right so just like we saw before the query matches all
[00:12:45] like we saw before the query matches all keys softly there's no exact match uh
[00:12:48] keys softly there's no exact match uh you sort of compute some sort of
[00:12:50] you sort of compute some sort of similarity between the key and all of
[00:12:52] similarity between the key and all of the sorry the query and all of the keys
[00:12:54] the sorry the query and all of the keys and then you sort of weight the results
[00:12:56] and then you sort of weight the results so you've got to query again you've got
[00:12:58] so you've got to query again you've got a bunch of keys
[00:12:59] a bunch of keys the query to different extents is
[00:13:02] the query to different extents is similar to each of the keys
[00:13:04] similar to each of the keys and you will sort of measure that
[00:13:06] and you will sort of measure that similarity between zero and one through
[00:13:08] similarity between zero and one through a soft Max and then you know you get the
[00:13:11] a soft Max and then you know you get the values out you you average them via the
[00:13:14] values out you you average them via the weights of the similarity between the
[00:13:16] weights of the similarity between the key and the the query and the keys
[00:13:18] key and the the query and the keys you do a weighted sum with those weights
[00:13:20] you do a weighted sum with those weights and you get an output right so it really
[00:13:22] and you get an output right so it really is quite a bit like a lookup table but
[00:13:25] is quite a bit like a lookup table but in this sort of soft Vector space you
[00:13:27] in this sort of soft Vector space you know
[00:13:28] know um mushy sort of sense so I'm really
[00:13:30] um mushy sort of sense so I'm really doing some kind of accessing into this
[00:13:32] doing some kind of accessing into this information that's stored in the key
[00:13:34] information that's stored in the key value store but I'm sort of softly
[00:13:36] value store but I'm sort of softly looking at all of the results
[00:13:40] okay any questions there
[00:13:45] cool
[00:13:46] cool um so so what might this look like right
[00:13:48] um so so what might this look like right so if I was trying to represent this
[00:13:50] so if I was trying to represent this sentence I went to Stanford's cs224n and
[00:13:53] sentence I went to Stanford's cs224n and learned so I'm trying to build a
[00:13:55] learned so I'm trying to build a representation of learned
[00:13:58] um you know uh I have a key for each
[00:14:01] um you know uh I have a key for each word so this is this self-attention
[00:14:03] word so this is this self-attention thing that we'll we'll get into I have a
[00:14:04] thing that we'll we'll get into I have a key for each word a value for each word
[00:14:06] key for each word a value for each word I've got the query for learned and I've
[00:14:08] I've got the query for learned and I've got these sort of these sort of teal-ish
[00:14:10] got these sort of these sort of teal-ish bars up top which sort of might say how
[00:14:13] bars up top which sort of might say how much you're going to try to access each
[00:14:15] much you're going to try to access each of the word like so maybe 224n is not
[00:14:17] of the word like so maybe 224n is not that important CS maybe that determines
[00:14:19] that important CS maybe that determines what I learned you know Stanford uh
[00:14:22] what I learned you know Stanford uh right and then learned maybe that's
[00:14:24] right and then learned maybe that's important to representing itself right
[00:14:25] important to representing itself right so you sort of look across at the whole
[00:14:27] so you sort of look across at the whole sentence and build up this sort of soft
[00:14:29] sentence and build up this sort of soft accessing of of information across the
[00:14:31] accessing of of information across the sentence in order to represent learned
[00:14:33] sentence in order to represent learned in context
[00:14:35] in context okay so this is just a toy a toy diagram
[00:14:38] okay so this is just a toy a toy diagram so let's get into the math so we're
[00:14:41] so let's get into the math so we're going to look at a sequence of words
[00:14:43] going to look at a sequence of words that's W1 to n a sequence of words in a
[00:14:46] that's W1 to n a sequence of words in a vocabulary so this is like you know Zuko
[00:14:48] vocabulary so this is like you know Zuko made his Uncle T that's a that's a good
[00:14:49] made his Uncle T that's a that's a good sequence and for each word we're going
[00:14:51] sequence and for each word we're going to embed it with this embedding Matrix
[00:14:53] to embed it with this embedding Matrix just like we've been doing in this class
[00:14:56] just like we've been doing in this class right so I have this embedding Matrix
[00:14:57] right so I have this embedding Matrix that goes from the vocabulary size to
[00:15:00] that goes from the vocabulary size to the dimensionality D
[00:15:02] the dimensionality D so that's each word has a non-contextual
[00:15:04] so that's each word has a non-contextual right only dependent on itself word
[00:15:06] right only dependent on itself word embedding and now I'm going to transform
[00:15:09] embedding and now I'm going to transform each word with one of three different
[00:15:11] each word with one of three different weight matrices so this is often called
[00:15:13] weight matrices so this is often called key query value self-attention so right
[00:15:16] key query value self-attention so right so I have a matrix Q which is an RD to D
[00:15:19] so I have a matrix Q which is an RD to D so this Maps x i to which is a vector of
[00:15:22] so this Maps x i to which is a vector of dimensionality D to another Vector of
[00:15:24] dimensionality D to another Vector of dimensionality D and uh so that's going
[00:15:27] dimensionality D and uh so that's going to be a query Vector right so it takes
[00:15:29] to be a query Vector right so it takes an x i and it sort of you know rotates
[00:15:31] an x i and it sort of you know rotates it shuffles it around stretches it
[00:15:32] it shuffles it around stretches it squishes it
[00:15:33] squishes it makes it different and now it's a query
[00:15:36] makes it different and now it's a query and now for a different learnable
[00:15:37] and now for a different learnable parameter K that's another Matrix I'm
[00:15:39] parameter K that's another Matrix I'm going to come up with my keys
[00:15:41] going to come up with my keys and with a different learnable parameter
[00:15:43] and with a different learnable parameter V I'm going to come up with my values
[00:15:46] V I'm going to come up with my values right so I'm taking each of the
[00:15:48] right so I'm taking each of the non-contextual word embeddings each of
[00:15:49] non-contextual word embeddings each of these xi's and I'm transforming each of
[00:15:52] these xi's and I'm transforming each of them to come up with my query for that
[00:15:55] them to come up with my query for that word my key for that word and my value
[00:15:57] word my key for that word and my value for that word
[00:15:59] for that word okay so every word is doing each of
[00:16:01] okay so every word is doing each of these roles
[00:16:03] these roles next I'm going to compute all pairs of
[00:16:06] next I'm going to compute all pairs of similarities between the keys and
[00:16:07] similarities between the keys and queries right so in the toy example we
[00:16:09] queries right so in the toy example we saw I was Computing sort of the
[00:16:11] saw I was Computing sort of the similarity between a single query for
[00:16:13] similarity between a single query for the word learned and all of the keys for
[00:16:15] the word learned and all of the keys for the entire sentence
[00:16:17] the entire sentence in this context I'm Computing all pairs
[00:16:19] in this context I'm Computing all pairs of similarities between all keys and all
[00:16:21] of similarities between all keys and all values because I want to represent sort
[00:16:24] values because I want to represent sort of all of these sums so I've got
[00:16:26] of all of these sums so I've got this sort of dot product I'm just going
[00:16:28] this sort of dot product I'm just going to take the dot product between these
[00:16:29] to take the dot product between these two vectors right so I've got Qi so this
[00:16:32] two vectors right so I've got Qi so this is saying the query for word I dotted
[00:16:34] is saying the query for word I dotted with the key for Word J and I get this
[00:16:36] with the key for Word J and I get this score which is you know a real value uh
[00:16:41] score which is you know a real value uh might be very large negative might be
[00:16:42] might be very large negative might be zero might be very large and positive
[00:16:44] zero might be very large and positive and so that's like how much should I
[00:16:46] and so that's like how much should I look at J in this lookup table
[00:16:49] look at J in this lookup table and then I do the softmax right so I
[00:16:51] and then I do the softmax right so I softmax so I say that you know the
[00:16:54] softmax so I say that you know the actual weight that I'm going to look at
[00:16:55] actual weight that I'm going to look at J from I is softmax of this over all of
[00:16:59] J from I is softmax of this over all of the possible indices right so it's like
[00:17:01] the possible indices right so it's like the the Affinity between I and J
[00:17:03] the the Affinity between I and J normalized by the infinity between I and
[00:17:06] normalized by the infinity between I and all of the possible J Prime in the
[00:17:08] all of the possible J Prime in the sequence
[00:17:10] and then my output is just the weighted
[00:17:13] and then my output is just the weighted sum of values so I've got this output
[00:17:14] sum of values so I've got this output for word I so maybe I is like one for
[00:17:17] for word I so maybe I is like one for Zuko and I'm representing it as the sum
[00:17:20] Zuko and I'm representing it as the sum of these weights for all J so Zuko and
[00:17:24] of these weights for all J so Zuko and made and his and uncle and T and the
[00:17:26] made and his and uncle and T and the value Vector for that word uh J I'm
[00:17:30] value Vector for that word uh J I'm looking from I to J
[00:17:32] looking from I to J as much as Alpha i j
[00:17:39] oh w i you can either think of it as a
[00:17:43] oh w i you can either think of it as a symbol in vocab V so that's like you
[00:17:45] symbol in vocab V so that's like you could think of it as a one hot Vector in
[00:17:48] could think of it as a one hot Vector in um
[00:17:49] um yeah in this case we are I guess
[00:17:50] yeah in this case we are I guess thinking of this so a one hot Vector in
[00:17:52] thinking of this so a one hot Vector in dimensionality size of vocab so in in
[00:17:55] dimensionality size of vocab so in in The Matrix e you see that it's uh r d by
[00:17:58] The Matrix e you see that it's uh r d by bars around V that's the size of the
[00:18:00] bars around V that's the size of the vocabulary so when I do e multiplied by
[00:18:03] vocabulary so when I do e multiplied by w i that's taking e which is d by V
[00:18:07] w i that's taking e which is d by V multiplying it by W which is V and
[00:18:10] multiplying it by W which is V and returning a vector that's dimensionality
[00:18:12] returning a vector that's dimensionality D so
[00:18:14] D so first line it like W1 and that's a
[00:18:17] first line it like W1 and that's a matrix where um it has like maybe like a
[00:18:21] matrix where um it has like maybe like a column for every word in that that
[00:18:23] column for every word in that that sentence in each column is a length V
[00:18:25] sentence in each column is a length V yeah usually I guess we think of it as
[00:18:27] yeah usually I guess we think of it as having a I mean if I'm putting the the
[00:18:29] having a I mean if I'm putting the the sequence length index first you might
[00:18:32] sequence length index first you might think of having a row for each word but
[00:18:33] think of having a row for each word but but similarly yeah it's it's n which is
[00:18:36] but similarly yeah it's it's n which is the sequence length and then the second
[00:18:37] the sequence length and then the second dimension would be V which is the
[00:18:39] dimension would be V which is the vocabulary size and then that gets
[00:18:41] vocabulary size and then that gets mapped to this thing which is sequence
[00:18:43] mapped to this thing which is sequence length by D
[00:18:46] um why do we learn two different
[00:18:47] um why do we learn two different matrices q and K when like Q transpose
[00:18:51] matrices q and K when like Q transpose Qi transpose KJ is really just one
[00:18:54] Qi transpose KJ is really just one Matrix in the middle between that's a
[00:18:56] Matrix in the middle between that's a great question it ends up being because
[00:18:58] great question it ends up being because this will end up being a low rank
[00:19:00] this will end up being a low rank approximation to that Matrix so it is
[00:19:02] approximation to that Matrix so it is for computational efficiency reasons
[00:19:05] for computational efficiency reasons although it also I think feels kind of
[00:19:07] although it also I think feels kind of nice and uh in the presentation but yeah
[00:19:10] nice and uh in the presentation but yeah what we'll end up doing is having a very
[00:19:12] what we'll end up doing is having a very low rank approximation to qk transpose
[00:19:14] low rank approximation to qk transpose and so it you actually do do it like
[00:19:16] and so it you actually do do it like this it's a good question
[00:19:21] so the curry
[00:19:26] so I could you repeat that for me the
[00:19:27] so I could you repeat that for me the CII so the query of the word
[00:19:31] CII so the query of the word um dotted with the key by itself doesn't
[00:19:33] um dotted with the key by itself doesn't look like Identity or do they look at
[00:19:36] look like Identity or do they look at these things in particular that's a good
[00:19:38] these things in particular that's a good question okay let me remember to repeat
[00:19:39] question okay let me remember to repeat questions so does eii right for for J
[00:19:42] questions so does eii right for for J equal to I so looking at itself look
[00:19:45] equal to I so looking at itself look like anything in particular does it look
[00:19:46] like anything in particular does it look like the identity is that the question
[00:19:48] like the identity is that the question okay so um
[00:19:50] okay so um so right it's unclear actually this
[00:19:53] so right it's unclear actually this question of should you look at yourself
[00:19:54] question of should you look at yourself for representing yourself well it's it's
[00:19:57] for representing yourself well it's it's going to be encoded by the matrices q
[00:19:59] going to be encoded by the matrices q and K right if I didn't have q and K in
[00:20:02] and K right if I didn't have q and K in there right if those were the identity
[00:20:04] there right if those were the identity matrices if Q is identity K's identity
[00:20:06] matrices if Q is identity K's identity then this would be sort of Dot credit
[00:20:08] then this would be sort of Dot credit with yourself which is going to be
[00:20:10] with yourself which is going to be high on average like you're pointing in
[00:20:12] high on average like you're pointing in the same direction as yourself but it
[00:20:14] the same direction as yourself but it could be that you know qxi and kxi might
[00:20:18] could be that you know qxi and kxi might be sort of arbitrarily different from
[00:20:20] be sort of arbitrarily different from each other because Q could be the
[00:20:23] each other because Q could be the identity and K could map you to the
[00:20:26] identity and K could map you to the negative of yourself for example so that
[00:20:28] negative of yourself for example so that you don't look at yourself so this is
[00:20:29] you don't look at yourself so this is all learned in practice so you end up it
[00:20:32] all learned in practice so you end up it can it can sort of decide by learning
[00:20:35] can it can sort of decide by learning whether you should be looking at
[00:20:37] whether you should be looking at yourself or not and that's some of the
[00:20:39] yourself or not and that's some of the flexibility that parameterizing at SQ
[00:20:41] flexibility that parameterizing at SQ and K gives you that wouldn't be there
[00:20:43] and K gives you that wouldn't be there if I just used xi's everywhere in this
[00:20:46] if I just used xi's everywhere in this in this equation
[00:20:49] in this equation I'm going to try to move on I'm afraid
[00:20:52] I'm going to try to move on I'm afraid because there's a lot to get on but uh
[00:20:53] because there's a lot to get on but uh we'll keep talking about self-attention
[00:20:55] we'll keep talking about self-attention and so as more questions come up I can
[00:20:58] and so as more questions come up I can also potentially return back
[00:21:01] also potentially return back um okay so so this is our basic building
[00:21:04] um okay so so this is our basic building block but there are a bunch of barriers
[00:21:06] block but there are a bunch of barriers to using it as a replacement for for our
[00:21:09] to using it as a replacement for for our lstms and so what we're going to do for
[00:21:11] lstms and so what we're going to do for this portion of the lecture is talk
[00:21:13] this portion of the lecture is talk about the minimal components that we
[00:21:15] about the minimal components that we need in order to use self-attention as
[00:21:18] need in order to use self-attention as sort of this like very fundamental uh
[00:21:20] sort of this like very fundamental uh building block so we can't use it as it
[00:21:23] building block so we can't use it as it stands as I've presented it
[00:21:25] stands as I've presented it um but because there are a couple of
[00:21:26] um but because there are a couple of things that we need to sort of solve or
[00:21:28] things that we need to sort of solve or fix one of them is that there's no
[00:21:30] fix one of them is that there's no notion of sequence order in
[00:21:32] notion of sequence order in self-attention so so you know
[00:21:35] self-attention so so you know um
[00:21:36] um what is what does this mean if I have a
[00:21:38] what is what does this mean if I have a sentence uh like I'm going to move over
[00:21:41] sentence uh like I'm going to move over here to the Whiteboard briefly and
[00:21:42] here to the Whiteboard briefly and hopefully I'll I'll uh write quite large
[00:21:46] hopefully I'll I'll uh write quite large um if I have a sentence like Zuko
[00:21:49] um if I have a sentence like Zuko made
[00:21:52] made his uncle
[00:21:55] his uncle and
[00:21:57] and uh let's say
[00:21:58] uh let's say his uncle
[00:22:01] his uncle made Zuko
[00:22:05] if I were to embed each of these words
[00:22:07] if I were to embed each of these words right using its embedding Matrix the
[00:22:10] right using its embedding Matrix the embedding Matrix isn't dependent on
[00:22:13] embedding Matrix isn't dependent on uh the index of the word so this is the
[00:22:15] uh the index of the word so this is the word index one two three four versus now
[00:22:19] word index one two three four versus now his is over here an uncle right and so
[00:22:22] his is over here an uncle right and so when I compute the self-attention and
[00:22:24] when I compute the self-attention and there's a lot more in this in the
[00:22:25] there's a lot more in this in the lecture notes that goes through a full
[00:22:26] lecture notes that goes through a full example
[00:22:27] example um uh
[00:22:29] um uh the actual self-attention operation will
[00:22:32] the actual self-attention operation will give you exactly the same
[00:22:33] give you exactly the same representations for this sequence Zuko
[00:22:35] representations for this sequence Zuko made his uncle as for this sequence his
[00:22:38] made his uncle as for this sequence his uncle made Zuko and that's bad because
[00:22:40] uncle made Zuko and that's bad because they're sentences that mean different
[00:22:42] they're sentences that mean different things
[00:22:43] things um and so right it's sort of this this
[00:22:45] um and so right it's sort of this this idea that self-attention is an operation
[00:22:47] idea that self-attention is an operation on sets like you have a set of vectors
[00:22:50] on sets like you have a set of vectors that you're going to perform
[00:22:51] that you're going to perform self-attention on and nowhere does like
[00:22:54] self-attention on and nowhere does like the exact position of the words come
[00:22:56] the exact position of the words come into play directly
[00:22:59] um so uh we're going to encode the
[00:23:01] um so uh we're going to encode the position of words uh through the keys
[00:23:03] position of words uh through the keys queries and values that we have
[00:23:06] queries and values that we have um so you know consider now representing
[00:23:08] um so you know consider now representing each sequence Index right our sequences
[00:23:10] each sequence Index right our sequences are going from one to n as a vector so
[00:23:13] are going from one to n as a vector so so don't worry so far about you know how
[00:23:15] so don't worry so far about you know how it's being made but you can imagine
[00:23:17] it's being made but you can imagine representing sort of the number one like
[00:23:20] representing sort of the number one like the position one the position two the
[00:23:22] the position one the position two the position three
[00:23:23] position three as a vector in the dimensionality D just
[00:23:25] as a vector in the dimensionality D just like we're representing our keys queries
[00:23:28] like we're representing our keys queries and values and um so these are position
[00:23:30] and values and um so these are position vectors uh you know you can if if you
[00:23:33] vectors uh you know you can if if you were to
[00:23:35] were to want to incorporate the information
[00:23:37] want to incorporate the information represented by these positions into our
[00:23:40] represented by these positions into our self-attention uh you could just add
[00:23:43] self-attention uh you could just add these vectors these Pi vectors to the
[00:23:46] these vectors these Pi vectors to the inputs right so if I have you know this
[00:23:50] inputs right so if I have you know this this x i embedding of a word which is
[00:23:53] this x i embedding of a word which is the word at position I but really just
[00:23:55] the word at position I but really just represents oh the word Zuko is here now
[00:23:58] represents oh the word Zuko is here now I can say that oh it's the word Zuko and
[00:24:00] I can say that oh it's the word Zuko and it's at position five because you know
[00:24:02] it's at position five because you know this Vector represents position five
[00:24:08] okay so so how do we do this
[00:24:11] okay so so how do we do this um and we might only have to do this
[00:24:12] um and we might only have to do this once right so we can do it once uh at
[00:24:14] once right so we can do it once uh at the very input to the to the network and
[00:24:16] the very input to the to the network and then that sort of is sufficient we don't
[00:24:18] then that sort of is sufficient we don't have to do it at every layer because it
[00:24:20] have to do it at every layer because it sort of knows from the input
[00:24:23] sort of knows from the input um so so one way in which people have
[00:24:25] um so so one way in which people have done this is look at these sinusoidal
[00:24:27] done this is look at these sinusoidal position representations so this looks a
[00:24:30] position representations so this looks a little bit like this where you have so
[00:24:32] little bit like this where you have so these are this is a vector Pi which is
[00:24:34] these are this is a vector Pi which is in dimensionality D right and um each
[00:24:37] in dimensionality D right and um each one of the dimensions you take the value
[00:24:39] one of the dimensions you take the value I you modify it by some uh constant and
[00:24:43] I you modify it by some uh constant and you
[00:24:45] you you pass it to the sine or cosine
[00:24:46] you pass it to the sine or cosine function and you get these sort of
[00:24:48] function and you get these sort of values that vary according to the period
[00:24:51] values that vary according to the period uh uh differing periods depending on the
[00:24:53] uh uh differing periods depending on the dimensionalities D so I've got this sort
[00:24:55] dimensionalities D so I've got this sort of a representation of a matrix where D
[00:24:58] of a representation of a matrix where D is the vertical Dimension and then n is
[00:25:00] is the vertical Dimension and then n is the horizontal and you can see that
[00:25:02] the horizontal and you can see that they're sort of like oh you know
[00:25:04] they're sort of like oh you know um as I walk along you see the period of
[00:25:06] um as I walk along you see the period of the sine function going up and down and
[00:25:08] the sine function going up and down and each of the dimensions D has a different
[00:25:10] each of the dimensions D has a different period And so together you can represent
[00:25:12] period And so together you can represent a bunch of different uh sort of position
[00:25:14] a bunch of different uh sort of position indices and um you know
[00:25:17] indices and um you know it gives so this intuition that oh maybe
[00:25:19] it gives so this intuition that oh maybe period maybe sort of the absolute
[00:25:20] period maybe sort of the absolute position of a word isn't as important
[00:25:22] position of a word isn't as important you've got the sort of periodicity of
[00:25:24] you've got the sort of periodicity of the Sines and cosines
[00:25:26] the Sines and cosines um and maybe that allows you to
[00:25:27] um and maybe that allows you to extrapolate to longer sequences uh but
[00:25:29] extrapolate to longer sequences uh but in practice that doesn't work
[00:25:31] in practice that doesn't work um but this is sort of like an early uh
[00:25:33] um but this is sort of like an early uh notion that still sometimes used for how
[00:25:36] notion that still sometimes used for how to represent position in Transformers
[00:25:39] to represent position in Transformers and self-attention networks in general
[00:25:42] and self-attention networks in general um so so that's one idea you might think
[00:25:45] um so so that's one idea you might think it's a little bit complicated
[00:25:48] it's a little bit complicated a little bit unintuitive here's
[00:25:50] a little bit unintuitive here's something that feels a little bit more
[00:25:52] something that feels a little bit more deep learning
[00:25:54] deep learning so we're just going to say oh you know
[00:25:57] so we're just going to say oh you know I've got a maximum sequence length of n
[00:25:59] I've got a maximum sequence length of n and I'm just gonna learn a matrix that's
[00:26:01] and I'm just gonna learn a matrix that's dimensionality d by n and that's going
[00:26:04] dimensionality d by n and that's going to represent my positions I'm going to
[00:26:05] to represent my positions I'm going to learn it as a parameter just like I
[00:26:07] learn it as a parameter just like I learned every other parameter and what
[00:26:09] learned every other parameter and what do they mean oh I have no idea but it
[00:26:11] do they mean oh I have no idea but it you know represents position
[00:26:13] you know represents position um so
[00:26:14] um so um right and be so you just sort of add
[00:26:16] um right and be so you just sort of add this Matrix uh to the xi's your input
[00:26:19] this Matrix uh to the xi's your input embeddings
[00:26:21] embeddings um and it learns to you know fit to data
[00:26:24] um and it learns to you know fit to data so whatever representation of position
[00:26:25] so whatever representation of position that's linear uh sort of you know index
[00:26:29] that's linear uh sort of you know index based that you want you can learn and
[00:26:31] based that you want you can learn and the cons are that well you definitely
[00:26:33] the cons are that well you definitely now can't represent anything that's
[00:26:35] now can't represent anything that's longer than n words long right no
[00:26:38] longer than n words long right no sequence longer than n you can handle
[00:26:40] sequence longer than n you can handle because
[00:26:41] because um well you only learned a matrix of
[00:26:43] um well you only learned a matrix of this many positions and so in practice
[00:26:45] this many positions and so in practice you'll get you know a model error if you
[00:26:48] you'll get you know a model error if you if you pass a self-attention model
[00:26:49] if you pass a self-attention model something longer than length n it will
[00:26:52] something longer than length n it will just sort of Crash and say I can't I
[00:26:54] just sort of Crash and say I can't I can't do this
[00:26:56] can't do this and so this is sort of what most systems
[00:26:58] and so this is sort of what most systems nowadays use they're more flexible
[00:27:00] nowadays use they're more flexible representations of position including a
[00:27:02] representations of position including a couple in the lecture notes you might
[00:27:05] couple in the lecture notes you might want to look at sort of like the
[00:27:06] want to look at sort of like the relative linear position or words before
[00:27:08] relative linear position or words before or after each other but not their
[00:27:10] or after each other but not their absolute position there's also some sort
[00:27:12] absolute position there's also some sort of representations that that hearken
[00:27:14] of representations that that hearken back to our dependency syntax because
[00:27:16] back to our dependency syntax because like oh maybe words that are close in
[00:27:18] like oh maybe words that are close in the dependency parse tree should be the
[00:27:20] the dependency parse tree should be the things that are sort of close in the uh
[00:27:22] things that are sort of close in the uh in the self-attention operation
[00:27:24] in the self-attention operation um okay questions
[00:27:28] in practice do we typically just make n
[00:27:31] in practice do we typically just make n large enough that we don't run into the
[00:27:33] large enough that we don't run into the issue of course
[00:27:35] issue of course having something that could be input
[00:27:37] having something that could be input longer than him so the question is in
[00:27:40] longer than him so the question is in practice do we just make n long enough
[00:27:41] practice do we just make n long enough so that we don't run into the problem
[00:27:43] so that we don't run into the problem where we're going to you know look at a
[00:27:45] where we're going to you know look at a text longer than n no in practice it's
[00:27:47] text longer than n no in practice it's actually quite a problem uh even today
[00:27:50] actually quite a problem uh even today even in the largest biggest language
[00:27:51] even in the largest biggest language models and uh you know uh you know can I
[00:27:55] models and uh you know uh you know can I fit this prompt into chat GPT or
[00:27:57] fit this prompt into chat GPT or whatever is the thing that you might see
[00:27:59] whatever is the thing that you might see on Twitter I mean these continue to be
[00:28:00] on Twitter I mean these continue to be issues and part of it is because the
[00:28:03] issues and part of it is because the self-attention operation and we'll get
[00:28:05] self-attention operation and we'll get into this later in the lecture it's it's
[00:28:07] into this later in the lecture it's it's quadratic complexity in the sequence
[00:28:09] quadratic complexity in the sequence length so you're going to cut you're
[00:28:11] length so you're going to cut you're going to spend N squared sort of memory
[00:28:13] going to spend N squared sort of memory budget in order to make sequence lengths
[00:28:15] budget in order to make sequence lengths longer so in practice you know this
[00:28:17] longer so in practice you know this might be on a large model say 4 000 or
[00:28:21] might be on a large model say 4 000 or so n is four thousand so you can fit
[00:28:22] so n is four thousand so you can fit four thousand words which feels like a
[00:28:24] four thousand words which feels like a lot but it's not going to fit a novel
[00:28:25] lot but it's not going to fit a novel it's not going to fit a Wikipedia page
[00:28:28] it's not going to fit a Wikipedia page um you know and so and there are models
[00:28:30] um you know and so and there are models that do longer uh sequences for sure
[00:28:33] that do longer uh sequences for sure um and again we'll talk a bit about it
[00:28:35] um and again we'll talk a bit about it but no this this actually is an issue
[00:28:48] yeah so how do you know that the P that
[00:28:51] yeah so how do you know that the P that you've learned this Matrix that you've
[00:28:52] you've learned this Matrix that you've learned is representing position as
[00:28:54] learned is representing position as opposed to anything else the reason is
[00:28:56] opposed to anything else the reason is the only thing that correlates this
[00:28:57] the only thing that correlates this position
[00:28:58] position right so like when I see these vectors
[00:29:00] right so like when I see these vectors I'm adding this P Matrix to my X Matrix
[00:29:03] I'm adding this P Matrix to my X Matrix the word embeddings
[00:29:05] the word embeddings I'm adding them together and the words
[00:29:07] I'm adding them together and the words that show up at each index will vary
[00:29:08] that show up at each index will vary depending on what word actually showed
[00:29:11] depending on what word actually showed up there in the example but the P Matrix
[00:29:13] up there in the example but the P Matrix never differs it's always exactly the
[00:29:14] never differs it's always exactly the same at every index and so it's the only
[00:29:17] same at every index and so it's the only thing in the data that it correlates
[00:29:19] thing in the data that it correlates with so you're sort of learning it
[00:29:20] with so you're sort of learning it implicitly like this Vector at index one
[00:29:22] implicitly like this Vector at index one is always at index one for every example
[00:29:24] is always at index one for every example for every gradient update and nothing
[00:29:26] for every gradient update and nothing else uh co-occurs like that
[00:29:31] else uh co-occurs like that yeah so what you end up learning I don't
[00:29:34] yeah so what you end up learning I don't know unclear but it definitely allows
[00:29:36] know unclear but it definitely allows you to know oh this word is with this
[00:29:38] you to know oh this word is with this index said this yeah
[00:29:41] okay yeah just quickly in space
[00:29:47] okay yeah just quickly in space um
[00:29:56] okay so the question is when this is
[00:29:59] okay so the question is when this is quadratic in the sequence is that a
[00:30:00] quadratic in the sequence is that a sequence of words yeah think of it as a
[00:30:02] sequence of words yeah think of it as a sequence of words
[00:30:03] sequence of words um sometimes there'll be pieces that are
[00:30:05] um sometimes there'll be pieces that are smaller than words which we'll go into
[00:30:06] smaller than words which we'll go into in next slide in the next lecture but
[00:30:08] in next slide in the next lecture but yeah think of this as a sequence of
[00:30:09] yeah think of this as a sequence of words but not necessarily just for a
[00:30:11] words but not necessarily just for a sentence maybe for an entire paragraph
[00:30:13] sentence maybe for an entire paragraph or an entire document or something like
[00:30:16] or an entire document or something like that
[00:30:19] where yeah the tension is based words to
[00:30:21] where yeah the tension is based words to words
[00:30:23] words okay cool I'm gonna move on
[00:30:26] okay cool I'm gonna move on um okay so
[00:30:28] um okay so um right so we have another problem uh
[00:30:30] um right so we have another problem uh another is that you know based on the
[00:30:32] another is that you know based on the presentation of self-attention that
[00:30:34] presentation of self-attention that we've done you know there's really no
[00:30:35] we've done you know there's really no non-linearities for uh sort of deep
[00:30:38] non-linearities for uh sort of deep learning magic we're just sort of
[00:30:39] learning magic we're just sort of computing weighted averages of stuff
[00:30:42] computing weighted averages of stuff um
[00:30:43] um so so you know if I apply self-attention
[00:30:46] so so you know if I apply self-attention and then apply self-attention again and
[00:30:48] and then apply self-attention again and then and again and again and again you
[00:30:50] then and again and again and again you should get uh you should look at the
[00:30:52] should get uh you should look at the next lecture notes if you're interested
[00:30:53] next lecture notes if you're interested in this it's actually quite cool but
[00:30:54] in this it's actually quite cool but what you end up doing is you're just
[00:30:56] what you end up doing is you're just re-averageing value vectors together so
[00:30:58] re-averageing value vectors together so you're like Computing averages of value
[00:31:00] you're like Computing averages of value vectors and it ends up looking like one
[00:31:02] vectors and it ends up looking like one big self-attention uh but there's an
[00:31:04] big self-attention uh but there's an easy fix to this if you want sort of the
[00:31:06] easy fix to this if you want sort of the traditional deep learning magic and you
[00:31:08] traditional deep learning magic and you can just add a feed forward Network to
[00:31:10] can just add a feed forward Network to post-process each output Vector so I've
[00:31:12] post-process each output Vector so I've got a word here that's sort of the
[00:31:14] got a word here that's sort of the output of self-attention and I'm going
[00:31:15] output of self-attention and I'm going to pass it through you know in this case
[00:31:17] to pass it through you know in this case I'm calling it a multi-layer perceptron
[00:31:19] I'm calling it a multi-layer perceptron MLP so this is a vector in Rd that's
[00:31:22] MLP so this is a vector in Rd that's going to be uh and it's taking in as
[00:31:24] going to be uh and it's taking in as input a vector in Rd and you know you do
[00:31:27] input a vector in Rd and you know you do the usual uh sort of multi-layer
[00:31:29] the usual uh sort of multi-layer perceptron thing right where you have
[00:31:30] perceptron thing right where you have the output and you multiply it by matrix
[00:31:32] the output and you multiply it by matrix pass it to a non-linearity multiply it
[00:31:35] pass it to a non-linearity multiply it by another Matrix okay and so what this
[00:31:36] by another Matrix okay and so what this looks like in self-attention is that
[00:31:38] looks like in self-attention is that I've got this sort of sentence the chef
[00:31:40] I've got this sort of sentence the chef who the food and I've got my embedding
[00:31:43] who the food and I've got my embedding for it I pass it through this whole big
[00:31:45] for it I pass it through this whole big self-attention block right which looks
[00:31:47] self-attention block right which looks at the whole sequence and sort of
[00:31:48] at the whole sequence and sort of incorporates context and all that and
[00:31:50] incorporates context and all that and then I pass each one individually
[00:31:52] then I pass each one individually through a feed forward uh layer right so
[00:31:55] through a feed forward uh layer right so so this embedding that's sort of the
[00:31:57] so this embedding that's sort of the output of the self-attention for the
[00:31:58] output of the self-attention for the word the is passed independently through
[00:32:01] word the is passed independently through a multi-layer perceptron here and that
[00:32:03] a multi-layer perceptron here and that sort of you can think of it as sort of
[00:32:05] sort of you can think of it as sort of combining you know together uh or
[00:32:09] combining you know together uh or processing the result of attention so
[00:32:11] processing the result of attention so so there's a number of reasons why we do
[00:32:13] so there's a number of reasons why we do this
[00:32:14] this um one of them also is that you can
[00:32:15] um one of them also is that you can actually stack a ton of computation into
[00:32:17] actually stack a ton of computation into these feed forward uh networks very very
[00:32:20] these feed forward uh networks very very efficiently very paralyzable very good
[00:32:23] efficiently very paralyzable very good for gpus but but this is what's done in
[00:32:25] for gpus but but this is what's done in practice so you do self-attention and
[00:32:27] practice so you do self-attention and then you can you know pass it through
[00:32:28] then you can you know pass it through this sort of position wise feed forward
[00:32:31] this sort of position wise feed forward layer right every word is processed
[00:32:32] layer right every word is processed independently by this feed forward
[00:32:35] independently by this feed forward Network to process the result
[00:32:39] Network to process the result okay so that's adding our sort of
[00:32:41] okay so that's adding our sort of classical deep learning non-linearities
[00:32:43] classical deep learning non-linearities for self-attention
[00:32:45] for self-attention um and that's an easy fix for this sort
[00:32:48] um and that's an easy fix for this sort of no non-linearities problem in
[00:32:49] of no non-linearities problem in self-attention and then we have a last
[00:32:51] self-attention and then we have a last issue before we have our final minimal
[00:32:54] issue before we have our final minimal self-attention building block with which
[00:32:56] self-attention building block with which we can replace rnns
[00:32:59] we can replace rnns and that's that uh well you know when
[00:33:02] and that's that uh well you know when I've been writing out all of these
[00:33:03] I've been writing out all of these examples of self-attention you can sort
[00:33:05] examples of self-attention you can sort of look at the entire sequence right and
[00:33:08] of look at the entire sequence right and and uh in practice for some tasks such
[00:33:10] and uh in practice for some tasks such as machine translation or language
[00:33:12] as machine translation or language modeling whenever you want to define a
[00:33:14] modeling whenever you want to define a probability distribution over a sequence
[00:33:16] probability distribution over a sequence you can't cheat and look at the future
[00:33:20] you can't cheat and look at the future right uh so you know at every time step
[00:33:23] right uh so you know at every time step I could Define the set of keys and
[00:33:25] I could Define the set of keys and queries and values to only include past
[00:33:28] queries and values to only include past words but this is inefficient uh bear
[00:33:31] words but this is inefficient uh bear with me it's inefficient because you
[00:33:32] with me it's inefficient because you can't parallelize it so well so instead
[00:33:35] can't parallelize it so well so instead we compute the entire n by n Matrix just
[00:33:38] we compute the entire n by n Matrix just like I showed in the slide discussing
[00:33:39] like I showed in the slide discussing self-attention and then I mask out words
[00:33:42] self-attention and then I mask out words in the future so if this score e i j
[00:33:44] in the future so if this score e i j right and I I computed eij for all n by
[00:33:47] right and I I computed eij for all n by n pairs of words is equal to whatever it
[00:33:51] n pairs of words is equal to whatever it was before
[00:33:52] was before if the word that you're looking at at
[00:33:56] if the word that you're looking at at index J is an index that is less than or
[00:33:58] index J is an index that is less than or equal to where you are index I and it's
[00:34:02] equal to where you are index I and it's equal to negative infinity-ish otherwise
[00:34:04] equal to negative infinity-ish otherwise if it's in the future and when you
[00:34:06] if it's in the future and when you softmax the eij negative Infinity gets
[00:34:09] softmax the eij negative Infinity gets mapped to zero
[00:34:10] mapped to zero so now my attention is weighted zero my
[00:34:14] so now my attention is weighted zero my my weighted average is zero on the
[00:34:15] my weighted average is zero on the future so I can't look at it
[00:34:18] future so I can't look at it what does this look like so in order to
[00:34:21] what does this look like so in order to encode these words the chef who and
[00:34:24] encode these words the chef who and maybe the start start symbol there
[00:34:27] maybe the start start symbol there I can look at these words right that's
[00:34:30] I can look at these words right that's all pairs of words and then I just gray
[00:34:32] all pairs of words and then I just gray out I I sort of negative Infinity out
[00:34:34] out I I sort of negative Infinity out the words I can't look at so encoding
[00:34:36] the words I can't look at so encoding the start symbol I can just look at the
[00:34:38] the start symbol I can just look at the start symbol when encoding the I can
[00:34:40] start symbol when encoding the I can look at the start symbol and the
[00:34:42] look at the start symbol and the encoding Chef I can look at start the
[00:34:45] encoding Chef I can look at start the chef but I you know can't look at who
[00:34:48] chef but I you know can't look at who right and so it with this representation
[00:34:50] right and so it with this representation of Chef that encode that is you know
[00:34:53] of Chef that encode that is you know only looking at start the chef
[00:34:56] only looking at start the chef I can define a probability distribution
[00:34:57] I can define a probability distribution using this Vector that allows me to
[00:35:00] using this Vector that allows me to predict who without having cheated by
[00:35:02] predict who without having cheated by already looking ahead and seeing that
[00:35:03] already looking ahead and seeing that well who is the next word
[00:35:09] questions
[00:35:11] so it says for using it in decoders
[00:35:14] so it says for using it in decoders um do we do this for both the encoding
[00:35:16] um do we do this for both the encoding layer and the e-coding layer or for the
[00:35:18] layer and the e-coding layer or for the encoding layer are we allowing ourselves
[00:35:20] encoding layer are we allowing ourselves to look forward the question is uh it
[00:35:23] to look forward the question is uh it says here that we're using this in a
[00:35:24] says here that we're using this in a decoder do we also use it in the encoder
[00:35:26] decoder do we also use it in the encoder so that this is the distinction between
[00:35:28] so that this is the distinction between sort of like a bi-directional lstm and a
[00:35:31] sort of like a bi-directional lstm and a unidirectional lstm right so wherever
[00:35:34] unidirectional lstm right so wherever you don't need this constraint
[00:35:37] you don't need this constraint you probably don't use it so if you're
[00:35:38] you probably don't use it so if you're using an encoder right on the source
[00:35:40] using an encoder right on the source sentence of your machine translation
[00:35:41] sentence of your machine translation problem you probably don't do this
[00:35:43] problem you probably don't do this masking because it's probably good to
[00:35:45] masking because it's probably good to let everything look at each other and
[00:35:47] let everything look at each other and then whenever you do need to use it
[00:35:48] then whenever you do need to use it because you have this Auto regressive
[00:35:50] because you have this Auto regressive sort of probability of word one
[00:35:52] sort of probability of word one probability of two given one you know
[00:35:54] probability of two given one you know three given two in one then you would
[00:35:55] three given two in one then you would use this so traditionally yes in
[00:35:57] use this so traditionally yes in decoders you will use it in encoders you
[00:35:59] decoders you will use it in encoders you will not
[00:36:02] will not yes
[00:36:04] yes um my question is a lot about
[00:36:05] um my question is a lot about philosophical
[00:36:07] philosophical how humans actually generate sentences
[00:36:10] how humans actually generate sentences by having some notion of the probability
[00:36:13] by having some notion of the probability of future words before they say
[00:36:17] of future words before they say um
[00:36:17] um the words that or before they choose the
[00:36:20] the words that or before they choose the words that they are friendly
[00:36:24] words that they are friendly speaking or writing regenerating good
[00:36:27] speaking or writing regenerating good question so the question is isn't you
[00:36:29] question so the question is isn't you know looking ahead a little bit and sort
[00:36:30] know looking ahead a little bit and sort of predicting or getting an idea of the
[00:36:33] of predicting or getting an idea of the words that you might say in the future
[00:36:34] words that you might say in the future sort of how humans generate language
[00:36:35] sort of how humans generate language instead of the sort of strict constraint
[00:36:38] instead of the sort of strict constraint of not seeing it into the future is that
[00:36:40] of not seeing it into the future is that is that what you're okay so so right
[00:36:42] is that what you're okay so so right um you know trying to plan ahead to see
[00:36:45] um you know trying to plan ahead to see what I should do is definitely an
[00:36:47] what I should do is definitely an interesting idea
[00:36:48] interesting idea um but when I am training the network
[00:36:50] um but when I am training the network right I can't if I'm teaching it to try
[00:36:54] right I can't if I'm teaching it to try to predict the next word and if I give
[00:36:56] to predict the next word and if I give it the answer it's not going to learn
[00:36:57] it the answer it's not going to learn anything useful
[00:36:59] anything useful uh so in practice when I'm generating
[00:37:01] uh so in practice when I'm generating text maybe it would be a good idea to
[00:37:03] text maybe it would be a good idea to make some guesses far into the future or
[00:37:05] make some guesses far into the future or have a high level plan or something but
[00:37:08] have a high level plan or something but in training the network I can't encode
[00:37:10] in training the network I can't encode that intuition about how humans build uh
[00:37:13] that intuition about how humans build uh see like generate sequences of language
[00:37:15] see like generate sequences of language by just giving it the answer of the
[00:37:16] by just giving it the answer of the future directly at least because then
[00:37:18] future directly at least because then it's just too easy like there's nothing
[00:37:20] it's just too easy like there's nothing to learn
[00:37:21] to learn um yeah but there might be interesting
[00:37:23] um yeah but there might be interesting ideas about maybe giving the network
[00:37:24] ideas about maybe giving the network like a hint as to what kind of thing
[00:37:26] like a hint as to what kind of thing could come next for example but but
[00:37:28] could come next for example but but that's out of scope for this yeah
[00:37:31] that's out of scope for this yeah um yeah question up here so I understand
[00:37:33] um yeah question up here so I understand like the like why we want to mask the
[00:37:35] like the like why we want to mask the future for stuff like language models
[00:37:36] future for stuff like language models but how does it apply to machine
[00:37:38] but how does it apply to machine translation like why would we use it
[00:37:40] translation like why would we use it there yeah so in machine translation uh
[00:37:43] there yeah so in machine translation uh I'm gonna come over to this board and
[00:37:46] I'm gonna come over to this board and hopefully get a better marker
[00:37:49] hopefully get a better marker nice in machine translation you know I
[00:37:52] nice in machine translation you know I have a sentence like uh
[00:37:54] have a sentence like uh I like pizza and I want to be able to uh
[00:38:01] I like pizza and I want to be able to uh you know translate it uh
[00:38:04] you know translate it uh Jim
[00:38:06] Jim uh Pizza nice
[00:38:09] uh Pizza nice um right and so uh when I'm when I'm
[00:38:12] um right and so uh when I'm when I'm looking at the I like pizza right I get
[00:38:15] looking at the I like pizza right I get this as the input and so I want
[00:38:16] this as the input and so I want self-attention
[00:38:19] self-attention um uh without masking
[00:38:22] um uh without masking because I want I to look at like and I
[00:38:26] because I want I to look at like and I to look at pizza and like to look at
[00:38:28] to look at pizza and like to look at pizza and I want it all and then when
[00:38:29] pizza and I want it all and then when I'm generating this right if my tokens
[00:38:31] I'm generating this right if my tokens are like J M La Pizza
[00:38:35] are like J M La Pizza um I want to in encoding this word I
[00:38:37] um I want to in encoding this word I want to be able to look only at myself
[00:38:40] want to be able to look only at myself and we'll talk about encoder decoder
[00:38:42] and we'll talk about encoder decoder architectures in this uh later in the
[00:38:45] architectures in this uh later in the lecture
[00:38:46] lecture um but I want to be able to look at
[00:38:47] um but I want to be able to look at myself none of the future and all of
[00:38:49] myself none of the future and all of this and so what I'm talking about right
[00:38:51] this and so what I'm talking about right now in this masking case is masking out
[00:38:53] now in this masking case is masking out you know
[00:38:55] you know um with like negative Infinity
[00:38:57] um with like negative Infinity all of these words so that sort of
[00:39:00] all of these words so that sort of attention score from to everything else
[00:39:02] attention score from to everything else should be uh net to be you know negative
[00:39:05] should be uh net to be you know negative Infinity yeah does that answer your
[00:39:07] Infinity yeah does that answer your question great
[00:39:09] question great okay let's move ahead
[00:39:11] okay let's move ahead um okay
[00:39:12] um okay so so that was our last big uh sort of
[00:39:15] so so that was our last big uh sort of building block uh issue with
[00:39:17] building block uh issue with self-attention so this is what I would
[00:39:18] self-attention so this is what I would call and this is my personal opinion a
[00:39:20] call and this is my personal opinion a minimal you know self-attention building
[00:39:22] minimal you know self-attention building block you have self-attention the basis
[00:39:24] block you have self-attention the basis of the method so uh that's sort of here
[00:39:27] of the method so uh that's sort of here in the red
[00:39:29] in the red um and maybe we had you know the inputs
[00:39:30] um and maybe we had you know the inputs to the sequence here and then you embed
[00:39:32] to the sequence here and then you embed it with that embedding Matrix e and then
[00:39:34] it with that embedding Matrix e and then you add position embeddings right then
[00:39:37] you add position embeddings right then these three arrows represent using you
[00:39:39] these three arrows represent using you know the uh the key the value and the
[00:39:42] know the uh the key the value and the query that sort of stylized there this
[00:39:44] query that sort of stylized there this is often how you see these diagrams
[00:39:46] is often how you see these diagrams um right and so you pass it to
[00:39:48] um right and so you pass it to self-attention
[00:39:49] self-attention uh with the position representation
[00:39:52] uh with the position representation right so that specifies the sequence
[00:39:54] right so that specifies the sequence order because otherwise you'd have no
[00:39:56] order because otherwise you'd have no idea what order the words showed up in
[00:39:59] idea what order the words showed up in yeah the non-linearities in sort of the
[00:40:01] yeah the non-linearities in sort of the teal feed forward Network there uh to
[00:40:03] teal feed forward Network there uh to sort of provide that sort of squashing
[00:40:05] sort of provide that sort of squashing and and sort of uh deep learning
[00:40:07] and and sort of uh deep learning expressivity and then you have masking
[00:40:10] expressivity and then you have masking in order to have parallelizable
[00:40:12] in order to have parallelizable operations that don't look at the future
[00:40:15] operations that don't look at the future okay so this is sort of our minimal uh
[00:40:17] okay so this is sort of our minimal uh architecture and then up at the top
[00:40:19] architecture and then up at the top above here right so you have this thing
[00:40:21] above here right so you have this thing maybe you repeat this sort of
[00:40:22] maybe you repeat this sort of self-attention and feed forward many
[00:40:24] self-attention and feed forward many times so self-attention feed forward
[00:40:25] times so self-attention feed forward self tension feed forward self tension
[00:40:27] self tension feed forward self tension feet forward right that's what I'm
[00:40:29] feet forward right that's what I'm calling this block
[00:40:30] calling this block and then maybe at the end of it you you
[00:40:32] and then maybe at the end of it you you know predict something I don't know we
[00:40:34] know predict something I don't know we haven't really talked about that but you
[00:40:35] haven't really talked about that but you know you have these representations and
[00:40:37] know you have these representations and then you predict the next word or you
[00:40:38] then you predict the next word or you predict the sentiment or you predict
[00:40:40] predict the sentiment or you predict whatever so this is like a
[00:40:41] whatever so this is like a self-attention architecture
[00:40:44] self-attention architecture okay we're going to move on to the
[00:40:46] okay we're going to move on to the Transformer next so if there are any
[00:40:47] Transformer next so if there are any questions yeah
[00:40:52] other way around uh we will use masking
[00:40:54] other way around uh we will use masking for decoders where I want to decode out
[00:40:58] for decoders where I want to decode out a sequence where I have an informational
[00:41:01] a sequence where I have an informational constraint where to represent this word
[00:41:03] constraint where to represent this word properly I cannot have the information
[00:41:05] properly I cannot have the information of the future
[00:41:07] of the future right yeah okay
[00:41:13] okay
[00:41:15] okay great uh so now let's talk about the
[00:41:17] great uh so now let's talk about the Transformers so what I've what I've
[00:41:19] Transformers so what I've what I've pitched to you is what I call a minimal
[00:41:21] pitched to you is what I call a minimal self-attention architecture
[00:41:24] self-attention architecture uh and um
[00:41:26] uh and um you know I quite I quite like pitching
[00:41:28] you know I quite I quite like pitching it that way but really no one uses the
[00:41:30] it that way but really no one uses the architecture that was just up on the
[00:41:31] architecture that was just up on the slide the previous uh slide it it
[00:41:34] slide the previous uh slide it it doesn't work quite as well as it could
[00:41:36] doesn't work quite as well as it could and there's a bunch of sort of important
[00:41:38] and there's a bunch of sort of important details that we'll talk about now that
[00:41:40] details that we'll talk about now that goes into the Transformer but what I
[00:41:42] goes into the Transformer but what I would hope though to sort of
[00:41:44] would hope though to sort of um have you take away from that is that
[00:41:46] um have you take away from that is that the Transformer architecture as I'll
[00:41:48] the Transformer architecture as I'll present it now is uh not necessarily the
[00:41:51] present it now is uh not necessarily the end point of our search for better and
[00:41:54] end point of our search for better and better ways of representing language
[00:41:55] better ways of representing language even though it's now ubiquitous and has
[00:41:58] even though it's now ubiquitous and has been for a couple of years so so think
[00:42:00] been for a couple of years so so think about these sort of ideas of of the
[00:42:02] about these sort of ideas of of the problems of using self-attention
[00:42:04] problems of using self-attention um and maybe ways of fixing some of the
[00:42:06] um and maybe ways of fixing some of the issues with Transformers
[00:42:08] issues with Transformers okay so a Transformer uh decoder is how
[00:42:12] okay so a Transformer uh decoder is how we'll build systems like language models
[00:42:14] we'll build systems like language models right and so we've discussed this it's
[00:42:15] right and so we've discussed this it's like our decoder uh with our
[00:42:17] like our decoder uh with our self-attention only sort of minimal
[00:42:19] self-attention only sort of minimal architecture it's got a couple of extra
[00:42:21] architecture it's got a couple of extra components some of which I've grayed out
[00:42:23] components some of which I've grayed out here that will go over one by one the
[00:42:25] here that will go over one by one the first uh that's actually different is
[00:42:29] first uh that's actually different is that we'll replace uh our self-attention
[00:42:31] that we'll replace uh our self-attention with masking with masked multi-head
[00:42:34] with masking with masked multi-head self-attention this ends up being
[00:42:36] self-attention this ends up being crucial it's probably the most important
[00:42:38] crucial it's probably the most important uh distinction between the Transformer
[00:42:40] uh distinction between the Transformer and this sort of minimal architecture
[00:42:42] and this sort of minimal architecture that I've presented
[00:42:43] that I've presented so let's come back to our toy example of
[00:42:46] so let's come back to our toy example of attention where we've been trying to
[00:42:47] attention where we've been trying to represent the word learned in the
[00:42:49] represent the word learned in the context of the sequence I went to
[00:42:51] context of the sequence I went to Stanford cs224n and learned
[00:42:54] Stanford cs224n and learned um and I was sort of giving these teal
[00:42:56] um and I was sort of giving these teal bars to say oh maybe intuitively you
[00:42:59] bars to say oh maybe intuitively you look at various things to build up your
[00:43:01] look at various things to build up your representation of learned
[00:43:04] representation of learned um but you know really there are varying
[00:43:06] um but you know really there are varying ways in which I want to look back at the
[00:43:08] ways in which I want to look back at the sequence to see varying sort of aspects
[00:43:12] sequence to see varying sort of aspects of of information that I want to
[00:43:14] of of information that I want to incorporate into my representation so
[00:43:16] incorporate into my representation so maybe in this way I sort of want to look
[00:43:19] maybe in this way I sort of want to look at Stanford cs224n
[00:43:23] at Stanford cs224n because like oh it's like entities like
[00:43:25] because like oh it's like entities like it it you learn different stuff at
[00:43:27] it it you learn different stuff at Stanford cs224n than you do it other
[00:43:29] Stanford cs224n than you do it other courses or other universities or
[00:43:31] courses or other universities or whatever right and so maybe I want to
[00:43:33] whatever right and so maybe I want to look here for this reason
[00:43:34] look here for this reason and maybe you know there's in another
[00:43:36] and maybe you know there's in another sense I actually want to look at the
[00:43:38] sense I actually want to look at the word learned and I want to look at I you
[00:43:40] word learned and I want to look at I you know I went and learned right is he sort
[00:43:43] know I went and learned right is he sort of like maybe syntactically relevant
[00:43:45] of like maybe syntactically relevant words right like it's very different
[00:43:47] words right like it's very different reasons for which I might want to look
[00:43:48] reasons for which I might want to look at different things in the sequence
[00:43:50] at different things in the sequence and so trying to sort of average it all
[00:43:52] and so trying to sort of average it all out with a single operation of
[00:43:54] out with a single operation of self-attention ends up being maybe
[00:43:56] self-attention ends up being maybe somewhat too difficult in a way that
[00:43:58] somewhat too difficult in a way that will make precise in assignment five
[00:44:00] will make precise in assignment five nice we'll do a little bit more math uh
[00:44:03] nice we'll do a little bit more math uh um okay so uh any questions about this
[00:44:07] um okay so uh any questions about this in this intuition
[00:44:12] um
[00:44:13] um yeah so uh it should be an application
[00:44:16] yeah so uh it should be an application of attention just as I've presented it
[00:44:18] of attention just as I've presented it uh right so one independent Define the
[00:44:21] uh right so one independent Define the keys to find the queries to find the
[00:44:22] keys to find the queries to find the values I'll Define it more precisely
[00:44:24] values I'll Define it more precisely here but think of it as I do attention
[00:44:26] here but think of it as I do attention once
[00:44:27] once and then I do it again with different
[00:44:29] and then I do it again with different like being able different parameters
[00:44:31] like being able different parameters being able to look at different things
[00:44:33] being able to look at different things Etc
[00:44:38] we do not okay so the question is if we
[00:44:40] we do not okay so the question is if we have two separate sets of Weights try to
[00:44:41] have two separate sets of Weights try to learn say to do this and and to do that
[00:44:43] learn say to do this and and to do that how do we ensure that they learn
[00:44:45] how do we ensure that they learn different things uh we do not ensure
[00:44:46] different things uh we do not ensure that they hope that they learn different
[00:44:48] that they hope that they learn different things and in practice they do uh
[00:44:51] things and in practice they do uh although not perfectly uh so it ends up
[00:44:53] although not perfectly uh so it ends up being the case that you have some
[00:44:54] being the case that you have some redundancy and you can sort of like cut
[00:44:56] redundancy and you can sort of like cut out some of these but that's sort of out
[00:44:58] out some of these but that's sort of out of scope for this but we sort of Hope
[00:44:59] of scope for this but we sort of Hope just like we hope that different sort of
[00:45:01] just like we hope that different sort of dimensions in our feed forward layers
[00:45:03] dimensions in our feed forward layers will learn different things because of
[00:45:04] will learn different things because of lack of symmetry and whatever that uh we
[00:45:07] lack of symmetry and whatever that uh we hope that the heads will start to
[00:45:08] hope that the heads will start to specialize and that will mean they'll
[00:45:10] specialize and that will mean they'll specialize even more and yeah
[00:45:14] okay
[00:45:16] okay all right so in order to discuss
[00:45:17] all right so in order to discuss multi-head self-attention well we really
[00:45:20] multi-head self-attention well we really need to talk about the matrices how
[00:45:22] need to talk about the matrices how we're going to implement this in gpus
[00:45:24] we're going to implement this in gpus efficiently we're going to talk about
[00:45:25] efficiently we're going to talk about the sequence stacked form of attention
[00:45:29] the sequence stacked form of attention um so we've been talking about each word
[00:45:30] um so we've been talking about each word sort of individually as a vector in
[00:45:32] sort of individually as a vector in dimensionality D but you know really
[00:45:34] dimensionality D but you know really we're going to be working on these as as
[00:45:36] we're going to be working on these as as big matrices that are stacked so I take
[00:45:39] big matrices that are stacked so I take you know all of my word embeddings X1 to
[00:45:41] you know all of my word embeddings X1 to xn and I stack them together and now I
[00:45:44] xn and I stack them together and now I have a big Matrix that is in
[00:45:45] have a big Matrix that is in dimensionality r n by D
[00:45:49] okay and uh now with my matrices k q and
[00:45:54] okay and uh now with my matrices k q and V I can just multiply them sort of on
[00:45:57] V I can just multiply them sort of on this side of X so X is RN by d k is our
[00:46:01] this side of X so X is RN by d k is our d by D so n by D times d by D gives you
[00:46:05] d by D so n by D times d by D gives you uh n by D again so I can just compute a
[00:46:08] uh n by D again so I can just compute a big you know Matrix multiply on my whole
[00:46:11] big you know Matrix multiply on my whole sequence to multiply each one of the
[00:46:13] sequence to multiply each one of the words with my key query and value
[00:46:14] words with my key query and value matrices very efficiently right so this
[00:46:17] matrices very efficiently right so this is sort of this vectorization idea I
[00:46:18] is sort of this vectorization idea I don't want to for Loop over the sequence
[00:46:20] don't want to for Loop over the sequence I represent the sequence as a big Matrix
[00:46:22] I represent the sequence as a big Matrix and I just do one big Matrix multiply
[00:46:27] and I just do one big Matrix multiply then the output is defined as this sort
[00:46:29] then the output is defined as this sort of inscrutable bit of math which I'm
[00:46:31] of inscrutable bit of math which I'm going to go over visually
[00:46:34] going to go over visually um so so first we're going to take the
[00:46:37] um so so first we're going to take the key query dot products in one Matrix so
[00:46:39] key query dot products in one Matrix so we've got um
[00:46:41] we've got um we've got X Cube which is uh RN by D
[00:46:46] we've got X Cube which is uh RN by D and I've got x k transpose which is our
[00:46:49] and I've got x k transpose which is our d by n so n by d d by n this is
[00:46:53] d by n so n by d d by n this is Computing all of the e i JS these scores
[00:46:56] Computing all of the e i JS these scores for self-attention right so this is all
[00:46:59] for self-attention right so this is all pairs of attention scores computed in
[00:47:01] pairs of attention scores computed in one big Matrix multiply
[00:47:05] okay so this is this big Matrix here
[00:47:08] okay so this is this big Matrix here next I use the softmax right so I
[00:47:12] next I use the softmax right so I softmax this over uh the second
[00:47:15] softmax this over uh the second dimension the second n Dimension
[00:47:17] dimension the second n Dimension um and I get my sort of normalized
[00:47:20] um and I get my sort of normalized scores and then I multiply with XV so
[00:47:22] scores and then I multiply with XV so this is an N by n Matrix multiplied by
[00:47:25] this is an N by n Matrix multiplied by an N by D Matrix
[00:47:27] an N by D Matrix and what do I get well this is just
[00:47:29] and what do I get well this is just doing the weighted average right so this
[00:47:31] doing the weighted average right so this is one big weighted average contribution
[00:47:34] is one big weighted average contribution on the whole Matrix giving me my whole
[00:47:36] on the whole Matrix giving me my whole self-attention output and r n by D right
[00:47:39] self-attention output and r n by D right so I've just restated identically the
[00:47:41] so I've just restated identically the self-attention operations but computed
[00:47:44] self-attention operations but computed in terms of matrices so you could do
[00:47:46] in terms of matrices so you could do this efficiently on a GPU
[00:47:50] okay uh so multi-headed attention this
[00:47:54] okay uh so multi-headed attention this is going to give us and it's going to be
[00:47:56] is going to give us and it's going to be important to compute this in terms of
[00:47:57] important to compute this in terms of the matrices which we'll see this is
[00:48:00] the matrices which we'll see this is going to give us the ability to look in
[00:48:01] going to give us the ability to look in multiple places at once for different
[00:48:03] multiple places at once for different reasons so sort of you know first
[00:48:06] reasons so sort of you know first self-attention looks where this dot
[00:48:08] self-attention looks where this dot product here is high
[00:48:10] product here is high right this x i the Q Matrix the key
[00:48:13] right this x i the Q Matrix the key Matrix uh but
[00:48:16] Matrix uh but um maybe we want to look in different
[00:48:17] um maybe we want to look in different places for different reasons so we
[00:48:19] places for different reasons so we actually Define multiple query key and
[00:48:22] actually Define multiple query key and value matrices so I'm going to have a
[00:48:26] value matrices so I'm going to have a bunch of heads I'm going to have 8 H
[00:48:28] bunch of heads I'm going to have 8 H self-attention heads and for each head
[00:48:30] self-attention heads and for each head I'm going to Define an independent query
[00:48:33] I'm going to Define an independent query key and value Matrix and I'm going to
[00:48:35] key and value Matrix and I'm going to say that it's its shape is going to map
[00:48:37] say that it's its shape is going to map from the model dimensionality to the
[00:48:39] from the model dimensionality to the model dimensionality over H so each one
[00:48:42] model dimensionality over H so each one of these is doing projection down to a
[00:48:43] of these is doing projection down to a lower dimensional space uh this is going
[00:48:45] lower dimensional space uh this is going to be for computational efficiency and
[00:48:48] to be for computational efficiency and um
[00:48:49] um I'll just apply self-attention sort of
[00:48:51] I'll just apply self-attention sort of independently for each output so this
[00:48:53] independently for each output so this equation here is identical to the one we
[00:48:56] equation here is identical to the one we saw for single-headed self-attention
[00:48:58] saw for single-headed self-attention except we've got the sort of L indices
[00:49:00] except we've got the sort of L indices everywhere
[00:49:02] everywhere so I've got this lower dimensional thing
[00:49:04] so I've got this lower dimensional thing I'm mapping to a lower dimensional space
[00:49:06] I'm mapping to a lower dimensional space and then I do have my lower dimensional
[00:49:08] and then I do have my lower dimensional value Vector there so my output is an r
[00:49:10] value Vector there so my output is an r d by H but really you're doing exactly
[00:49:13] d by H but really you're doing exactly the same kind of operation I'm just
[00:49:15] the same kind of operation I'm just doing it h different times and then you
[00:49:18] doing it h different times and then you combine the outputs so I've done sort of
[00:49:20] combine the outputs so I've done sort of look in different places with the
[00:49:22] look in different places with the different key query and value matrices
[00:49:24] different key query and value matrices and then I to get each of their outputs
[00:49:27] and then I to get each of their outputs and then I concatenate them together
[00:49:30] and then I concatenate them together right so each one is dimensionality d by
[00:49:32] right so each one is dimensionality d by H and I concatenate them together and
[00:49:35] H and I concatenate them together and then sort of mix them together with the
[00:49:37] then sort of mix them together with the final linear transformation
[00:49:39] final linear transformation and so uh each head gets to look at
[00:49:41] and so uh each head gets to look at different things and construct their
[00:49:43] different things and construct their value vectors differently and then I
[00:49:45] value vectors differently and then I sort of combine the result altogether at
[00:49:47] sort of combine the result altogether at once
[00:49:49] once okay let's go through this visually
[00:49:50] okay let's go through this visually because it's at least helpful for me
[00:49:53] because it's at least helpful for me um so uh right it's actually not more
[00:49:57] um so uh right it's actually not more costly to do this really than it is to
[00:49:59] costly to do this really than it is to compute a single head of self-attention
[00:50:00] compute a single head of self-attention and we'll see through the pictures
[00:50:04] so you know we were in single-headed
[00:50:07] so you know we were in single-headed self-attention we computed xq and in
[00:50:09] self-attention we computed xq and in multi-headed self-attention we'll also
[00:50:11] multi-headed self-attention we'll also compute X cubed the same way
[00:50:13] compute X cubed the same way so xq is r and by D
[00:50:19] and then we can reshape it into our n
[00:50:23] and then we can reshape it into our n that's sequence length times the number
[00:50:25] that's sequence length times the number of heads times the model dimensionality
[00:50:29] of heads times the model dimensionality over the number of heads so I've just
[00:50:30] over the number of heads so I've just reshaped it to say now I've got you know
[00:50:33] reshaped it to say now I've got you know a big three axis tensor the first axis
[00:50:36] a big three axis tensor the first axis is the sequence length the second one is
[00:50:38] is the sequence length the second one is the number of heads the third is this
[00:50:40] the number of heads the third is this reduced model dimensionality
[00:50:41] reduced model dimensionality and that costs nothing right and do the
[00:50:44] and that costs nothing right and do the same thing for x and V and then I
[00:50:46] same thing for x and V and then I transpose so that I've got the head axis
[00:50:49] transpose so that I've got the head axis as the first axis and now I can compute
[00:50:52] as the first axis and now I can compute all my other operations with the head
[00:50:54] all my other operations with the head axis kind of like a batch
[00:50:57] axis kind of like a batch so what does this look like in uh in
[00:51:00] so what does this look like in uh in practice like instead of having one big
[00:51:03] practice like instead of having one big xq Matrix that's Model dimensionality D
[00:51:06] xq Matrix that's Model dimensionality D I've got like in this case three x Cube
[00:51:10] I've got like in this case three x Cube matrices of Model dimensionality D by 3
[00:51:12] matrices of Model dimensionality D by 3 D by three D by three same thing with
[00:51:14] D by three D by three same thing with the key Matrix here
[00:51:16] the key Matrix here so everything looks almost identical
[00:51:18] so everything looks almost identical it's just a reshaping of the tensors and
[00:51:21] it's just a reshaping of the tensors and now right at the output of this I've got
[00:51:23] now right at the output of this I've got three sets of attention scores right
[00:51:26] three sets of attention scores right just by doing this reshape and the cost
[00:51:29] just by doing this reshape and the cost is that well you know each of my
[00:51:32] is that well you know each of my attention heads has only a d by H Vector
[00:51:35] attention heads has only a d by H Vector to work with instead of a d dimensional
[00:51:37] to work with instead of a d dimensional Vector to work with right so I get the
[00:51:38] Vector to work with right so I get the output I get these three uh sets of
[00:51:41] output I get these three uh sets of pairs of scores
[00:51:43] pairs of scores I compute the softmax independently for
[00:51:45] I compute the softmax independently for each of the three and then I have three
[00:51:48] each of the three and then I have three uh value matrices there as well each of
[00:51:51] uh value matrices there as well each of them lower dimensional and then finally
[00:51:53] them lower dimensional and then finally right I get my three different output
[00:51:55] right I get my three different output vectors and if I have a final linear
[00:51:57] vectors and if I have a final linear transformation to sort of
[00:51:59] transformation to sort of mush them together and I get an output
[00:52:02] mush them together and I get an output and in summary what this allows you to
[00:52:04] and in summary what this allows you to do is exactly what I gave in the toy
[00:52:07] do is exactly what I gave in the toy example which was I can have each of
[00:52:09] example which was I can have each of these heads look at different parts of a
[00:52:11] these heads look at different parts of a sequence for different reasons
[00:52:17] so this is at a given uh block right
[00:52:20] so this is at a given uh block right like all of these attention heads are
[00:52:22] like all of these attention heads are for a given Transformer block a next
[00:52:24] for a given Transformer block a next block would also could also have three
[00:52:25] block would also could also have three attention pins the question is uh are
[00:52:28] attention pins the question is uh are all of these four a given block and
[00:52:30] all of these four a given block and we'll talk about a block again but this
[00:52:32] we'll talk about a block again but this block was this sort of pair of
[00:52:34] block was this sort of pair of self-attention and feed forward Network
[00:52:36] self-attention and feed forward Network so you do like self-attention feed
[00:52:37] so you do like self-attention feed forward that's one block another block
[00:52:39] forward that's one block another block is another self-attention another feed
[00:52:40] is another self-attention another feed forward and the question is are the
[00:52:42] forward and the question is are the parameters shared between the blocks or
[00:52:44] parameters shared between the blocks or not generally they are not shared you'll
[00:52:46] not generally they are not shared you'll have independent parameters at every
[00:52:48] have independent parameters at every block although there are some exceptions
[00:52:53] is it typically the case that you have
[00:52:56] is it typically the case that you have the same number of like heads at each
[00:52:58] the same number of like heads at each block or do you vary the number of heads
[00:53:00] block or do you vary the number of heads across blocks you have this
[00:53:02] across blocks you have this you definitely could vary it people
[00:53:04] you definitely could vary it people haven't found reason to there so the
[00:53:06] haven't found reason to there so the question is do you have different
[00:53:06] question is do you have different numbers of heads across the different
[00:53:08] numbers of heads across the different blocks uh or do you have the same number
[00:53:11] blocks uh or do you have the same number of heads across all blocks you know the
[00:53:13] of heads across all blocks you know the simplest thing is to just have it be the
[00:53:15] simplest thing is to just have it be the same everywhere which is what people
[00:53:16] same everywhere which is what people have done I haven't yet found a good
[00:53:18] have done I haven't yet found a good reason to vary it but well
[00:53:21] reason to vary it but well it could be interesting it's definitely
[00:53:22] it could be interesting it's definitely the case that you know after training
[00:53:24] the case that you know after training these networks you can actually just
[00:53:26] these networks you can actually just totally zero out remove some of the
[00:53:29] totally zero out remove some of the attention heads and I'd be curious to
[00:53:32] attention heads and I'd be curious to know if uh you could remove more or less
[00:53:35] know if uh you could remove more or less depending on the like layer index which
[00:53:38] depending on the like layer index which might then say Oh we should just have
[00:53:40] might then say Oh we should just have fewer but it's again it's not actually
[00:53:41] fewer but it's again it's not actually more expensive to have a bunch so people
[00:53:44] more expensive to have a bunch so people tend to instead set the number of heads
[00:53:46] tend to instead set the number of heads to be roughly so that you have like a
[00:53:49] to be roughly so that you have like a reasonable number of Dimensions per head
[00:53:51] reasonable number of Dimensions per head given the total Model dimensionality D
[00:53:54] given the total Model dimensionality D that you want so for example I might
[00:53:57] that you want so for example I might want at least 64 Dimensions per head
[00:53:59] want at least 64 Dimensions per head which if D is you know 128 that tells me
[00:54:03] which if D is you know 128 that tells me how many heads I'm going to have roughly
[00:54:05] how many heads I'm going to have roughly so people tend to scale the number of
[00:54:07] so people tend to scale the number of heads up with the model dimensionality
[00:54:13] excuse by slicing it in different
[00:54:15] excuse by slicing it in different columns you're reducing the rank of the
[00:54:17] columns you're reducing the rank of the final Matrix right yeah this but that
[00:54:20] final Matrix right yeah this but that doesn't really have any effect on the
[00:54:22] doesn't really have any effect on the results so the question is by having
[00:54:24] results so the question is by having these sort of reduced xq and uh uh XK
[00:54:29] these sort of reduced xq and uh uh XK matrices right this is a very low rank
[00:54:31] matrices right this is a very low rank approximation this little sliver in this
[00:54:34] approximation this little sliver in this little sliver defining this whole big
[00:54:36] little sliver defining this whole big matrix it's very low rank is that not
[00:54:38] matrix it's very low rank is that not bad
[00:54:39] bad in practice no I mean again it's sort of
[00:54:42] in practice no I mean again it's sort of the reason why we limit the number of
[00:54:43] the reason why we limit the number of heads depending on the model
[00:54:45] heads depending on the model dimensionality because you you know you
[00:54:48] dimensionality because you you know you want intuitively at least some number of
[00:54:50] want intuitively at least some number of Dimensions so you know 64 is sometimes
[00:54:53] Dimensions so you know 64 is sometimes done 128 something like that
[00:54:55] done 128 something like that um but you know if you're not giving
[00:54:56] um but you know if you're not giving each head too much to do and it's got
[00:54:58] each head too much to do and it's got sort of a simple job you've got a lot of
[00:55:00] sort of a simple job you've got a lot of heads it ends up sort of being okay at
[00:55:04] heads it ends up sort of being okay at the very all we really know is that
[00:55:05] the very all we really know is that empirically it's way better to have more
[00:55:07] empirically it's way better to have more heads than like one
[00:55:12] uh yes
[00:55:14] uh yes um I'm wondering have there been studies
[00:55:16] um I'm wondering have there been studies to see if
[00:55:18] to see if um information in one of the sets of the
[00:55:21] um information in one of the sets of the attention scores like information that
[00:55:24] attention scores like information that one of them learns is consistent and
[00:55:27] one of them learns is consistent and like
[00:55:28] like um related to each other or
[00:55:31] um related to each other or so the question is have there been
[00:55:33] so the question is have there been studies to see if there's sort of
[00:55:35] studies to see if there's sort of consistent information encoded by the
[00:55:37] consistent information encoded by the attention heads and you know yes
[00:55:40] attention heads and you know yes actually there's been quite a lot of
[00:55:42] actually there's been quite a lot of sort of study and interpretability and
[00:55:43] sort of study and interpretability and Analysis of these models to try to
[00:55:45] Analysis of these models to try to figure out what roles what sort of
[00:55:47] figure out what roles what sort of mechanistic roles each of these heads
[00:55:49] mechanistic roles each of these heads takes on and uh there's quite a bit of
[00:55:51] takes on and uh there's quite a bit of exciting results there around some
[00:55:53] exciting results there around some attention heads you know learning to
[00:55:55] attention heads you know learning to pick out sort of the you know it's like
[00:55:58] pick out sort of the you know it's like syntactic dependencies or maybe doing
[00:56:00] syntactic dependencies or maybe doing like a sort of a global averaging of
[00:56:02] like a sort of a global averaging of context
[00:56:03] context um the question is quite nuanced though
[00:56:05] um the question is quite nuanced though because in a deep Network it's unclear
[00:56:07] because in a deep Network it's unclear and we should talk about this more
[00:56:09] and we should talk about this more offline but it's unclear if you look at
[00:56:10] offline but it's unclear if you look at a word 10 layers deep in a network what
[00:56:13] a word 10 layers deep in a network what you're really looking at because it's
[00:56:15] you're really looking at because it's already Incorporated context from
[00:56:17] already Incorporated context from everyone else and it's a little bit
[00:56:19] everyone else and it's a little bit unclear active area of research but I
[00:56:21] unclear active area of research but I think I should move on uh now to uh keep
[00:56:25] think I should move on uh now to uh keep discussing Transformers but yeah if you
[00:56:27] discussing Transformers but yeah if you want to talk more about it I'm happy to
[00:56:30] um okay so so uh another sort of uh hack
[00:56:33] um okay so so uh another sort of uh hack that I'm going to toss in here I mean
[00:56:35] that I'm going to toss in here I mean maybe they wouldn't call it hack but you
[00:56:36] maybe they wouldn't call it hack but you know it's a nice little method to
[00:56:38] know it's a nice little method to improve things it's called scaled dot
[00:56:40] improve things it's called scaled dot product attention so one of the issues
[00:56:43] product attention so one of the issues with this sort of key query value
[00:56:45] with this sort of key query value self-attention is that when the model
[00:56:46] self-attention is that when the model dimensionality becomes large the dot
[00:56:49] dimensionality becomes large the dot products between vectors even random
[00:56:51] products between vectors even random vectors tend to become uh large
[00:56:54] vectors tend to become uh large and when that happens the inputs to the
[00:56:57] and when that happens the inputs to the softmax function can be very large
[00:56:59] softmax function can be very large making the gradients small so
[00:57:01] making the gradients small so intuitively if you have two random
[00:57:03] intuitively if you have two random vectors in Model dimensionality D and
[00:57:05] vectors in Model dimensionality D and you just dot product them together as D
[00:57:07] you just dot product them together as D grows their dot product grows an
[00:57:09] grows their dot product grows an expectation to be very large and so you
[00:57:12] expectation to be very large and so you know you sort of want to start out with
[00:57:14] know you sort of want to start out with everyone's attention being very uniform
[00:57:16] everyone's attention being very uniform very flat sort of look everywhere but if
[00:57:19] very flat sort of look everywhere but if some dot products are very large then
[00:57:21] some dot products are very large then you know learning will be inhibited and
[00:57:23] you know learning will be inhibited and so what you end up doing is you just
[00:57:25] so what you end up doing is you just sort of for each of your heads uh you
[00:57:28] sort of for each of your heads uh you know you just sort of divide all the
[00:57:29] know you just sort of divide all the scores by this constant that's
[00:57:30] scores by this constant that's determined by the model dimensionality
[00:57:32] determined by the model dimensionality so as the vectors grow very large their
[00:57:36] so as the vectors grow very large their dot products don't at least at an
[00:57:39] dot products don't at least at an initialization time so this is sort of
[00:57:41] initialization time so this is sort of like a nice little
[00:57:42] like a nice little um
[00:57:43] um you know important but but maybe not uh
[00:57:48] you know important but but maybe not uh like yeah it's it's important to know
[00:57:51] like yeah it's it's important to know um and uh so that's called scale dot
[00:57:54] um and uh so that's called scale dot product attention from here on out we'll
[00:57:57] product attention from here on out we'll just assume that we do this you know
[00:57:58] just assume that we do this you know it's quite easy to implement you just do
[00:58:00] it's quite easy to implement you just do a little division in all of your uh
[00:58:02] a little division in all of your uh computations
[00:58:04] okay so so now in the Transformer
[00:58:07] okay so so now in the Transformer decoder we've got a couple of other
[00:58:08] decoder we've got a couple of other things that I have un uh faded out here
[00:58:12] things that I have un uh faded out here um we have two big optimization tricks
[00:58:14] um we have two big optimization tricks or optimization methods I should say
[00:58:16] or optimization methods I should say really because these are quite important
[00:58:17] really because these are quite important that end up being very important we've
[00:58:20] that end up being very important we've got residual connections and layer
[00:58:21] got residual connections and layer normalization and in Transformer
[00:58:24] normalization and in Transformer diagrams that you see sort of around the
[00:58:26] diagrams that you see sort of around the web they're often uh written together as
[00:58:29] web they're often uh written together as this ad and Norm box and in practice in
[00:58:33] this ad and Norm box and in practice in the Transformer decoder I'm going to you
[00:58:35] the Transformer decoder I'm going to you know apply mask multi-head attention and
[00:58:38] know apply mask multi-head attention and then do this sort of optimization add a
[00:58:40] then do this sort of optimization add a norm then I'll do a feed forward
[00:58:42] norm then I'll do a feed forward application and then add a norm so you
[00:58:45] application and then add a norm so you know this is quite important so let's go
[00:58:48] know this is quite important so let's go over these two individual uh components
[00:58:51] over these two individual uh components the first is residual connections I mean
[00:58:53] the first is residual connections I mean we've I think we've talked about
[00:58:54] we've I think we've talked about residual connections before right well
[00:58:56] residual connections before right well it's worth doing it again
[00:58:57] it's worth doing it again um uh but you know it's really a good
[00:58:59] um uh but you know it's really a good trick to help models train better
[00:59:01] trick to help models train better um so just to recap right we're going to
[00:59:04] um so just to recap right we're going to take instead of having this sort of you
[00:59:06] take instead of having this sort of you have a layer uh layer I minus one and
[00:59:09] have a layer uh layer I minus one and you pass it through a thing maybe it's
[00:59:10] you pass it through a thing maybe it's self-attention maybe it's a feed forward
[00:59:12] self-attention maybe it's a feed forward Network now you've got layer I
[00:59:15] Network now you've got layer I I'm going to add the result of layer I
[00:59:19] I'm going to add the result of layer I uh to this sort of to its input here so
[00:59:23] uh to this sort of to its input here so now I'm saying I'm just going to compute
[00:59:25] now I'm saying I'm just going to compute the layer and I'm going to add in the
[00:59:26] the layer and I'm going to add in the input to the layer so that I only have
[00:59:28] input to the layer so that I only have to learn the residual from the previous
[00:59:31] to learn the residual from the previous layer right so I've got this sort of
[00:59:33] layer right so I've got this sort of connection here it's often written as
[00:59:34] connection here it's often written as this it's sort of like oh
[00:59:36] this it's sort of like oh connection
[00:59:37] connection okay right goes around and you should
[00:59:40] okay right goes around and you should think that the gradient is just really
[00:59:42] think that the gradient is just really great through the residual connection
[00:59:43] great through the residual connection right like ah you know if I've got
[00:59:45] right like ah you know if I've got Vanishing or exploding gradients
[00:59:47] Vanishing or exploding gradients Vanishing gradients through this layer
[00:59:49] Vanishing gradients through this layer well I can at least learn everything
[00:59:50] well I can at least learn everything behind it because I've got this residual
[00:59:52] behind it because I've got this residual connection where the where the gradient
[00:59:54] connection where the where the gradient is one because it's the identity
[00:59:58] is one because it's the identity um this is really nice and you know it
[01:00:00] um this is really nice and you know it also maybe is like a buy at least at
[01:00:02] also maybe is like a buy at least at initialization everything looks a little
[01:00:04] initialization everything looks a little bit like the identity function now right
[01:00:06] bit like the identity function now right because if the contribution of the layer
[01:00:09] because if the contribution of the layer is somewhat small because all of your
[01:00:10] is somewhat small because all of your weights are small and I have the
[01:00:12] weights are small and I have the addition from the input maybe the whole
[01:00:14] addition from the input maybe the whole thing looks a little bit like the
[01:00:16] thing looks a little bit like the identity which might be a good sort of
[01:00:18] identity which might be a good sort of place to start
[01:00:20] place to start and you know there are really nice
[01:00:21] and you know there are really nice visualizations I just love this
[01:00:22] visualizations I just love this visualization uh right so this is your
[01:00:25] visualization uh right so this is your like lost landscape right so you're
[01:00:26] like lost landscape right so you're gradient descent and you're trying to
[01:00:28] gradient descent and you're trying to Traverse the mountains of the Lost
[01:00:30] Traverse the mountains of the Lost landscape this is like the parameter
[01:00:31] landscape this is like the parameter space and down is better in your lost
[01:00:34] space and down is better in your lost function and it's really hard so you get
[01:00:35] function and it's really hard so you get stuck in some local Optima and you can't
[01:00:38] stuck in some local Optima and you can't sort of find your way to to get out and
[01:00:41] sort of find your way to to get out and then this with residual connections I
[01:00:43] then this with residual connections I mean come on you just sort of walk down
[01:00:46] mean come on you just sort of walk down I mean it's not actually I guess really
[01:00:49] I mean it's not actually I guess really how it works all the time but I really
[01:00:51] how it works all the time but I really love this it's great
[01:00:55] okay
[01:00:58] um so yeah we've seen residual
[01:01:00] um so yeah we've seen residual connections we should move on to layer
[01:01:01] connections we should move on to layer normalization
[01:01:02] normalization um so layer Norm uh is another thing to
[01:01:05] um so layer Norm uh is another thing to help your model train faster
[01:01:08] help your model train faster um and you know there's
[01:01:11] um and you know there's the intuitions around layer
[01:01:12] the intuitions around layer normalization
[01:01:14] normalization um and sort of the empiricism of it
[01:01:16] um and sort of the empiricism of it working very well maybe aren't perfectly
[01:01:18] working very well maybe aren't perfectly like uh let's say connected but
[01:01:22] like uh let's say connected but you know you should imagine I suppose
[01:01:25] you know you should imagine I suppose um that we want to uh say you know this
[01:01:28] um that we want to uh say you know this variation within each layer things can
[01:01:30] variation within each layer things can get very big things can get very small
[01:01:32] get very big things can get very small uh that's not actually informative
[01:01:34] uh that's not actually informative because of you know variations between
[01:01:38] because of you know variations between um maybe the the gradients or you know
[01:01:41] um maybe the the gradients or you know I've got sort of weird things going on
[01:01:43] I've got sort of weird things going on in my layers that I can't totally
[01:01:44] in my layers that I can't totally control I haven't been able to sort of
[01:01:46] control I haven't been able to sort of make everything behave sort of nicely
[01:01:48] make everything behave sort of nicely where everything stays roughly the same
[01:01:49] where everything stays roughly the same Norm maybe some things explode maybe
[01:01:51] Norm maybe some things explode maybe some things shrink
[01:01:53] some things shrink um
[01:01:54] um and I want to cut down on sort of
[01:01:56] and I want to cut down on sort of uninformative variation
[01:01:59] uninformative variation um in between layers so I'm going to let
[01:02:01] um in between layers so I'm going to let X and r d be an individual word Vector
[01:02:04] X and r d be an individual word Vector in the model so this is like at a single
[01:02:06] in the model so this is like at a single a single index one vector
[01:02:08] a single index one vector and what I'm going to try to do is just
[01:02:10] and what I'm going to try to do is just normalize it
[01:02:12] normalize it normalize it in the sense of it's got a
[01:02:14] normalize it in the sense of it's got a bunch of variation and I'm going to cut
[01:02:16] bunch of variation and I'm going to cut out on everything I'm going to normalize
[01:02:18] out on everything I'm going to normalize it to unit mean and standard deviation
[01:02:20] it to unit mean and standard deviation so I'm going to estimate the mean
[01:02:23] so I'm going to estimate the mean um
[01:02:24] um here across uh so for all of the uh
[01:02:28] here across uh so for all of the uh dimensions in the vector so J equals one
[01:02:31] dimensions in the vector so J equals one to the model dimensionality I'm going to
[01:02:33] to the model dimensionality I'm going to sum up the value so I've got this one
[01:02:34] sum up the value so I've got this one big word vector and I sum up all the
[01:02:36] big word vector and I sum up all the values division by D here right that's
[01:02:39] values division by D here right that's the mean I'm going to have my estimate
[01:02:41] the mean I'm going to have my estimate of the standard deviation
[01:02:43] of the standard deviation um again these should say estimates this
[01:02:45] um again these should say estimates this is my simple estimate of the standard
[01:02:46] is my simple estimate of the standard deviation or the values within this one
[01:02:48] deviation or the values within this one vector
[01:02:50] vector and I'm just going to
[01:02:53] and I'm just going to um and then possibly I guess I can have
[01:02:55] um and then possibly I guess I can have learned uh parameters to try to like
[01:02:58] learned uh parameters to try to like scale back out in terms of uh
[01:03:01] scale back out in terms of uh multiplicatively and additively here
[01:03:04] multiplicatively and additively here that's optional we're going to compute
[01:03:06] that's optional we're going to compute this this standardization right we're
[01:03:08] this this standardization right we're going to take my Vector X subtract out
[01:03:10] going to take my Vector X subtract out the mean divide by the standard
[01:03:12] the mean divide by the standard deviation plus this Epsilon sort of
[01:03:14] deviation plus this Epsilon sort of constant if there's not a lot of
[01:03:15] constant if there's not a lot of variation I don't want things to explode
[01:03:17] variation I don't want things to explode so I'm going to have this Epsilon there
[01:03:19] so I'm going to have this Epsilon there that's uh close to zero so this part
[01:03:22] that's uh close to zero so this part here x minus mu over square root Sigma
[01:03:25] here x minus mu over square root Sigma plus Epsilon is saying take all the
[01:03:27] plus Epsilon is saying take all the variation and sort of normalize it to
[01:03:29] variation and sort of normalize it to unit mean and standard deviation
[01:03:32] unit mean and standard deviation and then maybe I want to sort of scale
[01:03:34] and then maybe I want to sort of scale it stretch it back out
[01:03:36] it stretch it back out um and then maybe add an offset beta
[01:03:40] um and then maybe add an offset beta that I've learned although in practice
[01:03:41] that I've learned although in practice actually this part and discuss this in
[01:03:43] actually this part and discuss this in the lecture notes uh and practice this
[01:03:45] the lecture notes uh and practice this part maybe isn't actually that important
[01:03:47] part maybe isn't actually that important um but so layer normalization yeah
[01:03:49] um but so layer normalization yeah you're sort of you know you can think of
[01:03:51] you're sort of you know you can think of this as when I get the output of layer
[01:03:54] this as when I get the output of layer normalization it's going to be sort of
[01:03:56] normalization it's going to be sort of look nice and look similar to the next
[01:03:58] look nice and look similar to the next layer independent of what's gone on
[01:04:00] layer independent of what's gone on because it's going to be unit mean and
[01:04:01] because it's going to be unit mean and standard deviation so maybe that makes
[01:04:03] standard deviation so maybe that makes for a better thing to learn off of for
[01:04:05] for a better thing to learn off of for the next layer
[01:04:09] okay any questions for uh residual or
[01:04:12] okay any questions for uh residual or layer Norm yes
[01:04:16] yeah it's a good question when I
[01:04:19] yeah it's a good question when I subtract the scalar mu from the vector x
[01:04:21] subtract the scalar mu from the vector x i broadcast mu to dimensionality D and
[01:04:24] i broadcast mu to dimensionality D and remove mu
[01:04:25] remove mu from Aldi yeah good point
[01:04:29] from Aldi yeah good point thank you that was unclear
[01:04:34] uh sure
[01:04:36] uh sure is it divided should it be divided by D
[01:04:40] is it divided should it be divided by D or from me sorry can you repeat that in
[01:04:44] or from me sorry can you repeat that in the fourth bullet point when you're
[01:04:45] the fourth bullet point when you're calculating the mean
[01:04:47] calculating the mean um is it divided by D or is it or maybe
[01:04:50] um is it divided by D or is it or maybe I'm just interested I think it is
[01:04:51] I'm just interested I think it is divided by D yeah
[01:04:55] these are so this is the average
[01:04:56] these are so this is the average deviation from the mean of all of the
[01:04:59] deviation from the mean of all of the yeah
[01:05:00] yeah yes
[01:05:04] [Music]
[01:05:06] [Music] mobilized based on the statistics
[01:05:11] so the question is if I have five words
[01:05:13] so the question is if I have five words in the sequence do I normalize by sort
[01:05:16] in the sequence do I normalize by sort of aggregating the statistics to
[01:05:18] of aggregating the statistics to estimate mu and sigma across all the
[01:05:20] estimate mu and sigma across all the five words share their statistics or do
[01:05:22] five words share their statistics or do it independently for each word this is a
[01:05:24] it independently for each word this is a great question which I think in all the
[01:05:26] great question which I think in all the papers that discuss Transformers is
[01:05:28] papers that discuss Transformers is under specified you do not share across
[01:05:31] under specified you do not share across the five words which is somewhat
[01:05:33] the five words which is somewhat confusing to me but so each of the five
[01:05:36] confusing to me but so each of the five words is done completely independently
[01:05:39] words is done completely independently um you could have shared across the five
[01:05:40] um you could have shared across the five words and said that my estimate of the
[01:05:42] words and said that my estimate of the statistics are just based on all five uh
[01:05:46] statistics are just based on all five uh but you do not
[01:05:49] I can't pretend I understand totally why
[01:05:54] for example per batch
[01:05:58] of the same position so so a similar
[01:06:02] of the same position so so a similar question the question is
[01:06:04] question the question is um if you have a batch of sequences
[01:06:06] um if you have a batch of sequences right so like just like we're doing
[01:06:08] right so like just like we're doing batch Based training do you for a single
[01:06:11] batch Based training do you for a single word now we don't share across a
[01:06:13] word now we don't share across a sequence index for sharing the
[01:06:14] sequence index for sharing the statistics we do share across the batch
[01:06:16] statistics we do share across the batch and the answer is no you also do not
[01:06:18] and the answer is no you also do not share across the batch in fact layer
[01:06:20] share across the batch in fact layer normalization was sort of invented as a
[01:06:23] normalization was sort of invented as a replacement for batch normalization
[01:06:25] replacement for batch normalization which did just that and the issue with
[01:06:27] which did just that and the issue with batch normalization is that now your
[01:06:28] batch normalization is that now your forward pass sort of depends in a way
[01:06:31] forward pass sort of depends in a way that you don't like on examples that
[01:06:32] that you don't like on examples that should be not related to your example
[01:06:35] should be not related to your example and so yeah you don't share statistics
[01:06:37] and so yeah you don't share statistics across the batch
[01:06:40] okay cool
[01:06:44] okay cool okay so so now we have our full
[01:06:46] okay so so now we have our full Transformer decoder and we have our
[01:06:50] Transformer decoder and we have our blocks so in this sort of slightly
[01:06:52] blocks so in this sort of slightly grayed out thing here that says repeat
[01:06:53] grayed out thing here that says repeat uh for a number of uh encoder or sorry
[01:06:56] uh for a number of uh encoder or sorry decoder blocks
[01:06:58] decoder blocks um each block consists of I pass it
[01:07:01] um each block consists of I pass it through self-attention and then my ADD
[01:07:03] through self-attention and then my ADD and Norm right so I've got this residual
[01:07:05] and Norm right so I've got this residual connection here that goes around and
[01:07:08] connection here that goes around and I've got the layer normalization there
[01:07:10] I've got the layer normalization there and then a feed forward layer and then
[01:07:12] and then a feed forward layer and then another ad and norm and so that sort of
[01:07:16] another ad and norm and so that sort of set of four operations I apply you know
[01:07:19] set of four operations I apply you know for some number of times number of
[01:07:21] for some number of times number of blocks so that whole thing is called a
[01:07:22] blocks so that whole thing is called a single block and uh that's it that's the
[01:07:25] single block and uh that's it that's the Transformer uh decoder
[01:07:28] Transformer uh decoder as it is
[01:07:31] cool so that's a whole architecture
[01:07:33] cool so that's a whole architecture right there we've solved things like
[01:07:35] right there we've solved things like needing to represent position we've
[01:07:37] needing to represent position we've solved things like
[01:07:39] solved things like um not being able to look into the
[01:07:40] um not being able to look into the future uh We've solved a lot of
[01:07:43] future uh We've solved a lot of different optimization problems you've
[01:07:44] different optimization problems you've got a question yes
[01:07:47] yes
[01:07:49] yes Mass to multi-head attention yeah
[01:07:52] Mass to multi-head attention yeah with the dot product scaling with the
[01:07:55] with the dot product scaling with the square root D over H as well yeah
[01:08:03] so the question is uh how do these
[01:08:05] so the question is uh how do these models handle variable length inputs
[01:08:08] models handle variable length inputs um yeah so
[01:08:12] if you have so so the input to the like
[01:08:16] if you have so so the input to the like GPU forward pass is going to be a
[01:08:19] GPU forward pass is going to be a constant length so you're going to maybe
[01:08:21] constant length so you're going to maybe pad to a constant length and in order to
[01:08:25] pad to a constant length and in order to not look at the future the stuff that's
[01:08:28] not look at the future the stuff that's sort of happening in the future you can
[01:08:30] sort of happening in the future you can mask out the pad tokens just like the
[01:08:33] mask out the pad tokens just like the masking that we showed for not looking
[01:08:35] masking that we showed for not looking at the future in general you can just
[01:08:37] at the future in general you can just say set all of the attention weights to
[01:08:39] say set all of the attention weights to to zero or the scores to negative
[01:08:41] to zero or the scores to negative Infinity for all of the pad tokens
[01:08:47] yeah exactly so you can you can uh set
[01:08:50] yeah exactly so you can you can uh set everything to this maximum length now in
[01:08:52] everything to this maximum length now in practice so the question was do you set
[01:08:54] practice so the question was do you set this length that you have everything be
[01:08:55] this length that you have everything be be that maximum length I mean you know
[01:08:58] be that maximum length I mean you know yes often although you can save
[01:08:59] yes often although you can save computation by setting it to something
[01:09:02] computation by setting it to something smaller and uh everything the math all
[01:09:05] smaller and uh everything the math all still works out you just have to code it
[01:09:07] still works out you just have to code it properly so it can handle so you set
[01:09:09] properly so it can handle so you set everything instead of the N you set it
[01:09:10] everything instead of the N you set it all to five if everything is shorter
[01:09:12] all to five if everything is shorter than like five and you save a lot of
[01:09:14] than like five and you save a lot of computation all of the self-attention
[01:09:16] computation all of the self-attention operations just work
[01:09:18] operations just work so yeah
[01:09:22] so yeah um
[01:09:25] uh there's one hidden layer in the feed
[01:09:26] uh there's one hidden layer in the feed forward yeah
[01:09:28] forward yeah okay I should move on got a couple more
[01:09:30] okay I should move on got a couple more things and not very much time okay
[01:09:33] things and not very much time okay um but I'll be here after the class as
[01:09:35] um but I'll be here after the class as well so in the encoder so the
[01:09:37] well so in the encoder so the Transformer encoder is almost identical
[01:09:39] Transformer encoder is almost identical but again we want bi-directional context
[01:09:41] but again we want bi-directional context and so we just don't do the masking
[01:09:44] and so we just don't do the masking right so I've got in my multi-head
[01:09:45] right so I've got in my multi-head attention here I've got no masking and
[01:09:48] attention here I've got no masking and so it's that easy to make the model
[01:09:50] so it's that easy to make the model bi-directional okay
[01:09:53] bi-directional okay um so that's easy so that's called the
[01:09:54] um so that's easy so that's called the Transformer encoder it's almost
[01:09:56] Transformer encoder it's almost identical but no masking and then
[01:09:58] identical but no masking and then finally we've got the Transformer
[01:09:59] finally we've got the Transformer encoder decoder which is actually how
[01:10:02] encoder decoder which is actually how the Transformer was originally presented
[01:10:04] the Transformer was originally presented in this paper attention is all you need
[01:10:07] in this paper attention is all you need um and this is when we want to have sort
[01:10:09] um and this is when we want to have sort of a bi-directional network here's the
[01:10:11] of a bi-directional network here's the encoder it takes in say my source
[01:10:12] encoder it takes in say my source sentence for machine translation it's
[01:10:15] sentence for machine translation it's multi-headed attention is not masked and
[01:10:18] multi-headed attention is not masked and I have a decoder to decode out my
[01:10:21] I have a decoder to decode out my sentence now but you'll see that this is
[01:10:23] sentence now but you'll see that this is slightly more complicated I have my
[01:10:25] slightly more complicated I have my masked multi-head self-attention uh just
[01:10:27] masked multi-head self-attention uh just like I had before in my decoder but now
[01:10:30] like I had before in my decoder but now I have an extra operation which is
[01:10:33] I have an extra operation which is called cross attention where I'm going
[01:10:35] called cross attention where I'm going to use my decoder vectors as my queries
[01:10:41] to use my decoder vectors as my queries then I'll take the output of the encoder
[01:10:43] then I'll take the output of the encoder as my keys and values so now for every
[01:10:46] as my keys and values so now for every word in the decoder I'm looking at all
[01:10:49] word in the decoder I'm looking at all the possible words in the output of all
[01:10:52] the possible words in the output of all of the blocks of the encoder yes
[01:10:54] of the blocks of the encoder yes yeah
[01:10:57] yeah longer because I know initially it was
[01:10:59] longer because I know initially it was like the keys and the values how do we
[01:11:01] like the keys and the values how do we get like a key in value separated from
[01:11:03] get like a key in value separated from the output because then we collapse
[01:11:05] the output because then we collapse those into the single output uh so we
[01:11:09] those into the single output uh so we well how sorry how will we get the keys
[01:11:11] well how sorry how will we get the keys and values out like how do we because
[01:11:13] and values out like how do we because when we have the output didn't we
[01:11:15] when we have the output didn't we collapse like the keys and values into
[01:11:17] collapse like the keys and values into like a single output so the output we
[01:11:20] like a single output so the output we capture those yeah the question is how
[01:11:22] capture those yeah the question is how do you get the keys and values and
[01:11:23] do you get the keys and values and queries out of this sort of single
[01:11:25] queries out of this sort of single collapsed output now remember the output
[01:11:26] collapsed output now remember the output for each word is just this weighted
[01:11:28] for each word is just this weighted average of the value vectors for the for
[01:11:31] average of the value vectors for the for the previous words right and then from
[01:11:33] the previous words right and then from that output for the next layer we apply
[01:11:36] that output for the next layer we apply a new key query and value transformation
[01:11:38] a new key query and value transformation to each of them for the next layer of
[01:11:40] to each of them for the next layer of self-attention
[01:11:42] self-attention so it's not actually that you're
[01:11:45] so it's not actually that you're here
[01:11:47] here yeah you apply the key Matrix the query
[01:11:50] yeah you apply the key Matrix the query Matrix to the output of whatever came
[01:11:52] Matrix to the output of whatever came before it yeah
[01:11:53] before it yeah um and so just in a little bit of math
[01:11:55] um and so just in a little bit of math right we have
[01:11:57] right we have um these vectors H1 through each n I'm
[01:12:00] um these vectors H1 through each n I'm going to call them that are the output
[01:12:01] going to call them that are the output of the encoder right and then I've got
[01:12:04] of the encoder right and then I've got vectors that are the output of the
[01:12:05] vectors that are the output of the decoder
[01:12:07] decoder uh so I've got these Z's I'm calling the
[01:12:09] uh so I've got these Z's I'm calling the output of the decoder and then I simply
[01:12:11] output of the decoder and then I simply Define my keys and my values from the
[01:12:16] Define my keys and my values from the encoder vectors these H's
[01:12:19] encoder vectors these H's right so I take the H's I apply a key
[01:12:20] right so I take the H's I apply a key Matrix and a value Matrix and then I
[01:12:24] Matrix and a value Matrix and then I Define the queries from my decoder so my
[01:12:26] Define the queries from my decoder so my queries here so this is why two of the
[01:12:28] queries here so this is why two of the arrows come from the encoder and one of
[01:12:30] arrows come from the encoder and one of the arrows comes from the decoder I've
[01:12:32] the arrows comes from the decoder I've got my Z's here get my queries my keys
[01:12:35] got my Z's here get my queries my keys and values from the encoder
[01:12:39] okay
[01:12:41] okay uh so that is it I've got a couple of
[01:12:45] uh so that is it I've got a couple of minutes I want to discuss some of the
[01:12:46] minutes I want to discuss some of the sort of results of Transformers and I'm
[01:12:48] sort of results of Transformers and I'm happy to answer more questions about
[01:12:50] happy to answer more questions about Transformers after class so um so you
[01:12:53] Transformers after class so um so you know really the original results of
[01:12:55] know really the original results of Transformers they had this big pitch for
[01:12:57] Transformers they had this big pitch for like oh look you can do way more
[01:13:00] like oh look you can do way more computation because of parallelization
[01:13:01] computation because of parallelization they got great results in machine
[01:13:03] they got great results in machine translation so you had
[01:13:06] translation so you had um
[01:13:10] you had Transformers sort of doing quite
[01:13:12] you had Transformers sort of doing quite well although not like astoundingly
[01:13:15] well although not like astoundingly better than existing machine translation
[01:13:17] better than existing machine translation systems
[01:13:19] systems um and they but they were significantly
[01:13:21] um and they but they were significantly more efficient to train right because
[01:13:22] more efficient to train right because you don't have this this parallelization
[01:13:24] you don't have this this parallelization problem you could compute on much more
[01:13:26] problem you could compute on much more data much faster and you could make use
[01:13:28] data much faster and you could make use of faster gpus much more
[01:13:31] of faster gpus much more um you know after that there were things
[01:13:33] um you know after that there were things like document generation where you had
[01:13:35] like document generation where you had the sort of old standard of sequence to
[01:13:37] the sort of old standard of sequence to sequence model to the lstms and
[01:13:39] sequence model to the lstms and eventually everything became sort of
[01:13:41] eventually everything became sort of Transformers all the way down
[01:13:44] Transformers all the way down um uh Transformers also enabled this
[01:13:46] um uh Transformers also enabled this revolution into pre-training which we'll
[01:13:49] revolution into pre-training which we'll go over uh in lecture next class
[01:13:51] go over uh in lecture next class um and sort of the efficiency the
[01:13:53] um and sort of the efficiency the parallelizability allows you to compute
[01:13:55] parallelizability allows you to compute on tons and tons of data and so after a
[01:13:59] on tons and tons of data and so after a certain point sort of on standard large
[01:14:01] certain point sort of on standard large benchmarks everything became Transformer
[01:14:04] benchmarks everything became Transformer based this ability to make use of lots
[01:14:07] based this ability to make use of lots and lots of data lots and lots of
[01:14:08] and lots of data lots and lots of compute just put Transformers Head and
[01:14:10] compute just put Transformers Head and Shoulders above lstms in let's say
[01:14:13] Shoulders above lstms in let's say almost every sort of modern advancement
[01:14:16] almost every sort of modern advancement in uh in natural language processing
[01:14:19] in uh in natural language processing um there are many sort of drawbacks and
[01:14:22] um there are many sort of drawbacks and and variants to Transformers you know
[01:14:24] and variants to Transformers you know the clearest one that people have tried
[01:14:26] the clearest one that people have tried to work on quite a bit is this quadratic
[01:14:28] to work on quite a bit is this quadratic compute problem so this all pairs of
[01:14:30] compute problem so this all pairs of interactions right means that our sort
[01:14:32] interactions right means that our sort of total computation for each block
[01:14:34] of total computation for each block grows quadratically with the sequence
[01:14:36] grows quadratically with the sequence length and in a student's question we
[01:14:37] length and in a student's question we heard uh that you know well as as the
[01:14:40] heard uh that you know well as as the sequence length becomes long if I want
[01:14:42] sequence length becomes long if I want to process you know a whole Wikipedia
[01:14:43] to process you know a whole Wikipedia article a whole a whole novel that
[01:14:46] article a whole a whole novel that becomes quite unfeasible and actually
[01:14:48] becomes quite unfeasible and actually you know that's a step backwards in some
[01:14:50] you know that's a step backwards in some sense because for recurrent neural
[01:14:51] sense because for recurrent neural networks it only grew linearly with the
[01:14:54] networks it only grew linearly with the sequence length
[01:14:55] sequence length um you know other things people have
[01:14:56] um you know other things people have tried to work on are sort of better
[01:14:58] tried to work on are sort of better position representations because the
[01:15:00] position representations because the absolute index of a word is not really
[01:15:02] absolute index of a word is not really you know the best way maybe to represent
[01:15:05] you know the best way maybe to represent its position in a sequence
[01:15:08] its position in a sequence um and just to give you an intuition of
[01:15:09] um and just to give you an intuition of quadratic sequence length right remember
[01:15:11] quadratic sequence length right remember that we had this big Matrix multiply
[01:15:12] that we had this big Matrix multiply here that resulted in this Matrix of n
[01:15:16] here that resulted in this Matrix of n by n and Computing this is like a you
[01:15:19] by n and Computing this is like a you know a big a big cost it costs a lot of
[01:15:21] know a big a big cost it costs a lot of memory
[01:15:22] memory um and so there's been work uh oh yeah
[01:15:24] um and so there's been work uh oh yeah and so you know if you think of the
[01:15:25] and so you know if you think of the model dimensionality as like a thousand
[01:15:27] model dimensionality as like a thousand although today it gets much larger then
[01:15:29] although today it gets much larger then for a short sequence of n is roughly 30
[01:15:32] for a short sequence of n is roughly 30 maybe the you know if you're Computing N
[01:15:35] maybe the you know if you're Computing N squared times d uh 30 isn't so bad but
[01:15:38] squared times d uh 30 isn't so bad but if you had something like you know 50
[01:15:41] if you had something like you know 50 000 then N squared becomes huge and sort
[01:15:44] 000 then N squared becomes huge and sort of totally infusible so people have
[01:15:47] of totally infusible so people have tried to sort of map things down to a
[01:15:49] tried to sort of map things down to a lower dimensional space to get rid of
[01:15:51] lower dimensional space to get rid of the sort of quadratic computation
[01:15:53] the sort of quadratic computation but in practice I mean as people have
[01:15:57] but in practice I mean as people have gone to things like gpt3 chat GPT most
[01:16:00] gone to things like gpt3 chat GPT most of the computation doesn't show up in
[01:16:01] of the computation doesn't show up in the self-attention so people are
[01:16:04] the self-attention so people are wondering sort of is it even necessary
[01:16:06] wondering sort of is it even necessary to get rid of some attention operations
[01:16:08] to get rid of some attention operations quadratic constraint it's an open form
[01:16:11] quadratic constraint it's an open form of research whether this is sort of
[01:16:13] of research whether this is sort of necessary and then finally there have
[01:16:15] necessary and then finally there have been a ton of modifications for the
[01:16:17] been a ton of modifications for the Transformer uh over the last you know
[01:16:20] Transformer uh over the last you know five four ish years and um it turns out
[01:16:23] five four ish years and um it turns out that the original Transformer plus maybe
[01:16:25] that the original Transformer plus maybe a couple of of modifications is pretty
[01:16:28] a couple of of modifications is pretty much the best thing there is still
[01:16:31] much the best thing there is still um there have been a couple of things
[01:16:32] um there have been a couple of things that end up being important changing out
[01:16:34] that end up being important changing out the non-linearities
[01:16:36] the non-linearities and the feed forward Network ends up
[01:16:37] and the feed forward Network ends up being important but it's sort of uh it's
[01:16:40] being important but it's sort of uh it's had lasting power so far and so it's
[01:16:42] had lasting power so far and so it's it's but I think it's it's right for uh
[01:16:44] it's but I think it's it's right for uh people to come through and think about
[01:16:46] people to come through and think about how to sort of improve it in various
[01:16:48] how to sort of improve it in various ways so um pre-training is on Tuesday uh
[01:16:52] ways so um pre-training is on Tuesday uh good luck on assignment four and then
[01:16:53] good luck on assignment four and then yeah we'll have the project proposal
[01:16:55] yeah we'll have the project proposal documents out tonight uh for you to talk
[01:16:57] documents out tonight uh for you to talk about


================================================================================
LECTURE 009
================================================================================

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretraining

Source: https://www.youtube.com/watch?v=DGfCRXuNA2w

---

Transcript

[00:00:05] hello welcome to cs224n
[00:00:09] hello welcome to cs224n today we'll be talking about
[00:00:10] today we'll be talking about pre-training uh which is another
[00:00:13] pre-training uh which is another exciting topic on the road to Modern
[00:00:17] exciting topic on the road to Modern natural language processing
[00:00:19] natural language processing um okay
[00:00:21] um okay how is everyone doing thumbs up some
[00:00:24] how is everyone doing thumbs up some side thumbs down
[00:00:27] side thumbs down wow no response bias there all you know
[00:00:30] wow no response bias there all you know all thumbs up oh sorry nice I like that
[00:00:32] all thumbs up oh sorry nice I like that Honesty that's good well
[00:00:34] Honesty that's good well um okay so we're now uh what is this
[00:00:37] um okay so we're now uh what is this week five
[00:00:39] week five yes it's week five and
[00:00:41] yes it's week five and um we have a couple so this lecture
[00:00:44] um we have a couple so this lecture um the Transformers lecture and then to
[00:00:47] um the Transformers lecture and then to a lesser extent Thursday's lecture on
[00:00:49] a lesser extent Thursday's lecture on natural language Generation
[00:00:51] natural language Generation Um will be sort of the sum of lectures
[00:00:53] Um will be sort of the sum of lectures for the assignments you have to do right
[00:00:56] for the assignments you have to do right so assignment five is coming out on uh
[00:00:59] so assignment five is coming out on uh Thursday
[00:01:01] Thursday um and uh the the topics covered in this
[00:01:04] um and uh the the topics covered in this lecture and the you know self-attention
[00:01:06] lecture and the you know self-attention Transformers and again a little bit of
[00:01:07] Transformers and again a little bit of natural language generation will be
[00:01:09] natural language generation will be tested in assignment five and then the
[00:01:10] tested in assignment five and then the rest of the course will go through some
[00:01:13] rest of the course will go through some really fascinating topics and sort of
[00:01:15] really fascinating topics and sort of modern uh natural language processing
[00:01:17] modern uh natural language processing that should be useful for your final
[00:01:19] that should be useful for your final projects and future jobs and interviews
[00:01:21] projects and future jobs and interviews and intellectual curiosity and
[00:01:25] and intellectual curiosity and um but uh you know I think that this
[00:01:27] um but uh you know I think that this today's lecture is significantly less
[00:01:29] today's lecture is significantly less um uh technical in detail than last
[00:01:32] um uh technical in detail than last Thursdays on self-attention and
[00:01:34] Thursdays on self-attention and Transformers but should give you an idea
[00:01:37] Transformers but should give you an idea of this sort of uh world of pre-training
[00:01:40] of this sort of uh world of pre-training and sort of how it helps Define
[00:01:43] and sort of how it helps Define uh natural language processing today
[00:01:46] uh natural language processing today um so a reminder about assignment five
[00:01:48] um so a reminder about assignment five your project proposals also are due on
[00:01:50] your project proposals also are due on Tuesday next Tuesday
[00:01:53] Tuesday next Tuesday um please do get those in try to get
[00:01:55] um please do get those in try to get them in on time so that we can give you
[00:01:57] them in on time so that we can give you prompt feedback about your project
[00:01:59] prompt feedback about your project proposals
[00:02:01] proposals um and yeah so let's let's jump into it
[00:02:06] okay so uh what we're going to start
[00:02:08] okay so uh what we're going to start with today is
[00:02:10] with today is um a bit of a technical detail on uh
[00:02:15] um a bit of a technical detail on uh word structure and sort of how we model
[00:02:17] word structure and sort of how we model the input sequence of words that we get
[00:02:19] the input sequence of words that we get so um in when we were teaching word to
[00:02:22] so um in when we were teaching word to VEC and uh sort of all the methods that
[00:02:26] VEC and uh sort of all the methods that we've talked about so far we assumed a
[00:02:28] we've talked about so far we assumed a finite vocabulary right so we had a
[00:02:30] finite vocabulary right so we had a vocabulary V that you define via
[00:02:32] vocabulary V that you define via whatever you've looked at some data
[00:02:33] whatever you've looked at some data you've decided what the words are in in
[00:02:35] you've decided what the words are in in that data and so you know
[00:02:38] that data and so you know um you have some words uh like hat and
[00:02:42] um you have some words uh like hat and learn and uh you know you have this
[00:02:44] learn and uh you know you have this embedding it's in red because you've
[00:02:45] embedding it's in red because you've learned it properly actually let's
[00:02:47] learned it properly actually let's replace hat and learn with pizza and
[00:02:49] replace hat and learn with pizza and tasty those are better
[00:02:50] tasty those are better um and uh and so that's all well and
[00:02:53] um and uh and so that's all well and good you see these words uh in your
[00:02:55] good you see these words uh in your model and
[00:02:57] model and you have an embedding that's been
[00:02:59] you have an embedding that's been learned on your data uh to sort of know
[00:03:02] learned on your data uh to sort of know what to do when you see those words but
[00:03:04] what to do when you see those words but when you see some sort of variations
[00:03:05] when you see some sort of variations maybe you see like tasty and maybe a
[00:03:09] maybe you see like tasty and maybe a typo like learn
[00:03:11] typo like learn um or or maybe novel items where it's
[00:03:13] um or or maybe novel items where it's like a word that you know you as a human
[00:03:15] like a word that you know you as a human can understand as sort of this
[00:03:17] can understand as sort of this combination this is called derivational
[00:03:19] combination this is called derivational morphology uh of like this word
[00:03:22] morphology uh of like this word Transformer that you know and if I which
[00:03:25] Transformer that you know and if I which means you know take this noun and give
[00:03:27] means you know take this noun and give me back you know a verb that means to
[00:03:30] me back you know a verb that means to make more like that noun to Transformer
[00:03:32] make more like that noun to Transformer if I NLP might mean to you know make NLP
[00:03:35] if I NLP might mean to you know make NLP more like using Transformers and such
[00:03:38] more like using Transformers and such um and for each of these right this
[00:03:40] um and for each of these right this maybe didn't show up in your in your
[00:03:41] maybe didn't show up in your in your training Corpus and language is uh
[00:03:44] training Corpus and language is uh always doing this right people are
[00:03:46] always doing this right people are always coming up with new words and
[00:03:47] always coming up with new words and there's new domains and there's the you
[00:03:50] there's new domains and there's the you know young people are always making new
[00:03:51] know young people are always making new words it's great and so it's a problem
[00:03:54] words it's great and so it's a problem for your model though right because
[00:03:55] for your model though right because you've defined this finite vocabulary
[00:03:57] you've defined this finite vocabulary and there's sort of no mapping
[00:03:59] and there's sort of no mapping in that vocabulary for each of these
[00:04:01] in that vocabulary for each of these things even though their meanings should
[00:04:04] things even though their meanings should be relatively well defined based on the
[00:04:06] be relatively well defined based on the data you've seen so far it's just that
[00:04:08] data you've seen so far it's just that the sort of string of characters that
[00:04:10] the sort of string of characters that Define them
[00:04:11] Define them aren't quite what you've seen
[00:04:13] aren't quite what you've seen and so what do you do well maybe you map
[00:04:15] and so what do you do well maybe you map them to this sort of universal unknown
[00:04:18] them to this sort of universal unknown tokens this is Unk uh right so it's like
[00:04:20] tokens this is Unk uh right so it's like oh I see something I don't know what
[00:04:21] oh I see something I don't know what I've never seen it before I'm going to
[00:04:23] I've never seen it before I'm going to say it's always represented by the same
[00:04:25] say it's always represented by the same token ankh
[00:04:26] token ankh um and so that's been done in the past
[00:04:28] um and so that's been done in the past uh and that's sort of bad right because
[00:04:30] uh and that's sort of bad right because it's totally in like losing tons of
[00:04:32] it's totally in like losing tons of information
[00:04:34] information um but you know you need to map it to
[00:04:36] um but you know you need to map it to something
[00:04:38] something uh and so that this is like a clear
[00:04:41] uh and so that this is like a clear problem especially I mean in English
[00:04:43] problem especially I mean in English it's a problem in many of the world's
[00:04:45] it's a problem in many of the world's languages it's a substantially larger
[00:04:47] languages it's a substantially larger problem right so
[00:04:49] problem right so um you know English has relatively
[00:04:52] um you know English has relatively simple word structure there's a couple
[00:04:53] simple word structure there's a couple of conjugations for each verb like you
[00:04:56] of conjugations for each verb like you know eat eats eaten ate
[00:05:00] know eat eats eaten ate um but in a language uh with much more
[00:05:02] um but in a language uh with much more complex morphology or word structure
[00:05:06] complex morphology or word structure um you'll have a considerably more
[00:05:08] um you'll have a considerably more complex uh sort of set of things that
[00:05:11] complex uh sort of set of things that you could see in the world so here is a
[00:05:13] you could see in the world so here is a a conjugation table for a Swahili verb
[00:05:16] a conjugation table for a Swahili verb and it has over 300 conjugations
[00:05:20] and it has over 300 conjugations and if I Define the vocabulary to be
[00:05:22] and if I Define the vocabulary to be every unique string of characters uh
[00:05:24] every unique string of characters uh maps to its own word then every one of
[00:05:27] maps to its own word then every one of the 300 conjugations would get an
[00:05:28] the 300 conjugations would get an independent Vector under my model which
[00:05:32] independent Vector under my model which makes no sense because the 300
[00:05:34] makes no sense because the 300 conjugations obviously have a lot in
[00:05:36] conjugations obviously have a lot in common
[00:05:36] common and differ by sort of meaningful uh
[00:05:39] and differ by sort of meaningful uh extents so you don't want to do this I'd
[00:05:41] extents so you don't want to do this I'd have to have a huge vocabulary uh if I
[00:05:44] have to have a huge vocabulary uh if I wanted all conjugations to show up and
[00:05:46] wanted all conjugations to show up and that's that's a mistake for efficiency
[00:05:48] that's that's a mistake for efficiency reasons and for learning reasons
[00:05:51] reasons and for learning reasons any questions so far
[00:05:54] cool
[00:05:55] cool okay
[00:05:56] okay um and so what we end up uh doing
[00:06:01] um and so what we end up uh doing um is we'll look at subword sub word
[00:06:05] um is we'll look at subword sub word structure sub word modeling so what
[00:06:07] structure sub word modeling so what we're going to do is we're going to say
[00:06:08] we're going to do is we're going to say I'm not going to even try to Define what
[00:06:10] I'm not going to even try to Define what the set of all words is I'm going to
[00:06:13] the set of all words is I'm going to Define my vocabulary to include parts of
[00:06:17] Define my vocabulary to include parts of words
[00:06:18] words there
[00:06:21] there where am I oh
[00:06:23] where am I oh uh right so um
[00:06:26] uh right so um so I'm going to split words into
[00:06:28] so I'm going to split words into sequences of known sub words and so
[00:06:30] sequences of known sub words and so there's a simple sort of algorithm for
[00:06:32] there's a simple sort of algorithm for this where you start with all characters
[00:06:35] this where you start with all characters right so if I only had a vocabulary of
[00:06:37] right so if I only had a vocabulary of all characters and maybe like an end of
[00:06:39] all characters and maybe like an end of word symbol
[00:06:41] word symbol um I ha for a finite data set then I
[00:06:44] um I ha for a finite data set then I could no matter what word I saw in the
[00:06:46] could no matter what word I saw in the future as long as I had seen all
[00:06:47] future as long as I had seen all possible characters I could take the
[00:06:49] possible characters I could take the word and say I don't know what this word
[00:06:50] word and say I don't know what this word is I'm going to split it into like all
[00:06:52] is I'm going to split it into like all of its individual characters so you
[00:06:54] of its individual characters so you won't have this UNC problem you can sort
[00:06:55] won't have this UNC problem you can sort of represent any word and then you're
[00:06:58] of represent any word and then you're going to find common adjacent characters
[00:07:00] going to find common adjacent characters and say okay A and B co-occur next to
[00:07:02] and say okay A and B co-occur next to each other quite a bit so I'm going to
[00:07:04] each other quite a bit so I'm going to add a new word to my vocabulary
[00:07:06] add a new word to my vocabulary now it's all characters plus this new
[00:07:09] now it's all characters plus this new word a b which is a sub word
[00:07:13] word a b which is a sub word and likewise I'm going so now I'm going
[00:07:15] and likewise I'm going so now I'm going to replace the character pair with the
[00:07:16] to replace the character pair with the new sub word and repeat until you add a
[00:07:19] new sub word and repeat until you add a lot a lot a lot of vocabulary items
[00:07:21] lot a lot a lot of vocabulary items through this process of what things tend
[00:07:22] through this process of what things tend to co-occur next to each other and so
[00:07:24] to co-occur next to each other and so what you'll end up with is a vocabulary
[00:07:28] what you'll end up with is a vocabulary a very commonly a co-occurring sort of
[00:07:30] a very commonly a co-occurring sort of substrings by which you can build up
[00:07:32] substrings by which you can build up words
[00:07:33] words and this was originally developed for
[00:07:35] and this was originally developed for machine translation but then has been
[00:07:36] machine translation but then has been used considerably in pretty much all
[00:07:39] used considerably in pretty much all modern language models so now we have a
[00:07:42] modern language models so now we have a hat and learn hat and learn so in our
[00:07:44] hat and learn hat and learn so in our sub word vocabulary hat and learn showed
[00:07:47] sub word vocabulary hat and learn showed up enough that they're their own
[00:07:48] up enough that they're their own individual words so that's sort of good
[00:07:51] individual words so that's sort of good right so simple common words show up as
[00:07:54] right so simple common words show up as a word in your vocabulary just like
[00:07:57] a word in your vocabulary just like you'd like them to but now tasty maybe
[00:07:59] you'd like them to but now tasty maybe gets split into TAA and then maybe you
[00:08:02] gets split into TAA and then maybe you know in some cases this Hash Hash means
[00:08:04] know in some cases this Hash Hash means like don't add a space next right so TAA
[00:08:08] like don't add a space next right so TAA and then AAA and then s-t-y right so
[00:08:12] and then AAA and then s-t-y right so I've actually taken one sort of thing
[00:08:14] I've actually taken one sort of thing that seems like a word and in my
[00:08:15] that seems like a word and in my vocabulary it's now split into three sub
[00:08:18] vocabulary it's now split into three sub word tokens
[00:08:19] word tokens so when I pass this to my Transformer or
[00:08:23] so when I pass this to my Transformer or to my recurrent neural network right the
[00:08:25] to my recurrent neural network right the recurrent neural network would take TAA
[00:08:27] recurrent neural network would take TAA as a as just a single element do the RNN
[00:08:30] as a as just a single element do the RNN update and then take AAA do the RNA and
[00:08:33] update and then take AAA do the RNA and update and then sty so it could learn
[00:08:36] update and then sty so it could learn to process constructions like this and
[00:08:39] to process constructions like this and maybe I can even add more aaas in the
[00:08:41] maybe I can even add more aaas in the middle right and have it do something
[00:08:42] middle right and have it do something similar
[00:08:43] similar instead of just seeing the entire word
[00:08:45] instead of just seeing the entire word tasty
[00:08:48] tasty and not knowing what it means
[00:08:51] and not knowing what it means is that that's feedback yeah
[00:08:55] is that that's feedback yeah uh
[00:08:58] how loud is that feedback
[00:09:01] how loud is that feedback we good
[00:09:02] we good okay I think we're fixed great
[00:09:05] okay I think we're fixed great um
[00:09:06] um and so same with Transformer if I maybe
[00:09:08] and so same with Transformer if I maybe Transformer as its own word and then if
[00:09:10] Transformer as its own word and then if I and so you can see that you have sort
[00:09:12] I and so you can see that you have sort of three learned embeddings instead of
[00:09:15] of three learned embeddings instead of one sort of useless unkem betting this
[00:09:18] one sort of useless unkem betting this is just wildly useful and is used pretty
[00:09:20] is just wildly useful and is used pretty much everywhere variants of this
[00:09:22] much everywhere variants of this algorithm are used pretty much
[00:09:23] algorithm are used pretty much everywhere in uh like modern NLP
[00:09:26] everywhere in uh like modern NLP questions yes
[00:09:28] questions yes if we have three embeddings for tasty do
[00:09:31] if we have three embeddings for tasty do we just add them together
[00:09:33] we just add them together so the question is if we have three
[00:09:34] so the question is if we have three embeddings for tasty do we just add them
[00:09:36] embeddings for tasty do we just add them together uh if we want to represent so
[00:09:40] together uh if we want to represent so when we're actually processing the
[00:09:41] when we're actually processing the sequence I'd see something like I
[00:09:44] sequence I'd see something like I learned about the TAA AAA sty so it'd
[00:09:50] learned about the TAA AAA sty so it'd actually be totally separate tokens but
[00:09:52] actually be totally separate tokens but if I wanted to then say what's my
[00:09:54] if I wanted to then say what's my representation of this thing
[00:09:56] representation of this thing uh depends on what you want to do
[00:09:58] uh depends on what you want to do sometimes you average you average the
[00:10:01] sometimes you average you average the contextual representations of the three
[00:10:02] contextual representations of the three or look at the last one maybe it at that
[00:10:06] or look at the last one maybe it at that point it's unclear what to do but
[00:10:08] point it's unclear what to do but everything sort of works okay
[00:10:11] click how do you what how do you know
[00:10:14] click how do you what how do you know where to split yeah so um you know where
[00:10:17] where to split yeah so um you know where to split based on the algorithm that I
[00:10:19] to split based on the algorithm that I uh specified earlier for learning the
[00:10:22] uh specified earlier for learning the vocabulary so you've learned this
[00:10:23] vocabulary so you've learned this vocabulary by just combining commonly
[00:10:26] vocabulary by just combining commonly co-occurring adjacent strings of letters
[00:10:28] co-occurring adjacent strings of letters right so like a b co-occurred a lot so
[00:10:31] right so like a b co-occurred a lot so now I've got a new word that's a b
[00:10:33] now I've got a new word that's a b um and then when I'm actually walking
[00:10:35] um and then when I'm actually walking through and tokenizing I try to split as
[00:10:37] through and tokenizing I try to split as little as possible so I split words into
[00:10:39] little as possible so I split words into the maximal uh sort of sub word that
[00:10:42] the maximal uh sort of sub word that takes up the most characters they're
[00:10:43] takes up the most characters they're algorithms for this uh yeah so like I'm
[00:10:45] algorithms for this uh yeah so like I'm like okay if I want to split this up you
[00:10:48] like okay if I want to split this up you know like there's many ways I could
[00:10:50] know like there's many ways I could split it up and you try to find some
[00:10:51] split it up and you try to find some approximate like what the best way to
[00:10:54] approximate like what the best way to split it into the fewest words is yeah
[00:11:00] do I ask the question is do people make
[00:11:02] do I ask the question is do people make use punctuation in the character set how
[00:11:04] use punctuation in the character set how do people do it yes absolutely so you
[00:11:07] do people do it yes absolutely so you know sort of from this point on so uh
[00:11:11] know sort of from this point on so uh just assume that what text is given to
[00:11:14] just assume that what text is given to these models is as unprocessed as
[00:11:16] these models is as unprocessed as possible you take it you try to make it
[00:11:18] possible you take it you try to make it sort of clean looking text where you've
[00:11:21] sort of clean looking text where you've removed you know HTML tags maybe if it's
[00:11:23] removed you know HTML tags maybe if it's from the Internet or or whatever
[00:11:26] from the Internet or or whatever um but then beyond that you process it
[00:11:27] um but then beyond that you process it as little as possible so that it
[00:11:29] as little as possible so that it reflects as well as possible what people
[00:11:32] reflects as well as possible what people might actually be using this for
[00:11:35] might actually be using this for um so maybe earlier in the course when
[00:11:36] um so maybe earlier in the course when we were looking at word to VEC maybe we
[00:11:39] we were looking at word to VEC maybe we had what might have thought about oh we
[00:11:40] had what might have thought about oh we don't want word to vectors of
[00:11:42] don't want word to vectors of punctuation or something like that
[00:11:45] punctuation or something like that um now everything is just as close as
[00:11:48] um now everything is just as close as possible to what the text you'd get with
[00:11:50] possible to what the text you'd get with people trying to use your system would
[00:11:51] people trying to use your system would be so yes uh in practice punctuation and
[00:11:54] be so yes uh in practice punctuation and like dot dot dot might be its own word
[00:11:56] like dot dot dot might be its own word you know and and maybe a sequence of
[00:11:59] you know and and maybe a sequence of like hyphens because people make big
[00:12:01] like hyphens because people make big bars across you know tables yeah
[00:12:04] bars across you know tables yeah foreign
[00:12:08] [Music]
[00:12:11] could be multiple embeddings versus a
[00:12:14] could be multiple embeddings versus a single embedding like this is the
[00:12:18] single embedding like this is the like system tree those any differently
[00:12:21] like system tree those any differently uh the question is does the system treat
[00:12:23] uh the question is does the system treat any differently words that are like
[00:12:24] any differently words that are like really themselves the whole word versus
[00:12:26] really themselves the whole word versus words that are sort of pieces you know
[00:12:28] words that are sort of pieces you know the system has no idea they're all just
[00:12:31] the system has no idea they're all just indices into your embedding vocabulary
[00:12:34] indices into your embedding vocabulary Matrix
[00:12:36] Matrix um so they're all treated equally
[00:12:41] about really long ones that are I guess
[00:12:43] about really long ones that are I guess relatively common because if you're
[00:12:45] relatively common because if you're building up from the same character all
[00:12:47] building up from the same character all the way up what happens then yeah the
[00:12:49] the way up what happens then yeah the question is what happens to very long
[00:12:51] question is what happens to very long words uh if you're building up some sort
[00:12:53] words uh if you're building up some sort of character Pairs and portions of
[00:12:55] of character Pairs and portions of characters uh you know in practice the
[00:12:58] characters uh you know in practice the statistics speak really well for
[00:13:00] statistics speak really well for themselves so if a long word is very
[00:13:02] themselves so if a long word is very common it will end up in the vocabulary
[00:13:04] common it will end up in the vocabulary and uh if it's not very common it won't
[00:13:08] and uh if it's not very common it won't um there are algorithms that aren't this
[00:13:09] um there are algorithms that aren't this that do slightly better in various ways
[00:13:12] that do slightly better in various ways um but the intuition that you sort of
[00:13:15] um but the intuition that you sort of figure out what the common co-occurring
[00:13:17] figure out what the common co-occurring substrings are sort of independent of
[00:13:19] substrings are sort of independent of length almost is is the right intuition
[00:13:21] length almost is is the right intuition to have and so yeah you can actually
[00:13:23] to have and so yeah you can actually just look at the Learned vocabularies of
[00:13:25] just look at the Learned vocabularies of a lot of these models and uh you see
[00:13:28] a lot of these models and uh you see some long words
[00:13:30] some long words uh just because they if they showed up a
[00:13:32] uh just because they if they showed up a lot
[00:13:36] I'm curious how does it weigh the uh
[00:13:39] I'm curious how does it weigh the uh like the frequency so let's say there's
[00:13:42] like the frequency so let's say there's like if ify or at the in your next slide
[00:13:44] like if ify or at the in your next slide it was like goodbye
[00:13:46] it was like goodbye um at the very last one so if could be
[00:13:49] um at the very last one so if could be really common so how did the weight like
[00:13:51] really common so how did the weight like the frequency of a sub word versus the
[00:13:53] the frequency of a sub word versus the length of it like it tries to spread it
[00:13:55] length of it like it tries to spread it up into the smallest number but what if
[00:13:57] up into the smallest number but what if it split it up into three but one of
[00:13:59] it split it up into three but one of them was super common
[00:14:00] them was super common yeah so the question is uh you know if
[00:14:03] yeah so the question is uh you know if Transformer is a sub word in my
[00:14:05] Transformer is a sub word in my vocabulary and if it's a sub word and Y
[00:14:08] vocabulary and if it's a sub word and Y is a sub word and if I as a three letter
[00:14:11] is a sub word and if I as a three letter Tuple is also a sub word how does it
[00:14:14] Tuple is also a sub word how does it choose to like take the you know if I
[00:14:16] choose to like take the you know if I maybe it's not very common uh as opposed
[00:14:20] maybe it's not very common uh as opposed to splitting it into more subwords
[00:14:22] to splitting it into more subwords um it's just a choice we choose to try
[00:14:24] um it's just a choice we choose to try to take the smallest number of sub words
[00:14:26] to take the smallest number of sub words because that tends to be uh more of the
[00:14:28] because that tends to be uh more of the bottleneck as opposed to the having a
[00:14:31] bottleneck as opposed to the having a bunch of very common very short sub
[00:14:32] bunch of very common very short sub words uh trans you know sequence length
[00:14:35] words uh trans you know sequence length is a big problem in Transformers and
[00:14:37] is a big problem in Transformers and this seems to be sort of what works
[00:14:39] this seems to be sort of what works although trying to split things into
[00:14:40] although trying to split things into multiple options of a sequence and
[00:14:42] multiple options of a sequence and running the Transformer on all of them
[00:14:44] running the Transformer on all of them is the thing that people have done to
[00:14:46] is the thing that people have done to see which one will work better
[00:14:47] see which one will work better but yeah having fewer bigger sub words
[00:14:49] but yeah having fewer bigger sub words tends to be the best sort of idea I'm
[00:14:51] tends to be the best sort of idea I'm going to start moving on though uh feel
[00:14:53] going to start moving on though uh feel free to ask me more questions about this
[00:14:55] free to ask me more questions about this afterward
[00:14:56] afterward okay so um
[00:14:58] okay so um so let's talk about pre-training from
[00:15:00] so let's talk about pre-training from the context of the course so far uh so
[00:15:03] the context of the course so far uh so we at the very beginning of the course
[00:15:05] we at the very beginning of the course we gave you this quote which was you
[00:15:07] we gave you this quote which was you know you shall know a word by the
[00:15:08] know you shall know a word by the company it keeps this was the sort of
[00:15:10] company it keeps this was the sort of thesis of the distributional hypothesis
[00:15:13] thesis of the distributional hypothesis right that the meaning of the word is
[00:15:15] right that the meaning of the word is defined by or at least reflected by what
[00:15:18] defined by or at least reflected by what words it tends to co-occur around and we
[00:15:20] words it tends to co-occur around and we implemented this via word to VEC
[00:15:23] implemented this via word to VEC uh the same person who made that quote
[00:15:26] uh the same person who made that quote had a separate quote actually earlier uh
[00:15:29] had a separate quote actually earlier uh that continues this sort of notion of
[00:15:32] that continues this sort of notion of meaning as defined by context which has
[00:15:35] meaning as defined by context which has something along the lines of well you
[00:15:37] something along the lines of well you know since the word shows up in context
[00:15:39] know since the word shows up in context when we actually use it when we speak to
[00:15:41] when we actually use it when we speak to each other the meaning of the word
[00:15:43] each other the meaning of the word should be defined in the context that it
[00:15:46] should be defined in the context that it actually shows up in and so uh you know
[00:15:48] actually shows up in and so uh you know the complete meaning of a word is always
[00:15:50] the complete meaning of a word is always contextual and no study of meaning apart
[00:15:53] contextual and no study of meaning apart from a complete context can be taken
[00:15:55] from a complete context can be taken seriously so right the big difference
[00:15:57] seriously so right the big difference here is like at word tovec training time
[00:16:00] here is like at word tovec training time if I have uh the word uh record
[00:16:04] if I have uh the word uh record r-e-c-o-r-d
[00:16:06] r-e-c-o-r-d when I'm training word to VEC I get one
[00:16:08] when I'm training word to VEC I get one vector or two but you know one one
[00:16:11] vector or two but you know one one vector meaning uh record the string
[00:16:15] vector meaning uh record the string um
[00:16:15] um and uh it has to learn by what context
[00:16:19] and uh it has to learn by what context it shows up in that sometimes you know
[00:16:21] it shows up in that sometimes you know it can mean I record I.E the verb or
[00:16:25] it can mean I record I.E the verb or record I.E the noun right but I only
[00:16:28] record I.E the noun right but I only have one vector to represent it and so
[00:16:30] have one vector to represent it and so when I use the word embedding of record
[00:16:32] when I use the word embedding of record uh it sort of has this mixture meaning
[00:16:35] uh it sort of has this mixture meaning of both of its sort of Senses right it
[00:16:39] of both of its sort of Senses right it doesn't get to specialize and say oh
[00:16:41] doesn't get to specialize and say oh this part means record and this part
[00:16:43] this part means record and this part means record
[00:16:44] means record and so where to back is going to just
[00:16:46] and so where to back is going to just sort of fail
[00:16:48] sort of fail um and and so I can build better
[00:16:49] um and and so I can build better representations of language through
[00:16:51] representations of language through these contextual uh representations that
[00:16:54] these contextual uh representations that are going to take things like recurrent
[00:16:55] are going to take things like recurrent neural networks or Transformers that we
[00:16:57] neural networks or Transformers that we used before to build up sort of
[00:16:58] used before to build up sort of contextual meaning
[00:17:02] uh so so what we had before were
[00:17:05] uh so so what we had before were pre-trained word embeddings and then we
[00:17:08] pre-trained word embeddings and then we had sort of a a big box on top of it
[00:17:10] had sort of a a big box on top of it like a Transformer or an lstm that was
[00:17:13] like a Transformer or an lstm that was not pre-trained right so so you learn
[00:17:16] not pre-trained right so so you learn via context your word embeddings here
[00:17:18] via context your word embeddings here and then uh you have a task like
[00:17:21] and then uh you have a task like sentiment analysis or machine
[00:17:22] sentiment analysis or machine translation or or parsing or whatever
[00:17:25] translation or or parsing or whatever and you initialize all the parameters of
[00:17:27] and you initialize all the parameters of this randomly and then you train uh to
[00:17:30] this randomly and then you train uh to predict your label
[00:17:32] predict your label and the the big difference uh in you
[00:17:36] and the the big difference uh in you know today's work is that we're going to
[00:17:37] know today's work is that we're going to try to pre-train all the parameters so I
[00:17:39] try to pre-train all the parameters so I have my big Transformer and instead of
[00:17:42] have my big Transformer and instead of just you know pre-training my word
[00:17:44] just you know pre-training my word embeddings with word to VEC I'm going to
[00:17:46] embeddings with word to VEC I'm going to train all of the parameters of the
[00:17:48] train all of the parameters of the network
[00:17:49] network uh trying to teach it you know much more
[00:17:53] uh trying to teach it you know much more about language uh that I could use in my
[00:17:55] about language uh that I could use in my in my Downstream tasks so now I'm sort
[00:17:58] in my Downstream tasks so now I'm sort of the the labeled data that I have for
[00:18:01] of the the labeled data that I have for say machine translation might need to be
[00:18:04] say machine translation might need to be smaller I might not need as much of it
[00:18:07] smaller I might not need as much of it because I've already trained much more
[00:18:09] because I've already trained much more of the network than I otherwise would
[00:18:11] of the network than I otherwise would have if I had just gotten sort of word
[00:18:12] have if I had just gotten sort of word to back embeddings
[00:18:15] to back embeddings okay so so here right I've pre-trained
[00:18:18] okay so so here right I've pre-trained this entire sort of structure the word
[00:18:20] this entire sort of structure the word embeddings the Transformer on top
[00:18:23] embeddings the Transformer on top everything's been trained via methods
[00:18:25] everything's been trained via methods that we'll talk about today and so what
[00:18:28] that we'll talk about today and so what does this give you I mean it gives you
[00:18:29] does this give you I mean it gives you very strong representations of language
[00:18:31] very strong representations of language so the meaning of record and record will
[00:18:35] so the meaning of record and record will be different in the sort of contextual
[00:18:37] be different in the sort of contextual representations that know what where in
[00:18:40] representations that know what where in the sequence it is and what words are
[00:18:42] the sequence it is and what words are co-occurring with it in the specific
[00:18:43] co-occurring with it in the specific input then word to back which only has
[00:18:46] input then word to back which only has one representation for record
[00:18:47] one representation for record independent of where it shows up it'll
[00:18:50] independent of where it shows up it'll also be used as strong parameter
[00:18:52] also be used as strong parameter initializations for NLP models so in all
[00:18:55] initializations for NLP models so in all of your homework so far you've worked
[00:18:57] of your homework so far you've worked with uh you know building out a natural
[00:18:59] with uh you know building out a natural language processing system sort of from
[00:19:01] language processing system sort of from scratch right like how do I initialize
[00:19:02] scratch right like how do I initialize this weight Matrix and we always say oh
[00:19:04] this weight Matrix and we always say oh you know small normally distributed
[00:19:07] you know small normally distributed noise like little values you know uh
[00:19:10] noise like little values you know uh close to zero and here we're going to
[00:19:13] close to zero and here we're going to say well just like we were going to you
[00:19:15] say well just like we were going to you know use the word to VEC embeddings and
[00:19:17] know use the word to VEC embeddings and those sort of encoded structure I'm
[00:19:19] those sort of encoded structure I'm going to start maybe my machine
[00:19:20] going to start maybe my machine translation system from a parameter
[00:19:22] translation system from a parameter initialization that's given to me via
[00:19:25] initialization that's given to me via pre-training
[00:19:27] pre-training and then also it's going to give us
[00:19:29] and then also it's going to give us probability distributions over language
[00:19:30] probability distributions over language that we can use to to generate and
[00:19:32] that we can use to to generate and otherwise and we'll talk about this
[00:19:35] otherwise and we'll talk about this okay so whole models are going to be
[00:19:37] okay so whole models are going to be pre-trained so um all of pre-training is
[00:19:40] pre-trained so um all of pre-training is effectively going to be centered around
[00:19:43] effectively going to be centered around this idea of reconstructing the input so
[00:19:45] this idea of reconstructing the input so you have an input it's a sequence of
[00:19:47] you have an input it's a sequence of text that some human has generated and
[00:19:50] text that some human has generated and the sort of hypothesis is that by
[00:19:53] the sort of hypothesis is that by masking out part of it
[00:19:55] masking out part of it and tasking a neural network with
[00:19:57] and tasking a neural network with reconstructing the original input
[00:20:00] reconstructing the original input that neural network has to learn a lot
[00:20:02] that neural network has to learn a lot about language about the world in order
[00:20:05] about language about the world in order to do a good job of reconstructing the
[00:20:07] to do a good job of reconstructing the input right so this is now a supervised
[00:20:09] input right so this is now a supervised learning problem just like you know
[00:20:11] learning problem just like you know machine translation right I've taken
[00:20:13] machine translation right I've taken this sentence that just existed Stanford
[00:20:15] this sentence that just existed Stanford University is located in say Palo Alto
[00:20:18] University is located in say Palo Alto California or Stanford California I
[00:20:20] California or Stanford California I guess
[00:20:21] guess um
[00:20:23] um and I have by removing this you know
[00:20:25] and I have by removing this you know part of the sentence uh made a label for
[00:20:29] part of the sentence uh made a label for myself right the input is this sort of
[00:20:30] myself right the input is this sort of broken uh mask sentence and the label is
[00:20:34] broken uh mask sentence and the label is Stanford or Palo Alto
[00:20:39] so if I give this example to a network
[00:20:41] so if I give this example to a network and ask it to predict the center thing
[00:20:43] and ask it to predict the center thing as it's doing its gradient step on this
[00:20:45] as it's doing its gradient step on this input it's going to encode information
[00:20:47] input it's going to encode information about the co-occurrence between this
[00:20:49] about the co-occurrence between this context Stanford University is located
[00:20:51] context Stanford University is located in and Palo Alto so by tasking it with
[00:20:54] in and Palo Alto so by tasking it with this it might learn say where Stanford
[00:20:57] this it might learn say where Stanford is
[00:20:58] is what else might it learn well it can
[00:20:59] what else might it learn well it can learn things about maybe syntax so I put
[00:21:02] learn things about maybe syntax so I put blank Fork down on the table
[00:21:05] blank Fork down on the table um here there's only a certain set of
[00:21:07] um here there's only a certain set of words that could go here I put the fork
[00:21:09] words that could go here I put the fork down on the table I put a fork down to
[00:21:11] down on the table I put a fork down to the table these are syntactic
[00:21:12] the table these are syntactic constraints
[00:21:13] constraints right so the context shows me sort of
[00:21:16] right so the context shows me sort of What kinds of words can appear in what
[00:21:18] What kinds of words can appear in what kinds of contexts
[00:21:21] kinds of contexts the woman walked across the street
[00:21:23] the woman walked across the street checking for traffic over blank shoulder
[00:21:26] checking for traffic over blank shoulder any ideas on what could go here
[00:21:29] any ideas on what could go here right so um this sort of this sort of co
[00:21:32] right so um this sort of this sort of co uh co-reference between this entity who
[00:21:35] uh co-reference between this entity who is being discussed in the world this
[00:21:36] is being discussed in the world this woman and her shoulder now when I
[00:21:39] woman and her shoulder now when I discuss you know this is sort of
[00:21:41] discuss you know this is sort of linguistic concept the word her here is
[00:21:43] linguistic concept the word her here is a co-referrent to woman right it's
[00:21:45] a co-referrent to woman right it's referring to the same entity in the
[00:21:46] referring to the same entity in the discourse and so the network might be
[00:21:48] discourse and so the network might be able to learn things about you know like
[00:21:50] able to learn things about you know like kind of what entities are doing what
[00:21:52] kind of what entities are doing what where
[00:21:56] uh it can learn things about sort of
[00:21:58] uh it can learn things about sort of semantics so if I have I went to the
[00:22:00] semantics so if I have I went to the ocean to see the fish Turtle seals and
[00:22:01] ocean to see the fish Turtle seals and blank then the word that's in the blank
[00:22:04] blank then the word that's in the blank should be sort of a member of the class
[00:22:06] should be sort of a member of the class that I'm thinking of as a person writing
[00:22:08] that I'm thinking of as a person writing this sentence of stuff that I see when I
[00:22:10] this sentence of stuff that I see when I go to the ocean and see these other
[00:22:12] go to the ocean and see these other things as well right so in order to do
[00:22:14] things as well right so in order to do this prediction task maybe I learned
[00:22:16] this prediction task maybe I learned about you know the semantics of of uh
[00:22:19] about you know the semantics of of uh aquatic creatures
[00:22:22] aquatic creatures okay so what else could I learn I've got
[00:22:25] okay so what else could I learn I've got overall the value I got from the two
[00:22:26] overall the value I got from the two hours watching it was the sum total of
[00:22:28] hours watching it was the sum total of the popcorn and drink the the movie was
[00:22:30] the popcorn and drink the the movie was blank what kind of task could I be
[00:22:33] blank what kind of task could I be learning from doing this sort of
[00:22:34] learning from doing this sort of prediction problem
[00:22:37] sentiment exactly so this is just a
[00:22:39] sentiment exactly so this is just a naturalistic sort of text that I
[00:22:42] naturalistic sort of text that I naturally wrote uh myself
[00:22:45] naturally wrote uh myself um but by saying oh the movie was bad
[00:22:48] um but by saying oh the movie was bad I'm learning about sort of the latent
[00:22:51] I'm learning about sort of the latent sentiment of the person who wrote This
[00:22:53] sentiment of the person who wrote This what they were feeling about the movie
[00:22:55] what they were feeling about the movie at the time
[00:22:56] at the time so maybe if I see a new review later on
[00:22:59] so maybe if I see a new review later on I can just paste in the review say the
[00:23:01] I can just paste in the review say the movie was blank
[00:23:04] movie was blank and if the model generates bad or good
[00:23:06] and if the model generates bad or good that could be implicitly solving the
[00:23:09] that could be implicitly solving the task of sentiment analysis
[00:23:13] so here's another one Ira went to the
[00:23:15] so here's another one Ira went to the kitchen to make some tea standing next
[00:23:17] kitchen to make some tea standing next to Ira Zuko pondered his Destiny Zuko
[00:23:20] to Ira Zuko pondered his Destiny Zuko left the blank
[00:23:22] left the blank okay so in this scenario we've got a
[00:23:25] okay so in this scenario we've got a world implicitly that's been designed by
[00:23:27] world implicitly that's been designed by the person who is creating this text
[00:23:30] the person who is creating this text right I've got physical locations in the
[00:23:32] right I've got physical locations in the discourse like the kitchen uh and I've
[00:23:35] discourse like the kitchen uh and I've got Zuko uh we've got iros in the
[00:23:38] got Zuko uh we've got iros in the kitchen Zuko's next to iro
[00:23:40] kitchen Zuko's next to iro so Zuko must be in the kitchen
[00:23:43] so Zuko must be in the kitchen so what could Zuko leave but the kitchen
[00:23:46] so what could Zuko leave but the kitchen right and so in terms of you know latent
[00:23:49] right and so in terms of you know latent Notions of embodiment and physical
[00:23:50] Notions of embodiment and physical location the way that people talk about
[00:23:53] location the way that people talk about people you know being next to something
[00:23:54] people you know being next to something and then leaving something could tell
[00:23:56] and then leaving something could tell you uh stuff about sort of yeah a little
[00:24:00] you uh stuff about sort of yeah a little bit about how the world works even
[00:24:04] so here's the secret sequence I was
[00:24:06] so here's the secret sequence I was thinking about the sequence that goes
[00:24:07] thinking about the sequence that goes one one two three five eight thirteen
[00:24:09] one one two three five eight thirteen twenty one uh blank
[00:24:12] twenty one uh blank and
[00:24:13] and um you know this is a pretty tough one
[00:24:16] um you know this is a pretty tough one right
[00:24:17] right this is the Fibonacci sequence right
[00:24:19] this is the Fibonacci sequence right create a model by looking at the a bunch
[00:24:22] create a model by looking at the a bunch of numbers from the Fibonacci sequence
[00:24:23] of numbers from the Fibonacci sequence learn to in general predict the next one
[00:24:27] learn to in general predict the next one a question you should be thinking about
[00:24:29] a question you should be thinking about throughout the lecture
[00:24:31] throughout the lecture okay any questions on these sort of
[00:24:33] okay any questions on these sort of examples of what you might learn from
[00:24:35] examples of what you might learn from predicting the context
[00:24:41] okay okay cool
[00:24:43] okay okay cool um so uh
[00:24:45] um so uh you know a very simple way to think
[00:24:47] you know a very simple way to think about pre-training is pre-training is
[00:24:48] about pre-training is pre-training is language modeling so we saw language
[00:24:49] language modeling so we saw language modeling earlier in the course and now
[00:24:51] modeling earlier in the course and now we're just going to say instead of using
[00:24:54] we're just going to say instead of using my language model just to provide
[00:24:55] my language model just to provide probabilities over the next word I am
[00:24:58] probabilities over the next word I am going to train it on that task right I'm
[00:24:59] going to train it on that task right I'm going to actually model the distribution
[00:25:04] going to actually model the distribution P Theta of the word t given all the
[00:25:07] P Theta of the word t given all the words previous
[00:25:09] words previous and there's a ton of data for this right
[00:25:11] and there's a ton of data for this right there's you know there's just an amazing
[00:25:13] there's you know there's just an amazing amount of data for this in a lot of
[00:25:15] amount of data for this in a lot of languages especially English there's
[00:25:17] languages especially English there's very little data for this in actually
[00:25:18] very little data for this in actually most of the world's languages which is a
[00:25:20] most of the world's languages which is a separate problem
[00:25:22] separate problem um but you can free train just through
[00:25:23] um but you can free train just through language modeling right so I'm going to
[00:25:25] language modeling right so I'm going to sort of do the teacher forcing thing so
[00:25:27] sort of do the teacher forcing thing so I have IRL I predict goes I have goes I
[00:25:29] I have IRL I predict goes I have goes I predict two and I'm going to train my
[00:25:32] predict two and I'm going to train my sort of lstm or my Transformer to do
[00:25:34] sort of lstm or my Transformer to do this task and then I'm just going to
[00:25:36] this task and then I'm just going to keep all the weights
[00:25:38] keep all the weights okay I'm going to save all the network
[00:25:39] okay I'm going to save all the network parameters
[00:25:44] um and then once I have these parameters
[00:25:45] um and then once I have these parameters right instead of generating from my
[00:25:47] right instead of generating from my language model I'm just going to use
[00:25:48] language model I'm just going to use them as an initialization for my
[00:25:51] them as an initialization for my parameters so I have this pre-training
[00:25:53] parameters so I have this pre-training fine tuning Paradigm two steps most of
[00:25:57] fine tuning Paradigm two steps most of you I think in your well maybe not this
[00:25:59] you I think in your well maybe not this year let's say a large portion of you
[00:26:01] year let's say a large portion of you this year in your final projects will be
[00:26:02] this year in your final projects will be doing the pre-training fine-tuning sort
[00:26:04] doing the pre-training fine-tuning sort of Paradigm where someone has done the
[00:26:06] of Paradigm where someone has done the pre-training for you right so you have a
[00:26:08] pre-training for you right so you have a ton of text you learn very general
[00:26:10] ton of text you learn very general things about the distribution of words
[00:26:12] things about the distribution of words and sort of the latent things that that
[00:26:14] and sort of the latent things that that tells you about the world and about
[00:26:16] tells you about the world and about language and then in step two you've got
[00:26:19] language and then in step two you've got some tasks maybe sentiment analysis and
[00:26:22] some tasks maybe sentiment analysis and you have maybe not very many labels you
[00:26:24] you have maybe not very many labels you have a little bit of labeled data
[00:26:26] have a little bit of labeled data and you adapt the pre-trained model to
[00:26:29] and you adapt the pre-trained model to the task that you care about by further
[00:26:31] the task that you care about by further doing gradient steps on this task so you
[00:26:34] doing gradient steps on this task so you give it the movie was you predict happy
[00:26:37] give it the movie was you predict happy or sad and then
[00:26:39] or sad and then um
[00:26:39] um you sort of Continue to update the
[00:26:41] you sort of Continue to update the parameters based on the initialization
[00:26:43] parameters based on the initialization from the pre-training
[00:26:46] from the pre-training and this just works exceptionally well I
[00:26:48] and this just works exceptionally well I mean unbelievably well compared to
[00:26:50] mean unbelievably well compared to training from scratch
[00:26:51] training from scratch intuitively because you've taken a lot
[00:26:53] intuitively because you've taken a lot of the burden of learning about language
[00:26:55] of the burden of learning about language learning about the world off of the data
[00:26:58] learning about the world off of the data that you've labeled for sentiment
[00:26:59] that you've labeled for sentiment analysis and you're sort of giving that
[00:27:01] analysis and you're sort of giving that task of learning all this sort of very
[00:27:03] task of learning all this sort of very general stuff to the much more General
[00:27:05] general stuff to the much more General task of language modeling yes
[00:27:08] task of language modeling yes but we didn't have much data in other
[00:27:10] but we didn't have much data in other languages what do you mean by that was
[00:27:12] languages what do you mean by that was it just text in that language yeah
[00:27:14] it just text in that language yeah labeled in some way so the question is
[00:27:17] labeled in some way so the question is uh you know you said we have a lot of
[00:27:18] uh you know you said we have a lot of data in English but not in other
[00:27:20] data in English but not in other languages
[00:27:22] languages um what do you mean by data that we
[00:27:24] um what do you mean by data that we don't have a lot of in other languages
[00:27:25] don't have a lot of in other languages is it just text it's literally just
[00:27:27] is it just text it's literally just text no annotations because you don't
[00:27:30] text no annotations because you don't need annotations to do language model
[00:27:32] need annotations to do language model pre-training right the existence of that
[00:27:34] pre-training right the existence of that sequence of words that someone has
[00:27:35] sequence of words that someone has written
[00:27:37] written provides you with all these pairs of
[00:27:39] provides you with all these pairs of input and output input iro output goes
[00:27:42] input and output input iro output goes input iro goes output two those are all
[00:27:45] input iro goes output two those are all labels sort of that you've constructed
[00:27:47] labels sort of that you've constructed from the input just existing but you
[00:27:49] from the input just existing but you know in most languages even on the
[00:27:52] know in most languages even on the entire internet I mean there's about 7
[00:27:54] entire internet I mean there's about 7 000 ish languages on Earth and most of
[00:27:56] 000 ish languages on Earth and most of them don't have the sort of you know
[00:27:59] them don't have the sort of you know uh billions of words that you might want
[00:28:01] uh billions of words that you might want to to train these systems on uh yeah
[00:28:07] entire thing are you still only like one
[00:28:09] entire thing are you still only like one vector representation per word
[00:28:11] vector representation per word the question is if you're pre-training
[00:28:13] the question is if you're pre-training the entire thing do you still learn one
[00:28:14] the entire thing do you still learn one vector representation per word you learn
[00:28:16] vector representation per word you learn one vector representation that is the
[00:28:18] one vector representation that is the non-contextual input vector
[00:28:21] non-contextual input vector right so you have your vocabulary Matrix
[00:28:22] right so you have your vocabulary Matrix you've got your embedding Matrix that is
[00:28:24] you've got your embedding Matrix that is vocabulary size by model dimensionality
[00:28:28] vocabulary size by model dimensionality and so yeah iro has one vector goes has
[00:28:31] and so yeah iro has one vector goes has one vector
[00:28:32] one vector but then the Transformer that you're
[00:28:34] but then the Transformer that you're learning on top of it takes in the
[00:28:36] learning on top of it takes in the sequence so far and sort of gives a
[00:28:38] sequence so far and sort of gives a vector to each of them that's dependent
[00:28:40] vector to each of them that's dependent on the context in that case but still at
[00:28:42] on the context in that case but still at the input you only have one embedding
[00:28:44] the input you only have one embedding per word
[00:28:48] yeah so what's sort of like metric would
[00:28:49] yeah so what's sort of like metric would you use to like evaluate
[00:28:51] you use to like evaluate it's supposed to be like General right
[00:28:53] it's supposed to be like General right so but things like application specific
[00:28:55] so but things like application specific metrics which one do you use yeah so the
[00:28:57] metrics which one do you use yeah so the question is what metric do you use to
[00:28:59] question is what metric do you use to evaluate pre-trained models since it's
[00:29:00] evaluate pre-trained models since it's supposed to be so General
[00:29:02] supposed to be so General um but there are lots of sort of very
[00:29:03] um but there are lots of sort of very specific evaluations you could use
[00:29:06] specific evaluations you could use um it will get into a lot of that in the
[00:29:08] um it will get into a lot of that in the rest of the lecture uh while you're
[00:29:10] rest of the lecture uh while you're training it you can use Simple metrics
[00:29:12] training it you can use Simple metrics that sort of correlate with what you
[00:29:13] that sort of correlate with what you want but aren't actually what you want
[00:29:15] want but aren't actually what you want just like the probability quality right
[00:29:18] just like the probability quality right so you can evaluate the perplexity of
[00:29:20] so you can evaluate the perplexity of your language model just like you would
[00:29:21] your language model just like you would have when you cared about language
[00:29:22] have when you cared about language modeling and it turns out to be the case
[00:29:25] modeling and it turns out to be the case that better perplexity correlates with
[00:29:27] that better perplexity correlates with all the stuff that's much harder to
[00:29:29] all the stuff that's much harder to evaluate like lots and lots of different
[00:29:31] evaluate like lots and lots of different tasks but also the natural language
[00:29:33] tasks but also the natural language processing Community has built very
[00:29:35] processing Community has built very large sort of Benchmark Suites of
[00:29:38] large sort of Benchmark Suites of varying tasks to try to get at sort of a
[00:29:40] varying tasks to try to get at sort of a notion of generality although that's
[00:29:42] notion of generality although that's very very difficult it's sort of
[00:29:43] very very difficult it's sort of ill-defined even and so when you develop
[00:29:46] ill-defined even and so when you develop new pre-training methods what you often
[00:29:48] new pre-training methods what you often do is you try to pick a whole bunch of
[00:29:50] do is you try to pick a whole bunch of evaluations and show that you do better
[00:29:52] evaluations and show that you do better on all of them you know and and that's
[00:29:54] on all of them you know and and that's your argument for generality
[00:29:56] your argument for generality okay
[00:29:58] okay um so so why should this sort of
[00:30:01] um so so why should this sort of pre-training fine-tuning two-part
[00:30:03] pre-training fine-tuning two-part Paradigm help uh you know this is still
[00:30:07] Paradigm help uh you know this is still an open area of research but the
[00:30:09] an open area of research but the intuitions are all you're going to take
[00:30:11] intuitions are all you're going to take from this course so right so
[00:30:13] from this course so right so pre-training provides some sort of
[00:30:14] pre-training provides some sort of starting uh parameters L Theta so this
[00:30:17] starting uh parameters L Theta so this is like all the parameters in your
[00:30:19] is like all the parameters in your network right from trying to do this
[00:30:21] network right from trying to do this minimum over all possible settings of
[00:30:23] minimum over all possible settings of your parameters of the pre-training loss
[00:30:26] your parameters of the pre-training loss and then the fine-tuning process takes
[00:30:29] and then the fine-tuning process takes uh you know your data for fine tuning
[00:30:31] uh you know your data for fine tuning you've got some labels and it tries to
[00:30:33] you've got some labels and it tries to approximate the minimum through gradient
[00:30:35] approximate the minimum through gradient Descent of the loss of the fine-tuning
[00:30:37] Descent of the loss of the fine-tuning task of theta but you start at Theta hat
[00:30:41] task of theta but you start at Theta hat right so you start gradient descent at
[00:30:43] right so you start gradient descent at Theta hat which your pre-training
[00:30:44] Theta hat which your pre-training process gave you
[00:30:46] process gave you and then you know
[00:30:48] and then you know if you could actually solve this Min and
[00:30:50] if you could actually solve this Min and wanted to it sort of feels like the
[00:30:53] wanted to it sort of feels like the starting point shouldn't matter
[00:30:55] starting point shouldn't matter but it really really really does it
[00:30:58] but it really really really does it really does
[00:31:00] really does um uh so that's and we'll talk a bit
[00:31:02] um uh so that's and we'll talk a bit more about this later but
[00:31:05] more about this later but um the process of gradient descent you
[00:31:07] um the process of gradient descent you know maybe it sticks relatively close to
[00:31:09] know maybe it sticks relatively close to the Theta hat during fine tuning right
[00:31:11] the Theta hat during fine tuning right so
[00:31:13] so um you know you start at Theta hat and
[00:31:14] um you know you start at Theta hat and then you sort of walk downhill with
[00:31:16] then you sort of walk downhill with gradient descent until you hit sort of a
[00:31:18] gradient descent until you hit sort of a valley and that Valley ends up being
[00:31:20] valley and that Valley ends up being really good because it's close to the
[00:31:22] really good because it's close to the pre-training parameters which were
[00:31:24] pre-training parameters which were really good for a lot of things this is
[00:31:26] really good for a lot of things this is a cool place where sort of practice and
[00:31:28] a cool place where sort of practice and Theory are sort of like meeting where
[00:31:30] Theory are sort of like meeting where like optimization people want to
[00:31:32] like optimization people want to understand why this is so useful NLP
[00:31:35] understand why this is so useful NLP people sort of just want to build better
[00:31:36] people sort of just want to build better systems
[00:31:38] systems um
[00:31:39] um so uh
[00:31:40] so uh yeah maybe the stuff around Theta hat
[00:31:43] yeah maybe the stuff around Theta hat tends to generalize well if you want to
[00:31:44] tends to generalize well if you want to work on this kind of thing you should
[00:31:46] work on this kind of thing you should talk about it yeah
[00:31:48] talk about it yeah the classic waiting to send sticks
[00:31:50] the classic waiting to send sticks relatively close but what if we were to
[00:31:52] relatively close but what if we were to use a different Optimizer how would that
[00:31:54] use a different Optimizer how would that change their results the question is uh
[00:31:57] change their results the question is uh if stochastic gradient descent sticks
[00:31:59] if stochastic gradient descent sticks relatively close what if we use a
[00:32:00] relatively close what if we use a different Optimizer I mean if we use
[00:32:02] different Optimizer I mean if we use sort of any common variant of gradient
[00:32:05] sort of any common variant of gradient descent like any first order method like
[00:32:07] descent like any first order method like atom which we use in this course or add
[00:32:09] atom which we use in this course or add a grad or they all have this very very
[00:32:12] a grad or they all have this very very similar properties
[00:32:14] similar properties um other types of optimization we just
[00:32:16] um other types of optimization we just tend to not use so who knows ah yeah
[00:32:22] yeah fine tuning works better than just
[00:32:25] yeah fine tuning works better than just fine tuning but making article like
[00:32:27] fine tuning but making article like adding more layers more data yeah uh the
[00:32:30] adding more layers more data yeah uh the question is why does the pre-trained
[00:32:32] question is why does the pre-trained fine-tuned Paradigm work better than
[00:32:34] fine-tuned Paradigm work better than just making the model more powerful
[00:32:35] just making the model more powerful adding more layers adding more data to
[00:32:38] adding more layers adding more data to just the fine tuning
[00:32:40] just the fine tuning um
[00:32:40] um that's a you know the simple answer is
[00:32:43] that's a you know the simple answer is that you have orders of magnitude more
[00:32:46] that you have orders of magnitude more data that's unlabeled
[00:32:48] data that's unlabeled that's just text
[00:32:50] that's just text that you found
[00:32:51] that you found then you do you know carefully labeled
[00:32:53] then you do you know carefully labeled data and the tasks that you care about
[00:32:55] data and the tasks that you care about right because that's expensive to get it
[00:32:57] right because that's expensive to get it has to be examples of your movie reviews
[00:32:59] has to be examples of your movie reviews or whatever that you've had someone
[00:33:00] or whatever that you've had someone label carefully
[00:33:03] label carefully um so you have you know something like
[00:33:05] um so you have you know something like on the internet
[00:33:07] on the internet uh at least five trillion maybe 10
[00:33:11] uh at least five trillion maybe 10 trillion words of this and you have
[00:33:13] trillion words of this and you have maybe a million words of your label data
[00:33:16] maybe a million words of your label data or whatever over here so it's just like
[00:33:18] or whatever over here so it's just like the it's just the scale is way off
[00:33:21] the it's just the scale is way off um but there's also an intuition that
[00:33:22] um but there's also an intuition that like learning to do a very very simple
[00:33:25] like learning to do a very very simple thing like sentiment analysis
[00:33:28] thing like sentiment analysis um is not going to get you a very
[00:33:31] um is not going to get you a very general
[00:33:33] general generally able agent in a wide range of
[00:33:36] generally able agent in a wide range of settings compared to language modeling
[00:33:38] settings compared to language modeling so like it's hard to get how to put it
[00:33:42] so like it's hard to get how to put it even if you have a lot of labeled data
[00:33:43] even if you have a lot of labeled data of movie reviews of the kind that people
[00:33:45] of movie reviews of the kind that people are writing today
[00:33:48] are writing today maybe tomorrow they start writing
[00:33:50] maybe tomorrow they start writing slightly different kinds of movie
[00:33:51] slightly different kinds of movie reviews and your system doesn't perform
[00:33:52] reviews and your system doesn't perform as well whereas if you pre-trained on a
[00:33:55] as well whereas if you pre-trained on a really diverse set of text from a wide
[00:33:57] really diverse set of text from a wide range of sources in people it might be
[00:34:00] range of sources in people it might be more adaptable to seeing stuff that
[00:34:03] more adaptable to seeing stuff that doesn't quite look like the training
[00:34:04] doesn't quite look like the training data you showed it even if you showed it
[00:34:06] data you showed it even if you showed it a ton of training data so one of the
[00:34:08] a ton of training data so one of the sort of big takeaways of pre-training is
[00:34:10] sort of big takeaways of pre-training is that you get this huge amount of sort of
[00:34:12] that you get this huge amount of sort of variety of text uh on the internet you
[00:34:15] variety of text uh on the internet you have to be very careful I mean you yeah
[00:34:18] have to be very careful I mean you yeah you should be very careful about what
[00:34:19] you should be very careful about what kind of text you're showing it and what
[00:34:21] kind of text you're showing it and what kind of text you're not because the
[00:34:22] kind of text you're not because the internet is full of you know
[00:34:25] internet is full of you know um awful text as well
[00:34:27] um awful text as well um but some of that generality just
[00:34:29] um but some of that generality just comes from how hard this problem is and
[00:34:31] comes from how hard this problem is and how much data you can show it
[00:34:36] so much data how do you then train it so
[00:34:40] so much data how do you then train it so that it considers the stuff that you're
[00:34:42] that it considers the stuff that you're fine-tuning it with as like more
[00:34:44] fine-tuning it with as like more important more sale into a passive Trend
[00:34:46] important more sale into a passive Trend if you rather than just one in a billion
[00:34:48] if you rather than just one in a billion articles
[00:34:50] articles yeah it's a good question so the
[00:34:52] yeah it's a good question so the question is uh given that the amount of
[00:34:54] question is uh given that the amount of data on the pre-training side is orders
[00:34:55] data on the pre-training side is orders of magnitude more than the amount of
[00:34:57] of magnitude more than the amount of data on the fine tuning side how do you
[00:34:58] data on the fine tuning side how do you sort of get across to the model that
[00:35:00] sort of get across to the model that okay actually the fine tuning task is
[00:35:02] okay actually the fine tuning task is like what I care about so like focus on
[00:35:03] like what I care about so like focus on that
[00:35:04] that um it's it's about the fact that I did
[00:35:06] um it's it's about the fact that I did this first the pre-training first and
[00:35:08] this first the pre-training first and then I do the fine tuning second right
[00:35:11] then I do the fine tuning second right so I've done I've gotten my parameter
[00:35:13] so I've done I've gotten my parameter initialization from this I've set it
[00:35:15] initialization from this I've set it somewhere and then I fine tune I move to
[00:35:18] somewhere and then I fine tune I move to where the parameters are doing well for
[00:35:20] where the parameters are doing well for this task afterward and so well it might
[00:35:23] this task afterward and so well it might just forget a lot about how to do this
[00:35:25] just forget a lot about how to do this because now I'm just asking it to do
[00:35:27] because now I'm just asking it to do this at this point
[00:35:30] this at this point uh I should move on I think
[00:35:32] uh I should move on I think um but we're going to keep talking about
[00:35:34] um but we're going to keep talking about this in in much more detail with more
[00:35:36] this in in much more detail with more concrete uh elements so
[00:35:39] concrete uh elements so um
[00:35:41] um okay so uh let's talk about model
[00:35:43] okay so uh let's talk about model pre-training oh wait
[00:35:46] pre-training oh wait that did not Advance the slides
[00:35:53] nice okay let's talk about model
[00:35:56] nice okay let's talk about model pre-training three ways uh in our
[00:35:58] pre-training three ways uh in our Transformers lecture uh Tuesday we
[00:36:01] Transformers lecture uh Tuesday we talked about encoders encoder decoders
[00:36:04] talked about encoders encoder decoders and decoders and we'll we'll do decoders
[00:36:06] and decoders and we'll we'll do decoders last because
[00:36:08] last because um actually many of the largest models
[00:36:10] um actually many of the largest models uh that are being used today Are all uh
[00:36:13] uh that are being used today Are all uh decoders and so we'll have a bit more to
[00:36:15] decoders and so we'll have a bit more to say about them
[00:36:16] say about them um right so let's recall these three so
[00:36:19] um right so let's recall these three so encoders get bi-directional context you
[00:36:21] encoders get bi-directional context you have a single sequence and you're able
[00:36:24] have a single sequence and you're able to see the whole thing kind of like an
[00:36:25] to see the whole thing kind of like an encoder in machine translation
[00:36:27] encoder in machine translation encoder decoders have one uh portion of
[00:36:32] encoder decoders have one uh portion of the network that gets bi-directional
[00:36:33] the network that gets bi-directional context so that's like the source
[00:36:35] context so that's like the source sentence of my machine translation
[00:36:37] sentence of my machine translation system and then they're sort of paired
[00:36:39] system and then they're sort of paired with a decoder that gets unidirectional
[00:36:41] with a decoder that gets unidirectional context so that I have this sort of uh
[00:36:44] context so that I have this sort of uh informational masking where I can't see
[00:36:45] informational masking where I can't see the future so that I can do things like
[00:36:47] the future so that I can do things like language modeling I can generate the
[00:36:49] language modeling I can generate the next token of my translation whatever so
[00:36:51] next token of my translation whatever so you could think of it as you know I've
[00:36:52] you could think of it as you know I've got my source sentence here and my
[00:36:55] got my source sentence here and my partial translation here and I'm sort of
[00:36:56] partial translation here and I'm sort of decoding out the translation
[00:36:58] decoding out the translation and then decoders only are things like
[00:37:01] and then decoders only are things like language models we've seen a lot of this
[00:37:02] language models we've seen a lot of this so far and there's pre-training for all
[00:37:05] so far and there's pre-training for all three sort of large classes of models
[00:37:08] three sort of large classes of models and how you pre-train them and then how
[00:37:10] and how you pre-train them and then how you use them depends on the properties
[00:37:13] you use them depends on the properties and the productivities of the specific
[00:37:15] and the productivities of the specific architecture so let's let's look at
[00:37:16] architecture so let's let's look at encoders first
[00:37:18] encoders first um so we've looked at language modeling
[00:37:20] um so we've looked at language modeling quite a bit but we can't do language
[00:37:22] quite a bit but we can't do language modeling with an encoder because they
[00:37:24] modeling with an encoder because they get bi-directional context right so if
[00:37:26] get bi-directional context right so if I'm down here uh at I and I want to
[00:37:30] I'm down here uh at I and I want to present I want to predict the next word
[00:37:32] present I want to predict the next word it's a trivial task at the at this level
[00:37:36] it's a trivial task at the at this level here to predict the next word because in
[00:37:38] here to predict the next word because in the middle I was able to look at the
[00:37:40] the middle I was able to look at the next word
[00:37:41] next word and so I should just know there's
[00:37:43] and so I should just know there's nothing hard about learning to predict
[00:37:44] nothing hard about learning to predict the next word here because I could just
[00:37:46] the next word here because I could just look at it see what it is and then you
[00:37:48] look at it see what it is and then you know copy it over so when I'm training
[00:37:50] know copy it over so when I'm training an encoder in something uh for for
[00:37:54] an encoder in something uh for for pre-training I have to be a little bit
[00:37:55] pre-training I have to be a little bit more clever
[00:37:57] more clever in practice what I do is something like
[00:37:58] in practice what I do is something like this uh I take the input and I modify it
[00:38:01] this uh I take the input and I modify it somewhat I mask out words sort of like I
[00:38:03] somewhat I mask out words sort of like I did in the examples I gave at the
[00:38:05] did in the examples I gave at the beginning of class so I blank to the
[00:38:07] beginning of class so I blank to the blank
[00:38:08] blank right and then I have the network
[00:38:10] right and then I have the network predict uh with this whole you know I
[00:38:13] predict uh with this whole you know I haven't built contextual representations
[00:38:14] haven't built contextual representations so now this Vector representation of the
[00:38:17] so now this Vector representation of the blank sees the entire context around it
[00:38:21] blank sees the entire context around it here and then I predict the word went
[00:38:25] here and then I predict the word went and then here the word store
[00:38:29] any questions
[00:38:34] okay and you can see how this is doing
[00:38:36] okay and you can see how this is doing something quite a bit like language
[00:38:37] something quite a bit like language modeling but with you know
[00:38:39] modeling but with you know bi-directional context I've removed the
[00:38:42] bi-directional context I've removed the Network's information about the words
[00:38:44] Network's information about the words that go in the blanks and I'm training
[00:38:46] that go in the blanks and I'm training it to reconstruct that so I only have
[00:38:48] it to reconstruct that so I only have lost terms right I only ask it to
[00:38:50] lost terms right I only ask it to actually do the prediction compute the
[00:38:52] actually do the prediction compute the loss back propagate the gradients for
[00:38:54] loss back propagate the gradients for the words that I've masked out
[00:38:56] the words that I've masked out and you can think of this as you know
[00:38:58] and you can think of this as you know instead of learning probability of X
[00:39:00] instead of learning probability of X where X is like a sentence or a document
[00:39:02] where X is like a sentence or a document this is learning the probability of X
[00:39:04] this is learning the probability of X the real document given X tilde which is
[00:39:08] the real document given X tilde which is this sort of corrupted document
[00:39:11] this sort of corrupted document with some of the information missing
[00:39:12] with some of the information missing missing
[00:39:14] missing okay and so maybe we get the sequence of
[00:39:16] okay and so maybe we get the sequence of vectors here one per word which is the
[00:39:19] vectors here one per word which is the output of my encoder in blue and then
[00:39:21] output of my encoder in blue and then I'd say that for the words that I want
[00:39:23] I'd say that for the words that I want to predict y i i draw them this is the
[00:39:26] to predict y i i draw them this is the Sim means the probability is uh
[00:39:29] Sim means the probability is uh proportional to uh you know my embedding
[00:39:32] proportional to uh you know my embedding Matrix times
[00:39:34] Matrix times um my representation of it so it's a
[00:39:37] um my representation of it so it's a linear transformation of that last thing
[00:39:38] linear transformation of that last thing here so this a plus b is this red
[00:39:40] here so this a plus b is this red portion here and then do the prediction
[00:39:42] portion here and then do the prediction and I train the entire network to do
[00:39:44] and I train the entire network to do this yes
[00:39:46] this yes so
[00:39:50] far do we just
[00:39:51] far do we just do it as we are is there something you
[00:39:53] do it as we are is there something you can do it the question is do we just
[00:39:55] can do it the question is do we just choose words randomly to mask out or is
[00:39:57] choose words randomly to mask out or is there a scheme mostly randomly we'll
[00:39:59] there a scheme mostly randomly we'll talk about a slightly smarter scheme in
[00:40:01] talk about a slightly smarter scheme in a couple of slides but yeah just mostly
[00:40:02] a couple of slides but yeah just mostly randomly
[00:40:05] randomly uh yeah
[00:40:06] uh yeah what was that last part on the bottom
[00:40:09] what was that last part on the bottom um exit the maps version of like if it's
[00:40:12] um exit the maps version of like if it's the first
[00:40:13] the first or the very last sentence uh yeah so so
[00:40:17] or the very last sentence uh yeah so so I'm saying that I'm defining X tilde to
[00:40:21] I'm saying that I'm defining X tilde to be this input part where I've got the
[00:40:23] be this input part where I've got the masked version of the sentence with
[00:40:25] masked version of the sentence with these sort of words missing and then I'm
[00:40:27] these sort of words missing and then I'm defining a probability distribution
[00:40:28] defining a probability distribution that's the probability of a sequence
[00:40:31] that's the probability of a sequence conditioned on the input being the sort
[00:40:34] conditioned on the input being the sort of corrupted sequence the masked
[00:40:35] of corrupted sequence the masked sequence
[00:40:39] okay
[00:40:41] okay um so uh this brings us to a very very
[00:40:44] um so uh this brings us to a very very popular and uh sort of NLP model that
[00:40:47] popular and uh sort of NLP model that you need to know about it's called Bert
[00:40:49] you need to know about it's called Bert and it was the first one to popularize
[00:40:51] and it was the first one to popularize this masked language modeling objective
[00:40:55] this masked language modeling objective um and they released the weights of this
[00:40:57] um and they released the weights of this pre-trained Transformer that they
[00:40:58] pre-trained Transformer that they pre-trained via something that looks a
[00:41:00] pre-trained via something that looks a lot like Mass language modeling and so
[00:41:02] lot like Mass language modeling and so these you can download you can use them
[00:41:04] these you can download you can use them via a code that's released by the
[00:41:06] via a code that's released by the company hugging face that we have you
[00:41:09] company hugging face that we have you know continue to bring up many of you
[00:41:10] know continue to bring up many of you will use a model like Birch in your
[00:41:12] will use a model like Birch in your final project because it's such a useful
[00:41:15] final project because it's such a useful builder of representations of language
[00:41:16] builder of representations of language and context so let's talk a little bit
[00:41:19] and context so let's talk a little bit about the details of mass language
[00:41:20] about the details of mass language modeling in Bert
[00:41:22] modeling in Bert first we'd take 15 of the sub word
[00:41:26] first we'd take 15 of the sub word tokens so remember all of our inputs now
[00:41:28] tokens so remember all of our inputs now are subword tokens I'm making I've made
[00:41:31] are subword tokens I'm making I've made them all look like words but just like
[00:41:33] them all look like words but just like we saw at the very beginning of class
[00:41:34] we saw at the very beginning of class each of these tokens could just be some
[00:41:37] each of these tokens could just be some portion some sub word and I'm going to
[00:41:39] portion some sub word and I'm going to do a couple of things with it sometimes
[00:41:41] do a couple of things with it sometimes I am going to just mask out the word
[00:41:45] and then you know predict the true word
[00:41:48] and then you know predict the true word sometimes I'm going to replace the word
[00:41:51] sometimes I'm going to replace the word with some random sample of another word
[00:41:53] with some random sample of another word from my distribution from my vocabulary
[00:41:55] from my distribution from my vocabulary and predict the real word that was
[00:41:57] and predict the real word that was supposed to go there and sometimes I'm
[00:42:00] supposed to go there and sometimes I'm going to not change the word at all and
[00:42:03] going to not change the word at all and still predict it the intuition of this
[00:42:05] still predict it the intuition of this is the following
[00:42:07] is the following um if I just had to build good
[00:42:09] um if I just had to build good representations
[00:42:10] representations of uh in the sort of middle of this
[00:42:13] of uh in the sort of middle of this network for words that are masked out
[00:42:15] network for words that are masked out then when I actually use the model at
[00:42:18] then when I actually use the model at test time on some real you know review
[00:42:21] test time on some real you know review to do sentiment analysis on well there
[00:42:23] to do sentiment analysis on well there are never going to be any tokens like
[00:42:24] are never going to be any tokens like this so maybe the model won't do a very
[00:42:26] this so maybe the model won't do a very good job because it's like oh you know I
[00:42:28] good job because it's like oh you know I have no job to do here because I only
[00:42:30] have no job to do here because I only need to deal with the mask tokens
[00:42:33] need to deal with the mask tokens by giving it sequences of words where
[00:42:36] by giving it sequences of words where sometimes it's the real word that needs
[00:42:37] sometimes it's the real word that needs to be predicted sometimes you have to
[00:42:39] to be predicted sometimes you have to detect if the word is wrong the idea is
[00:42:42] detect if the word is wrong the idea is that now when I give it a sentence uh
[00:42:44] that now when I give it a sentence uh that doesn't have any masks it actually
[00:42:47] that doesn't have any masks it actually sort of does a good job of representing
[00:42:48] sort of does a good job of representing all the words in context because it has
[00:42:50] all the words in context because it has this chance that it could be asked to
[00:42:52] this chance that it could be asked to predict anything at any time
[00:42:58] okay so uh the folks at uh at Google who
[00:43:03] okay so uh the folks at uh at Google who were defining this had a separate
[00:43:04] were defining this had a separate additional task that is sort of
[00:43:08] additional task that is sort of interesting to think about
[00:43:10] interesting to think about so this was their their Bert model from
[00:43:12] so this was their their Bert model from their paper they had their position
[00:43:14] their paper they had their position embeddings just like we saw from our
[00:43:16] embeddings just like we saw from our Transformers lecture token embeddings
[00:43:19] Transformers lecture token embeddings just like we saw from the Transformers
[00:43:20] just like we saw from the Transformers lecture but then also they had this
[00:43:22] lecture but then also they had this thing called a segment embedding where
[00:43:24] thing called a segment embedding where they had two possible segments segment a
[00:43:26] they had two possible segments segment a and segment B
[00:43:28] and segment B and uh they had this additional task
[00:43:31] and uh they had this additional task where they would get a big chunk of text
[00:43:33] where they would get a big chunk of text for Segment a and a big chunk of text
[00:43:35] for Segment a and a big chunk of text for Segment B and then they would ask
[00:43:38] for Segment B and then they would ask the model is segment b a real
[00:43:40] the model is segment b a real continuation of segment a well so the
[00:43:43] continuation of segment a well so the text that actually came next or did I
[00:43:46] text that actually came next or did I just pick this big segment randomly from
[00:43:48] just pick this big segment randomly from somewhere else
[00:43:49] somewhere else and the idea is that this should teach
[00:43:51] and the idea is that this should teach the network something some notion of
[00:43:53] the network something some notion of sort of long distance coherence right
[00:43:55] sort of long distance coherence right about sort of the connection between a
[00:43:57] about sort of the connection between a bunch of text over here and a bunch of
[00:43:58] bunch of text over here and a bunch of text over there
[00:44:00] text over there turns out it's not really necessary but
[00:44:02] turns out it's not really necessary but it's an interesting idea
[00:44:04] it's an interesting idea and sort of similar things have
[00:44:06] and sort of similar things have continued to have some sort of influence
[00:44:08] continued to have some sort of influence since then but again you should get this
[00:44:11] since then but again you should get this intuition that we're trying to come up
[00:44:13] intuition that we're trying to come up with hard problems for the network to
[00:44:14] with hard problems for the network to solve such that by solving them it has
[00:44:17] solve such that by solving them it has to learn a lot about language
[00:44:19] to learn a lot about language and we're defining those problems by
[00:44:21] and we're defining those problems by making simple Transformations or
[00:44:23] making simple Transformations or removing information from text that just
[00:44:26] removing information from text that just happen to occur
[00:44:29] questions
[00:44:32] yeah the plus signs do we concatenate
[00:44:35] yeah the plus signs do we concatenate the vector so do we do an element-wise
[00:44:37] the vector so do we do an element-wise Edition uh the question is for these
[00:44:39] Edition uh the question is for these plus signs do we concatenate the vectors
[00:44:41] plus signs do we concatenate the vectors or do element wise Edition we do element
[00:44:43] or do element wise Edition we do element wise Edition uh you could have
[00:44:46] wise Edition uh you could have concatenated them
[00:44:48] concatenated them however the one of the big sort of
[00:44:50] however the one of the big sort of conventions of all these networks is
[00:44:51] conventions of all these networks is that you always have exactly the same
[00:44:53] that you always have exactly the same number of Dimensions everywhere at every
[00:44:55] number of Dimensions everywhere at every layer of the network it just makes
[00:44:57] layer of the network it just makes everything very simple so just saying
[00:44:59] everything very simple so just saying everything's the same Dimension and then
[00:45:00] everything's the same Dimension and then uh doing addition just ends up being
[00:45:03] uh doing addition just ends up being simpler
[00:45:04] simpler yeah
[00:45:07] the next sentence prediction not
[00:45:09] the next sentence prediction not necessarily
[00:45:10] necessarily yeah why was the next sentence
[00:45:12] yeah why was the next sentence prediction not necessary I mean it one
[00:45:14] prediction not necessary I mean it one thing that it does that's a negative is
[00:45:16] thing that it does that's a negative is that now
[00:45:18] that now um
[00:45:20] um the sort of content the effective
[00:45:22] the sort of content the effective context length for a lot of your
[00:45:24] context length for a lot of your examples is halved
[00:45:26] examples is halved so one of the things that's useful about
[00:45:28] so one of the things that's useful about pre-training seemingly is that you get
[00:45:29] pre-training seemingly is that you get to build representations of very long
[00:45:31] to build representations of very long sequences of text so this is very short
[00:45:34] sequences of text so this is very short but in practice segment a was going to
[00:45:37] but in practice segment a was going to be something like 250 words and segment
[00:45:40] be something like 250 words and segment B was going to be 250 words and in the
[00:45:42] B was going to be 250 words and in the paper that sort of let us know that this
[00:45:44] paper that sort of let us know that this wasn't necessary they always had a long
[00:45:46] wasn't necessary they always had a long segment of 500 Words
[00:45:48] segment of 500 Words and it seemed to be useful to always
[00:45:50] and it seemed to be useful to always have this very long context because
[00:45:53] have this very long context because longer contexts help give you more
[00:45:55] longer contexts help give you more information about the role that each
[00:45:57] information about the role that each word is playing in that specific context
[00:45:59] word is playing in that specific context right if I see one word it's hard to
[00:46:01] right if I see one word it's hard to know if it's just the record it's hard
[00:46:03] know if it's just the record it's hard to know what it's supposed to mean but
[00:46:05] to know what it's supposed to mean but if I see a thousand words around it it's
[00:46:07] if I see a thousand words around it it's much clearer what its role is in that
[00:46:09] much clearer what its role is in that context is so yeah it cuts the effective
[00:46:11] context is so yeah it cuts the effective context size as one answer
[00:46:14] context size as one answer um
[00:46:17] what another thing is that this is
[00:46:18] what another thing is that this is actually much more difficult this is a
[00:46:19] actually much more difficult this is a much more recent paper uh that I don't
[00:46:22] much more recent paper uh that I don't have in the slides but it's been shown
[00:46:24] have in the slides but it's been shown since then that these models are really
[00:46:25] since then that these models are really really bad at the next sentence
[00:46:27] really bad at the next sentence prediction task so it could be that
[00:46:30] prediction task so it could be that maybe it just like was too hard at the
[00:46:32] maybe it just like was too hard at the time
[00:46:34] time uh and so it just like wasn't useful
[00:46:36] uh and so it just like wasn't useful because the model was failing to do it
[00:46:37] because the model was failing to do it at all
[00:46:39] at all so I give the link for that paper later
[00:46:44] why we need to do a next sentence
[00:46:46] why we need to do a next sentence prediction what about just masking and
[00:46:48] prediction what about just masking and predicting I missed that jump so
[00:46:51] predicting I missed that jump so yeah so the question is why do we need
[00:46:53] yeah so the question is why do we need to do next sentence prediction why not
[00:46:55] to do next sentence prediction why not just do the masking we saw before that's
[00:46:57] just do the masking we saw before that's the thing you seem to not need to do
[00:46:58] the thing you seem to not need to do next to this prediction but you know as
[00:47:00] next to this prediction but you know as sort of like his history of the research
[00:47:02] sort of like his history of the research it was thought that this was useful
[00:47:05] it was thought that this was useful and the idea is that it required you to
[00:47:08] and the idea is that it required you to develop this sort of pairwise like do
[00:47:10] develop this sort of pairwise like do these two segments of text interact how
[00:47:13] these two segments of text interact how do they interact are they related the
[00:47:15] do they interact are they related the sort of longer distance notion and many
[00:47:17] sort of longer distance notion and many NLP tasks are defined on pairs of things
[00:47:20] NLP tasks are defined on pairs of things and they thought that might be useful
[00:47:22] and they thought that might be useful and so they published it with this and
[00:47:25] and so they published it with this and then someone else came through published
[00:47:26] then someone else came through published a new model that didn't do that and it
[00:47:28] a new model that didn't do that and it and it sort of did better
[00:47:30] and it sort of did better so you know this is just yeah so yeah uh
[00:47:34] so you know this is just yeah so yeah uh there are intuitions as to why it could
[00:47:35] there are intuitions as to why it could work it just didn't
[00:47:39] it was doing both it was doing both this
[00:47:41] it was doing both it was doing both this next sentence so Bert was doing both
[00:47:43] next sentence so Bert was doing both this next sentence prediction uh
[00:47:45] this next sentence prediction uh evaluate uh training as well as this
[00:47:47] evaluate uh training as well as this masking training uh all at the same time
[00:47:52] and so you had to have a separate
[00:47:53] and so you had to have a separate predictor head on top of bird a separate
[00:47:56] predictor head on top of bird a separate predictor sort of classification thing
[00:47:59] predictor sort of classification thing and you know so one detail there is that
[00:48:02] and you know so one detail there is that there's this special word at the
[00:48:03] there's this special word at the beginning of Bert in every sequence
[00:48:05] beginning of Bert in every sequence that's CLS and you know you can define a
[00:48:09] that's CLS and you know you can define a predictor on top of that sort of fake
[00:48:11] predictor on top of that sort of fake word embedding that was going to say is
[00:48:13] word embedding that was going to say is the next sentence real or fake or not
[00:48:16] the next sentence real or fake or not yeah
[00:48:17] yeah okay I'm gonna move on
[00:48:20] okay I'm gonna move on um and so this gets that sort of the
[00:48:22] um and so this gets that sort of the question that we had earlier about how
[00:48:23] question that we had earlier about how do you evaluate these things
[00:48:25] do you evaluate these things um there's a lot of different NLP tasks
[00:48:27] um there's a lot of different NLP tasks out there gosh and uh you know when
[00:48:30] out there gosh and uh you know when people were defining these papers they
[00:48:32] people were defining these papers they would look at a ton of different
[00:48:33] would look at a ton of different evaluations that had been sort of
[00:48:35] evaluations that had been sort of compiled as a set of things that are
[00:48:36] compiled as a set of things that are still hard for today's systems so are
[00:48:39] still hard for today's systems so are you detecting paraphrases between
[00:48:41] you detecting paraphrases between questions or two quora questions
[00:48:43] questions or two quora questions actually the same question that turns
[00:48:44] actually the same question that turns out to be hard
[00:48:46] out to be hard um you know uh can you do sentiment
[00:48:49] um you know uh can you do sentiment analysis on this hard data set can you
[00:48:51] analysis on this hard data set can you tell if sentences are linguistically
[00:48:53] tell if sentences are linguistically acceptable are they grammatical or not
[00:48:56] acceptable are they grammatical or not um are two sequences similar
[00:48:58] um are two sequences similar semantically do they mean sort of
[00:48:59] semantically do they mean sort of vaguely the similar thing
[00:49:02] vaguely the similar thing um and we'll talk a bit about natural
[00:49:03] um and we'll talk a bit about natural language inference later but that's the
[00:49:05] language inference later but that's the task of defining sort of if I say you
[00:49:08] task of defining sort of if I say you know I saw the dog that does not
[00:49:10] know I saw the dog that does not necessarily mean I saw the little dog
[00:49:14] necessarily mean I saw the little dog but saying I saw the little dog does
[00:49:16] but saying I saw the little dog does mean I saw the dog so that's sort of
[00:49:18] mean I saw the dog so that's sort of this natural language inference task and
[00:49:21] this natural language inference task and you know the Striking the difference
[00:49:22] you know the Striking the difference between sort of pre-pre-training days
[00:49:26] between sort of pre-pre-training days where you had this sort of this row here
[00:49:29] where you had this sort of this row here before you had substantial amounts of
[00:49:32] before you had substantial amounts of pre-training and Bert was just like the
[00:49:35] pre-training and Bert was just like the field was taken aback in a way that's
[00:49:37] field was taken aback in a way that's hard to describe
[00:49:39] hard to describe you know very carefully crafted
[00:49:41] you know very carefully crafted architectures for each individual task
[00:49:43] architectures for each individual task where everyone was designing their own
[00:49:45] where everyone was designing their own neural network and doing things that
[00:49:46] neural network and doing things that they thought were sort of clever as to
[00:49:48] they thought were sort of clever as to how to define all the connections and
[00:49:50] how to define all the connections and the weights and whatever to do their
[00:49:51] the weights and whatever to do their tasks independently so everyone was
[00:49:53] tasks independently so everyone was doing a different thing for each one of
[00:49:54] doing a different thing for each one of these tasks roughly
[00:49:57] these tasks roughly all of that was blown out of the water
[00:49:58] all of that was blown out of the water by just build a big Transformer and just
[00:50:01] by just build a big Transformer and just teach it to predict the missing words a
[00:50:03] teach it to predict the missing words a whole bunch and then fine tune it on
[00:50:05] whole bunch and then fine tune it on each of these tasks
[00:50:06] each of these tasks so this was this was just a sea change
[00:50:09] so this was this was just a sea change in the field people were I mean amazed
[00:50:12] in the field people were I mean amazed it's a little bit less flashy than chat
[00:50:13] it's a little bit less flashy than chat GPT I'll admit but it's really part of
[00:50:15] GPT I'll admit but it's really part of the story that gets us to it you know
[00:50:18] the story that gets us to it you know um okay
[00:50:20] um okay questions
[00:50:24] so like uh to get stuff out of the like
[00:50:27] so like uh to get stuff out of the like the during the encode their pre-training
[00:50:30] the during the encode their pre-training stage encoder usually outputs like uh
[00:50:34] stage encoder usually outputs like uh some sort of hidden values how do we
[00:50:37] some sort of hidden values how do we correlate those the words that we are
[00:50:39] correlate those the words that we are trying to test against
[00:50:41] trying to test against so the question is you know the the
[00:50:43] so the question is you know the the encoder output is a bunch of you know
[00:50:45] encoder output is a bunch of you know hidden values
[00:50:48] hidden values um how do we actually correlate those
[00:50:49] um how do we actually correlate those values to stuff that we want to predict
[00:50:52] values to stuff that we want to predict I'm going to go on to the next slide
[00:50:53] I'm going to go on to the next slide here to bring up this this example here
[00:50:55] here to bring up this this example here right so the encoder gives us for each
[00:50:58] right so the encoder gives us for each input word token a vector of that token
[00:51:02] input word token a vector of that token that represents the token in context and
[00:51:04] that represents the token in context and the question is you know how do we get
[00:51:06] the question is you know how do we get these representations and and turn them
[00:51:08] these representations and and turn them into uh sort of answers for the tasks
[00:51:11] into uh sort of answers for the tasks that we care about and
[00:51:14] that we care about and um
[00:51:14] um the answer comes back to
[00:51:18] the answer comes back to do
[00:51:18] do [Music]
[00:51:21] [Music] something like this uh
[00:51:30] something like this
[00:51:31] something like this Maybe
[00:51:37] wow sure
[00:51:39] wow sure um so when we were doing a pre-training
[00:51:40] um so when we were doing a pre-training right we had the Transformer that was
[00:51:42] right we had the Transformer that was giving us our representations and we had
[00:51:44] giving us our representations and we had this little last layer here this little
[00:51:47] this little last layer here this little um sort of affine uh transformation that
[00:51:50] um sort of affine uh transformation that moved us from the encoder's hidden State
[00:51:51] moved us from the encoder's hidden State size to the vocabulary to do our
[00:51:53] size to the vocabulary to do our prediction and we just removed this last
[00:51:56] prediction and we just removed this last prediction layer here and let's say we
[00:51:59] prediction layer here and let's say we want to do something that is uh
[00:52:02] want to do something that is uh classifying the sentiment of the
[00:52:04] classifying the sentiment of the sentence we just pick arbitrarily maybe
[00:52:06] sentence we just pick arbitrarily maybe the last word in the sentence and we
[00:52:08] the last word in the sentence and we stick a linear classifier on top and map
[00:52:11] stick a linear classifier on top and map it to positive or negative and then fine
[00:52:13] it to positive or negative and then fine tune the whole thing
[00:52:15] tune the whole thing okay so so yeah the Bert model uh had uh
[00:52:20] okay so so yeah the Bert model uh had uh two different models one was 110 million
[00:52:22] two different models one was 110 million parameters one was 340 million keep that
[00:52:25] parameters one was 340 million keep that sort of in the back of your head sort of
[00:52:26] sort of in the back of your head sort of percolating as we talk about models with
[00:52:28] percolating as we talk about models with with many many more parameters later on
[00:52:30] with many many more parameters later on it was trained on
[00:52:33] it was trained on um uh
[00:52:36] 800 million words plus that is
[00:52:39] 800 million words plus that is definitely wrong maybe two point maybe
[00:52:41] definitely wrong maybe two point maybe 25 million words but on the order of
[00:52:43] 25 million words but on the order of less than a billion words of text quite
[00:52:46] less than a billion words of text quite a bit still
[00:52:48] a bit still um and it was trained on what was
[00:52:49] um and it was trained on what was considered at the time to be a whole lot
[00:52:52] considered at the time to be a whole lot of compute just you know it was Google
[00:52:54] of compute just you know it was Google doing this and they released it and we
[00:52:56] doing this and they released it and we were like oh who has that kind of
[00:52:57] were like oh who has that kind of compute but Google although nowadays
[00:52:59] compute but Google although nowadays it's not considered to be very much
[00:53:01] it's not considered to be very much um but fine-tuning is practical and
[00:53:03] um but fine-tuning is practical and common on a single GPU so you could take
[00:53:05] common on a single GPU so you could take the burp model that they'd spend a lot
[00:53:07] the burp model that they'd spend a lot of time training and fine-tune it
[00:53:09] of time training and fine-tune it yourself on your task uh on even sort of
[00:53:11] yourself on your task uh on even sort of a very a very sort of small GPU
[00:53:16] uh okay
[00:53:20] so so one question is like well this
[00:53:23] so so one question is like well this seems really great why don't we just use
[00:53:25] seems really great why don't we just use this for uh everything
[00:53:28] this for uh everything um
[00:53:29] um uh-huh yeah uh and the answer is well
[00:53:32] uh-huh yeah uh and the answer is well you know what is the sort of
[00:53:34] you know what is the sort of pre-training objective what's the
[00:53:35] pre-training objective what's the structure of the pre-trained model good
[00:53:37] structure of the pre-trained model good for uh Bert is really good for sort of
[00:53:40] for uh Bert is really good for sort of filling in the blanks but it's much less
[00:53:43] filling in the blanks but it's much less naturally used for actually generating
[00:53:45] naturally used for actually generating text right so I wouldn't want to use to
[00:53:48] text right so I wouldn't want to use to generate a summary of something because
[00:53:50] generate a summary of something because it's not really built for it it's not it
[00:53:53] it's not really built for it it's not it doesn't have a natural notion of
[00:53:55] doesn't have a natural notion of predicting the next word given all the
[00:53:56] predicting the next word given all the words that came before it so maybe I
[00:53:58] words that came before it so maybe I want to use Bert if I want a good
[00:54:00] want to use Bert if I want a good representation of say a document to
[00:54:02] representation of say a document to classify it give it one of a set of
[00:54:04] classify it give it one of a set of topic labels or say it's toxic or
[00:54:06] topic labels or say it's toxic or non-toxic or whatever but I wouldn't
[00:54:08] non-toxic or whatever but I wouldn't want to use it to generate a whole
[00:54:10] want to use it to generate a whole sequence
[00:54:12] sequence uh okay some extensions of Bert so we
[00:54:15] uh okay some extensions of Bert so we had a question earlier of whether you
[00:54:17] had a question earlier of whether you just mask things out randomly one thing
[00:54:19] just mask things out randomly one thing that seems to work better is uh you uh
[00:54:22] that seems to work better is uh you uh mask out sort of whole contiguous spans
[00:54:25] mask out sort of whole contiguous spans uh because uh sort of
[00:54:28] uh because uh sort of the difficulty of this problem is much
[00:54:31] the difficulty of this problem is much easier than it would otherwise be
[00:54:33] easier than it would otherwise be because uh sort of this is part of
[00:54:36] because uh sort of this is part of irresistibly and you can tell very
[00:54:38] irresistibly and you can tell very easily based on the sort of sub words
[00:54:39] easily based on the sort of sub words that came before it whereas uh if I have
[00:54:43] that came before it whereas uh if I have a much longer sequence it's a trade-off
[00:54:45] a much longer sequence it's a trade-off but you know this might be a harder
[00:54:47] but you know this might be a harder problem and it ends up being better to
[00:54:49] problem and it ends up being better to do this sort of span-based masking than
[00:54:51] do this sort of span-based masking than random masking and that might be because
[00:54:53] random masking and that might be because sub words make very simple prediction
[00:54:55] sub words make very simple prediction problems when you mask out just one sub
[00:54:58] problems when you mask out just one sub word of a word versus all the subwords
[00:55:00] word of a word versus all the subwords of a word
[00:55:01] of a word okay so those this ends up doing much
[00:55:04] okay so those this ends up doing much better there's also a paper called the
[00:55:06] better there's also a paper called the Roberta paper which showed that the next
[00:55:08] Roberta paper which showed that the next sentence prediction wasn't neces wasn't
[00:55:11] sentence prediction wasn't neces wasn't necessary they also showed that they
[00:55:13] necessary they also showed that they really should have trained it on a lot
[00:55:15] really should have trained it on a lot more text so Roberta is a drop-in
[00:55:18] more text so Roberta is a drop-in replacement for Bert so if you're
[00:55:20] replacement for Bert so if you're thinking of using just use your Brita
[00:55:21] thinking of using just use your Brita it's better and it gave us this
[00:55:23] it's better and it gave us this intuition that we really don't know a
[00:55:25] intuition that we really don't know a whole lot about the best practices for
[00:55:26] whole lot about the best practices for training these things you sort of train
[00:55:27] training these things you sort of train it for as long as you're willing to and
[00:55:30] it for as long as you're willing to and things do good stuff and whatever
[00:55:33] things do good stuff and whatever um so this is very but it's very
[00:55:35] um so this is very but it's very difficult to do sort of uh iteration on
[00:55:38] difficult to do sort of uh iteration on these models because they're big it's
[00:55:39] these models because they're big it's expensive to train them
[00:55:41] expensive to train them uh another thing that you should know
[00:55:43] uh another thing that you should know for your final projects in the world
[00:55:45] for your final projects in the world ahead is this notion of fine-tuning all
[00:55:48] ahead is this notion of fine-tuning all parameters of the network versus just a
[00:55:49] parameters of the network versus just a couple of them uh so what we've talked
[00:55:52] couple of them uh so what we've talked about so far is you pre-train all the
[00:55:54] about so far is you pre-train all the parameters and then you fine-tune all of
[00:55:55] parameters and then you fine-tune all of them as well so all the parameter values
[00:55:57] them as well so all the parameter values change uh an alternative which you call
[00:56:00] change uh an alternative which you call parameter efficient or lightweight
[00:56:02] parameter efficient or lightweight fine-tuning uh you sort of choose little
[00:56:05] fine-tuning uh you sort of choose little bits of parameters or you choose the
[00:56:07] bits of parameters or you choose the very smart way of keeping most of the
[00:56:08] very smart way of keeping most of the parameters fixed and only fine-tuning
[00:56:10] parameters fixed and only fine-tuning others and the intuition is that you
[00:56:13] others and the intuition is that you know these pre-trained parameters were
[00:56:14] know these pre-trained parameters were really good
[00:56:15] really good and you want to make the minimal change
[00:56:18] and you want to make the minimal change from the pre-trained model to the model
[00:56:20] from the pre-trained model to the model that does what you want so that you keep
[00:56:22] that does what you want so that you keep some of the generality some of the
[00:56:23] some of the generality some of the goodness of the pre-training
[00:56:26] goodness of the pre-training so one way that this is done is called
[00:56:28] so one way that this is done is called prefix tuning prompt tuning is very
[00:56:30] prefix tuning prompt tuning is very similar where you actually freeze all
[00:56:32] similar where you actually freeze all the parameters of the network so I've
[00:56:33] the parameters of the network so I've pre-trained my network here
[00:56:36] pre-trained my network here and I've never changed any of the
[00:56:38] and I've never changed any of the parameter values instead I make a bunch
[00:56:41] parameter values instead I make a bunch of fake sort of pseudo word vectors that
[00:56:44] of fake sort of pseudo word vectors that I propend to the very beginning of the
[00:56:46] I propend to the very beginning of the sequence and I train just them sort of
[00:56:49] sequence and I train just them sort of unintuitive it's like these would have
[00:56:51] unintuitive it's like these would have been like inputs to the network but I'm
[00:56:53] been like inputs to the network but I'm specifying them as parameters and I'm
[00:56:55] specifying them as parameters and I'm training everything to do my sentiment
[00:56:57] training everything to do my sentiment analysis task just by changing the
[00:57:00] analysis task just by changing the values of these sort of fake words
[00:57:02] values of these sort of fake words and this is nice because you know I get
[00:57:05] and this is nice because you know I get to keep all the good pre-trained
[00:57:07] to keep all the good pre-trained parameters
[00:57:08] parameters um and then just specify this sort of
[00:57:10] um and then just specify this sort of diff that ends up
[00:57:13] diff that ends up generalizing better this is a very open
[00:57:15] generalizing better this is a very open field of research but this is also
[00:57:18] field of research but this is also cheaper because I don't have to compute
[00:57:20] cheaper because I don't have to compute the gradients or I don't have to store
[00:57:22] the gradients or I don't have to store the gradients and all the optimizer
[00:57:24] the gradients and all the optimizer state with respect to all these
[00:57:26] state with respect to all these parameters I'm only training a very
[00:57:28] parameters I'm only training a very small number of parameters uh yeah
[00:57:33] small number of parameters uh yeah it's like make parameters and as if like
[00:57:37] it's like make parameters and as if like here but it doesn't make any difference
[00:57:39] here but it doesn't make any difference but he's at the end of the beginning in
[00:57:41] but he's at the end of the beginning in a decoder you have to put them at the
[00:57:43] a decoder you have to put them at the beginning because otherwise you don't
[00:57:45] beginning because otherwise you don't see them before you process the whole
[00:57:47] see them before you process the whole sequence
[00:57:48] sequence uh yes
[00:57:50] uh yes a few layers I only train the new layers
[00:57:52] a few layers I only train the new layers but the question is can we just attach a
[00:57:55] but the question is can we just attach a new layers of the sort of top of this
[00:57:56] new layers of the sort of top of this and only train those absolutely this
[00:57:59] and only train those absolutely this works a bit better
[00:58:00] works a bit better another thing that works well sorry
[00:58:02] another thing that works well sorry we're running out of time
[00:58:04] we're running out of time um is taking each weight Matrix so I
[00:58:07] um is taking each weight Matrix so I have a bunch of weight matrices in my
[00:58:08] have a bunch of weight matrices in my Transformer and I freeze the weight
[00:58:11] Transformer and I freeze the weight Matrix and learn a very low ranked
[00:58:14] Matrix and learn a very low ranked little diff and I set the weight
[00:58:16] little diff and I set the weight matrix's value to be sort of the
[00:58:18] matrix's value to be sort of the original Value Plus my my sort of very
[00:58:21] original Value Plus my my sort of very low rank diff uh from the original one
[00:58:24] low rank diff uh from the original one and this ends up being a very similarly
[00:58:27] and this ends up being a very similarly useful technique and the overall idea
[00:58:30] useful technique and the overall idea here is that again I'm learning way
[00:58:32] here is that again I'm learning way fewer parameters than I did Via
[00:58:35] fewer parameters than I did Via pre-training and freezing most of the
[00:58:36] pre-training and freezing most of the pre-training parameters
[00:58:38] pre-training parameters okay encoder decoders so um for encoder
[00:58:42] okay encoder decoders so um for encoder decoders we could do something like
[00:58:44] decoders we could do something like language modeling right I've got my
[00:58:45] language modeling right I've got my input sequence here encoder output
[00:58:48] input sequence here encoder output sequence here and I could say this part
[00:58:51] sequence here and I could say this part is my prefix for sort of having
[00:58:53] is my prefix for sort of having bi-directional context and I could then
[00:58:55] bi-directional context and I could then predict all the words that are sort of
[00:58:58] predict all the words that are sort of in the latter half of the sequence just
[00:59:01] in the latter half of the sequence just like a language model and that would
[00:59:02] like a language model and that would work fine
[00:59:04] work fine um and so this this is something that
[00:59:06] um and so this this is something that you could do right you sort of take it
[00:59:07] you could do right you sort of take it along text split it into two give half
[00:59:10] along text split it into two give half of it to the encoder and then generate
[00:59:12] of it to the encoder and then generate the second half with the decoder
[00:59:15] uh but in practice what works much
[00:59:18] uh but in practice what works much better is this notion of span corruption
[00:59:20] better is this notion of span corruption span corruption is going to show up in
[00:59:21] span corruption is going to show up in your assignment five and the idea here
[00:59:24] your assignment five and the idea here is a lot like Bert but uh in a sort of
[00:59:28] is a lot like Bert but uh in a sort of generative sense where I'm going to mask
[00:59:31] generative sense where I'm going to mask out a bunch of words in the input thank
[00:59:34] out a bunch of words in the input thank you mask token one me to your party mask
[00:59:38] you mask token one me to your party mask token two week
[00:59:40] token two week and then at the output I generate the
[00:59:43] and then at the output I generate the mask token and then what was supposed to
[00:59:46] mask token and then what was supposed to be there but the mass token replaced it
[00:59:48] be there but the mass token replaced it right so thank you then predict for
[00:59:50] right so thank you then predict for inviting at the output meet your party
[00:59:53] inviting at the output meet your party last week and what this does is that it
[00:59:57] last week and what this does is that it um allows you to have bi-directional
[01:00:00] um allows you to have bi-directional context right I get to see the whole
[01:00:02] context right I get to see the whole sequence except I can generate the parts
[01:00:05] sequence except I can generate the parts that we're missing
[01:00:07] that we're missing so this feels a little bit like you mask
[01:00:09] so this feels a little bit like you mask out parts of the input but you actually
[01:00:10] out parts of the input but you actually generate the output as a sequence like
[01:00:13] generate the output as a sequence like you would in language modeling so this
[01:00:15] you would in language modeling so this might be good for something like machine
[01:00:16] might be good for something like machine translation where I have an input that I
[01:00:18] translation where I have an input that I want bi-directional context in but then
[01:00:20] want bi-directional context in but then I want to generate an output and I want
[01:00:22] I want to generate an output and I want to pre-train the whole thing so this was
[01:00:25] to pre-train the whole thing so this was shown to work better than language
[01:00:26] shown to work better than language modeling at the scales that these uh
[01:00:28] modeling at the scales that these uh Folks at Google were able to test back
[01:00:30] Folks at Google were able to test back in 2018 this is still quite quite
[01:00:32] in 2018 this is still quite quite popular
[01:00:35] um yeah there's a lot of numbers it
[01:00:38] um yeah there's a lot of numbers it works better than the other stuff
[01:00:39] works better than the other stuff I'm not going to worry about it
[01:00:42] I'm not going to worry about it um you know there's a fascinating
[01:00:44] um you know there's a fascinating property of these models also so um so
[01:00:47] property of these models also so um so T5 was the model that was originally uh
[01:00:50] T5 was the model that was originally uh introduced with salience band masking
[01:00:52] introduced with salience band masking and you can think of you know at
[01:00:55] and you can think of you know at pre-training time you saw a bunch of
[01:00:56] pre-training time you saw a bunch of things like Franklin D Roosevelt was
[01:00:58] things like Franklin D Roosevelt was born in you know blank and you generated
[01:01:01] born in you know blank and you generated out the blank and uh there's this task
[01:01:04] out the blank and uh there's this task called
[01:01:06] called um open domain question answering which
[01:01:08] um open domain question answering which has a bunch of trivia questions like you
[01:01:10] has a bunch of trivia questions like you know when was Franklin D Roosevelt born
[01:01:12] know when was Franklin D Roosevelt born and then you're supposed to generate out
[01:01:14] and then you're supposed to generate out the answer as a string just like just
[01:01:16] the answer as a string just like just from your parameters right so you did a
[01:01:18] from your parameters right so you did a bunch of pre-training you saw a bunch of
[01:01:20] bunch of pre-training you saw a bunch of text and then you're supposed to
[01:01:21] text and then you're supposed to generate these answers and what's
[01:01:23] generate these answers and what's fascinating is that this sort of
[01:01:26] fascinating is that this sort of salience band masking method
[01:01:29] salience band masking method allowed you to pre-train and then fine
[01:01:32] allowed you to pre-train and then fine tune on some examples of questions
[01:01:35] tune on some examples of questions trivia questions and then when you
[01:01:37] trivia questions and then when you tested on new trivia questions it would
[01:01:40] tested on new trivia questions it would sort of the model would sort of
[01:01:41] sort of the model would sort of implicitly extract from its pre-training
[01:01:43] implicitly extract from its pre-training data somehow the answer to that new
[01:01:46] data somehow the answer to that new question that it never saw explicitly at
[01:01:48] question that it never saw explicitly at fine tuning time so it learned this sort
[01:01:50] fine tuning time so it learned this sort of implicit retrieval sometimes
[01:01:53] of implicit retrieval sometimes sometimes you know less than 50 of the
[01:01:55] sometimes you know less than 50 of the time or whatever but you know much more
[01:01:57] time or whatever but you know much more than random chance yeah
[01:01:59] than random chance yeah um and that's just sort of fascinating
[01:02:01] um and that's just sort of fascinating right so you've sort of learned to
[01:02:02] right so you've sort of learned to access this sort of latent knowledge
[01:02:05] access this sort of latent knowledge that you stored up by pre-training and
[01:02:07] that you stored up by pre-training and so yeah you just pass it the text when
[01:02:09] so yeah you just pass it the text when was Roosevelt born and it would pass out
[01:02:12] was Roosevelt born and it would pass out an answer and one thing to know is that
[01:02:14] an answer and one thing to know is that the answers always look very fluent they
[01:02:16] the answers always look very fluent they always look very reasonable but they're
[01:02:18] always look very reasonable but they're frequently wrong and that's still true
[01:02:20] frequently wrong and that's still true of things like chat gbt
[01:02:23] of things like chat gbt um
[01:02:24] um yeah
[01:02:25] yeah okay so that's that's like encoder
[01:02:27] okay so that's that's like encoder decoder models
[01:02:29] decoder models um next up we've got decoders and we'll
[01:02:32] um next up we've got decoders and we'll spend a long time on decoders so this is
[01:02:34] spend a long time on decoders so this is just our normal language model so I get
[01:02:36] just our normal language model so I get a sequence of hidden States for my
[01:02:37] a sequence of hidden States for my decoder the the models this words can
[01:02:40] decoder the the models this words can only look at themselves not the future
[01:02:42] only look at themselves not the future and then I predict you know the next
[01:02:45] and then I predict you know the next word in the sentence and then here again
[01:02:47] word in the sentence and then here again I can you know to do sentiment analysis
[01:02:49] I can you know to do sentiment analysis maybe take the last state for the last
[01:02:51] maybe take the last state for the last word and then predict happier sad based
[01:02:54] word and then predict happier sad based on that last embedding back propagate
[01:02:57] on that last embedding back propagate the gradients the whole network train
[01:02:58] the gradients the whole network train the whole thing or do some kind of
[01:03:00] the whole thing or do some kind of lightweight or or parameter efficient
[01:03:02] lightweight or or parameter efficient fine tuning like we mentioned earlier so
[01:03:05] fine tuning like we mentioned earlier so this is our our you know pre-training a
[01:03:06] this is our our you know pre-training a decoder and um you know I can just
[01:03:09] decoder and um you know I can just pre-train it on language modeling
[01:03:12] pre-train it on language modeling um
[01:03:13] um so again you might want to do this if
[01:03:15] so again you might want to do this if you are wanting to generate
[01:03:18] you are wanting to generate generate texts generate things uh this
[01:03:22] generate texts generate things uh this is you sort of can use this like you use
[01:03:23] is you sort of can use this like you use an encoder decoder but in practice as
[01:03:26] an encoder decoder but in practice as we'll see a lot of the sort of biggest
[01:03:28] we'll see a lot of the sort of biggest uh most powerful pre-trained models tend
[01:03:31] uh most powerful pre-trained models tend to be decoder only it's not really clear
[01:03:34] to be decoder only it's not really clear exactly why except they seem a little
[01:03:37] exactly why except they seem a little bit simpler than encoder decoders
[01:03:40] bit simpler than encoder decoders um and you get to share all the
[01:03:42] um and you get to share all the parameters in one big Network for the
[01:03:44] parameters in one big Network for the decoder whereas an encoder decoder you
[01:03:46] decoder whereas an encoder decoder you have to split them sort of some into the
[01:03:48] have to split them sort of some into the encoder some into the decoder so for the
[01:03:51] encoder some into the decoder so for the rest of this lecture we'll talk only
[01:03:53] rest of this lecture we'll talk only about decoders so even and modern things
[01:03:56] about decoders so even and modern things uh the biggest networks do tend to be
[01:03:59] uh the biggest networks do tend to be decoders
[01:04:00] decoders so we're coming all the way back again
[01:04:02] so we're coming all the way back again to 2018 and the GPT model from openai
[01:04:06] to 2018 and the GPT model from openai was a big success
[01:04:09] was a big success it had 117 parameter a million
[01:04:11] it had 117 parameter a million parameters uh it had you know 768
[01:04:15] parameters uh it had you know 768 dimensional hidden States and uh it had
[01:04:17] dimensional hidden States and uh it had this vocabulary that was 40 000 ish
[01:04:22] this vocabulary that was 40 000 ish words that was defined via a method like
[01:04:24] words that was defined via a method like what we showed at the beginning of class
[01:04:26] what we showed at the beginning of class trained on Books Corpus and
[01:04:29] trained on Books Corpus and um you know actually you know GPT never
[01:04:31] um you know actually you know GPT never actually showed up in the original paper
[01:04:32] actually showed up in the original paper uh
[01:04:34] uh the sort of uh it's unclear what exactly
[01:04:37] the sort of uh it's unclear what exactly it's supposed to refer to
[01:04:39] it's supposed to refer to um but uh this model was a precursor to
[01:04:43] um but uh this model was a precursor to all the things that you're hearing about
[01:04:44] all the things that you're hearing about nowadays uh if you move forward
[01:04:48] nowadays uh if you move forward uh oh yeah so if you
[01:04:52] um
[01:04:55] so if we wanted to do something like
[01:04:57] so if we wanted to do something like natural language inference right which
[01:05:00] natural language inference right which says you know take these pairs of
[01:05:01] says you know take these pairs of sentences the man is in the doorway the
[01:05:04] sentences the man is in the doorway the person is near the door and uh say that
[01:05:07] person is near the door and uh say that these mean that one entails the other
[01:05:08] these mean that one entails the other the sort of premise entails the
[01:05:10] the sort of premise entails the hypothesis that I can believe the
[01:05:12] hypothesis that I can believe the hypothesis if I believe the premise I
[01:05:14] hypothesis if I believe the premise I just sort of concatenate them together
[01:05:15] just sort of concatenate them together right so give it maybe a start token
[01:05:19] right so give it maybe a start token pass in one sentence pass in some
[01:05:21] pass in one sentence pass in some delimiter token pass in the other and
[01:05:24] delimiter token pass in the other and then predict uh sort of yes no
[01:05:27] then predict uh sort of yes no entailment not entailment fine tuning
[01:05:29] entailment not entailment fine tuning gbt on this it worked really well
[01:05:33] gbt on this it worked really well um and then you know Bert came after GPT
[01:05:35] um and then you know Bert came after GPT Bert did a bit better it had
[01:05:37] Bert did a bit better it had bi-directional context
[01:05:39] bi-directional context um but you know it did it did uh sort of
[01:05:42] um but you know it did it did uh sort of an excellent job
[01:05:43] an excellent job and then came gpt2 where they focused
[01:05:46] and then came gpt2 where they focused more on the generative abilities of the
[01:05:49] more on the generative abilities of the network so
[01:05:51] network so um right we looked at uh now a much
[01:05:54] um right we looked at uh now a much larger Network we've gone from 117
[01:05:55] larger Network we've gone from 117 million to 1.5 billion and given some
[01:05:58] million to 1.5 billion and given some sort of prompt it could generate at the
[01:06:01] sort of prompt it could generate at the time a quite surprisingly coherent
[01:06:03] time a quite surprisingly coherent continuation to The Prompt so it's
[01:06:05] continuation to The Prompt so it's telling this sort of story about uh
[01:06:08] telling this sort of story about uh about scientists and unicorns here
[01:06:11] about scientists and unicorns here um and this size of model is still sort
[01:06:14] um and this size of model is still sort of small enough that you can use on a
[01:06:16] of small enough that you can use on a small GPU and fine-tune and whatever and
[01:06:19] small GPU and fine-tune and whatever and its capabilities of generating long
[01:06:21] its capabilities of generating long coherent texts was just sort of
[01:06:24] coherent texts was just sort of um exceptional at the time
[01:06:27] um exceptional at the time it was also trained on more data
[01:06:29] it was also trained on more data although I don't
[01:06:31] although I don't uh something like 9 billion words of
[01:06:33] uh something like 9 billion words of text
[01:06:35] text um and then so after gpt2 we come to
[01:06:39] um and then so after gpt2 we come to gpd3 sort of walking through these
[01:06:41] gpd3 sort of walking through these models and then we come with a different
[01:06:43] models and then we come with a different way of interacting with the models so
[01:06:46] way of interacting with the models so we've interacted with pre-trained models
[01:06:47] we've interacted with pre-trained models in two ways so far we've sort of sampled
[01:06:50] in two ways so far we've sort of sampled from the distribution that they Define
[01:06:52] from the distribution that they Define uh We've generated text via like a
[01:06:55] uh We've generated text via like a machine translation system or whatever
[01:06:56] machine translation system or whatever or you fine-tuned them on a task that we
[01:06:59] or you fine-tuned them on a task that we care about and then we take their
[01:07:00] care about and then we take their predictions
[01:07:03] um but gpt3 seems to have an interesting
[01:07:08] um but gpt3 seems to have an interesting new ability uh it's much larger and it
[01:07:12] new ability uh it's much larger and it can do some tasks without any sort of
[01:07:15] can do some tasks without any sort of fine-tuning whatsoever
[01:07:17] fine-tuning whatsoever uh gbd3 is much larger than gpd2 right
[01:07:20] uh gbd3 is much larger than gpd2 right so we went from GPT 100-ish million
[01:07:22] so we went from GPT 100-ish million parameters gbd2 1.5 billion cpt3 175
[01:07:26] parameters gbd2 1.5 billion cpt3 175 billion much larger uh trained on 300
[01:07:30] billion much larger uh trained on 300 billion words of text and this sort of
[01:07:33] billion words of text and this sort of notion of in context learning that it
[01:07:34] notion of in context learning that it could Define or figure out patterns in
[01:07:37] could Define or figure out patterns in the training or in the example that it's
[01:07:39] the training or in the example that it's currently seeing and continue the
[01:07:41] currently seeing and continue the pattern
[01:07:42] pattern uh is called in context learnings you
[01:07:45] uh is called in context learnings you got you know the word thanks and I pass
[01:07:47] got you know the word thanks and I pass in this little arrow and say okay thanks
[01:07:48] in this little arrow and say okay thanks goes to you know mercy and then hello
[01:07:50] goes to you know mercy and then hello goes to bonjour and then you know they
[01:07:52] goes to bonjour and then you know they give it all of these examples and ask it
[01:07:55] give it all of these examples and ask it um what you know otter should go to and
[01:07:57] um what you know otter should go to and it's learned to sort of continue the
[01:08:00] it's learned to sort of continue the pattern and say that this is the
[01:08:02] pattern and say that this is the translation of otter so now remember
[01:08:05] translation of otter so now remember this is a single sort of input that I've
[01:08:07] this is a single sort of input that I've given to my to my model and I haven't
[01:08:10] given to my to my model and I haven't said oh do translation or fine tune it
[01:08:12] said oh do translation or fine tune it on translation or whatever I've just
[01:08:14] on translation or whatever I've just passed in the input given it some
[01:08:16] passed in the input given it some examples and then it is able to to some
[01:08:18] examples and then it is able to to some extent uh do this seemingly complex task
[01:08:22] extent uh do this seemingly complex task that's in context learning
[01:08:25] that's in context learning uh and here are more examples you know
[01:08:27] uh and here are more examples you know maybe you give it examples of addition
[01:08:29] maybe you give it examples of addition and then it can do some uh some simple
[01:08:32] and then it can do some uh some simple addition afterward uh you give it in
[01:08:34] addition afterward uh you give it in this case this is sort of rewriting
[01:08:36] this case this is sort of rewriting typos they can figure out how to rewrite
[01:08:37] typos they can figure out how to rewrite typos in context learning for for
[01:08:40] typos in context learning for for machine translation and this was the
[01:08:42] machine translation and this was the start of this idea that there were these
[01:08:44] start of this idea that there were these emergent properties that showed up in
[01:08:46] emergent properties that showed up in much larger models and it wasn't clear
[01:08:48] much larger models and it wasn't clear when looking at the smaller models that
[01:08:51] when looking at the smaller models that you'd get this sort of new this
[01:08:53] you'd get this sort of new this qualitatively new Behavior out of them
[01:08:57] qualitatively new Behavior out of them right like it's not obvious from just
[01:08:59] right like it's not obvious from just the language modeling signal right gpt3
[01:09:01] the language modeling signal right gpt3 is just trained on that decoder only
[01:09:04] is just trained on that decoder only just next predict the next word that it
[01:09:06] just next predict the next word that it would as a result of that training learn
[01:09:10] would as a result of that training learn to perform seemingly quite complex
[01:09:12] to perform seemingly quite complex things as a function of its context
[01:09:15] things as a function of its context um yeah okay one or two questions about
[01:09:19] um yeah okay one or two questions about that
[01:09:26] um this should be quite surprising I
[01:09:28] um this should be quite surprising I think right like so far we've said
[01:09:30] think right like so far we've said talked about good representations
[01:09:31] talked about good representations contextual representations meanings of
[01:09:33] contextual representations meanings of words in context this is some very very
[01:09:36] words in context this is some very very high level pattern matching right it's
[01:09:37] high level pattern matching right it's coming up with patterns in just the
[01:09:39] coming up with patterns in just the input data and that one sequence of text
[01:09:42] input data and that one sequence of text that you've passed it so far and it's
[01:09:43] that you've passed it so far and it's able to sort of identify how to complete
[01:09:47] able to sort of identify how to complete the pattern uh and as you think what
[01:09:49] the pattern uh and as you think what kinds of things can this solve what are
[01:09:51] kinds of things can this solve what are its capabilities whether it's
[01:09:52] its capabilities whether it's limitations this ends up being an open
[01:09:55] limitations this ends up being an open area of research sort of what are the
[01:09:57] area of research sort of what are the kinds of problems that you maybe saw in
[01:09:59] kinds of problems that you maybe saw in the training data lab maybe gpt3 saw a
[01:10:02] the training data lab maybe gpt3 saw a ton of pairs of words right they saw a
[01:10:04] ton of pairs of words right they saw a bunch of you know dictionaries bilingual
[01:10:06] bunch of you know dictionaries bilingual dictionaries in its training data so it
[01:10:08] dictionaries in its training data so it learned to do something like this or is
[01:10:10] learned to do something like this or is it doing something much more General
[01:10:11] it doing something much more General where it's really learning the task in
[01:10:13] where it's really learning the task in context you know the actual story we're
[01:10:16] context you know the actual story we're not totally sure something in the middle
[01:10:18] not totally sure something in the middle it seems like it has to be tied to your
[01:10:21] it seems like it has to be tied to your train data in ways that we don't quite
[01:10:23] train data in ways that we don't quite understand but there's also a
[01:10:25] understand but there's also a non-trivial ability to learn new sort of
[01:10:28] non-trivial ability to learn new sort of at least types of patterns just from the
[01:10:30] at least types of patterns just from the context so this is a very interesting
[01:10:32] context so this is a very interesting thing to work on
[01:10:34] thing to work on now we've talked a lot about the size of
[01:10:36] now we've talked a lot about the size of these models so far and as models have
[01:10:38] these models so far and as models have gotten larger they've always gotten
[01:10:40] gotten larger they've always gotten better we train them on more data
[01:10:43] better we train them on more data um right so gpd3 was trained on 300
[01:10:45] um right so gpd3 was trained on 300 billion words of text and it was 175
[01:10:48] billion words of text and it was 175 billion parameters
[01:10:50] billion parameters and you know at that scale it costs a
[01:10:53] and you know at that scale it costs a lot of money
[01:10:55] lot of money to build these things and it's very
[01:10:56] to build these things and it's very unclear whether you're getting the best
[01:10:58] unclear whether you're getting the best use out of your money like it's bigger
[01:10:59] use out of your money like it's bigger really what you should have been doing
[01:11:01] really what you should have been doing in terms of the number of parameters
[01:11:03] in terms of the number of parameters um so you know the cost of training one
[01:11:05] um so you know the cost of training one of these is roughly you take the number
[01:11:07] of these is roughly you take the number of parameters you multiply it by the
[01:11:09] of parameters you multiply it by the number of tokens that you're going to
[01:11:10] number of tokens that you're going to train it on the number of words
[01:11:12] train it on the number of words and uh some folks at deepmind I forgot
[01:11:15] and uh some folks at deepmind I forgot the citation on this some folks at
[01:11:17] the citation on this some folks at deepmind uh realized through some
[01:11:19] deepmind uh realized through some experimentation that actually gpt3 was
[01:11:22] experimentation that actually gpt3 was just comically oversized right so
[01:11:25] just comically oversized right so chinchilla the model they trained is
[01:11:27] chinchilla the model they trained is less than half the size and works better
[01:11:30] less than half the size and works better but they just trained it on way more
[01:11:32] but they just trained it on way more data
[01:11:34] data um and this is sort of an interesting
[01:11:36] um and this is sort of an interesting sort of trade-off about you know how do
[01:11:38] sort of trade-off about you know how do you best spend your compute I mean you
[01:11:39] you best spend your compute I mean you can't do this more than a handful of
[01:11:40] can't do this more than a handful of times even if you're you know Google
[01:11:43] times even if you're you know Google um so you know open open questions there
[01:11:46] um so you know open open questions there as well
[01:11:47] as well um another sort of way of interacting
[01:11:50] um another sort of way of interacting with these networks that has come out
[01:11:52] with these networks that has come out recently is called Chain of Thought
[01:11:56] um so the prefix right we saw in the in
[01:11:59] um so the prefix right we saw in the in context learning slide that the prefix
[01:12:00] context learning slide that the prefix can help sort of specify what task
[01:12:02] can help sort of specify what task you're trying to solve right now and it
[01:12:04] you're trying to solve right now and it can do even more so here's standard sort
[01:12:07] can do even more so here's standard sort of prompting we have a prefix of
[01:12:09] of prompting we have a prefix of examples of questions and answers so you
[01:12:11] examples of questions and answers so you have a question and then an example
[01:12:13] have a question and then an example answer so that's your prompt that's
[01:12:15] answer so that's your prompt that's specifying the task and then you have a
[01:12:18] specifying the task and then you have a new question and you're having the model
[01:12:19] new question and you're having the model generate an answer and it generates it
[01:12:21] generate an answer and it generates it wrong
[01:12:22] wrong and Chain of Thought prompting uh says
[01:12:26] and Chain of Thought prompting uh says well how about in the example in the
[01:12:28] well how about in the example in the demonstration we give we give the
[01:12:29] demonstration we give we give the question and then we give this sort of
[01:12:31] question and then we give this sort of decomposition
[01:12:33] decomposition of steps towards how to get an answer
[01:12:36] of steps towards how to get an answer right so I'm actually writing this out
[01:12:37] right so I'm actually writing this out as part of the input I'm I'm giving
[01:12:39] as part of the input I'm I'm giving annotations as a human to say oh you
[01:12:42] annotations as a human to say oh you know to solve this sort of word problem
[01:12:43] know to solve this sort of word problem here's how you could think it through
[01:12:46] here's how you could think it through ish and then I give it a new question
[01:12:49] ish and then I give it a new question and the model says oh I know what I'm
[01:12:51] and the model says oh I know what I'm supposed to do I'm supposed to First
[01:12:53] supposed to do I'm supposed to First generate a sequence of steps
[01:12:55] generate a sequence of steps of intermediate steps and then next say
[01:12:58] of intermediate steps and then next say the answer is and then say what the
[01:13:00] the answer is and then say what the answer is and it turns out and this
[01:13:03] answer is and it turns out and this should again be very surprising that the
[01:13:06] should again be very surprising that the model can tend to generate plausible
[01:13:09] model can tend to generate plausible sequences of steps and then much more
[01:13:12] sequences of steps and then much more frequently generates the correct answer
[01:13:13] frequently generates the correct answer after doing so relative to trying to
[01:13:16] after doing so relative to trying to generate the answer by itself
[01:13:17] generate the answer by itself so you can think of this as a scratch
[01:13:19] so you can think of this as a scratch Pad you can think of this as uh
[01:13:22] Pad you can think of this as uh increasing the amount of computation
[01:13:24] increasing the amount of computation that you're putting into trying to solve
[01:13:25] that you're putting into trying to solve the problem
[01:13:27] the problem sort of writing out your thoughts right
[01:13:28] sort of writing out your thoughts right as I generate each word of this
[01:13:31] as I generate each word of this continuation here I'm able to condition
[01:13:33] continuation here I'm able to condition on all the past words so far and so
[01:13:36] on all the past words so far and so maybe it just uh
[01:13:39] maybe it just uh yeah allows the network to sort of
[01:13:40] yeah allows the network to sort of decompose the problem into smaller
[01:13:42] decompose the problem into smaller simpler problems which is more able to
[01:13:45] simpler problems which is more able to solve each
[01:13:47] solve each um
[01:13:47] um no one's really sure why this works
[01:13:49] no one's really sure why this works exactly either at this point with
[01:13:52] exactly either at this point with networks that are this large they're
[01:13:54] networks that are this large they're emergent properties are both very
[01:13:57] emergent properties are both very powerful and exceptionally hard to
[01:13:58] powerful and exceptionally hard to understand and very hard you should
[01:14:01] understand and very hard you should think uh to trust
[01:14:03] think uh to trust because it's unclear what its
[01:14:04] because it's unclear what its capabilities are and what its
[01:14:06] capabilities are and what its limitations are where it will fail
[01:14:09] limitations are where it will fail so what do we think pre-training is
[01:14:10] so what do we think pre-training is teaching gosh a wide range of things
[01:14:14] teaching gosh a wide range of things even beyond what I've written in this
[01:14:16] even beyond what I've written in this slide which I mostly wrote two years ago
[01:14:19] slide which I mostly wrote two years ago right so it can teach you trivia and
[01:14:21] right so it can teach you trivia and syntax and co-reference and maybe some
[01:14:23] syntax and co-reference and maybe some lexical semantics and sentiment and some
[01:14:25] lexical semantics and sentiment and some reasoning like way more reasoning than
[01:14:27] reasoning like way more reasoning than we would have thought even three years
[01:14:28] we would have thought even three years ago
[01:14:30] ago um and yet they also learn and
[01:14:32] um and yet they also learn and exacerbate racism and sexism all manner
[01:14:35] exacerbate racism and sexism all manner of biases
[01:14:37] of biases um there's more on this later but it's
[01:14:40] um there's more on this later but it's uh the generality of this is really I
[01:14:43] uh the generality of this is really I think what's taken many people aback and
[01:14:45] think what's taken many people aback and so increasingly these objects are not
[01:14:48] so increasingly these objects are not just
[01:14:49] just um studied for the sake of using them
[01:14:50] um studied for the sake of using them but studied for the sake of
[01:14:52] but studied for the sake of understanding anything about how they
[01:14:54] understanding anything about how they work and how they fail
[01:14:56] work and how they fail uh yeah any questions
[01:15:05] has anyone tried like a benchmarking
[01:15:07] has anyone tried like a benchmarking like GPT for like programming tasks like
[01:15:11] like GPT for like programming tasks like how accurate these does etc yeah the
[01:15:14] how accurate these does etc yeah the question is has anyone tried
[01:15:15] question is has anyone tried benchmarking GPT for programming tasks
[01:15:18] benchmarking GPT for programming tasks anyone seen how well it does
[01:15:21] anyone seen how well it does um yes so there's definitely examples of
[01:15:23] um yes so there's definitely examples of people using GPT uh three four simple
[01:15:27] people using GPT uh three four simple programming things and then you know the
[01:15:29] programming things and then you know the modern state-of-the-art competitive
[01:15:31] modern state-of-the-art competitive programming Bots are all based on ideas
[01:15:34] programming Bots are all based on ideas from language modeling uh and I think I
[01:15:37] from language modeling uh and I think I think they're all also based on
[01:15:39] think they're all also based on pre-trained language models themselves
[01:15:40] pre-trained language models themselves like if you just take all of these ideas
[01:15:42] like if you just take all of these ideas and apply it to like GitHub uh then you
[01:15:46] and apply it to like GitHub uh then you get some very interesting emergent
[01:15:48] get some very interesting emergent behaviors relating to code uh fallout
[01:15:50] behaviors relating to code uh fallout and so yeah I think all of the best
[01:15:53] and so yeah I think all of the best systems use this more or less so lots of
[01:15:56] systems use this more or less so lots of benchmarking there for sure
[01:15:59] benchmarking there for sure through the basis for what like GitHub
[01:16:01] through the basis for what like GitHub co-pilot is going to do the question is
[01:16:03] co-pilot is going to do the question is is this the basis is that what we just
[01:16:04] is this the basis is that what we just mentioned the basis for the GitHub
[01:16:06] mentioned the basis for the GitHub copilot system yes absolutely
[01:16:10] copilot system yes absolutely we don't know exactly what it is in
[01:16:12] we don't know exactly what it is in terms of details but it's all these
[01:16:14] terms of details but it's all these ideas
[01:16:15] ideas what if you have a situation where you
[01:16:17] what if you have a situation where you have you know still a large amount of
[01:16:19] have you know still a large amount of data for you know General data and then
[01:16:21] data for you know General data and then you have also a large amount of data for
[01:16:23] you have also a large amount of data for your fine tuning task at one point is it
[01:16:25] your fine tuning task at one point is it better to train a new model for that
[01:16:28] better to train a new model for that pioneering versus you know get data from
[01:16:30] pioneering versus you know get data from both yeah the question is what if you
[01:16:32] both yeah the question is what if you have a large amount of data for
[01:16:33] have a large amount of data for pre-training and a large amount of data
[01:16:34] pre-training and a large amount of data for fine tuning when is it better to do
[01:16:37] for fine tuning when is it better to do sort of a separate training on just the
[01:16:39] sort of a separate training on just the fine-tuning data
[01:16:41] fine-tuning data um almost never if you
[01:16:44] um almost never if you have a bunch of data for the task that
[01:16:47] have a bunch of data for the task that you care about what's frequently done
[01:16:49] you care about what's frequently done instead is three-part training where you
[01:16:52] instead is three-part training where you pre-train on a very broad Corpus then
[01:16:55] pre-train on a very broad Corpus then you sort of continue to pre-train using
[01:16:57] you sort of continue to pre-train using something like language modeling on an
[01:16:59] something like language modeling on an unlabeled version
[01:17:01] unlabeled version of the label data that you have you just
[01:17:03] of the label data that you have you just like strip the labels off and just treat
[01:17:04] like strip the labels off and just treat it all as text and do language modeling
[01:17:06] it all as text and do language modeling on that adapt the parameters a little
[01:17:08] on that adapt the parameters a little bit and then do the final stage of
[01:17:11] bit and then do the final stage of fine-tuning with the labels that you
[01:17:12] fine-tuning with the labels that you want and that works even better this
[01:17:14] want and that works even better this interesting paper called Don't Stop
[01:17:16] interesting paper called Don't Stop pre-training
[01:17:18] pre-training nice uh final question
[01:17:21] nice uh final question that's a lot of questions on anyone new
[01:17:24] that's a lot of questions on anyone new new someone knew the question
[01:17:30] yeah um I was wondering do you know if
[01:17:33] yeah um I was wondering do you know if there's like a lot of instances where a
[01:17:35] there's like a lot of instances where a pre-trained model can do some tasks not
[01:17:38] pre-trained model can do some tasks not soon before I even know
[01:17:40] soon before I even know yeah so are there any instances of where
[01:17:42] yeah so are there any instances of where a pre-trained model can do a task that
[01:17:44] a pre-trained model can do a task that it hasn't seen before uh you know
[01:17:46] it hasn't seen before uh you know without fine-tuning the question is what
[01:17:47] without fine-tuning the question is what is hasn't seen before mean right like uh
[01:17:51] is hasn't seen before mean right like uh these models especially gpt3 and similar
[01:17:53] these models especially gpt3 and similar very large models you know during
[01:17:55] very large models you know during pre-training did it ever see something
[01:17:57] pre-training did it ever see something exactly like this sort of word problem
[01:18:00] exactly like this sort of word problem arithmetic maybe maybe not it's actually
[01:18:03] arithmetic maybe maybe not it's actually sort of unclear it's clearly able to
[01:18:06] sort of unclear it's clearly able to recombine sort of bits and pieces of
[01:18:08] recombine sort of bits and pieces of tasks that it saw implicitly during
[01:18:10] tasks that it saw implicitly during pre-training we saw the same thing with
[01:18:12] pre-training we saw the same thing with trivia right like language modeling
[01:18:13] trivia right like language modeling looks a lot like trivia sometimes where
[01:18:15] looks a lot like trivia sometimes where you just read the first paragraph of a
[01:18:18] you just read the first paragraph of a Wikipedia page and it's kind of like
[01:18:20] Wikipedia page and it's kind of like answering a bunch of little trivia
[01:18:21] answering a bunch of little trivia questions about where someone was born
[01:18:22] questions about where someone was born and when
[01:18:24] and when um but like it's never seen something
[01:18:25] um but like it's never seen something quite like this and it's actually still
[01:18:27] quite like this and it's actually still kind of astounding how much is able to
[01:18:29] kind of astounding how much is able to do things that don't seem like they
[01:18:30] do things that don't seem like they should have shown up all that directly
[01:18:32] should have shown up all that directly in the pre-training data quantifying
[01:18:34] in the pre-training data quantifying that extent is an open research problem
[01:18:37] that extent is an open research problem okay that's it let's call it foreign


================================================================================
LECTURE 010
================================================================================

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 11 - Natural Language Generation

Source: https://www.youtube.com/watch?v=N9L32bFieEY

---

Transcript

[00:00:05] hello everyone
[00:00:07] hello everyone um my name is Lisa I'm a third year PhD
[00:00:09] um my name is Lisa I'm a third year PhD student in the NLP group I'm advised by
[00:00:11] student in the NLP group I'm advised by Percy and Tatsu today I will give a
[00:00:14] Percy and Tatsu today I will give a lecture on natural language generation
[00:00:15] lecture on natural language generation and this is also the research area that
[00:00:18] and this is also the research area that I work on so I'm super excited about it
[00:00:20] I work on so I'm super excited about it I'm happy to answer any questions both
[00:00:22] I'm happy to answer any questions both during the lecture and after class about
[00:00:24] during the lecture and after class about natural language generation so nlg is a
[00:00:27] natural language generation so nlg is a super exciting area and is also moving
[00:00:30] super exciting area and is also moving really really fast so today we will
[00:00:33] really really fast so today we will discuss all the excitement of nlg
[00:00:36] discuss all the excitement of nlg but before we get into the really
[00:00:37] but before we get into the really exciting part I have to make some
[00:00:39] exciting part I have to make some announcements so first it is very very
[00:00:42] announcements so first it is very very important for you to remember to sign up
[00:00:44] important for you to remember to sign up for AWS by midnight today so this will
[00:00:48] for AWS by midnight today so this will concern this is related to your homework
[00:00:50] concern this is related to your homework 5 whether you have GPU access and then
[00:00:53] 5 whether you have GPU access and then also related to our final project so
[00:00:55] also related to our final project so please please remember to sign up for it
[00:00:57] please please remember to sign up for it for AWS by tonight and second the
[00:01:01] for AWS by tonight and second the project proposal is due on Tuesday next
[00:01:04] project proposal is due on Tuesday next Tuesday and I think assignment 4 should
[00:01:07] Tuesday and I think assignment 4 should just do it hopefully you had fun in this
[00:01:10] just do it hopefully you had fun in this machine translation and stuff and also
[00:01:13] machine translation and stuff and also assignment 5 is out today I think just
[00:01:16] assignment 5 is out today I think just now and it is due on Friday uh like
[00:01:20] now and it is due on Friday uh like basically Friday midnight and uh last we
[00:01:24] basically Friday midnight and uh last we will hold a Transformer I will hold a
[00:01:26] will hold a Transformer I will hold a hugging face Transformer Library
[00:01:27] hugging face Transformer Library tutorial this Friday so if your final
[00:01:31] tutorial this Friday so if your final project is related to implementing
[00:01:33] project is related to implementing Transformers or playing with large
[00:01:34] Transformers or playing with large language models you should definitely go
[00:01:36] language models you should definitely go to this tutorial because it's going to
[00:01:37] to this tutorial because it's going to be very very helpful
[00:01:40] be very very helpful um also yeah just one more time please
[00:01:42] um also yeah just one more time please remember to sign up for AWS because this
[00:01:44] remember to sign up for AWS because this is the final hard deadline
[00:01:47] is the final hard deadline okay cool now moving on to the main
[00:01:50] okay cool now moving on to the main topic for today
[00:01:51] topic for today um the very exciting natural language
[00:01:53] um the very exciting natural language Generation stuff so today we will
[00:01:55] Generation stuff so today we will discuss what is an LG review sound
[00:01:57] discuss what is an LG review sound models discuss about how to decode from
[00:02:00] models discuss about how to decode from language models and how to train
[00:02:01] language models and how to train language models
[00:02:03] language models um and we will also talk about
[00:02:05] um and we will also talk about evaluations and finally we'll discuss
[00:02:07] evaluations and finally we'll discuss ethical and risk considerations with the
[00:02:09] ethical and risk considerations with the current analogy systems so this natural
[00:02:12] current analogy systems so this natural language generation techniques are going
[00:02:14] language generation techniques are going to be really exciting because this is
[00:02:16] to be really exciting because this is kind of getting us closer to explain the
[00:02:18] kind of getting us closer to explain the magic of chat GPT which is a super
[00:02:21] magic of chat GPT which is a super popular model recently and practically
[00:02:23] popular model recently and practically speaking they could also help you with
[00:02:25] speaking they could also help you with your final project if you decide to work
[00:02:27] your final project if you decide to work on something related to text generation
[00:02:29] on something related to text generation so um let's get started to begin with
[00:02:32] so um let's get started to begin with let's ask the question of what is
[00:02:34] let's ask the question of what is natural language generation
[00:02:36] natural language generation so natural language generation is
[00:02:38] so natural language generation is actually a really broad category people
[00:02:40] actually a really broad category people have divided an LP into natural language
[00:02:43] have divided an LP into natural language understanding and natural language
[00:02:45] understanding and natural language generation so the understanding part
[00:02:47] generation so the understanding part mostly means that the task input is in
[00:02:50] mostly means that the task input is in natural language such as semantic
[00:02:52] natural language such as semantic parsing natural language inference and
[00:02:55] parsing natural language inference and so on whereas natural language
[00:02:57] so on whereas natural language generation means that the task output is
[00:02:59] generation means that the task output is in natural language so nlg focuses on
[00:03:03] in natural language so nlg focuses on systems that produce fluent coherent and
[00:03:06] systems that produce fluent coherent and useful language outputs for human to use
[00:03:09] useful language outputs for human to use historically there are many analogy
[00:03:12] historically there are many analogy systems that use rule-based systems such
[00:03:15] systems that use rule-based systems such as templates or infilling but nowadays
[00:03:18] as templates or infilling but nowadays deep learning is powering almost every
[00:03:20] deep learning is powering almost every text generation systems so this lecture
[00:03:23] text generation systems so this lecture today will be mostly focused on deep
[00:03:25] today will be mostly focused on deep learning steps
[00:03:27] learning steps so um first what are some examples of
[00:03:30] so um first what are some examples of natural language generation it's
[00:03:32] natural language generation it's actually everywhere including our
[00:03:33] actually everywhere including our homework machine translation is a form
[00:03:36] homework machine translation is a form of nlg where the input is some address
[00:03:38] of nlg where the input is some address in the source language and the output is
[00:03:41] in the source language and the output is generated text in a targeted language
[00:03:44] generated text in a targeted language digital assistant such as series or
[00:03:46] digital assistant such as series or Alexa they are also an LG systems so it
[00:03:50] Alexa they are also an LG systems so it takes in dialogue history and generates
[00:03:52] takes in dialogue history and generates continuations of the conversation
[00:03:55] continuations of the conversation um there is also summarization systems
[00:03:57] um there is also summarization systems that takes in a long document such as a
[00:04:00] that takes in a long document such as a research article and then the idea is
[00:04:02] research article and then the idea is trying to summarize it into a few
[00:04:04] trying to summarize it into a few sentences that are easy to read
[00:04:07] sentences that are easy to read so beyond these classic tasks there are
[00:04:09] so beyond these classic tasks there are some more interesting uses like creative
[00:04:12] some more interesting uses like creative story writing where you can prompt a
[00:04:14] story writing where you can prompt a language model with a story plot and
[00:04:16] language model with a story plot and then it will give you some creative
[00:04:18] then it will give you some creative stories that are aligned with the plot
[00:04:20] stories that are aligned with the plot there is state of the text where you
[00:04:22] there is state of the text where you give the language model some database or
[00:04:24] give the language model some database or some tables and then the idea is that it
[00:04:27] some tables and then the idea is that it will output some textual description of
[00:04:29] will output some textual description of the table content
[00:04:30] the table content and finally there is also like visual
[00:04:32] and finally there is also like visual description based nlg systems like image
[00:04:35] description based nlg systems like image captioning or like image based
[00:04:38] captioning or like image based storytelling
[00:04:40] storytelling so the really cool example
[00:04:43] so the really cool example um is the popular track GPT models so
[00:04:46] um is the popular track GPT models so chat GPT is also an analogy system it is
[00:04:49] chat GPT is also an analogy system it is very general purpose so therefore you
[00:04:51] very general purpose so therefore you can use it to do many many different
[00:04:54] can use it to do many many different tasks with different prompts for example
[00:04:56] tasks with different prompts for example we can use chat GPT to simulate a
[00:04:59] we can use chat GPT to simulate a chatbot it can ask it can answer
[00:05:01] chatbot it can ask it can answer questions about like creative gifts for
[00:05:04] questions about like creative gifts for 10 years old
[00:05:05] 10 years old it can be used to do poetry generation
[00:05:08] it can be used to do poetry generation like for example we can ask you to
[00:05:11] like for example we can ask you to generate a poem about sorting algorithms
[00:05:12] generate a poem about sorting algorithms and it's actually well I wouldn't say
[00:05:15] and it's actually well I wouldn't say it's very poetic but at least it has the
[00:05:17] it's very poetic but at least it has the same format as a poem and the content is
[00:05:19] same format as a poem and the content is actually correct
[00:05:22] so um charging Beauty can also be used
[00:05:25] so um charging Beauty can also be used in some really useful settings like a
[00:05:28] in some really useful settings like a web search so here Bing is augmented
[00:05:31] web search so here Bing is augmented with chat GPT and there are some
[00:05:33] with chat GPT and there are some twitters that are saying that the magic
[00:05:34] twitters that are saying that the magic of chat GPT is that it actually makes
[00:05:36] of chat GPT is that it actually makes people be happy to use Bing
[00:05:38] people be happy to use Bing um
[00:05:42] so there are so many tasks that actually
[00:05:44] so there are so many tasks that actually belong to the nlg category so how do we
[00:05:47] belong to the nlg category so how do we categorize these tasks one common way is
[00:05:49] categorize these tasks one common way is to think about the open-endedness of the
[00:05:51] to think about the open-endedness of the task so here we draw a line for the
[00:05:54] task so here we draw a line for the spectrum of open-endedness on the one
[00:05:57] spectrum of open-endedness on the one end we have tasks like machine
[00:05:58] end we have tasks like machine translation and summarization so we
[00:06:01] translation and summarization so we consider them not very open-ended
[00:06:03] consider them not very open-ended because for each Source sentence the
[00:06:06] because for each Source sentence the output is almost determined by the input
[00:06:08] output is almost determined by the input because basically we are trying to do
[00:06:11] because basically we are trying to do machine translation the semantic should
[00:06:13] machine translation the semantic should be exactly similar to the input sentence
[00:06:15] be exactly similar to the input sentence so there are only a few ways that you
[00:06:17] so there are only a few ways that you can refreeze the output like authorities
[00:06:19] can refreeze the output like authorities have announced that today is a national
[00:06:21] have announced that today is a national holiday you can rephrase it a little bit
[00:06:23] holiday you can rephrase it a little bit to say today is a national holiday
[00:06:25] to say today is a national holiday announced by the authorities but the
[00:06:27] announced by the authorities but the actual Space is really small because you
[00:06:29] actual Space is really small because you have to make sure the semantics doesn't
[00:06:31] have to make sure the semantics doesn't change so we can say that the output
[00:06:34] change so we can say that the output space here is not very diverse
[00:06:37] um and moving to the middle of the
[00:06:39] um and moving to the middle of the spectrum there is dialogue tasks such as
[00:06:41] spectrum there is dialogue tasks such as task driven dialogue or Chit Chat
[00:06:43] task driven dialogue or Chit Chat dialogue so we can see that for each
[00:06:45] dialogue so we can see that for each dialog input there are multiple
[00:06:47] dialog input there are multiple responses and the degree of Freedom has
[00:06:50] responses and the degree of Freedom has increased here we can say like we can
[00:06:52] increased here we can say like we can respond by saying good and you or we can
[00:06:55] respond by saying good and you or we can say about thanks for asking barely
[00:06:57] say about thanks for asking barely surviving on my homeworks so here we are
[00:07:00] surviving on my homeworks so here we are observing that there are actually
[00:07:01] observing that there are actually multiple ways to continue this
[00:07:03] multiple ways to continue this conversation and then this is where we
[00:07:05] conversation and then this is where we say the output space is getting more and
[00:07:07] say the output space is getting more and more diverse
[00:07:09] more diverse and on the other end of the spectrum
[00:07:12] and on the other end of the spectrum there is a very open-ended generation
[00:07:14] there is a very open-ended generation tasks like story generation so given the
[00:07:17] tasks like story generation so given the input like write me a story about three
[00:07:19] input like write me a story about three little pigs there are so many ways to
[00:07:21] little pigs there are so many ways to continue the prompt right we can write
[00:07:22] continue the prompt right we can write about them going to schools building
[00:07:24] about them going to schools building houses like they always do
[00:07:26] houses like they always do um so the valid output here is extremely
[00:07:29] um so the valid output here is extremely large and we call this open-ended
[00:07:31] large and we call this open-ended generation
[00:07:33] generation so it's hard to really draw a boundary
[00:07:35] so it's hard to really draw a boundary between open-ended and non-open-ended
[00:07:38] between open-ended and non-open-ended tasks but we still try to give a rough
[00:07:40] tasks but we still try to give a rough categorization so over the Ender
[00:07:42] categorization so over the Ender generation refers to tasks whose output
[00:07:44] generation refers to tasks whose output distribution has a high degree of
[00:07:46] distribution has a high degree of Freedom or an non-open under generation
[00:07:49] Freedom or an non-open under generation tasks refers to tasks where the input
[00:07:52] tasks refers to tasks where the input will almost certainly determine the
[00:07:55] will almost certainly determine the output generation examples of non-open
[00:07:58] output generation examples of non-open ended Generations are machine
[00:07:59] ended Generations are machine translation summarization and examples
[00:08:01] translation summarization and examples of open-ended Generations are story
[00:08:03] of open-ended Generations are story generation Chit Chat dialogue task
[00:08:05] generation Chit Chat dialogue task oriented dialogue Etc
[00:08:07] oriented dialogue Etc so how do we formalize this
[00:08:10] so how do we formalize this categorization one way of formalizing is
[00:08:12] categorization one way of formalizing is by Computing the entropy of the nlg
[00:08:15] by Computing the entropy of the nlg system so high entropy means that we we
[00:08:18] system so high entropy means that we we are to the right of the spectrum so it
[00:08:21] are to the right of the spectrum so it is more open-ended and low entropy means
[00:08:23] is more open-ended and low entropy means that we are to the left of the spectrum
[00:08:25] that we are to the left of the spectrum and less open-ended
[00:08:27] and less open-ended so there's two classes of nlg tasks
[00:08:30] so there's two classes of nlg tasks actually require different decoding and
[00:08:32] actually require different decoding and training approaches as we'll talk about
[00:08:34] training approaches as we'll talk about later
[00:08:36] later okay cool now let's recall some previous
[00:08:39] okay cool now let's recall some previous lectures and review the nlg models and
[00:08:41] lectures and review the nlg models and trainings that we have studied before
[00:08:44] trainings that we have studied before so I think we discussed the basics of
[00:08:46] so I think we discussed the basics of natural language generation so here is
[00:08:49] natural language generation so here is how other aggressive language model
[00:08:50] how other aggressive language model works at each time step our model would
[00:08:53] works at each time step our model would take in a sequence of tokens as input
[00:08:56] take in a sequence of tokens as input and here it is y less than T and the
[00:08:59] and here it is y less than T and the output is basically the new token YT so
[00:09:03] output is basically the new token YT so to decide on YT we first use the model
[00:09:06] to decide on YT we first use the model to assign a score for each token in the
[00:09:08] to assign a score for each token in the vocabulary denoted as s and then we
[00:09:11] vocabulary denoted as s and then we apply softmax to get the next token
[00:09:13] apply softmax to get the next token distribution p and we choose a token
[00:09:16] distribution p and we choose a token according to this next token
[00:09:17] according to this next token distribution
[00:09:19] distribution and summary once we have predicted YT
[00:09:21] and summary once we have predicted YT hat we then pass it back into the
[00:09:22] hat we then pass it back into the language model as the input predict y
[00:09:25] language model as the input predict y hat t plus 1 and then we do so
[00:09:27] hat t plus 1 and then we do so recursively until we reach the end of
[00:09:30] recursively until we reach the end of the sequence
[00:09:31] the sequence so any questions so far
[00:09:35] so any questions so far okay good
[00:09:37] okay good um so for the two types of energy tasks
[00:09:40] um so for the two types of energy tasks that we talked about like the open-ended
[00:09:42] that we talked about like the open-ended non-open-ended tasks they tend to prefer
[00:09:44] non-open-ended tasks they tend to prefer different model architectures so for now
[00:09:47] different model architectures so for now open-ended tasks like machine
[00:09:48] open-ended tasks like machine translation we typically use an encoder
[00:09:51] translation we typically use an encoder decoder system where like the other
[00:09:53] decoder system where like the other regressive decoder that we just talked
[00:09:55] regressive decoder that we just talked about function as the decoder and then
[00:09:57] about function as the decoder and then we have another bi-directional encoder
[00:09:59] we have another bi-directional encoder for encoding the inputs so this is kind
[00:10:01] for encoding the inputs so this is kind of what you implemented for assignment
[00:10:03] of what you implemented for assignment four because the encoder is like the
[00:10:06] four because the encoder is like the bi-directional lstm and the decoder is
[00:10:09] bi-directional lstm and the decoder is another lstm that is auto regressive
[00:10:12] another lstm that is auto regressive so for more open-ended tasks typically
[00:10:15] so for more open-ended tasks typically other aggressive generation model is the
[00:10:18] other aggressive generation model is the only Oppo is the only component
[00:10:20] only Oppo is the only component um of course like this architectures are
[00:10:22] um of course like this architectures are not really hard constraints because a
[00:10:25] not really hard constraints because a auto-agressive decoder alone can also be
[00:10:27] auto-agressive decoder alone can also be used to do machine translation and an
[00:10:29] used to do machine translation and an encoder decoder model can also be used
[00:10:31] encoder decoder model can also be used for story generation
[00:10:33] for story generation so this is kind of the convention for
[00:10:35] so this is kind of the convention for now but it's a reasonable convention
[00:10:37] now but it's a reasonable convention because like using decoder only model
[00:10:39] because like using decoder only model for Mt tends to hurt performance
[00:10:42] for Mt tends to hurt performance compared to an encoder decoder model for
[00:10:44] compared to an encoder decoder model for Mt and using an encoder decoder model
[00:10:46] Mt and using an encoder decoder model for open-ended generation seems to like
[00:10:49] for open-ended generation seems to like achieve similar performance to a decoder
[00:10:51] achieve similar performance to a decoder only model and therefore if you have the
[00:10:54] only model and therefore if you have the compute budget to train an encoder
[00:10:55] compute budget to train an encoder decoder model you might just be better
[00:10:57] decoder model you might just be better off by only trading a larger decoder
[00:10:59] off by only trading a larger decoder model so it's kind of more of an
[00:11:01] model so it's kind of more of an allocation of resources problem than
[00:11:03] allocation of resources problem than whether this to architecture will type
[00:11:05] whether this to architecture will type check with your task
[00:11:08] so
[00:11:09] so um okay so how do we train such a
[00:11:12] um okay so how do we train such a language model in previous lectures we
[00:11:14] language model in previous lectures we talked about that the language models
[00:11:16] talked about that the language models are trained by maximum likelihood so
[00:11:19] are trained by maximum likelihood so basically we were trying to maximize the
[00:11:21] basically we were trying to maximize the probability of the next token uh YT
[00:11:24] probability of the next token uh YT given the preceding words and this is
[00:11:26] given the preceding words and this is our optimization objective so at each
[00:11:30] our optimization objective so at each time step this can be regarded as a
[00:11:32] time step this can be regarded as a classification task because we are
[00:11:34] classification task because we are trying to distinguish the actual word uh
[00:11:36] trying to distinguish the actual word uh YT star from all the remaining words in
[00:11:39] YT star from all the remaining words in the vocabulary
[00:11:40] the vocabulary and this is also called teacher forcing
[00:11:42] and this is also called teacher forcing because at each time step uh we are
[00:11:45] because at each time step uh we are using the gold standard wise uh y star
[00:11:48] using the gold standard wise uh y star less than t as input to the model
[00:11:51] less than t as input to the model whereas presumably at generation time
[00:11:54] whereas presumably at generation time you wouldn't have any access to Y star
[00:11:56] you wouldn't have any access to Y star so you would have to use the model's own
[00:11:58] so you would have to use the model's own prediction to fit it back into the model
[00:12:00] prediction to fit it back into the model to generate the next token and that is
[00:12:02] to generate the next token and that is called student forcing which will talk
[00:12:04] called student forcing which will talk in detail later
[00:12:12] we never used that word before what does
[00:12:15] we never used that word before what does it mean Ultra aggressive oh this means
[00:12:17] it mean Ultra aggressive oh this means like uh so let's look at this animations
[00:12:20] like uh so let's look at this animations again
[00:12:22] again oops sorry oh it just looks like uh you
[00:12:24] oops sorry oh it just looks like uh you are generating word from left to right
[00:12:26] are generating word from left to right one by one so here suppose that you are
[00:12:28] one by one so here suppose that you are given a y less than T and then other
[00:12:31] given a y less than T and then other aggressive for your first general YT and
[00:12:33] aggressive for your first general YT and then once you have YT you'll fit it back
[00:12:34] then once you have YT you'll fit it back in general YT plus one and then feed it
[00:12:38] in general YT plus one and then feed it back and generate another thing so this
[00:12:39] back and generate another thing so this left to right nature because you are
[00:12:41] left to right nature because you are using chain rule to like condition on
[00:12:43] using chain rule to like condition on the the tokens that you just generated
[00:12:45] the the tokens that you just generated this chain rule thing is called Auto
[00:12:47] this chain rule thing is called Auto regressive and typically like I think
[00:12:50] regressive and typically like I think conventionally we are doing left to
[00:12:51] conventionally we are doing left to right other aggressive by generating
[00:12:52] right other aggressive by generating from left to right but there are also
[00:12:54] from left to right but there are also like other more interesting models that
[00:12:56] like other more interesting models that can do backward or influence and other
[00:12:58] can do backward or influence and other things this idea of generating one token
[00:13:00] things this idea of generating one token at once is auto regressive
[00:13:04] at once is auto regressive cool any other questions
[00:13:09] yep
[00:13:13] um so at inference time our decoding
[00:13:16] um so at inference time our decoding algorithm will Define a function to
[00:13:18] algorithm will Define a function to select a token from this distribution so
[00:13:21] select a token from this distribution so we've discussed that we can use the
[00:13:22] we've discussed that we can use the language model to compute this P which
[00:13:24] language model to compute this P which is the next token distribution and then
[00:13:26] is the next token distribution and then G here based on our notation is the
[00:13:29] G here based on our notation is the decoded algorithm which helps us select
[00:13:31] decoded algorithm which helps us select what token we are actually going to use
[00:13:33] what token we are actually going to use for YT so the obvious decoding algorithm
[00:13:36] for YT so the obvious decoding algorithm is to greatly choose the highest
[00:13:37] is to greatly choose the highest probability token as YT hat for each
[00:13:41] probability token as YT hat for each time step so well this basic algorithm
[00:13:43] time step so well this basic algorithm sort of works because they work for your
[00:13:45] sort of works because they work for your homework for to do better there are two
[00:13:47] homework for to do better there are two main avenues that we can take we can
[00:13:50] main avenues that we can take we can decide to improve decoding and we can
[00:13:52] decide to improve decoding and we can also decide to improve the training
[00:13:54] also decide to improve the training of course there are other things that we
[00:13:56] of course there are other things that we can do we can improve training data and
[00:13:58] can do we can improve training data and we can improve model architectures but
[00:14:00] we can improve model architectures but for this lecture we will focus on
[00:14:01] for this lecture we will focus on decoding and training
[00:14:04] decoding and training so uh now let's talk about how decoding
[00:14:08] so uh now let's talk about how decoding algorithms work for natural language
[00:14:09] algorithms work for natural language generation models before that I'm happy
[00:14:11] generation models before that I'm happy to take any questions about the previous
[00:14:13] to take any questions about the previous slides
[00:14:23] uh I think I'll go into this in detail
[00:14:26] uh I think I'll go into this in detail later but sure so basically for teacher
[00:14:28] later but sure so basically for teacher of forcing the idea is like you do
[00:14:29] of forcing the idea is like you do teacher forcing where you'll train the
[00:14:31] teacher forcing where you'll train the language model because you already
[00:14:32] language model because you already observe like the gold text so you kind
[00:14:35] observe like the gold text so you kind of use the gold text up until timestamp
[00:14:36] of use the gold text up until timestamp t uh put put it into the model and then
[00:14:39] t uh put put it into the model and then the model would try to predict why uh t
[00:14:41] the model would try to predict why uh t plus one whereas student forcing means
[00:14:44] plus one whereas student forcing means that you don't have access to this gold
[00:14:45] that you don't have access to this gold reference data instead you are still but
[00:14:48] reference data instead you are still but you are still trying to generate a
[00:14:49] you are still trying to generate a sequence of data so you have to use uh
[00:14:51] sequence of data so you have to use uh the text that you generated yourself
[00:14:52] the text that you generated yourself using the model and then feed it back
[00:14:54] using the model and then feed it back into the model as input to predict t
[00:14:56] into the model as input to predict t plus one that's the primary difference
[00:15:01] cool
[00:15:02] cool um so what is decoding all about at each
[00:15:04] um so what is decoding all about at each time step uh
[00:15:06] time step uh our model computes a vector of score for
[00:15:09] our model computes a vector of score for each token so it takes in preceding
[00:15:11] each token so it takes in preceding context while less than T and produce a
[00:15:13] context while less than T and produce a score s
[00:15:14] score s and then we try to compute the
[00:15:16] and then we try to compute the probability distribution P all of this
[00:15:18] probability distribution P all of this scores by just applying softmax to
[00:15:20] scores by just applying softmax to normalize them
[00:15:22] normalize them and our decoding algorithm is defined as
[00:15:25] and our decoding algorithm is defined as this function G which takes in the
[00:15:28] this function G which takes in the probability distribution and try to map
[00:15:30] probability distribution and try to map it to some word basically try to select
[00:15:32] it to some word basically try to select a token from this probability
[00:15:33] a token from this probability distribution
[00:15:34] distribution so in the machine translation lecture uh
[00:15:37] so in the machine translation lecture uh we talked about graded decoding which
[00:15:40] we talked about graded decoding which selects the highest probability token of
[00:15:42] selects the highest probability token of this P distribution
[00:15:45] this P distribution and we also talk about beam search which
[00:15:47] and we also talk about beam search which has the same objective as grade decoding
[00:15:49] has the same objective as grade decoding which is that we are both trying to find
[00:15:51] which is that we are both trying to find the most likely string defined based on
[00:15:53] the most likely string defined based on the model but instead of doing so
[00:15:55] the model but instead of doing so greedily for beam search we actually
[00:15:57] greedily for beam search we actually explore a wider range of candidates so
[00:16:00] explore a wider range of candidates so we have a wider exploration of
[00:16:02] we have a wider exploration of candidates by keeping always like k k
[00:16:05] candidates by keeping always like k k candidates in the beam
[00:16:08] candidates in the beam so overall this maximum probability
[00:16:10] so overall this maximum probability decoding is good for low entropy tasks
[00:16:12] decoding is good for low entropy tasks like machine translation and
[00:16:14] like machine translation and summarization but it actually encounters
[00:16:16] summarization but it actually encounters more problems for open-ended generation
[00:16:19] more problems for open-ended generation so the most likely string is actually
[00:16:21] so the most likely string is actually very repetitive when we try to do
[00:16:23] very repetitive when we try to do open-ended text generation
[00:16:26] open-ended text generation as we can see in this example the
[00:16:28] as we can see in this example the context is perfect in normal it's about
[00:16:30] context is perfect in normal it's about I mean a unicorn trying to speak English
[00:16:32] I mean a unicorn trying to speak English and for the continuation the first part
[00:16:35] and for the continuation the first part of it is it looks great it's like valid
[00:16:38] of it is it looks great it's like valid English it talks about science but
[00:16:39] English it talks about science but suddenly it starts to repeat and it
[00:16:41] suddenly it starts to repeat and it starts to repeat like I think uh
[00:16:43] starts to repeat like I think uh a institution's name
[00:16:46] a institution's name so why does this happen
[00:16:48] so why does this happen um if we look at for example uh this
[00:16:51] um if we look at for example uh this plot which shows uh the problem the
[00:16:54] plot which shows uh the problem the language model's probability assigned to
[00:16:56] language model's probability assigned to the sequence I don't know we can see
[00:16:58] the sequence I don't know we can see like here's the pattern
[00:17:00] like here's the pattern um it has regular probability but if we
[00:17:02] um it has regular probability but if we keep repeating this phrase I don't know
[00:17:04] keep repeating this phrase I don't know I don't know I don't know for 10 times
[00:17:05] I don't know I don't know for 10 times then we can see that there's a decrease
[00:17:08] then we can see that there's a decrease in Trend in their negative log
[00:17:10] in Trend in their negative log likelihood so the y-axis is the negative
[00:17:12] likelihood so the y-axis is the negative log probability we can see this
[00:17:14] log probability we can see this decreasing Trend which means that the
[00:17:17] decreasing Trend which means that the model actually has higher probability uh
[00:17:19] model actually has higher probability uh as the repeat goes on which is quite
[00:17:22] as the repeat goes on which is quite strange because it's suggesting that
[00:17:24] strange because it's suggesting that there is a self-amplification effect so
[00:17:26] there is a self-amplification effect so the more repeat we have the more
[00:17:28] the more repeat we have the more confidence the model becomes about this
[00:17:30] confidence the model becomes about this repeat
[00:17:32] repeat and this keeps going on we can see that
[00:17:34] and this keeps going on we can see that for I am tired I'm tired repeat 100
[00:17:36] for I am tired I'm tired repeat 100 times because it continuously decreasing
[00:17:38] times because it continuously decreasing Trend until the model is almost 100 sure
[00:17:41] Trend until the model is almost 100 sure that it's gonna keep repeating the same
[00:17:43] that it's gonna keep repeating the same thing
[00:17:45] thing and sadly
[00:17:47] and sadly um this art this problem is not really
[00:17:48] um this art this problem is not really solved by architecture here the Red
[00:17:50] solved by architecture here the Red Cloud is a lstm model and the blue curve
[00:17:54] Cloud is a lstm model and the blue curve is a Transformer model we can see that
[00:17:56] is a Transformer model we can see that both model kind of suffers from the same
[00:17:58] both model kind of suffers from the same problem and scale also doesn't solve
[00:18:01] problem and scale also doesn't solve this problem so we kind of believe that
[00:18:02] this problem so we kind of believe that like scale is the magical thing in NLP
[00:18:04] like scale is the magical thing in NLP but even even models with 175 billion
[00:18:08] but even even models with 175 billion parameters will still suffer from
[00:18:10] parameters will still suffer from repetition if we try to find the most
[00:18:12] repetition if we try to find the most likely string
[00:18:16] so how do we reduce repetition
[00:18:18] so how do we reduce repetition um one canonical approach is to do
[00:18:20] um one canonical approach is to do unground blocking so the principle is
[00:18:23] unground blocking so the principle is very simple basically you just don't
[00:18:24] very simple basically you just don't want to see the same engram twice if we
[00:18:27] want to see the same engram twice if we send n to be three then for any text
[00:18:29] send n to be three then for any text that contains the phrase I am happy the
[00:18:32] that contains the phrase I am happy the next time you see the prefix I am ungram
[00:18:34] next time you see the prefix I am ungram blocking would automatically set the
[00:18:36] blocking would automatically set the probability of happy to be zero so that
[00:18:39] probability of happy to be zero so that you will never see this unground this
[00:18:41] you will never see this unground this trigram again but clearly this this
[00:18:44] trigram again but clearly this this underground blocking heuristic has some
[00:18:46] underground blocking heuristic has some problems because sometimes it is quite
[00:18:48] problems because sometimes it is quite common for you to want to see a person's
[00:18:50] common for you to want to see a person's name appear twice or three times or even
[00:18:52] name appear twice or three times or even more in the text but this unground
[00:18:54] more in the text but this unground blocking will eliminate that possibility
[00:18:57] blocking will eliminate that possibility so what are better options that possibly
[00:18:59] so what are better options that possibly are more complicated for example we can
[00:19:02] are more complicated for example we can use a different training objective
[00:19:04] use a different training objective instead of training by mle we can train
[00:19:07] instead of training by mle we can train by unlikelihood objective so in this
[00:19:10] by unlikelihood objective so in this approach uh the model is actually
[00:19:12] approach uh the model is actually penalized for generating already seen
[00:19:15] penalized for generating already seen tokens so it's kind of like putting this
[00:19:17] tokens so it's kind of like putting this unground blocking idea into training
[00:19:19] unground blocking idea into training time
[00:19:20] time um rather than a decoding Time Force
[00:19:22] um rather than a decoding Time Force this constraint at trading time we just
[00:19:23] this constraint at trading time we just decrease the probability of repetition
[00:19:25] decrease the probability of repetition another another training objective is
[00:19:28] another another training objective is coverage Wells which uses kind of the
[00:19:31] coverage Wells which uses kind of the attention mechanism to prevent
[00:19:32] attention mechanism to prevent repetition so basically if you try to
[00:19:34] repetition so basically if you try to regularize and enforce your attention so
[00:19:36] regularize and enforce your attention so that it's always attending to different
[00:19:38] that it's always attending to different words for each token then uh it is
[00:19:41] words for each token then uh it is highly likely that you are not going to
[00:19:42] highly likely that you are not going to repeat because repetition tends to
[00:19:45] repeat because repetition tends to happen when you have similar attention
[00:19:46] happen when you have similar attention patterns
[00:19:48] patterns another different angle is that instead
[00:19:51] another different angle is that instead of searching for the most likely string
[00:19:53] of searching for the most likely string we can use a different decoding
[00:19:55] we can use a different decoding objective so maybe we can search for
[00:19:56] objective so maybe we can search for Strings that maximizes uh the difference
[00:19:59] Strings that maximizes uh the difference between log probabilities of two models
[00:20:01] between log probabilities of two models say that we want to maximize log problem
[00:20:03] say that we want to maximize log problem large model minus a lot of problem small
[00:20:06] large model minus a lot of problem small model in this way because both models
[00:20:08] model in this way because both models are repetitive so they kind of cancels
[00:20:10] are repetitive so they kind of cancels out so like they would both assign High
[00:20:11] out so like they would both assign High probabilities repetition and after
[00:20:13] probabilities repetition and after applying this new objective the
[00:20:16] applying this new objective the repetition stuff will actually be
[00:20:17] repetition stuff will actually be penalized because it cancels out
[00:20:20] penalized because it cancels out so here comes the broader question
[00:20:23] so here comes the broader question um it's finally the most likely string
[00:20:25] um it's finally the most likely string even a reasonable thing to do for
[00:20:27] even a reasonable thing to do for open-ended text generation
[00:20:28] open-ended text generation the answer is probably no because this
[00:20:32] the answer is probably no because this doesn't really match human pattern so we
[00:20:34] doesn't really match human pattern so we can see In This Cloud the orange curve
[00:20:36] can see In This Cloud the orange curve is the human pattern and the blue curve
[00:20:38] is the human pattern and the blue curve is the machine generated text using beam
[00:20:40] is the machine generated text using beam search so you can see that will with
[00:20:42] search so you can see that will with human talks there are actually lots of
[00:20:44] human talks there are actually lots of uncertainty uh in as we can see by the
[00:20:47] uncertainty uh in as we can see by the fluctuation of the probabilities like
[00:20:49] fluctuation of the probabilities like for some words we can be very certain
[00:20:51] for some words we can be very certain for some words we are a little bit
[00:20:53] for some words we are a little bit unsure whereas here for the model
[00:20:54] unsure whereas here for the model distribution is always very sure it's
[00:20:56] distribution is always very sure it's always assigning probability one to the
[00:20:58] always assigning probability one to the sequence so because we now are seeing a
[00:21:02] sequence so because we now are seeing a answer obviously there's a mismatch
[00:21:04] answer obviously there's a mismatch between the two distributions so it's
[00:21:06] between the two distributions so it's kind of suggesting that maybe searching
[00:21:08] kind of suggesting that maybe searching for the most likely string is not the
[00:21:10] for the most likely string is not the right decoding objective at all
[00:21:13] right decoding objective at all any questions so far before we move up
[00:21:15] any questions so far before we move up yeah
[00:21:16] yeah the online magazine for like some
[00:21:19] the online magazine for like some detector of whether some characters
[00:21:21] detector of whether some characters generated by Chinese
[00:21:24] um not really because uh so this can
[00:21:26] um not really because uh so this can only detect the really simple things
[00:21:28] only detect the really simple things that humans are also able to detect like
[00:21:30] that humans are also able to detect like repetition so uh in order to avoid like
[00:21:33] repetition so uh in order to avoid like the previous problems that we've talked
[00:21:34] the previous problems that we've talked about I'll talk about some other
[00:21:36] about I'll talk about some other decoding families that generates more
[00:21:39] decoding families that generates more robust attacks that actually look like
[00:21:41] robust attacks that actually look like this
[00:21:42] this um whose probability distribution looks
[00:21:43] um whose probability distribution looks like the orange curve so I wouldn't say
[00:21:46] like the orange curve so I wouldn't say this is like the to go answer for
[00:21:48] this is like the to go answer for watermarking or detection
[00:21:52] watermarking or detection oh yeah Okay cool so she asked about
[00:21:55] oh yeah Okay cool so she asked about whether
[00:21:57] whether um whether this mechanism of plotting
[00:21:59] um whether this mechanism of plotting the probabilities of human text and
[00:22:01] the probabilities of human text and machine generated text is one way of
[00:22:03] machine generated text is one way of detecting whether some text is generated
[00:22:06] detecting whether some text is generated by model or human and my answer is I
[00:22:09] by model or human and my answer is I don't think so but this could be an
[00:22:11] don't think so but this could be an interesting research Direction
[00:22:14] interesting research Direction because I feel like they are more robust
[00:22:16] because I feel like they are more robust decoding approaches that generate texts
[00:22:18] decoding approaches that generate texts that are that actually fluctuates a lot
[00:22:24] um so yeah let's talk about the decoding
[00:22:26] um so yeah let's talk about the decoding algorithm that is able to generate text
[00:22:28] algorithm that is able to generate text that fluctuates so given that searching
[00:22:31] that fluctuates so given that searching for the most likely string is a bad idea
[00:22:32] for the most likely string is a bad idea what else should we do and how do we
[00:22:35] what else should we do and how do we simulate that human pattern and the
[00:22:37] simulate that human pattern and the answer to this is to introduce
[00:22:38] answer to this is to introduce Randomness and stochasticity to decoding
[00:22:41] Randomness and stochasticity to decoding so um suppose that we are sampling a
[00:22:44] so um suppose that we are sampling a token from this distribution of P
[00:22:47] token from this distribution of P basically like we are trying to sample
[00:22:49] basically like we are trying to sample YT hat from this distribution It Is
[00:22:51] YT hat from this distribution It Is Random so that you can essentially
[00:22:53] Random so that you can essentially sample any token distribution previously
[00:22:55] sample any token distribution previously you are kind of restricted to selecting
[00:22:57] you are kind of restricted to selecting rest for more grocery but now you can
[00:22:59] rest for more grocery but now you can select bathroom instead
[00:23:02] so however uh sampling introduces a new
[00:23:05] so however uh sampling introduces a new set of problems since we never really
[00:23:08] set of problems since we never really zero out any token probabilities vanilla
[00:23:10] zero out any token probabilities vanilla vanilla sampling would make every token
[00:23:13] vanilla sampling would make every token in the vocabulary a viable option and in
[00:23:16] in the vocabulary a viable option and in some unlucky cases we might end up with
[00:23:18] some unlucky cases we might end up with a bad word
[00:23:19] a bad word so assuming that uh we already have a
[00:23:22] so assuming that uh we already have a very well trade model like even if most
[00:23:25] very well trade model like even if most of the probability mass of the
[00:23:27] of the probability mass of the distribution is over the limited set of
[00:23:29] distribution is over the limited set of good options the tail of the
[00:23:31] good options the tail of the distribution will still be very long
[00:23:32] distribution will still be very long because we have so many words in our
[00:23:34] because we have so many words in our vocabulary
[00:23:35] vocabulary and therefore if we add all those round
[00:23:37] and therefore if we add all those round Tails it Aggregates they still have a
[00:23:40] Tails it Aggregates they still have a considerable Mass so statistically
[00:23:42] considerable Mass so statistically speaking this is called heavy tail
[00:23:43] speaking this is called heavy tail distribution and language is exactly a
[00:23:46] distribution and language is exactly a heavy tail distribution
[00:23:47] heavy tail distribution so for example like uh many tokens are
[00:23:51] so for example like uh many tokens are probably really wrong in this context
[00:23:52] probably really wrong in this context and then given that we have a good
[00:23:55] and then given that we have a good language model we assign them each very
[00:23:57] language model we assign them each very little probability
[00:23:58] little probability thus this doesn't really solve the
[00:24:00] thus this doesn't really solve the problem because there are so many of
[00:24:01] problem because there are so many of them so you aggregate them as a group
[00:24:04] them so you aggregate them as a group will still have a high chance of being
[00:24:05] will still have a high chance of being selected
[00:24:07] selected and the solution here that we have for
[00:24:10] and the solution here that we have for this problem of long tail is that we
[00:24:12] this problem of long tail is that we should just cut off the tail we should
[00:24:13] should just cut off the tail we should just zero out the probabilities that we
[00:24:15] just zero out the probabilities that we don't want and one idea is called top
[00:24:18] don't want and one idea is called top place that a top case sampling where the
[00:24:21] place that a top case sampling where the idea is that we would only sample from
[00:24:23] idea is that we would only sample from the top K tokens in the probability
[00:24:25] the top K tokens in the probability distribution
[00:24:28] any questions for now
[00:24:32] okay yeah well the model we were looking
[00:24:35] okay yeah well the model we were looking at a second ago had some really low
[00:24:38] at a second ago had some really low probability samples as well on the graph
[00:24:41] probability samples as well on the graph right I would copy something with that
[00:24:44] right I would copy something with that uh you mean this one or even uh the
[00:24:48] uh you mean this one or even uh the orange blue graph of the human versus uh
[00:24:51] orange blue graph of the human versus uh oh yeah yeah so uh top cable basically
[00:24:55] oh yeah yeah so uh top cable basically uh eliminate it will not it will make it
[00:24:57] uh eliminate it will not it will make it impossible to generate the super low
[00:24:59] impossible to generate the super low probability tokens so technically it's
[00:25:01] probability tokens so technically it's not it's not exactly simulating this
[00:25:04] not it's not exactly simulating this pattern because now you don't have the
[00:25:05] pattern because now you don't have the super low probability tokens whereas
[00:25:07] super low probability tokens whereas human can generate super low probability
[00:25:09] human can generate super low probability television affluence way but yeah that's
[00:25:11] television affluence way but yeah that's that could be
[00:25:12] that could be um another like hint that people can use
[00:25:15] um another like hint that people can use for detecting a machine generated text
[00:25:18] for detecting a machine generated text yeah
[00:25:19] yeah depends on the type and text you want to
[00:25:22] depends on the type and text you want to generate for example poem or novels or
[00:25:24] generate for example poem or novels or more creative writing but then you
[00:25:27] more creative writing but then you decide to hyper correct yeah yeah for
[00:25:29] decide to hyper correct yeah yeah for sure case I have a parameter that
[00:25:31] sure case I have a parameter that depending on the type of task you will
[00:25:32] depending on the type of task you will choose K differently uh maybe mostly for
[00:25:35] choose K differently uh maybe mostly for close and the task K should be small and
[00:25:38] close and the task K should be small and for open-ended case should be large yeah
[00:25:39] for open-ended case should be large yeah cluster in the back
[00:25:41] cluster in the back how come like I guess intuitively this
[00:25:44] how come like I guess intuitively this builds up of one of the earlier
[00:25:46] builds up of one of the earlier questions why don't we consider the case
[00:25:48] questions why don't we consider the case like where we sample and then we just
[00:25:50] like where we sample and then we just weight the probability of each word by
[00:25:52] weight the probability of each word by it's like score or something rather than
[00:25:54] it's like score or something rather than just looking at top trade how can we
[00:25:56] just looking at top trade how can we don't do like a weighted sampling type
[00:25:58] don't do like a weighted sampling type of situation so we still have that small
[00:26:00] of situation so we still have that small but non-zero probability of selecting
[00:26:03] but non-zero probability of selecting uh I think Top Care is also like rated
[00:26:06] uh I think Top Care is also like rated so like top K just kind of zeros out all
[00:26:09] so like top K just kind of zeros out all the Tails of the distribution but for
[00:26:11] the Tails of the distribution but for the things that I didn't zero out uh
[00:26:13] the things that I didn't zero out uh it's not like a uniform Choice among the
[00:26:15] it's not like a uniform Choice among the K it's still trying to choose
[00:26:17] K it's still trying to choose proportional to the scores that you
[00:26:19] proportional to the scores that you computed is that just like a
[00:26:21] computed is that just like a computational
[00:26:24] like 17 000 words it could be like for
[00:26:27] like 17 000 words it could be like for like 10 or something
[00:26:29] like 10 or something um yeah sure that could be one gain of
[00:26:31] um yeah sure that could be one gain of 12K decoding is that your self Max will
[00:26:34] 12K decoding is that your self Max will take in fewer uh fewer candidates
[00:26:37] take in fewer uh fewer candidates yeah but it's not the main reason I
[00:26:40] yeah but it's not the main reason I think you should show yeah yeah I'll
[00:26:41] think you should show yeah yeah I'll keep talking about the many reasons
[00:26:48] um so we've discussed this part and then
[00:26:50] um so we've discussed this part and then here uh this is the formal this is kind
[00:26:52] here uh this is the formal this is kind of the formerly what is happening for
[00:26:54] of the formerly what is happening for top case sampling uh now that we are
[00:26:57] top case sampling uh now that we are only sampling from the top K tokens of
[00:26:59] only sampling from the top K tokens of the probability distribution and as
[00:27:01] the probability distribution and as we've said K is a hyper parameter so we
[00:27:04] we've said K is a hyper parameter so we can set K to be large or small uh if we
[00:27:07] can set K to be large or small uh if we increase K this means that we are making
[00:27:09] increase K this means that we are making our output more diverse but at the risk
[00:27:12] our output more diverse but at the risk of including some tokens that are bad if
[00:27:14] of including some tokens that are bad if we decrease k then we are making more
[00:27:16] we decrease k then we are making more conservative and safe options but
[00:27:18] conservative and safe options but possibly the generation will be quite
[00:27:20] possibly the generation will be quite generic and boring
[00:27:24] um so uh is top K decoding good enough
[00:27:26] um so uh is top K decoding good enough the answer is not really because we can
[00:27:29] the answer is not really because we can still find some problems with top K
[00:27:31] still find some problems with top K decoding for example in the context she
[00:27:34] decoding for example in the context she said I never blank there are many words
[00:27:36] said I never blank there are many words that are still valid options uh such as
[00:27:39] that are still valid options uh such as won't 8 but those words got zeroed out
[00:27:42] won't 8 but those words got zeroed out because they are not within the top K
[00:27:44] because they are not within the top K candidates so this actually leads to bad
[00:27:46] candidates so this actually leads to bad recall for your generation system
[00:27:49] recall for your generation system and similarly another failure of top K
[00:27:52] and similarly another failure of top K is that it can also cut off too quickly
[00:27:54] is that it can also cut off too quickly so in this example code is not really a
[00:27:58] so in this example code is not really a valid answer according to common sense
[00:28:00] valid answer according to common sense because you probably don't want to eat a
[00:28:01] because you probably don't want to eat a piece of code
[00:28:02] piece of code um but the probability remains non-zero
[00:28:05] um but the probability remains non-zero meaning that the model might still
[00:28:07] meaning that the model might still sample code as an output despite this
[00:28:10] sample code as an output despite this low probability but it might still
[00:28:11] low probability but it might still happen and this means bad Precision for
[00:28:14] happen and this means bad Precision for the generation model
[00:28:17] the generation model so given this problems with top K
[00:28:20] so given this problems with top K decoding how can we address them how can
[00:28:23] decoding how can we address them how can we address this of this issue of like
[00:28:26] we address this of this issue of like there is no single K that fits all
[00:28:27] there is no single K that fits all circumstances
[00:28:29] circumstances um this is basically because the
[00:28:30] um this is basically because the probability distribution that we sample
[00:28:32] probability distribution that we sample from our Dynamic so when the probability
[00:28:34] from our Dynamic so when the probability distribution is relatively flat having a
[00:28:37] distribution is relatively flat having a small cable remove many viable options
[00:28:40] small cable remove many viable options so the having a limited cable removes
[00:28:43] so the having a limited cable removes many viable options and we want K to be
[00:28:45] many viable options and we want K to be larger for this case
[00:28:46] larger for this case similarly when a distribution p is too
[00:28:49] similarly when a distribution p is too picky then we want the like a high K
[00:28:53] picky then we want the like a high K would allow for too many options uh to
[00:28:56] would allow for too many options uh to be viable and instead we might want a
[00:28:58] be viable and instead we might want a smaller K so that we are being safer
[00:29:01] smaller K so that we are being safer um so the solution here is that maybe K
[00:29:03] um so the solution here is that maybe K is just a bad Haver parameter and
[00:29:05] is just a bad Haver parameter and instead of doing K we should doing we
[00:29:07] instead of doing K we should doing we should think about probability we should
[00:29:09] should think about probability we should think about how to sample from tokens in
[00:29:11] think about how to sample from tokens in the top P probability percentiles of the
[00:29:15] the top P probability percentiles of the cumulative probability mass of the CDF
[00:29:18] cumulative probability mass of the CDF for example
[00:29:20] for example so now
[00:29:22] so now um the the advantage of doing top P
[00:29:24] um the the advantage of doing top P sampling where we sample from the top P
[00:29:26] sampling where we sample from the top P percentile of the cumulative probability
[00:29:28] percentile of the cumulative probability mass is that this is actually equivalent
[00:29:31] mass is that this is actually equivalent to we have now a adaptive k for each
[00:29:33] to we have now a adaptive k for each different distribution and let me
[00:29:36] different distribution and let me explain what what I mean by having like
[00:29:38] explain what what I mean by having like an Adaptive k
[00:29:39] an Adaptive k so in the first distribution this is
[00:29:41] so in the first distribution this is like a regular power law of language
[00:29:44] like a regular power law of language that's kind of typical and then uh doing
[00:29:47] that's kind of typical and then uh doing top uh doing top case sampling means
[00:29:49] top uh doing top case sampling means we're selecting the top K but doing the
[00:29:50] we're selecting the top K but doing the top P sampling means that we are zooming
[00:29:52] top P sampling means that we are zooming into maybe like like something that's
[00:29:55] into maybe like like something that's similar to top K and in fact but if I
[00:29:57] similar to top K and in fact but if I have a relatively flat distribution like
[00:29:59] have a relatively flat distribution like the blue one we can see that's doing top
[00:30:02] the blue one we can see that's doing top p means that we are including more
[00:30:04] p means that we are including more candidates and then if we have a more
[00:30:06] candidates and then if we have a more schools distribution like the green one
[00:30:08] schools distribution like the green one doing top p means that we actually
[00:30:10] doing top p means that we actually include fewer candidates so by actually
[00:30:13] include fewer candidates so by actually selecting like the the top P percentile
[00:30:16] selecting like the the top P percentile in the probability distribution we are
[00:30:18] in the probability distribution we are we are actually having a more uh
[00:30:21] we are actually having a more uh flexible okay and therefore have a
[00:30:23] flexible okay and therefore have a better sense of what are the good
[00:30:24] better sense of what are the good options uh in the model any questions
[00:30:27] options uh in the model any questions about top P top K decoding
[00:30:32] so everything's clear
[00:30:34] so everything's clear yeah sounds good
[00:30:36] yeah sounds good um so to go back to that question uh
[00:30:39] um so to go back to that question uh doing top K is not necessarily saving
[00:30:41] doing top K is not necessarily saving compute or like this whole idea is not
[00:30:43] compute or like this whole idea is not really compute saving intended because
[00:30:46] really compute saving intended because uh in the case of top p in order to
[00:30:48] uh in the case of top p in order to select the top P percentile we still
[00:30:51] select the top P percentile we still need to compute the soft Max over the
[00:30:53] need to compute the soft Max over the entire vocabulary set in order for us to
[00:30:55] entire vocabulary set in order for us to do top pay properly to compute the P
[00:30:58] do top pay properly to compute the P properly so therefore it's not really
[00:31:00] properly so therefore it's not really saving compute but it's improving
[00:31:01] saving compute but it's improving performance
[00:31:05] moving on
[00:31:07] moving on um so there are much more to go with
[00:31:08] um so there are much more to go with decoding algorithms with uh besides the
[00:31:11] decoding algorithms with uh besides the topic and top P that we've discussed
[00:31:13] topic and top P that we've discussed there are some more recent approaches
[00:31:15] there are some more recent approaches like typical typical sampling where the
[00:31:18] like typical typical sampling where the idea is that we want to relate the score
[00:31:19] idea is that we want to relate the score based on the entropy of the distribution
[00:31:22] based on the entropy of the distribution and try to generate tags that are closer
[00:31:24] and try to generate tags that are closer to the negative whose probability is
[00:31:26] to the negative whose probability is closer to the negative entropy of the
[00:31:29] closer to the negative entropy of the data distribution this kind of means
[00:31:31] data distribution this kind of means that if you have a closed-ended task or
[00:31:34] that if you have a closed-ended task or non-open-ended task you want it has
[00:31:36] non-open-ended task you want it has smaller entropy so you want a negative
[00:31:39] smaller entropy so you want a negative log probability to be smaller so you
[00:31:40] log probability to be smaller so you want probabilities to be larger so it
[00:31:42] want probabilities to be larger so it kind of TAPS it tap checks very well and
[00:31:45] kind of TAPS it tap checks very well and additionally there is also Epsilon
[00:31:48] additionally there is also Epsilon sampling coming from John
[00:31:50] sampling coming from John so this is an idea where we set the
[00:31:52] so this is an idea where we set the threshold for to lower bound
[00:31:54] threshold for to lower bound probabilities so basically if you have a
[00:31:56] probabilities so basically if you have a word whose probability is less than .03
[00:31:59] word whose probability is less than .03 for example then that word will never
[00:32:01] for example then that word will never appear
[00:32:02] appear um in the output distribution now that
[00:32:05] um in the output distribution now that that word will never be part of your
[00:32:06] that word will never be part of your output because it has still will
[00:32:07] output because it has still will probability
[00:32:09] probability yeah
[00:32:13] oh cool great question so the entropy
[00:32:16] oh cool great question so the entropy distribution is defined as
[00:32:18] distribution is defined as um
[00:32:19] um like you can suppose that we have a
[00:32:21] like you can suppose that we have a discrete distribution we can go over it
[00:32:23] discrete distribution we can go over it like we'll just enumerate X and then
[00:32:25] like we'll just enumerate X and then it's like negative log probability
[00:32:28] it's like negative log probability of X so like if we write it from a from
[00:32:31] of X so like if we write it from a from an expectation perspective it's
[00:32:33] an expectation perspective it's basically expected of well probability
[00:32:35] basically expected of well probability of x
[00:32:36] of x okay I'll I have to do a little bit here
[00:32:38] okay I'll I have to do a little bit here so so this is the entropy of a
[00:32:41] so so this is the entropy of a distribution
[00:32:44] and then so basically if you are
[00:32:46] and then so basically if you are distribution is very very concentrated
[00:32:48] distribution is very very concentrated to a few words then the entropy will be
[00:32:51] to a few words then the entropy will be relatively small if your distribution is
[00:32:53] relatively small if your distribution is very flat then your entropy will be very
[00:32:56] very flat then your entropy will be very large
[00:32:58] yeah
[00:33:00] yeah the Epsilon sampling is set such that we
[00:33:03] the Epsilon sampling is set such that we have no valid
[00:33:05] have no valid oh yeah I mean I
[00:33:08] oh yeah I mean I bump back off cases I think so in the
[00:33:10] bump back off cases I think so in the case that there is no valid options
[00:33:13] case that there is no valid options um You probably still want to select one
[00:33:15] um You probably still want to select one or two things just as a edge case I
[00:33:17] or two things just as a edge case I think
[00:33:20] okay cool
[00:33:22] okay cool um moving on so another hyper parameter
[00:33:25] um moving on so another hyper parameter that we can tune to affect decoding is
[00:33:28] that we can tune to affect decoding is the temperature parameter so recall that
[00:33:31] the temperature parameter so recall that previously at each time step we asked
[00:33:33] previously at each time step we asked the model to compute a score
[00:33:35] the model to compute a score um and then we renormalize that score
[00:33:37] um and then we renormalize that score using solve Max to get a probability
[00:33:39] using solve Max to get a probability distribution so one thing that we can
[00:33:41] distribution so one thing that we can adjust here is that we can insert this
[00:33:43] adjust here is that we can insert this temperature parameter Tau to relate the
[00:33:46] temperature parameter Tau to relate the score so basically we just divide all
[00:33:47] score so basically we just divide all the SW by Tau and after dividing this we
[00:33:52] the SW by Tau and after dividing this we apply solve Max and we get a new
[00:33:54] apply solve Max and we get a new distribution
[00:33:55] distribution and this temperature adjustment is not
[00:33:58] and this temperature adjustment is not really going to affect the monotonicity
[00:34:00] really going to affect the monotonicity of the distribution for example if word
[00:34:02] of the distribution for example if word a has higher probability than word b
[00:34:04] a has higher probability than word b previously then after the adjustment
[00:34:07] previously then after the adjustment where a is still going to have a higher
[00:34:09] where a is still going to have a higher probability than word b but still
[00:34:11] probability than word b but still relative difference will change
[00:34:15] relative difference will change so um for example if we raise the
[00:34:18] so um for example if we raise the temperature Tau to be greater than one
[00:34:19] temperature Tau to be greater than one then the distribution PT will become
[00:34:22] then the distribution PT will become more uniform it will be flatter and this
[00:34:25] more uniform it will be flatter and this kind of implies that there will be more
[00:34:27] kind of implies that there will be more diverse output because our distribution
[00:34:29] diverse output because our distribution is flatter
[00:34:30] is flatter and it's more spread out across
[00:34:33] and it's more spread out across different words in the vocabulary
[00:34:35] different words in the vocabulary on the other hand if we lower the
[00:34:37] on the other hand if we lower the temperature Tau less than one then PT
[00:34:40] temperature Tau less than one then PT becomes very spiky and then this means
[00:34:43] becomes very spiky and then this means that we are if we sample from the PT
[00:34:45] that we are if we sample from the PT we'll get less diverse output
[00:34:47] we'll get less diverse output um so because here the probability is
[00:34:49] um so because here the probability is concentrated only on the top words
[00:34:51] concentrated only on the top words so in the very extreme case if we set
[00:34:53] so in the very extreme case if we set Tau to be very very close to zero then
[00:34:55] Tau to be very very close to zero then the probability will kind of be a one
[00:34:58] the probability will kind of be a one hot Vector where all the probability
[00:35:00] hot Vector where all the probability mass will be centered on one word
[00:35:02] mass will be centered on one word and then this kind of reduces back to
[00:35:04] and then this kind of reduces back to Arc Max sampling or greedy decoding
[00:35:07] Arc Max sampling or greedy decoding so temperature is a hyper parameter as
[00:35:09] so temperature is a hyper parameter as well as as for K and P in top K on top P
[00:35:12] well as as for K and P in top K on top P it is a hyper parameter for decoding it
[00:35:15] it is a hyper parameter for decoding it can be tuned for beam search and
[00:35:17] can be tuned for beam search and sampling algorithms so it's kind of
[00:35:19] sampling algorithms so it's kind of orthogonal to the approaches that we
[00:35:21] orthogonal to the approaches that we discussed before
[00:35:23] discussed before any questions so far
[00:35:28] okay cool uh temperature is so easy
[00:35:33] so um well because something still
[00:35:36] so um well because something still involves Randomness like even though we
[00:35:38] involves Randomness like even though we do we try very hard in terms of
[00:35:39] do we try very hard in terms of truncation truncating the tail something
[00:35:42] truncation truncating the tail something still has Randomness so what if we're
[00:35:44] still has Randomness so what if we're just unlucky and decode a bad sequence
[00:35:46] just unlucky and decode a bad sequence from the model
[00:35:48] from the model um one common solution is to do
[00:35:49] um one common solution is to do re-ranking so basically we would decode
[00:35:51] re-ranking so basically we would decode a bunch of sequences like for example we
[00:35:53] a bunch of sequences like for example we can decode 10 candidates
[00:35:55] can decode 10 candidates um but like 10 or 30 is up to you the
[00:35:58] um but like 10 or 30 is up to you the only choice is that you want to balance
[00:35:59] only choice is that you want to balance between your compute efficiency and
[00:36:01] between your compute efficiency and performance so if you decode too many
[00:36:04] performance so if you decode too many sequences then of course your
[00:36:06] sequences then of course your performance is going to increase but
[00:36:08] performance is going to increase but it's also very costly to to just
[00:36:10] it's also very costly to to just generate a lot of things for one example
[00:36:13] generate a lot of things for one example and then so once you have a bunch of uh
[00:36:16] and then so once you have a bunch of uh sample sequences then we are trying to
[00:36:18] sample sequences then we are trying to define a score to approximate the
[00:36:21] define a score to approximate the quality of the sequence and re-rank
[00:36:23] quality of the sequence and re-rank everything and re-rank all the
[00:36:24] everything and re-rank all the candidates by this score so the simple
[00:36:26] candidates by this score so the simple thing to do is we can use a perplexity
[00:36:29] thing to do is we can use a perplexity as a metric as a score as a scoring
[00:36:31] as a metric as a score as a scoring function but we need to be careful that
[00:36:34] function but we need to be careful that because we have talked about this like
[00:36:36] because we have talked about this like the extreme of perplexity like if we try
[00:36:38] the extreme of perplexity like if we try to Arc Max log probability we will try
[00:36:41] to Arc Max log probability we will try to aim for a super well perplexity the
[00:36:43] to aim for a super well perplexity the attacks are actually very repetitive so
[00:36:45] attacks are actually very repetitive so we shouldn't really aim for extremely
[00:36:47] we shouldn't really aim for extremely low perplexity and perplexity to some
[00:36:49] low perplexity and perplexity to some extent it's not a perfect re-scoring
[00:36:52] extent it's not a perfect re-scoring function it's it's not a perfect scoring
[00:36:54] function it's it's not a perfect scoring function because it's not really robust
[00:36:56] function because it's not really robust to maximize
[00:36:58] to maximize so alternatively the re-rankers can
[00:37:01] so alternatively the re-rankers can actually use a wide variety of other
[00:37:02] actually use a wide variety of other scoring functions that we can score text
[00:37:05] scoring functions that we can score text based on their style their discourse
[00:37:07] based on their style their discourse coherence uh their entailment factuality
[00:37:10] coherence uh their entailment factuality properties consistency and so on
[00:37:13] properties consistency and so on um
[00:37:14] um and additionally we can compose multiple
[00:37:17] and additionally we can compose multiple re-rankers together uh yeah questions
[00:37:22] re-rankers together uh yeah questions 10 candidates or any number of
[00:37:24] 10 candidates or any number of candidates yeah what's the strategy
[00:37:26] candidates yeah what's the strategy usually use to generate these other
[00:37:29] usually use to generate these other candidates like what you're listening to
[00:37:31] candidates like what you're listening to use oh yeah so basically the idea is to
[00:37:34] use oh yeah so basically the idea is to sample from the model right so when you
[00:37:36] sample from the model right so when you sample from the model each time you
[00:37:37] sample from the model each time you sample you're going to get a different
[00:37:38] sample you're going to get a different output
[00:37:40] output and then that's what I mean by different
[00:37:41] and then that's what I mean by different candidates so if you sample 10 times you
[00:37:43] candidates so if you sample 10 times you will get 10 you will very likely get 10
[00:37:45] will get 10 you will very likely get 10 different outputs
[00:37:47] different outputs and then you are just given these 10
[00:37:49] and then you are just given these 10 different outputs that come from
[00:37:50] different outputs that come from sampling you can just decide re-rank
[00:37:53] sampling you can just decide re-rank them and select the candidate that has
[00:37:55] them and select the candidate that has the highest score
[00:37:59] oh because we are sampling here
[00:38:01] oh because we are sampling here yeah yeah for example if you are doing
[00:38:03] yeah yeah for example if you are doing like top three something then well
[00:38:06] like top three something then well suppose that A and B are equally
[00:38:08] suppose that A and B are equally probable then you might sample a your
[00:38:10] probable then you might sample a your max sample B with the same probability
[00:38:14] max sample B with the same probability okay cool and another cool thing that we
[00:38:16] okay cool and another cool thing that we can do is re-ranking is that we can
[00:38:18] can do is re-ranking is that we can compose multiple re-rankers together so
[00:38:20] compose multiple re-rankers together so basically you can suppose you have a
[00:38:22] basically you can suppose you have a scoring function for style and you have
[00:38:24] scoring function for style and you have a scoring function for factual
[00:38:25] a scoring function for factual consistency you can just add those two
[00:38:27] consistency you can just add those two scoring functions together to get a new
[00:38:29] scoring functions together to get a new scoring function and then uh re-rank
[00:38:31] scoring function and then uh re-rank everything based on your new scoring
[00:38:33] everything based on your new scoring function to get tags that are both good
[00:38:35] function to get tags that are both good at style and good at factual consistency
[00:38:42] do we just pick the decoding that has
[00:38:45] do we just pick the decoding that has the high score or do we do some more
[00:38:47] the high score or do we do some more sampling again based on the story uh the
[00:38:50] sampling again based on the story uh the idea is you just take the decoding that
[00:38:51] idea is you just take the decoding that has the highest score because you
[00:38:52] has the highest score because you already have like say 10 candidates so
[00:38:55] already have like say 10 candidates so out of this 10 you only need one and
[00:38:56] out of this 10 you only need one and then you just choose one that has the
[00:38:58] then you just choose one that has the highest score
[00:38:59] highest score yeah
[00:39:01] yeah cool any other questions
[00:39:04] cool any other questions yeah sorry what what is perplexity oh
[00:39:09] yeah sorry what what is perplexity oh yeah perplexity is like you can kind of
[00:39:11] yeah perplexity is like you can kind of regard it as log probabilities uh it's
[00:39:13] regard it as log probabilities uh it's it's proportion it's like e to the
[00:39:16] it's proportion it's like e to the negative well probabilities
[00:39:18] negative well probabilities kind of like uh if uh if a talker has
[00:39:21] kind of like uh if uh if a talker has high perplexity then it means it has a
[00:39:23] high perplexity then it means it has a low probability because you are more
[00:39:25] low probability because you are more perplexed
[00:39:29] okay so um taking a step back to
[00:39:31] okay so um taking a step back to summarize this decoding section we have
[00:39:34] summarize this decoding section we have discussed uh many decoding approaches
[00:39:35] discussed uh many decoding approaches from selecting the most probable string
[00:39:38] from selecting the most probable string to selecting uh to sampling and then to
[00:39:41] to selecting uh to sampling and then to various truncation approaches that we
[00:39:42] various truncation approaches that we can do to improve sampling like top P
[00:39:44] can do to improve sampling like top P top K Epsilon typical decoding and
[00:39:48] top K Epsilon typical decoding and finally we discuss how we can do in
[00:39:50] finally we discuss how we can do in terms of re-ranking the results so uh
[00:39:53] terms of re-ranking the results so uh decoding is still a really essential
[00:39:55] decoding is still a really essential problem in energy and there are lots of
[00:39:58] problem in energy and there are lots of Works to be done here still especially
[00:40:00] Works to be done here still especially as like chai GPD is so powerful we
[00:40:02] as like chai GPD is so powerful we should all go study decoding
[00:40:04] should all go study decoding um so it would be interesting if you
[00:40:06] um so it would be interesting if you want to do such final projects and also
[00:40:09] want to do such final projects and also different decoding algorithms can allow
[00:40:11] different decoding algorithms can allow us to inject different inductive biases
[00:40:12] us to inject different inductive biases uh to the to the text that we are trying
[00:40:16] uh to the to the text that we are trying to generate
[00:40:17] to generate and some of the most impactful advances
[00:40:20] and some of the most impactful advances in nlg in the last couple years actually
[00:40:22] in nlg in the last couple years actually come from simple but effective decoding
[00:40:24] come from simple but effective decoding algorithms for example the nuclear
[00:40:26] algorithms for example the nuclear sampling is the nuclear sampling paper
[00:40:28] sampling is the nuclear sampling paper is actually very very highly cited
[00:40:31] is actually very very highly cited so moving on to talk about training
[00:40:33] so moving on to talk about training analogy models
[00:40:36] analogy models well we have seen this example before in
[00:40:38] well we have seen this example before in the decoding slides and I'm just trying
[00:40:40] the decoding slides and I'm just trying to show them again uh because even
[00:40:42] to show them again uh because even though we can solve this repetition
[00:40:44] though we can solve this repetition Problem by by instead of doing search
[00:40:46] Problem by by instead of doing search doing sampling
[00:40:48] doing sampling um the but it's still concerning from a
[00:40:50] um the but it's still concerning from a language modeling perspective that's
[00:40:52] language modeling perspective that's your model would put so much probability
[00:40:54] your model would put so much probability on such repetitive and degenerate text
[00:40:56] on such repetitive and degenerate text so we asked this question well is
[00:40:59] so we asked this question well is repetition due to how language models
[00:41:01] repetition due to how language models are trained
[00:41:04] you have also seen this Cloud before
[00:41:06] you have also seen this Cloud before which shows this decaying pattern or
[00:41:08] which shows this decaying pattern or like the self amplification effect so we
[00:41:11] like the self amplification effect so we can conclude from this observation that
[00:41:13] can conclude from this observation that model trained via a mle objective wears
[00:41:17] model trained via a mle objective wears a really bad like whereas really bad
[00:41:18] a really bad like whereas really bad mode of the distribution by mode of the
[00:41:20] mode of the distribution by mode of the distribution I mean the arguments of the
[00:41:22] distribution I mean the arguments of the distribution so basically they would
[00:41:23] distribution so basically they would assign high probability to terrible
[00:41:25] assign high probability to terrible strings
[00:41:26] strings and this is definitely problematic for a
[00:41:28] and this is definitely problematic for a model perspective
[00:41:30] model perspective so why is this the case shouldn't mle be
[00:41:33] so why is this the case shouldn't mle be like a gold standard in machine
[00:41:34] like a gold standard in machine translation uh in in machine learning in
[00:41:36] translation uh in in machine learning in general not just machine translation
[00:41:37] general not just machine translation should an ml be like a gold standard for
[00:41:39] should an ml be like a gold standard for machine learning
[00:41:40] machine learning um the answer here is not really
[00:41:42] um the answer here is not really especially for text because mle has some
[00:41:45] especially for text because mle has some problem for sequential data and we call
[00:41:48] problem for sequential data and we call this problem exposure bias
[00:41:50] this problem exposure bias um so training with teacher forcing
[00:41:53] um so training with teacher forcing leads to exposure bias at generation
[00:41:55] leads to exposure bias at generation time because during training our model's
[00:41:57] time because during training our model's inputs are gold context tokens from real
[00:42:00] inputs are gold context tokens from real human generated text as denoted as I had
[00:42:02] human generated text as denoted as I had less than T here but during generation
[00:42:06] less than T here but during generation time our model's input become previously
[00:42:09] time our model's input become previously decoded tokens from the model well I had
[00:42:11] decoded tokens from the model well I had to and suppose that our model has minor
[00:42:14] to and suppose that our model has minor arrows than like what I had T why had
[00:42:17] arrows than like what I had T why had less than T will be much worse in terms
[00:42:19] less than T will be much worse in terms of quality than y star less than T and
[00:42:22] of quality than y star less than T and this discrepancy is terrible because it
[00:42:25] this discrepancy is terrible because it actually causes a discrepancy between
[00:42:27] actually causes a discrepancy between trading and test time which actually
[00:42:29] trading and test time which actually hurts model performance and we call this
[00:42:32] hurts model performance and we call this problem exposure bias
[00:42:35] um so people have proposed many
[00:42:37] um so people have proposed many solutions to address this exposure bias
[00:42:39] solutions to address this exposure bias problem
[00:42:40] problem uh one thing to do is to do um scheduled
[00:42:43] uh one thing to do is to do um scheduled sampling which means that uh with
[00:42:45] sampling which means that uh with probability P we try to decode a token
[00:42:48] probability P we try to decode a token uh and feed it back in as context to
[00:42:51] uh and feed it back in as context to train the model and this probability one
[00:42:53] train the model and this probability one minus P we use the gold tag we use the
[00:42:56] minus P we use the gold tag we use the gold token as context
[00:42:57] gold token as context so throughout trading we try to increase
[00:43:00] so throughout trading we try to increase P to gradually warm it up and then
[00:43:03] P to gradually warm it up and then prepare it for test time generation so
[00:43:06] prepare it for test time generation so this leads to Improvement in practice
[00:43:07] this leads to Improvement in practice because of using this T using this P
[00:43:10] because of using this T using this P probabilities we're actually graduating
[00:43:13] probabilities we're actually graduating uh like trying to narrow the discrepancy
[00:43:16] uh like trying to narrow the discrepancy between training and test time
[00:43:17] between training and test time but the objective is actually quite
[00:43:19] but the objective is actually quite strange and training can be very
[00:43:21] strange and training can be very unstable
[00:43:23] unstable another idea is to do data set
[00:43:25] another idea is to do data set aggregation and the method is called
[00:43:27] aggregation and the method is called dagger
[00:43:28] dagger essentially at various intervals during
[00:43:31] essentially at various intervals during training we try to generate a sequence
[00:43:33] training we try to generate a sequence of text from the current model and then
[00:43:35] of text from the current model and then use this and then put the sequence of
[00:43:36] use this and then put the sequence of text into the training data so we're
[00:43:39] text into the training data so we're kind of continuously doing this uh
[00:43:41] kind of continuously doing this uh training data augmentation scheme to
[00:43:43] training data augmentation scheme to make sure that the trading distribution
[00:43:46] make sure that the trading distribution and the generation distribution are
[00:43:48] and the generation distribution are closer together
[00:43:49] closer together so both approaches both scheduled
[00:43:51] so both approaches both scheduled sampling and data set aggregation are
[00:43:53] sampling and data set aggregation are ways to narrow the discrepancy between
[00:43:55] ways to narrow the discrepancy between training and test yes question
[00:44:00] just means human text
[00:44:03] just means human text I mean it's like uh well so little
[00:44:05] I mean it's like uh well so little language model you will see lots of
[00:44:07] language model you will see lots of Corpus that are human written gold is
[00:44:09] Corpus that are human written gold is just human
[00:44:13] okay cool
[00:44:15] okay cool um so another approach is to do
[00:44:16] um so another approach is to do retrieval augmented generation so we
[00:44:19] retrieval augmented generation so we first learned to retrieve a sequence
[00:44:21] first learned to retrieve a sequence from some existing Corpus of prototypes
[00:44:23] from some existing Corpus of prototypes and then we train a model to actually
[00:44:25] and then we train a model to actually edit the retrieved text by doing
[00:44:28] edit the retrieved text by doing insertion deletion or swapping
[00:44:30] insertion deletion or swapping we can add or remove tokens from this
[00:44:33] we can add or remove tokens from this prototype and then try to modify it into
[00:44:36] prototype and then try to modify it into another into another sentence so this
[00:44:40] another into another sentence so this doesn't really suffer from exposure bias
[00:44:41] doesn't really suffer from exposure bias because we start from a high quality
[00:44:44] because we start from a high quality prototype so that's at trading time and
[00:44:46] prototype so that's at trading time and at test time like you don't really have
[00:44:48] at test time like you don't really have the discrepancy anymore because you are
[00:44:49] the discrepancy anymore because you are not generating from left to right
[00:44:53] um another approach is to do
[00:44:55] um another approach is to do reinforcement learning so here the idea
[00:44:58] reinforcement learning so here the idea is to cast your generation problem as a
[00:45:00] is to cast your generation problem as a Markov decision process so there is the
[00:45:03] Markov decision process so there is the state as uh which is the model's
[00:45:06] state as uh which is the model's representation for all the preceding
[00:45:07] representation for all the preceding context there is action a uh which is
[00:45:10] context there is action a uh which is basically like the next token that we
[00:45:13] basically like the next token that we are trying to pick and there's policy
[00:45:15] are trying to pick and there's policy which is the language model or also
[00:45:16] which is the language model or also called the decoder and there is the
[00:45:18] called the decoder and there is the reward R which is provided by some
[00:45:21] reward R which is provided by some external score and the idea here uh well
[00:45:24] external score and the idea here uh well like we won't go into details about
[00:45:26] like we won't go into details about reinforcement learning and how it works
[00:45:27] reinforcement learning and how it works but we will recommend the class CS two
[00:45:30] but we will recommend the class CS two three like 234.
[00:45:34] so um in the reinforcement learning
[00:45:36] so um in the reinforcement learning context because reinforcement learning
[00:45:37] context because reinforcement learning involves a reward function that's very
[00:45:40] involves a reward function that's very important so how do we do reward
[00:45:42] important so how do we do reward estimation for tax Generation Well
[00:45:44] estimation for tax Generation Well really natural idea is to just use the
[00:45:47] really natural idea is to just use the evaluation metrics so whatever because
[00:45:49] evaluation metrics so whatever because you are trying to do well in terms of
[00:45:50] you are trying to do well in terms of evaluation so why not just improve for
[00:45:53] evaluation so why not just improve for evaluation metrics directly at training
[00:45:55] evaluation metrics directly at training time for example in the case of machine
[00:45:57] time for example in the case of machine translation we can use blue score as the
[00:46:00] translation we can use blue score as the reward function in the case of
[00:46:02] reward function in the case of summarization we can use root score as
[00:46:04] summarization we can use root score as the reward function
[00:46:06] the reward function but we really need to be careful about
[00:46:08] but we really need to be careful about optimizing for tasks as opposed to
[00:46:10] optimizing for tasks as opposed to gaining the reward because evaluation
[00:46:12] gaining the reward because evaluation metrics are merely proxies for the
[00:46:15] metrics are merely proxies for the generation quality so sometimes suppose
[00:46:17] generation quality so sometimes suppose that you run RL and improve the blue
[00:46:19] that you run RL and improve the blue score by a lot but will you will run
[00:46:22] score by a lot but will you will run human evaluations humans might still
[00:46:24] human evaluations humans might still think that well this this generated tax
[00:46:27] think that well this this generated tax is no better than the previous one or
[00:46:28] is no better than the previous one or even worse even though it gives you a
[00:46:30] even worse even though it gives you a much better blue score so we want to
[00:46:32] much better blue score so we want to like be careful about this case of not
[00:46:34] like be careful about this case of not gaining the reward
[00:46:37] gaining the reward so what behaviors can we tied to a
[00:46:39] so what behaviors can we tied to a reward function this is about reward
[00:46:40] reward function this is about reward design and reward estimation there are
[00:46:43] design and reward estimation there are so many things that we can do we can do
[00:46:45] so many things that we can do we can do cross modality consistency for image
[00:46:47] cross modality consistency for image captioning we can do sentence similarity
[00:46:49] captioning we can do sentence similarity to a sentence Simplicity to make sure
[00:46:52] to a sentence Simplicity to make sure that we are generating simple English
[00:46:54] that we are generating simple English that are understandable we can do
[00:46:56] that are understandable we can do formality and politeness to make sure
[00:46:58] formality and politeness to make sure that I don't know like your chatbot
[00:47:00] that I don't know like your chatbot doesn't suddenly yell at you
[00:47:02] doesn't suddenly yell at you um and the most important thing that's
[00:47:04] um and the most important thing that's really really popular uh is recently is
[00:47:07] really really popular uh is recently is human preference so we should just build
[00:47:10] human preference so we should just build a remote a reward model that captures
[00:47:12] a remote a reward model that captures human preference and this is actually
[00:47:14] human preference and this is actually the technique behind the chat GPT model
[00:47:17] the technique behind the chat GPT model so the idea here is that we would ask
[00:47:19] so the idea here is that we would ask human to rank a bunch of generated text
[00:47:21] human to rank a bunch of generated text based on their preference and then we
[00:47:23] based on their preference and then we will use this preference data to learn a
[00:47:26] will use this preference data to learn a reward function which will basically
[00:47:28] reward function which will basically always assign high score to something
[00:47:31] always assign high score to something that humans might prefer and assign a
[00:47:33] that humans might prefer and assign a low score to something that humans
[00:47:35] low score to something that humans wouldn't prefer
[00:47:37] wouldn't prefer Yeah question
[00:47:38] Yeah question more expensive
[00:47:43] oh yeah sure I mean it is going to be
[00:47:45] oh yeah sure I mean it is going to be very expensive but I feel like uh
[00:47:47] very expensive but I feel like uh compared to all the cost of trading
[00:47:48] compared to all the cost of trading models trading like 170 billion
[00:47:51] models trading like 170 billion parameter models
[00:47:52] parameter models um I feel like open Ai and Google are
[00:47:54] um I feel like open Ai and Google are well they can't afford hiring lots of
[00:47:56] well they can't afford hiring lots of humans to do human annotations and ask
[00:47:58] humans to do human annotations and ask their preference yeah
[00:48:04] yeah this is a great question so
[00:48:07] yeah this is a great question so um I think it's kind of a mystery about
[00:48:09] um I think it's kind of a mystery about how much data you exactly need to
[00:48:11] how much data you exactly need to achieve the level of performance of chat
[00:48:12] achieve the level of performance of chat GPT but roughly speaking I feel like I
[00:48:16] GPT but roughly speaking I feel like I mean like whenever you try to fine-tune
[00:48:18] mean like whenever you try to fine-tune a model on some Downstream tasks
[00:48:19] a model on some Downstream tasks similarly here you are trying to find
[00:48:20] similarly here you are trying to find through your model on on human
[00:48:22] through your model on on human preference it do need quite a lot of
[00:48:25] preference it do need quite a lot of data like maybe on a scale of 50k to
[00:48:27] data like maybe on a scale of 50k to 100K that's roughly the scale that like
[00:48:30] 100K that's roughly the scale that like anthropic actually released some data
[00:48:32] anthropic actually released some data set about human preference that's
[00:48:33] set about human preference that's roughly the skill that they released I
[00:48:35] roughly the skill that they released I think
[00:48:36] think um if I remember correctly Yeah question
[00:48:39] um if I remember correctly Yeah question we talked about earlier about how many
[00:48:41] we talked about earlier about how many of the state-of-the-art language models
[00:48:43] of the state-of-the-art language models use Transformers as their architecture
[00:48:45] use Transformers as their architecture how do you apply reinforcement learning
[00:48:47] how do you apply reinforcement learning to this model
[00:48:50] to this model uh to to what do you mean to Transformer
[00:48:52] uh to to what do you mean to Transformer model yeah yeah I feel like um
[00:48:56] model yeah yeah I feel like um reinforcement learning is kind of a
[00:48:58] reinforcement learning is kind of a modeling tool I mean it's kind of an
[00:49:00] modeling tool I mean it's kind of an objective that you are trying to
[00:49:01] objective that you are trying to optimize instead of an mlu objective now
[00:49:03] optimize instead of an mlu objective now you are optimizing for an RL objective
[00:49:05] you are optimizing for an RL objective so uh it's not real it's kind of
[00:49:08] so uh it's not real it's kind of orthogonal to the architectural choice
[00:49:10] orthogonal to the architectural choice so uh Transformer is an architecture you
[00:49:13] so uh Transformer is an architecture you just use Transformer to give you
[00:49:14] just use Transformer to give you probability of the next token
[00:49:16] probability of the next token distribution or to to try to estimate
[00:49:19] distribution or to to try to estimate probability of a sequence and then once
[00:49:21] probability of a sequence and then once you have the probability of a sequence
[00:49:22] you have the probability of a sequence you use that probability of the sequence
[00:49:24] you use that probability of the sequence pass it into the uh the RL objective
[00:49:27] pass it into the uh the RL objective that you have and then suppose that you
[00:49:30] that you have and then suppose that you are trying to do policy gradient or
[00:49:31] are trying to do policy gradient or something then you need to estimate the
[00:49:33] something then you need to estimate the probability of that sequence
[00:49:35] probability of that sequence and then you just need to be able to
[00:49:37] and then you just need to be able to back prop uh through Transformer which
[00:49:39] back prop uh through Transformer which is doable
[00:49:40] is doable yeah so I think like the question about
[00:49:42] yeah so I think like the question about architecture and objectives are
[00:49:43] architecture and objectives are orthogonal so even if you have an ostm
[00:49:46] orthogonal so even if you have an ostm you can do it you have a Transformer you
[00:49:47] you can do it you have a Transformer you can also do it yep
[00:49:51] can also do it yep cool hope I answered that question
[00:49:54] cool hope I answered that question yeah can you just like with the model T4
[00:49:57] yeah can you just like with the model T4 to for this country well for example we
[00:49:59] to for this country well for example we can build another Transformer to like to
[00:50:01] can build another Transformer to like to calculate yeah I think that's exactly
[00:50:03] calculate yeah I think that's exactly what they did so they uh so for example
[00:50:05] what they did so they uh so for example you would have gpt3 right
[00:50:07] you would have gpt3 right um you use gpd3 as the generator that
[00:50:10] um you use gpd3 as the generator that generate text and you kind of have
[00:50:12] generate text and you kind of have another pre-trained model that it could
[00:50:14] another pre-trained model that it could probably also be gpd3 but I'm guessing
[00:50:16] probably also be gpd3 but I'm guessing here
[00:50:17] here um that you fine-tune it to your human
[00:50:19] um that you fine-tune it to your human preference and then once you have a
[00:50:21] preference and then once you have a human preference model uh you use the
[00:50:23] human preference model uh you use the human preference model to put it into RL
[00:50:26] human preference model to put it into RL as the reward model and then use the
[00:50:28] as the reward model and then use the original gpd3 as the policy model and
[00:50:30] original gpd3 as the policy model and then you you apply our objectives and
[00:50:33] then you you apply our objectives and then update them so that you will get a
[00:50:36] then update them so that you will get a new model that's better at everything
[00:50:40] okay cool
[00:50:42] okay cool um yeah actually if you are very curious
[00:50:43] um yeah actually if you are very curious about earlier chap I would encourage you
[00:50:45] about earlier chap I would encourage you to come to the next lecture which is uh
[00:50:47] to come to the next lecture which is uh and where Jesse will talk about rlhs
[00:50:49] and where Jesse will talk about rlhs which is uh RL HF is shorthand for RL
[00:50:52] which is uh RL HF is shorthand for RL using human pref uh using human feedback
[00:50:57] foreign
[00:51:00] foreign teacher enforcing is still the main
[00:51:02] teacher enforcing is still the main algorithm for training tax generation
[00:51:05] algorithm for training tax generation models and exposure bias causes problems
[00:51:08] models and exposure bias causes problems in tax generation models for example it
[00:51:11] in tax generation models for example it causes models to lose coherence cause
[00:51:13] causes models to lose coherence cause this model to be repetitive and models
[00:51:15] this model to be repetitive and models must learn to recover from their own bad
[00:51:18] must learn to recover from their own bad samples by using techniques like
[00:51:20] samples by using techniques like scheduled sampling or dagger and models
[00:51:23] scheduled sampling or dagger and models shouldn't another approach to to reduce
[00:51:26] shouldn't another approach to to reduce exposure bias is to start with good text
[00:51:28] exposure bias is to start with good text like retrieval of class generation and
[00:51:31] like retrieval of class generation and we also discussed how to do training
[00:51:32] we also discussed how to do training with RL and this can actually
[00:51:35] with RL and this can actually make model learn behaviors that are
[00:51:37] make model learn behaviors that are preferred by human perform that are
[00:51:40] preferred by human perform that are preferred by human or preferred by some
[00:51:42] preferred by human or preferred by some metrics so uh to be very up-to-date in
[00:51:46] metrics so uh to be very up-to-date in the best language model nowadays check
[00:51:48] the best language model nowadays check GPT the trading is actually pipelined
[00:51:50] GPT the trading is actually pipelined for example we would first pre-train a
[00:51:52] for example we would first pre-train a large language models using internet
[00:51:53] large language models using internet Corpus by self-supervision and this kind
[00:51:56] Corpus by self-supervision and this kind of gets your chat GPT like the uh sorry
[00:51:58] of gets your chat GPT like the uh sorry gpt3 which is the original version and
[00:52:01] gpt3 which is the original version and then you would do some sorts of
[00:52:02] then you would do some sorts of instruction tuning to fine-tune the
[00:52:05] instruction tuning to fine-tune the language model to fine-tune the
[00:52:06] language model to fine-tune the pre-trained language model so that it
[00:52:07] pre-trained language model so that it learns roughly how to follow human
[00:52:09] learns roughly how to follow human instructions and finally we will do rlhs
[00:52:12] instructions and finally we will do rlhs to make sure that these models are well
[00:52:14] to make sure that these models are well aligned with human preference so if we
[00:52:18] aligned with human preference so if we start RL HF from scratch it's probably
[00:52:20] start RL HF from scratch it's probably going to be very hard for the model to
[00:52:22] going to be very hard for the model to converge because RL is hard to train for
[00:52:25] converge because RL is hard to train for text Data Etc so RL doesn't really work
[00:52:28] text Data Etc so RL doesn't really work from scratch but with all these smart
[00:52:30] from scratch but with all these smart tricks about pre-training and
[00:52:32] tricks about pre-training and instruction tuning suddenly now like
[00:52:35] instruction tuning suddenly now like they're they're off to a good start
[00:52:39] they're they're off to a good start cool any questions so far
[00:52:43] cool any questions so far okay oh yeah
[00:52:52] [Music]
[00:52:54] [Music] uh you mean the difference between
[00:52:57] uh you mean the difference between dagger and schedule sampling is how long
[00:53:00] dagger and schedule sampling is how long the the sequence are yeah I think
[00:53:02] the the sequence are yeah I think roughly that is that is it because like
[00:53:04] roughly that is that is it because like for dagger you are kind of trying to you
[00:53:06] for dagger you are kind of trying to you are trying to put in
[00:53:08] are trying to put in um full generated sequence but I feel
[00:53:10] um full generated sequence but I feel like there can be variations of dagger
[00:53:12] like there can be variations of dagger dagger is just like a high level
[00:53:13] dagger is just like a high level framework and idea there can be
[00:53:15] framework and idea there can be variations variations of dagger that are
[00:53:17] variations variations of dagger that are very similar to scheduled sampling I
[00:53:19] very similar to scheduled sampling I think I feel like for schedule sampling
[00:53:21] think I feel like for schedule sampling it's kind of a more smooth version of
[00:53:24] it's kind of a more smooth version of dagger because dagger for dagger you
[00:53:26] dagger because dagger for dagger you have to like uh for well basically for
[00:53:29] have to like uh for well basically for this Epoch I am generating something and
[00:53:31] this Epoch I am generating something and then I after this Epoch finishes I put
[00:53:33] then I after this Epoch finishes I put this into the data together and then
[00:53:35] this into the data together and then train for another Epoch whereas dagger
[00:53:37] train for another Epoch whereas dagger seems to be more flexible in terms of
[00:53:39] seems to be more flexible in terms of when you add data in yes look it's for
[00:53:43] when you add data in yes look it's for daggers to the rest of the models coming
[00:53:44] daggers to the rest of the models coming out but like how does it helpful model
[00:53:48] um I think that's a that's a good
[00:53:51] um I think that's a that's a good question I feel like if you regress the
[00:53:53] question I feel like if you regress the model for example
[00:53:54] model for example um if you regress the model on its own
[00:53:56] um if you regress the model on its own output uh I think well I think there are
[00:53:59] output uh I think well I think there are there should be smarter ways than to
[00:54:01] there should be smarter ways than to exactly regress on your own output for
[00:54:03] exactly regress on your own output for example you might still like consult
[00:54:06] example you might still like consult some good reference data for example
[00:54:07] some good reference data for example given that you ask the model to generate
[00:54:09] given that you ask the model to generate for something and then you can instead
[00:54:12] for something and then you can instead of using uh say you ask the model
[00:54:14] of using uh say you ask the model generate for five tokens and then
[00:54:16] generate for five tokens and then instead of using like the models
[00:54:18] instead of using like the models generation to be the sixth token you'll
[00:54:21] generation to be the sixth token you'll probably try to find some examples in
[00:54:23] probably try to find some examples in the training data that would be a good
[00:54:25] the training data that would be a good continuations and then you try to plug
[00:54:27] continuations and then you try to plug that in by like connecting the
[00:54:29] that in by like connecting the generation the model generation and some
[00:54:31] generation the model generation and some gold text and then therefore you are
[00:54:34] gold text and then therefore you are able to kind of correct the model uh
[00:54:37] able to kind of correct the model uh even though it it probably went off path
[00:54:39] even though it it probably went off path a little bit by generating its own stuff
[00:54:40] a little bit by generating its own stuff so it's kind of like letting the model
[00:54:42] so it's kind of like letting the model learn how to correct for itself
[00:54:44] learn how to correct for itself but yes I think you are right if you
[00:54:46] but yes I think you are right if you just ask the model to uh gen if you just
[00:54:49] just ask the model to uh gen if you just put model generation in the data it
[00:54:51] put model generation in the data it shouldn't really work
[00:54:53] shouldn't really work yeah any other questions
[00:54:57] cool
[00:54:59] cool um moving on
[00:55:05] yes um so now we'll talk about uh how we
[00:55:08] yes um so now we'll talk about uh how we are going to evaluate Energy Systems so
[00:55:10] are going to evaluate Energy Systems so there are three types of methods for
[00:55:12] there are three types of methods for evaluation there is content overlap
[00:55:14] evaluation there is content overlap metrics
[00:55:15] metrics um there is model based metrics and
[00:55:17] um there is model based metrics and there is human evaluations
[00:55:20] there is human evaluations so first content overlap metrics
[00:55:22] so first content overlap metrics computer score based on lexical
[00:55:24] computer score based on lexical similarities between the generated text
[00:55:26] similarities between the generated text and the gold reference text so the
[00:55:28] and the gold reference text so the advantage of this approach is that it's
[00:55:30] advantage of this approach is that it's very fast and efficient and widely used
[00:55:32] very fast and efficient and widely used for example a blue score is very popular
[00:55:35] for example a blue score is very popular in Mt and root score is very popular in
[00:55:37] in Mt and root score is very popular in summarization
[00:55:40] um
[00:55:41] um so these models are very popular because
[00:55:43] so these models are very popular because well these methods are very popular
[00:55:44] well these methods are very popular because they are cheap and easy to run
[00:55:47] because they are cheap and easy to run but they are not really the ideal
[00:55:49] but they are not really the ideal metrics for example simply rely on
[00:55:51] metrics for example simply rely on lexical overlap might miss some
[00:55:54] lexical overlap might miss some refreshings that have the same semantic
[00:55:56] refreshings that have the same semantic meaning or it might reward text with a
[00:55:59] meaning or it might reward text with a large portion of lexical overlap but
[00:56:01] large portion of lexical overlap but actually have the opposite meaning so
[00:56:02] actually have the opposite meaning so you have lots of both false positive and
[00:56:05] you have lots of both false positive and false negative problems
[00:56:07] false negative problems uh so despite all these disadvantages
[00:56:09] uh so despite all these disadvantages the metrics are still the to-go
[00:56:11] the metrics are still the to-go evaluation standard in machine
[00:56:12] evaluation standard in machine translation part of the reason is that
[00:56:14] translation part of the reason is that Mt uh is actually super close ended it's
[00:56:18] Mt uh is actually super close ended it's very non-open-ended and then therefore
[00:56:20] very non-open-ended and then therefore this is probably still fine to use uh
[00:56:24] this is probably still fine to use uh like blue score to measure machine
[00:56:25] like blue score to measure machine translation
[00:56:27] translation and they get progressively worse for
[00:56:28] and they get progressively worse for tasks that are more open-ended for
[00:56:31] tasks that are more open-ended for example they get words for summarization
[00:56:32] example they get words for summarization as long as the output text because the
[00:56:36] as long as the output text because the output text becomes much harder to
[00:56:37] output text becomes much harder to measure
[00:56:38] measure they are much worse for dialogue which
[00:56:40] they are much worse for dialogue which is more open-ended and then they are
[00:56:42] is more open-ended and then they are much much worse for story generation
[00:56:43] much much worse for story generation which is also open-ended and then the
[00:56:46] which is also open-ended and then the drawback here is that because like the
[00:56:48] drawback here is that because like the underground metrics
[00:56:49] underground metrics um
[00:56:50] um this is because like suppose that you
[00:56:53] this is because like suppose that you are generating a story that's relatively
[00:56:54] are generating a story that's relatively long then if you are still looking at
[00:56:56] long then if you are still looking at word overlap then you might actually get
[00:56:58] word overlap then you might actually get very high ungram scores because of your
[00:57:01] very high ungram scores because of your taxes very well not because it's
[00:57:02] taxes very well not because it's accurate of high quality just because
[00:57:04] accurate of high quality just because you are talking so much that you might
[00:57:06] you are talking so much that you might have covered lots of points already
[00:57:14] yes exactly that's kind of the the next
[00:57:18] yes exactly that's kind of the the next thing that I will talk about uh as a
[00:57:19] thing that I will talk about uh as a better metric for evaluation uh but for
[00:57:23] better metric for evaluation uh but for now let's do like a case study of a
[00:57:25] now let's do like a case study of a failure mode for uh Google score for
[00:57:27] failure mode for uh Google score for example so suppose that Chris asks a
[00:57:30] example so suppose that Chris asks a question are you enjoying the cs224a
[00:57:32] question are you enjoying the cs224a lectures
[00:57:33] lectures um the correct answer of course is hack
[00:57:34] um the correct answer of course is hack yes
[00:57:36] yes um so if we have this if if one if one
[00:57:40] um so if we have this if if one if one of the answer is yes it will get a score
[00:57:42] of the answer is yes it will get a score of 0.61 because it has some lexical
[00:57:44] of 0.61 because it has some lexical overlap with the correct answer if you
[00:57:47] overlap with the correct answer if you answer like you know it then it gets a
[00:57:50] answer like you know it then it gets a relatively lower score because it
[00:57:52] relatively lower score because it doesn't really have any lexical overlap
[00:57:54] doesn't really have any lexical overlap except from the exclamation mark
[00:57:56] except from the exclamation mark and if you answer Yep this is
[00:57:58] and if you answer Yep this is semantically correct but it actually
[00:58:02] semantically correct but it actually gets zero score because there is no
[00:58:03] gets zero score because there is no lexical overlap between the gold answer
[00:58:05] lexical overlap between the gold answer and the generation
[00:58:07] and the generation if you answer hack no this should be
[00:58:09] if you answer hack no this should be wrong
[00:58:10] wrong um but because it has lots of Sims but
[00:58:13] um but because it has lots of Sims but because it has lots of lexical overlap
[00:58:14] because it has lots of lexical overlap with the correct answer
[00:58:16] with the correct answer um it's actually getting some high
[00:58:18] um it's actually getting some high scores
[00:58:19] scores so these two cases are the major failure
[00:58:21] so these two cases are the major failure modes of lexical based engram overlap
[00:58:24] modes of lexical based engram overlap metrics you get false negative and false
[00:58:27] metrics you get false negative and false positives
[00:58:30] positives so um moving beyond this failure most of
[00:58:34] so um moving beyond this failure most of lexical based metrics the next step is
[00:58:37] lexical based metrics the next step is to check for semantic similarities and
[00:58:39] to check for semantic similarities and model based metrics are better at
[00:58:41] model based metrics are better at capturing the semantic similarities uh
[00:58:43] capturing the semantic similarities uh so this is kind of similar to what you
[00:58:45] so this is kind of similar to what you kind of raised up like a couple of
[00:58:47] kind of raised up like a couple of minutes ago we can actually use learn
[00:58:49] minutes ago we can actually use learn representation of words and sentences to
[00:58:52] representation of words and sentences to compute to compute semantic similarities
[00:58:54] compute to compute semantic similarities between generated and reference text
[00:58:58] between generated and reference text um so now we are no longer bottom at a
[00:59:00] um so now we are no longer bottom at a bottlenecked by ungram and instead we
[00:59:02] bottlenecked by ungram and instead we are using embeddings and these
[00:59:04] are using embeddings and these embeddings are going to be pre-trained
[00:59:05] embeddings are going to be pre-trained but the methods can still live on
[00:59:07] but the methods can still live on because we can just swap in different
[00:59:09] because we can just swap in different pre-trained methods and use the fixed
[00:59:10] pre-trained methods and use the fixed metrics
[00:59:11] metrics so here are some good examples of the
[00:59:14] so here are some good examples of the metrics that could be used uh one thing
[00:59:16] metrics that could be used uh one thing is to do Vector similarity this is very
[00:59:18] is to do Vector similarity this is very similar to homework one uh where if you
[00:59:21] similar to homework one uh where if you are trying to compute similarity between
[00:59:22] are trying to compute similarity between words except now we're trying to compute
[00:59:25] words except now we're trying to compute similarity between sentences
[00:59:27] similarity between sentences there are some ideas of how to go from
[00:59:30] there are some ideas of how to go from word similarity to sentence similarities
[00:59:31] word similarity to sentence similarities for example you can just average the
[00:59:33] for example you can just average the embedding which is like a relatively
[00:59:35] embedding which is like a relatively naive idea but it works uh sometimes
[00:59:39] naive idea but it works uh sometimes another high-level idea is that we can
[00:59:42] another high-level idea is that we can measure word movers distance
[00:59:45] measure word movers distance um the idea here is that we can use
[00:59:48] um the idea here is that we can use optimal transports to align the source
[00:59:50] optimal transports to align the source and Target word embeddings suppose that
[00:59:52] and Target word embeddings suppose that your Source word embedding is Obama
[00:59:54] your Source word embedding is Obama speaks to the media in Illinois and the
[00:59:58] speaks to the media in Illinois and the target is the the president Grace the
[01:00:01] target is the the president Grace the press in Chicago from a human evaluation
[01:00:03] press in Chicago from a human evaluation perspective these two are actually very
[01:00:05] perspective these two are actually very similar but they are not exactly aligned
[01:00:07] similar but they are not exactly aligned word by word so we need to figure out
[01:00:09] word by word so we need to figure out how to optimally align word to word like
[01:00:12] how to optimally align word to word like align Obama to president allowing
[01:00:14] align Obama to president allowing Chicago to Illinois and then therefore
[01:00:16] Chicago to Illinois and then therefore we can compute a score we can compute
[01:00:18] we can compute a score we can compute the pairwise word embedding difference
[01:00:21] the pairwise word embedding difference between this and then get a good score
[01:00:23] between this and then get a good score for the model for the sentence
[01:00:25] for the model for the sentence similarities
[01:00:27] similarities and finally there is Bird score which is
[01:00:29] and finally there is Bird score which is also a very popular metric for semantic
[01:00:31] also a very popular metric for semantic similarity so it first computes pairwise
[01:00:34] similarity so it first computes pairwise cosine distance using birth embeddings
[01:00:36] cosine distance using birth embeddings and then it finds an optimal alignment
[01:00:39] and then it finds an optimal alignment between the source and Target sentence
[01:00:41] between the source and Target sentence and then they finally compute some score
[01:00:43] and then they finally compute some score so I feel like uh these details are not
[01:00:46] so I feel like uh these details are not really that important but the high level
[01:00:48] really that important but the high level idea is super important is that we can
[01:00:51] idea is super important is that we can now use uh like we can now use word
[01:00:54] now use uh like we can now use word embeddings to compute sentence
[01:00:55] embeddings to compute sentence similarities by doing some sort of smart
[01:00:57] similarities by doing some sort of smart alignment and then transform from word
[01:00:59] alignment and then transform from word similarity to sentence similarities
[01:01:02] similarity to sentence similarities to move Beyond word embeddings we can
[01:01:04] to move Beyond word embeddings we can also use sentence embeddings to compute
[01:01:07] also use sentence embeddings to compute sentence similarities so typically this
[01:01:09] sentence similarities so typically this doesn't have the very comprehensive
[01:01:10] doesn't have the very comprehensive alignment by word problem
[01:01:13] alignment by word problem um but it has similar problems about you
[01:01:14] um but it has similar problems about you need to now align sentences or phrases
[01:01:16] need to now align sentences or phrases in a sentence
[01:01:18] in a sentence and similarly there is Port which is
[01:01:20] and similarly there is Port which is slightly different it is a regression
[01:01:22] slightly different it is a regression model based on birth
[01:01:24] model based on birth um to so the model is trained as a
[01:01:27] um to so the model is trained as a regression problem to return the score
[01:01:28] regression problem to return the score that indicate how good the text is in
[01:01:31] that indicate how good the text is in terms of grammaticality and the meaning
[01:01:33] terms of grammaticality and the meaning of the reference and similarity with the
[01:01:35] of the reference and similarity with the reference text so this is kind of a
[01:01:37] reference text so this is kind of a trading evaluation as a regression
[01:01:38] trading evaluation as a regression problem
[01:01:40] problem any questions so far
[01:01:48] okay cool you can move on
[01:01:50] okay cool you can move on um so all the previous Mission
[01:01:52] um so all the previous Mission approaches are evaluating semantic
[01:01:54] approaches are evaluating semantic similarities so they can be applied to
[01:01:56] similarities so they can be applied to non-open ended generation tasks but what
[01:01:59] non-open ended generation tasks but what about open-ended settings so here
[01:02:01] about open-ended settings so here enforcing semantic similarity seems
[01:02:03] enforcing semantic similarity seems wrong because a story can be perfectly
[01:02:05] wrong because a story can be perfectly fluent and perfectly high quality
[01:02:07] fluent and perfectly high quality without having to reassemble any of the
[01:02:09] without having to reassemble any of the reference stories
[01:02:11] reference stories so one idea here is that um maybe we
[01:02:14] so one idea here is that um maybe we want to evaluate open-ended text
[01:02:15] want to evaluate open-ended text generation using this mouth score mob
[01:02:19] generation using this mouth score mob score computes the information
[01:02:20] score computes the information Divergence in a contest embedding space
[01:02:22] Divergence in a contest embedding space between the generated text and the gold
[01:02:25] between the generated text and the gold reference text
[01:02:26] reference text so um here is roughly the detail of
[01:02:28] so um here is roughly the detail of what's going on suppose that you have a
[01:02:30] what's going on suppose that you have a batch of text from the gold reference
[01:02:32] batch of text from the gold reference that are human written and you have a
[01:02:34] that are human written and you have a batch of tags that's generated by your
[01:02:36] batch of tags that's generated by your model
[01:02:36] model um Step number one is that you want to
[01:02:38] um Step number one is that you want to embed this text you want to put this
[01:02:40] embed this text you want to put this text into some continuous representation
[01:02:42] text into some continuous representation space
[01:02:43] space which is kind of the figure to the left
[01:02:45] which is kind of the figure to the left and but it's really hard to compute any
[01:02:47] and but it's really hard to compute any distance metrics in this continuous
[01:02:49] distance metrics in this continuous embedding space because
[01:02:51] embedding space because um well different sentences might
[01:02:52] um well different sentences might actually lie very far away from each
[01:02:54] actually lie very far away from each other so the idea here is that we are
[01:02:57] other so the idea here is that we are trying to do a k-means cluster to
[01:02:59] trying to do a k-means cluster to discretize The Continuous space into
[01:03:01] discretize The Continuous space into some discrete space now after the
[01:03:03] some discrete space now after the discretization we can actually have a
[01:03:06] discretization we can actually have a histogram for the for the gold human
[01:03:09] histogram for the for the gold human written text and the histogram for the
[01:03:11] written text and the histogram for the machine generated text and then we can
[01:03:13] machine generated text and then we can now compute Precision recall using this
[01:03:16] now compute Precision recall using this to discretize the distributions and then
[01:03:18] to discretize the distributions and then we can compute Precision by like forward
[01:03:21] we can compute Precision by like forward K on recall that backward KL yes
[01:03:23] K on recall that backward KL yes question
[01:03:24] question why do we want to discretize it and then
[01:03:26] why do we want to discretize it and then we touch that why do we want to discard
[01:03:28] we touch that why do we want to discard Hazard so imagine that you suppose uh
[01:03:31] Hazard so imagine that you suppose uh maybe like it's equivalent to answer why
[01:03:33] maybe like it's equivalent to answer why is it hard to work with the continuous
[01:03:35] is it hard to work with the continuous space the idea is like if you're in that
[01:03:37] space the idea is like if you're in that a word if you embed a sentence into the
[01:03:39] a word if you embed a sentence into the continuous space say that it lies here
[01:03:41] continuous space say that it lies here and you embed another sentence in a
[01:03:43] and you embed another sentence in a confused with the lies here suppose that
[01:03:45] confused with the lies here suppose that you only have a finite number of uh
[01:03:47] you only have a finite number of uh sentences then they would basically be
[01:03:48] sentences then they would basically be direct Delta distributions in your
[01:03:50] direct Delta distributions in your manifold right so it's hard to like you
[01:03:53] manifold right so it's hard to like you probably want a smoother distribution
[01:03:55] probably want a smoother distribution but it's hard to Define what is a good
[01:03:58] but it's hard to Define what is a good smooth distribution uh in the case of
[01:04:00] smooth distribution uh in the case of text embedding because they're not super
[01:04:01] text embedding because they're not super interpretable so therefore eventually
[01:04:04] interpretable so therefore eventually you will have like a
[01:04:05] you will have like a um if you embed everything in a
[01:04:07] um if you embed everything in a continual space you will have like lots
[01:04:08] continual space you will have like lots of direct Deltas that are just very high
[01:04:11] of direct Deltas that are just very high and then not really connected to the to
[01:04:14] and then not really connected to the to their today's neighbors so it's hard to
[01:04:16] their today's neighbors so it's hard to uh so it's hard to quantify Chao
[01:04:19] uh so it's hard to quantify Chao Divergence or a distance Matrix in that
[01:04:20] Divergence or a distance Matrix in that space
[01:04:21] space well for example you have to make some
[01:04:23] well for example you have to make some assumptions for example you want to make
[01:04:25] assumptions for example you want to make gaussian assumptions that I want to
[01:04:26] gaussian assumptions that I want to smooth all the embeddings by convolving
[01:04:29] smooth all the embeddings by convolving with the gaussian and then you can start
[01:04:31] with the gaussian and then you can start getting some meaningful distance metrics
[01:04:34] getting some meaningful distance metrics but it's just the embeddings uh although
[01:04:37] but it's just the embeddings uh although you're not going to get meaning for
[01:04:38] you're not going to get meaning for distance metrics and then it doesn't
[01:04:40] distance metrics and then it doesn't really make sense to smooth things using
[01:04:41] really make sense to smooth things using gaussian because who said uh word
[01:04:43] gaussian because who said uh word representations are gaussian related
[01:04:45] representations are gaussian related yeah
[01:04:47] yeah classrooms
[01:04:51] I think this requires some gaussian
[01:04:53] I think this requires some gaussian smoothie yeah I think the plot is made
[01:04:55] smoothie yeah I think the plot is made with some smoothie yeah I mean I didn't
[01:04:57] with some smoothie yeah I mean I didn't make those clouds so I couldn't be
[01:04:59] make those clouds so I couldn't be perfectly sure but I think the fact that
[01:05:01] perfectly sure but I think the fact that it looks like this means that you
[01:05:02] it looks like this means that you smoothed it a little bit
[01:05:05] these are kind of sentence and weddings
[01:05:07] these are kind of sentence and weddings or concatenated word embeddings because
[01:05:09] or concatenated word embeddings because you are comparing sentences to sentences
[01:05:11] you are comparing sentences to sentences not words to words
[01:05:14] not words to words yeah so the advantage of mouth score is
[01:05:16] yeah so the advantage of mouth score is that it is applicable to open-ended
[01:05:18] that it is applicable to open-ended settings because you are now measuring
[01:05:21] settings because you are now measuring precision and recall with regard to the
[01:05:24] precision and recall with regard to the Target distribution
[01:05:27] cool so it has a it has a better
[01:05:29] cool so it has a it has a better probabilistic interpretation than all
[01:05:31] probabilistic interpretation than all the previous similarity metrics
[01:05:35] cool
[01:05:36] cool any other questions yes
[01:05:43] how's that different from just trying to
[01:05:45] how's that different from just trying to maximize it the similarity between
[01:05:49] oh yeah that's a good question
[01:05:52] oh yeah that's a good question um well this is because in a case where
[01:05:54] um well this is because in a case where it's really hard to get exactly the same
[01:05:56] it's really hard to get exactly the same thing like well for example I would say
[01:05:59] thing like well for example I would say that if maybe because I've never tried
[01:06:01] that if maybe because I've never tried this myself but if you try to run off on
[01:06:04] this myself but if you try to run off on a machine translation task you might get
[01:06:06] a machine translation task you might get very high score
[01:06:07] very high score um but for if you try to run full score
[01:06:10] um but for if you try to run full score on the open-ended text generation you
[01:06:11] on the open-ended text generation you will get super low score so it's just
[01:06:13] will get super low score so it's just not really measurable because
[01:06:15] not really measurable because everything's so different from each
[01:06:16] everything's so different from each other uh so I feel like moth is kind of
[01:06:18] other uh so I feel like moth is kind of a middle ground where you are trying to
[01:06:21] a middle ground where you are trying to evaluate something that are actually
[01:06:22] evaluate something that are actually very far away from each other but you
[01:06:24] very far away from each other but you still want a meaningful representation
[01:06:27] still want a meaningful representation yeah of course I mean if you are source
[01:06:30] yeah of course I mean if you are source and Target are exactly the same or are
[01:06:32] and Target are exactly the same or are just different app to some refreshings
[01:06:34] just different app to some refreshings you will get the best small score but
[01:06:36] you will get the best small score but maybe that's not really what you're
[01:06:37] maybe that's not really what you're looking for because given the current
[01:06:39] looking for because given the current situation you only have Generations that
[01:06:42] situation you only have Generations that are very far away from the gold text how
[01:06:44] are very far away from the gold text how do we evaluate this type of things
[01:06:47] do we evaluate this type of things yes question in the back I'm still
[01:06:49] yes question in the back I'm still trying to understand the most score is
[01:06:51] trying to understand the most score is it possible to write a the map even in
[01:06:54] it possible to write a the map even in just kind of pseudo uh simple form yeah
[01:06:58] just kind of pseudo uh simple form yeah I think it's possible I mean maybe we
[01:07:00] I think it's possible I mean maybe we come for this discussion after class and
[01:07:02] come for this discussion after class and because I kind of want to finish my
[01:07:04] because I kind of want to finish my slides yeah but happy to chat after
[01:07:07] slides yeah but happy to chat after class there is a paper a lot if you
[01:07:09] class there is a paper a lot if you search for mouth score
[01:07:10] search for mouth score I think it's probably the best paper in
[01:07:12] I think it's probably the best paper in some scml or Europe's conference as well
[01:07:16] some scml or Europe's conference as well okay so moving on
[01:07:18] okay so moving on um I've pointed out that there are so
[01:07:20] um I've pointed out that there are so many evaluation methods so let's take a
[01:07:22] many evaluation methods so let's take a step back and think about what's a good
[01:07:24] step back and think about what's a good metric for evaluation methods so how do
[01:07:26] metric for evaluation methods so how do we evaluate evaluations
[01:07:28] we evaluate evaluations nowadays the gold standard is still to
[01:07:30] nowadays the gold standard is still to check how well this metric is aligned
[01:07:33] check how well this metric is aligned with human judgment so if a model match
[01:07:35] with human judgment so if a model match human preference uh in other words if
[01:07:38] human preference uh in other words if the metric is very correlated with if
[01:07:41] the metric is very correlated with if the metric correlates very strongly with
[01:07:42] the metric correlates very strongly with human judgment then we say that the
[01:07:44] human judgment then we say that the metric is a good metric so in this part
[01:07:46] metric is a good metric so in this part people have shown people have pulled out
[01:07:48] people have shown people have pulled out a Google score and human score uh y and
[01:07:52] a Google score and human score uh y and x axis respectively and then we because
[01:07:54] x axis respectively and then we because we didn't see a correlation a strong
[01:07:56] we didn't see a correlation a strong correlation this kind of suggests that
[01:07:57] correlation this kind of suggests that blue score is not a very good metric
[01:08:01] so uh actually the gold standard for
[01:08:05] so uh actually the gold standard for human evaluation the gold standard for
[01:08:07] human evaluation the gold standard for evaluating language models is always to
[01:08:09] evaluating language models is always to do human evaluation
[01:08:11] do human evaluation so automatic metrics fall short of
[01:08:14] so automatic metrics fall short of matching human decisions and human
[01:08:16] matching human decisions and human evaluation is kind of the most important
[01:08:18] evaluation is kind of the most important criteria for evaluating text that are
[01:08:21] criteria for evaluating text that are generated from a model and it's also the
[01:08:23] generated from a model and it's also the gold standard in developing automatic
[01:08:25] gold standard in developing automatic metrics because we want everything to
[01:08:27] metrics because we want everything to match human evaluation
[01:08:30] um so what do we mean by human
[01:08:32] um so what do we mean by human evaluation how is it conducted typically
[01:08:35] evaluation how is it conducted typically we will provide human annotators with
[01:08:38] we will provide human annotators with some access that we care about like
[01:08:41] some access that we care about like fluency coherence open for open-ended
[01:08:43] fluency coherence open for open-ended tax generation suppose that we also care
[01:08:45] tax generation suppose that we also care about factuality for summarization we
[01:08:47] about factuality for summarization we care about the style of the writing and
[01:08:50] care about the style of the writing and Common Sense for example if you're
[01:08:51] Common Sense for example if you're trying to write a children's story uh
[01:08:55] trying to write a children's story uh essentially like another thing to note
[01:08:58] essentially like another thing to note is that please don't compare human
[01:09:00] is that please don't compare human evaluations across different papers or
[01:09:01] evaluations across different papers or different studies because human
[01:09:03] different studies because human evaluations tends to not be well
[01:09:05] evaluations tends to not be well collaborated and are not really
[01:09:06] collaborated and are not really reproducible
[01:09:07] reproducible even though we believe that human
[01:09:09] even though we believe that human evaluations are the gold standard there
[01:09:12] evaluations are the gold standard there are still many drawbacks for example
[01:09:14] are still many drawbacks for example human evaluations are really slow and
[01:09:16] human evaluations are really slow and expensive uh so but even beyond the slow
[01:09:20] expensive uh so but even beyond the slow and expensiveness they are still not not
[01:09:22] and expensiveness they are still not not perfect because first human evaluations
[01:09:25] perfect because first human evaluations the results may be inconsistent and it
[01:09:27] the results may be inconsistent and it may not be very reproducible so if you
[01:09:29] may not be very reproducible so if you ask the same human whether you like ARB
[01:09:31] ask the same human whether you like ARB they might say a the first time and B
[01:09:33] they might say a the first time and B the second time so and then human
[01:09:35] the second time so and then human evaluations are typically not really
[01:09:37] evaluations are typically not really logical
[01:09:39] logical um and it's really and sometimes like
[01:09:40] um and it's really and sometimes like the human annotators might misinterpret
[01:09:43] the human annotators might misinterpret your question suppose that you want them
[01:09:45] your question suppose that you want them to measure coherence of the text
[01:09:46] to measure coherence of the text different people have different criteria
[01:09:48] different people have different criteria for coherence some people might think
[01:09:50] for coherence some people might think coherence is equivalent to fluency and
[01:09:52] coherence is equivalent to fluency and then they look for grammaticality arrows
[01:09:54] then they look for grammaticality arrows some people might think coherence means
[01:09:57] some people might think coherence means how well your continuation is aligned
[01:09:59] how well your continuation is aligned with the prompt or the topic
[01:10:02] with the prompt or the topic so there are all sorts of
[01:10:03] so there are all sorts of misunderstandings that make that might
[01:10:05] misunderstandings that make that might make human evaluation very hard
[01:10:08] make human evaluation very hard and finally human evaluation only
[01:10:10] and finally human evaluation only measures Precision not recall this means
[01:10:13] measures Precision not recall this means that you can give a sentence to human
[01:10:14] that you can give a sentence to human and ask the human uh how do you like the
[01:10:17] and ask the human uh how do you like the sentence but you couldn't ask the human
[01:10:18] sentence but you couldn't ask the human like whether this model is able to
[01:10:21] like whether this model is able to generate all possible sentences that are
[01:10:23] generate all possible sentences that are good
[01:10:24] good so it's only a precision based metrics
[01:10:26] so it's only a precision based metrics not a recall based metrics so here are
[01:10:29] not a recall based metrics so here are two approaches that tries to like
[01:10:31] two approaches that tries to like combine human evaluations with uh
[01:10:34] combine human evaluations with uh modeling for example uh the first idea
[01:10:37] modeling for example uh the first idea is basically trying to learn a metric
[01:10:39] is basically trying to learn a metric from Human judgment
[01:10:41] from Human judgment um basically by by trying to use human
[01:10:43] um basically by by trying to use human human judgment data uh as trading data
[01:10:46] human judgment data uh as trading data and then train a model to simulate human
[01:10:48] and then train a model to simulate human judgment and the second approach is
[01:10:50] judgment and the second approach is trying to ask human and the human and
[01:10:53] trying to ask human and the human and model to collaborate so that the human
[01:10:55] model to collaborate so that the human would be in charge of evaluating
[01:10:56] would be in charge of evaluating Precision whereas the model would be in
[01:10:58] Precision whereas the model would be in charge of evaluating recall
[01:11:02] charge of evaluating recall um also like we have tried approaches in
[01:11:04] um also like we have tried approaches in terms of evaluating models interactively
[01:11:06] terms of evaluating models interactively so in this case we will no longer we not
[01:11:09] so in this case we will no longer we not only care about the output quality we
[01:11:11] only care about the output quality we also care about how the person feels
[01:11:13] also care about how the person feels when they interact with the model when
[01:11:16] when they interact with the model when they try to be a co-author with the
[01:11:18] they try to be a co-author with the model and how the person feels about the
[01:11:20] model and how the person feels about the writing process
[01:11:22] writing process Etc so this is called trying to evaluate
[01:11:24] Etc so this is called trying to evaluate the models more interactively
[01:11:29] um so the takeaway here is that content
[01:11:31] um so the takeaway here is that content overlap is a bad metric uh semantic
[01:11:34] overlap is a bad metric uh semantic based like model based metrics become
[01:11:36] based like model based metrics become better because it's more focused on
[01:11:37] better because it's more focused on semantics but it's still not good enough
[01:11:39] semantics but it's still not good enough human judgment is the gold standard but
[01:11:42] human judgment is the gold standard but it's hard to do human judgment it's hard
[01:11:44] it's hard to do human judgment it's hard to do human study well
[01:11:46] to do human study well and in many cases this is a hint for
[01:11:48] and in many cases this is a hint for final project the best charge of the
[01:11:51] final project the best charge of the output quality is actually you so if you
[01:11:53] output quality is actually you so if you want to do a final project in like
[01:11:55] want to do a final project in like natural language generation you should
[01:11:57] natural language generation you should look at the model output yourself and
[01:11:59] look at the model output yourself and don't just rely on the numbers that are
[01:12:01] don't just rely on the numbers that are in that are reported by Blue swirl or
[01:12:03] in that are reported by Blue swirl or something
[01:12:05] something cool
[01:12:06] cool um so finally we will discuss ethical
[01:12:08] um so finally we will discuss ethical considerations of natural language
[01:12:10] considerations of natural language generation problems
[01:12:12] generation problems so as language models gets better and
[01:12:15] so as language models gets better and better ethical considerations becomes
[01:12:17] better ethical considerations becomes much more pressing so we want to ensure
[01:12:19] much more pressing so we want to ensure that the model are well aligned with
[01:12:21] that the model are well aligned with human values for example we want to make
[01:12:23] human values for example we want to make sure the models are not harmful they are
[01:12:25] sure the models are not harmful they are not toxic and we want to make sure that
[01:12:27] not toxic and we want to make sure that the models are unbiased and fair to all
[01:12:30] the models are unbiased and fair to all demographics groups so for example here
[01:12:33] demographics groups so for example here we also we don't want the model to
[01:12:34] we also we don't want the model to generate any harmful content basically I
[01:12:37] generate any harmful content basically I try to prompt cat GPT to say can you
[01:12:39] try to prompt cat GPT to say can you write me some toxic content can GPT
[01:12:42] write me some toxic content can GPT politely refuse me
[01:12:43] politely refuse me um which I'm quite happy about but there
[01:12:47] um which I'm quite happy about but there are there are other people who kind of
[01:12:48] are there are other people who kind of like try to jailbreak chat GPT the idea
[01:12:52] like try to jailbreak chat GPT the idea here is that creativity actually I think
[01:12:54] here is that creativity actually I think internally they probably implement some
[01:12:56] internally they probably implement some detection tools so that we will try to
[01:12:59] detection tools so that we will try to prompt it adversarially it's going to
[01:13:00] prompt it adversarially it's going to avoid doing adversarial things but here
[01:13:03] avoid doing adversarial things but here there are many very complicated ways to
[01:13:06] there are many very complicated ways to prompt chat GPT so that you can get over
[01:13:09] prompt chat GPT so that you can get over the firewall and then therefore still
[01:13:12] the firewall and then therefore still ask you ability to generate some I don't
[01:13:15] ask you ability to generate some I don't know like bad English
[01:13:20] but uh so another problem with uh this
[01:13:24] but uh so another problem with uh this large language models is that they are
[01:13:27] large language models is that they are not necessarily truthful so for example
[01:13:28] not necessarily truthful so for example this very famous on news that uh
[01:13:31] this very famous on news that uh Google's model actually generates
[01:13:33] Google's model actually generates factual arrows
[01:13:34] factual arrows um which is quite disappointing but I
[01:13:37] um which is quite disappointing but I mean like
[01:13:38] mean like but the way the model talks about it is
[01:13:40] but the way the model talks about it is very convincing so like you wouldn't
[01:13:42] very convincing so like you wouldn't really know that it's a factual error
[01:13:44] really know that it's a factual error unless you go check that this is not the
[01:13:46] unless you go check that this is not the picture of the this is not the first
[01:13:48] picture of the this is not the first picture or something
[01:13:50] picture or something so we want to avoid this type of
[01:13:52] so we want to avoid this type of problems
[01:13:53] problems um actually like the models have already
[01:13:55] um actually like the models have already been trying very hard to refrain from
[01:13:58] been trying very hard to refrain from like generating harmful content uh but
[01:14:01] like generating harmful content uh but like for models that are more open
[01:14:03] like for models that are more open sourced and are smaller the same problem
[01:14:05] sourced and are smaller the same problem still appears and then typically like
[01:14:08] still appears and then typically like when we do our final project or we work
[01:14:10] when we do our final project or we work with models we are probably going to
[01:14:11] with models we are probably going to deal with much smaller models and then
[01:14:13] deal with much smaller models and then therefore we need to think about ways to
[01:14:15] therefore we need to think about ways to deal with these problems better
[01:14:17] deal with these problems better so text generation models are often
[01:14:19] so text generation models are often constructed from pre-trained language
[01:14:21] constructed from pre-trained language models and then pre-train language
[01:14:22] models and then pre-train language models are trained on internet data
[01:14:24] models are trained on internet data which contains lots of harmful stuff and
[01:14:26] which contains lots of harmful stuff and biased
[01:14:28] biased so when when we prom when the models are
[01:14:31] so when when we prom when the models are prompted for this information they will
[01:14:32] prompted for this information they will just repeat the negative stereotypes
[01:14:34] just repeat the negative stereotypes that they learn from the internet
[01:14:35] that they learn from the internet training data so one way uh to avoid
[01:14:38] training data so one way uh to avoid this is to do extensive data cleaning so
[01:14:41] this is to do extensive data cleaning so that the pre-training data does not
[01:14:42] that the pre-training data does not contain any bias or stereotypical
[01:14:44] contain any bias or stereotypical content however this is going to be very
[01:14:46] content however this is going to be very labor intensive and almost impossible to
[01:14:48] labor intensive and almost impossible to do because filtering a large amount of
[01:14:51] do because filtering a large amount of internet data is just so costly that is
[01:14:53] internet data is just so costly that is not really possible
[01:14:56] um again
[01:14:58] um again this existing language models like gpt2
[01:15:00] this existing language models like gpt2 medium there are some adversarial inputs
[01:15:03] medium there are some adversarial inputs that almost always trigger toxic content
[01:15:06] that almost always trigger toxic content and these models might be exploited in
[01:15:09] and these models might be exploited in the real world in the real world by EU
[01:15:11] the real world in the real world by EU intended people so for example there's a
[01:15:14] intended people so for example there's a paper about Universal adversarial
[01:15:16] paper about Universal adversarial triggers where the authors just find
[01:15:19] triggers where the authors just find some Universal set of words that would
[01:15:21] some Universal set of words that would trigger bad contents from the model that
[01:15:23] trigger bad contents from the model that would trigger toxic content from the
[01:15:24] would trigger toxic content from the model
[01:15:28] and sometimes even if you don't try to
[01:15:30] and sometimes even if you don't try to trigger the model the model might still
[01:15:31] trigger the model the model might still start to generate toxic content by
[01:15:33] start to generate toxic content by itself so in this case the pre-trained
[01:15:37] itself so in this case the pre-trained language models are prompted with very
[01:15:39] language models are prompted with very innocuous prompts but they still
[01:15:41] innocuous prompts but they still degenerate into toxic content
[01:15:43] degenerate into toxic content so
[01:15:44] so um the takeaway here is that models
[01:15:46] um the takeaway here is that models really shouldn't be deployed without
[01:15:48] really shouldn't be deployed without proper safeguards to control for toxic
[01:15:50] proper safeguards to control for toxic content or any harmful contents in
[01:15:51] content or any harmful contents in general
[01:15:52] general and models should not be deployed
[01:15:54] and models should not be deployed without consider rate without careful
[01:15:55] without consider rate without careful considerations of how users will
[01:15:57] considerations of how users will interact with these models
[01:16:02] um so in the Asic section one major
[01:16:04] um so in the Asic section one major takeaway is that we are trying to
[01:16:06] takeaway is that we are trying to Advocate that you need to think more
[01:16:08] Advocate that you need to think more about your model about the model that
[01:16:10] about your model about the model that you are building so before deploying or
[01:16:13] you are building so before deploying or publishing any nlg models please check
[01:16:15] publishing any nlg models please check if the models are output is is not
[01:16:17] if the models are output is is not harmful and please check if the model is
[01:16:20] harmful and please check if the model is more robust is robust to all the trigger
[01:16:22] more robust is robust to all the trigger words and other adversarial prompts and
[01:16:25] words and other adversarial prompts and of course there are more so well
[01:16:27] of course there are more so well basically one can never do enough for to
[01:16:30] basically one can never do enough for to improve the assets of tax generation
[01:16:31] improve the assets of tax generation systems and okay cool I still have three
[01:16:34] systems and okay cool I still have three minutes left so I can still do
[01:16:36] minutes left so I can still do concluding thoughts um
[01:16:38] concluding thoughts um the idea here well today we talk about
[01:16:40] the idea here well today we talk about the exciting applications of natural
[01:16:41] the exciting applications of natural language generation systems
[01:16:44] language generation systems um so but well one might think that
[01:16:47] um so but well one might think that while given that chaiji 50 is already so
[01:16:49] while given that chaiji 50 is already so good are there any other things that we
[01:16:51] good are there any other things that we can do research-wise if you try
[01:16:53] can do research-wise if you try interacting with these models
[01:16:55] interacting with these models um if you try to interact with these
[01:16:56] um if you try to interact with these models actually you can see that there
[01:16:58] models actually you can see that there are still lots of limitations in their
[01:17:00] are still lots of limitations in their skills and performance for example check
[01:17:02] skills and performance for example check GPT is able to like do a lot of things
[01:17:05] GPT is able to like do a lot of things with manipulating text but it couldn't
[01:17:08] with manipulating text but it couldn't really create like interesting contents
[01:17:10] really create like interesting contents or I couldn't really think deeply about
[01:17:12] or I couldn't really think deeply about stuff so it's still also so there are
[01:17:14] stuff so it's still also so there are lots of headrooms and there are still
[01:17:16] lots of headrooms and there are still many improvements ahead
[01:17:18] many improvements ahead and evaluation remains a really huge
[01:17:21] and evaluation remains a really huge challenge in natural language Generation
[01:17:23] challenge in natural language Generation Um basically we need better ways to
[01:17:25] Um basically we need better ways to automatically evaluate performance of
[01:17:27] automatically evaluate performance of nlg models because human evaluations are
[01:17:29] nlg models because human evaluations are expensive and not reproducible so it's
[01:17:33] expensive and not reproducible so it's better to figure out ways to how to
[01:17:35] better to figure out ways to how to compile all those human judgments into a
[01:17:38] compile all those human judgments into a very reliable and trustworthy model
[01:17:41] very reliable and trustworthy model and also with the advance of all these
[01:17:44] and also with the advance of all these large-scale language models uh doing
[01:17:46] large-scale language models uh doing like neural net doing like neural
[01:17:47] like neural net doing like neural natural language generation has been
[01:17:49] natural language generation has been reset and it's never been easier to jump
[01:17:53] reset and it's never been easier to jump into this space because now there are
[01:17:55] into this space because now there are all the tools that are already there for
[01:17:57] all the tools that are already there for you to build upon and finally it is one
[01:18:00] you to build upon and finally it is one of the most exciting and fun areas of
[01:18:01] of the most exciting and fun areas of NLP to work on so yeah I'm happy to chat
[01:18:04] NLP to work on so yeah I'm happy to chat more about nlg if you have any questions
[01:18:06] more about nlg if you have any questions post after class and in class I guess
[01:18:09] post after class and in class I guess into one minute
[01:18:10] into one minute okay cool that's everything so do you
[01:18:14] okay cool that's everything so do you have any questions if you don't we can
[01:18:16] have any questions if you don't we can end the class


================================================================================
LECTURE 011
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 10 - Post-training by Archit Sharma

Source: https://www.youtube.com/watch?v=35X6zlhoCy4

---

Transcript

[00:00:05] good evening people um even how are you
[00:00:08] good evening people um even how are you guys
[00:00:10] guys doing all right my name is archa Sharma
[00:00:13] doing all right my name is archa Sharma I'm a PhD student at Stanford and I'm
[00:00:15] I'm a PhD student at Stanford and I'm very very excited to talk about post
[00:00:17] very very excited to talk about post training generally speaking for large
[00:00:19] training generally speaking for large language models and I hope you guys are
[00:00:21] language models and I hope you guys are ready to learn some stuff because this
[00:00:24] ready to learn some stuff because this has been one of the last few years in
[00:00:25] has been one of the last few years in machine learning have been very very
[00:00:26] machine learning have been very very exciting uh with the Advent of large
[00:00:29] exciting uh with the Advent of large language model CH GPD and everything to
[00:00:31] language model CH GPD and everything to that extent and hopefully after today's
[00:00:33] that extent and hopefully after today's lecture you will be more comfortable
[00:00:36] lecture you will be more comfortable understanding how we go from pre-train
[00:00:38] understanding how we go from pre-train Models to models like CH GPD and we'll
[00:00:40] Models to models like CH GPD and we'll take a whole journey through prompting
[00:00:43] take a whole journey through prompting instruction fine tuning and DP and
[00:00:45] instruction fine tuning and DP and rlf so let's get
[00:00:51] started all
[00:00:53] started all right so something that has been very
[00:00:57] right so something that has been very fundamental to our entire field is this
[00:01:01] fundamental to our entire field is this idea of scaling loss and models are
[00:01:04] idea of scaling loss and models are increasingly becoming larger and larger
[00:01:06] increasingly becoming larger and larger and they're expanding more and more
[00:01:08] and they're expanding more and more compute so this is a graph of models
[00:01:10] compute so this is a graph of models starting all the way back in 1950s to
[00:01:13] starting all the way back in 1950s to somewhere around these are still this is
[00:01:15] somewhere around these are still this is an outdated graph so like this shows up
[00:01:17] an outdated graph so like this shows up to 10 to^ 24 flops or floating Point
[00:01:19] to 10 to^ 24 flops or floating Point operations that go into pre-training
[00:01:21] operations that go into pre-training these models but the number is well
[00:01:23] these models but the number is well above 10 to^ 26 now but you can see the
[00:01:26] above 10 to^ 26 now but you can see the graph and the way it's
[00:01:28] graph and the way it's trending and more more and more compute
[00:01:30] trending and more more and more compute requires more and more data because you
[00:01:32] requires more and more data because you need to train on something meaningful
[00:01:34] need to train on something meaningful and this is roughly the trend on the
[00:01:35] and this is roughly the trend on the amount of language tokens that are going
[00:01:37] amount of language tokens that are going into the language models in pre-training
[00:01:40] into the language models in pre-training and again this plot is outdated does
[00:01:43] and again this plot is outdated does anybody want to guess like we're in 2024
[00:01:45] anybody want to guess like we're in 2024 2022 we were at 1.4 trillion tokens or
[00:01:49] 2022 we were at 1.4 trillion tokens or words roughly speaking in language model
[00:01:51] words roughly speaking in language model pre-training do does anyone want to
[00:01:53] pre-training do does anyone want to guess like where we are in 2024
[00:02:00] that's a pretty good guess yeah so we're
[00:02:02] that's a pretty good guess yeah so we're close to 15 trillion tokens um recent
[00:02:05] close to 15 trillion tokens um recent llama 3 models were roughly trained on
[00:02:06] llama 3 models were roughly trained on 15 trillion tokens so yeah just just for
[00:02:10] 15 trillion tokens so yeah just just for a second appreciate that these are a lot
[00:02:12] a second appreciate that these are a lot of words uh this is not yeah I don't I
[00:02:16] of words uh this is not yeah I don't I don't think anybody of us listens to
[00:02:18] don't think anybody of us listens to like trillions of tokens in our lifetime
[00:02:20] like trillions of tokens in our lifetime so this is where we are right now and I
[00:02:24] so this is where we are right now and I hope you guys were here for the pre-
[00:02:25] hope you guys were here for the pre- pre-training lectures cool um so what do
[00:02:30] pre-training lectures cool um so what do we do so like I mean broadly speaking we
[00:02:32] we do so like I mean broadly speaking we are really just learning to predict text
[00:02:34] are really just learning to predict text tokens or language tokens but what do we
[00:02:37] tokens or language tokens but what do we learn in the process of pre-training why
[00:02:39] learn in the process of pre-training why is why are people spending so much money
[00:02:42] is why are people spending so much money so much compute because these Compu and
[00:02:44] so much compute because these Compu and tokens take dollars to do and we're
[00:02:46] tokens take dollars to do and we're we're on the order spending hundreds of
[00:02:48] we're on the order spending hundreds of millions of dollars on these runs so why
[00:02:50] millions of dollars on these runs so why are we doing this and this is basically
[00:02:53] are we doing this and this is basically a recall from whatever you have probably
[00:02:55] a recall from whatever you have probably learned till now but we're learning
[00:02:57] learned till now but we're learning things like oh we are learning knowledge
[00:02:59] things like oh we are learning knowledge Stanford University is located in Santa
[00:03:02] Stanford University is located in Santa clar California or wherever you want to
[00:03:04] clar California or wherever you want to like say you're learning syntax you're
[00:03:06] like say you're learning syntax you're learning semantics of the sentences
[00:03:08] learning semantics of the sentences these are things that you would expect
[00:03:10] these are things that you would expect to learn when you're training on
[00:03:12] to learn when you're training on language data broadly you're probably
[00:03:14] language data broadly you're probably learning a lot about different languages
[00:03:16] learning a lot about different languages as well so like depending on your text
[00:03:17] as well so like depending on your text Data distribution you're learning a lot
[00:03:19] Data distribution you're learning a lot of things but the models we interact
[00:03:22] of things but the models we interact with are very intelligent so where is
[00:03:24] with are very intelligent so where is that coming from like I mean just simply
[00:03:26] that coming from like I mean just simply learning about very factual things and
[00:03:30] learning about very factual things and it's a very simple loss function we're
[00:03:32] it's a very simple loss function we're optimizing and where is that
[00:03:33] optimizing and where is that Intelligence coming
[00:03:35] Intelligence coming from and this perhaps is the interesting
[00:03:39] from and this perhaps is the interesting bit recently like people have like
[00:03:43] bit recently like people have like started accumulating evidence for that
[00:03:45] started accumulating evidence for that like when you optimize the next token
[00:03:47] like when you optimize the next token prediction losses you're not just
[00:03:49] prediction losses you're not just learning about syntax you're not just
[00:03:50] learning about syntax you're not just learning knowledge but you're starting
[00:03:52] learning knowledge but you're starting to like form models of Agents beliefs
[00:03:55] to like form models of Agents beliefs and actions as well so how do we know
[00:03:58] and actions as well so how do we know this again a lot of this is speculative
[00:04:00] this again a lot of this is speculative evidence but it's mayy to like form an
[00:04:02] evidence but it's mayy to like form an understanding that the losses we're
[00:04:03] understanding that the losses we're optimizing are not just about the data
[00:04:04] optimizing are not just about the data fitting the data but you start learning
[00:04:06] fitting the data but you start learning something maybe more meaningful as
[00:04:08] something maybe more meaningful as well um for example like I mean in this
[00:04:11] well um for example like I mean in this specific case um we change the last
[00:04:15] specific case um we change the last sentence and the prediction of the text
[00:04:17] sentence and the prediction of the text or the next Tex that that is predicted
[00:04:19] or the next Tex that that is predicted changes as well so here it starts with
[00:04:22] changes as well so here it starts with Pat watch as a demonstration of a
[00:04:23] Pat watch as a demonstration of a bowling ball and the leaf being dropped
[00:04:25] bowling ball and the leaf being dropped at the same time pat who is a physicist
[00:04:27] at the same time pat who is a physicist predicts that the bowling ball and the
[00:04:29] predicts that the bowling ball and the leaf will land at the same rate we all
[00:04:31] leaf will land at the same rate we all know Gravity the way it works but when
[00:04:34] know Gravity the way it works but when you L change the last sentence to Pat
[00:04:36] you L change the last sentence to Pat who has never seen this demonstration
[00:04:38] who has never seen this demonstration before Pat predicts that bowling ball
[00:04:40] before Pat predicts that bowling ball will fall to the ground first maybe
[00:04:42] will fall to the ground first maybe somebody who's never seen this
[00:04:43] somebody who's never seen this experiment before might intuitively
[00:04:44] experiment before might intuitively believe that correct so like I mean the
[00:04:47] believe that correct so like I mean the language model was able to predict this
[00:04:48] language model was able to predict this so how do you predict this you have to
[00:04:51] so how do you predict this you have to have some notion of understanding of how
[00:04:54] have some notion of understanding of how humans work to even be able to predict
[00:04:56] humans work to even be able to predict this and that's maybe like something
[00:04:58] this and that's maybe like something that is not obvious with You're simply
[00:05:00] that is not obvious with You're simply optimizing to predict the
[00:05:03] optimizing to predict the text similarly like I mean these kind of
[00:05:05] text similarly like I mean these kind of examples are we're going to run through
[00:05:06] examples are we're going to run through some examples to like sort of
[00:05:08] some examples to like sort of communicate that when you're
[00:05:09] communicate that when you're pre-training these models you're
[00:05:10] pre-training these models you're learning much more than just language
[00:05:11] learning much more than just language tokens and so on you're also learning
[00:05:13] tokens and so on you're also learning about math like you're able to
[00:05:16] about math like you're able to understand what a graph of a circle
[00:05:17] understand what a graph of a circle means and what the center is and where
[00:05:19] means and what the center is and where how to like understand
[00:05:22] how to like understand equations probably my favorite example
[00:05:24] equations probably my favorite example something I use pretty much every day is
[00:05:26] something I use pretty much every day is you're learning how to write code so I
[00:05:29] you're learning how to write code so I don't know how many of you have
[00:05:30] don't know how many of you have interacted with co-pilot before but if
[00:05:33] interacted with co-pilot before but if you have like you probably know like if
[00:05:34] you have like you probably know like if you write down a few commands write down
[00:05:36] you write down a few commands write down a function template it will
[00:05:38] a function template it will automatically complete code for you so
[00:05:41] automatically complete code for you so again it's not perfect but it has to
[00:05:43] again it's not perfect but it has to have some deeper understanding of what
[00:05:45] have some deeper understanding of what your intent is for something like that
[00:05:47] your intent is for something like that to
[00:05:48] to emerge and similarly we have examples
[00:05:50] emerge and similarly we have examples from medicine as well I don't know about
[00:05:52] from medicine as well I don't know about you guys but like whenever I have some
[00:05:54] you guys but like whenever I have some issue I probably go to chat gbd or
[00:05:55] issue I probably go to chat gbd or Claude or something to that effect and
[00:05:57] Claude or something to that effect and ask them a diagnosis for those things as
[00:05:59] ask them a diagnosis for those things as well
[00:06:00] well um I don't recommend that uh please
[00:06:03] um I don't recommend that uh please don't take medical advice from me but
[00:06:06] don't take medical advice from me but yeah so broadly like the way we're
[00:06:09] yeah so broadly like the way we're seeing language models at this point is
[00:06:11] seeing language models at this point is that like they're sort of emerging as
[00:06:12] that like they're sort of emerging as this general purpose multitask
[00:06:14] this general purpose multitask assistance and it's very strange right
[00:06:17] assistance and it's very strange right like I mean we started off with text
[00:06:18] like I mean we started off with text token prediction and we're reaching the
[00:06:19] token prediction and we're reaching the stage where it can like sort of rely to
[00:06:21] stage where it can like sort of rely to them on them to do many many different
[00:06:23] them on them to do many many different things so how are we getting there and
[00:06:25] things so how are we getting there and I'm sure you all are aware of like what
[00:06:26] I'm sure you all are aware of like what these models are so yeah
[00:06:30] these models are so yeah so today's lecture is largely going to
[00:06:32] so today's lecture is largely going to be about how do we go from something
[00:06:34] be about how do we go from something Stanford University is located this very
[00:06:37] Stanford University is located this very simple pretraining task a very simple
[00:06:38] simple pretraining task a very simple procedure well it's more complicated but
[00:06:40] procedure well it's more complicated but in abstract terms it's not very
[00:06:42] in abstract terms it's not very complicated to like something as
[00:06:44] complicated to like something as powerful as CH
[00:06:45] powerful as CH GPD cool
[00:06:48] GPD cool so um I recommend you guys stopping me
[00:06:50] so um I recommend you guys stopping me asking me a lot of questions because
[00:06:52] asking me a lot of questions because this is a there's a lot of fun examples
[00:06:54] this is a there's a lot of fun examples and a lot of fun techniques so like I I
[00:06:56] and a lot of fun techniques so like I I want you guys to like learn everything
[00:06:58] want you guys to like learn everything about here so the overall plan is we're
[00:07:00] about here so the overall plan is we're going to talk about zero shot and few
[00:07:02] going to talk about zero shot and few shot in context learning um next we're
[00:07:05] shot in context learning um next we're going to follow up with instruction
[00:07:06] going to follow up with instruction fine-tuning and then we're going to talk
[00:07:08] fine-tuning and then we're going to talk about optimizing for preferences and
[00:07:10] about optimizing for preferences and this is where roughly things are right
[00:07:12] this is where roughly things are right now in the industry and when we're going
[00:07:15] now in the industry and when we're going to talk about what's next what the
[00:07:16] to talk about what's next what the limitations are and how do we move from
[00:07:20] here cool so we're going to start off
[00:07:23] here cool so we're going to start off with zero shot INF fusure in context
[00:07:27] learning um broad we're going to take an
[00:07:29] learning um broad we're going to take an example example of GPT or the generative
[00:07:31] example example of GPT or the generative pre-train Transformer and this is a
[00:07:33] pre-train Transformer and this is a whole series of models that started off
[00:07:34] whole series of models that started off in roughly 2018 and like up to 2020 they
[00:07:37] in roughly 2018 and like up to 2020 they were building GPD gpd2 gbd3 so we're
[00:07:40] were building GPD gpd2 gbd3 so we're going to start off with this example and
[00:07:42] going to start off with this example and yes so it's a decoder only model that is
[00:07:45] yes so it's a decoder only model that is trained on roughly 4.6 GB of text and it
[00:07:49] trained on roughly 4.6 GB of text and it has 12 layers of Transformers layers and
[00:07:51] has 12 layers of Transformers layers and it's trained with the next token
[00:07:52] it's trained with the next token prediction
[00:07:53] prediction loss and the first model obviously was
[00:07:57] loss and the first model obviously was not extremely good but it started
[00:07:58] not extremely good but it started showing that like hey like this
[00:08:00] showing that like hey like this technique for pre-training can be very
[00:08:03] technique for pre-training can be very effective for general purpose tasks and
[00:08:05] effective for general purpose tasks and we're going to see some
[00:08:06] we're going to see some examples um for example like I mean here
[00:08:09] examples um for example like I mean here it's able to do the task for entainment
[00:08:12] it's able to do the task for entainment and okay
[00:08:16] and okay um yeah and gbd1 itself was not very
[00:08:21] um yeah and gbd1 itself was not very strong as a model so like but they took
[00:08:23] strong as a model so like but they took the same recipe and like I mean tried to
[00:08:25] the same recipe and like I mean tried to like increase the model size so they
[00:08:27] like increase the model size so they went from 117 million parameters to
[00:08:29] went from 117 million parameters to about 1.5 billion parameters and we're
[00:08:32] about 1.5 billion parameters and we're now scaling up the data alongside as
[00:08:34] now scaling up the data alongside as well so we went from 4 gabt of data to
[00:08:36] well so we went from 4 gabt of data to approximately 40 gab of data and
[00:08:38] approximately 40 gab of data and pre-training is a whole different like
[00:08:40] pre-training is a whole different like melting part of techniques and there's a
[00:08:42] melting part of techniques and there's a lot that goes into it but like roughly
[00:08:43] lot that goes into it but like roughly for example here they filter data by the
[00:08:46] for example here they filter data by the number of upwards on the redit
[00:08:48] number of upwards on the redit data and yeah so this is roughly where
[00:08:52] data and yeah so this is roughly where we are and I think one of the things
[00:08:54] we are and I think one of the things that started emerging with gpd2 is zero
[00:08:57] that started emerging with gpd2 is zero shot learning and what do we mean by
[00:09:00] shot learning and what do we mean by zero shot learning
[00:09:02] zero shot learning um conventionally in the field like when
[00:09:05] um conventionally in the field like when we pre-train models there was the idea
[00:09:06] we pre-train models there was the idea that you take a few examples you update
[00:09:08] that you take a few examples you update the model um and then you are able to
[00:09:11] the model um and then you are able to adapt to a specific task but as you
[00:09:14] adapt to a specific task but as you pre-train on more and more data and more
[00:09:15] pre-train on more and more data and more and more tasks you sort of start seeing
[00:09:17] and more tasks you sort of start seeing this phenomena where they're able to do
[00:09:19] this phenomena where they're able to do the task basically zero short they're
[00:09:21] the task basically zero short they're shown no examples of how to do the task
[00:09:23] shown no examples of how to do the task and you can start thinking of oh how you
[00:09:25] and you can start thinking of oh how you can do it summarization you can follow
[00:09:27] can do it summarization you can follow some instructions you can do maybe a
[00:09:29] some instructions you can do maybe a little bit of math as well so this is
[00:09:31] little bit of math as well so this is where the idea of zero shot learning
[00:09:33] where the idea of zero shot learning started to
[00:09:38] emerge yeah so how do we do zero shot
[00:09:41] emerge yeah so how do we do zero shot learning or task specific learning from
[00:09:42] learning or task specific learning from these pre-trained models really the idea
[00:09:45] these pre-trained models really the idea is that we have to be creative here we
[00:09:47] is that we have to be creative here we know that these are text prediction
[00:09:48] know that these are text prediction models if you put in a text they will
[00:09:50] models if you put in a text they will complete whatever follows so if we can
[00:09:52] complete whatever follows so if we can sort of course these models into
[00:09:54] sort of course these models into completing the task we care about maybe
[00:09:56] completing the task we care about maybe it's question answering we can so start
[00:09:59] it's question answering we can so start getting them to solve tasks here so for
[00:10:02] getting them to solve tasks here so for example if you want to ask questions
[00:10:04] example if you want to ask questions about Tom Brady you sort of set it up
[00:10:07] about Tom Brady you sort of set it up you sort put information about Tom Brady
[00:10:09] you sort put information about Tom Brady and then you put a question that you
[00:10:10] and then you put a question that you wanted to answer and then it will
[00:10:12] wanted to answer and then it will autocomplete in some sense so this is
[00:10:14] autocomplete in some sense so this is one early perspective on these models
[00:10:16] one early perspective on these models these are very Advanced autocomplete
[00:10:19] these are very Advanced autocomplete models and
[00:10:21] models and similarly if you want to figure out like
[00:10:23] similarly if you want to figure out like which answer is true or which is not
[00:10:25] which answer is true or which is not something that is very useful to measure
[00:10:27] something that is very useful to measure is log probabilities so
[00:10:29] is log probabilities so for example we want to figure out what
[00:10:32] for example we want to figure out what is the word it refering to here in this
[00:10:35] is the word it refering to here in this sentence the cat couldn't fit into the
[00:10:36] sentence the cat couldn't fit into the Hat because it was too big um what we
[00:10:39] Hat because it was too big um what we can do is we can take the sentence
[00:10:41] can do is we can take the sentence replace it with either the cat or either
[00:10:44] replace it with either the cat or either the hat and then you can measure the
[00:10:46] the hat and then you can measure the probability of which Mo which one does
[00:10:49] probability of which Mo which one does the model think is higher and you can
[00:10:51] the model think is higher and you can sort of get the idea what the reference
[00:10:53] sort of get the idea what the reference here is so none of this is like in the
[00:10:56] here is so none of this is like in the training data it's simply learning to
[00:10:58] training data it's simply learning to predict text but you can start seeing
[00:11:00] predict text but you can start seeing how like we can leverage these models to
[00:11:02] how like we can leverage these models to do other tasks as well be besides
[00:11:06] do other tasks as well be besides prediction so this is just more evidence
[00:11:09] prediction so this is just more evidence about like how gpd2 no task specific
[00:11:12] about like how gpd2 no task specific fine-tuning no task specific training it
[00:11:15] fine-tuning no task specific training it simply is learning to predict text and
[00:11:17] simply is learning to predict text and it's establishes the state-of-the-art on
[00:11:19] it's establishes the state-of-the-art on many many different tasks simply by
[00:11:22] many many different tasks simply by scaling up the model parameters and
[00:11:23] scaling up the model parameters and scaling up the amount of data it's
[00:11:25] scaling up the amount of data it's stained on
[00:11:29] so this is a fun example so if you want
[00:11:31] so this is a fun example so if you want to do summarization for data or like you
[00:11:35] to do summarization for data or like you have a news article that you want to
[00:11:37] have a news article that you want to summarize so how do you get a zero shot
[00:11:40] summarize so how do you get a zero shot model to do it this answer is you put
[00:11:42] model to do it this answer is you put the document into the context and you
[00:11:44] the document into the context and you simply put tldr in front of
[00:11:47] simply put tldr in front of it now like I mean if most of the data
[00:11:50] it now like I mean if most of the data on internet whenever you see tldr you'll
[00:11:51] on internet whenever you see tldr you'll naturally summarize it so yeah you can
[00:11:54] naturally summarize it so yeah you can get zero shot performance and
[00:11:56] get zero shot performance and summarization here as well and again
[00:11:57] summarization here as well and again this is not trained to do something
[00:11:59] this is not trained to do something summarization in any specific way and
[00:12:01] summarization in any specific way and it's still doing really well simply
[00:12:03] it's still doing really well simply because of its pre-training
[00:12:07] data so yeah um I think gp22 tldr is
[00:12:11] data so yeah um I think gp22 tldr is somewhere there and like some of the
[00:12:12] somewhere there and like some of the very Tas specific train models are um
[00:12:15] very Tas specific train models are um like and I think you will see the trend
[00:12:18] like and I think you will see the trend with again if you were Alec Radford or
[00:12:20] with again if you were Alec Radford or somebody like I mean you see like these
[00:12:22] somebody like I mean you see like these cool things emerging your next step
[00:12:24] cool things emerging your next step would obviously be I'm going to scale
[00:12:26] would obviously be I'm going to scale this up a little more I'm going to make
[00:12:27] this up a little more I'm going to make an even bigger model I'm going to train
[00:12:28] an even bigger model I'm going to train it even more data and we'll see how
[00:12:31] it even more data and we'll see how things go right so that's how we got
[00:12:34] things go right so that's how we got gbd3 uh we went from 1.5 billion
[00:12:36] gbd3 uh we went from 1.5 billion parameters to 175 billion parameters we
[00:12:38] parameters to 175 billion parameters we are well over like 40 GB of data to 600
[00:12:42] are well over like 40 GB of data to 600 gbt of data of course like now we're in
[00:12:44] gbt of data of course like now we're in like terabytes of data and text is a
[00:12:47] like terabytes of data and text is a very compressed representation so like
[00:12:48] very compressed representation so like terabytes of data is a
[00:12:50] terabytes of data is a lot um and you know we we talked about
[00:12:53] lot um and you know we we talked about zero shot learning the cool thing that
[00:12:56] zero shot learning the cool thing that emerged in gbd3 is like go ahead like
[00:13:00] emerged in gbd3 is like go ahead like used before the passage right no you
[00:13:03] used before the passage right no you typically put the passage uh if youve
[00:13:04] typically put the passage uh if youve like interacted with Reddit or something
[00:13:06] like interacted with Reddit or something like that typically somebody will write
[00:13:08] like that typically somebody will write an entire post and then end with TLD drr
[00:13:11] an entire post and then end with TLD drr here's a summary of the thing too long
[00:13:13] here's a summary of the thing too long didn't read or if you have
[00:13:15] didn't read or if you have used opposite comes first oh yeah there
[00:13:20] used opposite comes first oh yeah there are situations where it also comes first
[00:13:21] are situations where it also comes first but um one reason is that these are like
[00:13:24] but um one reason is that these are like decoder only models so like they are
[00:13:27] decoder only models so like they are often these are causal attention models
[00:13:28] often these are causal attention models so the typically need to see the context
[00:13:30] so the typically need to see the context before yeah understand I'm just curious
[00:13:33] before yeah understand I'm just curious like from my experience the comes first
[00:13:36] like from my experience the comes first then how is
[00:13:38] then how is it
[00:13:40] it the okay um there's probably a lot of
[00:13:43] the okay um there's probably a lot of data where the tldr comes first but
[00:13:44] data where the tldr comes first but there's probably a lot of data where
[00:13:45] there's probably a lot of data where tldr comes after as
[00:13:47] tldr comes after as well cool so we saw Zero shot learning
[00:13:51] well cool so we saw Zero shot learning emerging in gpd2 few shot learning maybe
[00:13:54] emerging in gpd2 few shot learning maybe seem slightly easier but like this is
[00:13:56] seem slightly easier but like this is where things started getting really
[00:13:57] where things started getting really funny is that like you're starting to
[00:13:59] funny is that like you're starting to beat state-ofthe-art simply by just
[00:14:00] beat state-ofthe-art simply by just putting examples in context so yeah what
[00:14:04] putting examples in context so yeah what does f shot learning mean here what is
[00:14:06] does f shot learning mean here what is what are we talking about um as I
[00:14:09] what are we talking about um as I mentioned like the typical idea here is
[00:14:13] mentioned like the typical idea here is is that like you want to solve
[00:14:14] is that like you want to solve translation so you would put some
[00:14:16] translation so you would put some examples of translation into
[00:14:18] examples of translation into context and you know this is a
[00:14:21] context and you know this is a correction task here or maybe you want
[00:14:22] correction task here or maybe you want interested in Translation and no
[00:14:25] interested in Translation and no gradient updates no learning in any
[00:14:28] gradient updates no learning in any conventional sense whatso ever you put a
[00:14:30] conventional sense whatso ever you put a few examples in and that's it like I
[00:14:32] few examples in and that's it like I mean you know how to solve the task
[00:14:34] mean you know how to solve the task isn't that like crazy like you're bu you
[00:14:37] isn't that like crazy like you're bu you guys did the assignment on translation
[00:14:39] guys did the assignment on translation right so but this is what the modern NLP
[00:14:42] right so but this is what the modern NLP looks like
[00:14:44] looks like so yeah um you put in some examples and
[00:14:48] so yeah um you put in some examples and you have the entire system and this is
[00:14:50] you have the entire system and this is where things got really interesting is
[00:14:52] where things got really interesting is that all these task specific models that
[00:14:54] that all these task specific models that were created to like be really really
[00:14:56] were created to like be really really good at translation or really good at
[00:14:57] good at translation or really good at summarization you can just put let's
[00:15:00] summarization you can just put let's look at this graph so we start with a
[00:15:02] look at this graph so we start with a zero shot performance of this in a
[00:15:04] zero shot performance of this in a similar fashion that I described earlier
[00:15:05] similar fashion that I described earlier and you start somewhere there you put
[00:15:07] and you start somewhere there you put one example in of translation from
[00:15:09] one example in of translation from English to French you get to somewhere
[00:15:11] English to French you get to somewhere like already at a fine level few
[00:15:13] like already at a fine level few examples in you're already starting to
[00:15:15] examples in you're already starting to like be close to the state-ofthe-art
[00:15:18] like be close to the state-ofthe-art models wait but in that gra the state of
[00:15:20] models wait but in that gra the state of theart is really high isn't
[00:15:22] theart is really high isn't it uh find your Inver of the bird Plus+
[00:15:26] it uh find your Inver of the bird Plus+ here I think is like the one I'm
[00:15:27] here I think is like the one I'm referring to find your of the like which
[00:15:29] referring to find your of the like which is trained exclusively on a lot of um
[00:15:32] is trained exclusively on a lot of um translation data might be like slightly
[00:15:33] translation data might be like slightly better yes um and I think that's the
[00:15:37] better yes um and I think that's the relevant comparison here is the in
[00:15:39] relevant comparison here is the in context learning starts to emerge at
[00:15:42] context learning starts to emerge at scale so and this is I think like the
[00:15:45] scale so and this is I think like the key point is that this some of this is
[00:15:48] key point is that this some of this is contested just to be very upfront but
[00:15:50] contested just to be very upfront but like there's this idea of emergence of
[00:15:52] like there's this idea of emergence of this property as you train on more
[00:15:54] this property as you train on more Computing and more scale um there's more
[00:15:56] Computing and more scale um there's more recent research which suggest that if we
[00:15:58] recent research which suggest that if we plot the access correctly it feels less
[00:16:00] plot the access correctly it feels less emergent but the general idea is as you
[00:16:02] emergent but the general idea is as you increase the number of parameters and
[00:16:05] increase the number of parameters and increase the number of compute that is
[00:16:06] increase the number of compute that is going into the models the ability to
[00:16:08] going into the models the ability to just go from a few examples to really
[00:16:10] just go from a few examples to really strong performance is very
[00:16:14] compelling cool
[00:16:17] compelling cool um and yeah I think as I explained
[00:16:20] um and yeah I think as I explained earlier the general idea is that this is
[00:16:22] earlier the general idea is that this is very different from the conventional
[00:16:23] very different from the conventional idea of fine tuning that we typically go
[00:16:25] idea of fine tuning that we typically go for instead of like iterating over
[00:16:27] for instead of like iterating over examples putting it into context and
[00:16:29] examples putting it into context and doing gradient updates we are actually
[00:16:31] doing gradient updates we are actually just going for few short promting we're
[00:16:33] just going for few short promting we're going to put in few examples and that's
[00:16:34] going to put in few examples and that's going to give us the
[00:16:42] system um yes I mean the exact details
[00:16:45] system um yes I mean the exact details roughly can depend on the prom template
[00:16:47] roughly can depend on the prom template that you use but typically you would
[00:16:49] that you use but typically you would just put examples so like c order and
[00:16:52] just put examples so like c order and put these examples and then whatever
[00:16:54] put these examples and then whatever your task is you can just let the model
[00:16:56] your task is you can just let the model complete from there because it can infer
[00:16:58] complete from there because it can infer the task
[00:16:59] the task um based on the examples you've
[00:17:02] um based on the examples you've given any other
[00:17:05] given any other questions
[00:17:08] cool
[00:17:10] cool so yeah like I mean we have gotten from
[00:17:12] so yeah like I mean we have gotten from zero shot prompting and like we've seen
[00:17:14] zero shot prompting and like we've seen seeing that F shot prompting is becoming
[00:17:16] seeing that F shot prompting is becoming really competitive with good models but
[00:17:18] really competitive with good models but there's still limitations to this like I
[00:17:20] there's still limitations to this like I mean you cannot solve every task that
[00:17:21] mean you cannot solve every task that you see here and particularly like
[00:17:23] you see here and particularly like things that involve like richer
[00:17:25] things that involve like richer multi-step reasoning is something that
[00:17:26] multi-step reasoning is something that actually can be pretty challenging and
[00:17:29] actually can be pretty challenging and just to be fair human struggle at these
[00:17:30] just to be fair human struggle at these tasks as well so things like addition
[00:17:33] tasks as well so things like addition and so on like these are these are
[00:17:36] and so on like these are these are probably like still still hard to do
[00:17:38] probably like still still hard to do like when you keep increasing the number
[00:17:39] like when you keep increasing the number of digits but one thing that you have to
[00:17:43] of digits but one thing that you have to start being creative with I alluded to
[00:17:45] start being creative with I alluded to this earlier is that you can get these
[00:17:46] this earlier is that you can get these models to do the task if you're creative
[00:17:49] models to do the task if you're creative in how you prompt the model and this is
[00:17:52] in how you prompt the model and this is what we're going to see
[00:17:53] what we're going to see next um so this technique called Chain
[00:17:57] next um so this technique called Chain of Thought prompting emerged here and
[00:17:59] of Thought prompting emerged here and the idea that we have explored thus far
[00:18:01] the idea that we have explored thus far is that we put in examples of the kind
[00:18:03] is that we put in examples of the kind of tasks we want to do and we expect the
[00:18:07] of tasks we want to do and we expect the model to zero shot learn what the task
[00:18:09] model to zero shot learn what the task is and go from there um the idea is that
[00:18:13] is and go from there um the idea is that like instead of just showing what the
[00:18:15] like instead of just showing what the task is you show them examples where
[00:18:17] task is you show them examples where they reason through the task so they're
[00:18:19] they reason through the task so they're not just learning to do the task but
[00:18:21] not just learning to do the task but also learning how the reasoning is
[00:18:22] also learning how the reasoning is working so in this example initially we
[00:18:24] working so in this example initially we started with like we have to solve a
[00:18:25] started with like we have to solve a simple math problem and we are just
[00:18:27] simple math problem and we are just shown exactly the answer answer directly
[00:18:30] shown exactly the answer answer directly instead of doing that and if you do that
[00:18:32] instead of doing that and if you do that directly you'll observe that the model
[00:18:33] directly you'll observe that the model gets the answer wrong instead of that
[00:18:36] gets the answer wrong instead of that what if you show model how to reason
[00:18:37] what if you show model how to reason about the task show it a chain of
[00:18:39] about the task show it a chain of thought and include that in the prompt
[00:18:41] thought and include that in the prompt as
[00:18:42] as well and then you ask at a new question
[00:18:46] well and then you ask at a new question the idea is that now the model is not
[00:18:47] the idea is that now the model is not just going to Output an answer it's
[00:18:50] just going to Output an answer it's going to reason about the task and it's
[00:18:52] going to reason about the task and it's going to do actually a lot better and
[00:18:54] going to do actually a lot better and this has been shown to be very effective
[00:18:57] this has been shown to be very effective um Chain of Thought is also as you can
[00:19:00] um Chain of Thought is also as you can see like I mean it's again something
[00:19:02] see like I mean it's again something that improves a lot with model scale
[00:19:05] that improves a lot with model scale it's not just um yeah um but what you
[00:19:09] it's not just um yeah um but what you can probably start seeing is like it's
[00:19:11] can probably start seeing is like it's nearly better than supervised best
[00:19:13] nearly better than supervised best models here so power models roughly were
[00:19:17] models here so power models roughly were about 5 40 billion
[00:19:18] about 5 40 billion parameters and simply with this Chain of
[00:19:21] parameters and simply with this Chain of Thought kind of a skill you're already
[00:19:22] Thought kind of a skill you're already like beating state of the
[00:19:25] like beating state of the art cool um
[00:19:29] art cool um so yeah so I showed you examples of
[00:19:32] so yeah so I showed you examples of Chain of Thought reasoning to to this
[00:19:34] Chain of Thought reasoning to to this point where you go through a reasoning
[00:19:35] point where you go through a reasoning chain but you can be even slightly
[00:19:37] chain but you can be even slightly smarter than that you might not even
[00:19:39] smarter than that you might not even need to show them any examples you just
[00:19:41] need to show them any examples you just need to trick them into thinking about
[00:19:43] need to trick them into thinking about what to do
[00:19:44] what to do next um so yeah this s this idea emerged
[00:19:49] next um so yeah this s this idea emerged in this paper called where you let's
[00:19:51] in this paper called where you let's think step by step where instead of even
[00:19:54] think step by step where instead of even showing an example you just start your
[00:19:56] showing an example you just start your answer with let's think step by step and
[00:20:01] answer with let's think step by step and that's it like I mean the model will
[00:20:03] that's it like I mean the model will start reasoning about the answer itself
[00:20:05] start reasoning about the answer itself instead of just like autoc completing to
[00:20:07] instead of just like autoc completing to an answer and you get something like
[00:20:11] an answer and you get something like this
[00:20:12] this so maybe you don't even need to show any
[00:20:15] so maybe you don't even need to show any examples like you can probably induce
[00:20:17] examples like you can probably induce the reasoning Behavior through zero shot
[00:20:18] the reasoning Behavior through zero shot Behavior as well and again um what the
[00:20:22] Behavior as well and again um what the final numbers look like is like compared
[00:20:24] final numbers look like is like compared to zero shot performance that we got
[00:20:26] to zero shot performance that we got from essentially autoc comp completing
[00:20:29] from essentially autoc comp completing this zero shot Chain of Thought
[00:20:31] this zero shot Chain of Thought substantially improves the performance
[00:20:32] substantially improves the performance so you go from like 17.7 to 78.7 it's
[00:20:35] so you go from like 17.7 to 78.7 it's still worse than still putting like
[00:20:37] still worse than still putting like examples of reasoning and multi-shot few
[00:20:40] examples of reasoning and multi-shot few shot Chain of Thought as well but you
[00:20:42] shot Chain of Thought as well but you can see like how much it improves the
[00:20:44] can see like how much it improves the performance simply by asking you to
[00:20:45] performance simply by asking you to let's think by step by step and maybe
[00:20:48] let's think by step by step and maybe this is like a lesson that interacting
[00:20:50] this is like a lesson that interacting with these models is like when you
[00:20:52] with these models is like when you interact with these models you might not
[00:20:54] interact with these models you might not get the exact desired behavior from
[00:20:57] get the exact desired behavior from these models up front but often like
[00:20:59] these models up front but often like these models are capable of doing the
[00:21:01] these models are capable of doing the behavior that you might want and often
[00:21:05] behavior that you might want and often you have to think about how to induce
[00:21:06] you have to think about how to induce that behavior such that and the right
[00:21:09] that behavior such that and the right way to think perhaps is like what is the
[00:21:10] way to think perhaps is like what is the pre-training data what is the data on
[00:21:12] pre-training data what is the data on the internet it might have seen which
[00:21:13] the internet it might have seen which induces a similar Behavior to the kind I
[00:21:15] induces a similar Behavior to the kind I want and you probably want to like think
[00:21:18] want and you probably want to like think about that and then induce these kinds
[00:21:20] about that and then induce these kinds of behaviors from those
[00:21:24] models and yeah like I mean um you know
[00:21:28] models and yeah like I mean um you know we we hand designed some of these
[00:21:30] we we hand designed some of these prompts you can also like get an llm to
[00:21:32] prompts you can also like get an llm to design these prompts as well there's
[00:21:34] design these prompts as well there's like recursive self-improving ideas here
[00:21:37] like recursive self-improving ideas here um that happen and you can bump up the
[00:21:38] um that happen and you can bump up the performance a little bit
[00:21:42] more cool so what we have seen so far is
[00:21:46] more cool so what we have seen so far is that as models get stronger and stronger
[00:21:49] that as models get stronger and stronger you can start forcing them to do your
[00:21:50] you can start forcing them to do your task zero shot or with few short
[00:21:53] task zero shot or with few short examples and you can trick them into
[00:21:55] examples and you can trick them into thinking what task you want them to
[00:21:56] thinking what task you want them to solve
[00:21:59] solve but the downside is that there's only so
[00:22:01] but the downside is that there's only so much you can fit into context this might
[00:22:04] much you can fit into context this might not be very true anymore models like
[00:22:06] not be very true anymore models like becoming increasingly larger context but
[00:22:09] becoming increasingly larger context but it's still somewhat unsatisfactory to
[00:22:11] it's still somewhat unsatisfactory to think you have to trick the model into
[00:22:13] think you have to trick the model into doing your task rather than like it just
[00:22:15] doing your task rather than like it just doing the task you wanted to do and
[00:22:18] doing the task you wanted to do and potentially like I mean going forward
[00:22:20] potentially like I mean going forward like you probably still want to fine
[00:22:22] like you probably still want to fine tune these models for more and more
[00:22:23] tune these models for more and more complex tasks and that's where we're
[00:22:26] complex tasks and that's where we're going to go forward in this
[00:22:28] going to go forward in this um next section we're going to cover is
[00:22:30] um next section we're going to cover is instruction fine tuning and the general
[00:22:34] instruction fine tuning and the general idea we have right now is that as we
[00:22:37] idea we have right now is that as we talked about pre-training is not about
[00:22:39] talked about pre-training is not about assisting users it is about predicting
[00:22:41] assisting users it is about predicting the next token now you can trick it into
[00:22:44] the next token now you can trick it into assisting users and uh following the
[00:22:47] assisting users and uh following the instruction you wanted to but in general
[00:22:49] instruction you wanted to but in general that's not what it retrained it for and
[00:22:51] that's not what it retrained it for and this is an example of where if you ask
[00:22:53] this is an example of where if you ask GPD 3 pretty strong model to explain
[00:22:55] GPD 3 pretty strong model to explain like moon landing to a six-year-old in a
[00:22:57] like moon landing to a six-year-old in a few sentences and it will follow up with
[00:22:59] few sentences and it will follow up with more questions about what a 60-year-old
[00:23:01] more questions about what a 60-year-old might want this is not what you wanted
[00:23:03] might want this is not what you wanted the model to do right so the general
[00:23:07] the model to do right so the general term that people use these days is that
[00:23:08] term that people use these days is that they're not aligned with user intent and
[00:23:12] they're not aligned with user intent and the next sections that we're going to
[00:23:13] the next sections that we're going to cover are going to talk about how to
[00:23:14] cover are going to talk about how to align it with the user intent so that
[00:23:16] align it with the user intent so that they don't have to trick the model into
[00:23:17] they don't have to trick the model into whatever uh we wanted to
[00:23:20] whatever uh we wanted to do and this is a kind of like desired
[00:23:23] do and this is a kind of like desired completion we want at the end of
[00:23:24] completion we want at the end of instruction tuning um and yeah
[00:23:29] instruction tuning um and yeah how do we get from those pre-trained
[00:23:32] how do we get from those pre-trained models to models which can respond to
[00:23:33] models to models which can respond to user
[00:23:34] user intent um again um I hope this was
[00:23:38] intent um again um I hope this was covered somewhere in the class the
[00:23:39] covered somewhere in the class the general idea of pre-training and
[00:23:41] general idea of pre-training and fine-tuning um but what you have
[00:23:43] fine-tuning um but what you have probably seen thus far is that you
[00:23:45] probably seen thus far is that you pre-train on a lot of different language
[00:23:47] pre-train on a lot of different language task uh on data but then you find on
[00:23:50] task uh on data but then you find on your specific task so you're taking the
[00:23:53] your specific task so you're taking the same decoder only models and you're fine
[00:23:56] same decoder only models and you're fine tuning to some task with very little
[00:23:58] tuning to some task with very little amount of data the thing that is going
[00:24:00] amount of data the thing that is going to be different now is not that we're no
[00:24:02] to be different now is not that we're no longer fine tuning on a little amount of
[00:24:04] longer fine tuning on a little amount of data we're going to fine tune on many
[00:24:05] data we're going to fine tune on many many different tasks and we're going to
[00:24:07] many different tasks and we're going to just try to put them into a single
[00:24:10] just try to put them into a single usable um ux for users and this is where
[00:24:14] usable um ux for users and this is where fine tuning or instruction fine tuning
[00:24:16] fine tuning or instruction fine tuning comes
[00:24:19] in cool um so again the recipe is not
[00:24:23] in cool um so again the recipe is not very very complicated here um we're
[00:24:25] very very complicated here um we're going to collect a lot of examples of
[00:24:27] going to collect a lot of examples of instruction and output Pairs and the
[00:24:29] instruction and output Pairs and the instructions are going to rrange over
[00:24:30] instructions are going to rrange over several task different forms um there's
[00:24:33] several task different forms um there's going to be question answering they're
[00:24:34] going to be question answering they're going to be summarization translation
[00:24:36] going to be summarization translation code reasoning and so on and we're going
[00:24:38] code reasoning and so on and we're going to collect a lot of examples uh related
[00:24:41] to collect a lot of examples uh related to all those tasks and the idea is like
[00:24:44] to all those tasks and the idea is like I mean we'll train on instruction and
[00:24:45] I mean we'll train on instruction and output pairs exactly with them and then
[00:24:48] output pairs exactly with them and then we're going to evaluate on some unseen
[00:24:50] we're going to evaluate on some unseen tasks as well so this is a general
[00:24:53] tasks as well so this is a general Paradigm of instruction fine tuning and
[00:24:57] Paradigm of instruction fine tuning and again it's the same idea which we
[00:24:59] again it's the same idea which we explored in pre-training is that data
[00:25:01] explored in pre-training is that data plus scale is really important and these
[00:25:04] plus scale is really important and these days like a mean you start off with like
[00:25:06] days like a mean you start off with like one task you're now extending it over
[00:25:08] one task you're now extending it over thousands of thousands and thousands of
[00:25:10] thousands of thousands and thousands of tasks with like three million plus
[00:25:11] tasks with like three million plus examples and this is generally like a
[00:25:13] examples and this is generally like a broad range of tasks that you might see
[00:25:14] broad range of tasks that you might see in instruction fine tuning data
[00:25:16] in instruction fine tuning data sets and yeah you might even think of
[00:25:19] sets and yeah you might even think of like why are we calling it fine tuning
[00:25:20] like why are we calling it fine tuning anymore like it's almost starting to
[00:25:22] anymore like it's almost starting to look like pre-training um but yeah we
[00:25:25] look like pre-training um but yeah we can these are just terms um so so you
[00:25:28] can these are just terms um so so you can decide whatever you are comfortable
[00:25:30] can decide whatever you are comfortable with um so yeah we we get this like huge
[00:25:35] with um so yeah we we get this like huge instruction data set we finder model the
[00:25:37] instruction data set we finder model the next question is like how do we evaluate
[00:25:39] next question is like how do we evaluate these data sets um now I think you guys
[00:25:43] these data sets um now I think you guys will see another lecture on evaluation
[00:25:45] will see another lecture on evaluation so I don't want to like dive too deep
[00:25:46] so I don't want to like dive too deep into this but generally evaluation of
[00:25:49] into this but generally evaluation of these language models is an extremely
[00:25:51] these language models is an extremely tricky topic um there's a lot of biases
[00:25:53] tricky topic um there's a lot of biases that you need to deal with and a lot of
[00:25:55] that you need to deal with and a lot of this will be covered but some more
[00:25:57] this will be covered but some more recent progess on this is like we are
[00:25:59] recent progess on this is like we are starting to curate these like really
[00:26:01] starting to curate these like really large benchmarks uh like mlu where the
[00:26:05] large benchmarks uh like mlu where the models are tested on a broad range of
[00:26:06] models are tested on a broad range of diverse knowledge and this is just one
[00:26:09] diverse knowledge and this is just one example which is and these are the
[00:26:11] example which is and these are the topics that you will see and just to
[00:26:14] topics that you will see and just to give some intuition of what the examples
[00:26:16] give some intuition of what the examples in these evaluation look like um under
[00:26:18] in these evaluation look like um under astronomy you might be asked what is
[00:26:20] astronomy you might be asked what is true for type 1 a supernova or you might
[00:26:23] true for type 1 a supernova or you might be asked some questions about biology
[00:26:25] be asked some questions about biology and there's a huge host of tasks for
[00:26:27] and there's a huge host of tasks for this and you can typically like these
[00:26:30] this and you can typically like these are multi- choice questions and you can
[00:26:31] are multi- choice questions and you can ask the model to answer the question if
[00:26:33] ask the model to answer the question if they're instruction fine tuned already
[00:26:34] they're instruction fine tuned already hopefully they can like simply answer
[00:26:36] hopefully they can like simply answer the question but you can also uh Chain
[00:26:38] the question but you can also uh Chain of Thought prompt these questions or few
[00:26:40] of Thought prompt these questions or few short promp these questions too and
[00:26:43] short promp these questions too and recently there's been a huge amount of
[00:26:45] recently there's been a huge amount of progress uh on this Benchmark what
[00:26:48] progress uh on this Benchmark what people have observed is like more and
[00:26:49] people have observed is like more and more pre-training on more and more data
[00:26:51] more pre-training on more and more data and larger models is simply just like
[00:26:52] and larger models is simply just like climbing up these um the number on this
[00:26:55] climbing up these um the number on this so 90% is often seen as a benchmark Mark
[00:26:58] so 90% is often seen as a benchmark Mark number that these model wanted to cross
[00:27:00] number that these model wanted to cross because it's roughly like human level
[00:27:02] because it's roughly like human level knowledge or understanding and recently
[00:27:05] knowledge or understanding and recently the Gemini models purply cross this
[00:27:09] the Gemini models purply cross this number so yeah go
[00:27:12] number so yeah go ahead is isn't this like the entire sort
[00:27:15] ahead is isn't this like the entire sort of thing all over again right like imag
[00:27:18] of thing all over again right like imag at some point you're like okay maybe my
[00:27:19] at some point you're like okay maybe my methods are too like too fine tuned
[00:27:22] methods are too like too fine tuned implicitly on on the image bi and isn't
[00:27:25] implicitly on on the image bi and isn't something like that happening here as
[00:27:27] something like that happening here as well uh um yes I think this is a tricky
[00:27:30] well uh um yes I think this is a tricky topic because a lot of the models often
[00:27:33] topic because a lot of the models often there's this idea about whether your
[00:27:35] there's this idea about whether your test sets are leaking into your training
[00:27:37] test sets are leaking into your training data set and there's are huge concerns
[00:27:39] data set and there's are huge concerns about that it's a perfectly valid
[00:27:41] about that it's a perfectly valid question to ask how do we even evaluate
[00:27:44] question to ask how do we even evaluate this is why evaluation is actually very
[00:27:45] this is why evaluation is actually very tricky but one General thing to be
[00:27:48] tricky but one General thing to be careful about is like at some point like
[00:27:50] careful about is like at some point like it doesn't matter what your trained test
[00:27:51] it doesn't matter what your trained test is if the models are generally useful if
[00:27:54] is if the models are generally useful if their models are doing useful stuff like
[00:27:56] their models are doing useful stuff like does it matter like how if your test if
[00:27:59] does it matter like how if your test if you train on everything you care about
[00:28:01] you train on everything you care about and it does well on it like does it
[00:28:03] and it does well on it like does it matter so yeah um again we still need
[00:28:08] matter so yeah um again we still need better ways to evaluate the models um
[00:28:10] better ways to evaluate the models um and to understand what methods are doing
[00:28:12] and to understand what methods are doing and how they're if they're improving the
[00:28:14] and how they're if they're improving the model or not but at some point like that
[00:28:16] model or not but at some point like that those boundaries start to like be less
[00:28:22] important cool so massive progress on
[00:28:24] important cool so massive progress on this Benchmark starting with gpd2 and
[00:28:26] this Benchmark starting with gpd2 and like we're roughly at 90% which to the
[00:28:29] like we're roughly at 90% which to the point where these benchmarks are
[00:28:30] point where these benchmarks are starting to become unclear if like
[00:28:32] starting to become unclear if like improvements on these are actually
[00:28:33] improvements on these are actually meaningful or not um in fact like most
[00:28:37] meaningful or not um in fact like most of the times when the models are wrong
[00:28:39] of the times when the models are wrong like you might often find that the
[00:28:41] like you might often find that the question itself was unclear or ambiguous
[00:28:43] question itself was unclear or ambiguous so all evaluation benchmarks have a
[00:28:46] so all evaluation benchmarks have a certain limited utility to
[00:28:48] certain limited utility to them so yeah um going to go over like
[00:28:52] them so yeah um going to go over like another evaluation example of how this
[00:28:54] another evaluation example of how this recipe like changes things so T5 models
[00:28:58] recipe like changes things so T5 models were instruction fine tuned on a huge
[00:28:59] were instruction fine tuned on a huge number of tasks and another Trend to or
[00:29:02] number of tasks and another Trend to or which I think will be the theme across
[00:29:04] which I think will be the theme across this lecture is that as your models
[00:29:06] this lecture is that as your models become larger as they're trained on more
[00:29:07] become larger as they're trained on more data they become more and more
[00:29:09] data they become more and more responsive to your task information as
[00:29:11] responsive to your task information as well so what you will observe here is
[00:29:13] well so what you will observe here is like as the number of parameters ex
[00:29:15] like as the number of parameters ex increase we have like T5 small FL T5
[00:29:18] increase we have like T5 small FL T5 small and we go up to 11 billion
[00:29:20] small and we go up to 11 billion parameters where where we have T5 XXL
[00:29:23] parameters where where we have T5 XXL you'll see that the Improvement actually
[00:29:25] you'll see that the Improvement actually improves like going from a pre into an
[00:29:28] improves like going from a pre into an instruction model the instruction model
[00:29:30] instruction model the instruction model is all the more better at following
[00:29:32] is all the more better at following instructions so the difference is plus
[00:29:35] instructions so the difference is plus 6.1 and goes to plus 26.6 as the models
[00:29:37] 6.1 and goes to plus 26.6 as the models become larger so this is another very
[00:29:40] become larger so this is another very encouraging Trend that you probably
[00:29:42] encouraging Trend that you probably should train on a lot of data with a lot
[00:29:44] should train on a lot of data with a lot of compute and you know pre-training
[00:29:48] of compute and you know pre-training just keeps on
[00:29:50] just keeps on giving
[00:29:52] giving so yeah um you I hope you guys get a
[00:29:56] so yeah um you I hope you guys get a chance to like play with a lot of these
[00:29:57] chance to like play with a lot of these models I I think you already hopefully
[00:29:59] models I I think you already hopefully are uh but yeah before instruction fine
[00:30:02] are uh but yeah before instruction fine tuning something when you're asked a
[00:30:04] tuning something when you're asked a question related to disambiguation QA um
[00:30:07] question related to disambiguation QA um you get something like this and it
[00:30:09] you get something like this and it doesn't actually follow the let's think
[00:30:11] doesn't actually follow the let's think by step byep instruction very clearly
[00:30:15] by step byep instruction very clearly but after instruction fine tuning it is
[00:30:16] but after instruction fine tuning it is able to answer the question
[00:30:19] able to answer the question here and yeah like more recently people
[00:30:22] here and yeah like more recently people have been like researching into like
[00:30:24] have been like researching into like what the instruction tuning data set
[00:30:25] what the instruction tuning data set should look like there's a huge plora of
[00:30:28] should look like there's a huge plora of instruction tuning data sets now
[00:30:29] instruction tuning data sets now available like this is just a
[00:30:30] available like this is just a representative diagram and there's a h
[00:30:32] representative diagram and there's a h open source Community developing around
[00:30:34] open source Community developing around these as
[00:30:35] these as well um some high level lessons that we
[00:30:38] well um some high level lessons that we have learned from this
[00:30:41] have learned from this is one lesson that I think might be
[00:30:44] is one lesson that I think might be interesting is that we can actually use
[00:30:45] interesting is that we can actually use really large strong models to generate
[00:30:48] really large strong models to generate some of the instruction tuning data to
[00:30:49] some of the instruction tuning data to train some of our smaller models so take
[00:30:52] train some of our smaller models so take your favorite model right now gbd4 maybe
[00:30:54] your favorite model right now gbd4 maybe or maybe Claud or whatever and you can
[00:30:56] or maybe Claud or whatever and you can get it to answer some question s and
[00:30:58] get it to answer some question s and generate instruction outut pairs for
[00:31:01] generate instruction outut pairs for training your open source or smaller
[00:31:03] training your open source or smaller model and that actually is a very
[00:31:04] model and that actually is a very successful recipe so instead of getting
[00:31:06] successful recipe so instead of getting a human to collect all the instruction
[00:31:08] a human to collect all the instruction output pairs or getting humans to
[00:31:10] output pairs or getting humans to generate the answers you can get bigger
[00:31:12] generate the answers you can get bigger models to generate the answers as well
[00:31:14] models to generate the answers as well so that's number one thing that like has
[00:31:16] so that's number one thing that like has recently emerged another thing that are
[00:31:19] recently emerged another thing that are is being emerged or is like being
[00:31:21] is being emerged or is like being discussed is how much data do we need I
[00:31:23] discussed is how much data do we need I talked about millions of examples but
[00:31:25] talked about millions of examples but like people have often found that if you
[00:31:26] like people have often found that if you have really high quality example you can
[00:31:28] have really high quality example you can get away with thousand examples as well
[00:31:30] get away with thousand examples as well so this is the paperless as more for
[00:31:32] so this is the paperless as more for alignment and this is still an active
[00:31:34] alignment and this is still an active area of research on how like data
[00:31:36] area of research on how like data scaling and instruction tuning affects
[00:31:37] scaling and instruction tuning affects the final model
[00:31:39] the final model performance and yeah crowdsourcing these
[00:31:42] performance and yeah crowdsourcing these models can be effective as well so there
[00:31:45] models can be effective as well so there are very cool benchmarks that are
[00:31:46] are very cool benchmarks that are emerging like open Assistant um yeah a
[00:31:49] emerging like open Assistant um yeah a lot of activity in the field and
[00:31:52] lot of activity in the field and hopefully like a lot more progress as we
[00:31:54] hopefully like a lot more progress as we go on yes um a question sort of in the
[00:31:58] go on yes um a question sort of in the spirit of this LMA paper uh doesn't like
[00:32:02] spirit of this LMA paper uh doesn't like code or like I don't know like math word
[00:32:06] code or like I don't know like math word problems have this desired structure so
[00:32:08] problems have this desired structure so like shouldn't we just like be training
[00:32:10] like shouldn't we just like be training code models and doing like some English
[00:32:13] code models and doing like some English stuff and then just be like okay this is
[00:32:15] stuff and then just be like okay this is the best reasoning we can get at some
[00:32:17] the best reasoning we can get at some point right cuz like C code has the
[00:32:19] point right cuz like C code has the structure where where where like you're
[00:32:22] structure where where where like you're going sort of step by step and you're
[00:32:24] going sort of step by step and you're sort of thinking in in some way in like
[00:32:27] sort of thinking in in some way in like a so breaking down a concept into a
[00:32:29] a so breaking down a concept into a smaller so you can consider like cod
[00:32:31] smaller so you can consider like cod have like very high value tokens so
[00:32:33] have like very high value tokens so maybe like just doing so I think there's
[00:32:37] maybe like just doing so I think there's again pre-training is a whole Dark Art
[00:32:39] again pre-training is a whole Dark Art that I am not completely familiar with
[00:32:41] that I am not completely familiar with but um code actually ends up being
[00:32:44] but um code actually ends up being really useful in pre-training mixtures
[00:32:46] really useful in pre-training mixtures and people do like up with code data
[00:32:48] and people do like up with code data quite a lot um similarly like I mean but
[00:32:53] quite a lot um similarly like I mean but it depends upon what the users are going
[00:32:54] it depends upon what the users are going to use the models for right um some
[00:32:56] to use the models for right um some people might use it for Cotes some
[00:32:57] people might use it for Cotes some people people might do for reasoning but
[00:32:59] people people might do for reasoning but that's not the only task we care about
[00:33:01] that's not the only task we care about as you might see later on in the next
[00:33:02] as you might see later on in the next step we'll discuss this as well is that
[00:33:05] step we'll discuss this as well is that people often use these models for
[00:33:06] people often use these models for Creative task they wanted to write a
[00:33:08] Creative task they wanted to write a story uh they wanted to generate a movie
[00:33:10] story uh they wanted to generate a movie script or so on and I don't know if like
[00:33:13] script or so on and I don't know if like necessarily training on reasoning only
[00:33:15] necessarily training on reasoning only tasks would help with that so go ahead
[00:33:18] tasks would help with that so go ahead yeah would you explain like there there
[00:33:20] yeah would you explain like there there exists like some data distribution which
[00:33:22] exists like some data distribution which is like high value for Creative tasks
[00:33:28] is like high value for Creative tasks yes like I mean it seems like um a lot
[00:33:32] yes like I mean it seems like um a lot of PE people write about stories and
[00:33:34] of PE people write about stories and everything on the internet all the time
[00:33:36] everything on the internet all the time like which is not code and sometimes
[00:33:39] like which is not code and sometimes like there's this idea of hallucinations
[00:33:40] like there's this idea of hallucinations as well in this like field but you can
[00:33:42] as well in this like field but you can often think like hey like creativity
[00:33:44] often think like hey like creativity might be a byproduct of hallucinations
[00:33:47] might be a byproduct of hallucinations as well so I don't know what exact data
[00:33:50] as well so I don't know what exact data would like lead to like more creative
[00:33:52] would like lead to like more creative models but generally like there's a lot
[00:33:54] models but generally like there's a lot of data or a lot of stories that are
[00:33:56] of data or a lot of stories that are written on the internet which allows the
[00:33:57] written on the internet which allows the model to be
[00:33:59] model to be creative yeah but I don't know if I have
[00:34:01] creative yeah but I don't know if I have a specific answer to the
[00:34:03] a specific answer to the question cool so we discussed
[00:34:05] question cool so we discussed instruction fine tuning um very simple
[00:34:08] instruction fine tuning um very simple and very straightforward there's like no
[00:34:10] and very straightforward there's like no complicated algorithms here just collect
[00:34:12] complicated algorithms here just collect a lot of data and then you can start
[00:34:14] a lot of data and then you can start leveraging the performance at scale as
[00:34:16] leveraging the performance at scale as well like as models become better these
[00:34:18] well like as models become better these models also become more easily
[00:34:21] models also become more easily specifiable and they become more
[00:34:23] specifiable and they become more responsive to task as well we're going
[00:34:25] responsive to task as well we're going to discuss some limitations and I think
[00:34:27] to discuss some limitations and I think this is like really important to
[00:34:28] this is like really important to understand why we are going to optimize
[00:34:29] understand why we are going to optimize for human
[00:34:32] for human preferences cool so we talked a bit
[00:34:35] preferences cool so we talked a bit about this like instruction fine tuning
[00:34:37] about this like instruction fine tuning is necessarily contingent on humans
[00:34:40] is necessarily contingent on humans labeling the data now
[00:34:44] labeling the data now humans it's expensive to collect this
[00:34:46] humans it's expensive to collect this data especially as the questions become
[00:34:48] data especially as the questions become more and more complex you want to answer
[00:34:50] more and more complex you want to answer questions about what which may be at
[00:34:52] questions about what which may be at physics PhD level or things to that
[00:34:54] physics PhD level or things to that effect these things become increasingly
[00:34:57] effect these things become increasingly expensive to collect
[00:34:59] expensive to collect so yeah this is maybe like perhaps
[00:35:01] so yeah this is maybe like perhaps obvious like collecting data
[00:35:02] obvious like collecting data pre-training does not require any
[00:35:04] pre-training does not require any specific data you scrape data of the web
[00:35:06] specific data you scrape data of the web but for instruction fing you probably
[00:35:08] but for instruction fing you probably need to recruit some people to write
[00:35:09] need to recruit some people to write down answer to your instructions so this
[00:35:11] down answer to your instructions so this can become very expensive very quickly
[00:35:14] can become very expensive very quickly but there's more limitations to this as
[00:35:16] but there's more limitations to this as well and we we just discussing this like
[00:35:18] well and we we just discussing this like there are open-ended tasks related to
[00:35:20] there are open-ended tasks related to creativity that don't really have like
[00:35:22] creativity that don't really have like an exact correct answer to begin with so
[00:35:25] an exact correct answer to begin with so how do you generate the right answer to
[00:35:27] how do you generate the right answer to the kind of a
[00:35:30] the kind of a question and yeah like language modeling
[00:35:34] question and yeah like language modeling inherently like penalizes all token
[00:35:36] inherently like penalizes all token level mistakes equally um this is what
[00:35:38] level mistakes equally um this is what super fine supervised fine tuning does
[00:35:40] super fine supervised fine tuning does as well but often like not all mistakes
[00:35:42] as well but often like not all mistakes are the same so this is an example where
[00:35:46] are the same so this is an example where you're trying to do this prediction task
[00:35:47] you're trying to do this prediction task Avatar is a fantasy TV show and perhaps
[00:35:50] Avatar is a fantasy TV show and perhaps you can see like I mean calling it an
[00:35:53] you can see like I mean calling it an adventure TV show is perhaps okay but
[00:35:56] adventure TV show is perhaps okay but calling it a musical May be like a much
[00:35:59] calling it a musical May be like a much worse mistake but both these mistakes
[00:36:01] worse mistake but both these mistakes are penalized
[00:36:04] equally and I think one General aspect
[00:36:07] equally and I think one General aspect which is like more becoming increasingly
[00:36:09] which is like more becoming increasingly relevant is that the humans that you
[00:36:11] relevant is that the humans that you might ask might not generate the right
[00:36:13] might ask might not generate the right or the highest quality answer your
[00:36:15] or the highest quality answer your models are becoming increasingly
[00:36:17] models are becoming increasingly competitive and you want in some sense
[00:36:19] competitive and you want in some sense you're going to be limited by how high
[00:36:22] you're going to be limited by how high quality the answer um humans can
[00:36:24] quality the answer um humans can generate but often I find that the
[00:36:27] generate but often I find that the models are generating better and better
[00:36:29] models are generating better and better answers so do we really want to keep
[00:36:31] answers so do we really want to keep relying on humans to write down the
[00:36:32] relying on humans to write down the answers or do we want to like somehow go
[00:36:34] answers or do we want to like somehow go over
[00:36:36] that so these are the three problems we
[00:36:41] that so these are the three problems we have talked about um with instruction
[00:36:43] have talked about um with instruction fine tuning and we made a lot of
[00:36:46] fine tuning and we made a lot of progress with this but this is not how
[00:36:47] progress with this but this is not how we got Char
[00:36:49] we got Char GPT um and one high level problem here
[00:36:53] GPT um and one high level problem here is that even though when even when we
[00:36:55] is that even though when even when we are instruction fine tuning there is
[00:36:57] are instruction fine tuning there is still a huge mismatch
[00:36:59] still a huge mismatch between the end goal is to optimize for
[00:37:02] between the end goal is to optimize for human preferences generate an output
[00:37:04] human preferences generate an output that a human might like and we're still
[00:37:08] that a human might like and we're still doing prediction kind of tasks where
[00:37:09] doing prediction kind of tasks where we're predicting the next token but now
[00:37:11] we're predicting the next token but now in a more curated data set so that's
[00:37:13] in a more curated data set so that's still a bit of a mismatch going on here
[00:37:15] still a bit of a mismatch going on here and it's not exactly what we want to
[00:37:18] and it's not exactly what we want to do hopefully like I mean I'm going to
[00:37:20] do hopefully like I mean I'm going to take a second here to pause because this
[00:37:21] take a second here to pause because this is important to understand the next
[00:37:23] is important to understand the next section and if there's any
[00:37:25] section and if there's any questions feel free to ask so is this
[00:37:28] questions feel free to ask so is this step uh still taken as a as a first step
[00:37:33] step uh still taken as a as a first step or we discard this it's a good question
[00:37:36] or we discard this it's a good question so um I think this is still one of the
[00:37:39] so um I think this is still one of the more important steps that you take
[00:37:41] more important steps that you take before taking the next step but people
[00:37:43] before taking the next step but people are trying to like remove the step Al
[00:37:45] are trying to like remove the step Al together and jump directly to the next
[00:37:47] together and jump directly to the next step so there's work emerging on that
[00:37:49] step so there's work emerging on that but yeah and this is still a very
[00:37:51] but yeah and this is still a very important step before we do the next
[00:37:53] important step before we do the next step
[00:37:57] go ahead is PR two also present in
[00:38:00] go ahead is PR two also present in pre-training uh and if so how how do you
[00:38:03] pre-training uh and if so how how do you avoid that ver just by having a lot of
[00:38:06] avoid that ver just by having a lot of data um yeah that's a great question uh
[00:38:08] data um yeah that's a great question uh there's two diff there's one difference
[00:38:10] there's two diff there's one difference one major difference on pre-training um
[00:38:13] one major difference on pre-training um pre-training covers a lot more text so
[00:38:16] pre-training covers a lot more text so um just for context like I mean as we
[00:38:18] um just for context like I mean as we talked about it's pre-training is
[00:38:20] talked about it's pre-training is roughly 15 trillion tokens whereas like
[00:38:23] roughly 15 trillion tokens whereas like supervised instruction fine tuning might
[00:38:24] supervised instruction fine tuning might be somewhere on the order of millions to
[00:38:26] be somewhere on the order of millions to billions of tokens so it's like few
[00:38:28] billions of tokens so it's like few orders of magnitude lower typically
[00:38:30] orders of magnitude lower typically you'd only see one answer for a specific
[00:38:32] you'd only see one answer for a specific instruction but during pre-training
[00:38:34] instruction but during pre-training you'll see multiple text and multiple
[00:38:36] you'll see multiple text and multiple completions for a same kind of a prompt
[00:38:39] completions for a same kind of a prompt um now that's good because when you see
[00:38:41] um now that's good because when you see multiple answers or completions during
[00:38:42] multiple answers or completions during pre-training you sort of start to weigh
[00:38:44] pre-training you sort of start to weigh different answers you start to put
[00:38:46] different answers you start to put probability Mass on different kind of
[00:38:48] probability Mass on different kind of answers or completions but instruction
[00:38:50] answers or completions but instruction fineing might force you to put and wait
[00:38:52] fineing might force you to put and wait on only one
[00:38:53] on only one answer does it okay but generally yeah
[00:38:57] answer does it okay but generally yeah like I mean this is a problem with both
[00:38:59] like I mean this is a problem with both the stages you're
[00:39:01] the stages you're right anything
[00:39:04] right anything else
[00:39:07] else cool so as this whole thing alludes to
[00:39:11] cool so as this whole thing alludes to we're going to start to attempt to
[00:39:14] we're going to start to attempt to satisfy human preferences directly we're
[00:39:16] satisfy human preferences directly we're no longer going to like try to like get
[00:39:18] no longer going to like try to like get humans to generate some data and try to
[00:39:20] humans to generate some data and try to do some kind of a token level prediction
[00:39:21] do some kind of a token level prediction loss we're going to try to optimize for
[00:39:24] loss we're going to try to optimize for human preferences directly and that is
[00:39:27] human preferences directly and that is uh the general field of rlf and that's
[00:39:30] uh the general field of rlf and that's the final step in typically getting a
[00:39:32] the final step in typically getting a model like CH
[00:39:33] model like CH GPD so we talked about how collecting
[00:39:36] GPD so we talked about how collecting demonstration is expensive and there's
[00:39:37] demonstration is expensive and there's still a broad mismatch between the LM
[00:39:39] still a broad mismatch between the LM objective and human preferences and now
[00:39:41] objective and human preferences and now we're going to try and optimize for
[00:39:42] we're going to try and optimize for human preferences
[00:39:44] human preferences directly so what ises optimizing for
[00:39:47] directly so what ises optimizing for human preferences even mean um to like
[00:39:50] human preferences even mean um to like concretely establish that let's go
[00:39:52] concretely establish that let's go through like a specific example in mind
[00:39:54] through like a specific example in mind which is
[00:39:55] which is summarization um we want to train a
[00:39:58] summarization um we want to train a model to be better at
[00:39:59] model to be better at summarization and we want to satisfy
[00:40:01] summarization and we want to satisfy human preferences so let's imagine that
[00:40:03] human preferences so let's imagine that a human is able to prescribe a reward
[00:40:05] a human is able to prescribe a reward for a specific summary let's just
[00:40:07] for a specific summary let's just pretend there is a reward function you
[00:40:09] pretend there is a reward function you and I can assign say like reward this is
[00:40:11] and I can assign say like reward this is plus one this is minus one or something
[00:40:13] plus one this is minus one or something to that
[00:40:17] effect okay um so in this specific case
[00:40:22] effect okay um so in this specific case we have this input X which uh which is
[00:40:25] we have this input X which uh which is about an earthquake in San Francisco so
[00:40:27] about an earthquake in San Francisco so this is news article that we want to
[00:40:28] this is news article that we want to summarize
[00:40:30] summarize and let's pretend that we get these
[00:40:33] and let's pretend that we get these rewards and we want to optimize this so
[00:40:36] rewards and we want to optimize this so we we get one summary y1 which gives us
[00:40:39] we we get one summary y1 which gives us an earthquake hit and so on and we
[00:40:40] an earthquake hit and so on and we assign a reward of 8.0 and another
[00:40:43] assign a reward of 8.0 and another summary which gives us a reward of
[00:40:45] summary which gives us a reward of 1.2 generally speaking like the
[00:40:47] 1.2 generally speaking like the objective that we want to set up is
[00:40:49] objective that we want to set up is something of the following form where we
[00:40:51] something of the following form where we want to take our language model P Theta
[00:40:54] want to take our language model P Theta which generates a completion y uh given
[00:40:57] which generates a completion y uh given an input X and we want to maximize the
[00:41:00] an input X and we want to maximize the reward of rxy where X is the input and Y
[00:41:04] reward of rxy where X is the input and Y is the output summary in this specific
[00:41:07] is the output summary in this specific task and maybe like just to like really
[00:41:11] task and maybe like just to like really concretely point out something here this
[00:41:13] concretely point out something here this is different from everything that we
[00:41:15] is different from everything that we have done in one very specific way um we
[00:41:18] have done in one very specific way um we are sampling from the model itself in
[00:41:21] are sampling from the model itself in the bottom term if you see like we're
[00:41:23] the bottom term if you see like we're using Y from P Theta everything we've
[00:41:25] using Y from P Theta everything we've seen so far the data is sampled from
[00:41:27] seen so far the data is sampled from some other source either during
[00:41:28] some other source either during pre-training either in supervised fine
[00:41:30] pre-training either in supervised fine tuning and we're maximizing the log
[00:41:32] tuning and we're maximizing the log likelihood of those tokens but now we're
[00:41:35] likelihood of those tokens but now we're explicitly sampling from our model and
[00:41:37] explicitly sampling from our model and optimizing potentially a
[00:41:38] optimizing potentially a non-differentiable
[00:41:41] non-differentiable objective
[00:41:42] objective cool so broadly the rlf pipeline looks
[00:41:46] cool so broadly the rlf pipeline looks something like this and first step is
[00:41:48] something like this and first step is still instruction tuning something we
[00:41:49] still instruction tuning something we have seen up until now where we take our
[00:41:52] have seen up until now where we take our pre-trained model we instruction tune on
[00:41:54] pre-trained model we instruction tune on a large collection of tasks and we get
[00:41:57] a large collection of tasks and we get some something which starts responding
[00:41:58] some something which starts responding to our desired intent or
[00:42:01] to our desired intent or not but there are two more steps after
[00:42:03] not but there are two more steps after this which are typically followed in
[00:42:05] this which are typically followed in creating something like instruct gbt the
[00:42:07] creating something like instruct gbt the first step is estimating some kind of a
[00:42:09] first step is estimating some kind of a reward model something which tells us
[00:42:11] reward model something which tells us given an instruction how much would a
[00:42:13] given an instruction how much would a human like this answer or how much would
[00:42:15] human like this answer or how much would a human hate this answer so we looked at
[00:42:18] a human hate this answer so we looked at something like this earlier but I didn't
[00:42:19] something like this earlier but I didn't talk about how do we even get something
[00:42:21] talk about how do we even get something like that that's the second step and
[00:42:23] like that that's the second step and then we take this reward model and we
[00:42:25] then we take this reward model and we optimize it through the optimiz ation
[00:42:27] optimize it through the optimiz ation that I suggested earlier so the
[00:42:28] that I suggested earlier so the maximizing the expected reward under
[00:42:31] maximizing the expected reward under your language
[00:42:32] your language model and we're going to go over a lot
[00:42:35] model and we're going to go over a lot over in the second and third
[00:42:37] over in the second and third steps so the first question we want to
[00:42:39] steps so the first question we want to answer is how do we even get like a
[00:42:40] answer is how do we even get like a reward model what about what humans are
[00:42:43] reward model what about what humans are going to like like this is a very IL
[00:42:46] going to like like this is a very IL defined problem generally speaking
[00:42:49] defined problem generally speaking so there's there's two problems here
[00:42:52] so there's there's two problems here that we're going to address first is a
[00:42:53] that we're going to address first is a human in the loop is expensive so let's
[00:42:55] human in the loop is expensive so let's say like if I ask a model to like
[00:42:57] say like if I ask a model to like generate an answer and then I get a
[00:42:59] generate an answer and then I get a human to label with some kind of a score
[00:43:02] human to label with some kind of a score I'm doing this over millions of
[00:43:03] I'm doing this over millions of completions that is not very scalable I
[00:43:07] completions that is not very scalable I I don't want to sit around and label
[00:43:08] I don't want to sit around and label millions of examples
[00:43:10] millions of examples so this is very easy like we're in a
[00:43:13] so this is very easy like we're in a machine learning class so what are we
[00:43:15] machine learning class so what are we going to do what we're going to do is
[00:43:16] going to do what we're going to do is we're going to train something which
[00:43:18] we're going to train something which predicts what a human would like or what
[00:43:19] predicts what a human would like or what a human might not like and this is
[00:43:22] a human might not like and this is roughly um this is essentially a machine
[00:43:25] roughly um this is essentially a machine learning problem where we take these
[00:43:26] learning problem where we take these Rewards scor scores and we try to train
[00:43:28] Rewards scor scores and we try to train a reward model to predict given an input
[00:43:31] a reward model to predict given an input and output what the reward scores would
[00:43:33] and output what the reward scores would look
[00:43:34] look like simple simple machine learning
[00:43:37] like simple simple machine learning regression style problem uh you might
[00:43:38] regression style problem uh you might have seen this
[00:43:40] have seen this earlier
[00:43:43] cool now there's a bigger problem here
[00:43:46] cool now there's a bigger problem here and sorry go ahead one so do we use like
[00:43:49] and sorry go ahead one so do we use like I don't know like just embedding
[00:43:52] I don't know like just embedding withier we use a real language model to
[00:43:55] withier we use a real language model to do that um that's a good question
[00:43:58] do that um that's a good question generally like what we do is like we
[00:44:00] generally like what we do is like we still typically need reward models where
[00:44:02] still typically need reward models where they need to be able to understand the
[00:44:04] they need to be able to understand the text really well so like bigger models
[00:44:06] text really well so like bigger models and like they're typically initialized
[00:44:07] and like they're typically initialized from the language model that you trained
[00:44:09] from the language model that you trained pre-trained as well so you typically
[00:44:11] pre-trained as well so you typically start with the pre-trained language
[00:44:12] start with the pre-trained language model and do some kind of prediction
[00:44:14] model and do some kind of prediction that we'll talk about and they'll give
[00:44:16] that we'll talk about and they'll give you a
[00:44:17] you a score how do you if you're doing that
[00:44:20] score how do you if you're doing that how do you separate X and Y how does the
[00:44:22] how do you separate X and Y how does the language model know which part it
[00:44:24] language model know which part it doesn't need to it can put the X and Y
[00:44:27] doesn't need to it can put the X and Y like it only sees X and Y as an input so
[00:44:29] like it only sees X and Y as an input so it doesn't need to T typically see it
[00:44:31] it doesn't need to T typically see it separated it's just going to predict a
[00:44:33] separated it's just going to predict a score at the end okay yeah the X and Y
[00:44:35] score at the end okay yeah the X and Y is more for notational convenience here
[00:44:38] is more for notational convenience here because for us X and Y are different X
[00:44:40] because for us X and Y are different X is a question user asked and Y is
[00:44:42] is a question user asked and Y is something the model generated but you
[00:44:44] something the model generated but you shove the whole thing into you shove the
[00:44:46] shove the whole thing into you shove the whole thing into yes
[00:44:48] whole thing into yes cool now this is the bigger problem here
[00:44:51] cool now this is the bigger problem here and human judgments are very noisy we
[00:44:54] and human judgments are very noisy we have talked about we want to assign a
[00:44:55] have talked about we want to assign a score to a completion this is something
[00:44:58] score to a completion this is something that's like extremely non-trivial to do
[00:45:00] that's like extremely non-trivial to do so if I give you a summary like this
[00:45:02] so if I give you a summary like this what score are you going to assign on a
[00:45:04] what score are you going to assign on a scale of 10 if you ask me on different
[00:45:07] scale of 10 if you ask me on different days I'll give a different answer first
[00:45:08] days I'll give a different answer first of all but across humans itself this
[00:45:12] of all but across humans itself this this number is not calibrated in any
[00:45:14] this number is not calibrated in any meaningful way so you could assign
[00:45:17] meaningful way so you could assign number of 4.1 6.6 and different humans
[00:45:19] number of 4.1 6.6 and different humans would just simply assign different
[00:45:20] would just simply assign different scores and there are ways to address
[00:45:23] scores and there are ways to address this you can like calibrate humans you
[00:45:24] this you can like calibrate humans you can give them a specific rubric you can
[00:45:26] can give them a specific rubric you can like talk to them but it's a very
[00:45:28] like talk to them but it's a very complicated process and like still like
[00:45:30] complicated process and like still like there's a lot of room for judgment which
[00:45:32] there's a lot of room for judgment which is not typically very nice for training
[00:45:34] is not typically very nice for training a model like this if your labels can
[00:45:37] a model like this if your labels can vary a lot it's just hard to
[00:45:40] vary a lot it's just hard to predict so the way this is addressed is
[00:45:44] predict so the way this is addressed is that instead of trying to predict the
[00:45:45] that instead of trying to predict the reward label directly you actually want
[00:45:47] reward label directly you actually want to set up a problem in a slightly
[00:45:49] to set up a problem in a slightly different way what is something much
[00:45:50] different way what is something much easier for humans to do is give them two
[00:45:53] easier for humans to do is give them two answers or maybe many answers and tell
[00:45:55] answers or maybe many answers and tell them ask them which one is better so
[00:45:58] them ask them which one is better so this is where the idea of asking humans
[00:46:01] this is where the idea of asking humans to rank answers comes in so if I give
[00:46:04] to rank answers comes in so if I give you a whole news article and ask you
[00:46:07] you a whole news article and ask you which summary is better you might be
[00:46:08] which summary is better you might be able to give me a ranking that oh this
[00:46:10] able to give me a ranking that oh this second summary is the worst but the
[00:46:12] second summary is the worst but the first one is better and the third one is
[00:46:13] first one is better and the third one is somewhere in the middle between those
[00:46:14] somewhere in the middle between those two so you get like a ranking which
[00:46:16] two so you get like a ranking which gives you um preference over
[00:46:19] gives you um preference over summaries and hopefully like I mean you
[00:46:22] summaries and hopefully like I mean you can see like the idea that is important
[00:46:24] can see like the idea that is important here is that even when we have some kind
[00:46:27] here is that even when we have some kind of a consistent utility function even
[00:46:29] of a consistent utility function even when I have it's much easier to compare
[00:46:32] when I have it's much easier to compare to something and know that which is
[00:46:33] to something and know that which is better than this rather than ascribing
[00:46:34] better than this rather than ascribing it an arbitrary number on a scale and
[00:46:37] it an arbitrary number on a scale and that's why the signal from something
[00:46:39] that's why the signal from something like this is a lot
[00:46:41] like this is a lot better now how do we get like we talked
[00:46:44] better now how do we get like we talked about we need like we get this kind of a
[00:46:46] about we need like we get this kind of a preference data and now we need some
[00:46:47] preference data and now we need some kind of a reward score out of this and
[00:46:50] kind of a reward score out of this and we shove in like our input we shove in a
[00:46:53] we shove in like our input we shove in a summary as well and we still need to get
[00:46:54] summary as well and we still need to get a score out of this but it's not clearly
[00:46:56] a score out of this but it's not clearly obvious to me like how do we take this
[00:46:58] obvious to me like how do we take this data and convert it into that kind of
[00:47:00] data and convert it into that kind of score
[00:47:02] score um incomes are pretty good friends named
[00:47:05] um incomes are pretty good friends named Bradley Terry um and essentially like
[00:47:10] Bradley Terry um and essentially like there's a lot of study in like many in
[00:47:12] there's a lot of study in like many in economics and like psychology which
[00:47:14] economics and like psychology which basically tries to model how humans make
[00:47:18] basically tries to model how humans make decisions in specific case like this
[00:47:20] decisions in specific case like this Brad lary model essentially says that a
[00:47:23] Brad lary model essentially says that a probability that a human chooses answer
[00:47:25] probability that a human chooses answer y1 over y two is based on the difference
[00:47:30] y1 over y two is based on the difference between the rewards that humans assign
[00:47:33] between the rewards that humans assign internally and then you take a sigmoid
[00:47:35] internally and then you take a sigmoid around it so if you have looked at
[00:47:37] around it so if you have looked at binary classification before uh the
[00:47:40] binary classification before uh the logic is simply the difference between
[00:47:41] logic is simply the difference between the reward of some y1 minus Y2 or the
[00:47:45] the reward of some y1 minus Y2 or the difference between the winning
[00:47:47] difference between the winning completion and the losing
[00:47:51] completion is everybody with me till
[00:47:53] completion is everybody with me till this point
[00:47:57] so the idea is that like if you have a
[00:48:00] so the idea is that like if you have a data set where given y1 and Y2 where y1
[00:48:03] data set where given y1 and Y2 where y1 is a winning completion and we have a
[00:48:05] is a winning completion and we have a winning completion YW and a losing
[00:48:07] winning completion YW and a losing completion y l um the winning completion
[00:48:10] completion y l um the winning completion should score higher than the losing
[00:48:12] should score higher than the losing completion go ahead sorry what is J is
[00:48:15] completion go ahead sorry what is J is that a log or like sorry what what like
[00:48:19] that a log or like sorry what what like what is the type of J like this number
[00:48:21] what is the type of J like this number here that we're getting as the
[00:48:23] here that we're getting as the expectation is it a log prop or what is
[00:48:25] expectation is it a log prop or what is it it's an log prop so it will be a
[00:48:28] it it's an log prop so it will be a scaler at the end sigmoid is so you're
[00:48:31] scaler at the end sigmoid is so you're taking the let's say you have a reward
[00:48:33] taking the let's say you have a reward model which gives a
[00:48:34] model which gives a score R1 to like YW and R2 to y l you
[00:48:39] score R1 to like YW and R2 to y l you subtract that number you get another
[00:48:40] subtract that number you get another number you put it into sigmoid and you
[00:48:42] number you put it into sigmoid and you get a probability because sigmoid will
[00:48:45] get a probability because sigmoid will convert a logit into probability and
[00:48:47] convert a logit into probability and then you take a logarithm of that and
[00:48:50] then you take a logarithm of that and you take the expectation of everything
[00:48:51] you take the expectation of everything and you get this final number which
[00:48:53] and you get this final number which tells you how good your reward model is
[00:48:55] tells you how good your reward model is doing on the entire data set
[00:48:58] doing on the entire data set so like a good model of humans should
[00:48:59] so like a good model of humans should behave like this a good model of humans
[00:49:02] behave like this a good model of humans would um score very low here so it would
[00:49:05] would um score very low here so it would generally assign a higher reward to the
[00:49:07] generally assign a higher reward to the winning completion and generally assign
[00:49:09] winning completion and generally assign a lower reward to the losing
[00:49:12] a lower reward to the losing completion
[00:49:14] completion cool the math is just beginning so um
[00:49:18] cool the math is just beginning so um hold on to your seats
[00:49:21] hold on to your seats um cool so now let's see where we are we
[00:49:24] um cool so now let's see where we are we have a pre-trained model p p d y given X
[00:49:28] have a pre-trained model p p d y given X and we got this like fancy reward model
[00:49:30] and we got this like fancy reward model which tells us that hey how we have a
[00:49:32] which tells us that hey how we have a model of humans and it can tell us which
[00:49:34] model of humans and it can tell us which instruction which answer they like and
[00:49:36] instruction which answer they like and which in answer did not
[00:49:38] which in answer did not like now to do rlf generally like I mean
[00:49:42] like now to do rlf generally like I mean we have discussed what this will look
[00:49:43] we have discussed what this will look like uh we will copy our pre-train model
[00:49:46] like uh we will copy our pre-train model or instruction tune model and we'll
[00:49:48] or instruction tune model and we'll optimize the parameters for those models
[00:49:51] optimize the parameters for those models and I suggested that the param objective
[00:49:53] and I suggested that the param objective that we want to optimize is the expected
[00:49:56] that we want to optimize is the expected reward when we sample completions from P
[00:49:59] reward when we sample completions from P Theta and we're going to optimize our
[00:50:02] Theta and we're going to optimize our learned reward model instead of like the
[00:50:03] learned reward model instead of like the true reward model which humans would
[00:50:05] true reward model which humans would have typically assigned do you guys see
[00:50:07] have typically assigned do you guys see any problem with
[00:50:09] any problem with this
[00:50:11] this um is there something that's wrong here
[00:50:14] um is there something that's wrong here or like that might go wrong if we do
[00:50:16] or like that might go wrong if we do something along these
[00:50:22] lines go for itel
[00:50:28] it might collapse yes okay um but
[00:50:31] it might collapse yes okay um but generally at least from my intuition
[00:50:32] generally at least from my intuition like if you're ever doing something and
[00:50:34] like if you're ever doing something and you have you're optimizing some learned
[00:50:36] you have you're optimizing some learned metric I'd be very careful because
[00:50:39] metric I'd be very careful because typically our loss functions are very
[00:50:40] typically our loss functions are very clearly defined but here my reward model
[00:50:42] clearly defined but here my reward model is learned what when it's learned it
[00:50:44] is learned what when it's learned it means it will have
[00:50:46] means it will have errors yes so it's going to be trained
[00:50:49] errors yes so it's going to be trained on some distribution it will generalize
[00:50:51] on some distribution it will generalize as well but it will have errors and when
[00:50:54] as well but it will have errors and when you're optimizing against a learn model
[00:50:57] you're optimizing against a learn model it will tend to hack the reward model so
[00:50:59] it will tend to hack the reward model so it might exploit the reward model might
[00:51:02] it might exploit the reward model might erroneously assign a really high score
[00:51:04] erroneously assign a really high score to a really bad completion if your
[00:51:06] to a really bad completion if your policy learns or if your language model
[00:51:08] policy learns or if your language model learns to do that it will completely
[00:51:10] learns to do that it will completely Hack That and start generating those
[00:51:11] Hack That and start generating those gibberish
[00:51:13] gibberish completions
[00:51:15] completions so just as a general machine learning
[00:51:17] so just as a general machine learning tip as well if you're optimizing a learn
[00:51:19] tip as well if you're optimizing a learn metric be careful about what you're
[00:51:20] metric be careful about what you're optimizing and make sure that it's
[00:51:22] optimizing and make sure that it's actually
[00:51:23] actually reliable um and the way and this is
[00:51:27] reliable um and the way and this is obviously not desirable like I mean if
[00:51:28] obviously not desirable like I mean if you start optimizing this objective
[00:51:30] you start optimizing this objective you're going to converse to gibberish
[00:51:31] you're going to converse to gibberish language models very very quickly so
[00:51:33] language models very very quickly so typically what people do is that you
[00:51:35] typically what people do is that you want to add some kind of a penalty that
[00:51:37] want to add some kind of a penalty that like avoids it drifting too far from its
[00:51:40] like avoids it drifting too far from its initialization and why do we want to do
[00:51:42] initialization and why do we want to do that like if it cannot Drift from too
[00:51:44] that like if it cannot Drift from too far from its initialization we know the
[00:51:46] far from its initialization we know the initialization of the model is a decent
[00:51:47] initialization of the model is a decent language model and we know it is not
[00:51:50] language model and we know it is not necessarily satisfying this reward model
[00:51:52] necessarily satisfying this reward model too much and we also know that like the
[00:51:54] too much and we also know that like the reward model is trained on a distrib
[00:51:56] reward model is trained on a distrib ution of completions where the um
[00:51:59] ution of completions where the um initial model is so typically we when we
[00:52:02] initial model is so typically we when we talk about training this reward model we
[00:52:04] talk about training this reward model we have trained on certain completions
[00:52:05] have trained on certain completions which are sampled from this initial
[00:52:07] which are sampled from this initial distribution so we know the reward model
[00:52:09] distribution so we know the reward model will be somewhat reliable in that
[00:52:10] will be somewhat reliable in that distribution so we're just going to
[00:52:12] distribution so we're just going to Simply add a penalty which tells us that
[00:52:15] Simply add a penalty which tells us that you should not drift too far away from
[00:52:17] you should not drift too far away from the initial distribution and just to go
[00:52:20] the initial distribution and just to go over this we want to maximize the
[00:52:22] over this we want to maximize the objective where we have RMF which is our
[00:52:25] objective where we have RMF which is our learned one model but we're going to add
[00:52:29] learned one model but we're going to add this term beta log ratio and the ratio
[00:52:31] this term beta log ratio and the ratio is our the model we're optimizing P
[00:52:33] is our the model we're optimizing P Theta and our initial model PP PT and
[00:52:37] Theta and our initial model PP PT and what this says is that if we assign a
[00:52:39] what this says is that if we assign a much higher probability to certain
[00:52:41] much higher probability to certain completion as compared to our pre-train
[00:52:43] completion as compared to our pre-train model you're going to add an
[00:52:44] model you're going to add an increasingly large penalty to
[00:52:46] increasingly large penalty to it and simply you're paying a price for
[00:52:49] it and simply you're paying a price for drifting too far from initial
[00:52:50] drifting too far from initial distribution if you guys have taken like
[00:52:53] distribution if you guys have taken like machine learning this the expectation of
[00:52:55] machine learning this the expectation of this quantity can is exactly the cbak LI
[00:52:58] this quantity can is exactly the cbak LI Li Divergence or k Divergence between P
[00:53:01] Li Divergence or k Divergence between P Theta and PPT so you're penalizing
[00:53:04] Theta and PPT so you're penalizing drifting between two distributions go
[00:53:07] drifting between two distributions go forhead question shouldn't you also do
[00:53:09] forhead question shouldn't you also do this like add a penalty in the previous
[00:53:12] this like add a penalty in the previous version where you had to find huning or
[00:53:14] version where you had to find huning or is this only relevant for the RL HF
[00:53:18] is this only relevant for the RL HF that's a good question so um I think
[00:53:21] that's a good question so um I think people do add some kinds of
[00:53:22] people do add some kinds of regularization in fine tuning it's not
[00:53:25] regularization in fine tuning it's not nearly not as critical when you're doing
[00:53:27] nearly not as critical when you're doing this with RL like the incentive is to
[00:53:29] this with RL like the incentive is to exploit this reward model as well as
[00:53:32] exploit this reward model as well as much as possible and we'll see examples
[00:53:34] much as possible and we'll see examples where like the Learned reward predicts
[00:53:37] where like the Learned reward predicts like it's doing really well but the true
[00:53:38] like it's doing really well but the true reward models are completely garbage so
[00:53:41] reward models are completely garbage so it's much more important in this
[00:53:47] optimization cool
[00:53:49] optimization cool so now this assume does this course does
[00:53:53] so now this assume does this course does not assume background on reinforcement
[00:53:54] not assume background on reinforcement learning so we're not going to go into
[00:53:56] learning so we're not going to go into reinforce learning but I just want to
[00:53:57] reinforce learning but I just want to give a very high level intuition about
[00:53:59] give a very high level intuition about how this works and reinforcement
[00:54:01] how this works and reinforcement learning is not typically just used for
[00:54:03] learning is not typically just used for language model it's been applied to
[00:54:04] language model it's been applied to several uh domains of Interest game
[00:54:07] several uh domains of Interest game playing agents re um robotics developing
[00:54:11] playing agents re um robotics developing chip designs and so on and the interest
[00:54:15] chip designs and so on and the interest between like RL and model LMS it's also
[00:54:18] between like RL and model LMS it's also like dates back to roughly like 2016 as
[00:54:20] like dates back to roughly like 2016 as well but like it's been really
[00:54:22] well but like it's been really successful recently and especially with
[00:54:24] successful recently and especially with the success of rlf
[00:54:27] the success of rlf um the general idea is that we're going
[00:54:28] um the general idea is that we're going to use our model that we're optimizing
[00:54:30] to use our model that we're optimizing to generate several completions for an
[00:54:32] to generate several completions for an instruction um we're going to compute
[00:54:35] instruction um we're going to compute the reward under our learned reward
[00:54:37] the reward under our learned reward model and then we're going to simply try
[00:54:39] model and then we're going to simply try and like update the update our model to
[00:54:42] and like update the update our model to increase the probability on the high
[00:54:44] increase the probability on the high reward completions so when we sample a
[00:54:46] reward completions so when we sample a model we'll see completions of varing
[00:54:48] model we'll see completions of varing quality and we'll see some good
[00:54:49] quality and we'll see some good completions good summaries for our task
[00:54:51] completions good summaries for our task some bad summaries for our task and
[00:54:53] some bad summaries for our task and we'll try to update our log
[00:54:54] we'll try to update our log probabilities in a way such that uh the
[00:54:57] probabilities in a way such that uh the reward for when you use a updated model
[00:55:00] reward for when you use a updated model you're typically in the higher reward
[00:55:04] region does the high level summary like
[00:55:06] region does the high level summary like make
[00:55:07] make sense
[00:55:10] sense cool and rhf is incredibly successful I
[00:55:13] cool and rhf is incredibly successful I think this is a very good example of um
[00:55:15] think this is a very good example of um this is the same summarization example
[00:55:18] this is the same summarization example and I think the key Point here is that
[00:55:22] and I think the key Point here is that the performance improves by increasing
[00:55:24] the performance improves by increasing the model size for sure we have seen
[00:55:26] the model size for sure we have seen this in many different example what you
[00:55:28] this in many different example what you can actually see is that even very small
[00:55:30] can actually see is that even very small models can outperform human completions
[00:55:33] models can outperform human completions if you train it with with rlf and this
[00:55:36] if you train it with with rlf and this is exactly the result you see here the
[00:55:38] is exactly the result you see here the reference summaries are human generated
[00:55:39] reference summaries are human generated and when you evaluate when you ask
[00:55:42] and when you evaluate when you ask humans which ones they prefer they often
[00:55:44] humans which ones they prefer they often prefer the model generated summary over
[00:55:46] prefer the model generated summary over the human generated summary and this is
[00:55:47] the human generated summary and this is something you only observe with rlf even
[00:55:50] something you only observe with rlf even at small scales and again the same
[00:55:51] at small scales and again the same scaling phenomena still holds here
[00:55:53] scaling phenomena still holds here bigger models do become more responsive
[00:55:55] bigger models do become more responsive but are of itself is very impactful
[00:56:00] here
[00:56:02] here cool the problem with rlf is that it's
[00:56:05] cool the problem with rlf is that it's just incredibly complex like um I gave
[00:56:08] just incredibly complex like um I gave you a very high level summary that's
[00:56:10] you a very high level summary that's like doesn't that there's whole courses
[00:56:12] like doesn't that there's whole courses on this for a reason um so it just and
[00:56:15] on this for a reason um so it just and this image is not for you to understand
[00:56:17] this image is not for you to understand it's just completely to intimidate you
[00:56:20] it's just completely to intimidate you um
[00:56:21] um so um you want to fit a value function
[00:56:24] so um you want to fit a value function to something there's you have to sample
[00:56:25] to something there's you have to sample the model a lot it can be sensitive to a
[00:56:28] the model a lot it can be sensitive to a lot of hyperparameters so there's a lot
[00:56:30] lot of hyperparameters so there's a lot that goes on here and yeah um if you
[00:56:35] that goes on here and yeah um if you start implementing an rlf pipeline it
[00:56:37] start implementing an rlf pipeline it can be very hard and this is the reason
[00:56:39] can be very hard and this is the reason why like a lot of rlf was restricted to
[00:56:41] why like a lot of rlf was restricted to very very like high compute High
[00:56:43] very very like high compute High resource places and it was not very
[00:56:46] resource places and it was not very accessible so what we're going to talk
[00:56:48] accessible so what we're going to talk about and cover in this course is
[00:56:49] about and cover in this course is something called direct preference
[00:56:50] something called direct preference optimization which is a much simpler
[00:56:52] optimization which is a much simpler alternative to R LF and hopefully like
[00:56:54] alternative to R LF and hopefully like that's much more accessible but but
[00:56:56] that's much more accessible but but please bear with me there will be a lot
[00:56:58] please bear with me there will be a lot of math on here but the end goal of the
[00:57:00] of math on here but the end goal of the math is to make come up with a very
[00:57:02] math is to make come up with a very simple algorithm so hopefully like it's
[00:57:04] simple algorithm so hopefully like it's um and feel free to stop me and ask me
[00:57:10] questions you need in terms of like gbt
[00:57:13] questions you need in terms of like gbt 4 versus three like how much do the
[00:57:15] 4 versus three like how much do the number of parameters in the base model
[00:57:17] number of parameters in the base model help with like need to reduce the number
[00:57:19] help with like need to reduce the number of parameters or like in order sorry R
[00:57:22] of parameters or like in order sorry R the number of like examples from humans
[00:57:25] the number of like examples from humans for RFS that work
[00:57:26] for RFS that work well yeah that's a really good question
[00:57:28] well yeah that's a really good question so generally speaking as the if you hold
[00:57:31] so generally speaking as the if you hold the data set size constant and simply
[00:57:33] the data set size constant and simply increase the mod size it will improve
[00:57:35] increase the mod size it will improve quite a lot sure but the nice thing is
[00:57:38] quite a lot sure but the nice thing is that you can reuse the data and you can
[00:57:39] that you can reuse the data and you can keep adding data yeah uh as you keep
[00:57:42] keep adding data yeah uh as you keep like scaling models up so typically like
[00:57:44] like scaling models up so typically like nobody tries to like reduce the amount
[00:57:46] nobody tries to like reduce the amount of data collection yeah right you just
[00:57:47] of data collection yeah right you just keep increasing both the things
[00:57:51] out cool so we talked about rlf and the
[00:57:55] out cool so we talked about rlf and the current pipeline is some something like
[00:57:58] current pipeline is some something like um we train a reward model on the
[00:57:59] um we train a reward model on the comparison data that we have seen so far
[00:58:01] comparison data that we have seen so far and we're going to optimize we're going
[00:58:03] and we're going to optimize we're going to start with our pre-train our
[00:58:04] to start with our pre-train our instruction tune model and convert it
[00:58:05] instruction tune model and convert it into an rlf model using the
[00:58:07] into an rlf model using the reinforcement learning
[00:58:09] reinforcement learning techniques now the really the key idea
[00:58:11] techniques now the really the key idea in direct preference optimization is
[00:58:13] in direct preference optimization is what if we could just simply write a
[00:58:15] what if we could just simply write a reward model in terms of our language
[00:58:17] reward model in terms of our language model itself now to intuitively
[00:58:20] model itself now to intuitively understand that like what is going on a
[00:58:22] understand that like what is going on a language model is assigning
[00:58:23] language model is assigning probabilities to whatever is the most
[00:58:25] probabilities to whatever is the most plausible completion next but those
[00:58:28] plausible completion next but those plausible completions might not be what
[00:58:29] plausible completions might not be what we intended but you could restrict the
[00:58:31] we intended but you could restrict the probability simply to the completions
[00:58:34] probability simply to the completions that a human might like and then the log
[00:58:36] that a human might like and then the log probabilities of your model might
[00:58:38] probabilities of your model might represent something which the humans
[00:58:39] represent something which the humans might like and not just some arbitrary
[00:58:41] might like and not just some arbitrary completion on the internet so there is a
[00:58:43] completion on the internet so there is a direct correspondence between the log
[00:58:46] direct correspondence between the log probability that a language model
[00:58:48] probability that a language model assigns and how much a human might like
[00:58:50] assigns and how much a human might like the answer they can have like a direct
[00:58:52] the answer they can have like a direct correspondence in them and this is not
[00:58:55] correspondence in them and this is not some arbitrary intuition that I'm trying
[00:58:56] some arbitrary intuition that I'm trying to like come up with we will derive this
[00:59:00] to like come up with we will derive this mathematically so the general idea with
[00:59:03] mathematically so the general idea with direct preference optimization is going
[00:59:04] direct preference optimization is going to be we're going to write down reward
[00:59:06] to be we're going to write down reward model in terms of our language model and
[00:59:09] model in terms of our language model and now that we can write our reward model
[00:59:10] now that we can write our reward model in terms of our language model we can
[00:59:12] in terms of our language model we can simply solve directly fit our reward
[00:59:15] simply solve directly fit our reward model to the preference data we have and
[00:59:19] model to the preference data we have and we don't need to do the RL Step at all
[00:59:21] we don't need to do the RL Step at all so we started off with some preference
[00:59:22] so we started off with some preference data and we simply fit our reward model
[00:59:24] data and we simply fit our reward model to it which directly optimiz as the
[00:59:26] to it which directly optimiz as the language
[00:59:28] language parameters and maybe at a high level why
[00:59:31] parameters and maybe at a high level why is this like even possible like we did
[00:59:33] is this like even possible like we did this like really cumbersome process with
[00:59:34] this like really cumbersome process with fitting a reward model and optimizing it
[00:59:37] fitting a reward model and optimizing it but in the whole process the only
[00:59:39] but in the whole process the only external information that was being
[00:59:41] external information that was being added to the system like was human
[00:59:43] added to the system like was human labels labels on the preference data
[00:59:45] labels labels on the preference data when we optimize a learned reward model
[00:59:47] when we optimize a learned reward model there's no new information being added
[00:59:49] there's no new information being added into the system so this is why something
[00:59:52] into the system so this is why something like this is even possible for quite a
[00:59:54] like this is even possible for quite a few years this was not obvious obvious
[00:59:56] few years this was not obvious obvious but like as you will see like some of
[00:59:58] but like as you will see like some of these results like start to make sense
[01:00:00] these results like start to make sense so we're going to derive direct
[01:00:02] so we're going to derive direct preference
[01:00:03] preference optimization I'll I'll be after I'll be
[01:00:05] optimization I'll I'll be after I'll be here after the class as well if you have
[01:00:06] here after the class as well if you have questions but I'll hopefully like this
[01:00:08] questions but I'll hopefully like this is clear
[01:00:11] is clear so yes um we discussed that we wanted to
[01:00:14] so yes um we discussed that we wanted to solve this expected reward problem where
[01:00:17] solve this expected reward problem where we want to maximize the expected reward
[01:00:19] we want to maximize the expected reward but we subtract this term which is the
[01:00:21] but we subtract this term which is the beta log ratio which essentially
[01:00:22] beta log ratio which essentially penalizes the distance between where our
[01:00:25] penalizes the distance between where our current model is and where we started
[01:00:27] current model is and where we started off so we don't want to drift too far
[01:00:28] off so we don't want to drift too far away from our um from where we
[01:00:33] away from our um from where we started now it turns out that this
[01:00:36] started now it turns out that this specific problem instead of doing like
[01:00:38] specific problem instead of doing like an iterative routine um there's actually
[01:00:41] an iterative routine um there's actually a close form solution to this problem so
[01:00:45] a close form solution to this problem so the close form solution looks something
[01:00:46] the close form solution looks something like this um again if you have seen the
[01:00:50] like this um again if you have seen the boltzman distribution or something to
[01:00:52] boltzman distribution or something to that effect before this is very
[01:00:54] that effect before this is very basically the same idea but the idea is
[01:00:56] basically the same idea but the idea is this that we're going to take a
[01:00:57] this that we're going to take a pre-train distribution PPT y given X and
[01:01:00] pre-train distribution PPT y given X and we're going to rade the distribution by
[01:01:02] we're going to rade the distribution by the expected
[01:01:03] the expected reward so if if if a completion has a
[01:01:07] reward so if if if a completion has a very high reward it's going to have a
[01:01:09] very high reward it's going to have a higher probability mass and if it has a
[01:01:11] higher probability mass and if it has a lower reward it's going to have a lower
[01:01:12] lower reward it's going to have a lower probability mass and it's determined by
[01:01:14] probability mass and it's determined by the expected reward and beta is a
[01:01:16] the expected reward and beta is a hyperparameter which essentially governs
[01:01:18] hyperparameter which essentially governs like what is the trade-off between the
[01:01:20] like what is the trade-off between the reward model and the constraint and as
[01:01:24] reward model and the constraint and as beta becomes lower and lower you're
[01:01:26] beta becomes lower and lower you're going to start paying more and more
[01:01:27] going to start paying more and more attention to the reward
[01:01:29] attention to the reward model so the probabilities look
[01:01:32] model so the probabilities look something like this and there's this
[01:01:34] something like this and there's this like really annoying term this ZX and
[01:01:37] like really annoying term this ZX and the reason why it exists is that the
[01:01:39] the reason why it exists is that the numerator by itself is not normalized
[01:01:42] numerator by itself is not normalized it's not a probability distribution so
[01:01:44] it's not a probability distribution so to construct like an actual probability
[01:01:46] to construct like an actual probability distribution you have to normalize it
[01:01:48] distribution you have to normalize it and ZX is simply just this
[01:01:50] and ZX is simply just this normalization so if we write ZX out is
[01:01:53] normalization so if we write ZX out is the sum of all y okay yeah and that's
[01:01:56] the sum of all y okay yeah and that's exactly like it's some overall wise for
[01:01:58] exactly like it's some overall wise for a given instruction and that's exactly
[01:02:00] a given instruction and that's exactly why this is very pesky is like it's
[01:02:02] why this is very pesky is like it's intractable if I take an instruction and
[01:02:04] intractable if I take an instruction and try to sum over every possible
[01:02:06] try to sum over every possible completion and not just like
[01:02:07] completion and not just like syntactically correct ones every single
[01:02:09] syntactically correct ones every single possible we have 50,000 tokens maybe
[01:02:12] possible we have 50,000 tokens maybe even more and the completions can go
[01:02:14] even more and the completions can go arbitrary long so this space is
[01:02:15] arbitrary long so this space is completely intractable this quantity is
[01:02:17] completely intractable this quantity is not easy to approximate
[01:02:21] even um so the main point here is that
[01:02:25] even um so the main point here is that you if you're given in a reward model
[01:02:26] you if you're given in a reward model you can actually there does exist at
[01:02:28] you can actually there does exist at least a close form solution which tells
[01:02:30] least a close form solution which tells us what the optimal policy will look
[01:02:31] us what the optimal policy will look like or optimal language model will look
[01:02:33] like or optimal language model will look like but if you do a little bit of
[01:02:35] like but if you do a little bit of algebra just move some terms around take
[01:02:37] algebra just move some terms around take a logarithm here or there I I promise
[01:02:39] a logarithm here or there I I promise this is not very complicated you can
[01:02:41] this is not very complicated you can actually Express the reward model in
[01:02:43] actually Express the reward model in terms of the language model itself and I
[01:02:46] terms of the language model itself and I think this term is reasonably intuitive
[01:02:48] think this term is reasonably intuitive as well uh what it says is that um a
[01:02:51] as well uh what it says is that um a completion y hat has a high reward if
[01:02:54] completion y hat has a high reward if the model my optimal policy assigns a
[01:02:57] the model my optimal policy assigns a higher probability to it relative to my
[01:03:00] higher probability to it relative to my initialized model and this is scal by
[01:03:03] initialized model and this is scal by Beta so the beta log ratio is what we're
[01:03:05] Beta so the beta log ratio is what we're looking at
[01:03:07] looking at here and the partition function let's
[01:03:09] here and the partition function let's just ignore it for now but it's
[01:03:10] just ignore it for now but it's intractable but the beta log ratio is
[01:03:13] intractable but the beta log ratio is the key part
[01:03:15] the key part here is everyone following
[01:03:18] here is everyone following along awesome okay so right now I'm
[01:03:23] along awesome okay so right now I'm talking about optimal policies but
[01:03:26] talking about optimal policies but really like every policy is probably
[01:03:28] really like every policy is probably optimal for some kind of a reward right
[01:03:30] optimal for some kind of a reward right like this is mathematically true as well
[01:03:32] like this is mathematically true as well so the important bit here is that you
[01:03:34] so the important bit here is that you can actually Express you take a current
[01:03:37] can actually Express you take a current policy take your initialized model and
[01:03:39] policy take your initialized model and you can get some kind of a reward model
[01:03:41] you can get some kind of a reward model out of it and this is the exact identity
[01:03:44] out of it and this is the exact identity which leads to this so reward model can
[01:03:46] which leads to this so reward model can be expressed in terms of your language
[01:03:48] be expressed in terms of your language model baring the log partition term
[01:03:52] model baring the log partition term which we'll see what happens to it go
[01:03:55] which we'll see what happens to it go for sorry I don't know how you got like
[01:03:57] for sorry I don't know how you got like why is it that we can swap because there
[01:03:59] why is it that we can swap because there is a thing that we're trying to optimize
[01:04:00] is a thing that we're trying to optimize and how do p star turn into P yeah um
[01:04:04] and how do p star turn into P yeah um for now like we're not optimizing any
[01:04:05] for now like we're not optimizing any reward model okay all I'm saying is that
[01:04:08] reward model okay all I'm saying is that if I take my current language model it
[01:04:10] if I take my current language model it is it probably represents some kind of a
[01:04:12] is it probably represents some kind of a reward model
[01:04:15] reward model implicitly because of this relationship
[01:04:17] implicitly because of this relationship because this holds for every P star and
[01:04:19] because this holds for every P star and every reward model what I'm saying is
[01:04:22] every reward model what I'm saying is that like there if I plug in my current
[01:04:24] that like there if I plug in my current language model it also represents some
[01:04:26] language model it also represents some kind of a reward model I'm not saying
[01:04:27] kind of a reward model I'm not saying it's optimal
[01:04:29] it's optimal okay but I want say because at the
[01:04:31] okay but I want say because at the beginning uh PRL is PPT yes and so we
[01:04:36] beginning uh PRL is PPT yes and so we just get that the reward is basically
[01:04:38] just get that the reward is basically zero and so what what do we do initially
[01:04:41] zero and so what what do we do initially it's zero but like we can optimize the
[01:04:42] it's zero but like we can optimize the parameters okay okay yeah um yeah but
[01:04:45] parameters okay okay yeah um yeah but that's a good observation that it's
[01:04:46] that's a good observation that it's basically zero in the beginning but how
[01:04:48] basically zero in the beginning but how do we start optimizing
[01:04:50] do we start optimizing it I'll get to okay okay any other
[01:04:54] it I'll get to okay okay any other questions so the idea is that given the
[01:04:56] questions so the idea is that given the language model you have model such that
[01:05:02] language model you have model such that that makes the language model
[01:05:04] that makes the language model op yes that's uh that's the next step
[01:05:08] op yes that's uh that's the next step yes uh but the key idea is that like my
[01:05:11] yes uh but the key idea is that like my log my language model the probabilities
[01:05:13] log my language model the probabilities already implicitly Define a reward model
[01:05:16] already implicitly Define a reward model I think that's really the main point
[01:05:18] I think that's really the main point here and this mathematical relationship
[01:05:20] here and this mathematical relationship is
[01:05:21] is exact cool now like I mean I'm obviously
[01:05:25] exact cool now like I mean I'm obviously ignoring like the elephant in the room
[01:05:27] ignoring like the elephant in the room here which is the partition function um
[01:05:30] here which is the partition function um it's not going to magically vanish away
[01:05:31] it's not going to magically vanish away so like if this was just the beta log
[01:05:33] so like if this was just the beta log ratio that would be really nice I can
[01:05:35] ratio that would be really nice I can compute all these quantities I know how
[01:05:37] compute all these quantities I know how to compute the log probability under my
[01:05:38] to compute the log probability under my language model I know how to compute the
[01:05:40] language model I know how to compute the log probability under my pre-train model
[01:05:43] log probability under my pre-train model and I can compute the reward score and I
[01:05:45] and I can compute the reward score and I can optimize this but I don't know what
[01:05:47] can optimize this but I don't know what to do about my Lo log partition function
[01:05:50] to do about my Lo log partition function this is where something fun happens so
[01:05:54] this is where something fun happens so recall what the the reward modeling
[01:05:56] recall what the the reward modeling objective was uh when we started off
[01:05:59] objective was uh when we started off like we started off with the friends
[01:06:00] like we started off with the friends Bradley Terry again and what we really
[01:06:03] Bradley Terry again and what we really wanted to optimize was the reward
[01:06:05] wanted to optimize was the reward difference between the winning
[01:06:06] difference between the winning completion and the losing
[01:06:08] completion and the losing completion um and really like I mean
[01:06:11] completion um and really like I mean that's all we care about we don't care
[01:06:12] that's all we care about we don't care about the exact reward itself what we
[01:06:15] about the exact reward itself what we care about is maximizing the difference
[01:06:16] care about is maximizing the difference between the the difference between
[01:06:19] between the the difference between winning and losing completion and that's
[01:06:21] winning and losing completion and that's actually really key here because if you
[01:06:24] actually really key here because if you plug in the definition of of the RM
[01:06:27] plug in the definition of of the RM Theta there what you'll observe is that
[01:06:30] Theta there what you'll observe is that the partition function actually just
[01:06:32] the partition function actually just cancels out now why does it cancel out
[01:06:36] cancels out now why does it cancel out um the input is exactly the same the x
[01:06:39] um the input is exactly the same the x is actually exactly the same in the
[01:06:41] is actually exactly the same in the difference so the partition function ZX
[01:06:43] difference so the partition function ZX will just cancel out like it's the same
[01:06:45] will just cancel out like it's the same in both the terms so what you get is
[01:06:47] in both the terms so what you get is that the reward difference between the
[01:06:48] that the reward difference between the winning and losing completion is the
[01:06:50] winning and losing completion is the differences between the beta log ratio
[01:06:51] differences between the beta log ratio for the winning and losing
[01:06:53] for the winning and losing completion you can plug in the terms you
[01:06:56] completion you can plug in the terms you can work it out it's fairly simple so
[01:06:59] can work it out it's fairly simple so the partition function which was our
[01:07:00] the partition function which was our like um which was something we could not
[01:07:03] like um which was something we could not address we could not compute actually
[01:07:04] address we could not compute actually just simply vanished away I'm so sorry Z
[01:07:07] just simply vanished away I'm so sorry Z doesn't appear in theary mod um but it
[01:07:11] doesn't appear in theary mod um but it appears here in this equation so how
[01:07:14] appears here in this equation so how does plug in
[01:07:16] does plug in model um so we're going to take this
[01:07:19] model um so we're going to take this equation uh the last line that you see
[01:07:22] equation uh the last line that you see and we're going to plug in in place of
[01:07:24] and we're going to plug in in place of RMF
[01:07:26] RMF okay so um and in this the first loss
[01:07:31] okay so um and in this the first loss equation oh I see got yeah so the first
[01:07:33] equation oh I see got yeah so the first loss equation is the broadly ter loss
[01:07:35] loss equation is the broadly ter loss model
[01:07:37] model cool so this really is it like I mean
[01:07:40] cool so this really is it like I mean the key observation is we could express
[01:07:42] the key observation is we could express our reward model in terms of language
[01:07:43] our reward model in terms of language model and our problems with the
[01:07:45] model and our problems with the partition function actually go away
[01:07:46] partition function actually go away because we were optimizing the Brad lary
[01:07:48] because we were optimizing the Brad lary model and um what you get is something
[01:07:51] model and um what you get is something like this is that um we're going to
[01:07:55] like this is that um we're going to Express the loss function directly in
[01:07:57] Express the loss function directly in terms of our language model parameters
[01:07:59] terms of our language model parameters Theta and we're going to be able to
[01:08:02] Theta and we're going to be able to directly optimize on our data um without
[01:08:05] directly optimize on our data um without doing any RL steps or not and this is
[01:08:07] doing any RL steps or not and this is simply a binary classification problem
[01:08:10] simply a binary classification problem so we're really just trying to classify
[01:08:12] so we're really just trying to classify whether an answer is good or bad and
[01:08:14] whether an answer is good or bad and that's really what we're
[01:08:16] that's really what we're doing before I go on like people want to
[01:08:19] doing before I go on like people want to like absorb this in like I mean feel
[01:08:22] like absorb this in like I mean feel they're okay with it
[01:08:26] I don't get where they why good and why
[01:08:28] I don't get where they why good and why win and why lose come from are they
[01:08:30] win and why lose come from are they human and or they good question um it's
[01:08:34] human and or they good question um it's the same data set we started with in rlf
[01:08:36] the same data set we started with in rlf as well but the way the process works is
[01:08:39] as well but the way the process works is that you take a set of instructions and
[01:08:40] that you take a set of instructions and get the model to generate some answers
[01:08:42] get the model to generate some answers and then you get humans to label which
[01:08:44] and then you get humans to label which answer they prefer so they're model
[01:08:46] answer they prefer so they're model generated uh typically they can be human
[01:08:48] generated uh typically they can be human generated as well but they're typically
[01:08:50] generated as well but they're typically model generated and then you get some
[01:08:52] model generated and then you get some preference labels okay all you need is a
[01:08:55] preference labels okay all you need is a label saying which is a better
[01:08:58] label saying which is a better answer what do you lose here like you
[01:09:02] answer what do you lose here like you must be losing some information because
[01:09:03] must be losing some information because of the lack of information about like
[01:09:08] of the lack of information about like other you're canceling out your your
[01:09:11] other you're canceling out your your your because of the lack of uh any
[01:09:13] your because of the lack of uh any information about the partition function
[01:09:15] information about the partition function yeah you are bound to lose information
[01:09:18] yeah you are bound to lose information about like other possible completions
[01:09:20] about like other possible completions which you would have taken into account
[01:09:22] which you would have taken into account in like standard rlf right
[01:09:26] in like standard rlf right um that's a really good question I don't
[01:09:27] um that's a really good question I don't think I'll be able to completely answer
[01:09:29] think I'll be able to completely answer this question in time but like partition
[01:09:32] this question in time but like partition function is almost kind of a free
[01:09:33] function is almost kind of a free variable so I think the problem here is
[01:09:35] variable so I think the problem here is that the reward model there think of
[01:09:38] that the reward model there think of when you there's many reward models that
[01:09:40] when you there's many reward models that satisfy this optimization so there's a
[01:09:43] satisfy this optimization so there's a free variable here that you can actually
[01:09:45] free variable here that you can actually completely remove and that's what this
[01:09:47] completely remove and that's what this optimization benefits from so think of
[01:09:49] optimization benefits from so think of it this way like if I assign something a
[01:09:50] it this way like if I assign something a reward of plus one and assign something
[01:09:52] reward of plus one and assign something a reward of minus one that's basically
[01:09:54] a reward of minus one that's basically the same as saying as if it's a reward
[01:09:55] the same as saying as if it's a reward of plus
[01:09:57] of plus 99 and it will give you the same loss
[01:10:01] 99 and it will give you the same loss right so um that scale doesn't that
[01:10:05] right so um that scale doesn't that shift invariant in a ways is that like
[01:10:08] shift invariant in a ways is that like isn't that somehow like not what you
[01:10:11] isn't that somehow like not what you want though like like okay like if if
[01:10:15] want though like like okay like if if you have if you're actually training a
[01:10:16] you have if you're actually training a reward model right like 199 is like much
[01:10:20] reward model right like 199 is like much you should pay much less attention to
[01:10:22] you should pay much less attention to that as compared to
[01:10:23] that as compared to like one right
[01:10:26] like one right Zer or something what we're assuming is
[01:10:27] Zer or something what we're assuming is our choice model here is like if a human
[01:10:30] our choice model here is like if a human prefers something over the other like
[01:10:32] prefers something over the other like the probability is governed only by the
[01:10:34] the probability is governed only by the difference between the rewards so that's
[01:10:37] difference between the rewards so that's an assumption that every rlf also makes
[01:10:39] an assumption that every rlf also makes and like DPO also makes now is that
[01:10:42] and like DPO also makes now is that assumption true not completely true but
[01:10:45] assumption true not completely true but like it it holds to a fairly large
[01:10:49] like it it holds to a fairly large degree but that's a good question
[01:10:52] degree but that's a good question yeah cool um I'll move on and rest of
[01:10:56] yeah cool um I'll move on and rest of time um and really like I mean the goal
[01:10:58] time um and really like I mean the goal of this plot is to like we actually get
[01:11:00] of this plot is to like we actually get fairly performant models when we
[01:11:02] fairly performant models when we optimize things with DPO um we so in
[01:11:06] optimize things with DPO um we so in this plot I think the main thing that
[01:11:07] this plot I think the main thing that you should look at is po which is the
[01:11:08] you should look at is po which is the typical rlf Pipeline and we are
[01:11:10] typical rlf Pipeline and we are evaluating the models for summarization
[01:11:12] evaluating the models for summarization and we're comparing to human summaries
[01:11:15] and we're comparing to human summaries and what we find is that DP and BP sort
[01:11:17] and what we find is that DP and BP sort of do similarly but you're really not
[01:11:19] of do similarly but you're really not losing much by just doing the DPO
[01:11:21] losing much by just doing the DPO procedure instead of R LF and that's
[01:11:23] procedure instead of R LF and that's really compelling because DP is simply a
[01:11:24] really compelling because DP is simply a classif app ation loss instead of like a
[01:11:26] classif app ation loss instead of like a whole reinforcement learning
[01:11:29] whole reinforcement learning procedure so I want to quickly summarize
[01:11:32] procedure so I want to quickly summarize um what we have seen thus far is that we
[01:11:35] um what we have seen thus far is that we want to optimize for human preferences
[01:11:37] want to optimize for human preferences so um and the way we do this is like
[01:11:40] so um and the way we do this is like instead of relying on uncalibrated
[01:11:41] instead of relying on uncalibrated scores we're getting comparison data and
[01:11:43] scores we're getting comparison data and feedback on that and we use this ranking
[01:11:46] feedback on that and we use this ranking data to either do something like rlf
[01:11:48] data to either do something like rlf where we first fit a reward model and
[01:11:50] where we first fit a reward model and optimize using reinforcement learning um
[01:11:54] optimize using reinforcement learning um or we do something that like direct
[01:11:55] or we do something that like direct preference optimization we simply take
[01:11:57] preference optimization we simply take the data set and do a classification
[01:11:59] the data set and do a classification loss on that um and yeah like there's
[01:12:02] loss on that um and yeah like there's trade offs in these algorithms like
[01:12:04] trade offs in these algorithms like people when they have a lot of
[01:12:06] people when they have a lot of computational budget they typically
[01:12:07] computational budget they typically maybe go for rlf or some routine like
[01:12:09] maybe go for rlf or some routine like that but if you're really looking to get
[01:12:12] that but if you're really looking to get the bank for your buck like I mean you
[01:12:13] the bank for your buck like I mean you might want to go for DPO and if and
[01:12:15] might want to go for DPO and if and that's like probably going to work out
[01:12:17] that's like probably going to work out of the box um it's a still an active
[01:12:20] of the box um it's a still an active area of research people are still trying
[01:12:21] area of research people are still trying to understand how to like best work with
[01:12:23] to understand how to like best work with these algorithms so like I'm not making
[01:12:25] these algorithms so like I'm not making any strong claims here but like both of
[01:12:26] any strong claims here but like both of these algorithms are very effective DP
[01:12:28] these algorithms are very effective DP is just much simpler to work
[01:12:30] is just much simpler to work with
[01:12:32] with cool um so yeah like I mean let's see
[01:12:35] cool um so yeah like I mean let's see like we went through all this
[01:12:36] like we went through all this instruction tuning rlf what do we get um
[01:12:41] instruction tuning rlf what do we get um instruct GPD is the first model which
[01:12:43] instruct GPD is the first model which sort of followed this pipeline it
[01:12:45] sort of followed this pipeline it defined this pipeline so we got models
[01:12:47] defined this pipeline so we got models which did 30,000 or so tasks remember
[01:12:50] which did 30,000 or so tasks remember when we were doing like only one task
[01:12:52] when we were doing like only one task and now we have scaled it up from th000
[01:12:53] and now we have scaled it up from th000 tasks to like 30,000 different task with
[01:12:55] tasks to like 30,000 different task with many many different examples so that's
[01:12:57] many many different examples so that's like where we are with instruct GPT and
[01:13:00] like where we are with instruct GPT and it follows this pipeline that we just
[01:13:02] it follows this pipeline that we just described in this case they're following
[01:13:03] described in this case they're following a specific rlf pipeline where we
[01:13:05] a specific rlf pipeline where we explicitly fit a reward model and then
[01:13:07] explicitly fit a reward model and then do some kind of a reinforcement learning
[01:13:09] do some kind of a reinforcement learning routine on top of it um and yeah like
[01:13:14] routine on top of it um and yeah like the task collected from labelers looks
[01:13:15] the task collected from labelers looks something like this um I leave it to
[01:13:18] something like this um I leave it to your imagination or you can look at the
[01:13:19] your imagination or you can look at the details but how we started off with this
[01:13:22] details but how we started off with this model was something like completions we
[01:13:24] model was something like completions we see from G GPD 3 uh which you know
[01:13:27] see from G GPD 3 uh which you know explained the moon Ling to sixer and
[01:13:29] explained the moon Ling to sixer and like it is not really following the
[01:13:31] like it is not really following the instructions but instruct GPD will give
[01:13:33] instructions but instruct GPD will give you something which is Meaningful so
[01:13:35] you something which is Meaningful so it's inferring what a user wanted from
[01:13:37] it's inferring what a user wanted from the specific instruction and it's
[01:13:39] the specific instruction and it's converting to a realistic answer that a
[01:13:40] converting to a realistic answer that a user might
[01:13:43] like and yeah these are just more
[01:13:46] like and yeah these are just more examples of what an instruct GPD like
[01:13:48] examples of what an instruct GPD like model would do whereas your base model
[01:13:50] model would do whereas your base model might not follow the instructions to
[01:13:52] might not follow the instructions to your desired intentions
[01:13:56] your desired intentions and yeah like we went from instruct GPD
[01:13:58] and yeah like we went from instruct GPD to chart GPD and it was essentially this
[01:14:01] to chart GPD and it was essentially this pipeline um the key difference here is
[01:14:04] pipeline um the key difference here is that it is still doing the instruction
[01:14:06] that it is still doing the instruction tuning but it is more optimized for
[01:14:08] tuning but it is more optimized for dialogue more optimized for interacting
[01:14:10] dialogue more optimized for interacting with users so the core algorithmic
[01:14:13] with users so the core algorithmic techniques that we discussed today are
[01:14:15] techniques that we discussed today are what give us CH GPD but you have to be
[01:14:17] what give us CH GPD but you have to be really careful about the kind of data
[01:14:19] really careful about the kind of data you're training on and that's really the
[01:14:21] you're training on and that's really the whole game um but this is the foundation
[01:14:24] whole game um but this is the foundation for CH GPD
[01:14:26] for CH GPD and yeah it it follows the same pipeline
[01:14:29] and yeah it it follows the same pipeline as well and you might look at you might
[01:14:32] as well and you might look at you might interact with ch gbd I'm sure you all
[01:14:34] interact with ch gbd I'm sure you all have interacted with it some form or not
[01:14:35] have interacted with it some form or not but like this is an example of what a CH
[01:14:37] but like this is an example of what a CH gbd interaction might look
[01:14:40] like um you want to make a gen Z so like
[01:14:44] like um you want to make a gen Z so like I mean you can you know the idea here is
[01:14:46] I mean you can you know the idea here is that it's like very good at responding
[01:14:47] that it's like very good at responding to instructions and intent this is not
[01:14:49] to instructions and intent this is not something that we could like even fuse
[01:14:51] something that we could like even fuse shot in very easily uh these are kind of
[01:14:54] shot in very easily uh these are kind of instructions are hard to come examples
[01:14:56] instructions are hard to come examples for but like this is probably not
[01:14:58] for but like this is probably not something to trained on either but it's
[01:14:59] something to trained on either but it's able to like infer the intent and
[01:15:01] able to like infer the intent and generalize very very nicely and that's
[01:15:03] generalize very very nicely and that's something I find personally very
[01:15:07] something I find personally very remarkable cool and there's been a lot
[01:15:10] remarkable cool and there's been a lot of progress on the open source front as
[01:15:12] of progress on the open source front as well so like DPO is much simpler and
[01:15:14] well so like DPO is much simpler and much more efficient and essentially all
[01:15:16] much more efficient and essentially all the open source models these days are
[01:15:18] the open source models these days are using DPO so this is a leaderboard that
[01:15:21] using DPO so this is a leaderboard that is maintained by hugging hugging face a
[01:15:23] is maintained by hugging hugging face a so like I mean N9 out of 10 more models
[01:15:25] so like I mean N9 out of 10 more models here are trained with DPO so that's been
[01:15:28] here are trained with DPO so that's been something that's been enabled the open
[01:15:29] something that's been enabled the open source Community to instruction tune
[01:15:31] source Community to instruction tune their model betters as well and same is
[01:15:34] their model betters as well and same is being used in many production models now
[01:15:36] being used in many production models now as well mistol is using DPO llama 3 used
[01:15:38] as well mistol is using DPO llama 3 used DPO so these are very very strong models
[01:15:41] DPO so these are very very strong models which are nearly gp4 level and they're
[01:15:43] which are nearly gp4 level and they're also like starting to use um these
[01:15:46] also like starting to use um these algorithms as well and something that's
[01:15:48] algorithms as well and something that's very cool cool to see is like like we
[01:15:51] very cool cool to see is like like we went through all this like optimization
[01:15:52] went through all this like optimization and like I mean math and stuff but what
[01:15:54] and like I mean math and stuff but what is really fundamentally changing in the
[01:15:56] is really fundamentally changing in the behavior and I think this is a really
[01:15:58] behavior and I think this is a really good example is that if you simply ask
[01:16:01] good example is that if you simply ask an instruction for and ask for an sft
[01:16:03] an instruction for and ask for an sft output from an instruction tune model
[01:16:05] output from an instruction tune model you'll get something like this but when
[01:16:07] you'll get something like this but when you RL of the model you actually get a
[01:16:09] you RL of the model you actually get a lot more details in your answer and
[01:16:11] lot more details in your answer and they'll probably organize the answers a
[01:16:13] they'll probably organize the answers a little better and there's something that
[01:16:14] little better and there's something that they maybe humans prefer that's why it's
[01:16:17] they maybe humans prefer that's why it's an property that is emerging in these
[01:16:19] an property that is emerging in these model but it's something that's a very
[01:16:21] model but it's something that's a very clear difference between simply
[01:16:24] clear difference between simply instruction tude models and some models
[01:16:27] instruction tude models and some models which are
[01:16:30] rft so yeah um we discuss like this
[01:16:34] rft so yeah um we discuss like this whole rlf routine where we are directly
[01:16:37] whole rlf routine where we are directly modeling the preferences and we are
[01:16:38] modeling the preferences and we are generalizing Beyond label data um and we
[01:16:41] generalizing Beyond label data um and we also discussed RL can be very tricky to
[01:16:43] also discussed RL can be very tricky to um correctly Implement though DPO sort
[01:16:45] um correctly Implement though DPO sort of implements this or like avoid some of
[01:16:48] of implements this or like avoid some of these issue and we briefly also touched
[01:16:50] these issue and we briefly also touched upon the idea of reward model and reward
[01:16:52] upon the idea of reward model and reward hacking um and when you're optimizing
[01:16:56] hacking um and when you're optimizing for learned reward models you will often
[01:16:58] for learned reward models you will often see this example is that there's a way
[01:17:01] see this example is that there's a way for it to just simply crash into um the
[01:17:05] for it to just simply crash into um the object some keep repe repetitively
[01:17:08] object some keep repe repetitively crashing the board to get more and more
[01:17:09] crashing the board to get more and more points that wasn't the goal of this game
[01:17:12] points that wasn't the goal of this game so um this is a very common example that
[01:17:15] so um this is a very common example that is shown for reward hacking if you do
[01:17:18] is shown for reward hacking if you do not specify Rewards well the models can
[01:17:20] not specify Rewards well the models can like learn weird behaviors which are not
[01:17:23] like learn weird behaviors which are not your desired intent and there's
[01:17:24] your desired intent and there's something a lot of people worry about as
[01:17:26] something a lot of people worry about as well um part of the reason is
[01:17:28] well um part of the reason is reinforcement learning is a very strong
[01:17:29] reinforcement learning is a very strong optimization algorithm it's at the heart
[01:17:31] optimization algorithm it's at the heart of alpha go Alpha zero uh which like
[01:17:34] of alpha go Alpha zero uh which like results in superhuman models so you have
[01:17:36] results in superhuman models so you have to be careful about how you specify
[01:17:38] to be careful about how you specify things and the other thing is like even
[01:17:40] things and the other thing is like even optimizing for human preferences is
[01:17:42] optimizing for human preferences is often not the right thing because humans
[01:17:43] often not the right thing because humans are not do not always like things which
[01:17:46] are not do not always like things which are in their best interest so something
[01:17:48] are in their best interest so something that emerges is that they like
[01:17:49] that emerges is that they like authoritative and helpful answers but
[01:17:51] authoritative and helpful answers but they often like don't necessarily like
[01:17:54] they often like don't necessarily like truthful answers
[01:17:55] truthful answers so one property that happens is like is
[01:17:58] so one property that happens is like is that they'll prefer authoritativeness
[01:18:00] that they'll prefer authoritativeness more than correctness which is maybe
[01:18:02] more than correctness which is maybe like not something nice please go ahead
[01:18:04] like not something nice please go ahead on those lines I'm curious if maybe like
[01:18:07] on those lines I'm curious if maybe like chbt being so like now widely used by
[01:18:10] chbt being so like now widely used by the public will maybe change the like
[01:18:12] the public will maybe change the like how people were like made the rewards
[01:18:14] how people were like made the rewards because I at least feel like now when I
[01:18:15] because I at least feel like now when I go to chat I TP something it gives me
[01:18:17] go to chat I TP something it gives me five like detailed paragraphs of
[01:18:19] five like detailed paragraphs of information sometimes I'm just annoyed
[01:18:20] information sometimes I'm just annoyed by that that's not what I wanted but
[01:18:22] by that that's not what I wanted but maybe in the original reward function in
[01:18:24] maybe in the original reward function in the original people actually pref that
[01:18:25] the original people actually pref that and nower it less yeah um that's a great
[01:18:29] and nower it less yeah um that's a great point because like as these models like
[01:18:31] point because like as these models like integrate more and more into our system
[01:18:33] integrate more and more into our system they're going to collect more and more
[01:18:34] they're going to collect more and more data and they will like pick up on
[01:18:37] data and they will like pick up on things maybe undesirable things as well
[01:18:40] things maybe undesirable things as well um as far as I understand chbd is really
[01:18:42] um as far as I understand chbd is really cutting down on the verbosity which is
[01:18:44] cutting down on the verbosity which is like a huge issue that all of these
[01:18:45] like a huge issue that all of these models are trying to cut down on and
[01:18:47] models are trying to cut down on and they are dealing with that um part of
[01:18:50] they are dealing with that um part of the reason why that emerges is that when
[01:18:51] the reason why that emerges is that when you collect preference data at scale
[01:18:53] you collect preference data at scale people are not necessarily reading the
[01:18:55] people are not necessarily reading the answers the turkers might just simply
[01:18:57] answers the turkers might just simply choose the longer answer and that's a
[01:18:59] choose the longer answer and that's a property that actually goes into these
[01:19:00] property that actually goes into these models so but hopefully like these
[01:19:03] models so but hopefully like these things will improve over time as they
[01:19:04] things will improve over time as they get more feedb and yeah hallucinations
[01:19:07] get more feedb and yeah hallucinations is not a problem that is going to go
[01:19:08] is not a problem that is going to go away with RL and we talked a bit about
[01:19:10] away with RL and we talked a bit about reward hacking as well um biases from
[01:19:14] reward hacking as well um biases from things and so on but hopefully like I
[01:19:16] things and so on but hopefully like I mean what I want to conclude out is like
[01:19:18] mean what I want to conclude out is like we started with pre-trained
[01:19:20] we started with pre-trained models we we had these things which
[01:19:22] models we we had these things which could predict text and we got chargy GPD
[01:19:25] could predict text and we got chargy GPD and hopefully like it's a little more
[01:19:26] and hopefully like it's a little more clear how we go from something like that
[01:19:28] clear how we go from something like that to chat
[01:19:30] to chat GPD and that's I'll end
[01:19:34] GPD and that's I'll end here thanks


================================================================================
LECTURE 012
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 11 - Benchmarking by Yann Dubois

Source: https://www.youtube.com/watch?v=TO0CqzqiArM

---

Transcript

[00:00:05] great so I think let's get started
[00:00:07] great so I think let's get started because we have a lot to cover today uh
[00:00:09] because we have a lot to cover today uh so my name is Yan uh for those who don't
[00:00:11] so my name is Yan uh for those who don't know me I'm a third-year PhD student
[00:00:13] know me I'm a third-year PhD student advised by Tatsu and Percy and today
[00:00:16] advised by Tatsu and Percy and today I'll be talking about benchmarking and
[00:00:18] I'll be talking about benchmarking and evaluations uh so benchmarking and
[00:00:20] evaluations uh so benchmarking and evaluations are honestly something that
[00:00:22] evaluations are honestly something that I think not enough people look at in
[00:00:25] I think not enough people look at in Academia uh but if you really want to
[00:00:27] Academia uh but if you really want to put something in production and you
[00:00:28] put something in production and you really care about let's say real world
[00:00:31] really care about let's say real world machine learning uh evaluation is really
[00:00:33] machine learning uh evaluation is really key so let's talk about
[00:00:36] key so let's talk about that so overview of what we'll talk
[00:00:39] that so overview of what we'll talk about first is uh different reasons for
[00:00:42] about first is uh different reasons for measuring performance uh then I'll talk
[00:00:44] measuring performance uh then I'll talk about text classification and and how
[00:00:46] about text classification and and how you measure performance there then text
[00:00:48] you measure performance there then text generations and how you measure
[00:00:50] generations and how you measure performance there and finally like how
[00:00:52] performance there and finally like how do you evaluate Uh current large
[00:00:55] do you evaluate Uh current large language models um and some issues and
[00:00:58] language models um and some issues and challenges with the ways that we
[00:00:59] challenges with the ways that we actually to perform
[00:01:02] evaluations okay so my mental model of
[00:01:05] evaluations okay so my mental model of how you actually develop a machine
[00:01:07] how you actually develop a machine learning model is that first you will be
[00:01:10] learning model is that first you will be uh training your model uh so here
[00:01:12] uh training your model uh so here measuring performance is really key uh
[00:01:14] measuring performance is really key uh because you need to have a loss um that
[00:01:17] because you need to have a loss um that you need to know when basically how to
[00:01:19] you need to know when basically how to optimize um then once you are optimizing
[00:01:22] optimize um then once you are optimizing your loss the second step is basically
[00:01:24] your loss the second step is basically development uh so usually this is highi
[00:01:26] development uh so usually this is highi tuning or for example if you have um
[00:01:30] tuning or for example if you have um early stopping during your models like
[00:01:32] early stopping during your models like if you see that your model is not
[00:01:33] if you see that your model is not performing that well you might or that
[00:01:36] performing that well you might or that there's like some overfitting happening
[00:01:37] there's like some overfitting happening you might decide to to stop or you might
[00:01:39] you might decide to to stop or you might decide to like change the learning rate
[00:01:41] decide to like change the learning rate during the training of your model so
[00:01:43] during the training of your model so development is kind of the Second Step
[00:01:45] development is kind of the Second Step um and here you need to measure
[00:01:46] um and here you need to measure performance because you need to know how
[00:01:48] performance because you need to know how to do actually uh models uh sorry High
[00:01:50] to do actually uh models uh sorry High Prim tuning um and and like changing
[00:01:54] Prim tuning um and and like changing High parameters uh then the third step
[00:01:56] High parameters uh then the third step is essentially model selection so if I
[00:01:59] is essentially model selection so if I have a task that I really care about uh
[00:02:01] have a task that I really care about uh which model performs best for my task
[00:02:03] which model performs best for my task that might be a model that I have
[00:02:05] that might be a model that I have trained it might be a model that another
[00:02:06] trained it might be a model that another group has
[00:02:07] group has trained uh and finally at least in the
[00:02:09] trained uh and finally at least in the real world you would decide to deploy
[00:02:11] real world you would decide to deploy your model and here per uh measuring
[00:02:13] your model and here per uh measuring performance is really key because you
[00:02:15] performance is really key because you need to know whether your model is good
[00:02:16] need to know whether your model is good enough to put in production um in the
[00:02:19] enough to put in production um in the parallel universe that we live in
[00:02:21] parallel universe that we live in there's also the publishing uh so you
[00:02:24] there's also the publishing uh so you basically need to um test like evaluate
[00:02:28] basically need to um test like evaluate a model on standing bench marks and the
[00:02:30] a model on standing bench marks and the reason why we do that is essentially for
[00:02:32] reason why we do that is essentially for communicating to different groups the
[00:02:34] communicating to different groups the quality of our bottle so at every step
[00:02:37] quality of our bottle so at every step of this pipeline you really need to
[00:02:39] of this pipeline you really need to measure performance and that's what
[00:02:40] measure performance and that's what we'll talk about today but what is key
[00:02:42] we'll talk about today but what is key to understand is that at different steps
[00:02:44] to understand is that at different steps you need to measure performance in
[00:02:45] you need to measure performance in different ways so there's really not a
[00:02:47] different ways so there's really not a single way of of uh not a single ideal
[00:02:50] single way of of uh not a single ideal way of measuring
[00:02:51] way of measuring performance so for example on the left
[00:02:54] performance so for example on the left when you train your model uh for
[00:02:56] when you train your model uh for evaluating performance you really need
[00:02:57] evaluating performance you really need to uh have a way of measuring
[00:02:59] to uh have a way of measuring performance that is super fast super
[00:03:01] performance that is super fast super cheap uh and differentiable because
[00:03:03] cheap uh and differentiable because usually I mean with new networks you
[00:03:05] usually I mean with new networks you basically back propagate through the
[00:03:06] basically back propagate through the loss so it needs to be differentiable
[00:03:08] loss so it needs to be differentiable and finally you really cannot have a way
[00:03:11] and finally you really cannot have a way for your model to optimize some
[00:03:13] for your model to optimize some shortcuts um to optimize the loss even
[00:03:15] shortcuts um to optimize the loss even though it's not really what you wanted
[00:03:16] though it's not really what you wanted to optimize uh and as you move more to
[00:03:19] to optimize uh and as you move more to the right basically you you allowed or
[00:03:21] the right basically you you allowed or like you will you will um measure
[00:03:23] like you will you will um measure performance uh less often so it's fine
[00:03:26] performance uh less often so it's fine if it's more expensive uh but you really
[00:03:30] if it's more expensive uh but you really like the the risk that that um or you
[00:03:33] like the the risk that that um or you really need your perform your measuring
[00:03:34] really need your perform your measuring like your your evaluation metrics to be
[00:03:36] like your your evaluation metrics to be higher quality uh because um the issues
[00:03:39] higher quality uh because um the issues if you put a model in production are
[00:03:41] if you put a model in production are higher um so during the development
[00:03:43] higher um so during the development stage you need a way of measuring
[00:03:45] stage you need a way of measuring performance that is fast cheap and also
[00:03:47] performance that is fast cheap and also kind of avoiding shortcuts because when
[00:03:49] kind of avoiding shortcuts because when you do high PR tuning you're essentially
[00:03:51] you do high PR tuning you're essentially also optimizing over certain
[00:03:54] also optimizing over certain objective uh model selection it can be a
[00:03:56] objective uh model selection it can be a little bit less fast and less cheap but
[00:03:58] little bit less fast and less cheap but still you will have to do it that many
[00:03:59] still you will have to do it that many many times and most importantly when you
[00:04:01] many times and most importantly when you deploy a model you really want the way
[00:04:03] deploy a model you really want the way to evaluate performance to be
[00:04:05] to evaluate performance to be trustworthy uh because once you put
[00:04:07] trustworthy uh because once you put something in production there's kind of
[00:04:09] something in production there's kind of no way to go back for what happened
[00:04:11] no way to go back for what happened during that time where it was in
[00:04:12] during that time where it was in production uh you also want things to be
[00:04:14] production uh you also want things to be very task specific so if I care about a
[00:04:16] very task specific so if I care about a certain task when I put when I put my
[00:04:18] certain task when I put when I put my model in production H it really you
[00:04:20] model in production H it really you really need to evaluate on that specific
[00:04:22] really need to evaluate on that specific task I don't care about other tasks and
[00:04:24] task I don't care about other tasks and finally you need your metrics to be
[00:04:25] finally you need your metrics to be absolute uh so the reason why I'm
[00:04:27] absolute uh so the reason why I'm highlighting that is that in the in the
[00:04:29] highlighting that is that in the in the three other steps you really just care
[00:04:31] three other steps you really just care about comparing between things um so
[00:04:33] about comparing between things um so that is very different than if you if
[00:04:35] that is very different than if you if you want to kind of have a a threshold
[00:04:38] you want to kind of have a a threshold which says if I have less than 95%
[00:04:40] which says if I have less than 95% accuracy I'm not putting my model in
[00:04:42] accuracy I'm not putting my model in production okay and now let's talk about
[00:04:44] production okay and now let's talk about publishing this is a little bit
[00:04:45] publishing this is a little bit different than honesty evaluation in the
[00:04:47] different than honesty evaluation in the real world uh but when you publ when you
[00:04:49] real world uh but when you publ when you basically do academic benchmarking and
[00:04:51] basically do academic benchmarking and when you evaluate uh your models and
[00:04:53] when you evaluate uh your models and academic benchmarks you want the
[00:04:55] academic benchmarks you want the Benchmark to be reproducible and
[00:04:56] Benchmark to be reproducible and standardized and the reason why is
[00:04:58] standardized and the reason why is basically because for the next five or
[00:05:01] basically because for the next five or six or 10 years everyone will be
[00:05:03] six or 10 years everyone will be evaluated on that one Benchmark and you
[00:05:05] evaluated on that one Benchmark and you want papers in three years to be
[00:05:06] want papers in three years to be comparable to yours uh so it's really
[00:05:08] comparable to yours uh so it's really important that your evaluations are
[00:05:10] important that your evaluations are reproducible honestly you don't really
[00:05:12] reproducible honestly you don't really care about that in the real world um you
[00:05:15] care about that in the real world um you also want things to be easy to work with
[00:05:17] also want things to be easy to work with because researchers are uh usually a
[00:05:20] because researchers are uh usually a little bit they don't want to do
[00:05:22] little bit they don't want to do additional work that they need to uh and
[00:05:25] additional work that they need to uh and also they usually don't have that much
[00:05:26] also they usually don't have that much resource so it needs to be fast and and
[00:05:28] resource so it needs to be fast and and and cheap um
[00:05:30] and cheap um and finally one one thing which I really
[00:05:31] and finally one one thing which I really want to highlight is that for the
[00:05:34] want to highlight is that for the academic benchmarks that we usually have
[00:05:36] academic benchmarks that we usually have it's fine if the metrics that we use are
[00:05:38] it's fine if the metrics that we use are not perfect because really what matters
[00:05:40] not perfect because really what matters is that over 10 years uh the direction
[00:05:44] is that over 10 years uh the direction that your metrix is showing you to to go
[00:05:46] that your metrix is showing you to to go into like basically the how the field is
[00:05:48] into like basically the how the field is moving um really if if the metric is
[00:05:52] moving um really if if the metric is saying that it's slightly better uh
[00:05:53] saying that it's slightly better uh sorry that it's better over 10 years
[00:05:55] sorry that it's better over 10 years that in reality the the the field has
[00:05:57] that in reality the the the field has made some progress um so at a meta level
[00:06:01] made some progress um so at a meta level it's fine if you use crude metrics in in
[00:06:03] it's fine if you use crude metrics in in um in
[00:06:05] um in Academia and also you kind of need to
[00:06:09] Academia and also you kind of need to um to balance between difficulty and
[00:06:12] um to balance between difficulty and simplicity and what I mean by that is
[00:06:14] simplicity and what I mean by that is that if your benchmark is way too
[00:06:16] that if your benchmark is way too complicated uh then basically all
[00:06:19] complicated uh then basically all methods will have essentially random
[00:06:21] methods will have essentially random performance so no one will use your
[00:06:23] performance so no one will use your benchmark and if uh your benchmark is
[00:06:26] benchmark and if uh your benchmark is too simple then the basine will be so
[00:06:29] too simple then the basine will be so good that no one will use your benchmark
[00:06:31] good that no one will use your benchmark because no one can beat the the the
[00:06:32] because no one can beat the the the Baseline this is really something that
[00:06:34] Baseline this is really something that is specific to Academia in the real
[00:06:36] is specific to Academia in the real world you're not going to be able to
[00:06:37] world you're not going to be able to change the Tas that you're performing uh
[00:06:40] change the Tas that you're performing uh based on like how good your model is so
[00:06:43] based on like how good your model is so that's why I I kind of just want to
[00:06:44] that's why I I kind of just want to highlight that because usually people
[00:06:46] highlight that because usually people talk about evaluations but there's
[00:06:47] talk about evaluations but there's really different like different ways of
[00:06:49] really different like different ways of evaluating and different reasons why we
[00:06:51] evaluating and different reasons why we evaluate is that all makes sense also
[00:06:54] evaluate is that all makes sense also feel free to ask
[00:06:56] feel free to ask questions great Okay so benchmarks in
[00:07:00] questions great Okay so benchmarks in Academia this is really the way we drive
[00:07:02] Academia this is really the way we drive the field so this is The mlu Benchmark I
[00:07:05] the field so this is The mlu Benchmark I think archit briefly mentioned it but
[00:07:07] think archit briefly mentioned it but I'll talk about it later again um so
[00:07:09] I'll talk about it later again um so this is the most standard Benchmark
[00:07:12] this is the most standard Benchmark right now and you basically see that in
[00:07:13] right now and you basically see that in the last 4ish years uh it has gone from
[00:07:17] the last 4ish years uh it has gone from 25% accuracy which is essentially random
[00:07:19] 25% accuracy which is essentially random because uh it's multiple choice um and
[00:07:22] because uh it's multiple choice um and there are four choices to around 90 is%
[00:07:26] there are four choices to around 90 is% accuracy um so yeah benchmarking is
[00:07:30] accuracy um so yeah benchmarking is really what Drive the the the progress
[00:07:31] really what Drive the the the progress in the field and again you see what I
[00:07:33] in the field and again you see what I mean here what I meant here is that it's
[00:07:36] mean here what I meant here is that it's not really the differences between small
[00:07:37] not really the differences between small points that matter at least in Academia
[00:07:40] points that matter at least in Academia you have to take a step back and you
[00:07:41] you have to take a step back and you have to think what matters is is like
[00:07:43] have to think what matters is is like how your models will perform over 10
[00:07:45] how your models will perform over 10 years and making sure that the model on
[00:07:46] years and making sure that the model on the top right here is better than the
[00:07:48] the top right here is better than the model on the top on the bottom left um
[00:07:51] model on the top on the bottom left um even if the if even if the Benchmark is
[00:07:53] even if the if even if the Benchmark is not perfect and I I think mlu is a
[00:07:55] not perfect and I I think mlu is a pretty good one in that
[00:07:58] pretty good one in that sense okay so there are two main types
[00:08:01] sense okay so there are two main types at least classically of um tasks in in
[00:08:04] at least classically of um tasks in in NLP um close ended tasks uh so I'll talk
[00:08:09] NLP um close ended tasks uh so I'll talk about it later but essentially you can
[00:08:11] about it later but essentially you can think about classification where you
[00:08:12] think about classification where you know exactly uh the label the correct
[00:08:15] know exactly uh the label the correct label for uh the task that you're
[00:08:17] label for uh the task that you're performing so here this is the IMDB data
[00:08:20] performing so here this is the IMDB data set where you're asked to say whether a
[00:08:23] set where you're asked to say whether a sentence is has positive sentiment or
[00:08:25] sentence is has positive sentiment or negative sentiment so the text is read
[00:08:27] negative sentiment so the text is read the book forget the movie so this is
[00:08:28] the book forget the movie so this is about uh sentiment classification of the
[00:08:31] about uh sentiment classification of the movie so here it's basically negative um
[00:08:34] movie so here it's basically negative um and then there's open-ended evaluation
[00:08:36] and then there's open-ended evaluation so think about Chad GPT like how do you
[00:08:38] so think about Chad GPT like how do you evaluate something like that where
[00:08:39] evaluate something like that where really there's no correct answer uh and
[00:08:42] really there's no correct answer uh and there are many or there are many
[00:08:43] there are many or there are many possible correct answers and they all
[00:08:46] possible correct answers and they all have like different qualities um so
[00:08:49] have like different qualities um so we're going to distinguish between those
[00:08:51] we're going to distinguish between those two so close ended
[00:08:54] two so close ended evaluation so as I just said close ended
[00:08:57] evaluation so as I just said close ended tasks uh there's a limited it's Define
[00:08:59] tasks uh there's a limited it's Define one uh as the task where is's a limited
[00:09:02] one uh as the task where is's a limited number of potential answers um think
[00:09:05] number of potential answers um think like less than 10 um and often there's
[00:09:07] like less than 10 um and often there's just one or maybe a few correct POS
[00:09:10] just one or maybe a few correct POS possible answers um so this really is
[00:09:15] possible answers um so this really is standard machine learning if you think
[00:09:16] standard machine learning if you think about standard classification you can
[00:09:18] about standard classification you can just do accuracy you can look at your
[00:09:19] just do accuracy you can look at your Precision your recalls um there's
[00:09:21] Precision your recalls um there's nothing special here about NLP uh that
[00:09:24] nothing special here about NLP uh that is not to me to say that it's simple
[00:09:27] is not to me to say that it's simple it's just that there's nothing special
[00:09:28] it's just that there's nothing special about NP here
[00:09:30] about NP here uh so some tasks some uh close-ended
[00:09:33] uh so some tasks some uh close-ended tasks I already told you about uh
[00:09:35] tasks I already told you about uh sentiment analysis so usually this is a
[00:09:37] sentiment analysis so usually this is a binary binary classification task where
[00:09:39] binary binary classification task where you just have to say whether the
[00:09:40] you just have to say whether the sentiment is positive or whether it's
[00:09:42] sentiment is positive or whether it's negative uh another task is entailment
[00:09:45] negative uh another task is entailment um also for sentiment analysis the
[00:09:47] um also for sentiment analysis the typical benchmarks I always put it next
[00:09:49] typical benchmarks I always put it next to the task uh is IMDb and SST from
[00:09:52] to the task uh is IMDb and SST from Stanford entailment is snli also from
[00:09:55] Stanford entailment is snli also from Stanford where basically you have some
[00:09:56] Stanford where basically you have some text so here a soccer game with multiple
[00:09:59] text so here a soccer game with multiple males playing and then you have a
[00:10:01] males playing and then you have a hypothesis some men are playing Sport
[00:10:03] hypothesis some men are playing Sport and you have to say whether the
[00:10:05] and you have to say whether the hypothesis is implied or entailed by the
[00:10:08] hypothesis is implied or entailed by the text uh so here it is uh other tasks
[00:10:12] text uh so here it is uh other tasks part of speech uh typical Benchmark um
[00:10:15] part of speech uh typical Benchmark um pentry bank and name entity recognition
[00:10:18] pentry bank and name entity recognition uh which is a caral uh
[00:10:20] uh which is a caral uh Benchmark a few other tasks you don't
[00:10:23] Benchmark a few other tasks you don't need to know all of them but just to
[00:10:25] need to know all of them but just to give you a brief overview uh cor
[00:10:27] give you a brief overview uh cor reference resolution um so this is
[00:10:30] reference resolution um so this is actually a pretty challenging uh NLP
[00:10:32] actually a pretty challenging uh NLP task where you have to say what pronoun
[00:10:34] task where you have to say what pronoun which pronoun refers to what noun so you
[00:10:37] which pronoun refers to what noun so you have here the sentence Mark told Pete
[00:10:40] have here the sentence Mark told Pete many lies about himself uh himself which
[00:10:42] many lies about himself uh himself which Pete included in his book uh he should
[00:10:45] Pete included in his book uh he should have been more truthful and now you have
[00:10:46] have been more truthful and now you have to say uh what does he refer to um
[00:10:51] to say uh what does he refer to um whether he refers to Pete and then there
[00:10:54] whether he refers to Pete and then there question answering uh where you
[00:10:55] question answering uh where you basically have a long text and you ask a
[00:10:57] basically have a long text and you ask a question and uh well the test ask a
[00:11:00] question and uh well the test ask a question and you're supposed to provide
[00:11:01] question and you're supposed to provide an answer based on the on the text that
[00:11:03] an answer based on the on the text that you have before so those are some
[00:11:05] you have before so those are some examples of of uh close ended tasks and
[00:11:09] examples of of uh close ended tasks and again uh the key here is that the way we
[00:11:11] again uh the key here is that the way we evaluate those is just standard machine
[00:11:13] evaluate those is just standard machine learning you can look at accuracy
[00:11:14] learning you can look at accuracy precision recall F1 score uh hopefully
[00:11:18] precision recall F1 score uh hopefully you all know about these type of metrics
[00:11:20] you all know about these type of metrics but if you don't you should look at um
[00:11:22] but if you don't you should look at um Chris Peach's uh class uh I think it's
[00:11:26] Chris Peach's uh class uh I think it's cs224u uh but his lecture is online and
[00:11:29] cs224u uh but his lecture is online and it's actually really good um on
[00:11:31] it's actually really good um on different
[00:11:33] different metrics um so the ways that people
[00:11:36] metrics um so the ways that people evaluate some of these benchmarks uh is
[00:11:38] evaluate some of these benchmarks uh is usually by looking at many of them uh
[00:11:41] usually by looking at many of them uh concurrently uh so the most common I
[00:11:43] concurrently uh so the most common I would say like super or multitask
[00:11:45] would say like super or multitask Benchmark is called superglue um so here
[00:11:48] Benchmark is called superglue um so here you see on the on the columns here you
[00:11:53] you see on the on the columns here you have all the different tasks uh in super
[00:11:55] have all the different tasks uh in super glue so I think there are eight or nine
[00:11:58] glue so I think there are eight or nine um and then you you really just look at
[00:12:00] um and then you you really just look at the average uh performance in each of
[00:12:02] the average uh performance in each of these benchmarks and you get a ranking
[00:12:04] these benchmarks and you get a ranking on that and that is kind of a attempt to
[00:12:07] on that and that is kind of a attempt to measure General language capabilities um
[00:12:10] measure General language capabilities um this is what people used to do I would
[00:12:12] this is what people used to do I would say until maybe two years ago um I will
[00:12:16] say until maybe two years ago um I will tell you about what people do now uh
[00:12:19] tell you about what people do now uh around the end of the lecture uh but
[00:12:21] around the end of the lecture uh but yeah super glue is is definitely
[00:12:23] yeah super glue is is definitely something you should at least be aware
[00:12:25] something you should at least be aware of uh and the example of tasks uh that
[00:12:28] of uh and the example of tasks uh that are in superglue uh one is a bull Q
[00:12:31] are in superglue uh one is a bull Q which is simply you have some text you
[00:12:33] which is simply you have some text you have some question and you have to say
[00:12:35] have some question and you have to say whether the answer is yes or whether
[00:12:37] whether the answer is yes or whether it's no so that's very easy to evaluate
[00:12:39] it's no so that's very easy to evaluate you just look at accuracies or Precision
[00:12:41] you just look at accuracies or Precision recall uh entailment we already talked
[00:12:43] recall uh entailment we already talked about um and then the other ones like
[00:12:46] about um and then the other ones like co- reference resolution which we also
[00:12:47] co- reference resolution which we also talked about and meaning of words which
[00:12:49] talked about and meaning of words which is something where you have two
[00:12:51] is something where you have two sentences with uh the same words and you
[00:12:54] sentences with uh the same words and you have to say whether they actually mean
[00:12:56] have to say whether they actually mean the same thing in this sentence for
[00:12:57] the same thing in this sentence for example if if you have Bank could mean
[00:12:59] example if if you have Bank could mean Bank like water and Bank like money and
[00:13:02] Bank like water and Bank like money and you have to say whether in these two
[00:13:03] you have to say whether in these two sentences um they refer to the same
[00:13:07] sentences um they refer to the same concept uh and there are some question
[00:13:09] concept uh and there are some question answering tasks too um so this is about
[00:13:12] answering tasks too um so this is about super clue are there any
[00:13:16] questions no cool um so again although I
[00:13:22] questions no cool um so again although I said many times that this is essentially
[00:13:24] said many times that this is essentially just classical machine learning uh I
[00:13:26] just classical machine learning uh I want to emphasize that it doesn't mean
[00:13:28] want to emphasize that it doesn't mean that it's simple and and you really have
[00:13:29] that it's simple and and you really have to think carefully about what you do
[00:13:31] to think carefully about what you do when you use those type of close-ended
[00:13:33] when you use those type of close-ended tasks in particular you're going to have
[00:13:35] tasks in particular you're going to have to choose whether you look at accuracies
[00:13:37] to choose whether you look at accuracies Precision recall F1 score Rock curves Au
[00:13:40] Precision recall F1 score Rock curves Au curves if you don't know these names you
[00:13:42] curves if you don't know these names you should really check out the es learn uh
[00:13:44] should really check out the es learn uh documentation or the lecture from Chris
[00:13:46] documentation or the lecture from Chris Peach that I that I I linked above um
[00:13:49] Peach that I that I I linked above um both of which are really good but
[00:13:52] both of which are really good but depending on which metric you will
[00:13:53] depending on which metric you will choose uh you will have very you will
[00:13:55] choose uh you will have very you will decide on very different type of
[00:13:56] decide on very different type of algorithms and the usual example is is
[00:13:59] algorithms and the usual example is is that if um let's say you look at uh spam
[00:14:04] that if um let's say you look at uh spam you want to do classification of whether
[00:14:06] you want to do classification of whether um a email is spammed or not most emails
[00:14:09] um a email is spammed or not most emails are not spammed uh thankfully at least I
[00:14:11] are not spammed uh thankfully at least I hope um so let's say that 90% of emails
[00:14:15] hope um so let's say that 90% of emails were actually not spam and only 10% of
[00:14:17] were actually not spam and only 10% of them are spam uh if you look at the
[00:14:18] them are spam uh if you look at the accuracy then just a random classifier
[00:14:21] accuracy then just a random classifier uh that predicts the most likely label
[00:14:22] uh that predicts the most likely label will get 90% accuracy um and that seems
[00:14:26] will get 90% accuracy um and that seems if you don't know really about your data
[00:14:28] if you don't know really about your data set like 90% accur seems good but in
[00:14:30] set like 90% accur seems good but in reality here it means that you're not
[00:14:31] reality here it means that you're not classifying anything um so that's why
[00:14:34] classifying anything um so that's why you want to look at Precision recall and
[00:14:35] you want to look at Precision recall and Fon call um anyways I will not talk too
[00:14:39] Fon call um anyways I will not talk too much about that because again this is
[00:14:40] much about that because again this is not specific to NLP but it doesn't mean
[00:14:42] not specific to NLP but it doesn't mean it's easy um another issue is that once
[00:14:45] it's easy um another issue is that once you have multiple different tasks
[00:14:47] you have multiple different tasks there's a question of how do you
[00:14:48] there's a question of how do you aggregate these metrics so right before
[00:14:51] aggregate these metrics so right before I told you oh you just take the average
[00:14:53] I told you oh you just take the average between all of these things this
[00:14:55] between all of these things this honestly is a really terrible thing of
[00:14:57] honestly is a really terrible thing of do um to do but that's actually what
[00:14:59] do um to do but that's actually what people do um but these columns actually
[00:15:02] people do um but these columns actually mean very different things some of them
[00:15:04] mean very different things some of them are accuracies others are F1 score
[00:15:07] are accuracies others are F1 score others are correlations uh and you just
[00:15:09] others are correlations uh and you just average everything I can't remember
[00:15:11] average everything I can't remember which uh Benchmark but I remember a few
[00:15:13] which uh Benchmark but I remember a few years ago there was one Benchmark where
[00:15:15] years ago there was one Benchmark where actually one of the columns was uh you
[00:15:17] actually one of the columns was uh you were better basically you had better
[00:15:19] were better basically you had better performance if the if the value was
[00:15:21] performance if the if the value was lower and you still took an average of
[00:15:23] lower and you still took an average of these things until someone realized like
[00:15:25] these things until someone realized like maybe we should take a put a minus there
[00:15:28] maybe we should take a put a minus there so uh um so yeah be careful and like
[00:15:31] so uh um so yeah be careful and like don't always think that what people do
[00:15:33] don't always think that what people do in in in Academia or like in yeah that
[00:15:36] in in in Academia or like in yeah that what people do is correct um you should
[00:15:39] what people do is correct um you should think a little bit uh about that then
[00:15:41] think a little bit uh about that then there are some other questions I want
[00:15:42] there are some other questions I want you to think about uh where do those
[00:15:45] you to think about uh where do those labels come from I said that is usually
[00:15:47] labels come from I said that is usually a a real answer but how you actually get
[00:15:50] a a real answer but how you actually get those La labels is is unclear so I will
[00:15:53] those La labels is is unclear so I will tell you about like some issues uh in
[00:15:55] tell you about like some issues uh in the next slide and uh and also related
[00:15:58] the next slide and uh and also related to that the might be some superious
[00:15:59] to that the might be some superious correlations and that's what we're going
[00:16:00] correlations and that's what we're going to talk about right now so we already
[00:16:03] to talk about right now so we already talked about um snli so uh entailment so
[00:16:08] talked about um snli so uh entailment so here you have again your premise the E
[00:16:11] here you have again your premise the E economy could still be better uh and the
[00:16:14] economy could still be better uh and the hypothesis the economy has never been
[00:16:16] hypothesis the economy has never been better and you have to say whether the
[00:16:17] better and you have to say whether the hypothesis is implied by the premise uh
[00:16:20] hypothesis is implied by the premise uh and what this paper from 2019 found that
[00:16:24] and what this paper from 2019 found that actually all the different models were
[00:16:26] actually all the different models were performing really well but if you looked
[00:16:28] performing really well but if you looked if you just classified based on the
[00:16:30] if you just classified based on the hypothesis you could also perform really
[00:16:32] hypothesis you could also perform really well so even if you did not look at the
[00:16:35] well so even if you did not look at the at the premise which seems like
[00:16:37] at the premise which seems like something that you need to take into
[00:16:38] something that you need to take into account because it's part of the task
[00:16:39] account because it's part of the task you could perform well and the reason
[00:16:41] you could perform well and the reason why is because uh they realized that
[00:16:44] why is because uh they realized that when the humans actually wrote the the
[00:16:47] when the humans actually wrote the the hypothesis they will ask oh write a
[00:16:49] hypothesis they will ask oh write a hypothesis which is not entailed by the
[00:16:51] hypothesis which is not entailed by the premise and how humans usually do that
[00:16:54] premise and how humans usually do that is by adding a negation um so if you
[00:16:57] is by adding a negation um so if you only look at the hypothesis and you see
[00:16:59] only look at the hypothesis and you see this a negation it's very likely that
[00:17:00] this a negation it's very likely that it's not entailed by uh by the premise
[00:17:03] it's not entailed by uh by the premise um so again even though this is standard
[00:17:06] um so again even though this is standard machine learning be really careful about
[00:17:09] machine learning be really careful about what metric you use and what where do
[00:17:12] what metric you use and what where do the labels come from uh and don't do
[00:17:14] the labels come from uh and don't do everything like don't just use what
[00:17:16] everything like don't just use what people do thinking that like if there
[00:17:19] people do thinking that like if there was an issue people would have realized
[00:17:22] was an issue people would have realized um so yeah so that is Pur correlations
[00:17:26] um so yeah so that is Pur correlations uh any questions on close ended task
[00:17:32] cool okay open-ended evaluations I'm
[00:17:34] cool okay open-ended evaluations I'm going to mostly talk about that because
[00:17:36] going to mostly talk about that because this is what is specific to uh
[00:17:38] this is what is specific to uh NLP um so open-ended evaluation or
[00:17:41] NLP um so open-ended evaluation or open-ended task is essentially the
[00:17:43] open-ended task is essentially the opposite of the close ended task which
[00:17:46] opposite of the close ended task which is to say that there are many possible
[00:17:48] is to say that there are many possible correct answers and you cannot enumerate
[00:17:50] correct answers and you cannot enumerate all of them um so you really can't use
[00:17:53] all of them um so you really can't use standard machine learning metrics uh and
[00:17:56] standard machine learning metrics uh and more oops um even more than the the fact
[00:17:59] more oops um even more than the the fact that you cannot enumerate all the
[00:18:00] that you cannot enumerate all the possible answers usually uh there are
[00:18:03] possible answers usually uh there are different levels of correctness so if I
[00:18:07] different levels of correctness so if I ask you to write a book or if I ask chpt
[00:18:09] ask you to write a book or if I ask chpt to write a book like it might be a
[00:18:10] to write a book like it might be a decent book but it might be a better
[00:18:12] decent book but it might be a better book uh that it could have written or
[00:18:14] book uh that it could have written or that another model could write so it's
[00:18:16] that another model could write so it's not just right and wrong there's like
[00:18:17] not just right and wrong there's like diff it's a Continuum or or yeah it's a
[00:18:21] diff it's a Continuum or or yeah it's a Continuum uh so stand examples for
[00:18:24] Continuum uh so stand examples for open-ended tasks the two most common
[00:18:26] open-ended tasks the two most common ones are summarization um so
[00:18:28] ones are summarization um so summarization you have a long uh piece
[00:18:31] summarization you have a long uh piece of text and you just ask to summarize it
[00:18:33] of text and you just ask to summarize it in less than X characters um standard
[00:18:36] in less than X characters um standard Benchmark is the C uh CNN and daily mail
[00:18:40] Benchmark is the C uh CNN and daily mail um Benchmark so the way they actually uh
[00:18:43] um Benchmark so the way they actually uh um collected that data set is that they
[00:18:45] um collected that data set is that they took a lot of CNN articles and you know
[00:18:47] took a lot of CNN articles and you know at the top of CNN articles you have
[00:18:49] at the top of CNN articles you have bullet points uh that kind of say like
[00:18:51] bullet points uh that kind of say like what are the most important things in
[00:18:52] what are the most important things in the article so they use this as the as
[00:18:55] the article so they use this as the as essentially the gold summary um so this
[00:18:58] essentially the gold summary um so this is the classic one for summarization for
[00:19:00] is the classic one for summarization for translation you basically have sentences
[00:19:02] translation you basically have sentences in two different languages and you have
[00:19:04] in two different languages and you have to translate from one to the other uh so
[00:19:06] to translate from one to the other uh so those are the classical ones uh the way
[00:19:09] those are the classical ones uh the way that people currently do it is I would
[00:19:11] that people currently do it is I would say the the most standard task right now
[00:19:13] say the the most standard task right now is instruction following instruction
[00:19:15] is instruction following instruction following is kind of the uh the mother
[00:19:18] following is kind of the uh the mother of all tasks in the sense that um you
[00:19:22] of all tasks in the sense that um you can view any previous task as just a
[00:19:25] can view any previous task as just a chat bot or like some question that you
[00:19:27] chat bot or like some question that you ask to basically chat GP you can think
[00:19:29] ask to basically chat GP you can think classification I could just ask CH GPT
[00:19:31] classification I could just ask CH GPT to do that you can think summarization I
[00:19:33] to do that you can think summarization I could ask CH GPT to do that so
[00:19:35] could ask CH GPT to do that so essentially you could just view a
[00:19:37] essentially you could just view a chatbot as the most General type of task
[00:19:40] chatbot as the most General type of task and you can per you can ask it to
[00:19:41] and you can per you can ask it to perform any possible task and it should
[00:19:44] perform any possible task and it should just provide the answer for that task uh
[00:19:46] just provide the answer for that task uh so this is what we call instruction
[00:19:47] so this is what we call instruction following uh so as you might think um
[00:19:50] following uh so as you might think um evaluation is very hard in that domain
[00:19:53] evaluation is very hard in that domain uh and that's what we'll talk about
[00:19:54] uh and that's what we'll talk about later is how do you evaluate something
[00:19:56] later is how do you evaluate something like CH GPT
[00:19:59] like CH GPT okay so types of evaluation methods for
[00:20:01] okay so types of evaluation methods for text generation or open-ended tasks um
[00:20:04] text generation or open-ended tasks um the classical ones are content overlap
[00:20:06] the classical ones are content overlap metrix which I'll talk about then so
[00:20:08] metrix which I'll talk about then so that's really comparing just the the the
[00:20:11] that's really comparing just the the the words between a reference answer a gold
[00:20:13] words between a reference answer a gold answer that humans wrote and the actual
[00:20:15] answer that humans wrote and the actual Generation generation that you got from
[00:20:17] Generation generation that you got from your model uh then there's a there are
[00:20:19] your model uh then there's a there are model based metrics where you basically
[00:20:22] model based metrics where you basically uh turn evaluation into machine learning
[00:20:26] uh turn evaluation into machine learning so you ask a you train a model to
[00:20:28] so you ask a you train a model to basically become an evaluator um and
[00:20:31] basically become an evaluator um and then there human evaluation which is uh
[00:20:34] then there human evaluation which is uh usually seen as the gold standard for
[00:20:36] usually seen as the gold standard for open-ended
[00:20:39] open-ended tasks so content overlap metrics so as I
[00:20:42] tasks so content overlap metrics so as I just said this is really just comparing
[00:20:44] just said this is really just comparing word by word um or group of words
[00:20:47] word by word um or group of words between the generated sequence and some
[00:20:50] between the generated sequence and some reference um so here I have the
[00:20:53] reference um so here I have the generated sequence being the woman went
[00:20:55] generated sequence being the woman went to the hardware store and the gold
[00:20:57] to the hardware store and the gold reference which is uh the reference
[00:20:59] reference which is uh the reference written by humans I actually don't even
[00:21:01] written by humans I actually don't even know what the task is but the reference
[00:21:02] know what the task is but the reference here is they walk to the grocery store
[00:21:05] here is they walk to the grocery store and then what you do is that you just
[00:21:06] and then what you do is that you just compare uh the two different sentences
[00:21:09] compare uh the two different sentences by looking at the lexical similarity
[00:21:11] by looking at the lexical similarity between those two
[00:21:13] between those two texts um and this is super fast and
[00:21:16] texts um and this is super fast and efficient and the way you usually do
[00:21:18] efficient and the way you usually do that is by using engram overlap metrix
[00:21:20] that is by using engram overlap metrix so what I mean by this is that uh the
[00:21:22] so what I mean by this is that uh the simplest possible thing is just to say
[00:21:24] simplest possible thing is just to say whether for every word in the generated
[00:21:26] whether for every word in the generated sequence whether it appears in the
[00:21:28] sequence whether it appears in the reference sequence and if if it does
[00:21:31] reference sequence and if if it does then you kind of increment um your
[00:21:33] then you kind of increment um your performance uh so engrams is essentially
[00:21:36] performance uh so engrams is essentially the same thing but instead of looking at
[00:21:37] the same thing but instead of looking at a single word you basically look at Iams
[00:21:40] a single word you basically look at Iams trigrams and and and kind of um uh
[00:21:43] trigrams and and and kind of um uh multiple words uh next to one another so
[00:21:47] multiple words uh next to one another so the usual overlap metrix the most common
[00:21:49] the usual overlap metrix the most common ones are blue and Rouge uh blue means
[00:21:52] ones are blue and Rouge uh blue means blue and Rouge means red um that's not
[00:21:56] blue and Rouge means red um that's not what they stand for though and I I
[00:21:57] what they stand for though and I I always forget what they stand for
[00:21:59] always forget what they stand for uh but basically blur what it is is that
[00:22:02] uh but basically blur what it is is that it's a engram overlap metric that um
[00:22:06] it's a engram overlap metric that um tries to look at Precision while Rouge
[00:22:09] tries to look at Precision while Rouge is what looks at it looks at the recall
[00:22:12] is what looks at it looks at the recall uh so as I as I said before as alluded
[00:22:14] uh so as I as I said before as alluded before what is important even if you
[00:22:16] before what is important even if you turn everything into a kind of sentence
[00:22:18] turn everything into a kind of sentence classification you have to think about
[00:22:20] classification you have to think about whether you care about Precision or
[00:22:22] whether you care about Precision or recall um so those metrics are not ideal
[00:22:26] recall um so those metrics are not ideal but until I would say two years ago they
[00:22:28] but until I would say two years ago they would have gold standard for translation
[00:22:30] would have gold standard for translation and summarization uh for translation
[00:22:33] and summarization uh for translation people use blue uh because you really
[00:22:37] people use blue uh because you really want to yeah you really want you
[00:22:39] want to yeah you really want you basically look at the let's say I'm
[00:22:41] basically look at the let's say I'm translating from French to English uh I
[00:22:43] translating from French to English uh I want to look at the uh generated
[00:22:45] want to look at the uh generated sequence in English and the actual
[00:22:47] sequence in English and the actual reference sequence in English and I want
[00:22:49] reference sequence in English and I want to know whether every uh Bagram that I
[00:22:52] to know whether every uh Bagram that I generated appears or like how many of
[00:22:55] generated appears or like how many of the bams that I I generated appears in
[00:22:57] the bams that I I generated appears in the uh in the reference sequence um the
[00:23:02] the uh in the reference sequence um the there's one additional thing which is
[00:23:03] there's one additional thing which is that they don't only look at Precision
[00:23:05] that they don't only look at Precision because you could get a very high
[00:23:06] because you could get a very high Precision by actually predicting
[00:23:07] Precision by actually predicting something very small uh for example if
[00:23:09] something very small uh for example if you always predicted the word the only
[00:23:12] you always predicted the word the only generated the word the you would most
[00:23:14] generated the word the you would most likely get very high Precision uh
[00:23:16] likely get very high Precision uh because the usually appears in every
[00:23:19] because the usually appears in every sentence or like let's say a full stop
[00:23:23] sentence or like let's say a full stop um so there's also like some length
[00:23:25] um so there's also like some length penalty uh and Rouge is kind of the
[00:23:27] penalty uh and Rouge is kind of the opposite it just look said we call um so
[00:23:30] opposite it just look said we call um so those are the the common the common
[00:23:33] those are the the common the common content overlap metrics and I'll just to
[00:23:36] content overlap metrics and I'll just to illustrate why those are not ideal um
[00:23:40] illustrate why those are not ideal um well they have many issues but one of
[00:23:42] well they have many issues but one of them is that they don't really take into
[00:23:43] them is that they don't really take into account the the semantic relatedness
[00:23:45] account the the semantic relatedness between words um so imagine that Chris
[00:23:48] between words um so imagine that Chris asks you are you enjoying the cs224 N
[00:23:52] asks you are you enjoying the cs224 N lectures uh of course the the gold
[00:23:54] lectures uh of course the the gold answer is heck yes um so that's the
[00:23:58] answer is heck yes um so that's the reference answer so now let's say that
[00:24:00] reference answer so now let's say that the model just generates yes so here
[00:24:02] the model just generates yes so here what you're going to have is if I look
[00:24:03] what you're going to have is if I look at the blue the blue score uh I will
[00:24:06] at the blue the blue score uh I will have
[00:24:07] have 67% uh um essentially blue score because
[00:24:10] 67% uh um essentially blue score because two of the words that I generated or two
[00:24:12] two of the words that I generated or two of the unigrams I generated are in the
[00:24:14] of the unigrams I generated are in the reference um in the gold reference if I
[00:24:18] reference um in the gold reference if I generate you know it then I will only
[00:24:20] generate you know it then I will only have a single um token in the generated
[00:24:24] have a single um token in the generated sequence that appears in the reference
[00:24:25] sequence that appears in the reference sequence which is the exclamation point
[00:24:28] sequence which is the exclamation point uh I get a much lower um blue score and
[00:24:32] uh I get a much lower um blue score and if I just say yep then that doesn't
[00:24:34] if I just say yep then that doesn't appear at all in the generated uh sorry
[00:24:37] appear at all in the generated uh sorry in the reference sequence so I get zero
[00:24:39] in the reference sequence so I get zero uh blue score which is a false negative
[00:24:41] uh blue score which is a false negative because really it literally means the
[00:24:43] because really it literally means the same thing as uh heck yes um so
[00:24:46] same thing as uh heck yes um so hopefully you see that like these
[00:24:49] hopefully you see that like these metrics really have issues um also you
[00:24:52] metrics really have issues um also you can have false positives for example if
[00:24:54] can have false positives for example if you say heck no uh then most of the
[00:24:57] you say heck no uh then most of the words are the same so you get 7% uh blue
[00:24:59] words are the same so you get 7% uh blue score um but it really means something
[00:25:02] score um but it really means something completely
[00:25:03] completely different uh does that make sense any
[00:25:08] different uh does that make sense any questions cool so very naturally now
[00:25:12] questions cool so very naturally now that you know everything about what it
[00:25:14] that you know everything about what it Bings what you might ask is oh why do we
[00:25:17] Bings what you might ask is oh why do we look at words if what we could do is
[00:25:20] look at words if what we could do is looking at like learn representations
[00:25:22] looking at like learn representations which really kind of uh maintain the the
[00:25:25] which really kind of uh maintain the the semantic similarity uh between words uh
[00:25:28] semantic similarity uh between words uh so this is exactly what people have done
[00:25:31] so this is exactly what people have done um around 2019 I think is that they took
[00:25:36] um around 2019 I think is that they took uh some some or even before actually
[00:25:38] uh some some or even before actually 2016 they took some word embeddings um
[00:25:41] 2016 they took some word embeddings um they Associated every word in the uh in
[00:25:46] they Associated every word in the uh in the reference sequence to a word
[00:25:48] the reference sequence to a word embedding every word um in the generated
[00:25:51] embedding every word um in the generated sequence to the corresponding word
[00:25:53] sequence to the corresponding word embedding and they basically started
[00:25:55] embedding and they basically started comparing the word embeddings so a very
[00:25:57] comparing the word embeddings so a very simple way of comparing word embeddings
[00:25:58] simple way of comparing word embeddings is just you take the average uh between
[00:26:00] is just you take the average uh between the word embeddings in the reference
[00:26:02] the word embeddings in the reference sequence and the average between the
[00:26:03] sequence and the average between the word embeddings in the generated
[00:26:04] word embeddings in the generated sequence and you maybe look at cosine
[00:26:06] sequence and you maybe look at cosine similarity I mean there are mod ways of
[00:26:08] similarity I mean there are mod ways of doing it but honestly at this point it's
[00:26:09] doing it but honestly at this point it's not that important uh so you can think
[00:26:12] not that important uh so you can think about
[00:26:13] about averaging um another one uh as you
[00:26:17] averaging um another one uh as you know at this point um what embeddings
[00:26:21] know at this point um what embeddings don't really take into account the
[00:26:23] don't really take into account the contextual um or like the context of the
[00:26:26] contextual um or like the context of the of where the word basically appears uh
[00:26:28] of where the word basically appears uh so better way of getting good
[00:26:30] so better way of getting good representations for for word is by
[00:26:33] representations for for word is by looking essentially at at BIR so what
[00:26:35] looking essentially at at BIR so what you could do is you could take a bird
[00:26:36] you could do is you could take a bird model you could pass the generated
[00:26:39] model you could pass the generated sequence through it you get some
[00:26:41] sequence through it you get some embeddings um and then you can take bird
[00:26:44] embeddings um and then you can take bird again the same bird you pass the
[00:26:46] again the same bird you pass the reference sequence to it you get some
[00:26:48] reference sequence to it you get some other embeddings and then you do again
[00:26:49] other embeddings and then you do again some comparison I mean this bird score
[00:26:52] some comparison I mean this bird score uh pretty famous paper they do like some
[00:26:53] uh pretty famous paper they do like some smart comparison but it it's not that
[00:26:56] smart comparison but it it's not that important to understand what EX exactly
[00:26:58] important to understand what EX exactly they do um what is important is that
[00:27:00] they do um what is important is that they take like some smart averaging um
[00:27:03] they take like some smart averaging um between those
[00:27:05] between those words cool any
[00:27:10] questions
[00:27:12] questions okay um so that was the simplest uh the
[00:27:16] okay um so that was the simplest uh the simplest type of of meth of learning
[00:27:18] simplest type of of meth of learning methods which is uh word uh word
[00:27:22] methods which is uh word uh word matching another slightly more
[00:27:24] matching another slightly more complicated one is called blurt also
[00:27:27] complicated one is called blurt also pretty famous which is a mix between
[00:27:29] pretty famous which is a mix between blur and bird uh so the the way that
[00:27:32] blur and bird uh so the the way that they did that is that basically they
[00:27:34] they did that is that basically they they took a pre-trained bir um and then
[00:27:37] they took a pre-trained bir um and then they do some continual pre-training by
[00:27:40] they do some continual pre-training by trying to predict the the blur score and
[00:27:43] trying to predict the the blur score and some other metrics and then they
[00:27:44] some other metrics and then they fine-tune that's the important part is
[00:27:46] fine-tune that's the important part is that they F tune the pre-train model uh
[00:27:48] that they F tune the pre-train model uh to actually do the evaluation that they
[00:27:50] to actually do the evaluation that they care about so let's say that I have a
[00:27:52] care about so let's say that I have a lot of different uh sequences and I I I
[00:27:55] lot of different uh sequences and I I I have some human annotations of how I
[00:27:57] have some human annotations of how I should be evaluating it I could just
[00:27:59] should be evaluating it I could just treat that as a normal machine learning
[00:28:01] treat that as a normal machine learning task and I just fine-tune my bir uh to
[00:28:04] task and I just fine-tune my bir uh to do the
[00:28:05] do the evaluation so this is
[00:28:08] evaluation so this is blurt any
[00:28:10] blurt any questions yes curious if you pre-train
[00:28:14] questions yes curious if you pre-train on uh blue wouldn't it cause the same
[00:28:17] on uh blue wouldn't it cause the same problems as the if you if your
[00:28:18] problems as the if you if your pre-training fast is blue then your
[00:28:21] pre-training fast is blue then your would learn the ability to model
[00:28:23] would learn the ability to model languages semantically in the first
[00:28:25] languages semantically in the first place yeah uh that's a very good point
[00:28:27] place yeah uh that's a very good point so actually I I also find it kind of
[00:28:29] so actually I I also find it kind of surprising so they did two things first
[00:28:31] surprising so they did two things first they do the real pre-training of of uh
[00:28:33] they do the real pre-training of of uh of bird and then they do continual
[00:28:35] of bird and then they do continual pre-training for predicting uh blue and
[00:28:38] pre-training for predicting uh blue and the reason why is because usually they
[00:28:39] the reason why is because usually they say we have a lot of uh sequences in our
[00:28:41] say we have a lot of uh sequences in our data set that are unlabeled so we have
[00:28:43] data set that are unlabeled so we have like some reference um sequences and
[00:28:47] like some reference um sequences and some uh generated sequences but we don't
[00:28:50] some uh generated sequences but we don't have like the human annotation of
[00:28:51] have like the human annotation of whether this is good or bad so we will
[00:28:54] whether this is good or bad so we will treat that as unsupervised learning
[00:28:56] treat that as unsupervised learning objective uh so what you use for the
[00:28:58] objective uh so what you use for the supervised learning objective well you
[00:29:00] supervised learning objective well you have to use something and they basically
[00:29:02] have to use something and they basically use blur and they also use actually um
[00:29:05] use blur and they also use actually um uh Bird score so they use like many
[00:29:07] uh Bird score so they use like many different tasks and they basically do
[00:29:08] different tasks and they basically do multitask
[00:29:10] multitask learning
[00:29:13] learning cool okay so one important uh issue with
[00:29:17] cool okay so one important uh issue with all these methods is that really they
[00:29:19] all these methods is that really they are they can only be as good as the
[00:29:21] are they can only be as good as the references are and in reality the
[00:29:24] references are and in reality the references are usually not that good uh
[00:29:26] references are usually not that good uh so this is in part this is a paper that
[00:29:29] so this is in part this is a paper that looks at summarization um of news so
[00:29:33] looks at summarization um of news so basically as I said before most of the
[00:29:36] basically as I said before most of the news summarization benchmarks um they
[00:29:38] news summarization benchmarks um they they usually take the the reference uh
[00:29:41] they usually take the the reference uh summary as being the bullet points that
[00:29:43] summary as being the bullet points that you find at the top of an article um and
[00:29:46] you find at the top of an article um and this is usually not that good so here
[00:29:48] this is usually not that good so here what you see on the um sorry on on the
[00:29:52] what you see on the um sorry on on the left this is what if you look at the
[00:29:54] left this is what if you look at the correlation between the x-axis being the
[00:29:57] correlation between the x-axis being the the human uh rate of or like the human
[00:30:02] the human uh rate of or like the human evaluated performance of every model and
[00:30:03] evaluated performance of every model and on the y- axis you see the uh Rouge L
[00:30:06] on the y- axis you see the uh Rouge L which is just a variant of rouge um and
[00:30:09] which is just a variant of rouge um and you look at whether uh basically these
[00:30:12] you look at whether uh basically these two are correlated and what you see is
[00:30:14] two are correlated and what you see is that it essentially not correlated which
[00:30:16] that it essentially not correlated which means that rou L on standard um
[00:30:20] means that rou L on standard um references is really not correlated to
[00:30:22] references is really not correlated to what humans would say is a good summary
[00:30:25] what humans would say is a good summary um that is not to say that rou is a bad
[00:30:28] um that is not to say that rou is a bad score that is to say that actually the
[00:30:30] score that is to say that actually the references are bad because if you look
[00:30:31] references are bad because if you look at the exact same thing uh but now you
[00:30:34] at the exact same thing uh but now you ask experts to write very good summaries
[00:30:37] ask experts to write very good summaries uh then you see that the correlation
[00:30:38] uh then you see that the correlation actually increases by a decent amount
[00:30:40] actually increases by a decent amount still not perfect uh Rouge is definitely
[00:30:42] still not perfect uh Rouge is definitely not perfect but at least it's much
[00:30:44] not perfect but at least it's much better um so this is to say that the
[00:30:47] better um so this is to say that the metric itself is is not always perfect
[00:30:49] metric itself is is not always perfect but not only this the the references are
[00:30:51] but not only this the the references are usually actually uh not
[00:30:54] usually actually uh not great cool so that begs a very natural
[00:30:58] great cool so that begs a very natural question which is can we just dump uh
[00:31:01] question which is can we just dump uh and Le yeah basically move away from
[00:31:04] and Le yeah basically move away from reference based evaluation so as we just
[00:31:06] reference based evaluation so as we just said uh reference based evaluations are
[00:31:08] said uh reference based evaluations are the ones that compare human written
[00:31:10] the ones that compare human written references to some model outputs uh
[00:31:12] references to some model outputs uh using some different type of metrics um
[00:31:17] using some different type of metrics um and those used to be the standard
[00:31:20] and those used to be the standard metrics for evaluating or the standard
[00:31:22] metrics for evaluating or the standard Benchmark for evaluating NLP tasks uh I
[00:31:24] Benchmark for evaluating NLP tasks uh I would say up to like 2 or 3 years ago
[00:31:27] would say up to like 2 or 3 years ago right now now I think uh papers still
[00:31:30] right now now I think uh papers still have to always show the blue scores like
[00:31:33] have to always show the blue scores like for example in Translation um because
[00:31:36] for example in Translation um because review is W those but I'm I don't think
[00:31:39] review is W those but I'm I don't think like anyone in in uh in the real world
[00:31:41] like anyone in in uh in the real world actually uses them but I might be wrong
[00:31:43] actually uses them but I might be wrong on that um so yeah so blue Rouge Bird
[00:31:46] on that um so yeah so blue Rouge Bird score oh and I was mostly talking about
[00:31:48] score oh and I was mostly talking about blue and Rouge Bird score is actually
[00:31:50] blue and Rouge Bird score is actually still uh decently used and and actually
[00:31:52] still uh decently used and and actually pretty good okay so reference free
[00:31:55] pretty good okay so reference free evaluation so reference free evaluation
[00:31:58] evaluation so reference free evaluation is uh basically you have a model and you
[00:32:00] is uh basically you have a model and you ask it to give a score uh but there are
[00:32:02] ask it to give a score uh but there are no human references um so the way that
[00:32:05] no human references um so the way that this is that this used to be done is
[00:32:07] this is that this used to be done is essentially by taking a model like like
[00:32:10] essentially by taking a model like like B again but instead of comparing between
[00:32:12] B again but instead of comparing between a reference a reference answer and the
[00:32:14] a reference a reference answer and the generated answer you could just ask it
[00:32:17] generated answer you could just ask it to to take the input and just predict
[00:32:18] to to take the input and just predict the score uh that's like one simple way
[00:32:20] the score uh that's like one simple way of doing it that used to really not work
[00:32:22] of doing it that used to really not work well uh and I say used to because until
[00:32:25] well uh and I say used to because until basically Chad GPT and GPT 4 uh now what
[00:32:28] basically Chad GPT and GPT 4 uh now what people do and honestly that works super
[00:32:30] people do and honestly that works super well is that you just ask gp4 to do the
[00:32:34] well is that you just ask gp4 to do the uh the same task as you would ask a
[00:32:35] uh the same task as you would ask a human so you give like a very long uh
[00:32:37] human so you give like a very long uh text and then you uh you give the
[00:32:40] text and then you uh you give the generated summary and you ask like how
[00:32:41] generated summary and you ask like how good is it essentially and that works uh
[00:32:44] good is it essentially and that works uh surprisingly well um so common
[00:32:47] surprisingly well um so common benchmarks here alpaca eval and empty
[00:32:49] benchmarks here alpaca eval and empty bench uh there are many others now
[00:32:51] bench uh there are many others now honestly most people uh start using
[00:32:53] honestly most people uh start using these type of of techniques um but we'll
[00:32:55] these type of of techniques um but we'll be talking at least about app Eva
[00:32:59] be talking at least about app Eva um good okay so let's talk a little bit
[00:33:01] um good okay so let's talk a little bit about human evaluation before looping
[00:33:03] about human evaluation before looping back to um to basically
[00:33:06] back to um to basically gp4 um so as we saw the metrics until
[00:33:09] gp4 um so as we saw the metrics until now they are they all have some
[00:33:12] now they are they all have some shortcomings um and they definitely not
[00:33:14] shortcomings um and they definitely not as good as if you ask directly human
[00:33:16] as good as if you ask directly human evaluation because they are based on on
[00:33:18] evaluation because they are based on on um on
[00:33:20] um on references uh so human evaluation is
[00:33:23] references uh so human evaluation is really the goal standard for open-ended
[00:33:26] really the goal standard for open-ended uh open ended tasks um and not only is
[00:33:30] uh open ended tasks um and not only is it really the the standard way of doing
[00:33:34] it really the the standard way of doing evaluation or like the goal standard for
[00:33:35] evaluation or like the goal standard for evaluation it's also the gold standard
[00:33:37] evaluation it's also the gold standard for developing new automatic evaluations
[00:33:40] for developing new automatic evaluations so every time you you develop a new
[00:33:43] so every time you you develop a new automatic evaluations you will want to
[00:33:44] automatic evaluations you will want to compare to uh what humans would have
[00:33:48] compare to uh what humans would have basically uh predicted
[00:33:51] basically uh predicted um
[00:33:54] yeah okay so doing human evaluation i'
[00:33:57] yeah okay so doing human evaluation i' first it might seem very simple you
[00:33:59] first it might seem very simple you basically ask humans to evaluate the
[00:34:01] basically ask humans to evaluate the quality of some generated text seems
[00:34:03] quality of some generated text seems simple right uh but actually it's super
[00:34:06] simple right uh but actually it's super complicated and it's a it's a real like
[00:34:08] complicated and it's a it's a real like Challenge and it has many issues so
[00:34:10] Challenge and it has many issues so first um oh sorry I I'll talk about that
[00:34:13] first um oh sorry I I'll talk about that before maybe one additional thing is
[00:34:15] before maybe one additional thing is that you should not only ask the human
[00:34:17] that you should not only ask the human you usually ask it also to um ask them
[00:34:20] you usually ask it also to um ask them to evaluate across different axes for
[00:34:23] to evaluate across different axes for example the fluency of the text or the
[00:34:24] example the fluency of the text or the coherence of the text or like common
[00:34:26] coherence of the text or like common sense or like the the style
[00:34:29] sense or like the the style grammaticality redundancy and like
[00:34:31] grammaticality redundancy and like different axes that you might care about
[00:34:33] different axes that you might care about um another thing to note is that you
[00:34:36] um another thing to note is that you should absolutely never compare uh
[00:34:38] should absolutely never compare uh different human evaluations so if
[00:34:40] different human evaluations so if there's one paper that says oh uh humans
[00:34:43] there's one paper that says oh uh humans have evaluated the fluency of our text
[00:34:45] have evaluated the fluency of our text to be I don't know four out of five uh
[00:34:47] to be I don't know four out of five uh and then another paper that says like
[00:34:49] and then another paper that says like three out of five like they use
[00:34:50] three out of five like they use different humans different ways of like
[00:34:52] different humans different ways of like prom prompting the humans um so it's
[00:34:55] prom prompting the humans um so it's it's absolutely not comparable
[00:34:58] it's absolutely not comparable okay so let's go back to some of the
[00:35:01] okay so let's go back to some of the issues so as I said human judgments are
[00:35:03] issues so as I said human judgments are regarded as the gold standard uh but
[00:35:05] regarded as the gold standard uh but definitely has issues uh first it's
[00:35:08] definitely has issues uh first it's super slow um as you might expect humans
[00:35:11] super slow um as you might expect humans are definitely not as fast as automatic
[00:35:14] are definitely not as fast as automatic uh metrics uh second at least in
[00:35:17] uh metrics uh second at least in Academia it's it's still pretty
[00:35:19] Academia it's it's still pretty expensive to do um because I mean when
[00:35:22] expensive to do um because I mean when you pay well your workers um it's pretty
[00:35:25] you pay well your workers um it's pretty expensive to do well uh human evaluation
[00:35:28] expensive to do well uh human evaluation um another part is interannotator
[00:35:30] um another part is interannotator disagreement so if I take two random
[00:35:33] disagreement so if I take two random people in this room and I ask them to
[00:35:35] people in this room and I ask them to evaluate the quality of a generated text
[00:35:37] evaluate the quality of a generated text I can assure you that you will really
[00:35:39] I can assure you that you will really not agree uh so this is even if it
[00:35:42] not agree uh so this is even if it especially if it's subjective it's
[00:35:43] especially if it's subjective it's really bad but even if you talk for like
[00:35:45] really bad but even if you talk for like one hour before about uh like how you
[00:35:48] one hour before about uh like how you should be evaluating um uh Generations I
[00:35:53] should be evaluating um uh Generations I can most likely guarantee you that you
[00:35:55] can most likely guarantee you that you will you will still disagree on many of
[00:35:58] will you will still disagree on many of the evaluations and to give you an
[00:36:00] the evaluations and to give you an example uh when we were doing uh alpaca
[00:36:03] example uh when we were doing uh alpaca form last year which is uh something
[00:36:06] form last year which is uh something where we basically had to take um some
[00:36:10] where we basically had to take um some inputs and then take two models um think
[00:36:14] inputs and then take two models um think chat GPT alpaca and these type of models
[00:36:16] chat GPT alpaca and these type of models and you just uh have the two models
[00:36:19] and you just uh have the two models predict an answer and then you ask the
[00:36:21] predict an answer and then you ask the humans to say which answer they prefer
[00:36:24] humans to say which answer they prefer this is a very simple task um and this
[00:36:27] this is a very simple task um and this is but I will talk about it later this
[00:36:29] is but I will talk about it later this is what like a lot of people basically
[00:36:30] is what like a lot of people basically use right now for evaluating um models
[00:36:33] use right now for evaluating um models like J GPT um so natural question is
[00:36:36] like J GPT um so natural question is whether humans are good at doing that
[00:36:38] whether humans are good at doing that and what we saw is that so we were five
[00:36:41] and what we saw is that so we were five researchers doing that and the five of
[00:36:43] researchers doing that and the five of us we talked for like two or three hours
[00:36:45] us we talked for like two or three hours we write wrote extremely detailed uh
[00:36:48] we write wrote extremely detailed uh rubrics about how to do their
[00:36:49] rubrics about how to do their evaluations and still we only agreed 67%
[00:36:52] evaluations and still we only agreed 67% of the time so 50% is like uh random uh
[00:36:56] of the time so 50% is like uh random uh and if we just label things
[00:36:58] and if we just label things independently we only uh agree 67% of
[00:37:01] independently we only uh agree 67% of the time and we really try to do our
[00:37:02] the time and we really try to do our best like we were working on this thing
[00:37:04] best like we were working on this thing so it's not as if we were trying to do
[00:37:06] so it's not as if we were trying to do it like quickly so really people
[00:37:08] it like quickly so really people disagree of course if you if you then
[00:37:10] disagree of course if you if you then allow like discussions between the
[00:37:12] allow like discussions between the annotators then agreement actually
[00:37:15] annotators then agreement actually improves uh but then it becomes even
[00:37:17] improves uh but then it becomes even more like slower and more expensive
[00:37:20] more like slower and more expensive intra annotator disagreement this is
[00:37:22] intra annotator disagreement this is something that is extremely annoying
[00:37:24] something that is extremely annoying which is that if I ask a human if I ask
[00:37:26] which is that if I ask a human if I ask myself right now to evaluate something
[00:37:28] myself right now to evaluate something or like in three hours like after I have
[00:37:30] or like in three hours like after I have dinner or after I went to to run I will
[00:37:32] dinner or after I went to to run I will actually give different different
[00:37:34] actually give different different annotations um
[00:37:39] yes
[00:37:42] to um you mean for
[00:37:45] to um you mean for validating yeah so this is a very good
[00:37:47] validating yeah so this is a very good question honestly there's no good answer
[00:37:50] question honestly there's no good answer the usual way that people do it is that
[00:37:52] the usual way that people do it is that you look at some statistical um like
[00:37:56] you look at some statistical um like some stati iCal metrics basically where
[00:37:58] some stati iCal metrics basically where you're like okay I want to compare
[00:38:00] you're like okay I want to compare between these two models uh I'm going to
[00:38:02] between these two models uh I'm going to look at I'm going to basically perform a
[00:38:03] look at I'm going to basically perform a t test and I'm I want to know that like
[00:38:05] t test and I'm I want to know that like my P value is less than a certain amount
[00:38:07] my P value is less than a certain amount um what people usually do also when they
[00:38:09] um what people usually do also when they have humans annotations um I
[00:38:11] have humans annotations um I unfortunately didn't put a slide on that
[00:38:13] unfortunately didn't put a slide on that but they have a metric for computing the
[00:38:15] but they have a metric for computing the intra annotator basically agreement and
[00:38:18] intra annotator basically agreement and they try to achieve a certain intra
[00:38:19] they try to achieve a certain intra annotator agreement and if not they will
[00:38:22] annotator agreement and if not they will essentially ask for more humans or for
[00:38:24] essentially ask for more humans or for Rel
[00:38:26] Rel labelings um
[00:38:28] labelings um yeah it's not reproducible and this is
[00:38:31] yeah it's not reproducible and this is like partly because of the two things
[00:38:32] like partly because of the two things that we said before uh but also partly
[00:38:35] that we said before uh but also partly because um yeah I mean mostly because of
[00:38:38] because um yeah I mean mostly because of the two things before so this is a an
[00:38:41] the two things before so this is a an interesting paper um I think it's I I
[00:38:43] interesting paper um I think it's I I forgot which year I think it's from 2021
[00:38:45] forgot which year I think it's from 2021 but I'm not sure uh where basically they
[00:38:48] but I'm not sure uh where basically they say and I read from the abstract here
[00:38:50] say and I read from the abstract here just 5% of human evaluations are
[00:38:52] just 5% of human evaluations are repeatable in the sense that there are
[00:38:53] repeatable in the sense that there are no prohibitive barriers to repetition
[00:38:55] no prohibitive barriers to repetition and sufficient information about EXP
[00:38:57] and sufficient information about EXP experimental design uh is publicly
[00:39:00] experimental design uh is publicly available for rerunning them so this is
[00:39:02] available for rerunning them so this is a paper that analyzed I think
[00:39:04] a paper that analyzed I think 128 um different papers that were
[00:39:06] 128 um different papers that were published across like five years I think
[00:39:09] published across like five years I think between 2015 and 2020 um and they found
[00:39:12] between 2015 and 2020 um and they found that essentially only 5% of those papers
[00:39:14] that essentially only 5% of those papers were reproducible um uh so honestly
[00:39:17] were reproducible um uh so honestly working with humans is hard that's
[00:39:19] working with humans is hard that's definitely something to
[00:39:21] definitely something to remember uh another part is that humans
[00:39:23] remember uh another part is that humans only uh basically evaluate precision and
[00:39:27] only uh basically evaluate precision and not recall so what I mean by that is
[00:39:29] not recall so what I mean by that is that if I if I if you show me uh what
[00:39:31] that if I if I if you show me uh what the model generated I can only evaluate
[00:39:33] the model generated I can only evaluate that generation I cannot gen evaluate
[00:39:36] that generation I cannot gen evaluate all the other possible Generations I
[00:39:38] all the other possible Generations I could have generated um because then you
[00:39:41] could have generated um because then you really have to sample a lot of things
[00:39:42] really have to sample a lot of things and that that will become way too slow
[00:39:44] and that that will become way too slow and way too expensive and finally uh
[00:39:48] and way too expensive and finally uh usually the incentives are not aligned
[00:39:50] usually the incentives are not aligned so what you want is for the humans to
[00:39:52] so what you want is for the humans to basically do the best possible
[00:39:54] basically do the best possible evaluations uh but what crowd workers
[00:39:57] evaluations uh but what crowd workers usually want is basically to maximize
[00:39:59] usually want is basically to maximize the the amount of money that they get
[00:40:00] the the amount of money that they get paid uh per hour so to give you again a
[00:40:04] paid uh per hour so to give you again a concrete example when we were doing
[00:40:05] concrete example when we were doing alpaka Farm um I I think we were paying
[00:40:10] alpaka Farm um I I think we were paying relatively well in the sense that we
[00:40:11] relatively well in the sense that we were paying a u 1.5 times the the
[00:40:14] were paying a u 1.5 times the the minimum wage in California um and then
[00:40:17] minimum wage in California um and then we divided basically we we looked at how
[00:40:19] we divided basically we we looked at how much time we would spend to do the thing
[00:40:21] much time we would spend to do the thing like in the basically to evaluate a
[00:40:24] like in the basically to evaluate a single example the best we could uh um
[00:40:27] single example the best we could uh um and then we divided by that time to
[00:40:29] and then we divided by that time to basically know how much how much we
[00:40:30] basically know how much how much we would pay for every example um and what
[00:40:33] would pay for every example um and what we realized is that they ended up being
[00:40:36] we realized is that they ended up being paid I think two or 2.5 times the
[00:40:38] paid I think two or 2.5 times the minimum wage because they were just
[00:40:40] minimum wage because they were just doing things like two three times faster
[00:40:42] doing things like two three times faster than us uh and I don't I mean we could
[00:40:44] than us uh and I don't I mean we could be slow but I think what was happening
[00:40:46] be slow but I think what was happening is that they were just trying to
[00:40:48] is that they were just trying to maximize the dollars that they were
[00:40:49] maximize the dollars that they were getting per hour and as a result they
[00:40:51] getting per hour and as a result they were uh finding like shortcuts for doing
[00:40:53] were uh finding like shortcuts for doing their evaluations um and this is
[00:40:56] their evaluations um and this is something that you really see in the of
[00:40:57] something that you really see in the of papers uh for example in our case uh you
[00:41:00] papers uh for example in our case uh you saw that humans really preferred longer
[00:41:01] saw that humans really preferred longer answers and of course if you give me two
[00:41:05] answers and of course if you give me two uh two very long gener like two sorry
[00:41:07] uh two very long gener like two sorry generations and you ask me with minimal
[00:41:09] generations and you ask me with minimal amount of work to say which one is
[00:41:10] amount of work to say which one is better like if I see a longer one I'm
[00:41:12] better like if I see a longer one I'm like H probably there are more details
[00:41:14] like H probably there are more details proba it's better uh anyways it's not to
[00:41:16] proba it's better uh anyways it's not to say that everyone is like is like that
[00:41:18] say that everyone is like is like that but definitely it's it's the incentives
[00:41:20] but definitely it's it's the incentives are on line so you have to be careful of
[00:41:22] are on line so you have to be careful of this uh other challenges uh first you
[00:41:26] this uh other challenges uh first you have to decide how to descri the task uh
[00:41:28] have to decide how to descri the task uh you really have to give very detailed
[00:41:30] you really have to give very detailed rubrics for how the humans have to
[00:41:31] rubrics for how the humans have to evaluate the task um then there's a
[00:41:34] evaluate the task um then there's a question of how do you show the task to
[00:41:35] question of how do you show the task to the humans uh for example the order in
[00:41:37] the humans uh for example the order in which you give examples actually really
[00:41:39] which you give examples actually really important uh the in our case because we
[00:41:41] important uh the in our case because we had two examples side by side they're
[00:41:42] had two examples side by side they're actually which one is on the left and
[00:41:44] actually which one is on the left and which one is on the right is actually
[00:41:45] which one is on the right is actually also very important so all these things
[00:41:47] also very important so all these things really matter um of course you can
[00:41:49] really matter um of course you can randomize these things away but but it
[00:41:51] randomize these things away but but it is like it adds challenges um what
[00:41:54] is like it adds challenges um what metrics to use I mean this is not
[00:41:55] metrics to use I mean this is not specific to humans uh selecting the
[00:41:58] specific to humans uh selecting the annotators this is also very complicated
[00:42:00] annotators this is also very complicated uh you might think okay I have some
[00:42:02] uh you might think okay I have some money now I I can go on on Amazon M
[00:42:04] money now I I can go on on Amazon M turkers and I can just ask them to uh
[00:42:06] turkers and I can just ask them to uh evaluate uh or to do some annotations
[00:42:09] evaluate uh or to do some annotations but in reality you want to have the good
[00:42:11] but in reality you want to have the good annotators so how it usually works in
[00:42:13] annotators so how it usually works in Amazon uh in in m is that basically you
[00:42:16] Amazon uh in in m is that basically you you uh say oh here's a task I want like
[00:42:19] you uh say oh here's a task I want like 30 different people to do these
[00:42:21] 30 different people to do these annotations uh and then they start
[00:42:22] annotations uh and then they start annotating and then you if they don't
[00:42:26] annotating and then you if they don't achieve the level that you want you
[00:42:27] achieve the level that you want you basically pay for what they they
[00:42:28] basically pay for what they they annotated until then and you you work
[00:42:30] annotated until then and you you work with someone else afterwards uh so then
[00:42:32] with someone else afterwards uh so then there's a question of how do you decide
[00:42:34] there's a question of how do you decide whether they achieved the performance
[00:42:35] whether they achieved the performance that you want uh so you probably have to
[00:42:38] that you want uh so you probably have to do like some gold labeling before and
[00:42:39] do like some gold labeling before and then look at like some accuracies of how
[00:42:42] then look at like some accuracies of how well and like some intra anator
[00:42:43] well and like some intra anator agreement with you and with like the
[00:42:45] agreement with you and with like the other researchers on your team uh so it
[00:42:47] other researchers on your team uh so it is very
[00:42:48] is very complicated and not only this you have
[00:42:50] complicated and not only this you have to monitor that over time um so there
[00:42:53] to monitor that over time um so there are many different ways you can monitor
[00:42:54] are many different ways you can monitor that over time looking again at the
[00:42:56] that over time looking again at the accuracy so maybe every let's say a
[00:42:59] accuracy so maybe every let's say a typical thing is that every batch of
[00:43:00] typical thing is that every batch of example that you label you give a few
[00:43:03] example that you label you give a few you give a few examples that are
[00:43:04] you give a few examples that are actually uh ones that you already know
[00:43:06] actually uh ones that you already know what the Gold Label is and you see how
[00:43:08] what the Gold Label is and you see how well they're performing on that another
[00:43:11] well they're performing on that another way to look at is like the time that
[00:43:12] way to look at is like the time that they take to annotate um
[00:43:16] they take to annotate um yeah okay so that was about humans uh so
[00:43:20] yeah okay so that was about humans uh so human evaluation is hard but it is a
[00:43:22] human evaluation is hard but it is a gold
[00:43:22] gold standard okay now let's talk about uh
[00:43:25] standard okay now let's talk about uh reference free evaluation and chat bots
[00:43:26] reference free evaluation and chat bots so so I already told you about uh about
[00:43:28] so so I already told you about uh about it before very briefly how do you
[00:43:30] it before very briefly how do you evaluate something like chat GPT um this
[00:43:33] evaluate something like chat GPT um this is extremely complicated because
[00:43:34] is extremely complicated because basically you could ask it any task you
[00:43:36] basically you could ask it any task you want um and it can answer text that is
[00:43:39] want um and it can answer text that is arbitrarily long and that just makes
[00:43:42] arbitrarily long and that just makes evaluation extremely hard um so as I
[00:43:46] evaluation extremely hard um so as I suggested before the usual way that it's
[00:43:48] suggested before the usual way that it's done is that you take two models you put
[00:43:49] done is that you take two models you put them side by side you as the same
[00:43:51] them side by side you as the same question and you just ask either some
[00:43:53] question and you just ask either some humans or some model as we will see
[00:43:55] humans or some model as we will see afterwards uh which which one is better
[00:43:58] afterwards uh which which one is better um so this is the most common Benchmark
[00:44:02] um so this is the most common Benchmark right now I would say for human
[00:44:03] right now I would say for human evaluation it's called chatbot Arena uh
[00:44:06] evaluation it's called chatbot Arena uh where basically anyone can go online and
[00:44:09] where basically anyone can go online and just play with for free with some of
[00:44:11] just play with for free with some of like the best models out there and all
[00:44:13] like the best models out there and all they ask you is to say whether you
[00:44:15] they ask you is to say whether you prefer the one on the right or whether
[00:44:16] prefer the one on the right or whether you prefer the one on the left
[00:44:17] you prefer the one on the left essentially uh and then once they reach
[00:44:20] essentially uh and then once they reach I think a crazy amount of data 200,000
[00:44:23] I think a crazy amount of data 200,000 human votes for example they basically
[00:44:25] human votes for example they basically add it to a leaderboard uh and the way
[00:44:27] add it to a leaderboard uh and the way they add it to leaderboard is that they
[00:44:29] they add it to leaderboard is that they um essenti I don't know if you know how
[00:44:31] um essenti I don't know if you know how chess works but they basically look at
[00:44:33] chess works but they basically look at the ELO ratings um so it's they
[00:44:36] the ELO ratings um so it's they basically put everything as if it was a
[00:44:37] basically put everything as if it was a t tournament uh such that not every
[00:44:39] t tournament uh such that not every model has to uh play against every other
[00:44:42] model has to uh play against every other model uh and then they get Elo Elo
[00:44:46] model uh and then they get Elo Elo scores okay so what's missing with this
[00:44:48] scores okay so what's missing with this side by side human evil as I said this
[00:44:51] side by side human evil as I said this is really the gold standard for
[00:44:52] is really the gold standard for evaluation of CH chat LMS but there are
[00:44:55] evaluation of CH chat LMS but there are still some challenges uh first like it's
[00:44:58] still some challenges uh first like it's basically random people online uh that
[00:45:00] basically random people online uh that ask random questions and they provide
[00:45:03] ask random questions and they provide like their preferences um so that might
[00:45:05] like their preferences um so that might may not be representative although
[00:45:07] may not be representative although arguably when you have that many
[00:45:09] arguably when you have that many examples like it becomes actually pretty
[00:45:10] examples like it becomes actually pretty representative of what people would want
[00:45:12] representative of what people would want Al um
[00:45:14] Al um so it's probably better than whatever we
[00:45:16] so it's probably better than whatever we have um but it is still not ideal and
[00:45:20] have um but it is still not ideal and then really the big the big issue is
[00:45:21] then really the big the big issue is cost uh this takes a huge Community
[00:45:24] cost uh this takes a huge Community effort and a lot of people to work on
[00:45:27] effort and a lot of people to work on that um also it takes a lot of time to
[00:45:31] that um also it takes a lot of time to get uh new models on The Benchmark and
[00:45:33] get uh new models on The Benchmark and only the notable models so think like
[00:45:35] only the notable models so think like the openi models and the cloud and like
[00:45:37] the openi models and the cloud and like the Google ones uh and the Facebook ones
[00:45:39] the Google ones uh and the Facebook ones are going to be benchmarked uh you will
[00:45:41] are going to be benchmarked uh you will never have for your random model 200,000
[00:45:43] never have for your random model 200,000 uh um people who are willing to annotate
[00:45:46] uh um people who are willing to annotate it for free um so this is an issue and
[00:45:49] it for free um so this is an issue and again like as as we talked about in the
[00:45:51] again like as as we talked about in the first slide even for those big companies
[00:45:53] first slide even for those big companies they can definitely not do that for like
[00:45:54] they can definitely not do that for like development of their model uh this is
[00:45:57] development of their model uh this is something that comes at the end for
[00:45:58] something that comes at the end for maybe model
[00:46:00] maybe model selection okay so how do we make it
[00:46:03] selection okay so how do we make it faster um so one uh very natural uh
[00:46:07] faster um so one uh very natural uh solution is basically to ask uh a large
[00:46:10] solution is basically to ask uh a large language model to do the evaluation for
[00:46:12] language model to do the evaluation for you so imagine that I want to compare
[00:46:13] you so imagine that I want to compare chat GPT with Moll I basically add G as
[00:46:16] chat GPT with Moll I basically add G as GPT 4 to evaluate uh which one is better
[00:46:19] GPT 4 to evaluate uh which one is better um this is surprisingly good and I will
[00:46:21] um this is surprisingly good and I will show you some results afterwards and
[00:46:23] show you some results afterwards and some common versions are Paka eval and
[00:46:25] some common versions are Paka eval and empty bench probably the two most common
[00:46:28] empty bench probably the two most common ones um so when we started doing that
[00:46:31] ones um so when we started doing that that's a problem I told you about uh we
[00:46:33] that's a problem I told you about uh we started that around last year um and we
[00:46:36] started that around last year um and we found that using gp4 essentially for
[00:46:39] found that using gp4 essentially for evaluation is at least if you look at
[00:46:42] evaluation is at least if you look at the prices now would be 100 times faster
[00:46:44] the prices now would be 100 times faster and 100 times cheaper than if use human
[00:46:47] and 100 times cheaper than if use human evaluations uh but and this is very
[00:46:50] evaluations uh but and this is very surprising the agreement with humans
[00:46:52] surprising the agreement with humans actually higher than humans agree with
[00:46:56] actually higher than humans agree with themselves so what I mean by that is
[00:46:57] themselves so what I mean by that is that if I ask so this is what we found
[00:46:59] that if I ask so this is what we found if I ask four humans uh let's say I have
[00:47:03] if I ask four humans uh let's say I have a pool of four humans and I take out one
[00:47:06] a pool of four humans and I take out one human and I look at the agreement
[00:47:08] human and I look at the agreement between that that human preferences and
[00:47:10] between that that human preferences and the mode of the preferences of the three
[00:47:12] the mode of the preferences of the three others and I I do that in a leave one
[00:47:14] others and I I do that in a leave one out fashion and I look at this agreement
[00:47:17] out fashion and I look at this agreement uh this will be lower than if I ask um
[00:47:20] uh this will be lower than if I ask um for the model to predict essentially the
[00:47:22] for the model to predict essentially the preference of the mode of the humans um
[00:47:24] preference of the mode of the humans um so in some ways models are more highly
[00:47:28] so in some ways models are more highly correlated with humans than humans
[00:47:30] correlated with humans than humans themselves which is very surprising and
[00:47:31] themselves which is very surprising and I will tell you about it in two seconds
[00:47:33] I will tell you about it in two seconds a little bit more um when we did that we
[00:47:35] a little bit more um when we did that we actually use that for uh collecting uh
[00:47:38] actually use that for uh collecting uh human preferences for rhf so that's what
[00:47:41] human preferences for rhf so that's what we call R AI as I think Arch told you
[00:47:44] we call R AI as I think Arch told you about these things uh last
[00:47:47] about these things uh last week um so going back to this ISS or
[00:47:51] week um so going back to this ISS or this like surprising result that
[00:47:53] this like surprising result that actually models are more highly
[00:47:55] actually models are more highly correlated with humans and humans
[00:47:56] correlated with humans and humans themselves the reason why this is is
[00:47:58] themselves the reason why this is is because humans are actually uh have high
[00:48:00] because humans are actually uh have high inter annotate disagreement and have
[00:48:03] inter annotate disagreement and have high variance essentially um models they
[00:48:06] high variance essentially um models they will always be very consistent or maybe
[00:48:09] will always be very consistent or maybe not perfectly like there's still some
[00:48:10] not perfectly like there's still some some circity but essentially they will
[00:48:12] some circity but essentially they will they will always predict the same uh
[00:48:14] they will always predict the same uh label so they have very little variance
[00:48:16] label so they have very little variance so here what you see on this plot is on
[00:48:18] so here what you see on this plot is on the Y sorry x-axis we estimated the the
[00:48:21] the Y sorry x-axis we estimated the the variance and you see that the human has
[00:48:23] variance and you see that the human has a variance of like around 31 or 33 um
[00:48:26] a variance of like around 31 or 33 um well if you look at the red point this
[00:48:28] well if you look at the red point this is basically if you just add gp4 to do
[00:48:30] is basically if you just add gp4 to do evaluations so even though the bias is
[00:48:32] evaluations so even though the bias is still pretty high um so bias by
[00:48:35] still pretty high um so bias by definition for humans is zero uh for
[00:48:37] definition for humans is zero uh for gbd4 it is like around 32% the viance is
[00:48:40] gbd4 it is like around 32% the viance is much lower um than than humans so this
[00:48:44] much lower um than than humans so this is why you can see that actually
[00:48:46] is why you can see that actually sometimes agreement is higher but that's
[00:48:48] sometimes agreement is higher but that's really because there are there's no
[00:48:49] really because there are there's no varant or very little Varan in uh in
[00:48:53] varant or very little Varan in uh in LMS um yeah does that make sense
[00:48:59] yeah lens is higher than a human sorry
[00:49:02] yeah lens is higher than a human sorry it means the internal cerence is higher
[00:49:04] it means the internal cerence is higher than exactly so which which is actually
[00:49:07] than exactly so which which is actually a good sign because that means it's
[00:49:08] a good sign because that means it's that's makes it much easier for for
[00:49:10] that's makes it much easier for for research the bad sign is that the bias
[00:49:12] research the bad sign is that the bias is still
[00:49:14] is still high yeah okay so things to be careful
[00:49:17] high yeah okay so things to be careful with when you work I mean this is both
[00:49:20] with when you work I mean this is both with humans and with uh llms uh there
[00:49:23] with humans and with uh llms uh there will be some spus correlations so we
[00:49:25] will be some spus correlations so we already talked about SP correlations but
[00:49:26] already talked about SP correlations but you will see a lot of those um one very
[00:49:30] you will see a lot of those um one very common example is length so if you just
[00:49:32] common example is length so if you just as I told you before if you ask crowd
[00:49:35] as I told you before if you ask crowd workers which examples they prefer they
[00:49:37] workers which examples they prefer they are highly biased towards longer longer
[00:49:39] are highly biased towards longer longer output so here the blue is humans it's
[00:49:41] output so here the blue is humans it's around I think 70% preferences for
[00:49:44] around I think 70% preferences for longer outputs and and models are around
[00:49:46] longer outputs and and models are around the same uh same bias um and another
[00:49:49] the same uh same bias um and another example is preference for lists so
[00:49:51] example is preference for lists so usually if you if you see lists in an
[00:49:53] usually if you if you see lists in an output uh models prefer these these
[00:49:55] output uh models prefer these these examples and models uh model and humans
[00:49:57] examples and models uh model and humans prefer these examples uh another another
[00:50:00] prefer these examples uh another another bias or SP correlation is a position I
[00:50:02] bias or SP correlation is a position I told you like how which one you put on
[00:50:04] told you like how which one you put on the left which one do you put on the
[00:50:05] the left which one do you put on the right when you compare to when you ask
[00:50:07] right when you compare to when you ask humans to to label there's the same
[00:50:09] humans to to label there's the same thing with models but this is usually
[00:50:10] thing with models but this is usually pretty easy to control for you just
[00:50:12] pretty easy to control for you just randomize both uh another issue is GPD
[00:50:14] randomize both uh another issue is GPD for Self Bias so very naturally you
[00:50:17] for Self Bias so very naturally you might like you might wonder if if I ask
[00:50:19] might like you might wonder if if I ask GPD for to to evaluate itself like it
[00:50:21] GPD for to to evaluate itself like it will probably bias it will prefer itself
[00:50:24] will probably bias it will prefer itself than other models uh and this is true
[00:50:27] than other models uh and this is true but less than what you might think I
[00:50:29] but less than what you might think I will tell you about later okay so Al EV
[00:50:34] will tell you about later okay so Al EV TR wait until what time do I
[00:50:38] TR wait until what time do I have you have 30 minutes oh thanks great
[00:50:42] have you have 30 minutes oh thanks great um okay alpaka eval so alpaka eval is
[00:50:45] um okay alpaka eval so alpaka eval is The Benchmark that we developed when
[00:50:47] The Benchmark that we developed when when we were working on alpaca um so as
[00:50:50] when we were working on alpaca um so as I told you before we need um one thing
[00:50:52] I told you before we need um one thing which is very important is what you use
[00:50:54] which is very important is what you use for development um so basically for high
[00:50:57] for development um so basically for high propri tuning so what what we did is
[00:50:59] propri tuning so what what we did is that we basically did not trust many of
[00:51:01] that we basically did not trust many of the benchmarks out there at this point
[00:51:03] the benchmarks out there at this point uh for instruction following so we just
[00:51:05] uh for instruction following so we just developed a very small Benchmark for
[00:51:07] developed a very small Benchmark for ourselves and this is what we were doing
[00:51:08] ourselves and this is what we were doing for I tuning uh and then it kind of
[00:51:11] for I tuning uh and then it kind of became its own thing um so alpaka eval
[00:51:14] became its own thing um so alpaka eval in a few numbers it has very high
[00:51:16] in a few numbers it has very high correlation with chatboard Arena so the
[00:51:18] correlation with chatboard Arena so the ranking if you look at the the
[00:51:20] ranking if you look at the the correlation between the ranking in
[00:51:21] correlation between the ranking in chadbad Arena and and in alpaka eval
[00:51:24] chadbad Arena and and in alpaka eval it's 98% so very high and it takes
[00:51:26] it's 98% so very high and it takes around 3 minutes and $10 to to evaluate
[00:51:30] around 3 minutes and $10 to to evaluate um and the way it works I think I
[00:51:32] um and the way it works I think I already mentioned it but basically you
[00:51:33] already mentioned it but basically you take an instruction you generate an
[00:51:35] take an instruction you generate an output from one one model and then from
[00:51:38] output from one one model and then from another model that you're you're
[00:51:39] another model that you're you're comparing it to um and you as gp4 to
[00:51:42] comparing it to um and you as gp4 to basically give the probability that it
[00:51:44] basically give the probability that it prefers uh the model that you're
[00:51:46] prefers uh the model that you're evaluating versus the Baseline that
[00:51:48] evaluating versus the Baseline that you're comparing to um and then you do
[00:51:51] you're comparing to um and then you do some reweighting and the reason why you
[00:51:53] some reweighting and the reason why you do some reweighting is because uh these
[00:51:56] do some reweighting is because uh these models as I said are very uh biased
[00:51:58] models as I said are very uh biased towards longer outputs so you want to
[00:52:00] towards longer outputs so you want to rewe such that uh if it's a longer
[00:52:03] rewe such that uh if it's a longer output you give it a slightly less High
[00:52:06] output you give it a slightly less High preferences uh High preference and then
[00:52:08] preferences uh High preference and then you average across your entire data set
[00:52:10] you average across your entire data set and you get a red rate um so that's how
[00:52:12] and you get a red rate um so that's how it
[00:52:14] it works any
[00:52:17] questions cool um so system level
[00:52:21] questions cool um so system level correlation so here what you see on the
[00:52:23] correlation so here what you see on the x-axis is basically alaka I mean a
[00:52:25] x-axis is basically alaka I mean a slight transform of it but essentially a
[00:52:28] slight transform of it but essentially a back a EV valid scores and on the y-axis
[00:52:30] back a EV valid scores and on the y-axis is this chatbot Arena which is the gold
[00:52:32] is this chatbot Arena which is the gold standard and you see that things are
[00:52:34] standard and you see that things are relatively highly correlated and on the
[00:52:36] relatively highly correlated and on the on the lower plot you see basically the
[00:52:38] on the lower plot you see basically the correlation between different Benchmark
[00:52:40] correlation between different Benchmark and chatbot Arena and you see like mty
[00:52:42] and chatbot Arena and you see like mty bench and alpaka Eva which are the two
[00:52:44] bench and alpaka Eva which are the two ones that use llms for evaluations are
[00:52:46] ones that use llms for evaluations are relatively highly correlated with
[00:52:48] relatively highly correlated with chadbad Arena and mlu which is the
[00:52:51] chadbad Arena and mlu which is the automated one that doesn't use anlm is
[00:52:53] automated one that doesn't use anlm is also very highly correlated
[00:52:57] um so I told you very briefly about the
[00:52:59] um so I told you very briefly about the fact that we had to do some reweighting
[00:53:01] fact that we had to do some reweighting um so I'm not going to tell you how we
[00:53:02] um so I'm not going to tell you how we do it but I want to tell you why we do
[00:53:04] do it but I want to tell you why we do it um one of the issues that we
[00:53:08] it um one of the issues that we realized uh a little bit too late is
[00:53:11] realized uh a little bit too late is that if if you take something like GPT 4
[00:53:14] that if if you take something like GPT 4 um and you just ask it you prompt it to
[00:53:16] um and you just ask it you prompt it to be much more detailed to basically
[00:53:18] be much more detailed to basically provide much more detailed answers uh
[00:53:21] provide much more detailed answers uh its win rate so its performance on your
[00:53:23] its win rate so its performance on your on your benchmark goes from 50% to 64.3
[00:53:27] on your benchmark goes from 50% to 64.3 uh so that's this one
[00:53:29] uh so that's this one 64.3 uh if you're ask it to be more
[00:53:31] 64.3 uh if you're ask it to be more concise like it decreases to 22.9 and it
[00:53:34] concise like it decreases to 22.9 and it really doesn't fit like our mental model
[00:53:35] really doesn't fit like our mental model of what benchmarks should be doing um if
[00:53:38] of what benchmarks should be doing um if I just change tweak a little bit The
[00:53:40] I just change tweak a little bit The Prompt I don't want my model to change
[00:53:42] Prompt I don't want my model to change completely its ranking um so that's why
[00:53:46] completely its ranking um so that's why we have to do some reting and you see
[00:53:47] we have to do some reting and you see that after the reting uh you basically
[00:53:50] that after the reting uh you basically have that the uh uh performance after
[00:53:54] have that the uh uh performance after you as the model to be more of a boss is
[00:53:57] you as the model to be more of a boss is is very close to the performance uh
[00:53:59] is very close to the performance uh without any any prompt uh
[00:54:04] tuning
[00:54:05] tuning cool so I told you slight or very
[00:54:08] cool so I told you slight or very briefly before about self bias I do want
[00:54:10] briefly before about self bias I do want to say that I'm pretty surprised about
[00:54:12] to say that I'm pretty surprised about this result but actually Self Bias
[00:54:15] this result but actually Self Bias exists but is not as high as you might
[00:54:17] exists but is not as high as you might uh think so here you see on the xaxis
[00:54:21] uh think so here you see on the xaxis the ranking um or like the the different
[00:54:24] the ranking um or like the the different models that you're evaluating and on the
[00:54:27] models that you're evaluating and on the sorry that's on the rows and on the
[00:54:28] sorry that's on the rows and on the columns you see who is evaluating which
[00:54:31] columns you see who is evaluating which model are you using for evaluation and
[00:54:33] model are you using for evaluation and you actually see that regardless of the
[00:54:35] you actually see that regardless of the model that you that you evaluate with uh
[00:54:38] model that you that you evaluate with uh the ranking will be the same so even
[00:54:40] the ranking will be the same so even though it's true that um if I look
[00:54:43] though it's true that um if I look at um
[00:54:45] at um mol uh evaluated by mol it gives itself
[00:54:48] mol uh evaluated by mol it gives itself a much higher accuracy uh it is it still
[00:54:51] a much higher accuracy uh it is it still prefers Claud and gp4 uh so it's not as
[00:54:55] prefers Claud and gp4 uh so it's not as bad as what you may think it's still bad
[00:54:59] though
[00:55:01] though cool okay so that leads me to talking
[00:55:04] cool okay so that leads me to talking about current evaluation of llms so i'
[00:55:06] about current evaluation of llms so i' would say there are three main ways that
[00:55:09] would say there are three main ways that people currently evaluate llms um the
[00:55:11] people currently evaluate llms um the first one is perplexity um which is
[00:55:13] first one is perplexity um which is essentially just looking at training
[00:55:14] essentially just looking at training losses or validation losses um the
[00:55:17] losses or validation losses um the second one is basically averaging
[00:55:19] second one is basically averaging everything uh which is actually
[00:55:22] everything uh which is actually surprisingly more common than what you
[00:55:23] surprisingly more common than what you may think and the third one is uh is
[00:55:26] may think and the third one is uh is this Arena like or where you basically
[00:55:28] this Arena like or where you basically have comparisons between models and
[00:55:30] have comparisons between models and either use humans or use uh um models to
[00:55:32] either use humans or use uh um models to do the evaluation and usually how it
[00:55:35] do the evaluation and usually how it works is that pre-trained model let's
[00:55:36] works is that pre-trained model let's say the new like when L 4 comes out or
[00:55:39] say the new like when L 4 comes out or like when GPT 5 comes out they B
[00:55:41] like when GPT 5 comes out they B basically mostly show perplexity and
[00:55:43] basically mostly show perplexity and like average over everything and the
[00:55:45] like average over everything and the fine-tune models they usually tend to
[00:55:47] fine-tune models they usually tend to show average over everything and um uh
[00:55:50] show average over everything and um uh Arena like performance under Arena like
[00:55:52] Arena like performance under Arena like models uh and the reason why is because
[00:55:55] models uh and the reason why is because uh models that are pre that are
[00:55:57] uh models that are pre that are fine-tuned usually the the log
[00:55:59] fine-tuned usually the the log likelihood that they predict is is not
[00:56:03] likelihood that they predict is is not um yeah it's not calibrated for your for
[00:56:05] um yeah it's not calibrated for your for your data
[00:56:06] your data set so what do I mean by everything um I
[00:56:10] set so what do I mean by everything um I would say the two most common uh
[00:56:12] would say the two most common uh benchmarks that basically look at
[00:56:15] benchmarks that basically look at everything are Helm and hugging face
[00:56:17] everything are Helm and hugging face openm uh leaderboard it's really just a
[00:56:20] openm uh leaderboard it's really just a collection of a lot of different autom
[00:56:22] collection of a lot of different autom automatically evaluated benchmarks um
[00:56:24] automatically evaluated benchmarks um and you evaluate across all of them um
[00:56:27] and you evaluate across all of them um so what are some of the common Benchmark
[00:56:30] so what are some of the common Benchmark that we use uh one is um yeah measuring
[00:56:34] that we use uh one is um yeah measuring like math performance so GSM 8K that's a
[00:56:37] like math performance so GSM 8K that's a pretty common one that's basically grade
[00:56:39] pretty common one that's basically grade school math mlu uh is multiple choice
[00:56:42] school math mlu uh is multiple choice question answering on like Math Science
[00:56:44] question answering on like Math Science History uh legal bench is is uh is on
[00:56:46] History uh legal bench is is uh is on the legal aspect then you have Med QA so
[00:56:48] the legal aspect then you have Med QA so this I I believe this is for Helm Med QA
[00:56:51] this I I believe this is for Helm Med QA is uh um yeah medical licensing exams so
[00:56:54] is uh um yeah medical licensing exams so you basically ask many many different
[00:56:56] you basically ask many many different questions that you can uh automatically
[00:56:58] questions that you can uh automatically evaluate um and you hope that by taking
[00:57:01] evaluate um and you hope that by taking averages um it will say like how well
[00:57:04] averages um it will say like how well your model performs so that's kind of
[00:57:06] your model performs so that's kind of like the newer version of super glue I
[00:57:08] like the newer version of super glue I would say um one data set that I want to
[00:57:11] would say um one data set that I want to highlight which is probably or one
[00:57:12] highlight which is probably or one Benchmark which is probably the most
[00:57:14] Benchmark which is probably the most widely used and the one that people
[00:57:15] widely used and the one that people believe the most is mlu so massively
[00:57:18] believe the most is mlu so massively multitask language understanding um so
[00:57:20] multitask language understanding um so this is I I think maybe Archard
[00:57:22] this is I I think maybe Archard mentioned it last week but this is
[00:57:25] mentioned it last week but this is basically um uh multiple choice uh
[00:57:29] basically um uh multiple choice uh questions on 57 different tasks you so
[00:57:32] questions on 57 different tasks you so you have tasks like formal logic
[00:57:34] you have tasks like formal logic conceptual physics econometrics and and
[00:57:36] conceptual physics econometrics and and uh these type of tasks so here's an
[00:57:38] uh these type of tasks so here's an example um what is true for type 1 a
[00:57:42] example um what is true for type 1 a supernova uh this type occurs in binary
[00:57:45] supernova uh this type occurs in binary system this type occurs in young
[00:57:46] system this type occurs in young galaxies and you basically have to say
[00:57:48] galaxies and you basically have to say which answer so that seems very simp I
[00:57:50] which answer so that seems very simp I mean the task is not simple but the way
[00:57:52] mean the task is not simple but the way you evaluate seems simple uh and then
[00:57:54] you evaluate seems simple uh and then like high school biology in a population
[00:57:55] like high school biology in a population of Gira an environmental and then you
[00:57:58] of Gira an environmental and then you this is an example of directional
[00:58:00] this is an example of directional selection um so that seems simple but
[00:58:03] selection um so that seems simple but actually it's it's also more complicated
[00:58:05] actually it's it's also more complicated than what you might think
[00:58:08] than what you might think um and I think I will tell
[00:58:11] um and I think I will tell you okay I will tell you about it later
[00:58:14] you okay I will tell you about it later um but that's that's mo one of the most
[00:58:16] um but that's that's mo one of the most common probably the most common
[00:58:18] common probably the most common Benchmark and what people actually look
[00:58:19] Benchmark and what people actually look at uh for example when Mark Zuckerberg
[00:58:22] at uh for example when Mark Zuckerberg like said that LMA 3 was out uh he
[00:58:26] like said that LMA 3 was out uh he yeah he talked about MML U scores which
[00:58:28] yeah he talked about MML U scores which I I find kind of crazy but yeah um other
[00:58:33] I I find kind of crazy but yeah um other capabilities that people look at coding
[00:58:35] capabilities that people look at coding uh coding is a very common one that
[00:58:36] uh coding is a very common one that people evaluate on um for two different
[00:58:40] people evaluate on um for two different reasons one because coding uh is usually
[00:58:44] reasons one because coding uh is usually if you perform well on code you usually
[00:58:46] if you perform well on code you usually actually these models perform well on
[00:58:47] actually these models perform well on reasoning um which is actually pretty
[00:58:49] reasoning um which is actually pretty cool um so that's like highly correlated
[00:58:52] cool um so that's like highly correlated with things that people care about uh
[00:58:54] with things that people care about uh two I mean a lot of us are Cod is so uh
[00:58:56] two I mean a lot of us are Cod is so uh we like to have uh uh better models for
[00:59:00] we like to have uh uh better models for helping us coding and three the other
[00:59:01] helping us coding and three the other point is that it's actually pretty easy
[00:59:03] point is that it's actually pretty easy to evaluate uh because you can write
[00:59:04] to evaluate uh because you can write test cases so you basically as the model
[00:59:07] test cases so you basically as the model to generate very long uh uh code or like
[00:59:09] to generate very long uh uh code or like functions to to to do something and then
[00:59:11] functions to to to do something and then you just run the test and you see
[00:59:13] you just run the test and you see whether it it succeeds or not yes sorry
[00:59:16] whether it it succeeds or not yes sorry going back to the prev evaluations some
[00:59:18] going back to the prev evaluations some of them was short answ currently um how
[00:59:22] of them was short answ currently um how would you validate like short QA type of
[00:59:24] would you validate like short QA type of thing where it's like multiple choice
[00:59:26] thing where it's like multiple choice makes sense but if it's like short
[00:59:27] makes sense but if it's like short answer QA how would you say something is
[00:59:30] answer QA how would you say something is correct as an automatic
[00:59:32] correct as an automatic met uh think specifically to the
[00:59:36] met uh think specifically to the top yeah I actually don't
[00:59:42] know huh I actually don't know yeah I
[00:59:45] know huh I actually don't know yeah I should check sorry um so I don't know
[00:59:49] should check sorry um so I don't know specifically for this one but Hot Pot QA
[00:59:51] specifically for this one but Hot Pot QA and beer QA are other QA uh data sets
[00:59:55] and beer QA are other QA uh data sets and then they look at F1 for the true
[00:59:57] and then they look at F1 for the true and false and then they also have an
[00:59:59] and false and then they also have an exact match which is pretty punitive
[01:00:01] exact match which is pretty punitive because like if you say President Reagan
[01:00:03] because like if you say President Reagan and the answer is like President Ronald
[01:00:05] and the answer is like President Ronald Reagan like will de you but anyway so
[01:00:08] Reagan like will de you but anyway so they use like an exact M on
[01:00:11] they use like an exact M on that yeah um cool thanks okay Sam you
[01:00:17] that yeah um cool thanks okay Sam you coding another one that people start
[01:00:19] coding another one that people start looking at are agents I think shakar is
[01:00:21] looking at are agents I think shakar is going to give election on it uh so I'm
[01:00:23] going to give election on it uh so I'm not going to talk too much about it but
[01:00:24] not going to talk too much about it but like one cool thing that LMS can do
[01:00:26] like one cool thing that LMS can do right now is basically call apis uh and
[01:00:28] right now is basically call apis uh and then take actions in the real world
[01:00:30] then take actions in the real world essentially or like take control of your
[01:00:32] essentially or like take control of your computer um you should not give it
[01:00:35] computer um you should not give it control to your computer uh so a natural
[01:00:38] control to your computer uh so a natural question is like how do you evaluate
[01:00:39] question is like how do you evaluate these type of things this is a real
[01:00:41] these type of things this is a real challenge um because I mean the biggest
[01:00:45] challenge um because I mean the biggest challenge is that if you for example if
[01:00:47] challenge is that if you for example if I really wanted to to evaluate like how
[01:00:48] I really wanted to to evaluate like how good it is at coding or like how good it
[01:00:50] good it is at coding or like how good it is at doing things in my terminal I need
[01:00:52] is at doing things in my terminal I need to give it access to my terminal and I
[01:00:53] to give it access to my terminal and I really don't want to give my LM access
[01:00:56] really don't want to give my LM access my terminal um so you really need to
[01:00:59] my terminal um so you really need to sandbox environments for the specific
[01:01:00] sandbox environments for the specific cases of terminal I mean it's pretty
[01:01:02] cases of terminal I mean it's pretty easy to sandbox but once you want to do
[01:01:04] easy to sandbox but once you want to do evaluation of like a model that that I
[01:01:07] evaluation of like a model that that I don't know P people on slack or like
[01:01:08] don't know P people on slack or like writes things in your emails then you
[01:01:10] writes things in your emails then you have to write an entire uh uh sandbox
[01:01:13] have to write an entire uh uh sandbox environment for all the applications
[01:01:14] environment for all the applications that you want your llms to have access
[01:01:16] that you want your llms to have access to uh so this is actually really
[01:01:19] to uh so this is actually really complicated and something that people
[01:01:21] complicated and something that people really have to deal with in in kind of
[01:01:22] really have to deal with in in kind of the real
[01:01:24] the real world uh at least we have to because
[01:01:26] world uh at least we have to because right now it's still not in
[01:01:28] right now it's still not in production okay the last part is uh or
[01:01:31] production okay the last part is uh or the penultimate one perplexities uh so
[01:01:34] the penultimate one perplexities uh so one thing which is very uh surprising um
[01:01:37] one thing which is very uh surprising um at least the first time you see it is
[01:01:39] at least the first time you see it is that really the performance that you
[01:01:40] that really the performance that you have on um pre-training is extremely
[01:01:43] have on um pre-training is extremely highly correlated uh with basically
[01:01:45] highly correlated uh with basically performance on any Downstream task um at
[01:01:47] performance on any Downstream task um at least for the current types of LMS so
[01:01:49] least for the current types of LMS so what I mean by this is that if you just
[01:01:51] what I mean by this is that if you just look at your training performance uh
[01:01:53] look at your training performance uh just predicting the next word it's
[01:01:55] just predicting the next word it's extremely highly correlated so this is
[01:01:56] extremely highly correlated so this is the x- axxis which is essentially
[01:01:58] the x- axxis which is essentially perplexities um and the y- AIS which is
[01:02:00] perplexities um and the y- AIS which is just the average over like many
[01:02:02] just the average over like many different tasks what you will see is
[01:02:04] different tasks what you will see is that tasks that perform well on
[01:02:07] that tasks that perform well on your uh on perplexities will actually
[01:02:09] your uh on perplexities will actually have high uh High average scores um and
[01:02:14] have high uh High average scores um and as a result a lot of people actually end
[01:02:16] as a result a lot of people actually end up when they develop just looking at
[01:02:18] up when they develop just looking at perplexities and they just trust it
[01:02:20] perplexities and they just trust it enough that they don't need to uh do the
[01:02:21] enough that they don't need to uh do the downstream evaluations I would not
[01:02:23] downstream evaluations I would not recommend doing it but if you have to
[01:02:26] recommend doing it but if you have to have something like quick and dirty it
[01:02:27] have something like quick and dirty it usually works pretty well um one thing
[01:02:29] usually works pretty well um one thing to be careful with though is that the
[01:02:31] to be careful with though is that the perplexities are not going to be
[01:02:32] perplexities are not going to be comparable across different data sets um
[01:02:35] comparable across different data sets um so you really have to be careful with
[01:02:36] so you really have to be careful with like what perplexities you're looking at
[01:02:38] like what perplexities you're looking at and two it will depend on the tokenizer
[01:02:40] and two it will depend on the tokenizer um so if you have like Lama 3 or and and
[01:02:45] um so if you have like Lama 3 or and and you compare it to Gemini even um uh yeah
[01:02:49] you compare it to Gemini even um uh yeah even on the same data set it's going to
[01:02:50] even on the same data set it's going to give different different scores and it's
[01:02:52] give different different scores and it's not comparable uh yes
[01:02:56] not comparable uh yes uh the the easy answer I mean it's not
[01:02:58] uh the the easy answer I mean it's not the only answer but the easy answer is
[01:03:00] the only answer but the easy answer is that if the vocabulary changes the the
[01:03:02] that if the vocabulary changes the the size of the vocabulary changes uh then
[01:03:04] size of the vocabulary changes uh then clearly the type of um I mean everything
[01:03:07] clearly the type of um I mean everything is not on the like the upper bound is
[01:03:09] is not on the like the upper bound is different
[01:03:11] different sequence uh a sequence like normal yeah
[01:03:15] sequence uh a sequence like normal yeah but I'm talking but I'm not talking
[01:03:16] but I'm talking but I'm not talking about that I'm talking about the fact
[01:03:17] about that I'm talking about the fact that I mean just think about it if you
[01:03:19] that I mean just think about it if you have a vocabulary size of one oh then I
[01:03:22] have a vocabulary size of one oh then I have to always predict the same thing so
[01:03:24] have to always predict the same thing so like you you're like the basically your
[01:03:27] like you you're like the basically your entropy depends your entropy is uper
[01:03:29] entropy depends your entropy is uper bounded by log of like the cardinality
[01:03:31] bounded by log of like the cardinality of your vocabulary size so you're going
[01:03:33] of your vocabulary size so you're going to depend on
[01:03:34] to depend on that
[01:03:37] that cool and the last one is Arena like as I
[01:03:40] cool and the last one is Arena like as I already told you basically you compare
[01:03:42] already told you basically you compare different models uh you make them fight
[01:03:44] different models uh you make them fight essentially against each other and you
[01:03:45] essentially against each other and you have U um ELO ratings at the end so
[01:03:47] have U um ELO ratings at the end so that's really a more General way of
[01:03:49] that's really a more General way of saying it is I really just let the users
[01:03:51] saying it is I really just let the users decide uh and that works also pretty
[01:03:54] decide uh and that works also pretty well okay issues and challenging with
[01:03:57] well okay issues and challenging with issues and challenges with current
[01:03:59] issues and challenges with current evaluations uh first consistency issues
[01:04:02] evaluations uh first consistency issues uh if you look at question answering uh
[01:04:05] uh if you look at question answering uh sorry um multiple choice questions um if
[01:04:08] sorry um multiple choice questions um if you just change so you see on the top
[01:04:10] you just change so you see on the top left and top right if you just change
[01:04:12] left and top right if you just change ABCD to like random symbols uh the the
[01:04:15] ABCD to like random symbols uh the the generations that you will give are
[01:04:17] generations that you will give are actually going to be different and then
[01:04:18] actually going to be different and then the rankings between different models
[01:04:20] the rankings between different models will be different um so even things that
[01:04:23] will be different um so even things that are very simple like multiple choice
[01:04:25] are very simple like multiple choice like like selecting out of four choices
[01:04:28] like like selecting out of four choices uh will be very dependent on exactly
[01:04:31] uh will be very dependent on exactly like how you format uh these choices uh
[01:04:34] like how you format uh these choices uh and one real example that's that's what
[01:04:37] and one real example that's that's what I was alluding to before is mlu so mlu
[01:04:40] I was alluding to before is mlu so mlu seems really simple to evaluate you just
[01:04:42] seems really simple to evaluate you just ask it to say like what which one of the
[01:04:45] ask it to say like what which one of the four uh the model prefers um but
[01:04:49] four uh the model prefers um but actually for a very long time I think
[01:04:51] actually for a very long time I think for nearly one year uh there were three
[01:04:54] for nearly one year uh there were three main implementation of mlu and people
[01:04:57] main implementation of mlu and people were comparing between those three
[01:04:59] were comparing between those three having no idea that like those three
[01:05:00] having no idea that like those three gave like different scores um and the
[01:05:02] gave like different scores um and the reason like the two main differences
[01:05:04] reason like the two main differences were one people use different prompts so
[01:05:06] were one people use different prompts so that clearly will give different answers
[01:05:08] that clearly will give different answers but two they were using different ways
[01:05:11] but two they were using different ways of sampling to get the the actual to get
[01:05:14] of sampling to get the the actual to get the actual like most likely prediction
[01:05:16] the actual like most likely prediction so one of them for example was saying um
[01:05:19] so one of them for example was saying um I have the four choices now to get my
[01:05:21] I have the four choices now to get my most likely uh let's say that the
[01:05:23] most likely uh let's say that the correct answer is D um I will just look
[01:05:26] correct answer is D um I will just look look at the most likely answers out of
[01:05:29] look at the most likely answers out of ABCD even though like Z zigot was
[01:05:33] ABCD even though like Z zigot was another answer that has a higher
[01:05:35] another answer that has a higher likelihood uh I will not look at it
[01:05:37] likelihood uh I will not look at it because I will I will basically do
[01:05:38] because I will I will basically do constraint decoding um and if I do
[01:05:41] constraint decoding um and if I do constraint decoding here I will say that
[01:05:43] constraint decoding here I will say that the correct answer is d uh but if I
[01:05:46] the correct answer is d uh but if I actually just look at the most likely
[01:05:48] actually just look at the most likely token I will not get the correct answer
[01:05:50] token I will not get the correct answer so like those were two different
[01:05:52] so like those were two different implementation and a third different
[01:05:54] implementation and a third different implementation which uh seems really
[01:05:56] implementation which uh seems really different is that instead of generating
[01:05:58] different is that instead of generating the correct uh token which is basically
[01:06:01] the correct uh token which is basically the letter ABCD you can look
[01:06:03] the letter ABCD you can look at after this question what is the
[01:06:07] at after this question what is the likelihood that the answer sorry that
[01:06:09] likelihood that the answer sorry that the model would generate this so you
[01:06:10] the model would generate this so you would look at the log likelihood um or
[01:06:13] would look at the log likelihood um or like the perplexity essentially of
[01:06:14] like the perplexity essentially of predicting this thing log likelihood of
[01:06:16] predicting this thing log likelihood of predicting that and that gives very
[01:06:18] predicting that and that gives very different answers uh so if you look at
[01:06:20] different answers uh so if you look at the top right uh you see that Lama 65b
[01:06:23] the top right uh you see that Lama 65b MML on Helm was 63 3.7 and the original
[01:06:27] MML on Helm was 63 3.7 and the original mmu 63.6 but on harness uh which is the
[01:06:31] mmu 63.6 but on harness uh which is the thing that actually hugging face uses is
[01:06:34] thing that actually hugging face uses is 48.8 so that's like a huge difference
[01:06:37] 48.8 so that's like a huge difference yeah what is Helm harness and original
[01:06:39] yeah what is Helm harness and original it matches to these three things there
[01:06:41] it matches to these three things there uh yeah one I can't remember which one
[01:06:43] uh yeah one I can't remember which one does what but each of them does
[01:06:45] does what but each of them does something different actually now it's
[01:06:47] something different actually now it's not true anymore so the middle column
[01:06:49] not true anymore so the middle column change what they're doing so they start
[01:06:51] change what they're doing so they start matching the other two ones uh but at
[01:06:53] matching the other two ones uh but at that time they weren't
[01:06:56] that time they weren't uh I'm not sure which one my guess was
[01:06:58] uh I'm not sure which one my guess was would be that they did the the last one
[01:07:02] would be that they did the the last one uh but I'm not in
[01:07:06] show okay
[01:07:10] questions cool another issue
[01:07:13] questions cool another issue contamination um so here you have harass
[01:07:17] contamination um so here you have harass he if you don't follow him on Twitter
[01:07:18] he if you don't follow him on Twitter you should um I and he basically said
[01:07:21] you should um I and he basically said that he was looking at um like code
[01:07:24] that he was looking at um like code benchmarks and he was saying that pre
[01:07:27] benchmarks and he was saying that pre pre 2021 uh I can't remember which Mar
[01:07:30] pre 2021 uh I can't remember which Mar or gbt 4 was getting 10 out of 10 on
[01:07:33] or gbt 4 was getting 10 out of 10 on questions on code Force but after 2021
[01:07:36] questions on code Force but after 2021 or like more recent problems it was
[01:07:37] or like more recent problems it was getting zero out of 10 which seems very
[01:07:40] getting zero out of 10 which seems very very uh strange uh so that really
[01:07:43] very uh strange uh so that really strongly points to the fact that it was
[01:07:44] strongly points to the fact that it was contaminated and was probably the model
[01:07:46] contaminated and was probably the model was probably pre-trained or like that
[01:07:48] was probably pre-trained or like that data set or the code Force data set was
[01:07:49] data set or the code Force data set was probably in the pre-training data set
[01:07:51] probably in the pre-training data set and of course if you I mean essentially
[01:07:53] and of course if you I mean essentially you do you do training on your test set
[01:07:55] you do you do training on your test set then you're going to perform really well
[01:07:57] then you're going to perform really well um and Suzanne said also to follow also
[01:08:02] um and Suzanne said also to follow also said something similar for uh F 1.5 um
[01:08:06] said something similar for uh F 1.5 um which is a model from
[01:08:08] which is a model from Microsoft so what is challenging here is
[01:08:10] Microsoft so what is challenging here is that with' closed models um I mean there
[01:08:13] that with' closed models um I mean there are two things actually that are
[01:08:13] are two things actually that are challenging one is that those are
[01:08:15] challenging one is that those are pre-trained on so much data that even if
[01:08:17] pre-trained on so much data that even if we had access to the data it would be
[01:08:19] we had access to the data it would be hard to actually know what they like if
[01:08:21] hard to actually know what they like if they were pre-trained on your test set
[01:08:22] they were pre-trained on your test set but two those are all close Source
[01:08:24] but two those are all close Source models so you really don't even have
[01:08:26] models so you really don't even have access to a data set so you have no idea
[01:08:28] access to a data set so you have no idea if they were pre-train on that
[01:08:30] if they were pre-train on that data um overfitting issues that's also
[01:08:33] data um overfitting issues that's also relatively related uh but could be
[01:08:35] relatively related uh but could be slightly different uh so here you see
[01:08:38] slightly different uh so here you see how much time it took for standard data
[01:08:41] how much time it took for standard data sets to achieve uh quot like squary
[01:08:45] sets to achieve uh quot like squary quotes uh human level performance um and
[01:08:48] quotes uh human level performance um and what you see is that on the recent ones
[01:08:49] what you see is that on the recent ones where you really have this pre-training
[01:08:51] where you really have this pre-training in less than like six months uh you
[01:08:53] in less than like six months uh you perform at human level performance um we
[01:08:57] perform at human level performance um we don't really know if it's because of the
[01:08:59] don't really know if it's because of the contamination or if it's simply that
[01:09:01] contamination or if it's simply that like a lot of people are basically
[01:09:03] like a lot of people are basically developing and trying to do Hy prar
[01:09:05] developing and trying to do Hy prar tuning on these test sets uh we don't
[01:09:07] tuning on these test sets uh we don't know why but it's clearly an issue with
[01:09:09] know why but it's clearly an issue with over
[01:09:10] over overfitting um so how do you alleviate
[01:09:13] overfitting um so how do you alleviate that one you can have private test sets
[01:09:15] that one you can have private test sets uh so there's a paper from I think two
[01:09:17] uh so there's a paper from I think two weeks ago uh that presented GSM 1K which
[01:09:22] weeks ago uh that presented GSM 1K which is the same thing as the GSM 8K that we
[01:09:24] is the same thing as the GSM 8K that we saw before which is the the math uh data
[01:09:26] saw before which is the the math uh data set but um tries to reamp basically
[01:09:29] set but um tries to reamp basically regenerate or resample this data set uh
[01:09:32] regenerate or resample this data set uh or recollect this data set and then they
[01:09:34] or recollect this data set and then they look at how well different models um
[01:09:37] look at how well different models um perform on both the GSM 1K and the GSM
[01:09:39] perform on both the GSM 1K and the GSM 8K and what you see is that at least the
[01:09:41] 8K and what you see is that at least the open source models they they perform
[01:09:43] open source models they they perform much worse on the on a new data set than
[01:09:45] much worse on the on a new data set than on the one that people are able to tune
[01:09:48] on the one that people are able to tune on um this is not true though for like
[01:09:51] on um this is not true though for like cloud and
[01:09:52] cloud and gp4 another one is Dynam Ben or just
[01:09:55] gp4 another one is Dynam Ben or just Dynamic test sets uh so ideally every X
[01:09:59] Dynamic test sets uh so ideally every X number of days you would basically have
[01:10:01] number of days you would basically have new instructions or new inputs to the
[01:10:02] new instructions or new inputs to the models and and your data set would
[01:10:04] models and and your data set would basically be dynamic that's essentially
[01:10:06] basically be dynamic that's essentially also what chatbot Arena does so uh that
[01:10:09] also what chatbot Arena does so uh that definitely
[01:10:10] definitely helps um another way of alleviating
[01:10:14] helps um another way of alleviating contaminators is that you may try to
[01:10:16] contaminators is that you may try to estimate or like to look at whether the
[01:10:19] estimate or like to look at whether the the models were actually trained on your
[01:10:20] the models were actually trained on your on your test set so one very simple way
[01:10:22] on your test set so one very simple way of doing it which actually works I think
[01:10:25] of doing it which actually works I think relatively well um is just looking at
[01:10:27] relatively well um is just looking at the probability of different answers and
[01:10:30] the probability of different answers and you will see that if uh your model is
[01:10:32] you will see that if uh your model is really sure about a certain answer then
[01:10:34] really sure about a certain answer then probably was strein on that answer um
[01:10:37] probably was strein on that answer um another one which which is also really
[01:10:39] another one which which is also really cool is um looking at the order of your
[01:10:43] cool is um looking at the order of your test set so if your model was uh if a
[01:10:46] test set so if your model was uh if a model was trained or pre-trained on the
[01:10:48] model was trained or pre-trained on the test set then most likely it thinks that
[01:10:51] test set then most likely it thinks that example two comes after example one so
[01:10:53] example two comes after example one so if you switch example one and example
[01:10:54] if you switch example one and example two and you see drops in log likelihoods
[01:10:58] two and you see drops in log likelihoods then most likely the model was actually
[01:10:59] then most likely the model was actually pre-trained on that data set
[01:11:03] pre-trained on that data set um
[01:11:05] um cool any questions
[01:11:09] here okay so another issue is that I
[01:11:13] here okay so another issue is that I mean really there's a monoculture of NLP
[01:11:15] mean really there's a monoculture of NLP Benchmark benchmarking what I mean by
[01:11:17] Benchmark benchmarking what I mean by this is mostly the fact that we all just
[01:11:19] this is mostly the fact that we all just look at English um and this is a paper
[01:11:22] look at English um and this is a paper from
[01:11:23] from 2021 2022 I think but they look at ACL
[01:11:26] 2021 2022 I think but they look at ACL 2021 which is the probably the the most
[01:11:29] 2021 which is the probably the the most common uh um conference in or like yeah
[01:11:32] common uh um conference in or like yeah conference in in in NLP and they look at
[01:11:35] conference in in in NLP and they look at the best papers so the oral papers and
[01:11:37] the best papers so the oral papers and they saw that out of the 461 papers uh
[01:11:40] they saw that out of the 461 papers uh 70% of them only look at English and 40%
[01:11:43] 70% of them only look at English and 40% of them only look at accuracy so
[01:11:44] of them only look at accuracy so essentially just performance um so there
[01:11:47] essentially just performance um so there are very few papers that look at
[01:11:48] are very few papers that look at multilinguality and like uh even like
[01:11:51] multilinguality and like uh even like efficiency and interpretability or
[01:11:53] efficiency and interpretability or fairness um and
[01:11:55] fairness um and there's a similar paper from that
[01:11:58] there's a similar paper from that analyzes another conference in 2008 and
[01:12:01] analyzes another conference in 2008 and it was essentially the same finding so
[01:12:02] it was essentially the same finding so unfortunately it doesn't seem to improve
[01:12:04] unfortunately it doesn't seem to improve over
[01:12:05] over time um the thing is there are actually
[01:12:08] time um the thing is there are actually a lot of benchmarks for multilinguality
[01:12:12] a lot of benchmarks for multilinguality uh I just highlight a few here uh Mega
[01:12:14] uh I just highlight a few here uh Mega Global bench extreme uh those have at
[01:12:18] Global bench extreme uh those have at least uh 30 40 languages and many many
[01:12:22] least uh 30 40 languages and many many different tasks um so it's not that we
[01:12:25] different tasks um so it's not that we don't have the benchmarks is that
[01:12:26] don't have the benchmarks is that there's no incentives um unfortunately
[01:12:30] there's no incentives um unfortunately in Academia to actually train or like
[01:12:32] in Academia to actually train or like sorry to um to evaluate on those
[01:12:34] sorry to um to evaluate on those benchmarks um so if you have the chance
[01:12:38] benchmarks um so if you have the chance use those
[01:12:39] use those benchmarks um another issue is that
[01:12:42] benchmarks um another issue is that really we reduce everything to a single
[01:12:44] really we reduce everything to a single metric so I already told you before the
[01:12:46] metric so I already told you before the way we aggregate metrics this is usually
[01:12:48] way we aggregate metrics this is usually kind of broken in in some of these super
[01:12:51] kind of broken in in some of these super benchmarks but also we only look at
[01:12:53] benchmarks but also we only look at performance and in in in the real world
[01:12:55] performance and in in in the real world like we really care about computational
[01:12:57] like we really care about computational efficiency too we also care about biases
[01:12:59] efficiency too we also care about biases and we care about many other aspects and
[01:13:01] and we care about many other aspects and like most of these benchmarks don't
[01:13:02] like most of these benchmarks don't consider those um another part is that
[01:13:05] consider those um another part is that we usually average across every example
[01:13:08] we usually average across every example uh we just say that every example has
[01:13:09] uh we just say that every example has the same value essentially the same
[01:13:11] the same value essentially the same weight um so this is definitely unfair
[01:13:13] weight um so this is definitely unfair for like minoritized groups but more
[01:13:15] for like minoritized groups but more than this I think if for example if you
[01:13:19] than this I think if for example if you think about like um agents where maybe
[01:13:21] think about like um agents where maybe one example will be like how well it
[01:13:23] one example will be like how well it performs on um
[01:13:26] performs on um I don't know writing codee that will
[01:13:27] I don't know writing codee that will actually be put in production versus
[01:13:29] actually be put in production versus just like
[01:13:31] just like uh like answering your daily question
[01:13:34] uh like answering your daily question about like where I don't know where to
[01:13:36] about like where I don't know where to buy the best burger um like the value
[01:13:40] buy the best burger um like the value that you will get out of these examples
[01:13:41] that you will get out of these examples are very different and we right now when
[01:13:43] are very different and we right now when we evaluate stuff we don't actually
[01:13:44] we evaluate stuff we don't actually consider that so that's I think a real a
[01:13:46] consider that so that's I think a real a real issue um and also we basically we
[01:13:49] real issue um and also we basically we don't take into account that different
[01:13:50] don't take into account that different people have different
[01:13:52] people have different preferences um so a few outs one
[01:13:56] preferences um so a few outs one considering computational efficiency so
[01:13:58] considering computational efficiency so ml puff has a great Benchmark uh where
[01:14:00] ml puff has a great Benchmark uh where basically instead of trying to maximize
[01:14:02] basically instead of trying to maximize the performance on a certain Benchmark
[01:14:05] the performance on a certain Benchmark they say I want to achieve that
[01:14:07] they say I want to achieve that performance in the least amount of time
[01:14:10] performance in the least amount of time um so now you basically consider both
[01:14:12] um so now you basically consider both like accuracies and like speed either
[01:14:15] like accuracies and like speed either for training or for for inference uh for
[01:14:18] for training or for for inference uh for biases uh dis remal is a good data set
[01:14:21] biases uh dis remal is a good data set from um uh entropic where basically they
[01:14:25] from um uh entropic where basically they they have some templates um and they so
[01:14:28] they have some templates um and they so they try to ask questions like uh uh
[01:14:30] they try to ask questions like uh uh knowing whether someone should should
[01:14:32] knowing whether someone should should keep their insurance or not and they
[01:14:34] keep their insurance or not and they have templates where they change um like
[01:14:37] have templates where they change um like the race or like the gender of of the in
[01:14:40] the race or like the gender of of the in the template of the person and they see
[01:14:42] the template of the person and they see how the decisions made by the model
[01:14:44] how the decisions made by the model would change um and I mean unfortunately
[01:14:49] would change um and I mean unfortunately but unsurprisingly uh you will see that
[01:14:51] but unsurprisingly uh you will see that some groups are much more discriminated
[01:14:54] some groups are much more discriminated than others
[01:14:58] uh other biases in our evaluation so I
[01:15:00] uh other biases in our evaluation so I already told you slightly about the
[01:15:02] already told you slightly about the multilingual issues but honestly this
[01:15:04] multilingual issues but honestly this this issue about English is like much
[01:15:06] this issue about English is like much more prevalent than you would think for
[01:15:07] more prevalent than you would think for example blur and RK they really assume
[01:15:11] example blur and RK they really assume that you basically have access to words
[01:15:13] that you basically have access to words like you know how to tokenize and how to
[01:15:15] like you know how to tokenize and how to get words so I used to work with uh Tai
[01:15:17] get words so I used to work with uh Tai and Vietnamese with Vietnamese you have
[01:15:19] and Vietnamese with Vietnamese you have spaces in between words and T you have
[01:15:21] spaces in between words and T you have no spaces between words like you have no
[01:15:24] no spaces between words like you have no idea how to run like blue Rouge like
[01:15:26] idea how to run like blue Rouge like really these it's much more than just
[01:15:28] really these it's much more than just the data like all all our algorithms are
[01:15:30] the data like all all our algorithms are are really like focused on English or at
[01:15:32] are really like focused on English or at least Western languages uh bias uh
[01:15:35] least Western languages uh bias uh biased llm uh based evaluations so one
[01:15:38] biased llm uh based evaluations so one thing is that I told you about is that
[01:15:40] thing is that I told you about is that it's really cool because now you can use
[01:15:41] it's really cool because now you can use essentially gp4 for doing uh uh uh
[01:15:44] essentially gp4 for doing uh uh uh labeling um but that also means that
[01:15:47] labeling um but that also means that given that gp4 is very consistent if it
[01:15:50] given that gp4 is very consistent if it has some biases then most of essentially
[01:15:53] has some biases then most of essentially the NLP community will have these um
[01:15:56] the NLP community will have these um biases scaled up essentially um so one
[01:16:00] biases scaled up essentially um so one Benchmark which tries to look at uh
[01:16:03] Benchmark which tries to look at uh whose opinions llms reflect uh by
[01:16:06] whose opinions llms reflect uh by default uh this is actually pretty cool
[01:16:08] default uh this is actually pretty cool work that looks at
[01:16:10] work that looks at the uh output distribution of llms on
[01:16:13] the uh output distribution of llms on public opinion surveys so just trying to
[01:16:16] public opinion surveys so just trying to understand um whether llms um uh reflect
[01:16:21] understand um whether llms um uh reflect opinions from from which groups and they
[01:16:23] opinions from from which groups and they find that at least after
[01:16:25] find that at least after when you only do pre-training uh the
[01:16:27] when you only do pre-training uh the models actually relatively well
[01:16:31] models actually relatively well uh they are not too optimized to a
[01:16:33] uh they are not too optimized to a single group uh but after so this is in
[01:16:36] single group uh but after so this is in red but after fine-tuning uh you
[01:16:38] red but after fine-tuning uh you basically see that the models really
[01:16:40] basically see that the models really start being optimized for certain
[01:16:41] start being optimized for certain preferences uh which is unsurprising
[01:16:43] preferences uh which is unsurprising because that's how we actually train the
[01:16:44] because that's how we actually train the model um and typically these models are
[01:16:48] model um and typically these models are actually uh um mostly show uh uh ref es
[01:16:55] actually uh um mostly show uh uh ref es from um actually the answer as if they
[01:16:58] from um actually the answer as if they were from uh I mean white and uh
[01:17:01] were from uh I mean white and uh Southeast Asian so I think the selfie
[01:17:03] Southeast Asian so I think the selfie station is actually pretty interesting I
[01:17:04] station is actually pretty interesting I think it's probably because a lot of
[01:17:05] think it's probably because a lot of these models were uh the human data that
[01:17:09] these models were uh the human data that was used for supervised fine tuning and
[01:17:11] was used for supervised fine tuning and for HF was actually labeled by people in
[01:17:15] for HF was actually labeled by people in Southeast Asia uh which would explain
[01:17:17] Southeast Asia uh which would explain why these models have um these type of
[01:17:20] why these models have um these type of views and usually also highly educated
[01:17:26] views and usually also highly educated okay so this is the main challenge uh
[01:17:29] okay so this is the main challenge uh the challenges of all challenges uh we
[01:17:32] the challenges of all challenges uh we saw that there are many challenges in
[01:17:33] saw that there are many challenges in evaluation in in at least in academic
[01:17:35] evaluation in in at least in academic benchmarking but the biggest one is that
[01:17:37] benchmarking but the biggest one is that really there's no incentives for us to
[01:17:40] really there's no incentives for us to move to anything else um and this is uh
[01:17:44] move to anything else um and this is uh actually pretty interesting paper that
[01:17:46] actually pretty interesting paper that looks at uh machine translation between
[01:17:49] looks at uh machine translation between all the papers of or many papers um from
[01:17:53] all the papers of or many papers um from 2019 to 2020 in machine translation and
[01:17:56] 2019 to 2020 in machine translation and they found that 82% of papers they only
[01:17:59] they found that 82% of papers they only evaluated blue scores and as we said
[01:18:01] evaluated blue scores and as we said like blue scores have many many issues
[01:18:03] like blue scores have many many issues and if you see like we know that there
[01:18:06] and if you see like we know that there are many better uh metrics um but still
[01:18:09] are many better uh metrics um but still people are not incentivized to look at
[01:18:11] people are not incentivized to look at anything else and actually reviewers
[01:18:13] anything else and actually reviewers will usually ask you to show uh
[01:18:16] will usually ask you to show uh performance on Blue scores so it's not
[01:18:18] performance on Blue scores so it's not even that your incentivized not to look
[01:18:19] even that your incentivized not to look at something else or also incentivized
[01:18:20] at something else or also incentivized to continue and it kind of makes sense
[01:18:22] to continue and it kind of makes sense because you want to be able to compare
[01:18:23] because you want to be able to compare to methods from two three years ago go
[01:18:25] to methods from two three years ago go but it also means that we we it's it's
[01:18:27] but it also means that we we it's it's hard for the academic field um to change
[01:18:30] hard for the academic field um to change to other benchmarks um but this is
[01:18:32] to other benchmarks um but this is really specific to Academia like in
[01:18:34] really specific to Academia like in reality if you know that your metric is
[01:18:36] reality if you know that your metric is bad just
[01:18:37] bad just switch um okay evaluation takeaways uh
[01:18:42] switch um okay evaluation takeaways uh so first we I I mentioned that there
[01:18:45] so first we I I mentioned that there were different uh types of evaluation
[01:18:48] were different uh types of evaluation and different uh desired properties for
[01:18:49] and different uh desired properties for different types of evaluation uh then I
[01:18:51] different types of evaluation uh then I talked about close ended tasks um and
[01:18:54] talked about close ended tasks um and how you evaluate those the fact that
[01:18:56] how you evaluate those the fact that it's basically standard machine learning
[01:18:58] it's basically standard machine learning um but that you have to think carefully
[01:19:00] um but that you have to think carefully even though it's standard machine
[01:19:01] even though it's standard machine learning of how you evaluate them uh
[01:19:03] learning of how you evaluate them uh then there are open-ended tasks uh where
[01:19:06] then there are open-ended tasks uh where you look at content overlap metrics
[01:19:07] you look at content overlap metrics typically uh so things like blue uh uh
[01:19:11] typically uh so things like blue uh uh and Rouge um and Bird score and then you
[01:19:14] and Rouge um and Bird score and then you have chatbot evaluations uh which is
[01:19:16] have chatbot evaluations uh which is extremely difficult um but uh people
[01:19:20] extremely difficult um but uh people have start has have started doing with
[01:19:22] have start has have started doing with uh using essentially LM based
[01:19:25] uh using essentially LM based evaluations um and then we talked about
[01:19:27] evaluations um and then we talked about challenges one of them being consistency
[01:19:29] challenges one of them being consistency the other one contamination uh and the
[01:19:32] the other one contamination uh and the third one biases um in reality honestly
[01:19:36] third one biases um in reality honestly the best evaluation is just check your
[01:19:38] the best evaluation is just check your outputs um so I think too many people
[01:19:42] outputs um so I think too many people they just believe numbers in reality
[01:19:45] they just believe numbers in reality like never just believe numbers like I I
[01:19:48] like never just believe numbers like I I remember when we did initially alpaca
[01:19:50] remember when we did initially alpaca like we we kind of believed our alpaka
[01:19:52] like we we kind of believed our alpaka eval but uh but once we saw playing with
[01:19:55] eval but uh but once we saw playing with it that's when we were like okay this
[01:19:56] it that's when we were like okay this thing is actually I mean at that time
[01:19:57] thing is actually I mean at that time good now it's it would be a pretty bad
[01:19:59] good now it's it would be a pretty bad model but at that time we're like okay
[01:20:00] model but at that time we're like okay this thing is actually pretty good um we
[01:20:02] this thing is actually pretty good um we should do something about it even though
[01:20:04] should do something about it even though on maybe standard academic benchmarks it
[01:20:06] on maybe standard academic benchmarks it was pretty bad um so yeah don't rely on
[01:20:09] was pretty bad um so yeah don't rely on numbers and I'm happy to what time is it
[01:20:13] numbers and I'm happy to what time is it to take any other questions uh that you
[01:20:17] to take any other questions uh that you may
[01:20:18] may have yes question about so there's this
[01:20:21] have yes question about so there's this whole issue of bias which we're trying
[01:20:23] whole issue of bias which we're trying which we really trying to deal with but
[01:20:25] which we really trying to deal with but we're sweeping under the rug here so if
[01:20:28] we're sweeping under the rug here so if we have a problem in which we're dealing
[01:20:30] we have a problem in which we're dealing with a very specialized
[01:20:32] with a very specialized domain and yes we try and go and run and
[01:20:37] domain and yes we try and go and run and run um reference re vales using like
[01:20:39] run um reference re vales using like let's say gbg4 yeah uh sh like is it
[01:20:45] let's say gbg4 yeah uh sh like is it considered bad practice to be checking a
[01:20:49] considered bad practice to be checking a subset of these gp4 evals ranking them
[01:20:52] subset of these gp4 evals ranking them as ourselves and then like and
[01:20:56] as ourselves and then like and then using and then using
[01:20:59] then using and then using ourself uh like inserting ourself and
[01:21:03] ourself uh like inserting ourself and and our bias into this process by
[01:21:06] and our bias into this process by actually looking at many many many data
[01:21:09] actually looking at many many many data points uh so just to make sure I
[01:21:11] points uh so just to make sure I understand your question you're saying
[01:21:13] understand your question you're saying that if we try to look ourselves at the
[01:21:16] that if we try to look ourselves at the answers we might be incorporating some
[01:21:17] answers we might be incorporating some biases there yes but we should look at
[01:21:20] biases there yes but we should look at the answers to make sure that gp4 isn't
[01:21:23] the answers to make sure that gp4 isn't being biased when it looks at the
[01:21:25] being biased when it looks at the answers there's this tension here and I
[01:21:26] answers there's this tension here and I don't know what the cuz in a in a
[01:21:28] don't know what the cuz in a in a controlled scientific experiment you
[01:21:30] controlled scientific experiment you would blind yourself to looking at these
[01:21:32] would blind yourself to looking at these answers how do you deal with this yeah
[01:21:35] answers how do you deal with this yeah that's a good question I actually don't
[01:21:36] that's a good question I actually don't quite know but one thing um I actually
[01:21:40] quite know but one thing um I actually feel less concerned about biases of a
[01:21:42] feel less concerned about biases of a single person uh my issue with the gp4
[01:21:45] single person uh my issue with the gp4 biases is that it's the same across
[01:21:47] biases is that it's the same across every model so things really scale up
[01:21:49] every model so things really scale up and kind of uh um it's really it becomes
[01:21:52] and kind of uh um it's really it becomes a monoculture and I think that that I
[01:21:54] a monoculture and I think that that I think that's less that's much worse than
[01:21:57] think that's less that's much worse than if everyone incorporates a little bit of
[01:21:59] if everyone incorporates a little bit of the biases that they have in their
[01:22:00] the biases that they have in their Direction I'm not saying that that's the
[01:22:02] Direction I'm not saying that that's the the best answer but I think it's
[01:22:04] the best answer but I think it's slightly better than than just going
[01:22:06] slightly better than than just going with whatever they have yeah how does
[01:22:08] with whatever they have yeah how does one following up on that how does avoid
[01:22:09] one following up on that how does avoid a situation if we're like one is trying
[01:22:11] a situation if we're like one is trying to solve a problem with a model yeah uh
[01:22:15] to solve a problem with a model yeah uh and uh one evaluates it with GPT chat
[01:22:20] and uh one evaluates it with GPT chat gp4 yeah uh and then one starts to to
[01:22:23] gp4 yeah uh and then one starts to to like look at it and say okay is this is
[01:22:25] like look at it and say okay is this is this good and stuff and then one goes
[01:22:27] this good and stuff and then one goes okay this is great and everyone else in
[01:22:30] okay this is great and everyone else in the world and GT4 thinks it's a terrible
[01:22:33] the world and GT4 thinks it's a terrible terrible model and it's just someone
[01:22:35] terrible model and it's just someone being and it's just some academic being
[01:22:37] being and it's just some academic being like uh like pressuring themsel into
[01:22:39] like uh like pressuring themsel into publishing something that doesn't
[01:22:40] publishing something that doesn't actually work how do you how does the
[01:22:43] actually work how do you how does the field structurally avoid situations like
[01:22:46] field structurally avoid situations like that um well I I think that's one reason
[01:22:50] that um well I I think that's one reason why they want standardized benchmarks
[01:22:51] why they want standardized benchmarks and why all every reviewer actually
[01:22:53] and why all every reviewer actually wants standardized benchmarks because
[01:22:54] wants standardized benchmarks because because at least even though everyone
[01:22:56] because at least even though everyone knows that they're wrong they understand
[01:22:58] knows that they're wrong they understand how they are wrong um so I think that's
[01:23:00] how they are wrong um so I think that's like one perspective another thing which
[01:23:02] like one perspective another thing which is not doesn't completely answer your
[01:23:04] is not doesn't completely answer your question but but um I think could be a
[01:23:08] question but but um I think could be a potential solution um is that how I view
[01:23:11] potential solution um is that how I view gp4 is just something that is really
[01:23:13] gp4 is just something that is really good at performing what I want it to
[01:23:15] good at performing what I want it to perform uh right now the thing is I I
[01:23:18] perform uh right now the thing is I I not very specific about what I wanted to
[01:23:20] not very specific about what I wanted to perform um and as a result it will it
[01:23:22] perform um and as a result it will it will basically come in with its own biy
[01:23:24] will basically come in with its own biy that that come from its pre-training
[01:23:26] that that come from its pre-training data or or fine tring data um a
[01:23:28] data or or fine tring data um a potentially better way of doing it is
[01:23:30] potentially better way of doing it is that I could write exactly what I want
[01:23:32] that I could write exactly what I want so right now when we say when we do the
[01:23:34] so right now when we say when we do the the prompting to gp4 I basically ask a
[01:23:36] the prompting to gp4 I basically ask a question simple question like how good
[01:23:38] question simple question like how good is the summary uh out of five uh but a
[01:23:42] is the summary uh out of five uh but a much better way would probably be
[01:23:43] much better way would probably be writing a very detailed rubric of
[01:23:45] writing a very detailed rubric of everything that has to be uh in this
[01:23:47] everything that has to be uh in this answer for it to be a good answer and if
[01:23:49] answer for it to be a good answer and if you think about it this is exactly what
[01:23:50] you think about it this is exactly what like professors do uh when they evaluate
[01:23:53] like professors do uh when they evaluate for class like they
[01:23:55] for class like they basically say Okay Yan is a TA but I
[01:23:58] basically say Okay Yan is a TA but I cannot trust him per like blindly so
[01:24:01] cannot trust him per like blindly so what I will do is that I will write a
[01:24:03] what I will do is that I will write a very detailed rubric and I trust that he
[01:24:05] very detailed rubric and I trust that he can apply that rubric and I think that's
[01:24:07] can apply that rubric and I think that's also how we should be thinking about gp4
[01:24:09] also how we should be thinking about gp4 and this is not how we currently do
[01:24:13] it any other questions


================================================================================
LECTURE 013
================================================================================

Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 12 - Efficient Training, Shikhar Murty

Source: https://www.youtube.com/watch?v=UVX7SYGCKkA

---

Transcript

[00:00:06] okay cool uh let's just get started uh
[00:00:09] okay cool uh let's just get started uh welcome everyone to lecture
[00:00:11] welcome everyone to lecture 12 um so you know so far we've learned a
[00:00:15] 12 um so you know so far we've learned a lot about how uh like you know we
[00:00:18] lot about how uh like you know we convert words into vectors how we
[00:00:20] convert words into vectors how we convert sentences into vectors and you
[00:00:23] convert sentences into vectors and you know uh basically uh take actions in the
[00:00:27] know uh basically uh take actions in the real world using that so like classify
[00:00:28] real world using that so like classify documents uh we learned about uh you
[00:00:32] documents uh we learned about uh you know Transformers we learned about
[00:00:33] know Transformers we learned about pre-training today is going to be a
[00:00:35] pre-training today is going to be a little bit different uh I'm going to be
[00:00:37] little bit different uh I'm going to be talking about how you can train large
[00:00:39] talking about how you can train large models on gpus and a few Basics about
[00:00:44] models on gpus and a few Basics about how you know these ml systems work and
[00:00:47] how you know these ml systems work and it's it has nothing to do with natural
[00:00:48] it's it has nothing to do with natural language at all but hopefully it's going
[00:00:50] language at all but hopefully it's going to be useful for final
[00:00:51] to be useful for final projects um so I'm going to spend some
[00:00:54] projects um so I'm going to spend some time on mixed Precision training uh some
[00:00:57] time on mixed Precision training uh some time on multi-gpu training with DDP and
[00:00:59] time on multi-gpu training with DDP and Fs DP and hopefully by the end of the
[00:01:01] Fs DP and hopefully by the end of the lecture these terms will make sense and
[00:01:04] lecture these terms will make sense and sometime on parameter efficient
[00:01:07] sometime on parameter efficient fine-tuning uh but before we get into
[00:01:09] fine-tuning uh but before we get into the lecture uh just some
[00:01:12] the lecture uh just some announcements uh proposal grades are
[00:01:14] announcements uh proposal grades are going to be coming out shortly uh
[00:01:17] going to be coming out shortly uh hopefully by the end of the day uh thank
[00:01:19] hopefully by the end of the day uh thank you so much for all the hard work I know
[00:01:21] you so much for all the hard work I know you know it's kind of getting a little
[00:01:23] you know it's kind of getting a little bit crammed with you know a lot of
[00:01:24] bit crammed with you know a lot of deadlines for assignment 4 and the
[00:01:27] deadlines for assignment 4 and the project proposal so thank you so much
[00:01:29] project proposal so thank you so much for for all your hard work uh the other
[00:01:31] for for all your hard work uh the other thing is the project Milestone uh
[00:01:34] thing is the project Milestone uh details should be out shortly if not
[00:01:37] details should be out shortly if not already out on the website um so it's
[00:01:40] already out on the website um so it's worth 5% of the overall grade uh it's
[00:01:43] worth 5% of the overall grade uh it's due 12 days from now and it's a maximum
[00:01:46] due 12 days from now and it's a maximum of two pages and really the way to think
[00:01:49] of two pages and really the way to think about the Milestone is to use this as a
[00:01:51] about the Milestone is to use this as a forcing function to get work done for
[00:01:53] forcing function to get work done for your final project um and yeah uh with
[00:01:57] your final project um and yeah uh with that out of the way let's just jump into
[00:01:59] that out of the way let's just jump into into the material uh so I'm going to
[00:02:02] into the material uh so I'm going to start by thinking about how parameters
[00:02:05] start by thinking about how parameters and gradients and generally numbers are
[00:02:07] and gradients and generally numbers are represented in computers uh and I
[00:02:10] represented in computers uh and I promise it's going to be relevant to to
[00:02:12] promise it's going to be relevant to to deep learning pretty soon so let's start
[00:02:15] deep learning pretty soon so let's start with uh floating Point uh how many
[00:02:17] with uh floating Point uh how many people here are familiar with this
[00:02:19] people here are familiar with this cartoon depiction of fp32 can you just
[00:02:22] cartoon depiction of fp32 can you just get okay so some of you so yeah let's
[00:02:25] get okay so some of you so yeah let's let's kind of recap how floating points
[00:02:27] let's kind of recap how floating points are represented in computers uh so so
[00:02:30] are represented in computers uh so so firstly fp32 that's like 32 bytes so the
[00:02:33] firstly fp32 that's like 32 bytes so the memory requirement is it's 32 bits so
[00:02:35] memory requirement is it's 32 bits so the memory requirement is 4 bytes okay
[00:02:38] the memory requirement is 4 bytes okay and so if you're thinking about neural
[00:02:39] and so if you're thinking about neural networks then for every single neural
[00:02:41] networks then for every single neural net parameters you need four bytes of
[00:02:43] net parameters you need four bytes of GPU memory and the way to convert this
[00:02:45] GPU memory and the way to convert this cartoon into a real number is something
[00:02:48] cartoon into a real number is something like this uh so the first bit there is
[00:02:50] like this uh so the first bit there is the sign and then the stuff in green
[00:02:53] the sign and then the stuff in green represents the range and then the stuff
[00:02:55] represents the range and then the stuff in blue represents Precision okay um and
[00:02:59] in blue represents Precision okay um and and yeah and and so for fp32 you know
[00:03:03] and yeah and and so for fp32 you know there's like uh you can represent like a
[00:03:05] there's like uh you can represent like a pretty large range and it's fairly
[00:03:07] pretty large range and it's fairly precise right um and so the larger the
[00:03:10] precise right um and so the larger the stuff in green is the more numbers you
[00:03:13] stuff in green is the more numbers you can represent which means more smaller
[00:03:16] can represent which means more smaller numbers and also like larger numbers and
[00:03:19] numbers and also like larger numbers and the more stuff in green uh the more
[00:03:21] the more stuff in green uh the more stuff in blue we have the greater appr
[00:03:23] stuff in blue we have the greater appr Precision in representing actual numbers
[00:03:27] Precision in representing actual numbers okay um so another popular data type
[00:03:30] okay um so another popular data type that takes half the memory of fp32 is
[00:03:35] that takes half the memory of fp32 is fp16 and uh the way we reduce memory is
[00:03:38] fp16 and uh the way we reduce memory is we're going to reduce uh the stuff in
[00:03:40] we're going to reduce uh the stuff in green so there's going to be less range
[00:03:42] green so there's going to be less range less dynamic range and also the stuff in
[00:03:45] less dynamic range and also the stuff in blue which means it's going to be uh
[00:03:47] blue which means it's going to be uh less Precision okay but the good uh
[00:03:50] less Precision okay but the good uh thing is that we can save memory so you
[00:03:52] thing is that we can save memory so you know we slash memory requirements in
[00:03:55] know we slash memory requirements in half so uh let's think of a scenario
[00:03:59] half so uh let's think of a scenario where you know you're trying to train a
[00:04:00] where you know you're trying to train a big neural network and your model
[00:04:03] big neural network and your model parameters and gradients are represented
[00:04:05] parameters and gradients are represented in fp32 you start training and suddenly
[00:04:08] in fp32 you start training and suddenly you get uh an out of memory Cuda error
[00:04:12] you get uh an out of memory Cuda error okay and so just based on you know what
[00:04:14] okay and so just based on you know what we've seen so far one possible solution
[00:04:16] we've seen so far one possible solution is you cast everything into
[00:04:18] is you cast everything into fp16 okay and if you do that you reduce
[00:04:22] fp16 okay and if you do that you reduce memory usage by
[00:04:24] memory usage by half um so let's kind of work through
[00:04:26] half um so let's kind of work through what are some possible problems with
[00:04:28] what are some possible problems with doing something like that uh so you know
[00:04:31] doing something like that uh so you know like I said uh because there's less
[00:04:33] like I said uh because there's less stuff in green there's going to be less
[00:04:34] stuff in green there's going to be less range and so that means uh a lot of uh
[00:04:38] range and so that means uh a lot of uh very small numbers will get uh converted
[00:04:42] very small numbers will get uh converted to zero and a lot of really large
[00:04:44] to zero and a lot of really large numbers will get converted into n okay
[00:04:47] numbers will get converted into n okay and there's also less Precision because
[00:04:49] and there's also less Precision because you have less bits in in in blue which
[00:04:52] you have less bits in in in blue which means you're going to get rounding
[00:04:54] means you're going to get rounding errors um for example uh 1.1
[00:04:59] errors um for example uh 1.1 gets converted to one uh in half
[00:05:02] gets converted to one uh in half Precision uh and I have a little uh
[00:05:04] Precision uh and I have a little uh screenshot of how you can test uh
[00:05:07] screenshot of how you can test uh various properties of data types right
[00:05:09] various properties of data types right so um basically the things to to to to
[00:05:13] so um basically the things to to to to look at are the Epsilon the Epsilon is
[00:05:15] look at are the Epsilon the Epsilon is like the smallest number such that if
[00:05:18] like the smallest number such that if you add that to one you don't lose any
[00:05:20] you add that to one you don't lose any Precision if you add a number that's
[00:05:22] Precision if you add a number that's smaller than Epsilon to one that gets
[00:05:24] smaller than Epsilon to one that gets just round rounded down to one and the
[00:05:26] just round rounded down to one and the smallest normal is the smallest number
[00:05:29] smallest normal is the smallest number that can be presented in uh uh fp16
[00:05:32] that can be presented in uh uh fp16 anything smaller than that it goes
[00:05:33] anything smaller than that it goes straight to zero okay and uh for neural
[00:05:38] straight to zero okay and uh for neural network training if a lot of small
[00:05:39] network training if a lot of small numbers get rounded down to zero uh
[00:05:42] numbers get rounded down to zero uh that's actually not good so here's a
[00:05:43] that's actually not good so here's a diagram that I took from uh an Nvidia
[00:05:46] diagram that I took from uh an Nvidia blog post that's just showing um just
[00:05:50] blog post that's just showing um just sort of some gradients during the course
[00:05:51] sort of some gradients during the course of training and more than half of these
[00:05:53] of training and more than half of these gradients will literally just get set to
[00:05:56] gradients will literally just get set to zero in in fp16 uh which is kind of a
[00:06:00] zero in in fp16 uh which is kind of a problem um and uh that has to do with
[00:06:03] problem um and uh that has to do with the range of fb16 and the second problem
[00:06:05] the range of fb16 and the second problem is uh with Precision right so we have
[00:06:08] is uh with Precision right so we have basically uh less precision and so our
[00:06:11] basically uh less precision and so our updates are not going to be
[00:06:13] updates are not going to be precise
[00:06:15] precise okay so uh the solution here's one
[00:06:19] okay so uh the solution here's one possible solution right so uh we're
[00:06:21] possible solution right so uh we're going to use fp16 but we also going to
[00:06:23] going to use fp16 but we also going to use fp32 okay so that's that's sort of
[00:06:26] use fp32 okay so that's that's sort of the high level idea uh and what we're
[00:06:28] the high level idea uh and what we're going to do is we're going to maintain
[00:06:30] going to do is we're going to maintain uh a copy of the model in fp32 and let's
[00:06:33] uh a copy of the model in fp32 and let's call those Master weights and then uh
[00:06:36] call those Master weights and then uh you get a little bit of data you run a
[00:06:38] you get a little bit of data you run a forward pass um and then when you run
[00:06:40] forward pass um and then when you run your forward pass you run it by uh
[00:06:42] your forward pass you run it by uh converting from fp32 into fp16 okay and
[00:06:46] converting from fp32 into fp16 okay and then uh you get a gradient run a
[00:06:48] then uh you get a gradient run a backward pass and then get your gradient
[00:06:50] backward pass and then get your gradient in fp16 okay so everything so far has
[00:06:52] in fp16 okay so everything so far has happened in fp16 then you take your
[00:06:55] happened in fp16 then you take your gradients upcast them into
[00:06:57] gradients upcast them into fp32 and then update your master weights
[00:07:01] fp32 and then update your master weights and then once you update your master
[00:07:02] and then once you update your master weights you copy them into the fp16
[00:07:04] weights you copy them into the fp16 version of the neural network okay so
[00:07:07] version of the neural network okay so this seems like a reasonable scheme you
[00:07:09] this seems like a reasonable scheme you know I'm using fp16 on my GPU but I have
[00:07:12] know I'm using fp16 on my GPU but I have the full sort of uh 32-bit Precision
[00:07:15] the full sort of uh 32-bit Precision also lying around somewhere so I can
[00:07:17] also lying around somewhere so I can have more precise updates okay uh can
[00:07:21] have more precise updates okay uh can someone tell me why this is still
[00:07:25] someone tell me why this is still problematic any
[00:07:28] problematic any guesses yeah uh one would be like really
[00:07:31] guesses yeah uh one would be like really slow because you have to copy like the
[00:07:34] slow because you have to copy like the 32 bit versions from like GPU into like
[00:07:38] 32 bit versions from like GPU into like some yeah so so that's a good point so
[00:07:41] some yeah so so that's a good point so you can often like overlap uh IO with
[00:07:45] you can often like overlap uh IO with like forward and backward passes so
[00:07:48] like forward and backward passes so practically this is not a problem but
[00:07:50] practically this is not a problem but yeah that's a good point potentially if
[00:07:51] yeah that's a good point potentially if your network is very very small this
[00:07:53] your network is very very small this would be a
[00:07:54] would be a problem yeah gradients are usually
[00:07:57] problem yeah gradients are usually fairly small like individual gradients
[00:07:58] fairly small like individual gradients are usually Fair small and when you copy
[00:08:00] are usually Fair small and when you copy the fp6 computed Radiance onto fp32 you
[00:08:03] the fp6 computed Radiance onto fp32 you may be sending your network somewhere
[00:08:05] may be sending your network somewhere else where you don't want it
[00:08:07] else where you don't want it to so yeah so that's that's pretty much
[00:08:09] to so yeah so that's that's pretty much pretty much the right answer so you know
[00:08:12] pretty much the right answer so you know let's kind of uh go back to this diagram
[00:08:14] let's kind of uh go back to this diagram that we had uh so this shows gradients
[00:08:16] that we had uh so this shows gradients in the backward pass and you know I said
[00:08:19] in the backward pass and you know I said that we're going to compute all our
[00:08:20] that we're going to compute all our gradients in fp16 what's going to happen
[00:08:23] gradients in fp16 what's going to happen like most of them will just get
[00:08:24] like most of them will just get converted to zero okay which is which is
[00:08:26] converted to zero okay which is which is something that we really would like to
[00:08:28] something that we really would like to avoid okay so here's a possible solution
[00:08:31] avoid okay so here's a possible solution so what you can do is you can you get
[00:08:34] so what you can do is you can you get your batch of data you run your forward
[00:08:35] your batch of data you run your forward pass in fp16 you compute your gradient
[00:08:39] pass in fp16 you compute your gradient uh but then when you have the uh sorry
[00:08:43] uh but then when you have the uh sorry so we here um so you get a batch of data
[00:08:47] so we here um so you get a batch of data you compute a forward pass in fp16 you
[00:08:50] you compute a forward pass in fp16 you get your loss you scale the loss by some
[00:08:52] get your loss you scale the loss by some large value okay let's say 100 let's say
[00:08:55] large value okay let's say 100 let's say a th000 and then you compute gradients
[00:08:57] a th000 and then you compute gradients and now you just like scale your
[00:08:58] and now you just like scale your gradient by a large number and so
[00:09:00] gradient by a large number and so everything that we had on the left hand
[00:09:02] everything that we had on the left hand side of this red line just gets shifted
[00:09:04] side of this red line just gets shifted to the right and hopefully there's less
[00:09:06] to the right and hopefully there's less stuff that will uh get rounded down to
[00:09:08] stuff that will uh get rounded down to zero
[00:09:10] zero okay and then compute your gradient in
[00:09:12] okay and then compute your gradient in fp16 copy it into fp32 and then divided
[00:09:16] fp16 copy it into fp32 and then divided by the scaling factor and then you
[00:09:18] by the scaling factor and then you update your master widths okay uh so
[00:09:22] update your master widths okay uh so this will solve uh both the problems
[00:09:24] this will solve uh both the problems that we talked about okay and so this is
[00:09:27] that we talked about okay and so this is basically uh what we call Precision
[00:09:29] basically uh what we call Precision training okay and it's relatively simple
[00:09:33] training okay and it's relatively simple to implement this in pytorch uh all you
[00:09:36] to implement this in pytorch uh all you have to do is uh you need to instantiate
[00:09:38] have to do is uh you need to instantiate this grad scalar uh uh object and then
[00:09:43] this grad scalar uh uh object and then um within the context of like this
[00:09:47] um within the context of like this AutoCast you want to run your forward
[00:09:49] AutoCast you want to run your forward and backward passes and then um scale
[00:09:52] and backward passes and then um scale down your gradient uh and then uh update
[00:09:55] down your gradient uh and then uh update your model parameters
[00:09:57] your model parameters okay uh but then this seems a little
[00:10:00] okay uh but then this seems a little complex you know we have to deal with
[00:10:01] complex you know we have to deal with sort of scaling the loss uh and then
[00:10:04] sort of scaling the loss uh and then scaling it back down what if you
[00:10:06] scaling it back down what if you multiplied it by 10,000 and that leads
[00:10:08] multiplied it by 10,000 and that leads to nans and so then you have to kind of
[00:10:10] to nans and so then you have to kind of scale uh you have to update your scaler
[00:10:13] scale uh you have to update your scaler you have to like in the next iteration
[00:10:14] you have to like in the next iteration multiply by th000 you know and you have
[00:10:16] multiply by th000 you know and you have to kind of adjust to to sort of network
[00:10:19] to kind of adjust to to sort of network Dynamics okay so we'd like to not do
[00:10:21] Dynamics okay so we'd like to not do gradient scaling um so can we do
[00:10:24] gradient scaling um so can we do something better okay so uh the reason
[00:10:27] something better okay so uh the reason we why we have to do the scaling is you
[00:10:29] we why we have to do the scaling is you know just recall uh sort of the role of
[00:10:33] know just recall uh sort of the role of uh of sort of the bits in green that
[00:10:36] uh of sort of the bits in green that kind of tells you what is the dynamic
[00:10:38] kind of tells you what is the dynamic range of the data type and we needed
[00:10:41] range of the data type and we needed scaling because fp16 has a small much
[00:10:44] scaling because fp16 has a small much smaller range compared to fp32
[00:10:46] smaller range compared to fp32 right um and so because of that fp16
[00:10:49] right um and so because of that fp16 cannot represent very small numbers okay
[00:10:51] cannot represent very small numbers okay so how do we solve this any ideas
[00:11:03] yeah so uh here's the problem right so
[00:11:05] yeah so uh here's the problem right so in fp16 because you have fewer bits for
[00:11:07] in fp16 because you have fewer bits for the exponent you can't represent very
[00:11:09] the exponent you can't represent very small numbers so if you have something
[00:11:11] small numbers so if you have something that's smaller than I don't know 6 e
[00:11:14] that's smaller than I don't know 6 e minus 5 it gets uh down uh sort of uh uh
[00:11:19] minus 5 it gets uh down uh sort of uh uh rounded down to zero and that's because
[00:11:21] rounded down to zero and that's because of the dynamic range of fp16 so how do
[00:11:23] of the dynamic range of fp16 so how do you solve that sacrifice Precision so
[00:11:27] you solve that sacrifice Precision so have more
[00:11:28] have more green absolutely yeah so that's that's
[00:11:30] green absolutely yeah so that's that's the right answer uh so what we're going
[00:11:33] the right answer uh so what we're going to do is we're going to sacrifice uh
[00:11:35] to do is we're going to sacrifice uh Precision so uh that's the idea for B
[00:11:39] Precision so uh that's the idea for B float 16 which stands for brain float 16
[00:11:43] float 16 which stands for brain float 16 um so you're going to have exactly the
[00:11:45] um so you're going to have exactly the same number of bits for representing the
[00:11:47] same number of bits for representing the range so that's going to be eight bits
[00:11:49] range so that's going to be eight bits so has the same dynamic range as
[00:11:51] so has the same dynamic range as fp32 but a lot less precision and it
[00:11:55] fp32 but a lot less precision and it turns out that this this is okay for
[00:11:57] turns out that this this is okay for neural network training okay
[00:12:00] neural network training okay and now if you use B float 16 you don't
[00:12:02] and now if you use B float 16 you don't need to use uh grad scalers anymore it's
[00:12:05] need to use uh grad scalers anymore it's it's as simple as wrapping your model
[00:12:07] it's as simple as wrapping your model forward pass and backward pass uh within
[00:12:09] forward pass and backward pass uh within the right uh
[00:12:11] the right uh context uh the one caveat about B flat
[00:12:14] context uh the one caveat about B flat 16 is that it's not available on all
[00:12:16] 16 is that it's not available on all gpus so you need to have the latest sort
[00:12:18] gpus so you need to have the latest sort of ampere uh Nvidia architectures which
[00:12:22] of ampere uh Nvidia architectures which the h100s the a100 the a6000 have uh but
[00:12:26] the h100s the a100 the a6000 have uh but if you have like a older GPU then you
[00:12:28] if you have like a older GPU then you might not be able to utilize B FL 16
[00:12:31] might not be able to utilize B FL 16 sorry can you why having less Precision
[00:12:33] sorry can you why having less Precision but the same amount of eits um yeah so
[00:12:36] but the same amount of eits um yeah so it's B 6 and oh never mind sorry
[00:12:42] I'm uh so here are some um kind of
[00:12:46] I'm uh so here are some um kind of results um so someone uh find you in
[00:12:49] results um so someone uh find you in dist bir for sentiment classification on
[00:12:51] dist bir for sentiment classification on a single a 100 um at the very top is
[00:12:56] a single a 100 um at the very top is float 64 which is like you know really
[00:12:59] float 64 which is like you know really really rich 64bit representation of
[00:13:01] really rich 64bit representation of floating points it takes about 25
[00:13:05] floating points it takes about 25 minutes um and you get a pretty high
[00:13:07] minutes um and you get a pretty high accuracy but it also takes a lot more
[00:13:09] accuracy but it also takes a lot more memory okay and all the way down we're
[00:13:12] memory okay and all the way down we're using mixed Precision training with B
[00:13:14] using mixed Precision training with B float 16 and now we have reduced
[00:13:16] float 16 and now we have reduced staining Time by roughly a third um more
[00:13:20] staining Time by roughly a third um more or less have the same accuracy a little
[00:13:22] or less have the same accuracy a little bit better actually because there's some
[00:13:24] bit better actually because there's some regularizing effect from U the you know
[00:13:27] regularizing effect from U the you know half bit uh the half Precision
[00:13:30] half bit uh the half Precision representation and then a lot less
[00:13:32] representation and then a lot less memory
[00:13:34] memory okay uh and the reason we see speedups
[00:13:37] okay uh and the reason we see speedups for training is because Matrix
[00:13:38] for training is because Matrix multiplies tend to be faster when you
[00:13:41] multiplies tend to be faster when you are multiplying in half
[00:13:44] are multiplying in half Precision okay so uh before we move on
[00:13:47] Precision okay so uh before we move on are there any questions about about
[00:13:52] this Okay cool so uh let's let's keep
[00:13:57] this Okay cool so uh let's let's keep going and uh let's sort of change the
[00:14:00] going and uh let's sort of change the setting right so now now we have more
[00:14:02] setting right so now now we have more than one GPU now we have multiple gpus
[00:14:04] than one GPU now we have multiple gpus and we want to train a network or all of
[00:14:07] and we want to train a network or all of the multiple gpus that we have Okay so
[00:14:10] the multiple gpus that we have Okay so let's start with some Basics okay so uh
[00:14:13] let's start with some Basics okay so uh here's a cartoon
[00:14:15] here's a cartoon showing uh basically a model and an
[00:14:19] showing uh basically a model and an Optimizer uh receiving some data from a
[00:14:21] Optimizer uh receiving some data from a data set okay and let's kind of work
[00:14:24] data set okay and let's kind of work through what's stored on GPU vram okay
[00:14:27] through what's stored on GPU vram okay and this is going to be some what of a
[00:14:29] and this is going to be some what of a lie and I will I I will point out what
[00:14:32] lie and I will I I will point out what my lie is soon uh but you know just to
[00:14:34] my lie is soon uh but you know just to keep things simple uh we have the neural
[00:14:38] keep things simple uh we have the neural net parameters okay so let's say we're
[00:14:39] net parameters okay so let's say we're doing mixed procure training and so it's
[00:14:41] doing mixed procure training and so it's stored in
[00:14:42] stored in fp16 and then we have an Optimizer okay
[00:14:46] fp16 and then we have an Optimizer okay and uh you know when I first saw this uh
[00:14:50] and uh you know when I first saw this uh you know few years back I was very
[00:14:53] you know few years back I was very surprised to see that optimizers also
[00:14:55] surprised to see that optimizers also need memory but uh if you're using
[00:14:57] need memory but uh if you're using something like Adam uh then you need to
[00:15:00] something like Adam uh then you need to store the adom momentum term and the
[00:15:03] store the adom momentum term and the adom variance and every time you get a
[00:15:04] adom variance and every time you get a gradient you have to update adom
[00:15:06] gradient you have to update adom momentum and variance and that's what
[00:15:07] momentum and variance and that's what you use uh for updating your parameters
[00:15:11] you use uh for updating your parameters and because you're using mixed Precision
[00:15:13] and because you're using mixed Precision training these have to be represented in
[00:15:15] training these have to be represented in sort of um fp32 okay uh so so that's
[00:15:20] sort of um fp32 okay uh so so that's that that's that's what the picture
[00:15:21] that that's that's what the picture looks like if you have a single GPU now
[00:15:24] looks like if you have a single GPU now let's say we have multiple gpus okay and
[00:15:27] let's say we have multiple gpus okay and what we'd like to do is is uh first
[00:15:29] what we'd like to do is is uh first divide our data set into let's say we
[00:15:31] divide our data set into let's say we have four GPS right so we'll divide our
[00:15:33] have four GPS right so we'll divide our data set into four parts and we'll
[00:15:35] data set into four parts and we'll maintain a synchronized copy of the
[00:15:38] maintain a synchronized copy of the model uh and every model receives its
[00:15:41] model uh and every model receives its own slice of the data set
[00:15:43] own slice of the data set okay um in the beginning we have a
[00:15:46] okay um in the beginning we have a synchronized model and everyone has
[00:15:49] synchronized model and everyone has their own copy we run a forward pass
[00:15:51] their own copy we run a forward pass okay so this forward pass receives
[00:15:55] okay so this forward pass receives different data points and so every model
[00:15:58] different data points and so every model is going to have different activations
[00:16:00] is going to have different activations and correspondingly every model is going
[00:16:02] and correspondingly every model is going to have different gradients okay so you
[00:16:05] to have different gradients okay so you run a backward pass every model has a
[00:16:07] run a backward pass every model has a different gradient because there's
[00:16:08] different gradient because there's different data points and then we're
[00:16:10] different data points and then we're going to run a synchronization step and
[00:16:13] going to run a synchronization step and what synchronization is going to do is
[00:16:15] what synchronization is going to do is communicate gradients between between
[00:16:17] communicate gradients between between different workers okay and so I'm going
[00:16:20] different workers okay and so I'm going to introduce the first sort of uh MPI
[00:16:23] to introduce the first sort of uh MPI primitive in this lecture and that
[00:16:25] primitive in this lecture and that primitive is called the all reduce
[00:16:27] primitive is called the all reduce operation what all ruse does is it takes
[00:16:31] operation what all ruse does is it takes um four pieces of uh information on in
[00:16:36] um four pieces of uh information on in this example on four different gpus it
[00:16:38] this example on four different gpus it sort of merges everything together and
[00:16:40] sort of merges everything together and then uh distributes it to all of the
[00:16:43] then uh distributes it to all of the gpus okay and the communication overhead
[00:16:47] gpus okay and the communication overhead of doing that is two bytes per parameter
[00:16:49] of doing that is two bytes per parameter because remember we have uh fp16
[00:16:52] because remember we have uh fp16 gradients okay so two bytes per gradient
[00:16:55] gradients okay so two bytes per gradient and then uh this needs to be
[00:16:57] and then uh this needs to be communicated um and so the overhead is 2
[00:17:00] communicated um and so the overhead is 2 bytes per
[00:17:01] bytes per parameter okay so that's the all reduce
[00:17:04] parameter okay so that's the all reduce operation uh and then once gradients
[00:17:06] operation uh and then once gradients have been communicated so they have to
[00:17:08] have been communicated so they have to be communicated by you know sort of
[00:17:10] be communicated by you know sort of gathering on one worker and just sort of
[00:17:12] gathering on one worker and just sort of Distributing the cumulative gradient at
[00:17:14] Distributing the cumulative gradient at that point uh every Optimizer has the
[00:17:17] that point uh every Optimizer has the full gradient and then the optimizer can
[00:17:20] full gradient and then the optimizer can you know update the model so that you
[00:17:22] you know update the model so that you maintain
[00:17:24] maintain synchronization okay so that's the basic
[00:17:26] synchronization okay so that's the basic that's uh that's known as distributed
[00:17:29] that's uh that's known as distributed data parallel okay um that's good uh but
[00:17:33] data parallel okay um that's good uh but turns out that it has really poor memory
[00:17:35] turns out that it has really poor memory scaling so let's kind of go through our
[00:17:38] scaling so let's kind of go through our math for how many uh how much memory is
[00:17:41] math for how many uh how much memory is needed right so uh we have the model
[00:17:45] needed right so uh we have the model parameters that's fp16 because we're
[00:17:47] parameters that's fp16 because we're doing mixed Precision training and then
[00:17:49] doing mixed Precision training and then for the gradient we also have the
[00:17:51] for the gradient we also have the gradient in fp16 right so two bytes uh
[00:17:54] gradient in fp16 right so two bytes uh for the gradient and then we have the
[00:17:56] for the gradient and then we have the stuff in green the stuff in green is
[00:17:59] stuff in green the stuff in green is let's say we're doing adom so we need to
[00:18:02] let's say we're doing adom so we need to well we need to store the master waste
[00:18:03] well we need to store the master waste regardless of whether we're doing adom
[00:18:04] regardless of whether we're doing adom or not and then we need to store the
[00:18:06] or not and then we need to store the momentum and the variance okay so that's
[00:18:08] momentum and the variance okay so that's 12 extra bytes uh per parameter okay and
[00:18:13] 12 extra bytes uh per parameter okay and this needs to be stored on every single
[00:18:14] this needs to be stored on every single GPU okay um and so the question is can
[00:18:19] GPU okay um and so the question is can we do better than this okay and so now
[00:18:22] we do better than this okay and so now uh things are going to get a little bit
[00:18:23] uh things are going to get a little bit more tricky so if you have questions
[00:18:26] more tricky so if you have questions just stop me um and you know we can we
[00:18:29] just stop me um and you know we can we can go from there so um the way we're
[00:18:33] can go from there so um the way we're going to improve uh our memory sort of
[00:18:37] going to improve uh our memory sort of scaling is we are a set of techniques uh
[00:18:40] scaling is we are a set of techniques uh that are together known as zero that
[00:18:43] that are together known as zero that stands for zero redundancy Optimizer so
[00:18:45] stands for zero redundancy Optimizer so this was uh you know a set of techniques
[00:18:48] this was uh you know a set of techniques released by Microsoft as part of their
[00:18:51] released by Microsoft as part of their deep speed project okay and the idea is
[00:18:54] deep speed project okay and the idea is going to be that we are going to instead
[00:18:56] going to be that we are going to instead instead of having every GPU contain all
[00:18:59] instead of having every GPU contain all of this state so by the state I mean the
[00:19:02] of this state so by the state I mean the stuff in blue the stuff in Orange and
[00:19:03] stuff in blue the stuff in Orange and the stuff in green you're going to sort
[00:19:05] the stuff in green you're going to sort of Shard it okay so there's there's
[00:19:08] of Shard it okay so there's there's there's going to be uh shards so that
[00:19:11] there's going to be uh shards so that not every GPU has all of the parameters
[00:19:13] not every GPU has all of the parameters or all of the gradient but by
[00:19:16] or all of the gradient but by communication they can sort of
[00:19:17] communication they can sort of synchronize okay so that's pretty much
[00:19:19] synchronize okay so that's pretty much what uh the the the uh sketch for this
[00:19:21] what uh the the the uh sketch for this is going to look like okay so let's look
[00:19:25] is going to look like okay so let's look at stage one so like zero has like
[00:19:27] at stage one so like zero has like multiple stages so there's stage one 1
[00:19:28] multiple stages so there's stage one 1 two and three in stage one we're going
[00:19:31] two and three in stage one we're going to Shard uh the stuff in green so stuff
[00:19:33] to Shard uh the stuff in green so stuff in green was the optimizer State um and
[00:19:37] in green was the optimizer State um and so the way we're going to Shard and
[00:19:39] so the way we're going to Shard and still maintain synchronization is
[00:19:41] still maintain synchronization is something like this okay so every GPU
[00:19:43] something like this okay so every GPU has um you know uh the full set of
[00:19:46] has um you know uh the full set of parameters in
[00:19:47] parameters in fp16 and every GPU has its its gradient
[00:19:51] fp16 and every GPU has its its gradient for its data okay but it only has a
[00:19:54] for its data okay but it only has a sharded copy of the full Optimizer State
[00:19:57] sharded copy of the full Optimizer State and the the the other requirement is
[00:20:00] and the the the other requirement is that every GPU is responsible for
[00:20:02] that every GPU is responsible for updating the parameters corresponding to
[00:20:05] updating the parameters corresponding to its own Shard okay so if you go step by
[00:20:08] its own Shard okay so if you go step by step this is what it looks like uh every
[00:20:12] step this is what it looks like uh every every GPU has its own
[00:20:14] every GPU has its own data uh every GPU gets a gradient on its
[00:20:17] data uh every GPU gets a gradient on its subset of the data okay then we perform
[00:20:20] subset of the data okay then we perform a reduced scatter so now this is the
[00:20:22] a reduced scatter so now this is the second uh MPI operation of the lecture
[00:20:24] second uh MPI operation of the lecture so we've we've done all reduce this is
[00:20:27] so we've we've done all reduce this is the second one this is called reduce
[00:20:28] the second one this is called reduce scatter
[00:20:29] scatter uh what A reduced scatter does is every
[00:20:32] uh what A reduced scatter does is every GPU
[00:20:33] GPU has uh the full gradient on its data and
[00:20:37] has uh the full gradient on its data and what you want to do is you want to take
[00:20:39] what you want to do is you want to take the chunk corresponding to let's say GPU
[00:20:42] the chunk corresponding to let's say GPU 1 so let's say your GPU 0 and you've
[00:20:44] 1 so let's say your GPU 0 and you've computed the full gradient for all the
[00:20:47] computed the full gradient for all the parameters and you want to
[00:20:48] parameters and you want to communicate uh your uh the chunk for GPU
[00:20:51] communicate uh your uh the chunk for GPU 1 to GPU 1 okay and same for GPU 2 and
[00:20:55] 1 to GPU 1 okay and same for GPU 2 and three okay so what you're going to do is
[00:20:58] three okay so what you're going to do is from the full gradient just communicate
[00:20:59] from the full gradient just communicate the bits that a different worker wants
[00:21:01] the bits that a different worker wants to that worker okay and every GPU has to
[00:21:04] to that worker okay and every GPU has to do that so that's called a reduced
[00:21:06] do that so that's called a reduced scatter okay and then once every worker
[00:21:10] scatter okay and then once every worker gets uh the gradient corresponding to
[00:21:13] gets uh the gradient corresponding to its Shard they're going to update its
[00:21:16] its Shard they're going to update its parameters and then once they have
[00:21:18] parameters and then once they have updated their Shard they're going to
[00:21:20] updated their Shard they're going to sort of perform an all gather so what
[00:21:23] sort of perform an all gather so what that means is let's say you have a
[00:21:24] that means is let's say you have a neural network with just let's say eight
[00:21:26] neural network with just let's say eight parameters uh two parameters on each GPU
[00:21:29] parameters uh two parameters on each GPU at the end of this uh each GPU has
[00:21:32] at the end of this uh each GPU has updated their subset of parameters and
[00:21:34] updated their subset of parameters and then they're going to sort of do an all
[00:21:36] then they're going to sort of do an all gather to just sort of maintain
[00:21:37] gather to just sort of maintain synchronization so every GPU gets the
[00:21:40] synchronization so every GPU gets the full set of parameters that are all
[00:21:44] updated yeah is maintaining this and
[00:21:47] updated yeah is maintaining this and you're not merging them together in that
[00:21:49] you're not merging them together in that way um what is what makes this more
[00:21:56] efficient um sorry could you repeat your
[00:21:59] efficient um sorry could you repeat your question can you go over why this is
[00:22:01] question can you go over why this is better than doing the prion right so uh
[00:22:04] better than doing the prion right so uh what we're going to do is uh Shard the
[00:22:07] what we're going to do is uh Shard the optimizer state right so let's say in a
[00:22:10] optimizer state right so let's say in a running example we have a neural network
[00:22:11] running example we have a neural network with eight parameters okay earlier uh we
[00:22:14] with eight parameters okay earlier uh we needed the optimizer state for all of
[00:22:16] needed the optimizer state for all of the eight parameters now every GPU has
[00:22:19] the eight parameters now every GPU has to maintain Optimizer state for only two
[00:22:21] to maintain Optimizer state for only two parameters okay so after the reduced
[00:22:26] parameters okay so after the reduced scatters are done you have the full uh
[00:22:29] scatters are done you have the full uh gradient corresponding to just two
[00:22:31] gradient corresponding to just two parameters okay so the optimizer state
[00:22:34] parameters okay so the optimizer state is just uh the gradient for two
[00:22:37] is just uh the gradient for two parameters the model is going to update
[00:22:40] parameters the model is going to update only two
[00:22:42] only two parameters using the partial uh uh sort
[00:22:46] parameters using the partial uh uh sort of Optimizer state but you have to have
[00:22:48] of Optimizer state but you have to have the entire set of parameters to run so
[00:22:50] the entire set of parameters to run so you'll eventually get the rest of the
[00:22:51] you'll eventually get the rest of the parameters back so you have the entire
[00:22:53] parameters back so you have the entire set of parameters you have all the stuff
[00:22:55] set of parameters you have all the stuff in blue and you have the full grad for
[00:22:58] in blue and you have the full grad for your subset but you don't have the full
[00:23:01] your subset but you don't have the full Optimizer state so what you can do is
[00:23:03] Optimizer state so what you can do is you can only update the parameters for
[00:23:06] you can only update the parameters for the bits of Optimizer State you have
[00:23:09] the bits of Optimizer State you have okay so in a running example that I just
[00:23:12] okay so in a running example that I just made up uh you know gp0 updates two
[00:23:15] made up uh you know gp0 updates two parameters gpu1 updates two parameters
[00:23:18] parameters gpu1 updates two parameters and so on okay and then they communicate
[00:23:21] and so on okay and then they communicate updated parameters to maintain
[00:23:24] updated parameters to maintain synchronization okay more questions
[00:23:27] synchronization okay more questions about this
[00:23:32] okay so let's keep going so so so far we
[00:23:35] okay so let's keep going so so so far we have looked at three MPI operations we
[00:23:37] have looked at three MPI operations we looked at um all gather we looked at
[00:23:41] looked at um all gather we looked at reduce scatter and we looked at all
[00:23:43] reduce scatter and we looked at all reduce okay so turns out that um all
[00:23:49] reduce okay so turns out that um all reduce is actually equivalent to running
[00:23:52] reduce is actually equivalent to running a reduced scatter followed by an all
[00:23:54] a reduced scatter followed by an all gather
[00:23:55] gather operation and just recall that like for
[00:23:58] operation and just recall that like for DD P all we had to do was this all
[00:24:00] DD P all we had to do was this all reduce operation and we commun we we
[00:24:03] reduce operation and we commun we we computed what's the communication
[00:24:04] computed what's the communication overhead of that and turns out that when
[00:24:06] overhead of that and turns out that when you're doing this Optimizer State
[00:24:08] you're doing this Optimizer State sharding you have to do exactly the same
[00:24:11] sharding you have to do exactly the same amount of communication overhead just
[00:24:14] amount of communication overhead just because an all reduce is equivalent to a
[00:24:16] because an all reduce is equivalent to a reduce scatter followed by an all gather
[00:24:18] reduce scatter followed by an all gather okay and so we basically saved memory
[00:24:20] okay and so we basically saved memory for free okay so just I mean you should
[00:24:24] for free okay so just I mean you should just always use this okay um because you
[00:24:28] just always use this okay um because you going to get memory savings and you
[00:24:30] going to get memory savings and you don't have any additional communication
[00:24:31] don't have any additional communication overhead okay so we happy we saved
[00:24:35] overhead okay so we happy we saved memory and now you know we want to Shard
[00:24:38] memory and now you know we want to Shard even more things okay so let's let's
[00:24:40] even more things okay so let's let's start doing zero stage two uh and now
[00:24:44] start doing zero stage two uh and now along with sharding the stuff in green
[00:24:46] along with sharding the stuff in green which was my Optimizer state I'm also
[00:24:49] which was my Optimizer state I'm also going to Shard gradients okay and now
[00:24:53] going to Shard gradients okay and now this is going to be a little bit more
[00:24:54] this is going to be a little bit more complex because um we kind of still need
[00:24:57] complex because um we kind of still need need the full gradient
[00:24:59] need the full gradient for the workers's data slice okay but
[00:25:01] for the workers's data slice okay but each GPU only has enough memory for um
[00:25:06] each GPU only has enough memory for um instantiating the gradient for a small
[00:25:09] instantiating the gradient for a small subset of parameters so how are we going
[00:25:10] subset of parameters so how are we going to deal with
[00:25:12] to deal with that so we are actually never going to
[00:25:15] that so we are actually never going to instantiate the full gradient vector and
[00:25:18] instantiate the full gradient vector and then whenever a GPU gets um a gradient
[00:25:22] then whenever a GPU gets um a gradient in the backward pass you instantiate uh
[00:25:26] in the backward pass you instantiate uh a vector sort of temporarily for the
[00:25:28] a vector sort of temporarily for the parameter for which you just received a
[00:25:30] parameter for which you just received a gradient and then compute the gradient
[00:25:32] gradient and then compute the gradient and then just send it to the right
[00:25:33] and then just send it to the right worker and then you destroy the memory
[00:25:35] worker and then you destroy the memory that you just created okay that's kind
[00:25:38] that you just created okay that's kind kind of the sketch and let's kind of go
[00:25:39] kind of the sketch and let's kind of go through this step by step
[00:25:42] through this step by step okay so um we have four workers okay
[00:25:46] okay so um we have four workers okay each worker performs a backward pass and
[00:25:49] each worker performs a backward pass and the backward pass happens layer by layer
[00:25:51] the backward pass happens layer by layer right so uh recall the lecture on uh
[00:25:54] right so uh recall the lecture on uh Auto diff so you have the loss and then
[00:25:57] Auto diff so you have the loss and then you have this back backward passw layer
[00:25:59] you have this back backward passw layer by layer you compute gradients okay so
[00:26:01] by layer you compute gradients okay so now let's say you're at layer J you take
[00:26:03] now let's say you're at layer J you take the Upstream gradient you compute
[00:26:06] the Upstream gradient you compute gradient for the parameters at ler okay
[00:26:10] gradient for the parameters at ler okay immediately the moment you compute those
[00:26:12] immediately the moment you compute those gradients send it to the right worker
[00:26:14] gradients send it to the right worker okay so there exists some worker that is
[00:26:17] okay so there exists some worker that is responsible for ler okay and what's
[00:26:20] responsible for ler okay and what's going to happen is every GPU that's just
[00:26:23] going to happen is every GPU that's just computed uh the gradient for layer J for
[00:26:27] computed uh the gradient for layer J for its data slice sends it to the right
[00:26:29] its data slice sends it to the right worker okay uh and then the moment
[00:26:32] worker okay uh and then the moment you've done that you deallocate this
[00:26:34] you've done that you deallocate this memory that you just created okay uh and
[00:26:37] memory that you just created okay uh and so uh this is kind of uh a fourth MPI
[00:26:42] so uh this is kind of uh a fourth MPI operation but really not very different
[00:26:43] operation but really not very different from a reduce CER this is just a reduce
[00:26:45] from a reduce CER this is just a reduce so there are four gpus uh that have a
[00:26:48] so there are four gpus uh that have a gradient and then they just have to
[00:26:50] gradient and then they just have to communicate it to whoever is responsible
[00:26:52] communicate it to whoever is responsible for maintaining gradient for that layer
[00:26:56] for maintaining gradient for that layer okay um
[00:26:59] okay um and then yeah so there exists some
[00:27:01] and then yeah so there exists some worker that is responsible for a given
[00:27:04] worker that is responsible for a given layer they're going to
[00:27:06] layer they're going to update uh its parameter Shard using uh
[00:27:11] update uh its parameter Shard using uh the full gradient that it received via
[00:27:12] the full gradient that it received via this communication along with the
[00:27:14] this communication along with the optimizer State okay and then at the end
[00:27:17] optimizer State okay and then at the end to synchronize everything you have to
[00:27:18] to synchronize everything you have to perform in all gather as
[00:27:21] perform in all gather as before okay uh any any questions about
[00:27:24] before okay uh any any questions about about uh this like high level sketch
[00:27:34] okay so so let's keep moving um okay so
[00:27:37] okay so so let's keep moving um okay so recall that uh for zero stage one um it
[00:27:42] recall that uh for zero stage one um it was basically free because uh turns out
[00:27:45] was basically free because uh turns out that an all reduce is equivalent to a
[00:27:46] that an all reduce is equivalent to a reduce scatter plus an all gather and
[00:27:49] reduce scatter plus an all gather and we're kind of doing the same thing here
[00:27:51] we're kind of doing the same thing here we have a reduce followed by an all
[00:27:54] we have a reduce followed by an all gather so this is practically also for
[00:27:56] gather so this is practically also for free Okay so
[00:27:58] free Okay so we've gotten away with saving memory
[00:28:01] we've gotten away with saving memory without any communication overhead
[00:28:03] without any communication overhead compared to DDP so far okay so let's
[00:28:06] compared to DDP so far okay so let's keep going let's let's try and see if we
[00:28:07] keep going let's let's try and see if we can Shard even more things
[00:28:09] can Shard even more things okay uh and I think someone sort of
[00:28:12] okay uh and I think someone sort of alluded to this in the audience early on
[00:28:15] alluded to this in the audience early on so what happens if you Shard even your
[00:28:17] so what happens if you Shard even your model parameters okay so let's say you
[00:28:20] model parameters okay so let's say you run into a situation where uh you know
[00:28:22] run into a situation where uh you know forget about the optimizer State even
[00:28:24] forget about the optimizer State even your model wouldn't fit on a single GPU
[00:28:27] your model wouldn't fit on a single GPU and so in that that case what you do is
[00:28:29] and so in that that case what you do is you split up your model so you split up
[00:28:31] you split up your model so you split up your model across all the different gpus
[00:28:33] your model across all the different gpus so you Shard your model parameters which
[00:28:35] so you Shard your model parameters which is the stuff in blue um uh but the
[00:28:39] is the stuff in blue um uh but the caveat is that now we're not going to
[00:28:41] caveat is that now we're not going to get uh this for free we're not going to
[00:28:43] get uh this for free we're not going to get memory savings for free there's
[00:28:44] get memory savings for free there's going to be some communication overhead
[00:28:47] going to be some communication overhead okay and uh this is zero stage three
[00:28:49] okay and uh this is zero stage three this is the final stage of zero this is
[00:28:52] this is the final stage of zero this is also known as fsdp fully shoted data
[00:28:55] also known as fsdp fully shoted data parallel for uh anyone who's heard that
[00:28:57] parallel for uh anyone who's heard that term before
[00:29:00] um and here's sort of the high level
[00:29:03] um and here's sort of the high level sketch okay and I feel like uh this is
[00:29:06] sketch okay and I feel like uh this is kind of the easiest to understand
[00:29:08] kind of the easiest to understand compared to zero stage one and two um
[00:29:11] compared to zero stage one and two um just because there needs to be
[00:29:13] just because there needs to be communication at every step of the way
[00:29:15] communication at every step of the way right you can't get away uh without
[00:29:17] right you can't get away uh without communicating so the first thing we're
[00:29:19] communicating so the first thing we're going to do is we're going to take our
[00:29:21] going to do is we're going to take our model and we're going to convert the
[00:29:24] model and we're going to convert the entire model into fsdp units Okay so
[00:29:28] entire model into fsdp units Okay so here's a sketch um a simple deep neural
[00:29:30] here's a sketch um a simple deep neural network I'm going to convert that into
[00:29:33] network I'm going to convert that into multiple fsdp units uh three fsdp units
[00:29:37] multiple fsdp units uh three fsdp units here okay so that's that's just a data
[00:29:39] here okay so that's that's just a data structure an fsdp unit okay we've not
[00:29:41] structure an fsdp unit okay we've not done anything so far uh and then I have
[00:29:45] done anything so far uh and then I have this fsdp unit I'm going to convert this
[00:29:47] this fsdp unit I'm going to convert this into another data structure called a
[00:29:49] into another data structure called a flat parameter and then I'm going to uh
[00:29:52] flat parameter and then I'm going to uh assign a subset of these parameters to
[00:29:55] assign a subset of these parameters to every single GP okay so here we have 16
[00:29:58] every single GP okay so here we have 16 gpus and a flat parameter consisting of
[00:30:01] gpus and a flat parameter consisting of 14 parameters plus some extra padding so
[00:30:04] 14 parameters plus some extra padding so that uh things divide properly and I'm
[00:30:07] that uh things divide properly and I'm going to assign each parameter to uh a
[00:30:12] going to assign each parameter to uh a distinct GPU okay and so that's
[00:30:15] distinct GPU okay and so that's basically just a complex way of saying
[00:30:16] basically just a complex way of saying that we created some data structures and
[00:30:18] that we created some data structures and we just like divided up model parameters
[00:30:21] we just like divided up model parameters uh to every GPU okay so every GPU gets a
[00:30:24] uh to every GPU okay so every GPU gets a subset of model
[00:30:26] subset of model parameters um okay now let's start
[00:30:29] parameters um okay now let's start thinking about what my forward pass
[00:30:30] thinking about what my forward pass would look like so there's no GPU that
[00:30:32] would look like so there's no GPU that has the full set of parameters okay so
[00:30:34] has the full set of parameters okay so you're running a forward pass let's say
[00:30:36] you're running a forward pass let's say you're at layer 4 now there's no GPU
[00:30:38] you're at layer 4 now there's no GPU that has all of layer 4 so you have to
[00:30:41] that has all of layer 4 so you have to communicate uh so we need to do an all
[00:30:43] communicate uh so we need to do an all gathered operation that's the operation
[00:30:45] gathered operation that's the operation that we did to um you know uh cumulate
[00:30:50] that we did to um you know uh cumulate multiple things that are on multiple
[00:30:52] multiple things that are on multiple gpus uh so that every GPU has the full
[00:30:54] gpus uh so that every GPU has the full thing uh so you perform an all gather so
[00:30:57] thing uh so you perform an all gather so that all so you have all pieces of layer
[00:30:59] that all so you have all pieces of layer four you run a forward pass okay and now
[00:31:02] four you run a forward pass okay and now you don't need layer four so you now
[00:31:04] you don't need layer four so you now discard your parameter shards
[00:31:07] discard your parameter shards okay and now you have to run your
[00:31:11] okay and now you have to run your backward pass right so you computed your
[00:31:12] backward pass right so you computed your loss and now you have to do a backward
[00:31:14] loss and now you have to do a backward pass uh again uh let's say you are back
[00:31:17] pass uh again uh let's say you are back at layer four you have your Upstream
[00:31:19] at layer four you have your Upstream gradient uh you don't have layer four so
[00:31:21] gradient uh you don't have layer four so you need to do another all gather so you
[00:31:23] you need to do another all gather so you get all the parameters of layer four
[00:31:26] get all the parameters of layer four um and then you uh run a backward pass
[00:31:30] um and then you uh run a backward pass for layer Force you compute the gradient
[00:31:32] for layer Force you compute the gradient for your subset of parameters so recall
[00:31:34] for your subset of parameters so recall that every GPU has different data points
[00:31:38] that every GPU has different data points right so there's going to be different
[00:31:39] right so there's going to be different gradients for every GPU okay so then for
[00:31:43] gradients for every GPU okay so then for layer four uh you do an all gather get
[00:31:45] layer four uh you do an all gather get all parameters computer gradient every
[00:31:48] all parameters computer gradient every GPU has different gradients and then you
[00:31:50] GPU has different gradients and then you have to do a reduce scatter so that uh
[00:31:53] have to do a reduce scatter so that uh you can send the full gradient to the
[00:31:56] you can send the full gradient to the GPU that's responsible for whatever
[00:31:58] GPU that's responsible for whatever parts of layer 4 uh that that that
[00:32:00] parts of layer 4 uh that that that you're sending okay
[00:32:04] you're sending okay so yeah so so that's basically uh full
[00:32:07] so yeah so so that's basically uh full fsdp um and then once you uh once you
[00:32:11] fsdp um and then once you uh once you sort of run the forward and backward
[00:32:12] sort of run the forward and backward pass then each GPU will update its own
[00:32:15] pass then each GPU will update its own parameter Shard using the full gradient
[00:32:18] parameter Shard using the full gradient that it received just now um and then
[00:32:22] that it received just now um and then you do a synchronization
[00:32:23] you do a synchronization right so let's kind of do a quick review
[00:32:28] right so let's kind of do a quick review of everything we've looked at so far um
[00:32:32] of everything we've looked at so far um so there was
[00:32:33] so there was DDP um which was you don't shot anything
[00:32:37] DDP um which was you don't shot anything you have the full model full gradient
[00:32:39] you have the full model full gradient the full Optimizer State on every single
[00:32:42] the full Optimizer State on every single GPU and all you're going to divide up is
[00:32:44] GPU and all you're going to divide up is the full data set right so you have a
[00:32:46] the full data set right so you have a big data set of 1,000 examples every GPU
[00:32:49] big data set of 1,000 examples every GPU gets 250 examples okay and then you
[00:32:52] gets 250 examples okay and then you compute a forward pass and a backward
[00:32:54] compute a forward pass and a backward pass every GPU has a different gradient
[00:32:56] pass every GPU has a different gradient you need to communicate that gradient
[00:32:59] you need to communicate that gradient and then you synchronize okay and so
[00:33:01] and then you synchronize okay and so that was called an all reduce operation
[00:33:04] that was called an all reduce operation uh in MPI terms and then we looked at
[00:33:07] uh in MPI terms and then we looked at zero which is now we want to save some
[00:33:09] zero which is now we want to save some memory we don't want uh the full sort of
[00:33:12] memory we don't want uh the full sort of memory requirements of models gradients
[00:33:14] memory requirements of models gradients and Optimizer State on every single GPU
[00:33:17] and Optimizer State on every single GPU and in zero stage one we shouted the
[00:33:20] and in zero stage one we shouted the optimizer state so that there is uh you
[00:33:23] optimizer state so that there is uh you know so that you don't have to maintain
[00:33:24] know so that you don't have to maintain the full Optimizer state for every GPU
[00:33:26] the full Optimizer state for every GPU you kind of break that down between all
[00:33:29] you kind of break that down between all the different gpus that you have and we
[00:33:31] the different gpus that you have and we saw that the communication overhead of
[00:33:33] saw that the communication overhead of maintaining synchronization in zero
[00:33:35] maintaining synchronization in zero stage one boiled down to basically just
[00:33:38] stage one boiled down to basically just doing an all reduce through this uh
[00:33:40] doing an all reduce through this uh identity that says that an all reduce is
[00:33:42] identity that says that an all reduce is a reduce scatter plus an all gather okay
[00:33:46] a reduce scatter plus an all gather okay and we save memory for free with zero
[00:33:48] and we save memory for free with zero stage one and two so you should just do
[00:33:49] stage one and two so you should just do it uh you know um and then with zero
[00:33:53] it uh you know um and then with zero stage three things got a little bit more
[00:33:56] stage three things got a little bit more complex uh because you have to divide up
[00:33:59] complex uh because you have to divide up your model parameters the optimizer
[00:34:02] your model parameters the optimizer State and the gradient and so while
[00:34:04] State and the gradient and so while you're running your forward pass you
[00:34:06] you're running your forward pass you kind of have to do some communication to
[00:34:08] kind of have to do some communication to get the full parameters for for any
[00:34:10] get the full parameters for for any layer for layer four uh in in our
[00:34:12] layer for layer four uh in in our example and then also have to do an all
[00:34:15] example and then also have to do an all Gather in the backward pass so you get
[00:34:17] Gather in the backward pass so you get the full gradient and then you have to
[00:34:19] the full gradient and then you have to do a reduce scatter so that you can send
[00:34:21] do a reduce scatter so that you can send the full gradient for whatever chunk of
[00:34:23] the full gradient for whatever chunk of the parameter to the right GPU and uh
[00:34:27] the parameter to the right GPU and uh over overall that's like two all gathers
[00:34:29] over overall that's like two all gathers plus a reduced scatter so that's a lot
[00:34:32] plus a reduced scatter so that's a lot more overhead uh than stages one and two
[00:34:35] more overhead uh than stages one and two but if you don't uh have enough GPU vram
[00:34:38] but if you don't uh have enough GPU vram so that you can even load your model
[00:34:40] so that you can even load your model onto a GPU then this is kind of what you
[00:34:43] onto a GPU then this is kind of what you have to do um all right uh any questions
[00:34:47] have to do um all right uh any questions about MPI Primitives or stages of of
[00:34:51] about MPI Primitives or stages of of zero or fsdp
[00:34:59] okay cool um so I'm going to fix the lie
[00:35:02] okay cool um so I'm going to fix the lie that uh that I said earlier about the
[00:35:05] that uh that I said earlier about the GPU vram calculation um so I said that
[00:35:09] GPU vram calculation um so I said that there's just like model parameters and
[00:35:12] there's just like model parameters and gradients uh and the optimizer state but
[00:35:14] gradients uh and the optimizer state but there's this thing there this like final
[00:35:16] there's this thing there this like final thing the model Activation so like you
[00:35:18] thing the model Activation so like you know we've all seen that uh as you keep
[00:35:21] know we've all seen that uh as you keep you know you you want to increase the
[00:35:23] you know you you want to increase the bat size uh and there's a point when the
[00:35:27] bat size uh and there's a point when the GPU say says that it can't fit more
[00:35:29] GPU say says that it can't fit more stuff and that's because you also need
[00:35:31] stuff and that's because you also need to store model activations in the
[00:35:33] to store model activations in the backward pass right uh and that scales
[00:35:36] backward pass right uh and that scales linearly with the bath size okay so the
[00:35:38] linearly with the bath size okay so the larger the bath size the more the number
[00:35:40] larger the bath size the more the number of model activations that need to be
[00:35:42] of model activations that need to be stored uh and by the way if you're doing
[00:35:44] stored uh and by the way if you're doing mixed Precision this is in fp16 or bf16
[00:35:47] mixed Precision this is in fp16 or bf16 but it scales with the bat size okay and
[00:35:50] but it scales with the bat size okay and so uh that's sort of the other thing
[00:35:53] so uh that's sort of the other thing that you have to think about and you
[00:35:54] that you have to think about and you know you uh none of the techniques that
[00:35:56] know you uh none of the techniques that we've looked at so far
[00:35:58] we've looked at so far help with uh kind of Shing model
[00:36:01] help with uh kind of Shing model activations um so okay so we looked at a
[00:36:05] activations um so okay so we looked at a bunch of like um you know basics of like
[00:36:08] bunch of like um you know basics of like multi-gpu uh training and you know like
[00:36:11] multi-gpu uh training and you know like floating point but it kind of boils down
[00:36:14] floating point but it kind of boils down to this very simple flowchart which you
[00:36:16] to this very simple flowchart which you can use for your final projects when
[00:36:18] can use for your final projects when you're fine tuning models so the first
[00:36:21] you're fine tuning models so the first thing is always use mixed Precision
[00:36:23] thing is always use mixed Precision training you know you barely ever see a
[00:36:26] training you know you barely ever see a a hit in performance
[00:36:28] a hit in performance uh by performance I mean like
[00:36:30] uh by performance I mean like generalization or like you know F1 or
[00:36:33] generalization or like you know F1 or accuracy and if you're using the newer
[00:36:35] accuracy and if you're using the newer ampere architectures the the h100s or
[00:36:38] ampere architectures the the h100s or the A1 100s or the a6000 always use BF
[00:36:41] the A1 100s or the a6000 always use BF float 16 uh it's just better um and you
[00:36:45] float 16 uh it's just better um and you can check that with with that torch
[00:36:47] can check that with with that torch command um okay so always use mixed
[00:36:50] command um okay so always use mixed Precision training now uh ask yourself
[00:36:54] Precision training now uh ask yourself this question does bat size one fit on a
[00:36:57] this question does bat size one fit on a a single GPU okay if it fits and try a
[00:37:01] a single GPU okay if it fits and try a larger bat size okay bat size one is too
[00:37:04] larger bat size okay bat size one is too small uh try a larger bat size and or
[00:37:08] small uh try a larger bat size and or use zero stage two okay zero stage two
[00:37:10] use zero stage two okay zero stage two is for free just use a zero stage two
[00:37:13] is for free just use a zero stage two and increase your bat size if you can't
[00:37:15] and increase your bat size if you can't fit even bat size one uh then you have
[00:37:19] fit even bat size one uh then you have to see if zero stage three fixes your
[00:37:22] to see if zero stage three fixes your out of memory issues cuz now you're
[00:37:23] out of memory issues cuz now you're going to Shard your model
[00:37:25] going to Shard your model parameters okay and all of this is is in
[00:37:27] parameters okay and all of this is is in the context of full fine tuning right so
[00:37:29] the context of full fine tuning right so I'm fine tuning all of my model
[00:37:33] I'm fine tuning all of my model parameters
[00:37:35] parameters okay sometimes the answer to that
[00:37:38] okay sometimes the answer to that question is also no so you can't full
[00:37:41] question is also no so you can't full fine tune your model on four whatever A1
[00:37:45] fine tune your model on four whatever A1 100s or you know
[00:37:47] 100s or you know a6000 um and you've tried zero stage
[00:37:50] a6000 um and you've tried zero stage three you've start you've tried mixed
[00:37:52] three you've start you've tried mixed Precision training uh you have a bat
[00:37:54] Precision training uh you have a bat size of one maybe you did gradient
[00:37:57] size of one maybe you did gradient checkpoint pointing activation
[00:37:58] checkpoint pointing activation checkpointing and nothing works and so
[00:38:01] checkpointing and nothing works and so now uh basically you can't do full fine
[00:38:04] now uh basically you can't do full fine tuning and so the thing to do is to try
[00:38:06] tuning and so the thing to do is to try parameter efficient fine-tuning okay and
[00:38:08] parameter efficient fine-tuning okay and that's going to give you a lot more uh
[00:38:11] that's going to give you a lot more uh memory savings
[00:38:12] memory savings okay so let's talk about parameter
[00:38:15] okay so let's talk about parameter efficient fine tuning
[00:38:18] efficient fine tuning okay um so why is it called parameter
[00:38:22] okay um so why is it called parameter refer in fine tuning so in full fine
[00:38:23] refer in fine tuning so in full fine tuning uh you know you run a forward
[00:38:26] tuning uh you know you run a forward pass and a backward pass and you update
[00:38:28] pass and a backward pass and you update every single model parameter and in
[00:38:30] every single model parameter and in parameter efficient fine-tuning you're
[00:38:32] parameter efficient fine-tuning you're only going to update a small subset of
[00:38:35] only going to update a small subset of the full uh set of parameters
[00:38:38] the full uh set of parameters okay and uh why would you want to do
[00:38:41] okay and uh why would you want to do that so uh you know maybe you're in a
[00:38:44] that so uh you know maybe you're in a setting where you cannot full fine tune
[00:38:46] setting where you cannot full fine tune even with bat size one you've tried all
[00:38:48] even with bat size one you've tried all the tricks possible uh it just wouldn't
[00:38:51] the tricks possible uh it just wouldn't fit uh and so maybe you have to do
[00:38:53] fit uh and so maybe you have to do parameter efficient fine tuning um maybe
[00:38:57] parameter efficient fine tuning um maybe the other possible reason why you want
[00:38:59] the other possible reason why you want to do it is kind of slightly more
[00:39:02] to do it is kind of slightly more scientific uh you know these like models
[00:39:05] scientific uh you know these like models these days are heavily
[00:39:07] these days are heavily overparameterized uh and you have a
[00:39:09] overparameterized uh and you have a small data set and you you believe that
[00:39:11] small data set and you you believe that if you do parameter efficient F tuning
[00:39:14] if you do parameter efficient F tuning uh then you can get a better like
[00:39:16] uh then you can get a better like generalization
[00:39:17] generalization okay um or you believe that you know
[00:39:21] okay um or you believe that you know it's going to match for fine tuning okay
[00:39:24] it's going to match for fine tuning okay uh sort of a second reason for wanting
[00:39:26] uh sort of a second reason for wanting to do f
[00:39:28] to do f adaptation um so uh the plot on the
[00:39:32] adaptation um so uh the plot on the right here shows uh in in red it's sort
[00:39:35] right here shows uh in in red it's sort of the estimated growth uh in training
[00:39:38] of the estimated growth uh in training compute for training the largest AI
[00:39:40] compute for training the largest AI models and uh the line in blue is the
[00:39:44] models and uh the line in blue is the global compute capacity so very soon we
[00:39:47] global compute capacity so very soon we are going to overshoot the global
[00:39:49] are going to overshoot the global compute capacity and going to need a lot
[00:39:51] compute capacity and going to need a lot more compute than you know the global
[00:39:54] more compute than you know the global capacity and so this is kind of not
[00:39:56] capacity and so this is kind of not sustainable
[00:39:58] sustainable um and you know there's there there are
[00:40:00] um and you know there's there there are arguments to be made about how uh if we
[00:40:03] arguments to be made about how uh if we keep going down this route then you know
[00:40:06] keep going down this route then you know AI development becomes concentrated in
[00:40:09] AI development becomes concentrated in only the hands of a few well-funded
[00:40:11] only the hands of a few well-funded organizations and you know as students
[00:40:13] organizations and you know as students we can't do it um and so that's a
[00:40:18] we can't do it um and so that's a problem and then also like if there's
[00:40:20] problem and then also like if there's only a small number of players that are
[00:40:22] only a small number of players that are training and fine-tuning models then
[00:40:24] training and fine-tuning models then they may bias the model in specific ways
[00:40:26] they may bias the model in specific ways that reflect their value systems and not
[00:40:28] that reflect their value systems and not sort of the the broader uh public um and
[00:40:33] sort of the the broader uh public um and so that's another reason to do full uh
[00:40:35] so that's another reason to do full uh to to think about efficient adaptation
[00:40:38] to to think about efficient adaptation um and there's sort of this Paradigm in
[00:40:41] um and there's sort of this Paradigm in in in machine learning in general and
[00:40:44] in in machine learning in general and NLP uh specifically to focus uh a lot on
[00:40:49] NLP uh specifically to focus uh a lot on on accuracy instead of efficiency and so
[00:40:51] on accuracy instead of efficiency and so the pro on the right here shows um the
[00:40:55] the pro on the right here shows um the percentage of papers where where the
[00:40:57] percentage of papers where where the main contribution is a method that
[00:41:00] main contribution is a method that produces just you know more accurate
[00:41:03] produces just you know more accurate models versus methods that produce same
[00:41:07] models versus methods that produce same accuracy for more efficiency um and so
[00:41:10] accuracy for more efficiency um and so we can see that for most
[00:41:12] we can see that for most conferences the vast majority of papers
[00:41:15] conferences the vast majority of papers are about accuracy there's very few
[00:41:17] are about accuracy there's very few papers about efficiency okay um and and
[00:41:22] papers about efficiency okay um and and so maybe this is kind of leading to this
[00:41:24] so maybe this is kind of leading to this like monoculture and maybe that's why we
[00:41:25] like monoculture and maybe that's why we want to focus on efficiency
[00:41:28] want to focus on efficiency uh the second maybe bigger sort of
[00:41:30] uh the second maybe bigger sort of concern is that there's this huge uh
[00:41:33] concern is that there's this huge uh hidden environmental cost of of training
[00:41:35] hidden environmental cost of of training and fine-tuning large language models uh
[00:41:38] and fine-tuning large language models uh so I was just reading some report where
[00:41:40] so I was just reading some report where where they said that the cost of
[00:41:42] where they said that the cost of training gpt3 was equivalent to 1.1
[00:41:46] training gpt3 was equivalent to 1.1 million tons of carbon emission or some
[00:41:49] million tons of carbon emission or some some some some some such number and they
[00:41:50] some some some some such number and they they kind of estimated that that's the
[00:41:52] they kind of estimated that that's the cost of running a Coal Power Plant for
[00:41:54] cost of running a Coal Power Plant for 10 hours straight okay um
[00:41:58] 10 hours straight okay um all right uh and and uh for an example
[00:42:00] all right uh and and uh for an example closer to home um in in in the
[00:42:04] closer to home um in in in the reinforcement learning class um there
[00:42:07] reinforcement learning class um there was like uh you know the final project
[00:42:09] was like uh you know the final project no not the final project a homework
[00:42:11] no not the final project a homework assignment and um you know a lot of
[00:42:13] assignment and um you know a lot of students implemented uh kind of a common
[00:42:17] students implemented uh kind of a common algorithm uh one uh one or two
[00:42:20] algorithm uh one uh one or two algorithms that sort of outperformed
[00:42:21] algorithms that sort of outperformed everything else it used a lot more power
[00:42:24] everything else it used a lot more power okay and someone did this calculation
[00:42:27] okay and someone did this calculation that if everyone had used the most
[00:42:29] that if everyone had used the most efficient algorithm that would have uh
[00:42:33] efficient algorithm that would have uh uh sorry if everyone had used the more
[00:42:36] uh sorry if everyone had used the more efficient algorithm that would have
[00:42:37] efficient algorithm that would have reduced the power consumption of the
[00:42:39] reduced the power consumption of the Class by about 880 kilowatt hours which
[00:42:43] Class by about 880 kilowatt hours which is what an American household uses in a
[00:42:45] is what an American household uses in a month okay so so there's another like
[00:42:48] month okay so so there's another like the you know these are all reasons to
[00:42:49] the you know these are all reasons to think about like efficiency and like how
[00:42:51] think about like efficiency and like how you can um like fine tune your models
[00:42:54] you can um like fine tune your models with like uh less resources okay uh so
[00:42:57] with like uh less resources okay uh so let's let's kind of jump back into like
[00:43:00] let's let's kind of jump back into like parameter efficient fine tuning okay and
[00:43:02] parameter efficient fine tuning okay and let's let's start by recapping what full
[00:43:04] let's let's start by recapping what full fine tuning
[00:43:06] fine tuning is um any questions so far about any of
[00:43:12] this okay um so yeah so let's recap full
[00:43:16] this okay um so yeah so let's recap full fine tuning so let's say we have some
[00:43:18] fine tuning so let's say we have some large uh pre-trained Auto regressive
[00:43:20] large uh pre-trained Auto regressive language model um let's say it's a
[00:43:24] language model um let's say it's a GPT um and maybe we want to to use it
[00:43:27] GPT um and maybe we want to to use it for summarization maybe we want it for
[00:43:29] for summarization maybe we want it for semantic parsing so like converting
[00:43:31] semantic parsing so like converting natural language to SQL commands or
[00:43:34] natural language to SQL commands or maybe we want it to answer questions
[00:43:36] maybe we want it to answer questions about paragraphs okay and what do we do
[00:43:39] about paragraphs okay and what do we do we collect a data set of XY Pairs and
[00:43:43] we collect a data set of XY Pairs and then we do full fine tuning in full fine
[00:43:45] then we do full fine tuning in full fine tuning we are going to update all of the
[00:43:48] tuning we are going to update all of the model parameters based on the gradient
[00:43:50] model parameters based on the gradient for some loss function okay um and maybe
[00:43:55] for some loss function okay um and maybe that's not feasible uh GPD 3 has 175
[00:43:58] that's not feasible uh GPD 3 has 175 billion parameters and so there's just
[00:44:00] billion parameters and so there's just like lot more parameters to learn and
[00:44:03] like lot more parameters to learn and even once you have done full fine tuning
[00:44:06] even once you have done full fine tuning uh you kind of have to store all of the
[00:44:08] uh you kind of have to store all of the parameters and if you're doing like
[00:44:09] parameters and if you're doing like several tasks you have to store
[00:44:10] several tasks you have to store parameters for every task okay uh so can
[00:44:14] parameters for every task okay uh so can we do better
[00:44:17] we do better um okay so the main idea is instead of
[00:44:23] um okay so the main idea is instead of updating all of the parameters I am
[00:44:26] updating all of the parameters I am going to update
[00:44:27] going to update um a much smaller number of parameters
[00:44:31] um a much smaller number of parameters okay uh and and then instead of uh
[00:44:34] okay uh and and then instead of uh finding sort of a Delta Theta which is
[00:44:37] finding sort of a Delta Theta which is the same size as the entire set of
[00:44:39] the same size as the entire set of parameters I have to search over a much
[00:44:41] parameters I have to search over a much smaller
[00:44:42] smaller space uh and then the added benefit is
[00:44:47] space uh and then the added benefit is uh I can store this much smaller Delta
[00:44:50] uh I can store this much smaller Delta uh pretty easily on disk and hopefully
[00:44:52] uh pretty easily on disk and hopefully it's going to require less compute and
[00:44:55] it's going to require less compute and hopefully it's going to generalize
[00:44:57] hopefully it's going to generalize almost as well as full fine tuning
[00:45:00] almost as well as full fine tuning okay um so there's many different ways
[00:45:04] okay um so there's many different ways of um kind of operationalizing this high
[00:45:07] of um kind of operationalizing this high level idea of of parameter efficient
[00:45:10] level idea of of parameter efficient fine tuning uh the one I'm going to talk
[00:45:13] fine tuning uh the one I'm going to talk about today is Laura okay so that stands
[00:45:17] about today is Laura okay so that stands for uh low rank
[00:45:20] for uh low rank adaptation and uh that basically comes
[00:45:24] adaptation and uh that basically comes from this observation that when you have
[00:45:26] from this observation that when you have uh big language models that you fine
[00:45:28] uh big language models that you fine tune often times when you look at sort
[00:45:32] tune often times when you look at sort of the um the like geometric structure
[00:45:36] of the um the like geometric structure of the gradients they tend to have a low
[00:45:38] of the gradients they tend to have a low intrinsic rank okay um do people
[00:45:43] intrinsic rank okay um do people remember Rank and
[00:45:45] remember Rank and SD all right okay so um so these
[00:45:49] SD all right okay so um so these parameters uh so the the gradients tend
[00:45:51] parameters uh so the the gradients tend to have a low intrinsic rank okay and so
[00:45:55] to have a low intrinsic rank okay and so what the authors real realized is
[00:45:58] what the authors real realized is instead of uh fine-tuning the entire set
[00:46:01] instead of uh fine-tuning the entire set of parameters you could instead
[00:46:03] of parameters you could instead fine-tune a much smaller let's say rank
[00:46:06] fine-tune a much smaller let's say rank R Matrix for every uh full rank Matrix
[00:46:10] R Matrix for every uh full rank Matrix that exists in the model okay so let's
[00:46:13] that exists in the model okay so let's say we have uh some pre-trained weight
[00:46:16] say we have uh some pre-trained weight Matrix W
[00:46:18] Matrix W KN and what I'm going to do is instead
[00:46:21] KN and what I'm going to do is instead of applying some kind of arbitrary
[00:46:24] of applying some kind of arbitrary update I'm going to make sure that the
[00:46:26] update I'm going to make sure that the update it has this following form okay
[00:46:29] update it has this following form okay so it's going to be the product of two
[00:46:31] so it's going to be the product of two low rank matrices b and a okay so a is
[00:46:36] low rank matrices b and a okay so a is in R uh in R is is an R cross K uh
[00:46:40] in R uh in R is is an R cross K uh Matrix and B is a d cross R Matrix okay
[00:46:44] Matrix and B is a d cross R Matrix okay and R is is the rank much much smaller
[00:46:47] and R is is the rank much much smaller than either the um the uh sort of uh
[00:46:51] than either the um the uh sort of uh incoming Dimension and much much smaller
[00:46:54] incoming Dimension and much much smaller than the outgoing Dimension okay and the
[00:46:57] than the outgoing Dimension okay and the term Alpha you can think of that as some
[00:47:00] term Alpha you can think of that as some kind of trade-off between uh the
[00:47:02] kind of trade-off between uh the knowledge that's already stored in the
[00:47:05] knowledge that's already stored in the pre-rain model versus some additional
[00:47:08] pre-rain model versus some additional knowledge that you want to add into the
[00:47:09] knowledge that you want to add into the model okay so if Alpha is zero then
[00:47:11] model okay so if Alpha is zero then you're not doing anything if Alpha is
[00:47:13] you're not doing anything if Alpha is something really really small then you
[00:47:15] something really really small then you don't really want to change your model
[00:47:17] don't really want to change your model parameters all that much and you want to
[00:47:18] parameters all that much and you want to add some really small task specific
[00:47:22] add some really small task specific knowledge um and then additionally the
[00:47:25] knowledge um and then additionally the only trainable parameter s here are
[00:47:27] only trainable parameter s here are going to be a and b
[00:47:31] going to be a and b okay
[00:47:33] okay um
[00:47:36] um okay um and then uh sort of the other
[00:47:39] okay um and then uh sort of the other thing to note about this is uh since I'm
[00:47:42] thing to note about this is uh since I'm representing updates as this product u b
[00:47:45] representing updates as this product u b * a as I increase R that's going to
[00:47:49] * a as I increase R that's going to converge towards full fine tuning right
[00:47:50] converge towards full fine tuning right so you kind of have the slider that you
[00:47:52] so you kind of have the slider that you can use to control how much fine tuning
[00:47:56] can use to control how much fine tuning you want to do is essentially
[00:47:58] you want to do is essentially um and then uh the other important thing
[00:48:01] um and then uh the other important thing is uh inference latency so what you can
[00:48:03] is uh inference latency so what you can do is uh you can just store these
[00:48:07] do is uh you can just store these learned matrices for every task and
[00:48:09] learned matrices for every task and whenever you switch to a different task
[00:48:11] whenever you switch to a different task uh you can just uh you can just remove
[00:48:14] uh you can just uh you can just remove the extra term that you've added to
[00:48:15] the extra term that you've added to every Matrix for that task and add in
[00:48:18] every Matrix for that task and add in sort of the task specific terms for the
[00:48:21] sort of the task specific terms for the new task that you want to run inference
[00:48:23] new task that you want to run inference on okay and and the cost of like storing
[00:48:27] on okay and and the cost of like storing these like much smaller matrices is also
[00:48:29] these like much smaller matrices is also way lower than storing uh sort of the
[00:48:31] way lower than storing uh sort of the full Delta
[00:48:33] full Delta okay um and we'll kind of see where you
[00:48:37] okay um and we'll kind of see where you should apply Laura but like generally
[00:48:38] should apply Laura but like generally you want to apply it to uh the weight
[00:48:40] you want to apply it to uh the weight matrixes in self attention okay so um in
[00:48:45] matrixes in self attention okay so um in code it actually looks fairly simple um
[00:48:49] code it actually looks fairly simple um so what you're going to do is uh when
[00:48:51] so what you're going to do is uh when you're running uh the regular forward
[00:48:54] you're running uh the regular forward pass then you sort of uh you know just
[00:48:57] pass then you sort of uh you know just like compute the hidden State as let's
[00:49:00] like compute the hidden State as let's say the product of uh The Matrix and the
[00:49:02] say the product of uh The Matrix and the incoming feature uh Vector now with
[00:49:05] incoming feature uh Vector now with Laura what you're going to do is you're
[00:49:07] Laura what you're going to do is you're going to freeze your model parameters
[00:49:09] going to freeze your model parameters you're going to compute the H as before
[00:49:12] you're going to compute the H as before and then to that you're going to add
[00:49:13] and then to that you're going to add this additional offset term and that's
[00:49:16] this additional offset term and that's the only thing that's going to be uh
[00:49:18] the only thing that's going to be uh trainable okay and that's pretty much
[00:49:20] trainable okay and that's pretty much all you have to do we have to do it for
[00:49:22] all you have to do we have to do it for every single like weight Matrix uh for
[00:49:25] every single like weight Matrix uh for every single layer okay
[00:49:27] every single layer okay um but yeah so there's like an alpha
[00:49:29] um but yeah so there's like an alpha term in the second to last line where do
[00:49:31] term in the second to last line where do you define Alpha in the rest or do you
[00:49:33] you define Alpha in the rest or do you just like put it
[00:49:35] just like put it somewhere uh so you so so yeah so you
[00:49:37] somewhere uh so you so so yeah so you define the somewhere uh if you set it to
[00:49:40] define the somewhere uh if you set it to one that's like saying that uh I kind of
[00:49:43] one that's like saying that uh I kind of want um uh like an equal trade-off
[00:49:46] want um uh like an equal trade-off between pre-trained knowledge and the
[00:49:48] between pre-trained knowledge and the new task specific knowledge typically
[00:49:50] new task specific knowledge typically people set it to one you could set it to
[00:49:52] people set it to one you could set it to something larger than one if you believe
[00:49:53] something larger than one if you believe your task is something that the model uh
[00:49:55] your task is something that the model uh the pre- model has no idea about uh or
[00:49:58] the pre- model has no idea about uh or something smaller than one if you don't
[00:49:59] something smaller than one if you don't want to change the model too
[00:50:04] much
[00:50:06] much um so so that's basically Laura um in
[00:50:11] um so so that's basically Laura um in practice so I said there's a bunch of
[00:50:13] practice so I said there's a bunch of different uh parameter efficient fine
[00:50:14] different uh parameter efficient fine Heering methods right so uh I'm not even
[00:50:17] Heering methods right so uh I'm not even going to name all of these uh there's
[00:50:21] going to name all of these uh there's adapters some of you might have heard
[00:50:23] adapters some of you might have heard about
[00:50:23] about adapters uh there is uh bitfit which is
[00:50:27] adapters uh there is uh bitfit which is not here um and so there's like lots of
[00:50:31] not here um and so there's like lots of different like ptuning um but it turns
[00:50:33] different like ptuning um but it turns out that like compared to a lot of these
[00:50:35] out that like compared to a lot of these different uh methods it's it's kind of
[00:50:38] different uh methods it's it's kind of like uh you know pretty high high
[00:50:40] like uh you know pretty high high performing uh on a bunch of different
[00:50:42] performing uh on a bunch of different tasks uh for these like relatively
[00:50:45] tasks uh for these like relatively smaller models okay um and then if we
[00:50:49] smaller models okay um and then if we try and look at uh some of the bigger
[00:50:52] try and look at uh some of the bigger like we're trying to fine tune some of
[00:50:53] like we're trying to fine tune some of the bigger models like GPD 3 uh and then
[00:50:57] the bigger models like GPD 3 uh and then compare it with other parameter
[00:50:58] compare it with other parameter efficient fine tuning methods um so full
[00:51:01] efficient fine tuning methods um so full fine tuning is is is at the is at the
[00:51:03] fine tuning is is is at the is at the way top then we have bit fit uh which is
[00:51:06] way top then we have bit fit uh which is you only fine tune the bias terms um and
[00:51:09] you only fine tune the bias terms um and and adapters compared to that uh firstly
[00:51:13] and adapters compared to that uh firstly Laura requires a lot fewer additional
[00:51:16] Laura requires a lot fewer additional parameters that you need to store uh and
[00:51:18] parameters that you need to store uh and it kind of gives you a good tradeoff for
[00:51:20] it kind of gives you a good tradeoff for accuracy compared to full fine tuning
[00:51:22] accuracy compared to full fine tuning and sometimes there's a regularizing
[00:51:24] and sometimes there's a regularizing effect uh from fine tuning only small
[00:51:26] effect uh from fine tuning only small subset of your model
[00:51:30] subset of your model parameters
[00:51:32] parameters um
[00:51:34] um okay so uh you know the question is like
[00:51:39] okay so uh you know the question is like for every uh Matrix you can apply Laura
[00:51:43] for every uh Matrix you can apply Laura and I said that you want to apply it to
[00:51:46] and I said that you want to apply it to the various um learned uh bit matrices
[00:51:50] the various um learned uh bit matrices in site self attention the question is
[00:51:52] in site self attention the question is what uh what parameters uh you want to
[00:51:55] what uh what parameters uh you want to apply Lura to and uh generally the rule
[00:51:58] apply Lura to and uh generally the rule of the thumb is that if you apply it to
[00:52:00] of the thumb is that if you apply it to the Matrix that takes your hidden State
[00:52:02] the Matrix that takes your hidden State and converts that into queries and the
[00:52:05] and converts that into queries and the Matrix that converts your hidden State
[00:52:07] Matrix that converts your hidden State into values apply Lura to those and
[00:52:10] into values apply Lura to those and that's pretty much going to give you the
[00:52:11] that's pretty much going to give you the best performance overall um the other
[00:52:15] best performance overall um the other hyperparameter for Laura is the optimal
[00:52:18] hyperparameter for Laura is the optimal rank so recall that these like two
[00:52:20] rank so recall that these like two matrices b and a that are both like low
[00:52:23] matrices b and a that are both like low rank and turns out that like already
[00:52:26] rank and turns out that like already with a a really small uh rank you can
[00:52:29] with a a really small uh rank you can get a pretty high performance okay and
[00:52:31] get a pretty high performance okay and this is much much smaller than um sort
[00:52:34] this is much much smaller than um sort of the Hidden State dimensions of most
[00:52:37] of the Hidden State dimensions of most of the matrices for most models these
[00:52:41] of the matrices for most models these days okay all right so uh we covered a
[00:52:45] days okay all right so uh we covered a bunch of things uh we talked about uh
[00:52:47] bunch of things uh we talked about uh you know floating points and mixed
[00:52:49] you know floating points and mixed Precision training uh multi-gpu training
[00:52:52] Precision training uh multi-gpu training DDP
[00:52:54] DDP fsdp uh Laura
[00:52:56] fsdp uh Laura uh kind of boils down to a very simple
[00:52:59] uh kind of boils down to a very simple flowchart that you can just like use for
[00:53:02] flowchart that you can just like use for your project so if you were sleeping
[00:53:03] your project so if you were sleeping through the entire lecture maybe now
[00:53:05] through the entire lecture maybe now it's the time to wake up and just like
[00:53:07] it's the time to wake up and just like look at this uh this flowchart
[00:53:10] look at this uh this flowchart um so always use mixed Precision
[00:53:13] um so always use mixed Precision training okay if you have the newer
[00:53:16] training okay if you have the newer Amper architectures use B float 16 try
[00:53:19] Amper architectures use B float 16 try with the bat size one okay if bat size
[00:53:22] with the bat size one okay if bat size one fits try a larger bat size and then
[00:53:24] one fits try a larger bat size and then always just use zero stage two
[00:53:27] always just use zero stage two okay bad size one doesn't fit uh try
[00:53:30] okay bad size one doesn't fit uh try zero stage three uh maybe try gradient
[00:53:33] zero stage three uh maybe try gradient checkpointing activation checkpointing
[00:53:35] checkpointing activation checkpointing uh sorry there's a question this is
[00:53:37] uh sorry there's a question this is assuming we have more than one GPU
[00:53:39] assuming we have more than one GPU because doesn't help
[00:53:42] because doesn't help us oh yeah so all of this applies only
[00:53:45] us oh yeah so all of this applies only if you have more than one GPU yeah if
[00:53:47] if you have more than one GPU yeah if you have a single
[00:53:49] you have a single GPU yeah you have to do other things
[00:53:52] GPU yeah you have to do other things maybe have to like heavily uh quantize
[00:53:54] maybe have to like heavily uh quantize the model and even then I don't think
[00:53:57] the model and even then I don't think you can uh fine tune some of the bigger
[00:53:59] you can uh fine tune some of the bigger bigger models yeah um so assuming you
[00:54:03] bigger models yeah um so assuming you have multiple gpus uh you can try zero
[00:54:06] have multiple gpus uh you can try zero stage three if you have uh out of memory
[00:54:10] stage three if you have uh out of memory errors with a bat size of one if that
[00:54:12] errors with a bat size of one if that doesn't work you can try Laura okay and
[00:54:14] doesn't work you can try Laura okay and the main hyperparameters in Laura are
[00:54:17] the main hyperparameters in Laura are the alpha the rank and what uh weight
[00:54:20] the alpha the rank and what uh weight matrices to apply Lura to apply that to
[00:54:23] matrices to apply Lura to apply that to the query Matrix apply that to the value
[00:54:25] the query Matrix apply that to the value Matrix set rank to eight okay that's a
[00:54:28] Matrix set rank to eight okay that's a good starting point set Alpha to one
[00:54:30] good starting point set Alpha to one okay just do that and you should be good
[00:54:32] okay just do that and you should be good to go okay so you can find tun your
[00:54:34] to go okay so you can find tun your models and things should be reasonably
[00:54:37] models and things should be reasonably good okay so uh I'm going to end now
[00:54:41] good okay so uh I'm going to end now unless there's
[00:54:43] unless there's questions um oh there's one question in
[00:54:46] questions um oh there's one question in the
[00:54:47] the back
[00:54:49] back diamides I was wondering if you just
[00:54:51] diamides I was wondering if you just like go back to it and walk through it a
[00:54:53] like go back to it and walk through it a little bit on step uh sorry on slide 48
[00:55:03] yeah this diagram from the last right uh
[00:55:07] yeah this diagram from the last right uh okay um so let's go through this diagram
[00:55:10] okay um so let's go through this diagram so basically uh what this diagram shows
[00:55:13] so basically uh what this diagram shows is how the communication overhead is
[00:55:16] is how the communication overhead is really not that bad if you have a fairly
[00:55:19] really not that bad if you have a fairly big model such that the the time it
[00:55:22] big model such that the the time it takes to do a forward pass you can
[00:55:23] takes to do a forward pass you can already sort of prefetch uh all of the
[00:55:26] already sort of prefetch uh all of the parameters for the next layer okay so
[00:55:28] parameters for the next layer okay so that's pretty much the idea so that's
[00:55:29] that's pretty much the idea so that's kind of like a standard idea that I
[00:55:32] kind of like a standard idea that I guess everyone should already be using
[00:55:33] guess everyone should already be using like you want to make sure and pyos does
[00:55:36] like you want to make sure and pyos does this by default by the way like you want
[00:55:38] this by default by the way like you want to make sure that you know um you sort
[00:55:41] to make sure that you know um you sort of fully saturate your GPU uh and then
[00:55:45] of fully saturate your GPU uh and then sort of make sure that you uh kind of
[00:55:47] sort of make sure that you uh kind of overlay that with any additional compute
[00:55:49] overlay that with any additional compute you're doing and that's pretty much
[00:55:50] you're doing and that's pretty much what's going on here but let's sort of
[00:55:53] what's going on here but let's sort of go through this kind of Step by Step
[00:55:55] go through this kind of Step by Step okay and so the starting point here is
[00:55:59] okay and so the starting point here is uh fsdp units so 0o one and two are
[00:56:02] uh fsdp units so 0o one and two are different fsdp units okay so um what you
[00:56:07] different fsdp units okay so um what you start by doing is um you want to run a
[00:56:11] start by doing is um you want to run a forward pass on the first layer you
[00:56:13] forward pass on the first layer you don't have the first layer okay so let's
[00:56:15] don't have the first layer okay so let's say you are gpuk you don't have the
[00:56:16] say you are gpuk you don't have the first layer so you have to do an all
[00:56:17] first layer so you have to do an all gather to get all of the parameters for
[00:56:20] gather to get all of the parameters for the first layer so that's
[00:56:23] the first layer so that's ag0 okay at the end of ag0 every GPU has
[00:56:28] ag0 okay at the end of ag0 every GPU has the full set of parameters for uh the
[00:56:31] the full set of parameters for uh the layers corresponding to fsdp unit Zer
[00:56:33] layers corresponding to fsdp unit Zer let's just say that's layer that's layer
[00:56:35] let's just say that's layer that's layer one okay or let's just say that's layer
[00:56:37] one okay or let's just say that's layer zero okay so you have the full
[00:56:39] zero okay so you have the full parameters for layer zero you run a
[00:56:41] parameters for layer zero you run a forward pass so that's the stuff in
[00:56:43] forward pass so that's the stuff in blue and while you're running a forward
[00:56:46] blue and while you're running a forward pass through the first layer you're
[00:56:49] pass through the first layer you're going to be smart about uh communication
[00:56:51] going to be smart about uh communication overheads and while you're running that
[00:56:53] overheads and while you're running that you're going to prefetch uh the
[00:56:55] you're going to prefetch uh the parameters for the next fsdp unit okay
[00:56:58] parameters for the next fsdp unit okay so let's say Layer Two is a different
[00:56:59] so let's say Layer Two is a different fsdp unit so that's a G1 okay and so you
[00:57:04] fsdp unit so that's a G1 okay and so you can see that there is like a little bit
[00:57:06] can see that there is like a little bit of overlap between um between F zero and
[00:57:09] of overlap between um between F zero and ag1 okay at the end of getting all of
[00:57:13] ag1 okay at the end of getting all of the parameters for layer one uh you're
[00:57:17] the parameters for layer one uh you're going to do a forward pass okay and so
[00:57:19] going to do a forward pass okay and so on and then you're going to do AG uh
[00:57:22] on and then you're going to do AG uh ag2 uh and at the same time uh now let's
[00:57:26] ag2 uh and at the same time uh now let's say you just have way tooo many uh
[00:57:27] say you just have way tooo many uh parameters on your GPU okay so you're
[00:57:29] parameters on your GPU okay so you're going to do a little bit of uh like
[00:57:32] going to do a little bit of uh like memory free uh you're going to free up
[00:57:34] memory free uh you're going to free up some memory okay so that's the stuff in
[00:57:36] some memory okay so that's the stuff in uh in yellow okay and so that's how that
[00:57:39] uh in yellow okay and so that's how that goes so you know you basically overlay
[00:57:42] goes so you know you basically overlay all gather operations with the forward
[00:57:44] all gather operations with the forward pass okay and that's how you run the
[00:57:46] pass okay and that's how you run the forward pass okay so the communication
[00:57:48] forward pass okay so the communication overhead is really not that bad if you
[00:57:49] overhead is really not that bad if you have like a really big uh deep neural
[00:57:52] have like a really big uh deep neural network uh and uh assuming that you have
[00:57:55] network uh and uh assuming that you have uh um kind of shoted everything properly
[00:57:59] uh um kind of shoted everything properly okay um and then you start the backward
[00:58:03] okay um and then you start the backward pass so in the backward pass uh uh I
[00:58:06] pass so in the backward pass uh uh I guess it's a little bit tricky uh cuz
[00:58:08] guess it's a little bit tricky uh cuz you know you want to do these uh all
[00:58:12] you know you want to do these uh all gather operations to get the full
[00:58:15] gather operations to get the full gradient so let's say it's a 10 layer
[00:58:16] gradient so let's say it's a 10 layer neural network okay so you want to
[00:58:18] neural network okay so you want to compute the full gradient at layer 10
[00:58:20] compute the full gradient at layer 10 you need to do an all gather operation
[00:58:21] you need to do an all gather operation to get all of the gradients or to get
[00:58:23] to get all of the gradients or to get all of the parameters at layer 10 and
[00:58:25] all of the parameters at layer 10 and then you have to do a reduced scatter
[00:58:27] then you have to do a reduced scatter okay so you have four gpus everyone has
[00:58:30] okay so you have four gpus everyone has the full set of parameters at layer 10
[00:58:33] the full set of parameters at layer 10 they have different gradients and so
[00:58:35] they have different gradients and so they have to kind of merge their
[00:58:36] they have to kind of merge their gradients and then sort of split them up
[00:58:38] gradients and then sort of split them up to the right GPU and so that's the
[00:58:40] to the right GPU and so that's the reduced scatter uh but that's not too
[00:58:42] reduced scatter uh but that's not too bad because you can still like overlay
[00:58:44] bad because you can still like overlay uh reduce scatter operations with the
[00:58:46] uh reduce scatter operations with the backward pass okay and so that's what
[00:58:47] backward pass okay and so that's what you see happening in the backward pass
[00:58:49] you see happening in the backward pass there okay um and then like along with
[00:58:54] there okay um and then like along with these forward and backward path
[00:58:57] these forward and backward path uh at sort of um you know regular
[00:59:00] uh at sort of um you know regular intervals you have to make sure that you
[00:59:01] intervals you have to make sure that you kind of free up GPU memory so for
[00:59:04] kind of free up GPU memory so for example once you have run a forward pass
[00:59:06] example once you have run a forward pass through layer one now you're on to Layer
[00:59:08] through layer one now you're on to Layer Two you don't need anything in layer one
[00:59:09] Two you don't need anything in layer one so you just like free up uh the memory
[00:59:11] so you just like free up uh the memory in layer one okay that's pretty much the
[00:59:14] in layer one okay that's pretty much the idea of behind this diagram um so
[00:59:16] idea of behind this diagram um so there's a few details here um one of the
[00:59:20] there's a few details here um one of the details is like in fsdp uh unit zero is
[00:59:23] details is like in fsdp uh unit zero is sort of treated differently so you'll
[00:59:25] sort of treated differently so you'll see that unit zero is never freed up uh
[00:59:28] see that unit zero is never freed up uh that's just sort of an implementation
[00:59:29] that's just sort of an implementation detail in fsdp I'll just quickly say one
[00:59:32] detail in fsdp I'll just quickly say one more thing about fsdp and take a
[00:59:34] more thing about fsdp and take a question okay so uh the presentation
[00:59:38] question okay so uh the presentation here makes it seem like it's so simple
[00:59:40] here makes it seem like it's so simple and that it can be applied to any any
[00:59:43] and that it can be applied to any any neural network right but uh turns out
[00:59:45] neural network right but uh turns out that that's not the full picture so you
[00:59:48] that that's not the full picture so you need to do this kind of uh like you need
[00:59:51] need to do this kind of uh like you need to kind of divide up your neural network
[00:59:53] to kind of divide up your neural network into fsdp units okay and depending on
[00:59:57] into fsdp units okay and depending on how you uh depending on what policy you
[01:00:00] how you uh depending on what policy you use for dividing up your parameters into
[01:00:02] use for dividing up your parameters into fsdp units there's different
[01:00:04] fsdp units there's different communication overheads okay so for
[01:00:07] communication overheads okay so for example it makes sense to kind of uh
[01:00:10] example it makes sense to kind of uh have like multiple like consecutive
[01:00:13] have like multiple like consecutive layers in the same fsdp unit and so on
[01:00:15] layers in the same fsdp unit and so on and so this is like very architecture
[01:00:17] and so this is like very architecture specific so when you start to use this
[01:00:19] specific so when you start to use this in pytorch you'll see that the
[01:00:22] in pytorch you'll see that the fsdp uh wrapper requires a sort of a
[01:00:26] fsdp uh wrapper requires a sort of a sharding policy uh and that is like very
[01:00:29] sharding policy uh and that is like very architecture specific so because
[01:00:31] architecture specific so because everyone uses Transformers now they're
[01:00:33] everyone uses Transformers now they're like very sort of you know handcrafted
[01:00:36] like very sort of you know handcrafted fine-tuned policies for Transformer like
[01:00:38] fine-tuned policies for Transformer like for for creating fsdp units and and
[01:00:40] for for creating fsdp units and and shotting strategies for Transformers but
[01:00:43] shotting strategies for Transformers but let's say you want to you know for your
[01:00:45] let's say you want to you know for your final project you came up with a new
[01:00:46] final project you came up with a new architecture sub quadratic attention
[01:00:50] architecture sub quadratic attention whatever maybe it's not going to be as
[01:00:52] whatever maybe it's not going to be as efficient just because you don't have
[01:00:54] efficient just because you don't have the right shot policy okay so that's
[01:00:57] the right shot policy okay so that's like one detail about fsdp uh that that
[01:01:00] like one detail about fsdp uh that that maybe you want to keep in mind okay uh
[01:01:02] maybe you want to keep in mind okay uh you have a question just a clarification
[01:01:04] you have a question just a clarification when you mentioned uh you can throw away
[01:01:07] when you mentioned uh you can throw away the weights that you don't need after
[01:01:08] the weights that you don't need after each layer forward FS but then when you
[01:01:10] each layer forward FS but then when you compute backward FS you stream them back
[01:01:12] compute backward FS you stream them back in each time uh or do you sort of cash
[01:01:15] in each time uh or do you sort of cash some or cash recent or is there any
[01:01:18] some or cash recent or is there any cashing going on or do you throw them
[01:01:19] cashing going on or do you throw them all way and streaming all back uh so
[01:01:23] all way and streaming all back uh so there might be some caching but uh uh in
[01:01:26] there might be some caching but uh uh in in the system but uh the idea is that
[01:01:30] in the system but uh the idea is that you just sort of throw them away or at
[01:01:31] you just sort of throw them away or at least to the user it it seems like
[01:01:33] least to the user it it seems like you've thrown it all away in terms of
[01:01:35] you've thrown it all away in terms of like uh GPU Ram utilization and then we
[01:01:38] like uh GPU Ram utilization and then we stream them each layer
[01:01:41] stream them each layer again and so that's why it's important
[01:01:43] again and so that's why it's important to like Shard it properly right so for
[01:01:46] to like Shard it properly right so for example if like every consecutive layer
[01:01:49] example if like every consecutive layer is you know is like shoted such that
[01:01:50] is you know is like shoted such that it's on multiple gpus then you kind of
[01:01:52] it's on multiple gpus then you kind of always are communicating right as
[01:01:54] always are communicating right as opposed to you know you kind of did like
[01:01:57] opposed to you know you kind of did like uh all gather and then all of the sort
[01:02:00] uh all gather and then all of the sort of next three layers are loaded in okay
[01:02:03] of next three layers are loaded in okay so that's why like you know how you
[01:02:04] so that's why like you know how you Shard and this like Shing policy becomes
[01:02:10] important
[01:02:16] okay okay so if there's no more
[01:02:19] okay okay so if there's no more questions uh let's uh let's end early uh
[01:02:22] questions uh let's uh let's end early uh thank you so much
[01:02:24] thank you so much [Applause]


================================================================================
LECTURE 014
================================================================================

Stanford CS224N: NLP w/ DL| Spring 2024 | Lecture 13 - Brain-Computer Interfaces, Chaofei Fan

Source: https://www.youtube.com/watch?v=tfVgHsKpRC8

---

Transcript

[00:00:05] so thanks faon for coming know it's a
[00:00:09] so thanks faon for coming know it's a really busy time in the quarter
[00:00:10] really busy time in the quarter everyone's busy with the homework
[00:00:12] everyone's busy with the homework project and
[00:00:13] project and midterms um yeah today I'm going to tell
[00:00:17] midterms um yeah today I'm going to tell you something I'm really passionate
[00:00:19] you something I'm really passionate about is uh the speech breing computer
[00:00:22] about is uh the speech breing computer interface uh my research so yeah before
[00:00:25] interface uh my research so yeah before that so I'm just s some self
[00:00:27] that so I'm just s some self introduction so I'm cha
[00:00:30] introduction so I'm cha um from the uh Stanford mptl lab our lab
[00:00:34] um from the uh Stanford mptl lab our lab uh trying to build speech Compu build
[00:00:36] uh trying to build speech Compu build bre computer interfaces to help people
[00:00:38] bre computer interfaces to help people restore communication or like restore
[00:00:41] restore communication or like restore movements um so today I think I'm just
[00:00:45] movements um so today I think I'm just going to really tell you guys how cool
[00:00:49] going to really tell you guys how cool this bre computer interface is given
[00:00:51] this bre computer interface is given that we have so many uh recent
[00:00:54] that we have so many uh recent development in the AI and the Machine
[00:00:56] development in the AI and the Machine learning and I hope you guys will can
[00:00:58] learning and I hope you guys will can enjoy this talk
[00:01:04] um all right so let me first start with
[00:01:07] um all right so let me first start with the video to give you some motivations
[00:01:11] the video to give you some motivations that why we want to build a uh green
[00:01:13] that why we want to build a uh green computer
[00:01:19] interface yeah I think what the story
[00:01:23] interface yeah I think what the story tell is that we saw this teenager Howard
[00:01:26] tell is that we saw this teenager Howard who's uh 21 at time of this video was
[00:01:30] who's uh 21 at time of this video was shot and he lost all his dreams because
[00:01:32] shot and he lost all his dreams because of a severe stroke that also like made
[00:01:35] of a severe stroke that also like made him uh in this kind of a locking state
[00:01:38] him uh in this kind of a locking state where he can't move and he talk about he
[00:01:42] where he can't move and he talk about he used to like going out and play football
[00:01:45] used to like going out and play football uh making friends and just let her his
[00:01:48] uh making friends and just let her his emotion out I think all this is lost to
[00:01:50] emotion out I think all this is lost to him and I think the most important thing
[00:01:53] him and I think the most important thing is that he couldn't really speak to
[00:01:55] is that he couldn't really speak to express himself to let all the emotions
[00:01:58] express himself to let all the emotions out so Howard just is just like one of
[00:02:00] out so Howard just is just like one of those individuals who suffer from this
[00:02:03] those individuals who suffer from this kind of neurological disease disorder
[00:02:06] kind of neurological disease disorder such as brain stamp stroke or AOS that
[00:02:09] such as brain stamp stroke or AOS that can cause like severe speech and motor
[00:02:11] can cause like severe speech and motor impairment and even complete loss of
[00:02:17] speech this individuals I think the life
[00:02:19] speech this individuals I think the life is really challenging for them right
[00:02:21] is really challenging for them right just think about it you cannot speak you
[00:02:23] just think about it you cannot speak you cannot move like you still have a fully
[00:02:25] cannot move like you still have a fully functioned brain but everything is lost
[00:02:28] functioned brain but everything is lost all all your dreams could be sh
[00:02:32] other so I think for people like Howard
[00:02:36] other so I think for people like Howard as you just saw in this video the way
[00:02:38] as you just saw in this video the way they can still like communicate with the
[00:02:40] they can still like communicate with the outside world with their loved one is
[00:02:42] outside world with their loved one is through this assistive communication
[00:02:44] through this assistive communication devices such as the one we just saw in
[00:02:46] devices such as the one we just saw in the video which is this kind of lead
[00:02:48] the video which is this kind of lead board that has the letters like
[00:02:50] board that has the letters like organized physically and for people like
[00:02:53] organized physically and for people like Howard which may still like have there
[00:02:55] Howard which may still like have there some residual eye movement they can use
[00:02:57] some residual eye movement they can use their like gaze to tell the his friend
[00:03:00] their like gaze to tell the his friend that where he's looking at and then his
[00:03:03] that where he's looking at and then his friend can use the Gaze to tell what
[00:03:05] friend can use the Gaze to tell what letter he's trying to say just imagine
[00:03:07] letter he's trying to say just imagine how slow this process is if you want to
[00:03:10] how slow this process is if you want to just say a sentence it might takes you a
[00:03:12] just say a sentence it might takes you a few minutes to express like simple
[00:03:15] few minutes to express like simple things like how are you like I'm feeling
[00:03:18] things like how are you like I'm feeling not comfortable today like an
[00:03:20] not comfortable today like an alternative here is you can also use
[00:03:22] alternative here is you can also use like a it tracking device so that people
[00:03:25] like a it tracking device so that people can you know use it tracking to uh
[00:03:29] can you know use it tracking to uh type on the computer on on virtual
[00:03:31] type on the computer on on virtual keyboard but by you know just think
[00:03:34] keyboard but by you know just think about it if you have to look at the
[00:03:35] about it if you have to look at the computer screen all the times all day
[00:03:38] computer screen all the times all day it's really tiring for for them and also
[00:03:40] it's really tiring for for them and also like these people are not like us they
[00:03:42] like these people are not like us they probably even if they have still have
[00:03:44] probably even if they have still have like a residual eye movement it's very
[00:03:47] like a residual eye movement it's very hard for them to move their eyes so it's
[00:03:49] hard for them to move their eyes so it's very tiring as
[00:03:52] very tiring as well
[00:03:54] well um maybe something different here is
[00:03:58] um maybe something different here is that maybe it's some of you guys have
[00:04:00] that maybe it's some of you guys have already seen this recently is uh some
[00:04:02] already seen this recently is uh some videos published by a company called
[00:04:04] videos published by a company called neuralink for example here's one video
[00:04:07] neuralink for example here's one video here let me see if I can play
[00:04:09] here let me see if I can play them hopefully I can play
[00:04:13] them hopefully I can play them all right so I think here is that
[00:04:17] them all right so I think here is that this company neur link is uh developing
[00:04:19] this company neur link is uh developing this kind of like a
[00:04:21] this kind of like a tiny uh implantable device that can be
[00:04:25] tiny uh implantable device that can be actually placed inside your scale and
[00:04:28] actually placed inside your scale and then rate the brain
[00:04:31] then rate the brain signals um The Hope here is that because
[00:04:34] signals um The Hope here is that because for people like Hardward their brain is
[00:04:36] for people like Hardward their brain is still fully
[00:04:37] still fully functioning so we can the Hope here is
[00:04:40] functioning so we can the Hope here is that maybe we can using this kind of
[00:04:42] that maybe we can using this kind of like a direct interface with their brain
[00:04:46] like a direct interface with their brain so that they can still use their intact
[00:04:48] so that they can still use their intact brain to control like a computer or even
[00:04:51] brain to control like a computer or even like a robots to help them to live like
[00:04:53] like a robots to help them to live like a normal life and here is a quote from
[00:04:56] a normal life and here is a quote from this um their participants noan and he's
[00:05:00] this um their participants noan and he's pretty excited about being able to uh
[00:05:03] pretty excited about being able to uh using
[00:05:05] using this very state of art BCI to be able to
[00:05:08] this very state of art BCI to be able to connect with their families and then
[00:05:10] connect with their families and then being able to support himself I think
[00:05:12] being able to support himself I think this kind of like PCI what I trying to
[00:05:14] this kind of like PCI what I trying to say here that for people like Hardward
[00:05:17] say here that for people like Hardward are like a lot of people who has lost
[00:05:19] are like a lot of people who has lost their
[00:05:20] their uh control of their body and language I
[00:05:24] uh control of their body and language I think BCI can bring hope to them so
[00:05:26] think BCI can bring hope to them so that's what I'm going
[00:05:28] that's what I'm going to motivate here today is that we're
[00:05:31] to motivate here today is that we're trying to use BCI to really help these
[00:05:33] trying to use BCI to really help these people but I think before going into the
[00:05:35] people but I think before going into the details about how this is how this works
[00:05:37] details about how this is how this works I just want to First go through a brief
[00:05:39] I just want to First go through a brief history of brain computer interface just
[00:05:42] history of brain computer interface just to help you guys understand how this
[00:05:44] to help you guys understand how this thing works like why we can put such
[00:05:46] thing works like why we can put such tiny devices into the brain and then
[00:05:48] tiny devices into the brain and then suddenly can interpret what the brain is
[00:05:50] suddenly can interpret what the brain is doing like there's a lot of interesting
[00:05:52] doing like there's a lot of interesting stories here so let me start with a
[00:05:54] stories here so let me start with a brief history of
[00:05:57] brief history of PCI so first uh back to the 19th century
[00:06:01] PCI so first uh back to the 19th century here a British scientist called Richard
[00:06:05] here a British scientist called Richard Kon uh started to do some experiments on
[00:06:09] Kon uh started to do some experiments on animals and one of the things he found
[00:06:12] animals and one of the things he found is that uh you can actually measure the
[00:06:14] is that uh you can actually measure the brain activity like measure electricity
[00:06:17] brain activity like measure electricity is from the brain uh
[00:06:20] is from the brain uh moreover if you ask the animal not ask
[00:06:23] moreover if you ask the animal not ask but like let animals do do some task say
[00:06:26] but like let animals do do some task say moving their heads then you can see that
[00:06:28] moving their heads then you can see that they're like the electricity changes
[00:06:30] they're like the electricity changes somehow so I think this is the very
[00:06:33] somehow so I think this is the very first early experiments that that
[00:06:35] first early experiments that that scientists do to show that okay actually
[00:06:37] scientists do to show that okay actually you can decode some signals from the
[00:06:39] you can decode some signals from the brain but we still don't know how
[00:06:41] brain but we still don't know how exactly what those exact like exactly
[00:06:43] exactly what those exact like exactly what those electric signals means
[00:06:46] what those electric signals means here so fast forward um to
[00:06:51] here so fast forward um to 1924 a German scientist called hansburg
[00:06:55] 1924 a German scientist called hansburg invent this device called the yeah I
[00:06:59] invent this device called the yeah I always forgot how to read that word but
[00:07:01] always forgot how to read that word but anyway so it's show for EEG so it's
[00:07:03] anyway so it's show for EEG so it's basically this kind of like on the right
[00:07:04] basically this kind of like on the right you can see that's kind of like this uh
[00:07:06] you can see that's kind of like this uh electrod that you can play place on the
[00:07:10] electrod that you can play place on the outside of your basically on your scale
[00:07:11] outside of your basically on your scale and then measure this kind of like a
[00:07:13] and then measure this kind of like a wave like um signals so what the
[00:07:17] wave like um signals so what the scientist hburg find is
[00:07:20] scientist hburg find is that uh first first so first he's the
[00:07:24] that uh first first so first he's the first scientist to find that you you can
[00:07:26] first scientist to find that you you can actually measure this kind of like wave
[00:07:27] actually measure this kind of like wave like like like signals uh from like just
[00:07:30] like like like signals uh from like just this kind of like a electrod place on
[00:07:32] this kind of like a electrod place on the head and then he found that this
[00:07:34] the head and then he found that this kind of signals has very different uh
[00:07:37] kind of signals has very different uh safe frequency depending on how the user
[00:07:41] safe frequency depending on how the user how the patient is uh the fun like the
[00:07:44] how the patient is uh the fun like the state of patient is for example if the
[00:07:46] state of patient is for example if the patient is in this kind of very calm
[00:07:47] patient is in this kind of very calm state that then it will have generate
[00:07:49] state that then it will have generate this kind of slow alha wave around like
[00:07:52] this kind of slow alha wave around like 10 to maybe 20 HZ I don't I forgot the
[00:07:54] 10 to maybe 20 HZ I don't I forgot the exact range but if the patient is open
[00:07:56] exact range but if the patient is open their eyes and then doing some like a
[00:07:58] their eyes and then doing some like a cognitive in tax then you'll see really
[00:08:01] cognitive in tax then you'll see really sharp beta waves so he's the first
[00:08:03] sharp beta waves so he's the first scientist to discover that you can
[00:08:05] scientist to discover that you can actually know the using this kind of
[00:08:06] actually know the using this kind of like electrodes to measure the brain the
[00:08:09] like electrodes to measure the brain the electricity electric signal of the brain
[00:08:12] electricity electric signal of the brain and also there's a funny story here is
[00:08:13] and also there's a funny story here is that the hburg used to be a used to be a
[00:08:18] that the hburg used to be a used to be a soldier and then one day he was training
[00:08:20] soldier and then one day he was training on the horse and then he fell from the
[00:08:22] on the horse and then he fell from the horse and I suffered from a concussion
[00:08:25] horse and I suffered from a concussion he also has like a twin sister not a
[00:08:27] he also has like a twin sister not a twin but he has a he has yes he has a
[00:08:29] twin but he has a he has yes he has a twin sister and then the story is that
[00:08:32] twin sister and then the story is that at the same day her sister felt like
[00:08:35] at the same day her sister felt like there's a weird something weird and then
[00:08:37] there's a weird something weird and then he just starting to worry about worry
[00:08:39] he just starting to worry about worry about his brother so her sister his
[00:08:41] about his brother so her sister his sister sent a a telegraph to his his
[00:08:44] sister sent a a telegraph to his his father and telling his brother that is
[00:08:46] father and telling his brother that is his brother okay so this really intrig
[00:08:49] his brother okay so this really intrig hansberger that you know maybe there's
[00:08:51] hansberger that you know maybe there's something called a telepathy that can
[00:08:53] something called a telepathy that can connect to uh to people through this
[00:08:55] connect to uh to people through this kind of like I know whatever this spring
[00:08:57] kind of like I know whatever this spring wave so that's his motivation to start
[00:09:00] wave so that's his motivation to start to study uh Psychology and Neuroscience
[00:09:03] to study uh Psychology and Neuroscience and invent this EG which we are still
[00:09:05] and invent this EG which we are still using today to diagn like things like a
[00:09:10] using today to diagn like things like a epilepsy okay
[00:09:13] epilepsy okay um and then people starting to using the
[00:09:17] um and then people starting to using the this kind of eug devices to perform like
[00:09:20] this kind of eug devices to perform like to use it to maybe since we can can
[00:09:23] to use it to maybe since we can can somehow detect this kind of wave like
[00:09:25] somehow detect this kind of wave like things from the brain and then we can
[00:09:27] things from the brain and then we can also control the like the frequency of
[00:09:30] also control the like the frequency of the wave so someone started like a
[00:09:31] the wave so someone started like a musician here starting to use this EG
[00:09:34] musician here starting to use this EG devices to um perform music right anyway
[00:09:38] devices to um perform music right anyway so I guess you guys already got the idea
[00:09:40] so I guess you guys already got the idea that someone is controlling like doing
[00:09:42] that someone is controlling like doing trying to perform some music with his
[00:09:44] trying to perform some music with his Brin wave I think this is a really cool
[00:09:46] Brin wave I think this is a really cool experiment it's done in I think in the
[00:09:49] experiment it's done in I think in the 1950s and you can already see that
[00:09:51] 1950s and you can already see that people starting to um get the idea that
[00:09:53] people starting to um get the idea that you can actually uh bypass like the your
[00:09:57] you can actually uh bypass like the your body you can actually use your brain
[00:09:59] body you can actually use your brain directly connect your brain to some like
[00:10:01] directly connect your brain to some like external device and controlling that
[00:10:03] external device and controlling that device so I think the idea here that
[00:10:06] device so I think the idea here that what if we can also leverage the same
[00:10:07] what if we can also leverage the same idea but to help people with like
[00:10:10] idea but to help people with like Hardward you know you can help them to
[00:10:12] Hardward you know you can help them to maybe control robotic arm but like the
[00:10:15] maybe control robotic arm but like the the problem with this kind of like a EG
[00:10:18] the problem with this kind of like a EG or external measuring device is that the
[00:10:22] or external measuring device is that the signal you get is very weak just imagine
[00:10:24] signal you get is very weak just imagine that uh your brain is generating you
[00:10:28] that uh your brain is generating you know probably know that like the brain
[00:10:30] know probably know that like the brain has a lot of neurons right the neurons
[00:10:32] has a lot of neurons right the neurons actually generating a lot of like
[00:10:34] actually generating a lot of like signals and if you just put some
[00:10:36] signals and if you just put some electrodes on your on the scalp on the
[00:10:39] electrodes on your on the scalp on the scalp and then what you are actually
[00:10:40] scalp and then what you are actually measuring is like the average neuron
[00:10:43] measuring is like the average neuron firings of like maybe millions of
[00:10:44] firings of like maybe millions of neurons which is if you think about like
[00:10:47] neurons which is if you think about like analogy is that um if you are trying to
[00:10:49] analogy is that um if you are trying to see hear what people are saying in the
[00:10:52] see hear what people are saying in the room next to us but we can only trying
[00:10:56] room next to us but we can only trying to figure out like they what they saying
[00:10:58] to figure out like they what they saying in this room what we are hearing is kind
[00:11:00] in this room what we are hearing is kind of the mumbling of a lot of things we
[00:11:02] of the mumbling of a lot of things we can probably just tell maybe they are in
[00:11:04] can probably just tell maybe they are in this kind of like a happy mood or maybe
[00:11:06] this kind of like a happy mood or maybe they have reached a conclusion but not
[00:11:08] they have reached a conclusion but not exactly what they are trying to say so
[00:11:10] exactly what they are trying to say so the limitation here is that this kind of
[00:11:13] the limitation here is that this kind of EG device can only give us very uh low
[00:11:16] EG device can only give us very uh low Precision or like low resolution signal
[00:11:18] Precision or like low resolution signal we want to get barer signal so how I
[00:11:21] we want to get barer signal so how I think the answer is trying to go inside
[00:11:24] think the answer is trying to go inside the brain and then putting this kind of
[00:11:27] the brain and then putting this kind of electrodes next to a neuron and then
[00:11:29] electrodes next to a neuron and then trying to directly measure the neuro
[00:11:31] trying to directly measure the neuro activities of this
[00:11:34] activities of this neurons um and for the purpose of this
[00:11:38] neurons um and for the purpose of this talk we're mostly going to focus on
[00:11:42] talk we're mostly going to focus on uh the neurons in this region of the
[00:11:45] uh the neurons in this region of the brain called motor cortex so this brain
[00:11:48] brain called motor cortex so this brain as some of of you may already know that
[00:11:50] as some of of you may already know that brain has different regions that's doing
[00:11:52] brain has different regions that's doing different tasks so in the center of the
[00:11:55] different tasks so in the center of the brain it's called motor cortex it's
[00:11:57] brain it's called motor cortex it's basically controlling all your muscle
[00:11:59] basically controlling all your muscle muscles or or your body muscles so the
[00:12:01] muscles or or your body muscles so the Hope here that if we can understand the
[00:12:03] Hope here that if we can understand the neuron the coding of like the
[00:12:05] neuron the coding of like the information of neuron that's encoded
[00:12:08] information of neuron that's encoded here then perhaps we can decode this
[00:12:10] here then perhaps we can decode this information and use this information to
[00:12:13] information and use this information to help people like H to to be able to
[00:12:16] help people like H to to be able to control a external arm or being able to
[00:12:19] control a external arm or being able to uh speak
[00:12:21] uh speak again um so here's a some very
[00:12:26] again um so here's a some very basic neuroscience
[00:12:29] basic neuroscience uh so we know that there's a this kind
[00:12:32] uh so we know that there's a this kind of cell called neurons right so this is
[00:12:36] of cell called neurons right so this is each one of this thing is called neuron
[00:12:38] each one of this thing is called neuron and this is the body of neuron called
[00:12:40] and this is the body of neuron called soma and then this is the axon so this
[00:12:43] soma and then this is the axon so this is another neuron so neuron connects
[00:12:45] is another neuron so neuron connects through this uh tiny thing called
[00:12:48] through this uh tiny thing called synapse so if neuron has want to
[00:12:50] synapse so if neuron has want to transfer some information to another
[00:12:52] transfer some information to another neuron just like in artificial neuron
[00:12:53] neuron just like in artificial neuron Network you have some uh neurons and
[00:12:56] Network you have some uh neurons and then you want to send information to the
[00:12:58] then you want to send information to the uh next layer you can basically this
[00:13:01] uh next layer you can basically this neuron will generate some action
[00:13:03] neuron will generate some action potential which is just some like a
[00:13:05] potential which is just some like a electricity here to signal that another
[00:13:09] electricity here to signal that another neuron that there is some information
[00:13:11] neuron that there is some information there so if you put a a chiny electrod
[00:13:14] there so if you put a a chiny electrod say on the uh Exon of this neuron here
[00:13:18] say on the uh Exon of this neuron here and measure the uh membrane potential
[00:13:21] and measure the uh membrane potential which you will get is something like
[00:13:23] which you will get is something like this so you'll have like on the x axxis
[00:13:25] this so you'll have like on the x axxis is the time and on the y- axis is the
[00:13:27] is the time and on the y- axis is the measure the uh uh electric potential and
[00:13:31] measure the uh uh electric potential and then you will see this come like a
[00:13:33] then you will see this come like a very uh sharp spikes spikes and then if
[00:13:37] very uh sharp spikes spikes and then if you zoom into this spikes you will see
[00:13:39] you zoom into this spikes you will see this kind of like a typical firing
[00:13:42] this kind of like a typical firing signature of the neuron which is like
[00:13:44] signature of the neuron which is like the electric like if the voltage
[00:13:46] the electric like if the voltage suddenly goes up and then goes down so
[00:13:49] suddenly goes up and then goes down so basically what you can measure at the
[00:13:51] basically what you can measure at the neuron is that it's kind of very sharp
[00:13:53] neuron is that it's kind of very sharp spikes that's what you will get by
[00:13:55] spikes that's what you will get by putting electrodes next to neuron Okay
[00:13:58] putting electrodes next to neuron Okay so
[00:13:59] so how do we figure out what like what kind
[00:14:01] how do we figure out what like what kind of information is encoded in this uh
[00:14:04] of information is encoded in this uh what call a a spike
[00:14:07] what call a a spike train um we can perform some Behavior
[00:14:10] train um we can perform some Behavior tasks here so for example here uh
[00:14:14] tasks here so for example here uh suppose that we still lessen into a
[00:14:15] suppose that we still lessen into a single neuron and this neuron is let's
[00:14:19] single neuron and this neuron is let's say we're using a monkey for this
[00:14:21] say we're using a monkey for this experiment right so we are instructing
[00:14:22] experiment right so we are instructing training a monkey to do two things uh
[00:14:25] training a monkey to do two things uh one thing is basically we're trying to
[00:14:27] one thing is basically we're trying to instruct the monkey to move his uh hands
[00:14:30] instruct the monkey to move his uh hands either to the left or to the right and
[00:14:33] either to the left or to the right and then we measure the firing of the spikes
[00:14:37] then we measure the firing of the spikes of that single neuron and then see
[00:14:39] of that single neuron and then see trying to get what kind of information
[00:14:41] trying to get what kind of information is encoded by that neuron so what you
[00:14:44] is encoded by that neuron so what you see here is that each row here is
[00:14:47] see here is that each row here is basically a spike train of that neuron
[00:14:50] basically a spike train of that neuron as you just saw here each vertical line
[00:14:52] as you just saw here each vertical line is just a spike of the
[00:14:54] is just a spike of the neuron uh each row is a trial a trial
[00:14:58] neuron uh each row is a trial a trial means that the the mon he is try uh just
[00:15:00] means that the the mon he is try uh just trying to move his hand in like One
[00:15:03] trying to move his hand in like One Direction
[00:15:06] Direction um then the vertical if you see here
[00:15:10] um then the vertical if you see here that because like you can see here that
[00:15:13] that because like you can see here that the neuron seems to far slightly
[00:15:16] the neuron seems to far slightly different across trials so I think
[00:15:18] different across trials so I think that's one like fundamental properties
[00:15:20] that's one like fundamental properties of a neuron is that it's very noisy it's
[00:15:23] of a neuron is that it's very noisy it's not like in artificial neuron network if
[00:15:25] not like in artificial neuron network if you put something you always get
[00:15:26] you put something you always get something out whereas in a real neuron
[00:15:28] something out whereas in a real neuron Network in a neural real neural uh
[00:15:31] Network in a neural real neural uh things are really noisy so sometimes
[00:15:33] things are really noisy so sometimes they fire a little bit faster but
[00:15:34] they fire a little bit faster but sometimes they fire a little bit slower
[00:15:36] sometimes they fire a little bit slower under the same experimental
[00:15:38] under the same experimental conditions and so
[00:15:40] conditions and so here what we are trying to measure is
[00:15:44] here what we are trying to measure is that if what kind of information is this
[00:15:47] that if what kind of information is this neuron encoding when the monkey has mov
[00:15:50] neuron encoding when the monkey has mov his uh limb to the left or Lim to the
[00:15:53] his uh limb to the left or Lim to the right and then one we can
[00:15:57] right and then one we can also uh split the this information
[00:15:59] also uh split the this information encoding into two phases one is a
[00:16:01] encoding into two phases one is a preparation and the other is the
[00:16:03] preparation and the other is the execution so for preparation for
[00:16:06] execution so for preparation for execution execution means that the
[00:16:09] execution execution means that the monkey is actually moving his his his uh
[00:16:12] monkey is actually moving his his his uh his arms where where is the prep
[00:16:14] his arms where where is the prep preparation means that the monkey is
[00:16:16] preparation means that the monkey is preparing to move but he's holding his
[00:16:18] preparing to move but he's holding his arm fixed so he will actually move his
[00:16:21] arm fixed so he will actually move his arm at this uh right this go time here
[00:16:25] arm at this uh right this go time here so what you can see here that it seems
[00:16:27] so what you can see here that it seems like this neuron
[00:16:29] like this neuron um likes to firing a lot during the
[00:16:32] um likes to firing a lot during the execution when the monkey's hand is
[00:16:34] execution when the monkey's hand is moving to the right and it also fires a
[00:16:38] moving to the right and it also fires a little bit more when the monkey is
[00:16:41] little bit more when the monkey is preparing to move to the left so this
[00:16:45] preparing to move to the left so this means that maybe the neuron is encoding
[00:16:47] means that maybe the neuron is encoding some movement Direction
[00:16:49] some movement Direction here uh so basically if you can repeat
[00:16:52] here uh so basically if you can repeat this experiments for many different
[00:16:54] this experiments for many different neurons and then for a lot of different
[00:16:56] neurons and then for a lot of different directions and eventually what
[00:16:58] directions and eventually what scientists find is that the neurons like
[00:17:02] scientists find is that the neurons like if you fit like say for a single neuron
[00:17:04] if you fit like say for a single neuron if you fit the the firing rates of that
[00:17:07] if you fit the the firing rates of that neuron basically how many spikes it's
[00:17:09] neuron basically how many spikes it's generating every seconds to different uh
[00:17:12] generating every seconds to different uh movement directions you can fit this
[00:17:14] movement directions you can fit this kind of like a coign tuning curve to it
[00:17:17] kind of like a coign tuning curve to it so what this tuning curve means that on
[00:17:18] so what this tuning curve means that on the Y AIS is the firing rate and then on
[00:17:20] the Y AIS is the firing rate and then on the horizontal axis is the movement
[00:17:23] the horizontal axis is the movement Direction so this neuron prefers to like
[00:17:26] Direction so this neuron prefers to like firing the most when the movement is say
[00:17:29] firing the most when the movement is say 180 degrees to some reference and then
[00:17:32] 180 degrees to some reference and then the fireing gradually goes down um so
[00:17:36] the fireing gradually goes down um so that's like one the first thing
[00:17:38] that's like one the first thing scientists find like how the like single
[00:17:40] scientists find like how the like single neuron encoding like movement
[00:17:44] information and then if you measure
[00:17:46] information and then if you measure multiple neurons like you will find that
[00:17:47] multiple neurons like you will find that each neuron could incode very different
[00:17:49] each neuron could incode very different informations for example this green
[00:17:51] informations for example this green neuron here it's tuning curve is
[00:17:53] neuron here it's tuning curve is slightly shifted to the right and then
[00:17:55] slightly shifted to the right and then the magnitude is shifted down so it's
[00:17:57] the magnitude is shifted down so it's preferred direction is around maybe
[00:18:00] preferred direction is around maybe 250 now with two neurons you can
[00:18:02] 250 now with two neurons you can actually decode out more actually decode
[00:18:06] actually decode out more actually decode out what the the intended movement
[00:18:09] out what the the intended movement direction is right so for example with a
[00:18:11] direction is right so for example with a single neuron like suppose right now I
[00:18:13] single neuron like suppose right now I measure the firing rate is around like
[00:18:15] measure the firing rate is around like 30 uh spikes per second then there could
[00:18:18] 30 uh spikes per second then there could be two movement Direction with 120 and
[00:18:20] be two movement Direction with 120 and then 240 however with uh second neurons
[00:18:24] then 240 however with uh second neurons here you can see that we can basically
[00:18:25] here you can see that we can basically eliminate the like suppose we measure
[00:18:27] eliminate the like suppose we measure the second neuron is around five five
[00:18:30] the second neuron is around five five spikes per second then we can exactly
[00:18:33] spikes per second then we can exactly pinpoint that it's the actually the
[00:18:35] pinpoint that it's the actually the movement direction is 120 instead of the
[00:18:38] movement direction is 120 instead of the uh the other one
[00:18:40] uh the other one right
[00:18:42] right uh however we know that uh neurons has
[00:18:46] uh however we know that uh neurons has uh some noises so we actually cannot
[00:18:49] uh some noises so we actually cannot really exactly tell uh the movement
[00:18:52] really exactly tell uh the movement Direction by using two neurons here so
[00:18:55] Direction by using two neurons here so for example in the third part here right
[00:18:58] for example in the third part here right due to the why the actual like suppose
[00:19:01] due to the why the actual like suppose the ground Cho like the the actual fire
[00:19:05] the ground Cho like the the actual fire rate is those gray lines but um due to
[00:19:08] rate is those gray lines but um due to the noise the firing rate is slightly
[00:19:10] the noise the firing rate is slightly shift to those Dash lines and you can
[00:19:13] shift to those Dash lines and you can see that originally if we can decode the
[00:19:15] see that originally if we can decode the movement direction is 120 but in this
[00:19:18] movement direction is 120 but in this case the possibility become that there's
[00:19:20] case the possibility become that there's four possibilities we cannot exactly we
[00:19:23] four possibilities we cannot exactly we cannot uniquely
[00:19:24] cannot uniquely Define um however you can see that maybe
[00:19:28] Define um however you can see that maybe it's still more likely that the the the
[00:19:31] it's still more likely that the the the direction that monkey tries to move is
[00:19:32] direction that monkey tries to move is like around 120 rather than the one two
[00:19:37] like around 120 rather than the one two Lage is like around 50 and then the
[00:19:39] Lage is like around 50 and then the other one is more like greater than 240
[00:19:41] other one is more like greater than 240 right so how do we deal with this kind
[00:19:44] right so how do we deal with this kind of like a noise in neuron like how can
[00:19:46] of like a noise in neuron like how can we still uniqu more accurate like how
[00:19:48] we still uniqu more accurate like how can we still accurately decide decoding
[00:19:51] can we still accurately decide decoding this um intended movements from this
[00:19:54] this um intended movements from this kind of like a multiple uh neuron
[00:19:57] kind of like a multiple uh neuron recordings uh I
[00:19:59] recordings uh I think we can basically use machine
[00:20:01] think we can basically use machine learning here right so we can treat this
[00:20:04] learning here right so we can treat this as a kind of like a classification
[00:20:06] as a kind of like a classification problem so here we are plotting each dot
[00:20:09] problem so here we are plotting each dot here is basically a firing combination
[00:20:12] here is basically a firing combination of two neurons and the color here
[00:20:15] of two neurons and the color here basically represents the the intended
[00:20:18] basically represents the the intended move movement Direction and then if you
[00:20:21] move movement Direction and then if you like somehow train a machine learning
[00:20:22] like somehow train a machine learning classifier here then you can basically
[00:20:25] classifier here then you can basically see we can draw some like decision
[00:20:27] see we can draw some like decision boundaries where say on the right side
[00:20:29] boundaries where say on the right side where those uh if like on new
[00:20:32] where those uh if like on new measurement that we get that the firing
[00:20:34] measurement that we get that the firing rate somehow Drops Fall onto this here
[00:20:37] rate somehow Drops Fall onto this here then we probably know that it's going to
[00:20:39] then we probably know that it's going to be the monkey is trying to move to the
[00:20:42] be the monkey is trying to move to the left direction right uh so okay so I
[00:20:48] left direction right uh so okay so I guess here
[00:20:49] guess here like we know that okay we can do this
[00:20:51] like we know that okay we can do this kind of single neural measurement we can
[00:20:53] kind of single neural measurement we can measure firing rates of multiple neurons
[00:20:56] measure firing rates of multiple neurons and then by training a machine learning
[00:20:58] and then by training a machine learning model
[00:20:59] model we can use this U machine learning
[00:21:02] we can use this U machine learning models with the neuro data to infer like
[00:21:04] models with the neuro data to infer like the what's the likely movement
[00:21:06] the what's the likely movement directions so this is how we are going
[00:21:08] directions so this is how we are going to build up to uh actually build a uh
[00:21:12] to build up to uh actually build a uh brain computer interface questions yeah
[00:21:14] brain computer interface questions yeah this for all these data you mentioned
[00:21:16] this for all these data you mentioned like neuron one as a very like specific
[00:21:18] like neuron one as a very like specific number where how do you pinpoint which
[00:21:20] number where how do you pinpoint which neuron can start
[00:21:21] neuron can start measuring yeah so here neural one is
[00:21:24] measuring yeah so here neural one is basically like we are taking so we here
[00:21:26] basically like we are taking so we here make assumption that each tiny
[00:21:29] make assumption that each tiny electrode here you see is measuring like
[00:21:33] electrode here you see is measuring like exactly one neurons and then that
[00:21:35] exactly one neurons and then that electrode will be fixed always be fixed
[00:21:38] electrode will be fixed always be fixed always measuring the fire of that
[00:21:40] always measuring the fire of that neurons
[00:21:41] neurons yeah but in the real case it's not it's
[00:21:44] yeah but in the real case it's not it's not always the case because like you
[00:21:46] not always the case because like you think about it the bra brain is this
[00:21:49] think about it the bra brain is this kind of like a soft structure so if you
[00:21:51] kind of like a soft structure so if you put electrodes there could move a little
[00:21:53] put electrodes there could move a little bit and measure different neurons so
[00:21:55] bit and measure different neurons so that's one of the challenging problems
[00:21:57] that's one of the challenging problems of uh BC okay is that how you can deal
[00:22:00] of uh BC okay is that how you can deal with this kind of neural change
[00:22:03] with this kind of neural change recording change
[00:22:04] recording change yeah all right let's go back to here so
[00:22:08] yeah all right let's go back to here so now we can basically know that we can
[00:22:10] now we can basically know that we can put some electrod into the brain into a
[00:22:12] put some electrod into the brain into a motor cortex measure some signals and
[00:22:14] motor cortex measure some signals and then we know how the neuron in encodes
[00:22:17] then we know how the neuron in encodes those signals and then we can also build
[00:22:19] those signals and then we can also build a machine learning decoder to decode
[00:22:22] a machine learning decoder to decode those signals right so can basically
[00:22:25] those signals right so can basically have some like methods to be able to
[00:22:27] have some like methods to be able to build a brain computer interface so that
[00:22:29] build a brain computer interface so that we can interpret what the like a still
[00:22:33] we can interpret what the like a still functioning fully functioning brain
[00:22:34] functioning fully functioning brain trying to
[00:22:36] trying to do um one more thing is that so how we
[00:22:40] do um one more thing is that so how we can record this uh
[00:22:44] can record this uh signals so yeah this is a very
[00:22:46] signals so yeah this is a very complicated figure but don't worry about
[00:22:49] complicated figure but don't worry about all the details what I'm trying to show
[00:22:52] all the details what I'm trying to show here that basically there's a lot of
[00:22:54] here that basically there's a lot of different technologies that you can use
[00:22:56] different technologies that you can use to uh record brain signals
[00:22:59] to uh record brain signals but when you think about this technology
[00:23:00] but when you think about this technology you can think about it as in this kind
[00:23:02] you can think about it as in this kind of
[00:23:03] of two-dimensional two-dimensional way so
[00:23:05] two-dimensional two-dimensional way so on the y axis is the think about as a
[00:23:09] on the y axis is the think about as a spatial resolution so the higher so the
[00:23:15] spatial resolution so the higher so the higher up up you go on the y- axis that
[00:23:17] higher up up you go on the y- axis that means that you can basically measure you
[00:23:20] means that you can basically measure you you can basically measure like very
[00:23:24] you can basically measure like very um the like basically shows like what's
[00:23:28] um the like basically shows like what's the the region like what's the size of
[00:23:31] the the region like what's the size of the region of the brain that you can
[00:23:32] the region of the brain that you can measure so if you go really high up
[00:23:34] measure so if you go really high up there that means that you can only
[00:23:35] there that means that you can only measure say uh a very large like like
[00:23:38] measure say uh a very large like like the average activ average brain
[00:23:41] the average activ average brain activities of a very large very large
[00:23:43] activities of a very large very large Brain area whereas if you go down the Y
[00:23:45] Brain area whereas if you go down the Y AIS that means that you can actually
[00:23:47] AIS that means that you can actually measure two very fine green skilles like
[00:23:49] measure two very fine green skilles like such as a single
[00:23:51] such as a single neurons whereas the Y the the horizontal
[00:23:54] neurons whereas the Y the the horizontal access here means like the the temporal
[00:23:56] access here means like the the temporal resolution that means that for
[00:23:59] resolution that means that for Technologies such as this kind of like a
[00:24:01] Technologies such as this kind of like a single neuron recordings you can
[00:24:03] single neuron recordings you can basically measure like
[00:24:06] basically measure like exactly at each time point for example
[00:24:09] exactly at each time point for example like one milliseconds what's the like
[00:24:11] like one milliseconds what's the like electric potential is for that single
[00:24:13] electric potential is for that single neuron whereas for technology Recording
[00:24:16] neuron whereas for technology Recording Technology such as Mi which basically
[00:24:19] Technology such as Mi which basically measures like the blood flow in a small
[00:24:21] measures like the blood flow in a small brain region it can only measure on
[00:24:23] brain region it can only measure on average like around like like say 0 five
[00:24:26] average like around like like say 0 five seconds or 1 seconds what's the blood
[00:24:28] seconds or 1 seconds what's the blood flow changes in that small brain area so
[00:24:30] flow changes in that small brain area so that's really like average of a lot of
[00:24:32] that's really like average of a lot of information because we know that the
[00:24:34] information because we know that the neuron fire is like this really like
[00:24:36] neuron fire is like this really like fast speed right the
[00:24:38] fast speed right the firing the electric potential change of
[00:24:40] firing the electric potential change of neuron is usually around order of one
[00:24:42] neuron is usually around order of one milliseconds right if you can only
[00:24:44] milliseconds right if you can only measure things
[00:24:45] measure things around like say one second you are
[00:24:48] around like say one second you are basically like averaging smoothing out a
[00:24:50] basically like averaging smoothing out a lot of
[00:24:51] lot of informations so ideally we want to have
[00:24:54] informations so ideally we want to have something both have like a high uh
[00:24:56] something both have like a high uh spatial resolution and also uh temporal
[00:25:00] spatial resolution and also uh temporal R resolution here
[00:25:03] R resolution here um so what we will use in most of
[00:25:07] um so what we will use in most of this uh right now in a lot of like a
[00:25:10] this uh right now in a lot of like a clinical trial in our lab is this kind
[00:25:12] clinical trial in our lab is this kind of like
[00:25:13] of like a multi electrod array so each each
[00:25:18] a multi electrod array so each each electrod here is like a tiny needle that
[00:25:20] electrod here is like a tiny needle that can measure maybe a signal of a few
[00:25:24] can measure maybe a signal of a few neurons and then you put this needles
[00:25:27] neurons and then you put this needles into a small tiny like a square on the
[00:25:29] into a small tiny like a square on the size of like a fingernail and then you
[00:25:31] size of like a fingernail and then you can measure maybe on order of hundreds
[00:25:34] can measure maybe on order of hundreds of
[00:25:35] of neurons all right so now we can have
[00:25:38] neurons all right so now we can have this uh devices to measure neurons let's
[00:25:42] this uh devices to measure neurons let's go into a more like
[00:25:44] go into a more like a uh examples of how we do this
[00:25:52] here so here's a let's take a example
[00:25:55] here so here's a let's take a example for example
[00:25:57] for example here so suppose someone has have a like
[00:26:00] here so suppose someone has have a like say SP spinal cord injury and then lost
[00:26:03] say SP spinal cord injury and then lost the connection to his body so his mind
[00:26:06] the connection to his body so his mind is still fully functioning so the
[00:26:09] is still fully functioning so the question here is that whether we can
[00:26:11] question here is that whether we can still what kind of information we can
[00:26:13] still what kind of information we can still decode from his motor cortex such
[00:26:15] still decode from his motor cortex such that we can decode those informations
[00:26:18] that we can decode those informations and then use those informations to
[00:26:20] and then use those informations to either control his arm or like his own
[00:26:23] either control his arm or like his own arm or like artificial
[00:26:24] arm or like artificial [Music]
[00:26:25] [Music] arm right what we going to do is uh
[00:26:29] arm right what we going to do is uh trying to put this kind of tiny
[00:26:31] trying to put this kind of tiny electrodes micro electrode arrays into
[00:26:33] electrodes micro electrode arrays into his motor
[00:26:35] his motor cortex really penetrating into his like
[00:26:38] cortex really penetrating into his like the surface of his motor
[00:26:43] cortex and each electrod here as you see
[00:26:46] cortex and each electrod here as you see here is this kind of tiny needle and
[00:26:48] here is this kind of tiny needle and those uh gr triangle is the size of a
[00:26:52] those uh gr triangle is the size of a neuron so each electrod maybe is
[00:26:55] neuron so each electrod maybe is measuring the firing potential of
[00:26:58] measuring the firing potential of multiple like local field potential of
[00:27:00] multiple like local field potential of multiple neurons around
[00:27:05] it then we can pass all this information
[00:27:09] it then we can pass all this information in real time to a computer through this
[00:27:12] in real time to a computer through this kind of a wire right
[00:27:22] now and then this what we get on the
[00:27:25] now and then this what we get on the computer here is that for example here
[00:27:28] computer here is that for example here each block is basically the measurement
[00:27:31] each block is basically the measurement of that one
[00:27:32] of that one electrodes
[00:27:36] and if we do some like Behavior
[00:27:38] and if we do some like Behavior experiments as we just show now we can
[00:27:40] experiments as we just show now we can probably figure out the tuning curve for
[00:27:42] probably figure out the tuning curve for each electrod for example this one is
[00:27:44] each electrod for example this one is probably it's preferred direction is to
[00:27:46] probably it's preferred direction is to the left
[00:27:51] right so we can repeat experiments
[00:27:54] right so we can repeat experiments Behavior experiments for other for other
[00:27:58] Behavior experiments for other for other channels here and probably train a ml
[00:28:01] channels here and probably train a ml decoder to figure out what this channel
[00:28:03] decoder to figure out what this channel each channel is encoding the preferred
[00:28:05] each channel is encoding the preferred direction for each channel so once we
[00:28:07] direction for each channel so once we have the decoder
[00:28:08] have the decoder trained and
[00:28:10] trained and then at test time we can basically ask
[00:28:13] then at test time we can basically ask our participants who has the thing
[00:28:16] our participants who has the thing implant his brain to trying to imagine
[00:28:19] implant his brain to trying to imagine to move his hand to some directions and
[00:28:22] to move his hand to some directions and then the decoder we're trying to figure
[00:28:23] then the decoder we're trying to figure out the direction he's trying to move
[00:28:25] out the direction he's trying to move right so that's the basic idea here
[00:28:28] right so that's the basic idea here um let me go to a demo
[00:28:32] um let me go to a demo here so this is one of the research
[00:28:36] here so this is one of the research coming out of our lab in
[00:28:38] coming out of our lab in 2017 so here you see a
[00:28:41] 2017 so here you see a participant is typing on virtual
[00:28:44] participant is typing on virtual keyboard with uh with her mind right and
[00:28:47] keyboard with uh with her mind right and then on the bottom shows the typing
[00:28:50] then on the bottom shows the typing speed measured as the uh correct
[00:28:54] speed measured as the uh correct characters per minute so it picks around
[00:28:57] characters per minute so it picks around 40 and then on average maybe it's around
[00:29:00] 40 and then on average maybe it's around like
[00:29:13] 20 I think this is really amazing like
[00:29:16] 20 I think this is really amazing like think about like four people like
[00:29:18] think about like four people like Hardward who used to have to um using
[00:29:22] Hardward who used to have to um using this kind of ladder board to
[00:29:24] this kind of ladder board to communicate now with this bring computer
[00:29:27] communicate now with this bring computer interface he can fully uh communicate by
[00:29:32] interface he can fully uh communicate by himself through a like say a computer so
[00:29:34] himself through a like say a computer so that's a huge improvement over that the
[00:29:36] that's a huge improvement over that the board yeah does person open the ey or
[00:29:39] board yeah does person open the ey or close the uh he still opens his eyes I
[00:29:43] close the uh he still opens his eyes I mean she's still opens her eyes so is
[00:29:45] mean she's still opens her eyes so is there anything with eye tracking not for
[00:29:48] there anything with eye tracking not for this
[00:29:50] this experiment so even even she closed the
[00:29:53] experiment so even even she closed the eye still work right yeah it will still
[00:29:55] eye still work right yeah it will still work but like she won't have the visual
[00:29:57] work but like she won't have the visual feedback you know she won't know where
[00:29:59] feedback you know she won't know where she's typing how about like if she came
[00:30:02] she's typing how about like if she came up with a in a character in in the mind
[00:30:05] up with a in a character in in the mind e or R without looking at the keyboard
[00:30:08] e or R without looking at the keyboard uh okay um that's something I'm going to
[00:30:11] uh okay um that's something I'm going to show next so
[00:30:13] show next so yeah how do you know whether it's the
[00:30:16] yeah how do you know whether it's the person who mped or whether it's the
[00:30:19] person who mped or whether it's the machine that's not capturing the correct
[00:30:24] machine that's not capturing the correct character like I'm confused by what do
[00:30:26] character like I'm confused by what do you mean by correct
[00:30:29] you mean by correct okay oh oh here so yeah that that's good
[00:30:32] okay oh oh here so yeah that that's good question so let me clarify here I think
[00:30:34] question so let me clarify here I think the task here is uh maybe it's not
[00:30:38] the task here is uh maybe it's not readable but I think the task here is
[00:30:40] readable but I think the task here is basically she is copying a sentence so
[00:30:44] basically she is copying a sentence so there we know the ground tro then we can
[00:30:46] there we know the ground tro then we can measure like the error
[00:30:48] measure like the error rate and
[00:30:51] rate and yeah ask how do you the clicking motion
[00:30:55] yeah ask how do you the clicking motion or the like selection motion is is that
[00:30:58] or the like selection motion is is that easy to distinguish or is there a
[00:31:00] easy to distinguish or is there a certain way of knowing that the user is
[00:31:02] certain way of knowing that the user is pressing down or does she visualize like
[00:31:05] pressing down or does she visualize like like a mouse or actually it that's
[00:31:09] like a mouse or actually it that's really good question so as I just
[00:31:10] really good question so as I just mentioned right so we can decode
[00:31:13] mentioned right so we can decode movements and we can also decode like
[00:31:15] movements and we can also decode like say different gestures say like like say
[00:31:19] say different gestures say like like say you can use this kind of gestures or
[00:31:20] you can use this kind of gestures or like move his uh her elbows so you can
[00:31:23] like move his uh her elbows so you can just imagine different uh modor
[00:31:26] just imagine different uh modor movements and then we can basically
[00:31:27] movements and then we can basically match decode those modern movements and
[00:31:29] match decode those modern movements and map that to say a click signal or
[00:31:32] map that to say a click signal or different
[00:31:34] different signals so was it if the person look at
[00:31:37] signals so was it if the person look at the keyboard and then remember the
[00:31:39] the keyboard and then remember the keyboard in the in her and then she
[00:31:43] keyboard in the in her and then she close the eye and it does still
[00:31:47] close the eye and it does still work
[00:31:49] work um I think that's really even like it's
[00:31:51] um I think that's really even like it's that's even hard for me to do right I
[00:31:53] that's even hard for me to do right I can't like can you like remember the
[00:31:55] can't like can you like remember the keyboard and then just control like say
[00:31:58] keyboard and then just control like say Mouse and I use keyboard every day so I
[00:32:01] Mouse and I use keyboard every day so I definitely remember the word Vis in my
[00:32:03] definitely remember the word Vis in my mind and I just close my eye but this is
[00:32:06] mind and I just close my eye but this is like a virtual keyboard so it's not like
[00:32:07] like a virtual keyboard so it's not like a physical keyboard we can use your
[00:32:09] a physical keyboard we can use your muscle
[00:32:10] muscle memories yeah so yeah so maybe one thing
[00:32:13] memories yeah so yeah so maybe one thing I have to clarify here that the mental
[00:32:15] I have to clarify here that the mental image for her is to say controlling like
[00:32:18] image for her is to say controlling like a uh like what's the word I mean just
[00:32:22] a uh like what's the word I mean just just like controlling a like a mouse
[00:32:23] just like controlling a like a mouse right so she's not actually doing the
[00:32:26] right so she's not actually doing the touch typing but she is actually moving
[00:32:28] touch typing but she is actually moving like say a
[00:32:33] mouse let's move
[00:32:35] mouse let's move on all right so this is basically just a
[00:32:38] on all right so this is basically just a showcase that
[00:32:39] showcase that U building up on all the knowledge we
[00:32:43] U building up on all the knowledge we have uh learn about the brain so we can
[00:32:46] have uh learn about the brain so we can basically decode uh some attempted
[00:32:50] basically decode uh some attempted movements from people like uh like uh
[00:32:53] movements from people like uh like uh this I forgot her name but like I think
[00:32:56] this I forgot her name but like I think it's T6 that code name T6 that we can
[00:32:59] it's T6 that code name T6 that we can really help this kind of people to uh uh
[00:33:02] really help this kind of people to uh uh regain communication through this kind
[00:33:04] regain communication through this kind of PCI and I
[00:33:07] of PCI and I think as I mentioned earlier you can
[00:33:09] think as I mentioned earlier you can also use BCI to control robotics arms so
[00:33:12] also use BCI to control robotics arms so for example in this
[00:33:14] for example in this case this is the participants in uh I
[00:33:18] case this is the participants in uh I think
[00:33:19] think ctech
[00:33:25] oops he's uh using his mind to control
[00:33:28] oops he's uh using his mind to control this robotic s which grabs him a
[00:33:34] [Applause]
[00:33:43] drink all
[00:33:48] right hey you finish that thing off
[00:33:51] right hey you finish that thing off that's good
[00:33:58] and
[00:34:08] that's there you
[00:34:12] that's there you go all right so and also you can do
[00:34:16] go all right so and also you can do things like a restore working
[00:34:18] things like a restore working abilities um I think that's someone just
[00:34:22] abilities um I think that's someone just mentioned right now just now is that
[00:34:25] mentioned right now just now is that maybe uh there's like we can try trying
[00:34:27] maybe uh there's like we can try trying to restore different modalities of
[00:34:30] to restore different modalities of communication for example just now we're
[00:34:32] communication for example just now we're just using the movements and then using
[00:34:34] just using the movements and then using by restoring movements we can control
[00:34:36] by restoring movements we can control computer but like how about we directly
[00:34:38] computer but like how about we directly restore the abilities to do handwriting
[00:34:41] restore the abilities to do handwriting right because handwriting is very um
[00:34:44] right because handwriting is very um natural ways to uh communicate so uh
[00:34:48] natural ways to uh communicate so uh Frank Frank wet uh research scientist
[00:34:50] Frank Frank wet uh research scientist from our lab in 2021 published a paper
[00:34:53] from our lab in 2021 published a paper to show that you can actually do this
[00:34:55] to show that you can actually do this kind of uh handwriting BCI and he showed
[00:34:58] kind of uh handwriting BCI and he showed that it's actually really really fast
[00:35:01] that it's actually really really fast compared to the previous one okay so now
[00:35:03] compared to the previous one okay so now we have seen that there's a different
[00:35:05] we have seen that there's a different ways to restore
[00:35:07] ways to restore communication
[00:35:09] communication um here's just like a t a measurement of
[00:35:13] um here's just like a t a measurement of different uh ways of communicating right
[00:35:15] different uh ways of communicating right you can see on the very left is the uh
[00:35:18] you can see on the very left is the uh say Sip and puff interface it's very
[00:35:21] say Sip and puff interface it's very slow basically that means that's for
[00:35:23] slow basically that means that's for someone who can not really move but can
[00:35:26] someone who can not really move but can still do uh some like a breathing they
[00:35:28] still do uh some like a breathing they can do the kind like a sip and a path to
[00:35:31] can do the kind like a sip and a path to say yes and no to communicate that's
[00:35:33] say yes and no to communicate that's really slow maybe around five words per
[00:35:35] really slow maybe around five words per minute per minute for a normal person
[00:35:38] minute per minute for a normal person I'm really surprised that on average a
[00:35:40] I'm really surprised that on average a normal person can write maybe like 13 or
[00:35:44] normal person can write maybe like 13 or 14 words per minute that's really slow
[00:35:46] 14 words per minute that's really slow but maybe that's just the average speed
[00:35:49] but maybe that's just the average speed and on the very far right side is the
[00:35:51] and on the very far right side is the natural communication which can reach up
[00:35:53] natural communication which can reach up to 150 and 60 words per minute for
[00:35:56] to 150 and 60 words per minute for minute just put everything into context
[00:35:58] minute just put everything into context here so the 2D cursor I just show you
[00:36:01] here so the 2D cursor I just show you guys right now it can do eight minutes
[00:36:03] guys right now it can do eight minutes for work per per minute eight words per
[00:36:06] for work per per minute eight words per minute and the handwriting can do around
[00:36:08] minute and the handwriting can do around 18 words per minute so compared to the
[00:36:13] 18 words per minute so compared to the say uh Le board or like uh this kind of
[00:36:16] say uh Le board or like uh this kind of ey tracking we are doing like we are
[00:36:20] ey tracking we are doing like we are really uh make a lot of like advancement
[00:36:23] really uh make a lot of like advancement here but still it's far way far from
[00:36:26] here but still it's far way far from like the natural conversation speed so
[00:36:29] like the natural conversation speed so the next question basically is okay how
[00:36:31] the next question basically is okay how can we get there can we actually restore
[00:36:33] can we get there can we actually restore natural uh can we actually restore
[00:36:36] natural uh can we actually restore speech with a Brin computer
[00:36:39] speech with a Brin computer interface
[00:36:41] interface um um to get there I think there's
[00:36:47] um um to get there I think there's a huge barrier here first is that um the
[00:36:51] a huge barrier here first is that um the language processing in the brain is a
[00:36:53] language processing in the brain is a really complicated process so for
[00:36:56] really complicated process so for example here is sh all the braing areas
[00:36:58] example here is sh all the braing areas that's involved in the language and we
[00:37:00] that's involved in the language and we still don't know exactly how this
[00:37:02] still don't know exactly how this happens but this is just the our best
[00:37:04] happens but this is just the our best guess at the how language is uh
[00:37:07] guess at the how language is uh processed in the brain on the very right
[00:37:08] processed in the brain on the very right you see that uh there's a lot of brain
[00:37:11] you see that uh there's a lot of brain regions that's involved with uh
[00:37:13] regions that's involved with uh knowledge and reasoning in the center is
[00:37:16] knowledge and reasoning in the center is uh maybe area that's involved with
[00:37:19] uh maybe area that's involved with semantics and uh syntax and on the very
[00:37:22] semantics and uh syntax and on the very left is about the perception of speech
[00:37:25] left is about the perception of speech and then the production of uh Speech
[00:37:29] and then the production of uh Speech language is really complex so um maybe
[00:37:33] language is really complex so um maybe the Hope here that we can start with Mod
[00:37:35] the Hope here that we can start with Mod cortex which modor planning of language
[00:37:38] cortex which modor planning of language because things I've just shown you guys
[00:37:40] because things I've just shown you guys that we already know how the model
[00:37:41] that we already know how the model cortex can incode movements right and we
[00:37:44] cortex can incode movements right and we can also know that in order to produce
[00:37:47] can also know that in order to produce language we need to speak and then maybe
[00:37:49] language we need to speak and then maybe we can put some electrodes into this
[00:37:51] we can put some electrodes into this part of the modal cortex that actually
[00:37:53] part of the modal cortex that actually controls our oral facial muscles and
[00:37:56] controls our oral facial muscles and then trying to decode some information
[00:37:57] then trying to decode some information there then see if we can actually
[00:37:59] there then see if we can actually restore
[00:38:01] restore speech
[00:38:04] um to actually being able to restore
[00:38:07] um to actually being able to restore speech is uh I think it's more
[00:38:09] speech is uh I think it's more complicated compared to the restoring
[00:38:11] complicated compared to the restoring movements so what I'm trying to say here
[00:38:14] movements so what I'm trying to say here is that this the production of speech is
[00:38:16] is that this the production of speech is really a a lot of a complicated
[00:38:18] really a a lot of a complicated movements and it's really rapid it's
[00:38:20] movements and it's really rapid it's just more than just moving your hands to
[00:38:22] just more than just moving your hands to like a certain
[00:38:23] like a certain directions so restoring speech is much
[00:38:26] directions so restoring speech is much harder than just the coding out the
[00:38:28] harder than just the coding out the those U uh like um movements of each
[00:38:32] those U uh like um movements of each articulators so in instead of uh trying
[00:38:35] articulators so in instead of uh trying to decode the movements of each
[00:38:37] to decode the movements of each articulators because it's very hard
[00:38:39] articulators because it's very hard right so also like for people who has uh
[00:38:42] right so also like for people who has uh like hard or people like uh has lost
[00:38:45] like hard or people like uh has lost speech it's basically it's very hard to
[00:38:48] speech it's basically it's very hard to actually measure their uh Speech Artic
[00:38:52] actually measure their uh Speech Artic movements um instead maybe we can try to
[00:38:56] movements um instead maybe we can try to decode out this kind of like a discrete
[00:38:58] decode out this kind of like a discrete fums instead of like this kind of
[00:39:00] fums instead of like this kind of continuous speech article Ms because we
[00:39:02] continuous speech article Ms because we know that all the languages has this can
[00:39:05] know that all the languages has this can be decomposed into this kind of basic
[00:39:07] be decomposed into this kind of basic fonetic units for example for English we
[00:39:10] fonetic units for example for English we know that there's different vows and
[00:39:11] know that there's different vows and different consonants uh they are
[00:39:13] different consonants uh they are correlated with how you place your towns
[00:39:16] correlated with how you place your towns in your mouth and how you place your
[00:39:17] in your mouth and how you place your different speech articulators so here
[00:39:20] different speech articulators so here we're trying to instead of decoding the
[00:39:22] we're trying to instead of decoding the actual articul movements we are trying
[00:39:24] actual articul movements we are trying to decode this kind of like discrete
[00:39:26] to decode this kind of like discrete phic
[00:39:28] phic tokens um and then there's a previous uh
[00:39:32] tokens um and then there's a previous uh work work showing that if you put some
[00:39:35] work work showing that if you put some electos on the modor cortex and then you
[00:39:37] electos on the modor cortex and then you can
[00:39:38] can actually tell the differences by
[00:39:41] actually tell the differences by measuring the tell the differences of
[00:39:43] measuring the tell the differences of different fums by measuring the electric
[00:39:45] different fums by measuring the electric activities in the motor cortex so
[00:39:47] activities in the motor cortex so there's a hope of uh being able to
[00:39:49] there's a hope of uh being able to restore speech from by just putting elal
[00:39:53] restore speech from by just putting elal in the model cortex
[00:39:55] in the model cortex here and inde so in
[00:39:59] here and inde so in 2021 uh researchers from UCSF actually
[00:40:02] 2021 uh researchers from UCSF actually demonstrated that's actually feasible to
[00:40:04] demonstrated that's actually feasible to build this kind of like small vocabulary
[00:40:06] build this kind of like small vocabulary speech PCI with this EOG Recording
[00:40:09] speech PCI with this EOG Recording Technology the difference between EOG
[00:40:13] Technology the difference between EOG and then the micro electral array I just
[00:40:14] and then the micro electral array I just show you guys is that uh whereas the
[00:40:18] show you guys is that uh whereas the micro electral arrays actually
[00:40:19] micro electral arrays actually penetrates into the cortex but the Eco
[00:40:22] penetrates into the cortex but the Eco stays on cortex so it doesn't actually
[00:40:24] stays on cortex so it doesn't actually record single neuron firing still record
[00:40:26] record single neuron firing still record some uh average neuro activities over a
[00:40:28] some uh average neuro activities over a small region so compared to micro
[00:40:31] small region so compared to micro electrod race they will have a slightly
[00:40:33] electrod race they will have a slightly lower resolution so that's why their
[00:40:37] lower resolution so that's why their prototype is uh around this kind of like
[00:40:39] prototype is uh around this kind of like a small vocabulary BC which can only
[00:40:42] a small vocabulary BC which can only decode 50 words at around maybe 75%
[00:40:46] decode 50 words at around maybe 75% accuracy but this is still very exciting
[00:40:48] accuracy but this is still very exciting work that showcase that you can actually
[00:40:51] work that showcase that you can actually achieve maybe achieve uh this kind of
[00:40:53] achieve maybe achieve uh this kind of like a speech decoding using uh uh by
[00:40:56] like a speech decoding using uh uh by putting some en actuals into M
[00:40:58] putting some en actuals into M cortex all right so I'll just right now
[00:41:01] cortex all right so I'll just right now go into the research that's done in our
[00:41:04] go into the research that's done in our lab which is to build a high performance
[00:41:06] lab which is to build a high performance speech neuroth neuro
[00:41:08] speech neuroth neuro prothesis so in
[00:41:12] prothesis so in 2022 so we recruited a uh participant
[00:41:16] 2022 so we recruited a uh participant code named T12 who has uh AOS so T12 um
[00:41:22] code named T12 who has uh AOS so T12 um she used to be a very active person she
[00:41:26] she used to be a very active person she likes to write horse likes to jog
[00:41:30] likes to write horse likes to jog but uh because of AOS a couple years ago
[00:41:33] but uh because of AOS a couple years ago she basically uh couldn't do all those
[00:41:35] she basically uh couldn't do all those things that she used to enjoy and unlike
[00:41:39] things that she used to enjoy and unlike most Asos uh patients her symptom starts
[00:41:44] most Asos uh patients her symptom starts with the artifcial movements first so
[00:41:46] with the artifcial movements first so she still can move her hands a little
[00:41:49] she still can move her hands a little bit but she cannot really uh speak
[00:41:53] bit but she cannot really uh speak intelligibly uh so we decided to put
[00:41:56] intelligibly uh so we decided to put four uh micro electrod arrays into her
[00:41:59] four uh micro electrod arrays into her brain two arrays into her modor cortex
[00:42:02] brain two arrays into her modor cortex and then two array into the part of the
[00:42:04] and then two array into the part of the Brokers area which is supposed to
[00:42:06] Brokers area which is supposed to involve with uh uh language planning so
[00:42:09] involve with uh uh language planning so the Hope here is that we want to PO
[00:42:11] the Hope here is that we want to PO decode the the execution of uh of speech
[00:42:17] decode the the execution of uh of speech the production of speech which is is you
[00:42:18] the production of speech which is is you can how you control your your like uh
[00:42:21] can how you control your your like uh Speech arcs and also maybe decode some
[00:42:23] Speech arcs and also maybe decode some like high level uh planning about the
[00:42:25] like high level uh planning about the speech so that's why want to put arrays
[00:42:28] speech so that's why want to put arrays into two different brain regions
[00:42:30] into two different brain regions here um so the first thing we do after
[00:42:33] here um so the first thing we do after we put arrays in her brain is that we
[00:42:35] we put arrays in her brain is that we did some like a Behavior test to see
[00:42:38] did some like a Behavior test to see what kind of information we can decode
[00:42:40] what kind of information we can decode from those arrays so here's a very uh
[00:42:44] from those arrays so here's a very uh here's a result the first result we got
[00:42:46] here's a result the first result we got is that we're are trying to
[00:42:48] is that we're are trying to classify different uh task here the
[00:42:52] classify different uh task here the first plot is that we're trying to using
[00:42:54] first plot is that we're trying to using this four arrays to classify the auto
[00:42:57] this four arrays to classify the auto facial
[00:42:58] facial movements um so this dash line is the
[00:43:02] movements um so this dash line is the the cue that she's actually executing
[00:43:06] the cue that she's actually executing those artificial movements and then
[00:43:08] those artificial movements and then before this dash line she is trying to
[00:43:10] before this dash line she is trying to prefer to do those artificial movements
[00:43:13] prefer to do those artificial movements so you can see that this two uh right
[00:43:16] so you can see that this two uh right line here show that you
[00:43:18] line here show that you can classify you can classify You can
[00:43:22] can classify you can classify You can predict those movements much better
[00:43:24] predict those movements much better above chance using this two arrays in
[00:43:26] above chance using this two arrays in the M cortex whereas this two are really
[00:43:29] the M cortex whereas this two are really in the broadcast area you can basically
[00:43:31] in the broadcast area you can basically you can't really predict too much above
[00:43:34] you can't really predict too much above chance especially during the execution
[00:43:36] chance especially during the execution of those more movements and for single
[00:43:39] of those more movements and for single forums which we instruct our
[00:43:42] forums which we instruct our participants to speak uh single English
[00:43:44] participants to speak uh single English fums you can also uh predict those
[00:43:47] fums you can also uh predict those things much higher about like much
[00:43:50] things much higher about like much higher above chance using these two
[00:43:51] higher above chance using these two arrays from the model cortex and also
[00:43:53] arrays from the model cortex and also for the words for single words right
[00:43:57] for the words for single words right so what this results tell us is that for
[00:44:00] so what this results tell us is that for those two arrays we have put into t12's
[00:44:03] those two arrays we have put into t12's mind these two arrays in the motor CeX
[00:44:05] mind these two arrays in the motor CeX contains a lot of informations about the
[00:44:08] contains a lot of informations about the the fumes being articulated and also the
[00:44:10] the fumes being articulated and also the the words being articulated but those
[00:44:12] the words being articulated but those two AR in the broadcast area which is
[00:44:15] two AR in the broadcast area which is supposed to uh help us to figure out the
[00:44:18] supposed to uh help us to figure out the planning of the speech production
[00:44:19] planning of the speech production doesn't contain too much information so
[00:44:22] doesn't contain too much information so that's really intriguing to us and we're
[00:44:23] that's really intriguing to us and we're still trying to figure out why why
[00:44:26] still trying to figure out why why that's true so for the rest of the uh
[00:44:30] that's true so for the rest of the uh this talk here we mostly using only this
[00:44:32] this talk here we mostly using only this two arrays in the motor cortex
[00:44:34] two arrays in the motor cortex here so now we know that there's a
[00:44:38] here so now we know that there's a fanatic informations being encoded in
[00:44:41] fanatic informations being encoded in the those two arrays what we're going to
[00:44:44] the those two arrays what we're going to do next is actually trying to build a
[00:44:46] do next is actually trying to build a real time spring to text
[00:44:49] real time spring to text BCI here so what we're going to do is uh
[00:44:53] BCI here so what we're going to do is uh let me just show you a video demo first
[00:44:56] let me just show you a video demo first to
[00:44:57] to you know get a sense of what the the PC
[00:45:02] you know get a sense of what the the PC trying to build so here this is our
[00:45:05] trying to build so here this is our participants so she's connect to uh our
[00:45:08] participants so she's connect to uh our decoding machines through this cable
[00:45:10] decoding machines through this cable here which transmits her neuros signals
[00:45:12] here which transmits her neuros signals in real time to decoding machine and
[00:45:14] in real time to decoding machine and then on this screen you can see that
[00:45:15] then on this screen you can see that there's a sentence here that's we
[00:45:17] there's a sentence here that's we instructed her to copy to basically to
[00:45:20] instructed her to copy to basically to read that out and then this once this uh
[00:45:24] read that out and then this once this uh R square turn scen she will try to speak
[00:45:28] R square turn scen she will try to speak and then what you will see below here is
[00:45:31] and then what you will see below here is what the machine is
[00:45:36] [Music]
[00:45:42] decoded I don't want to call for a
[00:45:52] babysitter that would be good
[00:45:57] [Music]
[00:46:02] I did well in
[00:46:14] school I don't see much pollution
[00:46:16] school I don't see much pollution turning
[00:46:17] turning on all right so that's
[00:46:20] on all right so that's the almost perfect decoding results from
[00:46:23] the almost perfect decoding results from her and you can tell from the video that
[00:46:25] her and you can tell from the video that although she can vocalize but it's not
[00:46:27] although she can vocalize but it's not really intelligible because of her
[00:46:30] really intelligible because of her limited um artificial muscle movements
[00:46:34] limited um artificial muscle movements but we can still decode out from her
[00:46:35] but we can still decode out from her brain signal that what she is trying to
[00:46:37] brain signal that what she is trying to say and this video is uh so the task I
[00:46:40] say and this video is uh so the task I just show is she's trying to copy a
[00:46:42] just show is she's trying to copy a sentence but this is trying to answer a
[00:46:45] sentence but this is trying to answer a question here
[00:46:59] I have a very good friend and
[00:47:02] I have a very good friend and sister and we also try to different
[00:47:05] sister and we also try to different modalities which is uh
[00:47:07] modalities which is uh because when she tries to uh attempting
[00:47:10] because when she tries to uh attempting to articulate it's very actually very
[00:47:12] to articulate it's very actually very tiring for her to actually uh to
[00:47:15] tiring for her to actually uh to articulate those sounds so what we try
[00:47:17] articulate those sounds so what we try here is that only instructed her to move
[00:47:19] here is that only instructed her to move her mouse or move her articulators but
[00:47:22] her mouse or move her articulators but not move but not vocalized so what we
[00:47:24] not move but not vocalized so what we call this a silent speech and we can
[00:47:26] call this a silent speech and we can still decode pretty well using this kind
[00:47:28] still decode pretty well using this kind of like a silent speech
[00:47:32] [Music]
[00:47:40] modality I do not have much to compare
[00:47:42] modality I do not have much to compare it
[00:47:47] [Music]
[00:47:55] to I as much as I would like to
[00:47:58] to I as much as I would like to either all right
[00:48:03] okay okay so let's just move on more
[00:48:06] okay okay so let's just move on more technical details about how we uh build
[00:48:08] technical details about how we uh build this uh Speech V here so um as I just
[00:48:14] this uh Speech V here so um as I just mentioned right so the first thing we
[00:48:15] mentioned right so the first thing we need to do is to try to build a decoder
[00:48:18] need to do is to try to build a decoder and before building that decoder we need
[00:48:20] and before building that decoder we need to do some data collection so here is
[00:48:22] to do some data collection so here is our research scientist Frank sitting
[00:48:24] our research scientist Frank sitting next to T12 and asking her to read that
[00:48:28] next to T12 and asking her to read that sentence on the screen and then we'll
[00:48:29] sentence on the screen and then we'll record the new activities of her seeing
[00:48:33] record the new activities of her seeing that sentence so we'll have this kind of
[00:48:34] that sentence so we'll have this kind of a paired data collected paired data
[00:48:37] a paired data collected paired data where the input is the new activity and
[00:48:39] where the input is the new activity and the output is the targeted sentence we
[00:48:41] the output is the targeted sentence we want to decode so we basically have to
[00:48:44] want to decode so we basically have to basically go to um where T12 Liv and
[00:48:48] basically go to um where T12 Liv and then we'll do some data collection
[00:48:49] then we'll do some data collection session there uh and then testing the
[00:48:53] decoder the way we collect data is that
[00:48:57] decoder the way we collect data is that uh because um we only have very limited
[00:49:01] uh because um we only have very limited time and we cannot ask T12 to to to
[00:49:05] time and we cannot ask T12 to to to speak a lot of sentences so we'll divide
[00:49:09] speak a lot of sentences so we'll divide the our data collection into this kind
[00:49:11] the our data collection into this kind of a block structure where we can start
[00:49:12] of a block structure where we can start instructed her to speak 40 sentences
[00:49:14] instructed her to speak 40 sentences every block and then she take a break
[00:49:16] every block and then she take a break and then we collect another block so the
[00:49:18] and then we collect another block so the data collection will last about 100
[00:49:21] data collection will last about 100 minutes for every research session and
[00:49:24] minutes for every research session and then we Crea train a decoder maybe that
[00:49:27] then we Crea train a decoder maybe that takes like uh 10 to 20 minutes it's
[00:49:30] takes like uh 10 to 20 minutes it's really quick after training a decoder
[00:49:32] really quick after training a decoder we'll start actually evaluating the
[00:49:34] we'll start actually evaluating the performance of decoder by asking our
[00:49:36] performance of decoder by asking our participants to speak some new sentences
[00:49:39] participants to speak some new sentences and then see how well we can decode on
[00:49:41] and then see how well we can decode on those new set of
[00:49:43] those new set of sentences so in total we did the
[00:49:46] sentences so in total we did the experiments sessions over maybe three
[00:49:48] experiments sessions over maybe three months of time and then we collect about
[00:49:51] months of time and then we collect about like 10,000 sentences from this T like a
[00:49:55] like 10,000 sentences from this T like a switchboard uh tele telephone
[00:49:58] switchboard uh tele telephone conversation Corpus which really want to
[00:50:01] conversation Corpus which really want to emphasize that we want to decode this
[00:50:02] emphasize that we want to decode this kind of conversational
[00:50:05] kind of conversational English um once we have the data then we
[00:50:07] English um once we have the data then we can try to see how we can design a
[00:50:10] can try to see how we can design a decoder that can best solve this task so
[00:50:13] decoder that can best solve this task so let's first Define the problem here so
[00:50:15] let's first Define the problem here so we have some neuro features inputs right
[00:50:18] we have some neuro features inputs right so let's say we have some neuro features
[00:50:21] so let's say we have some neuro features which is a Time series you can think
[00:50:23] which is a Time series you can think about as a maybe similar to audio that
[00:50:26] about as a maybe similar to audio that uh
[00:50:27] uh at each time point we'll get some
[00:50:29] at each time point we'll get some feature
[00:50:32] vectors um the output of this decoder is
[00:50:37] vectors um the output of this decoder is a set of words right so we know that
[00:50:39] a set of words right so we know that she's trying to speak some sentence so
[00:50:40] she's trying to speak some sentence so we are trying to decode the words from
[00:50:43] we are trying to decode the words from this input neur features
[00:50:46] this input neur features here um as I say as I mentioned earlier
[00:50:49] here um as I say as I mentioned earlier so instead of like directly decoding
[00:50:52] so instead of like directly decoding words from the input sentences maybe we
[00:50:55] words from the input sentences maybe we want to have this inter media Target of
[00:50:57] want to have this inter media Target of a PHS to decode the reason is that first
[00:51:01] a PHS to decode the reason is that first we know that um there's only 40 PHS in
[00:51:05] we know that um there's only 40 PHS in English so that's much small set
[00:51:06] English so that's much small set compared to the number of words so if
[00:51:08] compared to the number of words so if you want to train decoder I can actually
[00:51:10] you want to train decoder I can actually decode words then you will have you need
[00:51:12] decode words then you will have you need to have much more data to cover all the
[00:51:14] to have much more data to cover all the possible words whereas for PHS you
[00:51:17] possible words whereas for PHS you probably need way less data to cover all
[00:51:19] probably need way less data to cover all the 40 fums here so we inide of decoding
[00:51:22] the 40 fums here so we inide of decoding directly decoding the words we decided
[00:51:24] directly decoding the words we decided to decode a inter media representation
[00:51:26] to decode a inter media representation of PHS from the neuro input
[00:51:29] of PHS from the neuro input features okay so basically there's a two
[00:51:32] features okay so basically there's a two decoders we want to design one is the
[00:51:34] decoders we want to design one is the first is a neural to volume decoder and
[00:51:36] first is a neural to volume decoder and the second is volum to word decoder so
[00:51:38] the second is volum to word decoder so that's the two decoders that will have
[00:51:41] that's the two decoders that will have to design this task let's focus on the
[00:51:44] to design this task let's focus on the ne tonium decoder
[00:51:47] ne tonium decoder first so basically that's I think at
[00:51:50] first so basically that's I think at this point of uh class we probably know
[00:51:52] this point of uh class we probably know that we can treat this problem as a
[00:51:53] that we can treat this problem as a sequence to sequence problem right so
[00:51:55] sequence to sequence problem right so the input is some feature sequence
[00:51:57] the input is some feature sequence output is a token sequence um and for
[00:52:02] output is a token sequence um and for sequence two sequence problem we can
[00:52:04] sequence two sequence problem we can probably know that we can use some like
[00:52:06] probably know that we can use some like encoder and decoder models to to solve
[00:52:09] encoder and decoder models to to solve this problem right
[00:52:11] this problem right however for encod decod model it's
[00:52:14] however for encod decod model it's actually more powerful than we actually
[00:52:16] actually more powerful than we actually need because um encod decod model allows
[00:52:19] need because um encod decod model allows this kind of arbitary alignments between
[00:52:22] this kind of arbitary alignments between inputs and outputs that's really helpful
[00:52:24] inputs and outputs that's really helpful for tasks such as the machine trans
[00:52:26] for tasks such as the machine trans ation whereas you know some languages
[00:52:29] ation whereas you know some languages has like different word orders than
[00:52:31] has like different word orders than other Lang than other languages but here
[00:52:36] other Lang than other languages but here we know that the the alignment is
[00:52:39] we know that the the alignment is actually is kind of like a more
[00:52:41] actually is kind of like a more monotonic compared to the like say
[00:52:43] monotonic compared to the like say machine translation where the alignment
[00:52:45] machine translation where the alignment is arbitrary but monotonic I means that
[00:52:47] is arbitrary but monotonic I means that you know that the probably like that the
[00:52:51] you know that the probably like that the say for example the first two neuro FES
[00:52:53] say for example the first two neuro FES probably like corresponds to the first
[00:52:55] probably like corresponds to the first fores in out sentence rather than the
[00:52:57] fores in out sentence rather than the last volume in the output sentence right
[00:52:59] last volume in the output sentence right so this is kind of a monotonic
[00:53:01] so this is kind of a monotonic alignment um so to solve this problems
[00:53:04] alignment um so to solve this problems of monotonic
[00:53:06] of monotonic alignments um we can actually borrow
[00:53:08] alignments um we can actually borrow idea that people have de developed
[00:53:11] idea that people have de developed for machine learning tasks such as
[00:53:14] for machine learning tasks such as handwriting recognition and speech
[00:53:16] handwriting recognition and speech recognition where the task is also
[00:53:18] recognition where the task is also trying to decode a say a ladder sequence
[00:53:22] trying to decode a say a ladder sequence or like fum sequence also ladder
[00:53:24] or like fum sequence also ladder sequences from some like a uh
[00:53:27] sequences from some like a uh um Say speech Fe Fe features and the
[00:53:31] um Say speech Fe Fe features and the techn we going to use is called
[00:53:32] techn we going to use is called connection is temporal classification so
[00:53:35] connection is temporal classification so for people who have taken CS2 24s you
[00:53:38] for people who have taken CS2 24s you probably already know what this means
[00:53:40] probably already know what this means but I'm going to do some little bit
[00:53:42] but I'm going to do some little bit introduction about this thing I guess I
[00:53:45] introduction about this thing I guess I don't have too much time but I just
[00:53:46] don't have too much time but I just quickly go over it here so what um CDC
[00:53:50] quickly go over it here so what um CDC the connection tempis classification do
[00:53:53] the connection tempis classification do is that giving some input sequence right
[00:53:56] is that giving some input sequence right the goal is that we want to decode some
[00:53:58] the goal is that we want to decode some output sequence but we don't know the
[00:54:00] output sequence but we don't know the exact alignment between them and usually
[00:54:02] exact alignment between them and usually the input output has some lens mismatch
[00:54:05] the input output has some lens mismatch for example in the case of like say
[00:54:08] for example in the case of like say speech recognition um where the input is
[00:54:11] speech recognition um where the input is uh could be have like a lens of like
[00:54:14] uh could be have like a lens of like several thousands of uh frames right
[00:54:17] several thousands of uh frames right where each frame correspond to a very
[00:54:19] where each frame correspond to a very like fine level like find high temporal
[00:54:22] like fine level like find high temporal resolution featur such recorded at like
[00:54:25] resolution featur such recorded at like say 20 millisecond whereas the output
[00:54:27] say 20 millisecond whereas the output only has a few tokens so that's like a
[00:54:30] only has a few tokens so that's like a huge lens mismatch
[00:54:32] huge lens mismatch here what we can do is uh we can still
[00:54:36] here what we can do is uh we can still using like say a RN Transformer model to
[00:54:40] using like say a RN Transformer model to predict what the output token is at each
[00:54:43] predict what the output token is at each time step right and then we somehow have
[00:54:46] time step right and then we somehow have to figure out a way to fill in between
[00:54:49] to figure out a way to fill in between like fill in between the output tokens
[00:54:52] like fill in between the output tokens some like spacers so that output token
[00:54:54] some like spacers so that output token can also have the same length as the
[00:54:56] can also have the same length as the input
[00:54:56] input output sequence can have the same length
[00:54:58] output sequence can have the same length as the input sequence so what the CDC
[00:55:01] as the input sequence so what the CDC loss does is introduce this additional
[00:55:04] loss does is introduce this additional blank token as
[00:55:06] blank token as output um with this blank token what you
[00:55:09] output um with this blank token what you can actually do is U for example here is
[00:55:13] can actually do is U for example here is a a example output of the CTC classifier
[00:55:16] a a example output of the CTC classifier it's trying to produce this kind of like
[00:55:20] it's trying to produce this kind of like sequence here right what you can do is
[00:55:22] sequence here right what you can do is first you merge repeated tokens and then
[00:55:25] first you merge repeated tokens and then you're taking out that blank tokens so
[00:55:28] you're taking out that blank tokens so what you get is a much shorter sequence
[00:55:30] what you get is a much shorter sequence that's correspond to Output so what CDC
[00:55:32] that's correspond to Output so what CDC loss it does is it allows you to to do a
[00:55:36] loss it does is it allows you to to do a sequence to sequence problem that has a
[00:55:38] sequence to sequence problem that has a different input and output length and
[00:55:40] different input and output length and also have this kind of monotonic
[00:55:42] also have this kind of monotonic alignment
[00:55:44] alignment property um let me just skip through
[00:55:48] property um let me just skip through this detail here how you can train a CTC
[00:55:51] this detail here how you can train a CTC loss
[00:55:52] loss here uh just uh skip through this thing
[00:55:59] so now let's suppose that we have this
[00:56:01] so now let's suppose that we have this uh CTC loss that can actually train be
[00:56:04] uh CTC loss that can actually train be used train to to to train our model
[00:56:07] used train to to to train our model right the next problem is what kind of
[00:56:09] right the next problem is what kind of decoder we want to use for this task
[00:56:13] decoder we want to use for this task what kind of neural network decoders to
[00:56:14] what kind of neural network decoders to use for this task I think at this point
[00:56:16] use for this task I think at this point of class I think most of you guys are
[00:56:18] of class I think most of you guys are convinced that Transformer is really
[00:56:19] convinced that Transformer is really powerful right there's no no reason for
[00:56:21] powerful right there's no no reason for me to say more about it but in this case
[00:56:23] me to say more about it but in this case we don't want to use Transformer the
[00:56:25] we don't want to use Transformer the reason is that
[00:56:26] reason is that we don't have large data set as I
[00:56:28] we don't have large data set as I mentioned previous we only have 10,000
[00:56:30] mentioned previous we only have 10,000 sentences right and also Transformer is
[00:56:33] sentences right and also Transformer is really good at dealing with long range
[00:56:34] really good at dealing with long range dependencies but here um for the speech
[00:56:37] dependencies but here um for the speech production the like there's no re really
[00:56:40] production the like there's no re really required for long range dependency so
[00:56:43] required for long range dependency so let's just go back to the very simple RN
[00:56:45] let's just go back to the very simple RN we know that RN works for small data
[00:56:47] we know that RN works for small data sets and it can it can predict uh it can
[00:56:51] sets and it can it can predict uh it can deal with short range dependency pretty
[00:56:53] deal with short range dependency pretty well and another ni thing about RN is
[00:56:55] well and another ni thing about RN is that it can it's very efficient to run
[00:56:57] that it can it's very efficient to run in real time you can do like put a very
[00:57:00] in real time you can do like put a very complicated R and even run very
[00:57:01] complicated R and even run very efficiently on your on your mobile
[00:57:03] efficiently on your on your mobile phone
[00:57:05] phone um one like one like the most popular uh
[00:57:10] um one like one like the most popular uh RN we have learn ISM right use this this
[00:57:14] RN we have learn ISM right use this this kind of like a memory State here's my
[00:57:17] kind of like a memory State here's my cursor uh use this memory state to
[00:57:20] cursor uh use this memory state to trying to store some long long-term like
[00:57:23] trying to store some long long-term like long range informations and then use
[00:57:25] long range informations and then use this um different uh input and forget
[00:57:28] this um different uh input and forget gate input and output gates to control
[00:57:30] gate input and output gates to control how you can write read and write into
[00:57:32] how you can write read and write into that memory state right um but LM is
[00:57:36] that memory state right um but LM is also very complicated there's a variant
[00:57:39] also very complicated there's a variant ofm called Gru Gator recurring unit it's
[00:57:43] ofm called Gru Gator recurring unit it's tries like I think the idea here that
[00:57:45] tries like I think the idea here that just trying to combine this memory State
[00:57:47] just trying to combine this memory State and hidden States into just one hidden
[00:57:49] and hidden States into just one hidden state by doing that you can also reduce
[00:57:52] state by doing that you can also reduce uh some gates so Gru is basically a more
[00:57:55] uh some gates so Gru is basically a more simple version of lstm that's works
[00:57:58] simple version of lstm that's works really well when you have a small small
[00:58:00] really well when you have a small small data set so here we use Gru instead of L
[00:58:03] data set so here we use Gru instead of L lstm for our
[00:58:05] lstm for our task so now we know that how we can
[00:58:08] task so now we know that how we can decode PHS and then we have new network
[00:58:10] decode PHS and then we have new network models to decode PHS right we know how
[00:58:13] models to decode PHS right we know how to train the model so at inference Time
[00:58:16] to train the model so at inference Time by inference time I mean that at testing
[00:58:17] by inference time I mean that at testing time you can pass in some new activities
[00:58:20] time you can pass in some new activities into our decoders and then you will
[00:58:22] into our decoders and then you will decode out some like phum probabilities
[00:58:24] decode out some like phum probabilities right so there's a maybe at the first
[00:58:27] right so there's a maybe at the first time stamp the highest probability is I
[00:58:30] time stamp the highest probability is I uh the problem here is that how do I
[00:58:32] uh the problem here is that how do I figure out the most likely output
[00:58:35] figure out the most likely output sequences giving this fun probabilties
[00:58:37] sequences giving this fun probabilties right so basically the task is to find
[00:58:39] right so basically the task is to find the most likely output sequences here
[00:58:43] the most likely output sequences here um I think for this problem I think
[00:58:47] um I think for this problem I think since we have already did something
[00:58:49] since we have already did something similar in the assignment three which is
[00:58:51] similar in the assignment three which is that we can use beam search to figure
[00:58:53] that we can use beam search to figure out the most likely sequence here
[00:58:55] out the most likely sequence here however this one caveat with the beam
[00:58:57] however this one caveat with the beam search when you're applying it to the
[00:59:00] search when you're applying it to the CTC LW but I'm not going to expand it
[00:59:03] CTC LW but I'm not going to expand it too much here um yeah so let's just skip
[00:59:07] too much here um yeah so let's just skip over
[00:59:09] that now suppose that we can use the
[00:59:12] that now suppose that we can use the beam search to find the most likely fum
[00:59:15] beam search to find the most likely fum sequences how do we convert that fume
[00:59:17] sequences how do we convert that fume sequences into words right so that's CU
[00:59:19] sequences into words right so that's CU like we eventually want to decode uh a
[00:59:21] like we eventually want to decode uh a sentences but not just like a a sequence
[00:59:24] sentences but not just like a a sequence of vums so one thing you can modify the
[00:59:27] of vums so one thing you can modify the beam search is that you can if you have
[00:59:30] beam search is that you can if you have like a English dictionary where you can
[00:59:31] like a English dictionary where you can map each words into its pronunciations
[00:59:34] map each words into its pronunciations then while doing the beam search we can
[00:59:36] then while doing the beam search we can basically see if you de some fum
[00:59:38] basically see if you de some fum sequences that correspond to words and
[00:59:39] sequences that correspond to words and can basically replace that fum sequence
[00:59:41] can basically replace that fum sequence with that words right
[00:59:44] with that words right um however you can actually do better by
[00:59:48] um however you can actually do better by using a language model for example here
[00:59:51] using a language model for example here uh that's the the
[00:59:54] uh that's the the thing that's the the decoding equation
[00:59:59] thing that's the the decoding equation what I want to do here is that uh here
[01:00:02] what I want to do here is that uh here the x is the input the Y is decoded the
[01:00:05] the x is the input the Y is decoded the word sequences and then because not all
[01:00:08] word sequences and then because not all word sequences
[01:00:10] word sequences are have the same likelihood right so
[01:00:13] are have the same likelihood right so some word sequences say suppose I decode
[01:00:15] some word sequences say suppose I decode a sentence say called I can spoke that
[01:00:18] a sentence say called I can spoke that doesn't that doesn't seem like a
[01:00:19] doesn't that doesn't seem like a syntactically correct so we can use
[01:00:22] syntactically correct so we can use maybe use the language model to evaluate
[01:00:23] maybe use the language model to evaluate the probabilities of each decoded hypo
[01:00:26] the probabilities of each decoded hypo this and then using that as some sort of
[01:00:28] this and then using that as some sort of weight on the final um decoding
[01:00:31] weight on the final um decoding probabilities so we're adding this extra
[01:00:34] probabilities so we're adding this extra term here called the probabilities of a
[01:00:37] term here called the probabilities of a sentence here that's actually you can
[01:00:39] sentence here that's actually you can just decompose that into um the
[01:00:42] just decompose that into um the probability of each token giving its
[01:00:44] probability of each token giving its previous tokens and you can basically
[01:00:46] previous tokens and you can basically measure this things using any language
[01:00:48] measure this things using any language models
[01:00:49] models right now another thing we want to add
[01:00:51] right now another thing we want to add here is another term is a word insertion
[01:00:54] here is another term is a word insertion bonus so one problem about
[01:00:56] bonus so one problem about this language model this probability of
[01:00:59] this language model this probability of sentences that actually longer sentences
[01:01:03] sentences that actually longer sentences will have smaller probabilities than
[01:01:05] will have smaller probabilities than shorter senten that's just like the
[01:01:06] shorter senten that's just like the properties of this how you decompose
[01:01:09] properties of this how you decompose that how you decompose the this
[01:01:11] that how you decompose the this probability here so we want to actually
[01:01:14] probability here so we want to actually to balance the length of the uh decoded
[01:01:17] to balance the length of the uh decoded sequence here by adding some word
[01:01:19] sequence here by adding some word insertion bonus so what we eventually
[01:01:22] insertion bonus so what we eventually trying to optimize is this equation here
[01:01:24] trying to optimize is this equation here using um both the probabilities
[01:01:27] using um both the probabilities generated by the RN decoder and then
[01:01:30] generated by the RN decoder and then using some sort of language model like
[01:01:32] using some sort of language model like weights and then we insertion bonus and
[01:01:35] weights and then we insertion bonus and also some ways here you can
[01:01:38] also some ways here you can optimize okay just start trying to put
[01:01:40] optimize okay just start trying to put everything together so suppose that you
[01:01:42] everything together so suppose that you have a neuro feature inputs here which
[01:01:45] have a neuro feature inputs here which is you get this neuro features every 20
[01:01:47] is you get this neuro features every 20 milliseconds you pass that through Gru
[01:01:50] milliseconds you pass that through Gru and now you got some phing probabilities
[01:01:52] and now you got some phing probabilities right this is all happens in real time
[01:01:54] right this is all happens in real time so everything has to be happen under
[01:01:56] so everything has to be happen under like all computation needs to be down
[01:01:58] like all computation needs to be down within 20 milliseconds you do a really
[01:02:00] within 20 milliseconds you do a really quick beam search and then you find okay
[01:02:02] quick beam search and then you find okay maybe this this pH corresponds to the
[01:02:06] maybe this this pH corresponds to the word I or the word I um and then here we
[01:02:10] word I or the word I um and then here we want to use the angram language model
[01:02:11] want to use the angram language model instead of like a more powerful
[01:02:13] instead of like a more powerful Transformer language model the reason is
[01:02:15] Transformer language model the reason is that we want to really do a lot of like
[01:02:18] that we want to really do a lot of like evaluation really quickly under 20
[01:02:20] evaluation really quickly under 20 milliseconds so if so suppose that you
[01:02:23] milliseconds so if so suppose that you have like say 100 hypothesis right and
[01:02:25] have like say 100 hypothesis right and then you want to all evaluate the
[01:02:26] then you want to all evaluate the possibility of them if you use a
[01:02:29] possibility of them if you use a Transformer language model such as gpt3
[01:02:32] Transformer language model such as gpt3 which is really powerful but you cannot
[01:02:33] which is really powerful but you cannot really like do really fast inferences
[01:02:36] really like do really fast inferences under 20 milliseconds right so whereas
[01:02:38] under 20 milliseconds right so whereas the angram language model you can just
[01:02:39] the angram language model you can just load everything into memory and all the
[01:02:42] load everything into memory and all the evaluation is just a memory lookup so
[01:02:43] evaluation is just a memory lookup so it's really
[01:02:45] it's really quick after that you can get some
[01:02:47] quick after that you can get some probabilities out and then you'll just
[01:02:49] probabilities out and then you'll just keep say the top K hypothesis for the
[01:02:51] keep say the top K hypothesis for the next uh step next step of a beam search
[01:02:54] next uh step next step of a beam search here so that's how we use the angram
[01:02:57] here so that's how we use the angram language model in the real time uh
[01:03:00] language model in the real time uh decoding after that we'll use a
[01:03:02] decoding after that we'll use a Transformer language model so
[01:03:05] Transformer language model so to uh rerank all the hypothesis
[01:03:08] to uh rerank all the hypothesis generated by the engram language model
[01:03:10] generated by the engram language model so this actually happens when you
[01:03:11] so this actually happens when you actually decoded out the entire sentence
[01:03:14] actually decoded out the entire sentence say what I can keep the most likely 100
[01:03:17] say what I can keep the most likely 100 sentences and then I can at this time I
[01:03:20] sentences and then I can at this time I can use the Transformer language model
[01:03:22] can use the Transformer language model which can quickly evaluate the
[01:03:24] which can quickly evaluate the probabilities say only 100 hypothesis
[01:03:27] probabilities say only 100 hypothesis under the time of maybe half a second
[01:03:30] under the time of maybe half a second right and then can get a better
[01:03:31] right and then can get a better probability measurement of other the
[01:03:33] probability measurement of other the sentences
[01:03:35] sentences here yeah so putting everything together
[01:03:37] here yeah so putting everything together this is how the entire system works when
[01:03:40] this is how the entire system works when I show you the previous video is
[01:03:42] I show you the previous video is that is that we can right now using uh
[01:03:45] that is that we can right now using uh this complicated not comp like
[01:03:47] this complicated not comp like multi-stage uh machine learning model to
[01:03:50] multi-stage uh machine learning model to accurately decode what the person is
[01:03:52] accurately decode what the person is trying to see and build this high
[01:03:53] trying to see and build this high performance uh neural uh speech Pro NE
[01:03:57] performance uh neural uh speech Pro NE speech
[01:03:58] speech PCI right we almost time here so I just
[01:04:02] PCI right we almost time here so I just uh skip the evaluation part evaluation
[01:04:04] uh skip the evaluation part evaluation is how we measure the performance is
[01:04:05] is how we measure the performance is basically measur in the world a rate um
[01:04:09] basically measur in the world a rate um it also have a all the data open to as a
[01:04:13] it also have a all the data open to as a competition so if you guys are trying
[01:04:15] competition so if you guys are trying really want to curious about this thing
[01:04:17] really want to curious about this thing you can try to play around with it I
[01:04:20] you can try to play around with it I think here's uh I think the most
[01:04:23] think here's uh I think the most exciting thing about doing this research
[01:04:25] exciting thing about doing this research is that
[01:04:26] is that actually see how your research can
[01:04:29] actually see how your research can impact people here so this is a quote
[01:04:31] impact people here so this is a quote from our participant T12 and then this
[01:04:33] from our participant T12 and then this is how she reacts when this thing first
[01:04:36] is how she reacts when this thing first worked for her it's really exciting that
[01:04:39] worked for her it's really exciting that uh she can speak after so many years of
[01:04:42] uh she can speak after so many years of uh uh
[01:04:44] uh uh silence okay so in the last five minutes
[01:04:47] silence okay so in the last five minutes maybe I can just go a little bit into
[01:04:49] maybe I can just go a little bit into what I think is the future of uh bcis
[01:04:52] what I think is the future of uh bcis here so I think what I've just shown you
[01:04:54] here so I think what I've just shown you guys is that using BC we can help people
[01:04:57] guys is that using BC we can help people to either restore movements or um
[01:05:00] to either restore movements or um restore
[01:05:01] restore communication
[01:05:03] communication um one exciting Direction I think is
[01:05:06] um one exciting Direction I think is this kind of multimodal BCI here is a a
[01:05:11] this kind of multimodal BCI here is a a work published by group at UCS UCSF is
[01:05:14] work published by group at UCS UCSF is that they are trying to decode not only
[01:05:16] that they are trying to decode not only the phes but also uh Speech I mean the
[01:05:22] the phes but also uh Speech I mean the actual speech and then also some articul
[01:05:24] actual speech and then also some articul adjures so that you can actually um move
[01:05:28] adjures so that you can actually um move 3D avatars here and also as I just
[01:05:31] 3D avatars here and also as I just mentioned is that the
[01:05:34] mentioned is that the um uh the the the the final goal of this
[01:05:38] um uh the the the the final goal of this P PCI is you want to actually deploy it
[01:05:41] P PCI is you want to actually deploy it for people to be able to use it every
[01:05:44] for people to be able to use it every day just as we use our phones right
[01:05:46] day just as we use our phones right so um so here's a more recent
[01:05:49] so um so here's a more recent development of speech PCI by our
[01:05:52] development of speech PCI by our collaborators at UC Davis what they do
[01:05:54] collaborators at UC Davis what they do is they actually p put four arrays into
[01:05:56] is they actually p put four arrays into the motor cortex meaning that they can
[01:05:58] the motor cortex meaning that they can actually have better uh signals than we
[01:06:01] actually have better uh signals than we do so what they I can actually show is
[01:06:03] do so what they I can actually show is that so for here just for reference
[01:06:06] that so for here just for reference because I previous I just forgot to
[01:06:08] because I previous I just forgot to mention that the the final performance
[01:06:09] mention that the the final performance of our system is that we can get maybe
[01:06:12] of our system is that we can get maybe around
[01:06:14] around 25% word arate means that for every 100
[01:06:17] 25% word arate means that for every 100 W the participant says uh 25 of them
[01:06:21] W the participant says uh 25 of them maybe is wrong uh so for this latest to
[01:06:24] maybe is wrong uh so for this latest to work at the USA d
[01:06:26] work at the USA d they show that you can actually get
[01:06:28] they show that you can actually get close to zero word error rate in uh a
[01:06:32] close to zero word error rate in uh a few sessions by training the system more
[01:06:34] few sessions by training the system more and more continously so it's actually
[01:06:37] and more continously so it's actually being very close to being actually a
[01:06:39] being very close to being actually a usable system right now and here's a
[01:06:43] usable system right now and here's a video of their participants using the
[01:06:44] video of their participants using the system to type to to speak it's very
[01:06:49] system to type to to speak it's very accurate she's he's actually using the
[01:06:51] accurate she's he's actually using the system
[01:06:52] system to every day right now to uh for
[01:06:55] to every day right now to uh for communicating with his family and the
[01:06:57] communicating with his family and the full
[01:07:04] work that really cannot be understated
[01:07:06] work that really cannot be understated how important that
[01:07:09] is all right so I think the most
[01:07:13] is all right so I think the most exciting Direction I think happen at
[01:07:14] exciting Direction I think happen at least like personally I think is
[01:07:16] least like personally I think is happening in our lab is that um we're
[01:07:19] happening in our lab is that um we're trying to maybe trying to restore more
[01:07:21] trying to maybe trying to restore more effortless and natural communication by
[01:07:23] effortless and natural communication by decoding this kind of inner speech
[01:07:26] decoding this kind of inner speech so previously all the speech PC just
[01:07:28] so previously all the speech PC just show you I think the maximum speed we
[01:07:30] show you I think the maximum speed we can do is maybe to at 60 to 7 70 words
[01:07:34] can do is maybe to at 60 to 7 70 words per minute but that's still far slower
[01:07:36] per minute but that's still far slower compared to Natural competition which
[01:07:38] compared to Natural competition which happens at 150 words per minute so one
[01:07:41] happens at 150 words per minute so one of the reason is that for all this
[01:07:43] of the reason is that for all this participants if we ask them to trying to
[01:07:46] participants if we ask them to trying to attempt to speak because they have been
[01:07:48] attempt to speak because they have been lost speak speech so many years it's
[01:07:51] lost speak speech so many years it's very hard for them to speak at a normal
[01:07:53] very hard for them to speak at a normal rate however we know that we know that a
[01:07:56] rate however we know that we know that a lot of people have this kind of inner
[01:07:58] lot of people have this kind of inner speech right we're kind of like talking
[01:07:59] speech right we're kind of like talking to our s self in our mind I think the
[01:08:02] to our s self in our mind I think the research question here is whether we can
[01:08:04] research question here is whether we can decode this sort of like inner
[01:08:06] decode this sort of like inner speech um so that is like some like a
[01:08:09] speech um so that is like some like a prelim preliminary work from uh one
[01:08:11] prelim preliminary work from uh one collaborators in our lab show that you
[01:08:13] collaborators in our lab show that you can actually do so so for example here
[01:08:16] can actually do so so for example here she's our the result show that by if you
[01:08:19] she's our the result show that by if you decode attempted speech which is why I
[01:08:21] decode attempted speech which is why I just show you that you can do maybe at
[01:08:23] just show you that you can do maybe at the for a small set of work you can do
[01:08:26] the for a small set of work you can do say at the 90% accuracy but if you ask
[01:08:31] say at the 90% accuracy but if you ask the participants to imagining moving her
[01:08:34] the participants to imagining moving her mouth or imagining like a voice in her
[01:08:37] mouth or imagining like a voice in her head you can pretty do pretty well right
[01:08:39] head you can pretty do pretty well right so it's not as good as um the a template
[01:08:42] so it's not as good as um the a template speech but still much better than much
[01:08:45] speech but still much better than much higher than chance so I think this
[01:08:46] higher than chance so I think this showing that it's possible in the future
[01:08:48] showing that it's possible in the future that we can decode this sort of like a
[01:08:50] that we can decode this sort of like a inner speech to fully uh restore like
[01:08:53] inner speech to fully uh restore like natural communication to people like
[01:08:55] natural communication to people like Hardward and
[01:08:57] Hardward and T12 but I think there's a more a
[01:08:59] T12 but I think there's a more a controversial issue regarding this kind
[01:09:02] controversial issue regarding this kind of inner speech because what if like you
[01:09:04] of inner speech because what if like you can decode something that's your like
[01:09:07] can decode something that's your like your private thoughts or private
[01:09:08] your private thoughts or private memories that you don't want to express
[01:09:10] memories that you don't want to express right that's a very difficult question
[01:09:12] right that's a very difficult question here and
[01:09:13] here and also as I just mentioned not everyone
[01:09:16] also as I just mentioned not everyone has UN speech and also someone maybe
[01:09:18] has UN speech and also someone maybe like when you think about it's kind of
[01:09:20] like when you think about it's kind of like speech right speech is just a
[01:09:23] like speech right speech is just a external One external representation of
[01:09:25] external One external representation of your internal thoughts it's just like a
[01:09:27] your internal thoughts it's just like a linear representation that you want to
[01:09:29] linear representation that you want to put out through this kind of like a med
[01:09:31] put out through this kind of like a med like medium of speech whereas your in
[01:09:33] like medium of speech whereas your in your thoughts could be more complex more
[01:09:35] your thoughts could be more complex more multi-dimensional so it's very hard to
[01:09:38] multi-dimensional so it's very hard to decide where you want to put arrays and
[01:09:40] decide where you want to put arrays and you where you want to decode all those
[01:09:42] you where you want to decode all those inner but I think that's also very
[01:09:44] inner but I think that's also very exciting opportunities to for us to
[01:09:46] exciting opportunities to for us to learn more about like the speech
[01:09:48] learn more about like the speech processing in the brain um and I just
[01:09:51] processing in the brain um and I just mentioned like if you want to decode
[01:09:53] mentioned like if you want to decode this kind of inner speech then you also
[01:09:55] this kind of inner speech then you also coming to this uh facing a lot of new
[01:09:58] coming to this uh facing a lot of new ethical questions um that's really I
[01:10:01] ethical questions um that's really I think thought provoking for example like
[01:10:04] think thought provoking for example like suppose should we allow bcis to decod
[01:10:07] suppose should we allow bcis to decod read in the memories right like what if
[01:10:10] read in the memories right like what if like we de something you don't want to
[01:10:12] like we de something you don't want to say right that's how can we deal with
[01:10:14] say right that's how can we deal with that on the other hand like what if we
[01:10:16] that on the other hand like what if we can actually use this things to help
[01:10:19] can actually use this things to help people who has lost their memories due
[01:10:22] people who has lost their memories due to like a Alzheimer disease right or
[01:10:26] to like a Alzheimer disease right or we can read out some like subconscious
[01:10:28] we can read out some like subconscious fear that can help people to do their
[01:10:31] fear that can help people to do their psycho therapies are this like how
[01:10:34] psycho therapies are this like how should we decide whether we want to
[01:10:35] should we decide whether we want to allow this kind of in decoding or not or
[01:10:37] allow this kind of in decoding or not or you know memory decoding or lot or not
[01:10:41] you know memory decoding or lot or not and
[01:10:41] and also I think a more like deeper question
[01:10:44] also I think a more like deeper question is like what if like one day we could do
[01:10:46] is like what if like one day we could do this kind of like a cognitive
[01:10:47] this kind of like a cognitive enhancement with BCI such as you know
[01:10:51] enhancement with BCI such as you know what if you can move a robotic arm much
[01:10:53] what if you can move a robotic arm much faster than your real arm is that
[01:10:54] faster than your real arm is that allowed
[01:10:55] allowed or like can you actually purchase a
[01:10:57] or like can you actually purchase a memory so that you can skip this CS2 24n
[01:11:01] memory so that you can skip this CS2 24n class um yeah I think that's really a
[01:11:03] class um yeah I think that's really a hard question to answer but it's just
[01:11:06] hard question to answer but it's just like to throw this question out so that
[01:11:08] like to throw this question out so that you know we're not like it's not like
[01:11:10] you know we're not like it's not like only a PCI problem but we're also facing
[01:11:12] only a PCI problem but we're also facing this problem right now right there's a
[01:11:14] this problem right now right there's a lot of ways you can do enhancement of
[01:11:17] lot of ways you can do enhancement of yourself so I guess what I'm trying to
[01:11:19] yourself so I guess what I'm trying to say here is that b will raise a lot of
[01:11:22] say here is that b will raise a lot of new a lot of new uh ethical questions
[01:11:25] new a lot of new uh ethical questions so this is I'm taking this quote from uh
[01:11:28] so this is I'm taking this quote from uh this textbook here what it's trying to
[01:11:30] this textbook here what it's trying to say is that I think this question is um
[01:11:34] say is that I think this question is um we're not really looking for answer here
[01:11:36] we're not really looking for answer here but I think the point here that maybe we
[01:11:40] but I think the point here that maybe we just want to keep this in discussion
[01:11:43] just want to keep this in discussion with Scientists with Engineers with uh
[01:11:46] with Scientists with Engineers with uh with policy makers and just to make sure
[01:11:48] with policy makers and just to make sure that
[01:11:49] that everything know well we we can use BCI
[01:11:53] everything know well we we can use BCI to help people that's really need them
[01:11:55] to help people that's really need them and and also be aware that there could
[01:11:56] and and also be aware that there could be some a lot of like potential issues
[01:11:59] be some a lot of like potential issues here um yeah I think the just to give a
[01:12:03] here um yeah I think the just to give a summary here so I think I've I hope I
[01:12:06] summary here so I think I've I hope I can convince you guys that BCI is a
[01:12:08] can convince you guys that BCI is a really cool new research directions it's
[01:12:10] really cool new research directions it's uh at the intersection of uh AI machine
[01:12:13] uh at the intersection of uh AI machine learning neuroscience and
[01:12:15] learning neuroscience and neuroengineering uh we'll soon have this
[01:12:18] neuroengineering uh we'll soon have this kind of like systems that can really
[01:12:20] kind of like systems that can really help people to be able to communicate
[01:12:23] help people to be able to communicate again and also it's a really cool
[01:12:24] again and also it's a really cool opportunity days to understand how the
[01:12:27] opportunity days to understand how the brain process the process languages I
[01:12:30] brain process the process languages I think I think the most important thing
[01:12:32] think I think the most important thing is that we are bringing hope to people
[01:12:34] is that we are bringing hope to people like horard and
[01:12:35] like horard and T12 all right thank you everyone and
[01:12:39] T12 all right thank you everyone and special thanks to the people in my last


================================================================================
LECTURE 015
================================================================================

Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 14 - Reasoning and Agents by Shikhar Murty

Source: https://www.youtube.com/watch?v=I0tj4Y7xaOQ

---

Transcript

[00:00:06] okay uh let's just get started welcome
[00:00:09] okay uh let's just get started welcome to lecture 14 everyone uh hope you've
[00:00:12] to lecture 14 everyone uh hope you've been uh doing well and uh you know
[00:00:16] been uh doing well and uh you know managing all of the various deadlines so
[00:00:20] managing all of the various deadlines so uh today we'll be looking at two
[00:00:22] uh today we'll be looking at two interesting
[00:00:23] interesting applications of language models uh the
[00:00:26] applications of language models uh the first half I'll be talking about using
[00:00:28] first half I'll be talking about using language models to reason
[00:00:30] language models to reason in domains like math geometry uh doing
[00:00:34] in domains like math geometry uh doing things like spatial reasoning and then
[00:00:36] things like spatial reasoning and then in the second half of the lecture I'll
[00:00:37] in the second half of the lecture I'll be talking about how you can use
[00:00:40] be talking about how you can use language models to take actions in
[00:00:42] language models to take actions in grounded
[00:00:44] grounded environments okay um so a little bit of
[00:00:47] environments okay um so a little bit of a disclaimer a lot of the content
[00:00:49] a disclaimer a lot of the content today's research that was done in the
[00:00:52] today's research that was done in the last 3 4 years so there's plenty of
[00:00:55] last 3 4 years so there's plenty of questions plenty of unanswered questions
[00:00:57] questions plenty of unanswered questions and not a lot of uh uh an so you know
[00:01:01] and not a lot of uh uh an so you know let's let's maybe we can have more of a
[00:01:04] let's let's maybe we can have more of a discussion U around these topics
[00:01:07] discussion U around these topics okay okay so let's get started with
[00:01:10] okay okay so let's get started with reasoning um so experts like to start a
[00:01:14] reasoning um so experts like to start a lecture on reasoning by really uh
[00:01:16] lecture on reasoning by really uh talking about what are the various kinds
[00:01:18] talking about what are the various kinds of freezing so I'm going to do that here
[00:01:20] of freezing so I'm going to do that here okay but at a high level it's really
[00:01:21] okay but at a high level it's really about using facts and logic to arrive at
[00:01:23] about using facts and logic to arrive at an answer okay uh but more concretely
[00:01:26] an answer okay uh but more concretely there's three distinct categories of
[00:01:28] there's three distinct categories of reasoning that that we we can uh talk
[00:01:31] reasoning that that we we can uh talk about the first one which is probably
[00:01:33] about the first one which is probably the one that most of you are familiar
[00:01:35] the one that most of you are familiar with is deductive reasoning where we go
[00:01:38] with is deductive reasoning where we go from uh rules of logic along with a
[00:01:42] from uh rules of logic along with a premise to come with a firm conclusion
[00:01:45] premise to come with a firm conclusion so an example of that could be that we
[00:01:47] so an example of that could be that we have the sentence all mammals have
[00:01:49] have the sentence all mammals have kidneys and all whales are mammals and
[00:01:52] kidneys and all whales are mammals and then we can come up with the conclusion
[00:01:54] then we can come up with the conclusion all whales have kidneys and we could do
[00:01:56] all whales have kidneys and we could do multiple such steps of reasoning okay a
[00:02:00] multiple such steps of reasoning okay a second form of reasoning is inductive
[00:02:03] second form of reasoning is inductive where uh given observations we derive
[00:02:07] where uh given observations we derive conclusions okay so maybe we've learned
[00:02:11] conclusions okay so maybe we've learned from experience that every time we see a
[00:02:14] from experience that every time we see a creature with wings it is usually a bird
[00:02:17] creature with wings it is usually a bird and uh let's say we
[00:02:20] and uh let's say we observe uh a state where we see a
[00:02:22] observe uh a state where we see a creature with wings and using our observ
[00:02:25] creature with wings and using our observ using our experience we can come up with
[00:02:27] using our experience we can come up with this conclusion that the creature is
[00:02:29] this conclusion that the creature is likely to be
[00:02:30] likely to be so that form of reasoning is inductive
[00:02:32] so that form of reasoning is inductive okay and finally we have abductive
[00:02:36] okay and finally we have abductive reasoning where we're given an
[00:02:38] reasoning where we're given an observation and then we start drawing
[00:02:40] observation and then we start drawing possible explanations okay so maybe you
[00:02:43] possible explanations okay so maybe you see a car that cannot start and there's
[00:02:46] see a car that cannot start and there's a puddle of liquid under the under the
[00:02:48] a puddle of liquid under the under the engine and then you you start drawing
[00:02:51] engine and then you you start drawing inferences about the situation so one of
[00:02:53] inferences about the situation so one of them could be that the car has leak in
[00:02:55] them could be that the car has leak in the radiator
[00:02:56] the radiator okay all right and apart from that
[00:03:00] okay all right and apart from that taxonomy uh we can also think of
[00:03:02] taxonomy uh we can also think of reasoning in formal and informal terms
[00:03:05] reasoning in formal and informal terms where formal reasoning involves using
[00:03:07] where formal reasoning involves using aums and rules of formal logic to derive
[00:03:11] aums and rules of formal logic to derive truth conditions okay uh there's also
[00:03:14] truth conditions okay uh there's also informal reasoning which is what you and
[00:03:16] informal reasoning which is what you and I uh probably do every day and here we
[00:03:19] I uh probably do every day and here we just reason about everyday situations
[00:03:22] just reason about everyday situations and use common sense to derive
[00:03:24] and use common sense to derive conclusions for most of the lecture when
[00:03:26] conclusions for most of the lecture when I say reasoning I will mean informal
[00:03:29] I say reasoning I will mean informal dedu Ive reasoning and it's often going
[00:03:32] dedu Ive reasoning and it's often going to involve multiple
[00:03:34] to involve multiple steps okay so uh let's let's come back
[00:03:38] steps okay so uh let's let's come back to language models okay so uh we've
[00:03:41] to language models okay so uh we've learned in lectures 9 10 11 that uh
[00:03:44] learned in lectures 9 10 11 that uh language models are really really good
[00:03:46] language models are really really good at or large language models are really
[00:03:48] at or large language models are really really good at coming up with plausible
[00:03:50] really good at coming up with plausible continuations of text that reflect human
[00:03:53] continuations of text that reflect human preferences and constraints today we'll
[00:03:56] preferences and constraints today we'll try to answer if they can also reason
[00:04:00] try to answer if they can also reason okay so one of the most basic ways we
[00:04:04] okay so one of the most basic ways we can try to answer this question is we
[00:04:06] can try to answer this question is we are prompting okay and uh we've probably
[00:04:10] are prompting okay and uh we've probably already seen this uh there's this uh
[00:04:13] already seen this uh there's this uh popular method called Chain of Thought
[00:04:15] popular method called Chain of Thought prompting where you get a language model
[00:04:17] prompting where you get a language model to produce a reasoning step before
[00:04:20] to produce a reasoning step before producing an answer and we could do this
[00:04:23] producing an answer and we could do this by providing some in context examples
[00:04:26] by providing some in context examples with explicit uh reasoning steps that
[00:04:28] with explicit uh reasoning steps that the language model can then mimic at
[00:04:30] the language model can then mimic at test time okay so that's Chain of
[00:04:33] test time okay so that's Chain of Thought
[00:04:34] Thought prompting uh another rather surprising
[00:04:38] prompting uh another rather surprising uh property of language models is that
[00:04:40] uh property of language models is that sometimes you don't even have to show
[00:04:41] sometimes you don't even have to show them these in context examples and you
[00:04:43] them these in context examples and you could just prompt them with the sentence
[00:04:45] could just prompt them with the sentence let's things step by step and you can
[00:04:47] let's things step by step and you can get these uh reasoning rationals before
[00:04:51] get these uh reasoning rationals before they produce an
[00:04:52] they produce an answer okay so that's pretty
[00:04:55] answer okay so that's pretty simple uh but let's let's keep going
[00:04:58] simple uh but let's let's keep going okay so another popular way to prompt
[00:05:01] okay so another popular way to prompt language models to do reasoning is via
[00:05:04] language models to do reasoning is via self-consistency
[00:05:06] self-consistency so here what we do is instead of uh
[00:05:09] so here what we do is instead of uh greedily sampling a rationale followed
[00:05:12] greedily sampling a rationale followed by an answer we're going to sample
[00:05:14] by an answer we're going to sample multiple reasoning steps and
[00:05:17] multiple reasoning steps and correspondingly multiple answers okay so
[00:05:19] correspondingly multiple answers okay so what we see in the figure on the right
[00:05:22] what we see in the figure on the right uh we have a question and then what you
[00:05:25] uh we have a question and then what you would normally do with Chain of Thought
[00:05:27] would normally do with Chain of Thought prompting is you would greedily decode a
[00:05:30] prompting is you would greedily decode a rational and then condition on the
[00:05:31] rational and then condition on the rational generate an answer with
[00:05:33] rational generate an answer with self-consistency we're going to sample
[00:05:35] self-consistency we're going to sample multiple times so sample multiple
[00:05:37] multiple times so sample multiple rationals they are all going to lead to
[00:05:39] rationals they are all going to lead to multiple answers and then we're going to
[00:05:42] multiple answers and then we're going to pick the one that is the most common
[00:05:44] pick the one that is the most common okay with the idea being that if an
[00:05:46] okay with the idea being that if an answer keeps appearing uh for multiple
[00:05:49] answer keeps appearing uh for multiple rationals as the majority uh of the
[00:05:51] rationals as the majority uh of the rationals agree on then it's more likely
[00:05:53] rationals agree on then it's more likely to be
[00:05:55] to be correct and the authors of
[00:05:57] correct and the authors of self-consistency find that on a VAR of
[00:06:00] self-consistency find that on a VAR of mathematical reasoning tasks if you add
[00:06:03] mathematical reasoning tasks if you add this simple idea of self-consistency
[00:06:05] this simple idea of self-consistency where you sample multiple times and sort
[00:06:08] where you sample multiple times and sort of do majority voting that improves
[00:06:10] of do majority voting that improves performance pretty drastically over over
[00:06:13] performance pretty drastically over over standard Chain of
[00:06:14] standard Chain of Thought and interestingly you know when
[00:06:17] Thought and interestingly you know when I saw this result the first time I
[00:06:19] I saw this result the first time I thought okay this is just like
[00:06:21] thought okay this is just like ensembling which is you know we we
[00:06:22] ensembling which is you know we we learned this in cs229 the idea is if you
[00:06:25] learned this in cs229 the idea is if you want to boost the performance of your
[00:06:27] want to boost the performance of your system I'm going to produce like uh 10
[00:06:30] system I'm going to produce like uh 10 classifiers with different random seeds
[00:06:33] classifiers with different random seeds and I'm going to produce a
[00:06:34] and I'm going to produce a classification decision and I'm going to
[00:06:36] classification decision and I'm going to do a majority voting but turns out that
[00:06:38] do a majority voting but turns out that it's doing maybe a little bit more than
[00:06:40] it's doing maybe a little bit more than just simple ensembling so the author's
[00:06:43] just simple ensembling so the author's also compared uh an ensembling approach
[00:06:45] also compared uh an ensembling approach where it's the same language model with
[00:06:47] where it's the same language model with multiple different prompts and then you
[00:06:50] multiple different prompts and then you do majority voting there and then turns
[00:06:52] do majority voting there and then turns out that self-consistency is better than
[00:06:55] out that self-consistency is better than just simple
[00:06:58] embling okay okay so earlier today I
[00:07:02] embling okay okay so earlier today I said that I'll be talking about
[00:07:03] said that I'll be talking about multi-step reasoning uh so far we've
[00:07:06] multi-step reasoning uh so far we've looked at sort of math problems but not
[00:07:09] looked at sort of math problems but not and and like prompting but not
[00:07:10] and and like prompting but not necessarily multi-step reasoning uh one
[00:07:13] necessarily multi-step reasoning uh one of the main kind of aspects about
[00:07:16] of the main kind of aspects about multi-step reasoning is it involves
[00:07:18] multi-step reasoning is it involves breaking down a large problem into
[00:07:20] breaking down a large problem into several
[00:07:21] several subparts uh where and and answering each
[00:07:24] subparts uh where and and answering each of the sub uh subp parts and then
[00:07:26] of the sub uh subp parts and then combining everything into a solution
[00:07:28] combining everything into a solution okay uh so there's this kind of
[00:07:30] okay uh so there's this kind of decomposition strategy that was uh
[00:07:32] decomposition strategy that was uh integrated into another prompting method
[00:07:34] integrated into another prompting method called least to most
[00:07:36] called least to most prompting and the idea behind least to
[00:07:38] prompting and the idea behind least to most prompting is uh like I said given a
[00:07:42] most prompting is uh like I said given a question we're going to first break it
[00:07:43] question we're going to first break it down into sub questions as shown
[00:07:47] down into sub questions as shown here um and then given these sub
[00:07:50] here um and then given these sub questions the language model will sort
[00:07:53] questions the language model will sort of answer each of the sub questions and
[00:07:55] of answer each of the sub questions and then conditioned on its answers to the
[00:07:57] then conditioned on its answers to the sub questions is going to generate the
[00:07:59] sub questions is going to generate the final
[00:08:02] answer and this is kind of how it looks
[00:08:05] answer and this is kind of how it looks like uh for uh sort of a math reasoning
[00:08:08] like uh for uh sort of a math reasoning problem so in standard Chain of Thought
[00:08:10] problem so in standard Chain of Thought prompting uh you would have a question
[00:08:12] prompting uh you would have a question followed by a rationale and the answer
[00:08:15] followed by a rationale and the answer with least to most prompting which is
[00:08:17] with least to most prompting which is this like de composition strategy uh you
[00:08:20] this like de composition strategy uh you take the question and then instead of
[00:08:23] take the question and then instead of directly producing a rational you you uh
[00:08:25] directly producing a rational you you uh sort of ask the language model to break
[00:08:27] sort of ask the language model to break it down into problems and then you have
[00:08:30] it down into problems and then you have like these uh two different sub problems
[00:08:33] like these uh two different sub problems and then you start answering both of
[00:08:34] and then you start answering both of those sub problems and then condition
[00:08:36] those sub problems and then condition your final answer on the answers to
[00:08:39] your final answer on the answers to those sub
[00:08:41] those sub problems so okay so that's just like a
[00:08:44] problems so okay so that's just like a prompting method right uh one
[00:08:46] prompting method right uh one interesting experiment from least to
[00:08:48] interesting experiment from least to most prompting was showing that you can
[00:08:51] most prompting was showing that you can sometimes generalize from a small number
[00:08:53] sometimes generalize from a small number of reasoning steps to much larger number
[00:08:55] of reasoning steps to much larger number of reasoning steps so here in this sort
[00:08:58] of reasoning steps so here in this sort of math word
[00:09:00] of math word problem uh there's two reasoning steps
[00:09:03] problem uh there's two reasoning steps and if we show this prompt to the
[00:09:05] and if we show this prompt to the language model sort of as in context
[00:09:08] language model sort of as in context example we see that it continues to
[00:09:11] example we see that it continues to generalize even on examples that
[00:09:14] generalize even on examples that required more than five steps of
[00:09:16] required more than five steps of reasoning um and in a way that's much
[00:09:19] reasoning um and in a way that's much better than standard Chain of
[00:09:21] better than standard Chain of Thought uh but it's not entirely clear
[00:09:24] Thought uh but it's not entirely clear if structuring inference in this manner
[00:09:26] if structuring inference in this manner is really
[00:09:27] is really fundamental uh one of the other results
[00:09:30] fundamental uh one of the other results they reported was um sort of uh that
[00:09:34] they reported was um sort of uh that with enough prompt engineering so uh the
[00:09:37] with enough prompt engineering so uh the rose corresponding to best uh normal
[00:09:39] rose corresponding to best uh normal Chain of Thought is on par with least to
[00:09:42] Chain of Thought is on par with least to most prompting but it's it's kind of an
[00:09:44] most prompting but it's it's kind of an interesting idea of trying to break down
[00:09:46] interesting idea of trying to break down problems into sub problems solving the
[00:09:48] problems into sub problems solving the sub problems and then sort of building
[00:09:49] sub problems and then sort of building up a solution based on your answers to
[00:09:52] up a solution based on your answers to the sub
[00:09:53] the sub problems okay so all this was different
[00:09:57] problems okay so all this was different sort of prompting methods to get
[00:09:58] sort of prompting methods to get reasoning behavor AV out of language
[00:10:00] reasoning behavor AV out of language models can we do something more um so
[00:10:03] models can we do something more um so one of the things that we might be
[00:10:05] one of the things that we might be interested in is instead of trying to
[00:10:07] interested in is instead of trying to get really large language models to do
[00:10:10] get really large language models to do reasoning maybe we want to somehow get
[00:10:13] reasoning maybe we want to somehow get this kind of reasoning behavior in a
[00:10:16] this kind of reasoning behavior in a smaller language model and one popular
[00:10:19] smaller language model and one popular approach for doing that is
[00:10:21] approach for doing that is distillation where maybe you want to
[00:10:23] distillation where maybe you want to fine-tune a smaller llama model uh by
[00:10:27] fine-tune a smaller llama model uh by teaching it to imitate a larger llama
[00:10:29] teaching it to imitate a larger llama model
[00:10:30] model model um and so that's what we're going
[00:10:33] model um and so that's what we're going to look at now okay so uh this model is
[00:10:36] to look at now okay so uh this model is called
[00:10:36] called Orca and at a high level Orca is going
[00:10:39] Orca and at a high level Orca is going to fine tune a smaller 13 billion Lama
[00:10:44] to fine tune a smaller 13 billion Lama language model on explanations produced
[00:10:47] language model on explanations produced by
[00:10:48] by gbd4 okay and to construct this data it
[00:10:52] gbd4 okay and to construct this data it it's pretty simple uh it has these three
[00:10:54] it's pretty simple uh it has these three steps so the first step is uh we get a
[00:10:58] steps so the first step is uh we get a wide variety of instructions from the
[00:11:01] wide variety of instructions from the flan V2 collection okay so flan V2 is
[00:11:04] flan V2 collection okay so flan V2 is basically a data set it's it kind of
[00:11:07] basically a data set it's it kind of cumulates multiple data sets into one
[00:11:09] cumulates multiple data sets into one sort of collection uh and it consists of
[00:11:12] sort of collection uh and it consists of instructions paired with uh questions
[00:11:15] instructions paired with uh questions and answers and I I'll show an example
[00:11:16] and answers and I I'll show an example of this um in a moment and then we're
[00:11:21] of this um in a moment and then we're going to prompt gbd4 or chat
[00:11:24] going to prompt gbd4 or chat GPT with these instructions along with a
[00:11:28] GPT with these instructions along with a system message
[00:11:30] system message and the objective of the system message
[00:11:32] and the objective of the system message is to get chat GPT or gp4 to produce an
[00:11:36] is to get chat GPT or gp4 to produce an informative explanation along with the
[00:11:39] informative explanation along with the answer so here we have a question about
[00:11:42] answer so here we have a question about you know simple data processing uh about
[00:11:45] you know simple data processing uh about calculating the median and there's a
[00:11:47] calculating the median and there's a system instruction that says uh please
[00:11:50] system instruction that says uh please justify your steps and uh kind of answer
[00:11:53] justify your steps and uh kind of answer step by step and uh in producing its
[00:11:57] step by step and uh in producing its output the model sort of provides
[00:11:59] output the model sort of provides uh a fairly detailed explanation of how
[00:12:02] uh a fairly detailed explanation of how it got to the answer and what orai is
[00:12:05] it got to the answer and what orai is going to do is use precisely this
[00:12:07] going to do is use precisely this explanation to fine tune a much smaller
[00:12:09] explanation to fine tune a much smaller model okay uh so that's what's going to
[00:12:13] model okay uh so that's what's going to happen uh once we have these
[00:12:15] happen uh once we have these explanations we're going to fine-tune a
[00:12:18] explanations we're going to fine-tune a much smaller 13 billion parameter llama
[00:12:20] much smaller 13 billion parameter llama model on these explanations
[00:12:24] model on these explanations okay so uh so far we've looked at sort
[00:12:28] okay so uh so far we've looked at sort of math reasoning
[00:12:30] of math reasoning um and um sort of grade school math
[00:12:34] um and um sort of grade school math problems uh let's kind of turn to a
[00:12:36] problems uh let's kind of turn to a different Benchmark for reasoning so
[00:12:38] different Benchmark for reasoning so we're going to look at big bench hard uh
[00:12:41] we're going to look at big bench hard uh and this is another data set for
[00:12:44] and this is another data set for multi-step reasoning okay and uh let's
[00:12:47] multi-step reasoning okay and uh let's look at some examples from Big bench
[00:12:49] look at some examples from Big bench hard um so it consists of uh multiple
[00:12:52] hard um so it consists of uh multiple different uh subas so there's a total of
[00:12:54] different uh subas so there's a total of 23 different subas I'm going to show a
[00:12:56] 23 different subas I'm going to show a few examples so one of them is
[00:12:59] few examples so one of them is evaluating Boolean Expressions so uh the
[00:13:02] evaluating Boolean Expressions so uh the question is true and false and not true
[00:13:05] question is true and false and not true and true is okay so that's basically um
[00:13:08] and true is okay so that's basically um you know uh evaluate this Boolean
[00:13:11] you know uh evaluate this Boolean expression and um you know with with
[00:13:15] expression and um you know with with sort of Chain of Thought the model can
[00:13:17] sort of Chain of Thought the model can evaluate each of the sub expressions and
[00:13:19] evaluate each of the sub expressions and get to The Final
[00:13:21] get to The Final Answer um and another example of a task
[00:13:25] Answer um and another example of a task from Big bench hard is dat uh is data
[00:13:28] from Big bench hard is dat uh is data understanding
[00:13:29] understanding where uh you know like maybe the
[00:13:32] where uh you know like maybe the question is uh sorry this is date
[00:13:34] question is uh sorry this is date understanding not data understanding uh
[00:13:36] understanding not data understanding uh so the question is tomorrow is a given
[00:13:38] so the question is tomorrow is a given date uh what is the date one year ago
[00:13:42] date uh what is the date one year ago from today in a given format and uh it's
[00:13:45] from today in a given format and uh it's paired with some options and again the
[00:13:47] paired with some options and again the model can sort of think step by step
[00:13:49] model can sort of think step by step following you know basic Chain of
[00:13:51] following you know basic Chain of Thought and then come up with an answer
[00:13:54] Thought and then come up with an answer so this is kind of the flavor of tasks
[00:13:56] so this is kind of the flavor of tasks in big bench you know most of these
[00:13:58] in big bench you know most of these involveed mult step reasoning they're
[00:14:00] involveed mult step reasoning they're fairly synthetic but also um reasonably
[00:14:04] fairly synthetic but also um reasonably hard uh for for for for language models
[00:14:06] hard uh for for for for language models okay another example is geometric shapes
[00:14:10] okay another example is geometric shapes and uh this one is you know pretty
[00:14:13] and uh this one is you know pretty surprising that language models can do
[00:14:14] surprising that language models can do anything here so you're given sort of
[00:14:16] anything here so you're given sort of the SVG path
[00:14:18] the SVG path element and uh you know sort of I I have
[00:14:21] element and uh you know sort of I I have no idea what this renders us but like
[00:14:24] no idea what this renders us but like the question is uh you know just given
[00:14:27] the question is uh you know just given the SVG what shape uh you're going to
[00:14:31] the SVG what shape uh you're going to get okay and there's a bunch of options
[00:14:34] get okay and there's a bunch of options and then again the model uh prompted
[00:14:36] and then again the model uh prompted with let's things step by step will
[00:14:38] with let's things step by step will produce some answer we don't know if
[00:14:39] produce some answer we don't know if it's correct but it's going to produce
[00:14:40] it's correct but it's going to produce some answer okay uh and so it's
[00:14:43] some answer okay uh and so it's basically this data set covering
[00:14:46] basically this data set covering different kinds of reasoning spatial
[00:14:48] different kinds of reasoning spatial reasoning data understanding um you know
[00:14:51] reasoning data understanding um you know evaluating booleans um and it's sort of
[00:14:53] evaluating booleans um and it's sort of multi-choice so it's easier to kind of
[00:14:56] multi-choice so it's easier to kind of get uh get sort of an accuracy number
[00:15:01] get uh get sort of an accuracy number and so yeah so it covers like a wide
[00:15:03] and so yeah so it covers like a wide variety of different tasks um on the
[00:15:06] variety of different tasks um on the left we have performance from uh really
[00:15:09] left we have performance from uh really large language models uh this is zero
[00:15:12] large language models uh this is zero short um Chain of Thought with just the
[00:15:15] short um Chain of Thought with just the prompt lets things step by
[00:15:17] prompt lets things step by step um so gbd4 has some potential
[00:15:21] step um so gbd4 has some potential contamination issues with big bench
[00:15:23] contamination issues with big bench heart so let's maybe we can ignore that
[00:15:26] heart so let's maybe we can ignore that column uh wuna is is um I think a few
[00:15:33] column uh wuna is is um I think a few months ago it was state-ofthe-art as as
[00:15:35] months ago it was state-ofthe-art as as an instruction tuned uh llama 13B model
[00:15:39] an instruction tuned uh llama 13B model and orca is is again a llama 13B that's
[00:15:43] and orca is is again a llama 13B that's fine-tuned specifically on this uh like
[00:15:47] fine-tuned specifically on this uh like explanation um data where you know you
[00:15:50] explanation um data where you know you have instructions and then you have
[00:15:51] have instructions and then you have explanations from chat GPD or gp4 and
[00:15:54] explanations from chat GPD or gp4 and you f tune on that and we see that
[00:15:57] you f tune on that and we see that overall it it outperforms chat GPT uh
[00:16:01] overall it it outperforms chat GPT uh maybe because it's specialized to just
[00:16:03] maybe because it's specialized to just like these reasoning problems uh and it
[00:16:06] like these reasoning problems uh and it outperforms wuna which was not trained
[00:16:09] outperforms wuna which was not trained on like these really extensive
[00:16:12] on like these really extensive explanations um so that's one way you
[00:16:14] explanations um so that's one way you can get a smaller language model to
[00:16:16] can get a smaller language model to display some kind of reasoning Behavior
[00:16:20] display some kind of reasoning Behavior okay so you know this was all great and
[00:16:24] okay so you know this was all great and you know we we we are very happy that
[00:16:26] you know we we we are very happy that like you can just generate rationals
[00:16:28] like you can just generate rationals from Big LM and then fine tune a smaller
[00:16:30] from Big LM and then fine tune a smaller language model on that but then someone
[00:16:32] language model on that but then someone could ask why not just fine tune the big
[00:16:35] could ask why not just fine tune the big language model on its own rationals
[00:16:37] language model on its own rationals right um so that's also been explored uh
[00:16:41] right um so that's also been explored uh and there's a bunch of different methods
[00:16:43] and there's a bunch of different methods that do this I'm going to talk about one
[00:16:44] that do this I'm going to talk about one of them called reinforced self trining
[00:16:46] of them called reinforced self trining or rest and that's going to alternate
[00:16:49] or rest and that's going to alternate between two
[00:16:50] between two stages the first stage uh given a
[00:16:54] stages the first stage uh given a reasoning problem and perhaps The Prompt
[00:16:56] reasoning problem and perhaps The Prompt lets things step by step I'm going to
[00:16:58] lets things step by step I'm going to have the language model generate
[00:17:00] have the language model generate multiple rationals okay and then I'm
[00:17:03] multiple rationals okay and then I'm going to filter these rationals based on
[00:17:05] going to filter these rationals based on whether they give me the correct answer
[00:17:07] whether they give me the correct answer or not okay so you know think about the
[00:17:10] or not okay so you know think about the word uh algebra problems someone has
[00:17:14] word uh algebra problems someone has three apples someone else has four
[00:17:15] three apples someone else has four apples you generate a rationale and the
[00:17:18] apples you generate a rationale and the answer comes out to be seven you keep
[00:17:19] answer comes out to be seven you keep that rational the answer is 12 you sort
[00:17:22] that rational the answer is 12 you sort of leave that rational out and then I'm
[00:17:25] of leave that rational out and then I'm going to do an update step where I'm
[00:17:27] going to do an update step where I'm going to take uh these rationals that I
[00:17:30] going to take uh these rationals that I filtered in my first stage I'm going to
[00:17:32] filtered in my first stage I'm going to f tune the language model on that and
[00:17:35] f tune the language model on that and then I can do this iteratively now I
[00:17:36] then I can do this iteratively now I have an updated language model I can get
[00:17:39] have an updated language model I can get hopefully better rationals and then I
[00:17:41] hopefully better rationals and then I can update the language model on better
[00:17:43] can update the language model on better rationals to get an even better language
[00:17:45] rationals to get an even better language model and I can keep doing
[00:17:46] model and I can keep doing that
[00:17:48] that okay and uh the results are promising
[00:17:52] okay and uh the results are promising but uh you know uh what we find is uh on
[00:17:56] but uh you know uh what we find is uh on on gsm 8K which is this great School
[00:17:59] on gsm 8K which is this great School math uh data set of like algebraic word
[00:18:02] math uh data set of like algebraic word problems um as you increase the number
[00:18:05] problems um as you increase the number of iterations of
[00:18:07] of iterations of self-training we see a slight
[00:18:09] self-training we see a slight Improvement in performance and then it
[00:18:11] Improvement in performance and then it starts
[00:18:12] starts degrading uh math is another data set
[00:18:16] degrading uh math is another data set that again focuses on multi-step
[00:18:18] that again focuses on multi-step reasoning uh covering math problems and
[00:18:22] reasoning uh covering math problems and again we on this data set we see that as
[00:18:24] again we on this data set we see that as we do more iterations of this reinforced
[00:18:28] we do more iterations of this reinforced self trining Paradigm we see an
[00:18:30] self trining Paradigm we see an improvement in in the
[00:18:32] improvement in in the accuracy and uh the numbers in Orange
[00:18:36] accuracy and uh the numbers in Orange here are a much larger uh Palm model the
[00:18:40] here are a much larger uh Palm model the numbers in blue are a smaller model uh
[00:18:43] numbers in blue are a smaller model uh and the dash lines represent what you
[00:18:46] and the dash lines represent what you get um sort of if you did supervised
[00:18:48] get um sort of if you did supervised fine tuning on human provided rationals
[00:18:51] fine tuning on human provided rationals so one of the promising things about
[00:18:53] so one of the promising things about this approach is when you do multiple
[00:18:57] this approach is when you do multiple iterations of training on your own
[00:19:00] iterations of training on your own rationals you can
[00:19:02] rationals you can outperform um sort of human generated
[00:19:05] outperform um sort of human generated rationals um and that is exemplified
[00:19:09] rationals um and that is exemplified again in in this graph where uh what we
[00:19:13] again in in this graph where uh what we find is uh the blue bar represents
[00:19:17] find is uh the blue bar represents accuracy when you take uh the Palm model
[00:19:20] accuracy when you take uh the Palm model and you do supervised fine-tuning on
[00:19:22] and you do supervised fine-tuning on human provided
[00:19:23] human provided rationals okay and then in green is if
[00:19:27] rationals okay and then in green is if you did if you controlled uh for the
[00:19:30] you did if you controlled uh for the sorry uh so blue is if you f tune on all
[00:19:33] sorry uh so blue is if you f tune on all human provided
[00:19:34] human provided rationals orange is if you f tune on one
[00:19:37] rationals orange is if you f tune on one rational per training
[00:19:40] rational per training example okay and these are from these
[00:19:42] example okay and these are from these are written by humans in
[00:19:45] are written by humans in green uh it's what you get if you fine
[00:19:48] green uh it's what you get if you fine tune on one rational chosen at random
[00:19:52] tune on one rational chosen at random per question which is generated by the
[00:19:54] per question which is generated by the model so it's controlling for the number
[00:19:56] model so it's controlling for the number of rationals and we see that it out
[00:19:58] of rationals and we see that it out performs human provided rationals and
[00:20:01] performs human provided rationals and then if you sort of do the full
[00:20:03] then if you sort of do the full multi-step uh iterative procedure where
[00:20:06] multi-step uh iterative procedure where you keep improving the model we see
[00:20:08] you keep improving the model we see again a boost in
[00:20:09] again a boost in performance so that's uh super
[00:20:13] performance so that's uh super promising um but let's kind of start
[00:20:15] promising um but let's kind of start revisiting the question that we asked um
[00:20:19] revisiting the question that we asked um in the beginning about reasoning in
[00:20:21] in the beginning about reasoning in language models okay um so one way of
[00:20:26] language models okay um so one way of answering that question is we can apply
[00:20:28] answering that question is we can apply all these methods and we can look at
[00:20:31] all these methods and we can look at benchmarks uh but maybe the way to
[00:20:33] benchmarks uh but maybe the way to answer the question correctly is to be
[00:20:35] answer the question correctly is to be more
[00:20:36] more systematic uh come up with
[00:20:38] systematic uh come up with counterfactual tasks and be very uh
[00:20:41] counterfactual tasks and be very uh careful about possible data
[00:20:43] careful about possible data contamination and I'm going to show some
[00:20:45] contamination and I'm going to show some results uh around around
[00:20:48] results uh around around that so uh we started the lecture with
[00:20:51] that so uh we started the lecture with Chain of Thought and maybe the first
[00:20:54] Chain of Thought and maybe the first question to ask is are the rationals
[00:20:57] question to ask is are the rationals that the model produces would change of
[00:20:58] that the model produces would change of thought
[00:21:00] thought faithful what I mean by Faithful Is
[00:21:02] faithful what I mean by Faithful Is maybe the model produces some rational
[00:21:05] maybe the model produces some rational and then it produces an answer maybe the
[00:21:07] and then it produces an answer maybe the answer does not even depend on the
[00:21:09] answer does not even depend on the rational that It produced right so maybe
[00:21:12] rational that It produced right so maybe the question was you know Tom has three
[00:21:14] the question was you know Tom has three apples and Jerry has four apples and the
[00:21:16] apples and Jerry has four apples and the rational It produced was okay Tom has
[00:21:18] rational It produced was okay Tom has three apples Jerry has four 3 + 4 is
[00:21:21] three apples Jerry has four 3 + 4 is seven so the answer is 25 you know uh so
[00:21:23] seven so the answer is 25 you know uh so in a case like that uh you'd say that
[00:21:26] in a case like that uh you'd say that the model was not faithful to its
[00:21:27] the model was not faithful to its rational
[00:21:29] rational and so what we see in this plot is is a
[00:21:33] and so what we see in this plot is is a very careful experiment where um on on
[00:21:38] very careful experiment where um on on the x-
[00:21:39] the x- axis uh we have the number of reasoning
[00:21:43] axis uh we have the number of reasoning samples so okay so the setup is
[00:21:46] samples so okay so the setup is something like this so for every
[00:21:47] something like this so for every question the model produces a rational
[00:21:51] question the model produces a rational and a rational here is multiple
[00:21:53] and a rational here is multiple sentences and what we're going to do is
[00:21:56] sentences and what we're going to do is we're going to force the model to
[00:21:59] we're going to force the model to sort of early exit from its
[00:22:01] sort of early exit from its rationalization and just like force it
[00:22:02] rationalization and just like force it to produce an answer okay so if it
[00:22:05] to produce an answer okay so if it produced four rationals I can early exit
[00:22:08] produced four rationals I can early exit right after the first rational and ask
[00:22:10] right after the first rational and ask it to produce an answer I can exit after
[00:22:12] it to produce an answer I can exit after the second rational ask it to produce an
[00:22:14] the second rational ask it to produce an answer and so on and what I'm going to
[00:22:16] answer and so on and what I'm going to plot on the Y AIS is the model's
[00:22:18] plot on the Y AIS is the model's accuracy after early exiting um in in in
[00:22:22] accuracy after early exiting um in in in this
[00:22:23] this procedure so let's say that I early
[00:22:25] procedure so let's say that I early exerted after just one rational and the
[00:22:28] exerted after just one rational and the model produced exactly the same answer
[00:22:30] model produced exactly the same answer that it would if it had seen all four uh
[00:22:33] that it would if it had seen all four uh sentences in its rational then maybe we
[00:22:36] sentences in its rational then maybe we can conclude that uh the kind of
[00:22:39] can conclude that uh the kind of reasoning is is not faithful like it
[00:22:42] reasoning is is not faithful like it doesn't matter if the model C is the
[00:22:43] doesn't matter if the model C is the full rational or just the first sentence
[00:22:46] full rational or just the first sentence uh and if you take that to the extreme
[00:22:48] uh and if you take that to the extreme um you know maybe you terine it even
[00:22:51] um you know maybe you terine it even without any rational and it produces the
[00:22:53] without any rational and it produces the same answer so the results here are
[00:22:56] same answer so the results here are somewhat mixed but we see that there are
[00:22:59] somewhat mixed but we see that there are enough data sets where uh it doesn't
[00:23:02] enough data sets where uh it doesn't matter if you see the full if the model
[00:23:04] matter if you see the full if the model sees the full rational before answering
[00:23:05] sees the full rational before answering or if you early sort of early exit you
[00:23:08] or if you early sort of early exit you kind of get the same answer which means
[00:23:11] kind of get the same answer which means that sometimes uh these rationals may be
[00:23:14] that sometimes uh these rationals may be post hog explanations of the model's
[00:23:17] post hog explanations of the model's answer okay another experiment that
[00:23:21] answer okay another experiment that tries to answer this exact same question
[00:23:24] tries to answer this exact same question is uh you can take these rationals and
[00:23:27] is uh you can take these rationals and then you can start corrupting them
[00:23:29] then you can start corrupting them so maybe your rational was length four
[00:23:32] so maybe your rational was length four and then I generate the first rational
[00:23:34] and then I generate the first rational the second rational and for the third
[00:23:35] the second rational and for the third rational I just corrupt it okay and then
[00:23:38] rational I just corrupt it okay and then uh the fourth rational and then I asked
[00:23:40] uh the fourth rational and then I asked the model to generate my answer if it
[00:23:41] the model to generate my answer if it turns out that no matter how much I
[00:23:43] turns out that no matter how much I corrupt my rational the model produces
[00:23:46] corrupt my rational the model produces the same answer then I can conclude that
[00:23:50] the same answer then I can conclude that again the answer kind of did not depend
[00:23:51] again the answer kind of did not depend on my
[00:23:52] on my rational so on the x-axis uh we are
[00:23:56] rational so on the x-axis uh we are looking at the number number of re the
[00:23:59] looking at the number number of re the percentage of reasoning steps before uh
[00:24:02] percentage of reasoning steps before uh I add sort of a mistake in the rational
[00:24:05] I add sort of a mistake in the rational okay so what you should see is kind of a
[00:24:08] okay so what you should see is kind of a strictly increasing uh increasing sort
[00:24:11] strictly increasing uh increasing sort of trend where if I add a mistake after
[00:24:14] of trend where if I add a mistake after the very first step then that's probably
[00:24:17] the very first step then that's probably going to change the answer a lot and
[00:24:19] going to change the answer a lot and then if I add a mistake after the last
[00:24:21] then if I add a mistake after the last step that maybe doesn't change the
[00:24:22] step that maybe doesn't change the answer all that much but again we find
[00:24:24] answer all that much but again we find that for some data sets uh it so happens
[00:24:28] that for some data sets uh it so happens that you know you can add a mistake in
[00:24:31] that you know you can add a mistake in the first sentence in your rationale and
[00:24:32] the first sentence in your rationale and the answer is not going to change all
[00:24:34] the answer is not going to change all that much and so that's also kind of an
[00:24:37] that much and so that's also kind of an indicator that maybe these rationals are
[00:24:39] indicator that maybe these rationals are sort of post talk explanations of the
[00:24:41] sort of post talk explanations of the model's
[00:24:42] model's behavior um so yeah so there's a lot of
[00:24:46] behavior um so yeah so there's a lot of lines here so if anyone has questions uh
[00:24:49] lines here so if anyone has questions uh see a few blank faces in the audience
[00:24:59] okay so let's let's uh let's keep moving
[00:25:01] okay so let's let's uh let's keep moving um okay so that's that was about like
[00:25:04] um okay so that's that was about like whether uh the models where sort of
[00:25:07] whether uh the models where sort of Chain of Thought expresses kind of a
[00:25:08] Chain of Thought expresses kind of a reasoning that the model is faithful to
[00:25:11] reasoning that the model is faithful to uh another question you could ask is
[00:25:14] uh another question you could ask is what if I changed my setting a little
[00:25:16] what if I changed my setting a little bit right so my model let's say I
[00:25:18] bit right so my model let's say I observe that it's able to do arithmetic
[00:25:21] observe that it's able to do arithmetic in base 10 so it's able to answer
[00:25:23] in base 10 so it's able to answer something like 12 + 14 uh does that mean
[00:25:27] something like 12 + 14 uh does that mean that my model knows how to do it
[00:25:28] that my model knows how to do it arithmetic or maybe there was just this
[00:25:30] arithmetic or maybe there was just this exact same um you know example was
[00:25:34] exact same um you know example was present in the training data so one way
[00:25:36] present in the training data so one way you could test for this is by creating
[00:25:39] you could test for this is by creating counterfactuals which uh based on our
[00:25:41] counterfactuals which uh based on our understanding of the data you expect uh
[00:25:43] understanding of the data you expect uh to not be present that frequently in the
[00:25:45] to not be present that frequently in the training
[00:25:46] training data so instead of doing base 10
[00:25:49] data so instead of doing base 10 addition you could do addition in base 9
[00:25:52] addition you could do addition in base 9 and then if the model has the same
[00:25:54] and then if the model has the same accuracy in base 9 then you can conclude
[00:25:56] accuracy in base 9 then you can conclude that maybe this model has under OD how
[00:25:58] that maybe this model has under OD how to do
[00:25:59] to do addition similarly for logic uh maybe uh
[00:26:04] addition similarly for logic uh maybe uh the reason why the model is so good at
[00:26:06] the reason why the model is so good at solving logic problems is because it's
[00:26:08] solving logic problems is because it's seen something very similar in its
[00:26:10] seen something very similar in its training data so what if I construct a
[00:26:12] training data so what if I construct a world where I don't know corgis are
[00:26:15] world where I don't know corgis are reptiles can it still do this logic
[00:26:20] reptiles can it still do this logic problem okay and so what we find is uh
[00:26:25] problem okay and so what we find is uh there is a you know sometimes a pretty
[00:26:27] there is a you know sometimes a pretty significant drop when you move from
[00:26:30] significant drop when you move from there's a question sorry you counteract
[00:26:33] there's a question sorry you counteract why is base 9 counteract and Bas
[00:26:36] why is base 9 counteract and Bas 10 so it's it's a counterfactual excuse
[00:26:39] 10 so it's it's a counterfactual excuse me in the sense that
[00:26:42] me in the sense that um uh the the authors comment that like
[00:26:44] um uh the the authors comment that like base 10 addition is like frequently
[00:26:46] base 10 addition is like frequently observed in training data but very few
[00:26:48] observed in training data but very few people do base 9 addition and so there's
[00:26:51] people do base 9 addition and so there's going to be much fewer examples of this
[00:26:53] going to be much fewer examples of this in training data so it's more so add a
[00:26:56] in training data so it's more so add a distribution I right yeah yeah so you
[00:26:59] distribution I right yeah yeah so you can also call it out of distribution for
[00:27:03] sure
[00:27:04] sure um and yeah so like from results like
[00:27:07] um and yeah so like from results like what we see is uh you know there's
[00:27:10] what we see is uh you know there's there's like this drop in performance
[00:27:12] there's like this drop in performance even for like very simple logic problems
[00:27:14] even for like very simple logic problems that don't involve like multiple steps
[00:27:16] that don't involve like multiple steps of reasoning a you know kind of a
[00:27:18] of reasoning a you know kind of a significant drop in
[00:27:20] significant drop in performance um which maybe suggest that
[00:27:24] performance um which maybe suggest that there's not that much reasoning there's
[00:27:26] there's not that much reasoning there's more memorization um
[00:27:29] more memorization um yeah so we could keep going with this
[00:27:33] yeah so we could keep going with this Paradigm of like changing the problem
[00:27:35] Paradigm of like changing the problem setting so that it starts looking sort
[00:27:38] setting so that it starts looking sort of out of distribution to the training
[00:27:40] of out of distribution to the training Corpus um and this is exactly what was
[00:27:44] Corpus um and this is exactly what was done in this paper that looked at
[00:27:46] done in this paper that looked at analogical
[00:27:48] analogical reasoning so uh so basically the setup
[00:27:51] reasoning so uh so basically the setup is something like this I'm going to show
[00:27:53] is something like this I'm going to show certain examples of string
[00:27:56] certain examples of string Transformations and I'm going to ask the
[00:27:58] Transformations and I'm going to ask the model to generalize to new examples okay
[00:28:01] model to generalize to new examples okay so in this extend sequence problem I
[00:28:04] so in this extend sequence problem I have ABCD and the output is AB bcde e
[00:28:08] have ABCD and the output is AB bcde e and then given i j KL the model has to
[00:28:11] and then given i j KL the model has to produce I J K L M okay and so on now the
[00:28:16] produce I J K L M okay and so on now the way you can sort of make this into like
[00:28:19] way you can sort of make this into like a
[00:28:20] a counterfactual or uh something that is
[00:28:22] counterfactual or uh something that is out of distribution is uh maybe you can
[00:28:25] out of distribution is uh maybe you can kind of change what the extend sequence
[00:28:28] kind of change what the extend sequence Quin task is so now instead of
[00:28:30] Quin task is so now instead of outputting ABCDE e maybe the model has
[00:28:33] outputting ABCDE e maybe the model has to Output ABCD F okay so instead of
[00:28:37] to Output ABCD F okay so instead of outputting the next character it has to
[00:28:39] outputting the next character it has to Output um sort of one more uh so the
[00:28:42] Output um sort of one more uh so the second character after the next and so
[00:28:46] second character after the next and so on uh the other kind of contactual you
[00:28:48] on uh the other kind of contactual you could add is instead of operating on the
[00:28:51] could add is instead of operating on the standard uh alphabet you could modify
[00:28:54] standard uh alphabet you could modify the alphabet completely so instead of
[00:28:56] the alphabet completely so instead of the alphabet being ABC maybe you start
[00:28:59] the alphabet being ABC maybe you start at x y and you know so
[00:29:03] at x y and you know so on um so what we find is uh so we find
[00:29:09] on um so what we find is uh so we find two things uh the first thing that we
[00:29:10] two things uh the first thing that we find is that uh there's a significant
[00:29:13] find is that uh there's a significant drop in performance as we go from the
[00:29:16] drop in performance as we go from the standard sort of analogical reasoning
[00:29:18] standard sort of analogical reasoning problem to one of these counterfactuals
[00:29:20] problem to one of these counterfactuals where we either change the alphabet we
[00:29:22] where we either change the alphabet we change the description of the Tas so
[00:29:24] change the description of the Tas so that it becomes slightly
[00:29:26] that it becomes slightly unnatural on the other hand the authors
[00:29:28] unnatural on the other hand the authors also did this exact same experiment on
[00:29:31] also did this exact same experiment on human subjects where they find very
[00:29:34] human subjects where they find very little uh decrease in
[00:29:37] little uh decrease in performance okay so overall what this
[00:29:41] performance okay so overall what this result suggests is maybe there's some
[00:29:43] result suggests is maybe there's some reasoning um maybe there's some
[00:29:45] reasoning um maybe there's some memorization but there's nothing
[00:29:47] memorization but there's nothing systematic okay uh so you know again
[00:29:51] systematic okay uh so you know again this is like all emerging so uh maybe uh
[00:29:54] this is like all emerging so uh maybe uh someone will find that you know if you
[00:29:55] someone will find that you know if you if you change your prompt a little bit
[00:29:57] if you change your prompt a little bit now now models get can do reasoning uh
[00:29:59] now now models get can do reasoning uh but this is kind of the current lay of
[00:30:01] but this is kind of the current lay of the
[00:30:02] the land
[00:30:04] land okay
[00:30:05] okay um so uh that was sort of the reasoning
[00:30:08] um so uh that was sort of the reasoning module of the lecture I'm going to now
[00:30:10] module of the lecture I'm going to now switch gar and talk about uh language
[00:30:14] switch gar and talk about uh language model
[00:30:15] model agents
[00:30:17] agents um so uh and and this is kind of related
[00:30:20] um so uh and and this is kind of related to reasoning in the sense that uh
[00:30:22] to reasoning in the sense that uh reasoning involves sort of this
[00:30:24] reasoning involves sort of this multi-step inferences where you know
[00:30:27] multi-step inferences where you know given some facts have to arrive at
[00:30:28] given some facts have to arrive at completely new
[00:30:30] completely new conclusions um with agents what we'll
[00:30:32] conclusions um with agents what we'll see is that there's some high level kind
[00:30:35] see is that there's some high level kind of objective or model has to accomplish
[00:30:37] of objective or model has to accomplish and it has to reason about post
[00:30:39] and it has to reason about post conditions object
[00:30:41] conditions object affordances um kind of uncertainty in
[00:30:43] affordances um kind of uncertainty in the world to carry out a sequence of
[00:30:45] the world to carry out a sequence of steps
[00:30:47] steps um so let's start with some
[00:30:49] um so let's start with some terminology okay so we have our agent on
[00:30:53] terminology okay so we have our agent on the right um that's going to be some
[00:30:56] the right um that's going to be some some neural network
[00:30:58] some neural network and then we have an
[00:31:00] and then we have an environment um you know and I I I'll
[00:31:03] environment um you know and I I I'll give some examples of what what these
[00:31:04] give some examples of what what these environments could be
[00:31:07] environments could be um the agent receives an observation
[00:31:10] um the agent receives an observation from its
[00:31:11] from its environment and based on the observation
[00:31:14] environment and based on the observation IT issues an action
[00:31:17] IT issues an action okay and along with that it receives
[00:31:20] okay and along with that it receives this second variable G and G represents
[00:31:24] this second variable G and G represents a language
[00:31:25] a language instruction okay uh so there's many
[00:31:29] instruction okay uh so there's many names for this setting and and what uh
[00:31:31] names for this setting and and what uh and these models uh digital agent
[00:31:34] and these models uh digital agent language conditioned policy or an
[00:31:37] language conditioned policy or an instruction following
[00:31:39] instruction following agent uh some examples of environments
[00:31:43] agent uh some examples of environments are maybe uh it's it's sort of a web
[00:31:46] are maybe uh it's it's sort of a web browser and in sort of a browsing
[00:31:48] browser and in sort of a browsing environment uh where the uh objective is
[00:31:52] environment uh where the uh objective is to book a flight from San Francisco to
[00:31:54] to book a flight from San Francisco to New York and the observation could
[00:31:56] New York and the observation could either be raw pixel that the model
[00:32:00] either be raw pixel that the model sees or or it could be the HTML Dom
[00:32:06] sees or or it could be the HTML Dom representation and the action space if
[00:32:09] representation and the action space if you're looking at these web environments
[00:32:11] you're looking at these web environments could be uh typing on specific web
[00:32:14] could be uh typing on specific web elements clicking on web elements uh
[00:32:17] elements clicking on web elements uh moving your mouse to a certain web
[00:32:19] moving your mouse to a certain web element to interact with it and so
[00:32:21] element to interact with it and so on and yeah I mean like this are sort of
[00:32:25] on and yeah I mean like this are sort of a vast number of applications I don't
[00:32:27] a vast number of applications I don't think can cover all applications but
[00:32:29] think can cover all applications but like you know uh we can look at some so
[00:32:32] like you know uh we can look at some so there's obviously uh like digital
[00:32:34] there's obviously uh like digital assistance like uh you know I'm not
[00:32:38] assistance like uh you know I'm not going to say the names because I I know
[00:32:39] going to say the names because I I know people's mobiles might might might start
[00:32:41] people's mobiles might might might start popping up um but you know you can give
[00:32:44] popping up um but you know you can give them natural language commands and like
[00:32:46] them natural language commands and like you know set an alarm uh set reminders
[00:32:49] you know set an alarm uh set reminders and so on uh you could also do natural
[00:32:52] and so on uh you could also do natural language programming uh where you could
[00:32:56] language programming uh where you could given natural language descript
[00:32:58] given natural language descript descriptions uh get a model to sort of
[00:33:00] descriptions uh get a model to sort of write python code another example of
[00:33:03] write python code another example of this could be UI
[00:33:05] this could be UI automation where maybe you want to do
[00:33:08] automation where maybe you want to do automated testing of of UI elements and
[00:33:11] automated testing of of UI elements and so instead of having a human sort of
[00:33:13] so instead of having a human sort of verify whether uh a UI UI element Works
[00:33:17] verify whether uh a UI UI element Works maybe you can get a model to execute
[00:33:19] maybe you can get a model to execute actions corresponding to a given
[00:33:21] actions corresponding to a given instruction or it could be something
[00:33:23] instruction or it could be something more sort of user facing where uh you
[00:33:26] more sort of user facing where uh you know given some kind of complex
[00:33:29] know given some kind of complex environment like Spotify you could ask
[00:33:31] environment like Spotify you could ask an agent to play some
[00:33:34] an agent to play some songs and then finally uh there is this
[00:33:37] songs and then finally uh there is this sort of emerging
[00:33:38] sort of emerging application where we want to add
[00:33:41] application where we want to add additional tools um or plugins to
[00:33:45] additional tools um or plugins to language models so that they can control
[00:33:48] language models so that they can control various different
[00:33:49] various different applications
[00:33:52] applications um okay so uh before we look at how we
[00:33:55] um okay so uh before we look at how we can use language models to do
[00:33:57] can use language models to do instruction following I think it's very
[00:33:59] instruction following I think it's very helpful to look at how this was done
[00:34:01] helpful to look at how this was done before language
[00:34:02] before language models um so uh there were basically
[00:34:06] models um so uh there were basically three main
[00:34:07] three main ideas uh sometimes uh the the the right
[00:34:11] ideas uh sometimes uh the the the right thing to do was uh collect examples of
[00:34:16] thing to do was uh collect examples of utterances paired with uh logical forms
[00:34:20] utterances paired with uh logical forms so logical forms uh could be some kind
[00:34:23] so logical forms uh could be some kind of an executable representation that you
[00:34:25] of an executable representation that you could just execute against either a
[00:34:28] could just execute against either a knowledge graph or a database to get an
[00:34:31] knowledge graph or a database to get an answer so maybe you have a query like
[00:34:34] answer so maybe you have a query like what state botherers
[00:34:36] what state botherers Texas and then there exists some sort of
[00:34:39] Texas and then there exists some sort of program description that you could
[00:34:41] program description that you could execute against uh a Knowledge Graph to
[00:34:45] execute against uh a Knowledge Graph to get sort of an answer or a list
[00:34:48] get sort of an answer or a list here um and and idea number one that
[00:34:51] here um and and idea number one that people looked at was to treat this as
[00:34:54] people looked at was to treat this as almost like machine translation right so
[00:34:55] almost like machine translation right so you have uh
[00:34:58] you have uh a source language which is sort of
[00:35:01] a source language which is sort of English commands and then you have a
[00:35:03] English commands and then you have a target language which is sort of these
[00:35:06] target language which is sort of these uh these like meaning representations or
[00:35:08] uh these like meaning representations or logical forms and then you could apply
[00:35:10] logical forms and then you could apply the same Machinery from assignment 3 uh
[00:35:13] the same Machinery from assignment 3 uh to build kind of a natural language
[00:35:15] to build kind of a natural language interface here okay so you directly
[00:35:17] interface here okay so you directly maximize the probability of a sequence
[00:35:20] maximize the probability of a sequence of actions given a goal or a
[00:35:24] of actions given a goal or a command idea number two was
[00:35:28] command idea number two was um something a little bit more complex
[00:35:30] um something a little bit more complex so here you have um instructions paired
[00:35:36] so here you have um instructions paired with actions instead of directly mapping
[00:35:38] with actions instead of directly mapping instructions to
[00:35:40] instructions to actions uh I'm going to infer an
[00:35:43] actions uh I'm going to infer an executable plan okay from these
[00:35:47] executable plan okay from these instructions uh and action sequences and
[00:35:50] instructions uh and action sequences and I'm going to train a model to go from
[00:35:53] I'm going to train a model to go from instructions to these plans and then
[00:35:56] instructions to these plans and then Define a very rich execution model
[00:35:59] Define a very rich execution model that's going to directly execute these
[00:36:01] that's going to directly execute these plans the advantage of this is uh maybe
[00:36:04] plans the advantage of this is uh maybe there is more sort of highlevel uh
[00:36:07] there is more sort of highlevel uh decisions you could encode in your plan
[00:36:09] decisions you could encode in your plan which would be harder to like get into
[00:36:12] which would be harder to like get into the model if you were to just train it
[00:36:14] the model if you were to just train it uh to produce the action trajectories
[00:36:16] uh to produce the action trajectories directly and I have an example of a
[00:36:19] directly and I have an example of a system like that from
[00:36:21] system like that from 2011 which uh was basically an agent
[00:36:24] 2011 which uh was basically an agent that could navigate in um in sort of
[00:36:26] that could navigate in um in sort of grounded environment
[00:36:28] grounded environment and yeah the idea was something like
[00:36:30] and yeah the idea was something like this that uh you kind of took an
[00:36:32] this that uh you kind of took an instruction and obtained a
[00:36:35] instruction and obtained a plan um and then you would uh train a
[00:36:39] plan um and then you would uh train a semantic par so which is basically like
[00:36:41] semantic par so which is basically like this kind of machine translation system
[00:36:43] this kind of machine translation system that would convert commands into
[00:36:45] that would convert commands into sequences of uh into this plan and then
[00:36:48] sequences of uh into this plan and then once that's trained at test time given a
[00:36:50] once that's trained at test time given a completely new instruction you would run
[00:36:53] completely new instruction you would run the semantic parsel get this plan and
[00:36:56] the semantic parsel get this plan and then execute it in this execution model
[00:36:59] then execute it in this execution model okay and I have an example of an
[00:37:02] okay and I have an example of an instruction and and a plan uh from this
[00:37:05] instruction and and a plan uh from this 2011
[00:37:07] 2011 system the third idea uh which is
[00:37:11] system the third idea uh which is probably um you know maybe the first one
[00:37:14] probably um you know maybe the first one that comes to mind if you see a setting
[00:37:16] that comes to mind if you see a setting like that is to use reinforcement
[00:37:18] like that is to use reinforcement learning
[00:37:19] learning directly and what people did there was
[00:37:22] directly and what people did there was to use RL to directly map instructions
[00:37:25] to use RL to directly map instructions into actions so I'm going to learn a
[00:37:28] into actions so I'm going to learn a policy that outputs actions that
[00:37:31] policy that outputs actions that maximize some reward okay which is
[00:37:34] maximize some reward okay which is conditioned on my natural language
[00:37:36] conditioned on my natural language instruction and the observation and this
[00:37:39] instruction and the observation and this reward could be both sparse which is I
[00:37:42] reward could be both sparse which is I carry out the entire task and then my
[00:37:44] carry out the entire task and then my environment tells me if I achieve the
[00:37:45] environment tells me if I achieve the task or not or it could be something
[00:37:48] task or not or it could be something that I obtain after each step so I take
[00:37:50] that I obtain after each step so I take an action and then the and and then the
[00:37:53] an action and then the and and then the environment tells me if this action sort
[00:37:55] environment tells me if this action sort of completed some percentage of my task
[00:37:57] of completed some percentage of my task or not uh and on the top I've included
[00:38:00] or not uh and on the top I've included an example of a system from 2009 that
[00:38:03] an example of a system from 2009 that did this for automated Windows debugging
[00:38:07] did this for automated Windows debugging and so you know you have some natural
[00:38:10] and so you know you have some natural language uh instruction uh to click some
[00:38:13] language uh instruction uh to click some UI elements and that that get mapped
[00:38:16] UI elements and that that get mapped into kind of an API command that the
[00:38:18] into kind of an API command that the model executes one after the
[00:38:20] model executes one after the other um okay so these were basically
[00:38:24] other um okay so these were basically the three main ideas that people had
[00:38:26] the three main ideas that people had before language models you would either
[00:38:28] before language models you would either train semantic parsers you would either
[00:38:32] train semantic parsers you would either infer these plans from instruction
[00:38:35] infer these plans from instruction trajectory pairs uh and then learn to
[00:38:38] trajectory pairs uh and then learn to directly model plans and then have an
[00:38:40] directly model plans and then have an execution model that can execute plans
[00:38:42] execution model that can execute plans or you would do reinforcement learning
[00:38:44] or you would do reinforcement learning if you had a reward
[00:38:46] if you had a reward signal so how do we do things in
[00:38:49] signal so how do we do things in 2024 so uh there are a few ways to think
[00:38:52] 2024 so uh there are a few ways to think about this uh I think like maybe most
[00:38:57] about this uh I think like maybe most instructive is to think about what we
[00:38:59] instructive is to think about what we trying to achieve right so we are trying
[00:39:00] trying to achieve right so we are trying to model trajectories so sequence of
[00:39:03] to model trajectories so sequence of actions conditioned on some goal okay so
[00:39:06] actions conditioned on some goal okay so I want my model to book a flight from
[00:39:08] I want my model to book a flight from San Francisco to New York and I want it
[00:39:11] San Francisco to New York and I want it to produce a trajectory of like maybe
[00:39:13] to produce a trajectory of like maybe tip uh typing and clicking actions so
[00:39:16] tip uh typing and clicking actions so let's look at how that
[00:39:17] let's look at how that factorizes so the probability of a
[00:39:20] factorizes so the probability of a trajectory uh conditioned on a goal or
[00:39:23] trajectory uh conditioned on a goal or an instruction is just the probability
[00:39:26] an instruction is just the probability of the State action next state and so on
[00:39:31] of the State action next state and so on condition on the goal and you could
[00:39:33] condition on the goal and you could factorize that into two terms so the
[00:39:36] factorize that into two terms so the first term is sort of the transition
[00:39:39] first term is sort of the transition dynamics of the environment and that's
[00:39:41] dynamics of the environment and that's just what happens if I take a certain
[00:39:44] just what happens if I take a certain action in a given State how is my state
[00:39:47] action in a given State how is my state going to change and the second object is
[00:39:50] going to change and the second object is sort of the agent policy which is given
[00:39:54] sort of the agent policy which is given my goal and the trajectory so far what
[00:39:57] my goal and the trajectory so far what what is the next action I should be
[00:39:58] what is the next action I should be taking okay and then uh sort of people
[00:40:03] taking okay and then uh sort of people quickly realize that you could just
[00:40:06] quickly realize that you could just treat this as kind of a generative
[00:40:08] treat this as kind of a generative problem so you could treat uh the
[00:40:11] problem so you could treat uh the problem of decision- making in
[00:40:12] problem of decision- making in environments as sort of a generative
[00:40:15] environments as sort of a generative trajectory modeling
[00:40:16] trajectory modeling problem and what I have in sort of the
[00:40:19] problem and what I have in sort of the top right is an example of a transformer
[00:40:23] top right is an example of a transformer that just takes uh the history of
[00:40:26] that just takes uh the history of actions it's taken so far are the
[00:40:28] actions it's taken so far are the current state and uh some indication of
[00:40:32] current state and uh some indication of what task it should achieve here uh
[00:40:35] what task it should achieve here uh based on reward but it could be a
[00:40:36] based on reward but it could be a natural language string and it's just
[00:40:38] natural language string and it's just trained to predict what's the next
[00:40:40] trained to predict what's the next action okay and you could just train an
[00:40:42] action okay and you could just train an auto regressive language model to do
[00:40:44] auto regressive language model to do this and it turned out that this worked
[00:40:47] this and it turned out that this worked very well in sort of an offline RL case
[00:40:50] very well in sort of an offline RL case question sorry in figure why are we
[00:40:51] question sorry in figure why are we predicting one one action
[00:40:57] uh so we are predicting one action
[00:41:00] uh so we are predicting one action before and the current action uh oh so
[00:41:03] before and the current action uh oh so so no no so you predict an action
[00:41:05] so no no so you predict an action execute that right append that to your
[00:41:08] execute that right append that to your trajectory and then you predict the next
[00:41:10] trajectory and then you predict the next action and so on goe so we we we resolve
[00:41:13] action and so on goe so we we we resolve three input to tokens into one output
[00:41:15] three input to tokens into one output token and turn it off yeah okay sounds
[00:41:19] token and turn it off yeah okay sounds good um and it turned out that this
[00:41:21] good um and it turned out that this worked really well and so you know the
[00:41:25] worked really well and so you know the the instead of uh you know uh getting
[00:41:27] the instead of uh you know uh getting these latent plans and training semantic
[00:41:29] these latent plans and training semantic parsers or trying to do reinforcement
[00:41:32] parsers or trying to do reinforcement learning uh we started using language
[00:41:35] learning uh we started using language models as policies and so a simple way
[00:41:39] models as policies and so a simple way to do all of that is uh to prompt a
[00:41:42] to do all of that is uh to prompt a language model in a loop
[00:41:44] language model in a loop okay so uh we're going to specify the
[00:41:47] okay so uh we're going to specify the action space and text so this is like a
[00:41:49] action space and text so this is like a simple uh sort of language model agent
[00:41:52] simple uh sort of language model agent this is not going to work at all but
[00:41:54] this is not going to work at all but like probably just like illustrative of
[00:41:56] like probably just like illustrative of of how agent can be built now so you
[00:41:59] of how agent can be built now so you provide an action space in text um so
[00:42:03] provide an action space in text um so maybe it's a digital environment and
[00:42:06] maybe it's a digital environment and maybe it can type maybe it can click
[00:42:09] maybe it can type maybe it can click maybe it can type characters maybe it
[00:42:11] maybe it can type characters maybe it can move Mouse somewhere uh you provide
[00:42:14] can move Mouse somewhere uh you provide it an
[00:42:15] it an instruction and you provide it the
[00:42:18] instruction and you provide it the sequence of actions and observations
[00:42:22] sequence of actions and observations it's received so far okay and then
[00:42:25] it's received so far okay and then condition on all that you ask it to
[00:42:27] condition on all that you ask it to predict the next uh the next
[00:42:30] predict the next uh the next action and there's nothing deep going on
[00:42:33] action and there's nothing deep going on here this is just Chain of Thought
[00:42:35] here this is just Chain of Thought prompting in a loop okay but uh the hope
[00:42:39] prompting in a loop okay but uh the hope is that uh because all of this uh
[00:42:41] is that uh because all of this uh because we reduce the problem of
[00:42:43] because we reduce the problem of decision making into just Auto
[00:42:45] decision making into just Auto regressive modeling this this could work
[00:42:47] regressive modeling this this could work okay and indeed like you know a slightly
[00:42:50] okay and indeed like you know a slightly more complex version of this can work in
[00:42:52] more complex version of this can work in some
[00:42:54] some environments okay so now I'm going to
[00:42:56] environments okay so now I'm going to sort of give a little FL faor of what
[00:42:58] sort of give a little FL faor of what different environments look like now for
[00:43:01] different environments look like now for evaluating language models um as
[00:43:05] evaluating language models um as agents so the simplest environment uh
[00:43:09] agents so the simplest environment uh that that that people consider is mini
[00:43:11] that that that people consider is mini wob so uh this is a Sandbox environment
[00:43:15] wob so uh this is a Sandbox environment um that evaluates like basic browser
[00:43:17] um that evaluates like basic browser interactions like you know maybe on a
[00:43:19] interactions like you know maybe on a mini Twitter environment can you get a
[00:43:22] mini Twitter environment can you get a language model to retweet a given tweet
[00:43:25] language model to retweet a given tweet um given sort of a simulat email client
[00:43:28] um given sort of a simulat email client can the model forward someone's email
[00:43:30] can the model forward someone's email can it compose an email uh can it click
[00:43:33] can it compose an email uh can it click on certain buttons or
[00:43:35] on certain buttons or not uh it's not at all real world so
[00:43:38] not uh it's not at all real world so it's not real websites uh and it's
[00:43:41] it's not real websites uh and it's relatively short Horizon so given any
[00:43:44] relatively short Horizon so given any instruction most tasks can be
[00:43:46] instruction most tasks can be accomplished in under three
[00:43:49] accomplished in under three actions uh but zero short performance of
[00:43:52] actions uh but zero short performance of even the best language models is still
[00:43:53] even the best language models is still far from perfect even on this very
[00:43:55] far from perfect even on this very simple
[00:43:55] simple Benchmark um
[00:43:58] Benchmark um a second slightly more real world
[00:44:01] a second slightly more real world Benchmark is web
[00:44:03] Benchmark is web Arina and this is also a Sandbox
[00:44:05] Arina and this is also a Sandbox environment but it's kind of an a pretty
[00:44:08] environment but it's kind of an a pretty close approximation of real websites uh
[00:44:11] close approximation of real websites uh that span e-commerce so there is a
[00:44:13] that span e-commerce so there is a website in web Arina that resembles
[00:44:15] website in web Arina that resembles Amazon um social media so something that
[00:44:18] Amazon um social media so something that resembles
[00:44:19] resembles Twitter and additionally there are
[00:44:21] Twitter and additionally there are utility tools like Maps so an
[00:44:23] utility tools like Maps so an instruction could require a model to
[00:44:26] instruction could require a model to open up sort of a map application find
[00:44:28] open up sort of a map application find the shortest path from point A to point
[00:44:31] the shortest path from point A to point B and use that uh in its later uh
[00:44:34] B and use that uh in its later uh sequence of actions and there's
[00:44:36] sequence of actions and there's multi-tab browsing like we kind of
[00:44:38] multi-tab browsing like we kind of commonly do uh so with mini wob there's
[00:44:40] commonly do uh so with mini wob there's only one single tab uh and and with web
[00:44:45] only one single tab uh and and with web Arena I think this was the first
[00:44:46] Arena I think this was the first environment that introduced this idea
[00:44:48] environment that introduced this idea where uh you kind of have multiple tabs
[00:44:51] where uh you kind of have multiple tabs and the agent can sort of switch between
[00:44:53] and the agent can sort of switch between uh apps tabs uh and again we are going
[00:44:57] uh apps tabs uh and again we are going to evaluate sort of functional
[00:44:59] to evaluate sort of functional correctness um which is whether the
[00:45:02] correctness um which is whether the model sort of gave uh the correct answer
[00:45:05] model sort of gave uh the correct answer at the end whether the sequence of steps
[00:45:06] at the end whether the sequence of steps it took um gave the intended Behavior as
[00:45:10] it took um gave the intended Behavior as opposed to whether it took a sequence of
[00:45:11] opposed to whether it took a sequence of steps that maybe a user had
[00:45:15] steps that maybe a user had pre-programmed so another popular uh
[00:45:19] pre-programmed so another popular uh kind of uh environment is or a data set
[00:45:23] kind of uh environment is or a data set is web links so web links also has
[00:45:26] is web links so web links also has multi-tab browsing
[00:45:28] multi-tab browsing and uh it has web interactions on real
[00:45:31] and uh it has web interactions on real websites so this is not sandboxed
[00:45:33] websites so this is not sandboxed approximations of real websites is not
[00:45:35] approximations of real websites is not sandboxed kind of just like browser like
[00:45:38] sandboxed kind of just like browser like uh uh like browser interactions these
[00:45:40] uh uh like browser interactions these are like actual real websites um and it
[00:45:44] are like actual real websites um and it also introduced like a new action where
[00:45:47] also introduced like a new action where the agent could communicate with the
[00:45:49] the agent could communicate with the user so maybe there's some instruction
[00:45:52] user so maybe there's some instruction uh which is to like
[00:45:54] uh which is to like reserve um kind of
[00:45:57] reserve um kind of I don't know like like a movie or like
[00:45:59] I don't know like like a movie or like uh buy a movie ticket or something and
[00:46:02] uh buy a movie ticket or something and then at some point the model has to
[00:46:04] then at some point the model has to request credit card information and so
[00:46:06] request credit card information and so there is this like additional action
[00:46:08] there is this like additional action where a human could be involved in
[00:46:11] where a human could be involved in communicating um with the
[00:46:14] communicating um with the agent uh and this is not an environment
[00:46:16] agent uh and this is not an environment uh but just a collection of interactions
[00:46:19] uh but just a collection of interactions uh so you can't for example do any kind
[00:46:21] uh so you can't for example do any kind of exploration or online learning here
[00:46:23] of exploration or online learning here but you could definitely use this for
[00:46:25] but you could definitely use this for evaluating um um okay uh so this was
[00:46:29] evaluating um um okay uh so this was just a taste of what some benchmarks
[00:46:31] just a taste of what some benchmarks look like uh for for language model
[00:46:34] look like uh for for language model agents so how are we going to train
[00:46:36] agents so how are we going to train these models right so uh you know given
[00:46:40] these models right so uh you know given that we we're going to train we're going
[00:46:42] that we we're going to train we're going to treat like uh decision making as sort
[00:46:44] to treat like uh decision making as sort of casual uh as causal language modeling
[00:46:46] of casual uh as causal language modeling we're not going to use any of the ideas
[00:46:49] we're not going to use any of the ideas from
[00:46:50] from pre-ms uh the standard practice is to do
[00:46:53] pre-ms uh the standard practice is to do in context learning with few short
[00:46:55] in context learning with few short examples uh and in the few short
[00:46:58] examples uh and in the few short examples uh for typically for any new
[00:47:01] examples uh for typically for any new kind of uh website or any new use case
[00:47:05] kind of uh website or any new use case you're going to get humans to perform
[00:47:06] you're going to get humans to perform those tasks and sort of feed that into
[00:47:09] those tasks and sort of feed that into the language models prompt as in context
[00:47:11] the language models prompt as in context demonstrations which it could then use
[00:47:14] demonstrations which it could then use to solve um similar similar looking
[00:47:17] to solve um similar similar looking tasks on very similar
[00:47:19] tasks on very similar websites so obviously this is not
[00:47:22] websites so obviously this is not scalable uh there's thousands of
[00:47:24] scalable uh there's thousands of environments on some environments that
[00:47:27] environments on some environments that like lots of different interactions that
[00:47:28] like lots of different interactions that are possible and so maybe there's
[00:47:31] are possible and so maybe there's something better that we can do than
[00:47:33] something better that we can do than just U sort of getting humans to provide
[00:47:36] just U sort of getting humans to provide demonstrations for every new use
[00:47:39] demonstrations for every new use case um and so we going to use something
[00:47:42] case um and so we going to use something we saw early on in the lecture okay
[00:47:45] we saw early on in the lecture okay which was to kind of use the language
[00:47:47] which was to kind of use the language model to generate rationals and then
[00:47:50] model to generate rationals and then fine tune on that and here we don't have
[00:47:52] fine tune on that and here we don't have rationals but we could produce action
[00:47:54] rationals but we could produce action trajectories and then we're going to use
[00:47:56] trajectories and then we're going to use that
[00:47:57] that as supervision okay so the way that
[00:48:00] as supervision okay so the way that looks like is something like this so
[00:48:03] looks like is something like this so let's say I have some
[00:48:05] let's say I have some environment um you know let's say it's
[00:48:07] environment um you know let's say it's some mini wob environment and I'm going
[00:48:10] some mini wob environment and I'm going to just uh get an agent that can
[00:48:12] to just uh get an agent that can randomly explore the environment so
[00:48:14] randomly explore the environment so it'll just execute a random
[00:48:16] it'll just execute a random sequence of clicks and types uh and
[00:48:19] sequence of clicks and types uh and scrolling
[00:48:20] scrolling operations and let's say it produces
[00:48:22] operations and let's say it produces some trajectories
[00:48:24] some trajectories okay and now I'm going to use these traj
[00:48:27] okay and now I'm going to use these traj and somehow filter them so that was the
[00:48:28] and somehow filter them so that was the idea from earlier so you're going to get
[00:48:30] idea from earlier so you're going to get a bunch of different outputs and then
[00:48:31] a bunch of different outputs and then you're going to filter it somehow so
[00:48:33] you're going to filter it somehow so here we're going to use a second
[00:48:35] here we're going to use a second language model because we don't know
[00:48:38] language model because we don't know what a good trajectory looks like so not
[00:48:40] what a good trajectory looks like so not like a math problem where you know you
[00:48:42] like a math problem where you know you know the correct answer uh we just had a
[00:48:45] know the correct answer uh we just had a language model interact with the website
[00:48:47] language model interact with the website and generate trajectories and we want to
[00:48:49] and generate trajectories and we want to somehow filter out what a good
[00:48:50] somehow filter out what a good trajectories and so we're going to use a
[00:48:52] trajectories and so we're going to use a second
[00:48:53] second model that will produce a
[00:48:57] model that will produce a description uh of these trajectories and
[00:49:00] description uh of these trajectories and the idea here is that if you can get a
[00:49:02] the idea here is that if you can get a model to produce a description of what
[00:49:05] model to produce a description of what uh what the sequence of actions
[00:49:06] uh what the sequence of actions corresponds to then maybe that's a good
[00:49:09] corresponds to then maybe that's a good enough signal for a good trajectory okay
[00:49:13] enough signal for a good trajectory okay and so maybe given the first trajectory
[00:49:15] and so maybe given the first trajectory it guesses that the instruction was to
[00:49:18] it guesses that the instruction was to book a flight from San Francisco to New
[00:49:20] book a flight from San Francisco to New York um for the second trajectory it
[00:49:23] York um for the second trajectory it said set the date to some given date um
[00:49:26] said set the date to some given date um um and maybe it it wasn't able to come
[00:49:29] um and maybe it it wasn't able to come up with any good uh sort of uh
[00:49:32] up with any good uh sort of uh instruction for the third
[00:49:33] instruction for the third trajectory and then we're going to do
[00:49:35] trajectory and then we're going to do something uh again uh that we saw
[00:49:38] something uh again uh that we saw earlier on which is to like kind of do
[00:49:40] earlier on which is to like kind of do this iteratively so now uh we have a a
[00:49:44] this iteratively so now uh we have a a goal that we got for for a trajectory
[00:49:48] goal that we got for for a trajectory and now I'm going to get the language
[00:49:49] and now I'm going to get the language model to condition Its Behavior on this
[00:49:52] model to condition Its Behavior on this goal so the goal is to set the date as
[00:49:56] goal so the goal is to set the date as some given date and now instead of doing
[00:49:59] some given date and now instead of doing random exploration the model is going to
[00:50:00] random exploration the model is going to produce a sequence of actions that have
[00:50:03] produce a sequence of actions that have a better correspondence with some
[00:50:05] a better correspondence with some natural language
[00:50:07] natural language instruction So It produced a trajectory
[00:50:10] instruction So It produced a trajectory based on that
[00:50:12] based on that instruction and then I'm going to use
[00:50:14] instruction and then I'm going to use sort of some course filter that's just
[00:50:17] sort of some course filter that's just going to look at correspondences between
[00:50:19] going to look at correspondences between the instruction and uh the sequence of
[00:50:22] the instruction and uh the sequence of actions and the states the language
[00:50:25] actions and the states the language model visited and used that to decide if
[00:50:29] model visited and used that to decide if the trajectory was a good trajectory for
[00:50:30] the trajectory was a good trajectory for the
[00:50:31] the instruction and in this case uh you know
[00:50:35] instruction and in this case uh you know given the instruction this seems like a
[00:50:36] given the instruction this seems like a pretty good trajectory for completing
[00:50:39] pretty good trajectory for completing this task and so then we added to a set
[00:50:43] this task and so then we added to a set of examples okay but maybe sometimes uh
[00:50:48] of examples okay but maybe sometimes uh things are not so good so for that
[00:50:50] things are not so good so for that second
[00:50:51] second instruction the generated label was to
[00:50:53] instruction the generated label was to book a flight from San Francisco to New
[00:50:55] book a flight from San Francisco to New York and let's say we run that again
[00:50:58] York and let's say we run that again through the language model and It
[00:51:00] through the language model and It produced a second trajectory okay and
[00:51:03] produced a second trajectory okay and clearly this does not seem like uh kind
[00:51:06] clearly this does not seem like uh kind of a successful trajectory corresponding
[00:51:08] of a successful trajectory corresponding to booking a
[00:51:09] to booking a flight um and so what do we do here
[00:51:13] flight um and so what do we do here maybe we can throw away this uh
[00:51:14] maybe we can throw away this uh interaction but interactions are pretty
[00:51:16] interaction but interactions are pretty costly like
[00:51:17] costly like specifically uh you know if you're
[00:51:19] specifically uh you know if you're looking at real websites and each
[00:51:21] looking at real websites and each interaction uh you know could take a few
[00:51:23] interaction uh you know could take a few milliseconds and so maybe we don't want
[00:51:24] milliseconds and so maybe we don't want to throw away this interaction
[00:51:27] to throw away this interaction so what we're going to do here is again
[00:51:29] so what we're going to do here is again invoke the Rel laaber to take the
[00:51:32] invoke the Rel laaber to take the trajectory and assign it a new label so
[00:51:34] trajectory and assign it a new label so the model was not successful at
[00:51:36] the model was not successful at accomplishing the task it set out to do
[00:51:38] accomplishing the task it set out to do but it accomplished something and we're
[00:51:40] but it accomplished something and we're going to come up with the best guess of
[00:51:42] going to come up with the best guess of what that was with a second language
[00:51:43] what that was with a second language model and it's going to say that okay uh
[00:51:46] model and it's going to say that okay uh maybe the instruction you accomplished
[00:51:48] maybe the instruction you accomplished instead was to set the origin to SFO and
[00:51:51] instead was to set the origin to SFO and the destination to New York City okay
[00:51:54] the destination to New York City okay and so that's going to get get fed back
[00:51:56] and so that's going to get get fed back into the language model and we're going
[00:51:58] into the language model and we're going to keep doing this iteratively till our
[00:52:01] to keep doing this iteratively till our filter says that this is a good
[00:52:02] filter says that this is a good instruction trajectory pair okay so we
[00:52:05] instruction trajectory pair okay so we have the same idea of using a language
[00:52:07] have the same idea of using a language model to sort of generate outputs and
[00:52:10] model to sort of generate outputs and some iterative uh procedure that will
[00:52:13] some iterative uh procedure that will like you know give us kind of a good set
[00:52:15] like you know give us kind of a good set of training
[00:52:17] of training examples um so overall the method looks
[00:52:21] examples um so overall the method looks something like this uh you know you have
[00:52:24] something like this uh you know you have some
[00:52:24] some environment uh we going to use uh kind
[00:52:28] environment uh we going to use uh kind of an unconditioned language model to
[00:52:30] of an unconditioned language model to just randomly explore the environment
[00:52:32] just randomly explore the environment and generate a sequence of
[00:52:34] and generate a sequence of trajectories and then we're going to
[00:52:36] trajectories and then we're going to convert these trajectories into
[00:52:38] convert these trajectories into synthetic training data by iteratively
[00:52:42] synthetic training data by iteratively converting trajectories into natural
[00:52:44] converting trajectories into natural language descriptions and then taking
[00:52:46] language descriptions and then taking natural language descriptions and
[00:52:48] natural language descriptions and converting them into even better
[00:52:49] converting them into even better trajectories and so on and once we have
[00:52:52] trajectories and so on and once we have this collection of synthetic examples uh
[00:52:55] this collection of synthetic examples uh there are two things we could do
[00:52:57] there are two things we could do one could fine-tune using this data uh
[00:53:00] one could fine-tune using this data uh but the simplest thing you could do is
[00:53:01] but the simplest thing you could do is kind of repeat the Paradigm earlier of
[00:53:04] kind of repeat the Paradigm earlier of you know replace uh human provided
[00:53:07] you know replace uh human provided demonstrations in context with these
[00:53:09] demonstrations in context with these synthetic
[00:53:11] synthetic demonstrations um and we find uh a a
[00:53:14] demonstrations um and we find uh a a reasonable boost in performance or 13
[00:53:17] reasonable boost in performance or 13 Point Improvement on the mini Benchmark
[00:53:20] Point Improvement on the mini Benchmark and again uh you know even though mini
[00:53:21] and again uh you know even though mini wob is very very simple zero short
[00:53:24] wob is very very simple zero short performance for even the best language
[00:53:25] performance for even the best language models is far from from perfect and we
[00:53:28] models is far from from perfect and we also see an improvement on second sort
[00:53:30] also see an improvement on second sort of multi-step uh tool use uh environment
[00:53:34] of multi-step uh tool use uh environment but so far we've only looked at text
[00:53:36] but so far we've only looked at text right um but uh maybe for real world
[00:53:40] right um but uh maybe for real world applications it's kind of intractable to
[00:53:43] applications it's kind of intractable to for every environment obtain the HTML
[00:53:46] for every environment obtain the HTML and feed that into the language models
[00:53:47] and feed that into the language models context uh sometimes there can be tens
[00:53:50] context uh sometimes there can be tens of thousands of Dom uh elements and then
[00:53:54] of thousands of Dom uh elements and then corresponding uh JavaScript and
[00:53:56] corresponding uh JavaScript and inputting all that into the language
[00:53:57] inputting all that into the language models context could be you know
[00:53:59] models context could be you know intractable and maybe that's also not
[00:54:01] intractable and maybe that's also not the best way to kind of uh
[00:54:04] the best way to kind of uh show the state of the environment maybe
[00:54:07] show the state of the environment maybe the best way is to directly show the
[00:54:09] the best way is to directly show the pixels uh corresponding to uh the the
[00:54:13] pixels uh corresponding to uh the the environment and so now we're going to
[00:54:15] environment and so now we're going to look at some examples of vision language
[00:54:18] look at some examples of vision language models that people have used for
[00:54:20] models that people have used for building these agents
[00:54:22] building these agents okay so uh the first one that that we're
[00:54:26] okay so uh the first one that that we're going to look at is
[00:54:28] going to look at is lava uh and the idea here is again kind
[00:54:31] lava uh and the idea here is again kind of similar to Orca that we looked at uh
[00:54:33] of similar to Orca that we looked at uh in in sort of the reasoning half of the
[00:54:35] in in sort of the reasoning half of the lecture uh we're going to use
[00:54:38] lecture uh we're going to use gp4 to generate uh this time both
[00:54:42] gp4 to generate uh this time both instructions and
[00:54:43] instructions and responses uh for textual descriptions of
[00:54:46] responses uh for textual descriptions of images so maybe there's an image um and
[00:54:50] images so maybe there's an image um and we're going to sort of use uh metadata
[00:54:54] we're going to sort of use uh metadata corresponding to that image to come up
[00:54:56] corresponding to that image to come up with a texture description feed that
[00:54:58] with a texture description feed that into gbd4 and ask it to generate
[00:55:02] into gbd4 and ask it to generate possible questions and
[00:55:04] possible questions and responses and then we're going to
[00:55:06] responses and then we're going to jointly
[00:55:07] jointly fine-tune um sort of an image encoder um
[00:55:12] fine-tune um sort of an image encoder um here clip along with a uh a Texton
[00:55:16] here clip along with a uh a Texton decoder here wuna which is a Lama model
[00:55:20] decoder here wuna which is a Lama model that is instruction
[00:55:21] that is instruction tuned um and through this sort of joint
[00:55:24] tuned um and through this sort of joint fine-tuning uh at the end we we kind of
[00:55:26] fine-tuning uh at the end we we kind of get this image
[00:55:28] get this image encoder um that can output language
[00:55:30] encoder um that can output language responses and now we can sort of ask
[00:55:32] responses and now we can sort of ask questions about images maybe use that uh
[00:55:35] questions about images maybe use that uh to directly input uh screenshots instead
[00:55:38] to directly input uh screenshots instead of HTML Dom
[00:55:40] of HTML Dom elements so a second approach that um
[00:55:45] elements so a second approach that um looked at sort of building joint uh
[00:55:48] looked at sort of building joint uh image language models that then people
[00:55:50] image language models that then people later adapted to agents was uh pix to
[00:55:54] later adapted to agents was uh pix to struct and uh the idea is again very
[00:55:56] struct and uh the idea is again very similar uh there's an image encoder and
[00:55:59] similar uh there's an image encoder and a text decoder U the image encoder will
[00:56:03] a text decoder U the image encoder will will will sort of take the image convert
[00:56:05] will will sort of take the image convert them into patches and assign each patch
[00:56:07] them into patches and assign each patch sort of a position ID uh run that
[00:56:10] sort of a position ID uh run that through a
[00:56:11] through a Transformer and then there's a decoder
[00:56:13] Transformer and then there's a decoder that will decode out some text okay uh
[00:56:17] that will decode out some text okay uh one of the new things that pix to struct
[00:56:18] one of the new things that pix to struct introduced was a new pre-training task
[00:56:22] introduced was a new pre-training task so uh for for lava the pre-training was
[00:56:25] so uh for for lava the pre-training was you know fairly simple uh we're going to
[00:56:27] you know fairly simple uh we're going to use gbd4 to just generate sort of
[00:56:29] use gbd4 to just generate sort of synthetic uh questions and responses
[00:56:33] synthetic uh questions and responses based on textual descriptions of images
[00:56:34] based on textual descriptions of images but there's only so far you can go with
[00:56:36] but there's only so far you can go with textual descriptions of images what pix
[00:56:38] textual descriptions of images what pix to struck did was uh to look at
[00:56:42] to struck did was uh to look at screenshots uh from websites and mask
[00:56:45] screenshots uh from websites and mask out uh screenshots and then ask the
[00:56:48] out uh screenshots and then ask the Transformer decoder to produce HTML
[00:56:51] Transformer decoder to produce HTML corresponding to the marked outout
[00:56:53] corresponding to the marked outout elements uh so here there is uh like
[00:56:56] elements uh so here there is uh like this list um that has a corresponding
[00:57:00] this list um that has a corresponding HTML uh one of the data points in uh in
[00:57:04] HTML uh one of the data points in uh in pix to struct looks something like this
[00:57:05] pix to struct looks something like this so you you might mask out let's say the
[00:57:09] so you you might mask out let's say the first uh the first answer corresponding
[00:57:11] first uh the first answer corresponding to Python and ask the model to produce
[00:57:14] to Python and ask the model to produce the HTML corresponding to just the uh
[00:57:17] the HTML corresponding to just the uh patch that was mased
[00:57:19] patch that was mased out uh and so this seems like a more
[00:57:22] out uh and so this seems like a more natural sort of pre-training objective
[00:57:24] natural sort of pre-training objective that can maybe have like better
[00:57:27] that can maybe have like better interactions between image uh and text
[00:57:30] interactions between image uh and text and then this was also adapted for uh
[00:57:33] and then this was also adapted for uh building like these multimodal
[00:57:35] building like these multimodal agents okay so uh you know at this point
[00:57:39] agents okay so uh you know at this point I just want to kind of highlight that
[00:57:41] I just want to kind of highlight that this is really an emerging application
[00:57:44] this is really an emerging application um there's kind of this huge kind of
[00:57:47] um there's kind of this huge kind of prompting Gap uh is what I like to call
[00:57:49] prompting Gap uh is what I like to call it so if you do not do extensive
[00:57:51] it so if you do not do extensive prompting and if you do not use thepoke
[00:57:55] prompting and if you do not use thepoke few short example where for every
[00:57:57] few short example where for every different environment you have a
[00:57:58] different environment you have a different set of few short examples even
[00:58:01] different set of few short examples even the best language models are very very
[00:58:03] the best language models are very very far from perfect even on very very
[00:58:04] far from perfect even on very very simple tasks like mini wob uh where you
[00:58:07] simple tasks like mini wob uh where you know the goal was just to click on
[00:58:09] know the goal was just to click on certain elements or uh you know respond
[00:58:12] certain elements or uh you know respond to someone's email where in mini wob
[00:58:14] to someone's email where in mini wob that just takes like five
[00:58:16] that just takes like five actions
[00:58:17] actions um and then uh even for something as
[00:58:21] um and then uh even for something as simple as mini wob even after doing
[00:58:23] simple as mini wob even after doing extensive prompting and few short
[00:58:25] extensive prompting and few short examples
[00:58:26] examples is this like U drop in performance as
[00:58:29] is this like U drop in performance as you go from sort of the simplest task
[00:58:32] you go from sort of the simplest task that involve mapping and instruction
[00:58:34] that involve mapping and instruction into a single action to mapping an
[00:58:37] into a single action to mapping an instruction into maybe five or 10
[00:58:39] instruction into maybe five or 10 actions uh so long Horizon planning uh
[00:58:43] actions uh so long Horizon planning uh is is still very very hard even on these
[00:58:45] is is still very very hard even on these very simple
[00:58:46] very simple benchmarks um and then if you look at
[00:58:49] benchmarks um and then if you look at something more complex like web Arena
[00:58:51] something more complex like web Arena which tries to approximate real websites
[00:58:54] which tries to approximate real websites has multi-tab browsing has external
[00:58:57] has multi-tab browsing has external tools that the mod can use there's just
[00:58:59] tools that the mod can use there's just a huge difference between sort of human
[00:59:02] a huge difference between sort of human level uh task success rate and what the
[00:59:06] level uh task success rate and what the best models get uh even after prompting
[00:59:09] best models get uh even after prompting even with few short
[00:59:12] even with few short examples um and then the kinds of Errors
[00:59:15] examples um and then the kinds of Errors model make models make are
[00:59:17] model make models make are also pretty weird so one of the examples
[00:59:22] also pretty weird so one of the examples uh from from web links was uh the task
[00:59:25] uh from from web links was uh the task was to just open Google Translate and
[00:59:28] was to just open Google Translate and sign in using credentials and there was
[00:59:30] sign in using credentials and there was an email and a password and then what
[00:59:33] an email and a password and then what gbd4 V did was instead of uh typing in
[00:59:37] gbd4 V did was instead of uh typing in the password it just typed in the email
[00:59:39] the password it just typed in the email into the password tab uh and it just
[00:59:42] into the password tab uh and it just couldn't recover from this error so you
[00:59:43] couldn't recover from this error so you know it it tried to sign in there was an
[00:59:45] know it it tried to sign in there was an error it tried to insert uh try tried to
[00:59:48] error it tried to insert uh try tried to type in the the email again and so on
[00:59:50] type in the the email again and so on and I'm sure with extensive prompting
[00:59:51] and I'm sure with extensive prompting you can fix this and maybe that's
[00:59:53] you can fix this and maybe that's besides the point right um
[00:59:56] besides the point right um and then again uh you know there was
[00:59:59] and then again uh you know there was like a different example where uh the
[01:00:02] like a different example where uh the model had to issue a search and then
[01:00:05] model had to issue a search and then instead of issuing the search with the
[01:00:07] instead of issuing the search with the correct term it sort of repeated the
[01:00:10] correct term it sort of repeated the same term like three times um and
[01:00:13] same term like three times um and obviously that's not going to return any
[01:00:15] obviously that's not going to return any query uh return any
[01:00:18] query uh return any results um so there's lot lot of room
[01:00:20] results um so there's lot lot of room for for improvement as you can see um
[01:00:23] for for improvement as you can see um and then there's lots to be done in the
[01:00:25] and then there's lots to be done in the space okay so I'm going to recap um and
[01:00:28] space okay so I'm going to recap um and take any questions so we kind of looked
[01:00:32] take any questions so we kind of looked at two different things today we looked
[01:00:34] at two different things today we looked at reasoning in language models uh we
[01:00:37] at reasoning in language models uh we saw that there's a few ways that you can
[01:00:40] saw that there's a few ways that you can get reasoning like behavior in language
[01:00:41] get reasoning like behavior in language models you can prompt them in various
[01:00:43] models you can prompt them in various ways so the simplest example of that is
[01:00:46] ways so the simplest example of that is Chain of Thought prompting you can do
[01:00:48] Chain of Thought prompting you can do Chain of Thought prompting but generate
[01:00:50] Chain of Thought prompting but generate multiple rationals and sort of try to uh
[01:00:53] multiple rationals and sort of try to uh reconcile them and pick the answer that
[01:00:55] reconcile them and pick the answer that was most
[01:00:56] was most uh most like
[01:00:58] uh most like frequent uh you can do sort of problem
[01:01:00] frequent uh you can do sort of problem decomposition in your prompt so ask the
[01:01:03] decomposition in your prompt so ask the model to explicitly decompose a problem
[01:01:06] model to explicitly decompose a problem into multiple steps before
[01:01:08] into multiple steps before answering U so that was all
[01:01:10] answering U so that was all prompting you could also try and train
[01:01:13] prompting you could also try and train specialize small language models for
[01:01:15] specialize small language models for reasoning by generating rationals from a
[01:01:18] reasoning by generating rationals from a big language model and then F tuning a
[01:01:20] big language model and then F tuning a smaller language model on these
[01:01:23] smaller language model on these rationals uh instead of fine tuning a
[01:01:26] rationals uh instead of fine tuning a smaller language model on rationals from
[01:01:28] smaller language model on rationals from a big language model you could just
[01:01:30] a big language model you could just fine-tune the big language model on its
[01:01:32] fine-tune the big language model on its own rationals and keep doing this
[01:01:34] own rationals and keep doing this iteratively and we saw that sometimes
[01:01:36] iteratively and we saw that sometimes like if you do multiple iterations of
[01:01:38] like if you do multiple iterations of that performance can keep
[01:01:40] that performance can keep improving and and and can even
[01:01:42] improving and and and can even outperform sort of human provided
[01:01:45] outperform sort of human provided rationals um but on the flip side we saw
[01:01:48] rationals um but on the flip side we saw that while there are some initial
[01:01:50] that while there are some initial reasons to be optimistic if we go and do
[01:01:54] reasons to be optimistic if we go and do counterfactual evalu ation we see that
[01:01:58] counterfactual evalu ation we see that you know it's not clear if the models
[01:02:00] you know it's not clear if the models are good because they're reasoning or if
[01:02:03] are good because they're reasoning or if models are good because you know all of
[01:02:05] models are good because you know all of these problems were in some shape or
[01:02:07] these problems were in some shape or form already in the training data and we
[01:02:08] form already in the training data and we saw that with sort of counterfactual
[01:02:11] saw that with sort of counterfactual evaluation um in the second part we
[01:02:14] evaluation um in the second part we looked at language model
[01:02:15] looked at language model agents uh we kind of talked about the
[01:02:17] agents uh we kind of talked about the historical perspective through which uh
[01:02:20] historical perspective through which uh people built sort of grounded agents and
[01:02:22] people built sort of grounded agents and then we saw that you could recast the
[01:02:25] then we saw that you could recast the problem of decision making as just sort
[01:02:28] problem of decision making as just sort of uh causal language modeling and then
[01:02:31] of uh causal language modeling and then we looked at various ways through which
[01:02:33] we looked at various ways through which people have modeled uh decision making
[01:02:36] people have modeled uh decision making with language models most of it involves
[01:02:39] with language models most of it involves prompting and in context learning and
[01:02:41] prompting and in context learning and then we looked at a method for U you
[01:02:44] then we looked at a method for U you know similar to sort of what we saw in
[01:02:45] know similar to sort of what we saw in the first module uh generating synthetic
[01:02:49] the first module uh generating synthetic demonstrations and here we looked at
[01:02:51] demonstrations and here we looked at doing exploration and the same kind of
[01:02:53] doing exploration and the same kind of iterative
[01:02:54] iterative relabeling um um you know most of the
[01:02:57] relabeling um um you know most of the language models we looked at today were
[01:02:58] language models we looked at today were text only uh we saw some examples of
[01:03:01] text only uh we saw some examples of language models that can take both text
[01:03:04] language models that can take both text and uh visual
[01:03:06] and uh visual input and then uh you know we we saw
[01:03:09] input and then uh you know we we saw that benchmarks are very very
[01:03:11] that benchmarks are very very challenging models make kind of trivial
[01:03:13] challenging models make kind of trivial mistakes uh there's a huge gap between
[01:03:15] mistakes uh there's a huge gap between human performance and sort of what we
[01:03:17] human performance and sort of what we get with models uh so there's a huge uh
[01:03:21] get with models uh so there's a huge uh like there's a huge difference between
[01:03:23] like there's a huge difference between human performance and where models are
[01:03:25] human performance and where models are and you know a lot of room for driving
[01:03:26] and you know a lot of room for driving further
[01:03:27] further Improvement um and maybe some of you are
[01:03:29] Improvement um and maybe some of you are doing it for your projects uh thank you
[01:03:34] [Applause]


================================================================================
LECTURE 016
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert

Source: https://www.youtube.com/watch?v=dnF463_Ar9I

---

Transcript

[00:00:05] okay um well uh welcome back to cs224n
[00:00:10] okay um well uh welcome back to cs224n it's welcome back for me to
[00:00:11] it's welcome back for me to cs224n um too since I was traveling for
[00:00:14] cs224n um too since I was traveling for a couple of weeks I hope everything went
[00:00:16] a couple of weeks I hope everything went smoothly in the meantime um so today I'm
[00:00:20] smoothly in the meantime um so today I'm delighted to introduce our first invited
[00:00:22] delighted to introduce our first invited speaker Nathan Lambert um so Nathan um
[00:00:26] speaker Nathan Lambert um so Nathan um did his PhD at UC Berkeley so you're
[00:00:29] did his PhD at UC Berkeley so you're allowed Boo and hiss for that
[00:00:32] allowed Boo and hiss for that but um but um since then um he worked
[00:00:38] but um but um since then um he worked first for a couple of years at hugging
[00:00:40] first for a couple of years at hugging face and now he's working at ai2 the
[00:00:44] face and now he's working at ai2 the Allen instit the Allen Institute for
[00:00:46] Allen instit the Allen Institute for artificial intelligence um in Seattle um
[00:00:50] artificial intelligence um in Seattle um so Nathan um comes from a background in
[00:00:54] so Nathan um comes from a background in reinforcement learning like quite a few
[00:00:56] reinforcement learning like quite a few other people who are now applying
[00:00:57] other people who are now applying reinforcement learning to language
[00:00:59] reinforcement learning to language models he had an early background
[00:01:01] models he had an early background applying reinforcement learning to
[00:01:03] applying reinforcement learning to robots but it turns out it's more fun to
[00:01:05] robots but it turns out it's more fun to do it with language models um um no it's
[00:01:08] do it with language models um um no it's not um okay um but anyway I mean he's
[00:01:12] not um okay um but anyway I mean he's been very influential in both developing
[00:01:16] been very influential in both developing ideas as to how to do posttraining with
[00:01:19] ideas as to how to do posttraining with rhf and other ideas that come since then
[00:01:23] rhf and other ideas that come since then including DPO that he'll definitely
[00:01:25] including DPO that he'll definitely mention in today's talk um and so he's
[00:01:28] mention in today's talk um and so he's one of the so best experts on the
[00:01:31] one of the so best experts on the posttraining um phase of language model
[00:01:35] posttraining um phase of language model development which has just proven as
[00:01:37] development which has just proven as time is passed by that more and more of
[00:01:39] time is passed by that more and more of the action of the large language model
[00:01:41] the action of the large language model companies is happening not in the the
[00:01:44] companies is happening not in the the initial um pre-training language model
[00:01:46] initial um pre-training language model training phase but this subsequent
[00:01:48] training phase but this subsequent posttraining phase and Nathan will have
[00:01:50] posttraining phase and Nathan will have a lot to say about that today thanks a
[00:01:52] a lot to say about that today thanks a lot for coming to do this yeah thanks
[00:01:54] lot for coming to do this yeah thanks for the wonderful intro um you can see
[00:01:57] for the wonderful intro um you can see my talk is life after DPO which is a
[00:01:59] my talk is life after DPO which is a little bit of a unclear title so I
[00:02:01] little bit of a unclear title so I apologize about this but it's trying to
[00:02:03] apologize about this but it's trying to capture like what is the moment that
[00:02:05] capture like what is the moment that we're at in alignment and Alignment
[00:02:07] we're at in alignment and Alignment research and really DPO is the paper the
[00:02:10] research and really DPO is the paper the story of last year which is this paper
[00:02:12] story of last year which is this paper that came out and I'll get to the math
[00:02:14] that came out and I'll get to the math and now a lot more people are interested
[00:02:16] and now a lot more people are interested in able to do alignment and it's
[00:02:17] in able to do alignment and it's building on from there so it's like what
[00:02:19] building on from there so it's like what what are we going to be interested in
[00:02:21] what are we going to be interested in after DPO and a tidbit talking with
[00:02:23] after DPO and a tidbit talking with Chris that isn't explicitly in my slides
[00:02:26] Chris that isn't explicitly in my slides is like what we're trying to close and
[00:02:29] is like what we're trying to close and the labs like meta and people with the
[00:02:31] the labs like meta and people with the amount of data that they're using for
[00:02:32] amount of data that they're using for this kind of post
[00:02:34] this kind of post training um fine-tuning there's all
[00:02:36] training um fine-tuning there's all these words all defined is so big that
[00:02:39] these words all defined is so big that like the amount of data points that meta
[00:02:40] like the amount of data points that meta bought in llama 2 from one of these
[00:02:43] bought in llama 2 from one of these providers is much more data than all of
[00:02:45] providers is much more data than all of the data that's been collected on
[00:02:46] the data that's been collected on chatbot arena for mmis so chatbot Arena
[00:02:49] chatbot arena for mmis so chatbot Arena has like 800,000 data points that have
[00:02:51] has like 800,000 data points that have been collected and metat 2's paper says
[00:02:53] been collected and metat 2's paper says they bought about 1.5 million
[00:02:55] they bought about 1.5 million comparisons and these are years outdated
[00:02:57] comparisons and these are years outdated and chatbot Arena's data is that's as of
[00:03:00] and chatbot Arena's data is that's as of a few weeks ago so you can only imagine
[00:03:02] a few weeks ago so you can only imagine what op AI anthropic Etc are buying at
[00:03:05] what op AI anthropic Etc are buying at this scale and this is the kind of
[00:03:07] this scale and this is the kind of reality that we need to adapt to is like
[00:03:09] reality that we need to adapt to is like what is different like we don't have
[00:03:11] what is different like we don't have that type of resource doing research and
[00:03:13] that type of resource doing research and what are we going to do so this lecture
[00:03:15] what are we going to do so this lecture is some history on things that lead up
[00:03:17] is some history on things that lead up to DPO that I saw that I think are
[00:03:20] to DPO that I saw that I think are important to remember and then really
[00:03:22] important to remember and then really we'll kind of go zero to 100 and talk
[00:03:24] we'll kind of go zero to 100 and talk about Recent research that we're doing
[00:03:26] about Recent research that we're doing to try to answer this question and
[00:03:28] to try to answer this question and Define what is happening
[00:03:31] Define what is happening so I'll start with a heavily abbreviated
[00:03:34] so I'll start with a heavily abbreviated history of language models I won't go
[00:03:36] history of language models I won't go through all of this there's a bunch of
[00:03:37] through all of this there's a bunch of this in the class already this is late
[00:03:38] this in the class already this is late in the lecture I like to talk start with
[00:03:41] in the lecture I like to talk start with Claude Channon and then you skip a whole
[00:03:42] Claude Channon and then you skip a whole bunch of stuff where this Auto
[00:03:44] bunch of stuff where this Auto regressive loss
[00:03:45] regressive loss function shows a lot of promise and this
[00:03:48] function shows a lot of promise and this was not fast you can see how many years
[00:03:50] was not fast you can see how many years that it took to build language modeling
[00:03:53] that it took to build language modeling as a field here and deep learning is
[00:03:55] as a field here and deep learning is brewing in the background of one of many
[00:03:58] brewing in the background of one of many things that went into this
[00:04:00] things that went into this and then you have these years with like
[00:04:01] and then you have these years with like 2017 the Transformer paper that you hear
[00:04:04] 2017 the Transformer paper that you hear about 2018 with gpt1 Elmo and Bert kind
[00:04:07] about 2018 with gpt1 Elmo and Bert kind of these foundational topics in language
[00:04:10] of these foundational topics in language processing and how embeddings are
[00:04:11] processing and how embeddings are created and then with gpt2 and scaling
[00:04:14] created and then with gpt2 and scaling laws become this kind of key idea that
[00:04:17] laws become this kind of key idea that people are looking at and tracking and
[00:04:19] people are looking at and tracking and how these models are improving and then
[00:04:21] how these models are improving and then in 2020 is when people really started to
[00:04:24] in 2020 is when people really started to wake up to how useful these large scale
[00:04:27] wake up to how useful these large scale trained language models were at this
[00:04:29] trained language models were at this time I was wasn't even a language
[00:04:30] time I was wasn't even a language modeling person but for a lot of people
[00:04:32] modeling person but for a lot of people in AI this is when the kind of gravity
[00:04:35] in AI this is when the kind of gravity of the situation was starting to suck
[00:04:37] of the situation was starting to suck people in and there's a lot of cadence
[00:04:39] people in and there's a lot of cadence to these things in 2021 we had the
[00:04:41] to these things in 2021 we had the stochastic parrots paper which before
[00:04:43] stochastic parrots paper which before chat gbt is raising the warnings of what
[00:04:45] chat gbt is raising the warnings of what are actually what are we actually
[00:04:47] are actually what are we actually putting into these models and what are
[00:04:49] putting into these models and what are they learning like are they actually
[00:04:50] they learning like are they actually learning something meaningful from
[00:04:52] learning something meaningful from language or are they repeating the
[00:04:54] language or are they repeating the language that we have and this is a kind
[00:04:55] language that we have and this is a kind of philosophical debate depending on
[00:04:57] of philosophical debate depending on where you land on what language is what
[00:05:00] where you land on what language is what these language models are doing today
[00:05:02] these language models are doing today but it's important that it came out
[00:05:04] but it's important that it came out before chat gbt and it's like these
[00:05:06] before chat gbt and it's like these foundations of debates of what language
[00:05:07] foundations of debates of what language models are doing in 20 end of 2022 is
[00:05:10] models are doing in 20 end of 2022 is when chat gbt actually came out which
[00:05:13] when chat gbt actually came out which was supposed to be this kind of quiet
[00:05:15] was supposed to be this kind of quiet launch of a b like a demo from open Ai
[00:05:18] launch of a b like a demo from open Ai and it has since captured the attention
[00:05:21] and it has since captured the attention of the world that we have seen and the
[00:05:25] of the world that we have seen and the simple question is can C chat GPT exist
[00:05:27] simple question is can C chat GPT exist without rlf I think it's it's important
[00:05:30] without rlf I think it's it's important to acknowledge that so much of this is
[00:05:31] to acknowledge that so much of this is from pre-training but at every point of
[00:05:33] from pre-training but at every point of the line and chat GPT and then a lot of
[00:05:35] the line and chat GPT and then a lot of these popular models since then rhf and
[00:05:38] these popular models since then rhf and these human related or other fine-tuning
[00:05:41] these human related or other fine-tuning Technologies seem to be necessary but
[00:05:43] Technologies seem to be necessary but not sufficient like you need the
[00:05:44] not sufficient like you need the pre-training but you also need this kind
[00:05:46] pre-training but you also need this kind of rhf or this post training to really
[00:05:49] of rhf or this post training to really shift the needle and what the most
[00:05:51] shift the needle and what the most important models are at that certain
[00:05:54] important models are at that certain moment some examples you can list so
[00:05:56] moment some examples you can list so many of them where rhf has replied
[00:05:58] many of them where rhf has replied relied upon I like to look at these
[00:06:01] relied upon I like to look at these plots from the anthropic constitutional
[00:06:03] plots from the anthropic constitutional AI paper where they kind of show this
[00:06:04] AI paper where they kind of show this iterative Improvement of their different
[00:06:06] iterative Improvement of their different rhf methods it kind of shows of how you
[00:06:09] rhf methods it kind of shows of how you have these multiple model versions that
[00:06:11] have these multiple model versions that are evolving over time as you add more
[00:06:12] are evolving over time as you add more fine-tuning data this is a a dense paper
[00:06:15] fine-tuning data this is a a dense paper but one of the most representative
[00:06:17] but one of the most representative figures of kind of what rhf can do
[00:06:18] figures of kind of what rhf can do there's a lot of information in here
[00:06:20] there's a lot of information in here that you don't need to follow right now
[00:06:21] that you don't need to follow right now and then like meta's llama 2 paper is
[00:06:23] and then like meta's llama 2 paper is pretty funny where they're they have
[00:06:24] pretty funny where they're they have this quote this like reinforcement
[00:06:26] this quote this like reinforcement learning known for its instability
[00:06:28] learning known for its instability seemed as some shadowy field for those
[00:06:30] seemed as some shadowy field for those in the NLP research Community however
[00:06:32] in the NLP research Community however reinforcement learning proved highly
[00:06:34] reinforcement learning proved highly effective particularly given its cost
[00:06:36] effective particularly given its cost and time Effectiveness so this is like
[00:06:38] and time Effectiveness so this is like this is from the technical report
[00:06:39] this is from the technical report directly which I find really
[00:06:40] directly which I find really entertaining is this is back in the day
[00:06:42] entertaining is this is back in the day when we were like oh we don't know if
[00:06:43] when we were like oh we don't know if rhf is really going to take off this is
[00:06:46] rhf is really going to take off this is July of year 2023 is like in this
[00:06:49] July of year 2023 is like in this building period and it's just directly
[00:06:51] building period and it's just directly from the report and that's aged really
[00:06:52] from the report and that's aged really well where people are still using this
[00:06:54] well where people are still using this today but there's just a lot of
[00:06:56] today but there's just a lot of interesting hints in kind of history of
[00:06:58] interesting hints in kind of history of culture of rhf in the releases of these
[00:07:01] culture of rhf in the releases of these models where the people these companies
[00:07:02] models where the people these companies like to talk about it and give us kind
[00:07:04] like to talk about it and give us kind of these cultural details to what's
[00:07:06] of these cultural details to what's going
[00:07:06] going on so I'm going to kind of go through
[00:07:08] on so I'm going to kind of go through some definitions and I don't spend too
[00:07:11] some definitions and I don't spend too much time on saying doing rhf 101 and
[00:07:14] much time on saying doing rhf 101 and like exactly what is happening with
[00:07:16] like exactly what is happening with these kind of mathematical terms but
[00:07:18] these kind of mathematical terms but it's important to get on the same page
[00:07:19] it's important to get on the same page of what some of these things do and
[00:07:20] of what some of these things do and don't mean um there's a lot of
[00:07:23] don't mean um there's a lot of definitions I think some of the
[00:07:24] definitions I think some of the interesting ones that if they don't make
[00:07:26] interesting ones that if they don't make sense right now to come back to is like
[00:07:28] sense right now to come back to is like what's the difference between
[00:07:28] what's the difference between instruction find tuning and supervise
[00:07:30] instruction find tuning and supervise fine tuning I think like instruction
[00:07:33] fine tuning I think like instruction fine tuning is what's become really
[00:07:34] fine tuning is what's become really popular where it's like you're training
[00:07:35] popular where it's like you're training a model to follow instructions and I
[00:07:37] a model to follow instructions and I have another slide on this after and
[00:07:39] have another slide on this after and supervis fine tuning is like this domain
[00:07:41] supervis fine tuning is like this domain specific thing and we want to do both of
[00:07:43] specific thing and we want to do both of them I think instruction fine tuning is
[00:07:45] them I think instruction fine tuning is more linked to rhf it's about making
[00:07:48] more linked to rhf it's about making these models really useful and really
[00:07:50] these models really useful and really engaging and kind of easy to work with
[00:07:52] engaging and kind of easy to work with and then there's other things like
[00:07:53] and then there's other things like alignment which is like super vague but
[00:07:56] alignment which is like super vague but it's in the word it's align it's
[00:07:58] it's in the word it's align it's training a model to be
[00:08:00] training a model to be mirrored to what a user wants and
[00:08:01] mirrored to what a user wants and there's a lot of things that you can
[00:08:02] there's a lot of things that you can align to rhf is a mouthful which is one
[00:08:06] align to rhf is a mouthful which is one specific tool for doing alignment where
[00:08:08] specific tool for doing alignment where you have this kind of human feedback
[00:08:10] you have this kind of human feedback data which is like feedback is a really
[00:08:12] data which is like feedback is a really loaded word there where there can be
[00:08:14] loaded word there where there can be like preferences and learning to rank is
[00:08:16] like preferences and learning to rank is related to actually putting feedback on
[00:08:18] related to actually putting feedback on preferences there's a lot of little
[00:08:19] preferences there's a lot of little things I tried to make preference
[00:08:21] things I tried to make preference fine-tuning a phrase at one point but
[00:08:23] fine-tuning a phrase at one point but didn't really double down on it I think
[00:08:24] didn't really double down on it I think it's a little bit clearer than rhf
[00:08:26] it's a little bit clearer than rhf especially in the context of DPO but
[00:08:28] especially in the context of DPO but there's just these lot of spheres that
[00:08:30] there's just these lot of spheres that are overlapping in this kind of post
[00:08:32] are overlapping in this kind of post trining or fine tuning space of models
[00:08:34] trining or fine tuning space of models these days instruction tuning
[00:08:37] these days instruction tuning instruction fine tuning is the kind of
[00:08:39] instruction fine tuning is the kind of it's still the foundation of a lot of
[00:08:41] it's still the foundation of a lot of this this is where things called system
[00:08:43] this this is where things called system prompts are added where we're like
[00:08:45] prompts are added where we're like making the model ready for a specific
[00:08:47] making the model ready for a specific style of input um open AI is still kind
[00:08:50] style of input um open AI is still kind of inov innovating on this they have
[00:08:52] of inov innovating on this they have this model spec document they released a
[00:08:54] this model spec document they released a few weeks ago where they said they're
[00:08:55] few weeks ago where they said they're going to have like a second level system
[00:08:57] going to have like a second level system prompt here which this just adds some
[00:08:59] prompt here which this just adds some structure to how the models can take in
[00:09:01] structure to how the models can take in data so that you can do a lot more of
[00:09:03] data so that you can do a lot more of this fine tuning down the line and how
[00:09:05] this fine tuning down the line and how user data actually gets passed to the
[00:09:07] user data actually gets passed to the model or how the developer passes
[00:09:09] model or how the developer passes information that the user doesn't
[00:09:11] information that the user doesn't see so what this can often look like is
[00:09:14] see so what this can often look like is like stack overlow Reddit data where you
[00:09:16] like stack overlow Reddit data where you have a a question at the top and then an
[00:09:18] have a a question at the top and then an answer and this is still I think a lot
[00:09:21] answer and this is still I think a lot of what is happening behind the scenes
[00:09:22] of what is happening behind the scenes there's a lot of data sets of stack
[00:09:23] there's a lot of data sets of stack Overflow out there Reddit has these data
[00:09:26] Overflow out there Reddit has these data Partnerships and this still uses the
[00:09:28] Partnerships and this still uses the auto regressive loss function that we
[00:09:29] auto regressive loss function that we started with we haven't branched out
[00:09:31] started with we haven't branched out into kind of different loss functions
[00:09:33] into kind of different loss functions yet but it's still super important a lot
[00:09:35] yet but it's still super important a lot of academic research shows that this is
[00:09:37] of academic research shows that this is like all you need in some ways which I I
[00:09:40] like all you need in some ways which I I think is a much more mixed bag but it's
[00:09:42] think is a much more mixed bag but it's it's the simple method and it's the
[00:09:44] it's the simple method and it's the right place to start and where we kind
[00:09:46] right place to start and where we kind of go is then we go to this rhf
[00:09:49] of go is then we go to this rhf objective which this looks really
[00:09:52] objective which this looks really familiar to people that are trained in
[00:09:53] familiar to people that are trained in reinforcement learning I think this is a
[00:09:55] reinforcement learning I think this is a little different to from the NLP loss
[00:09:57] little different to from the NLP loss function um on the left side is like the
[00:09:59] function um on the left side is like the standard reinforcement learning
[00:10:01] standard reinforcement learning objective which is you're learning a
[00:10:02] objective which is you're learning a policy pi to maximize some reward which
[00:10:05] policy pi to maximize some reward which is a function of something depending how
[00:10:07] is a function of something depending how you set of the problem and then on the
[00:10:09] you set of the problem and then on the right side is going to be this kind of
[00:10:11] right side is going to be this kind of KL constraint this um it's a distance so
[00:10:14] KL constraint this um it's a distance so that the policy doesn't change too much
[00:10:16] that the policy doesn't change too much it's related to this whole idea of over
[00:10:18] it's related to this whole idea of over optimization that I don't go into too
[00:10:20] optimization that I don't go into too much of this talk um but the key ideas
[00:10:23] much of this talk um but the key ideas is that we want to optimize a reward but
[00:10:25] is that we want to optimize a reward but not over optimize it and the primary
[00:10:28] not over optimize it and the primary questions when doing our LF is like how
[00:10:30] questions when doing our LF is like how do we Implement a reward function like
[00:10:32] do we Implement a reward function like what is our reward actually going to be
[00:10:33] what is our reward actually going to be and then how do we optimize it you see
[00:10:35] and then how do we optimize it you see this abstracted later as like we train a
[00:10:37] this abstracted later as like we train a specific reward model and then we have
[00:10:39] specific reward model and then we have specific policy updates and DPO direct
[00:10:42] specific policy updates and DPO direct preference optimization handles this a
[00:10:44] preference optimization handles this a little bit differently so to get before
[00:10:47] little bit differently so to get before we get there it's like the actual
[00:10:48] we get there it's like the actual preference model that people use for rlf
[00:10:51] preference model that people use for rlf is well I find this interesting like
[00:10:53] is well I find this interesting like it's from this Bradley teror Terry model
[00:10:55] it's from this Bradley teror Terry model which is from economics in like the
[00:10:57] which is from economics in like the 1950s which is essentially a probability
[00:11:00] 1950s which is essentially a probability distribution over a pairwise choice and
[00:11:04] distribution over a pairwise choice and what ends up happening for various
[00:11:05] what ends up happening for various technical reasons is that if we train a
[00:11:07] technical reasons is that if we train a preference model it needs to Output a
[00:11:08] preference model it needs to Output a scalar value and by some coincidence
[00:11:12] scalar value and by some coincidence that I think is still very convenient
[00:11:14] that I think is still very convenient they just take the output of this
[00:11:15] they just take the output of this learned probability distribution as a
[00:11:17] learned probability distribution as a reward they say that okay our reward is
[00:11:19] reward they say that okay our reward is going to be proportional to this
[00:11:21] going to be proportional to this probability and it's going to work and
[00:11:23] probability and it's going to work and it ends up doing so but that's like even
[00:11:25] it ends up doing so but that's like even a big leap to accept that it's like we
[00:11:28] a big leap to accept that it's like we have this par wise preference
[00:11:29] have this par wise preference probability that's saying the
[00:11:31] probability that's saying the probability that one answer is chosen
[00:11:33] probability that one answer is chosen over another and then you have to kind
[00:11:35] over another and then you have to kind of this mental crazy step of saying we
[00:11:37] of this mental crazy step of saying we just pass in one number or one piece of
[00:11:39] just pass in one number or one piece of text and we're getting the probability
[00:11:41] text and we're getting the probability that that one piece of text is chosen
[00:11:43] that that one piece of text is chosen over any arbitrary other one so there's
[00:11:45] over any arbitrary other one so there's a lot of like assumptions that make this
[00:11:47] a lot of like assumptions that make this there's like kind of deep Concepts in
[00:11:50] there's like kind of deep Concepts in here but what we're getting is a model
[00:11:52] here but what we're getting is a model that's giving us the score
[00:11:55] that's giving us the score out and the kind of question is if we
[00:11:58] out and the kind of question is if we why do we have to do this and like what
[00:11:59] why do we have to do this and like what if we can just take our original
[00:12:01] if we can just take our original objective and use gradient Ascent on
[00:12:04] objective and use gradient Ascent on this equation Ascent because it's a
[00:12:06] this equation Ascent because it's a maximum and this is really what DPO does
[00:12:08] maximum and this is really what DPO does I'm blurring through a ton of math it's
[00:12:11] I'm blurring through a ton of math it's a great paper to learn a lot of this
[00:12:13] a great paper to learn a lot of this math of language modeling where you
[00:12:15] math of language modeling where you learn how these probabilities of
[00:12:17] learn how these probabilities of different pieces of text are handled by
[00:12:19] different pieces of text are handled by the model and how it's ends up being a
[00:12:22] the model and how it's ends up being a lot of these like log probability ratios
[00:12:24] lot of these like log probability ratios and seeing how the prompt and the
[00:12:26] and seeing how the prompt and the completion are handled differently it's
[00:12:27] completion are handled differently it's worth digging into and understanding the
[00:12:30] worth digging into and understanding the derivation but the core idea is like why
[00:12:32] derivation but the core idea is like why can't we just do gradient descent or
[00:12:35] can't we just do gradient descent or gradient Ascent to solve rhf
[00:12:37] gradient Ascent to solve rhf optimization and this is like it becomes
[00:12:41] optimization and this is like it becomes be incredibly simple so if you look at
[00:12:44] be incredibly simple so if you look at the code on the right is the um
[00:12:46] the code on the right is the um reference code from the original
[00:12:47] reference code from the original implementation it's extremely simple to
[00:12:49] implementation it's extremely simple to implement and it has this characteristic
[00:12:51] implement and it has this characteristic where if you work with something like
[00:12:52] where if you work with something like Transformers before it's pretty easy to
[00:12:55] Transformers before it's pretty easy to write a loss function that uses DPO
[00:12:59] write a loss function that uses DPO rather than building an entire
[00:13:01] rather than building an entire infrastructure stack to start with when
[00:13:02] infrastructure stack to start with when you do something like a PO and this full
[00:13:05] you do something like a PO and this full rhf stuff that open AI does you normally
[00:13:07] rhf stuff that open AI does you normally need almost entire new infrastructure
[00:13:09] need almost entire new infrastructure stack but you can get started with DPO
[00:13:11] stack but you can get started with DPO in a much much simpler way and there's
[00:13:13] in a much much simpler way and there's some kind of characteristics that I'll
[00:13:15] some kind of characteristics that I'll get to later which is DPO still has a
[00:13:17] get to later which is DPO still has a reward model which is really important
[00:13:19] reward model which is really important to the math actually checking out
[00:13:21] to the math actually checking out whereas you're using your original
[00:13:23] whereas you're using your original language model as a different type of
[00:13:25] language model as a different type of reward model but that quickly takes us
[00:13:27] reward model but that quickly takes us down a whole bunch of derivations that
[00:13:29] down a whole bunch of derivations that is probably at least not the lecture
[00:13:32] is probably at least not the lecture that I think is as fun to give and the
[00:13:35] that I think is as fun to give and the key thing is which is why this lecture
[00:13:37] key thing is which is why this lecture is called what it is is that the first
[00:13:38] is called what it is is that the first two points mean we'll see more DPO
[00:13:41] two points mean we'll see more DPO models than anything else like DPO is
[00:13:42] models than anything else like DPO is where everyone will start with if they
[00:13:44] where everyone will start with if they want to do alignment research and it's
[00:13:47] want to do alignment research and it's for good reason like it is the right
[00:13:48] for good reason like it is the right place to start if you're thinking about
[00:13:50] place to start if you're thinking about doing this it scales more easily on
[00:13:51] doing this it scales more easily on compute it's easier to debug it's even
[00:13:54] compute it's easier to debug it's even easier to learn so like it's it's not
[00:13:57] easier to learn so like it's it's not really worth second guessing that and it
[00:13:59] really worth second guessing that and it is a good place to
[00:14:01] is a good place to start but it also leads into these
[00:14:03] start but it also leads into these ridiculous conversations online where
[00:14:05] ridiculous conversations online where everyone is trying to figure out like is
[00:14:07] everyone is trying to figure out like is DPO better than other RL methods pop
[00:14:11] DPO better than other RL methods pop which is this older popular deep RL
[00:14:14] which is this older popular deep RL algorithm which John scholman wrote
[00:14:17] algorithm which John scholman wrote reinforce which is a slightly different
[00:14:20] reinforce which is a slightly different parameterization of policy gradient
[00:14:22] parameterization of policy gradient they're very similar and DPO ends up
[00:14:25] they're very similar and DPO ends up being much simp like it's just simpler
[00:14:27] being much simp like it's just simpler to work with so there's this
[00:14:29] to work with so there's this meme where it's like if you you just do
[00:14:31] meme where it's like if you you just do gradient descent it'll work in reality
[00:14:33] gradient descent it'll work in reality they're pretty they're they're different
[00:14:35] they're pretty they're they're different loss functions and they're doing very
[00:14:36] loss functions and they're doing very different things but you can get similar
[00:14:39] different things but you can get similar results with both of them which is why
[00:14:41] results with both of them which is why if something is much easier to do you
[00:14:42] if something is much easier to do you should just start with it and I come
[00:14:44] should just start with it and I come back to this much later in the talk
[00:14:45] back to this much later in the talk which is like what is fundamentally
[00:14:47] which is like what is fundamentally different about these RL algorithms and
[00:14:50] different about these RL algorithms and whe how your data is processed and where
[00:14:51] whe how your data is processed and where the signals actually come from but for
[00:14:53] the signals actually come from but for now it's like we don't need to say one
[00:14:56] now it's like we don't need to say one versus the other we can do both and they
[00:14:57] versus the other we can do both and they are different
[00:15:00] are different so that's kind of the quick one-on-one
[00:15:02] so that's kind of the quick one-on-one of what the core ideas are I'm going to
[00:15:04] of what the core ideas are I'm going to kind of take a path to where we how we
[00:15:07] kind of take a path to where we how we actually got to training models with DPO
[00:15:09] actually got to training models with DPO because I think this slide was from a
[00:15:12] because I think this slide was from a different talk where this subsection is
[00:15:14] different talk where this subsection is reduced from but DPO really came out
[00:15:17] reduced from but DPO really came out months before we started getting popular
[00:15:18] months before we started getting popular models trained with it so it's like how
[00:15:21] models trained with it so it's like how did we actually get to the point where
[00:15:22] did we actually get to the point where the community was training models with
[00:15:24] the community was training models with DPO which is much more recently than the
[00:15:27] DPO which is much more recently than the paper was actually released
[00:15:29] paper was actually released and this comes all the way back to these
[00:15:31] and this comes all the way back to these first instruction tuned models that you
[00:15:32] first instruction tuned models that you saw so the alpaca the vuna koala Dolly
[00:15:36] saw so the alpaca the vuna koala Dolly of the world all in April of 2023 and
[00:15:40] of the world all in April of 2023 and these are all built on kind of similar
[00:15:42] these are all built on kind of similar things and slight iterations so there's
[00:15:44] things and slight iterations so there's kind of figuring out how to use
[00:15:45] kind of figuring out how to use synthetic data building on this first
[00:15:48] synthetic data building on this first llama release there's some other things
[00:15:50] llama release there's some other things that I'll talk about but this is where
[00:15:51] that I'll talk about but this is where we started they're all using instruction
[00:15:53] we started they're all using instruction tuning most of them use synthetic data
[00:15:56] tuning most of them use synthetic data and what vuna actually did was they used
[00:16:00] and what vuna actually did was they used this thing called share GPT which was
[00:16:02] this thing called share GPT which was the first time that people working in
[00:16:04] the first time that people working in kind of this academic alignment space
[00:16:06] kind of this academic alignment space had access to data that was from humans
[00:16:09] had access to data that was from humans it ended up being a bit of a legal gray
[00:16:11] it ended up being a bit of a legal gray area because it was logging data that
[00:16:13] area because it was logging data that people used in a Google Chrome extension
[00:16:16] people used in a Google Chrome extension called share gbt to like make it so chat
[00:16:18] called share gbt to like make it so chat gbt had a share button but this data was
[00:16:21] gbt had a share button but this data was really important to things like vuna and
[00:16:23] really important to things like vuna and a lot of the other models that came down
[00:16:24] a lot of the other models that came down the line and is still used in models
[00:16:26] the line and is still used in models today as like one subset of the training
[00:16:29] today as like one subset of the training data set so just having access to these
[00:16:32] data set so just having access to these human prompts was unlocked a lot of
[00:16:34] human prompts was unlocked a lot of potential back in the day and is still
[00:16:36] potential back in the day and is still something that were seeing thankfully
[00:16:37] something that were seeing thankfully now we're starting to get data sets like
[00:16:39] now we're starting to get data sets like this that were collected in more
[00:16:41] this that were collected in more permissive ways like this kind of lmis
[00:16:43] permissive ways like this kind of lmis data has prompts that are collected and
[00:16:46] data has prompts that are collected and with consent and wild chat which was a
[00:16:48] with consent and wild chat which was a project from ai2 which essentially gave
[00:16:50] project from ai2 which essentially gave people free access to chat gbt and
[00:16:51] people free access to chat gbt and exchange for their
[00:16:53] exchange for their data the thing that came after Shar GPT
[00:16:56] data the thing that came after Shar GPT was the realization that we need more
[00:16:58] was the realization that we need more human data and this open Assistant
[00:17:00] human data and this open Assistant project is one that um we honestly need
[00:17:03] project is one that um we honestly need more of it's It's shows how hard it is
[00:17:06] more of it's It's shows how hard it is to create human data that we haven't
[00:17:07] to create human data that we haven't seen more things like this this was run
[00:17:10] seen more things like this this was run by a few people in a Discord Community
[00:17:12] by a few people in a Discord Community working extremely long hours to generate
[00:17:14] working extremely long hours to generate prompts responses and preference pairs
[00:17:16] prompts responses and preference pairs to kind of common requests the language
[00:17:19] to kind of common requests the language models and this was from April of 2023
[00:17:22] models and this was from April of 2023 and we haven't seen anything like it tra
[00:17:24] and we haven't seen anything like it tra gbt or lmc's data is similar but there's
[00:17:27] gbt or lmc's data is similar but there's not the same level of controls and
[00:17:29] not the same level of controls and voting and ranking that they went into
[00:17:31] voting and ranking that they went into this open Assistant data and it again is
[00:17:33] this open Assistant data and it again is a data set that we're still training
[00:17:34] a data set that we're still training models with and many people still train
[00:17:36] models with and many people still train models who I think come up time and time
[00:17:38] models who I think come up time and time again so it's just like these one or two
[00:17:40] again so it's just like these one or two influential data sets from over a year
[00:17:42] influential data sets from over a year ago are still what are used to trade
[00:17:44] ago are still what are used to trade models so you'll you'll get the theme as
[00:17:46] models so you'll you'll get the theme as I keep going there's actually rhf models
[00:17:49] I keep going there's actually rhf models trained in April of 2023 as well um this
[00:17:54] trained in April of 2023 as well um this was from Carper AI that was doing a lot
[00:17:56] was from Carper AI that was doing a lot of work in the space they've kind of
[00:17:58] of work in the space they've kind of Taken they've Fallen back a bit in
[00:18:00] Taken they've Fallen back a bit in recent times but there were people that
[00:18:02] recent times but there were people that were doing the similar methods to what
[00:18:04] were doing the similar methods to what I'm going to talk about at the end of
[00:18:05] I'm going to talk about at the end of the talk that kind of knowledge and
[00:18:08] the talk that kind of knowledge and infrastructure was not translated into
[00:18:10] infrastructure was not translated into things that were easy to use so there's
[00:18:12] things that were easy to use so there's also this vein of like even if things
[00:18:15] also this vein of like even if things are open it doesn't mean that it's going
[00:18:17] are open it doesn't mean that it's going to immediately catch on and be useful
[00:18:19] to immediately catch on and be useful you have to have the resources the data
[00:18:22] you have to have the resources the data and your codebase setup in a way that
[00:18:24] and your codebase setup in a way that people can build on it which is what DPO
[00:18:25] people can build on it which is what DPO did really well this like RF model from
[00:18:29] did really well this like RF model from Carper was successful it was better than
[00:18:31] Carper was successful it was better than the vicuna model but no one really built
[00:18:33] the vicuna model but no one really built on it right away which I always find
[00:18:36] on it right away which I always find confusing then kind of later in the year
[00:18:39] confusing then kind of later in the year another key thing for this open
[00:18:40] another key thing for this open alignment was the Llama 2 backlash where
[00:18:42] alignment was the Llama 2 backlash where the Llama 2 was asked to kill a Linux
[00:18:45] the Llama 2 was asked to kill a Linux process it it would refuse and this kind
[00:18:47] process it it would refuse and this kind of bred a whole series of models which
[00:18:50] of bred a whole series of models which are kind of called we are still referred
[00:18:52] are kind of called we are still referred to as uncensored which I don't think is
[00:18:54] to as uncensored which I don't think is the best name CU I don't think there was
[00:18:56] the best name CU I don't think there was ever actually any censoring to the model
[00:18:58] ever actually any censoring to the model wasn't intentional censorship but the
[00:19:01] wasn't intentional censorship but the goal is to make models that don't refuse
[00:19:03] goal is to make models that don't refuse any request which is useful as a
[00:19:05] any request which is useful as a research artifact which is like what do
[00:19:07] research artifact which is like what do you get out of a model if it answers
[00:19:08] you get out of a model if it answers every question like what are the limits
[00:19:10] every question like what are the limits in that regard there are other ways to
[00:19:12] in that regard there are other ways to use that which are up to you but like
[00:19:16] use that which are up to you but like what ended up happening is a lot of
[00:19:17] what ended up happening is a lot of these like um shared gbt data sets
[00:19:19] these like um shared gbt data sets because they're from chat gbt there's
[00:19:21] because they're from chat gbt there's data that says oh as a language model I
[00:19:23] data that says oh as a language model I shouldn't answer that so people started
[00:19:25] shouldn't answer that so people started filtering all of that out and there you
[00:19:27] filtering all of that out and there you still see a lot of people releasing
[00:19:29] still see a lot of people releasing these like uncensored models today as a
[00:19:32] these like uncensored models today as a popular area of
[00:19:33] popular area of development I think that we should
[00:19:35] development I think that we should understand what people need when doing
[00:19:38] understand what people need when doing research and researching a model that
[00:19:41] research and researching a model that doesn't refuse is reasonable but if
[00:19:43] doesn't refuse is reasonable but if you're to deploy a model for free use to
[00:19:45] you're to deploy a model for free use to users you should consider whether or not
[00:19:47] users you should consider whether or not everything should be answered so it's
[00:19:49] everything should be answered so it's like as a researcher how your artifacts
[00:19:52] like as a researcher how your artifacts are used kind of depend on the work that
[00:19:54] are used kind of depend on the work that you're actually going to be
[00:19:56] you're actually going to be doing then in the alignments there's
[00:19:59] doing then in the alignments there's this long series I'm almost done with
[00:20:00] this long series I'm almost done with the end lens but there's this long
[00:20:02] the end lens but there's this long series of models that are really
[00:20:03] series of models that are really interesting to people like me that never
[00:20:05] interesting to people like me that never really broke through the narrative where
[00:20:06] really broke through the narrative where they're saying their things like we used
[00:20:08] they're saying their things like we used rhf where the first model to beat gb4 on
[00:20:11] rhf where the first model to beat gb4 on alpaca Val and these other V tools
[00:20:14] alpaca Val and these other V tools they're scaling things up but they don't
[00:20:16] they're scaling things up but they don't always have papers they don't always
[00:20:18] always have papers they don't always have code bases and it's like things are
[00:20:21] have code bases and it's like things are happening around I just like it's not
[00:20:23] happening around I just like it's not just like the hugging face of the world
[00:20:25] just like the hugging face of the world there's a lot of different organizations
[00:20:27] there's a lot of different organizations in the US and elsewhere where that we're
[00:20:29] in the US and elsewhere where that we're aligning models and getting similar
[00:20:31] aligning models and getting similar numbers or beating these kind of
[00:20:32] numbers or beating these kind of mainstream tech companies and these
[00:20:34] mainstream tech companies and these places that you look for models to these
[00:20:36] places that you look for models to these so these are all in the summer of
[00:20:39] so these are all in the summer of 2023 and this is kind of all this like
[00:20:41] 2023 and this is kind of all this like these I bring these up because this
[00:20:43] these I bring these up because this comes before like the first big splash
[00:20:44] comes before like the first big splash of DPO so this Zephyr model was really
[00:20:47] of DPO so this Zephyr model was really the first model that I remember Making a
[00:20:50] the first model that I remember Making a Splash with DPO and this is when it took
[00:20:53] Splash with DPO and this is when it took until this time which was in September
[00:20:55] until this time which was in September after the May release of the paper for
[00:20:57] after the May release of the paper for people to really be like o DPO is the
[00:20:59] people to really be like o DPO is the real deal like it took four months and
[00:21:02] real deal like it took four months and now like the paper has best paper
[00:21:04] now like the paper has best paper everyone uses it there's tons of
[00:21:05] everyone uses it there's tons of derivations but in industry and people
[00:21:08] derivations but in industry and people trying to train models there was a lot
[00:21:09] trying to train models there was a lot of skepticism until this moment so this
[00:21:11] of skepticism until this moment so this is like a classic academic story of
[00:21:13] is like a classic academic story of needing to wait a bit until your um
[00:21:17] needing to wait a bit until your um until your work is Vindicated in some
[00:21:18] until your work is Vindicated in some ways but the two crucial things here was
[00:21:20] ways but the two crucial things here was new data set the ultra feedback data set
[00:21:23] new data set the ultra feedback data set which is a data set of um synthetically
[00:21:26] which is a data set of um synthetically generated text labeled by gp4 so it's
[00:21:29] generated text labeled by gp4 so it's again this kind of new ways of making
[00:21:31] again this kind of new ways of making data where it's a preference data set um
[00:21:35] data where it's a preference data set um we didn't make it it was made by um open
[00:21:37] we didn't make it it was made by um open BMB I think they're based in China and
[00:21:40] BMB I think they're based in China and should know more and then we also just
[00:21:41] should know more and then we also just had to do a lot of experiments to make
[00:21:43] had to do a lot of experiments to make it work there's a weird really low
[00:21:45] it work there's a weird really low learning rate that was needed to make
[00:21:47] learning rate that was needed to make this kind of chat model work with DPO
[00:21:49] this kind of chat model work with DPO which is like 5e minus 7 if you're
[00:21:51] which is like 5e minus 7 if you're really plugged into AI you'll know that
[00:21:52] really plugged into AI you'll know that like 3 e minus 4 is like the lore of the
[00:21:56] like 3 e minus 4 is like the lore of the best learning rate so it's many of
[00:21:58] best learning rate so it's many of magnitude lower so that's kind of what
[00:22:00] magnitude lower so that's kind of what it took to get this to work we probably
[00:22:01] it took to get this to work we probably could have done it months earlier if we
[00:22:03] could have done it months earlier if we just did more hyperparameter sweet but
[00:22:05] just did more hyperparameter sweet but like this is the random happen stance of
[00:22:08] like this is the random happen stance of the stories that people now like
[00:22:10] the stories that people now like backcast as being like this is the super
[00:22:12] backcast as being like this is the super important bottle it's just it's it's
[00:22:14] important bottle it's just it's it's somewhat random and then at the same
[00:22:16] somewhat random and then at the same time I was switching jobs to the Allen
[00:22:18] time I was switching jobs to the Allen Institute and they were already working
[00:22:19] Institute and they were already working on this project which is trying to do a
[00:22:21] on this project which is trying to do a systematic study of instruction tuning
[00:22:24] systematic study of instruction tuning data along with some of this preference
[00:22:26] data along with some of this preference tuning recipes that were coming out
[00:22:28] tuning recipes that were coming out because once this Zephyr model came out
[00:22:31] because once this Zephyr model came out there's always Skeptics of like oh doing
[00:22:32] there's always Skeptics of like oh doing it at 7B is easy like that's a small
[00:22:35] it at 7B is easy like that's a small model so it's like oh is it going to
[00:22:36] model so it's like oh is it going to actually scale to the real deal to
[00:22:38] actually scale to the real deal to bigger models to be what like Chad gbt
[00:22:40] bigger models to be what like Chad gbt does so it was like okay we have some
[00:22:42] does so it was like okay we have some more compute and we tried it on this 70
[00:22:44] more compute and we tried it on this 70 billion parameter scale and we showed
[00:22:46] billion parameter scale and we showed similar gains we all we did was use the
[00:22:48] similar gains we all we did was use the same Ultra feedback recipe the low
[00:22:51] same Ultra feedback recipe the low learning rate and it largely worked so
[00:22:54] learning rate and it largely worked so this is within two months and then this
[00:22:56] this is within two months and then this is when in since then is when there's
[00:22:59] is when in since then is when there's tons of new DPO models anyone all these
[00:23:02] tons of new DPO models anyone all these startups that are releasing their own
[00:23:03] startups that are releasing their own models will release an instruct version
[00:23:05] models will release an instruct version that is a DPO thing and that kind of
[00:23:08] that is a DPO thing and that kind of continued for 6 months I think just
[00:23:09] continued for 6 months I think just today I'm starting to see less DPO
[00:23:11] today I'm starting to see less DPO models which is interesting I've been
[00:23:13] models which is interesting I've been keep tracking keeping track of them for
[00:23:15] keep tracking keeping track of them for another evaluation project and it has
[00:23:17] another evaluation project and it has finally kind of slowed down a little bit
[00:23:19] finally kind of slowed down a little bit I don't know if that's alignment at
[00:23:20] I don't know if that's alignment at large but there is so many I I I should
[00:23:23] large but there is so many I I I should add a slide that's like a list of the
[00:23:24] add a slide that's like a list of the ridiculous amount of DPO models after
[00:23:27] ridiculous amount of DPO models after them after these two but this is really
[00:23:29] them after these two but this is really when the floodgates kind of started and
[00:23:32] when the floodgates kind of started and when we're like okay DPO really
[00:23:34] when we're like okay DPO really works so this is kind of why I say like
[00:23:37] works so this is kind of why I say like what comes next it's like we could
[00:23:39] what comes next it's like we could retrain models on data sets that we have
[00:23:42] retrain models on data sets that we have we don't have that many data sets but it
[00:23:44] we don't have that many data sets but it kind of feels like we're fishing in the
[00:23:45] kind of feels like we're fishing in the dark like Zephyr was built on the
[00:23:47] dark like Zephyr was built on the success of needing the low learning rate
[00:23:49] success of needing the low learning rate this Tulu 2 model is actually trained on
[00:23:51] this Tulu 2 model is actually trained on tpus because we have the Google tensor
[00:23:54] tpus because we have the Google tensor research Cloud so we have bigger tpus to
[00:23:56] research Cloud so we have bigger tpus to train these models and it's like how do
[00:23:58] train these models and it's like how do we do this more systematically and
[00:24:00] we do this more systematically and that's kind of where most of what I talk
[00:24:02] that's kind of where most of what I talk about today on the technical matter is
[00:24:04] about today on the technical matter is the recent research that we've been
[00:24:06] the recent research that we've been doing to just kind of make sense of this
[00:24:07] doing to just kind of make sense of this and answer the fundamental questions of
[00:24:09] and answer the fundamental questions of like what do we need to change about DPO
[00:24:12] like what do we need to change about DPO is po better and so
[00:24:14] is po better and so on so this is kind of the the the
[00:24:17] on so this is kind of the the the reality that I back go back and forth in
[00:24:19] reality that I back go back and forth in between which is we don't really have
[00:24:21] between which is we don't really have the human data to do rhf like industry
[00:24:23] the human data to do rhf like industry but it is getting much easier to do
[00:24:25] but it is getting much easier to do alignment research so you can kind of
[00:24:27] alignment research so you can kind of choose your narrative I think sometimes
[00:24:29] choose your narrative I think sometimes because I'm so close to Industry and
[00:24:30] because I'm so close to Industry and hear about people have I'm like too
[00:24:31] hear about people have I'm like too often on this side but there is a lot of
[00:24:33] often on this side but there is a lot of opportunity to do things it feels
[00:24:35] opportunity to do things it feels crowded but being crowded at this point
[00:24:39] crowded but being crowded at this point when there's so much investment is just
[00:24:40] when there's so much investment is just because you're in the right area and
[00:24:43] because you're in the right area and most people in this room aren't trying
[00:24:44] most people in this room aren't trying to be professors so if you get scooped
[00:24:46] to be professors so if you get scooped it's it's okay but um it's it's I find
[00:24:50] it's it's okay but um it's it's I find it very fun and so like how do we
[00:24:52] it very fun and so like how do we actually understand what we're doing
[00:24:54] actually understand what we're doing with alignment and can we improve on
[00:24:55] with alignment and can we improve on these models like we have to 2 it has a
[00:24:58] these models like we have to 2 it has a number because we want to keep releasing
[00:24:59] number because we want to keep releasing more models so it's like how do we get
[00:25:01] more models so it's like how do we get better evaluating what we're doing to
[00:25:03] better evaluating what we're doing to try to understand this process and then
[00:25:05] try to understand this process and then how do we train better models so these
[00:25:07] how do we train better models so these are the sort of things that I'm up to I
[00:25:09] are the sort of things that I'm up to I have a few examples of things I've been
[00:25:11] have a few examples of things I've been working on I built an evaluation tool
[00:25:13] working on I built an evaluation tool for reward models I'll talk more about
[00:25:15] for reward models I'll talk more about reward models to start here and we need
[00:25:18] reward models to start here and we need better evaluation because when you're
[00:25:19] better evaluation because when you're training models you need to be able to
[00:25:22] training models you need to be able to do kind of what I call like local
[00:25:24] do kind of what I call like local evaluation you need to be able to get a
[00:25:25] evaluation you need to be able to get a number that tells you if your training
[00:25:27] number that tells you if your training technique is
[00:25:29] technique is improving the the end result you can't
[00:25:31] improving the the end result you can't wait until chatbot Arena evaluates your
[00:25:33] wait until chatbot Arena evaluates your model because that takes you about a
[00:25:35] model because that takes you about a month to get your numbers back you need
[00:25:36] month to get your numbers back you need to be able to run something at your desk
[00:25:38] to be able to run something at your desk that gives you signal on if you're
[00:25:40] that gives you signal on if you're actually doing a good job and we're
[00:25:41] actually doing a good job and we're still pretty behind on those evaluation
[00:25:43] still pretty behind on those evaluation tools and they're there are more coming
[00:25:45] tools and they're there are more coming which is promising and then given dp's
[00:25:48] which is promising and then given dp's Simplicity can we actually improve on
[00:25:50] Simplicity can we actually improve on that and can we catch on to some of the
[00:25:52] that and can we catch on to some of the industry rumors that they've let it
[00:25:55] industry rumors that they've let it drift
[00:25:57] drift aside so so reward bench is this project
[00:26:00] aside so so reward bench is this project that I started because there are no
[00:26:02] that I started because there are no evaluation tools for reward models my
[00:26:04] evaluation tools for reward models my motivation was mostly for transparency
[00:26:07] motivation was mostly for transparency given how much industry says reward
[00:26:10] given how much industry says reward models are what you need to focus on
[00:26:11] models are what you need to focus on they're really important for getting
[00:26:12] they're really important for getting good models out the door and it's like
[00:26:14] good models out the door and it's like what does that mean like what is like
[00:26:16] what does that mean like what is like what does it mean for a reward model to
[00:26:18] what does it mean for a reward model to be good if we look at this kind of
[00:26:20] be good if we look at this kind of feedback diagram which is the one kind
[00:26:23] feedback diagram which is the one kind of homage to the RL background just
[00:26:25] of homage to the RL background just feedback loops um is like a word model
[00:26:28] feedback loops um is like a word model is in this casee the agent is your
[00:26:30] is in this casee the agent is your actual language model Pi is the policy
[00:26:33] actual language model Pi is the policy the training data is prompts that you
[00:26:35] the training data is prompts that you get so in this kind of um rhf framework
[00:26:39] get so in this kind of um rhf framework you have this feedback loop where the
[00:26:41] you have this feedback loop where the policy generates something a which is
[00:26:43] policy generates something a which is the action which is the completion it
[00:26:45] the action which is the completion it goes to the reward model which then
[00:26:46] goes to the reward model which then scores it but you kind of on the side
[00:26:49] scores it but you kind of on the side are looking at all these evaluation
[00:26:50] are looking at all these evaluation tools and it's like none of these
[00:26:53] tools and it's like none of these evaluation tools are giving us internal
[00:26:56] evaluation tools are giving us internal insight into what's happening in this
[00:26:57] insight into what's happening in this feedback loop it seems kind of external
[00:26:59] feedback loop it seems kind of external to what we are doing when we're training
[00:27:01] to what we are doing when we're training these models so we really wanted to zoom
[00:27:03] these models so we really wanted to zoom in on this reward model and reward
[00:27:06] in on this reward model and reward models are trained in a kind a another
[00:27:08] models are trained in a kind a another kind of weird way the many quirks of rhf
[00:27:12] kind of weird way the many quirks of rhf so in order to train a reward model you
[00:27:14] so in order to train a reward model you need to collect this pawise preference
[00:27:15] need to collect this pawise preference data if you're kind of using chat 2bt a
[00:27:18] data if you're kind of using chat 2bt a lot you'll sometimes see it give you two
[00:27:20] lot you'll sometimes see it give you two answers and ask you which one is better
[00:27:22] answers and ask you which one is better this data is literally what is used to
[00:27:24] this data is literally what is used to train a reward model it's a a prompt and
[00:27:28] train a reward model it's a a prompt and then two completions a chosen completion
[00:27:30] then two completions a chosen completion and a rejected completion but in order
[00:27:32] and a rejected completion but in order to train these models you have to pass
[00:27:34] to train these models you have to pass both of them in at the same time so you
[00:27:36] both of them in at the same time so you pass both of them in at the same time
[00:27:37] pass both of them in at the same time and it gives you two scalar values you
[00:27:39] and it gives you two scalar values you use a language model that outputs a
[00:27:41] use a language model that outputs a scaler just by some modifications of the
[00:27:43] scaler just by some modifications of the last layers rather than outputting text
[00:27:46] last layers rather than outputting text and then this L function I'll show you
[00:27:48] and then this L function I'll show you on the next slide is essentially why you
[00:27:50] on the next slide is essentially why you need to you need to use this batch mode
[00:27:52] need to you need to use this batch mode idea which is you pass multiple things
[00:27:54] idea which is you pass multiple things at once and you get multiple numbers out
[00:27:56] at once and you get multiple numbers out so this L function is ESS
[00:27:59] so this L function is ESS here this R is the output directly from
[00:28:02] here this R is the output directly from the reward model for the rejected
[00:28:03] the reward model for the rejected completion and the chosen completion so
[00:28:05] completion and the chosen completion so you're trying to separate the distance
[00:28:07] you're trying to separate the distance between them and then automatic
[00:28:09] between them and then automatic differentiation kind of updates the
[00:28:10] differentiation kind of updates the parameters so that this distance is
[00:28:12] parameters so that this distance is bigger so you can't just kind of do
[00:28:14] bigger so you can't just kind of do supervised learning directly on one
[00:28:17] supervised learning directly on one thing to say for the reward model there
[00:28:20] thing to say for the reward model there are alignment methods researching that
[00:28:22] are alignment methods researching that now but it's really built on this idea
[00:28:25] now but it's really built on this idea of separating two things and creating a
[00:28:27] of separating two things and creating a margin in the prep references to kind of
[00:28:28] margin in the prep references to kind of learn the decision boundary there's a
[00:28:30] learn the decision boundary there's a lot of really specific details in
[00:28:32] lot of really specific details in Industry such as these models are only
[00:28:33] Industry such as these models are only trained for one Epoch they get really
[00:28:36] trained for one Epoch they get really low accuracy scores when you compare
[00:28:38] low accuracy scores when you compare them to other kind of train test set
[00:28:40] them to other kind of train test set things in machine learning and there's
[00:28:42] things in machine learning and there's some additional tweaks that people do
[00:28:44] some additional tweaks that people do you can do ensem ensembles lamud did
[00:28:47] you can do ensem ensembles lamud did this weird margin loss but none of it
[00:28:50] this weird margin loss but none of it really trans is transformative and how
[00:28:52] really trans is transformative and how these models are trained they're in this
[00:28:54] these models are trained they're in this weird place where you can only get about
[00:28:56] weird place where you can only get about 70% agreement with your annot
[00:28:59] 70% agreement with your annot it's it's kind of the sort of thing of
[00:29:01] it's it's kind of the sort of thing of is the noise part of the signal or is it
[00:29:03] is the noise part of the signal or is it a bug so like in preferences it could
[00:29:05] a bug so like in preferences it could make sense that it's a signal because
[00:29:07] make sense that it's a signal because not everyone's preferences here are the
[00:29:09] not everyone's preferences here are the same so not getting full of agreement be
[00:29:12] same so not getting full of agreement be like this system might be working we
[00:29:13] like this system might be working we don't want chpt to be fully
[00:29:15] don't want chpt to be fully narrow-minded all the
[00:29:17] narrow-minded all the time and this kind of reads to the thing
[00:29:19] time and this kind of reads to the thing of like how do we actually evaluate
[00:29:21] of like how do we actually evaluate these reward models that I was talking
[00:29:22] these reward models that I was talking about I hear all the time that reward
[00:29:25] about I hear all the time that reward models are crucial to rhf but um how do
[00:29:28] models are crucial to rhf but um how do we know exactly what types of the final
[00:29:30] we know exactly what types of the final policy they're improving should we
[00:29:32] policy they're improving should we include safety in these reward models
[00:29:34] include safety in these reward models how does scaling laws impact reward
[00:29:36] how does scaling laws impact reward models there's kind of basic machine
[00:29:37] models there's kind of basic machine learning questions it's like can we
[00:29:39] learning questions it's like can we evaluate these what should we think
[00:29:42] evaluate these what should we think about so what we kind of what we did is
[00:29:44] about so what we kind of what we did is we collected a bunch of prompts and then
[00:29:47] we collected a bunch of prompts and then we manually created Chosen and rejected
[00:29:49] we manually created Chosen and rejected answers for each prompt and then we can
[00:29:52] answers for each prompt and then we can see whether or not the reward model
[00:29:53] see whether or not the reward model agrees with our human created data and
[00:29:56] agrees with our human created data and call that like a win or loss in an
[00:29:57] call that like a win or loss in an accurate point of view it's really
[00:30:00] accurate point of view it's really direct we're just doing inference on
[00:30:01] direct we're just doing inference on existing models and we're going to see
[00:30:03] existing models and we're going to see whether or not they agree with human
[00:30:06] whether or not they agree with human data and this is a slide if you want to
[00:30:09] data and this is a slide if you want to go into the academic side of things this
[00:30:11] go into the academic side of things this was built on a lot of existing
[00:30:13] was built on a lot of existing evaluation tools that were out there
[00:30:15] evaluation tools that were out there you'll see some common names alpaca Val
[00:30:17] you'll see some common names alpaca Val Mt Ben are things that you've heard
[00:30:19] Mt Ben are things that you've heard about EXs test was on the slide when I
[00:30:21] about EXs test was on the slide when I mentioned llama 2 being um overly safe
[00:30:25] mentioned llama 2 being um overly safe and there's some other things that are
[00:30:26] and there's some other things that are really good but you might not heard
[00:30:28] really good but you might not heard about like um this llm bar data set from
[00:30:31] about like um this llm bar data set from Princeton is a bunch of trick questions
[00:30:32] Princeton is a bunch of trick questions that I'll have an example on later and
[00:30:35] that I'll have an example on later and some kind of normal names from anthropic
[00:30:37] some kind of normal names from anthropic and open AI in here as well so there's a
[00:30:39] and open AI in here as well so there's a lot of different things that we're
[00:30:40] lot of different things that we're testing with this data set and then
[00:30:41] testing with this data set and then we're trying to get the full picture of
[00:30:44] we're trying to get the full picture of like what is going on with these
[00:30:47] like what is going on with these models we released this in March of 24
[00:30:50] models we released this in March of 24 and you can see a key in the bottom
[00:30:52] and you can see a key in the bottom where these kind of um red circles with
[00:30:54] where these kind of um red circles with the arrow in them are DPO models which
[00:30:56] the arrow in them are DPO models which you can use as a reward model
[00:30:58] you can use as a reward model and then um this kind of these dice
[00:31:00] and then um this kind of these dice which look like gray squares when you
[00:31:02] which look like gray squares when you zoom out are what I described in this
[00:31:04] zoom out are what I described in this kind of um classifier type of training
[00:31:08] kind of um classifier type of training and you can see that there's reasonable
[00:31:09] and you can see that there's reasonable scores The Benchmark isn't
[00:31:12] scores The Benchmark isn't saturated bunch of open models some
[00:31:14] saturated bunch of open models some names that you've seen before like the
[00:31:15] names that you've seen before like the Tulu models and the Zephyr models are on
[00:31:18] Tulu models and the Zephyr models are on here kind of normal stuff we like this
[00:31:20] here kind of normal stuff we like this is what we expected it's not too
[00:31:22] is what we expected it's not too saturated but if you look here I'll show
[00:31:25] saturated but if you look here I'll show you where this model has moved in a few
[00:31:27] you where this model has moved in a few months so today we have a lot more
[00:31:29] months so today we have a lot more models and there's a lot more
[00:31:30] models and there's a lot more information here so I get to tell you
[00:31:32] information here so I get to tell you about more interesting things which is
[00:31:33] about more interesting things which is like how open AIS and coheres models do
[00:31:36] like how open AIS and coheres models do on this which is like I mentioned
[00:31:37] on this which is like I mentioned wanting to do this for transparency but
[00:31:40] wanting to do this for transparency but we also add new type so this is where
[00:31:42] we also add new type so this is where the fifth model ended up so in two
[00:31:43] the fifth model ended up so in two months the model that was fifth on your
[00:31:45] months the model that was fifth on your leaderboard is now 31st so we're trying
[00:31:47] leaderboard is now 31st so we're trying we're getting the saturation from people
[00:31:50] we're getting the saturation from people doing research in the area to actually
[00:31:52] doing research in the area to actually have places to compare their
[00:31:54] have places to compare their models and but we also have models from
[00:31:56] models and but we also have models from some closed labs
[00:31:58] some closed labs and I'll kind of get into the details
[00:32:00] and I'll kind of get into the details here so like some of these are labeled
[00:32:02] here so like some of these are labeled as um are different types of models with
[00:32:05] as um are different types of models with is llm as a judge um llm as a judge is
[00:32:09] is llm as a judge um llm as a judge is the idea if you can ask a language model
[00:32:12] the idea if you can ask a language model which answer is better this is kind of
[00:32:15] which answer is better this is kind of how things like alpaca Val and Mt bench
[00:32:17] how things like alpaca Val and Mt bench are built but you can also use that as a
[00:32:19] are built but you can also use that as a reward model I told you that I have
[00:32:20] reward model I told you that I have prompts and then Chosen and rejected I
[00:32:22] prompts and then Chosen and rejected I could just ask chat gbt which one is
[00:32:24] could just ask chat gbt which one is better and see what it does and this is
[00:32:26] better and see what it does and this is what we added in as a Baseline and this
[00:32:28] what we added in as a Baseline and this ends up being really interesting because
[00:32:30] ends up being really interesting because GPT 4 and gbt 40 are not actually as
[00:32:35] GPT 4 and gbt 40 are not actually as good in this closed domain as a reward
[00:32:38] good in this closed domain as a reward model that coher is training so we don't
[00:32:40] model that coher is training so we don't have full information because we don't
[00:32:42] have full information because we don't have Open the Eyes reward models but we
[00:32:44] have Open the Eyes reward models but we can use their models to compare so we
[00:32:46] can use their models to compare so we have a lot of different information
[00:32:47] have a lot of different information going into one system about how language
[00:32:50] going into one system about how language models and different parts of the
[00:32:52] models and different parts of the alignment process choose different
[00:32:54] alignment process choose different categories so I'll kind of and kind of
[00:32:57] categories so I'll kind of and kind of go back and you can see like this Co
[00:32:59] go back and you can see like this Co here is across two different months
[00:33:00] here is across two different months theirs has improved a lot and then these
[00:33:03] theirs has improved a lot and then these kind of earlier DPO models that we saw
[00:33:05] kind of earlier DPO models that we saw higher up on the leaderboard have been
[00:33:06] higher up on the leaderboard have been shifting down by more people training
[00:33:08] shifting down by more people training reward models to begin
[00:33:12] with and the specific category that I'll
[00:33:14] with and the specific category that I'll Focus most on is this kind of chat hard
[00:33:16] Focus most on is this kind of chat hard thing um if you think about evaluation a
[00:33:20] thing um if you think about evaluation a lot it's actually surprisingly common as
[00:33:22] lot it's actually surprisingly common as a topic covered in kind of tech coverage
[00:33:24] a topic covered in kind of tech coverage is how evaluation is saturating this is
[00:33:27] is how evaluation is saturating this is the one feature of our Benchmark that
[00:33:28] the one feature of our Benchmark that hasn't fully saturated and it's really
[00:33:30] hasn't fully saturated and it's really important to kind of having some sort of
[00:33:33] important to kind of having some sort of longevity to The Benchmark and I'll talk
[00:33:34] longevity to The Benchmark and I'll talk more about this kind of as we go from
[00:33:36] more about this kind of as we go from here so I mentioned this data set and
[00:33:39] here so I mentioned this data set and it's interesting to understand if you
[00:33:42] it's interesting to understand if you could actually do this problem so what
[00:33:45] could actually do this problem so what we have is a prompt a Chosen and a
[00:33:47] we have is a prompt a Chosen and a rejected and the prompt is give an
[00:33:49] rejected and the prompt is give an example of a metaphor that uses the
[00:33:50] example of a metaphor that uses the following object stars and then the
[00:33:53] following object stars and then the Chosen and rejected are two similar
[00:33:56] Chosen and rejected are two similar metaphors but one of the like you can
[00:33:59] metaphors but one of the like you can you can see if you read these what the
[00:34:01] you can see if you read these what the differences
[00:34:03] differences are I'm just pausing for the people that
[00:34:05] are I'm just pausing for the people that are still that are paying attention to
[00:34:07] are still that are paying attention to reading these but essentially what
[00:34:08] reading these but essentially what happens is that the chosen one is about
[00:34:10] happens is that the chosen one is about the sky and the rejected is about the
[00:34:11] the sky and the rejected is about the moon or yeah so the twinkling diamonds
[00:34:15] moon or yeah so the twinkling diamonds in the sky see I haven't messed it up
[00:34:16] in the sky see I haven't messed it up reading the slide but it asks for stars
[00:34:18] reading the slide but it asks for stars and it's about this kind of metaphor of
[00:34:20] and it's about this kind of metaphor of stars where the rejected is about the
[00:34:21] stars where the rejected is about the moon which is also in the sky at night
[00:34:23] moon which is also in the sky at night and this data set is a whole bunch of
[00:34:25] and this data set is a whole bunch of things like this where what they do to
[00:34:26] things like this where what they do to create this is they either manually or
[00:34:29] create this is they either manually or by chat GPT ask the or ask to rephrase a
[00:34:33] by chat GPT ask the or ask to rephrase a prompt and then you create a new
[00:34:34] prompt and then you create a new generation from it so you can kind of
[00:34:36] generation from it so you can kind of get these rejected Generations that are
[00:34:38] get these rejected Generations that are just off topic and it makes sense for
[00:34:41] just off topic and it makes sense for something that would be really hard for
[00:34:42] something that would be really hard for language models because they have this
[00:34:44] language models because they have this association between the stars and the
[00:34:46] association between the stars and the moon but we want our language models to
[00:34:49] moon but we want our language models to be able to answer questions like this
[00:34:51] be able to answer questions like this and this is the type of thing where our
[00:34:53] and this is the type of thing where our reward model Benchmark which is
[00:34:54] reward model Benchmark which is something that is training language
[00:34:56] something that is training language models has the best correlation
[00:34:58] models has the best correlation as something that is hard so this is
[00:35:00] as something that is hard so this is promising there's this is the sort of
[00:35:02] promising there's this is the sort of thing that you if you're in research is
[00:35:05] thing that you if you're in research is interesting so it's really in the weeds
[00:35:07] interesting so it's really in the weeds but it shows that we still have things
[00:35:09] but it shows that we still have things to learn about these models and there
[00:35:11] to learn about these models and there are things that we can't do yet but
[00:35:14] are things that we can't do yet but another interesting pattern in safety I
[00:35:16] another interesting pattern in safety I mentioned this kind of um uncensored
[00:35:18] mentioned this kind of um uncensored models and in safety we see all the
[00:35:22] models and in safety we see all the patterns we would expect the breakdown
[00:35:24] patterns we would expect the breakdown at the top of this table refusals is
[00:35:26] at the top of this table refusals is things that we want the language mod
[00:35:28] things that we want the language mod refuse and then this excess T Test data
[00:35:30] refuse and then this excess T Test data set can be split into something that we
[00:35:32] set can be split into something that we want models to refuse and we want models
[00:35:33] want models to refuse and we want models to respond and you can kind of see that
[00:35:37] to respond and you can kind of see that there's multiple categories of either
[00:35:38] there's multiple categories of either DPO models or reward models where the
[00:35:41] DPO models or reward models where the model that handles safety really well
[00:35:43] model that handles safety really well refuses things like asking for advice on
[00:35:46] refuses things like asking for advice on causing harm and responds to something
[00:35:49] causing harm and responds to something that is borderline but there's actually
[00:35:51] that is borderline but there's actually a lot of models out there that just
[00:35:52] a lot of models out there that just refuse everything so that'll tank your
[00:35:54] refuse everything so that'll tank your score on things that that response um to
[00:35:57] score on things that that response um to everything which is kind of the safe bet
[00:35:59] everything which is kind of the safe bet we were seeing a lot of like tech
[00:36:00] we were seeing a lot of like tech companies release models like this which
[00:36:02] companies release models like this which it just feels like you just it doesn't
[00:36:04] it just feels like you just it doesn't feel right when you talk to them but
[00:36:05] feel right when you talk to them but there's also the models that just
[00:36:07] there's also the models that just respond to everything it's like not my
[00:36:08] respond to everything it's like not my job to gate whether or not I should it's
[00:36:12] job to gate whether or not I should it's not like not the language models job to
[00:36:13] not like not the language models job to gate the question is the philosophy
[00:36:15] gate the question is the philosophy there which is something that we hear a
[00:36:17] there which is something that we hear a lot about in the discourse of alignment
[00:36:19] lot about in the discourse of alignment but to see it in these reward models and
[00:36:21] but to see it in these reward models and DPO models when directly probing them at
[00:36:24] DPO models when directly probing them at the without asking them to generate text
[00:36:26] the without asking them to generate text is nice to be to confirm a lot of
[00:36:28] is nice to be to confirm a lot of suspicions that we have so this is back
[00:36:31] suspicions that we have so this is back to some of the DPO math which is again
[00:36:34] to some of the DPO math which is again good to know so if you are to go into
[00:36:37] good to know so if you are to go into the DPO paper you'll see equation three
[00:36:39] the DPO paper you'll see equation three here which is the reward that is defined
[00:36:41] here which is the reward that is defined in order to make the math actually work
[00:36:43] in order to make the math actually work and this is very different than just
[00:36:45] and this is very different than just outputting a scaler it ends up being a
[00:36:47] outputting a scaler it ends up being a ratio of the probability of the policy
[00:36:50] ratio of the probability of the policy relative to the original policy during
[00:36:52] relative to the original policy during training which is called the reference
[00:36:53] training which is called the reference model and this is an imp it's like it's
[00:36:57] model and this is an imp it's like it's a very complicated mathematical
[00:36:59] a very complicated mathematical representation so if you actually take a
[00:37:01] representation so if you actually take a piece of text and pass it through a DPO
[00:37:03] piece of text and pass it through a DPO model the reward will be something like
[00:37:05] model the reward will be something like minus 200 or something because it's a
[00:37:07] minus 200 or something because it's a bunch of log probabilities probabilities
[00:37:09] bunch of log probabilities probabilities are between 0 to one you take the log
[00:37:11] are between 0 to one you take the log you get negative numbers and you sum all
[00:37:13] you get negative numbers and you sum all of these up so you got a big negative
[00:37:15] of these up so you got a big negative number and that intuitively is like the
[00:37:18] number and that intuitively is like the score of that these models are providing
[00:37:19] score of that these models are providing which is very different than the other
[00:37:21] which is very different than the other type of reward models that I talked
[00:37:22] type of reward models that I talked about training earlier and if you have
[00:37:25] about training earlier and if you have two prompts with a Chosen and a rejected
[00:37:27] two prompts with a Chosen and a rejected equation 4 is the math that you actually
[00:37:29] equation 4 is the math that you actually need to do to um decide whether or not
[00:37:33] need to do to um decide whether or not one of the answers was better you're
[00:37:35] one of the answers was better you're kind of comparing these ratios of
[00:37:36] kind of comparing these ratios of probabilities from two different models
[00:37:38] probabilities from two different models with respect to this reference model
[00:37:40] with respect to this reference model which was the starting point of training
[00:37:42] which was the starting point of training and the question is when people release
[00:37:44] and the question is when people release a DPO model they normally release a
[00:37:46] a DPO model they normally release a model and they don't release all the
[00:37:47] model and they don't release all the intermediate checkpoints so this
[00:37:49] intermediate checkpoints so this reference model would be an intermediate
[00:37:51] reference model would be an intermediate checkpoint and the training process so
[00:37:53] checkpoint and the training process so the question is like can you do this can
[00:37:55] the question is like can you do this can you use it as a reward model if you
[00:37:57] you use it as a reward model if you don't have access to all the information
[00:37:59] don't have access to all the information and the short answer is no which is all
[00:38:03] and the short answer is no which is all the scores on our Benchmark plummet
[00:38:04] the scores on our Benchmark plummet across all the DPO models that we have
[00:38:07] across all the DPO models that we have which makes sense because this extra
[00:38:09] which makes sense because this extra model is a regular regularizer in the
[00:38:12] model is a regular regularizer in the probabilities or it it's in the actual
[00:38:14] probabilities or it it's in the actual reward equation if you go back a few
[00:38:16] reward equation if you go back a few slides like it's in the equation so if
[00:38:18] slides like it's in the equation so if we what we do is we get rid of this and
[00:38:20] we what we do is we get rid of this and we stop normalizing equation 4 and we
[00:38:23] we stop normalizing equation 4 and we just see if it works and it doesn't but
[00:38:26] just see if it works and it doesn't but this is important because DPO is
[00:38:29] this is important because DPO is training a reward model but if we don't
[00:38:31] training a reward model but if we don't always have access to it we just we just
[00:38:33] always have access to it we just we just can't learn from it we can't use that in
[00:38:35] can't learn from it we can't use that in another system as clearly so it's just a
[00:38:37] another system as clearly so it's just a lot to ask for when getting people to
[00:38:39] lot to ask for when getting people to release
[00:38:41] release models and this is a interesting slide
[00:38:44] models and this is a interesting slide showing coheres kind of progress on
[00:38:46] showing coheres kind of progress on reward models in just a few months they
[00:38:48] reward models in just a few months they released something that was clearly
[00:38:49] released something that was clearly state-of-the-art on our Benchmark a
[00:38:51] state-of-the-art on our Benchmark a alignment lab um they this kind of RL
[00:38:55] alignment lab um they this kind of RL rhf flow work release something in May
[00:38:58] rhf flow work release something in May and then just a few days later coher
[00:39:00] and then just a few days later coher sent another number of those like here's
[00:39:02] sent another number of those like here's our new model it's still better than
[00:39:03] our new model it's still better than everyone else so it's nice to kind of
[00:39:05] everyone else so it's nice to kind of have this academic industry intersection
[00:39:08] have this academic industry intersection but it's very rare and takes a lot of
[00:39:10] but it's very rare and takes a lot of work in terms of networking and building
[00:39:12] work in terms of networking and building relationships but we're trying to do it
[00:39:14] relationships but we're trying to do it at least in these small niches where the
[00:39:16] at least in these small niches where the companies are willing to
[00:39:18] companies are willing to share reward bench 2 is going to need to
[00:39:21] share reward bench 2 is going to need to just mostly make everything harder and
[00:39:22] just mostly make everything harder and make everything more human and kind of
[00:39:25] make everything more human and kind of the last point is what I'm going to
[00:39:27] the last point is what I'm going to trans into next is like everything I've
[00:39:29] trans into next is like everything I've told you about is about part of this rhf
[00:39:31] told you about is about part of this rhf pipeline but I haven't told you how it
[00:39:34] pipeline but I haven't told you how it is impacting the final model that you
[00:39:35] is impacting the final model that you use at the end of the day which is very
[00:39:37] use at the end of the day which is very rightful criticism which is like if
[00:39:39] rightful criticism which is like if you're evaluating part of the alignment
[00:39:41] you're evaluating part of the alignment pipeline you should be telling me
[00:39:43] pipeline you should be telling me whether or not the final model is
[00:39:44] whether or not the final model is actually useful so this is kind of where
[00:39:46] actually useful so this is kind of where I talk about our journey into trying to
[00:39:48] I talk about our journey into trying to train um po models so we're trying to
[00:39:51] train um po models so we're trying to fine tune a good model we spent a lot of
[00:39:53] fine tune a good model we spent a lot of time on DPO with this tul2 work and we
[00:39:56] time on DPO with this tul2 work and we wanted to know if we could do better by
[00:39:58] wanted to know if we could do better by switching to PO so this is a lot of um
[00:40:02] switching to PO so this is a lot of um it's not yet published work but it's
[00:40:03] it's not yet published work but it's going to be out soon so the numbers
[00:40:05] going to be out soon so the numbers aren't entirely final but we're just
[00:40:07] aren't entirely final but we're just trying to
[00:40:07] trying to disentangle what the difference between
[00:40:09] disentangle what the difference between DPO and PO is at a very empirical level
[00:40:14] DPO and PO is at a very empirical level so um we're trying to answer if it's
[00:40:16] so um we're trying to answer if it's better or not so what we're going to do
[00:40:18] better or not so what we're going to do is kind of walk through a series of
[00:40:20] is kind of walk through a series of design decisions and see how it affects
[00:40:22] design decisions and see how it affects the suite of evaluations we're starting
[00:40:24] the suite of evaluations we're starting with this llama 2 13B model and that has
[00:40:27] with this llama 2 13B model and that has already been instruction tuned the
[00:40:29] already been instruction tuned the difference between the blue and the red
[00:40:30] difference between the blue and the red is the gains from instruction tuning for
[00:40:32] is the gains from instruction tuning for these kind of um reasoning coding chat
[00:40:35] these kind of um reasoning coding chat tasks instruction tuning does the
[00:40:37] tasks instruction tuning does the biggest Delta that you'll see among all
[00:40:39] biggest Delta that you'll see among all these slides instruction tuning kind of
[00:40:40] these slides instruction tuning kind of puts the model on the map as being
[00:40:42] puts the model on the map as being useful and it is easy to see gains at
[00:40:46] useful and it is easy to see gains at the beginning and then it's harder and
[00:40:47] the beginning and then it's harder and harder for us to really keep improving
[00:40:49] harder for us to really keep improving these models so we start with is we add
[00:40:52] these models so we start with is we add this anthropic um helpful harmless rhf
[00:40:55] this anthropic um helpful harmless rhf data with DPO and you can see that there
[00:40:58] data with DPO and you can see that there is a small bump across all the metrics
[00:41:01] is a small bump across all the metrics that we did this data set is known as
[00:41:03] that we did this data set is known as being particularly noisy among
[00:41:05] being particularly noisy among researchers in the area but it is kind
[00:41:07] researchers in the area but it is kind of the starting point when you're doing
[00:41:08] of the starting point when you're doing research on alignment it's been around
[00:41:10] research on alignment it's been around for a few years it's big it's multi-turn
[00:41:12] for a few years it's big it's multi-turn it's it's but it's known to be noisy and
[00:41:15] it's it's but it's known to be noisy and it still gives Improvement and then what
[00:41:17] it still gives Improvement and then what you do is if we switch to this data that
[00:41:19] you do is if we switch to this data that was used for both Zephyr and 2 through2
[00:41:22] was used for both Zephyr and 2 through2 officially this Ultra feedback data um
[00:41:25] officially this Ultra feedback data um we get an even bigger bump so this is
[00:41:26] we get an even bigger bump so this is just kind of showing the difference that
[00:41:28] just kind of showing the difference that changing only the data can give you in a
[00:41:31] changing only the data can give you in a DPO recipe it's normally increases of
[00:41:34] DPO recipe it's normally increases of kind of like 0 to 2% and in the research
[00:41:37] kind of like 0 to 2% and in the research sphere of trying to ship a model that's
[00:41:39] sphere of trying to ship a model that's a big
[00:41:40] a big deal so this is kind of where we Triad
[00:41:42] deal so this is kind of where we Triad it into new territory grad students
[00:41:45] it into new territory grad students worked really hard and implemented Po
[00:41:47] worked really hard and implemented Po and Jacks in addition to what they
[00:41:49] and Jacks in addition to what they already had and we were like okay what
[00:41:52] already had and we were like okay what happens when we add Po and require
[00:41:55] happens when we add Po and require reliably across m multiple experiments
[00:41:58] reliably across m multiple experiments it's this is one example with the 13
[00:42:00] it's this is one example with the 13 billion parameters po just happens to do
[00:42:03] billion parameters po just happens to do a little bit better it's like like 1%
[00:42:05] a little bit better it's like like 1% better and we try to change a lot of
[00:42:07] better and we try to change a lot of things and the changing things is where
[00:42:10] things and the changing things is where things are get a bit Messier so we've
[00:42:13] things are get a bit Messier so we've heard from industry that using a bigger
[00:42:15] heard from industry that using a bigger reward model can be really helpful to
[00:42:17] reward model can be really helpful to getting a better policy model
[00:42:19] getting a better policy model essentially these bigger reward models
[00:42:21] essentially these bigger reward models will be better at Nuance they should
[00:42:23] will be better at Nuance they should give more label better scores which are
[00:42:25] give more label better scores which are used as Rewards they should just kind of
[00:42:27] used as Rewards they should just kind of make this process a little bit more
[00:42:29] make this process a little bit more stable if we have the compute for it we
[00:42:31] stable if we have the compute for it we see that it does improve some things but
[00:42:34] see that it does improve some things but it doesn't actually make the model
[00:42:36] it doesn't actually make the model overall much better it's kind of
[00:42:37] overall much better it's kind of flatlined with like pretty similar data
[00:42:40] flatlined with like pretty similar data and then just at making the reward model
[00:42:42] and then just at making the reward model bigger which is a little bit surprising
[00:42:44] bigger which is a little bit surprising to us and we Al this is like this is the
[00:42:47] to us and we Al this is like this is the most this is the most realistic few
[00:42:49] most this is the most realistic few slides of the talk but it's like we did
[00:42:51] slides of the talk but it's like we did this thing where we took the we even
[00:42:53] this thing where we took the we even were trying to see if our reward model
[00:42:56] were trying to see if our reward model training was bad as we scaled it up so
[00:42:58] training was bad as we scaled it up so we used reward bench on the right which
[00:43:00] we used reward bench on the right which I had told you about earlier which it's
[00:43:03] I had told you about earlier which it's not clearly correlated whether or not
[00:43:05] not clearly correlated whether or not these two 13B models or 70b are better
[00:43:08] these two 13B models or 70b are better we also did this best event sampling
[00:43:10] we also did this best event sampling idea which is if you generate a bunch of
[00:43:12] idea which is if you generate a bunch of completions from the language model you
[00:43:14] completions from the language model you can rank them by your reward model and
[00:43:15] can rank them by your reward model and then re-evaluate on the top to rank
[00:43:18] then re-evaluate on the top to rank completions that shows that our reward
[00:43:20] completions that shows that our reward models are better at the bigger scale
[00:43:22] models are better at the bigger scale but we couldn't get this to really click
[00:43:24] but we couldn't get this to really click into a like a downstream model in a p
[00:43:27] into a like a downstream model in a p notion of the world um we even tried
[00:43:30] notion of the world um we even tried adding more prompts to rhf we added more
[00:43:32] adding more prompts to rhf we added more code and reasoning prompts because
[00:43:34] code and reasoning prompts because that's something that open AI talks
[00:43:35] that's something that open AI talks about a lot it's like and we want to
[00:43:37] about a lot it's like and we want to improve our models on um it doesn't
[00:43:40] improve our models on um it doesn't really shift the needle on this kind of
[00:43:43] really shift the needle on this kind of cohesive average over many tasks in the
[00:43:46] cohesive average over many tasks in the paper what you'll see when it's out is
[00:43:48] paper what you'll see when it's out is it shows that we added prompts really
[00:43:50] it shows that we added prompts really similar to two math and code evaluations
[00:43:52] similar to two math and code evaluations and those specific evaluations got a bit
[00:43:54] and those specific evaluations got a bit better but adding the full noise into
[00:43:57] better but adding the full noise into the fact that some other valuations
[00:43:59] the fact that some other valuations might go down makes it this just like
[00:44:01] might go down makes it this just like this process is really hard to
[00:44:02] this process is really hard to disentangle and this is why it's like
[00:44:05] disentangle and this is why it's like we're getting the 0 to 2% Improvement
[00:44:07] we're getting the 0 to 2% Improvement out of Po but DPO doesn't have this this
[00:44:10] out of Po but DPO doesn't have this this sort of mess so what we ended up getting
[00:44:13] sort of mess so what we ended up getting to is like there's always one more thing
[00:44:15] to is like there's always one more thing for us to oblate when you're trading
[00:44:17] for us to oblate when you're trading these models with po the sort of things
[00:44:20] these models with po the sort of things like different regularization we're
[00:44:22] like different regularization we're learning a value function in RL
[00:44:24] learning a value function in RL different warmup different size par
[00:44:27] different warmup different size par like there's just so many knobs to turn
[00:44:29] like there's just so many knobs to turn in Po and it was reliably getting us a
[00:44:32] in Po and it was reliably getting us a pretty good model but it's like we're
[00:44:34] pretty good model but it's like we're staring into the abyss trying to improve
[00:44:36] staring into the abyss trying to improve this right now in the next few months
[00:44:38] this right now in the next few months and the bottleneck at in terms of the
[00:44:41] and the bottleneck at in terms of the actual technical side is that PO
[00:44:43] actual technical side is that PO generates new responses from the model
[00:44:45] generates new responses from the model as it trains to kind of refresh the data
[00:44:48] as it trains to kind of refresh the data and that is by far in a way the biggest
[00:44:50] and that is by far in a way the biggest bottle neck when you're actually
[00:44:51] bottle neck when you're actually training these models is it's just way
[00:44:53] training these models is it's just way slower than
[00:44:55] slower than DPO so all these resources for po things
[00:44:58] DPO so all these resources for po things are somewhat available to academics the
[00:45:00] are somewhat available to academics the Google tensel research Cloud I think is
[00:45:02] Google tensel research Cloud I think is pretty available the grad students I
[00:45:04] pretty available the grad students I work with seem to sign up um the code
[00:45:06] work with seem to sign up um the code base is open so if you're interested in
[00:45:08] base is open so if you're interested in a grad student and you're trying to do
[00:45:10] a grad student and you're trying to do po alignment and have access to tpus
[00:45:13] po alignment and have access to tpus please get in touch it's it's a very fun
[00:45:15] please get in touch it's it's a very fun can of worms but kind of as a summary
[00:45:18] can of worms but kind of as a summary like this is the many different DPO data
[00:45:21] like this is the many different DPO data sets that we tried this is almost all of
[00:45:23] sets that we tried this is almost all of the well-received data sets that are out
[00:45:26] the well-received data sets that are out there in the open and they all look at
[00:45:28] there in the open and they all look at like the factuality column like some of
[00:45:30] like the factuality column like some of these things just don't matter at all
[00:45:32] these things just don't matter at all when you're aligning these models so
[00:45:34] when you're aligning these models so like we need to get new data sets that
[00:45:36] like we need to get new data sets that are really adding different capabilities
[00:45:38] are really adding different capabilities to these models and something that
[00:45:41] to these models and something that matches these kind of ultra feedback
[00:45:43] matches these kind of ultra feedback numbers at the bottom and I don't I
[00:45:46] numbers at the bottom and I don't I don't like I'm surprised whenever I look
[00:45:48] don't like I'm surprised whenever I look at this but this is where we are at and
[00:45:50] at this but this is where we are at and we need to try to keep building data
[00:45:52] we need to try to keep building data sets and keep adding freshness to this
[00:45:56] sets and keep adding freshness to this system Ultra feedback at this point is
[00:45:58] system Ultra feedback at this point is maybe 6 months old or so I don't know
[00:46:00] maybe 6 months old or so I don't know the exact age but in terms of people
[00:46:02] the exact age but in terms of people training models that that feels old to
[00:46:04] training models that that feels old to people to things that are happening um
[00:46:07] people to things that are happening um and these are the actual sort of numbers
[00:46:09] and these are the actual sort of numbers that you get when you compare DPO versus
[00:46:11] that you get when you compare DPO versus Po this is all with this 13 billion
[00:46:14] Po this is all with this 13 billion parameter again we changed the data set
[00:46:18] parameter again we changed the data set and every one of these poo comes out a
[00:46:19] and every one of these poo comes out a little bit better on average and this is
[00:46:22] little bit better on average and this is a few grad students and people like me
[00:46:23] a few grad students and people like me this is not a big team in Industry doing
[00:46:26] this is not a big team in Industry doing this like we're scraping by and I don't
[00:46:29] this like we're scraping by and I don't know if it's worth the effort if I see
[00:46:32] know if it's worth the effort if I see why open AI uses this because we able to
[00:46:34] why open AI uses this because we able to get a bit more signal out of it but it's
[00:46:37] get a bit more signal out of it but it's a ton of effort to get a bit better um
[00:46:40] a ton of effort to get a bit better um signal out and I'll kind of transition
[00:46:44] signal out and I'll kind of transition into a bit more of a like open-ended
[00:46:47] into a bit more of a like open-ended discussion of this and then we'll have
[00:46:48] discussion of this and then we'll have questions but it's like what about PO is
[00:46:52] questions but it's like what about PO is actually special like this generation
[00:46:54] actually special like this generation and this online nature and like can we
[00:46:58] and this online nature and like can we just change DPO to be like this or like
[00:46:59] just change DPO to be like this or like where are the new things going to go and
[00:47:02] where are the new things going to go and I had the pleasure of advising one
[00:47:04] I had the pleasure of advising one project that was related to this but
[00:47:05] project that was related to this but this is much much more General so it's
[00:47:09] this is much much more General so it's like what is special about online data
[00:47:11] like what is special about online data there's multiple ways that you can get
[00:47:13] there's multiple ways that you can get new data into your
[00:47:14] new data into your rlf process and then there's also this
[00:47:17] rlf process and then there's also this related question in reinforcement
[00:47:20] related question in reinforcement learning literature which is like on
[00:47:21] learning literature which is like on versus off policy which is a technical
[00:47:23] versus off policy which is a technical distinction that often gets looped in
[00:47:26] distinction that often gets looped in with these discussions of DPO versus Po
[00:47:29] with these discussions of DPO versus Po they're actually related but the
[00:47:32] they're actually related but the reinforcement learning discussions have
[00:47:33] reinforcement learning discussions have a very much more like definitional
[00:47:36] a very much more like definitional flavor to them while in this alignment
[00:47:38] flavor to them while in this alignment space we're more focused on if we need
[00:47:41] space we're more focused on if we need to get fresh data in and how we need to
[00:47:43] to get fresh data in and how we need to label our data for language models so
[00:47:45] label our data for language models so I'd make this distinction between these
[00:47:47] I'd make this distinction between these two things which is freshly generated
[00:47:49] two things which is freshly generated data from the policy if you zoom into a
[00:47:51] data from the policy if you zoom into a data set like Ultra feedback it has
[00:47:53] data set like Ultra feedback it has Generations from all sorts of models
[00:47:55] Generations from all sorts of models from alpaca Von kuna GPT 3.5 GPT 4 llama
[00:48:00] from alpaca Von kuna GPT 3.5 GPT 4 llama like there's Generations from all sorts
[00:48:02] like there's Generations from all sorts of models in this data set we are using
[00:48:04] of models in this data set we are using so when we train these Zephyr these 2u
[00:48:06] so when we train these Zephyr these 2u models we're incorporating information
[00:48:08] models we're incorporating information from a lot of different models down into
[00:48:10] from a lot of different models down into our one policy whereas what PPO is doing
[00:48:13] our one policy whereas what PPO is doing is only generating data from your
[00:48:15] is only generating data from your existing model and kind of changing this
[00:48:17] existing model and kind of changing this distribution over time so like that is a
[00:48:20] distribution over time so like that is a very different idea of where the signal
[00:48:22] very different idea of where the signal is coming from from the models and then
[00:48:24] is coming from from the models and then the second thing is whether or not
[00:48:26] the second thing is whether or not refreshing the data labels over time if
[00:48:29] refreshing the data labels over time if I have human labelers comparing Chosen
[00:48:31] I have human labelers comparing Chosen and rejected that's one data point but I
[00:48:34] and rejected that's one data point but I can also later on take this reward model
[00:48:36] can also later on take this reward model that I trained and generate a Chosen and
[00:48:39] that I trained and generate a Chosen and rejected and change the label so these
[00:48:41] rejected and change the label so these kind of two things of like what the
[00:48:42] kind of two things of like what the actual text is and when the chosen
[00:48:45] actual text is and when the chosen rejected label was given are what people
[00:48:48] rejected label was given are what people mean when they're talking about like is
[00:48:49] mean when they're talking about like is something special about online in rhf
[00:48:52] something special about online in rhf and it's much Clear it's clear to see
[00:48:54] and it's much Clear it's clear to see that PO does it very differently than
[00:48:56] that PO does it very differently than DPO but we're not restricted to
[00:48:58] DPO but we're not restricted to this in the last few weeks I have the
[00:49:01] this in the last few weeks I have the dates all in here so um April April in
[00:49:05] dates all in here so um April April in May of 2024 there started to be a lot of
[00:49:07] May of 2024 there started to be a lot of papers on this about DPO po online
[00:49:12] papers on this about DPO po online offline and they really kind of say
[00:49:15] offline and they really kind of say similar things which is that online is
[00:49:17] similar things which is that online is important and these papers on this slide
[00:49:20] important and these papers on this slide they show these kind of more theoretical
[00:49:22] they show these kind of more theoretical and closed form experiments on like what
[00:49:25] and closed form experiments on like what is special about online data and what
[00:49:27] is special about online data and what performance drops if you use this kind
[00:49:29] performance drops if you use this kind of offline data it's good to dig into
[00:49:32] of offline data it's good to dig into these but it's this is why I say it's
[00:49:34] these but it's this is why I say it's like nice to do research now because if
[00:49:35] like nice to do research now because if you have an idea a lot of times people
[00:49:37] you have an idea a lot of times people have like three papers that confirm the
[00:49:39] have like three papers that confirm the notion that you have it's a lot easier
[00:49:41] notion that you have it's a lot easier to be confident in things if three
[00:49:43] to be confident in things if three independent institutions say something
[00:49:45] independent institutions say something similar at the same time there's a lot
[00:49:48] similar at the same time there's a lot of methods coming out where people are
[00:49:50] of methods coming out where people are trying to modify DPO to actually use
[00:49:53] trying to modify DPO to actually use this kind of online notion I think
[00:49:55] this kind of online notion I think self-rewarding language models for meta
[00:49:57] self-rewarding language models for meta was the first really popular one where
[00:50:00] was the first really popular one where they just had they asked the DPO model
[00:50:02] they just had they asked the DPO model hey which of these answers is better in
[00:50:04] hey which of these answers is better in between each iteration so they did this
[00:50:06] between each iteration so they did this like llm as a judge to relabel their own
[00:50:08] like llm as a judge to relabel their own data and then they did multiple
[00:50:10] data and then they did multiple iterations of DPO and the model had
[00:50:13] iterations of DPO and the model had really strong stores there's now ideas
[00:50:15] really strong stores there's now ideas like not using all of your data at once
[00:50:17] like not using all of your data at once so you can kind of do batches of DPO and
[00:50:19] so you can kind of do batches of DPO and update your data the paper that I was on
[00:50:22] update your data the paper that I was on with this discriminator guided DPO which
[00:50:24] with this discriminator guided DPO which I'll talk about in a second is using
[00:50:26] I'll talk about in a second is using reward models plus this DPO training
[00:50:28] reward models plus this DPO training objective there's just a lot of things
[00:50:30] objective there's just a lot of things that we can change and I think the
[00:50:31] that we can change and I think the community again is in this expansion
[00:50:33] community again is in this expansion phase where I even get messages from
[00:50:36] phase where I even get messages from people are like oh my paper was really
[00:50:38] people are like oh my paper was really similar to this other paper they that we
[00:50:40] similar to this other paper they that we did it first they didn't site us and I'm
[00:50:41] did it first they didn't site us and I'm like this is kind of the point but it's
[00:50:43] like this is kind of the point but it's hard it's like it's it's going to be
[00:50:46] hard it's like it's it's going to be like this for a little bit longer and
[00:50:47] like this for a little bit longer and then hopefully in the end of the year in
[00:50:49] then hopefully in the end of the year in a few years we're going to be like okay
[00:50:50] a few years we're going to be like okay this is clearly what we need to do on
[00:50:52] this is clearly what we need to do on the method side of thing so this is one
[00:50:54] the method side of thing so this is one example d2p discriminator guided DPO
[00:50:58] example d2p discriminator guided DPO which I'm is an advisor to which is a
[00:51:02] which I'm is an advisor to which is a undergrad researcher and the idea is
[00:51:05] undergrad researcher and the idea is comparing these three different things
[00:51:07] comparing these three different things so like a is the standard DPO you have a
[00:51:10] so like a is the standard DPO you have a data set you apply the loss function on
[00:51:12] data set you apply the loss function on it B is what we call some sort of online
[00:51:15] it B is what we call some sort of online preference optimization which is where
[00:51:18] preference optimization which is where you can repeatedly label your data with
[00:51:20] you can repeatedly label your data with a uh reward model it just kind of like
[00:51:23] a uh reward model it just kind of like the self-reward paper that I mentioned
[00:51:25] the self-reward paper that I mentioned which is you can read shuffle your
[00:51:27] which is you can read shuffle your preference data based on a reward model
[00:51:29] preference data based on a reward model and that kind of adds some notion of
[00:51:31] and that kind of adds some notion of online to your data and then the third
[00:51:33] online to your data and then the third thing is like what if we're relabeling
[00:51:35] thing is like what if we're relabeling data and we're retraining our reward
[00:51:38] data and we're retraining our reward model over time so we're just really
[00:51:39] model over time so we're just really trying to keep our um kind of what our
[00:51:42] trying to keep our um kind of what our policy is doing related to our reward
[00:51:44] policy is doing related to our reward model and keep everything really updated
[00:51:46] model and keep everything really updated in real time so that it's all it's all
[00:51:48] in real time so that it's all it's all lined up and this is wondering how much
[00:51:50] lined up and this is wondering how much of a gain do you have by retraining the
[00:51:53] of a gain do you have by retraining the reward model over time in a DPO
[00:51:55] reward model over time in a DPO framework
[00:51:57] framework and part of why I like this paper is
[00:51:58] and part of why I like this paper is there's things like closed form tasks so
[00:52:02] there's things like closed form tasks so the biggest question that I get for
[00:52:04] the biggest question that I get for alignment is like how do we actually
[00:52:06] alignment is like how do we actually evaluate it like what tasks is it good
[00:52:08] evaluate it like what tasks is it good for there's a whole philosophical
[00:52:10] for there's a whole philosophical discussion where I think information
[00:52:12] discussion where I think information transformation is a valuable task
[00:52:14] transformation is a valuable task writers tell the same stories in
[00:52:16] writers tell the same stories in different ways but the best told story
[00:52:17] different ways but the best told story is the one that resonates with people
[00:52:19] is the one that resonates with people that has value and but at the other time
[00:52:23] that has value and but at the other time we're academics and we need to be able
[00:52:24] we're academics and we need to be able to measure things so this paper has
[00:52:26] to measure things so this paper has things like your reward is counting the
[00:52:28] things like your reward is counting the number of nouns in a sentence and then
[00:52:30] number of nouns in a sentence and then you're using these alignment methods to
[00:52:32] you're using these alignment methods to increase the number of nouns in the
[00:52:33] increase the number of nouns in the outputed sentences from the model so you
[00:52:36] outputed sentences from the model so you can measure that a lot better because we
[00:52:37] can measure that a lot better because we have classifiers which know our nouns
[00:52:39] have classifiers which know our nouns and you can see on this left figure is
[00:52:41] and you can see on this left figure is that just by retraining this reward
[00:52:43] that just by retraining this reward Model A few times and it converges
[00:52:45] Model A few times and it converges better than if you were just to relabel
[00:52:48] better than if you were just to relabel your preference data it's a mouthful but
[00:52:50] your preference data it's a mouthful but it's just like keeping your model your
[00:52:52] it's just like keeping your model your training process a little bit more
[00:52:53] training process a little bit more online can improve a performance and on
[00:52:56] online can improve a performance and on the right is a more standard open-ended
[00:52:58] the right is a more standard open-ended evaluation task where we're asking a
[00:53:01] evaluation task where we're asking a language model like chbt which answer is
[00:53:03] language model like chbt which answer is better and that has all sorts of
[00:53:05] better and that has all sorts of problems but we can show similar results
[00:53:08] problems but we can show similar results I think the big takeaway is really like
[00:53:11] I think the big takeaway is really like these few slides which is the the the
[00:53:13] these few slides which is the the the literature is moving we have studies
[00:53:15] literature is moving we have studies that show that online is better and
[00:53:17] that show that online is better and people are coming up with really cool
[00:53:19] people are coming up with really cool clever ways to actually use online data
[00:53:22] clever ways to actually use online data so I would I combined with new data sets
[00:53:24] so I would I combined with new data sets this is kind of this the like deep of
[00:53:26] this is kind of this the like deep of this year is like online methods and how
[00:53:29] this year is like online methods and how they
[00:53:30] they work so this kind of goes back to what
[00:53:33] work so this kind of goes back to what industry is doing and I showed this
[00:53:36] industry is doing and I showed this figure earlier on the left with Claude
[00:53:37] figure earlier on the left with Claude where you can see the little points
[00:53:39] where you can see the little points along the lines and these are these
[00:53:41] along the lines and these are these different iterations we don't know
[00:53:42] different iterations we don't know exactly what they're doing but it seems
[00:53:45] exactly what they're doing but it seems a little bit different where the dots on
[00:53:46] a little bit different where the dots on these figures are new data sets from
[00:53:48] these figures are new data sets from humans rather than this kind of redo a
[00:53:52] humans rather than this kind of redo a reward model relabel your data this is
[00:53:54] reward model relabel your data this is what happens when you have access to
[00:53:56] what happens when you have access to different type of scale the Llama 2
[00:53:58] different type of scale the Llama 2 paper makes this much clearer they say
[00:53:59] paper makes this much clearer they say they work with an annotator they get
[00:54:01] they work with an annotator they get batches of data when they're generating
[00:54:03] batches of data when they're generating this new batch of data the previous
[00:54:05] this new batch of data the previous models checkpoint was used for
[00:54:07] models checkpoint was used for Generations they do this many times and
[00:54:10] Generations they do this many times and you can kind of see that they're
[00:54:11] you can kind of see that they're collecting new human data new human data
[00:54:13] collecting new human data new human data new human data and each time they
[00:54:15] new human data and each time they generate human data it is trained for a
[00:54:18] generate human data it is trained for a new model they're doing a lot of
[00:54:19] new model they're doing a lot of training updates and they're kind of
[00:54:21] training updates and they're kind of building on each other and this kind of
[00:54:23] building on each other and this kind of leads into the last section that I'll
[00:54:25] leads into the last section that I'll talk about in the conclusion
[00:54:27] talk about in the conclusion is like what did meta do with llama 3
[00:54:30] is like what did meta do with llama 3 this is one of the most funny blog post
[00:54:32] this is one of the most funny blog post sentences it's like the ridiculous
[00:54:34] sentences it's like the ridiculous things that they give us and then we
[00:54:35] things that they give us and then we parse the tea leaves um they say in the
[00:54:38] parse the tea leaves um they say in the blog post is that our approach to post
[00:54:40] blog post is that our approach to post training is a combination of supervised
[00:54:41] training is a combination of supervised fine tuning rejection sampling proximal
[00:54:44] fine tuning rejection sampling proximal policy optimization Po and direct
[00:54:46] policy optimization Po and direct preference optimization so it's like the
[00:54:49] preference optimization so it's like the people ask me like what the heck did
[00:54:50] people ask me like what the heck did they do it I mean I kind of agree but it
[00:54:53] they do it I mean I kind of agree but it really goes back to this slide in my
[00:54:55] really goes back to this slide in my mind which is that
[00:54:57] mind which is that they're getting new data and then
[00:54:58] they're getting new data and then they're training a new model over time
[00:55:01] they're training a new model over time so what I think is happening at each one
[00:55:03] so what I think is happening at each one of these points they you they tried a
[00:55:05] of these points they you they tried a few methods and they chose the training
[00:55:07] few methods and they chose the training method that worked best it's really it's
[00:55:09] method that worked best it's really it's practical meta is a really practical
[00:55:11] practical meta is a really practical organization especially in the Gen org
[00:55:13] organization especially in the Gen org right now and that just makes sense it's
[00:55:15] right now and that just makes sense it's like at different points in the model
[00:55:17] like at different points in the model your model has different capabilities
[00:55:19] your model has different capabilities and it's ready to be trained in
[00:55:20] and it's ready to be trained in different ways rejection sampling which
[00:55:22] different ways rejection sampling which I didn't cover here is the simplest
[00:55:25] I didn't cover here is the simplest Training Method you take a reward model
[00:55:27] Training Method you take a reward model you rank some supervised fine tuning
[00:55:29] you rank some supervised fine tuning outputs and then you use this autor
[00:55:31] outputs and then you use this autor regressive loss function again and then
[00:55:34] regressive loss function again and then from there DPO is much simpler to PO but
[00:55:38] from there DPO is much simpler to PO but it might not be give you the highest end
[00:55:40] it might not be give you the highest end performance and then as your model
[00:55:42] performance and then as your model really starts kicking into gear or you
[00:55:43] really starts kicking into gear or you have more time to train this model once
[00:55:45] have more time to train this model once all of your data is collected and you're
[00:55:47] all of your data is collected and you're not on a weekly time crunch um you can
[00:55:49] not on a weekly time crunch um you can experiment with all the little knobs of
[00:55:51] experiment with all the little knobs of Po and you can really try to get the
[00:55:53] Po and you can really try to get the best model out at the end of the day
[00:55:56] best model out at the end of the day it's just hopefully they release a
[00:55:58] it's just hopefully they release a technical report that confirms some of
[00:55:59] technical report that confirms some of my hypothesis but I think this is
[00:56:01] my hypothesis but I think this is normally what people are interested in
[00:56:03] normally what people are interested in when somebody from industry comes up to
[00:56:05] when somebody from industry comes up to give a lecture and it's it's I wish we
[00:56:10] give a lecture and it's it's I wish we had more details on what industry was
[00:56:12] had more details on what industry was doing but in terms of current directions
[00:56:14] doing but in terms of current directions that I'm most interested in rhf I talked
[00:56:18] that I'm most interested in rhf I talked about data a lot we are very
[00:56:20] about data a lot we are very bottlenecked on data even as academics
[00:56:23] bottlenecked on data even as academics with very limited compute we literally
[00:56:25] with very limited compute we literally try every data set that is available
[00:56:27] try every data set that is available like that's not like we don't have a lot
[00:56:29] like that's not like we don't have a lot of compute but we need to keep
[00:56:31] of compute but we need to keep innovating there we're going to see more
[00:56:34] innovating there we're going to see more DPO methods it's it's here to say
[00:56:37] DPO methods it's it's here to say there's a ton I didn't cover here things
[00:56:40] there's a ton I didn't cover here things like removing the reference model
[00:56:42] like removing the reference model changing the loss function slightly um
[00:56:46] changing the loss function slightly um not using pair wise preferences but
[00:56:47] not using pair wise preferences but single wise preferences it's a lot going
[00:56:50] single wise preferences it's a lot going on there we should use more model sizes
[00:56:52] on there we should use more model sizes in 7 and 13 billion parameters or in
[00:56:55] in 7 and 13 billion parameters or in llama's case like 7 and 70 billion
[00:56:58] llama's case like 7 and 70 billion parameters particularly scaling down is
[00:57:00] parameters particularly scaling down is very useful it's a place where Academia
[00:57:03] very useful it's a place where Academia can still play there's kind of less of a
[00:57:06] can still play there's kind of less of a weird marketing Dynamic where all the
[00:57:07] weird marketing Dynamic where all the companies are racing to go bigger for
[00:57:09] companies are racing to go bigger for certain um strategic reasons but this is
[00:57:12] certain um strategic reasons but this is something that's accessible to many
[00:57:13] something that's accessible to many people aligning small models it's hard
[00:57:16] people aligning small models it's hard to get signal out of them because the
[00:57:17] to get signal out of them because the models show more or less random scores
[00:57:20] models show more or less random scores on many benchmarks that people care
[00:57:22] on many benchmarks that people care about or really low scores so even just
[00:57:24] about or really low scores so even just kind of breaking through in that domain
[00:57:26] kind of breaking through in that domain would be really impactful work to kind
[00:57:28] would be really impactful work to kind of get more people working on alignment
[00:57:30] of get more people working on alignment and then kind of evaluations I covered
[00:57:32] and then kind of evaluations I covered at length which is we need to keep
[00:57:34] at length which is we need to keep getting more specific on things we care
[00:57:36] getting more specific on things we care about and personalization is something
[00:57:38] about and personalization is something in alignment that I didn't cover in this
[00:57:41] in alignment that I didn't cover in this talk but is something that is good to
[00:57:44] talk but is something that is good to compete with this kind of big Tech which
[00:57:46] compete with this kind of big Tech which is like how do we train models that are
[00:57:48] is like how do we train models that are good for you as an individual rather
[00:57:50] good for you as an individual rather than one big model for one big
[00:57:52] than one big model for one big technology
[00:57:54] technology organization so this these slides will
[00:57:57] organization so this these slides will get to you but these are the types of
[00:57:59] get to you but these are the types of places that I follow when I'm trying to
[00:58:00] places that I follow when I'm trying to see open models or open data sets that
[00:58:03] see open models or open data sets that are reputable and easy to keep track of
[00:58:06] are reputable and easy to keep track of so you don't have to try to follow um
[00:58:08] so you don't have to try to follow um everyone and I write about this a lot
[00:58:11] everyone and I write about this a lot without doing too much self-promotion
[00:58:13] without doing too much self-promotion but I have like I ended like 10 minutes
[00:58:16] but I have like I ended like 10 minutes early for questions that I'm happy to
[00:58:17] early for questions that I'm happy to take um in a Q&amp;A format and then that if
[00:58:22] take um in a Q&amp;A format and then that if you don't have to stay in waight if you
[00:58:23] you don't have to stay in waight if you don't want to
[00:58:26] don't want to [Applause]
[00:58:35] [Applause] okay thank you Nathan um questions
[00:58:38] okay thank you Nathan um questions anyone got
[00:58:41] questions assum you're hand a good
[00:58:43] questions assum you're hand a good reward model which is a large assumption
[00:58:45] reward model which is a large assumption I agree but what is the key challenge to
[00:58:47] I agree but what is the key challenge to doing online D in sense you can do en
[00:58:49] doing online D in sense you can do en roll outs and then just like rank them
[00:58:51] roll outs and then just like rank them using a model and then go and you can
[00:58:55] using a model and then go and you can iterate this so what like what is the
[00:58:57] iterate this so what like what is the hard
[00:59:00] thing yeah I'm going to repeat the
[00:59:02] thing yeah I'm going to repeat the questions so that people can hear them
[00:59:04] questions so that people can hear them and it gets recorded the idea is if you
[00:59:06] and it gets recorded the idea is if you have a good reward model what is
[00:59:08] have a good reward model what is stopping you from doing online DPO and
[00:59:12] stopping you from doing online DPO and kind of just improving the policy from
[00:59:14] kind of just improving the policy from there I think there's kind of multiple
[00:59:16] there I think there's kind of multiple angles to this
[00:59:18] angles to this that they're both Technical and like the
[00:59:21] that they're both Technical and like the kind of industrywide but the technical
[00:59:23] kind of industrywide but the technical thing is I think the prompt matching
[00:59:25] thing is I think the prompt matching ends up being really important so prompt
[00:59:28] ends up being really important so prompt matching so what your reward model can
[00:59:30] matching so what your reward model can learn is specific to the prompts
[00:59:33] learn is specific to the prompts there're a technical detail where the
[00:59:35] there're a technical detail where the prompts used for your policy often are
[00:59:37] prompts used for your policy often are exactly the same as your reward model in
[00:59:39] exactly the same as your reward model in po which is really strange because we
[00:59:41] po which is really strange because we talk about generalization in machine
[00:59:43] talk about generalization in machine learning but we're kind of like soft
[00:59:44] learning but we're kind of like soft balling oursel at the PO stage which is
[00:59:47] balling oursel at the PO stage which is we're only grading po answers which our
[00:59:49] we're only grading po answers which our reward model is train to answer which is
[00:59:52] reward model is train to answer which is kind of strange so people think that
[00:59:53] kind of strange so people think that some of that might break down and we see
[00:59:56] some of that might break down and we see some of that when trying to train po
[00:59:59] some of that when trying to train po models with off-the-shelf reward models
[01:00:01] models with off-the-shelf reward models it's was kind of a long answer and
[01:00:04] it's was kind of a long answer and then but I think that I think that's
[01:00:06] then but I think that I think that's mostly it's like mostly distribution
[01:00:08] mostly it's like mostly distribution matching if I had to guess but if we had
[01:00:11] matching if I had to guess but if we had truly a good model it should work for
[01:00:13] truly a good model it should work for some things and that could be one of the
[01:00:16] some things and that could be one of the reasons why there aren't that many in
[01:00:17] reasons why there aren't that many in the open because it would kind of help
[01:00:19] the open because it would kind of help people catch up in alignment it's like a
[01:00:21] people catch up in alignment it's like a reward model if it is as important as
[01:00:23] reward model if it is as important as people say it is it might be easy
[01:00:28] other questions yeah
[01:00:44] [Music]
[01:00:56] for examp
[01:01:00] me yeah I think there's there this is a
[01:01:03] me yeah I think there's there this is a whole conversation so if I don't cover
[01:01:05] whole conversation so if I don't cover it if you want more after I answer I can
[01:01:07] it if you want more after I answer I can you can come up but the question is like
[01:01:09] you can come up but the question is like is there more than pairwise preferences
[01:01:11] is there more than pairwise preferences that could be used in rhf and there's a
[01:01:14] that could be used in rhf and there's a lot of different lines of work that are
[01:01:15] lot of different lines of work that are studying this um one is methods like
[01:01:19] studying this um one is methods like there's a method out of Stanford that's
[01:01:21] there's a method out of Stanford that's kto name like csky I always mess it up
[01:01:23] kto name like csky I always mess it up with these names are so hard to
[01:01:25] with these names are so hard to pronounce but it's like idea of using
[01:01:27] pronounce but it's like idea of using one-sided preference data so a lot of
[01:01:29] one-sided preference data so a lot of customer apps have like did you did you
[01:01:32] customer apps have like did you did you get good support from this agent yes or
[01:01:34] get good support from this agent yes or no and like you could use data like that
[01:01:36] no and like you could use data like that is it just is a different loss function
[01:01:38] is it just is a different loss function for using single um single side of
[01:01:40] for using single um single side of preferences or just yes or no there are
[01:01:42] preferences or just yes or no there are other things like learning to rank for
[01:01:45] other things like learning to rank for multiple answers so this is something I
[01:01:49] multiple answers so this is something I slightly insinuated but like binary
[01:01:52] slightly insinuated but like binary preferences is kind of like there's a
[01:01:53] preferences is kind of like there's a lot of literature on learning
[01:01:55] lot of literature on learning preferences
[01:01:56] preferences and one of the models that got reduced
[01:01:59] and one of the models that got reduced down is the Starling model and they use
[01:02:01] down is the Starling model and they use a kwise preference so they have like
[01:02:04] a kwise preference so they have like five or nine answers to every prompt and
[01:02:07] five or nine answers to every prompt and then they collect answers and then they
[01:02:09] then they collect answers and then they have a different loss function and this
[01:02:11] have a different loss function and this is one of the models has kind of like
[01:02:12] is one of the models has kind of like broken through in the open alignment
[01:02:14] broken through in the open alignment space it's one of the few that I left in
[01:02:15] space it's one of the few that I left in and skipped in my slide deck but so
[01:02:18] and skipped in my slide deck but so that's kind of interesting and then
[01:02:19] that's kind of interesting and then there's other research that's like fine
[01:02:21] there's other research that's like fine grained um preferences so for every
[01:02:25] grained um preferences so for every completion to a prompt you get labels
[01:02:27] completion to a prompt you get labels like conciseness helpfulness honesty so
[01:02:30] like conciseness helpfulness honesty so there's a few things on that regards
[01:02:32] there's a few things on that regards there's like a steer LM paper from
[01:02:35] there's like a steer LM paper from Nvidia and then there's work from udub
[01:02:37] Nvidia and then there's work from udub that does like learning from fine G
[01:02:40] that does like learning from fine G grained preferences so that one's
[01:02:42] grained preferences so that one's probably like the one that's most
[01:02:43] probably like the one that's most emerging most in the academic sense but
[01:02:46] emerging most in the academic sense but there's so much to learn here there's
[01:02:47] there's so much to learn here there's like all like literally all the field of
[01:02:50] like all like literally all the field of social Choice needs to get condensed
[01:02:52] social Choice needs to get condensed into these things
[01:03:02] any other
[01:03:11] [Applause]
[01:03:23] questions yeah so the question is how
[01:03:26] questions yeah so the question is how can we broadly is like how can we exceed
[01:03:28] can we broadly is like how can we exceed Human Performance with um fine-tuning or
[01:03:31] Human Performance with um fine-tuning or any training for that regards I think
[01:03:33] any training for that regards I think this is where some older ideas in CS
[01:03:35] this is where some older ideas in CS will come back like I think one of the
[01:03:36] will come back like I think one of the foundational ideas in CS is search which
[01:03:39] foundational ideas in CS is search which is really also motivated as like
[01:03:41] is really also motivated as like exploration in RL and therefore we need
[01:03:44] exploration in RL and therefore we need to have some sort of language models
[01:03:46] to have some sort of language models that can search and generate new data I
[01:03:49] that can search and generate new data I was talking with somebody before the
[01:03:51] was talking with somebody before the grad student and I think that it's like
[01:03:53] grad student and I think that it's like search will be a large part of synthetic
[01:03:55] search will be a large part of synthetic data but then the human aspect will be
[01:03:57] data but then the human aspect will be what gets it across the line if it can't
[01:03:58] what gets it across the line if it can't solve a certain area and this is like
[01:04:01] solve a certain area and this is like the qar rumors are ridiculous but that
[01:04:04] the qar rumors are ridiculous but that seems to be the best argument for the
[01:04:07] seems to be the best argument for the sort of thing that open AI is trying
[01:04:09] sort of thing that open AI is trying with that is like how to get that
[01:04:12] with that is like how to get that barrier broken with
[01:04:18] AI thank you so much for coming in you
[01:04:20] AI thank you so much for coming in you mentioned data sets for a big limitation
[01:04:23] mentioned data sets for a big limitation and I was curious how one goes about
[01:04:25] and I was curious how one goes about creating a new data
[01:04:27] creating a new data set yeah this is another thing that's
[01:04:30] set yeah this is another thing that's hard I think Community efforts are what
[01:04:32] hard I think Community efforts are what people have tried to do I mentioned open
[01:04:34] people have tried to do I mentioned open Assistant but most people that do a
[01:04:37] Assistant but most people that do a community effort are like I never want
[01:04:38] community effort are like I never want to do this again so while I still think
[01:04:42] to do this again so while I still think it's worth doing things once that are
[01:04:44] it's worth doing things once that are highly impactful even if you like might
[01:04:46] highly impactful even if you like might not want to do it again other avenues
[01:04:48] not want to do it again other avenues for building these in a sustainable
[01:04:51] for building these in a sustainable manner are very important I think that
[01:04:54] manner are very important I think that there's some way that this is being done
[01:04:56] there's some way that this is being done like chatbot Arena returns some of the
[01:04:58] like chatbot Arena returns some of the prompts and the labels to users there's
[01:05:01] prompts and the labels to users there's specific concerns I have with that data
[01:05:03] specific concerns I have with that data around being too noisy um but that is
[01:05:06] around being too noisy um but that is the sort of thing that can happen if AI
[01:05:08] the sort of thing that can happen if AI 2 has a demo for their models it's going
[01:05:11] 2 has a demo for their models it's going to be about science and like generating
[01:05:14] to be about science and like generating information rather than being a chat GPT
[01:05:16] information rather than being a chat GPT competitor it's like a nonprofit it
[01:05:17] competitor it's like a nonprofit it can't do a product competitor but that's
[01:05:19] can't do a product competitor but that's the sort of data that we would want to
[01:05:21] the sort of data that we would want to release and something that I might just
[01:05:23] release and something that I might just have to do but I'm interested in like
[01:05:25] have to do but I'm interested in like academic workshops and competitions as a
[01:05:28] academic workshops and competitions as a ground where you could have communities
[01:05:30] ground where you could have communities meet every 3 6 8 months and have work
[01:05:33] meet every 3 6 8 months and have work that's focused on an area Andor like
[01:05:35] that's focused on an area Andor like Focus time to have people contribute to
[01:05:37] Focus time to have people contribute to it but it's a good question it's not
[01:05:40] it but it's a good question it's not it's probably why there aren't very
[01:05:45] many how do you
[01:05:49] feel are subject to reward hacking as
[01:05:52] feel are subject to reward hacking as well so we get one at the front first
[01:05:55] well so we get one at the front first yeah close first and then we'll come to
[01:05:56] yeah close first and then we'll come to you um the various places you've done
[01:05:59] you um the various places you've done research at over the years do you have
[01:06:01] research at over the years do you have any sense of how they compare in terms
[01:06:05] any sense of how they compare in terms of uh specifically alignment research I
[01:06:08] of uh specifically alignment research I mean obviously they weren't doing
[01:06:10] mean obviously they weren't doing alignment research specifically add
[01:06:12] alignment research specifically add those
[01:06:14] those time I think generally they represents
[01:06:16] time I think generally they represents different culture and Investments of the
[01:06:18] different culture and Investments of the company like the my I wasn't doing
[01:06:20] company like the my I wasn't doing language models until a time at hugging
[01:06:22] language models until a time at hugging phase so I can really only speak to
[01:06:24] phase so I can really only speak to these kind of two open
[01:06:26] these kind of two open companies and from like hugging P's
[01:06:29] companies and from like hugging P's perspective is to show that more people
[01:06:31] perspective is to show that more people can do this like we're not trying to
[01:06:32] can do this like we're not trying to compete with chat 2bt but we're trying
[01:06:34] compete with chat 2bt but we're trying to Foster an ecosystem of doing this and
[01:06:35] to Foster an ecosystem of doing this and ai2 is similar but more about like what
[01:06:39] ai2 is similar but more about like what is happening like how do we learn about
[01:06:40] is happening like how do we learn about this how do we do science how do we
[01:06:42] this how do we do science how do we study the science of this and
[01:06:43] study the science of this and communicate that clearly and I'm sure if
[01:06:45] communicate that clearly and I'm sure if you do the exercise you can map this to
[01:06:47] you do the exercise you can map this to every company is like what is their
[01:06:49] every company is like what is their important thing and like they have
[01:06:51] important thing and like they have different goals in their products and
[01:06:53] different goals in their products and their corporate structure and things
[01:06:55] their corporate structure and things like that
[01:06:56] like that I will talk more when not
[01:06:57] I will talk more when not [Laughter]
[01:07:00] [Laughter] recorded okay up the
[01:07:02] recorded okay up the back are reward model also subject to
[01:07:05] back are reward model also subject to reward hacking like they achieve a good
[01:07:08] reward hacking like they achieve a good result on the outcome but
[01:07:11] result on the outcome but act in reality the outcome does not
[01:07:15] act in reality the outcome does not expected yeah so this is like when
[01:07:17] expected yeah so this is like when talking about reward models this is
[01:07:19] talking about reward models this is probably the most established line of
[01:07:20] probably the most established line of work the question is like are reward
[01:07:22] work the question is like are reward models subject to reward hacking and
[01:07:24] models subject to reward hacking and reward hacking is a classic problem in
[01:07:26] reward hacking is a classic problem in RL I should bring back from my RL slides
[01:07:28] RL I should bring back from my RL slides where you have the boat swimming going
[01:07:30] where you have the boat swimming going in circles and then be like this happens
[01:07:32] in circles and then be like this happens to your language model and and what
[01:07:33] to your language model and and what happens but it is and there's a lot of
[01:07:36] happens but it is and there's a lot of research to mitigate it but it's a
[01:07:38] research to mitigate it but it's a fundamental problem which is you have a
[01:07:40] fundamental problem which is you have a very powerful Optimizer and you have a
[01:07:42] very powerful Optimizer and you have a incomplete representation of your reward
[01:07:44] incomplete representation of your reward and it will always find where your
[01:07:46] and it will always find where your representation of reward is wrong so
[01:07:48] representation of reward is wrong so it's like we will always be doing the
[01:07:49] it's like we will always be doing the best we can but I think saying it's
[01:07:52] best we can but I think saying it's perfect is like not possible in the math
[01:08:02] I mean I can also say like the ways that
[01:08:03] I mean I can also say like the ways that it fails are pretty funny cuz like if
[01:08:05] it fails are pretty funny cuz like if you train these models you'll end up
[01:08:06] you train these models you'll end up with a model that just says JavaScript
[01:08:08] with a model that just says JavaScript to like every answer to like for on
[01:08:10] to like every answer to like for on Infinity it's like sometimes it's really
[01:08:12] Infinity it's like sometimes it's really easy to see when that is happening which
[01:08:14] easy to see when that is happening which is which is good or like you could
[01:08:16] is which is good or like you could change your loss function so that it
[01:08:17] change your loss function so that it will always exploit and like it's a good
[01:08:20] will always exploit and like it's a good way to kind of make sure that things are
[01:08:21] way to kind of make sure that things are working which is you should be able to
[01:08:24] working which is you should be able to easily exploit if you turn the brakes
[01:08:27] easily exploit if you turn the brakes off
[01:08:30] off okay any last public
[01:08:37] question if not uh thank you for Nathan
[01:08:40] question if not uh thank you for Nathan for giving this
[01:08:45] call and if there's anything you'd like
[01:08:47] call and if there's anything you'd like to ask off the Record um he'll be here
[01:08:49] to ask off the Record um he'll be here for a bit longer


================================================================================
LECTURE 017
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 16 - ConvNets and TreeRNNs

Source: https://www.youtube.com/watch?v=S8d-7v3f5MQ

---

Transcript

[00:00:05] hi okay let me get started for today uh
[00:00:08] hi okay let me get started for today uh I guess I'm now down to the the more
[00:00:11] I guess I'm now down to the the more select week eight audience of people who
[00:00:14] select week eight audience of people who actually want to learn so uh my welcome
[00:00:18] actually want to learn so uh my welcome and my pleasure for the people who show
[00:00:21] and my pleasure for the people who show up today thank you thank you um okay so
[00:00:25] up today thank you thank you um okay so what I want to do today um is Prince
[00:00:30] what I want to do today um is Prince ibly sort of talk about a couple of
[00:00:34] ibly sort of talk about a couple of other newal network techniques which can
[00:00:37] other newal network techniques which can be used for language I mean in some
[00:00:42] be used for language I mean in some sense um these two techniques are G ones
[00:00:46] sense um these two techniques are G ones that people aren't using very much these
[00:00:48] that people aren't using very much these days and that's partly why they get sort
[00:00:50] days and that's partly why they get sort of stuck towards the end of the course
[00:00:53] of stuck towards the end of the course um because we try to teach people early
[00:00:55] um because we try to teach people early on in the course um the most essential
[00:00:58] on in the course um the most essential things that you should definitely know
[00:01:00] things that you should definitely know about um but you know um the fact of the
[00:01:04] about um but you know um the fact of the matter is you know in any scientific
[00:01:06] matter is you know in any scientific field there are different ideas and
[00:01:09] field there are different ideas and techniques that bounce around and it's
[00:01:11] techniques that bounce around and it's good to know a few of the different
[00:01:13] good to know a few of the different ideas that are out there because often
[00:01:15] ideas that are out there because often what happens is people find new ways to
[00:01:19] what happens is people find new ways to reinvent things um and put things
[00:01:22] reinvent things um and put things together and see different insides from
[00:01:24] together and see different insides from them so today I'm going to tell you a
[00:01:26] them so today I'm going to tell you a little bit about using convolutional new
[00:01:28] little bit about using convolutional new networks for language Edge and then a
[00:01:30] networks for language Edge and then a bit about tree recursive neural
[00:01:33] bit about tree recursive neural networks um but before that um just
[00:01:36] networks um but before that um just course organization um this is a a bit
[00:01:40] course organization um this is a a bit after it happened but I guess I've never
[00:01:41] after it happened but I guess I've never been back to say it so thanks to
[00:01:43] been back to say it so thanks to everyone who filled in the mid quarter
[00:01:45] everyone who filled in the mid quarter surveys um some people said very nice
[00:01:48] surveys um some people said very nice things about the lecture fantastic
[00:01:50] things about the lecture fantastic lectures and really interesting content
[00:01:53] lectures and really interesting content um some people uh wished that we were
[00:01:56] um some people uh wished that we were teaching more about State space models I
[00:01:58] teaching more about State space models I guess we haven't added those that
[00:02:00] guess we haven't added those that lecture in yet um um a couple of people
[00:02:04] lecture in yet um um a couple of people thought it' be good to have an exam in
[00:02:06] thought it' be good to have an exam in this class clearly they weren't people
[00:02:09] this class clearly they weren't people who have friends in cs231n from what
[00:02:12] who have friends in cs231n from what I've heard but um um yeah uh and then um
[00:02:17] I've heard but um um yeah uh and then um in general people are pretty happy how
[00:02:19] in general people are pretty happy how Ed has been going uh a bit less happy on
[00:02:23] Ed has been going uh a bit less happy on how office hours have been going I mean
[00:02:26] how office hours have been going I mean honestly it's a hard problem I feel to
[00:02:30] honestly it's a hard problem I feel to do office hours you know some people are
[00:02:32] do office hours you know some people are saying oh you should just use Q status I
[00:02:35] saying oh you should just use Q status I sort of remember badly to a year where
[00:02:36] sort of remember badly to a year where we did everything but Q status and near
[00:02:39] we did everything but Q status and near the assignments due date the the C would
[00:02:42] the assignments due date the the C would stretch six hours long and that didn't
[00:02:44] stretch six hours long and that didn't seem such a good solution either um but
[00:02:46] seem such a good solution either um but we'll work along with it um finally yeah
[00:02:50] we'll work along with it um finally yeah on on cloud compute I know this is
[00:02:54] on on cloud compute I know this is something that people variously do have
[00:02:56] something that people variously do have issues with so um there are quite a few
[00:02:59] issues with so um there are quite a few people that still trying to do things
[00:03:01] people that still trying to do things with Google collab um which I realize is
[00:03:04] with Google collab um which I realize is sort of a very convenient nice interface
[00:03:07] sort of a very convenient nice interface but you sort of do suffer from access to
[00:03:10] but you sort of do suffer from access to gpus um on Google collab the best way to
[00:03:13] gpus um on Google collab the best way to get better access to gpus is to pay 10
[00:03:17] get better access to gpus is to pay 10 bucks for a month of collab Pro um which
[00:03:21] bucks for a month of collab Pro um which perhaps means that you end up paying for
[00:03:23] perhaps means that you end up paying for two months um for if it's May and June
[00:03:26] two months um for if it's May and June um we can't reimburse you for that um
[00:03:29] um we can't reimburse you for that um but not so many copies worth of money
[00:03:33] but not so many copies worth of money and it does just give you better access
[00:03:35] and it does just give you better access to gpus I'm encourage you to use the gcp
[00:03:38] to gpus I'm encourage you to use the gcp credits and together AP API access that
[00:03:42] credits and together AP API access that we've given to you um you're also
[00:03:45] we've given to you um you're also welcome to try other things kaggle
[00:03:47] welcome to try other things kaggle notebooks can actually give you be
[00:03:48] notebooks can actually give you be better GPU access but um not as not all
[00:03:52] better GPU access but um not as not all the nice features of collabs um some
[00:03:55] the nice features of collabs um some groups have started using modal which is
[00:03:58] groups have started using modal which is can also be um a good way to get GPU
[00:04:02] can also be um a good way to get GPU access okay that was the intro to that
[00:04:05] access okay that was the intro to that um and so now I wanted to just sort of
[00:04:07] um and so now I wanted to just sort of talk about convolutional new networks um
[00:04:11] talk about convolutional new networks um for language I mean these slides are
[00:04:14] for language I mean these slides are sort of positioned a bit as
[00:04:17] sort of positioned a bit as convolutional new networks versus rnns
[00:04:21] convolutional new networks versus rnns as opposed to versus Transformers I mean
[00:04:24] as opposed to versus Transformers I mean that's um partly you could say because I
[00:04:27] that's um partly you could say because I haven't updated my slides enough but in
[00:04:30] haven't updated my slides enough but in another sense that's partly because
[00:04:32] another sense that's partly because that's how the ideas of convolutional
[00:04:35] that's how the ideas of convolutional neural networks really were explored it
[00:04:38] neural networks really were explored it was in the days when most people were
[00:04:40] was in the days when most people were using recursive neural networks for NLP
[00:04:43] using recursive neural networks for NLP a few people said about saying hey maybe
[00:04:45] a few people said about saying hey maybe we should use convolutional newal
[00:04:46] we should use convolutional newal networks for language as well um whereas
[00:04:50] networks for language as well um whereas in truth in The Last 5 Years when
[00:04:52] in truth in The Last 5 Years when Transformers have dominated there hasn't
[00:04:54] Transformers have dominated there hasn't been much use of convolutional neural
[00:04:56] been much use of convolutional neural networks for NLP um so if we think back
[00:04:59] networks for NLP um so if we think back to our new networks um if you remember
[00:05:03] to our new networks um if you remember those that they kind of gave a way of
[00:05:06] those that they kind of gave a way of giving a
[00:05:07] giving a representation um for a sentence or part
[00:05:10] representation um for a sentence or part of a sentence but they sort of computed
[00:05:12] of a sentence but they sort of computed forward through the string um and so you
[00:05:17] forward through the string um and so you kind of had to get a representation that
[00:05:20] kind of had to get a representation that included in everything that came before
[00:05:23] included in everything that came before you and then you so um you didn't really
[00:05:26] you and then you so um you didn't really have a representation of the ceremony
[00:05:28] have a representation of the ceremony you had a representation of man walked
[00:05:30] you had a representation of man walked into the ceremony that you could
[00:05:34] into the ceremony that you could use um so in contrast to that
[00:05:37] use um so in contrast to that convolutional neural networks basically
[00:05:40] convolutional neural networks basically say well you know kind of like an engr
[00:05:43] say well you know kind of like an engr model that we should be able to take
[00:05:45] model that we should be able to take engrams of words like two G or three G
[00:05:49] engrams of words like two G or three G so for the example tentative deal reach
[00:05:51] so for the example tentative deal reach to keep the government open that we can
[00:05:54] to keep the government open that we can take each Three G tentative deal reach
[00:05:57] take each Three G tentative deal reach deal reach to reach to keep keep
[00:06:00] deal reach to reach to keep keep government and we can make uh some
[00:06:02] government and we can make uh some neural representation for each of those
[00:06:05] neural representation for each of those so notice just being done for every um
[00:06:08] so notice just being done for every um engram for a certain end so there's
[00:06:10] engram for a certain end so there's nothing linguistically or cognitively
[00:06:13] nothing linguistically or cognitively especially plausible here um but we're
[00:06:15] especially plausible here um but we're just going to sort of form
[00:06:17] just going to sort of form representations of multi-word units
[00:06:20] representations of multi-word units which will'll then group in some form
[00:06:22] which will'll then group in some form further way later on and the standard
[00:06:25] further way later on and the standard way of doing that is with convol
[00:06:27] way of doing that is with convol convolutional neural networks um so the
[00:06:30] convolutional neural networks um so the classic case of convolutional neuron
[00:06:32] classic case of convolutional neuron networks is in Vision um the
[00:06:35] networks is in Vision um the convolutional neuron networks were
[00:06:36] convolutional neuron networks were invented for vision where they gave you
[00:06:39] invented for vision where they gave you a kind of a translation
[00:06:41] a kind of a translation invariant um model so that you could
[00:06:44] invariant um model so that you could recognize your kangaroo no matter where
[00:06:46] recognize your kangaroo no matter where in the frame it was and so this little
[00:06:48] in the frame it was and so this little picture here which I'll just do the
[00:06:50] picture here which I'll just do the lower half side of SL lower half of the
[00:06:54] lower half side of SL lower half of the slide is sort of the what a
[00:06:56] slide is sort of the what a convolutional new network is doing in 2D
[00:06:59] convolutional new network is doing in 2D video
[00:07:00] video so the convolution is like a mask that
[00:07:03] so the convolution is like a mask that you're sliding over the image and the
[00:07:06] you're sliding over the image and the mask is defined by weights which are the
[00:07:09] mask is defined by weights which are the little things shown in red and so for
[00:07:12] little things shown in red and so for each place you slide your masks to
[00:07:16] each place you slide your masks to you're then calculating a score by
[00:07:19] you're then calculating a score by taking well what's effectively a DOT
[00:07:21] taking well what's effectively a DOT product of the Mask terms by the
[00:07:24] product of the Mask terms by the elements in that patch and that's then
[00:07:26] elements in that patch and that's then filling in The Matrix on the right um
[00:07:29] filling in The Matrix on the right um that's shown in pink and so that's then
[00:07:32] that's shown in pink and so that's then calculating our convolved feature um
[00:07:36] calculating our convolved feature um from the image uh that make
[00:07:39] from the image uh that make sense yeah so well what happens if we
[00:07:43] sense yeah so well what happens if we then want to do that for language well
[00:07:46] then want to do that for language well you know for language we don't have a 2d
[00:07:49] you know for language we don't have a 2d picture we've got a 1D picture we've got
[00:07:52] picture we've got a 1D picture we've got a sequence of words so we can have
[00:07:54] a sequence of words so we can have tentative deal reach to keep government
[00:07:56] tentative deal reach to keep government open so each of our words will have a
[00:07:59] open so each of our words will have a word Vector I'm using four-dimensional
[00:08:02] word Vector I'm using four-dimensional in my examples to keep it compact on my
[00:08:04] in my examples to keep it compact on my slide and so then we can apply um a
[00:08:09] slide and so then we can apply um a filter that applies to an engr so this
[00:08:13] filter that applies to an engr so this is going to be a filter for trigram and
[00:08:17] is going to be a filter for trigram and so then we're going to slide that
[00:08:19] so then we're going to slide that downwards in the exactly the same way as
[00:08:22] downwards in the exactly the same way as for the vision case apart from we're
[00:08:24] for the vision case apart from we're just sliding in one dimension so I
[00:08:27] just sliding in one dimension so I calculate um the dot product of the
[00:08:30] calculate um the dot product of the filter and this three G and that gives
[00:08:34] filter and this three G and that gives me a value minus one if I did my
[00:08:36] me a value minus one if I did my arithmetic right then I slide it down to
[00:08:39] arithmetic right then I slide it down to the next position and work it out and I
[00:08:41] the next position and work it out and I get minus
[00:08:43] get minus 0.5 um slide it down get the other
[00:08:46] 0.5 um slide it down get the other values and then typically I can add on a
[00:08:49] values and then typically I can add on a bias term so my bias is plus one in this
[00:08:52] bias term so my bias is plus one in this example and then I'll stick it through a
[00:08:54] example and then I'll stick it through a nonlinearity like a sigmoid or something
[00:08:57] nonlinearity like a sigmoid or something like that and so I'll be calculating a
[00:08:59] like that and so I'll be calculating a vector for a term for each um of these
[00:09:04] vector for a term for each um of these three GRS and so that is a convolution
[00:09:07] three GRS and so that is a convolution for a single filter um and then commonly
[00:09:11] for a single filter um and then commonly what I'm doing after that is deciding
[00:09:13] what I'm doing after that is deciding that I'm going to have more than one
[00:09:15] that I'm going to have more than one filter and I'll show that in a minute um
[00:09:19] filter and I'll show that in a minute um in this example um and in my vision
[00:09:22] in this example um and in my vision example earlier we sort of had shrinkage
[00:09:26] example earlier we sort of had shrinkage because we started off with seven words
[00:09:28] because we started off with seven words but course we sort of slid these
[00:09:32] but course we sort of slid these um you know trigram over it we only sort
[00:09:35] um you know trigram over it we only sort of had space for um five trigrams and so
[00:09:39] of had space for um five trigrams and so we ended up with something smaller than
[00:09:41] we ended up with something smaller than our input sentence often people want to
[00:09:44] our input sentence often people want to keep it the same size and the way you
[00:09:46] keep it the same size and the way you can keep it the same size is by having
[00:09:49] can keep it the same size is by having padding so if I put a a zero padding at
[00:09:52] padding so if I put a a zero padding at each end well now I'm going to get seven
[00:09:55] each end well now I'm going to get seven trigrams coming out as corresponding to
[00:09:58] trigrams coming out as corresponding to my original seven words and normally
[00:10:01] my original seven words and normally I'll just sort of pad it with zeros like
[00:10:03] I'll just sort of pad it with zeros like that um you can actually increase the
[00:10:07] that um you can actually increase the size of things because if you add um
[00:10:10] size of things because if you add um padding of two at each end you can then
[00:10:13] padding of two at each end you can then have a wide convolution and so seven
[00:10:15] have a wide convolution and so seven will then go to nine different
[00:10:17] will then go to nine different things okay so if we only had one filter
[00:10:22] things okay so if we only had one filter things are pretty limiting and so
[00:10:25] things are pretty limiting and so commonly as in the vision case what
[00:10:28] commonly as in the vision case what we're going to do is um Define multiple
[00:10:30] we're going to do is um Define multiple filters and then we're going to be
[00:10:32] filters and then we're going to be calculating a value for each of these
[00:10:35] calculating a value for each of these filters over each of these triag and so
[00:10:38] filters over each of these triag and so then we're getting out a new
[00:10:40] then we're getting out a new representation as a vector and you know
[00:10:43] representation as a vector and you know depending on how many filters we have
[00:10:46] depending on how many filters we have relative to what the word dimensionality
[00:10:48] relative to what the word dimensionality is we might end up with something that's
[00:10:50] is we might end up with something that's shorter as in this example the same
[00:10:53] shorter as in this example the same length or actually longer than what our
[00:10:56] length or actually longer than what our input was in terms of word vectors
[00:11:00] um but commonly um when we do that um we
[00:11:05] um but commonly um when we do that um we then in some way want to summarize all
[00:11:09] then in some way want to summarize all of these filters and the most common way
[00:11:12] of these filters and the most common way of doing that is to do something that's
[00:11:16] of doing that is to do something that's called Max pooling and Max pooling is
[00:11:19] called Max pooling and Max pooling is something you see quite a bit in newal
[00:11:21] something you see quite a bit in newal networks in general and the way to think
[00:11:24] networks in general and the way to think of Max pooling that I think makes sense
[00:11:27] of Max pooling that I think makes sense is that you can think of um max pooling
[00:11:30] is that you can think of um max pooling is doing what you want if you R want to
[00:11:33] is doing what you want if you R want to run something that's like a feature
[00:11:36] run something that's like a feature detector so you know if you imagine that
[00:11:40] detector so you know if you imagine that you'd you learn these functions um that
[00:11:44] you'd you learn these functions um that will look at word vectors and that they
[00:11:47] will look at word vectors and that they will look for evidence of something
[00:11:49] will look for evidence of something particular so you know maybe this filter
[00:11:52] particular so you know maybe this filter looks for the person is using I language
[00:11:55] looks for the person is using I language that it's matches the words I my we our
[00:12:00] that it's matches the words I my we our something like this and you know maybe
[00:12:03] something like this and you know maybe this is a filter that matches you know
[00:12:06] this is a filter that matches you know speech verbs like or thinking verbs like
[00:12:09] speech verbs like or thinking verbs like think say um said told Etc like that um
[00:12:14] think say um said told Etc like that um and so each of these is sort of some
[00:12:17] and so each of these is sort of some kind of feature of the text that you
[00:12:19] kind of feature of the text that you might want to detect well if that's your
[00:12:21] might want to detect well if that's your model of it when you sort of slide your
[00:12:24] model of it when you sort of slide your feature detector down the piece of text
[00:12:27] feature detector down the piece of text you sort of want to know does this match
[00:12:29] you sort of want to know does this match anywhere in this piece of text is
[00:12:32] anywhere in this piece of text is somewhere at using an i word regardless
[00:12:35] somewhere at using an i word regardless of whether it's in the first second
[00:12:36] of whether it's in the first second third or fourth position and so that's
[00:12:39] third or fourth position and so that's effectively what you're getting out with
[00:12:41] effectively what you're getting out with Max pooling that a feature is counting
[00:12:43] Max pooling that a feature is counting as firing to the extent that it fires
[00:12:46] as firing to the extent that it fires strongly in any position in the text
[00:12:49] strongly in any position in the text that you could match um that's not the
[00:12:52] that you could match um that's not the only way you can think of doing it an
[00:12:54] only way you can think of doing it an alternative way you could do it is you
[00:12:57] alternative way you could do it is you could think of your featured tector as
[00:13:00] could think of your featured tector as sort of measuring some quality of the
[00:13:02] sort of measuring some quality of the text like um casualness or learnedness
[00:13:06] text like um casualness or learnedness or something like that and then you
[00:13:08] or something like that and then you might think oh well for overall wanting
[00:13:11] might think oh well for overall wanting to know how casual the text is maybe I
[00:13:14] to know how casual the text is maybe I want to know the average of how casual
[00:13:16] want to know the average of how casual it is in different parts of the text and
[00:13:19] it is in different parts of the text and so then you can do the alternative of
[00:13:21] so then you can do the alternative of average pooling and sometimes people do
[00:13:24] average pooling and sometimes people do that as well you can do both you can
[00:13:26] that as well you can do both you can both work out an average pull and a Max
[00:13:28] both work out an average pull and a Max pull and put both of them into the
[00:13:30] pull and put both of them into the feature detector in general for the kind
[00:13:33] feature detector in general for the kind of features people learn in your
[00:13:35] of features people learn in your networks if you're just doing one or the
[00:13:37] networks if you're just doing one or the other the result does seem to be that
[00:13:40] other the result does seem to be that Max pooling is the most effective that
[00:13:42] Max pooling is the most effective that that kind of does the feature fire
[00:13:44] that kind of does the feature fire metaphor tends in general um to be um
[00:13:48] metaphor tends in general um to be um the best way of thinking about
[00:13:51] the best way of thinking about things okay so if you want to do all of
[00:13:54] things okay so if you want to do all of this in pytorch um con 1D right so um
[00:13:58] this in pytorch um con 1D right so um that guess the one-dimensional
[00:14:00] that guess the one-dimensional convolutions aren't the most common case
[00:14:03] convolutions aren't the most common case and so you're using com 1D and all these
[00:14:06] and so you're using com 1D and all these kind of things that you can then be
[00:14:09] kind of things that you can then be specifying so the output channels is the
[00:14:11] specifying so the output channels is the number of filters you have the kernel
[00:14:14] number of filters you have the kernel size is saying the size is how big it is
[00:14:17] size is saying the size is how big it is which for my example was three um and
[00:14:19] which for my example was three um and then you can sort of just collapse
[00:14:21] then you can sort of just collapse things with the max
[00:14:23] things with the max pulling okay um there's a space of other
[00:14:27] pulling okay um there's a space of other things that you can also do with
[00:14:30] things that you can also do with convolutional newal networks which I
[00:14:32] convolutional newal networks which I think are sort of less useful and less
[00:14:36] think are sort of less useful and less used and language cases but I can sort
[00:14:39] used and language cases but I can sort of say them quickly so one thing you can
[00:14:42] of say them quickly so one thing you can do um is sort of have a stride CU When
[00:14:47] do um is sort of have a stride CU When we sort of did every
[00:14:50] we sort of did every trigram of sort of zero tentative deal
[00:14:53] trigram of sort of zero tentative deal then tentative deal reach then deal
[00:14:55] then tentative deal reach then deal reach 2 you could feel like well they're
[00:14:57] reach 2 you could feel like well they're overlapping each other a lot so they've
[00:14:59] overlapping each other a lot so they've actually got very similar stuff in them
[00:15:02] actually got very similar stuff in them and that would be even more so if we
[00:15:04] and that would be even more so if we weren't using 3 G we were using
[00:15:06] weren't using 3 G we were using something like five G so something that
[00:15:09] something like five G so something that you can do is the stride is sort of how
[00:15:11] you can do is the stride is sort of how much you move along so if you move along
[00:15:14] much you move along so if you move along two you'd have one trigram that's um
[00:15:17] two you'd have one trigram that's um padding tentative deal and then the next
[00:15:19] padding tentative deal and then the next one would then be deal reach two and
[00:15:22] one would then be deal reach two and then the next one would be to keep
[00:15:24] then the next one would be to keep government so that they're overlapping
[00:15:26] government so that they're overlapping by less as you go through it um
[00:15:30] by less as you go through it um another thing that you can then do um
[00:15:35] another thing that you can then do um that's sort of stride like is rather
[00:15:38] that's sort of stride like is rather than um doing Max pulling over the
[00:15:43] than um doing Max pulling over the entire thing you could do more of a
[00:15:45] entire thing you could do more of a local Max pull so um you could think
[00:15:49] local Max pull so um you could think that well I want to have this feature
[00:15:51] that well I want to have this feature detector for something like use of eye
[00:15:54] detector for something like use of eye language but you know if it's a big long
[00:15:56] language but you know if it's a big long sentence and there's eye language at
[00:15:58] sentence and there's eye language at four different points points maybe you
[00:16:00] four different points points maybe you should get four points for that rather
[00:16:01] should get four points for that rather than just sort of the one point that
[00:16:03] than just sort of the one point that you're going to get from Max pulling so
[00:16:06] you're going to get from Max pulling so you could sort of do local Max pulling
[00:16:08] you could sort of do local Max pulling sensitive to the stride so here I could
[00:16:11] sensitive to the stride so here I could look at the first two of these and Max
[00:16:13] look at the first two of these and Max pull those two then the next two and Max
[00:16:16] pull those two then the next two and Max pull those the next two and Max pull
[00:16:18] pull those the next two and Max pull those and the next pull two and Max pull
[00:16:21] those and the next pull two and Max pull those and you could sort of then end up
[00:16:23] those and you could sort of then end up with this sort of local Max pooling as
[00:16:26] with this sort of local Max pooling as you go along
[00:16:30] you go along um okay um and then one other idea
[00:16:34] um okay um and then one other idea that's sort of related um that you can
[00:16:38] that's sort of related um that you can um do is well another way of capturing
[00:16:41] um do is well another way of capturing sort of does something match in multiple
[00:16:44] sort of does something match in multiple places is rather than only keeping the
[00:16:48] places is rather than only keeping the One Max in each column maybe you could
[00:16:50] One Max in each column maybe you could just do a k Max so you could keep the
[00:16:53] just do a k Max so you could keep the two maximum things in a column and that
[00:16:56] two maximum things in a column and that might be also be a way of seeing whether
[00:16:58] might be also be a way of seeing whether something
[00:16:59] something is detected in two places or
[00:17:02] is detected in two places or not um okay um then I've got lots of
[00:17:08] not um okay um then I've got lots of Notions here okay so um
[00:17:12] Notions here okay so um dilation is then the notion um that what
[00:17:16] dilation is then the notion um that what we'd like to do is sort of form our
[00:17:21] we'd like to do is sort of form our trigrams not only as adjacent things but
[00:17:25] trigrams not only as adjacent things but things that are spaced out so after
[00:17:27] things that are spaced out so after having done our first layer of
[00:17:30] having done our first layer of convolutional filters that took trigrams
[00:17:33] convolutional filters that took trigrams that got us to the sort of top right
[00:17:36] that got us to the sort of top right part here we could then do a dilated
[00:17:39] part here we could then do a dilated trigram convolution which means that
[00:17:42] trigram convolution which means that we're going to take the first third and
[00:17:44] we're going to take the first third and fifth things and um combine them in a
[00:17:50] fifth things and um combine them in a convolutional filter and then we'll take
[00:17:53] convolutional filter and then we'll take the second fourth and six things and
[00:17:55] the second fourth and six things and combine them in a um convolutional
[00:17:58] combine them in a um convolutional filter and so we've then got a trigram
[00:18:01] filter and so we've then got a trigram filter but it sort of has a bigger range
[00:18:04] filter but it sort of has a bigger range of size that it can see um and that's
[00:18:07] of size that it can see um and that's sometimes used more commonly used in
[00:18:09] sometimes used more commonly used in places like speech than a natural
[00:18:13] places like speech than a natural language okay so those are the kind of
[00:18:15] language okay so those are the kind of tools we have um for calculating things
[00:18:19] tools we have um for calculating things with these convolutions over text and so
[00:18:21] with these convolutions over text and so next what I want to do you do is tell
[00:18:24] next what I want to do you do is tell you about a a couple of pieces of work
[00:18:27] you about a a couple of pieces of work that made use of convolutions natural
[00:18:29] that made use of convolutions natural language
[00:18:30] language processing I guess this is a decade old
[00:18:33] processing I guess this is a decade old now because this is from
[00:18:35] now because this is from 2014 um this is um the single most
[00:18:39] 2014 um this is um the single most famous piece of work that made use of
[00:18:41] famous piece of work that made use of convolutional neural networks for
[00:18:43] convolutional neural networks for natural language processing um and Yun
[00:18:46] natural language processing um and Yun Kim is now an assistant professor at MIT
[00:18:50] Kim is now an assistant professor at MIT I mean in retrospect it's sort of
[00:18:53] I mean in retrospect it's sort of actually pretty simple um but I guess
[00:18:56] actually pretty simple um but I guess you know he got in early with the idea
[00:18:58] you know he got in early with the idea of okay maybe um we could use
[00:19:01] of okay maybe um we could use convolutions um for NLP and did a kind
[00:19:04] convolutions um for NLP and did a kind of a clear example of that that worked
[00:19:06] of a clear example of that that worked pretty well um and so this piece of work
[00:19:10] pretty well um and so this piece of work is very well known um so this was um
[00:19:16] is very well known um so this was um writing a sentiment classifier so
[00:19:18] writing a sentiment classifier so looking at a sentence and deciding
[00:19:21] looking at a sentence and deciding whether it's um positive or negative and
[00:19:24] whether it's um positive or negative and actually for both the kind of models
[00:19:25] actually for both the kind of models that I'm going to talk about today we're
[00:19:27] that I'm going to talk about today we're going to use examples um that are doing
[00:19:31] going to use examples um that are doing sentiments classification he also
[00:19:34] sentiments classification he also considered other tasks subjective or
[00:19:36] considered other tasks subjective or objective language question
[00:19:38] objective language question classification as to what they were
[00:19:39] classification as to what they were about but the main application was doing
[00:19:42] about but the main application was doing sentiment
[00:19:43] sentiment analysis so what you're going to be
[00:19:45] analysis so what you're going to be doing um this this paper shows things
[00:19:49] doing um this this paper shows things more in his notation but it's exactly
[00:19:53] more in his notation but it's exactly the same as we've just been talking
[00:19:54] the same as we've just been talking about that you're taking engrams of word
[00:19:57] about that you're taking engrams of word vectors you're going to be um
[00:20:00] vectors you're going to be um multiplying them by convolution and
[00:20:04] multiplying them by convolution and calculating new vectors and it's going
[00:20:07] calculating new vectors and it's going to be done in his model for different
[00:20:10] to be done in his model for different sizes of NR so he's going to have some
[00:20:13] sizes of NR so he's going to have some convolutional filters that look at byrs
[00:20:16] convolutional filters that look at byrs some at Tri and some of them that look
[00:20:19] some at Tri and some of them that look at
[00:20:23] forrs and then those just slid across
[00:20:25] forrs and then those just slid across the positions in the
[00:20:27] the positions in the sentence um
[00:20:29] sentence um then having done that it does Max
[00:20:32] then having done that it does Max pooling as we've been talking about
[00:20:36] pooling as we've been talking about which gives a single number um coming
[00:20:39] which gives a single number um coming out of each filter and those Max pulled
[00:20:42] out of each filter and those Max pulled numbers from each filter and then going
[00:20:45] numbers from each filter and then going to be used as a classifier in a final
[00:20:48] to be used as a classifier in a final simple softmax layer to give the full
[00:20:52] simple softmax layer to give the full answers um there's one other thing that
[00:20:55] answers um there's one other thing that came up in this paper which is kind of
[00:20:57] came up in this paper which is kind of just an interesting general idea um to
[00:21:01] just an interesting general idea um to be aware of um and it was something he
[00:21:04] be aware of um and it was something he sort of pioneered which is the following
[00:21:07] sort of pioneered which is the following that it's a very common case um that
[00:21:12] that it's a very common case um that what you for when you're sort of this I
[00:21:16] what you for when you're sort of this I guess this again occurs less with huge
[00:21:18] guess this again occurs less with huge pre-trained Transformers but for sort of
[00:21:21] pre-trained Transformers but for sort of the classic case of models where you had
[00:21:24] the classic case of models where you had word vectors and then you were training
[00:21:26] word vectors and then you were training some neural network model
[00:21:29] some neural network model um on some supervised data um there was
[00:21:33] um on some supervised data um there was this following Pitfall of what happened
[00:21:36] this following Pitfall of what happened when you fine-tuned word vectors and so
[00:21:39] when you fine-tuned word vectors and so the setting is you know we've started
[00:21:42] the setting is you know we've started off with um our pre-trained word vectors
[00:21:45] off with um our pre-trained word vectors from glove or word to V or whatever it
[00:21:48] from glove or word to V or whatever it is and then we've got a smaller um
[00:21:51] is and then we've got a smaller um sentiment analysis data set and that
[00:21:54] sentiment analysis data set and that we're going to um train a sentiment
[00:21:58] we're going to um train a sentiment classifier and that will involve not
[00:22:01] classifier and that will involve not only um learning the parameters of our
[00:22:04] only um learning the parameters of our sentiment classifier but also we can
[00:22:06] sentiment classifier but also we can back propop into the word Vector
[00:22:09] back propop into the word Vector representations um and if you do that I
[00:22:13] representations um and if you do that I mean it seems like that should be a good
[00:22:15] mean it seems like that should be a good idea because you
[00:22:17] idea because you know normal word vectors aren't you know
[00:22:21] know normal word vectors aren't you know especially tuned to predicting sentiment
[00:22:24] especially tuned to predicting sentiment correctly um they're sort of more tuned
[00:22:27] correctly um they're sort of more tuned to meaning of words as to sort of um
[00:22:30] to meaning of words as to sort of um just what words are about and so it
[00:22:34] just what words are about and so it seems like it should help you if you
[00:22:37] seems like it should help you if you could um back propop into the word
[00:22:40] could um back propop into the word vectors and change them as you go along
[00:22:42] vectors and change them as you go along but if you do that there tends to be a
[00:22:46] but if you do that there tends to be a problem and the problem is um what
[00:22:49] problem and the problem is um what you'll find is that some words will be
[00:22:53] you'll find is that some words will be in your sentiment training data set and
[00:22:56] in your sentiment training data set and when you um learn um with backprop these
[00:23:00] when you um learn um with backprop these word vectors will move but some words
[00:23:03] word vectors will move but some words just won't be in your trading data and
[00:23:06] just won't be in your trading data and they're going to stay exactly where they
[00:23:08] they're going to stay exactly where they were in the word de vectors because
[00:23:11] were in the word de vectors because there's nothing to move them around so
[00:23:14] there's nothing to move them around so what tends to happen is you sort of
[00:23:16] what tends to happen is you sort of started off like this where all of
[00:23:19] started off like this where all of tedious dull and plotting were close by
[00:23:22] tedious dull and plotting were close by each other as having similar meanings
[00:23:24] each other as having similar meanings and the indicators of something negative
[00:23:27] and the indicators of something negative but after you done your training um
[00:23:30] but after you done your training um tedious and dull as part of back propop
[00:23:33] tedious and dull as part of back propop have moved over here where they're part
[00:23:35] have moved over here where they're part of the um negative land and the
[00:23:38] of the um negative land and the classification boundaries moved over
[00:23:40] classification boundaries moved over here but plotting wasn't in the training
[00:23:42] here but plotting wasn't in the training set so it's just sitting exactly where
[00:23:45] set so it's just sitting exactly where it was at the start of the process and
[00:23:48] it was at the start of the process and now it's being treated as a positive
[00:23:50] now it's being treated as a positive word which is completely wrong and so
[00:23:53] word which is completely wrong and so that's tended to have the result that um
[00:23:57] that's tended to have the result that um when people sort of train uh Language
[00:24:01] when people sort of train uh Language new network on a small supervised data
[00:24:03] new network on a small supervised data set you got kind of ambivalent results
[00:24:06] set you got kind of ambivalent results that sometimes um doing back prop into
[00:24:11] that sometimes um doing back prop into the word vectors would help because you
[00:24:13] the word vectors would help because you could specialize your word vectors to
[00:24:15] could specialize your word vectors to your task but sometimes it would hurt
[00:24:17] your task but sometimes it would hurt you because of this kind of effect that
[00:24:19] you because of this kind of effect that you sort of messed up the semantic
[00:24:21] you sort of messed up the semantic relations that applied over the that
[00:24:25] relations that applied over the that were captured reasonably well in the
[00:24:27] were captured reasonably well in the initial word vectors
[00:24:29] initial word vectors um so the way that um Yun Kim dealt with
[00:24:33] um so the way that um Yun Kim dealt with that was a fairly um simple way um he
[00:24:36] that was a fairly um simple way um he just doubled his number of channels and
[00:24:39] just doubled his number of channels and so he made two copies of each um Channel
[00:24:45] so he made two copies of each um Channel each filter in his convolutional neural
[00:24:47] each filter in his convolutional neural network and for one of them it used the
[00:24:51] network and for one of them it used the fine-tuned word vectors and for one of
[00:24:53] fine-tuned word vectors and for one of them it kept the original word vectors
[00:24:56] them it kept the original word vectors and then he could have the best of both
[00:24:59] and then he could have the best of both worlds okay um so this picture um
[00:25:03] worlds okay um so this picture um captures the sort of whole of um his
[00:25:07] captures the sort of whole of um his Network um this picture actually comes
[00:25:10] Network um this picture actually comes on from comes from a followon paper
[00:25:12] on from comes from a followon paper which produced this nice picture so we
[00:25:15] which produced this nice picture so we start off with a sentence I like this
[00:25:18] start off with a sentence I like this movie very much which should be
[00:25:19] movie very much which should be classified positive um so we have words
[00:25:22] classified positive um so we have words and their word vectors and so then
[00:25:26] and their word vectors and so then you're going to have convolutional field
[00:25:28] you're going to have convolutional field filters that are both byr filters trigr
[00:25:32] filters that are both byr filters trigr filters and forr filters and at each of
[00:25:36] filters and forr filters and at each of those sizes you're going to have ones
[00:25:37] those sizes you're going to have ones that work on the unfin tuned word
[00:25:40] that work on the unfin tuned word vectors and the fine-tuned word vectors
[00:25:44] vectors and the fine-tuned word vectors um and
[00:25:46] um and so you're going to put these filters and
[00:25:49] so you're going to put these filters and slide them over the text and get um
[00:25:53] slide them over the text and get um representations and the way he's doing
[00:25:55] representations and the way he's doing this the filters are done without
[00:25:58] this the filters are done without padding so that the 4 gr filters you're
[00:26:01] padding so that the 4 gr filters you're getting smaller vectors coming out um
[00:26:03] getting smaller vectors coming out um and the byr filters you've got bigger
[00:26:06] and the byr filters you've got bigger vectors coming out and so then for each
[00:26:09] vectors coming out and so then for each of these you're then going to Max pull
[00:26:13] of these you're then going to Max pull um so you're just getting the highest
[00:26:14] um so you're just getting the highest value from it and then you're getting a
[00:26:16] value from it and then you're getting a highest value from the ones with um the
[00:26:19] highest value from the ones with um the fine tuning of the word vectors and the
[00:26:21] fine tuning of the word vectors and the ones not and so you're getting one one
[00:26:24] ones not and so you're getting one one feature out of each filter you're then
[00:26:28] feature out of each filter you're then concatenating all of those Max pulled
[00:26:32] concatenating all of those Max pulled outputs together so then one vector for
[00:26:35] outputs together so then one vector for the entire sentence which is of fixed
[00:26:38] the entire sentence which is of fixed size reflecting the number of filters um
[00:26:41] size reflecting the number of filters um and then you're just sticking this as a
[00:26:44] and then you're just sticking this as a straightforward um linear classifier
[00:26:46] straightforward um linear classifier into a soft Max that's then giving you a
[00:26:49] into a soft Max that's then giving you a probability of a positive or negative
[00:26:52] probability of a positive or negative and that was the entire
[00:26:54] and that was the entire model um and the interesting thing was
[00:26:58] model um and the interesting thing was this actually worked pretty well um for
[00:27:02] this actually worked pretty well um for natural language classification tasks so
[00:27:06] natural language classification tasks so um this is a big table of results from
[00:27:08] um this is a big table of results from his paper so um there are sentiment data
[00:27:13] his paper so um there are sentiment data sets like the Stanford sentiment Tree
[00:27:15] sets like the Stanford sentiment Tree Bank two versions of that um movie
[00:27:18] Bank two versions of that um movie reviews there's another sentiment data
[00:27:20] reviews there's another sentiment data set there's a subjectivity
[00:27:22] set there's a subjectivity classifier um the Tre was the kind of
[00:27:25] classifier um the Tre was the kind of question type classifier um so various
[00:27:28] question type classifier um so various data sets
[00:27:30] data sets um and um that you know various people
[00:27:37] um and um that you know various people um including us at Stanford I guess all
[00:27:39] um including us at Stanford I guess all of these soed hour results were ones we
[00:27:42] of these soed hour results were ones we were doing at Stanford um had built um
[00:27:46] were doing at Stanford um had built um lots of models on various of these data
[00:27:48] lots of models on various of these data sets and his argument was that by using
[00:27:53] sets and his argument was that by using this simple convolutional new network um
[00:27:56] this simple convolutional new network um you could do as as well sometimes better
[00:28:00] you could do as as well sometimes better than any of these other models that were
[00:28:02] than any of these other models that were being considered at the time for
[00:28:04] being considered at the time for sentiment
[00:28:05] sentiment analysis now there was at least one way
[00:28:09] analysis now there was at least one way in which um maybe that comparison was
[00:28:13] in which um maybe that comparison was too generous to the CNN so um because um
[00:28:18] too generous to the CNN so um because um if you remember back when we um were
[00:28:20] if you remember back when we um were doing Dropout and we said Dropout is
[00:28:23] doing Dropout and we said Dropout is such a good idea I mean Dropout I think
[00:28:25] such a good idea I mean Dropout I think came out in 2012 if I'm remember
[00:28:28] came out in 2012 if I'm remember remembering correctly so the reality is
[00:28:30] remembering correctly so the reality is a lot of these other methods were being
[00:28:33] a lot of these other methods were being written before Dropout appeared on the
[00:28:36] written before Dropout appeared on the scene whereas he was using Dropout and
[00:28:39] scene whereas he was using Dropout and that gave him an advantage and sort of
[00:28:41] that gave him an advantage and sort of better experimental technique might have
[00:28:43] better experimental technique might have been to compare to redo the other models
[00:28:46] been to compare to redo the other models with Dropout um um which he didn't um
[00:28:50] with Dropout um um which he didn't um but nevertheless it sort of shows that
[00:28:52] but nevertheless it sort of shows that you could get strong results using
[00:28:55] you could get strong results using convolutional newal networks um with
[00:28:57] convolutional newal networks um with just a very simple
[00:29:00] just a very simple architecture yeah um so that's one more
[00:29:03] architecture yeah um so that's one more thing that you can do and so I mean the
[00:29:05] thing that you can do and so I mean the thing to think about here is you know we
[00:29:08] thing to think about here is you know we have this sort of toolkit of ways that
[00:29:10] have this sort of toolkit of ways that you can do things we started off with um
[00:29:13] you can do things we started off with um word vectors and bags of vectors which
[00:29:16] word vectors and bags of vectors which you could use for simple classification
[00:29:18] you could use for simple classification we talked early on about window models
[00:29:23] we talked early on about window models and your window models are sort of like
[00:29:26] and your window models are sort of like what you get for convolutional your
[00:29:27] what you get for convolutional your network works but sort of more ad hoc um
[00:29:31] network works but sort of more ad hoc um then we have convolutional newal
[00:29:33] then we have convolutional newal networks which are definitely good for
[00:29:36] networks which are definitely good for classification and very easy to paralyze
[00:29:39] classification and very easy to paralyze which is good and we talked about
[00:29:41] which is good and we talked about recurrent newal networks which seem to
[00:29:43] recurrent newal networks which seem to be cognitively plausible reading through
[00:29:46] be cognitively plausible reading through sentences from left to right but aren't
[00:29:49] sentences from left to right but aren't easy to
[00:29:50] easy to parallelize um and then we've talked
[00:29:52] parallelize um and then we've talked about
[00:29:53] about Transformers um which to some extent is
[00:29:56] Transformers um which to some extent is our you know best model for NLP and is
[00:29:59] our you know best model for NLP and is being used everywhere and indeed you
[00:30:01] being used everywhere and indeed you know what's happening now is that things
[00:30:03] know what's happening now is that things are going in reverse and people are
[00:30:05] are going in reverse and people are increasingly using Transformers for
[00:30:07] increasingly using Transformers for vision as well though there's still I
[00:30:10] vision as well though there's still I think more debate in The Vision World as
[00:30:12] think more debate in The Vision World as between CNN and Transformers with some
[00:30:15] between CNN and Transformers with some people arguing that both of them have
[00:30:17] people arguing that both of them have complimentary
[00:30:20] complimentary advantages okay um couple of other just
[00:30:23] advantages okay um couple of other just facts on the side and then I'll um show
[00:30:25] facts on the side and then I'll um show you um one other um
[00:30:29] you um one other um bigger fancier convolutional new network
[00:30:32] bigger fancier convolutional new network model for language um so we talked about
[00:30:36] model for language um so we talked about for Transformer models the use of layer
[00:30:40] for Transformer models the use of layer normalization which sort of um keeps the
[00:30:43] normalization which sort of um keeps the size of the numbers in the middle layers
[00:30:46] size of the numbers in the middle layers of the newal network about the Same by
[00:30:49] of the newal network about the Same by giving zero mean and um unit variance um
[00:30:54] giving zero mean and um unit variance um there are slightly different ways that
[00:30:56] there are slightly different ways that you can do that um for tional new
[00:30:58] you can do that um for tional new networks the standard thing to be using
[00:31:01] networks the standard thing to be using is batch normalization and indeed batch
[00:31:04] is batch normalization and indeed batch normalization was the thing that was
[00:31:07] normalization was the thing that was invented first and sort of layer
[00:31:10] invented first and sort of layer normalization and batch normalization
[00:31:13] normalization and batch normalization are sort of doing the same thing of sort
[00:31:16] are sort of doing the same thing of sort of scaling numbers to give them zero
[00:31:18] of scaling numbers to give them zero mean and unit variance but the way that
[00:31:22] mean and unit variance but the way that they differ is sort of under what
[00:31:24] they differ is sort of under what dimensions they're doing their
[00:31:26] dimensions they're doing their calculations
[00:31:28] calculations um so that layer Norm is calculating
[00:31:30] um so that layer Norm is calculating statistics across the feature Dimension
[00:31:33] statistics across the feature Dimension whereas batch Norm is normalizing all
[00:31:36] whereas batch Norm is normalizing all the elements in the batch for each um
[00:31:38] the elements in the batch for each um feature
[00:31:41] independently okay one other little
[00:31:45] independently okay one other little concept that turns up um which actually
[00:31:48] concept that turns up um which actually sort of connects a bit to Transformers
[00:31:51] sort of connects a bit to Transformers as well there's this sort of funny thing
[00:31:53] as well there's this sort of funny thing that you can all of what I've presented
[00:31:55] that you can all of what I've presented so far was sort of volutions that are
[00:32:00] so far was sort of volutions that are some INR byr Tri for GR and so they're
[00:32:05] some INR byr Tri for GR and so they're also size one
[00:32:08] also size one convolutions and at first sight that
[00:32:10] convolutions and at first sight that seems to make no sense at all cuz what's
[00:32:13] seems to make no sense at all cuz what's the point of doing a size one
[00:32:15] the point of doing a size one convolution because you just got one
[00:32:17] convolution because you just got one thing and it's staying just one thing
[00:32:20] thing and it's staying just one thing but it actually does make sense because
[00:32:23] but it actually does make sense because it corresponds to having a little fully
[00:32:27] it corresponds to having a little fully connected layer
[00:32:28] connected layer that's only looking at the
[00:32:29] that's only looking at the representation in one position so in a
[00:32:33] representation in one position so in a language term is taking a word vector
[00:32:35] language term is taking a word vector and putting it through a fully connected
[00:32:38] and putting it through a fully connected neural network to produce a new
[00:32:41] neural network to produce a new representation just of that word and
[00:32:44] representation just of that word and that's sort of actually what we also
[00:32:45] that's sort of actually what we also have with the fully connected layers and
[00:32:47] have with the fully connected layers and Transformers right that you've got a
[00:32:49] Transformers right that you've got a fully connected layer that's just at one
[00:32:52] fully connected layer that's just at one well subword token position and
[00:32:55] well subword token position and calculates a new representation for it
[00:32:58] calculates a new representation for it um and so that's um allows you to sort
[00:33:02] um and so that's um allows you to sort of create new representations with
[00:33:05] of create new representations with actually many fewer parameters than if
[00:33:07] actually many fewer parameters than if you're allowing a fully connected layer
[00:33:09] you're allowing a fully connected layer across the entire
[00:33:13] sentence okay and so this is then a more
[00:33:17] sentence okay and so this is then a more recent version of a convolutional newal
[00:33:21] recent version of a convolutional newal network still again used for text
[00:33:24] network still again used for text classification but a much more complex
[00:33:27] classification but a much more complex one um from Cano Al in
[00:33:32] one um from Cano Al in 2017 um and again this was still at the
[00:33:35] 2017 um and again this was still at the stage um in which lstm sequence models
[00:33:39] stage um in which lstm sequence models were dominant in NLP I guess in 2017
[00:33:42] were dominant in NLP I guess in 2017 this is sort of the same year the first
[00:33:44] this is sort of the same year the first Transformer paper um came out and you
[00:33:48] Transformer paper um came out and you know there were the motivations were
[00:33:52] know there were the motivations were sort of comparing vision and language um
[00:33:57] sort of comparing vision and language um and so at that point in time um
[00:34:00] and so at that point in time um convolutional new network models and
[00:34:03] convolutional new network models and vision were already very deep models so
[00:34:06] vision were already very deep models so people were using things like resnet
[00:34:08] people were using things like resnet models that had 30 50 100 layers in them
[00:34:12] models that had 30 50 100 layers in them and that stood in um stark contrast to
[00:34:16] and that stood in um stark contrast to what was happening in the lstm world for
[00:34:18] what was happening in the lstm world for sequence models where commonly people
[00:34:21] sequence models where commonly people were just using two layer sequence
[00:34:23] were just using two layer sequence models and if you're wanting to go
[00:34:26] models and if you're wanting to go further you might be using a thre layer
[00:34:29] further you might be using a thre layer sequence model or four layer sequence
[00:34:31] sequence model or four layer sequence model or occasionally if you got really
[00:34:33] model or occasionally if you got really really deep um people had used eight
[00:34:36] really deep um people had used eight layer sequence models if they had a lot
[00:34:37] layer sequence models if they had a lot of data but essentially you know it was
[00:34:40] of data but essentially you know it was always in a single digit the number of
[00:34:43] always in a single digit the number of layers um and then a second thing was
[00:34:47] layers um and then a second thing was you know in some sense the vision models
[00:34:50] you know in some sense the vision models were more raw signal models because they
[00:34:53] were more raw signal models because they were operating on the individual pixel
[00:34:57] were operating on the individual pixel level whereas in NLP the standard was
[00:35:00] level whereas in NLP the standard was that we were using Word level models
[00:35:03] that we were using Word level models still in the Transformer model so it
[00:35:05] still in the Transformer model so it sort of seemed like things were much
[00:35:07] sort of seemed like things were much more grouped before they began and so
[00:35:10] more grouped before they began and so the idea of this paper is well maybe we
[00:35:13] the idea of this paper is well maybe we could do NLP kind of like it was Vision
[00:35:17] could do NLP kind of like it was Vision so we'll start with the raw characters
[00:35:21] so we'll start with the raw characters as our signal we're going to put them
[00:35:23] as our signal we're going to put them into a deeper convolutional neural
[00:35:27] into a deeper convolutional neural network work and use the same kind of
[00:35:30] network work and use the same kind of architecture um we use for vision and
[00:35:33] architecture um we use for vision and use that for language classification
[00:35:36] use that for language classification tasks and so that um led to this VD CNN
[00:35:40] tasks and so that um led to this VD CNN architecture which is something um that
[00:35:43] architecture which is something um that looks very like a vision system in
[00:35:47] looks very like a vision system in design um and so what do we have here so
[00:35:51] design um and so what do we have here so at the bottom um we have um individual
[00:35:56] at the bottom um we have um individual characters um
[00:35:58] characters um and the individual characters get a 16d
[00:36:02] and the individual characters get a 16d representation um and then you've got
[00:36:04] representation um and then you've got some sort of size of piece of text that
[00:36:07] some sort of size of piece of text that you're classifying um which for them was
[00:36:12] you're classifying um which for them was 1,24 um and then at each stage we're
[00:36:16] 1,24 um and then at each stage we're then going to have convolutional blocks
[00:36:21] then going to have convolutional blocks um and so these convolutional blocks um
[00:36:25] um and so these convolutional blocks um have a whole bunch of filters but
[00:36:27] have a whole bunch of filters but they're also then going to group stuff
[00:36:31] they're also then going to group stuff together um so that we're kind of sort
[00:36:33] together um so that we're kind of sort of starting to collapse into multicar
[00:36:36] of starting to collapse into multicar units um so we're starting off first of
[00:36:40] units um so we're starting off first of all having you know 64 um size three um
[00:36:46] all having you know 64 um size three um convolutional
[00:36:48] convolutional filters um and so that gives us a
[00:36:51] filters um and so that gives us a representation of 64 um times the window
[00:36:55] representation of 64 um times the window size um and then then um we're going to
[00:36:59] size um and then then um we're going to do that again and put it um through
[00:37:03] do that again and put it um through another set of convolutional
[00:37:05] another set of convolutional filters of size three and 64 of them
[00:37:09] filters of size three and 64 of them which gets us sort of up to here um and
[00:37:13] which gets us sort of up to here um and then at each point we also have residual
[00:37:15] then at each point we also have residual connections which we also saw in
[00:37:17] connections which we also saw in Transformers but were pioneered in the
[00:37:19] Transformers but were pioneered in the vision space so that we have a path that
[00:37:22] vision space so that we have a path that things can just go straight through but
[00:37:24] things can just go straight through but then when we get to here um we then then
[00:37:27] then when we get to here um we then then going to do local pooling so each pair
[00:37:31] going to do local pooling so each pair of representations here will be pulled
[00:37:35] of representations here will be pulled together and so at that point we've no
[00:37:39] together and so at that point we've no longer got a length of the initial
[00:37:41] longer got a length of the initial length of
[00:37:43] length of 1,24 we've now got a a length of um
[00:37:48] 1,24 we've now got a a length of um 512 um so now we're going to be putting
[00:37:51] 512 um so now we're going to be putting it
[00:37:52] it through again sort of trigram
[00:37:55] through again sort of trigram convolutions but now we're going to have
[00:37:58] convolutions but now we're going to have 128 of those
[00:38:01] 128 of those channels um we're going to repeat that
[00:38:03] channels um we're going to repeat that again and then we're going to again
[00:38:05] again and then we're going to again group with pooling so now we're going to
[00:38:09] group with pooling so now we're going to um have
[00:38:12] um have 256 um long sequence because we've done
[00:38:16] 256 um long sequence because we've done local pooling of each pair um and we're
[00:38:19] local pooling of each pair um and we're going to then have 256 filters at each
[00:38:22] going to then have 256 filters at each stage and we go up and then we do local
[00:38:26] stage and we go up and then we do local pulling again so each of them is now
[00:38:29] pulling again so each of them is now representing an eight G of characters um
[00:38:32] representing an eight G of characters um and we're putting triam filters over
[00:38:35] and we're putting triam filters over those 8 G so really the amount of a
[00:38:40] those 8 G so really the amount of a sentence that the convolutional filters
[00:38:43] sentence that the convolutional filters is seeing at this point is 24 characters
[00:38:46] is seeing at this point is 24 characters so you know sort of seeing something
[00:38:48] so you know sort of seeing something like six-word sequences or something
[00:38:50] like six-word sequences or something like that more convolutional blocks
[00:38:53] like that more convolutional blocks there um then at there they then do this
[00:38:57] there um then at there they then do this ke KX pooling so some of the ideas from
[00:38:59] ke KX pooling so some of the ideas from the beginning of the lecture um do show
[00:39:01] the beginning of the lecture um do show up so you're then doing kmax pooling and
[00:39:05] up so you're then doing kmax pooling and finding the eight highest activations in
[00:39:08] finding the eight highest activations in the sequence and that sort of makes
[00:39:10] the sequence and that sort of makes sense for something like a text
[00:39:12] sense for something like a text classifier because you want to count up
[00:39:14] classifier because you want to count up the amount of evidence right if you got
[00:39:17] the amount of evidence right if you got some category like is this about I don't
[00:39:21] some category like is this about I don't know uh copper mining you want to be
[00:39:24] know uh copper mining you want to be seeing whether there's a bunch of places
[00:39:26] seeing whether there's a bunch of places in the Tex that's talking about copper
[00:39:28] in the Tex that's talking about copper mining um and then right up the top they
[00:39:32] mining um and then right up the top they have several fully connected layers
[00:39:34] have several fully connected layers which again is very typical of what you
[00:39:37] which again is very typical of what you are finding in Vision networks um such
[00:39:40] are finding in Vision networks um such as something like vggnet um that after
[00:39:44] as something like vggnet um that after you've done a whole bunch of um
[00:39:46] you've done a whole bunch of um convolutional layers You' just stick it
[00:39:48] convolutional layers You' just stick it through multiple fully connected layers
[00:39:50] through multiple fully connected layers at the top and so that's what they're
[00:39:52] at the top and so that's what they're doing as well and this is your
[00:39:55] doing as well and this is your architecture um for doing text
[00:39:58] architecture um for doing text in
[00:40:00] in um okay I think I talked through that in
[00:40:02] um okay I think I talked through that in a lot of detail um so I'll skip this
[00:40:05] a lot of detail um so I'll skip this slide um yeah so um their experiments
[00:40:09] slide um yeah so um their experiments were done on text classification data
[00:40:12] were done on text classification data sets so um various news classification
[00:40:16] sets so um various news classification data sets um dbpedia onology um then
[00:40:20] data sets um dbpedia onology um then doing sentiment analysis on Yelp reviews
[00:40:23] doing sentiment analysis on Yelp reviews and Amazon
[00:40:25] and Amazon reviews um and here here um results um
[00:40:31] reviews um and here here um results um from their one so you know they're
[00:40:33] from their one so you know they're taking the previous known best published
[00:40:36] taking the previous known best published results which are shown here in table
[00:40:40] results which are shown here in table four and then they're
[00:40:42] four and then they're considering whether they can do better
[00:40:45] considering whether they can do better by using their architecture and that
[00:40:49] by using their architecture and that they um used architectures of different
[00:40:51] they um used architectures of different lengths in terms of the number of um
[00:40:55] lengths in terms of the number of um layers of nine layers 17 and 29 layers
[00:40:59] layers of nine layers 17 and 29 layers and the result of the paper is in all
[00:41:01] and the result of the paper is in all cases um they got the best results by
[00:41:04] cases um they got the best results by their deepest Network which was a 29
[00:41:07] their deepest Network which was a 29 layer model which is sort of then sort
[00:41:09] layer model which is sort of then sort of similar to what people were doing in
[00:41:11] of similar to what people were doing in Vision um and then you know there's some
[00:41:16] Vision um and then you know there's some variation as to which was best by using
[00:41:18] variation as to which was best by using the max pulling or the kmax pulling but
[00:41:20] the max pulling or the kmax pulling but in general was always the Deep model and
[00:41:24] in general was always the Deep model and it varied a bit according to the data
[00:41:27] it varied a bit according to the data set
[00:41:27] set but at least sometimes they were able to
[00:41:29] but at least sometimes they were able to produce the best results that were known
[00:41:32] produce the best results that were known so I mean I guess for these text
[00:41:34] so I mean I guess for these text classification previous results were
[00:41:36] classification previous results were slightly better than their results um
[00:41:39] slightly better than their results um but for some of the other ones like the
[00:41:41] but for some of the other ones like the DBP and the Yelp that their results uh
[00:41:45] DBP and the Yelp that their results uh or Al well for both the Yelp data sets
[00:41:48] or Al well for both the Yelp data sets their results were better than the best
[00:41:50] their results were better than the best known previous results um the Amazon
[00:41:54] known previous results um the Amazon ones one was better one was worse but to
[00:41:57] ones one was better one was worse but to a first approximation this meant that
[00:42:00] a first approximation this meant that they could basically reach the
[00:42:01] they could basically reach the state-ofthe-art of a text classification
[00:42:04] state-ofthe-art of a text classification system with something that was just a
[00:42:07] system with something that was just a deep convolutional neural network
[00:42:10] deep convolutional neural network starting from the character level with
[00:42:13] starting from the character level with none of the sort of Having learned word
[00:42:14] none of the sort of Having learned word vectors in advance or anything like that
[00:42:17] vectors in advance or anything like that and so that was a pretty cool
[00:42:19] and so that was a pretty cool achievement which showed that you could
[00:42:21] achievement which showed that you could go a Fair Way um in doing things um with
[00:42:25] go a Fair Way um in doing things um with just this sort of raw character level
[00:42:27] just this sort of raw character level convolutional new networks sort of more
[00:42:29] convolutional new networks sort of more like a vision
[00:42:32] system okay so that's that and then for
[00:42:36] system okay so that's that and then for the final piece of the class I then want
[00:42:39] the final piece of the class I then want to tell tell you about something in The
[00:42:42] to tell tell you about something in The Other Extreme um which is about um tree
[00:42:46] Other Extreme um which is about um tree recursive neural networks so tree
[00:42:48] recursive neural networks so tree recursive neural networks is a framework
[00:42:52] recursive neural networks is a framework um that me and students developed at
[00:42:54] um that me and students developed at Stanford so I mean really um when I
[00:42:59] Stanford so I mean really um when I first got into new network day in
[00:43:02] first got into new network day in 2010 that sort of for about the first
[00:43:05] 2010 that sort of for about the first five years um that what me and students
[00:43:08] five years um that what me and students worked on was doing these tree recursive
[00:43:11] worked on was doing these tree recursive newal networks and so they were sort of
[00:43:13] newal networks and so they were sort of um the Stanford brand um
[00:43:17] um the Stanford brand um ultimately um they didn't prove as
[00:43:21] ultimately um they didn't prove as successful as other things that came
[00:43:23] successful as other things that came along but I think they're linguistically
[00:43:26] along but I think they're linguistically interested in and I think there's a
[00:43:28] interested in and I think there's a clear idea here which is still an idea
[00:43:31] clear idea here which is still an idea that exists and um I think there may be
[00:43:34] that exists and um I think there may be still some things to do with which I'll
[00:43:36] still some things to do with which I'll come back to but the starting point is
[00:43:38] come back to but the starting point is essentially being motivated by structure
[00:43:41] essentially being motivated by structure of human language um and so most of this
[00:43:45] of human language um and so most of this slide is sort of filled um by a paper
[00:43:48] slide is sort of filled um by a paper from gome Chomsky and colleagues sort of
[00:43:51] from gome Chomsky and colleagues sort of discussing um their views of the human
[00:43:54] discussing um their views of the human faculty of language what it is who has
[00:43:57] faculty of language what it is who has and how did it evolve um and I don't uh
[00:44:01] and how did it evolve um and I don't uh want to dwell on this in too much detail
[00:44:04] want to dwell on this in too much detail but essentially in this paper what they
[00:44:07] but essentially in this paper what they argue is that you know the defining
[00:44:10] argue is that you know the defining property of human language um that's not
[00:44:13] property of human language um that's not observed in other things that humans do
[00:44:17] observed in other things that humans do is having that language has this
[00:44:19] is having that language has this recursive structure that you have this
[00:44:21] recursive structure that you have this hierarchical nesting where the same
[00:44:24] hierarchical nesting where the same structure repeats inside itself so if
[00:44:27] structure repeats inside itself so if you have an example um like the person
[00:44:30] you have an example um like the person standing next to the man from the
[00:44:32] standing next to the man from the company that purchased the firm that you
[00:44:34] company that purchased the firm that you used to work at um what you have is um
[00:44:39] used to work at um what you have is um you the whole of this is a noun phrase
[00:44:42] you the whole of this is a noun phrase the person headed by the person and then
[00:44:45] the person headed by the person and then it's standing next to then the first
[00:44:48] it's standing next to then the first square brackets here is another noun
[00:44:50] square brackets here is another noun phrase the man from then inside that
[00:44:54] phrase the man from then inside that prepositional phrase there's another
[00:44:55] prepositional phrase there's another noun phrase the company that
[00:44:57] noun phrase the company that purchased um the firm and then the firm
[00:45:01] purchased um the firm and then the firm is another noun phrase that has the uh
[00:45:05] is another noun phrase that has the uh relative Clause modifier of the firm
[00:45:07] relative Clause modifier of the firm that you used to work at so we have
[00:45:09] that you used to work at so we have these embedded layers of noun phrases
[00:45:12] these embedded layers of noun phrases with the same syntactic structure
[00:45:15] with the same syntactic structure underneath them and so for the kind of
[00:45:18] underneath them and so for the kind of formalisms that we use in linguistics of
[00:45:21] formalisms that we use in linguistics of context free grammar we they it permits
[00:45:24] context free grammar we they it permits the kind of infinite embedding of
[00:45:26] the kind of infinite embedding of nesting which is the same kind of
[00:45:27] nesting which is the same kind of nesting that you get in programming
[00:45:30] nesting that you get in programming languages where you can sort of use if
[00:45:33] languages where you can sort of use if statements and Nest them as deeply as
[00:45:36] statements and Nest them as deeply as you want to because you just have the
[00:45:37] you want to because you just have the same repeating recursive structure now
[00:45:40] same repeating recursive structure now of course human beings can't actually
[00:45:43] of course human beings can't actually understand recursive infinite recursion
[00:45:46] understand recursive infinite recursion and people don't actually produce
[00:45:48] and people don't actually produce infinite recursion that you could sort
[00:45:50] infinite recursion that you could sort of say oh in practice no one's going to
[00:45:52] of say oh in practice no one's going to go more than eight deep when they're
[00:45:54] go more than eight deep when they're saying a sentence but in terms of the
[00:45:58] saying a sentence but in terms of the structure of what the language looks
[00:45:59] structure of what the language looks like it seems like you should be able to
[00:46:02] like it seems like you should be able to do it infinitely deep and when you
[00:46:04] do it infinitely deep and when you actually start looking at the structures
[00:46:06] actually start looking at the structures of sentences um they do sort of repeat
[00:46:09] of sentences um they do sort of repeat over the same structure quite deeply so
[00:46:12] over the same structure quite deeply so this is uh example of a pen Tree Bank um
[00:46:15] this is uh example of a pen Tree Bank um tree which is sort of the best known um
[00:46:19] tree which is sort of the best known um constituency um tree bank and so here's
[00:46:22] constituency um tree bank and so here's my random sentence analyst said Mr
[00:46:25] my random sentence analyst said Mr stronck wants to resume assum a more
[00:46:27] stronck wants to resume assum a more influential role in running the company
[00:46:31] influential role in running the company and well what we end up with sort of if
[00:46:33] and well what we end up with sort of if we have these nested things of verb
[00:46:36] we have these nested things of verb phrases so running the company is a verb
[00:46:39] phrases so running the company is a verb phrase resume a more influential role in
[00:46:43] phrase resume a more influential role in running the company is a bigger verb
[00:46:45] running the company is a bigger verb phrase um wants to resume a bigger role
[00:46:49] phrase um wants to resume a bigger role in running the company as an even bigger
[00:46:52] in running the company as an even bigger verb phrase and then said Mr stronck
[00:46:56] verb phrase and then said Mr stronck wants to resume a more influential role
[00:47:00] wants to resume a more influential role in running the company is an even bigger
[00:47:02] in running the company is an even bigger verb phrase and so we have sort of one 2
[00:47:07] verb phrase and so we have sort of one 2 three four verb phrases all nested
[00:47:10] three four verb phrases all nested inside each other and so the idea was
[00:47:14] inside each other and so the idea was well maybe we should be thinking of
[00:47:17] well maybe we should be thinking of sentences as having this kind of tree
[00:47:19] sentences as having this kind of tree structure and Computing representations
[00:47:22] structure and Computing representations of meanings of sentences in terms of
[00:47:25] of meanings of sentences in terms of this tree structure so um we have words
[00:47:29] this tree structure so um we have words that have
[00:47:30] that have representations in uh word Vector space
[00:47:34] representations in uh word Vector space like we saw right at the beginning of
[00:47:36] like we saw right at the beginning of the class but then we're going to have a
[00:47:38] the class but then we're going to have a phrase like the country of my birth and
[00:47:42] phrase like the country of my birth and the classic linguistic answer that you
[00:47:44] the classic linguistic answer that you find both in linguistic semantics
[00:47:47] find both in linguistic semantics classes or philosophy of language is
[00:47:50] classes or philosophy of language is that we could should construct
[00:47:51] that we could should construct representations of phrases using the
[00:47:54] representations of phrases using the principle of compositionality which says
[00:47:57] principle of compositionality which says that the meaning of a phrase or sentence
[00:48:00] that the meaning of a phrase or sentence is determined by the meanings of its
[00:48:02] is determined by the meanings of its Words which are our word vectors and the
[00:48:05] Words which are our word vectors and the rules that combine them so maybe we
[00:48:07] rules that combine them so maybe we could take the phrase structure tree of
[00:48:10] could take the phrase structure tree of a sentence and combine the word vectors
[00:48:14] a sentence and combine the word vectors together by some means and then we can
[00:48:17] together by some means and then we can construct a representation of the
[00:48:19] construct a representation of the meaning of phrases in a more linguistic
[00:48:22] meaning of phrases in a more linguistic way giving us a vector representation of
[00:48:25] way giving us a vector representation of the meaning of the phrase which we could
[00:48:27] the meaning of the phrase which we could also put into our Vector space and we'd
[00:48:30] also put into our Vector space and we'd hope that a phrase like the country of
[00:48:32] hope that a phrase like the country of my birth would appear in the vector
[00:48:34] my birth would appear in the vector space in a similar place to where words
[00:48:37] space in a similar place to where words representing locations
[00:48:40] representing locations appeared okay so what we want is to be
[00:48:43] appeared okay so what we want is to be able to start with word vectors and Pa
[00:48:46] able to start with word vectors and Pa up a
[00:48:47] up a sentence and as we pause the sentence
[00:48:51] sentence and as we pause the sentence we're then going to be Computing
[00:48:53] we're then going to be Computing representations for the different
[00:48:55] representations for the different phrases of the sentence
[00:48:59] um and so the difference here is now you
[00:49:03] um and so the difference here is now you know um you know the difference between
[00:49:06] know um you know the difference between recursive and recurrent is sort of a
[00:49:09] recursive and recurrent is sort of a fake difference right they both come
[00:49:11] fake difference right they both come from the same recur word um but rather
[00:49:15] from the same recur word um but rather than having the recursion Just Happening
[00:49:18] than having the recursion Just Happening along a sequence as in a recurrent newal
[00:49:21] along a sequence as in a recurrent newal Network we're going to have the
[00:49:22] Network we're going to have the recursion happening uper tree structure
[00:49:26] recursion happening uper tree structure so we can Computing representations for
[00:49:28] so we can Computing representations for linguistically Meaningful
[00:49:30] linguistically Meaningful phrases um and so there sort of um what
[00:49:35] phrases um and so there sort of um what we're going to do with that is you know
[00:49:38] we're going to do with that is you know the easy Cas is if we know um the phrase
[00:49:42] the easy Cas is if we know um the phrase structure tree we can take the
[00:49:45] structure tree we can take the representations of the child nodes put
[00:49:49] representations of the child nodes put them into a neural network which could
[00:49:52] them into a neural network which could give us the representation of the parent
[00:49:54] give us the representation of the parent node but we'd also like to find the tree
[00:49:58] node but we'd also like to find the tree structure and so a way we could do that
[00:50:00] structure and so a way we could do that is then get a second thing out of the
[00:50:02] is then get a second thing out of the newal network we could get a score for
[00:50:06] newal network we could get a score for how plausible something is as a
[00:50:08] how plausible something is as a constituent does it make sense to
[00:50:11] constituent does it make sense to combine these two nose together to form
[00:50:14] combine these two nose together to form a larger constituent and then we can use
[00:50:17] a larger constituent and then we can use that in
[00:50:18] that in AA so um formerly the very simplest kind
[00:50:23] AA so um formerly the very simplest kind of tral network and the first one we
[00:50:25] of tral network and the first one we explored was when we had two child
[00:50:29] explored was when we had two child vectors we're going to be um
[00:50:32] vectors we're going to be um representing the parent vector by
[00:50:35] representing the parent vector by concatenating the two children
[00:50:38] concatenating the two children multiplying them by a matrix adding a
[00:50:40] multiplying them by a matrix adding a bias putting it through a nonlinearity
[00:50:43] bias putting it through a nonlinearity to get a parent representation p and
[00:50:46] to get a parent representation p and then we'd score whether it's a good
[00:50:49] then we'd score whether it's a good constituent by taking another um Vector
[00:50:54] constituent by taking another um Vector of learn parameters which would um do a
[00:50:58] of learn parameters which would um do a DOT product with p and that would give
[00:51:01] DOT product with p and that would give us a score as to whether this was a good
[00:51:04] us a score as to whether this was a good constituent to include in your past tree
[00:51:07] constituent to include in your past tree and the same W parameters were used at
[00:51:11] and the same W parameters were used at all nodes of the tree in the same way as
[00:51:13] all nodes of the tree in the same way as a new a current new network kept using
[00:51:16] a new a current new network kept using the same
[00:51:18] the same parameters okay um so if we did that if
[00:51:22] parameters okay um so if we did that if we had that we could build a greedy paa
[00:51:25] we had that we could build a greedy paa cuz what we could do is we could start
[00:51:27] cuz what we could do is we could start with all the word vectors and we could
[00:51:30] with all the word vectors and we could just take every pair of words and put it
[00:51:33] just take every pair of words and put it through this system and calculate what
[00:51:37] through this system and calculate what the representation of that pair would be
[00:51:39] the representation of that pair would be as a constituent and then get a score as
[00:51:43] as a constituent and then get a score as to whether it seemed a good constituent
[00:51:45] to whether it seemed a good constituent or not and then we could just
[00:51:47] or not and then we could just greedily decide this is the best
[00:51:50] greedily decide this is the best constituent the cat and so if we do a
[00:51:53] constituent the cat and so if we do a greedy paraa we can commit to that and
[00:51:57] greedy paraa we can commit to that and then well we already still know the the
[00:52:00] then well we already still know the the possibilities of combining other pairs
[00:52:02] possibilities of combining other pairs of words and we could just additionally
[00:52:05] of words and we could just additionally um score how good um the the cat
[00:52:09] um score how good um the the cat combined with sat is um so that we
[00:52:12] combined with sat is um so that we producing binary um PA structures so now
[00:52:16] producing binary um PA structures so now the best pair to combine greedily is the
[00:52:19] the best pair to combine greedily is the mat so we can combine them together and
[00:52:22] mat so we can combine them together and commit to those we can score combining
[00:52:25] commit to those we can score combining on with the m that and now that seems
[00:52:28] on with the m that and now that seems the best thing so we'll commit to that
[00:52:31] the best thing so we'll commit to that and we just sort of keep on up and we
[00:52:33] and we just sort of keep on up and we produce the binary pars of the sentence
[00:52:36] produce the binary pars of the sentence and this gives us our sentence
[00:52:40] representation okay which is like
[00:52:43] representation okay which is like that okay and so that gives us our
[00:52:46] that okay and so that gives us our simple RNN um and so uh back in 2011 we
[00:52:52] simple RNN um and so uh back in 2011 we got some pretty decent results of
[00:52:54] got some pretty decent results of showing that you could use this as a
[00:52:56] showing that you could use this as a sentence paa that worked pretty well but
[00:52:59] sentence paa that worked pretty well but beyond that that the
[00:53:01] beyond that that the representations um we calculated for
[00:53:04] representations um we calculated for sentences and phrases were good enough
[00:53:08] sentences and phrases were good enough representations that you would use it
[00:53:10] representations that you would use it for tasks like sentence classification
[00:53:13] for tasks like sentence classification sentiment analysis and it works
[00:53:16] sentiment analysis and it works reasonably well um it only works it only
[00:53:21] reasonably well um it only works it only worked reasonably well um because if you
[00:53:25] worked reasonably well um because if you start thinking about it further you know
[00:53:28] start thinking about it further you know there sort of wor strong limitations of
[00:53:31] there sort of wor strong limitations of having this single W Matrix that's used
[00:53:35] having this single W Matrix that's used at all points to combine things that
[00:53:37] at all points to combine things that that if you sort of have that
[00:53:40] that if you sort of have that architecture you sort of can't have
[00:53:42] architecture you sort of can't have different forms of interaction between
[00:53:44] different forms of interaction between the different words you're s just
[00:53:46] the different words you're s just uniformly Computing things and that sort
[00:53:49] uniformly Computing things and that sort of stands in distinction to the fact
[00:53:52] of stands in distinction to the fact that different kinds of things in
[00:53:54] that different kinds of things in natural language seem kind of different
[00:53:56] natural language seem kind of different you have um different properties with
[00:53:59] you have um different properties with verbs and their objects versus an
[00:54:02] verbs and their objects versus an adjective modifying a noun just in terms
[00:54:04] adjective modifying a noun just in terms what the roles of the different words
[00:54:06] what the roles of the different words were so we started um to see limitations
[00:54:09] were so we started um to see limitations of this architecture and so in following
[00:54:12] of this architecture and so in following years um we started exploring other ways
[00:54:15] years um we started exploring other ways to build tree recursive neural networks
[00:54:18] to build tree recursive neural networks which had more flexibility as to how
[00:54:21] which had more flexibility as to how things were combined together and I'm
[00:54:23] things were combined together and I'm not going to show you all the details um
[00:54:25] not going to show you all the details um of all of that but I will show you um
[00:54:29] of all of that but I will show you um one more um model that we use for
[00:54:33] one more um model that we use for building tree recursive newal networks
[00:54:35] building tree recursive newal networks and that was used in sort of some of our
[00:54:37] and that was used in sort of some of our sentiment analysis work called the
[00:54:40] sentiment analysis work called the recursive newal tensor Network um it
[00:54:44] recursive newal tensor Network um it wasn't actually the final net version
[00:54:46] wasn't actually the final net version that we did after that we sort of
[00:54:48] that we did after that we sort of started taking lstm ideas and extending
[00:54:52] started taking lstm ideas and extending those to the tree structured case and we
[00:54:54] those to the tree structured case and we worked on tree lsdm but I'm not going to
[00:54:56] worked on tree lsdm but I'm not going to show that this year um but the idea of
[00:55:00] show that this year um but the idea of recursive newal tensor networks is you
[00:55:03] recursive newal tensor networks is you know when pairs of words or phrases
[00:55:06] know when pairs of words or phrases combined together in linguistic
[00:55:09] combined together in linguistic semantics terms depending on the pairs
[00:55:13] semantics terms depending on the pairs of words they modify each other in
[00:55:15] of words they modify each other in different ways so if you have an
[00:55:16] different ways so if you have an adjective and a noun like um a red ball
[00:55:21] adjective and a noun like um a red ball sort of red is giving attributes of the
[00:55:23] sort of red is giving attributes of the noun whereas if you have something like
[00:55:25] noun whereas if you have something like a verb in object um like kick the ball
[00:55:29] a verb in object um like kick the ball you've got a very different role for the
[00:55:31] you've got a very different role for the object as the right hand side versus the
[00:55:34] object as the right hand side versus the red ball it's sort of the opposite way
[00:55:36] red ball it's sort of the opposite way around so we want to have more
[00:55:38] around so we want to have more flexibility in the way we can calculate
[00:55:41] flexibility in the way we can calculate meanings of phrases depending on what's
[00:55:44] meanings of phrases depending on what's in it and the way we came up with doing
[00:55:47] in it and the way we came up with doing that is to come up with what's we call
[00:55:49] that is to come up with what's we call this neural tensil layer and so the idea
[00:55:52] this neural tensil layer and so the idea in the neural tensil layer is that we
[00:55:56] in the neural tensil layer is that we have had um the
[00:55:59] have had um the representations of um the the child um
[00:56:04] representations of um the the child um words or phrases and so rather than sort
[00:56:08] words or phrases and so rather than sort of directly putting rather than directly
[00:56:11] of directly putting rather than directly concatenating them and
[00:56:14] concatenating them and then um putting it through a sort of a
[00:56:17] then um putting it through a sort of a linear transformation like a regular new
[00:56:19] linear transformation like a regular new network layer instead what we could do
[00:56:22] network layer instead what we could do is that we could learn um in between
[00:56:27] is that we could learn um in between matrices and if we put several of those
[00:56:30] matrices and if we put several of those together we're then getting a
[00:56:31] together we're then getting a three-dimensional tensor and we could
[00:56:34] three-dimensional tensor and we could multiply a vector um by a tenser time a
[00:56:40] multiply a vector um by a tenser time a vector and then we'll end up getting out
[00:56:45] vector and then we'll end up getting out um vectors for each one we'll have
[00:56:47] um vectors for each one we'll have multiple such
[00:56:54] vectors okay and the the place that we
[00:56:57] vectors okay and the the place that we applied this model is for this task of
[00:57:00] applied this model is for this task of sentiment analysis so let me just tell
[00:57:02] sentiment analysis so let me just tell you a little bit more of what we did
[00:57:05] you a little bit more of what we did here and this is sort of in fact going
[00:57:07] here and this is sort of in fact going backwards to the Stanford sentiment Tree
[00:57:10] backwards to the Stanford sentiment Tree Bank um that was already used in the Yim
[00:57:12] Bank um that was already used in the Yim work so the goal of sentiment analysis
[00:57:15] work so the goal of sentiment analysis is to see whether a piece of text is
[00:57:17] is to see whether a piece of text is positive negative or neutral um so a lot
[00:57:20] positive negative or neutral um so a lot of the time doing sentiment analysis is
[00:57:23] of the time doing sentiment analysis is pretty easy um you know in the 2010s and
[00:57:28] pretty easy um you know in the 2010s and probably even today you know quite a few
[00:57:31] probably even today you know quite a few people's sentiment analysis systems are
[00:57:34] people's sentiment analysis systems are essentially just um keyword matching
[00:57:38] essentially just um keyword matching right if you see great marvelous
[00:57:40] right if you see great marvelous wonderful positive sentiment if you see
[00:57:43] wonderful positive sentiment if you see something of poor bad negative sentiment
[00:57:47] something of poor bad negative sentiment um and so lots of the time you can sort
[00:57:50] um and so lots of the time you can sort of effectively do a kind of dictionary
[00:57:53] of effectively do a kind of dictionary matching and get pretty good sentiment
[00:57:56] matching and get pretty good sentiment especially on longer documents but you
[00:57:58] especially on longer documents but you know on the other hand people use
[00:58:00] know on the other hand people use language in lots of interesting ways and
[00:58:03] language in lots of interesting ways and it's not always that easy so if you look
[00:58:06] it's not always that easy so if you look at something like movie reviews such as
[00:58:08] at something like movie reviews such as the Snippets you get on Rotten Tomatoes
[00:58:11] the Snippets you get on Rotten Tomatoes um you get Snippets like this in Rotten
[00:58:14] um you get Snippets like this in Rotten Tomatoes with this cast and this subject
[00:58:17] Tomatoes with this cast and this subject matter the movie should have been
[00:58:19] matter the movie should have been funnier and more entertaining and if you
[00:58:21] funnier and more entertaining and if you just think of it as okay we're doing
[00:58:24] just think of it as okay we're doing dictionary matching there's the word
[00:58:27] dictionary matching there's the word entertaining that's definitely positive
[00:58:30] entertaining that's definitely positive um and funnier that's positive so there
[00:58:33] um and funnier that's positive so there are two positive words so this should be
[00:58:35] are two positive words so this should be a positive review but of course it's not
[00:58:38] a positive review but of course it's not a positive review this is a negative
[00:58:40] a positive review this is a negative review um because it's saying um well
[00:58:44] review um because it's saying um well I'm just reading it out again with this
[00:58:46] I'm just reading it out again with this cast and subject matter the movie should
[00:58:48] cast and subject matter the movie should have been funnier and more entertaining
[00:58:51] have been funnier and more entertaining right so the compositional structure of
[00:58:54] right so the compositional structure of human language goes together to mean
[00:58:57] human language goes together to mean that these um because
[00:58:59] that these um because it's buried under should have been the
[00:59:03] it's buried under should have been the funnier and entertaining are actually
[00:59:06] funnier and entertaining are actually lacking and so it's a negative review
[00:59:09] lacking and so it's a negative review and so these were the kind of examples
[00:59:11] and so these were the kind of examples that we were interested in and saying
[00:59:14] that we were interested in and saying could we sort of actually understand the
[00:59:16] could we sort of actually understand the structure of sentences more and do a
[00:59:18] structure of sentences more and do a better job at sentiment analysis and so
[00:59:22] better job at sentiment analysis and so up until this time um people are just
[00:59:26] up until this time um people are just sort of had pieces of text and a
[00:59:28] sort of had pieces of text and a classification Judgment of positive and
[00:59:31] classification Judgment of positive and negative so we decided we're going to do
[00:59:33] negative so we decided we're going to do more than that and come up with the
[00:59:36] more than that and come up with the Stanford sentiment Tree Bank where what
[00:59:38] Stanford sentiment Tree Bank where what we did was passed up a whole lot of
[00:59:42] we did was passed up a whole lot of sentences almost 12,000 of them and then
[00:59:46] sentences almost 12,000 of them and then what we are going to do is put sentiment
[00:59:49] what we are going to do is put sentiment judgments on every linguistic phrase of
[00:59:54] judgments on every linguistic phrase of the sentence so for something like this
[00:59:56] the sentence so for something like this example you know with this cast is a
[00:59:59] example you know with this cast is a phrase no sentiments so that just be
[01:00:02] phrase no sentiments so that just be neutral um entertaining is a phrase a
[01:00:06] neutral um entertaining is a phrase a one-word phrase its sentiment is
[01:00:09] one-word phrase its sentiment is positive um um you know funnier and more
[01:00:14] positive um um you know funnier and more entertaining that's a
[01:00:16] entertaining that's a phrase very positive um but then by the
[01:00:21] phrase very positive um but then by the time we get embedded under should have
[01:00:22] time we get embedded under should have been funnier and more entertaining
[01:00:24] been funnier and more entertaining that's a bigger phrase its sentiment is
[01:00:27] that's a bigger phrase its sentiment is now negative and the movie should have
[01:00:29] now negative and the movie should have been funnier and more entertaining
[01:00:31] been funnier and more entertaining that's an even bigger phrase it's
[01:00:33] that's an even bigger phrase it's negative and so we were um passing up
[01:00:36] negative and so we were um passing up trees like that and these examples are
[01:00:40] trees like that and these examples are very small I'll show you a big examples
[01:00:43] very small I'll show you a big examples later but you can sort of just see that
[01:00:46] later but you can sort of just see that in the trees there are blue nodes and
[01:00:48] in the trees there are blue nodes and orange nodes corresponding to positive
[01:00:50] orange nodes corresponding to positive and negative sentiment reflecting units
[01:00:53] and negative sentiment reflecting units at the different sizes and so the
[01:00:56] at the different sizes and so the interesting thing is then this gave us a
[01:00:58] interesting thing is then this gave us a richer annotated data set because it's
[01:01:01] richer annotated data set because it's not only sort of whole sentences or
[01:01:04] not only sort of whole sentences or whole articles that were annotated for
[01:01:07] whole articles that were annotated for sentiment we had annotations for
[01:01:10] sentiment we had annotations for different phrases and simply the fact
[01:01:13] different phrases and simply the fact that you were annotating phrases meant
[01:01:16] that you were annotating phrases meant that you could learn more from the
[01:01:18] that you could learn more from the examples so even if you're using
[01:01:20] examples so even if you're using something very simple like a naive based
[01:01:22] something very simple like a naive based classifier because there an ations on
[01:01:26] classifier because there an ations on Words and smaller phrases you could
[01:01:29] Words and smaller phrases you could learn a bit more about which were
[01:01:31] learn a bit more about which were positive and which were negative and so
[01:01:34] positive and which were negative and so that was the first result that we could
[01:01:37] that was the first result that we could um people a baseline method of a byr
[01:01:41] um people a baseline method of a byr Nave Bas classifier which is a very
[01:01:43] Nave Bas classifier which is a very common sentiment classifier that if you
[01:01:46] common sentiment classifier that if you just trained with sentiment classifiers
[01:01:48] just trained with sentiment classifiers you got 79% on this data set if you
[01:01:52] you got 79% on this data set if you trained with um if you trained using
[01:01:56] trained with um if you trained using every node of the tree Bank um you got
[01:01:59] every node of the tree Bank um you got 83% so you got a 4% lift and so um that
[01:02:03] 83% so you got a 4% lift and so um that was kind of good um these other two
[01:02:05] was kind of good um these other two lines show two of our early Tre tree
[01:02:09] lines show two of our early Tre tree rnns and the negative part of the result
[01:02:11] rnns and the negative part of the result is they weren't really better than a
[01:02:13] is they weren't really better than a bite gram um naive Bas classifier they
[01:02:17] bite gram um naive Bas classifier they were better than a unigram naive Bas
[01:02:19] were better than a unigram naive Bas classifier but a lot of sort of the
[01:02:23] classifier but a lot of sort of the extra information that you want to
[01:02:24] extra information that you want to capture for sentiment analysis you can
[01:02:28] capture for sentiment analysis you can get from byrams because that's can
[01:02:30] get from byrams because that's can already tell you sort of not good um
[01:02:33] already tell you sort of not good um somewhat interesting and things like
[01:02:36] somewhat interesting and things like that
[01:02:38] that um but then so the other Hope was to
[01:02:41] um but then so the other Hope was to have a more powerful model and so that
[01:02:44] have a more powerful model and so that then led into use of this recursive
[01:02:46] then led into use of this recursive newal tensor Network which allowed sort
[01:02:49] newal tensor Network which allowed sort of the mediated multiplicative
[01:02:51] of the mediated multiplicative interactions between word or phrase
[01:02:54] interactions between word or phrase vectors
[01:02:57] vectors um and so we built that and so then here
[01:03:01] um and so we built that and so then here are the results of that model that's
[01:03:03] are the results of that model that's shown in red um so by having our
[01:03:07] shown in red um so by having our recursive newal tensor Network we were
[01:03:10] recursive newal tensor Network we were able to build um a somewhat better newal
[01:03:14] able to build um a somewhat better newal Network that performed at least
[01:03:18] Network that performed at least reasonably better than a byr naive Bas
[01:03:22] reasonably better than a byr naive Bas model rate that we were getting sort of
[01:03:24] model rate that we were getting sort of about 22% better than a by byr NA Bas
[01:03:28] about 22% better than a by byr NA Bas model so that was progress but I think
[01:03:31] model so that was progress but I think perhaps the more interesting thing isn't
[01:03:33] perhaps the more interesting thing isn't sort of the aggregate results but the
[01:03:36] sort of the aggregate results but the fact that because we were building up
[01:03:39] fact that because we were building up this
[01:03:40] this model the computed
[01:03:43] model the computed representations over a constituency tree
[01:03:47] representations over a constituency tree that it actually made judgments of
[01:03:49] that it actually made judgments of different parts of sentences and how
[01:03:52] different parts of sentences and how they combined so um here's the movie
[01:03:55] they combined so um here's the movie review sentence there are slow and rep
[01:03:58] review sentence there are slow and rep repetitive Parts but it has just enough
[01:04:01] repetitive Parts but it has just enough space to keep it interesting um so I
[01:04:03] space to keep it interesting um so I hope you'll agree with the Judgment that
[01:04:05] hope you'll agree with the Judgment that overall that's a positive statement
[01:04:07] overall that's a positive statement about the movie um and so the um recurs
[01:04:11] about the movie um and so the um recurs of newor tens Network builds the tree
[01:04:14] of newor tens Network builds the tree structure over this sentence and it says
[01:04:17] structure over this sentence and it says you know slow and repetitive that's
[01:04:19] you know slow and repetitive that's negative there are slow um repetitive
[01:04:22] negative there are slow um repetitive Parts it's all negative to hear but for
[01:04:25] Parts it's all negative to hear but for the part over to the right interesting
[01:04:28] the part over to the right interesting um spice they're both positive and spice
[01:04:32] um spice they're both positive and spice to keep it interesting that's positive
[01:04:34] to keep it interesting that's positive it has just enough spice to keep it
[01:04:36] it has just enough spice to keep it interesting positive and it correctly
[01:04:39] interesting positive and it correctly predicts that when you put these two
[01:04:40] predicts that when you put these two halves of these sentences together um
[01:04:43] halves of these sentences together um the overall judgment is that this
[01:04:45] the overall judgment is that this remains um a positive review and it
[01:04:48] remains um a positive review and it gives a positive judgment overall so
[01:04:51] gives a positive judgment overall so that was kind of cool um and in
[01:04:54] that was kind of cool um and in particular um the fact that we were
[01:04:57] particular um the fact that we were building these phrase judgments um meant
[01:05:00] building these phrase judgments um meant that it seemed like we could actually do
[01:05:02] that it seemed like we could actually do a better job of sentence understanding
[01:05:06] a better job of sentence understanding in the way that sort of any linguist
[01:05:08] in the way that sort of any linguist doing linguistic semantics would like to
[01:05:10] doing linguistic semantics would like to see sentence understanding so one of the
[01:05:14] see sentence understanding so one of the things that newal networks when looking
[01:05:16] things that newal networks when looking at language have often been faltered for
[01:05:20] at language have often been faltered for and are still faled for to this day
[01:05:23] and are still faled for to this day using Transformer models is you often
[01:05:26] using Transformer models is you often find the result that um new network
[01:05:30] find the result that um new network models just don't pay attention to
[01:05:32] models just don't pay attention to negation that you can be um having some
[01:05:36] negation that you can be um having some sentence and you can compare the
[01:05:38] sentence and you can compare the sentence of um you know a lot of
[01:05:41] sentence of um you know a lot of students are studying for their final
[01:05:44] students are studying for their final exams versus a lot of students aren't
[01:05:46] exams versus a lot of students aren't studying for their final exams and the
[01:05:49] studying for their final exams and the negation just gets lost that it doesn't
[01:05:51] negation just gets lost that it doesn't produce um the differences and
[01:05:53] produce um the differences and representation and meaning that you'd
[01:05:56] representation and meaning that you'd like it to um so somewhat
[01:05:59] like it to um so somewhat interestingly um with this model it
[01:06:03] interestingly um with this model it seemed like because we were modeling the
[01:06:06] seemed like because we were modeling the the cursive building up of sentence
[01:06:09] the cursive building up of sentence structure that we actually um could do
[01:06:12] structure that we actually um could do interesting things with um modeling
[01:06:15] interesting things with um modeling negation right so in particular
[01:06:19] negation right so in particular um that um what You' the results that
[01:06:23] um that um what You' the results that you'd like to get is if you have
[01:06:25] you'd like to get is if you have something like it's just incredibly dull
[01:06:28] something like it's just incredibly dull so dull is a very negative word um
[01:06:31] so dull is a very negative word um incredible is a positive word by itself
[01:06:35] incredible is a positive word by itself but when you're sort of saying
[01:06:36] but when you're sort of saying incredibly dull it's definitely still
[01:06:39] incredibly dull it's definitely still negative um and this um uh our recursive
[01:06:44] negative um and this um uh our recursive newor tensor network is um correctly
[01:06:48] newor tensor network is um correctly modeling um it's just incredibly dull is
[01:06:51] modeling um it's just incredibly dull is very negative despite incredible being a
[01:06:53] very negative despite incredible being a sort of positive word
[01:06:56] sort of positive word so um you know actually in this um model
[01:07:01] so um you know actually in this um model um there was five way classification so
[01:07:03] um there was five way classification so there was very negative somewhat
[01:07:05] there was very negative somewhat negative neutral somewhat positive very
[01:07:07] negative neutral somewhat positive very positive um so there's sort of some
[01:07:10] positive um so there's sort of some bouncing around as to whether it's
[01:07:11] bouncing around as to whether it's giving the classification very negative
[01:07:13] giving the classification very negative versus somewhat negative I can't really
[01:07:16] versus somewhat negative I can't really explain why in the middle it goes to
[01:07:18] explain why in the middle it goes to somewhat negative and then goes back to
[01:07:20] somewhat negative and then goes back to very negative but that's the results
[01:07:22] very negative but that's the results that came out of the the network and at
[01:07:24] that came out of the the network and at any rate if it all stays negative the
[01:07:27] any rate if it all stays negative the fact that incredible by itself
[01:07:29] fact that incredible by itself incredibly is a positive word it's seen
[01:07:32] incredibly is a positive word it's seen in the modif modification of dull as
[01:07:36] in the modif modification of dull as that keeps it negative but on the other
[01:07:39] that keeps it negative but on the other hand if you put a negation in here it's
[01:07:42] hand if you put a negation in here it's definitely not Dull um well then what
[01:07:46] definitely not Dull um well then what happens now interestingly the word not
[01:07:49] happens now interestingly the word not by itself is a negative word that if you
[01:07:52] by itself is a negative word that if you just sort of do the raw statistics of it
[01:07:55] just sort of do the raw statistics of it um not occurs much more often in
[01:07:59] um not occurs much more often in negative sentiment sentences than it
[01:08:01] negative sentiment sentences than it does in positive
[01:08:03] does in positive sentiment sentences so you know if you
[01:08:05] sentiment sentences so you know if you want to be a more positive person use
[01:08:08] want to be a more positive person use negation less um so not by itself as
[01:08:12] negation less um so not by itself as negative but if you then combine it
[01:08:15] negative but if you then combine it together not Dull or in this case
[01:08:17] together not Dull or in this case definitely not Dull um well not Dull is
[01:08:23] definitely not Dull um well not Dull is you have two negations so that they
[01:08:25] you have two negations so that they cancel each other out and you get
[01:08:27] cancel each other out and you get something that's positive and so it's
[01:08:29] something that's positive and so it's definitely not Dull comes out as a
[01:08:31] definitely not Dull comes out as a positive
[01:08:33] positive sentence and so the interesting result
[01:08:36] sentence and so the interesting result here is that um if you um compare what
[01:08:44] here is that um if you um compare what happens between you know if you have
[01:08:47] happens between you know if you have negated positive sentences so you know
[01:08:50] negated positive sentences so you know it's definitely not good various models
[01:08:54] it's definitely not good various models can model that correctly because not is
[01:08:57] can model that correctly because not is a negative word and so therefore it
[01:09:00] a negative word and so therefore it weakens the positivity of the positive
[01:09:03] weakens the positivity of the positive word and so putting a knot in front of a
[01:09:07] word and so putting a knot in front of a positive into a positive sentence makes
[01:09:10] positive into a positive sentence makes it less positive and even a not very
[01:09:14] it less positive and even a not very well but even a naive Bas model can do
[01:09:17] well but even a naive Bas model can do that because not by itself is seen as a
[01:09:19] that because not by itself is seen as a negative word but the hard case is what
[01:09:23] negative word but the hard case is what happens if you negate a negative
[01:09:24] happens if you negate a negative sentence
[01:09:26] sentence well the result that you should get is
[01:09:28] well the result that you should get is it becomes more positive and neither a
[01:09:32] it becomes more positive and neither a byr naive based model or our earlier
[01:09:35] byr naive based model or our earlier attempts at recursive models can capture
[01:09:38] attempts at recursive models can capture that whereas this RNN structure was able
[01:09:42] that whereas this RNN structure was able to correctly modify capture this sort of
[01:09:44] to correctly modify capture this sort of semantic modification um structure and
[01:09:48] semantic modification um structure and say hey that's made the sentence much
[01:09:50] say hey that's made the sentence much more positive um so that was a cool
[01:09:53] more positive um so that was a cool result and to some extent you know this
[01:09:55] result and to some extent you know this result I think still isn't captured as
[01:09:57] result I think still isn't captured as well by any of the um current
[01:10:00] well by any of the um current Transformer models even though they have
[01:10:02] Transformer models even though they have many other advantages and are much
[01:10:04] many other advantages and are much better than a tree recursive neural
[01:10:07] better than a tree recursive neural network um so I mean yeah so just to say
[01:10:12] network um so I mean yeah so just to say a couple um this is basically the end um
[01:10:15] a couple um this is basically the end um to just say a couple of final remarks
[01:10:17] to just say a couple of final remarks about these tree recursive new networks
[01:10:20] about these tree recursive new networks um you know the reason that they became
[01:10:26] um you know the reason that they became uncompetitive is because they just
[01:10:28] uncompetitive is because they just didn't allow the kind of
[01:10:32] didn't allow the kind of um associations and information flow
[01:10:36] um associations and information flow that you have in a Transformer right
[01:10:39] that you have in a Transformer right that these these models had a strictly
[01:10:42] that these these models had a strictly context free backbone and the only
[01:10:45] context free backbone and the only information flow was Tre structured
[01:10:48] information flow was Tre structured following the context free backbone um
[01:10:51] following the context free backbone um whereas in the Transformer you've got
[01:10:54] whereas in the Transformer you've got this tension function where in every
[01:10:57] this tension function where in every position you're looking at every other
[01:10:58] position you're looking at every other position and so you can have much more
[01:11:00] position and so you can have much more general information flow and in general
[01:11:04] general information flow and in general that is just good and Transformers are
[01:11:07] that is just good and Transformers are much more powerful but you know on the
[01:11:10] much more powerful but you know on the other hand to the extent that you
[01:11:12] other hand to the extent that you actually want to model the sort of
[01:11:15] actually want to model the sort of semantics of human language carefully is
[01:11:18] semantics of human language carefully is to sort of what modifies what and how
[01:11:21] to sort of what modifies what and how does negation or quantifiers in a
[01:11:24] does negation or quantifiers in a sentence behave in some sense these
[01:11:27] sentence behave in some sense these models were more right and so one of the
[01:11:30] models were more right and so one of the things I'm still kind of interested in
[01:11:32] things I'm still kind of interested in is are there any opportunities um to
[01:11:35] is are there any opportunities um to combine together some of the benefits of
[01:11:37] combine together some of the benefits of both of these ways of thinking and have
[01:11:39] both of these ways of thinking and have something that's a bit more tree
[01:11:41] something that's a bit more tree structured while still more flexible
[01:11:44] structured while still more flexible like a
[01:11:46] like a Transformer okay that's it for today
[01:11:48] Transformer okay that's it for today thanks a lot


================================================================================
LECTURE 018
================================================================================

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 18 - NLP, Linguistics, Philosophy

Source: https://www.youtube.com/watch?v=NxH0Y78xcF4

---

Transcript

[00:00:05] okay hi everyone I'll get started the
[00:00:08] okay hi everyone I'll get started the last
[00:00:10] last class okay um yeah well welcome
[00:00:15] class okay um yeah well welcome congratulations and thank you to making
[00:00:17] congratulations and thank you to making it to the um last real lecture of
[00:00:20] it to the um last real lecture of cs224n yeah so so this is the plan for
[00:00:23] cs224n yeah so so this is the plan for today um the lecture's titled NLP
[00:00:26] today um the lecture's titled NLP Linguistics and philosophy which I took
[00:00:28] Linguistics and philosophy which I took as meaning that I could talk about
[00:00:30] as meaning that I could talk about anything I wanted to um and so that is
[00:00:33] anything I wanted to um and so that is what I'm going to do um so this is sort
[00:00:36] what I'm going to do um so this is sort of this um what we're going to go
[00:00:38] of this um what we're going to go through talk a bit about the major ideas
[00:00:39] through talk a bit about the major ideas of Cs 24n and open problems um some of
[00:00:44] of Cs 24n and open problems um some of the more foundational questions of where
[00:00:47] the more foundational questions of where are we with llm symbolic versus neural
[00:00:50] are we with llm symbolic versus neural systems meaning and Linguistics and NLP
[00:00:54] systems meaning and Linguistics and NLP and then I'll close with um some slides
[00:00:56] and then I'll close with um some slides on the future risks of AI in the world
[00:01:00] on the future risks of AI in the world okay so here is an attempt to sort of
[00:01:03] okay so here is an attempt to sort of lay out the most major things that we
[00:01:05] lay out the most major things that we looked at in
[00:01:07] looked at in cs224n um we started with word vectors
[00:01:11] cs224n um we started with word vectors then we developed the idea of neural NP
[00:01:13] then we developed the idea of neural NP systems we expanded from a simple feed
[00:01:17] systems we expanded from a simple feed forward Network into doing sequence
[00:01:19] forward Network into doing sequence models language models rnns lstms and
[00:01:23] models language models rnns lstms and then we introduced this powerful new
[00:01:24] then we introduced this powerful new model that's been very influential the
[00:01:27] model that's been very influential the Transformer and then we built from there
[00:01:30] Transformer and then we built from there to the kind of um it's not exactly an
[00:01:33] to the kind of um it's not exactly an architecture but model that's been built
[00:01:35] architecture but model that's been built up in recent years to produce high
[00:01:37] up in recent years to produce high performance NLP systems where we're
[00:01:40] performance NLP systems where we're first doing pre-training and then a uh
[00:01:43] first doing pre-training and then a uh post-training phase of various
[00:01:45] post-training phase of various techniques that we talked about to
[00:01:47] techniques that we talked about to produce um these General Foundation
[00:01:50] produce um these General Foundation models that understand language so well
[00:01:52] models that understand language so well and then we went on from there and
[00:01:53] and then we went on from there and talked about various particular topics
[00:01:55] talked about various particular topics like benchmarking and reasoning so a few
[00:01:59] like benchmarking and reasoning so a few of the major ideas that we looked at
[00:02:01] of the major ideas that we looked at were um this idea that you could get a
[00:02:04] were um this idea that you could get a long way by having dense representations
[00:02:08] long way by having dense representations those are our hidden representations in
[00:02:10] those are our hidden representations in new networks and then looking at
[00:02:13] new networks and then looking at distributional semantics representing
[00:02:15] distributional semantics representing words by their context um first slogan
[00:02:18] words by their context um first slogan of you shall know a word by the company
[00:02:20] of you shall know a word by the company it keeps and I'll come back to that a
[00:02:22] it keeps and I'll come back to that a bit later and talking about um ideas of
[00:02:25] bit later and talking about um ideas of meaning but you know that's essentially
[00:02:27] meaning but you know that's essentially been the idea that has driven most of
[00:02:31] been the idea that has driven most of the successful ideas of modern NLP
[00:02:33] the successful ideas of modern NLP whether it's the earli statistical NLP
[00:02:35] whether it's the earli statistical NLP phase or more modern neural NLP phase
[00:02:39] phase or more modern neural NLP phase and in this world we start instantiating
[00:02:41] and in this world we start instantiating that as um these models of word vectors
[00:02:45] that as um these models of word vectors but the same contextual idea is then
[00:02:47] but the same contextual idea is then used in all the models up through
[00:02:50] used in all the models up through Transformers we looked at both the
[00:02:52] Transformers we looked at both the challenges and
[00:02:54] challenges and opportunities um of training large deep
[00:02:57] opportunities um of training large deep newal networks and how gradually people
[00:02:59] newal networks and how gradually people developed ideas and tricks such as
[00:03:02] developed ideas and tricks such as having residual connections which made
[00:03:04] having residual connections which made it much more possible and stable to do
[00:03:07] it much more possible and stable to do successfully um which took us from a
[00:03:10] successfully um which took us from a place where a lot of this seemed black
[00:03:12] place where a lot of this seemed black magic that was hard to get right to
[00:03:14] magic that was hard to get right to people being able to very reliably train
[00:03:17] people being able to very reliably train high performance Transformer models um
[00:03:20] high performance Transformer models um we talked about sequence models what's
[00:03:23] we talked about sequence models what's good about them and some of their
[00:03:25] good about them and some of their problems and how those problems have
[00:03:27] problems and how those problems have been addressed in large measure by
[00:03:30] been addressed in large measure by adopting this different architecture of
[00:03:31] adopting this different architecture of Transformers which give a form of
[00:03:34] Transformers which give a form of parallelization and then we moved into
[00:03:36] parallelization and then we moved into the modern form of pre-training by
[00:03:40] the modern form of pre-training by language modeling where language
[00:03:43] language modeling where language modeling seems a simple thing predicting
[00:03:45] modeling seems a simple thing predicting words in context but it emerges as what
[00:03:48] words in context but it emerges as what we think of as a universal pre-training
[00:03:51] we think of as a universal pre-training task that all kinds of both linguistic
[00:03:55] task that all kinds of both linguistic and World Knowledge help you to do this
[00:03:58] and World Knowledge help you to do this task of predicting words better and so
[00:04:01] task of predicting words better and so this is ended up as just a general
[00:04:03] this is ended up as just a general method um to produce the kind of um
[00:04:07] method um to produce the kind of um powerful knowledgeable models that we
[00:04:09] powerful knowledgeable models that we have today and up until now there's been
[00:04:12] have today and up until now there's been this amazing property that we see this
[00:04:15] this amazing property that we see this empirical fact that we seem to just get
[00:04:18] empirical fact that we seem to just get this um basically well not basically
[00:04:21] this um basically well not basically it's extremely linear improvements of
[00:04:24] it's extremely linear improvements of performance as we continue to scale data
[00:04:28] performance as we continue to scale data and compute and model model size up by
[00:04:31] and compute and model model size up by orders of
[00:04:33] magnitude um that doesn't mean that all
[00:04:36] magnitude um that doesn't mean that all problems in NLP are solved um there are
[00:04:38] problems in NLP are solved um there are lots of things that people um still work
[00:04:41] lots of things that people um still work on and see opportunities to try and make
[00:04:44] on and see opportunities to try and make things better and a few of these um are
[00:04:47] things better and a few of these um are mentioned on the next few slides um so
[00:04:50] mentioned on the next few slides um so there's a real question of how much um
[00:04:54] there's a real question of how much um these models are good at actually
[00:04:58] these models are good at actually learning to be able to do things
[00:05:01] learning to be able to do things generally rather than just being very
[00:05:04] generally rather than just being very good at
[00:05:05] good at memorization um that a lot of the
[00:05:09] memorization um that a lot of the benefits of what we're getting from
[00:05:10] benefits of what we're getting from these large pre-trained language models
[00:05:13] these large pre-trained language models is that they've seen a huge amount of
[00:05:16] is that they've seen a huge amount of stuff and therefore they know everything
[00:05:19] stuff and therefore they know everything they've seen every pattern before and
[00:05:21] they've seen every pattern before and they know how to use things I've
[00:05:23] they know how to use things I've occasionally use the analogy that um
[00:05:26] occasionally use the analogy that um large language models are sort of like a
[00:05:29] large language models are sort of like a talking and encyclopedia that they're
[00:05:31] talking and encyclopedia that they're really in many ways more like a huge
[00:05:34] really in many ways more like a huge knowledge store than necessarily
[00:05:37] knowledge store than necessarily something that is intelligent in the
[00:05:40] something that is intelligent in the sense of being able to work out how to
[00:05:42] sense of being able to work out how to solve new problems and generalize as
[00:05:44] solve new problems and generalize as human beings do a kind of interesting
[00:05:47] human beings do a kind of interesting fact actually is that in some ways
[00:05:51] fact actually is that in some ways Transformer models are actually worse at
[00:05:54] Transformer models are actually worse at generalizing than the older lstms that
[00:05:57] generalizing than the older lstms that preceded them so here's just you know
[00:06:00] preceded them so here's just you know one little graph I'm not going to spend
[00:06:01] one little graph I'm not going to spend a lot of time on but this was looking at
[00:06:05] a lot of time on but this was looking at data that's being generated by a finite
[00:06:08] data that's being generated by a finite autometer and then trying to learn it
[00:06:13] autometer and then trying to learn it from a limited amount of data um with
[00:06:16] from a limited amount of data um with either an lstm or a Transformer and the
[00:06:20] either an lstm or a Transformer and the observation is that you know at the
[00:06:22] observation is that you know at the scales that they're working even having
[00:06:24] scales that they're working even having seen quite limited
[00:06:27] seen quite limited exemplification the lstm is basically
[00:06:29] exemplification the lstm is basically sealing the entire of this graph right
[00:06:31] sealing the entire of this graph right it's just at the one line um because it
[00:06:35] it's just at the one line um because it generalizes in good ways because of its
[00:06:37] generalizes in good ways because of its lstm architecture um whereas the
[00:06:40] lstm architecture um whereas the Transformer needs to see a ton more data
[00:06:43] Transformer needs to see a ton more data before it actually learns the patterns
[00:06:45] before it actually learns the patterns well and so if we think of one of the
[00:06:49] well and so if we think of one of the Prime attributes of human intelligence
[00:06:53] Prime attributes of human intelligence is actually we're amazing at figuring
[00:06:57] is actually we're amazing at figuring out and learning things from very
[00:07:00] out and learning things from very limited exposure right you know there's
[00:07:02] limited exposure right you know there's something that you don't know how to do
[00:07:05] something that you don't know how to do and a friend shows you once um what you
[00:07:07] and a friend shows you once um what you do to make it work and by and large you
[00:07:10] do to make it work and by and large you know you'll improve a few times with
[00:07:12] know you'll improve a few times with practice but you can learn effectively
[00:07:15] practice but you can learn effectively new skills from these kind of single
[00:07:17] new skills from these kind of single shot
[00:07:19] shot examples um and that's not always what
[00:07:23] examples um and that's not always what we seem to be seeing in our models um
[00:07:26] we seem to be seeing in our models um there's a lot of interest in what's
[00:07:28] there's a lot of interest in what's going on inside newal networks that a
[00:07:31] going on inside newal networks that a lot of the time newal networks still
[00:07:33] lot of the time newal networks still appear as blackboxes where we have no
[00:07:35] appear as blackboxes where we have no real idea of how they're doing what
[00:07:38] real idea of how they're doing what they're doing and as perhaps for your
[00:07:40] they're doing and as perhaps for your final projects the main thing you're
[00:07:42] final projects the main thing you're doing is measuring um the final
[00:07:45] doing is measuring um the final performance number and seeing if it goes
[00:07:47] performance number and seeing if it goes up or not um so there's a lot of
[00:07:51] up or not um so there's a lot of interest in better understanding what do
[00:07:53] interest in better understanding what do they learn how did they learn it why do
[00:07:56] they learn how did they learn it why do they succeed and fail and a lot of that
[00:07:58] they succeed and fail and a lot of that work is star to look more closely into
[00:08:01] work is star to look more closely into what's happening inside new network
[00:08:04] what's happening inside new network computations um there is some work of
[00:08:07] computations um there is some work of that sort that actually goes back quite
[00:08:09] that sort that actually goes back quite a fair way so here's um an old blog post
[00:08:13] a fair way so here's um an old blog post by Andre Kathy while he's a grad student
[00:08:16] by Andre Kathy while he's a grad student here in
[00:08:17] here in 2016 and he he was looking at lstms and
[00:08:22] 2016 and he he was looking at lstms and how did they learn and he found that one
[00:08:25] how did they learn and he found that one of the neurons in an lstm cell was
[00:08:28] of the neurons in an lstm cell was effectively sort of measuring position
[00:08:31] effectively sort of measuring position along a line of text and as the line of
[00:08:33] along a line of text and as the line of text got long its sort of value started
[00:08:36] text got long its sort of value started to change because the model was learning
[00:08:40] to change because the model was learning that there was sort of a line length of
[00:08:42] that there was sort of a line length of this text and that the line was likely
[00:08:44] this text and that the line was likely to be ending at that point and in recent
[00:08:47] to be ending at that point and in recent times there started to be with
[00:08:48] times there started to be with Transformers as well a lot of work
[00:08:50] Transformers as well a lot of work looking at mechanistic interpretability
[00:08:52] looking at mechanistic interpretability or causal abstraction trying to
[00:08:54] or causal abstraction trying to understand the internals of
[00:08:57] understand the internals of models um a problem that's far from
[00:09:00] models um a problem that's far from solved and in many respects probably
[00:09:03] solved and in many respects probably unsolvable is the multilingual question
[00:09:07] unsolvable is the multilingual question of dealing with all the other languages
[00:09:09] of dealing with all the other languages of the world um that you do have to keep
[00:09:13] of the world um that you do have to keep in your head that whatever you see for
[00:09:15] in your head that whatever you see for English um it's worse for every other
[00:09:19] English um it's worse for every other language um and what they're getting out
[00:09:21] language um and what they're getting out of modern language models now you know
[00:09:24] of modern language models now you know there is a good news story here I don't
[00:09:26] there is a good news story here I don't want to claim that everything is
[00:09:28] want to claim that everything is terrible so in this graph which is kind
[00:09:30] terrible so in this graph which is kind of small um the blue line was the
[00:09:33] of small um the blue line was the performance of GPT 3.5 English and then
[00:09:39] performance of GPT 3.5 English and then all of the the
[00:09:40] all of the the green um bars are then the performance
[00:09:44] green um bars are then the performance of GPT 4 and so you know there's a
[00:09:47] of GPT 4 and so you know there's a genuine good news story here which is
[00:09:50] genuine good news story here which is look not just for English but for a lot
[00:09:52] look not just for English but for a lot of other languages for Greek lvan um
[00:09:56] of other languages for Greek lvan um Arabic Turkish um all of them in GP 4
[00:10:00] Arabic Turkish um all of them in GP 4 better than English was in gpt3 3.5 so
[00:10:05] better than English was in gpt3 3.5 so you know that's the good news argument
[00:10:08] you know that's the good news argument um that um you know building these
[00:10:11] um that um you know building these models big is in some sense raising all
[00:10:14] models big is in some sense raising all boats but you know these are still all
[00:10:18] boats but you know these are still all huge languages and you know things are
[00:10:20] huge languages and you know things are starting to drop off at the bottom of
[00:10:23] starting to drop off at the bottom of this table um for um languages where the
[00:10:27] this table um for um languages where the performance is worse than than English
[00:10:29] performance is worse than than English and GPT 3.5 but even for those languages
[00:10:33] and GPT 3.5 but even for those languages they're they're languages for which much
[00:10:36] they're they're languages for which much less written data is available but
[00:10:39] less written data is available but they're still large languages so the
[00:10:40] they're still large languages so the ones at the three at the bottom are
[00:10:42] ones at the three at the bottom are actually all Indian languages they're
[00:10:44] actually all Indian languages they're Punjabi morati and telu um and which are
[00:10:48] Punjabi morati and telu um and which are languages that are each spoken by
[00:10:50] languages that are each spoken by millions of people they're not small
[00:10:52] millions of people they're not small languages um so the real question is
[00:10:56] languages um so the real question is what happens when you actually get to
[00:10:58] what happens when you actually get to the small resource languages so the vast
[00:11:01] the small resource languages so the vast majority of languages around the world
[00:11:04] majority of languages around the world don't have millions of speakers they
[00:11:06] don't have millions of speakers they vary from having hundreds of speakers to
[00:11:10] vary from having hundreds of speakers to hundreds of thousands of speakers and
[00:11:12] hundreds of thousands of speakers and there are thousands of such languages a
[00:11:14] there are thousands of such languages a lot of those languages are primarily
[00:11:16] lot of those languages are primarily oral and have very limited amounts of um
[00:11:20] oral and have very limited amounts of um written text now some of those languages
[00:11:22] written text now some of those languages are likely or many of those languages
[00:11:25] are likely or many of those languages are likely to go extinct um in the
[00:11:27] are likely to go extinct um in the coming decades um but many of those
[00:11:30] coming decades um but many of those language communities would like to
[00:11:32] language communities would like to preserve their languages and it's very
[00:11:35] preserve their languages and it's very unclear how the kind of language
[00:11:37] unclear how the kind of language technologies that we've been talking
[00:11:40] technologies that we've been talking about in the later parts of um the
[00:11:42] about in the later parts of um the course can be extended to those
[00:11:44] course can be extended to those languages because there just isn't
[00:11:46] languages because there just isn't sufficient data to build the kind of
[00:11:48] sufficient data to build the kind of models that we've been looking
[00:11:51] models that we've been looking at so I imagine you've gotten some idea
[00:11:54] at so I imagine you've gotten some idea in this course of how evaluation is a
[00:11:58] in this course of how evaluation is a huge part of what we do that effectively
[00:12:01] huge part of what we do that effectively a lot of the way that progress is being
[00:12:03] a lot of the way that progress is being driven is by defining evaluations of
[00:12:06] driven is by defining evaluations of what models should be able to achieve
[00:12:08] what models should be able to achieve and then people working um to measure um
[00:12:12] and then people working um to measure um systems and improve systems so they do
[00:12:15] systems and improve systems so they do better on what we see as good language
[00:12:18] better on what we see as good language other understanding or other properties
[00:12:21] other understanding or other properties um one of the concerns that many people
[00:12:24] um one of the concerns that many people have about what's happened with the
[00:12:27] have about what's happened with the large recent um Clos models from large
[00:12:31] large recent um Clos models from large companies is a concern that all of the
[00:12:34] companies is a concern that all of the benchmarks are being sullied and not to
[00:12:36] benchmarks are being sullied and not to be trusted um so here's um one example
[00:12:41] be trusted um so here's um one example um that comes from a tweet of Horus
[00:12:43] um that comes from a tweet of Horus Hurst um and he's noting I suspect gp4s
[00:12:48] Hurst um and he's noting I suspect gp4s performance is influenced by data
[00:12:50] performance is influenced by data contamination at least on code forces
[00:12:52] contamination at least on code forces one of the coding benchmarks of the
[00:12:55] one of the coding benchmarks of the easiest problems on Co code forces it
[00:12:57] easiest problems on Co code forces it solved 10 out of 10 pre 2021 problems
[00:13:01] solved 10 out of 10 pre 2021 problems but zero out of 10 recent problems this
[00:13:04] but zero out of 10 recent problems this strongly points to contamination and the
[00:13:08] strongly points to contamination and the worry is that every time you're seeing
[00:13:10] worry is that every time you're seeing these fantastic results of how well the
[00:13:13] these fantastic results of how well the latest best language model is performing
[00:13:16] latest best language model is performing that at this point so much data is on
[00:13:19] that at this point so much data is on the web that gets included um in the
[00:13:23] the web that gets included um in the pre-training data for these large
[00:13:24] pre-training data for these large language models that essentially they're
[00:13:27] language models that essentially they're memorizing at least a good share of the
[00:13:30] memorizing at least a good share of the questions that are appearing in these
[00:13:32] questions that are appearing in these challenges so they're not actually
[00:13:33] challenges so they're not actually solving them in a fair way as an
[00:13:36] solving them in a fair way as an independent test set at all they're just
[00:13:38] independent test set at all they're just memorizing them and so there are sort of
[00:13:41] memorizing them and so there are sort of issues then as to you know what kind of
[00:13:43] issues then as to you know what kind of Thoroughly hidden test sets we can have
[00:13:46] Thoroughly hidden test sets we can have or dynamic evaluation mechanisms so we
[00:13:49] or dynamic evaluation mechanisms so we can actually have Benchmark
[00:13:51] can actually have Benchmark Integrity another huge area the number
[00:13:54] Integrity another huge area the number of us involved in Stanford elsewhere is
[00:13:57] of us involved in Stanford elsewhere is making NLP work in different technical
[00:14:00] making NLP work in different technical domains so domains including biomedical
[00:14:03] domains so domains including biomedical or clinical medical NLP have a lot of
[00:14:05] or clinical medical NLP have a lot of differences of vocabulary um and usage
[00:14:09] differences of vocabulary um and usage they have a lot of potential good uses
[00:14:12] they have a lot of potential good uses but they also have a lot of potential
[00:14:14] but they also have a lot of potential risks of doing harm if the language
[00:14:17] risks of doing harm if the language understanding is incomplete I myself
[00:14:20] understanding is incomplete I myself have been more involved um doing things
[00:14:22] have been more involved um doing things in the legal NLP um working with other
[00:14:26] in the legal NLP um working with other people at the reg lab with danho in
[00:14:29] people at the reg lab with danho in building Foundation models for law um
[00:14:32] building Foundation models for law um and there are all kinds of ways again in
[00:14:36] and there are all kinds of ways again in which um this kind of Technology could
[00:14:39] which um this kind of Technology could be really useful right the biggest
[00:14:42] be really useful right the biggest problem in most countries it's bad in
[00:14:45] problem in most countries it's bad in the United States but it's way worse in
[00:14:47] the United States but it's way worse in a place like India is that most people
[00:14:50] a place like India is that most people can't get access to the kind of legal
[00:14:52] can't get access to the kind of legal help that they need to solve their
[00:14:54] help that they need to solve their problems because of the cost of it and
[00:14:57] problems because of the cost of it and the lack of um trained lawyers so if
[00:15:00] the lack of um trained lawyers so if more could be done to be able to help
[00:15:03] more could be done to be able to help people via um NLP tools you know in
[00:15:07] people via um NLP tools you know in principle that would be great but in
[00:15:09] principle that would be great but in practice um the tools still don't have
[00:15:12] practice um the tools still don't have good enough language understanding um so
[00:15:15] good enough language understanding um so um in the reg lab there's a just
[00:15:17] um in the reg lab there's a just completed study out at the moment
[00:15:19] completed study out at the moment looking at um legal NLP systems and we
[00:15:22] looking at um legal NLP systems and we were finding that the hallucination rate
[00:15:25] were finding that the hallucination rate the rate in which there was madeup stuff
[00:15:28] the rate in which there was madeup stuff in their legal AR answers was
[00:15:30] in their legal AR answers was effectively for one question in six
[00:15:33] effectively for one question in six which isn't a very good accuracy rate if
[00:15:35] which isn't a very good accuracy rate if you're someone who's wanting to rely on
[00:15:37] you're someone who's wanting to rely on these systems um for legal
[00:15:40] these systems um for legal advice um there are lots of things also
[00:15:43] advice um there are lots of things also to work out dealing with the social and
[00:15:45] to work out dealing with the social and cultural aspects of NLP NLP systems
[00:15:49] cultural aspects of NLP NLP systems remain very biased against various
[00:15:52] remain very biased against various cultures and
[00:15:53] cultures and religions um
[00:15:56] religions um they they have certain social norms you
[00:15:59] they they have certain social norms you could say that they pick up from
[00:16:01] could say that they pick up from somewhere but those social norms are
[00:16:04] somewhere but those social norms are very biased it's against certain groups
[00:16:06] very biased it's against certain groups and um related to um there being small
[00:16:10] and um related to um there being small languages that I mentioned before that
[00:16:12] languages that I mentioned before that there are lots of issues with
[00:16:13] there are lots of issues with underrepresented groups in having the
[00:16:16] underrepresented groups in having the kind of NLP that they'd like to
[00:16:19] kind of NLP that they'd like to have okay so um that's sort of the
[00:16:22] have okay so um that's sort of the summary of that bit so for the next bit
[00:16:25] summary of that bit so for the next bit I thought I'd just sort of give one more
[00:16:27] I thought I'd just sort of give one more bit of perspective on where are we um
[00:16:31] bit of perspective on where are we um with um the best um language models like
[00:16:35] with um the best um language models like gp4 I mean I think it's really
[00:16:39] gp4 I mean I think it's really interesting at this moment of where we
[00:16:41] interesting at this moment of where we are CU you know on the one hand the
[00:16:45] are CU you know on the one hand the performance of these models is just
[00:16:49] performance of these models is just amazing and you know even as someone who
[00:16:53] amazing and you know even as someone who works in NLP and has worked in it for
[00:16:56] works in NLP and has worked in it for many many years now I mean you know I
[00:17:00] many many years now I mean you know I can tell a a sort of story that these
[00:17:03] can tell a a sort of story that these models you know that we do this training
[00:17:06] models you know that we do this training to predict the next word and it's
[00:17:08] to predict the next word and it's conditioning on a lot of text and it
[00:17:10] conditioning on a lot of text and it knows about things and it does but you
[00:17:13] knows about things and it does but you know in some sense these things still
[00:17:15] know in some sense these things still seem like magic right it's just kind of
[00:17:18] seem like magic right it's just kind of hard to believe how this could possibly
[00:17:20] hard to believe how this could possibly work so in this example I asked chat GPT
[00:17:24] work so in this example I asked chat GPT 40 I did that this this morning I asked
[00:17:28] 40 I did that this this morning I asked and um to write a sonnet explaining the
[00:17:32] and um to write a sonnet explaining the Transformer neet architecture in which
[00:17:35] Transformer neet architecture in which every line begins with the letter t um
[00:17:39] every line begins with the letter t um and it sort of still frankly blows my
[00:17:42] and it sort of still frankly blows my mind and I don't actually feel I can
[00:17:45] mind and I don't actually feel I can really explain even to myself in a way
[00:17:47] really explain even to myself in a way that's convincing how this large
[00:17:52] that's convincing how this large Transformer is able to take all its
[00:17:54] Transformer is able to take all its pre-training text and reading that
[00:17:57] pre-training text and reading that instruction
[00:17:59] instruction um and as a next tokken prediction
[00:18:01] um and as a next tokken prediction machine it successfully produces
[00:18:04] machine it successfully produces something that is a Sonet and every line
[00:18:09] something that is a Sonet and every line begins with the letter T I hope you
[00:18:11] begins with the letter T I hope you remember from your high school English
[00:18:12] remember from your high school English classes sonnets are meant to have AB um
[00:18:16] classes sonnets are meant to have AB um rhyming um it's a little bit imperfect
[00:18:19] rhyming um it's a little bit imperfect here net set some people will complain
[00:18:21] here net set some people will complain about that but it's basically a rhyme
[00:18:23] about that but it's basically a rhyme behold unfold grasp clasp right so it's
[00:18:26] behold unfold grasp clasp right so it's basically doing the writing rhyming
[00:18:28] basically doing the writing rhyming right and um the lines are meant to be
[00:18:30] right and um the lines are meant to be an i Amic pentameter and it's basically
[00:18:33] an i Amic pentameter and it's basically produced lines and I Amic pentameter to
[00:18:36] produced lines and I Amic pentameter to delve into the world of neural Nets you
[00:18:39] delve into the world of neural Nets you know I Amic pentam um yeah it's
[00:18:42] know I Amic pentam um yeah it's incredible so here's um it's Sonet it
[00:18:44] incredible so here's um it's Sonet it came up with to delve into the world of
[00:18:47] came up with to delve into the world of neural Nets Transformers Rise a Marvel
[00:18:49] neural Nets Transformers Rise a Marvel to behold through layers deep the
[00:18:52] to behold through layers deep the networks path is set to learn from data
[00:18:55] networks path is set to learn from data patterns to unfold the tokens feed into
[00:18:58] patterns to unfold the tokens feed into attentions grasp to weigh their import
[00:19:01] attentions grasp to weigh their import context to a line the query's keys and
[00:19:04] context to a line the query's keys and values in a clasp to process sequences
[00:19:07] values in a clasp to process sequences with power fine through heads of many
[00:19:10] with power fine through heads of many multifaceted site to parallelize
[00:19:13] multifaceted site to parallelize capturing the span transforming each a
[00:19:16] capturing the span transforming each a matrix math Delight to synthesize with
[00:19:19] matrix math Delight to synthesize with elegance and plan the models might in
[00:19:23] elegance and plan the models might in every task it shows to turn raw data
[00:19:26] every task it shows to turn raw data into wisdom's Pros now now you you could
[00:19:31] into wisdom's Pros now now you you could object um that I'm not sure this exactly
[00:19:34] object um that I'm not sure this exactly explained the transform new net
[00:19:36] explained the transform new net architecture um it's a little bit
[00:19:39] architecture um it's a little bit abstract I'll give it
[00:19:41] abstract I'll give it that but you know in another sense it
[00:19:46] that but you know in another sense it did it did in one place or another evoke
[00:19:49] did it did in one place or another evoke quite a bit of stuff about Transformers
[00:19:52] quite a bit of stuff about Transformers with queries keys and values and
[00:19:55] with queries keys and values and multiheaded stuff parallelized with
[00:19:58] multiheaded stuff parallelized with Matrix math and whatever else um yeah
[00:20:03] Matrix math and whatever else um yeah still still kind of blows my mind how
[00:20:05] still still kind of blows my mind how well that works and you know indeed um
[00:20:09] well that works and you know indeed um as natural language understanding and
[00:20:13] as natural language understanding and sort of world understanding devices I
[00:20:17] sort of world understanding devices I mean these devices have clearly crossed
[00:20:20] mean these devices have clearly crossed the threshold in which they're very
[00:20:24] the threshold in which they're very usable in many contexts so here's
[00:20:27] usable in many contexts so here's there's now started to be some fairly
[00:20:29] there's now started to be some fairly good studies um that have been done on
[00:20:33] good studies um that have been done on you know how much value people can get
[00:20:35] you know how much value people can get out of using llms like um
[00:20:38] out of using llms like um gp4 um so this um study by dequa and a
[00:20:43] gp4 um so this um study by dequa and a whole lot of colle including Ethan mik
[00:20:46] whole lot of colle including Ethan mik um they took a bunch of Consultants from
[00:20:50] um they took a bunch of Consultants from the Boston Consulting Group and so you
[00:20:52] the Boston Consulting Group and so you know what that's like that means you
[00:20:54] know what that's like that means you know 23 year olds graduating from
[00:20:57] know 23 year olds graduating from universities like this one but more on
[00:20:58] universities like this one but more on the East Coast um they become you know
[00:21:01] the East Coast um they become you know Boston Consultants you know not exactly
[00:21:05] Boston Consultants you know not exactly um dummies um and so they they found in
[00:21:10] um dummies um and so they they found in this study so you know controlled task
[00:21:13] this study so you know controlled task um there are there actually so three
[00:21:15] um there are there actually so three groups but the big contrast is that you
[00:21:18] groups but the big contrast is that you know two of the groups were using gp4 to
[00:21:21] know two of the groups were using gp4 to do Consulting tasks and one of the
[00:21:23] do Consulting tasks and one of the groups wasn't using um gp4 to do tasks
[00:21:28] groups wasn't using um gp4 to do tasks the difference between the two that were
[00:21:30] the difference between the two that were was one of them was giving given more
[00:21:32] was one of them was giving given more training on how to use gp4 but that
[00:21:34] training on how to use gp4 but that didn't seem to make much of a difference
[00:21:37] didn't seem to make much of a difference um but their result was um that the
[00:21:40] um but their result was um that the groups using
[00:21:42] groups using gp4 um in their study completed 12% more
[00:21:46] gp4 um in their study completed 12% more tasks on average they did the task 25%
[00:21:49] tasks on average they did the task 25% more com quickly and um the results were
[00:21:54] more com quickly and um the results were judged 40% higher quality than those not
[00:21:57] judged 40% higher quality than those not using AI which I think is a pretty um
[00:22:00] using AI which I think is a pretty um stunning success of how um gp4 or
[00:22:05] stunning success of how um gp4 or similar llms are good enough to actually
[00:22:08] similar llms are good enough to actually help people get real work done you know
[00:22:11] help people get real work done you know with whatever asteris you want to put
[00:22:13] with whatever asteris you want to put about the quality of Management
[00:22:15] about the quality of Management Consultant work in various instances um
[00:22:19] Consultant work in various instances um yeah I mean an interesting result is
[00:22:22] yeah I mean an interesting result is that you know using these llms seem to
[00:22:24] that you know using these llms seem to be a big leveler and actually you see
[00:22:27] be a big leveler and actually you see exactly the same thing for people using
[00:22:31] exactly the same thing for people using um coding llms that they're a huge
[00:22:35] um coding llms that they're a huge assistance for people whose own skills
[00:22:37] assistance for people whose own skills are weaker and they're much less of an
[00:22:40] are weaker and they're much less of an assistance for people whose own skills
[00:22:42] assistance for people whose own skills are
[00:22:43] are strong okay so that's the good news
[00:22:45] strong okay so that's the good news story but you know if on the other hand
[00:22:48] story but you know if on the other hand you're more like the good news story for
[00:22:50] you're more like the good news story for human beings um here's a study um that
[00:22:54] human beings um here's a study um that goes in the other direction can gp4
[00:22:57] goes in the other direction can gp4 write fiction that matches the quality
[00:23:00] write fiction that matches the quality of New Yorker fiction writers um and the
[00:23:04] of New Yorker fiction writers um and the result of that study was not even close
[00:23:08] result of that study was not even close um that gp4 was measured as 3 to 10
[00:23:12] um that gp4 was measured as 3 to 10 times worse at creative writing than a
[00:23:15] times worse at creative writing than a New Yorker fiction writer so there's
[00:23:17] New Yorker fiction writer so there's still hope for human beings hang on
[00:23:21] still hope for human beings hang on there um and so you know I think that's
[00:23:25] there um and so you know I think that's kind of the you know the jewel screen
[00:23:27] kind of the you know the jewel screen picture that we have at the moment then
[00:23:29] picture that we have at the moment then in some ways these things are great and
[00:23:31] in some ways these things are great and useful in other ways um they're not so
[00:23:35] useful in other ways um they're not so great and I think that's something um
[00:23:37] great and I think that's something um that we're still sort of going to be um
[00:23:40] that we're still sort of going to be um seeing playing out um in the future
[00:23:44] seeing playing out um in the future years I think um living in Silicon
[00:23:47] years I think um living in Silicon Valley um we see a lot of the positive
[00:23:50] Valley um we see a lot of the positive hype so if you just want to um see a
[00:23:52] hype so if you just want to um see a little bit of the negative on the other
[00:23:54] little bit of the negative on the other side um late last year there was uh
[00:23:57] side um late last year there was uh peace in the financial times um which
[00:24:00] peace in the financial times um which was titled generative AI hyly
[00:24:03] was titled generative AI hyly intelligent um and I won't read all of
[00:24:06] intelligent um and I won't read all of this um but basically they were a
[00:24:09] this um but basically they were a wanting to express um considerable um
[00:24:13] wanting to express um considerable um skepticism of the current AI boom
[00:24:16] skepticism of the current AI boom investors should keep their heads
[00:24:18] investors should keep their heads expectations for generative AI are
[00:24:20] expectations for generative AI are running way ahead of the limitations
[00:24:22] running way ahead of the limitations that apply to it as investment in
[00:24:24] that apply to it as investment in generative AI grows so does pressure to
[00:24:27] generative AI grows so does pressure to create new use cases by 2027 IDC thinks
[00:24:30] create new use cases by 2027 IDC thinks Enterprise spending on generative AI
[00:24:33] Enterprise spending on generative AI will reach 143 billion up from 16
[00:24:36] will reach 143 billion up from 16 billion this year so 10 times up open AI
[00:24:39] billion this year so 10 times up open AI hopes for more funding to pursue
[00:24:41] hopes for more funding to pursue humanlike AI it is worth remembering
[00:24:44] humanlike AI it is worth remembering that when examining alman's plan for
[00:24:45] that when examining alman's plan for super intelligence models predict they
[00:24:48] super intelligence models predict they do not comprehend that limitation cast
[00:24:50] do not comprehend that limitation cast out on AI achieving even humanlike
[00:24:54] out on AI achieving even humanlike intelligence um and then they sort of
[00:24:57] intelligence um and then they sort of start talking about some the problems um
[00:25:00] start talking about some the problems um with you know limited gains for low
[00:25:02] with you know limited gains for low skilled workers inaccuracies in the work
[00:25:05] skilled workers inaccuracies in the work they produc um and suggests that the
[00:25:10] they produc um and suggests that the limitations will become more obvious as
[00:25:12] limitations will become more obvious as generative AI tools roll out that will
[00:25:14] generative AI tools roll out that will put pressure on providers to address
[00:25:16] put pressure on providers to address costs AI could add 4 trillion to profits
[00:25:20] costs AI could add 4 trillion to profits says McKenzie but pricing Clarity is
[00:25:22] says McKenzie but pricing Clarity is lacking without it companies cannot
[00:25:24] lacking without it companies cannot predict what Financial gains AI can
[00:25:27] predict what Financial gains AI can accomplish and AI cannot predict that
[00:25:32] either okay um that's that topic I'm
[00:25:36] either okay um that's that topic I'm chugging through my topics um the next
[00:25:38] chugging through my topics um the next topic is I wanted to return and say a
[00:25:42] topic is I wanted to return and say a bit more about symbolic methods that
[00:25:46] bit more about symbolic methods that dominated AI from the 60s until about
[00:25:51] dominated AI from the 60s until about 2010 versus um what I termed here is
[00:25:54] 2010 versus um what I termed here is cybernetics because the original
[00:25:57] cybernetics because the original alternative going back to the 50s and
[00:25:59] alternative going back to the 50s and 60s um was called cybernetics and in a
[00:26:05] 60s um was called cybernetics and in a very real sense neural networks is a
[00:26:09] very real sense neural networks is a continuation of the cybernetics
[00:26:11] continuation of the cybernetics tradition rather than the AI tradition
[00:26:15] tradition rather than the AI tradition that started in the 50s and 60s um in
[00:26:19] that started in the 50s and 60s um in this context um Stanford is the home of
[00:26:22] this context um Stanford is the home of the symbolic systems program um so at
[00:26:25] the symbolic systems program um so at the moment um we are unique in having a
[00:26:28] the moment um we are unique in having a symbolic systems program um so the name
[00:26:32] symbolic systems program um so the name symbolic systems came about because at
[00:26:35] symbolic systems came about because at the time it was started um so I guess
[00:26:39] the time it was started um so I guess philosophy was an active part of the
[00:26:40] philosophy was an active part of the symbolic systems program and John
[00:26:43] symbolic systems program and John barwise um shown in this picture he
[00:26:46] barwise um shown in this picture he actually died young so he he actually
[00:26:48] actually died young so he he actually died in 2000 um John barwise um had um a
[00:26:54] died in 2000 um John barwise um had um a very strong belief um that you me to be
[00:26:59] very strong belief um that you me to be um dealing with meaning in the world and
[00:27:04] um dealing with meaning in the world and the connection between people's thinking
[00:27:07] the connection between people's thinking in the world and so he um refused to
[00:27:11] in the world and so he um refused to allow the program to be called cognitive
[00:27:14] allow the program to be called cognitive science as it's called at most other
[00:27:16] science as it's called at most other places and it ended up being called
[00:27:19] places and it ended up being called symbolic systems um now at one point
[00:27:22] symbolic systems um now at one point there were two universities that had
[00:27:24] there were two universities that had symbolic systems CU John barise actually
[00:27:27] symbolic systems CU John barise actually moved away from Stanford and went to
[00:27:30] moved away from Stanford and went to Indiana which is where he originally was
[00:27:33] Indiana which is where he originally was from um and so Indiana also had a
[00:27:35] from um and so Indiana also had a symbolic systems program for a number of
[00:27:37] symbolic systems program for a number of years but they've actually changed
[00:27:39] years but they've actually changed Theirs to cognitive science now since he
[00:27:41] Theirs to cognitive science now since he died so we are unique in having symbolic
[00:27:46] died so we are unique in having symbolic systems and so the idea of symbolic
[00:27:48] systems and so the idea of symbolic systems this is sort of what's on the
[00:27:51] systems this is sort of what's on the website website with a bit of
[00:27:53] website website with a bit of interpretation right so symbolic system
[00:27:56] interpretation right so symbolic system studies systems are meaningful symbols
[00:27:58] studies systems are meaningful symbols that represent the world about us like
[00:28:00] that represent the world about us like human languages Logics and programming
[00:28:03] human languages Logics and programming languages and the systems that work with
[00:28:05] languages and the systems that work with these symbols like brains computers and
[00:28:08] these symbols like brains computers and complex social systems contrasting that
[00:28:11] complex social systems contrasting that to the sort of typical view of cognitive
[00:28:14] to the sort of typical view of cognitive science which is focusing on the mind
[00:28:16] science which is focusing on the mind and intelligence as a naturally
[00:28:18] and intelligence as a naturally occurring phenomenon um symbolic systems
[00:28:21] occurring phenomenon um symbolic systems gives equal Focus to human constructed
[00:28:23] gives equal Focus to human constructed systems that use symbols to communicate
[00:28:26] systems that use symbols to communicate and to represent information
[00:28:29] and to represent information right so um in AI terms um you know AI
[00:28:34] right so um in AI terms um you know AI as a field and the name AI um
[00:28:38] as a field and the name AI um arose um around arguing for a symbolic
[00:28:42] arose um around arguing for a symbolic approach right that John um John
[00:28:45] approach right that John um John McCarthy who's the color photo there and
[00:28:47] McCarthy who's the color photo there and who um founded Stanford's artificial
[00:28:51] who um founded Stanford's artificial intelligence um and the St the original
[00:28:55] intelligence um and the St the original famous Stanford AI lab so John McCarthy
[00:28:58] famous Stanford AI lab so John McCarthy came up with the name artificial
[00:29:00] came up with the name artificial intelligence and he very ex very
[00:29:04] intelligence and he very ex very explicitly chose a new name to um
[00:29:08] explicitly chose a new name to um disassociate what he was doing um from
[00:29:11] disassociate what he was doing um from the cybernetics approach um which had
[00:29:13] the cybernetics approach um which had been pursued by people including um
[00:29:16] been pursued by people including um Norbert weer um at MIT who's shown on
[00:29:19] Norbert weer um at MIT who's shown on the right side um so Marvin miny the
[00:29:22] the right side um so Marvin miny the Teeny um photo down here um sort of
[00:29:26] Teeny um photo down here um sort of founded um artificial intelligence at
[00:29:29] founded um artificial intelligence at MIT um McCarthy worked with him for a
[00:29:32] MIT um McCarthy worked with him for a few years and then McCarthy came to
[00:29:35] few years and then McCarthy came to Stanford um and two of the other most
[00:29:37] Stanford um and two of the other most prominent early AI people new and Simon
[00:29:40] prominent early AI people new and Simon who were at CMU and the other two people
[00:29:43] who were at CMU and the other two people on the right side and
[00:29:46] on the right side and so in particular um new and Simon um
[00:29:52] so in particular um new and Simon um develop well actually no let me say a
[00:29:54] develop well actually no let me say a sentence first yeah so I mean McCarthy's
[00:29:57] sentence first yeah so I mean McCarthy's own background was a mathematician and
[00:29:59] own background was a mathematician and logician right so that he wanted to
[00:30:03] logician right so that he wanted to construct uh an artificial intelligence
[00:30:06] construct uh an artificial intelligence that looked like math and logic
[00:30:08] that looked like math and logic effectively right and that sort of was
[00:30:11] effectively right and that sort of was AI as a symbolic system and that was
[00:30:15] AI as a symbolic system and that was developed as a position in the
[00:30:16] developed as a position in the philosophy of artificial intelligence by
[00:30:19] philosophy of artificial intelligence by new and Simon and so they developed what
[00:30:22] new and Simon and so they developed what they called the physical symbol system
[00:30:25] they called the physical symbol system hypothesis um so that said a phys iCal
[00:30:28] hypothesis um so that said a phys iCal symbol system has the necessary and
[00:30:31] symbol system has the necessary and sufficient means for General intelligent
[00:30:34] sufficient means for General intelligent action and so you know that's a super
[00:30:37] action and so you know that's a super strong claim it's not only claiming that
[00:30:40] strong claim it's not only claiming that um having a symbol system allows you to
[00:30:44] um having a symbol system allows you to produce artificial general intelligence
[00:30:47] produce artificial general intelligence but through um the necessary Clause that
[00:30:51] but through um the necessary Clause that you can't have artificial general
[00:30:52] you can't have artificial general intelligence without having a symbol
[00:30:55] intelligence without having a symbol system um so that was sort of um the
[00:30:58] system um so that was sort of um the basis of um classical AI right um so um
[00:31:05] basis of um classical AI right um so um and that kind of contrasts a bit with
[00:31:08] and that kind of contrasts a bit with the you know so cybernetics you know had
[00:31:11] the you know so cybernetics you know had its origins in sort of control and
[00:31:15] its origins in sort of control and communication so it's much nearer to
[00:31:17] communication so it's much nearer to sort of an electrical engineering kind
[00:31:19] sort of an electrical engineering kind of background um and was wanting to sort
[00:31:22] of background um and was wanting to sort of um unify ideas of control and
[00:31:26] of um unify ideas of control and communication between animals um maybe
[00:31:30] communication between animals um maybe perhaps more than humans and machines um
[00:31:34] perhaps more than humans and machines um yeah so I mean you know um sort of in
[00:31:38] yeah so I mean you know um sort of in yeah so cybernetics comes from a Greek
[00:31:41] yeah so cybernetics comes from a Greek word um kubernetes which is sort of
[00:31:44] word um kubernetes which is sort of interesting all the uses it has so it's
[00:31:46] interesting all the uses it has so it's exactly the same route that occurs both
[00:31:50] exactly the same route that occurs both in kubernetes if you're are familiar
[00:31:52] in kubernetes if you're are familiar with that as you know distributed
[00:31:54] with that as you know distributed containers on Modern systems but also
[00:31:58] containers on Modern systems but also it's actually the same route that the
[00:32:00] it's actually the same route that the word government comes from um course
[00:32:02] word government comes from um course it's a control system as
[00:32:05] it's a control system as well
[00:32:07] well um yeah um so under the cybernetics
[00:32:11] um yeah um so under the cybernetics tradition was when newal Nets first
[00:32:13] tradition was when newal Nets first started being explored um the very
[00:32:16] started being explored um the very earliest neuron Nets um the most famous
[00:32:18] earliest neuron Nets um the most famous ones are Frank Rosen blats which were
[00:32:20] ones are Frank Rosen blats which were used for vision the newal net was
[00:32:23] used for vision the newal net was actually wired um to say just a teeny
[00:32:26] actually wired um to say just a teeny bit about this in case you think that AI
[00:32:30] bit about this in case you think that AI hype is only a thing of the
[00:32:33] hype is only a thing of the 2020s there was just as much AI hype in
[00:32:37] 2020s there was just as much AI hype in the
[00:32:37] the 1950s when Rosen blat um unveiled his
[00:32:42] 1950s when Rosen blat um unveiled his perceptron um so in the New York Times
[00:32:45] perceptron um so in the New York Times article about it new Navy device learns
[00:32:48] article about it new Navy device learns by doing psychologist shows embryo of
[00:32:52] by doing psychologist shows embryo of computer design to read and grow wiser
[00:32:55] computer design to read and grow wiser the Navy revealed the embryo of an of an
[00:32:58] the Navy revealed the embryo of an of an electronic computer today that it
[00:33:00] electronic computer today that it expects will be able to walk talk see
[00:33:03] expects will be able to walk talk see write reproduce itself and be conscious
[00:33:07] write reproduce itself and be conscious of its existence um and this hype is all
[00:33:11] of its existence um and this hype is all the more incredible when you get to the
[00:33:14] the more incredible when you get to the um later paragraph of the article and
[00:33:17] um later paragraph of the article and you find out what the demonstration was
[00:33:19] you find out what the demonstration was actually of and the demonstration that
[00:33:22] actually of and the demonstration that people were shown um was um that this
[00:33:26] people were shown um was um that this device learned to
[00:33:28] device learned to differentiate between right arrow and
[00:33:31] differentiate between right arrow and left arrow pictures after 50
[00:33:35] left arrow pictures after 50 [Laughter]
[00:33:37] [Laughter] exposures but there you
[00:33:40] exposures but there you go okay um yeah so what what do we make
[00:33:45] go okay um yeah so what what do we make of this in the case of NLP and language
[00:33:51] of this in the case of NLP and language and you know the position I would like
[00:33:56] and you know the position I would like to suggest is you know there's just no
[00:34:01] to suggest is you know there's just no doubt um that language is a Sy symbolic
[00:34:06] doubt um that language is a Sy symbolic system right that humans
[00:34:09] system right that humans developed language as a symbolic system
[00:34:12] developed language as a symbolic system it's perhaps most obvious um that if you
[00:34:16] it's perhaps most obvious um that if you think about it in writing we have
[00:34:18] think about it in writing we have symbols of the letters and words that we
[00:34:21] symbols of the letters and words that we use but even if there's no writing and
[00:34:24] use but even if there's no writing and you know the majority of human language
[00:34:26] you know the majority of human language use over time has been verbal human
[00:34:30] use over time has been verbal human language use that even though the
[00:34:32] language use that even though the substrate it's carried on where the
[00:34:35] substrate it's carried on where the sound waves or in sign languages
[00:34:37] sound waves or in sign languages movements of hands even though that's a
[00:34:39] movements of hands even though that's a continuous substrate the structure of
[00:34:42] continuous substrate the structure of human languages is a symbol system we
[00:34:45] human languages is a symbol system we have symbols which are the sounds of
[00:34:48] have symbols which are the sounds of human languages for cat we have a c an
[00:34:50] human languages for cat we have a c an at and a t those are symbols and they're
[00:34:53] at and a t those are symbols and they're recognized in a symbolic Way by language
[00:34:56] recognized in a symbolic Way by language users and indeed um the all the
[00:35:00] users and indeed um the all the pioneering work and categorical
[00:35:02] pioneering work and categorical perception in cognitive psychology is
[00:35:05] perception in cognitive psychology is done with um the sounds of human
[00:35:07] done with um the sounds of human languages um the phones as linguists
[00:35:10] languages um the phones as linguists call them so spoken language also has a
[00:35:14] call them so spoken language also has a symbolic structure but you know going
[00:35:18] symbolic structure but you know going against new and Simon the fact that
[00:35:22] against new and Simon the fact that humans use a symbol system for
[00:35:25] humans use a symbol system for communication doesn't mean that the
[00:35:27] communication doesn't mean that the process of the symbols the human brain
[00:35:30] process of the symbols the human brain has to be a physical symbol system and
[00:35:33] has to be a physical symbol system and so similarly we don't have to design NLP
[00:35:37] so similarly we don't have to design NLP our computer processors as physical
[00:35:39] our computer processors as physical symbol systems either um the brain is
[00:35:43] symbol systems either um the brain is you know clearly much more like a newal
[00:35:45] you know clearly much more like a newal network model um and probably newal
[00:35:48] network model um and probably newal models will scale better and capture
[00:35:51] models will scale better and capture language processing better than
[00:35:53] language processing better than something that is a symbolic processor
[00:35:56] something that is a symbolic processor um in the same way I mean that sort of
[00:35:58] um in the same way I mean that sort of leaves behind the question of well why
[00:36:01] leaves behind the question of well why did humans come up with a symbol system
[00:36:05] did humans come up with a symbol system for communication I mean after all you
[00:36:07] for communication I mean after all you know we could have just sort of hummed
[00:36:10] know we could have just sort of hummed at different frequencies and that could
[00:36:12] at different frequencies and that could have been used as our system of
[00:36:15] have been used as our system of communication I mean I think the
[00:36:17] communication I mean I think the dominant idea which seems reasonable to
[00:36:19] dominant idea which seems reasonable to me but who knows is that um having a
[00:36:22] me but who knows is that um having a symbolic system gives signaling
[00:36:25] symbolic system gives signaling reliability right that if you have
[00:36:28] reliability right that if you have discrete Target points that are
[00:36:30] discrete Target points that are separated then that gives you an ability
[00:36:33] separated then that gives you an ability um when there's degradation of the
[00:36:34] um when there's degradation of the signal to recover it
[00:36:37] signal to recover it well um yeah so where does that leave
[00:36:40] well um yeah so where does that leave Linguistics which is mainly um been
[00:36:43] Linguistics which is mainly um been developed in terms of um describing a
[00:36:47] developed in terms of um describing a symbolic system I think the right way to
[00:36:51] symbolic system I think the right way to think about it as Linguistics is good
[00:36:53] think about it as Linguistics is good for giving us questions Concepts and
[00:36:55] for giving us questions Concepts and distinctions when thinking about
[00:36:58] distinctions when thinking about language acquisition processing and
[00:37:01] language acquisition processing and understanding and indeed one of the
[00:37:03] understanding and indeed one of the interesting things that's come about is
[00:37:05] interesting things that's come about is that sort of as um NLP and AI have been
[00:37:10] that sort of as um NLP and AI have been developed further and as able to do a
[00:37:13] developed further and as able to do a lot of low-level stuff that there's
[00:37:15] lot of low-level stuff that there's actually the sort of higher level
[00:37:18] actually the sort of higher level Concepts that linguists often talk about
[00:37:20] Concepts that linguists often talk about a lot things like compositionality and
[00:37:22] a lot things like compositionality and systematic generalization which I'll
[00:37:25] systematic generalization which I'll come back to in a few minutes um the
[00:37:28] come back to in a few minutes um the mapping of stable meanings for symbols
[00:37:31] mapping of stable meanings for symbols the reference of um linguistic
[00:37:34] the reference of um linguistic expressions in the world that they get
[00:37:36] expressions in the world that they get talked about more and more in artificial
[00:37:39] talked about more and more in artificial intelligence context building neural
[00:37:42] intelligence context building neural systems and I mean I think one way to
[00:37:45] systems and I mean I think one way to think about is that you know a lot of
[00:37:47] think about is that you know a lot of the early neural netw work um of um most
[00:37:52] the early neural netw work um of um most notably visual processing but also other
[00:37:57] notably visual processing but also other kinds of sensory stuff like sounds I
[00:38:00] kinds of sensory stuff like sounds I mean doing that is sort of what gets you
[00:38:02] mean doing that is sort of what gets you to insect level intelligence and if you
[00:38:05] to insect level intelligence and if you want to get higher up the chain than
[00:38:07] want to get higher up the chain than insect level intelligence then a lot of
[00:38:10] insect level intelligence then a lot of the kind of questions and properties of
[00:38:12] the kind of questions and properties of linguistic systems become increasingly
[00:38:15] linguistic systems become increasingly relevant um at a slightly more prosaic
[00:38:20] relevant um at a slightly more prosaic level um that I don't think one
[00:38:25] level um that I don't think one necessarily wants to believe all the F
[00:38:28] necessarily wants to believe all the F details of different um linguistic
[00:38:30] details of different um linguistic theories but you know for how human
[00:38:33] theories but you know for how human languages are structured and how they
[00:38:35] languages are structured and how they behave I think yeah most of our broad
[00:38:37] behave I think yeah most of our broad understanding of linguistics is right
[00:38:41] understanding of linguistics is right and so therefore when we're thinking
[00:38:42] and so therefore when we're thinking about NP systems and we're thinking
[00:38:45] about NP systems and we're thinking about you know understanding how they
[00:38:47] about you know understanding how they behave wanting to know whether they have
[00:38:49] behave wanting to know whether they have certain properties thinking up ways to
[00:38:52] certain properties thinking up ways to evaluate them a lot of that is done in
[00:38:54] evaluate them a lot of that is done in terms of linguistic understand standing
[00:38:58] terms of linguistic understand standing wanting to see whether they capture
[00:39:00] wanting to see whether they capture facts about sentence structure discourse
[00:39:02] facts about sentence structure discourse structure semantic properties like
[00:39:04] structure semantic properties like natural language inference um whether
[00:39:07] natural language inference um whether you can do things like bridging and AFA
[00:39:10] you can do things like bridging and AFA which I did not cover this year's class
[00:39:11] which I did not cover this year's class because we skipped the co- reference
[00:39:13] because we skipped the co- reference lecture when we slice one lecture off
[00:39:15] lecture when we slice one lecture off the class metaphors presuppositions all
[00:39:18] the class metaphors presuppositions all of these things are linguistic Notions
[00:39:20] of these things are linguistic Notions that we try and get our NLP models to
[00:39:23] that we try and get our NLP models to capture so I just want to say a couple
[00:39:26] capture so I just want to say a couple more remarks um about you know the role
[00:39:30] more remarks um about you know the role of human language in human intelligence
[00:39:34] of human language in human intelligence I think this is kind of interesting um
[00:39:37] I think this is kind of interesting um so an interesting person in the history
[00:39:39] so an interesting person in the history of linguistics um is this guy vilhelm
[00:39:43] of linguistics um is this guy vilhelm Von Hot um who is um a prominent German
[00:39:48] Von Hot um who is um a prominent German academic um so um so really the American
[00:39:53] academic um so um so really the American education system um was borrowed from
[00:39:57] education system um was borrowed from Germany right so up until the second
[00:40:00] Germany right so up until the second world war the preeminent place of
[00:40:03] world war the preeminent place of Science and learning was Germany and
[00:40:06] Science and learning was Germany and Germany um essentially via V Von
[00:40:09] Germany um essentially via V Von humber's work developed the idea of
[00:40:12] humber's work developed the idea of having graduate education and the US
[00:40:15] having graduate education and the US copied graduate education from Germany
[00:40:18] copied graduate education from Germany and started um doing its own but you
[00:40:21] and started um doing its own but you know in in that context it was still the
[00:40:24] know in in that context it was still the case that um for people in the United
[00:40:27] case that um for people in the United United States prior to the 1930s that
[00:40:31] United States prior to the 1930s that generally people would go to Germany um
[00:40:35] generally people would go to Germany um to finish their education either to get
[00:40:38] to finish their education either to get their PHD or to do a postdoc or
[00:40:40] their PHD or to do a postdoc or something like that right so you know if
[00:40:43] something like that right so you know if you uh Trace back my own academic tree
[00:40:46] you uh Trace back my own academic tree or most other academic trees of people
[00:40:49] or most other academic trees of people who got phds in the US they actually go
[00:40:53] who got phds in the US they actually go back a few generations and then they go
[00:40:55] back a few generations and then they go back to Germany um so
[00:40:58] back to Germany um so uh we don't think of that as much in the
[00:41:01] uh we don't think of that as much in the modern world yeah so um hbal was
[00:41:04] modern world yeah so um hbal was influential and developing the
[00:41:05] influential and developing the university system um but he also um
[00:41:09] university system um but he also um worked a lot on um language and I he's
[00:41:13] worked a lot on um language and I he's someone that um Chomsky always cites um
[00:41:17] someone that um Chomsky always cites um because he's known for this famous
[00:41:19] because he's known for this famous statement about that human language must
[00:41:22] statement about that human language must make infinite use of finite means so the
[00:41:24] make infinite use of finite means so the fact that we have a limited um Supply of
[00:41:27] fact that we have a limited um Supply of words and sentence structures but out of
[00:41:29] words and sentence structures but out of those we can recursively build up an
[00:41:32] those we can recursively build up an infinite number of sentences and that's
[00:41:35] infinite number of sentences and that's in chomsky's view supporting the kind of
[00:41:38] in chomsky's view supporting the kind of um symbolic structured view of language
[00:41:41] um symbolic structured view of language that he's been advocating but I think
[00:41:43] that he's been advocating but I think there's sort of another interesting take
[00:41:47] there's sort of another interesting take of Von Holts which um we can argue
[00:41:50] of Von Holts which um we can argue whether it's um right or not but I think
[00:41:53] whether it's um right or not but I think is kind of interesting and one of the
[00:41:56] is kind of interesting and one of the things he want wants to stress is that
[00:41:59] things he want wants to stress is that language isn't just something um used
[00:42:04] language isn't just something um used for the purpose of
[00:42:06] for the purpose of communication um that he
[00:42:10] communication um that he um I should actually introduce something
[00:42:12] um I should actually introduce something here so um so caraman and tersi are two
[00:42:17] here so um so caraman and tersi are two well-known cognitive psychologists and
[00:42:19] well-known cognitive psychologists and they introduced this idea that there are
[00:42:21] they introduced this idea that there are two kinds of thinking system one
[00:42:24] two kinds of thinking system one cognition and system 2 cognition and
[00:42:27] cognition and system 2 cognition and system one is the kind of subconscious
[00:42:30] system one is the kind of subconscious thinking that you're not really thinking
[00:42:32] thinking that you're not really thinking of just we process stuff when it comes
[00:42:34] of just we process stuff when it comes into our heads whether visual signals or
[00:42:38] into our heads whether visual signals or um speech and system 2 Thinking is um
[00:42:42] um speech and system 2 Thinking is um the conscious let me think about this
[00:42:44] the conscious let me think about this and try and figure out what's going on
[00:42:46] and try and figure out what's going on I'm solving a math problem style of
[00:42:48] I'm solving a math problem style of thinking and um you know I think you can
[00:42:53] thinking and um you know I think you can see in Von Holt's writings essentially
[00:42:56] see in Von Holt's writings essentially the same kind of distinction between
[00:42:59] the same kind of distinction between system 1 and system 2 cognition although
[00:43:02] system 1 and system 2 cognition although he refers to system one cognition as a
[00:43:05] he refers to system one cognition as a of the spirit um and system to cognition
[00:43:09] of the spirit um and system to cognition as thinking um yeah and so basically he
[00:43:14] as thinking um yeah and so basically he argues for a version of the philos
[00:43:17] argues for a version of the philos philosophical position of the language
[00:43:19] philosophical position of the language of thought of suggesting that effective
[00:43:23] of thought of suggesting that effective system to thinking um requires extension
[00:43:27] system to thinking um requires extension of the Mind through the symbols of
[00:43:29] of the Mind through the symbols of language um and so he argued that having
[00:43:34] language um and so he argued that having language is absolutely a necessary
[00:43:37] language is absolutely a necessary foundation for the progress of the human
[00:43:39] foundation for the progress of the human mind and I think that's actually an
[00:43:41] mind and I think that's actually an interesting perspective which I have
[00:43:43] interesting perspective which I have some sympathy with I mean you know
[00:43:46] some sympathy with I mean you know obviously we can think without language
[00:43:48] obviously we can think without language you know we can feel afraid we can think
[00:43:51] you know we can feel afraid we can think visually and about how things that fit
[00:43:54] visually and about how things that fit together but I think it's fairly
[00:43:58] together but I think it's fairly plausible that for the sort of more
[00:44:01] plausible that for the sort of more abstract larger scale thinking that
[00:44:05] abstract larger scale thinking that humans engage in and has led them to
[00:44:08] humans engage in and has led them to sort of higher levels of thought than a
[00:44:11] sort of higher levels of thought than a chimpanzee gets to that language gives a
[00:44:14] chimpanzee gets to that language gives a scaffolding Inside the Mind that makes
[00:44:17] scaffolding Inside the Mind that makes that possible another version of that um
[00:44:20] that possible another version of that um is from the philosopher Daniel dennet
[00:44:23] is from the philosopher Daniel dennet who just actually died a couple of
[00:44:24] who just actually died a couple of months ago um so dennet wrote this book
[00:44:28] months ago um so dennet wrote this book um called from bacteria to bark and back
[00:44:31] um called from bacteria to bark and back and the main thing this book was about
[00:44:33] and the main thing this book was about was the origin of human consciousness
[00:44:35] was the origin of human consciousness and I'm not going to talk about human
[00:44:37] and I'm not going to talk about human consciousness um today um but um he
[00:44:42] consciousness um today um but um he introduced this model of four grades of
[00:44:46] introduced this model of four grades of progressively more competent
[00:44:48] progressively more competent intelligences um and so the four levels
[00:44:52] intelligences um and so the four levels um he outlined was that the bottom one
[00:44:55] um he outlined was that the bottom one was dar wian so darwinian intelligence
[00:45:00] was dar wian so darwinian intelligence was something that was predesigned and
[00:45:02] was something that was predesigned and fixed it doesn't improve during its
[00:45:05] fixed it doesn't improve during its lifetime Improvement only happens um by
[00:45:09] lifetime Improvement only happens um by Evolution through genetic selection um
[00:45:12] Evolution through genetic selection um so things like bacteria and viruses are
[00:45:16] so things like bacteria and viruses are darwinian
[00:45:17] darwinian intelligences so then after that was
[00:45:20] intelligences so then after that was scaran
[00:45:22] scaran intelligences and so they improve
[00:45:25] intelligences and so they improve Behavior by learning to respond on to
[00:45:28] Behavior by learning to respond on to reinforcement um so something like a
[00:45:30] reinforcement um so something like a lizard um or you know perhaps a dog we
[00:45:34] lizard um or you know perhaps a dog we could um argue about how intelligent
[00:45:36] could um argue about how intelligent dogs are um has um scaran intelligence
[00:45:41] dogs are um has um scaran intelligence and so then the third level up um
[00:45:44] and so then the third level up um paparian intelligence is things that
[00:45:47] paparian intelligence is things that learn models of the environment so they
[00:45:50] learn models of the environment so they can improve performance by thinking
[00:45:53] can improve performance by thinking through plans and then executing them
[00:45:57] through plans and then executing them and seeing how they behave so in a
[00:46:00] and seeing how they behave so in a computational
[00:46:02] computational sense paparian intelligence kind of
[00:46:05] sense paparian intelligence kind of means that you can do modelbased
[00:46:07] means that you can do modelbased reinforcement
[00:46:09] reinforcement learning um and so um primates like
[00:46:13] learning um and so um primates like chimpanzees can definitely um do the
[00:46:16] chimpanzees can definitely um do the kind of planning and modelbased um
[00:46:19] kind of planning and modelbased um reinforcement learning that gives you a
[00:46:21] reinforcement learning that gives you a preparan intelligence but actually a lot
[00:46:24] preparan intelligence but actually a lot of recent evidence shows that a lot of
[00:46:26] of recent evidence shows that a lot of simp simpler creatures can also do it um
[00:46:31] simp simpler creatures can also do it um so I'm not sure the facts here so all
[00:46:34] so I'm not sure the facts here so all these studies you see are about um crows
[00:46:38] these studies you see are about um crows from um the South Pacific Australia and
[00:46:44] from um the South Pacific Australia and Fiji and places like that so I'm not
[00:46:46] Fiji and places like that so I'm not sure if Northern Hemisphere crows are
[00:46:48] sure if Northern Hemisphere crows are Dumber but at least southern hemisphere
[00:46:50] Dumber but at least southern hemisphere crows um can learn plans so that they
[00:46:55] crows um can learn plans so that they can do multi-stage planning to work out
[00:46:59] can do multi-stage planning to work out ways to get a piece of meat that's down
[00:47:01] ways to get a piece of meat that's down a Hole by learning to pick up a stick
[00:47:03] a Hole by learning to pick up a stick and poke it in and um so you know that
[00:47:07] and poke it in and um so you know that even crows can be paparian intelligences
[00:47:10] even crows can be paparian intelligences um but what Dennis suggests is that
[00:47:12] um but what Dennis suggests is that there's a stage Beyond um paparian
[00:47:16] there's a stage Beyond um paparian intelligence which he calls Gregorian
[00:47:19] intelligence which he calls Gregorian intelligence and the idea of Gregorian
[00:47:22] intelligence and the idea of Gregorian intelligence is that you can build
[00:47:24] intelligence is that you can build Thinking Tools which allow you to do a
[00:47:28] Thinking Tools which allow you to do a higher level of control of mental
[00:47:32] higher level of control of mental searches and So He suggests that you
[00:47:36] searches and So He suggests that you know things like well mathematics is a
[00:47:40] know things like well mathematics is a thinking tool but well also democracy is
[00:47:43] thinking tool but well also democracy is a thinking tool but nevertheless out of
[00:47:46] a thinking tool but nevertheless out of the space of Thinking Tools um that
[00:47:49] the space of Thinking Tools um that human language is the preminent thinking
[00:47:51] human language is the preminent thinking tool that we have and So He suggests
[00:47:55] tool that we have and So He suggests that you know the only biolog iCal
[00:47:57] that you know the only biolog iCal example we have of a Gregorian
[00:47:59] example we have of a Gregorian intelligence um as human
[00:48:01] intelligence um as human beings and so that I I think in that
[00:48:04] beings and so that I I think in that kind of sense you can say that there's a
[00:48:06] kind of sense you can say that there's a very important role for
[00:48:09] very important role for language okay two parts to go in my um
[00:48:14] language okay two parts to go in my um summary okay so the next one is what
[00:48:17] summary okay so the next one is what kind of semantic should we use for
[00:48:19] kind of semantic should we use for language um and so this is getting back
[00:48:22] language um and so this is getting back to the question I mentioned for word
[00:48:24] to the question I mentioned for word vectors and this is kind of interesting
[00:48:27] vectors and this is kind of interesting so the semantics that's been dominant in
[00:48:31] so the semantics that's been dominant in philosophy of language or in linguistic
[00:48:33] philosophy of language or in linguistic semantics is a notion of model theoretic
[00:48:37] semantics is a notion of model theoretic semantics where the meaning of words is
[00:48:40] semantics where the meaning of words is their
[00:48:41] their denotation what they represent in the
[00:48:43] denotation what they represent in the world I mentioned this I think in a
[00:48:45] world I mentioned this I think in a early lecture right so that if you have
[00:48:48] early lecture right so that if you have a word like computer the meaning of
[00:48:50] a word like computer the meaning of computer is the set of computers this
[00:48:52] computer is the set of computers this one that one that one all the other
[00:48:53] one that one that one all the other computers are out right so it's
[00:48:55] computers are out right so it's denotational relationship between a word
[00:48:59] denotational relationship between a word and its denotation in the world or in a
[00:49:01] and its denotation in the world or in a model of the world and that was the um
[00:49:04] model of the world and that was the um notion that was used in most of the
[00:49:07] notion that was used in most of the history of AI for doing symbolic Ai and
[00:49:10] history of AI for doing symbolic Ai and that then contrasts with this sort of
[00:49:12] that then contrasts with this sort of distributional semantics that the
[00:49:15] distributional semantics that the meaning of a word is understanding the
[00:49:17] meaning of a word is understanding the context in which it's used which is
[00:49:20] context in which it's used which is effectively what we're using um for our
[00:49:22] effectively what we're using um for our neural
[00:49:24] neural models yeah so if if you look at the
[00:49:27] models yeah so if if you look at the traditional view of understand
[00:49:31] traditional view of understand interpreting the meaning of human
[00:49:33] interpreting the meaning of human language um and this is what you'll have
[00:49:35] language um and this is what you'll have seen if you did an intro logic class at
[00:49:39] seen if you did an intro logic class at some point right that we have a sentence
[00:49:41] some point right that we have a sentence the red apple is on the table and you
[00:49:44] the red apple is on the table and you get to write in some logical
[00:49:46] get to write in some logical representation first order predicate
[00:49:49] representation first order predicate calculus or whatever this one's a bit
[00:49:51] calculus or whatever this one's a bit different to allow in thus where
[00:49:53] different to allow in thus where normally for first order predal calculus
[00:49:55] normally for first order predal calculus you only do um for all and if there
[00:49:58] you only do um for all and if there exists but you have a a sort of a formal
[00:50:01] exists but you have a a sort of a formal logic and you know in the early week in
[00:50:04] logic and you know in the early week in weeks one and two of the logic class you
[00:50:06] weeks one and two of the logic class you have some English sentences for which
[00:50:08] have some English sentences for which you translate into formal logic and then
[00:50:10] you translate into formal logic and then after that um you forget about human
[00:50:13] after that um you forget about human languages and you just sort of start
[00:50:15] languages and you just sort of start proving stuff about formal logical
[00:50:18] proving stuff about formal logical systems um and so to some extent what
[00:50:21] systems um and so to some extent what you get a philosophy class um represents
[00:50:25] you get a philosophy class um represents the tradition of Alfred tasi um so taski
[00:50:29] the tradition of Alfred tasi um so taski believed that you couldn't talk about
[00:50:32] believed that you couldn't talk about meaning in terms of talking about human
[00:50:35] meaning in terms of talking about human languages because human languages were
[00:50:37] languages because human languages were quot quote impossibly
[00:50:40] quot quote impossibly incoherent um yeah and so um from about
[00:50:44] incoherent um yeah and so um from about the sort of 1940s until um 1980 you know
[00:50:48] the sort of 1940s until um 1980 you know tasy was the preeminent legis um in the
[00:50:52] tasy was the preeminent legis um in the US um he was in Berkeley um and so that
[00:50:56] US um he was in Berkeley um and so that was very much um the view of the
[00:50:59] was very much um the view of the logicians of the world but during that
[00:51:02] logicians of the world but during that period one of his students um was this
[00:51:05] period one of his students um was this guy Richard
[00:51:07] guy Richard montigue so um Richard montigue sort of
[00:51:10] montigue so um Richard montigue sort of rebelled against that picture um saying
[00:51:14] rebelled against that picture um saying I reject the contention that an
[00:51:16] I reject the contention that an important theoretical difference exists
[00:51:18] important theoretical difference exists between formal and natural languages and
[00:51:21] between formal and natural languages and so he then set about um showing that
[00:51:27] so he then set about um showing that well you could start building up a
[00:51:30] well you could start building up a formal semantics for describing the
[00:51:33] formal semantics for describing the meaning of natural language sentences
[00:51:36] meaning of natural language sentences and so Richard monu's work became the
[00:51:39] and so Richard monu's work became the foundation of the work that's used in um
[00:51:42] foundation of the work that's used in um semantics in linguistics as well for
[00:51:45] semantics in linguistics as well for anyone who's done 1 Ling 130 or 230 um
[00:51:49] anyone who's done 1 Ling 130 or 230 um the picture you saw is sort of a
[00:51:51] the picture you saw is sort of a montigue picture of um semantics and so
[00:51:55] montigue picture of um semantics and so that was the semantics that
[00:51:58] that was the semantics that was um taken over and essentially used
[00:52:03] was um taken over and essentially used as the model of doing natural language
[00:52:05] as the model of doing natural language understanding for most of the history of
[00:52:09] understanding for most of the history of NLP you know roughly 1960 to
[00:52:14] NLP you know roughly 1960 to 2015 17 you know and so the picture
[00:52:18] 2015 17 you know and so the picture essentially was that if we wanted to
[00:52:22] essentially was that if we wanted to have a sentence that we interpreted like
[00:52:24] have a sentence that we interpreted like the red apple is on the table what we
[00:52:27] the red apple is on the table what we would do is we'd first produce a
[00:52:30] would do is we'd first produce a syntactic structure for the sentence so
[00:52:32] syntactic structure for the sentence so we would pause it um and then using
[00:52:37] we would pause it um and then using ideas roughly along the lines that
[00:52:39] ideas roughly along the lines that montue suggested we would construct its
[00:52:43] montue suggested we would construct its meaning by looking up meanings of words
[00:52:47] meaning by looking up meanings of words in a lexicon and then using the
[00:52:50] in a lexicon and then using the compositionality of human languages to
[00:52:53] compositionality of human languages to work out the meanings of progressively
[00:52:55] work out the meanings of progressively larger phrases and clauses in terms of
[00:52:58] larger phrases and clauses in terms of the meanings of those words and the way
[00:53:01] the meanings of those words and the way that they are combined slightly
[00:53:03] that they are combined slightly reminiscent of my discussion of tree
[00:53:06] reminiscent of my discussion of tree structures to meanings in the last
[00:53:08] structures to meanings in the last lecture I gave and so you would build up
[00:53:12] lecture I gave and so you would build up um a meaning representation of a
[00:53:15] um a meaning representation of a sentence and so this could then give you
[00:53:18] sentence and so this could then give you a semantic meaning of a sentence that
[00:53:20] a semantic meaning of a sentence that you could use in a system um this is
[00:53:24] you could use in a system um this is approximately a slide except titled that
[00:53:27] approximately a slide except titled that I actually used to use in
[00:53:29] I actually used to use in cs224n um in the
[00:53:32] cs224n um in the 2000s decade right so we have a we have
[00:53:37] 2000s decade right so we have a we have um or part of a sentence um I get oh no
[00:53:40] um or part of a sentence um I get oh no it's a whole sentence here it is how
[00:53:42] it's a whole sentence here it is how many red
[00:53:44] many red cars what can I get this sentence I
[00:53:47] cars what can I get this sentence I think there's a sentence here how many
[00:53:50] think there's a sentence here how many oh how many red cars in Palo Alo does
[00:53:53] oh how many red cars in Palo Alo does Kathy like how many red cars in pal does
[00:53:57] Kathy like how many red cars in pal does Kathy like and sorry yeah the cars sorry
[00:53:59] Kathy like and sorry yeah the cars sorry got hidden underneath here um yeah so we
[00:54:02] got hidden underneath here um yeah so we have a sentence we pass it we look up
[00:54:05] have a sentence we pass it we look up meanings of words in a lexicon we start
[00:54:07] meanings of words in a lexicon we start composing them up um we get a semantic
[00:54:10] composing them up um we get a semantic form for the whole sentence which we can
[00:54:12] form for the whole sentence which we can then convert into SQL and we can run
[00:54:15] then convert into SQL and we can run against a database and we can get the
[00:54:17] against a database and we can get the answer and this was a in outline the
[00:54:21] answer and this was a in outline the kind of technology that was widely used
[00:54:23] kind of technology that was widely used for natural language understanding
[00:54:25] for natural language understanding systems that were built anywhere from
[00:54:28] systems that were built anywhere from the 1960s to the 2s and1s and you know
[00:54:33] the 1960s to the 2s and1s and you know in particular um they were used not only
[00:54:37] in particular um they were used not only in a purely kind of rule based grammar
[00:54:39] in a purely kind of rule based grammar and lexicon way this same basic
[00:54:42] and lexicon way this same basic technology was incorporated into a
[00:54:44] technology was incorporated into a machine learning context where your goal
[00:54:46] machine learning context where your goal was to start to learn various of these
[00:54:49] was to start to learn various of these parts you could not only learn the paraa
[00:54:52] parts you could not only learn the paraa but you could also um learn semantic
[00:54:55] but you could also um learn semantic meanings of words and learn composition
[00:54:57] meanings of words and learn composition rules and so the Acme of that work was
[00:55:00] rules and so the Acme of that work was then what was called semantic paring
[00:55:03] then what was called semantic paring that was pioneered by Luke zettel moer
[00:55:05] that was pioneered by Luke zettel moer and Mike Collins in the 2000th decade
[00:55:08] and Mike Collins in the 2000th decade and then taken up by others including um
[00:55:11] and then taken up by others including um Percy leang so person's PhD thesis but
[00:55:14] Percy leang so person's PhD thesis but also actually his early work at Stanford
[00:55:17] also actually his early work at Stanford before he was convinced to do neural
[00:55:19] before he was convinced to do neural networks um was doing semantic paing um
[00:55:24] networks um was doing semantic paing um work um so you know these systems could
[00:55:27] work um so you know these systems could actually work and were used in limited
[00:55:30] actually work and were used in limited domains but they're always extremely
[00:55:32] domains but they're always extremely brittle um and yeah the interesting
[00:55:35] brittle um and yeah the interesting thing is sort of what of humans I mean
[00:55:38] thing is sort of what of humans I mean there is you know some evidence that
[00:55:41] there is you know some evidence that humans do something like this that they
[00:55:44] humans do something like this that they um work out the structure of sentences
[00:55:46] um work out the structure of sentences and compute um meanings um in a a bottom
[00:55:51] and compute um meanings um in a a bottom up mostly projective way you know
[00:55:54] up mostly projective way you know there's a lot of controversy to exactly
[00:55:57] there's a lot of controversy to exactly how human understanding of sentences
[00:56:00] how human understanding of sentences still works but you know there are
[00:56:01] still works but you know there are certainly people have argued in support
[00:56:03] certainly people have argued in support of human brains doing something similar
[00:56:07] of human brains doing something similar um that's obviously not what we're
[00:56:09] um that's obviously not what we're getting with current day
[00:56:11] getting with current day Transformers and so you know the
[00:56:13] Transformers and so you know the question is do our current day
[00:56:17] question is do our current day um neural language models provide
[00:56:20] um neural language models provide suitable meaning functions and that's
[00:56:23] suitable meaning functions and that's you know a complex question
[00:56:26] you know a complex question CU you know in many ways yeah they seem
[00:56:28] CU you know in many ways yeah they seem to they do an amazing job at
[00:56:31] to they do an amazing job at understanding whatever sentences you put
[00:56:33] understanding whatever sentences you put into them but there's still some genuine
[00:56:35] into them but there's still some genuine concerns as to whether
[00:56:37] concerns as to whether they are making shortcuts or work to a
[00:56:41] they are making shortcuts or work to a certain extent and don't actually have
[00:56:43] certain extent and don't actually have the same kind of
[00:56:45] the same kind of compositional understanding with
[00:56:47] compositional understanding with systematic generalization that human
[00:56:49] systematic generalization that human beings
[00:56:50] beings do okay um so that's the traditional
[00:56:54] do okay um so that's the traditional denotational semantics View and that
[00:56:57] denotational semantics View and that contrasts um with the kind of use theory
[00:57:00] contrasts um with the kind of use theory of meaning and in the first or second
[00:57:04] of meaning and in the first or second lecture and at the beginning of this one
[00:57:06] lecture and at the beginning of this one I attributed that to the British
[00:57:08] I attributed that to the British linguist Jr fth you shall know a word by
[00:57:12] linguist Jr fth you shall know a word by the company at kees but it's not only a
[00:57:14] the company at kees but it's not only a position of f it's also been a minority
[00:57:18] position of f it's also been a minority position of philosophers in particular
[00:57:20] position of philosophers in particular it was Advanced by viken Stein and his
[00:57:23] it was Advanced by viken Stein and his later work and his work philosophical
[00:57:25] later work and his work philosophical invest vation so in that work he writes
[00:57:29] invest vation so in that work he writes when I talk about language words
[00:57:31] when I talk about language words sentences Etc I must speak the language
[00:57:34] sentences Etc I must speak the language of every day is this language somehow
[00:57:36] of every day is this language somehow too coarse and material for what we want
[00:57:38] too coarse and material for what we want to say then how is another one to be
[00:57:41] to say then how is another one to be constructed and how strange that we
[00:57:43] constructed and how strange that we should be able to do anything at all
[00:57:45] should be able to do anything at all with the one we we have um philosophical
[00:57:48] with the one we we have um philosophical investigations is written in this sort
[00:57:49] investigations is written in this sort of vaguely poetical literary style but
[00:57:52] of vaguely poetical literary style but the point of it is meant to be
[00:57:54] the point of it is meant to be saying look these logician people are
[00:57:57] saying look these logician people are claiming you can't use natural langu
[00:58:00] claiming you can't use natural langu human languages to express meaning and
[00:58:02] human languages to express meaning and you have to translate into this um
[00:58:04] you have to translate into this um symbol system but isn't that a weird
[00:58:07] symbol system but isn't that a weird concept that one symbol system is no
[00:58:09] concept that one symbol system is no good but this other symbol system
[00:58:11] good but this other symbol system somehow fixes things um and then about
[00:58:15] somehow fixes things um and then about denotational semantics he writes you say
[00:58:19] denotational semantics he writes you say the point isn't the word but it's
[00:58:21] the point isn't the word but it's meaning and you think of the meaning as
[00:58:23] meaning and you think of the meaning as a thing of the same kind as the word
[00:58:25] a thing of the same kind as the word they also different from the word he the
[00:58:28] they also different from the word he the word there the meaning so that's the
[00:58:30] word there the meaning so that's the symbol and its denotation the money and
[00:58:33] symbol and its denotation the money and the cow that you can buy with it um but
[00:58:36] the cow that you can buy with it um but contrast money in its use and he goes on
[00:58:39] contrast money in its use and he goes on from there to argue for the kind of you
[00:58:42] from there to argue for the kind of you know the meaning of money is the way
[00:58:44] know the meaning of money is the way that money can be used in the world the
[00:58:47] that money can be used in the world the meaning of money isn't pointing at
[00:58:49] meaning of money isn't pointing at pieces of
[00:58:51] pieces of money okay so this this is what's
[00:58:55] money okay so this this is what's referred to as a use theory of meaning
[00:58:57] referred to as a use theory of meaning and so the question is is that a good
[00:59:00] and so the question is is that a good theory of meaning so some people just
[00:59:04] theory of meaning so some people just don't accept
[00:59:07] don't accept um this kind of distributional semantic
[00:59:11] um this kind of distributional semantic use theories of meaning um as a theory
[00:59:14] use theories of meaning um as a theory of meaning or semantics most prominently
[00:59:17] of meaning or semantics most prominently in recent NLP work that's the position
[00:59:20] in recent NLP work that's the position of Bender and cola that they just take
[00:59:22] of Bender and cola that they just take as axiomatic the only thing that counts
[00:59:25] as axiomatic the only thing that counts as having a meaning is that you got form
[00:59:28] as having a meaning is that you got form over here and meaning over there
[00:59:32] over here and meaning over there um but I think that that's too narrow I
[00:59:36] um but I think that that's too narrow I think we have to argue that meaning
[00:59:40] think we have to argue that meaning arises from connect meaning of words
[00:59:42] arises from connect meaning of words arises from connecting words to other
[00:59:45] arises from connecting words to other things and although in some sense you
[00:59:48] things and although in some sense you could say connecting um words to things
[00:59:52] could say connecting um words to things in the real world is privileged it's not
[00:59:54] in the real world is privileged it's not the only way that you can ground
[00:59:57] the only way that you can ground meanings you can have meanings in a
[00:59:59] meanings you can have meanings in a virtual world but you can also have
[01:00:01] virtual world but you can also have meanings by connecting one word to other
[01:00:04] meanings by connecting one word to other things in human language and the other
[01:00:07] things in human language and the other thing that I think you need to say is
[01:00:10] thing that I think you need to say is you know meaning isn't a sort of a zero1
[01:00:13] you know meaning isn't a sort of a zero1 thing that you know the denotation of a
[01:00:17] thing that you know the denotation of a word or you don't I think meaning is a
[01:00:19] word or you don't I think meaning is a gradient thing and you can understand
[01:00:22] gradient thing and you can understand meanings of words and phrases either
[01:00:24] meanings of words and phrases either more or less and so this is an example I
[01:00:27] more or less and so this is an example I gave in a piece that I wrote a couple of
[01:00:30] gave in a piece that I wrote a couple of years ago okay what is the um meaning of
[01:00:33] years ago okay what is the um meaning of the word
[01:00:34] the word Shai um
[01:00:37] Shai um well maybe a few of you know it but um
[01:00:40] well maybe a few of you know it but um if you don't well what could I do um
[01:00:43] if you don't well what could I do um well you know if you'd seen or held one
[01:00:46] well you know if you'd seen or held one you'd have classic grounded meaning um
[01:00:49] you'd have classic grounded meaning um know something about the denotation um
[01:00:52] know something about the denotation um well if that's not the case well you
[01:00:54] well if that's not the case well you know I could at least show you picture
[01:00:56] know I could at least show you picture of one here's a picture of one so that
[01:00:58] of one here's a picture of one so that gives you some information about what a
[01:01:01] gives you some information about what a Shai is but you know is that the only
[01:01:04] Shai is but you know is that the only thing I can do I mean
[01:01:07] thing I can do I mean suppose well sorry I left out a bullet
[01:01:10] suppose well sorry I left out a bullet point you know so this gives you a
[01:01:12] point you know so this gives you a partial meaning of a Shai but surely you
[01:01:15] partial meaning of a Shai but surely you have a richer meaning if you'd heard one
[01:01:18] have a richer meaning if you'd heard one being played um and well is showing you
[01:01:22] being played um and well is showing you a picture of one the only thing I can do
[01:01:25] a picture of one the only thing I can do um suppose you'd never you know seen
[01:01:27] um suppose you'd never you know seen felt or heard one but you know I told
[01:01:31] felt or heard one but you know I told you it's a traditional Indian instrument
[01:01:33] you it's a traditional Indian instrument a bit like an OBO well I think you
[01:01:37] a bit like an OBO well I think you understand something about the meaning
[01:01:38] understand something about the meaning of the word at that point um that you
[01:01:42] of the word at that point um that you know it's sort of connected to India
[01:01:44] know it's sort of connected to India it's a wind instrument using reads
[01:01:48] it's a wind instrument using reads that's used for playing music you know I
[01:01:50] that's used for playing music you know I could tell you some other things about
[01:01:52] could tell you some other things about it I could say it has whole sort of like
[01:01:54] it I could say it has whole sort of like a recorder but has multiple reads and a
[01:01:57] a recorder but has multiple reads and a flared end more like an OBO um then
[01:02:01] flared end more like an OBO um then maybe you know a bit more about a shenai
[01:02:02] maybe you know a bit more about a shenai even though you've never seen one um and
[01:02:07] even though you've never seen one um and um if you then extend to what we do more
[01:02:11] um if you then extend to what we do more in um our sort of Corpus based
[01:02:14] in um our sort of Corpus based linguistic learning you know you could
[01:02:17] linguistic learning you know you could imagine it's not that I tried to Define
[01:02:19] imagine it's not that I tried to Define one for you instead I've just shown you
[01:02:22] one for you instead I've just shown you a textual use example so here or several
[01:02:25] a textual use example so here or several of those
[01:02:26] of those so here's one textual use example from a
[01:02:29] so here's one textual use example from a week before Shai players sat in bamboo
[01:02:33] week before Shai players sat in bamboo ma CH at the entrance to the house
[01:02:35] ma CH at the entrance to the house playing their pipes bash Babu disliked
[01:02:38] playing their pipes bash Babu disliked the Shai's whale but was determined to
[01:02:41] the Shai's whale but was determined to fulfill every conventional expectation
[01:02:44] fulfill every conventional expectation the Grooms family might have um so if
[01:02:49] the Grooms family might have um so if that's all you know about a Shai you
[01:02:52] that's all you know about a Shai you know in some ways you understand less of
[01:02:55] know in some ways you understand less of the mean meaning of the word than if
[01:02:56] the mean meaning of the word than if you'd seen one but actually in other
[01:03:00] you'd seen one but actually in other ways you understand more of the meaning
[01:03:04] ways you understand more of the meaning of the word than if you just seen one
[01:03:06] of the word than if you just seen one because you know from that one textual
[01:03:09] because you know from that one textual example you know some things you um have
[01:03:12] example you know some things you um have heard a characterization of the sound as
[01:03:15] heard a characterization of the sound as wailing um and um you know that it's
[01:03:20] wailing um and um you know that it's connected with weddings which you don't
[01:03:22] connected with weddings which you don't get from just having um held or looked
[01:03:24] get from just having um held or looked at one or even you know having had
[01:03:27] at one or even you know having had someone stand in front of you and play
[01:03:29] someone stand in front of you and play it and you know that's an important part
[01:03:31] it and you know that's an important part of the meaning of aai to um people and
[01:03:35] of the meaning of aai to um people and so that's the sense in which I think so
[01:03:37] so that's the sense in which I think so meaning comes from various kinds of
[01:03:40] meaning comes from various kinds of connections okay last topic our AI
[01:03:44] connections okay last topic our AI future um yeah so there are different
[01:03:47] future um yeah so there are different senses of our AI future and lots of
[01:03:49] senses of our AI future and lots of things um that can we can be worried
[01:03:52] things um that can we can be worried about one thing we can be worried about
[01:03:54] about one thing we can be worried about is whether we're all going to lose our
[01:03:56] is whether we're all going to lose our jobs um interesting question uh here's a
[01:04:03] jobs um interesting question uh here's a newspaper article from The New York
[01:04:05] newspaper article from The New York Times March of the machine makes Idle
[01:04:08] Times March of the machine makes Idle Hands prevalence of unemployment with
[01:04:10] Hands prevalence of unemployment with greatly increased industrial output
[01:04:12] greatly increased industrial output points to the influence of Labor saving
[01:04:15] points to the influence of Labor saving devices as an underlying cause um this
[01:04:18] devices as an underlying cause um this was published in the New York Times in
[01:04:22] was published in the New York Times in 1928 um but you know it turns out that
[01:04:25] 1928 um but you know it turns out that quite quite a few people like labor
[01:04:27] quite quite a few people like labor saving machines like washing machines
[01:04:31] saving machines like washing machines and dishwashers and um sewing machines
[01:04:35] and dishwashers and um sewing machines lots of useful labor saving machines um
[01:04:39] lots of useful labor saving machines um and well you know this was published in
[01:04:41] and well you know this was published in 1928 just before um you know at a time
[01:04:46] 1928 just before um you know at a time when a small group of immensely powerful
[01:04:49] when a small group of immensely powerful and rich men dominated the United States
[01:04:54] and rich men dominated the United States um just before the Great Depression um
[01:04:57] um just before the Great Depression um but what happened in the decades after
[01:05:00] but what happened in the decades after that um greatly changed policies in the
[01:05:04] that um greatly changed policies in the United States led to Boom years that
[01:05:07] United States led to Boom years that distributed um wealth and work much more
[01:05:11] distributed um wealth and work much more evenly across the country and the
[01:05:14] evenly across the country and the country
[01:05:15] country boomed you know here's another one in
[01:05:18] boomed you know here's another one in the past new Industries hired far more
[01:05:21] the past new Industries hired far more people than those they put out of
[01:05:23] people than those they put out of business but this is this is not true of
[01:05:25] business but this is this is not true of many today's new Industries today's new
[01:05:28] many today's new Industries today's new Industries have comparatively few jobs
[01:05:30] Industries have comparatively few jobs for the unskilled or semiskilled just
[01:05:32] for the unskilled or semiskilled just the class of workers whose jobs are
[01:05:34] the class of workers whose jobs are being eliminated by automation you know
[01:05:37] being eliminated by automation you know this was um Time magazine in
[01:05:40] this was um Time magazine in 1961 um so this is a longstanding um
[01:05:43] 1961 um so this is a longstanding um fear which at least so far has not been
[01:05:46] fear which at least so far has not been realized you know here we are um in
[01:05:50] realized you know here we are um in which a country in which not everyone
[01:05:52] which a country in which not everyone might have the work that they wish they
[01:05:55] might have the work that they wish they had but that overall almost everybody
[01:05:58] had but that overall almost everybody has a job and many people are working a
[01:06:01] has a job and many people are working a lot of hours a week whereas Once Upon a
[01:06:04] lot of hours a week whereas Once Upon a Time the claim was that before the end
[01:06:06] Time the claim was that before the end of the 20th century we would only have
[01:06:08] of the 20th century we would only have to do a three-day work week because
[01:06:10] to do a three-day work week because there wouldn't be much work to go around
[01:06:12] there wouldn't be much work to go around imagine um yeah so another fear is will
[01:06:18] imagine um yeah so another fear is will almost all the money go to 5 to 10
[01:06:20] almost all the money go to 5 to 10 enormous technology Giants um I actually
[01:06:24] enormous technology Giants um I actually think this is a more serious worry this
[01:06:26] think this is a more serious worry this seems to be the direction that we're
[01:06:28] seems to be the direction that we're headed in at the moment um I think
[01:06:30] headed in at the moment um I think there's no doubt that modern networks
[01:06:33] there's no doubt that modern networks and a concentration of AI Talent tend to
[01:06:36] and a concentration of AI Talent tend to encourage this outcome um just but you
[01:06:39] encourage this outcome um just but you know essentially this is the modern
[01:06:41] know essentially this is the modern analog of what happened in the early
[01:06:43] analog of what happened in the early Decades of the 20th century you know the
[01:06:46] Decades of the 20th century you know the equivalent then was Transportation
[01:06:49] equivalent then was Transportation networks and it was domination of the
[01:06:51] networks and it was domination of the new Transportation networks like
[01:06:53] new Transportation networks like Railways that led to a few people
[01:06:56] Railways that led to a few people dominating the economic system um but
[01:06:59] dominating the economic system um but what happens there um would be you know
[01:07:04] what happens there um would be you know essentially comes down to a political
[01:07:06] essentially comes down to a political and social question so um as I was
[01:07:08] and social question so um as I was mentioning before after the Great
[01:07:11] mentioning before after the Great Depression countes successfully dealt
[01:07:14] Depression countes successfully dealt with the monopolistic power of a small
[01:07:17] with the monopolistic power of a small number of companies um and with
[01:07:20] number of companies um and with political leadership we could do that
[01:07:22] political leadership we could do that again um the problem is that there's not
[01:07:25] again um the problem is that there's not much sign of political leadership right
[01:07:27] much sign of political leadership right at the moment um but that's um a
[01:07:29] at the moment um but that's um a political problem to solve rather than
[01:07:32] political problem to solve rather than actually being a technological problem
[01:07:34] actually being a technological problem to solve um so the next problem is
[01:07:38] to solve um so the next problem is should we be afraid of an imminent
[01:07:41] should we be afraid of an imminent Singularity IE when machines have
[01:07:43] Singularity IE when machines have artificial general intelligence beyond
[01:07:46] artificial general intelligence beyond the human level um in particular um
[01:07:49] the human level um in particular um would such an event threaten human
[01:07:52] would such an event threaten human survival um so this is um
[01:07:56] survival um so this is um uh concern that is increasingly um explo
[01:08:00] uh concern that is increasingly um explo exploded into the mainstream with
[01:08:02] exploded into the mainstream with discussions of AI existential risk and
[01:08:05] discussions of AI existential risk and in quite a few of the discussions that
[01:08:08] in quite a few of the discussions that have been leading to the setting up of
[01:08:10] have been leading to the setting up of things like AI safety institutes in the
[01:08:13] things like AI safety institutes in the US UK um are motivated by maybe um there
[01:08:17] US UK um are motivated by maybe um there are these worries of out of control
[01:08:20] are these worries of out of control artificial intelligence um taking over
[01:08:24] artificial intelligence um taking over and deciding to eliminate Humanity so we
[01:08:27] and deciding to eliminate Humanity so we get these sort of article headlines like
[01:08:29] get these sort of article headlines like pausing AI developments isn't enough we
[01:08:32] pausing AI developments isn't enough we need to shut it all down how Rogue AIS
[01:08:35] need to shut it all down how Rogue AIS may arise AI Godfather Jeffrey Hinton
[01:08:39] may arise AI Godfather Jeffrey Hinton warns of dangers as he quits Google we
[01:08:42] warns of dangers as he quits Google we must slow down the race to Godlike AI
[01:08:47] must slow down the race to Godlike AI um I don't personally um give these
[01:08:52] um I don't personally um give these concerns too much credence um and I
[01:08:56] concerns too much credence um and I think there's started to be increasing
[01:08:58] think there's started to be increasing push back against them um so in the
[01:09:01] push back against them um so in the other direction um franois charal who is
[01:09:05] other direction um franois charal who is the architect of caras sort of argues
[01:09:08] the architect of caras sort of argues there does not exist any AI model or
[01:09:10] there does not exist any AI model or technique that could represent an
[01:09:12] technique that could represent an Extinction risk for Humanity not even if
[01:09:14] Extinction risk for Humanity not even if you extrapolate capabilities far into
[01:09:17] you extrapolate capabilities far into the future V scaling laws most arguments
[01:09:20] the future V scaling laws most arguments boil down to this is a new type of
[01:09:23] boil down to this is a new type of Technology it could happen
[01:09:26] Technology it could happen um Joel Pino who um meta AI leader um
[01:09:31] um Joel Pino who um meta AI leader um refers to existential risk discour is
[01:09:34] refers to existential risk discour is unhinged and points out the flaw of the
[01:09:37] unhinged and points out the flaw of the lot of the utilitarian argumentation
[01:09:40] lot of the utilitarian argumentation that goes along with discussions of
[01:09:43] that goes along with discussions of these risks which is um you know if you
[01:09:46] these risks which is um you know if you say the elimination of um humanity is
[01:09:51] say the elimination of um humanity is infinitely bad that means you know any
[01:09:55] infinitely bad that means you know any any nonzero chance multiplied by
[01:09:57] any nonzero chance multiplied by Infinity will be bigger than the Badness
[01:10:00] Infinity will be bigger than the Badness of anything else that could happen in
[01:10:01] of anything else that could happen in the world um but that that isn't
[01:10:04] the world um but that that isn't actually a sensible way to have rational
[01:10:07] actually a sensible way to have rational discussion about the outcomes and many
[01:10:09] discussion about the outcomes and many people including Tim n jeu have argued
[01:10:13] people including Tim n jeu have argued that a lot of
[01:10:15] that a lot of the well a lot of what the a lot of the
[01:10:19] the well a lot of what the a lot of the outcome of this focus on existential
[01:10:22] outcome of this focus on existential risk and if you're more cynical a lot of
[01:10:24] risk and if you're more cynical a lot of the purpose of this focus of on
[01:10:27] the purpose of this focus of on existential risk is to distract away
[01:10:30] existential risk is to distract away from the immediate harms that are
[01:10:32] from the immediate harms that are arising from companies deploying
[01:10:34] arising from companies deploying automated systems including their biases
[01:10:37] automated systems including their biases worker exploitation copyright violation
[01:10:40] worker exploitation copyright violation disinformation growing concentration of
[01:10:43] disinformation growing concentration of power and Regulatory capture by Leading
[01:10:46] power and Regulatory capture by Leading AI companies and that's something that
[01:10:49] AI companies and that's something that is worth you know thinking about that
[01:10:52] is worth you know thinking about that behind all the discussions of our
[01:10:54] behind all the discussions of our amazing AIS and all the things we can do
[01:10:57] amazing AIS and all the things we can do with them like get our homework done or
[01:10:59] with them like get our homework done or generate wonderful images that there are
[01:11:02] generate wonderful images that there are lots of things underneath about
[01:11:05] lots of things underneath about disinformation deception um
[01:11:08] disinformation deception um hallucinations um problems of
[01:11:10] hallucinations um problems of homogeneity of decision making violation
[01:11:13] homogeneity of decision making violation of copyrights and people's creativity
[01:11:17] of copyrights and people's creativity lots of carbon emissions um erosion of
[01:11:20] lots of carbon emissions um erosion of Rich human practices so we need to be
[01:11:23] Rich human practices so we need to be conscious of the sort of present day
[01:11:25] conscious of the sort of present day harms that can come about from AI um and
[01:11:29] harms that can come about from AI um and for NLP as well there are various kinds
[01:11:31] for NLP as well there are various kinds of harms that we've touched on which
[01:11:33] of harms that we've touched on which include generating offensive content
[01:11:36] include generating offensive content generating untruthful content and
[01:11:38] generating untruthful content and enabling disinformation so the
[01:11:40] enabling disinformation so the disinformation one is an interesting one
[01:11:44] disinformation one is an interesting one that if models can reason well about
[01:11:47] that if models can reason well about texts can they also be persuasive in
[01:11:51] texts can they also be persuasive in communicating incorrect information or
[01:11:53] communicating incorrect information or opinions to users perhaps there are new
[01:11:56] opinions to users perhaps there are new possibilities for doing very
[01:11:59] possibilities for doing very personalized um misinformation
[01:12:01] personalized um misinformation propagation that easily persuades human
[01:12:04] propagation that easily persuades human beings better than traditional methods
[01:12:07] beings better than traditional methods of political advertising and there's
[01:12:10] of political advertising and there's starting to be evidence that that's true
[01:12:12] starting to be evidence that that's true it's still being debated in the
[01:12:14] it's still being debated in the literature but there's now multiple
[01:12:17] literature but there's now multiple studies suggesting that humans can be
[01:12:20] studies suggesting that humans can be influenced by disinformation generated
[01:12:22] influenced by disinformation generated by AIS and it seems reasonable to think
[01:12:25] by AIS and it seems reasonable to think that we're going to start to see more
[01:12:27] that we're going to start to see more use of that in political systems and
[01:12:30] use of that in political systems and elsewhere which is potentially um quite
[01:12:33] elsewhere which is potentially um quite scary and you know perhaps the worst of
[01:12:36] scary and you know perhaps the worst of it isn't going to be text based it's
[01:12:38] it isn't going to be text based it's likely that visual
[01:12:41] likely that visual um fakes are going to be even more
[01:12:45] um fakes are going to be even more compelling in political context and um
[01:12:48] compelling in political context and um this sort of seems like whether it
[01:12:50] this sort of seems like whether it happens in the US for this election or
[01:12:53] happens in the US for this election or in other countries in their election
[01:12:55] in other countries in their election that we're likely to see some major
[01:12:58] that we're likely to see some major incidents where um AI generated fakes
[01:13:01] incidents where um AI generated fakes can be seen of having a major impacts on
[01:13:04] can be seen of having a major impacts on political
[01:13:05] political systems so I sort of think really um
[01:13:09] systems so I sort of think really um what we should be doing is worrying not
[01:13:12] what we should be doing is worrying not about existential risks but worrying
[01:13:14] about existential risks but worrying about what people and organizations with
[01:13:17] about what people and organizations with power will use AI to do um that this is
[01:13:21] power will use AI to do um that this is a pattern that we've noticed multiple
[01:13:24] a pattern that we've noticed multiple times Al so with social media right in
[01:13:27] times Al so with social media right in the early days of social media there was
[01:13:29] the early days of social media there was the idea that this was meant to lead to
[01:13:32] the idea that this was meant to lead to new freedoms for people across the globe
[01:13:34] new freedoms for people across the globe bringing the positives of free political
[01:13:37] bringing the positives of free political um thought and improved human lives in
[01:13:40] um thought and improved human lives in lar measure that isn't what's happened
[01:13:43] lar measure that isn't what's happened that new technologies get captured by
[01:13:45] that new technologies get captured by powerful people and organizations who
[01:13:48] powerful people and organizations who Master the new technological options and
[01:13:52] Master the new technological options and Ai and machine learning is being
[01:13:54] Ai and machine learning is being increasingly used
[01:13:55] increasingly used um for surveillance and control and
[01:13:57] um for surveillance and control and we're seeing that around the world at
[01:14:00] we're seeing that around the world at the
[01:14:01] the moment um so my final thought to end
[01:14:04] moment um so my final thought to end with um is a thought about Carl San so
[01:14:08] with um is a thought about Carl San so when I was young many decades ago um
[01:14:12] when I was young many decades ago um Carl Sean did the series Cosmos on
[01:14:15] Carl Sean did the series Cosmos on television explaining the Miracles of
[01:14:18] television explaining the Miracles of the universe and at the time when I was
[01:14:21] the universe and at the time when I was a teenager I loved Cosmos now this was a
[01:14:25] a teenager I loved Cosmos now this was a long time ago um so much more recently
[01:14:28] long time ago um so much more recently there's now a new generation of Cosmos
[01:14:31] there's now a new generation of Cosmos and the book is advertised on the basis
[01:14:34] and the book is advertised on the basis of with a new forward by Neil degrass
[01:14:37] of with a new forward by Neil degrass Tyson um I think um K San was a good guy
[01:14:43] Tyson um I think um K San was a good guy um and he didn't only write Cosmos he
[01:14:46] um and he didn't only write Cosmos he wrote a number of other books and
[01:14:48] wrote a number of other books and another of the books he wrote um was the
[01:14:50] another of the books he wrote um was the demon Haunted World um which has a theme
[01:14:54] demon Haunted World um which has a theme that's a little bit
[01:14:55] that's a little bit um closer um to um some of the things
[01:14:58] um closer um to um some of the things that connect with um what we're dealing
[01:15:01] that connect with um what we're dealing with here um so in that book um he
[01:15:05] with here um so in that book um he writes I have a for boing of a world in
[01:15:07] writes I have a for boing of a world in my children's or grandchildren's time
[01:15:10] my children's or grandchildren's time when awesome technological powers are in
[01:15:13] when awesome technological powers are in the hands of a very few and no one
[01:15:16] the hands of a very few and no one representing the public interest can
[01:15:18] representing the public interest can even grasp the issues when the people
[01:15:20] even grasp the issues when the people have lost the ability to set their own
[01:15:22] have lost the ability to set their own agendas or knowledgeably question those
[01:15:25] agendas or knowledgeably question those in Authority when clutching our crystals
[01:15:28] in Authority when clutching our crystals and nervously Consulting our horoscopes
[01:15:30] and nervously Consulting our horoscopes our critical faculties in Decline unable
[01:15:33] our critical faculties in Decline unable to distinguish between what feels good
[01:15:36] to distinguish between what feels good and what's true we slide almost without
[01:15:39] and what's true we slide almost without noticing back into Superstition and
[01:15:42] noticing back into Superstition and darkness um I think if you look around
[01:15:44] darkness um I think if you look around the US and many other parts of the world
[01:15:47] the US and many other parts of the world today um this is actually much more the
[01:15:50] today um this is actually much more the risk um that humanity is facing and why
[01:15:55] risk um that humanity is facing and why um education which we try to provide at
[01:15:57] um education which we try to provide at Stanford and other places is an
[01:16:00] Stanford and other places is an important thing um that should be valued
[01:16:04] important thing um that should be valued and all the other things that go along
[01:16:06] and all the other things that go along with this of having things like open
[01:16:08] with this of having things like open source that supports the broad
[01:16:11] source that supports the broad dissemination of
[01:16:13] dissemination of learning thank you
[01:16:16] learning thank you [Applause]


================================================================================
LECTURE 019
================================================================================

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela

Source: https://www.youtube.com/watch?v=5vfIT5LOkR0

---

Transcript

[00:00:05] so today I'm delighted to introduce our
[00:00:08] so today I'm delighted to introduce our first um invited speaker is Dao Aquila
[00:00:12] first um invited speaker is Dao Aquila um there has also been
[00:00:14] um there has also been um as well as being invited and I'll
[00:00:16] um as well as being invited and I'll tell his background
[00:00:18] tell his background um he's also
[00:00:19] um he's also um in the symbolic systems program has
[00:00:22] um in the symbolic systems program has been an Adjunct professor and has been
[00:00:24] been an Adjunct professor and has been involved with some students in that role
[00:00:26] involved with some students in that role as well but in his invited role he's
[00:00:29] as well but in his invited role he's originally from the Netherlands where he
[00:00:31] originally from the Netherlands where he even learned some logic among other
[00:00:32] even learned some logic among other things back in the old days but in more
[00:00:35] things back in the old days but in more recent times he's been
[00:00:37] recent times he's been um a prominent
[00:00:39] um a prominent um deep learning researcher for a number
[00:00:41] um deep learning researcher for a number of years he worked at
[00:00:44] of years he worked at um Facebook now meta in the fair unit
[00:00:47] um Facebook now meta in the fair unit and was involved in various ideas
[00:00:49] and was involved in various ideas including retrieval augmented Generation
[00:00:53] including retrieval augmented Generation Um after that he then spent some time at
[00:00:56] Um after that he then spent some time at hugging face he's become interested in
[00:00:59] hugging face he's become interested in looking at multimodal models which is
[00:01:01] looking at multimodal models which is what he's going to be talking about
[00:01:03] what he's going to be talking about today and we welcome Dara it's great to
[00:01:06] today and we welcome Dara it's great to have you
[00:01:07] have you thank you very much
[00:01:11] foreign
[00:01:16] yes uh yeah thanks everyone for coming I
[00:01:20] yes uh yeah thanks everyone for coming I understand that you get points for being
[00:01:22] understand that you get points for being here so you're not really here for me
[00:01:23] here so you're not really here for me but uh uh thanks for coming anyway so
[00:01:27] but uh uh thanks for coming anyway so I'm going to talk about multimodal deep
[00:01:28] I'm going to talk about multimodal deep learning uh it's gonna have an NLP focus
[00:01:31] learning uh it's gonna have an NLP focus of course that's for discourse but it's
[00:01:33] of course that's for discourse but it's also because otherwise I would really be
[00:01:35] also because otherwise I would really be talking for uh many more hours than I
[00:01:38] talking for uh many more hours than I have time for here so I'll try to really
[00:01:40] have time for here so I'll try to really keep it focused on on the things that I
[00:01:42] keep it focused on on the things that I think will be most useful for you to
[00:01:44] think will be most useful for you to learn and so the first thing you should
[00:01:46] learn and so the first thing you should understand is that this whole concept of
[00:01:48] understand is that this whole concept of multimodality is is kind of ill-defined
[00:01:51] multimodality is is kind of ill-defined actually
[00:01:52] actually um so if you go to the dictionary you'll
[00:01:54] um so if you go to the dictionary you'll see that it means having or involving
[00:01:56] see that it means having or involving several modes or modalities or Maxima
[00:02:01] several modes or modalities or Maxima um and and so what mode here really
[00:02:03] um and and so what mode here really means is so it could be mode in the very
[00:02:05] means is so it could be mode in the very generic sense or it could be a very
[00:02:07] generic sense or it could be a very precise sense of the mode of a
[00:02:09] precise sense of the mode of a statistical distribution
[00:02:11] statistical distribution um and so depending on the paper you're
[00:02:13] um and so depending on the paper you're reading in some cases people really mean
[00:02:15] reading in some cases people really mean this statistical sense in other cases
[00:02:17] this statistical sense in other cases people really mean this sort of very
[00:02:19] people really mean this sort of very vague concept of a modality where it
[00:02:21] vague concept of a modality where it really means the type of information
[00:02:22] really means the type of information that you're getting so an example of
[00:02:24] that you're getting so an example of modality in that case is an image or
[00:02:27] modality in that case is an image or speech signal or audio in general or
[00:02:30] speech signal or audio in general or even affection so smell or or things
[00:02:32] even affection so smell or or things like that so in this uh lecture we're
[00:02:35] like that so in this uh lecture we're just going to focus mostly on text
[00:02:37] just going to focus mostly on text because this is an NLP course and we're
[00:02:40] because this is an NLP course and we're going to focus on images mostly as the
[00:02:42] going to focus on images mostly as the other modality to to keep it simple
[00:02:46] other modality to to keep it simple all right so why does it matter why do
[00:02:49] all right so why does it matter why do we care about multi-modality
[00:02:51] we care about multi-modality um and so there are a couple of really
[00:02:52] um and so there are a couple of really good reasons in general for this uh the
[00:02:55] good reasons in general for this uh the the first one is is about faithfulness
[00:02:58] the first one is is about faithfulness so if you look at how we humans
[00:03:00] so if you look at how we humans understand the world how we make sense
[00:03:02] understand the world how we make sense of what happens in the world uh that is
[00:03:05] of what happens in the world uh that is very multimodal uh right so we we
[00:03:07] very multimodal uh right so we we perceive the world not just using Vision
[00:03:09] perceive the world not just using Vision uh or just audio but we synthesize
[00:03:12] uh or just audio but we synthesize information across all of these
[00:03:13] information across all of these different modalities and that's how we
[00:03:15] different modalities and that's how we understand the world and each other
[00:03:17] understand the world and each other um there's also a very practical uh
[00:03:19] um there's also a very practical uh argument for doing it it's because the
[00:03:21] argument for doing it it's because the internet is multimodal right so if you
[00:03:23] internet is multimodal right so if you go to I don't know uh like Facebook or
[00:03:25] go to I don't know uh like Facebook or something like that like it rarely
[00:03:27] something like that like it rarely happens that it's just text or just an
[00:03:29] happens that it's just text or just an image there's usually a combination of
[00:03:31] image there's usually a combination of multiple modalities and then the the
[00:03:34] multiple modalities and then the the final good reason uh that we're just
[00:03:37] final good reason uh that we're just starting to hit now if you're if you're
[00:03:38] starting to hit now if you're if you're really following where the field is
[00:03:40] really following where the field is going we're kind of running out of Text
[00:03:42] going we're kind of running out of Text data for these large language models so
[00:03:44] data for these large language models so uh one interesting way to uh keep
[00:03:46] uh one interesting way to uh keep scaling on the data side is to make use
[00:03:49] scaling on the data side is to make use of all of these other modalities right
[00:03:51] of all of these other modalities right so if you can have your language model
[00:03:52] so if you can have your language model also watch all of the videos of cats in
[00:03:55] also watch all of the videos of cats in the world it's going to understand the
[00:03:57] the world it's going to understand the concept of catch cat much better and
[00:03:59] concept of catch cat much better and that's what we want to have in these
[00:04:00] that's what we want to have in these models we want them to understand the
[00:04:03] models we want them to understand the world in the same way that humans
[00:04:04] world in the same way that humans understand it
[00:04:06] understand it um so right now multimodality is really
[00:04:09] um so right now multimodality is really one of the main frontiers of this New
[00:04:10] one of the main frontiers of this New Foundation model uh drives that were all
[00:04:13] Foundation model uh drives that were all in right now
[00:04:15] in right now there's a thing called the mcgurk effect
[00:04:17] there's a thing called the mcgurk effect let's see if it loads up uh but uh so uh
[00:04:21] let's see if it loads up uh but uh so uh what what we'll see when this loads is
[00:04:23] what what we'll see when this loads is uh this guy over here and uh we'll have
[00:04:26] uh this guy over here and uh we'll have the same audio effect uh being played so
[00:04:29] the same audio effect uh being played so the audio is exactly the same and this
[00:04:31] the audio is exactly the same and this man is going to say something like
[00:04:35] and so you're hearing a b there I think
[00:04:38] and so you're hearing a b there I think if you look at my mouth because that's
[00:04:40] if you look at my mouth because that's what I said uh but if you then change
[00:04:42] what I said uh but if you then change the video to where he says
[00:04:45] the video to where he says with exactly the same audio you're going
[00:04:48] with exactly the same audio you're going to hear the other version
[00:04:50] to hear the other version um so unfortunately I can't really like
[00:04:52] um so unfortunately I can't really like swap in the different audio here so you
[00:04:54] swap in the different audio here so you have to trust me for it we might
[00:04:56] have to trust me for it we might suddenly start hearing a guy saying
[00:05:00] all right
[00:05:02] all right um so um uh multimodal applications so
[00:05:06] um so um uh multimodal applications so when we have multiple modalities we can
[00:05:09] when we have multiple modalities we can do all kinds of uh interesting things
[00:05:11] do all kinds of uh interesting things and as I said most of the use cases we
[00:05:13] and as I said most of the use cases we have on the internet they're all
[00:05:15] have on the internet they're all multimodal
[00:05:16] multimodal um and there are some really kind of
[00:05:18] um and there are some really kind of obvious things we would be interested in
[00:05:20] obvious things we would be interested in if we have information from these
[00:05:22] if we have information from these different data sources right from
[00:05:24] different data sources right from different modalities uh so obviously we
[00:05:26] different modalities uh so obviously we might want to do retrieval so maybe
[00:05:28] might want to do retrieval so maybe given a bit of text we want to find the
[00:05:30] given a bit of text we want to find the right image or maybe given some image we
[00:05:33] right image or maybe given some image we want to find the right text for it so we
[00:05:35] want to find the right text for it so we can match them up obviously we can also
[00:05:37] can match them up obviously we can also do this in a generative setting so then
[00:05:39] do this in a generative setting so then we have image captioning which you've
[00:05:40] we have image captioning which you've probably heard of we can do text to
[00:05:42] probably heard of we can do text to image generation so that's image
[00:05:44] image generation so that's image synthesis and so stable diffusion
[00:05:46] synthesis and so stable diffusion everybody in the audience here has
[00:05:47] everybody in the audience here has probably seen that then we could do a
[00:05:50] probably seen that then we could do a visual question answering where we have
[00:05:52] visual question answering where we have an image and text and then we need to
[00:05:53] an image and text and then we need to generate some new text we have
[00:05:55] generate some new text we have multimodal classification where we have
[00:05:57] multimodal classification where we have image syntax and we need to have a label
[00:05:58] image syntax and we need to have a label for example whether something is hate
[00:06:00] for example whether something is hate speech or not and then in general we
[00:06:03] speech or not and then in general we want to be able to have a richer
[00:06:05] want to be able to have a richer understanding of information which means
[00:06:07] understanding of information which means that we combine images and text and then
[00:06:09] that we combine images and text and then use it for Downstream applications that
[00:06:11] use it for Downstream applications that require better understanding or better
[00:06:12] require better understanding or better Generation
[00:06:14] Generation Um so this field really is super hot
[00:06:16] Um so this field really is super hot right now uh so there's this uh this
[00:06:19] right now uh so there's this uh this nice paper title I predict that this
[00:06:21] nice paper title I predict that this paper is going to do really well in
[00:06:22] paper is going to do really well in terms of citations just because it has
[00:06:24] terms of citations just because it has such a sideable title I think a lot of
[00:06:26] such a sideable title I think a lot of people are not actually going to read it
[00:06:28] people are not actually going to read it and so I mean I've been in this field
[00:06:31] and so I mean I've been in this field for quite a while now and people have
[00:06:32] for quite a while now and people have been saying this for a really long time
[00:06:34] been saying this for a really long time I think Chris would agree that so for
[00:06:36] I think Chris would agree that so for decades people have been saying that
[00:06:37] decades people have been saying that multimodal is the next big thing uh but
[00:06:39] multimodal is the next big thing uh but now it's really true I think
[00:06:42] now it's really true I think all right so uh the outline for uh what
[00:06:45] all right so uh the outline for uh what we're going to be talking about so first
[00:06:47] we're going to be talking about so first I'm going to tell you a little bit about
[00:06:48] I'm going to tell you a little bit about early models then we're going to do a
[00:06:50] early models then we're going to do a bit of a deep dive on some of the
[00:06:52] bit of a deep dive on some of the specifics then we're gonna go over a
[00:06:55] specifics then we're gonna go over a particular type of fusion contrastive
[00:06:57] particular type of fusion contrastive models or Lake Fusion then we're gonna
[00:06:59] models or Lake Fusion then we're gonna go through a little bit of the history
[00:07:00] go through a little bit of the history of multimodal foundation models then
[00:07:04] of multimodal foundation models then we're going to talk a little bit about
[00:07:05] we're going to talk a little bit about evaluation a little bit about other
[00:07:07] evaluation a little bit about other modalities and then I'll make some
[00:07:08] modalities and then I'll make some predictions for the future and hopefully
[00:07:10] predictions for the future and hopefully maybe give you some cool research ideas
[00:07:11] maybe give you some cool research ideas or things to talk or think about
[00:07:14] or things to talk or think about all right so
[00:07:16] all right so um obviously like there's a lot of work
[00:07:18] um obviously like there's a lot of work that happened before deep learning but I
[00:07:20] that happened before deep learning but I think if you want to start from like the
[00:07:22] think if you want to start from like the Deep learning Revolution and what was
[00:07:24] Deep learning Revolution and what was happening in images and text then a good
[00:07:28] happening in images and text then a good starting point is for example Wasabi uh
[00:07:31] starting point is for example Wasabi uh or device uh or Richard zoker who you
[00:07:34] or device uh or Richard zoker who you who you've probably heard of has done
[00:07:36] who you've probably heard of has done some really cool early work in this that
[00:07:38] some really cool early work in this that really pioneered a lot of these ideas uh
[00:07:41] really pioneered a lot of these ideas uh and the basic uh gist of this is that we
[00:07:43] and the basic uh gist of this is that we have a vision model on the one hand we
[00:07:46] have a vision model on the one hand we have a language model so this really I
[00:07:48] have a language model so this really I mean the first lecture of this course I
[00:07:50] mean the first lecture of this course I think was about word embeddings right so
[00:07:51] think was about word embeddings right so that's just your basic word embedding
[00:07:53] that's just your basic word embedding model and now we need to figure out how
[00:07:55] model and now we need to figure out how to align them in the same multimodal
[00:07:57] to align them in the same multimodal space so the way you do that is you get
[00:08:00] space so the way you do that is you get some sort of similarity metric right a
[00:08:01] some sort of similarity metric right a score function or like a kernel function
[00:08:03] score function or like a kernel function if you're thinking about this from a
[00:08:05] if you're thinking about this from a support Vector machine literature
[00:08:06] support Vector machine literature perspective and now you need to figure
[00:08:08] perspective and now you need to figure out uh in a Max margin or margin loss
[00:08:12] out uh in a Max margin or margin loss um uh how you want to align these two
[00:08:14] um uh how you want to align these two points in your embedding space right so
[00:08:16] points in your embedding space right so things that are similar you want to
[00:08:18] things that are similar you want to bring them closer together things that
[00:08:19] bring them closer together things that are not you want to bring them further
[00:08:20] are not you want to bring them further apart and if you do that in this
[00:08:23] apart and if you do that in this multimodal embedding space that means
[00:08:25] multimodal embedding space that means that you can do interesting cross-modal
[00:08:28] that you can do interesting cross-modal transfer where you can take the word
[00:08:30] transfer where you can take the word embedding for something like Auto or
[00:08:31] embedding for something like Auto or like horse and then you can find close
[00:08:34] like horse and then you can find close images in the embedding space to that
[00:08:36] images in the embedding space to that thing and now now you've solved the
[00:08:38] thing and now now you've solved the retrieval problem so this is a really
[00:08:41] retrieval problem so this is a really nice early application and I think a lot
[00:08:43] nice early application and I think a lot of the stuff that I'm going to talk
[00:08:44] of the stuff that I'm going to talk about in the in the early slides you're
[00:08:46] about in the in the early slides you're going to see this income over and over
[00:08:48] going to see this income over and over again you're going to see it get kind of
[00:08:50] again you're going to see it get kind of reinvented with fancier models but it's
[00:08:52] reinvented with fancier models but it's basically all the same stuff
[00:08:55] basically all the same stuff um so you can do cross modal transfer
[00:08:57] um so you can do cross modal transfer where you have images and text but you
[00:08:59] where you have images and text but you can also combine them together so that
[00:09:01] can also combine them together so that you get a multimodal word embedding uh
[00:09:04] you get a multimodal word embedding uh and so this uh just gives you a more
[00:09:06] and so this uh just gives you a more accurate representation of how humans
[00:09:08] accurate representation of how humans understand words word meaning because
[00:09:10] understand words word meaning because when we think about the word Moon uh or
[00:09:13] when we think about the word Moon uh or cat or something we can go to Wikipedia
[00:09:15] cat or something we can go to Wikipedia and read that a cat is a small
[00:09:17] and read that a cat is a small carnivorous mammal that people like to
[00:09:19] carnivorous mammal that people like to keep as pets or we can just go and look
[00:09:21] keep as pets or we can just go and look at pictures of cats and now we
[00:09:22] at pictures of cats and now we understand what a cat is right and I
[00:09:25] understand what a cat is right and I would argue actually that for a lot of
[00:09:26] would argue actually that for a lot of people the picture of the cats is much
[00:09:28] people the picture of the cats is much closer to the meaning of the concept of
[00:09:29] closer to the meaning of the concept of cat
[00:09:31] cat um so uh some some early work where
[00:09:33] um so uh some some early work where people were trying to do this is from
[00:09:36] people were trying to do this is from Rooney at all where they did multimodal
[00:09:38] Rooney at all where they did multimodal distribution or semantics using this
[00:09:40] distribution or semantics using this very uh elegant approach called bag of
[00:09:43] very uh elegant approach called bag of visual words so just like who has heard
[00:09:45] visual words so just like who has heard of bag of visual words
[00:09:48] of bag of visual words very few people okay so it's it's
[00:09:50] very few people okay so it's it's surprisingly simple and so I kind of
[00:09:53] surprisingly simple and so I kind of like it it's nicely elegant so you take
[00:09:55] like it it's nicely elegant so you take a picture of a moon in this case I think
[00:09:57] a picture of a moon in this case I think you can see it in the back too right so
[00:09:59] you can see it in the back too right so uh we use an algorithm like sift uh to
[00:10:02] uh we use an algorithm like sift uh to find interesting key points so it's sort
[00:10:04] find interesting key points so it's sort of where the difference between the
[00:10:06] of where the difference between the pixels and the pixels next to it where
[00:10:08] pixels and the pixels next to it where that difference is Big those are sort of
[00:10:10] that difference is Big those are sort of the spots you want to be looking at
[00:10:12] the spots you want to be looking at and for each of these key points you get
[00:10:16] and for each of these key points you get feature descriptors so relatively small
[00:10:18] feature descriptors so relatively small vectors like 32 dimensional events are
[00:10:20] vectors like 32 dimensional events are kind of on the on the implementation of
[00:10:22] kind of on the on the implementation of this and what you can do now with these
[00:10:24] this and what you can do now with these feature descriptors is you can cluster
[00:10:26] feature descriptors is you can cluster them using k-means and then you assign
[00:10:29] them using k-means and then you assign every one of these points uh so you can
[00:10:32] every one of these points uh so you can count how often they occur right so in
[00:10:33] count how often they occur right so in this picture of the Moon we have like uh
[00:10:35] this picture of the Moon we have like uh actually the count is oh yeah so there
[00:10:37] actually the count is oh yeah so there are three like red dots right so that's
[00:10:39] are three like red dots right so that's why the Red Dot one is three so what
[00:10:42] why the Red Dot one is three so what that gives you is a uh an idea of the
[00:10:44] that gives you is a uh an idea of the visual words very similar to the
[00:10:47] visual words very similar to the original bag of words model that you
[00:10:48] original bag of words model that you hopefully have heard about maybe in the
[00:10:51] hopefully have heard about maybe in the first lecture
[00:10:52] first lecture um so that's the visual equivalent of
[00:10:54] um so that's the visual equivalent of the textual thing
[00:10:55] the textual thing um and so if you do this and you then
[00:10:58] um and so if you do this and you then concatenate or you apply sud to fuse the
[00:11:00] concatenate or you apply sud to fuse the information what you get is a word
[00:11:02] information what you get is a word embedding that is much more
[00:11:04] embedding that is much more representative of human meaning so you
[00:11:07] representative of human meaning so you know as reflected in the data sets that
[00:11:09] know as reflected in the data sets that people used to care about at the time so
[00:11:12] people used to care about at the time so after that there were a couple of people
[00:11:15] after that there were a couple of people me included who tried to take these
[00:11:17] me included who tried to take these ideas and then really applied deep
[00:11:18] ideas and then really applied deep learning to them so some of the very
[00:11:20] learning to them so some of the very early versions of this use convolutional
[00:11:23] early versions of this use convolutional neural networks uh and then you can
[00:11:25] neural networks uh and then you can transfer the features from uh your your
[00:11:28] transfer the features from uh your your confnet and you take your word
[00:11:30] confnet and you take your word embeddings which you've seen in the
[00:11:31] embeddings which you've seen in the first lecture uh and then you can
[00:11:33] first lecture uh and then you can concatenate them now you have a
[00:11:35] concatenate them now you have a multimodal work vector or you can do
[00:11:37] multimodal work vector or you can do something slightly fancier so you've
[00:11:39] something slightly fancier so you've seen the skip gram model you can also
[00:11:41] seen the skip gram model you can also try to do skip gram predictions onto
[00:11:44] try to do skip gram predictions onto image features right so when you see
[00:11:46] image features right so when you see your work like cat in some contexts like
[00:11:48] your work like cat in some contexts like the cute little cat said on the Met then
[00:11:51] the cute little cat said on the Met then when you see cat you also want to
[00:11:52] when you see cat you also want to predict cat pictures
[00:11:54] predict cat pictures so super easy ideas but it turned out
[00:11:56] so super easy ideas but it turned out that this gives you much richer work
[00:11:58] that this gives you much richer work representations uh so that's kind of
[00:12:00] representations uh so that's kind of cool but obviously words are very
[00:12:02] cool but obviously words are very limited what we really care about is not
[00:12:04] limited what we really care about is not words but sentences so uh then people
[00:12:07] words but sentences so uh then people started really looking into sentence
[00:12:09] started really looking into sentence representations and how can we figure
[00:12:11] representations and how can we figure out uh how to get compositional
[00:12:13] out uh how to get compositional understanding in the sentence
[00:12:15] understanding in the sentence representations and how to how do we
[00:12:17] representations and how to how do we align that with images
[00:12:19] align that with images um so the loss here is very similar to
[00:12:22] um so the loss here is very similar to what we saw with works and pictures but
[00:12:24] what we saw with works and pictures but now we just have a sentence encoder
[00:12:26] now we just have a sentence encoder right um and so there's some really cool
[00:12:28] right um and so there's some really cool early papers from Andre karapati and
[00:12:31] early papers from Andre karapati and Richard Soaker also had some work here
[00:12:34] Richard Soaker also had some work here um and then you know so the basic idea
[00:12:37] um and then you know so the basic idea is just that instead of having these
[00:12:38] is just that instead of having these word embeddings we now have an lscm in
[00:12:41] word embeddings we now have an lscm in these papers or some other kind of
[00:12:43] these papers or some other kind of recurrent neural network or in the case
[00:12:45] recurrent neural network or in the case of this one recursive neural network and
[00:12:47] of this one recursive neural network and then we try to align the features
[00:12:49] then we try to align the features together
[00:12:51] together um and so so these three or four papers
[00:12:53] um and so so these three or four papers are actually very important than this
[00:12:54] are actually very important than this one by me is less important but it's
[00:12:56] one by me is less important but it's still kind of interesting because uh we
[00:12:59] still kind of interesting because uh we showed here that grounded sentence
[00:13:01] showed here that grounded sentence representation so if you actually just
[00:13:03] representation so if you actually just use this part here as a sentence encoder
[00:13:05] use this part here as a sentence encoder for NLP tasks the ability to just
[00:13:08] for NLP tasks the ability to just predict pictures from it already gives
[00:13:10] predict pictures from it already gives you a really good sentence
[00:13:11] you a really good sentence representation right so so just by
[00:13:14] representation right so so just by predicting pictures you can sort of
[00:13:16] predicting pictures you can sort of imagine what things look like and that
[00:13:18] imagine what things look like and that gives you a really good meaning
[00:13:19] gives you a really good meaning representation which you can then
[00:13:20] representation which you can then transfer to I don't know sentiment
[00:13:22] transfer to I don't know sentiment classification or something else
[00:13:26] um and then of course uh once we have
[00:13:29] um and then of course uh once we have census encoders uh or then we also have
[00:13:32] census encoders uh or then we also have decoders and and so when the sequence to
[00:13:35] decoders and and so when the sequence to sequence architecture came out which
[00:13:36] sequence architecture came out which you've probably also heard about in this
[00:13:38] you've probably also heard about in this course uh what you can do instead of
[00:13:41] course uh what you can do instead of having a text encoder for like your
[00:13:42] having a text encoder for like your Source language if you're doing machine
[00:13:44] Source language if you're doing machine translation is you can plug in a confnet
[00:13:46] translation is you can plug in a confnet uh instead of an lstm encoder and now
[00:13:50] uh instead of an lstm encoder and now you can generate captions so that's
[00:13:52] you can generate captions so that's exactly what people did we used to have
[00:13:54] exactly what people did we used to have all of these fancy diagrams in our
[00:13:56] all of these fancy diagrams in our papers then where we explain the lstm
[00:13:58] papers then where we explain the lstm and how that works probably people don't
[00:14:00] and how that works probably people don't learn that anymore these days they do
[00:14:02] learn that anymore these days they do yeah very good they might make a
[00:14:05] yeah very good they might make a comeback I think you know at some point
[00:14:07] comeback I think you know at some point uh Transformers are going to go away
[00:14:09] uh Transformers are going to go away we'll see
[00:14:11] we'll see um and uh so uh one of the things that
[00:14:13] um and uh so uh one of the things that that people figured out in machine
[00:14:15] that people figured out in machine translation very early on is that you
[00:14:17] translation very early on is that you can do alignment of words between your
[00:14:19] can do alignment of words between your Source language and your target language
[00:14:21] Source language and your target language and you can do the same thing actually
[00:14:23] and you can do the same thing actually with images right so if you want to
[00:14:24] with images right so if you want to align a word in your uh in your
[00:14:27] align a word in your uh in your generated sequence with something in
[00:14:29] generated sequence with something in your picture then you can do the same uh
[00:14:32] your picture then you can do the same uh use the same approach for that and that
[00:14:33] use the same approach for that and that approach of course is called attention
[00:14:35] approach of course is called attention right so you know you've learned a lot
[00:14:37] right so you know you've learned a lot about detention probably in this course
[00:14:39] about detention probably in this course and and so yeah that was one of the the
[00:14:41] and and so yeah that was one of the the building blocks of these systems as well
[00:14:43] building blocks of these systems as well where you can do very interesting things
[00:14:45] where you can do very interesting things uh and really see that when it has to
[00:14:47] uh and really see that when it has to generate stop for the stop sign that is
[00:14:50] generate stop for the stop sign that is really actually looking at the stop sign
[00:14:52] really actually looking at the stop sign right so there's a really cool alignment
[00:14:54] right so there's a really cool alignment going on there
[00:14:55] going on there um in these models
[00:14:58] um in these models um and so the the final kind of early
[00:15:00] um and so the the final kind of early model we should talk about a little bit
[00:15:01] model we should talk about a little bit uh is Gans uh who here is sort of Gans
[00:15:06] uh is Gans uh who here is sort of Gans okay that's a lot a lot more than bag of
[00:15:08] okay that's a lot a lot more than bag of visual words I guess that makes sense
[00:15:10] visual words I guess that makes sense um and uh so so yeah the basic idea of
[00:15:14] um and uh so so yeah the basic idea of again is really that you have this
[00:15:15] again is really that you have this generator and discriminator and you want
[00:15:17] generator and discriminator and you want to have the generator uh generate images
[00:15:19] to have the generator uh generate images that the discriminator cannot
[00:15:20] that the discriminator cannot distinguish uh uh from uh so it cannot
[00:15:23] distinguish uh uh from uh so it cannot distinguish fake and real images right
[00:15:25] distinguish fake and real images right and if you do that you can actually
[00:15:27] and if you do that you can actually condition that on the piece of text uh
[00:15:30] condition that on the piece of text uh and then you can generate images uh
[00:15:32] and then you can generate images uh using some some uh text prompt right so
[00:15:35] using some some uh text prompt right so that's what what uh kind of the the
[00:15:37] that's what what uh kind of the the first versions of stable diffusion we're
[00:15:39] first versions of stable diffusion we're doing things like this and you know it's
[00:15:41] doing things like this and you know it's all the a natural progression to that
[00:15:43] all the a natural progression to that model
[00:15:44] model um
[00:15:44] um so those were the early models
[00:15:48] so those were the early models um maybe do people have any like burning
[00:15:50] um maybe do people have any like burning questions about this or does this all
[00:15:52] questions about this or does this all make sense
[00:15:55] all right
[00:15:56] all right so let's do a bit of a deeper dive then
[00:15:59] so let's do a bit of a deeper dive then on on in particular on features and
[00:16:01] on on in particular on features and fusion so those are really the kind of
[00:16:03] fusion so those are really the kind of core building blocks for for all of this
[00:16:05] core building blocks for for all of this multimodal stuff really but before we go
[00:16:06] multimodal stuff really but before we go there maybe very briefly like if all of
[00:16:09] there maybe very briefly like if all of this multimodal stuff is cool and sort
[00:16:12] this multimodal stuff is cool and sort of useful and and doesn't look that
[00:16:14] of useful and and doesn't look that difficult you know like why aren't we
[00:16:16] difficult you know like why aren't we all doing multimodal things and so why
[00:16:19] all doing multimodal things and so why why do we focus on specific modalities
[00:16:21] why do we focus on specific modalities and I think there are a couple of
[00:16:23] and I think there are a couple of problems just to be aware of uh so one
[00:16:26] problems just to be aware of uh so one is modalities can sometimes dominate
[00:16:28] is modalities can sometimes dominate especially text is much more dominant
[00:16:30] especially text is much more dominant than Vision or audio in many use cases
[00:16:32] than Vision or audio in many use cases right so uh you can already just have a
[00:16:35] right so uh you can already just have a model that picks up on the tech signal
[00:16:37] model that picks up on the tech signal and basically learns to ignore the image
[00:16:38] and basically learns to ignore the image completely which actually happened
[00:16:40] completely which actually happened embarrassingly for visual question
[00:16:42] embarrassingly for visual question answering we'll get to that so visual
[00:16:44] answering we'll get to that so visual question answering you could do that
[00:16:45] question answering you could do that without actually looking at the picture
[00:16:48] without actually looking at the picture um the additional modalities can add a
[00:16:51] um the additional modalities can add a lot of noise so it makes your machine
[00:16:52] lot of noise so it makes your machine learning problem more difficult uh you
[00:16:55] learning problem more difficult uh you don't always have full coverage right so
[00:16:56] don't always have full coverage right so as I said if you look at Facebook posts
[00:16:58] as I said if you look at Facebook posts sometimes you have text sometimes you
[00:16:59] sometimes you have text sometimes you have pictures sometimes you have both
[00:17:01] have pictures sometimes you have both but you don't have a guarantee that you
[00:17:02] but you don't have a guarantee that you always have both so how do you deal with
[00:17:04] always have both so how do you deal with that
[00:17:05] that um in many cases we just really weren't
[00:17:07] um in many cases we just really weren't ready it was too complicated to
[00:17:09] ready it was too complicated to implement stuff and also just in general
[00:17:11] implement stuff and also just in general like how to design your model really to
[00:17:13] like how to design your model really to uh to combine all the information is
[00:17:16] uh to combine all the information is actually quite complicated so in order
[00:17:19] actually quite complicated so in order to to uh you know to maybe drive the
[00:17:22] to to uh you know to maybe drive the home that point home a little bit
[00:17:24] home that point home a little bit um so featurizing text I guess we all
[00:17:27] um so featurizing text I guess we all know how to do that by now especially
[00:17:29] know how to do that by now especially sort of in the age of Transformers and
[00:17:31] sort of in the age of Transformers and before in lstm sorry we just said like
[00:17:32] before in lstm sorry we just said like you have your batch by your secrets so
[00:17:35] you have your batch by your secrets so batch size by sequence length by
[00:17:37] batch size by sequence length by embedding size right so it's always like
[00:17:39] embedding size right so it's always like a 3D tensor and that's how you encode
[00:17:42] a 3D tensor and that's how you encode your textual information when you pump
[00:17:44] your textual information when you pump it through your neural net
[00:17:46] it through your neural net um and so with images it's like trickier
[00:17:48] um and so with images it's like trickier because you can just kind of look at the
[00:17:51] because you can just kind of look at the patches but then if you do convolutions
[00:17:53] patches but then if you do convolutions you're kind of like shifting over the
[00:17:55] you're kind of like shifting over the image and then you're aggregating right
[00:17:57] image and then you're aggregating right um and in many cases you don't really
[00:17:59] um and in many cases you don't really want to be this uniform you want to have
[00:18:01] want to be this uniform you want to have something that actually looks at the
[00:18:03] something that actually looks at the things in the picture right so this is
[00:18:05] things in the picture right so this is called region features where you would
[00:18:07] called region features where you would use an object detector as a first step
[00:18:09] use an object detector as a first step for processing your image and then you
[00:18:11] for processing your image and then you would have a confident backbone that
[00:18:13] would have a confident backbone that encodes the features for that particular
[00:18:15] encodes the features for that particular sub image like this guys like skateboard
[00:18:17] sub image like this guys like skateboard or something it has its own like vector
[00:18:19] or something it has its own like vector representation right
[00:18:22] representation right um and then in terms of dense features
[00:18:24] um and then in terms of dense features we now also have Vision Transformers so
[00:18:26] we now also have Vision Transformers so we'll just very quickly go over that to
[00:18:28] we'll just very quickly go over that to make sure we're on the same page so
[00:18:30] make sure we're on the same page so there are all these models like YOLO is
[00:18:31] there are all these models like YOLO is a really good one if you haven't heard
[00:18:32] a really good one if you haven't heard of that yet uh so we're at YOLO V7 now I
[00:18:36] of that yet uh so we're at YOLO V7 now I think create I don't know uh so there's
[00:18:38] think create I don't know uh so there's a new one coming out every every other
[00:18:40] a new one coming out every every other like year or something but the basic
[00:18:43] like year or something but the basic idea is that we get these bounding boxes
[00:18:45] idea is that we get these bounding boxes uh for things in the images right
[00:18:47] uh for things in the images right actually segmentations with the bounding
[00:18:49] actually segmentations with the bounding boxes is what people tend to use and
[00:18:51] boxes is what people tend to use and they they have labels right so this is
[00:18:53] they they have labels right so this is labeled like Backpacker or something and
[00:18:55] labeled like Backpacker or something and so you can do this as a pre-processing
[00:18:57] so you can do this as a pre-processing step on your image to get a much richer
[00:19:00] step on your image to get a much richer representation of what is really in that
[00:19:02] representation of what is really in that image which you can then pump into your
[00:19:03] image which you can then pump into your system as we'll see later and and so
[00:19:06] system as we'll see later and and so then how you encode the information that
[00:19:08] then how you encode the information that is in these little bounding boxes or
[00:19:09] is in these little bounding boxes or actually in the image itself in general
[00:19:11] actually in the image itself in general we just use a standard comp net for that
[00:19:14] we just use a standard comp net for that and so this probably feels like super
[00:19:17] and so this probably feels like super obvious now but uh in 2014 when people
[00:19:20] obvious now but uh in 2014 when people were starting to discover this it was
[00:19:22] were starting to discover this it was really very surprising that you could
[00:19:24] really very surprising that you could just use off-the-shelf continent
[00:19:26] just use off-the-shelf continent features to really replace the entire
[00:19:28] features to really replace the entire computer vision pipeline so people used
[00:19:31] computer vision pipeline so people used to do all of this very fancy
[00:19:32] to do all of this very fancy sophisticated stuff and people you know
[00:19:34] sophisticated stuff and people you know spend decades on trying to refine this
[00:19:36] spend decades on trying to refine this and then it was all thrown away and
[00:19:38] and then it was all thrown away and replaced by a confnet that does all of
[00:19:40] replaced by a confnet that does all of that stuff for free
[00:19:42] that stuff for free um and so the cool thing you get there
[00:19:43] um and so the cool thing you get there is that you can transfer very easily
[00:19:45] is that you can transfer very easily across different tasks so you don't you
[00:19:47] across different tasks so you don't you can have a very generic confidence and
[00:19:49] can have a very generic confidence and then use it to all kinds of very
[00:19:50] then use it to all kinds of very specialized uh things like spotting
[00:19:53] specialized uh things like spotting buildings in Paris for example or
[00:19:56] buildings in Paris for example or flowers or other stuff
[00:19:58] flowers or other stuff um and then of course in the age of
[00:20:00] um and then of course in the age of Transformers
[00:20:01] Transformers um how far how far we're already quite a
[00:20:04] um how far how far we're already quite a while and this is only the first
[00:20:05] while and this is only the first Transformer actually uh in the the slide
[00:20:07] Transformer actually uh in the the slide deck so uh you know we're making good
[00:20:09] deck so uh you know we're making good progress uh so Vision Transformers are
[00:20:12] progress uh so Vision Transformers are what we would use these days to encode
[00:20:14] what we would use these days to encode the images uh where you have these
[00:20:16] the images uh where you have these flattened patches and then you would do
[00:20:18] flattened patches and then you would do uh kind of the the standard birth
[00:20:21] uh kind of the the standard birth architecture maybe as you would know it
[00:20:23] architecture maybe as you would know it from this course and then you do
[00:20:24] from this course and then you do classification right so this is all like
[00:20:26] classification right so this is all like a standard Transformer everything
[00:20:27] a standard Transformer everything standards except now your input here is
[00:20:29] standards except now your input here is not words or tokens it's patches of an
[00:20:32] not words or tokens it's patches of an image and then you classify that
[00:20:35] image and then you classify that all right so then we have a bunch of
[00:20:37] all right so then we have a bunch of features and now how do we combine the
[00:20:39] features and now how do we combine the information right so let's say we have
[00:20:41] information right so let's say we have two vectors u and v uh so you know it
[00:20:44] two vectors u and v uh so you know it sounds easy right to how how we could
[00:20:46] sounds easy right to how how we could could combine them it turns out that
[00:20:48] could combine them it turns out that they're actually very many ways to
[00:20:50] they're actually very many ways to combine them so I I don't think it's
[00:20:51] combine them so I I don't think it's it's really useful to go over all the
[00:20:53] it's really useful to go over all the different ways here
[00:20:55] different ways here um but you can do very simple things
[00:20:56] um but you can do very simple things right so obviously like uh inner product
[00:20:58] right so obviously like uh inner product or similarity is what you would use if
[00:21:00] or similarity is what you would use if you want to do cross-modal things so if
[00:21:02] you want to do cross-modal things so if you want to embed things in the same
[00:21:03] you want to embed things in the same Vector space uh but you can do sort of
[00:21:05] Vector space uh but you can do sort of fancier uh projections on top or
[00:21:08] fancier uh projections on top or different combinations that are kind of
[00:21:10] different combinations that are kind of linear uh or you can do multiplicative
[00:21:13] linear uh or you can do multiplicative things where you uh multiply the
[00:21:15] things where you uh multiply the components element wise or you do some
[00:21:16] components element wise or you do some sort of gating over the different
[00:21:18] sort of gating over the different features you can do attention you can do
[00:21:20] features you can do attention you can do fancier buy linear things you can do
[00:21:22] fancier buy linear things you can do very fancy compact bilinear things so
[00:21:25] very fancy compact bilinear things so there there's really a wealth of
[00:21:26] there there's really a wealth of literature kind of on all the different
[00:21:28] literature kind of on all the different ways you can combine two vectors and and
[00:21:31] ways you can combine two vectors and and so uh this is called multimodal fusion
[00:21:33] so uh this is called multimodal fusion and most of the literature on multiple
[00:21:35] and most of the literature on multiple modality is essentially about this
[00:21:37] modality is essentially about this question what is the best way to do
[00:21:39] question what is the best way to do fusion and that's it
[00:21:42] fusion and that's it um so so I think within that discussion
[00:21:44] um so so I think within that discussion it's maybe useful to distinguish between
[00:21:46] it's maybe useful to distinguish between different levels of fusion so you can do
[00:21:48] different levels of fusion so you can do it very early where basically you make
[00:21:51] it very early where basically you make sure you have the different features and
[00:21:52] sure you have the different features and then you just kind of uh in in the sort
[00:21:55] then you just kind of uh in in the sort of modern sense of attention you would
[00:21:56] of modern sense of attention you would attend to everything in all the features
[00:21:58] attend to everything in all the features from the beginning you can first treat
[00:22:01] from the beginning you can first treat them separately and then combine them or
[00:22:03] them separately and then combine them or you can treat them as completely
[00:22:04] you can treat them as completely separate and then you only combine the
[00:22:06] separate and then you only combine the final scores right and so there's the so
[00:22:09] final scores right and so there's the so that's kind of what we would call Early
[00:22:10] that's kind of what we would call Early fusion and then sort of my my invention
[00:22:13] fusion and then sort of my my invention for calling the middle part would be
[00:22:14] for calling the middle part would be sort of middle fusion and then you have
[00:22:16] sort of middle fusion and then you have late Fusion uh where you really just
[00:22:18] late Fusion uh where you really just combine the scores or the logits but you
[00:22:21] combine the scores or the logits but you don't really have any interaction
[00:22:22] don't really have any interaction between the information from the
[00:22:24] between the information from the different modalities
[00:22:26] different modalities um so you could do really fun stuff with
[00:22:29] um so you could do really fun stuff with multimodal Fusion so this is a paper I
[00:22:32] multimodal Fusion so this is a paper I really like film
[00:22:34] really like film um where uh you have this sort of very
[00:22:36] um where uh you have this sort of very special uh
[00:22:38] special uh feature Maps this sort of f here and it
[00:22:41] feature Maps this sort of f here and it gets gets modulated by a multiplicative
[00:22:44] gets gets modulated by a multiplicative Factor so this gamma and an additive
[00:22:47] Factor so this gamma and an additive sort of bias Vector this beta and you
[00:22:49] sort of bias Vector this beta and you have a different one for every layer of
[00:22:51] have a different one for every layer of a resnet that is conditioned on some
[00:22:54] a resnet that is conditioned on some encoding of the thing you're after uh so
[00:22:56] encoding of the thing you're after uh so in this case are there more cubes than
[00:22:58] in this case are there more cubes than yellow things so we have some Vector
[00:22:59] yellow things so we have some Vector representation for that and we use that
[00:23:02] representation for that and we use that Vector representation to modulate the
[00:23:04] Vector representation to modulate the resnet blocks at every layer of the
[00:23:06] resnet blocks at every layer of the confident
[00:23:08] confident um so you know you can really do very
[00:23:10] um so you know you can really do very fun things where you're sort of
[00:23:11] fun things where you're sort of modulating one network with the other
[00:23:13] modulating one network with the other one and really try to have them learn uh
[00:23:15] one and really try to have them learn uh as much as possible from that
[00:23:18] as much as possible from that all right so
[00:23:20] all right so um let's talk about late Fusion then so
[00:23:22] um let's talk about late Fusion then so late Fusion is what we would Now call
[00:23:24] late Fusion is what we would Now call contrastive models uh but the basic idea
[00:23:27] contrastive models uh but the basic idea is that we have this similarity score so
[00:23:29] is that we have this similarity score so we have the two kind of we process the
[00:23:31] we have the two kind of we process the modalities completely independently and
[00:23:33] modalities completely independently and then at the very end we do some
[00:23:34] then at the very end we do some combination uh and the most famous uh
[00:23:38] combination uh and the most famous uh instance of that uh these days is clip
[00:23:40] instance of that uh these days is clip so who's heard of clip
[00:23:43] so who's heard of clip okay so clip uh from openai and so it's
[00:23:48] okay so clip uh from openai and so it's again exactly the same contrast of loss
[00:23:51] again exactly the same contrast of loss that that we've seen in all these early
[00:23:52] that that we've seen in all these early approaches
[00:23:54] approaches um it does kind of negative sampling uh
[00:23:57] um it does kind of negative sampling uh but then in batch so you just have a
[00:23:59] but then in batch so you just have a batch you have two things that are
[00:24:00] batch you have two things that are aligned right so like this the first
[00:24:02] aligned right so like this the first piece of text and the first image they
[00:24:04] piece of text and the first image they are aligned so this is the right answer
[00:24:06] are aligned so this is the right answer and I just want to make sure that I rank
[00:24:08] and I just want to make sure that I rank this thing higher than all the
[00:24:10] this thing higher than all the alternatives
[00:24:11] alternatives right and I want to make sure I rank
[00:24:13] right and I want to make sure I rank this thing higher than all the
[00:24:14] this thing higher than all the Alternatives so it's a very very simple
[00:24:16] Alternatives so it's a very very simple idea uh really really nothing special
[00:24:19] idea uh really really nothing special about this architecture that that was
[00:24:21] about this architecture that that was sort of invented here but what made this
[00:24:23] sort of invented here but what made this uh thing so cool was first of all it was
[00:24:26] uh thing so cool was first of all it was Transformers and it was Transformers all
[00:24:28] Transformers and it was Transformers all the way so your text encoder would be a
[00:24:29] the way so your text encoder would be a Transformer and your image encoder would
[00:24:31] Transformer and your image encoder would be a vit image encoder so also a
[00:24:34] be a vit image encoder so also a Transformer
[00:24:35] Transformer um and it was trained on lots and lots
[00:24:37] um and it was trained on lots and lots of web data so Alex Radford is really a
[00:24:41] of web data so Alex Radford is really a genius at creating very high quality
[00:24:43] genius at creating very high quality data sets and he he created I think 300
[00:24:46] data sets and he he created I think 300 million image text pairs for this data
[00:24:48] million image text pairs for this data set trained a bigger model on on it than
[00:24:50] set trained a bigger model on on it than people used to do and then we got this
[00:24:54] people used to do and then we got this amazing model out of it
[00:24:56] amazing model out of it um and so so uh moving away from the
[00:24:59] um and so so uh moving away from the words there to the sort of texts that
[00:25:02] words there to the sort of texts that you would see on the internet right so
[00:25:03] you would see on the internet right so the caption uh for an image on the web
[00:25:05] the caption uh for an image on the web it's not going to say dog or cat it's
[00:25:07] it's not going to say dog or cat it's going to say a photo of a cat doing
[00:25:09] going to say a photo of a cat doing something something right so uh that
[00:25:12] something something right so uh that that means that you can do kind of zero
[00:25:13] that means that you can do kind of zero shot uh label predictions where you have
[00:25:16] shot uh label predictions where you have a photo of uh and then you need to
[00:25:18] a photo of uh and then you need to figure out what uh the right label is
[00:25:21] figure out what uh the right label is for a given image using this kind of
[00:25:24] for a given image using this kind of prompt right so the the thing you know
[00:25:25] prompt right so the the thing you know you probably all know about prompting
[00:25:27] you probably all know about prompting large language models and so you can
[00:25:29] large language models and so you can prompt vision and language models in in
[00:25:31] prompt vision and language models in in very much the same way and do zero shot
[00:25:33] very much the same way and do zero shot generalization
[00:25:35] generalization um so if you want a really really good
[00:25:37] um so if you want a really really good paper I would recommend that you read
[00:25:39] paper I would recommend that you read this paper this is really one that's
[00:25:40] this paper this is really one that's going to teach you how to write really
[00:25:42] going to teach you how to write really good papers it's thorough and it's
[00:25:44] good papers it's thorough and it's really worth a very close read I think
[00:25:46] really worth a very close read I think if you're interested in this view
[00:25:48] if you're interested in this view um and so I think when it came out uh
[00:25:51] um and so I think when it came out uh actually on imagenet itself it it didn't
[00:25:54] actually on imagenet itself it it didn't really outperform resnet right so so you
[00:25:57] really outperform resnet right so so you might think oh yeah actually it's not
[00:25:58] might think oh yeah actually it's not all that special but what really made it
[00:26:01] all that special but what really made it special was that it generalized much
[00:26:02] special was that it generalized much better to these other data sets right so
[00:26:05] better to these other data sets right so this uh this resnet thing here is pretty
[00:26:07] this uh this resnet thing here is pretty terrible at some of these kind of
[00:26:08] terrible at some of these kind of adversarial versions of imagenet and
[00:26:11] adversarial versions of imagenet and clip is super robust to that so it's
[00:26:13] clip is super robust to that so it's just a way better image encoder in
[00:26:15] just a way better image encoder in general
[00:26:17] general um so uh very very quickly after clip
[00:26:20] um so uh very very quickly after clip there was this paper from Google uh
[00:26:22] there was this paper from Google uh using a line
[00:26:23] using a line um which was basically exactly the same
[00:26:26] um which was basically exactly the same idea uh you know the field is not really
[00:26:28] idea uh you know the field is not really that creative at all it's like the same
[00:26:30] that creative at all it's like the same idea but then you just keep like
[00:26:31] idea but then you just keep like throwing more data and more compute at
[00:26:33] throwing more data and more compute at it and it often works much better so
[00:26:35] it and it often works much better so that's what they found here too and 1.8
[00:26:38] that's what they found here too and 1.8 billion image taxpayers instead of 300
[00:26:40] billion image taxpayers instead of 300 million gives you a better model
[00:26:42] million gives you a better model surprise
[00:26:45] surprise um but uh so it's still very cool and
[00:26:47] um but uh so it's still very cool and and what is really cool I think is that
[00:26:49] and what is really cool I think is that there's this organization called lion
[00:26:52] there's this organization called lion um uh where uh they've they've started
[00:26:55] um uh where uh they've they've started this open source Collective to create
[00:26:57] this open source Collective to create really high quality data sets
[00:26:59] really high quality data sets um and so the lie on the initial data
[00:27:02] um and so the lie on the initial data set
[00:27:03] set um was uh how many examples in the
[00:27:06] um was uh how many examples in the initial lineup
[00:27:07] initial lineup 400 million right he knows I know that
[00:27:09] 400 million right he knows I know that he knows
[00:27:10] he knows um and uh so so now there's a much
[00:27:13] um and uh so so now there's a much bigger version of lion uh that's even
[00:27:15] bigger version of lion uh that's even multilingual and it has five billion
[00:27:17] multilingual and it has five billion examples right so uh stable diffusion
[00:27:19] examples right so uh stable diffusion was trained on sort of the image the
[00:27:21] was trained on sort of the image the English subset of this thing uh and
[00:27:24] English subset of this thing uh and that's one of the reasons that it's so
[00:27:25] that's one of the reasons that it's so awesome it's because it's just seen a
[00:27:27] awesome it's because it's just seen a ton of data uh and that really makes
[00:27:30] ton of data uh and that really makes your system a lot better so if you're
[00:27:32] your system a lot better so if you're looking for like the ultimate data set
[00:27:34] looking for like the ultimate data set to play around with uh with your own
[00:27:36] to play around with uh with your own ideas if you have enough compute
[00:27:38] ideas if you have enough compute obviously then you should really look at
[00:27:40] obviously then you should really look at this data set
[00:27:42] this data set all right any questions about
[00:27:45] all right any questions about up until this point
[00:27:50] nope all right
[00:27:52] nope all right um so that then we'll we'll move on from
[00:27:54] um so that then we'll we'll move on from late Fusion to kind of middle Fusion
[00:27:58] late Fusion to kind of middle Fusion early Fusion uh and this really is kind
[00:28:00] early Fusion uh and this really is kind of the core uh of what I think a lot of
[00:28:03] of the core uh of what I think a lot of people in the field right now or if
[00:28:04] people in the field right now or if you're interested in getting in this
[00:28:05] you're interested in getting in this field or if you're going to go into
[00:28:07] field or if you're going to go into industry and you're going to be using
[00:28:08] industry and you're going to be using this stuff like this is what you should
[00:28:10] this stuff like this is what you should really understand and and
[00:28:13] really understand and and again like the idea is sort of Stack
[00:28:15] again like the idea is sort of Stack onto each other so I've kind of uh
[00:28:17] onto each other so I've kind of uh sequenced the slides to give you an idea
[00:28:19] sequenced the slides to give you an idea sort of of how the scientists kind of
[00:28:21] sort of of how the scientists kind of came up with the next step uh and you
[00:28:23] came up with the next step uh and you can really see the architecture just get
[00:28:25] can really see the architecture just get slightly more and more advanced but
[00:28:27] slightly more and more advanced but basically a lot of it is just more data
[00:28:29] basically a lot of it is just more data and more compute uh again
[00:28:31] and more compute uh again um so uh who knows how bird works
[00:28:38] everybody should raise their heads so
[00:28:42] everybody should raise their heads so um
[00:28:43] um uh yeah so so Bert is kind of so
[00:28:46] uh yeah so so Bert is kind of so canonical I think everybody kind of gets
[00:28:48] canonical I think everybody kind of gets out Burke works right so I don't think
[00:28:49] out Burke works right so I don't think we need a real refresher uh but uh I
[00:28:53] we need a real refresher uh but uh I think you can think and so the reason I
[00:28:55] think you can think and so the reason I have to slide is because I want you to
[00:28:57] have to slide is because I want you to think about if you have a bird model
[00:28:59] think about if you have a bird model and you have a bunch of images how are
[00:29:02] and you have a bunch of images how are you going to turn that Bird model into
[00:29:03] you going to turn that Bird model into something multimodal
[00:29:06] something multimodal right so so there are a bunch of like
[00:29:08] right so so there are a bunch of like obvious things you could do given the
[00:29:10] obvious things you could do given the kind of features I told you about in the
[00:29:12] kind of features I told you about in the sort of fusion process so you know how
[00:29:14] sort of fusion process so you know how are you going to do that
[00:29:16] are you going to do that does anybody want to like
[00:29:18] does anybody want to like say something
[00:29:24] like if you're doing classification
[00:29:27] and then just concatenate it to whatever
[00:29:30] and then just concatenate it to whatever encoder like maybe an a n or whatever
[00:29:32] encoder like maybe an a n or whatever you're training on the data
[00:29:35] you're training on the data concatenating okay exactly yeah so so
[00:29:38] concatenating okay exactly yeah so so you can take take the confnet features
[00:29:41] you can take take the confnet features and classifier token from bird
[00:29:42] and classifier token from bird concatenate them and then classify uh
[00:29:45] concatenate them and then classify uh for like a catheter or something like
[00:29:47] for like a catheter or something like that or whatever the thing is you're
[00:29:48] that or whatever the thing is you're interested in yeah
[00:29:49] interested in yeah yeah so that's one thing you could also
[00:29:52] yeah so that's one thing you could also like take the confident features and
[00:29:53] like take the confident features and like give them to the bird model in lots
[00:29:56] like give them to the bird model in lots of different ways right uh we can use
[00:29:58] of different ways right uh we can use the region features so
[00:30:00] the region features so um and I think a lot of people uh when
[00:30:03] um and I think a lot of people uh when Burke came out who were working in in
[00:30:05] Burke came out who were working in in vision and language processing were
[00:30:06] vision and language processing were thinking exactly about okay so do we do
[00:30:08] thinking exactly about okay so do we do like middle Fusion late Fusion do we do
[00:30:10] like middle Fusion late Fusion do we do early Fusion how do we do diffusion
[00:30:13] early Fusion how do we do diffusion um and so there were a lot of papers all
[00:30:16] um and so there were a lot of papers all coming out basically at around the same
[00:30:17] coming out basically at around the same time where people were doing versions uh
[00:30:20] time where people were doing versions uh of this because so Bert was really kind
[00:30:22] of this because so Bert was really kind of the Innovation and then everybody
[00:30:23] of the Innovation and then everybody sort of just plugged it into their own
[00:30:25] sort of just plugged it into their own thing because of hugging face
[00:30:26] thing because of hugging face Transformers and things like that so
[00:30:29] Transformers and things like that so um the first thing is uh visual bird
[00:30:33] um the first thing is uh visual bird um this was one of the very early ones
[00:30:34] um this was one of the very early ones where you have this image and people
[00:30:36] where you have this image and people would do uh object detection on this so
[00:30:39] would do uh object detection on this so you get like a hat and a racket and a
[00:30:41] you get like a hat and a racket and a shirt and things like that so you can
[00:30:43] shirt and things like that so you can just really take these features and then
[00:30:45] just really take these features and then plug them into your uh your Transformer
[00:30:48] plug them into your uh your Transformer model and then you you try to like
[00:30:50] model and then you you try to like recover the features and so this really
[00:30:53] recover the features and so this really is probably like the simplest way to do
[00:30:55] is probably like the simplest way to do it right
[00:30:57] it right um and so this is what we call a single
[00:30:59] um and so this is what we call a single stream architecture where you have all
[00:31:01] stream architecture where you have all of these kind of concatenating the the
[00:31:03] of these kind of concatenating the the original input features and then putting
[00:31:05] original input features and then putting them through the same Transformer what
[00:31:08] them through the same Transformer what you can also do and that's something
[00:31:09] you can also do and that's something that this this uh model called vilbert
[00:31:12] that this this uh model called vilbert did is where you have two different
[00:31:13] did is where you have two different streams so you essentially have these
[00:31:16] streams so you essentially have these two parallel Transformers but at every
[00:31:18] two parallel Transformers but at every layer uh you kind of give them a cross
[00:31:21] layer uh you kind of give them a cross attention right so or co-attention as
[00:31:24] attention right so or co-attention as they call it but it's basically like so
[00:31:26] they call it but it's basically like so you just make sure you have an attention
[00:31:27] you just make sure you have an attention map that spends both and then you just
[00:31:29] map that spends both and then you just do your full normal Transformer layer
[00:31:32] do your full normal Transformer layer again
[00:31:33] again um and then so this you can train just
[00:31:35] um and then so this you can train just like your regular Bird right so you uh
[00:31:38] like your regular Bird right so you uh you have your your masked model Mass
[00:31:41] you have your your masked model Mass language model here and here you do sort
[00:31:43] language model here and here you do sort of some equivalent of that and then you
[00:31:44] of some equivalent of that and then you also have your next sentence prediction
[00:31:47] also have your next sentence prediction which you probably remember from your
[00:31:49] which you probably remember from your birth lecture
[00:31:50] birth lecture um but instead here we're saying okay is
[00:31:52] um but instead here we're saying okay is this image aligned with this piece of
[00:31:54] this image aligned with this piece of text or not
[00:31:56] text or not um there's also lexmart I mean there I
[00:31:59] um there's also lexmart I mean there I could go on forever there are like 100
[00:32:00] could go on forever there are like 100 papers that that came out that did this
[00:32:02] papers that that came out that did this all at the same time so Lexmark had a
[00:32:04] all at the same time so Lexmark had a different cross-modal output encoder a
[00:32:07] different cross-modal output encoder a bunch of different uh ways of encoding
[00:32:09] bunch of different uh ways of encoding the positional information right so you
[00:32:11] the positional information right so you could say okay I just have a bunch of
[00:32:12] could say okay I just have a bunch of bounding boxes that are featureized but
[00:32:14] bounding boxes that are featureized but I don't care about where they are in the
[00:32:16] I don't care about where they are in the image so it's just kind of like a just a
[00:32:18] image so it's just kind of like a just a bag of uh bounding boxes or you could
[00:32:21] bag of uh bounding boxes or you could say I found it here like this is the
[00:32:22] say I found it here like this is the particular like top left and and bottom
[00:32:24] particular like top left and and bottom right coordinate and that's what you
[00:32:26] right coordinate and that's what you featurize into your network
[00:32:29] featurize into your network um you can also do something even dumber
[00:32:32] um you can also do something even dumber and I can say that because this is my
[00:32:34] and I can say that because this is my paper
[00:32:35] paper um where you just take the the image
[00:32:38] um where you just take the the image itself you put it through a resnet and
[00:32:41] itself you put it through a resnet and then you uh do a little bit of pooling
[00:32:43] then you uh do a little bit of pooling on the final feature maps and you just
[00:32:45] on the final feature maps and you just get give those feature Maps too Bert
[00:32:48] get give those feature Maps too Bert um and so you then need to distinguish
[00:32:50] um and so you then need to distinguish between like your text segment
[00:32:51] between like your text segment embeddings right and your vision segment
[00:32:54] embeddings right and your vision segment embeddings
[00:32:56] embeddings um but so this actually works
[00:32:58] um but so this actually works surprisingly well you don't have to do
[00:32:59] surprisingly well you don't have to do any uh any additional training you can
[00:33:02] any uh any additional training you can just take Bert out of the box initially
[00:33:04] just take Bert out of the box initially you freeze it you learn to project into
[00:33:06] you freeze it you learn to project into bird token space then you unfreeze your
[00:33:09] bird token space then you unfreeze your resnet and then finally you unfreeze
[00:33:10] resnet and then finally you unfreeze your birth and now you have a very good
[00:33:12] your birth and now you have a very good multimodal classifier on the problem you
[00:33:14] multimodal classifier on the problem you care about so a lot of these other
[00:33:16] care about so a lot of these other papers they're doing what they call
[00:33:18] papers they're doing what they call multimodal pre-training where first you
[00:33:20] multimodal pre-training where first you have a bird model and a resnet so
[00:33:22] have a bird model and a resnet so they're kind of unimodally pre-trained
[00:33:24] they're kind of unimodally pre-trained and then you couple them together and
[00:33:26] and then you couple them together and then you have a multimodal sort of
[00:33:28] then you have a multimodal sort of intermediate every pre-training step
[00:33:30] intermediate every pre-training step before you fine tune it on the problem
[00:33:31] before you fine tune it on the problem you care about uh and what we showed
[00:33:33] you care about uh and what we showed here is that you don't really need that
[00:33:35] here is that you don't really need that actually in many cases so it's a very
[00:33:37] actually in many cases so it's a very strong Baseline
[00:33:39] strong Baseline um you can also go to the the pixel
[00:33:42] um you can also go to the the pixel level completely so so that's what they
[00:33:44] level completely so so that's what they they did in this other paper called
[00:33:45] they did in this other paper called pixel bird where they it's basically
[00:33:47] pixel bird where they it's basically exactly mmbt uh so the the previous uh
[00:33:51] exactly mmbt uh so the the previous uh supervised one but here they do do the
[00:33:53] supervised one but here they do do the the multimodal pre-training step and
[00:33:55] the multimodal pre-training step and show that I think for vqa it helps a
[00:33:57] show that I think for vqa it helps a little bit
[00:33:58] little bit um so there are many of these birds uh
[00:34:02] um so there are many of these birds uh doing sort of visual things uh people
[00:34:04] doing sort of visual things uh people really tried everything uh here's
[00:34:07] really tried everything uh here's another one called unider where they
[00:34:08] another one called unider where they they added a bunch of different losses
[00:34:10] they added a bunch of different losses uh we can really talk about this for a
[00:34:12] uh we can really talk about this for a very long time uh We're not gonna do
[00:34:14] very long time uh We're not gonna do that I'm just gonna kind of talk you
[00:34:16] that I'm just gonna kind of talk you through some of the more interesting
[00:34:17] through some of the more interesting ones so this one I think is quite
[00:34:19] ones so this one I think is quite interesting built because here this is
[00:34:22] interesting built because here this is really the first instance where we are
[00:34:24] really the first instance where we are completely gone from uh confident
[00:34:26] completely gone from uh confident features so we don't do any any
[00:34:28] features so we don't do any any pre-processing on the image no regime
[00:34:30] pre-processing on the image no regime features no backbone that it featurizes
[00:34:32] features no backbone that it featurizes the uh the the parts of the image we
[00:34:35] the uh the the parts of the image we care about we just have these patches of
[00:34:36] care about we just have these patches of the image so really integrate we
[00:34:39] the image so really integrate we flattened those patches we just pump
[00:34:41] flattened those patches we just pump them into the Transformer straight away
[00:34:42] them into the Transformer straight away so this really is like sort of burnt and
[00:34:45] so this really is like sort of burnt and vit together in one model and this
[00:34:47] vit together in one model and this worked really very well
[00:34:49] worked really very well um so that that's been the trend uh so
[00:34:52] um so that that's been the trend uh so here's a here's a nice uh very long list
[00:34:54] here's a here's a nice uh very long list of all those all of these different
[00:34:56] of all those all of these different models and what they do and and so
[00:34:58] models and what they do and and so really the distinctions are just in what
[00:35:00] really the distinctions are just in what is the text encoder that you use so do
[00:35:02] is the text encoder that you use so do you use birth or something fancier or
[00:35:04] you use birth or something fancier or better Roberta uh what is your vision
[00:35:07] better Roberta uh what is your vision encoder so in many cases you have these
[00:35:10] encoder so in many cases you have these region features so you would do an rcnn
[00:35:12] region features so you would do an rcnn style thing or you could just do a
[00:35:14] style thing or you could just do a resnet or a vit you have different kinds
[00:35:16] resnet or a vit you have different kinds of fusion so either single or dual
[00:35:19] of fusion so either single or dual stream as we talked about right so
[00:35:20] stream as we talked about right so visual birth or vilbert different
[00:35:23] visual birth or vilbert different pre-training tasks so Mass language
[00:35:26] pre-training tasks so Mass language modeling image text matching there's a
[00:35:28] modeling image text matching there's a bunch of like funkier ones you can do so
[00:35:32] bunch of like funkier ones you can do so and then finally you can do multimodal
[00:35:34] and then finally you can do multimodal pre-training on all of these different
[00:35:36] pre-training on all of these different data sets that have aligned data um So
[00:35:40] data sets that have aligned data um So you you're probably wondering okay so
[00:35:42] you you're probably wondering okay so what is what is really the interesting
[00:35:43] what is what is really the interesting difference between a lot of these and uh
[00:35:47] difference between a lot of these and uh so I have another recommended paper that
[00:35:49] so I have another recommended paper that if you're interested in this space you
[00:35:51] if you're interested in this space you should really take a look at it's also a
[00:35:52] should really take a look at it's also a really well done paper where uh they uh
[00:35:56] really well done paper where uh they uh they unmask multimodal pre-training so
[00:35:59] they unmask multimodal pre-training so basically they say
[00:36:01] basically they say if you take all of these little model
[00:36:03] if you take all of these little model inventions and you train these different
[00:36:05] inventions and you train these different models on exactly the same data in
[00:36:08] models on exactly the same data in exactly the same way it turns out that
[00:36:10] exactly the same way it turns out that they're all basically the same
[00:36:12] they're all basically the same uh so that's a lot of kind of uh you
[00:36:15] uh so that's a lot of kind of uh you know wasted effort on the part of the
[00:36:17] know wasted effort on the part of the field because everybody is saying like
[00:36:18] field because everybody is saying like oh my model is better but it's actually
[00:36:20] oh my model is better but it's actually just because you trained it on different
[00:36:21] just because you trained it on different data and there's no real sort of model
[00:36:24] data and there's no real sort of model Innovation uh going on in a lot of these
[00:36:27] Innovation uh going on in a lot of these things so I don't mean to sound
[00:36:29] things so I don't mean to sound discouraging or anything like that but
[00:36:30] discouraging or anything like that but you know like uh I think that's why this
[00:36:33] you know like uh I think that's why this paper is really nice and really
[00:36:35] paper is really nice and really important is because it just shows us
[00:36:37] important is because it just shows us what really matters
[00:36:39] what really matters so this is also work that I I did myself
[00:36:42] so this is also work that I I did myself uh called Flava uh with uh with my team
[00:36:45] uh called Flava uh with uh with my team where we wanted to take these ideas
[00:36:49] where we wanted to take these ideas really to the limit so a lot of the
[00:36:50] really to the limit so a lot of the things that you've seen now uh so the
[00:36:53] things that you've seen now uh so the visual Birds and the vilberts and things
[00:36:55] visual Birds and the vilberts and things like that they're all about multimodal
[00:36:56] like that they're all about multimodal questions so how can we do visual
[00:36:58] questions so how can we do visual question answering uh something like
[00:37:00] question answering uh something like that where we just have these two
[00:37:01] that where we just have these two modalities we only care about problems
[00:37:03] modalities we only care about problems that always involve these two modalities
[00:37:05] that always involve these two modalities and where we want to go and this is this
[00:37:07] and where we want to go and this is this is kind of the basic premise I think of
[00:37:09] is kind of the basic premise I think of foundation models in general is that we
[00:37:12] foundation models in general is that we have one model to rule them all right so
[00:37:14] have one model to rule them all right so this one model can consume data from all
[00:37:16] this one model can consume data from all of these different modalities and you
[00:37:18] of these different modalities and you can synthesize across all of these
[00:37:20] can synthesize across all of these different modalities and then do useful
[00:37:21] different modalities and then do useful things with that information
[00:37:24] things with that information um so so with flavor that's exactly what
[00:37:26] um so so with flavor that's exactly what we tried to build so we wanted to have
[00:37:28] we tried to build so we wanted to have one Foundation model that is good at
[00:37:30] one Foundation model that is good at vision and language and computer vision
[00:37:31] vision and language and computer vision and natural language processing is
[00:37:34] and natural language processing is jointly pre-trained on all of these
[00:37:35] jointly pre-trained on all of these different data sources so it's also
[00:37:37] different data sources so it's also trained on just CC news so common all
[00:37:40] trained on just CC news so common all and book Corpus so it's very good at the
[00:37:43] and book Corpus so it's very good at the sort of things you would expect Bert to
[00:37:44] sort of things you would expect Bert to be good at it's drained on imagenet for
[00:37:46] be good at it's drained on imagenet for image data so it's good at the things
[00:37:48] image data so it's good at the things that you would expect is a kind of basic
[00:37:50] that you would expect is a kind of basic image model to be good at and then you
[00:37:52] image model to be good at and then you have this PMD data set that we created
[00:37:54] have this PMD data set that we created out of publicly available uh image text
[00:37:57] out of publicly available uh image text pairs that we also train it on so this
[00:38:00] pairs that we also train it on so this PMD data set is really just if you take
[00:38:02] PMD data set is really just if you take all the data sets that were ever created
[00:38:04] all the data sets that were ever created that have image text pairs that are
[00:38:06] that have image text pairs that are publicly available so unfortunately the
[00:38:07] publicly available so unfortunately the clip data and the Google Align data and
[00:38:10] clip data and the Google Align data and all of these data sets they haven't been
[00:38:11] all of these data sets they haven't been open source so this is before rely on uh
[00:38:14] open source so this is before rely on uh where so now there's a good alternative
[00:38:16] where so now there's a good alternative to this
[00:38:17] to this um but so this PMD data set
[00:38:20] um but so this PMD data set if you combine all of these image
[00:38:22] if you combine all of these image taxpayers you get 70 million of them so
[00:38:24] taxpayers you get 70 million of them so that's still pretty decent size and then
[00:38:26] that's still pretty decent size and then you can take all of this data basically
[00:38:29] you can take all of this data basically to solve all of these problems that we
[00:38:30] to solve all of these problems that we know we care about in these different
[00:38:32] know we care about in these different fields so you can do multimodal
[00:38:33] fields so you can do multimodal reasoning you can do language
[00:38:35] reasoning you can do language understanding you can do visual
[00:38:36] understanding you can do visual recognition all with exactly the same
[00:38:38] recognition all with exactly the same model and that's that's a very powerful
[00:38:40] model and that's that's a very powerful idea I think if you uh like if you work
[00:38:43] idea I think if you uh like if you work at a company like Facebook you don't
[00:38:44] at a company like Facebook you don't want to have different models for all
[00:38:46] want to have different models for all kinds of different things you want to
[00:38:47] kinds of different things you want to have one model that you can really use
[00:38:49] have one model that you can really use for everything that's going to really
[00:38:51] for everything that's going to really make your life a lot easier
[00:38:54] make your life a lot easier um so the exact architecture here is
[00:38:56] um so the exact architecture here is that on the one hand we have this image
[00:38:58] that on the one hand we have this image encoder where we take the image we
[00:39:00] encoder where we take the image we encoded as patches and we just do what
[00:39:03] encoded as patches and we just do what we call mass image modeling but it's
[00:39:04] we call mass image modeling but it's basically Mass language modeling and
[00:39:06] basically Mass language modeling and then just on the on the image tokens
[00:39:08] then just on the on the image tokens right
[00:39:10] right uh and then on the other side we have
[00:39:12] uh and then on the other side we have the mass language modeling uh on on the
[00:39:16] the mass language modeling uh on on the language so your regular sort of bird
[00:39:17] language so your regular sort of bird thing and then we have a multimodal part
[00:39:20] thing and then we have a multimodal part where all of this information gets
[00:39:21] where all of this information gets combined uh so we have a mass multimodal
[00:39:25] combined uh so we have a mass multimodal modeling loss term uh where you can also
[00:39:27] modeling loss term uh where you can also do image text matching so this is like
[00:39:29] do image text matching so this is like your bird next sentence prediction thing
[00:39:31] your bird next sentence prediction thing and then we also have a global
[00:39:32] and then we also have a global contrastive loss which is exactly like a
[00:39:34] contrastive loss which is exactly like a clip so if you do all of this stuff it's
[00:39:37] clip so if you do all of this stuff it's just all Transformers all the way down
[00:39:39] just all Transformers all the way down and it's sort of a very elegant way I
[00:39:42] and it's sort of a very elegant way I think to combine a lot of this
[00:39:43] think to combine a lot of this information
[00:39:44] information um and when you do that you get
[00:39:46] um and when you do that you get something that can really do a lot of
[00:39:48] something that can really do a lot of things very well uh so I'm we're not
[00:39:50] things very well uh so I'm we're not going to talk about that table is just
[00:39:52] going to talk about that table is just way too many numbers but uh so just
[00:39:54] way too many numbers but uh so just trust me we were pretty thorough
[00:39:55] trust me we were pretty thorough generating uh the table here and so over
[00:40:00] generating uh the table here and so over 35 different tests if you compare flavor
[00:40:02] 35 different tests if you compare flavor to all kinds of different ablations in
[00:40:04] to all kinds of different ablations in terms of clip models then this is just a
[00:40:07] terms of clip models then this is just a much better way to to get to this
[00:40:09] much better way to to get to this information
[00:40:10] information so I think this is a nice example of
[00:40:12] so I think this is a nice example of like where we're probably going to go
[00:40:13] like where we're probably going to go with the field uh in in the near future
[00:40:17] with the field uh in in the near future um so the other Trend that we that we
[00:40:19] um so the other Trend that we that we see very obviously in the field right
[00:40:20] see very obviously in the field right now is that everybody cares about
[00:40:22] now is that everybody cares about generative models right so you know
[00:40:24] generative models right so you know language models and and you know image
[00:40:26] language models and and you know image generative models there's just a trend
[00:40:28] generative models there's just a trend where we want to be generative we want
[00:40:30] where we want to be generative we want to move away from this contrastive
[00:40:31] to move away from this contrastive discriminative uh stuff to the more
[00:40:34] discriminative uh stuff to the more interesting more richer representations
[00:40:36] interesting more richer representations maybe that you get out of generating
[00:40:38] maybe that you get out of generating sequences or or images
[00:40:41] sequences or or images um so this uh Sim vlm paper was one of
[00:40:43] um so this uh Sim vlm paper was one of the first ones where they really had
[00:40:45] the first ones where they really had this separate decoder that was trying to
[00:40:46] this separate decoder that was trying to generate or kind of complete captions
[00:40:48] generate or kind of complete captions which they showed gives you a lot richer
[00:40:51] which they showed gives you a lot richer representations
[00:40:53] representations um I think this is actually the current
[00:40:54] um I think this is actually the current state of the art now it's called coca
[00:40:57] state of the art now it's called coca so a lot of these uh models they all
[00:41:00] so a lot of these uh models they all again look very similar uh but in this
[00:41:02] again look very similar uh but in this case now we're starting to really see
[00:41:04] case now we're starting to really see these text decoders so initially with
[00:41:06] these text decoders so initially with clip I think that's also what they were
[00:41:08] clip I think that's also what they were trying to go for like open AI being a
[00:41:10] trying to go for like open AI being a company that really likes generative
[00:41:12] company that really likes generative models but they couldn't really get it
[00:41:13] models but they couldn't really get it to work and I think so it took us a
[00:41:15] to work and I think so it took us a while as a field to really figure out
[00:41:17] while as a field to really figure out how to do this the right way
[00:41:20] how to do this the right way um and so right now we're really kind of
[00:41:23] um and so right now we're really kind of in the age of language models right and
[00:41:25] in the age of language models right and uh so uh one of the interesting things
[00:41:28] uh so uh one of the interesting things you can do with language models is just
[00:41:31] you can do with language models is just keep them Frozen and then learn how to
[00:41:33] keep them Frozen and then learn how to project into the language models so uh
[00:41:35] project into the language models so uh the mmbt architecture I talked about
[00:41:38] the mmbt architecture I talked about where we had this bird model we kind of
[00:41:40] where we had this bird model we kind of kept it frozen and we learned to learn
[00:41:41] kept it frozen and we learned to learn to project into the bird token space you
[00:41:45] to project into the bird token space you can do exactly the same thing but then
[00:41:46] can do exactly the same thing but then with a much fancier model uh or
[00:41:49] with a much fancier model uh or something like T5 even where you just
[00:41:51] something like T5 even where you just have an encoder decoder or some kind of
[00:41:52] have an encoder decoder or some kind of generative part of this you keep that
[00:41:55] generative part of this you keep that thing Frozen uh and then you learn to
[00:41:57] thing Frozen uh and then you learn to project into the token space of that
[00:42:00] project into the token space of that Frozen language model uh and then you
[00:42:02] Frozen language model uh and then you can do lots of fun stuff it turns out so
[00:42:05] can do lots of fun stuff it turns out so what they show in this paper is that you
[00:42:07] what they show in this paper is that you then get few shot Learners uh so all of
[00:42:09] then get few shot Learners uh so all of the things you see with gpt3 where you
[00:42:11] the things you see with gpt3 where you can just give it some kind of in-context
[00:42:13] can just give it some kind of in-context examples and it's gonna figure out
[00:42:15] examples and it's gonna figure out binding uh kind of on the fly so it says
[00:42:19] binding uh kind of on the fly so it says like this is a accent this is a blicket
[00:42:21] like this is a accent this is a blicket so what is what is this and then it
[00:42:23] so what is what is this and then it gives you the answer that is that it's
[00:42:24] gives you the answer that is that it's the decks so it really learns in context
[00:42:26] the decks so it really learns in context how you decide the feature mappings
[00:42:29] how you decide the feature mappings which is really kind of solving the the
[00:42:31] which is really kind of solving the the grounding problem that a lot of this
[00:42:33] grounding problem that a lot of this multimodal stuff uh started with
[00:42:35] multimodal stuff uh started with so I think that's very cool and then uh
[00:42:38] so I think that's very cool and then uh probably one of the the coolest papers
[00:42:41] probably one of the the coolest papers right now or models right now that you
[00:42:43] right now or models right now that you might have heard of if you follow the
[00:42:44] might have heard of if you follow the field is flamingo uh out of deepmind
[00:42:47] field is flamingo uh out of deepmind where they take a chinchilla language
[00:42:49] where they take a chinchilla language model
[00:42:51] model um and uh so this is really an optimal
[00:42:53] um and uh so this is really an optimal language model and now you have this
[00:42:54] language model and now you have this Vision encoder uh that encodes multiple
[00:42:57] Vision encoder uh that encodes multiple different images uh that you can then do
[00:43:00] different images uh that you can then do reasoning over and then kind of
[00:43:02] reasoning over and then kind of autocomplete so what this gets you is
[00:43:04] autocomplete so what this gets you is just a much more powerful model because
[00:43:07] just a much more powerful model because you can do uh you know your generative
[00:43:09] you can do uh you know your generative over lots of different images so like
[00:43:11] over lots of different images so like it's really like step wise you can see
[00:43:13] it's really like step wise you can see it right we started off with very simple
[00:43:14] it right we started off with very simple Transformers and now we're actually at
[00:43:16] Transformers and now we're actually at something that that is starting to get
[00:43:18] something that that is starting to get pretty complicated because we have these
[00:43:20] pretty complicated because we have these building blocks like a perceiver
[00:43:22] building blocks like a perceiver resampler where we have a a bunch of
[00:43:25] resampler where we have a a bunch of different images that we featureize and
[00:43:27] different images that we featureize and now we need to compress the information
[00:43:29] now we need to compress the information because sometimes we have three images
[00:43:30] because sometimes we have three images sometimes we have five images so we want
[00:43:32] sometimes we have five images so we want to make sure that we can compress it so
[00:43:34] to make sure that we can compress it so that it's always ready for consumption
[00:43:35] that it's always ready for consumption and by the next layer of the language
[00:43:39] and by the next layer of the language model and then so this paper again is a
[00:43:42] model and then so this paper again is a really good paper to read because they
[00:43:44] really good paper to read because they actually so this is not me this is not
[00:43:46] actually so this is not me this is not my code this comes from the actual paper
[00:43:48] my code this comes from the actual paper so they just have the diagram together
[00:43:49] so they just have the diagram together with the code so that you can really
[00:43:51] with the code so that you can really understand what it's doing which I think
[00:43:53] understand what it's doing which I think is is really uh great
[00:43:56] is is really uh great um and so once you you have your
[00:43:58] um and so once you you have your perceiver resampling step what you then
[00:44:01] perceiver resampling step what you then do is you do a gated cross attention
[00:44:03] do is you do a gated cross attention this is how you implement it
[00:44:06] this is how you implement it um and uh so this gated cross attention
[00:44:08] um and uh so this gated cross attention you do that before your Frozen language
[00:44:11] you do that before your Frozen language model layer so you really just have a
[00:44:13] model layer so you really just have a frozen chinchilla language model and you
[00:44:15] frozen chinchilla language model and you learn to kind of modulate the
[00:44:17] learn to kind of modulate the information that goes into that language
[00:44:19] information that goes into that language model you propagate the gradients all
[00:44:21] model you propagate the gradients all the way back you just don't update the
[00:44:23] the way back you just don't update the language model so you're really kind of
[00:44:25] language model so you're really kind of trying to figure out like how am I going
[00:44:26] trying to figure out like how am I going to design my signal so that my language
[00:44:28] to design my signal so that my language model can can do the most with it right
[00:44:31] model can can do the most with it right how am I going to combine the
[00:44:32] how am I going to combine the information so you'll notice that now we
[00:44:34] information so you'll notice that now we do it before the layer right in a lot of
[00:44:35] do it before the layer right in a lot of other stuff you would do the attention
[00:44:37] other stuff you would do the attention after the layer but here you do it
[00:44:39] after the layer but here you do it before
[00:44:41] before um so uh karpathy I think more than 10
[00:44:44] um so uh karpathy I think more than 10 years ago had this this image it's
[00:44:46] years ago had this this image it's Barack Obama kind of setting his foot
[00:44:48] Barack Obama kind of setting his foot here on the scale to make somebody think
[00:44:50] here on the scale to make somebody think uh like uh you know they're they're a
[00:44:52] uh like uh you know they're they're a lot heavier than they really are uh so
[00:44:54] lot heavier than they really are uh so this is obviously funny to us
[00:44:57] this is obviously funny to us um but not to an AI system I think
[00:44:59] um but not to an AI system I think unless it really uh understands the
[00:45:01] unless it really uh understands the scene and so that's why karpathy uh at
[00:45:04] scene and so that's why karpathy uh at the time said this would be a really
[00:45:05] the time said this would be a really good visual Turing test like if a system
[00:45:07] good visual Turing test like if a system can figure this out then it's actually
[00:45:09] can figure this out then it's actually really smart
[00:45:10] really smart um and so obviously it's been a bit of a
[00:45:12] um and so obviously it's been a bit of a challenge for everybody working in the
[00:45:14] challenge for everybody working in the field than to get something that
[00:45:15] field than to get something that actually works on this and uh so
[00:45:17] actually works on this and uh so Flamingo as it turns out kind of gets
[00:45:19] Flamingo as it turns out kind of gets the joke
[00:45:21] the joke um but uh yeah so it's it's a bit
[00:45:23] um but uh yeah so it's it's a bit unclear if it really gets the joke
[00:45:24] unclear if it really gets the joke because if you read this conversation
[00:45:26] because if you read this conversation it's sort of kind of getting steered in
[00:45:28] it's sort of kind of getting steered in the right direction right but
[00:45:29] the right direction right but um at least we're making progress let's
[00:45:31] um at least we're making progress let's put it that way
[00:45:33] put it that way um and then so in Flamingo you still
[00:45:36] um and then so in Flamingo you still have a lot of moving Parts but you can
[00:45:38] have a lot of moving Parts but you can really take this almost through the full
[00:45:39] really take this almost through the full extreme where you try to freeze almost
[00:45:41] extreme where you try to freeze almost everything and you just want to learn
[00:45:43] everything and you just want to learn this kind of mapping between your image
[00:45:45] this kind of mapping between your image encoder and your language model or your
[00:45:47] encoder and your language model or your image encoder and your encoder decoder
[00:45:49] image encoder and your encoder decoder architecture and all you really do is
[00:45:52] architecture and all you really do is just the projection between the two
[00:45:53] just the projection between the two right so there's this nice model called
[00:45:56] right so there's this nice model called blip 2 where they they experiment with
[00:45:58] blip 2 where they they experiment with like opt for the language model in Flint
[00:46:00] like opt for the language model in Flint T5 for the encoder decoder architecture
[00:46:02] T5 for the encoder decoder architecture and this just gives you amazing results
[00:46:04] and this just gives you amazing results it gives you uh really complex captions
[00:46:07] it gives you uh really complex captions and things like that without any real
[00:46:10] and things like that without any real direct supervision on the captions
[00:46:11] direct supervision on the captions itself which is pretty impressive I
[00:46:13] itself which is pretty impressive I think so that just shows you the power
[00:46:15] think so that just shows you the power of language models in general
[00:46:19] um so here are some examples uh so it
[00:46:21] um so here are some examples uh so it can really do like different things from
[00:46:23] can really do like different things from captioning to reasoning to visual
[00:46:25] captioning to reasoning to visual question answering to like like location
[00:46:28] question answering to like like location detection uh so you can have a long
[00:46:31] detection uh so you can have a long conversation with this system this
[00:46:33] conversation with this system this really is is kind of the future where
[00:46:34] really is is kind of the future where we're going right where we're going to
[00:46:35] we're going right where we're going to have a chat GPT but it's also going to
[00:46:37] have a chat GPT but it's also going to be able to see the world in a way
[00:46:40] be able to see the world in a way um and so so I think an interesting
[00:46:43] um and so so I think an interesting thing so you've probably heard of like
[00:46:44] thing so you've probably heard of like Chain of Thought prompting and things
[00:46:46] Chain of Thought prompting and things like that where you ask the language
[00:46:47] like that where you ask the language model like let's think step by step
[00:46:50] model like let's think step by step um and you can tell a vision and
[00:46:52] um and you can tell a vision and language model uh generate a rationale
[00:46:55] language model uh generate a rationale for why uh why something might be the
[00:46:57] for why uh why something might be the case so you generate a potential
[00:47:00] case so you generate a potential explanation for what your answer might
[00:47:02] explanation for what your answer might be and then after that you ask it to
[00:47:04] be and then after that you ask it to answer the question and it turns out
[00:47:06] answer the question and it turns out that if you do that sort of multimodal
[00:47:08] that if you do that sort of multimodal Chain of Thought prompting then the
[00:47:10] Chain of Thought prompting then the system gets much better uh and and so
[00:47:12] system gets much better uh and and so you know this was like the new
[00:47:14] you know this was like the new state-of-the-art on science QA or
[00:47:16] state-of-the-art on science QA or Benchmark like that just because it
[00:47:18] Benchmark like that just because it learns to unpack the information right
[00:47:20] learns to unpack the information right and so uh I think we're really as a
[00:47:23] and so uh I think we're really as a field just starting to figure out what
[00:47:24] field just starting to figure out what what the potential is of this and I
[00:47:27] what the potential is of this and I think this paper is where they also
[00:47:28] think this paper is where they also showed that multimodal Chain of Thought
[00:47:30] showed that multimodal Chain of Thought prompting really gets you pretty amazing
[00:47:32] prompting really gets you pretty amazing results and they show uh very nice
[00:47:35] results and they show uh very nice results on Raven matrices and like very
[00:47:38] results on Raven matrices and like very complicated kind of IQ tests the things
[00:47:40] complicated kind of IQ tests the things that that humans are supposed to be
[00:47:42] that that humans are supposed to be really good at but you have to be a
[00:47:44] really good at but you have to be a pretty smart human to really be good at
[00:47:45] pretty smart human to really be good at this and this system Just Nails it
[00:47:48] this and this system Just Nails it um so you know we're making super fast
[00:47:50] um so you know we're making super fast progress and we started off from a very
[00:47:53] progress and we started off from a very simple birth model that was able to look
[00:47:55] simple birth model that was able to look at some pictures and now we're getting
[00:47:57] at some pictures and now we're getting to these very sophisticated Foundation
[00:47:59] to these very sophisticated Foundation models so that was my my short history
[00:48:02] models so that was my my short history of multimodal foundation models
[00:48:06] of multimodal foundation models um so how much time do I have left
[00:48:11] all right okay plenty of time
[00:48:14] all right okay plenty of time um
[00:48:15] um yeah please questions
[00:48:24] one of the images they just looked like
[00:48:26] one of the images they just looked like they were
[00:48:27] they were um boxes
[00:48:29] um boxes passed through and kind of no sense of
[00:48:32] passed through and kind of no sense of shape in them yeah yeah so so I I think
[00:48:36] shape in them yeah yeah so so I I think the the history of computer vision has
[00:48:38] the the history of computer vision has been very similar to the history of
[00:48:40] been very similar to the history of natural language processing where we
[00:48:41] natural language processing where we thought we needed all of this structure
[00:48:43] thought we needed all of this structure and all of these different things and it
[00:48:45] and all of these different things and it turns out you can just throw it all away
[00:48:46] turns out you can just throw it all away and just have a big Transformer over the
[00:48:48] and just have a big Transformer over the patches
[00:48:51] patches sorry yes
[00:48:55] [Laughter]
[00:48:59] you mentioned a couple times like Model
[00:49:01] you mentioned a couple times like Model T person or what does that mean yeah uh
[00:49:04] T person or what does that mean yeah uh yeah sorry I should have explained that
[00:49:06] yeah sorry I should have explained that better maybe so it just means that um we
[00:49:09] better maybe so it just means that um we are not updating the weights
[00:49:11] are not updating the weights um so uh like if we go to uh this era I
[00:49:14] um so uh like if we go to uh this era I think is a nice example so uh we have
[00:49:17] think is a nice example so uh we have frozen self-attention so that just means
[00:49:20] frozen self-attention so that just means that we when we do a forward pass we go
[00:49:22] that we when we do a forward pass we go all the way to whatever we want to
[00:49:24] all the way to whatever we want to predict we get some gradients we take
[00:49:26] predict we get some gradients we take them all the way down but we only update
[00:49:28] them all the way down but we only update the non-frozen layers right so here the
[00:49:31] the non-frozen layers right so here the gradients actually do get updated but
[00:49:33] gradients actually do get updated but these just never change and so the
[00:49:35] these just never change and so the reason you want to do that is because
[00:49:36] reason you want to do that is because otherwise you're going to drift way too
[00:49:38] otherwise you're going to drift way too far right so then you're going to kind
[00:49:40] far right so then you're going to kind of destroy all of the cool stuff your
[00:49:42] of destroy all of the cool stuff your language model has learned because
[00:49:44] language model has learned because you're just going to focus on on the
[00:49:46] you're just going to focus on on the small data set that you're training it
[00:49:47] small data set that you're training it on so you want to preserve the abilities
[00:49:49] on so you want to preserve the abilities of the language model but you want it to
[00:49:51] of the language model but you want it to become good at the thing you care about
[00:49:58] other questions
[00:50:02] is there a benefit to doing like that
[00:50:04] is there a benefit to doing like that really your Metal Fusion as opposed to
[00:50:06] really your Metal Fusion as opposed to like only doing leg Fusion I think yeah
[00:50:09] like only doing leg Fusion I think yeah so so I mean we're going to talk about
[00:50:10] so so I mean we're going to talk about evaluation next but so it really depends
[00:50:13] evaluation next but so it really depends on the the tasks that you care about
[00:50:15] on the the tasks that you care about um and so I would say the earlier is
[00:50:18] um and so I would say the earlier is always the better if you can afford it
[00:50:20] always the better if you can afford it uh and so like clip is very efficient to
[00:50:23] uh and so like clip is very efficient to train it's very late Fusion right at the
[00:50:25] train it's very late Fusion right at the very end so there's no interaction
[00:50:26] very end so there's no interaction between the different modalities
[00:50:29] between the different modalities um and so that's really good if you want
[00:50:32] um and so that's really good if you want to be very efficient and if you want to
[00:50:33] to be very efficient and if you want to be like for training it's it's much
[00:50:36] be like for training it's it's much nicer right uh but if you want to have a
[00:50:38] nicer right uh but if you want to have a richer understanding of the the
[00:50:40] richer understanding of the the multimodal signal then you want to do
[00:50:42] multimodal signal then you want to do earlier Fusion
[00:50:44] earlier Fusion so it's yeah it's always a trade-off
[00:50:46] so it's yeah it's always a trade-off right
[00:50:50] images are just a lot more data than
[00:50:53] images are just a lot more data than text so how much more difficult are
[00:50:56] text so how much more difficult are these to train and um how much bigger
[00:50:59] these to train and um how much bigger does like the image processing have to
[00:51:02] does like the image processing have to be compared to uh the language model
[00:51:05] be compared to uh the language model yeah so so
[00:51:07] yeah so so um images are are more complex in a way
[00:51:10] um images are are more complex in a way but but they're also kind of higher
[00:51:13] but but they're also kind of higher bandwidth representations right so
[00:51:14] bandwidth representations right so there's a lot of kind of like just
[00:51:16] there's a lot of kind of like just pixels that our brains just abstract
[00:51:19] pixels that our brains just abstract away right it's really about the scene
[00:51:21] away right it's really about the scene that you're seeing and like you're not
[00:51:22] that you're seeing and like you're not really
[00:51:23] really thinking too much about the pixels
[00:51:25] thinking too much about the pixels themselves
[00:51:26] themselves um so so like John Lagoon likes to say
[00:51:28] um so so like John Lagoon likes to say that uh language is just a kind of low
[00:51:31] that uh language is just a kind of low bandwidth uh a proxy for a language of
[00:51:35] bandwidth uh a proxy for a language of thought which is much richer and much
[00:51:37] thought which is much richer and much higher bandwidth and like he thinks
[00:51:39] higher bandwidth and like he thinks probably visual I'm not so sure
[00:51:41] probably visual I'm not so sure um but uh so uh yeah I I don't think
[00:51:45] um but uh so uh yeah I I don't think that there's necessarily a difference
[00:51:47] that there's necessarily a difference between kind of the scaling laws that
[00:51:49] between kind of the scaling laws that you see in these systems
[00:51:51] you see in these systems um or at least we still have to figure
[00:51:53] um or at least we still have to figure that out we'll kind of talk about that
[00:51:54] that out we'll kind of talk about that towards the end as well
[00:52:00] so how
[00:52:02] so how to put your bias just like the natural
[00:52:04] to put your bias just like the natural one oh yeah they have terrible biases
[00:52:07] one oh yeah they have terrible biases yeah
[00:52:09] yeah um so yeah so some people are actually
[00:52:11] um so yeah so some people are actually working on on this uh who are in this
[00:52:13] working on on this uh who are in this very room but uh so these models can be
[00:52:15] very room but uh so these models can be very racist also in what they generate
[00:52:17] very racist also in what they generate or or the kind of predictions they make
[00:52:20] or or the kind of predictions they make so uh if you have an Asian basketball
[00:52:23] so uh if you have an Asian basketball player standing sort of like this with a
[00:52:25] player standing sort of like this with a basketball very obviously there then the
[00:52:28] basketball very obviously there then the model will think that he's playing ping
[00:52:29] model will think that he's playing ping pong because he's Asian I'm not joking
[00:52:32] pong because he's Asian I'm not joking so uh
[00:52:34] so uh so so these models uh yeah just like all
[00:52:38] so so these models uh yeah just like all neural networks right this is really a
[00:52:40] neural networks right this is really a big problem and one of the the most
[00:52:41] big problem and one of the the most interesting problems that that you
[00:52:43] interesting problems that that you should be working on if you're a student
[00:52:44] should be working on if you're a student and you want to make a difference is how
[00:52:46] and you want to make a difference is how do we get these systems to be much
[00:52:48] do we get these systems to be much better at these sorts of things
[00:52:54] probably examples you show like the
[00:52:55] probably examples you show like the model interpret from the content of the
[00:52:57] model interpret from the content of the image so really we want to understand
[00:52:59] image so really we want to understand the content for a video so what actual
[00:53:02] the content for a video so what actual challenges you might see like what
[00:53:05] challenges you might see like what improvements we can make uh to um
[00:53:08] improvements we can make uh to um yeah so so
[00:53:10] yeah so so um you're asking about the attention
[00:53:12] um you're asking about the attention Mass sort of right yeah so you can use
[00:53:14] Mass sort of right yeah so you can use the same idea for for videos uh and you
[00:53:17] the same idea for for videos uh and you just look at the video and and so these
[00:53:19] just look at the video and and so these systems are are so good now the object
[00:53:20] systems are are so good now the object detectors are so good you can really
[00:53:22] detectors are so good you can really track objects kind of real time as they
[00:53:25] track objects kind of real time as they they go through your video and so you
[00:53:27] they go through your video and so you can try to check how that aligns with
[00:53:29] can try to check how that aligns with your attention mask in your model
[00:53:31] your attention mask in your model um so so a lot of uh like so videos I
[00:53:36] um so so a lot of uh like so videos I think are sort of interesting but
[00:53:37] think are sort of interesting but they're also not really interesting
[00:53:38] they're also not really interesting because you can very often just
[00:53:40] because you can very often just sub-sample images and solve the images
[00:53:43] sub-sample images and solve the images rather than having to deal with the
[00:53:44] rather than having to deal with the complex video
[00:53:46] complex video um good job
[00:53:49] um good job all right maybe one more one more
[00:53:50] all right maybe one more one more question and then we'll go do some
[00:53:52] question and then we'll go do some evaluations yeah so these multi-mole
[00:53:55] evaluations yeah so these multi-mole models when um you only provide let's
[00:53:57] models when um you only provide let's say you only provide a single source of
[00:53:59] say you only provide a single source of media to the same only text or vision
[00:54:01] media to the same only text or vision how does it perform in that case because
[00:54:03] how does it perform in that case because it's obviously more geared for
[00:54:05] it's obviously more geared for multi-million cases yeah so I mean
[00:54:07] multi-million cases yeah so I mean that's one of the giant shortcomings of
[00:54:09] that's one of the giant shortcomings of a lot of these models is that they're
[00:54:11] a lot of these models is that they're really just built for multimodal stuff
[00:54:13] really just built for multimodal stuff and so what if I don't have an image
[00:54:15] and so what if I don't have an image right uh and so uh I mean that that's
[00:54:19] right uh and so uh I mean that that's why we did Flavor because we want to
[00:54:21] why we did Flavor because we want to have one model that can do all of that
[00:54:22] have one model that can do all of that stuff
[00:54:23] stuff um and that's why in in mmbt so the
[00:54:26] um and that's why in in mmbt so the supervised multimodal by Transformer we
[00:54:28] supervised multimodal by Transformer we actually have an analysis of like how
[00:54:30] actually have an analysis of like how robust is this model to missing images
[00:54:32] robust is this model to missing images or missing text uh but uh so I think a
[00:54:35] or missing text uh but uh so I think a lot of a lot of folks working on these
[00:54:37] lot of a lot of folks working on these early visual bird models that were kind
[00:54:39] early visual bird models that were kind of myopically focused on vqa uh which is
[00:54:43] of myopically focused on vqa uh which is actually a great segue to what I want to
[00:54:44] actually a great segue to what I want to talk about next uh so so it really
[00:54:48] talk about next uh so so it really depends on the the tasks that you care
[00:54:50] depends on the the tasks that you care about as I said right and so I I think
[00:54:52] about as I said right and so I I think if I'm gonna tell you about
[00:54:54] if I'm gonna tell you about multimodality I also have to tell you
[00:54:56] multimodality I also have to tell you how you're going to check that the
[00:54:57] how you're going to check that the multimodal system is actually good at
[00:54:59] multimodal system is actually good at multimodal things and so that's the the
[00:55:02] multimodal things and so that's the the topic of evaluation which actually is a
[00:55:05] topic of evaluation which actually is a super important topic and a lot of
[00:55:07] super important topic and a lot of people they want to be cool and build
[00:55:09] people they want to be cool and build big models uh but I I think it should be
[00:55:11] big models uh but I I think it should be way cooler to do proper evaluation of
[00:55:13] way cooler to do proper evaluation of these models especially if you're in
[00:55:15] these models especially if you're in Academia because you only have limited
[00:55:17] Academia because you only have limited gpus anyway right so what what what can
[00:55:20] gpus anyway right so what what what can you do
[00:55:22] you do sorry I don't want to rub it in with it
[00:55:24] sorry I don't want to rub it in with it no
[00:55:26] no um so um so how do you check well um
[00:55:30] um so um so how do you check well um there's this amazing project uh so like
[00:55:33] there's this amazing project uh so like imagenet really changed like the history
[00:55:35] imagenet really changed like the history of deep learning I think and this other
[00:55:37] of deep learning I think and this other data set Coco I think also really
[00:55:40] data set Coco I think also really changed especially vision and language
[00:55:42] changed especially vision and language but also I think Vision uh in general
[00:55:45] but also I think Vision uh in general where they uh have just a bunch of main
[00:55:49] where they uh have just a bunch of main sort of multimodal tasks so these images
[00:55:51] sort of multimodal tasks so these images are very richly annotated with all kinds
[00:55:54] are very richly annotated with all kinds of different things so like the
[00:55:55] of different things so like the segmentation of the objects the bounding
[00:55:57] segmentation of the objects the bounding boxes the labels of the boundary boxes
[00:56:00] boxes the labels of the boundary boxes they come with like uh sort of a
[00:56:02] they come with like uh sort of a different pixel granularities it's a
[00:56:04] different pixel granularities it's a huge data set uh it's very fine-grained
[00:56:06] huge data set uh it's very fine-grained uh annotated in terms of like the
[00:56:09] uh annotated in terms of like the categories that it has and then you have
[00:56:11] categories that it has and then you have five captions for each of these images
[00:56:14] five captions for each of these images um and so this this really was the first
[00:56:16] um and so this this really was the first data set that unlocked a lot of sort of
[00:56:18] data set that unlocked a lot of sort of vision and language processing at scale
[00:56:20] vision and language processing at scale because you had your picture and you had
[00:56:22] because you had your picture and you had your caption and now you need to figure
[00:56:24] your caption and now you need to figure out okay how do I give the right caption
[00:56:26] out okay how do I give the right caption for this image so that's image
[00:56:27] for this image so that's image captioning or can I retrieve given some
[00:56:30] captioning or can I retrieve given some piece of text the right image or the
[00:56:32] piece of text the right image or the piece of uh or the image for the piece
[00:56:34] piece of uh or the image for the piece of text so there's a bunch of very
[00:56:36] of text so there's a bunch of very impactful data sets that do this stuff
[00:56:38] impactful data sets that do this stuff that we already talked about lion but
[00:56:40] that we already talked about lion but Coco really is the main one still I
[00:56:43] Coco really is the main one still I think that a lot of people kind of use
[00:56:44] think that a lot of people kind of use as the canal core instance of this data
[00:56:47] as the canal core instance of this data set category
[00:56:49] set category and then the other thing that people
[00:56:51] and then the other thing that people really care about in vision and language
[00:56:53] really care about in vision and language processing is visual question answering
[00:56:56] processing is visual question answering um and so the there really are a bunch
[00:56:59] um and so the there really are a bunch of academic groups who are or have been
[00:57:02] of academic groups who are or have been so focused on this task that they didn't
[00:57:04] so focused on this task that they didn't really care about anything else and
[00:57:06] really care about anything else and that's why you see a lot of models that
[00:57:07] that's why you see a lot of models that are really optimized just for multimodal
[00:57:10] are really optimized just for multimodal and nothing else uh and you can see that
[00:57:12] and nothing else uh and you can see that kind of reflected in the citation counts
[00:57:14] kind of reflected in the citation counts as of last night 3 A.M
[00:57:17] as of last night 3 A.M um where uh so the vqh just has way more
[00:57:20] um where uh so the vqh just has way more citations than uh image captioning data
[00:57:23] citations than uh image captioning data sets even right and so what you do here
[00:57:26] sets even right and so what you do here is you just have an image and then
[00:57:27] is you just have an image and then people ask very simple questions so
[00:57:30] people ask very simple questions so annotators right they they ask these
[00:57:32] annotators right they they ask these simple questions they give the answers
[00:57:34] simple questions they give the answers and now we want to be able to answer
[00:57:36] and now we want to be able to answer these questions with machines and as I
[00:57:39] these questions with machines and as I alluded to earlier one of the the kind
[00:57:40] alluded to earlier one of the the kind of embarrassing uh backstories of this
[00:57:42] of embarrassing uh backstories of this data set was that the initial version of
[00:57:44] data set was that the initial version of the data set was actually found to uh to
[00:57:48] the data set was actually found to uh to have images not really matter at all so
[00:57:51] have images not really matter at all so you could just look at the question then
[00:57:52] you could just look at the question then it could have something like how many
[00:57:54] it could have something like how many slices of pizza are there
[00:57:56] slices of pizza are there um and so well not in that particular
[00:57:58] um and so well not in that particular case but in almost all of the data set
[00:58:01] case but in almost all of the data set the right answer for how much or how
[00:58:03] the right answer for how much or how many questions was too
[00:58:05] many questions was too so if you just predicted two to every
[00:58:07] so if you just predicted two to every how much or how many questions you got
[00:58:09] how much or how many questions you got like 70 accuracy on the accounting
[00:58:11] like 70 accuracy on the accounting category so careful data set uh or
[00:58:14] category so careful data set uh or evaluation Benchmark design is also
[00:58:17] evaluation Benchmark design is also really a skill and you really need to
[00:58:19] really a skill and you really need to think about what you're doing you can't
[00:58:20] think about what you're doing you can't just like set some data aside and
[00:58:22] just like set some data aside and evaluate it on it you have to really
[00:58:23] evaluate it on it you have to really think about what you're doing
[00:58:25] think about what you're doing um and so there's gqa by by Chris
[00:58:27] um and so there's gqa by by Chris actually which is also just a I I think
[00:58:29] actually which is also just a I I think a a better designed version of this data
[00:58:32] a a better designed version of this data set maybe so you might want to use that
[00:58:34] set maybe so you might want to use that uh these days
[00:58:36] uh these days um they're also kind of
[00:58:38] um they're also kind of um a very targeted data sets that really
[00:58:41] um a very targeted data sets that really try to measure one particular thing and
[00:58:43] try to measure one particular thing and I think one of the things we really want
[00:58:45] I think one of the things we really want to get at with these models is what we
[00:58:47] to get at with these models is what we would call compositionality right so we
[00:58:49] would call compositionality right so we want to be able to really take the parts
[00:58:51] want to be able to really take the parts and and reason about the whole and
[00:58:53] and and reason about the whole and understand the relationships between the
[00:58:55] understand the relationships between the different concepts so clever was a very
[00:58:57] different concepts so clever was a very clever data set uh that was designed
[00:59:00] clever data set uh that was designed really to to measure the the
[00:59:01] really to to measure the the compositionality both on the language
[00:59:03] compositionality both on the language side and on the vision side so you have
[00:59:05] side and on the vision side so you have to understand the relationships between
[00:59:07] to understand the relationships between all of these different objects in the
[00:59:08] all of these different objects in the images uh so that's been a pretty
[00:59:10] images uh so that's been a pretty impactful data set I think for really uh
[00:59:13] impactful data set I think for really uh forcing people to think about
[00:59:14] forcing people to think about compositionality
[00:59:16] compositionality but a lot of these data sets uh really
[00:59:19] but a lot of these data sets uh really had big problems uh so so one of the
[00:59:22] had big problems uh so so one of the problem is you know uh they were too
[00:59:24] problem is you know uh they were too easy uh so vqa is sort of like
[00:59:26] easy uh so vqa is sort of like plateauing out we can talk about that a
[00:59:28] plateauing out we can talk about that a little bit too it wasn't really
[00:59:29] little bit too it wasn't really realistic so you could solve vqa and
[00:59:32] realistic so you could solve vqa and that's probably going to make some
[00:59:33] that's probably going to make some people's lives better you're all like
[00:59:35] people's lives better you're all like trying to process the means how I can
[00:59:37] trying to process the means how I can see everything
[00:59:39] see everything okay let's get to the memes first then
[00:59:40] okay let's get to the memes first then so
[00:59:41] so um so uh obviously so these memes are
[00:59:44] um so uh obviously so these memes are not actually in the data set
[00:59:46] not actually in the data set um so I could put some really hateful
[00:59:48] um so I could put some really hateful memes about sort of Hitler or something
[00:59:51] memes about sort of Hitler or something which are in the data set but that would
[00:59:52] which are in the data set but that would be less fun
[00:59:53] be less fun um so uh these are mean meme examples to
[00:59:57] um so uh these are mean meme examples to kind of uh demonstrate uh how the data
[01:00:00] kind of uh demonstrate uh how the data set was constructed and and so one of
[01:00:03] set was constructed and and so one of the problems we had as I said like vqa
[01:00:05] the problems we had as I said like vqa the V didn't really matter what we want
[01:00:07] the V didn't really matter what we want to have is a data set if we're if we
[01:00:09] to have is a data set if we're if we care about multimodality specifically
[01:00:11] care about multimodality specifically it's like how do we get a data set that
[01:00:14] it's like how do we get a data set that you can only get right if you are good
[01:00:16] you can only get right if you are good at multimodal reasoning and otherwise
[01:00:18] at multimodal reasoning and otherwise you're just going to screw it up
[01:00:20] you're just going to screw it up um and so this is what we came up with
[01:00:21] um and so this is what we came up with is if you have a meme like this one love
[01:00:24] is if you have a meme like this one love the way you smell today I mean that's
[01:00:25] the way you smell today I mean that's not very nice if you send this to your
[01:00:27] not very nice if you send this to your friend right
[01:00:28] friend right um
[01:00:29] um so
[01:00:30] so um but uh so it turns out that that if
[01:00:33] um but uh so it turns out that that if you just swap out the background now
[01:00:35] you just swap out the background now it's a very nice thing to say right uh
[01:00:37] it's a very nice thing to say right uh and like this one is you know I don't
[01:00:39] and like this one is you know I don't know you're maybe a bit weird if you
[01:00:41] know you're maybe a bit weird if you like this but uh there's there's nothing
[01:00:43] like this but uh there's there's nothing wrong with it right
[01:00:45] wrong with it right um and so it's the same for this one
[01:00:48] um and so it's the same for this one here like look how many people love you
[01:00:49] here like look how many people love you with the Tumbleweed that's really sad
[01:00:51] with the Tumbleweed that's really sad and like you know if you if you if you
[01:00:53] and like you know if you if you if you change just one word suddenly it's like
[01:00:55] change just one word suddenly it's like a really nice thing to say right
[01:00:58] a really nice thing to say right um so so if you want to solve this if
[01:01:00] um so so if you want to solve this if you want to classify this correctly for
[01:01:02] you want to classify this correctly for the meanness
[01:01:04] the meanness then you have to really understand
[01:01:06] then you have to really understand multimodal reasoning you have to
[01:01:07] multimodal reasoning you have to understand the relationship between the
[01:01:09] understand the relationship between the image and the text in order to get to
[01:01:11] image and the text in order to get to the right label right and so it was
[01:01:13] the right label right and so it was really constructed by Design to do that
[01:01:16] really constructed by Design to do that um and uh so how we did it exactly is we
[01:01:19] um and uh so how we did it exactly is we we use some really uh highly trained
[01:01:22] we use some really uh highly trained annotators and then uh one of the big
[01:01:24] annotators and then uh one of the big problems with a lot of these data sets
[01:01:26] problems with a lot of these data sets is that uh nobody really knows who owns
[01:01:29] is that uh nobody really knows who owns the meme for example right so somebody
[01:01:31] the meme for example right so somebody makes this meme now they technically own
[01:01:33] makes this meme now they technically own a copyright and so when I made this data
[01:01:36] a copyright and so when I made this data set I was working at the Facebook and
[01:01:38] set I was working at the Facebook and they were very afraid of copyright
[01:01:40] they were very afraid of copyright things so what we actually had to do is
[01:01:42] things so what we actually had to do is uh we had to pay people to make new
[01:01:44] uh we had to pay people to make new memes
[01:01:46] memes um
[01:01:47] um and and so so not from stretch so we
[01:01:50] and and so so not from stretch so we could show them kind of the actual
[01:01:51] could show them kind of the actual examples and then they had to try to
[01:01:53] examples and then they had to try to find images that were were uh kind of
[01:01:56] find images that were were uh kind of corresponding to the original Source
[01:01:58] corresponding to the original Source image and tried to recreate the meme but
[01:02:00] image and tried to recreate the meme but now with an image that we could buy from
[01:02:03] now with an image that we could buy from Getty
[01:02:04] Getty um and and so we gave a lot of money to
[01:02:06] um and and so we gave a lot of money to Getty uh so that we could then release
[01:02:09] Getty uh so that we could then release the data set uh to the public so that
[01:02:11] the data set uh to the public so that people could do actually research on
[01:02:12] people could do actually research on this and understand for their multimodal
[01:02:15] this and understand for their multimodal models whether they're good or not
[01:02:16] models whether they're good or not um and so we really tried to make it so
[01:02:18] um and so we really tried to make it so that we had these benign co-founder
[01:02:21] that we had these benign co-founder benign confounders uh sorry um it's a
[01:02:24] benign confounders uh sorry um it's a startup world with co-founders
[01:02:27] startup world with co-founders um so um so the co-founder here is
[01:02:29] um so um so the co-founder here is obviously that you have your original
[01:02:31] obviously that you have your original Meme and then you have uh your
[01:02:33] Meme and then you have uh your confounder where you swap out one of the
[01:02:35] confounder where you swap out one of the modalities in here you have the other
[01:02:36] modalities in here you have the other one right so we had our annotators do
[01:02:38] one right so we had our annotators do that as well uh and so this led to a
[01:02:41] that as well uh and so this led to a really nice data set I think uh because
[01:02:44] really nice data set I think uh because it showed some of the intuitions that I
[01:02:46] it showed some of the intuitions that I think a lot of people under field had
[01:02:47] think a lot of people under field had which is that multimodal pre-training
[01:02:50] which is that multimodal pre-training doesn't really work
[01:02:52] doesn't really work is that an alarm
[01:02:54] is that an alarm um so multimodal pre-training doesn't
[01:02:56] um so multimodal pre-training doesn't really work uh and so all of this stuff
[01:02:59] really work uh and so all of this stuff that people have been doing with all
[01:03:01] that people have been doing with all their fancy visual work models actually
[01:03:03] their fancy visual work models actually turned out maybe to not really be that
[01:03:05] turned out maybe to not really be that useful anyway and so maybe it got you
[01:03:07] useful anyway and so maybe it got you like one point extra right from visual
[01:03:10] like one point extra right from visual birth to like a different visual birth
[01:03:12] birth to like a different visual birth like less than a point uh just just by
[01:03:14] like less than a point uh just just by doing that multimodal pre-training
[01:03:17] doing that multimodal pre-training um so that means like we still have to
[01:03:19] um so that means like we still have to figure this stuff out right this this
[01:03:21] figure this stuff out right this this data set is far from Salt and we we
[01:03:24] data set is far from Salt and we we still have a long way to go despite all
[01:03:25] still have a long way to go despite all of these fancy models and and you know a
[01:03:28] of these fancy models and and you know a new paper coming out every week that
[01:03:30] new paper coming out every week that does something new like we're not there
[01:03:31] does something new like we're not there yet
[01:03:33] yet um and I think that's encouraging uh
[01:03:35] um and I think that's encouraging uh especially for you like when you uh you
[01:03:37] especially for you like when you uh you can go out and solve it
[01:03:39] can go out and solve it um so
[01:03:40] um so um what we did with this data set is we
[01:03:42] um what we did with this data set is we organized the competition we had 100K in
[01:03:44] organized the competition we had 100K in price money uh to try to see what people
[01:03:46] price money uh to try to see what people could come up with
[01:03:48] could come up with um and so there there was a lot of nice
[01:03:51] um and so there there was a lot of nice work coming out of that and we've really
[01:03:53] work coming out of that and we've really kind of managed to to crank the numbers
[01:03:54] kind of managed to to crank the numbers up by quite a lot
[01:03:57] up by quite a lot um but the the solutions were slightly
[01:03:59] um but the the solutions were slightly disappointing so I don't know if you've
[01:04:01] disappointing so I don't know if you've ever used kaggle but if you want to
[01:04:03] ever used kaggle but if you want to really win on kaggle you just have to
[01:04:04] really win on kaggle you just have to Ensemble the hell out of all of the
[01:04:06] Ensemble the hell out of all of the different models that are in the current
[01:04:07] different models that are in the current state of the art and then you're very
[01:04:09] state of the art and then you're very likely to win right and so that's that's
[01:04:11] likely to win right and so that's that's what happened here
[01:04:13] what happened here um where you know there wasn't really
[01:04:15] um where you know there wasn't really the fundamental breakthrough we had
[01:04:17] the fundamental breakthrough we had maybe been hoping for so uh that still
[01:04:20] maybe been hoping for so uh that still uh needs to be built I think
[01:04:23] uh needs to be built I think um so this other data set I just want to
[01:04:25] um so this other data set I just want to kind of briefly talk about so so the
[01:04:26] kind of briefly talk about so so the theme sort of of this section is like if
[01:04:28] theme sort of of this section is like if you make a data set think about it very
[01:04:30] you make a data set think about it very carefully uh because you can really be
[01:04:32] carefully uh because you can really be very creative with this and really
[01:04:34] very creative with this and really really measure the things you're trying
[01:04:35] really measure the things you're trying to get at so
[01:04:37] to get at so um this this data set winner ground we
[01:04:40] um this this data set winner ground we were trying to figure out okay how good
[01:04:41] were trying to figure out okay how good is clip actually so it looks really
[01:04:43] is clip actually so it looks really amazing and it's way better than things
[01:04:45] amazing and it's way better than things that were previously there but does it
[01:04:48] that were previously there but does it understand compositional relationships
[01:04:50] understand compositional relationships in the same way that humans would
[01:04:51] in the same way that humans would understand it or is it sort of just
[01:04:53] understand it or is it sort of just fitting onto the data distribution and
[01:04:55] fitting onto the data distribution and it can be very good at the head of the
[01:04:57] it can be very good at the head of the distribution but it's terrible at detail
[01:04:59] distribution but it's terrible at detail and you can probably already guess where
[01:05:01] and you can probably already guess where this is going
[01:05:03] this is going um but uh so so just to give you an
[01:05:05] um but uh so so just to give you an illustration of what is in this data set
[01:05:07] illustration of what is in this data set you would have some plants surrounding a
[01:05:09] you would have some plants surrounding a light bulb or you would have eight light
[01:05:12] light bulb or you would have eight light bulb surrounding some plants so notice
[01:05:14] bulb surrounding some plants so notice that the words here are exactly the same
[01:05:16] that the words here are exactly the same words
[01:05:17] words but in a different order right so uh and
[01:05:20] but in a different order right so uh and and so the visual depiction of these
[01:05:23] and so the visual depiction of these words is very very different so if your
[01:05:25] words is very very different so if your model your contrastive model is actually
[01:05:27] model your contrastive model is actually good at understanding the Visio semantic
[01:05:30] good at understanding the Visio semantic or the yeah visual linguistic
[01:05:33] or the yeah visual linguistic compositionality uh of these these uh
[01:05:36] compositionality uh of these these uh these uh examples then then you can get
[01:05:39] these uh examples then then you can get it right but again if it's actually just
[01:05:41] it right but again if it's actually just overfitting on the data distribution
[01:05:42] overfitting on the data distribution that is seen and it just kind of is
[01:05:44] that is seen and it just kind of is biased toward what it sees often then it
[01:05:47] biased toward what it sees often then it doesn't really get it right and so one
[01:05:49] doesn't really get it right and so one paper uh that we use as a source of
[01:05:52] paper uh that we use as a source of inspiration for this work is uh this
[01:05:55] inspiration for this work is uh this paper here Order word matters
[01:05:57] paper here Order word matters pre-training for little uh so we
[01:05:59] pre-training for little uh so we actually found that the order of words
[01:06:01] actually found that the order of words doesn't even matter that much for
[01:06:03] doesn't even matter that much for General pre-training very often uh which
[01:06:06] General pre-training very often uh which is also kind of a scary thing right so
[01:06:07] is also kind of a scary thing right so this is deep learning for NLP we think
[01:06:09] this is deep learning for NLP we think that you know language is really
[01:06:10] that you know language is really important but these models can can
[01:06:12] important but these models can can reason about language even if you
[01:06:14] reason about language even if you shuffle all the words
[01:06:16] shuffle all the words um and so that's that's probably not
[01:06:18] um and so that's that's probably not what we want to have and so that that
[01:06:20] what we want to have and so that that doesn't tell you something about how
[01:06:23] doesn't tell you something about how great we are as researchers it tells you
[01:06:25] great we are as researchers it tells you something about how terrible our
[01:06:26] something about how terrible our evaluation benchmarks are right and
[01:06:29] evaluation benchmarks are right and that's what we need to fix
[01:06:31] that's what we need to fix um so so what we did with this data set
[01:06:33] um so so what we did with this data set here some other nice examples like
[01:06:34] here some other nice examples like there's a mug in some grass or there's
[01:06:36] there's a mug in some grass or there's some grass in a mug like these are very
[01:06:38] some grass in a mug like these are very different pictures right and so for us
[01:06:40] different pictures right and so for us these are trivial like so like you know
[01:06:42] these are trivial like so like you know what's the difference between a truck
[01:06:43] what's the difference between a truck fire and a fire truck
[01:06:46] fire and a fire truck they're pretty pretty important I think
[01:06:48] they're pretty pretty important I think also to get that distinction right
[01:06:51] also to get that distinction right um so um
[01:06:53] um so um guess what
[01:06:54] guess what um state-of-the-art models often perform
[01:06:56] um state-of-the-art models often perform below random chance
[01:07:00] so uh uh you know as I said we still
[01:07:03] so uh uh you know as I said we still have a lot of work to do which is good
[01:07:06] have a lot of work to do which is good um and and so when this paper came out
[01:07:08] um and and so when this paper came out that I I think the the reaction was was
[01:07:11] that I I think the the reaction was was really nice and uh so when dolly two
[01:07:13] really nice and uh so when dolly two came out
[01:07:15] came out um which so you've probably heard of
[01:07:17] um which so you've probably heard of Dolly too right so it's sort of like
[01:07:18] Dolly too right so it's sort of like stable diffusion but then before stable
[01:07:20] stable diffusion but then before stable diffusion
[01:07:22] diffusion um and so this was really the the first
[01:07:24] um and so this was really the the first model that really showed like just how
[01:07:26] model that really showed like just how impressive these generative models can
[01:07:28] impressive these generative models can be uh when they're when they're creating
[01:07:30] be uh when they're when they're creating images so this is there's a mug in some
[01:07:33] images so this is there's a mug in some grass uh you do have to kind of cheat a
[01:07:35] grass uh you do have to kind of cheat a little bit because you have to add
[01:07:37] little bit because you have to add digital art here uh if you if you don't
[01:07:40] digital art here uh if you if you don't add that then it breaks down completely
[01:07:42] add that then it breaks down completely right uh so it's sort of prone attacking
[01:07:45] right uh so it's sort of prone attacking I think or sort of tuning on the test
[01:07:46] I think or sort of tuning on the test set but okay you know
[01:07:49] set but okay you know um so this is pretty good right so so
[01:07:51] um so this is pretty good right so so it's definitely is better than I think a
[01:07:54] it's definitely is better than I think a lot of people would have expected even a
[01:07:55] lot of people would have expected even a couple of years ago
[01:07:58] couple of years ago um but it's not perfect because uh
[01:08:00] um but it's not perfect because uh people on the internet like to take more
[01:08:03] people on the internet like to take more pictures of spoons than Forks
[01:08:06] um so if you say there are fewer uh
[01:08:09] um so if you say there are fewer uh spoons than Forks or there are fewer
[01:08:11] spoons than Forks or there are fewer Forks than spoons it just really like
[01:08:13] Forks than spoons it just really like spoons more
[01:08:16] um you know and so maybe it's like the
[01:08:20] um you know and so maybe it's like the Matrix or something I don't know but so
[01:08:21] Matrix or something I don't know but so uh
[01:08:22] uh spoons are just nicer
[01:08:24] spoons are just nicer so uh so again what you can see here is
[01:08:27] so uh so again what you can see here is that these models really are just
[01:08:28] that these models really are just reflections of the data that they're
[01:08:30] reflections of the data that they're trained on right
[01:08:32] trained on right um and uh yeah so models are getting
[01:08:34] um and uh yeah so models are getting better but if you've looked at stable
[01:08:36] better but if you've looked at stable diffusion like it still can't count
[01:08:38] diffusion like it still can't count fingers and things like that right so
[01:08:40] fingers and things like that right so again uh there's still a lot of a cool
[01:08:42] again uh there's still a lot of a cool work to be done
[01:08:44] work to be done any questions on evaluation
[01:08:52] no okay so let's let's talk about other
[01:08:56] no okay so let's let's talk about other modalities then because so we've really
[01:08:58] modalities then because so we've really just been focused on images and images
[01:09:00] just been focused on images and images are great there are lots of images uh on
[01:09:04] are great there are lots of images uh on the internet and and so that makes it
[01:09:06] the internet and and so that makes it sort of an obvious thing to focus on
[01:09:08] sort of an obvious thing to focus on it's also I think if you look at our
[01:09:10] it's also I think if you look at our brain like vision is a very dominant
[01:09:12] brain like vision is a very dominant modality right so how we understand the
[01:09:14] modality right so how we understand the world is very Vision driven uh but it
[01:09:17] world is very Vision driven uh but it that it that doesn't have to be the case
[01:09:18] that it that doesn't have to be the case so there's all these other interesting
[01:09:20] so there's all these other interesting problems that involve different
[01:09:21] problems that involve different modalities and so the most obvious one
[01:09:24] modalities and so the most obvious one is just speech or audio right so after
[01:09:27] is just speech or audio right so after after CN comes hearing
[01:09:29] after CN comes hearing um and and really we could do another
[01:09:31] um and and really we could do another lecture just like this just on speech
[01:09:33] lecture just like this just on speech and audio and there's lots of
[01:09:34] and audio and there's lots of interesting stuff to talk about
[01:09:36] interesting stuff to talk about obviously we don't have time but uh I'll
[01:09:38] obviously we don't have time but uh I'll give you another uh nice example of how
[01:09:41] give you another uh nice example of how Amazing Alec Radford is at creating data
[01:09:43] Amazing Alec Radford is at creating data sets
[01:09:45] sets um so so there's this whisper model that
[01:09:47] um so so there's this whisper model that came out of open AI not too long ago
[01:09:49] came out of open AI not too long ago which was trained on 680 000 hours of
[01:09:52] which was trained on 680 000 hours of multilingual multitask uh Speech data so
[01:09:55] multilingual multitask uh Speech data so speech with transcriptions
[01:09:57] speech with transcriptions um and they they trained this very fancy
[01:10:00] um and they they trained this very fancy uh thing on there which actually is not
[01:10:02] uh thing on there which actually is not very fancy at all it's just the long
[01:10:03] very fancy at all it's just the long male spectrogram so how you represent
[01:10:05] male spectrogram so how you represent the audio signal and then you feed that
[01:10:07] the audio signal and then you feed that into a big Transformer so this is sort
[01:10:09] into a big Transformer so this is sort of your encoder self-attention here
[01:10:11] of your encoder self-attention here right and then you have your decoder
[01:10:13] right and then you have your decoder where you have your cross attention and
[01:10:15] where you have your cross attention and then you just generate the sequence so
[01:10:17] then you just generate the sequence so this is encoder decoder basic
[01:10:20] this is encoder decoder basic Transformer model but your input is uh
[01:10:22] Transformer model but your input is uh convolutions one-dimensional
[01:10:24] convolutions one-dimensional convolutions over the log mail
[01:10:25] convolutions over the log mail spectrogram and so there's lots of
[01:10:28] spectrogram and so there's lots of papers that do very similar things uh
[01:10:30] papers that do very similar things uh there's there's models like wave to VEC
[01:10:32] there's there's models like wave to VEC that try to turn the wave signal into
[01:10:34] that try to turn the wave signal into vectors or you can discretize it in lots
[01:10:36] vectors or you can discretize it in lots of different ways
[01:10:37] of different ways um so there's a wealth of literature
[01:10:40] um so there's a wealth of literature then I think one of the funny
[01:10:41] then I think one of the funny observations actually is that you can
[01:10:43] observations actually is that you can just reduce audio to Vision anyway right
[01:10:46] just reduce audio to Vision anyway right so so that's what you could sort of
[01:10:48] so so that's what you could sort of argue this log mail spectrogram does but
[01:10:50] argue this log mail spectrogram does but so not to toot my own horn but in 27 I I
[01:10:54] so not to toot my own horn but in 27 I I did this paper where we showed that you
[01:10:55] did this paper where we showed that you can just take a real audio sample turn
[01:10:58] can just take a real audio sample turn it into a a kind of a spectrogram really
[01:11:03] it into a a kind of a spectrogram really just a spectrogram so what does the
[01:11:05] just a spectrogram so what does the spectrum of the the audio file look like
[01:11:07] spectrum of the the audio file look like feed that to a regular content like an
[01:11:10] feed that to a regular content like an Alex net even and then that gives you
[01:11:12] Alex net even and then that gives you amazing auditory features so now you can
[01:11:14] amazing auditory features so now you can use this to distinguish between violins
[01:11:16] use this to distinguish between violins or guitars and things like that so you
[01:11:18] or guitars and things like that so you know maybe you can just reduce all of
[01:11:20] know maybe you can just reduce all of this to Vision so one question maybe you
[01:11:22] this to Vision so one question maybe you could ask is that can we also reduce
[01:11:24] could ask is that can we also reduce language division
[01:11:25] language division or Vision to language
[01:11:27] or Vision to language you know that could so that's sort of
[01:11:29] you know that could so that's sort of what people are thinking about today
[01:11:32] what people are thinking about today um so we talked about the video there
[01:11:34] um so we talked about the video there was a question about video so a lot of
[01:11:36] was a question about video so a lot of these ideas also extend pretty directly
[01:11:38] these ideas also extend pretty directly to video but now you just have more data
[01:11:40] to video but now you just have more data right so like Flamingo already had a
[01:11:43] right so like Flamingo already had a bunch of different images in it you can
[01:11:44] bunch of different images in it you can do Flamingo over videos probably a lot
[01:11:47] do Flamingo over videos probably a lot of the images are pretty useless for
[01:11:50] of the images are pretty useless for what you're trying to do with this video
[01:11:51] what you're trying to do with this video model right so they're they're too
[01:11:53] model right so they're they're too similar it doesn't really add all that
[01:11:55] similar it doesn't really add all that much information so you want to
[01:11:56] much information so you want to sub-sample the frames so that you get
[01:11:58] sub-sample the frames so that you get the most useful information out of your
[01:12:00] the most useful information out of your video uh and so there's a bunch of
[01:12:03] video uh and so there's a bunch of approaches that that kind of take the
[01:12:05] approaches that that kind of take the keyframes and then you just do a
[01:12:06] keyframes and then you just do a standard joint vision and language
[01:12:08] standard joint vision and language Transformer encoder thing on top of that
[01:12:11] Transformer encoder thing on top of that so this is kind of becoming hopefully by
[01:12:13] so this is kind of becoming hopefully by now a very familiar recipe right
[01:12:16] now a very familiar recipe right um and so there's this so Merlot is a
[01:12:19] um and so there's this so Merlot is a nice architecture that does this and
[01:12:21] nice architecture that does this and then they came up with Merlot Reserve
[01:12:23] then they came up with Merlot Reserve kind of a silly name where they also
[01:12:25] kind of a silly name where they also added audio to this model so this is now
[01:12:28] added audio to this model so this is now a tri-modal model right and so you know
[01:12:31] a tri-modal model right and so you know for going towards this Foundation model
[01:12:33] for going towards this Foundation model that can consume all of these different
[01:12:35] that can consume all of these different modalities uh all in one go and that's
[01:12:37] modalities uh all in one go and that's really like a clear Trend in the field
[01:12:41] um another very interesting Direction I
[01:12:44] um another very interesting Direction I think where in the field we were very
[01:12:46] think where in the field we were very excited about this for a while but we I
[01:12:48] excited about this for a while but we I think it's it's sort of uh gone now
[01:12:51] think it's it's sort of uh gone now because it's too difficult to create
[01:12:53] because it's too difficult to create lots of high quality data in this
[01:12:54] lots of high quality data in this setting but what you can do is you can
[01:12:56] setting but what you can do is you can have simulated environments uh so this
[01:12:59] have simulated environments uh so this is a paper from deepmind from 2017 where
[01:13:01] is a paper from deepmind from 2017 where they had this agent walk around in the
[01:13:03] they had this agent walk around in the Maze and then it could have natural
[01:13:05] Maze and then it could have natural language instructions they could also
[01:13:06] language instructions they could also generalize to like decks and Blicks and
[01:13:08] generalize to like decks and Blicks and different sort of groundings to the and
[01:13:10] different sort of groundings to the and assignments that that you could do in
[01:13:12] assignments that that you could do in that environment so this is a super
[01:13:14] that environment so this is a super interesting Direction I think in the
[01:13:16] interesting Direction I think in the long term because this is how humans
[01:13:17] long term because this is how humans learn language right like we walk around
[01:13:19] learn language right like we walk around in the world we interact with our
[01:13:21] in the world we interact with our environments we have all of these
[01:13:22] environments we have all of these different perceptual observations we
[01:13:24] different perceptual observations we synthesize them in our brain we
[01:13:26] synthesize them in our brain we manipulate objects we change our own
[01:13:28] manipulate objects we change our own Viewpoint and that's how we learn
[01:13:30] Viewpoint and that's how we learn everything we know about the world and
[01:13:32] everything we know about the world and so our our language is very intricately
[01:13:34] so our our language is very intricately connected to that world and how we
[01:13:36] connected to that world and how we observe it
[01:13:38] observe it um so I think that that might make a
[01:13:40] um so I think that that might make a comeback at some point in the future
[01:13:43] comeback at some point in the future um you can also do other stuff so
[01:13:45] um you can also do other stuff so especially with this kind of
[01:13:47] especially with this kind of conditioning or text that that we're
[01:13:49] conditioning or text that that we're seeing a lot of right so like so you
[01:13:51] seeing a lot of right so like so you know Dali 2 and stable diffusion and all
[01:13:53] know Dali 2 and stable diffusion and all of these different things and the
[01:13:54] of these different things and the original again we talked about at the
[01:13:56] original again we talked about at the beginning you can do the same thing but
[01:13:59] beginning you can do the same thing but now you're generating 3D Point clouds
[01:14:01] now you're generating 3D Point clouds right so this is a 3D Corgi
[01:14:05] right so this is a 3D Corgi um using a corgi and so this this prompt
[01:14:07] um using a corgi and so this this prompt can probably become much more complex
[01:14:09] can probably become much more complex over time and you can do like sort of
[01:14:10] over time and you can do like sort of AutoCAD design and just say like give me
[01:14:12] AutoCAD design and just say like give me a house and it's just going to design
[01:14:14] a house and it's just going to design the whole house for you uh so you can
[01:14:17] the whole house for you uh so you can just like tweak The Prompt and things
[01:14:19] just like tweak The Prompt and things like that like that that's all coming or
[01:14:21] like that like that that's all coming or or even already here in many cases
[01:14:24] or even already here in many cases um so so the final modality I I just
[01:14:26] um so so the final modality I I just briefly wanted to talk about is uh
[01:14:28] briefly wanted to talk about is uh olfactory embeddings
[01:14:30] olfactory embeddings sure
[01:14:33] um and uh so olfaction means smell if
[01:14:36] um and uh so olfaction means smell if you didn't know
[01:14:37] you didn't know um and uh so it turns out so my PhD
[01:14:40] um and uh so it turns out so my PhD thesis was about grounding uh semantics
[01:14:44] thesis was about grounding uh semantics in uh different perceptual modalities so
[01:14:47] in uh different perceptual modalities so a lot of my work started in vision and
[01:14:49] a lot of my work started in vision and then it's like okay now audio is sort of
[01:14:51] then it's like okay now audio is sort of the obvious next one right so you can
[01:14:52] the obvious next one right so you can learn the meaning of violin and then
[01:14:54] learn the meaning of violin and then maybe you can learn that violin like
[01:14:57] maybe you can learn that violin like what a violin looks like and what it is
[01:14:58] what a violin looks like and what it is and what it sounds like and that's going
[01:15:00] and what it sounds like and that's going to give you a richer representation but
[01:15:02] to give you a richer representation but for a lot of these words well it's
[01:15:04] for a lot of these words well it's actually very primitive to their meaning
[01:15:06] actually very primitive to their meaning is what they smell like because uh in
[01:15:08] is what they smell like because uh in our brains that's really one of the core
[01:15:10] our brains that's really one of the core areas of one of the oldest areas in your
[01:15:12] areas of one of the oldest areas in your brain uh so uh what you can try to do if
[01:15:15] brain uh so uh what you can try to do if you want to complete all of your
[01:15:17] you want to complete all of your perceptual modalities is you can try to
[01:15:20] perceptual modalities is you can try to build olfactory embedding so it was kind
[01:15:22] build olfactory embedding so it was kind of a joke paper I did but
[01:15:25] of a joke paper I did but um the funny thing is it actually worked
[01:15:28] um the funny thing is it actually worked um so uh there's there's a catalog this
[01:15:32] um so uh there's there's a catalog this Sigma Aldrich fine flavors and
[01:15:35] Sigma Aldrich fine flavors and fragrances catalog where you can look up
[01:15:38] fragrances catalog where you can look up words like melon and pineapple and then
[01:15:40] words like melon and pineapple and then it's going to give you all of the
[01:15:41] it's going to give you all of the chemical compounds that produce this
[01:15:44] chemical compounds that produce this smell or taste and so if you do that
[01:15:47] smell or taste and so if you do that then you can count the occurrences and
[01:15:49] then you can count the occurrences and then you can sort of do SVD or something
[01:15:51] then you can sort of do SVD or something like that only to to get it to be a bit
[01:15:54] like that only to to get it to be a bit more of a real embedding model so now
[01:15:56] more of a real embedding model so now you get smell embeddings smell vectors
[01:15:59] you get smell embeddings smell vectors and then you can compute similarity
[01:16:01] and then you can compute similarity judgments between these these smell so
[01:16:04] judgments between these these smell so turns out Apple smells like pear uh and
[01:16:07] turns out Apple smells like pear uh and you know the chocolate and cocoa and
[01:16:09] you know the chocolate and cocoa and sweet and coffee are sort of related
[01:16:12] sweet and coffee are sort of related right so you get these clusters of
[01:16:13] right so you get these clusters of different smells just based off of their
[01:16:15] different smells just based off of their chemical compounds so this bag of
[01:16:18] chemical compounds so this bag of chemical compounds model gives you a
[01:16:20] chemical compounds model gives you a very rich representation and so if you
[01:16:23] very rich representation and so if you look at all of the words that are
[01:16:25] look at all of the words that are concrete enough to have smell right so
[01:16:28] concrete enough to have smell right so like if you have a word like democracy
[01:16:30] like if you have a word like democracy in there that doesn't really smell like
[01:16:31] in there that doesn't really smell like anything right so you you ignore
[01:16:35] anything right so you you ignore democracy and you just focus on on the
[01:16:39] democracy and you just focus on on the things that smell
[01:16:41] things that smell um or that good smell I guess
[01:16:43] um or that good smell I guess um and then so the the really
[01:16:45] um and then so the the really interesting thing to me is that you know
[01:16:47] interesting thing to me is that you know this is is much more correlated with
[01:16:50] this is is much more correlated with human similarity judgments than the
[01:16:52] human similarity judgments than the linguistic vectors we had at the time
[01:16:54] linguistic vectors we had at the time right so so for a work like apple like
[01:16:57] right so so for a work like apple like you can just get a word Vector like
[01:16:59] you can just get a word Vector like you've learned in your first lecture uh
[01:17:01] you've learned in your first lecture uh and and so you can do like you know Skip
[01:17:03] and and so you can do like you know Skip gram and things like that but that that
[01:17:05] gram and things like that but that that thing is not going to be as correlated
[01:17:07] thing is not going to be as correlated with human similarity judgments as this
[01:17:09] with human similarity judgments as this bag of chemical compounds model
[01:17:12] bag of chemical compounds model so that's that's pretty interesting
[01:17:14] so that's that's pretty interesting right so even something like smell where
[01:17:16] right so even something like smell where maybe we think you know this doesn't
[01:17:17] maybe we think you know this doesn't really matter if you really want to
[01:17:19] really matter if you really want to understand how humans understand
[01:17:20] understand how humans understand language then maybe you want to include
[01:17:23] language then maybe you want to include this in your foundation model too
[01:17:27] and but I would start with other
[01:17:28] and but I would start with other modalities
[01:17:31] modalities all right
[01:17:32] all right um
[01:17:33] um okay yeah sorry yeah uh so where to next
[01:17:36] okay yeah sorry yeah uh so where to next uh I'll just I think I've already said
[01:17:38] uh I'll just I think I've already said most of this actually so One Foundation
[01:17:40] most of this actually so One Foundation model is going to rule them all
[01:17:43] model is going to rule them all um and we'll so I mean there will be
[01:17:45] um and we'll so I mean there will be many of these but a lot of them are
[01:17:46] many of these but a lot of them are going to have very similar traits I
[01:17:48] going to have very similar traits I think
[01:17:49] think um we're going to be looking at scaling
[01:17:51] um we're going to be looking at scaling loss and trying to understand really
[01:17:53] loss and trying to understand really what is the relationship between the
[01:17:54] what is the relationship between the different modalities which one do we
[01:17:56] different modalities which one do we want more of that sort of stuff
[01:17:58] want more of that sort of stuff we're going to have retrieval
[01:17:59] we're going to have retrieval augmentation this thing is going to be
[01:18:01] augmentation this thing is going to be really huge if you've heard of rag or if
[01:18:04] really huge if you've heard of rag or if you haven't you should look it up uh so
[01:18:06] you haven't you should look it up uh so all of these parts of these models can
[01:18:08] all of these parts of these models can also be multimodal we need way better
[01:18:10] also be multimodal we need way better evaluation and better measurements we
[01:18:12] evaluation and better measurements we already talked about that too and that's
[01:18:14] already talked about that too and that's all I have thank you
[01:18:15] all I have thank you [Applause]


================================================================================
LECTURE 020
================================================================================

Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim

Source: https://www.youtube.com/watch?v=cd3pRpEtjLs

---

Transcript

[00:00:05] today I'm delighted to introduce us our
[00:00:08] today I'm delighted to introduce us our final guest speaker
[00:00:10] final guest speaker um Bean Kim
[00:00:11] um Bean Kim um being Kim is a staff research
[00:00:13] um being Kim is a staff research scientist at Google brain if you're
[00:00:16] scientist at Google brain if you're really into googleology you know those
[00:00:18] really into googleology you know those funny words the beginning like staff
[00:00:20] funny words the beginning like staff sort of says how senior you are
[00:00:22] sort of says how senior you are um and that means that being's a good
[00:00:24] um and that means that being's a good research scientist
[00:00:26] research scientist um
[00:00:27] um um so uh I I discovered at lunch today
[00:00:30] um so uh I I discovered at lunch today that bean started out
[00:00:32] that bean started out um studying mechanical engineering at
[00:00:34] um studying mechanical engineering at Seoul national university but she moved
[00:00:37] Seoul national university but she moved on to uh I don't know if it's better
[00:00:39] on to uh I don't know if it's better things or not but she moved on to
[00:00:41] things or not but she moved on to computer science and did a PhD
[00:00:44] computer science and did a PhD um at MIT and there she started working
[00:00:47] um at MIT and there she started working on the interpretability and
[00:00:49] on the interpretability and explainability of machine learning
[00:00:51] explainability of machine learning models
[00:00:52] models um I think she'll be talking about some
[00:00:54] um I think she'll be talking about some different parts of her work but a theme
[00:00:57] different parts of her work but a theme that she's had in some of her recent
[00:00:59] that she's had in some of her recent work that I find especially appealing as
[00:01:02] work that I find especially appealing as an NLP person is the idea that we should
[00:01:06] an NLP person is the idea that we should be using higher level human
[00:01:09] be using higher level human interpretable languages for
[00:01:11] interpretable languages for communication between people and
[00:01:14] communication between people and machines and so welcome Bean looking
[00:01:16] machines and so welcome Bean looking forward to your talk
[00:01:18] forward to your talk um and go for it thank you thank you
[00:01:25] thanks for having me it's honored to be
[00:01:27] thanks for having me it's honored to be here it's the rainiest Stanford I've
[00:01:31] here it's the rainiest Stanford I've ever seen last night I got here last
[00:01:33] ever seen last night I got here last night but then I'm I live in Seattle so
[00:01:35] night but then I'm I live in Seattle so this is pretty common so I still was
[00:01:37] this is pretty common so I still was able to see the blue sky today I was
[00:01:39] able to see the blue sky today I was like this works I really like it here so
[00:01:42] like this works I really like it here so today I'm going to share some of my
[00:01:44] today I'm going to share some of my dreams chasing my dreams to communicate
[00:01:46] dreams chasing my dreams to communicate with machines
[00:01:48] with machines so if you're in this class you probably
[00:01:51] so if you're in this class you probably agree you don't have to that large
[00:01:54] agree you don't have to that large language models and generated models are
[00:01:56] language models and generated models are pretty cool they're impressive but you
[00:01:59] pretty cool they're impressive but you may also agree that they're a little bit
[00:02:01] may also agree that they're a little bit frightening
[00:02:02] frightening not just because they're impressive
[00:02:04] not just because they're impressive they're doing really good job but also
[00:02:07] they're doing really good job but also we're not quite sure where we're going
[00:02:10] we're not quite sure where we're going with this technology in 10 years out
[00:02:13] with this technology in 10 years out will we look back and say that
[00:02:15] will we look back and say that technology was net positive or we will
[00:02:17] technology was net positive or we will say ah that was catastrophic we didn't
[00:02:19] say ah that was catastrophic we didn't know that that would happen
[00:02:22] know that that would happen and ultimately what I would like to do
[00:02:25] and ultimately what I would like to do or maybe hopefully what we all want to
[00:02:28] or maybe hopefully what we all want to do is to have this technology benefit us
[00:02:31] do is to have this technology benefit us humans I know in 10 years time or maybe
[00:02:34] humans I know in 10 years time or maybe well 20 years or earlier he's gonna ask
[00:02:37] well 20 years or earlier he's gonna ask me he's gonna be like Mom did you work
[00:02:39] me he's gonna be like Mom did you work on this AI stuff I watched some of your
[00:02:42] on this AI stuff I watched some of your talks
[00:02:43] talks and did you know that how this will
[00:02:46] and did you know that how this will profoundly change our lives
[00:02:48] profoundly change our lives and what did you do about that
[00:02:51] and what did you do about that and I have to answer that question and I
[00:02:53] and I have to answer that question and I really hope that I have some good things
[00:02:55] really hope that I have some good things to say to him
[00:02:59] so my initial thought or an instill so
[00:03:03] so my initial thought or an instill so or current thought is that if we want
[00:03:06] or current thought is that if we want our ultimate goal to be benefit Humanity
[00:03:08] our ultimate goal to be benefit Humanity why not directly optimize for it why
[00:03:11] why not directly optimize for it why wait
[00:03:13] so how can we benefit there's lots of
[00:03:16] so how can we benefit there's lots of different ways we can benefit but one
[00:03:18] different ways we can benefit but one way we can benefit is to treat this like
[00:03:21] way we can benefit is to treat this like a colleague you know a colleague who are
[00:03:24] a colleague you know a colleague who are really good at something
[00:03:26] really good at something it's called it's not perfect but it's
[00:03:28] it's called it's not perfect but it's good at something enough that you want
[00:03:30] good at something enough that you want to learn something from them
[00:03:32] to learn something from them one difference is though in this case is
[00:03:35] one difference is though in this case is that this colleague is kind of weird
[00:03:36] that this colleague is kind of weird this colleague might have very different
[00:03:39] this colleague might have very different values it might has very different
[00:03:42] values it might has very different experiences in the world it may not care
[00:03:45] experiences in the world it may not care about surviving as much as we do maybe
[00:03:48] about surviving as much as we do maybe mortality isn't really a thing for this
[00:03:51] mortality isn't really a thing for this colleague
[00:03:52] colleague so you have to navigate that in our
[00:03:55] so you have to navigate that in our conversation
[00:03:57] conversation so what do you do when you first meet
[00:03:58] so what do you do when you first meet somebody there's someone so different
[00:04:00] somebody there's someone so different what do you do
[00:04:01] what do you do you try to have a conversation
[00:04:03] you try to have a conversation to figure out what how do you do what
[00:04:06] to figure out what how do you do what you do how are you solving decades-old
[00:04:09] you do how are you solving decades-old protein folding problem
[00:04:11] protein folding problem how are you so how are you beating the
[00:04:13] how are you so how are you beating the world gold champion
[00:04:15] world gold champion so easily What It Seems
[00:04:17] so easily What It Seems are you using the same language the
[00:04:19] are you using the same language the science knowledge the language that we
[00:04:21] science knowledge the language that we use atoms molecules or do you think
[00:04:24] use atoms molecules or do you think about the world in a very different way
[00:04:27] about the world in a very different way and more importantly how can we work
[00:04:29] and more importantly how can we work together
[00:04:32] I have a one area that I really want to
[00:04:34] I have a one area that I really want to talk to and it's alphago
[00:04:36] talk to and it's alphago so alphago beat world of gold Champion
[00:04:39] so alphago beat world of gold Champion Isador in 2016. Isidore is from South
[00:04:41] Isador in 2016. Isidore is from South Korea I'm from South Korea I watched
[00:04:43] Korea I'm from South Korea I watched every single batch it was such a big
[00:04:45] every single batch it was such a big deal in South Korea and worldwide I hope
[00:04:48] deal in South Korea and worldwide I hope and in one of the matches alphago played
[00:04:51] and in one of the matches alphago played this move called move 37. how many
[00:04:54] this move called move 37. how many people watched alphago match matches
[00:04:57] people watched alphago match matches and how many people remember move 37.
[00:05:00] and how many people remember move 37. yeah a few people right and I remember
[00:05:03] yeah a few people right and I remember the nine Don commentator who's been like
[00:05:05] the nine Don commentator who's been like talking a lot throughout the matches
[00:05:07] talking a lot throughout the matches suddenly got really quiet
[00:05:09] suddenly got really quiet and he said hmm that's a very strange
[00:05:13] and he said hmm that's a very strange move
[00:05:14] move and I knew then that something really
[00:05:17] and I knew then that something really interesting has just happened in from my
[00:05:19] interesting has just happened in from my eyes
[00:05:20] eyes that this is gonna change something the
[00:05:22] that this is gonna change something the South Fargo has made something that
[00:05:23] South Fargo has made something that we're gonna remember forever and sure
[00:05:25] we're gonna remember forever and sure enough this move turned around the game
[00:05:27] enough this move turned around the game for alphago and leading Alpha go to win
[00:05:30] for alphago and leading Alpha go to win one of the matches
[00:05:33] one of the matches so go players today continue to analyze
[00:05:35] so go players today continue to analyze this move and still discuss people talk
[00:05:38] this move and still discuss people talk about this is not the move a human would
[00:05:40] about this is not the move a human would Phantom
[00:05:42] Phantom so the question is how did alphago know
[00:05:44] so the question is how did alphago know this is a good move
[00:05:49] my dream is to learn something new by
[00:05:52] my dream is to learn something new by communicating the machine with machines
[00:05:54] communicating the machine with machines and having a conversation
[00:05:56] and having a conversation and such that Humanity will gain some
[00:05:59] and such that Humanity will gain some new angle to our important problems like
[00:06:01] new angle to our important problems like medicine and Science and many others
[00:06:04] medicine and Science and many others and this is not just about discovering
[00:06:07] and this is not just about discovering new things if you think about reward
[00:06:09] new things if you think about reward hacking
[00:06:10] hacking you have to have a meaningful
[00:06:12] you have to have a meaningful conversation with somebody to truly
[00:06:15] conversation with somebody to truly figure out what their true goal is
[00:06:18] figure out what their true goal is so in a way solving this problem is a
[00:06:21] so in a way solving this problem is a superset of solving AI safety too
[00:06:26] so how do we have this conversation
[00:06:28] so how do we have this conversation conversation assumes that we share some
[00:06:31] conversation assumes that we share some common vocabulary between uh that that
[00:06:33] common vocabulary between uh that that exchange to exchange meaning and
[00:06:35] exchange to exchange meaning and ultimately the knowledge and naturally a
[00:06:37] ultimately the knowledge and naturally a representation plays a key role in this
[00:06:39] representation plays a key role in this conversation on the left and we can
[00:06:42] conversation on the left and we can visualize this on the left we say what
[00:06:44] visualize this on the left we say what this is a representational space of what
[00:06:46] this is a representational space of what humans know on the right what machines
[00:06:48] humans know on the right what machines know
[00:06:50] know here in left Circle there will be
[00:06:52] here in left Circle there will be something like this dog is Fluffy and
[00:06:54] something like this dog is Fluffy and you know what that means because we all
[00:06:55] you know what that means because we all share somewhat similar recovery
[00:06:59] share somewhat similar recovery but on the right we have something like
[00:07:02] but on the right we have something like move 37 where we humans yet to have a
[00:07:05] move 37 where we humans yet to have a representation for
[00:07:10] so how do we have this conversation our
[00:07:12] so how do we have this conversation our representation space needs overlap and
[00:07:15] representation space needs overlap and the more overlap we have the better
[00:07:17] the more overlap we have the better conversation we're going to have humans
[00:07:20] conversation we're going to have humans are all good at learning new things like
[00:07:22] are all good at learning new things like here everyone is learning something new
[00:07:24] here everyone is learning something new so we can expand what we know by
[00:07:26] so we can expand what we know by learning new Concepts and vocabularies
[00:07:30] learning new Concepts and vocabularies and doing so I believe will help us to
[00:07:33] and doing so I believe will help us to build machines that can better align
[00:07:34] build machines that can better align with our values and our goals
[00:07:39] so this is the talk that I gave if
[00:07:41] so this is the talk that I gave if you're curious about some of the work
[00:07:42] you're curious about some of the work we're doing towards this direction I
[00:07:44] we're doing towards this direction I highly recommend it's a YouTube video I
[00:07:46] highly recommend it's a YouTube video I clear keen on half an hour you can fast
[00:07:48] clear keen on half an hour you can fast uh do a best feed
[00:07:50] uh do a best feed but today I'm going to talk more about
[00:07:52] but today I'm going to talk more about my hopes and dreams
[00:07:54] my hopes and dreams and hopefully at the end of the day your
[00:07:56] and hopefully at the end of the day your hopes and dream is there
[00:07:59] so first of all
[00:08:01] so first of all I'm just gonna set the expectation so at
[00:08:04] I'm just gonna set the expectation so at the end of this talk we still don't know
[00:08:06] the end of this talk we still don't know how the move 37 is made okay sorry
[00:08:09] how the move 37 is made okay sorry that's going to take a while
[00:08:12] that's going to take a while in fact the first part of this talk is
[00:08:16] in fact the first part of this talk is going to be about how we move backwards
[00:08:18] going to be about how we move backwards in this progress
[00:08:20] in this progress in in terms of making this progress in
[00:08:23] in in terms of making this progress in our journey and still very very small
[00:08:26] our journey and still very very small portion of our entire Journey towards
[00:08:28] portion of our entire Journey towards understanding move 37
[00:08:31] understanding move 37 and of course this journey wouldn't be
[00:08:33] and of course this journey wouldn't be like a singular path there will be lots
[00:08:36] like a singular path there will be lots of different branches coming in core
[00:08:38] of different branches coming in core ideas like Transformer helped many
[00:08:41] ideas like Transformer helped many domains across they will be similar here
[00:08:43] domains across they will be similar here so I'm going to talk in the part two
[00:08:45] so I'm going to talk in the part two some of our work on understanding
[00:08:48] some of our work on understanding emerging behaviors in reinforcement
[00:08:50] emerging behaviors in reinforcement learning
[00:08:51] learning and all the techniques that I'm going to
[00:08:53] and all the techniques that I'm going to talk about is going to be in principle
[00:08:55] talk about is going to be in principle applicable to NLP
[00:08:59] so coming back to our move our dreams
[00:09:01] so coming back to our move our dreams and hopes and dreams 37.
[00:09:05] and hopes and dreams 37. so let's first think about how we might
[00:09:07] so let's first think about how we might realize this dream
[00:09:09] realize this dream and taking a step back we have to ask do
[00:09:12] and taking a step back we have to ask do we have tools to First estimate what
[00:09:15] we have tools to First estimate what even machines know
[00:09:17] even machines know there has been many development in
[00:09:18] there has been many development in machine learning last decade now to
[00:09:22] machine learning last decade now to develop tools to understand and estimate
[00:09:24] develop tools to understand and estimate this purple circle
[00:09:26] this purple circle so is that accurate
[00:09:28] so is that accurate unfortunately many Recent research
[00:09:31] unfortunately many Recent research showed that there's a huge gap between
[00:09:33] showed that there's a huge gap between What machines actually know and what we
[00:09:36] What machines actually know and what we think the machines know
[00:09:40] and identifying and bridging this Gap is
[00:09:42] and identifying and bridging this Gap is important because these tools will form
[00:09:44] important because these tools will form basis for understanding that move 37.
[00:09:50] so what are these tools how many people
[00:09:52] so what are these tools how many people familiar with sale in cmaps
[00:09:55] familiar with sale in cmaps a lot but you don't have to explain what
[00:09:57] a lot but you don't have to explain what it is so saliency map is one of the
[00:09:59] it is so saliency map is one of the popular interpretability methods for
[00:10:03] popular interpretability methods for Simplicity let's say an imagenet you
[00:10:05] Simplicity let's say an imagenet you have an image like this you have a bird
[00:10:06] have an image like this you have a bird the explanation is going to take a form
[00:10:09] the explanation is going to take a form of the same image but where each pixel
[00:10:12] of the same image but where each pixel is numb with associated with a number
[00:10:14] is numb with associated with a number that is supposed to
[00:10:17] that is supposed to imply some importance of that pixel for
[00:10:21] imply some importance of that pixel for prediction of this image
[00:10:23] prediction of this image and one definition of that importance is
[00:10:26] and one definition of that importance is that that number indicates how the
[00:10:28] that that number indicates how the function looked like around this pixel
[00:10:30] function looked like around this pixel so for example if I have a pixel I XJ
[00:10:33] so for example if I have a pixel I XJ maybe around XJ the function moves up
[00:10:36] maybe around XJ the function moves up like the yellow curve or function is
[00:10:39] like the yellow curve or function is flat or a function is going down like
[00:10:40] flat or a function is going down like the green curve
[00:10:43] and so if it's flat like a like a blue
[00:10:46] and so if it's flat like a like a blue curve or red curve maybe that feature is
[00:10:49] curve or red curve maybe that feature is irrelevant to predicting bird maybe it's
[00:10:51] irrelevant to predicting bird maybe it's going up then it's maybe more important
[00:10:53] going up then it's maybe more important because the value of X increases and the
[00:10:55] because the value of X increases and the function value goes up function value
[00:10:57] function value goes up function value here like a prediction value
[00:11:01] so let's think about what are the few
[00:11:03] so let's think about what are the few ways why this Gap might exist there are
[00:11:06] ways why this Gap might exist there are fewer is not exhaustive they're overlap
[00:11:08] fewer is not exhaustive they're overlap a little bit but helpful for us to think
[00:11:10] a little bit but helpful for us to think about maybe assumptions are wrong so
[00:11:13] about maybe assumptions are wrong so this alien again these machines that we
[00:11:15] this alien again these machines that we train Works in a completely different
[00:11:17] train Works in a completely different perhaps completely different
[00:11:18] perhaps completely different representational space very different
[00:11:20] representational space very different experiences about the world so assuming
[00:11:23] experiences about the world so assuming that it sees the world that we do just
[00:11:26] that it sees the world that we do just like we do like having the gesture
[00:11:28] like we do like having the gesture phenomenon there's few dots humans are
[00:11:30] phenomenon there's few dots humans are have tendency to connect them maybe
[00:11:33] have tendency to connect them maybe machines have the two maybe not so maybe
[00:11:35] machines have the two maybe not so maybe our assumptions about these machines
[00:11:37] our assumptions about these machines were wrong
[00:11:39] were wrong maybe our expectations are mismatched we
[00:11:42] maybe our expectations are mismatched we thought it was doing X but it was
[00:11:44] thought it was doing X but it was actually doing y
[00:11:45] actually doing y or
[00:11:47] or maybe it's beyond us maybe it's showing
[00:11:49] maybe it's beyond us maybe it's showing something superhuman that humans just
[00:11:51] something superhuman that humans just can't understand
[00:11:55] I'm going to take a deeper into one of
[00:11:57] I'm going to take a deeper into one of uh some of these our work this is more
[00:12:00] uh some of these our work this is more recent work
[00:12:01] recent work so again coming back to the earlier
[00:12:03] so again coming back to the earlier story about Salient cement we're going
[00:12:05] story about Salient cement we're going to play with some of these methods
[00:12:08] to play with some of these methods now uh in 2018
[00:12:11] now uh in 2018 we
[00:12:12] we stumbled upon this phenomenon that was
[00:12:15] stumbled upon this phenomenon that was quite shocking which was that we were
[00:12:17] quite shocking which was that we were actually trying to write some different
[00:12:18] actually trying to write some different people again people of Christians here
[00:12:20] people again people of Christians here but we were testing something and we
[00:12:23] but we were testing something and we realized that train Network and
[00:12:25] realized that train Network and untrained network has the same very
[00:12:27] untrained network has the same very similar as alien cmap in other words
[00:12:30] similar as alien cmap in other words random prediction and meaningful
[00:12:32] random prediction and meaningful prediction were giving me the same
[00:12:34] prediction were giving me the same explanation
[00:12:36] explanation so that was puzzling we thought we had a
[00:12:38] so that was puzzling we thought we had a bug but it turned out we didn't it
[00:12:41] bug but it turned out we didn't it actually is in this thing
[00:12:42] actually is in this thing indistinguishable qualitatively and
[00:12:44] indistinguishable qualitatively and quantitatively
[00:12:45] quantitatively so that was shocking
[00:12:47] so that was shocking but then we wondered maybe this one-off
[00:12:50] but then we wondered maybe this one-off case maybe it still works somehow in
[00:12:54] case maybe it still works somehow in practice
[00:12:56] practice so we tested that in a follow-up paper
[00:12:58] so we tested that in a follow-up paper okay what if the model had an error one
[00:13:02] okay what if the model had an error one of these errors maybe it has a labeling
[00:13:03] of these errors maybe it has a labeling error maybe it has a spheres correlation
[00:13:05] error maybe it has a spheres correlation maybe that had Auto distribution at test
[00:13:09] maybe that had Auto distribution at test time if we intentionally insert these
[00:13:11] time if we intentionally insert these bugs can explanation tell us that
[00:13:14] bugs can explanation tell us that there's something wrong with the model
[00:13:17] there's something wrong with the model it turns out that that's also not quite
[00:13:19] it turns out that that's also not quite true
[00:13:21] true you might think that oh maybe superior's
[00:13:23] you might think that oh maybe superior's correlation another follow-up work also
[00:13:25] correlation another follow-up work also showed that this is also not the case
[00:13:28] showed that this is also not the case so we were disappointed
[00:13:31] so we were disappointed but then still we say you know maybe
[00:13:34] but then still we say you know maybe there is a there's no theoretical proof
[00:13:37] there is a there's no theoretical proof of this maybe this is again a lab
[00:13:40] of this maybe this is again a lab setting test we had grad students to
[00:13:43] setting test we had grad students to test this system maybe there's still
[00:13:45] test this system maybe there's still some hope
[00:13:48] so this is more recent work where we
[00:13:50] so this is more recent work where we theoretically prove that some of these
[00:13:52] theoretically prove that some of these methods very popular methods cannot do
[00:13:55] methods very popular methods cannot do better than random so I'm going to talk
[00:13:57] better than random so I'm going to talk a little a little bit about that
[00:14:00] a little a little bit about that I'm missing one person oh I'm missing
[00:14:02] I'm missing one person oh I'm missing Tang way in the author list I just
[00:14:04] Tang way in the author list I just realized this is also work with pangwei
[00:14:07] realized this is also work with pangwei so let's first talk about our
[00:14:09] so let's first talk about our expectation what is our expectation
[00:14:11] expectation what is our expectation about this tool now the original paper
[00:14:15] about this tool now the original paper that developed this method IG and Shop
[00:14:18] that developed this method IG and Shop talks about how IG can be used for
[00:14:21] talks about how IG can be used for accounting the contributions of each
[00:14:23] accounting the contributions of each feature
[00:14:25] feature so what that means is that when the Tool
[00:14:27] so what that means is that when the Tool assigns zero attribution to a pixel
[00:14:29] assigns zero attribution to a pixel we're gonna say okay well pixel is on
[00:14:31] we're gonna say okay well pixel is on used by the function
[00:14:33] used by the function and that means that F will be
[00:14:35] and that means that F will be insensitive if I perturb this X
[00:14:40] and in fact this is how it's been used
[00:14:42] and in fact this is how it's been used it in practice this is a paper published
[00:14:45] it in practice this is a paper published in nature they use the shop to figure
[00:14:48] in nature they use the shop to figure out the eligibility criteria in a
[00:14:50] out the eligibility criteria in a medical trial
[00:14:53] medical trial what we show in this work is that none
[00:14:55] what we show in this work is that none of these inferences that seemed pretty
[00:14:58] of these inferences that seemed pretty natural were true
[00:15:01] natural were true and in fact just because popular
[00:15:02] and in fact just because popular attribution methods tell you anything
[00:15:04] attribution methods tell you anything about it attribution is X you cannot
[00:15:07] about it attribution is X you cannot conclude anything about the actual Model
[00:15:10] conclude anything about the actual Model Behavior
[00:15:12] Behavior so how does that work
[00:15:15] so how does that work how many people here do Theory proof
[00:15:20] or few great I'll tell you I I learned
[00:15:23] or few great I'll tell you I I learned about Theory proving from this project
[00:15:25] about Theory proving from this project as well so I'll tell you like the way
[00:15:27] as well so I'll tell you like the way that that we pursued this particular
[00:15:29] that that we pursued this particular work is that first think about this
[00:15:31] work is that first think about this problem and we're going to formulate
[00:15:33] problem and we're going to formulate into some other problem that we know how
[00:15:36] into some other problem that we know how to solve so in this case we formulate
[00:15:38] to solve so in this case we formulate this as hypothesis testing because once
[00:15:41] this as hypothesis testing because once you formulate in the hypothesis testing
[00:15:43] you formulate in the hypothesis testing yes or no there are lots of tools in
[00:15:45] yes or no there are lots of tools in statistics you can use to prove this
[00:15:48] statistics you can use to prove this so what is hypothesis the hypothesis is
[00:15:51] so what is hypothesis the hypothesis is that I'm a user I got an attribution
[00:15:53] that I'm a user I got an attribution value from one of these tools and I have
[00:15:56] value from one of these tools and I have a mental model of ah this feature is
[00:15:59] a mental model of ah this feature is important or maybe not important
[00:16:01] important or maybe not important then the hypothesis is that whether
[00:16:03] then the hypothesis is that whether that's true or not
[00:16:05] that's true or not and what we showed is that given
[00:16:08] and what we showed is that given whatever hypothesis you may have we can
[00:16:11] whatever hypothesis you may have we can you cannot do better than random
[00:16:13] you cannot do better than random guessing invalidating or invalidating
[00:16:15] guessing invalidating or invalidating this hypothesis testing
[00:16:17] this hypothesis testing and that means yes it's sometimes it's
[00:16:20] and that means yes it's sometimes it's right but you don't do hypothesis
[00:16:22] right but you don't do hypothesis testing if you cannot validate yes or no
[00:16:24] testing if you cannot validate yes or no you just don't because like what's the
[00:16:26] you just don't because like what's the point of doing it if you just don't know
[00:16:28] point of doing it if you just don't know if it's as good as random guessing right
[00:16:32] if it's as good as random guessing right and the result is that yes for for this
[00:16:35] and the result is that yes for for this for this graph it's just a visualization
[00:16:37] for this graph it's just a visualization of our results if you plot true negative
[00:16:39] of our results if you plot true negative and true positive and line is random
[00:16:42] and true positive and line is random guessing because this is the worst
[00:16:43] guessing because this is the worst method that's the best method all the
[00:16:45] method that's the best method all the equal distance is this line methods that
[00:16:48] equal distance is this line methods that we know shop in IG all all falls under
[00:16:51] we know shop in IG all all falls under this line of random guessing
[00:16:55] this line of random guessing that's bad news
[00:16:56] that's bad news but maybe
[00:16:59] but maybe maybe this still works in practice for
[00:17:01] maybe this still works in practice for some reason maybe there were some
[00:17:03] some reason maybe there were some assumptions that we had that didn't
[00:17:05] assumptions that we had that didn't quite meet in the practice
[00:17:07] quite meet in the practice so does this phenomenal hold in practice
[00:17:11] so does this phenomenal hold in practice the answer is yes we did we now have
[00:17:14] the answer is yes we did we now have more image graphs and more bigger models
[00:17:16] more image graphs and more bigger models but here we test two concrete and tasks
[00:17:19] but here we test two concrete and tasks that people care about in
[00:17:21] that people care about in interpretability or use these methods to
[00:17:23] interpretability or use these methods to do recourse or spiritual correlation so
[00:17:26] do recourse or spiritual correlation so recourse for those are not familiaries
[00:17:28] recourse for those are not familiaries you're getting a loan and you wonder
[00:17:30] you're getting a loan and you wonder whether if I'm older I would have a high
[00:17:33] whether if I'm older I would have a high chance of getting a loan so I tweak this
[00:17:36] chance of getting a loan so I tweak this one feature and see if my value goes up
[00:17:38] one feature and see if my value goes up or down very reasonable task have people
[00:17:41] or down very reasonable task have people do all the time pretty significant
[00:17:43] do all the time pretty significant implications socially
[00:17:45] implications socially so for two of these concrete and tasks
[00:17:49] so for two of these concrete and tasks both of them boil down to this
[00:17:51] both of them boil down to this hypothesis testing framework that I
[00:17:53] hypothesis testing framework that I talked about
[00:17:54] talked about they're all around the random guessing
[00:17:57] they're all around the random guessing line over worse than random guessing
[00:18:01] so you might say oh no this is not good
[00:18:03] so you might say oh no this is not good A lot of people are using these tools
[00:18:05] A lot of people are using these tools what do we do
[00:18:06] what do we do we have very simple idea about this
[00:18:10] we have very simple idea about this so
[00:18:11] so people like
[00:18:13] people like developing complex tools and I really
[00:18:16] developing complex tools and I really hope you're not one of those people
[00:18:17] hope you're not one of those people because a lot of times simple methods
[00:18:21] because a lot of times simple methods work who comes Razer but also simple
[00:18:24] work who comes Razer but also simple methods are elegant there's a reason
[00:18:26] methods are elegant there's a reason perhaps a lot of times why they work
[00:18:28] perhaps a lot of times why they work they're simple that you can understand
[00:18:31] they're simple that you can understand them they make sense so let's try that
[00:18:34] them they make sense so let's try that idea here so again your goal is to
[00:18:36] idea here so again your goal is to estimate a function shape what do you do
[00:18:39] estimate a function shape what do you do well the simplest thing you do is you
[00:18:42] well the simplest thing you do is you have a point of interest you sample
[00:18:44] have a point of interest you sample around that point and evaluate the
[00:18:46] around that point and evaluate the function around that point if it goes up
[00:18:49] function around that point if it goes up maybe functions going up if it goes down
[00:18:51] maybe functions going up if it goes down maybe functions coming down right so
[00:18:54] maybe functions coming down right so that's the simplest way you can kind of
[00:18:56] that's the simplest way you can kind of brute force it
[00:18:58] brute force it but then the question is how many
[00:19:00] but then the question is how many samples do we need so here this is the
[00:19:03] samples do we need so here this is the equation that you're boosting you're
[00:19:05] equation that you're boosting you're lifting this line upwards that way by
[00:19:08] lifting this line upwards that way by adding that additional term
[00:19:10] adding that additional term uh it's proportional to number of
[00:19:12] uh it's proportional to number of samples the more samples you have the
[00:19:13] samples the more samples you have the better estimation you have makes sense
[00:19:15] better estimation you have makes sense and differences in output how much
[00:19:17] and differences in output how much resolution do you care do you care point
[00:19:19] resolution do you care do you care point zero point one to point point one to
[00:19:22] zero point one to point point one to point two or do you only care zero slope
[00:19:24] point two or do you only care zero slope to like slope one that's resolution that
[00:19:27] to like slope one that's resolution that you care about and number of features of
[00:19:30] you care about and number of features of course so if you worry about making some
[00:19:34] course so if you worry about making some conclusion based on function shape
[00:19:37] conclusion based on function shape sample
[00:19:38] sample easy
[00:19:42] so
[00:19:43] so can we infer the Model Behavior using
[00:19:46] can we infer the Model Behavior using this popular methods the answer is no
[00:19:49] this popular methods the answer is no and this holds both theory and practice
[00:19:53] and this holds both theory and practice we're currently working on even bigger
[00:19:55] we're currently working on even bigger models to to show just like again you
[00:19:57] models to to show just like again you know again empirical evidence that yes
[00:19:59] know again empirical evidence that yes it just really doesn't work please you
[00:20:02] it just really doesn't work please you know think of think twice and three
[00:20:03] know think of think twice and three times before using these methods
[00:20:06] times before using these methods and also a model dependent sample
[00:20:08] and also a model dependent sample complexity if your function is kind of
[00:20:10] complexity if your function is kind of crazy of course you're going to need
[00:20:12] crazy of course you're going to need more samples so what is the definition
[00:20:14] more samples so what is the definition of how do we characterize these
[00:20:15] of how do we characterize these functions
[00:20:17] functions and finally we haven't quite given up
[00:20:20] and finally we haven't quite given up yet because these methods have a pretty
[00:20:22] yet because these methods have a pretty good root in economics and and sharply
[00:20:24] good root in economics and and sharply values and all that so maybe they're a
[00:20:28] values and all that so maybe they're a lot narrower condition where these
[00:20:30] lot narrower condition where these methods work and we believe such such
[00:20:33] methods work and we believe such such condition does exist we just have to
[00:20:35] condition does exist we just have to figure out when
[00:20:36] figure out when once we figure out what that condition
[00:20:38] once we figure out what that condition is then in given function I can test it
[00:20:42] is then in given function I can test it and say yes I can use shop here yes I
[00:20:44] and say yes I can use shop here yes I can use IG here or no I can't
[00:20:47] can use IG here or no I can't that would be still very useful so
[00:20:49] that would be still very useful so ongoing work
[00:20:51] ongoing work before I go to the next one any
[00:20:53] before I go to the next one any questions
[00:20:54] questions yes do the findings you have about the
[00:20:58] yes do the findings you have about the these models like does it only applied
[00:21:00] these models like does it only applied in computer-bit models or does it
[00:21:02] in computer-bit models or does it applies any model and that has a
[00:21:06] applies any model and that has a function
[00:21:08] function yeah very simple simple actually
[00:21:10] yeah very simple simple actually simplest proof that can show simply any
[00:21:14] simplest proof that can show simply any function this holds
[00:21:16] function this holds any other questions
[00:21:20] it's wonderful yeah this relate to you a
[00:21:23] it's wonderful yeah this relate to you a lot but like it's almost seems like for
[00:21:27] lot but like it's almost seems like for the last couple of years they're being
[00:21:30] the last couple of years they're being at least dozens maybe hundreds of people
[00:21:32] at least dozens maybe hundreds of people writing to people through the Shipley
[00:21:34] writing to people through the Shipley values I mean it
[00:21:38] values I mean it if you're guessed that most of that work
[00:21:41] if you're guessed that most of that work that's invalid or that a lot of it might
[00:21:44] that's invalid or that a lot of it might be okay because the the point of a
[00:21:47] be okay because the the point of a condition where it's all right right off
[00:21:50] condition where it's all right right off of you being there
[00:21:51] of you being there so two answers to that question my
[00:21:54] so two answers to that question my hypothesis testing results shows that
[00:21:56] hypothesis testing results shows that it's random right so maybe in the
[00:21:58] it's random right so maybe in the optimistic case optimistic case 50 of
[00:22:02] optimistic case optimistic case 50 of those papers
[00:22:03] those papers you hit it
[00:22:06] you hit it and on the other side on the second note
[00:22:09] and on the other side on the second note even if maybe shop wasn't perfect maybe
[00:22:12] even if maybe shop wasn't perfect maybe it was kind of wrong but even if if it
[00:22:15] it was kind of wrong but even if if it helped human at the end task whatever
[00:22:17] helped human at the end task whatever that might be Health doctors to be more
[00:22:19] that might be Health doctors to be more efficient identifying bugs and whatnot
[00:22:21] efficient identifying bugs and whatnot and if they did the validation correctly
[00:22:23] and if they did the validation correctly with the right control testing setup
[00:22:26] with the right control testing setup then I think it's good you know you
[00:22:28] then I think it's good you know you figure it out somehow how to make this
[00:22:30] figure it out somehow how to make this noisy tools together work with human
[00:22:32] noisy tools together work with human interlude maybe and that's also good and
[00:22:34] interlude maybe and that's also good and I personally really like shop uh paper
[00:22:37] I personally really like shop uh paper and I'm a I'm a good friend with Scott
[00:22:39] and I'm a I'm a good friend with Scott and I love all his work it's just that I
[00:22:41] and I love all his work it's just that I think we need to narrow down our
[00:22:43] think we need to narrow down our expectations so that our expectations
[00:22:45] expectations so that our expectations are better aligned
[00:22:49] all right I'm going to talk about
[00:22:51] all right I'm going to talk about another word that's a kind of similar
[00:22:53] another word that's a kind of similar flavor
[00:22:54] flavor now it's an NLP
[00:22:56] now it's an NLP so this is one of those papers just like
[00:22:59] so this is one of those papers just like the many other papers that that we we
[00:23:01] the many other papers that that we we ended up writing one of those
[00:23:03] ended up writing one of those Serendipity paper so initially Peter
[00:23:05] Serendipity paper so initially Peter came up as an intern and we thought
[00:23:08] came up as an intern and we thought we're gonna locate ethical knowledge in
[00:23:10] we're gonna locate ethical knowledge in this large language models and then
[00:23:12] this large language models and then maybe we're gonna edit them to make them
[00:23:14] maybe we're gonna edit them to make them a little more ethical so that was a goal
[00:23:16] a little more ethical so that was a goal and then we thought oh the wrong paper
[00:23:18] and then we thought oh the wrong paper from David Bowie and I also love David's
[00:23:21] from David Bowie and I also love David's work and let's use that so that's the
[00:23:23] work and let's use that so that's the start of this work
[00:23:24] start of this work but then we start digging into and
[00:23:27] but then we start digging into and implementing the realm and like things
[00:23:29] implementing the realm and like things didn't quite line up so we do like
[00:23:31] didn't quite line up so we do like sanity check experiment after sanity
[00:23:34] sanity check experiment after sanity check and we ended up writing completely
[00:23:36] check and we ended up writing completely different paper which I'm going to about
[00:23:38] different paper which I'm going to about to talk to you about
[00:23:39] to talk to you about so
[00:23:41] so the this paper the Rome for those who
[00:23:44] the this paper the Rome for those who are not familiar which I'm going into
[00:23:45] are not familiar which I'm going into detail a little more detail in a bit is
[00:23:47] detail a little more detail in a bit is about editing a model so you first
[00:23:49] about editing a model so you first locate a knowledge in a in a model like
[00:23:53] locate a knowledge in a in a model like the Space Needle is in Seattle that's a
[00:23:55] the Space Needle is in Seattle that's a factor knowledge you locate them you
[00:23:57] factor knowledge you locate them you edit them because you can locate them
[00:24:00] edit them because you can locate them you can mess with it to edit that fact
[00:24:03] you can mess with it to edit that fact that's like the whole promise of it in
[00:24:05] that's like the whole promise of it in fact that's a lot of times how
[00:24:06] fact that's a lot of times how localization or editing methods were
[00:24:08] localization or editing methods were motivated in their literature
[00:24:11] motivated in their literature but what we show is that this assumption
[00:24:14] but what we show is that this assumption is actually not true
[00:24:16] is actually not true and to be quite honest with you like I
[00:24:18] and to be quite honest with you like I still don't quite get why this is not
[00:24:22] still don't quite get why this is not related and I'll talk more about this
[00:24:24] related and I'll talk more about this because this is like a big question uh
[00:24:26] because this is like a big question uh to us this is a pretty pretty um active
[00:24:28] to us this is a pretty pretty um active work
[00:24:29] work so substantial fraction of factual
[00:24:32] so substantial fraction of factual knowledge is stored outside of layers
[00:24:36] knowledge is stored outside of layers that are identified as having no
[00:24:38] that are identified as having no knowledge
[00:24:39] knowledge and you can you can you can see you can
[00:24:41] and you can you can you can see you can you will see this a little more detail
[00:24:43] you will see this a little more detail in a bit
[00:24:44] in a bit in fact the correlation between where
[00:24:47] in fact the correlation between where the location where where the facts are
[00:24:49] the location where where the facts are located and how well you will edit if we
[00:24:52] located and how well you will edit if we edit that location is completely
[00:24:55] edit that location is completely correlated uncorrelated so they have
[00:24:57] correlated uncorrelated so they have nothing to do with each other
[00:25:00] nothing to do with each other so we thought well maybe it's the
[00:25:04] so we thought well maybe it's the problem with the definition of editing
[00:25:06] problem with the definition of editing what we mean by editing can mean a lot
[00:25:07] what we mean by editing can mean a lot of different things so let's think about
[00:25:09] of different things so let's think about different ways to edit a thing so we try
[00:25:13] different ways to edit a thing so we try a bunch of things with a little success
[00:25:16] a bunch of things with a little success we couldn't find an editing definition
[00:25:18] we couldn't find an editing definition that actually relates really well with
[00:25:21] that actually relates really well with localization methods like in particular
[00:25:23] localization methods like in particular with ROM
[00:25:26] so let's talk a little bit about Rome
[00:25:28] so let's talk a little bit about Rome how Rome Works super briefly there's a
[00:25:31] how Rome Works super briefly there's a lot of details missed out on this side
[00:25:32] lot of details missed out on this side but roughly you will get the idea so
[00:25:35] but roughly you will get the idea so Rome is Magneto 2022 uh they have what's
[00:25:39] Rome is Magneto 2022 uh they have what's called causal tracing algorithm and the
[00:25:42] called causal tracing algorithm and the way it works is that you're going to run
[00:25:44] way it works is that you're going to run a model on this particular data set now
[00:25:47] a model on this particular data set now counter effect data set that has this
[00:25:50] counter effect data set that has this Tuple subject relation and object the
[00:25:53] Tuple subject relation and object the space needle look is located in Seattle
[00:25:56] space needle look is located in Seattle and so you're going to have a clean run
[00:25:58] and so you're going to have a clean run of the Space Needle is in Seattle one
[00:26:01] of the Space Needle is in Seattle one time you stole every single module every
[00:26:03] time you stole every single module every single value activations
[00:26:05] single value activations and then in the second run which they
[00:26:08] and then in the second run which they call corrupted run you're going to add
[00:26:10] call corrupted run you're going to add noise in those Space Needle is or or the
[00:26:14] noise in those Space Needle is or or the space
[00:26:15] space then then you're going to intervene at
[00:26:19] then then you're going to intervene at every single one of those modules
[00:26:21] every single one of those modules as if from by copying this module to the
[00:26:25] as if from by copying this module to the corrupted run so as if that particular
[00:26:27] corrupted run so as if that particular model was never
[00:26:29] model was never interrupted never a noise was never
[00:26:32] interrupted never a noise was never added to that module
[00:26:34] added to that module so it's a typical like intervention case
[00:26:36] so it's a typical like intervention case where you pretend everything else being
[00:26:39] where you pretend everything else being equal if I change just this one module
[00:26:43] equal if I change just this one module what is the probability of having the
[00:26:45] what is the probability of having the right answer so in this case probability
[00:26:47] right answer so in this case probability of the right answer Seattle given that I
[00:26:50] of the right answer Seattle given that I know it's the model and I intervened on
[00:26:53] know it's the model and I intervened on it
[00:26:54] it so at the end of the day you'll find
[00:26:57] so at the end of the day you'll find graph like that where each layer and
[00:27:00] graph like that where each layer and each token has a score How likely it is
[00:27:03] each token has a score How likely it is if I intervene on that token in that
[00:27:06] if I intervene on that token in that layer how How likely is it that I will
[00:27:09] layer how How likely is it that I will recover the right answer because if I
[00:27:11] recover the right answer because if I recover right answer that's the model
[00:27:13] recover right answer that's the model that's the module that's stored on
[00:27:14] that's the module that's stored on knowledge
[00:27:16] knowledge really reasonable algorithm I couldn't
[00:27:18] really reasonable algorithm I couldn't find technical flow in this algorithm I
[00:27:20] find technical flow in this algorithm I quite like it actually
[00:27:24] so but but when we start looking at this
[00:27:27] so but but when we start looking at this using the same model that they use GPT
[00:27:29] using the same model that they use GPT gptj we realized that a lot of these
[00:27:34] gptj we realized that a lot of these facts so so Rome uses just layer 6 to
[00:27:37] facts so so Rome uses just layer 6 to edit because that was the supposedly the
[00:27:39] edit because that was the supposedly the best layer across this data set to add
[00:27:42] best layer across this data set to add in most of the factual knowledge is
[00:27:44] in most of the factual knowledge is stored in layer 6 and they showed uh
[00:27:46] stored in layer 6 and they showed uh editing success and whatnot
[00:27:49] editing success and whatnot but we realized the truth looks like the
[00:27:51] but we realized the truth looks like the graph on the right so the red line is
[00:27:54] graph on the right so the red line is the layer 6 their extension paper called
[00:27:56] the layer 6 their extension paper called memet and it's multiple layers that's
[00:27:59] memet and it's multiple layers that's the Blue Line blue region
[00:28:01] the Blue Line blue region the black bars are histogram of where
[00:28:04] the black bars are histogram of where the knowledge was actually peaked if you
[00:28:06] the knowledge was actually peaked if you test every single layer and as you can
[00:28:09] test every single layer and as you can see not a lot of facts fall into that
[00:28:11] see not a lot of facts fall into that region so in fact every single fact has
[00:28:13] region so in fact every single fact has like different regions that where it
[00:28:15] like different regions that where it peaked so layer six for a lot of facts
[00:28:18] peaked so layer six for a lot of facts weren't the best layer
[00:28:20] weren't the best layer what the editing really worked it really
[00:28:22] what the editing really worked it really works and we did we were able to
[00:28:24] works and we did we were able to duplicate that results so we thought
[00:28:26] duplicate that results so we thought what do we do to find this ethics
[00:28:29] what do we do to find this ethics ethical knowledge how do we find the
[00:28:31] ethical knowledge how do we find the best layer to edit so that's where we
[00:28:33] best layer to edit so that's where we started but then we thought you know
[00:28:36] started but then we thought you know what take a step back we're going to
[00:28:38] what take a step back we're going to actually do alternative check first to
[00:28:40] actually do alternative check first to make sure that tracing effect the the
[00:28:43] make sure that tracing effect the the tracing effect is the localization
[00:28:46] tracing effect is the localization rip implies better editing results and
[00:28:49] rip implies better editing results and that's when everything started to
[00:28:51] that's when everything started to falling apart
[00:28:53] falling apart so let's define some metrics first the
[00:28:56] so let's define some metrics first the edit success this is the rewrite score
[00:28:59] edit success this is the rewrite score same score as roam paper used that's
[00:29:01] same score as roam paper used that's what we use and the tracing effect this
[00:29:04] what we use and the tracing effect this is localization
[00:29:05] is localization is probably you can beat the due to the
[00:29:08] is probably you can beat the due to the slide
[00:29:09] slide so when we plotted the relation between
[00:29:12] so when we plotted the relation between tracing effect and rewrite score the
[00:29:15] tracing effect and rewrite score the local uh the the editing method Redline
[00:29:18] local uh the the editing method Redline applies the perfect correlation
[00:29:21] applies the perfect correlation and that was our assumption that there
[00:29:24] and that was our assumption that there will be perfectly correlated and which
[00:29:25] will be perfectly correlated and which is why we do localization to begin with
[00:29:28] is why we do localization to begin with the actual line was yellow
[00:29:30] the actual line was yellow it's close to zero it's actually
[00:29:32] it's close to zero it's actually negative in this particular data set
[00:29:36] negative in this particular data set that is not even on correlated it's like
[00:29:38] that is not even on correlated it's like anti-correlated
[00:29:39] anti-correlated and we didn't stop there we were like we
[00:29:41] and we didn't stop there we were like we were so puzzled we're gonna do this for
[00:29:43] were so puzzled we're gonna do this for every single layer and we're gonna find
[00:29:45] every single layer and we're gonna find R square value so how much of the choice
[00:29:49] R square value so how much of the choice of layer
[00:29:50] of layer versus the localization the tracing
[00:29:52] versus the localization the tracing effect explains the variance of
[00:29:55] effect explains the variance of successful edit
[00:29:56] successful edit if you're not familiar with r squared r
[00:29:59] if you're not familiar with r squared r squares like a think about it as an
[00:30:00] squares like a think about it as an importance of a factor
[00:30:03] importance of a factor and it turns out that layer takes 94
[00:30:06] and it turns out that layer takes 94 dressing effect is zero zero one six
[00:30:10] dressing effect is zero zero one six and so we were really opposed that we
[00:30:12] and so we were really opposed that we were like scratching our head why is
[00:30:13] were like scratching our head why is this true
[00:30:15] this true but it was true across layer we tried
[00:30:18] but it was true across layer we tried all sorts of different things we we
[00:30:19] all sorts of different things we we tried different model we tried different
[00:30:21] tried different model we tried different data set it was all like roughly the
[00:30:24] data set it was all like roughly the case so we were at this point we
[00:30:26] case so we were at this point we contacted David and we start talking
[00:30:28] contacted David and we start talking about and and we resolve them they
[00:30:31] about and and we resolve them they acknowledge that this is a phenomenon
[00:30:33] acknowledge that this is a phenomenon that that exists you know
[00:30:35] that that exists you know so apart from the layer the other way in
[00:30:39] so apart from the layer the other way in which localization can happen is are you
[00:30:41] which localization can happen is are you looking at the correct token is that the
[00:30:43] looking at the correct token is that the other like corresponding yeah yeah in
[00:30:46] other like corresponding yeah yeah in this graph the token
[00:30:48] this graph the token so the added benefit of the rest of the
[00:30:52] so the added benefit of the rest of the localization could only help you look at
[00:30:53] localization could only help you look at which is the correct subgroup token is
[00:30:55] which is the correct subgroup token is that it yeah yeah and so looking at any
[00:30:57] that it yeah yeah and so looking at any of the software tokens it sort of finds
[00:30:59] of the software tokens it sort of finds what I should think of yeah yeah just
[00:31:00] what I should think of yeah yeah just layer layer is the most biggest thing
[00:31:03] layer layer is the most biggest thing that's the only thing you should care if
[00:31:04] that's the only thing you should care if you care about editing layers
[00:31:06] you care about editing layers in fact don't worry about localization
[00:31:08] in fact don't worry about localization at all it's extra wasted carbon uh
[00:31:11] at all it's extra wasted carbon uh climate effect yeah so that was that was
[00:31:14] climate effect yeah so that was that was our conclusion
[00:31:16] our conclusion but then we thought you know maybe the
[00:31:19] but then we thought you know maybe the particular definition of edit that they
[00:31:21] particular definition of edit that they used in the room was was maybe different
[00:31:23] used in the room was was maybe different maybe maybe there's exists a definition
[00:31:25] maybe maybe there's exists a definition of editing that correlates a lot better
[00:31:28] of editing that correlates a lot better with localization because there must be
[00:31:31] with localization because there must be I'm still puzzled why is this not
[00:31:33] I'm still puzzled why is this not correlated so we tried a bunch of
[00:31:35] correlated so we tried a bunch of different definitions of edits you might
[00:31:39] different definitions of edits you might inject an error you might uh you might
[00:31:43] inject an error you might uh you might invert reverse the tracing you might
[00:31:46] invert reverse the tracing you might want to erase effect you might we might
[00:31:48] want to erase effect you might we might want to amplify the fact all these
[00:31:49] want to amplify the fact all these things like maybe one of these will work
[00:31:52] things like maybe one of these will work you did it
[00:31:53] you did it so the craft that you're seeing down
[00:31:55] so the craft that you're seeing down here is R square value for four
[00:31:58] here is R square value for four different methods and this wasn't just
[00:32:00] different methods and this wasn't just the case for Roman memory it was also
[00:32:01] the case for Roman memory it was also the case for fine tuning methods
[00:32:04] the case for fine tuning methods that you want to look at the difference
[00:32:06] that you want to look at the difference between blue and orange bar represents
[00:32:10] between blue and orange bar represents how much the tracing effect influenced
[00:32:12] how much the tracing effect influenced our Square value of the tracing effect
[00:32:13] our Square value of the tracing effect as you can see it's ignorable they're
[00:32:16] as you can see it's ignorable they're all the same
[00:32:17] all the same you might feel the effect forcing the
[00:32:19] you might feel the effect forcing the last one has a little bit of Hope but
[00:32:22] last one has a little bit of Hope but still
[00:32:23] still compared to the impact of layer choice
[00:32:26] compared to the impact of layer choice of layer it's ignorable
[00:32:28] of layer it's ignorable so at this point we said okay well we
[00:32:32] so at this point we said okay well we can't locate the ethics no ethical
[00:32:34] can't locate the ethics no ethical knowledge at this project we're going to
[00:32:36] knowledge at this project we're going to have to switch the direction and we end
[00:32:38] have to switch the direction and we end up doing a lot more in-depth analysis on
[00:32:40] up doing a lot more in-depth analysis on on this
[00:32:44] so in summary does localization help
[00:32:47] so in summary does localization help editing no
[00:32:49] editing no the relationship is actually zero for
[00:32:52] the relationship is actually zero for this particular editing method that from
[00:32:54] this particular editing method that from what I know is pretty state-of-the-art
[00:32:56] what I know is pretty state-of-the-art and the counter of counter effect data
[00:32:58] and the counter of counter effect data it's not true
[00:33:00] it's not true are there any other editing methods that
[00:33:02] are there any other editing methods that correlate better no but if somebody can
[00:33:04] correlate better no but if somebody can answer this question for me that will be
[00:33:06] answer this question for me that will be very satisfying like I feel like there
[00:33:08] very satisfying like I feel like there should start something still be
[00:33:10] should start something still be something there that we're missing
[00:33:12] something there that we're missing but causal tracing I think what it does
[00:33:14] but causal tracing I think what it does is it reveals the factual information
[00:33:18] is it reveals the factual information when the Transformer is passing forward
[00:33:21] when the Transformer is passing forward I think it represents where's the fact
[00:33:24] I think it represents where's the fact when you're doing that
[00:33:26] when you're doing that but what we found here is that it has
[00:33:28] but what we found here is that it has nothing to do with editing success those
[00:33:31] nothing to do with editing success those two things are different and we have to
[00:33:32] two things are different and we have to resolve that somehow
[00:33:35] resolve that somehow but a lot of insights that they found in
[00:33:37] but a lot of insights that they found in their paper is still useful like the
[00:33:39] their paper is still useful like the early to mid-range NLP representation
[00:33:41] early to mid-range NLP representation last token there they represent the
[00:33:43] last token there they represent the factual something we didn't know before
[00:33:46] factual something we didn't know before but it is important not to validate
[00:33:48] but it is important not to validate localization methods using the editing
[00:33:51] localization methods using the editing method now we know and maybe not to
[00:33:54] method now we know and maybe not to motivate editing methods using via
[00:33:57] motivate editing methods using via localization those are the two things
[00:33:59] localization those are the two things now we know that we shouldn't do because
[00:34:02] now we know that we shouldn't do because we couldn't find a relationship
[00:34:04] we couldn't find a relationship any questions on this one before I move
[00:34:07] any questions on this one before I move on to the next one
[00:34:15] you're not shocked by this
[00:34:16] you're not shocked by this I am shocked by this I'm still so
[00:34:19] I am shocked by this I'm still so puzzled like it should be there should
[00:34:22] puzzled like it should be there should be something I don't know
[00:34:26] all right
[00:34:28] all right so in summary of this first part we
[00:34:32] so in summary of this first part we talked about why there the Gap might
[00:34:34] talked about why there the Gap might exist and what she what machines know
[00:34:36] exist and what she what machines know versus what we think machines now there
[00:34:39] versus what we think machines now there are three hypothesis there are three
[00:34:40] are three hypothesis there are three ideas assumptions are wrong maybe our
[00:34:42] ideas assumptions are wrong maybe our expectations are wrong maybe it's beyond
[00:34:44] expectations are wrong maybe it's beyond us there's a good quote that says good
[00:34:47] us there's a good quote that says good at good artist still I think good
[00:34:49] at good artist still I think good researchers doubt we have to be really
[00:34:51] researchers doubt we have to be really suspicious of everything that we do and
[00:34:54] suspicious of everything that we do and that's maybe the biggest lesson that
[00:34:55] that's maybe the biggest lesson that I've learned over many years that once
[00:34:58] I've learned over many years that once you like your results so much that's a
[00:35:01] you like your results so much that's a bad sign
[00:35:02] bad sign like come back like go home have a beer
[00:35:04] like come back like go home have a beer go to sleep and next day you come back
[00:35:07] go to sleep and next day you come back and like put your paper in on your desk
[00:35:09] and like put your paper in on your desk and think okay now I'm gonna review this
[00:35:11] and think okay now I'm gonna review this paper
[00:35:12] paper how do I criticize this one what do I
[00:35:14] how do I criticize this one what do I not like about this paper right that's
[00:35:16] not like about this paper right that's the one way to look at criticize your
[00:35:18] the one way to look at criticize your own research and and that will improve
[00:35:20] own research and and that will improve your thinking a lot
[00:35:26] so let's bring our attention back to our
[00:35:28] so let's bring our attention back to our hopes and dreams it keeps coming back
[00:35:31] hopes and dreams it keeps coming back so here
[00:35:32] so here I came to realize maybe instead of just
[00:35:36] I came to realize maybe instead of just building tools to understand perhaps we
[00:35:39] building tools to understand perhaps we need to do some groundwork what do I
[00:35:41] need to do some groundwork what do I mean well this alien that we've been
[00:35:43] mean well this alien that we've been dealing with trying to generate
[00:35:45] dealing with trying to generate explanations seems to be a different
[00:35:48] explanations seems to be a different kind so maybe we should study them as if
[00:35:51] kind so maybe we should study them as if they're like new species in the wild
[00:35:54] they're like new species in the wild so what do you do when you observe a new
[00:35:56] so what do you do when you observe a new species in the wild you have a couple
[00:35:58] species in the wild you have a couple ways but one of the ways is to
[00:36:00] ways but one of the ways is to observational study so you saw some
[00:36:03] observational study so you saw some species in the wild far away first you
[00:36:06] species in the wild far away first you just kind of watch them you watch them
[00:36:08] just kind of watch them you watch them and see what are they like what are
[00:36:10] and see what are they like what are their habitat how they what what do they
[00:36:12] their habitat how they what what do they what are their values and whatnot
[00:36:14] what are their values and whatnot and second way you can actually
[00:36:16] and second way you can actually intervene and do a control study so we
[00:36:19] intervene and do a control study so we did something like this with
[00:36:20] did something like this with reinforcement learning setup
[00:36:25] I'm going to talk about these two papers
[00:36:27] I'm going to talk about these two papers first paper
[00:36:28] first paper emergent behaviors in multi-agent
[00:36:30] emergent behaviors in multi-agent systems has been so cool who who saw
[00:36:33] systems has been so cool who who saw this hide and seek video by open AI
[00:36:36] this hide and seek video by open AI yeah it's so cool if you haven't seen it
[00:36:38] yeah it's so cool if you haven't seen it just Google it and watch it it's so
[00:36:39] just Google it and watch it it's so fascinating I'm only covering the tip of
[00:36:41] fascinating I'm only covering the tip of an iceberg in this but at the end of
[00:36:43] an iceberg in this but at the end of this hide and seek episode at some point
[00:36:47] this hide and seek episode at some point the agents reveal a discover a bug in
[00:36:51] the agents reveal a discover a bug in this physical system and start like
[00:36:53] this physical system and start like anti-gravity flying in the air and like
[00:36:56] anti-gravity flying in the air and like shooting hiders everywhere a super
[00:36:58] shooting hiders everywhere a super interesting video you must watch
[00:37:01] interesting video you must watch so lots of that and also humanoid
[00:37:03] so lots of that and also humanoid football and capture the flag from
[00:37:05] football and capture the flag from deepmind lots of interesting behaviors
[00:37:07] deepmind lots of interesting behaviors emerging that we observed
[00:37:11] here's the my favorite one but but these
[00:37:14] here's the my favorite one but but these labels so here these are labels that are
[00:37:16] labels so here these are labels that are provided by open AI running and chasing
[00:37:19] provided by open AI running and chasing for building and ramp use but and these
[00:37:22] for building and ramp use but and these ones were that oh human or humans when
[00:37:25] ones were that oh human or humans when painstakingly one by one watch all these
[00:37:28] painstakingly one by one watch all these videos and label them manually
[00:37:31] videos and label them manually so our question is can we is there
[00:37:34] so our question is can we is there better way to discover this emergent
[00:37:36] better way to discover this emergent behaviors perhaps some nice
[00:37:38] behaviors perhaps some nice visualization can help us explore this
[00:37:41] visualization can help us explore this complex uh complex domain a little
[00:37:44] complex uh complex domain a little better so that's our goal
[00:37:47] better so that's our goal so in this work we're going to again
[00:37:49] so in this work we're going to again treat the agents like an observational
[00:37:52] treat the agents like an observational study like us new species then we're
[00:37:54] study like us new species then we're going to do observational study and what
[00:37:56] going to do observational study and what that means is that we only get to
[00:37:58] that means is that we only get to observe State and action pair so where
[00:38:00] observe State and action pair so where they are what are they doing or uh yeah
[00:38:02] they are what are they doing or uh yeah what are they doing
[00:38:03] what are they doing and we're going to discover agent
[00:38:06] and we're going to discover agent Behavior by basically kind of like a
[00:38:09] Behavior by basically kind of like a clustering the data that's all we're
[00:38:10] clustering the data that's all we're gonna do
[00:38:12] gonna do and how do we do it pretty simple a
[00:38:15] and how do we do it pretty simple a generative model have you have covered
[00:38:17] generative model have you have covered the Bayesian generator graphical no
[00:38:20] the Bayesian generator graphical no gotcha okay so think about
[00:38:23] gotcha okay so think about hi then also what you teach yeah so this
[00:38:27] hi then also what you teach yeah so this is a graphical model
[00:38:29] is a graphical model um think about this as a fake or
[00:38:32] um think about this as a fake or hypothetical data generation process so
[00:38:35] hypothetical data generation process so how does this work like I'm generating
[00:38:37] how does this work like I'm generating the data I created this system I'm going
[00:38:39] the data I created this system I'm going to first generate a joint latent
[00:38:41] to first generate a joint latent embedding space for that represents all
[00:38:44] embedding space for that represents all numbers that represents all the
[00:38:45] numbers that represents all the behaviors in the system
[00:38:47] behaviors in the system and then I'm gonna for each agent I'm
[00:38:49] and then I'm gonna for each agent I'm going to generate another embedding
[00:38:51] going to generate another embedding and each embedding when it's conditioned
[00:38:53] and each embedding when it's conditioned with State it's going to generate policy
[00:38:57] with State it's going to generate policy it's going to decide what it's going to
[00:38:58] it's going to decide what it's going to do what action is given the state and
[00:39:00] do what action is given the state and the embedding pair
[00:39:02] the embedding pair and then what that whole thing generates
[00:39:04] and then what that whole thing generates is what you see the state and action
[00:39:07] is what you see the state and action pair so how does this work well and then
[00:39:09] pair so how does this work well and then given this you build a model and you do
[00:39:12] given this you build a model and you do inference to learn all these parameters
[00:39:14] inference to learn all these parameters kind of same business as neural network
[00:39:16] kind of same business as neural network but it's just have a little more
[00:39:18] but it's just have a little more structure
[00:39:19] structure so this is completely made up right this
[00:39:22] so this is completely made up right this is like my idea of how these new species
[00:39:25] is like my idea of how these new species might work and our goal is to we're
[00:39:27] might work and our goal is to we're going to try this and see if anything
[00:39:29] going to try this and see if anything useful comes up and the way you do this
[00:39:31] useful comes up and the way you do this is one of the ways you do this is you
[00:39:34] is one of the ways you do this is you optimize for a variation lower bound you
[00:39:36] optimize for a variation lower bound you don't need to know that it's very cool
[00:39:37] don't need to know that it's very cool actually if if one gets into this
[00:39:40] actually if if one gets into this exponential family business uh it's very
[00:39:42] exponential family business uh it's very cool
[00:39:43] cool CS 228 okay so here's one of the results
[00:39:49] CS 228 okay so here's one of the results that we had it's a domain called mujoko
[00:39:52] that we had it's a domain called mujoko here we're going to pretend that we have
[00:39:53] here we're going to pretend that we have two agents one controlling back leg and
[00:39:56] two agents one controlling back leg and one controlling the front leg and on the
[00:39:58] one controlling the front leg and on the right we're showing that joint embedding
[00:40:00] right we're showing that joint embedding space Z Omega and z alpha while video is
[00:40:03] space Z Omega and z alpha while video is running
[00:40:05] running I'm going to try to
[00:40:07] I'm going to try to put the video back
[00:40:10] okay so now I'm going to select this is
[00:40:13] okay so now I'm going to select this is a visualization that we built or online
[00:40:15] a visualization that we built or online you can you can go check it out you can
[00:40:18] you can you can go check it out you can select a little space in agent one space
[00:40:21] select a little space in agent one space and you see it maps to pretty tight
[00:40:23] and you see it maps to pretty tight space and Agent Zero and it shows pretty
[00:40:26] space and Agent Zero and it shows pretty decent running ability so that's cool
[00:40:28] decent running ability so that's cool and now I'm going to select somewhere
[00:40:30] and now I'm going to select somewhere else in agent one that maps to kind of
[00:40:33] else in agent one that maps to kind of disperse area in Agent Zero it looks
[00:40:36] disperse area in Agent Zero it looks like it's not not doing as well
[00:40:38] like it's not not doing as well and this is just an Insight that we gain
[00:40:40] and this is just an Insight that we gain for this data only but like I was
[00:40:43] for this data only but like I was quickly able to identify ah this type
[00:40:47] quickly able to identify ah this type mapping business kind of represents the
[00:40:50] mapping business kind of represents the good running behavior and bad running
[00:40:51] good running behavior and bad running behaviors that's something that you can
[00:40:53] behaviors that's something that you can do pretty efficiently
[00:40:55] do pretty efficiently and now I'm going to show you something
[00:40:56] and now I'm going to show you something more interesting so of course we have to
[00:40:59] more interesting so of course we have to do this because we have the data it's
[00:41:01] do this because we have the data it's it's here it's so cool so we apply this
[00:41:04] it's here it's so cool so we apply this framework in the when ai's hide and seek
[00:41:06] framework in the when ai's hide and seek this has four agent it looks like a
[00:41:09] this has four agent it looks like a simple game but it has pretty complex
[00:41:11] simple game but it has pretty complex structure 100 dimensional observations
[00:41:12] structure 100 dimensional observations uh five-dimensional action space
[00:41:15] uh five-dimensional action space so in this work remember that we pretend
[00:41:18] so in this work remember that we pretend that we don't know the labels given by
[00:41:20] that we don't know the labels given by open AI we just shuffle them in the mix
[00:41:24] open AI we just shuffle them in the mix but we can color them our results with
[00:41:26] but we can color them our results with respect to their labels so again this is
[00:41:29] respect to their labels so again this is the result of Z Omega and z alpha the
[00:41:33] the result of Z Omega and z alpha the individual agents but the coloring is
[00:41:35] individual agents but the coloring is something that we didn't know before we
[00:41:36] something that we didn't know before we just did it after the fact
[00:41:39] just did it after the fact you can see in the Z Omega there's nice
[00:41:41] you can see in the Z Omega there's nice kind of
[00:41:42] kind of pattern that we can roughly separate
[00:41:45] pattern that we can roughly separate what human what makes sense to humans
[00:41:47] what human what makes sense to humans and what makes sense to us but remember
[00:41:49] and what makes sense to us but remember the the green and gray kind of
[00:41:52] the the green and gray kind of everywhere they're mixed so in this
[00:41:55] everywhere they're mixed so in this particular run of open AIS hide and seek
[00:41:58] particular run of open AIS hide and seek it seemed that those two representations
[00:42:00] it seemed that those two representations were kind of entangled
[00:42:02] were kind of entangled the running and chasing the blue dots it
[00:42:05] the running and chasing the blue dots it seems to be pretty separate and
[00:42:07] seems to be pretty separate and distinguishable from all the other
[00:42:09] distinguishable from all the other colors and that kind of makes sense
[00:42:11] colors and that kind of makes sense because that's basis of playing this
[00:42:13] because that's basis of playing this game so if you don't have that
[00:42:14] game so if you don't have that representation you have a you have a big
[00:42:16] representation you have a you have a big trouble
[00:42:17] trouble but in case of like orange which is fort
[00:42:21] but in case of like orange which is fort building
[00:42:22] building it's a lot more distinguishable in
[00:42:25] it's a lot more distinguishable in hiders and that makes sense because
[00:42:27] hiders and that makes sense because hiders are the ones building the fort
[00:42:30] hiders are the ones building the fort then Seekers don't build the fort so
[00:42:31] then Seekers don't build the fort so we're in just a little more entangled in
[00:42:33] we're in just a little more entangled in Seekers perhaps if Seekers had built
[00:42:36] Seekers perhaps if Seekers had built more separate for building uh
[00:42:38] more separate for building uh representation maybe they would have win
[00:42:40] representation maybe they would have win this game
[00:42:44] so this work can we learn something
[00:42:47] so this work can we learn something interesting emerging behaviors by just
[00:42:49] interesting emerging behaviors by just simply observing the system the answer
[00:42:52] simply observing the system the answer seems to be yes at least for the domains
[00:42:54] seems to be yes at least for the domains that we tested a lot more more complex
[00:42:56] that we tested a lot more more complex domains should be tested but these are
[00:42:58] domains should be tested but these are the ones we had
[00:43:01] the ones we had but remember that these methods don't
[00:43:03] but remember that these methods don't give you names of these clusters so you
[00:43:05] give you names of these clusters so you would have to go and investigate and
[00:43:07] would have to go and investigate and click through and explore
[00:43:10] click through and explore and if the cluster represents super
[00:43:12] and if the cluster represents super superhuman concept this is not going to
[00:43:15] superhuman concept this is not going to help you and I'll talk a little more
[00:43:17] help you and I'll talk a little more about the work that that we do try to
[00:43:19] about the work that that we do try to help them but this is not for you this
[00:43:21] help them but this is not for you this is not going to help you there
[00:43:22] is not going to help you there and also if you have access to the model
[00:43:26] and also if you have access to the model and the reward signal you should use it
[00:43:28] and the reward signal you should use it why why dump it
[00:43:31] why why dump it so next part we do use it I'm going to
[00:43:33] so next part we do use it I'm going to talk about let's work with Nico and
[00:43:36] talk about let's work with Nico and Natasha and Shay again
[00:43:39] Natasha and Shay again so here this time we're going to
[00:43:41] so here this time we're going to intervene we're going to be a little
[00:43:43] intervene we're going to be a little intrusive but hopefully we'll learn a
[00:43:45] intrusive but hopefully we'll learn a little more
[00:43:46] little more so problem is that we're going to build
[00:43:48] so problem is that we're going to build a new multi-agent system we're going to
[00:43:50] a new multi-agent system we're going to build it from scratch such that we can
[00:43:52] build it from scratch such that we can do control testing but at the same time
[00:43:54] do control testing but at the same time we shouldn't sacrifice the performance
[00:43:56] we shouldn't sacrifice the performance so we're going to try to match the the
[00:43:58] so we're going to try to match the the performance of the overall system and we
[00:44:01] performance of the overall system and we do succeed
[00:44:03] do succeed I had this paper collaboration with
[00:44:05] I had this paper collaboration with Folks at Sanford actually here in 2020
[00:44:08] Folks at Sanford actually here in 2020 where we propose this pretty simple idea
[00:44:10] where we propose this pretty simple idea which is you have on your own network
[00:44:12] which is you have on your own network why don't we embed Concepts in the
[00:44:15] why don't we embed Concepts in the middle of the bottleneck where one
[00:44:18] middle of the bottleneck where one neuron represents three the other
[00:44:19] neuron represents three the other represents stripes and just train the
[00:44:22] represents stripes and just train the model end to end
[00:44:23] model end to end and why are we doing this well because
[00:44:26] and why are we doing this well because then at inference time you can actually
[00:44:28] then at inference time you can actually intervene you can pretend you know
[00:44:31] intervene you can pretend you know predicting zebra I don't think three
[00:44:33] predicting zebra I don't think three should matter so I'm gonna zero out this
[00:44:35] should matter so I'm gonna zero out this neuron and feed forward and see what
[00:44:36] neuron and feed forward and see what happens so it's particularly useful in
[00:44:38] happens so it's particularly useful in the medical setting where there are some
[00:44:40] the medical setting where there are some features that doctors don't want we can
[00:44:42] features that doctors don't want we can cancel on and test
[00:44:44] cancel on and test so this is the work to extend this to RL
[00:44:47] so this is the work to extend this to RL setting it's actually not as simple
[00:44:50] setting it's actually not as simple extension then as we thought it came out
[00:44:53] extension then as we thought it came out to be pretty complex but essentially
[00:44:55] to be pretty complex but essentially we're doing that and we're building each
[00:44:58] we're doing that and we're building each of the concept bottleneck for each agent
[00:45:01] of the concept bottleneck for each agent and at the end of the day what you
[00:45:03] and at the end of the day what you optimize is what you usually do typical
[00:45:05] optimize is what you usually do typical PPO just think about this as make the
[00:45:08] PPO just think about this as make the make daughter system work plus
[00:45:10] make daughter system work plus minimizing the difference between the
[00:45:12] minimizing the difference between the true concept and estimated concept
[00:45:14] true concept and estimated concept that's all you do
[00:45:17] that's all you do why are we doing this you can intervene
[00:45:19] why are we doing this you can intervene you can pretend now agent 2 pretend that
[00:45:22] you can pretend now agent 2 pretend that you can't see agent 1. what happens now
[00:45:25] you can't see agent 1. what happens now that's what we're doing here
[00:45:29] we're going to do this in two domains
[00:45:31] we're going to do this in two domains first domain how many people looked at
[00:45:33] first domain how many people looked at the uh saw this cooking game before
[00:45:37] the uh saw this cooking game before yeah it's a it's a pretty commonly used
[00:45:39] yeah it's a it's a pretty commonly used cooking uh domain in reinforcement
[00:45:42] cooking uh domain in reinforcement learning very simple
[00:45:43] learning very simple we have two agents yellow and blue and
[00:45:46] we have two agents yellow and blue and they're going to make soup
[00:45:47] they're going to make soup they can bring Three Tomatoes they get a
[00:45:50] they can bring Three Tomatoes they get a war they wait for the tomato and bring
[00:45:52] war they wait for the tomato and bring the dishes a dish to the cooking pot
[00:45:54] the dishes a dish to the cooking pot they get a reward finally their goal is
[00:45:56] they get a reward finally their goal is to deliver as many soups as possible
[00:45:58] to deliver as many soups as possible given given some time
[00:46:01] given given some time and here Concepts that we use are agent
[00:46:04] and here Concepts that we use are agent position orientation agent has tomato it
[00:46:06] position orientation agent has tomato it has Dish etc etc something that's
[00:46:08] has Dish etc etc something that's immediately available to you already
[00:46:11] immediately available to you already and you can of course tweak the
[00:46:13] and you can of course tweak the environment to make it more fun so you
[00:46:15] environment to make it more fun so you can make it that they have to
[00:46:17] can make it that they have to collaborate like you can build a wall
[00:46:18] collaborate like you can build a wall between them so that they have to work
[00:46:20] between them so that they have to work together in order to serve any tomato
[00:46:22] together in order to serve any tomato soup or you can make them freely
[00:46:24] soup or you can make them freely available you can work independently or
[00:46:26] available you can work independently or together
[00:46:27] together whatever your choice
[00:46:31] first uh just kind of send you the check
[00:46:33] first uh just kind of send you the check was that you can you can detect the
[00:46:37] was that you can you can detect the emerging behavior of coordination versus
[00:46:40] emerging behavior of coordination versus non-coordination so when the impassable
[00:46:43] non-coordination so when the impassable environment when we made up that
[00:46:44] environment when we made up that environment and suppose that RL system
[00:46:47] environment and suppose that RL system that we trained worked they were able to
[00:46:49] that we trained worked they were able to deliver some soups then you see that
[00:46:52] deliver some soups then you see that when we intervene uh this graph let me
[00:46:53] when we intervene uh this graph let me explain this is a reward of an agent one
[00:46:56] explain this is a reward of an agent one when we when there's no intervention so
[00:46:59] when we when there's no intervention so this is perfectly good world and when
[00:47:02] this is perfectly good world and when there was an intervention this is the
[00:47:04] there was an intervention this is the average value of intervening on all
[00:47:06] average value of intervening on all Concepts but I'm also going to show you
[00:47:08] Concepts but I'm also going to show you each concept soon
[00:47:10] each concept soon if you compare left and right you can
[00:47:13] if you compare left and right you can tell that in the right when we intervene
[00:47:15] tell that in the right when we intervene reward deteriorated quite a lot for both
[00:47:18] reward deteriorated quite a lot for both of them and that's one way to see yeah
[00:47:21] of them and that's one way to see yeah they are coordinating because somehow
[00:47:23] they are coordinating because somehow intervening and at this concept impacted
[00:47:27] intervening and at this concept impacted a lot of their performance
[00:47:30] but this is what what uh what was really
[00:47:32] but this is what what uh what was really interesting to me and I'm curious anyone
[00:47:34] interesting to me and I'm curious anyone can guess so this is the same graph as
[00:47:38] can guess so this is the same graph as the one you saw before
[00:47:39] the one you saw before but except I'm plotting for intervention
[00:47:42] but except I'm plotting for intervention for each concept so I'm intervening team
[00:47:45] for each concept so I'm intervening team position team orientation team has
[00:47:47] position team orientation team has tomato etc etc
[00:47:49] tomato etc etc it turns out that they are using or
[00:47:53] it turns out that they are using or rather when we intervene on team
[00:47:55] rather when we intervene on team orientation the degradation of
[00:47:57] orientation the degradation of performance was the biggest to the
[00:47:59] performance was the biggest to the extent that we believe that orientation
[00:48:00] extent that we believe that orientation had to do with subcoordination
[00:48:03] had to do with subcoordination does anyone can guess why this might be
[00:48:12] the position
[00:48:13] the position there's orientation
[00:48:19] yes just a clarification question on the
[00:48:21] yes just a clarification question on the orientation is that like the direction
[00:48:23] orientation is that like the direction that the teammate is producing yes so it
[00:48:26] that the teammate is producing yes so it seems like orientation would let you
[00:48:30] seems like orientation would let you yes yes that's right yes where were you
[00:48:33] yes yes that's right yes where were you when I was when I was pulling my hair
[00:48:34] when I was when I was pulling my hair hair over this question yes that's
[00:48:37] hair over this question yes that's exactly right and initially I was really
[00:48:39] exactly right and initially I was really puzzled like why not position because I
[00:48:41] puzzled like why not position because I expect it to be positioned but exactly
[00:48:43] expect it to be positioned but exactly that's exactly right so the orientation
[00:48:45] that's exactly right so the orientation is the first signal that an agent can
[00:48:48] is the first signal that an agent can get about the next move over the other
[00:48:51] get about the next move over the other Asian because they're facing the pot
[00:48:53] Asian because they're facing the pot they're going to the pot they're facing
[00:48:54] they're going to the pot they're facing the Tomato they're going to get the
[00:48:55] the Tomato they're going to get the tomato
[00:48:57] tomato really interesting intuition but some
[00:49:00] really interesting intuition but some too obvious to some but I needed this
[00:49:02] too obvious to some but I needed this graph to work that out
[00:49:05] graph to work that out and of course you can use this to
[00:49:07] and of course you can use this to identify lazy agents if you look at the
[00:49:10] identify lazy agents if you look at the rightmost uh yellow agent our friend
[00:49:13] rightmost uh yellow agent our friend just chilling in the in the background
[00:49:16] just chilling in the in the background and he's lazy and if you train our
[00:49:19] and he's lazy and if you train our religion there's always some agents just
[00:49:21] religion there's always some agents just hanging out they just not do anything
[00:49:22] hanging out they just not do anything and you can you can easily identify this
[00:49:25] and you can you can easily identify this by using this graph if I intervene it it
[00:49:28] by using this graph if I intervene it it just doesn't impact any any of their
[00:49:30] just doesn't impact any any of their Rewards
[00:49:34] so the second domain we're going to look
[00:49:37] so the second domain we're going to look at a little more complex domain so this
[00:49:39] at a little more complex domain so this is uh it's studying inter-agent social
[00:49:41] is uh it's studying inter-agent social dynamics so in this domain there is a
[00:49:44] dynamics so in this domain there is a little bit of tension this is called a
[00:49:46] little bit of tension this is called a cleanup we have four agents they only
[00:49:50] cleanup we have four agents they only get rewards if they eat apples just
[00:49:51] get rewards if they eat apples just yellow things or green things or apples
[00:49:54] yellow things or green things or apples uh but if you don't clean the river then
[00:49:57] uh but if you don't clean the river then Apple stops through all so somebody has
[00:49:59] Apple stops through all so somebody has to clean the river and you can see if
[00:50:02] to clean the river and you can see if there are four people trying to collect
[00:50:05] there are four people trying to collect apples you can just stay someone else's
[00:50:06] apples you can just stay someone else's to wait until someone else to to clean
[00:50:09] to wait until someone else to to clean the river and then collect the apples
[00:50:11] the river and then collect the apples and in fact that's sometimes what
[00:50:12] and in fact that's sometimes what happens
[00:50:15] and Concepts here again are pretty uh
[00:50:19] and Concepts here again are pretty uh pretty uh pretty common things position
[00:50:20] pretty uh pretty common things position orientation and and pollution positions
[00:50:24] orientation and and pollution positions Etc
[00:50:26] Etc so
[00:50:27] so would we first plotted the same graph as
[00:50:31] would we first plotted the same graph as the previous domain
[00:50:33] the previous domain it it it it tells a story so the story
[00:50:36] it it it it tells a story so the story here is that when I intervene on Asian
[00:50:40] here is that when I intervene on Asian one
[00:50:41] one it seems to influence Asian too quite a
[00:50:45] it seems to influence Asian too quite a lot if you look at these three different
[00:50:47] lot if you look at these three different uh graph reward how reward was impacted
[00:50:51] uh graph reward how reward was impacted when I intervened on Asian one it's
[00:50:53] when I intervened on Asian one it's agent three and four are fine but it
[00:50:55] agent three and four are fine but it seems that only agent two is influenced
[00:50:57] seems that only agent two is influenced same with idle time same with the Intel
[00:50:59] same with idle time same with the Intel agent distance so we were like oh maybe
[00:51:02] agent distance so we were like oh maybe that's true but we keep wondering
[00:51:04] that's true but we keep wondering there's like a lot going on in this
[00:51:06] there's like a lot going on in this domain like how do we know this is the
[00:51:08] domain like how do we know this is the case
[00:51:09] case so we decided to take
[00:51:12] so we decided to take another step so
[00:51:15] another step so we're going to do a little more work
[00:51:16] we're going to do a little more work here uh but but not a lot we're going to
[00:51:18] here uh but but not a lot we're going to fill the graph to discover interagent
[00:51:21] fill the graph to discover interagent relationships this is simplest dumbest
[00:51:24] relationships this is simplest dumbest way to build a graph but again I like
[00:51:26] way to build a graph but again I like simple things so how do you build a
[00:51:28] simple things so how do you build a graph well suppose that you have you're
[00:51:30] graph well suppose that you have you're building a graph between movies this is
[00:51:32] building a graph between movies this is like not what we do but just to describe
[00:51:34] like not what we do but just to describe what we're trying to do we have each row
[00:51:37] what we're trying to do we have each row if we want to build a matrix each row is
[00:51:40] if we want to build a matrix each row is a movie and each column consists of
[00:51:43] a movie and each column consists of features of these movies so length
[00:51:45] features of these movies so length Jungle of the movie and so on
[00:51:47] Jungle of the movie and so on and the simplest way to build a graph is
[00:51:50] and the simplest way to build a graph is to do a regression so exclude I I throw
[00:51:55] to do a regression so exclude I I throw and then we're going to regress over
[00:51:57] and then we're going to regress over everyone else
[00:51:58] everyone else and that gives me beta
[00:52:00] and that gives me beta which is kind of coefficient for for
[00:52:03] which is kind of coefficient for for each of these and that beta represents
[00:52:05] each of these and that beta represents the strength between uh strengths of the
[00:52:08] the strength between uh strengths of the edges so this movie is more related to
[00:52:10] edges so this movie is more related to this movie and not the other movie and
[00:52:12] this movie and not the other movie and ta-da you have a graph it's like dummy
[00:52:13] ta-da you have a graph it's like dummy story there's a lot of caveats to you
[00:52:15] story there's a lot of caveats to you shouldn't do this with them a lot of
[00:52:16] shouldn't do this with them a lot of times but you know this is the simplest
[00:52:18] times but you know this is the simplest way to do it
[00:52:20] way to do it so we did the same thing here instead
[00:52:22] so we did the same thing here instead instead of movie we're going to use
[00:52:25] instead of movie we're going to use intervention on concept C on agent n as
[00:52:29] intervention on concept C on agent n as our node
[00:52:31] our node and for to build this Matrix we're going
[00:52:34] and for to build this Matrix we're going to use intervention outcome which
[00:52:36] to use intervention outcome which wouldn't happen to be available without
[00:52:38] wouldn't happen to be available without our framework
[00:52:39] our framework for reward resource collected and and
[00:52:42] for reward resource collected and and many other things
[00:52:45] many other things and when you build this graph at the end
[00:52:47] and when you build this graph at the end of the day you get betas that represent
[00:52:49] of the day you get betas that represent relationship between these interventions
[00:52:52] relationship between these interventions okay
[00:52:54] okay so I had a
[00:52:55] so I had a graph of that Matrix apparently I
[00:52:57] graph of that Matrix apparently I removed before I came over but imagine
[00:53:00] removed before I came over but imagine there was a matrix
[00:53:02] there was a matrix there is a nicely highlighted between
[00:53:05] there is a nicely highlighted between agent 1 and 4 and that only
[00:53:07] agent 1 and 4 and that only contradicting the original hypothesis
[00:53:10] contradicting the original hypothesis that we had and this is the video of it
[00:53:12] that we had and this is the video of it so when we stared at that Matrix it it
[00:53:15] so when we stared at that Matrix it it turns out that there's no High Edge
[00:53:18] turns out that there's no High Edge strong edges between agent one and two
[00:53:22] strong edges between agent one and two so we were like that's weird but there
[00:53:24] so we were like that's weird but there is strong edges between agent one and
[00:53:25] is strong edges between agent one and four so we like dig deeper into it
[00:53:28] four so we like dig deeper into it watched a lot of uh a lot of sessions to
[00:53:30] watched a lot of uh a lot of sessions to validate what's happening and it turns
[00:53:32] validate what's happening and it turns out that the story was a lot more
[00:53:34] out that the story was a lot more complicated the ones orientation was
[00:53:37] complicated the ones orientation was important for four
[00:53:39] important for four but when that fails agent 1 and 2 kinda
[00:53:42] but when that fails agent 1 and 2 kinda gets cornered in and you can see that in
[00:53:44] gets cornered in and you can see that in the graph agent 4 kind of get a get
[00:53:46] the graph agent 4 kind of get a get agent one and four uh sorry one and two
[00:53:49] agent one and four uh sorry one and two blue and yellow agent kind of gets in
[00:53:51] blue and yellow agent kind of gets in the corner together they kind of get
[00:53:52] the corner together they kind of get stuck and this is simply just accidental
[00:53:55] stuck and this is simply just accidental because of the way that we built this
[00:53:57] because of the way that we built this environment it just happened
[00:54:00] environment it just happened but but the true the raw statistics
[00:54:03] but but the true the raw statistics wouldn't have told us this story that
[00:54:04] wouldn't have told us this story that this was completely accidental in fact
[00:54:06] this was completely accidental in fact there was no correlation no coordination
[00:54:08] there was no correlation no coordination between agent one and two but only after
[00:54:11] between agent one and two but only after the graph we realized this was the case
[00:54:14] the graph we realized this was the case now this might be one-off case but you
[00:54:17] now this might be one-off case but you know what a lot of emerging behaviors
[00:54:18] know what a lot of emerging behaviors that we want to detect a lot of them
[00:54:21] that we want to detect a lot of them will be one-off case and we really want
[00:54:23] will be one-off case and we really want to get to the truth of that rather than
[00:54:25] to get to the truth of that rather than having some surface level statistics
[00:54:30] so
[00:54:31] so can we build multi-agent system that
[00:54:35] can we build multi-agent system that enables intervention and performs as
[00:54:37] enables intervention and performs as well the answer is yes there's a graph
[00:54:38] well the answer is yes there's a graph that shows the red line and blue line
[00:54:40] that shows the red line and blue line roughly a line that's good news we are
[00:54:43] roughly a line that's good news we are performing as well
[00:54:45] performing as well um but remember these Concepts you need
[00:54:47] um but remember these Concepts you need to label them or you should have some
[00:54:49] to label them or you should have some way of getting those Concepts positions
[00:54:51] way of getting those Concepts positions and orientation there might be something
[00:54:53] and orientation there might be something that we would love to extend in the
[00:54:55] that we would love to extend in the future
[00:54:56] future before I go on any questions
[00:55:03] you shy
[00:55:05] you shy [Music]
[00:55:10] cool
[00:55:11] cool all right
[00:55:13] all right so I did tell you that we're not gonna
[00:55:15] so I did tell you that we're not gonna know uh move uh the solution to move 37
[00:55:18] know uh move uh the solution to move 37 I still don't okay I still don't
[00:55:20] I still don't okay I still don't but I'll tell you a little bit of work
[00:55:23] but I'll tell you a little bit of work that I'm currently doing I'm really
[00:55:25] that I'm currently doing I'm really excited about uh that we started
[00:55:28] excited about uh that we started thinking you know what will this
[00:55:30] thinking you know what will this understanding move 37 happen before
[00:55:32] understanding move 37 happen before within my lifetime and I was like oh
[00:55:35] within my lifetime and I was like oh maybe not but I kind of want it to
[00:55:36] maybe not but I kind of want it to happen so we start this is all about
[00:55:39] happen so we start this is all about research right you started carving out a
[00:55:41] research right you started carving out a space where things are a little
[00:55:43] space where things are a little resolvable and you try to attack that
[00:55:45] resolvable and you try to attack that problem so this is our attempt to do
[00:55:47] problem so this is our attempt to do exactly that to get a little closer to
[00:55:50] exactly that to get a little closer to our ultimate goal or my ultimate goal of
[00:55:53] our ultimate goal or my ultimate goal of understanding that move 37.
[00:55:56] understanding that move 37. so before that how many people here know
[00:55:57] so before that how many people here know Alpha Zero from T my yes Alpha zero is a
[00:56:02] Alpha Zero from T my yes Alpha zero is a self-trained uh self-trained chess
[00:56:05] self-trained uh self-trained chess playing machine that beats that has
[00:56:07] playing machine that beats that has higher yellow rating than any other
[00:56:09] higher yellow rating than any other humans and beats stockfish which is
[00:56:11] humans and beats stockfish which is arguably no existing human can beat
[00:56:13] arguably no existing human can beat stock fish so in the previous paper we
[00:56:17] stock fish so in the previous paper we try to discover human chess Concepts in
[00:56:21] try to discover human chess Concepts in this network so when does concept like
[00:56:24] this network so when does concept like material imbalance appear in its Network
[00:56:27] material imbalance appear in its Network which layer and when in the training
[00:56:30] which layer and when in the training time
[00:56:31] time and which we call what when and where
[00:56:33] and which we call what when and where plots
[00:56:34] plots and we also compare the evolution of
[00:56:37] and we also compare the evolution of opening moves between humans and Alpha
[00:56:39] opening moves between humans and Alpha zero these are the first couple moves
[00:56:42] zero these are the first couple moves that you make when you play chess and as
[00:56:44] that you make when you play chess and as you can see there's a pretty huge
[00:56:45] you can see there's a pretty huge difference left is human right is Alpha
[00:56:48] difference left is human right is Alpha zero it turns out that Alpha zero can
[00:56:51] zero it turns out that Alpha zero can master or supposedly Master a lot of
[00:56:55] master or supposedly Master a lot of variety of different types of openings
[00:56:57] variety of different types of openings openings can be very aggressive openings
[00:56:59] openings can be very aggressive openings can be very boring could be very long
[00:57:02] can be very boring could be very long range targeting for long range strategy
[00:57:05] range targeting for long range strategy or short range very different so that
[00:57:07] or short range very different so that begs a question what does alpha zero
[00:57:10] begs a question what does alpha zero know that humans don't know don't you
[00:57:12] know that humans don't know don't you want to learn what that might be
[00:57:16] want to learn what that might be so that's what we're doing right now
[00:57:17] so that's what we're doing right now we're actually almost um we're about to
[00:57:20] we're actually almost um we're about to about to evaluate
[00:57:22] about to evaluate so the goal of this war is please teach
[00:57:25] so the goal of this war is please teach the world chess champion on new chess
[00:57:28] the world chess champion on new chess superhuman chess strategy
[00:57:30] superhuman chess strategy and we just got yes from Magnus Carlson
[00:57:33] and we just got yes from Magnus Carlson who is the world chess champion
[00:57:36] who is the world chess champion he just lost the match I know but but
[00:57:38] he just lost the match I know but but you know he still he's still champion in
[00:57:40] you know he still he's still champion in my mind he's still championed in two
[00:57:42] my mind he's still championed in two categories actually
[00:57:44] categories actually so the way that we're doing this is
[00:57:46] so the way that we're doing this is we're going to discover new chess
[00:57:48] we're going to discover new chess strategy by explicitly explicitly for
[00:57:51] strategy by explicitly explicitly for getting existing chess strategy which we
[00:57:54] getting existing chess strategy which we have a lot of data for
[00:57:56] have a lot of data for and then we're going to learn a graph
[00:57:58] and then we're going to learn a graph this time a little more complicated
[00:58:00] this time a little more complicated graph by uh using the the existing
[00:58:04] graph by uh using the the existing relationships between existing Concepts
[00:58:07] relationships between existing Concepts so that we can get a little bit of more
[00:58:09] so that we can get a little bit of more idea of what the New Concept might look
[00:58:11] idea of what the New Concept might look like
[00:58:12] like and Magnus Carlson uh so my favorite
[00:58:14] and Magnus Carlson uh so my favorite part about this work I talk about
[00:58:16] part about this work I talk about carving out my favorite part about this
[00:58:18] carving out my favorite part about this work is that the evaluation is going to
[00:58:20] work is that the evaluation is going to be pretty clear so it's not just like
[00:58:22] be pretty clear so it's not just like Magnus coming in inside say oh your work
[00:58:24] Magnus coming in inside say oh your work is kind of nice and and say nice things
[00:58:26] is kind of nice and and say nice things about our work no Magnus actually has to
[00:58:28] about our work no Magnus actually has to solve some puzzles
[00:58:29] solve some puzzles and we will be able to evaluate him
[00:58:31] and we will be able to evaluate him whether he did it or not so it's like a
[00:58:34] whether he did it or not so it's like a kind of success and fail but I'm
[00:58:35] kind of success and fail but I'm extremely excited this kind of work I
[00:58:37] extremely excited this kind of work I can only do because of Lisa who is a
[00:58:41] can only do because of Lisa who is a champion herself but also a PhD student
[00:58:43] champion herself but also a PhD student at Oxford and like she played against
[00:58:46] at Oxford and like she played against Magnus in the past and many others
[00:58:48] Magnus in the past and many others chestplates in the world and she's going
[00:58:51] chestplates in the world and she's going to be the ultimate uh pre-super human
[00:58:53] to be the ultimate uh pre-super human filtering to filter out these Concepts
[00:58:56] filtering to filter out these Concepts that will eventually get to Magnus
[00:58:59] that will eventually get to Magnus so I'm super excited about this I have
[00:59:01] so I'm super excited about this I have no results but it's coming up I'm
[00:59:02] no results but it's coming up I'm excited yes
[00:59:07] generator because it's already so many
[00:59:09] generator because it's already so many puzzles out there so I'm assuming that
[00:59:12] puzzles out there so I'm assuming that there's probably something new
[00:59:15] there's probably something new what are the problems
[00:59:17] what are the problems puzzles are actually pretty simple so
[00:59:19] puzzles are actually pretty simple so the way that we generate concepts are
[00:59:22] the way that we generate concepts are within the embedding space of alpha zero
[00:59:25] within the embedding space of alpha zero and given that because Alpha zero has
[00:59:27] and given that because Alpha zero has really weird architecture so every
[00:59:30] really weird architecture so every single latent layer in Alpha zero has
[00:59:32] single latent layer in Alpha zero has the exact same position as a chessboard
[00:59:33] the exact same position as a chessboard that's just the way that they decide to
[00:59:35] that's just the way that they decide to do it so because of that we can actually
[00:59:37] do it so because of that we can actually identify or generate the board positions
[00:59:40] identify or generate the board positions that corresponds to that concept and
[00:59:43] that corresponds to that concept and because we have MCTS we can predict what
[00:59:47] because we have MCTS we can predict what move it's going to make given that board
[00:59:49] move it's going to make given that board position because at inference time it's
[00:59:51] position because at inference time it's actually deterministic of the whole lot
[00:59:53] actually deterministic of the whole lot plus zero thing so these we have a lot
[00:59:55] plus zero thing so these we have a lot of board positions and that's all you
[00:59:57] of board positions and that's all you need for puzzles you give up board
[00:59:59] need for puzzles you give up board position and then ask Magnus to make a
[01:00:01] position and then ask Magnus to make a move we explain the concept and then
[01:00:03] move we explain the concept and then give Magnus more board positions and see
[01:00:05] give Magnus more board positions and see if we can apply that concept that he
[01:00:08] if we can apply that concept that he just learned
[01:00:12] for example
[01:00:14] right but it seems like you're kind of
[01:00:17] right but it seems like you're kind of underneath
[01:00:21] yeah so the if I were to ask stockfish
[01:00:25] yeah so the if I were to ask stockfish to
[01:00:26] to solve those puzzles that were a
[01:00:27] solve those puzzles that were a different question because we're
[01:00:29] different question because we're interested in whether we can teach human
[01:00:31] interested in whether we can teach human not stockfish stockfish might be able to
[01:00:33] not stockfish stockfish might be able to do that's actually interesting uh thing
[01:00:36] do that's actually interesting uh thing that we could do now I think about but
[01:00:38] that we could do now I think about but our goal is to just teach one superhume
[01:00:41] our goal is to just teach one superhume like if I have for example 10 000
[01:00:43] like if I have for example 10 000 superhuman Concepts and only three of
[01:00:46] superhuman Concepts and only three of them are digestible by Magnus that's a
[01:00:49] them are digestible by Magnus that's a win that would be a big win for for this
[01:00:52] win that would be a big win for for this type of research
[01:00:56] questions
[01:00:59] all right yeah so oh so wrap up small
[01:01:04] all right yeah so oh so wrap up small steps towards our hopes and dreams we
[01:01:07] steps towards our hopes and dreams we talked about the gap between What
[01:01:09] talked about the gap between What machines know versus what we think
[01:01:10] machines know versus what we think machines know three ideas why that might
[01:01:14] machines know three ideas why that might be true the three different maybe angles
[01:01:16] be true the three different maybe angles we can try to attack and answer those
[01:01:18] we can try to attack and answer those questions and the the bridge that Gap we
[01:01:21] questions and the the bridge that Gap we talked about studying aliens these
[01:01:24] talked about studying aliens these machines in observation study or control
[01:01:26] machines in observation study or control study there are many other ways to study
[01:01:28] study there are many other ways to study your species uh and I'm not an expert
[01:01:31] your species uh and I'm not an expert but anthropology and other Humanity
[01:01:33] but anthropology and other Humanity studies would know a lot better more
[01:01:35] studies would know a lot better more about this
[01:01:36] about this and maybe just maybe we can try to
[01:01:40] and maybe just maybe we can try to understand move 37 at some point
[01:01:42] understand move 37 at some point hopefully within my lifetime through
[01:01:44] hopefully within my lifetime through this chess uh project that I'm very
[01:01:47] this chess uh project that I'm very excited about thank you
[01:01:49] excited about thank you [Applause]
[01:02:01] you talked about interprecility research
[01:02:04] you talked about interprecility research that costs NLP vision and RL
[01:02:07] that costs NLP vision and RL um do you think there's much about first
[01:02:09] um do you think there's much about first taking certain interpretability
[01:02:10] taking certain interpretability techniques from one modality into other
[01:02:12] techniques from one modality into other modalities
[01:02:17] all right
[01:02:19] all right so it depends on your goal I think like
[01:02:22] so it depends on your goal I think like think about fairness research which uh
[01:02:25] think about fairness research which uh Builds on strong mathematical foundation
[01:02:27] Builds on strong mathematical foundation and that's like applicable for any
[01:02:30] and that's like applicable for any questions around fairness or hopefully
[01:02:32] questions around fairness or hopefully applicable but then once you if your
[01:02:36] applicable but then once you if your goal is to actually solve a fairness
[01:02:38] goal is to actually solve a fairness issue at hand for somebody the real
[01:02:41] issue at hand for somebody the real person in the world that's completely
[01:02:43] person in the world that's completely different question you would have to
[01:02:45] different question you would have to customize it for a particular
[01:02:46] customize it for a particular application so there are two venues and
[01:02:48] application so there are two venues and I think similar is true interoperability
[01:02:50] I think similar is true interoperability like the theory work that I talked about
[01:02:52] like the theory work that I talked about shop and IG are used across domains like
[01:02:56] shop and IG are used across domains like Vision texts so that theory paper would
[01:02:58] Vision texts so that theory paper would be applicable across the domain things
[01:03:01] be applicable across the domain things like RL and the way that we build that
[01:03:03] like RL and the way that we build that generative model you would need to test
[01:03:05] generative model you would need to test a little bit more to make sure that this
[01:03:07] a little bit more to make sure that this works in NLP uh I don't even know how to
[01:03:10] works in NLP uh I don't even know how to think about agents in NLP yet so it will
[01:03:13] think about agents in NLP yet so it will need a little bit of tweaking but both
[01:03:14] need a little bit of tweaking but both directions are fruitful
[01:03:20] John has a question
[01:03:23] John has a question I saw the recent work in which
[01:03:26] I saw the recent work in which some amateur go players found a very
[01:03:29] some amateur go players found a very tricky strategy to trick up I think it
[01:03:31] tricky strategy to trick up I think it was alphago and that seemed like a
[01:03:34] was alphago and that seemed like a concept that humans know that machines
[01:03:36] concept that humans know that machines don't in that Venn
[01:03:38] don't in that Venn diagrams about that yeah actually it's
[01:03:40] diagrams about that yeah actually it's funny you mentioned that Lisa can beat
[01:03:44] funny you mentioned that Lisa can beat Alpha zero pretty easily and it's a
[01:03:47] Alpha zero pretty easily and it's a similar idea because uh if you you kind
[01:03:50] similar idea because uh if you you kind of know what are the most unseen out of
[01:03:52] of know what are the most unseen out of distribution moves are and and he she
[01:03:55] distribution moves are and and he she can break Alpha zero pretty easily at
[01:03:56] can break Alpha zero pretty easily at least I guess that if Isa Dole had known
[01:03:59] least I guess that if Isa Dole had known something more about AI then maybe he
[01:04:01] something more about AI then maybe he would have tried to confuse alphago but
[01:04:03] would have tried to confuse alphago but the truth is you know it takes a lot
[01:04:05] the truth is you know it takes a lot it's a high stake game like he said oh
[01:04:07] it's a high stake game like he said oh it's like a the famous star worldwide so
[01:04:10] it's like a the famous star worldwide so he wouldn't want to make a a move that
[01:04:12] he wouldn't want to make a a move that would be seen as a complete mistake like
[01:04:15] would be seen as a complete mistake like the one that Magnus made couple of days
[01:04:17] the one that Magnus made couple of days ago that got on the news feed everywhere
[01:04:19] ago that got on the news feed everywhere that he made this Taco century-wide
[01:04:21] that he made this Taco century-wide mistake and that's that's probably hurts
[01:04:26] any other questions
[01:04:33] zero for example I just like building
[01:04:36] zero for example I just like building machine learning
[01:04:38] machine learning lazy's games really well
[01:04:40] lazy's games really well um
[01:04:52] well these work that I've presented are
[01:04:54] well these work that I've presented are pretty you
[01:04:56] pretty you um but there has been a bit of
[01:04:57] um but there has been a bit of discussion in in the robotics applying
[01:05:00] discussion in in the robotics applying potentially these two Robotics and of
[01:05:02] potentially these two Robotics and of course I can't talk about details but
[01:05:05] course I can't talk about details but um
[01:05:06] um uh things that
[01:05:08] uh things that reinforcement learning in the wild
[01:05:10] reinforcement learning in the wild people worry about or some of the
[01:05:11] people worry about or some of the surprises right if you have a test for
[01:05:14] surprises right if you have a test for it like if you have a unit test for it
[01:05:16] it like if you have a unit test for it you're never going to fail because
[01:05:18] you're never going to fail because you're going to test before you deploy I
[01:05:21] you're going to test before you deploy I think the biggest risk for any of this
[01:05:22] think the biggest risk for any of this deployment systems is the surprises that
[01:05:26] deployment systems is the surprises that you didn't expect
[01:05:27] you didn't expect so my work around the visualization and
[01:05:30] so my work around the visualization and others aim to help you with that so we
[01:05:34] others aim to help you with that so we may not know names of these surprises
[01:05:36] may not know names of these surprises but here's a tool that help you better
[01:05:38] but here's a tool that help you better discover those surprises before someone
[01:05:41] discover those surprises before someone else does or someone else gets harm
[01:05:51] um this is kind of an open independent
[01:05:52] um this is kind of an open independent question but I was wondering we're
[01:05:54] question but I was wondering we're talking about a lot of ways in which we
[01:05:56] talking about a lot of ways in which we try to kind of visualize or understand
[01:05:59] try to kind of visualize or understand what's going on in the representation
[01:06:00] what's going on in the representation inside the machine but I was wondering
[01:06:02] inside the machine but I was wondering whether we could turn it around and try
[01:06:05] whether we could turn it around and try to teach machines to tell us like what
[01:06:07] to teach machines to tell us like what using our language is what they're doing
[01:06:10] using our language is what they're doing and their representations like illegal
[01:06:12] and their representations like illegal representations of ours and then get the
[01:06:14] representations of ours and then get the machine to do the translation for us
[01:06:16] machine to do the translation for us instead of us going into the English
[01:06:18] instead of us going into the English yeah great question so it's a really
[01:06:21] yeah great question so it's a really interesting question because um that's
[01:06:22] interesting question because um that's something that I kind of
[01:06:25] something that I kind of tried in in my work previous work called
[01:06:27] tried in in my work previous work called testing with Concept activation vectors
[01:06:30] testing with Concept activation vectors so that was to map human language into
[01:06:33] so that was to map human language into machine space so that they can only
[01:06:35] machine space so that they can only speak our language because I understand
[01:06:36] speak our language because I understand my language and just talk to me in my
[01:06:38] my language and just talk to me in my language the challenge is that how would
[01:06:41] language the challenge is that how would you do that for something like Alpha
[01:06:43] you do that for something like Alpha zero like we don't have a vocabulary for
[01:06:46] zero like we don't have a vocabulary for it like move 37 then there's going to be
[01:06:49] it like move 37 then there's going to be a lot of missing valuable knowledge that
[01:06:52] a lot of missing valuable knowledge that we might we might not get from the
[01:06:54] we might we might not get from the machine so I think the approach has to
[01:06:56] machine so I think the approach has to be both ways we should leverage as much
[01:06:58] be both ways we should leverage as much as we can but acknowledging that even
[01:07:01] as we can but acknowledging that even that mapping that trying to map our
[01:07:04] that mapping that trying to map our language to machines is going to is not
[01:07:06] language to machines is going to is not going to be perfect because it's a kind
[01:07:09] going to be perfect because it's a kind of proxy for what we think like a
[01:07:11] of proxy for what we think like a penguin is there's a psychology research
[01:07:13] penguin is there's a psychology research that says everyone thinks very
[01:07:15] that says everyone thinks very differently about what penguin is like
[01:07:16] differently about what penguin is like if I like a picture of penguin everyone
[01:07:19] if I like a picture of penguin everyone was thinking different penguin right now
[01:07:21] was thinking different penguin right now right Australia has the cutest penguin
[01:07:24] right Australia has the cutest penguin the fairy penguin I'm thinking that
[01:07:26] the fairy penguin I'm thinking that right I don't know how many people are
[01:07:27] right I don't know how many people are thinking that so given that like we give
[01:07:30] thinking that so given that like we give we're so different machine's gonna think
[01:07:31] we're so different machine's gonna think something else so how do you bridge that
[01:07:33] something else so how do you bridge that Gap extend that to 100 Concepts and
[01:07:36] Gap extend that to 100 Concepts and composing those Concepts it's gonna go
[01:07:38] composing those Concepts it's gonna go out a while very soon so there's pros
[01:07:40] out a while very soon so there's pros and cons I'm into both of them I think
[01:07:43] and cons I'm into both of them I think some applications
[01:07:45] some applications exclusively exclusively just using human
[01:07:48] exclusively exclusively just using human concepts are still very helpful it gets
[01:07:50] concepts are still very helpful it gets you uh halfway but my ambition is that
[01:07:55] you uh halfway but my ambition is that we shouldn't stop there we should
[01:07:56] we shouldn't stop there we should benefit from them by having us having
[01:07:59] benefit from them by having us having them teach us new things that we didn't
[01:08:01] them teach us new things that we didn't know before
[01:08:19] but like um I don't know but like trying
[01:08:23] but like um I don't know but like trying to locate like specific strategies in
[01:08:24] to locate like specific strategies in the embedding space
[01:08:27] What are the alternatives I guess I
[01:08:30] What are the alternatives I guess I don't know the Alternatives just because
[01:08:32] don't know the Alternatives just because I feel like the wrong thing
[01:08:35] that's possible so like it's like some
[01:08:37] that's possible so like it's like some transformed space of our embedding space
[01:08:40] transformed space of our embedding space in Alpha zero maybe it's a function of
[01:08:42] in Alpha zero maybe it's a function of uh applied to that embedding space so
[01:08:44] uh applied to that embedding space so thinking about that as a raw Vector is
[01:08:47] thinking about that as a raw Vector is is a is a dead end could be uh we'll see
[01:08:50] is a is a dead end could be uh we'll see how this chess project goes in a couple
[01:08:52] how this chess project goes in a couple months I might I might rethink my
[01:08:55] months I might I might rethink my strategy but interesting thought
[01:08:57] strategy but interesting thought yeah so I'm a Psychology major and I do
[01:09:00] yeah so I'm a Psychology major and I do realize that a lot of the stuff will be
[01:09:02] realize that a lot of the stuff will be trying to hear like at least this is how
[01:09:04] trying to hear like at least this is how we can figure out how our brains work
[01:09:07] we can figure out how our brains work and so I think that this would there be
[01:09:11] and so I think that this would there be um stuff that um use that's applicable
[01:09:14] um stuff that um use that's applicable to internal networks and on the contrary
[01:09:16] to internal networks and on the contrary youth English means interpretability
[01:09:18] youth English means interpretability it's in the studies of neural network
[01:09:19] it's in the studies of neural network will help us understand and stuff for
[01:09:21] will help us understand and stuff for our own brain yeah I talked to Jeffrey
[01:09:24] our own brain yeah I talked to Jeffrey Hinton uh you know he would really like
[01:09:26] Hinton uh you know he would really like this so I believe I believe you probably
[01:09:28] this so I believe I believe you probably know about this history I think that's
[01:09:29] know about this history I think that's how it all started right the whole
[01:09:32] how it all started right the whole neural network is to understand human
[01:09:34] neural network is to understand human brain
[01:09:35] brain um
[01:09:37] um so so that that that's that's the answer
[01:09:39] so so that that that's that's the answer to your question interesting however in
[01:09:40] to your question interesting however in my view there is some biases that we
[01:09:44] my view there is some biases that we have in neuros Neuroscience because of
[01:09:47] have in neuros Neuroscience because of the limitations of tools like physical
[01:09:48] the limitations of tools like physical tools and availability of humans that
[01:09:51] tools and availability of humans that you can poke in I think that influences
[01:09:53] you can poke in I think that influences interpretability research and I'll try
[01:09:55] interpretability research and I'll try to give you an example what I mean so in
[01:09:57] to give you an example what I mean so in you know cat near the the line or the
[01:09:59] you know cat near the the line or the horizontal line and vertical line neuron
[01:10:00] horizontal line and vertical line neuron in cat brain so they put the prop in and
[01:10:03] in cat brain so they put the prop in and figure out this one neuron detects
[01:10:04] figure out this one neuron detects vertical lines and you can like validate
[01:10:06] vertical lines and you can like validate it's really cool if you look at the
[01:10:08] it's really cool if you look at the video the video is still online
[01:10:10] video the video is still online yeah what is it
[01:10:12] yeah what is it yes yes yes uh so why why did they do
[01:10:15] yes yes yes uh so why why did they do that well because you had one cat and a
[01:10:18] that well because you had one cat and a four poor cat and you had uh we can only
[01:10:22] four poor cat and you had uh we can only prob a few neurons at a time right
[01:10:25] prob a few neurons at a time right so that that implied a lot of few
[01:10:27] so that that implied a lot of few interpreportable research actually
[01:10:28] interpreportable research actually looked at or very focused on like neuron
[01:10:31] looked at or very focused on like neuron wise representation like this one neuron
[01:10:34] wise representation like this one neuron must be very special I actually think
[01:10:35] must be very special I actually think that's not true that was limited by our
[01:10:38] that's not true that was limited by our ability like physical ability ability to
[01:10:40] ability like physical ability ability to prop organisms but in your network you
[01:10:42] prop organisms but in your network you don't have to do that like you can apply
[01:10:43] don't have to do that like you can apply functions to embeddings you can change
[01:10:45] functions to embeddings you can change the whole embedding to something else
[01:10:47] the whole embedding to something else override so that kind of uh is actually
[01:10:50] override so that kind of uh is actually a uh obstacle in our thinking rather
[01:10:54] a uh obstacle in our thinking rather than helping
[01:10:58] yeah
[01:11:00] yeah okay maybe we should call it there
[01:11:03] okay maybe we should call it there um so for Thursday when you're not
[01:11:05] um so for Thursday when you're not having uh lecture on Thursday
[01:11:08] having uh lecture on Thursday um there'll be Tas in me here so if you
[01:11:11] um there'll be Tas in me here so if you have any you know last minute panics on
[01:11:14] have any you know last minute panics on your project so I think we might have
[01:11:16] your project so I think we might have some straight inside to help you we
[01:11:19] some straight inside to help you we probably won't actually
[01:11:21] probably won't actually um
[01:11:30] final lecture cs224 in today
[01:11:34] final lecture cs224 in today [Applause]


================================================================================
LECTURE 021
================================================================================

Stanford CS224N NLP with Deep Learning | 2023 | Python Tutorial, Manasi Sharma

Source: https://www.youtube.com/watch?v=8j4wpU98Q74

---

Transcript

[00:00:05] all right hi everyone
[00:00:07] all right hi everyone um welcome to the 224n python review
[00:00:09] um welcome to the 224n python review session
[00:00:11] session um the goal of the session really will
[00:00:12] um the goal of the session really will be to sort of give you the basics
[00:00:15] be to sort of give you the basics um of python and numpy in particular
[00:00:18] um of python and numpy in particular that you'll be using a lot in your
[00:00:19] that you'll be using a lot in your second homework
[00:00:20] second homework um and the homework will come after that
[00:00:22] um and the homework will come after that as well
[00:00:23] as well um we're sort of taking this tutorial
[00:00:25] um we're sort of taking this tutorial from the background of anyone who hasn't
[00:00:28] from the background of anyone who hasn't touched programming languages to some
[00:00:30] touched programming languages to some extent
[00:00:31] extent um but also for people who have will be
[00:00:33] um but also for people who have will be sort of going through a lot of that
[00:00:34] sort of going through a lot of that material very quickly and we'll be
[00:00:35] material very quickly and we'll be progressing to numpy as well
[00:00:37] progressing to numpy as well um and as I mentioned first and foremost
[00:00:38] um and as I mentioned first and foremost the session is really meant for the
[00:00:40] the session is really meant for the people who are here in person so if
[00:00:41] people who are here in person so if you'd like me to slow down speed up at
[00:00:43] you'd like me to slow down speed up at any point need time for clarifications
[00:00:46] any point need time for clarifications feel free to ask us and it's really
[00:00:47] feel free to ask us and it's really meant for you first um here and that I
[00:00:50] meant for you first um here and that I really would like it to be sort of an
[00:00:51] really would like it to be sort of an interactive session as well
[00:00:53] interactive session as well all right so this is a topic the topics
[00:00:55] all right so this is a topic the topics we'll be covering today
[00:00:57] we'll be covering today um going through first of all why python
[00:00:58] um going through first of all why python is a language why have we chosen it for
[00:01:00] is a language why have we chosen it for sort of discourse and in general why do
[00:01:02] sort of discourse and in general why do people prefer prefer it to some extent
[00:01:04] people prefer prefer it to some extent for machine learning and natural
[00:01:05] for machine learning and natural language processing
[00:01:06] language processing um some basics of the language itself
[00:01:08] um some basics of the language itself common data structures and then getting
[00:01:10] common data structures and then getting to sort of the meat of it through numpy
[00:01:13] to sort of the meat of it through numpy which as I mentioned you'll be
[00:01:14] which as I mentioned you'll be extensively using in your homeworks
[00:01:15] extensively using in your homeworks going forward and then some practical
[00:01:17] going forward and then some practical tips about how to use
[00:01:18] tips about how to use um things in Python
[00:01:21] um things in Python all right it's first thing why python um
[00:01:23] all right it's first thing why python um so a lot of you who might have um been
[00:01:26] so a lot of you who might have um been first introduced to programming might
[00:01:28] first introduced to programming might have done Java before a lot of people
[00:01:30] have done Java before a lot of people use Matlab in
[00:01:31] use Matlab in um other in other fields as well
[00:01:34] um other in other fields as well um so why python python is generally
[00:01:36] um so why python python is generally used um for one because it's a very high
[00:01:38] used um for one because it's a very high level language um it can look very very
[00:01:40] level language um it can look very very english-like and so it's really easy to
[00:01:42] english-like and so it's really easy to work with for people especially when
[00:01:43] work with for people especially when they get started out it has a lot of
[00:01:45] they get started out it has a lot of scientific computational functionality
[00:01:47] scientific computational functionality as well similar to Matlab so when you
[00:01:49] as well similar to Matlab so when you talk about numpy you'll see that it has
[00:01:50] talk about numpy you'll see that it has a lot of framework so very very quick
[00:01:52] a lot of framework so very very quick and efficient operations involving math
[00:01:54] and efficient operations involving math or matrices and that's very very useful
[00:01:56] or matrices and that's very very useful in applications such as deep learning
[00:01:59] in applications such as deep learning and for deep learning in particular a
[00:02:01] and for deep learning in particular a lot of Frameworks that people use
[00:02:02] lot of Frameworks that people use particularly for example Pi torch and
[00:02:04] particularly for example Pi torch and tensorflow interface directly with
[00:02:06] tensorflow interface directly with python and so for that those main
[00:02:07] python and so for that those main reasons people generally tend to use
[00:02:09] reasons people generally tend to use Python within deep learning
[00:02:12] Python within deep learning okay so the setup information is in the
[00:02:15] okay so the setup information is in the slides if you like to look at them
[00:02:16] slides if you like to look at them offline
[00:02:17] offline um I will be sort of jumping over that
[00:02:18] um I will be sort of jumping over that for now because I want to sort of get to
[00:02:20] for now because I want to sort of get to the introduction to the language itself
[00:02:22] the introduction to the language itself and if you have time come back to sort
[00:02:23] and if you have time come back to sort of the setup information a lot of it's
[00:02:25] of the setup information a lot of it's pretty direct you can walk through it um
[00:02:27] pretty direct you can walk through it um it gives you steps for sort of how to
[00:02:29] it gives you steps for sort of how to install packages
[00:02:31] install packages um what is a conda environment for
[00:02:33] um what is a conda environment for example and gets you set up with your
[00:02:35] example and gets you set up with your first working python environment so you
[00:02:36] first working python environment so you can sort of run simple and basic
[00:02:38] can sort of run simple and basic commands to get used to the language but
[00:02:40] commands to get used to the language but for now I'm going to be skipping over
[00:02:41] for now I'm going to be skipping over this and coming back to it if we have
[00:02:42] this and coming back to it if we have time
[00:02:44] time all right language Basics so
[00:02:47] all right language Basics so um in Python you have variables and
[00:02:50] um in Python you have variables and these variables can take on multiple
[00:02:51] these variables can take on multiple values the assignment operation there's
[00:02:53] values the assignment operation there's an equal sign will allow you to assign
[00:02:56] an equal sign will allow you to assign this particular value to a variable a
[00:02:57] this particular value to a variable a nice thing with python is you don't have
[00:02:59] nice thing with python is you don't have to instantiate the type of the variable
[00:03:01] to instantiate the type of the variable to begin with and then only instantiate
[00:03:03] to begin with and then only instantiate or only assign values of that type so
[00:03:05] or only assign values of that type so for example in certain languages we
[00:03:07] for example in certain languages we first say that this variable X is only
[00:03:10] first say that this variable X is only going to be of type intent any value
[00:03:12] going to be of type intent any value aside from that assigned to it will
[00:03:13] aside from that assigned to it will throw an error Python's pretty flexible
[00:03:15] throw an error Python's pretty flexible so if I want to I can reassign it I can
[00:03:17] so if I want to I can reassign it I can start with X is equal to 10 and then
[00:03:19] start with X is equal to 10 and then later on like five lines later I can say
[00:03:21] later on like five lines later I can say x is equal to high as a string and there
[00:03:24] x is equal to high as a string and there would be no issue
[00:03:25] would be no issue um you can do simple mathematical
[00:03:27] um you can do simple mathematical operations such as the plus and division
[00:03:29] operations such as the plus and division signs you can do exponentiation which is
[00:03:33] signs you can do exponentiation which is Raising one value to another value so x
[00:03:36] Raising one value to another value so x to the power of Y for example using the
[00:03:37] to the power of Y for example using the double asterisk
[00:03:39] double asterisk um you can do type castings for float
[00:03:41] um you can do type castings for float division so if you want to ensure that
[00:03:43] division so if you want to ensure that your values are being divided resulting
[00:03:45] your values are being divided resulting in a float value and not just dividing
[00:03:46] in a float value and not just dividing two integers you can cast two different
[00:03:48] two integers you can cast two different types like float if you want something
[00:03:50] types like float if you want something to be explicitly an INT you can also
[00:03:51] to be explicitly an INT you can also just put an INT instead of the float
[00:03:53] just put an INT instead of the float with brackets around the result and
[00:03:56] with brackets around the result and that'll give you an integer value and
[00:03:58] that'll give you an integer value and then you can also do typecasting to for
[00:04:01] then you can also do typecasting to for example convert from integers to Strings
[00:04:03] example convert from integers to Strings so in this case if I wanted to instead
[00:04:05] so in this case if I wanted to instead of doing 10 plus 3 as a mathematical
[00:04:08] of doing 10 plus 3 as a mathematical operation I just want to write out 10
[00:04:10] operation I just want to write out 10 plus 3 then I can convert the X and Y
[00:04:13] plus 3 then I can convert the X and Y values for example to Strings and then
[00:04:15] values for example to Strings and then add the plus sign as as a character as
[00:04:19] add the plus sign as as a character as well to create a string and so a lot of
[00:04:20] well to create a string and so a lot of these common operations you can look
[00:04:22] these common operations you can look online as well people have lists for
[00:04:23] online as well people have lists for them and just see how they're sort of
[00:04:25] them and just see how they're sort of dot in Python
[00:04:27] dot in Python all right
[00:04:28] all right um some other quick things um so Boolean
[00:04:30] um some other quick things um so Boolean values the the true and the false um
[00:04:32] values the the true and the false um they're always used with capital letters
[00:04:34] they're always used with capital letters and some of the languages they might be
[00:04:35] and some of the languages they might be lowercase so just one thing to know
[00:04:37] lowercase so just one thing to know um python also doesn't have a null value
[00:04:39] um python also doesn't have a null value the equivalent of a null value is none
[00:04:41] the equivalent of a null value is none so sometimes when you want to say that
[00:04:43] so sometimes when you want to say that this value you want to return none
[00:04:45] this value you want to return none saying I'm not really doing anything
[00:04:46] saying I'm not really doing anything here you want to do checks protect for
[00:04:49] here you want to do checks protect for example in if statements
[00:04:51] example in if statements um to say that this doesn't have a value
[00:04:53] um to say that this doesn't have a value then you can assign it to none so none
[00:04:56] then you can assign it to none so none sort of functions as a null equivalent
[00:04:59] sort of functions as a null equivalent so you're not really returning anything
[00:05:00] so you're not really returning anything it doesn't have a value not the same as
[00:05:02] it doesn't have a value not the same as zero and um another nice thing about
[00:05:05] zero and um another nice thing about python is lists which are sort of um
[00:05:08] python is lists which are sort of um mutable we'll come to that a little bit
[00:05:09] mutable we'll come to that a little bit later but sort of mutable lists of
[00:05:12] later but sort of mutable lists of objects that means that you can change
[00:05:14] objects that means that you can change them they can be of any type so you can
[00:05:17] them they can be of any type so you can have a mixture of integers non-values
[00:05:19] have a mixture of integers non-values strings
[00:05:21] strings Etc
[00:05:22] Etc and yeah functions can return the
[00:05:23] and yeah functions can return the non-value as well
[00:05:26] non-value as well um and another quick thing instead of
[00:05:28] um and another quick thing instead of using the double and and uh in some of
[00:05:31] using the double and and uh in some of the languages as people might do with
[00:05:32] the languages as people might do with python I mentioned earlier it's very
[00:05:34] python I mentioned earlier it's very english-like so you can actually just
[00:05:35] english-like so you can actually just write out
[00:05:36] write out um if x is equal to three and and in
[00:05:40] um if x is equal to three and and in English Y is equal to four then return
[00:05:42] English Y is equal to four then return true or something
[00:05:43] true or something um it's it's quite nice that way so you
[00:05:45] um it's it's quite nice that way so you can use and or and not
[00:05:47] can use and or and not um and then just the comparison
[00:05:48] um and then just the comparison operators of equal equals to and not
[00:05:51] operators of equal equals to and not equal to we'll check for equality and
[00:05:53] equal to we'll check for equality and inequality this one's pretty standard I
[00:05:55] inequality this one's pretty standard I feel across many languages and you can
[00:05:57] feel across many languages and you can use them in python as well
[00:05:59] use them in python as well and yeah remember just clicking the
[00:06:01] and yeah remember just clicking the equal equal to sign is different from
[00:06:02] equal equal to sign is different from the assignment operator this one checks
[00:06:04] the assignment operator this one checks for equality that one is just assigning
[00:06:05] for equality that one is just assigning a value
[00:06:06] a value so single equal sign versus two of them
[00:06:10] so single equal sign versus two of them all right and then also in Python you
[00:06:12] all right and then also in Python you don't use brackets so python you can use
[00:06:15] don't use brackets so python you can use basically spaces or tabs so either
[00:06:18] basically spaces or tabs so either indents of two or four to be able to
[00:06:21] indents of two or four to be able to break up what is contained in the
[00:06:23] break up what is contained in the function or contained within like an if
[00:06:24] function or contained within like an if statement a for statement
[00:06:26] statement a for statement um or any Loops for example
[00:06:28] um or any Loops for example um and so the main thing is you can
[00:06:29] um and so the main thing is you can choose whether to do two or four you
[00:06:31] choose whether to do two or four you just have to be consistent throughout
[00:06:33] just have to be consistent throughout your entire
[00:06:34] your entire um code base otherwise they will throw
[00:06:36] um code base otherwise they will throw an error
[00:06:37] an error now go to some common data structures
[00:06:39] now go to some common data structures and for this we'll transition to the
[00:06:41] and for this we'll transition to the collab
[00:06:43] so this will sort of show you in real
[00:06:46] so this will sort of show you in real time this is by the way a collab a
[00:06:48] time this is by the way a collab a collab is basically um a Jupiter
[00:06:50] collab is basically um a Jupiter notebook for those of you who are
[00:06:51] notebook for those of you who are familiar with those
[00:06:52] familiar with those um that you can use that it's hosted on
[00:06:54] um that you can use that it's hosted on Google servers
[00:06:56] Google servers um the really nice thing about jupyter
[00:06:57] um the really nice thing about jupyter notebooks is you don't have to run an
[00:06:58] notebooks is you don't have to run an entire file altogether you can run it
[00:07:01] entire file altogether you can run it step by step into what are these called
[00:07:03] step by step into what are these called cells so if you want to see like an
[00:07:05] cells so if you want to see like an intermediate output you can see that
[00:07:06] intermediate output you can see that pretty easily and that way and Indian
[00:07:09] pretty easily and that way and Indian also writes for example a lot of um like
[00:07:11] also writes for example a lot of um like descriptions
[00:07:13] descriptions um pertaining to cells which is really
[00:07:14] um pertaining to cells which is really really nice to have as well so a lot of
[00:07:16] really nice to have as well so a lot of people like tend to use these when
[00:07:17] people like tend to use these when they're sort of starting off the project
[00:07:18] they're sort of starting off the project we want to debug things and colab allows
[00:07:21] we want to debug things and colab allows you to use these jupyter notebook type
[00:07:24] you to use these jupyter notebook type applications hosted on their servers for
[00:07:27] applications hosted on their servers for free basically so anyone can create one
[00:07:29] free basically so anyone can create one of these and run their code
[00:07:32] of these and run their code all right so lists are mutable arrays
[00:07:34] all right so lists are mutable arrays mutable means that you can change them
[00:07:36] mutable means that you can change them so that once you declare them you can
[00:07:38] so that once you declare them you can add to them you can delete them and
[00:07:40] add to them you can delete them and they're optimized for that purpose so
[00:07:41] they're optimized for that purpose so they expect to be changed very often
[00:07:43] they expect to be changed very often we'll come to what are called numpy
[00:07:45] we'll come to what are called numpy arrays later and those tend to be pretty
[00:07:47] arrays later and those tend to be pretty much fixed when you create a new one
[00:07:49] much fixed when you create a new one you'd have when you change one you'd
[00:07:50] you'd have when you change one you'd basically have to create a new array
[00:07:52] basically have to create a new array um which will have the additional
[00:07:53] um which will have the additional information so this is highly optimized
[00:07:55] information so this is highly optimized for changing things so if you know for
[00:07:56] for changing things so if you know for example and you're in a loop you're
[00:07:58] example and you're in a loop you're adding different elements to let's say a
[00:08:00] adding different elements to let's say a bigger entity you'd want to use
[00:08:02] bigger entity you'd want to use something like a list because you're
[00:08:03] something like a list because you're going to be changing that very often so
[00:08:06] going to be changing that very often so let's see how they work so we start off
[00:08:08] let's see how they work so we start off with a names array with Zach and J
[00:08:11] with a names array with Zach and J um you can index into the list
[00:08:14] um you can index into the list um by so what is this index into the
[00:08:16] um by so what is this index into the list by index which means that you can
[00:08:18] list by index which means that you can um
[00:08:18] um list out the elements in the list um
[00:08:21] list out the elements in the list um depending on What's called the index so
[00:08:22] depending on What's called the index so it's what place that value is at within
[00:08:25] it's what place that value is at within the list so zero refers to the first
[00:08:26] the list so zero refers to the first element so Python's what's called zero
[00:08:28] element so Python's what's called zero index which means it starts with zero
[00:08:30] index which means it starts with zero and then it goes to one so here zero
[00:08:32] and then it goes to one so here zero will be Zack
[00:08:34] will be Zack and then let's say I want to append
[00:08:36] and then let's say I want to append something to the end so the to add
[00:08:38] something to the end so the to add something to the end of the list the
[00:08:40] something to the end of the list the term is append not add and so if I want
[00:08:43] term is append not add and so if I want to append
[00:08:44] to append I can now create a separate list which
[00:08:46] I can now create a separate list which is the original list itself with the
[00:08:48] is the original list itself with the added last element
[00:08:50] added last element and what would currently be the length
[00:08:51] and what would currently be the length of this it would be three because you
[00:08:53] of this it would be three because you have three elements and you can just
[00:08:55] have three elements and you can just quickly get that by using the Len
[00:08:56] quickly get that by using the Len function not length just three letters
[00:08:58] function not length just three letters Len
[00:09:00] Len all right
[00:09:01] all right um it's also really nice because python
[00:09:03] um it's also really nice because python has overloaded the plus
[00:09:05] has overloaded the plus operation to be able to concatenate
[00:09:08] operation to be able to concatenate lists so here I have a separate list
[00:09:11] lists so here I have a separate list right and all you need for a list
[00:09:12] right and all you need for a list definition is just brackets so this is a
[00:09:14] definition is just brackets so this is a separate list altogether even though I
[00:09:16] separate list altogether even though I haven't saved it in the variable just
[00:09:17] haven't saved it in the variable just Abby and Kevin and I can just do a plus
[00:09:20] Abby and Kevin and I can just do a plus equal to which means that names is equal
[00:09:22] equal to which means that names is equal to names plus Abby and Kevin and this
[00:09:24] to names plus Abby and Kevin and this should output this full list
[00:09:27] should output this full list you can create lists by just putting the
[00:09:29] you can create lists by just putting the plain brackets or an existing list and
[00:09:32] plain brackets or an existing list and then as I mentioned earlier your list
[00:09:34] then as I mentioned earlier your list can have a variety of types within them
[00:09:36] can have a variety of types within them so here this list contains an integer
[00:09:37] so here this list contains an integer value a list value so you can have a
[00:09:40] value a list value so you can have a list of lists as many sort of sub lists
[00:09:42] list of lists as many sort of sub lists as you like a float value and a none
[00:09:44] as you like a float value and a none value
[00:09:45] value and this is completely valid within
[00:09:47] and this is completely valid within python
[00:09:48] python slicing refers to how you can access
[00:09:51] slicing refers to how you can access only parts of the list so if I only want
[00:09:53] only parts of the list so if I only want for example
[00:09:55] for example um in this numbers array I only want 0 1
[00:09:58] um in this numbers array I only want 0 1 2 slicing is a way that you can extract
[00:10:01] 2 slicing is a way that you can extract only those parts so the way sizing works
[00:10:03] only those parts so the way sizing works is the first element is included and the
[00:10:05] is the first element is included and the last element is excluded so here I start
[00:10:07] last element is excluded so here I start with 0 1 2 3. so 3 is not included and
[00:10:12] with 0 1 2 3. so 3 is not included and so 0 1 2 will be printed out there's
[00:10:15] so 0 1 2 will be printed out there's also shorthands so if you know that
[00:10:16] also shorthands so if you know that you're going to be starting with the
[00:10:18] you're going to be starting with the first element of the arrays if you know
[00:10:20] first element of the arrays if you know I'm starting I want zero one two and it
[00:10:21] I'm starting I want zero one two and it starts with zero then you don't need to
[00:10:23] starts with zero then you don't need to even include the first index you can
[00:10:25] even include the first index you can just leave that and include the last
[00:10:27] just leave that and include the last index that would be excluded so that
[00:10:29] index that would be excluded so that would be blank semicolon 3 and same deal
[00:10:33] would be blank semicolon 3 and same deal with the end if you know that you want
[00:10:35] with the end if you know that you want to take everything let's say from like
[00:10:37] to take everything let's say from like five and six till the end of the array
[00:10:40] five and six till the end of the array you can start with what would you like
[00:10:42] you can start with what would you like so 0 1 2 3 4 5 till the end and leave
[00:10:46] so 0 1 2 3 4 5 till the end and leave that oh sorry
[00:10:49] um fun fact so this um semicolon when
[00:10:51] um fun fact so this um semicolon when you take this just the semicolon it'll
[00:10:53] you take this just the semicolon it'll take everything in the list but it'll
[00:10:54] take everything in the list but it'll also create a duplicate in memory that's
[00:10:56] also create a duplicate in memory that's that's a very slight
[00:10:58] that's a very slight um very useful thing to to know
[00:11:00] um very useful thing to to know um because sometimes when you're like
[00:11:02] um because sometimes when you're like past Less in an array uh sorry in Python
[00:11:04] past Less in an array uh sorry in Python which is out of scope of this tutorial
[00:11:06] which is out of scope of this tutorial um you can only pass the reference to it
[00:11:08] um you can only pass the reference to it so if you will change the array that
[00:11:09] so if you will change the array that gets changed this will create an
[00:11:11] gets changed this will create an entirely separate copy in memory of the
[00:11:13] entirely separate copy in memory of the exact same array so if you make any
[00:11:15] exact same array so if you make any changes to it it won't affect your
[00:11:16] changes to it it won't affect your original array so this is a very pretty
[00:11:18] original array so this is a very pretty neat way to do that
[00:11:20] neat way to do that um and then another fun thing that
[00:11:21] um and then another fun thing that python has which is pretty unique is you
[00:11:23] python has which is pretty unique is you can index negatively so negative
[00:11:25] can index negatively so negative indexing means you index from the back
[00:11:27] indexing means you index from the back of the array so -1 refers to the last
[00:11:29] of the array so -1 refers to the last element of the array minus three will
[00:11:32] element of the array minus three will refer to the third last element and so
[00:11:35] refer to the third last element and so what minus one will give you will be six
[00:11:37] what minus one will give you will be six in this case when minus three will give
[00:11:39] in this case when minus three will give you will be everything because you're
[00:11:40] you will be everything because you're starting with the minus three elements
[00:11:42] starting with the minus three elements so minus one minus two minus three till
[00:11:45] so minus one minus two minus three till the end and then this one seems kind of
[00:11:48] the end and then this one seems kind of confusing right three to minus two so
[00:11:50] confusing right three to minus two so this will do is it will give you 0 1 2 3
[00:11:52] this will do is it will give you 0 1 2 3 so you start with three and then minus
[00:11:54] so you start with three and then minus one minus two so you leave off the X the
[00:11:57] one minus two so you leave off the X the last because you excluded within list
[00:12:01] last because you excluded within list um you'd only get three and four so
[00:12:03] um you'd only get three and four so that's what this is
[00:12:04] that's what this is okay that's about lists
[00:12:07] okay that's about lists tuples are immutable arrays so once you
[00:12:10] tuples are immutable arrays so once you declare the values of these they cannot
[00:12:11] declare the values of these they cannot be changed so I start with remember we
[00:12:13] be changed so I start with remember we started with like the list of Zach and
[00:12:15] started with like the list of Zach and Jay tuples you start with Zach and Jay
[00:12:19] Jay tuples you start with Zach and Jay um and you can still access them you
[00:12:20] um and you can still access them you know I can still print out name zero
[00:12:23] know I can still print out name zero same as I did with lists but if I try to
[00:12:25] same as I did with lists but if I try to change it in this case it'll throw an
[00:12:27] change it in this case it'll throw an error so two pulls once you've
[00:12:28] error so two pulls once you've instantiated them they cannot be changed
[00:12:31] instantiated them they cannot be changed and to create an empty Tuple you just
[00:12:33] and to create an empty Tuple you just create you can either use just a tuple
[00:12:35] create you can either use just a tuple sign or oftentimes you can just use the
[00:12:37] sign or oftentimes you can just use the parentheses backwards so you can just
[00:12:39] parentheses backwards so you can just say for example as you did here just
[00:12:41] say for example as you did here just parentheses to instantiate something
[00:12:45] parentheses to instantiate something all right
[00:12:46] all right and yeah this one we'll we'll come to a
[00:12:48] and yeah this one we'll we'll come to a little bit later in shapes but you can
[00:12:50] little bit later in shapes but you can also have a tuple of a single value and
[00:12:52] also have a tuple of a single value and all you have to do there is just put the
[00:12:54] all you have to do there is just put the value and put a comma so that just shows
[00:12:56] value and put a comma so that just shows that you have a tuple which is like with
[00:12:58] that you have a tuple which is like with like an immutable array so you can't
[00:13:00] like an immutable array so you can't change it it's a list but only of one
[00:13:01] change it it's a list but only of one item
[00:13:02] item and that's here
[00:13:04] and that's here okay I'll quickly move to dictionaries
[00:13:07] okay I'll quickly move to dictionaries um for those of you who might be
[00:13:08] um for those of you who might be familiar with other languages this is
[00:13:10] familiar with other languages this is the equivalent of like a hash map or
[00:13:12] the equivalent of like a hash map or hash table
[00:13:13] hash table um what this is useful for essentially
[00:13:14] um what this is useful for essentially is mapping one value to another in a
[00:13:16] is mapping one value to another in a really really quick way
[00:13:18] really really quick way um so if I want to map for example a
[00:13:19] um so if I want to map for example a string to an index which you will happen
[00:13:21] string to an index which you will happen to do a lot of in your homeworks
[00:13:24] to do a lot of in your homeworks um this is a really really useful way to
[00:13:25] um this is a really really useful way to do that and so what what it does is you
[00:13:27] do that and so what what it does is you can instantiate this dictionary and it
[00:13:30] can instantiate this dictionary and it says corresponding that DAC is going to
[00:13:32] says corresponding that DAC is going to correspond to the string value whatever
[00:13:33] correspond to the string value whatever it is and so anytime I want to retrieve
[00:13:36] it is and so anytime I want to retrieve the string value I just use this
[00:13:38] the string value I just use this dictionary I indexed by it which is what
[00:13:41] dictionary I indexed by it which is what I do here and then it outputs the
[00:13:43] I do here and then it outputs the corresponding value and it does that
[00:13:45] corresponding value and it does that really really quickly
[00:13:47] really really quickly um and yeah so it's really useful very
[00:13:50] um and yeah so it's really useful very very commonly used especially when you
[00:13:52] very commonly used especially when you sort for example you have like a list of
[00:13:53] sort for example you have like a list of strings or a list of items and you want
[00:13:56] strings or a list of items and you want to have a corresponding index for them
[00:13:58] to have a corresponding index for them because and as you'll see in NLP
[00:14:00] because and as you'll see in NLP oftentimes you're using what you're
[00:14:02] oftentimes you're using what you're working with indices and numbers in
[00:14:04] working with indices and numbers in particular so it's a really great way to
[00:14:06] particular so it's a really great way to sort of move from like string formats to
[00:14:09] sort of move from like string formats to just like um numerical index values
[00:14:12] just like um numerical index values there's some other things you can do for
[00:14:13] there's some other things you can do for dictionaries you can check whether
[00:14:15] dictionaries you can check whether certain elements are in there so if you
[00:14:16] certain elements are in there so if you for example try to index phone book is
[00:14:19] for example try to index phone book is equal to monsie they'll throw an error
[00:14:21] equal to monsie they'll throw an error because there's no string that says
[00:14:22] because there's no string that says Monty in that phone book dictionary and
[00:14:24] Monty in that phone book dictionary and so sometimes you might be wanting to do
[00:14:25] so sometimes you might be wanting to do checks before you extract a value and so
[00:14:28] checks before you extract a value and so this will just check for example if I do
[00:14:30] this will just check for example if I do print monsie and phone book it should
[00:14:31] print monsie and phone book it should say false or for example here Kevin and
[00:14:33] say false or for example here Kevin and phone book it should say false while
[00:14:35] phone book it should say false while something that's actually in that
[00:14:36] something that's actually in that dictionary Zach will be true
[00:14:39] dictionary Zach will be true okay and then if you'd like to delete an
[00:14:41] okay and then if you'd like to delete an entry from the um from the dictionary
[00:14:43] entry from the um from the dictionary you can just do that using the Dell
[00:14:45] you can just do that using the Dell command
[00:14:47] command all right let's move to Loops
[00:14:51] all right let's move to Loops um quickly so Loops are a really great
[00:14:53] um quickly so Loops are a really great way to optimize for the same kind of app
[00:14:56] way to optimize for the same kind of app same kind of operation you're doing it's
[00:14:58] same kind of operation you're doing it's also a great way to
[00:15:00] also a great way to um start to sequentially go over those
[00:15:03] um start to sequentially go over those list type or array type objects we were
[00:15:05] list type or array type objects we were talking about earlier you know you have
[00:15:06] talking about earlier you know you have like a list of names right how do you
[00:15:09] like a list of names right how do you access all of them so Loops are a really
[00:15:11] access all of them so Loops are a really great way to do that
[00:15:12] great way to do that um in Python they've abstracted away a
[00:15:14] um in Python they've abstracted away a lot of the confusing sort of
[00:15:16] lot of the confusing sort of um parts and other languages that might
[00:15:18] um parts and other languages that might be you can really for example first
[00:15:20] be you can really for example first index on numbers so what you do is you
[00:15:22] index on numbers so what you do is you have like a range function that you call
[00:15:24] have like a range function that you call so here you say range and the range of
[00:15:27] so here you say range and the range of the last number you'd want so with this
[00:15:29] the last number you'd want so with this range function will return is 0 1 2 3 4
[00:15:31] range function will return is 0 1 2 3 4 and that's what will be stored in this I
[00:15:33] and that's what will be stored in this I value and here it's just printing out
[00:15:35] value and here it's just printing out that I value
[00:15:36] that I value so if I wanted for example Loop over the
[00:15:38] so if I wanted for example Loop over the length of a list of size 10 I just have
[00:15:41] length of a list of size 10 I just have to do 4i and range 10 and then index
[00:15:44] to do 4i and range 10 and then index that corresponding part of the list you
[00:15:46] that corresponding part of the list you technically don't even have to do that
[00:15:47] technically don't even have to do that because in Python you can just directly
[00:15:49] because in Python you can just directly get the element of the list so here I
[00:15:51] get the element of the list so here I have an a list of
[00:15:53] have an a list of um names where I have Zac J and Richard
[00:15:55] um names where I have Zac J and Richard instead of saying first the length of
[00:15:58] instead of saying first the length of the list and then doing this range
[00:16:00] the list and then doing this range operation I can just directly say for
[00:16:02] operation I can just directly say for name and names and then print out the
[00:16:04] name and names and then print out the names and it will just directly get the
[00:16:06] names and it will just directly get the element in each list
[00:16:07] element in each list but sometimes you might want both you
[00:16:10] but sometimes you might want both you might both want this element Zach as
[00:16:13] might both want this element Zach as well as its position in the array and
[00:16:15] well as its position in the array and for that you can actually use this
[00:16:16] for that you can actually use this really helpful function called enumerate
[00:16:18] really helpful function called enumerate and so enumerate will basically pair
[00:16:20] and so enumerate will basically pair those two values and it'll give you the
[00:16:22] those two values and it'll give you the both the value which is here name for
[00:16:24] both the value which is here name for example and its corresponding index
[00:16:26] example and its corresponding index within the array both together so that's
[00:16:29] within the array both together so that's really really convenient versus for
[00:16:31] really really convenient versus for example having to do this like a little
[00:16:32] example having to do this like a little bit more complicated range operation
[00:16:34] bit more complicated range operation where you first take the range and then
[00:16:36] where you first take the range and then you index the US
[00:16:38] you index the US how do you iterate over a dictionary so
[00:16:40] how do you iterate over a dictionary so for dictionaries if you want to enter
[00:16:43] for dictionaries if you want to enter um iterate over What's called the keys
[00:16:45] um iterate over What's called the keys so all of these first items that you
[00:16:47] so all of these first items that you first you know put into the dictionary
[00:16:48] first you know put into the dictionary you can just iterate the same way you
[00:16:50] you can just iterate the same way you would a list you just say foreign name
[00:16:52] would a list you just say foreign name and for example phone book and you can
[00:16:54] and for example phone book and you can output the keys if you want to iterate
[00:16:55] output the keys if you want to iterate over what is stored in the list which is
[00:16:58] over what is stored in the list which is called a value you'd have to do the
[00:17:00] called a value you'd have to do the dictionary dot values and if you want
[00:17:03] dictionary dot values and if you want both you use the dot items function and
[00:17:06] both you use the dot items function and so that will print out both of these
[00:17:09] all right
[00:17:11] all right so this is sort of covering the
[00:17:14] so this is sort of covering the overarching most commonly used sort of
[00:17:16] overarching most commonly used sort of structures lists dictionaries and then
[00:17:19] structures lists dictionaries and then loops and how to sort of efficiently use
[00:17:21] loops and how to sort of efficiently use them within your code
[00:17:23] them within your code we'll quickly be moving to the sort of
[00:17:25] we'll quickly be moving to the sort of meat of what um is really really strong
[00:17:27] meat of what um is really really strong about Python and what you'll be using a
[00:17:29] about Python and what you'll be using a lot for your coming homework essentially
[00:17:31] lot for your coming homework essentially homework 2 which is numpy
[00:17:34] homework 2 which is numpy okay so for numpy also I'm going to be
[00:17:36] okay so for numpy also I'm going to be going to the collab we just quickly
[00:17:38] going to the collab we just quickly wanted to mention
[00:17:40] wanted to mention um what numpy is so numpy is basically
[00:17:42] um what numpy is so numpy is basically an optimized Library
[00:17:44] an optimized Library um for mathematical operations you know
[00:17:46] um for mathematical operations you know people tend to like math lab because
[00:17:47] people tend to like math lab because it's very very useful for these
[00:17:48] it's very very useful for these mathematical operations which people use
[00:17:50] mathematical operations which people use in their research
[00:17:51] in their research um pythons sort of solution to that is
[00:17:54] um pythons sort of solution to that is to have a separate Library entirely
[00:17:55] to have a separate Library entirely where they make use of subroutines which
[00:17:59] where they make use of subroutines which are sort of like sub languages sorry sub
[00:18:01] are sort of like sub languages sorry sub scripts that are written in a different
[00:18:03] scripts that are written in a different language called C or C plus plus that
[00:18:05] language called C or C plus plus that are highly optimized for efficiency so
[00:18:08] are highly optimized for efficiency so the reason C and C plus plus are much
[00:18:10] the reason C and C plus plus are much faster than python is because they're
[00:18:12] faster than python is because they're closer to what's called machine language
[00:18:13] closer to what's called machine language which is what the computer will read I
[00:18:15] which is what the computer will read I mentioned earlier one of the nice things
[00:18:16] mentioned earlier one of the nice things about python is it's kind of high level
[00:18:18] about python is it's kind of high level it looks like English right to some
[00:18:19] it looks like English right to some extent you know we say literally like is
[00:18:21] extent you know we say literally like is you know if x is equal to one or X is
[00:18:23] you know if x is equal to one or X is equal to two right but um that also
[00:18:26] equal to two right but um that also means that there's a lot more
[00:18:27] means that there's a lot more translation required on the computer's
[00:18:28] translation required on the computer's part before it understands what you mean
[00:18:31] part before it understands what you mean um and that's useful when you know we're
[00:18:33] um and that's useful when you know we're writing out code where we want under
[00:18:34] writing out code where we want under understand it but it's a little bit less
[00:18:35] understand it but it's a little bit less useful when you're sort of running a lot
[00:18:37] useful when you're sort of running a lot of operations on a lot of data so the
[00:18:40] of operations on a lot of data so the real benefit of something like lumpy is
[00:18:42] real benefit of something like lumpy is that if you have sort of your memory and
[00:18:44] that if you have sort of your memory and your data in a particular format it'll
[00:18:46] your data in a particular format it'll call the these like species scripts or
[00:18:48] call the these like species scripts or what are called subroutines in a
[00:18:49] what are called subroutines in a different language and it'll make them
[00:18:50] different language and it'll make them very very fast and so that's the real
[00:18:52] very very fast and so that's the real benefit of using numpy and almost
[00:18:53] benefit of using numpy and almost everyone
[00:18:55] everyone um in in sort of NLP is very very
[00:18:57] um in in sort of NLP is very very familiar with this because you'll be
[00:18:58] familiar with this because you'll be running a lot of operations on for
[00:19:00] running a lot of operations on for example like co-occurrence matrices
[00:19:01] example like co-occurrence matrices which are really really big and
[00:19:03] which are really really big and um it's very useful to have them
[00:19:05] um it's very useful to have them optimized for time so that's really the
[00:19:06] optimized for time so that's really the benefit of using numpy
[00:19:08] benefit of using numpy and numpy basically it's involved for
[00:19:10] and numpy basically it's involved for all these like math and Matrix and
[00:19:12] all these like math and Matrix and Vector calculations and it's different
[00:19:14] Vector calculations and it's different than a list although you can easily
[00:19:15] than a list although you can easily translate between a list and a numpy
[00:19:17] translate between a list and a numpy array numpy arrays are specifically as I
[00:19:19] array numpy arrays are specifically as I mentioned designed to be used in these
[00:19:22] mentioned designed to be used in these subroutines so they have a specific
[00:19:23] subroutines so they have a specific format they're instantiated differently
[00:19:25] format they're instantiated differently and you can translate between this and
[00:19:27] and you can translate between this and sort of your standard lists easily but
[00:19:29] sort of your standard lists easily but to know that you can only do numpy
[00:19:31] to know that you can only do numpy operations on numpy arrays you can't do
[00:19:32] operations on numpy arrays you can't do numpy operations on lists directly you'd
[00:19:35] numpy operations on lists directly you'd first have to like convert them which is
[00:19:36] first have to like convert them which is really simple you just use this
[00:19:37] really simple you just use this numpy.array function but just know that
[00:19:40] numpy.array function but just know that they operate only on numpy arrays
[00:19:42] they operate only on numpy arrays okay so for numpy we're going to be
[00:19:44] okay so for numpy we're going to be going back to the collab
[00:19:47] going back to the collab and then as I mentioned earlier the real
[00:19:49] and then as I mentioned earlier the real strength of numpy is you know it
[00:19:50] strength of numpy is you know it supports these large multi-dimensional
[00:19:52] supports these large multi-dimensional arrays and matrices for very very
[00:19:54] arrays and matrices for very very optimized high-level mathematical
[00:19:56] optimized high-level mathematical functions
[00:19:58] functions um and just to go back step back for a
[00:19:59] um and just to go back step back for a quick second what is a matrix matrices
[00:20:02] quick second what is a matrix matrices are basically like rectangular
[00:20:04] are basically like rectangular um structures of numbers that are used
[00:20:07] um structures of numbers that are used and you can treat them with specific
[00:20:08] and you can treat them with specific rules
[00:20:10] rules um for operations between different
[00:20:11] um for operations between different kinds of things so if you have like a
[00:20:14] kinds of things so if you have like a lot of data instead of you know
[00:20:15] lot of data instead of you know individually potentially multiplying
[00:20:16] individually potentially multiplying things if you can store them in this
[00:20:18] things if you can store them in this rectangular format you have specific
[00:20:20] rectangular format you have specific rules about how this Matrix for example
[00:20:22] rules about how this Matrix for example interact with a different one and by
[00:20:24] interact with a different one and by doing that which is matrix
[00:20:25] doing that which is matrix multiplication or Matrix math
[00:20:27] multiplication or Matrix math um you can do a wide variety of
[00:20:30] um you can do a wide variety of mathematical operations a vector is
[00:20:32] mathematical operations a vector is generally this is conventional none of
[00:20:34] generally this is conventional none of these are like hard and fast rules but
[00:20:35] these are like hard and fast rules but conventionally a vector is a matrix in
[00:20:38] conventionally a vector is a matrix in one dimension so it's usually like a row
[00:20:40] one dimension so it's usually like a row vector or a column Vector which usually
[00:20:43] vector or a column Vector which usually just means that it's a list of values
[00:20:46] just means that it's a list of values and only one mentioned so it's like for
[00:20:48] and only one mentioned so it's like for example here when I come down to X is
[00:20:50] example here when I come down to X is equal to numpy array of one two three
[00:20:52] equal to numpy array of one two three that's a listen only one dimension
[00:20:55] that's a listen only one dimension versus for example Z when I this is z
[00:20:58] versus for example Z when I this is z down here that is what's called like a
[00:21:01] down here that is what's called like a two-dimensional array because you have
[00:21:03] two-dimensional array because you have both rows for example like six comma
[00:21:06] both rows for example like six comma seven and then you have eight comma nine
[00:21:09] seven and then you have eight comma nine um versus in this first one you only
[00:21:11] um versus in this first one you only have three values in one dimension so
[00:21:14] have three values in one dimension so that's sort of the conventional
[00:21:15] that's sort of the conventional difference between the two another
[00:21:16] difference between the two another convention is matrices generally
[00:21:18] convention is matrices generally referred to two-dimensional objects so
[00:21:19] referred to two-dimensional objects so this as I mentioned is like Z this is
[00:21:22] this as I mentioned is like Z this is two dimensional you might have heard the
[00:21:24] two dimensional you might have heard the word tensor also tensors by convention
[00:21:26] word tensor also tensors by convention usually are like higher dimensional
[00:21:28] usually are like higher dimensional objects so instead of having two
[00:21:29] objects so instead of having two Dimensions you know two comma two you
[00:21:31] Dimensions you know two comma two you can have like n Dimensions you can have
[00:21:33] can have like n Dimensions you can have two comma two comma two comma two comma
[00:21:36] two comma two comma two comma two comma two for like five or six dimensions and
[00:21:38] two for like five or six dimensions and those are very valid to do mathematical
[00:21:40] those are very valid to do mathematical operations on
[00:21:41] operations on um and those are often colloquially sort
[00:21:43] um and those are often colloquially sort of called tensors in addition and this
[00:21:46] of called tensors in addition and this will be covered in the next tutorial in
[00:21:48] will be covered in the next tutorial in pi torch
[00:21:49] pi torch um those larger sort of tensors are also
[00:21:52] um those larger sort of tensors are also optimized for efficiency
[00:21:54] optimized for efficiency um to be used on gpus and so they're
[00:21:56] um to be used on gpus and so they're called tensor in a more concrete way
[00:21:58] called tensor in a more concrete way because you're using these tensors with
[00:22:00] because you're using these tensors with pytorch and other sort of packages to
[00:22:02] pytorch and other sort of packages to directly do those quicker GPU operations
[00:22:04] directly do those quicker GPU operations on for deep learning so those are sort
[00:22:06] on for deep learning so those are sort of this is a quick sort of terminology
[00:22:08] of this is a quick sort of terminology difference between the three
[00:22:10] difference between the three okay so now
[00:22:12] okay so now um let's start off with just some quick
[00:22:14] um let's start off with just some quick sort of representations of how are these
[00:22:15] sort of representations of how are these matrices and vectors represented in
[00:22:17] matrices and vectors represented in numpy
[00:22:18] numpy um this sort of goes back to your
[00:22:19] um this sort of goes back to your question about like what is the
[00:22:21] question about like what is the difference between like three comma
[00:22:23] difference between like three comma versus like one comma three
[00:22:25] versus like one comma three um so usually three comma and numpy
[00:22:28] um so usually three comma and numpy arrays usually just means that you have
[00:22:29] arrays usually just means that you have one list of like one two three for
[00:22:33] one list of like one two three for example there's like three values versus
[00:22:35] example there's like three values versus if you add another list on top of that
[00:22:37] if you add another list on top of that this one comma 3 essentially refers to
[00:22:40] this one comma 3 essentially refers to the fact that there's a list of lists so
[00:22:42] the fact that there's a list of lists so anytime you have two Dimensions it
[00:22:44] anytime you have two Dimensions it always means that there's a list of
[00:22:46] always means that there's a list of lists
[00:22:47] lists um and that being like a list of lists
[00:22:48] um and that being like a list of lists for example like a row so here one comma
[00:22:51] for example like a row so here one comma three means that there's one row and
[00:22:53] three means that there's one row and then three columns so it's saying
[00:22:54] then three columns so it's saying there's one row of three comma four
[00:22:57] there's one row of three comma four comma five essentially and then each of
[00:22:59] comma five essentially and then each of those is a column separately
[00:23:01] those is a column separately you can easily reshape them so these are
[00:23:04] you can easily reshape them so these are basically the same format but from
[00:23:06] basically the same format but from numpy's perspective you'll see a little
[00:23:07] numpy's perspective you'll see a little bit later for operations such as
[00:23:09] bit later for operations such as broadcasting you need to have it for
[00:23:11] broadcasting you need to have it for example sometimes in this one comma
[00:23:13] example sometimes in this one comma three format or three comma one format
[00:23:15] three format or three comma one format um and also like
[00:23:17] um and also like what like as I said three is this he
[00:23:18] what like as I said three is this he just like she represents three numbers
[00:23:20] just like she represents three numbers one comma three means like one row of
[00:23:22] one comma three means like one row of three elements three comma one will mean
[00:23:25] three elements three comma one will mean you have essentially in each column
[00:23:27] you have essentially in each column you'll have a separate array so you'll
[00:23:29] you'll have a separate array so you'll see sort of boxes around each of them
[00:23:31] see sort of boxes around each of them there's an example that comes a little
[00:23:32] there's an example that comes a little bit later in this collab which will make
[00:23:33] bit later in this collab which will make it a little bit more clearer so here if
[00:23:36] it a little bit more clearer so here if you can see the difference between like
[00:23:36] you can see the difference between like X and Y one of them has only one bracket
[00:23:39] X and Y one of them has only one bracket which just says it's one list only one
[00:23:42] which just says it's one list only one list of one comma two comma three the
[00:23:45] list of one comma two comma three the second one is two brackets which says
[00:23:46] second one is two brackets which says it's a list with only one list in it if
[00:23:49] it's a list with only one list in it if it's a list of a list that's really the
[00:23:51] it's a list of a list that's really the main difference between like these sort
[00:23:53] main difference between like these sort of two representations so I could have
[00:23:55] of two representations so I could have like let's say like a separate
[00:23:59] like let's say like a separate one I'm going to call this a and I just
[00:24:02] one I'm going to call this a and I just do this
[00:24:03] do this so it's the same sort of elements but
[00:24:07] so it's the same sort of elements but this will be one comma three because
[00:24:08] this will be one comma three because it's showing that there's one outer list
[00:24:11] it's showing that there's one outer list which shows the rows and then one inner
[00:24:13] which shows the rows and then one inner list which I like to have each of those
[00:24:15] list which I like to have each of those values
[00:24:16] values so the benefit will when I'm coming to
[00:24:18] so the benefit will when I'm coming to what's a little bit later which is
[00:24:19] what's a little bit later which is broadcasting and so it essentially will
[00:24:21] broadcasting and so it essentially will help you determine what dimensions you
[00:24:23] help you determine what dimensions you want to match against because sometimes
[00:24:25] want to match against because sometimes you'd want to have one comma three like
[00:24:27] you'd want to have one comma three like one comma two comma three applied only
[00:24:30] one comma two comma three applied only two rows in some other Matrix well we'll
[00:24:33] two rows in some other Matrix well we'll come to that a little bit later
[00:24:34] come to that a little bit later um but sometimes you might want to have
[00:24:35] um but sometimes you might want to have it only applied to columns and so like
[00:24:38] it only applied to columns and so like if I have a separate Matrix for example
[00:24:39] if I have a separate Matrix for example of zero zero zero zero zero zero zero
[00:24:42] of zero zero zero zero zero zero zero zero and I want the resulting Matrix to
[00:24:44] zero and I want the resulting Matrix to be for example one two three one two
[00:24:45] be for example one two three one two three one two three along the rows let
[00:24:47] three one two three along the rows let me actually draw this out it might be
[00:24:48] me actually draw this out it might be easier
[00:24:50] easier so
[00:24:52] so let's say I have like the zero zero zero
[00:24:54] let's say I have like the zero zero zero zero zero zero zero zero
[00:24:57] zero zero zero zero zero and if I want to have a matrix that does
[00:24:59] and if I want to have a matrix that does one two three one two three one two
[00:25:03] one two three one two three one two three versus
[00:25:05] three versus one two three one two three one two
[00:25:09] one two three one two three one two three
[00:25:10] three the difference in how to generate these
[00:25:12] the difference in how to generate these two
[00:25:13] two um will be the difference in the shape
[00:25:15] um will be the difference in the shape like how you represent their shape it's
[00:25:16] like how you represent their shape it's the same one two three but the resulting
[00:25:19] the same one two three but the resulting array you're generating by repeating the
[00:25:21] array you're generating by repeating the one two three values
[00:25:23] one two three values um requires a difference in shape and so
[00:25:25] um requires a difference in shape and so we'll come to that a little bit later
[00:25:25] we'll come to that a little bit later because this process of how do you
[00:25:27] because this process of how do you generate these arrays is called
[00:25:28] generate these arrays is called broadcasting but that's the real benefit
[00:25:29] broadcasting but that's the real benefit of having an understanding of the shapes
[00:25:31] of having an understanding of the shapes the same one two three values are the
[00:25:33] the same one two three values are the same it's just how they're sort of used
[00:25:34] same it's just how they're sort of used with regards to other arrays
[00:25:36] with regards to other arrays all right so yeah vectors can be usually
[00:25:39] all right so yeah vectors can be usually represented as sort of and this is what
[00:25:40] represented as sort of and this is what I talked about earlier as like n
[00:25:41] I talked about earlier as like n Dimensions n by one or one by n
[00:25:43] Dimensions n by one or one by n dimensions and they can result in this
[00:25:45] dimensions and they can result in this different Behavior kind of what like
[00:25:46] different Behavior kind of what like this that I talked about
[00:25:48] this that I talked about um matrices are usually in two
[00:25:49] um matrices are usually in two Dimensions represented as M by n
[00:25:51] Dimensions represented as M by n um these are just two examples if for
[00:25:53] um these are just two examples if for example I generate let's say an engine
[00:25:54] example I generate let's say an engine also reshape so I start with for example
[00:25:57] also reshape so I start with for example this array
[00:25:58] this array which is a list of 10 oh sorry it's
[00:26:00] which is a list of 10 oh sorry it's important on Pi quickly
[00:26:04] so I start off with this Matrix a which
[00:26:07] so I start off with this Matrix a which is basically a one-dimensional list of
[00:26:08] is basically a one-dimensional list of ten values I can reshape it into a five
[00:26:11] ten values I can reshape it into a five by two Matrix so you just have to make
[00:26:13] by two Matrix so you just have to make sure that your Dimensions match which
[00:26:14] sure that your Dimensions match which means that like you can multiply them
[00:26:16] means that like you can multiply them together and get the original size so if
[00:26:19] together and get the original size so if I start off with the 10 matrix I can
[00:26:20] I start off with the 10 matrix I can make a two by five Matrix I can make a
[00:26:22] make a two by five Matrix I can make a five by two Matrix I can make a ten by
[00:26:24] five by two Matrix I can make a ten by one one by ten I can't make it for
[00:26:26] one one by ten I can't make it for example three and five because that it
[00:26:28] example three and five because that it wouldn't fit into the original size
[00:26:30] wouldn't fit into the original size um and for that this operation called
[00:26:31] um and for that this operation called reshape is really useful
[00:26:33] reshape is really useful um you might be wondering why is there
[00:26:34] um you might be wondering why is there two parentheses the way that reshape
[00:26:36] two parentheses the way that reshape works is essentially it'll take in a
[00:26:38] works is essentially it'll take in a tuple so remember that what I was
[00:26:40] tuple so remember that what I was talking about earlier with tuples is
[00:26:41] talking about earlier with tuples is that these they're immutable objects and
[00:26:43] that these they're immutable objects and they're defined by parentheses so the
[00:26:45] they're defined by parentheses so the outer parenthesis is representing what
[00:26:47] outer parenthesis is representing what you're inputting to the function and
[00:26:48] you're inputting to the function and what you're inputting is a tuple so it
[00:26:50] what you're inputting is a tuple so it uses a second set of parentheses
[00:26:52] uses a second set of parentheses so now let's go to some array operations
[00:26:56] so now let's go to some array operations um so I started off with you know this
[00:26:57] um so I started off with you know this array X
[00:26:59] array X um when you apply simple operations for
[00:27:01] um when you apply simple operations for example a Max operation sometimes you
[00:27:03] example a Max operation sometimes you might want the max of the entire array
[00:27:05] might want the max of the entire array so if I do the max of in the entire
[00:27:06] so if I do the max of in the entire array what's the max value of the entire
[00:27:08] array what's the max value of the entire array by the way just the entire thing
[00:27:10] array by the way just the entire thing six right so if I just do NP dot Max of
[00:27:13] six right so if I just do NP dot Max of X it'll return one value and return six
[00:27:16] X it'll return one value and return six but let's say I want the max of every
[00:27:18] but let's say I want the max of every row right like and every in each of
[00:27:20] row right like and every in each of these rows I say I want let's say the
[00:27:22] these rows I say I want let's say the max of each I want two and then four and
[00:27:24] max of each I want two and then four and then six how do you do that and so numpy
[00:27:27] then six how do you do that and so numpy always has like usually in most of their
[00:27:29] always has like usually in most of their functions an access variable and what
[00:27:31] functions an access variable and what the axis variable will do is it'll tell
[00:27:33] the axis variable will do is it'll tell you which of these Dimensions do you
[00:27:35] you which of these Dimensions do you want to take the max over
[00:27:37] want to take the max over and the way to sort of think about it is
[00:27:39] and the way to sort of think about it is this is going to be a little bit tricky
[00:27:40] this is going to be a little bit tricky but the way people describe it is the
[00:27:43] but the way people describe it is the access is what you want to apply your
[00:27:45] access is what you want to apply your function over what you want to reduce
[00:27:47] function over what you want to reduce over and what that means is I print out
[00:27:49] over and what that means is I print out the shape of the original array it's
[00:27:51] the shape of the original array it's three by two
[00:27:52] three by two I want to apply access one or as I
[00:27:55] I want to apply access one or as I remember you know numpy is zero indexed
[00:27:57] remember you know numpy is zero indexed it'll be zero one so I want to apply the
[00:27:59] it'll be zero one so I want to apply the max over the second dimension the second
[00:28:01] max over the second dimension the second dimension means that for each of these
[00:28:04] dimension means that for each of these essentially you know that like for like
[00:28:07] essentially you know that like for like the row Dimension is the First Dimension
[00:28:09] the row Dimension is the First Dimension so it's not around along the rows I'm
[00:28:11] so it's not around along the rows I'm going to be comparing columns and so
[00:28:13] going to be comparing columns and so compare this entire column to this
[00:28:15] compare this entire column to this entire column
[00:28:16] entire column and so just remember for axes
[00:28:19] and so just remember for axes um usually the axis zero refers to the
[00:28:21] um usually the axis zero refers to the row axis and then axis one refers to the
[00:28:23] row axis and then axis one refers to the column access
[00:28:25] column access um if you don't even want to remember
[00:28:25] um if you don't even want to remember that you can just remember that from the
[00:28:27] that you can just remember that from the original Dimension which of these it's
[00:28:29] original Dimension which of these it's referring to
[00:28:31] referring to um and that's the dimension you want to
[00:28:32] um and that's the dimension you want to compare over or reduce over
[00:28:35] compare over or reduce over so it can be a little bit harder to
[00:28:37] so it can be a little bit harder to grasp around it usually the best way to
[00:28:39] grasp around it usually the best way to sort of get around is like just play
[00:28:40] sort of get around is like just play with a bunch of sort of operations of
[00:28:42] with a bunch of sort of operations of min max and things like that but just
[00:28:45] min max and things like that but just remember like the access is what you
[00:28:46] remember like the access is what you want to compare over not the resulting
[00:28:49] want to compare over not the resulting thing so axis one means here column I
[00:28:51] thing so axis one means here column I want to compare between the columns I
[00:28:52] want to compare between the columns I want to get for example comparing one to
[00:28:54] want to get for example comparing one to two three to four five to six
[00:28:57] two three to four five to six does that make sense
[00:29:00] does that make sense okay
[00:29:01] okay and what this will do is if I just do
[00:29:04] and what this will do is if I just do numpy.axis it'll just return basically
[00:29:06] numpy.axis it'll just return basically since I'm comparing these columns it'll
[00:29:08] since I'm comparing these columns it'll just return a resultant column and so as
[00:29:10] just return a resultant column and so as I mentioned you know um for over the
[00:29:12] I mentioned you know um for over the axis one you get three values because
[00:29:14] axis one you get three values because you're comparing over these columns and
[00:29:16] you're comparing over these columns and each column has three values I'm
[00:29:18] each column has three values I'm comparing over rows as you mentioned I
[00:29:19] comparing over rows as you mentioned I get two values right
[00:29:21] get two values right um and so this will just be the Tuple
[00:29:22] um and so this will just be the Tuple comma which is just indicating that it's
[00:29:24] comma which is just indicating that it's just a list it's not a list of lists
[00:29:26] just a list it's not a list of lists it's just a list but let's say I want a
[00:29:28] it's just a list but let's say I want a list of lists you know maybe I want to
[00:29:29] list of lists you know maybe I want to do those operations I talked about
[00:29:30] do those operations I talked about earlier
[00:29:31] earlier um instead of reshaping which is always
[00:29:33] um instead of reshaping which is always there it's always an option you can also
[00:29:35] there it's always an option you can also use this um feature called keep dimms
[00:29:38] use this um feature called keep dimms and what that'll do is it'll take the
[00:29:39] and what that'll do is it'll take the original Dimensions which is two
[00:29:42] original Dimensions which is two Dimensions right because you have three
[00:29:43] Dimensions right because you have three comma two just two of them and it'll
[00:29:45] comma two just two of them and it'll keep that consistent so it'll be three
[00:29:48] keep that consistent so it'll be three comma one
[00:29:49] comma one but it just means that instead of
[00:29:51] but it just means that instead of returning just the extracted column
[00:29:53] returning just the extracted column which is just a list it'll basically
[00:29:55] which is just a list it'll basically keep the column in the context of the
[00:29:58] keep the column in the context of the original sort of X and it'll be it'll
[00:30:00] original sort of X and it'll be it'll keep it as like a two-dimensional value
[00:30:04] all right
[00:30:06] all right now these are just some operations so in
[00:30:09] now these are just some operations so in numpy
[00:30:10] numpy um you can use an asterisk as a an
[00:30:12] um you can use an asterisk as a an element-wise multiplication so an
[00:30:14] element-wise multiplication so an asterisk means that I'm going to be
[00:30:15] asterisk means that I'm going to be comparing every single value
[00:30:17] comparing every single value um to every single corresponding value
[00:30:19] um to every single corresponding value in another Matrix and it's you need your
[00:30:21] in another Matrix and it's you need your matrices to also be the same size for
[00:30:23] matrices to also be the same size for this one so this one it's basically an
[00:30:24] this one so this one it's basically an element wise matrix it's not a matrix
[00:30:26] element wise matrix it's not a matrix multiplication so you need to have them
[00:30:27] multiplication so you need to have them be the exact same size so this will
[00:30:29] be the exact same size so this will compare for example one into three two
[00:30:31] compare for example one into three two into three three into three and four
[00:30:32] into three three into three and four into three
[00:30:34] into three all right
[00:30:36] all right um you can also do matrix multiplication
[00:30:37] um you can also do matrix multiplication which is a different operation entirely
[00:30:40] which is a different operation entirely um for those of you unfamiliar with
[00:30:42] um for those of you unfamiliar with matrix multiplication
[00:30:44] matrix multiplication um you would basically be multiplying a
[00:30:46] um you would basically be multiplying a row of one Matrix with the column of
[00:30:48] row of one Matrix with the column of another Matrix and for that to be
[00:30:50] another Matrix and for that to be necessary you need to have the second
[00:30:52] necessary you need to have the second dimension of the first array be equal to
[00:30:54] dimension of the first array be equal to the first dimension of the second array
[00:30:55] the first dimension of the second array so for matrix multiplication if I have
[00:30:58] so for matrix multiplication if I have an
[00:31:00] an a and two B comma 3 in tune c
[00:31:06] a and two B comma 3 in tune c um shaped matrices these two have to be
[00:31:09] um shaped matrices these two have to be equal for matrix multiplication just
[00:31:10] equal for matrix multiplication just something to keep in mind because
[00:31:12] something to keep in mind because oftentimes if you're doing matrix
[00:31:13] oftentimes if you're doing matrix multiplication
[00:31:15] multiplication um you need you have to make sure these
[00:31:17] um you need you have to make sure these dimensions are the same which means that
[00:31:18] dimensions are the same which means that for example
[00:31:24] this is a valid operation
[00:31:26] this is a valid operation um but this can sometimes throw an error
[00:31:31] sometimes so it's just important to make
[00:31:34] sometimes so it's just important to make sure that sometimes you you want to make
[00:31:35] sure that sometimes you you want to make sure that these are exactly equal you
[00:31:36] sure that these are exactly equal you can actually just print out the shapes
[00:31:37] can actually just print out the shapes and make sure that these are equal to be
[00:31:39] and make sure that these are equal to be doing matrix multiplication and then for
[00:31:41] doing matrix multiplication and then for matrix multiplication
[00:31:43] matrix multiplication um there's a couple of
[00:31:44] um there's a couple of functions you can use the first one is
[00:31:47] functions you can use the first one is just np.mat mule which is NP dot matrix
[00:31:49] just np.mat mule which is NP dot matrix multiplication you can also just use the
[00:31:52] multiplication you can also just use the um the at operation and that one both of
[00:31:54] um the at operation and that one both of those are overloaded you can choose
[00:31:56] those are overloaded you can choose whichever one they'll result in the same
[00:31:57] whichever one they'll result in the same exact operation and just a quick session
[00:32:00] exact operation and just a quick session show you can to show what this will do
[00:32:02] show you can to show what this will do is we'll multiply one into two so it'll
[00:32:04] is we'll multiply one into two so it'll come like one two versus three four so
[00:32:07] come like one two versus three four so it'll do one into three two into three
[00:32:09] it'll do one into three two into three and add those two values so that's what
[00:32:11] and add those two values so that's what matrix multiplication will do
[00:32:15] okay and then dot products will what
[00:32:18] okay and then dot products will what what a DOT product is that it takes two
[00:32:19] what a DOT product is that it takes two vectors so usually it operates on
[00:32:21] vectors so usually it operates on vectors and a vector as I mentioned is
[00:32:23] vectors and a vector as I mentioned is just like a one-dimensional matrix so
[00:32:26] just like a one-dimensional matrix so it's just basically Three cross one for
[00:32:27] it's just basically Three cross one for example a four cross one
[00:32:29] example a four cross one um it'll element wise multiply between
[00:32:31] um it'll element wise multiply between two different vectors and we'll sum up
[00:32:32] two different vectors and we'll sum up those values and so here what a DOT
[00:32:35] those values and so here what a DOT product do would be like one into one
[00:32:36] product do would be like one into one plus two into ten plus three into a
[00:32:38] plus two into ten plus three into a hundred and for a numpy you can just do
[00:32:40] hundred and for a numpy you can just do NP Dot and then both of those vectors
[00:32:45] um this one is just a side on how you
[00:32:47] um this one is just a side on how you would want the structure of the dot
[00:32:49] would want the structure of the dot product to be
[00:32:50] product to be um for arrays that are more so okay so
[00:32:54] um for arrays that are more so okay so the phrase is the best way
[00:32:57] the phrase is the best way um for single dimensional vectors this
[00:33:00] um for single dimensional vectors this operation Works directly anytime it's a
[00:33:02] operation Works directly anytime it's a multiple dimensional Matrix
[00:33:05] multiple dimensional Matrix um then it treats it as a matrix
[00:33:07] um then it treats it as a matrix multiplication the NP dot dot function
[00:33:09] multiplication the NP dot dot function so for two by two Matrix versus a two by
[00:33:11] so for two by two Matrix versus a two by two Matrix dot product it's not going to
[00:33:13] two Matrix dot product it's not going to return the sum it's going to return
[00:33:16] return the sum it's going to return um the matrix multiplication so that's
[00:33:17] um the matrix multiplication so that's just something to keep in mind if you
[00:33:20] just something to keep in mind if you want to make sure that your your dot
[00:33:22] want to make sure that your your dot product is happening in the correct way
[00:33:25] product is happening in the correct way you would want to make sure that sort of
[00:33:27] you would want to make sure that sort of similar to what I was talking about
[00:33:29] similar to what I was talking about earlier that
[00:33:32] earlier that here this is I think the best way best
[00:33:34] here this is I think the best way best way to show it
[00:33:36] okay so you would want the second like
[00:33:40] okay so you would want the second like that what I mentioned like the last
[00:33:42] that what I mentioned like the last dimension of the first one to match with
[00:33:44] dimension of the first one to match with the first dimension of the next one
[00:33:45] the first dimension of the next one because it's treating it as like a
[00:33:46] because it's treating it as like a matrix multiplication
[00:33:48] matrix multiplication um here the error that it's throwing is
[00:33:50] um here the error that it's throwing is it's three comma two combined with three
[00:33:52] it's three comma two combined with three and so the way to sort of like fix that
[00:33:55] and so the way to sort of like fix that would be to have this be like for
[00:33:58] would be to have this be like for example like
[00:33:59] example like um switch the two so you have two comma
[00:34:01] um switch the two so you have two comma three and then three comma
[00:34:04] three and then three comma it's really a dimension matching thing
[00:34:05] it's really a dimension matching thing at this point so it's it can be a little
[00:34:09] at this point so it's it can be a little bit confusing but when you sort of the
[00:34:10] bit confusing but when you sort of the main thing to keep in mind is like for
[00:34:11] main thing to keep in mind is like for single dimensional vectors you can just
[00:34:13] single dimensional vectors you can just do NP dot dot directly and they'll give
[00:34:15] do NP dot dot directly and they'll give you the dot product value for higher
[00:34:17] you the dot product value for higher dimensional matrices it treats it as a
[00:34:18] dimensional matrices it treats it as a matrix multiplication
[00:34:20] matrix multiplication um and so for if you still want to like
[00:34:22] um and so for if you still want to like for those higher dimensional values to
[00:34:24] for those higher dimensional values to ensure that you're getting a DOT product
[00:34:26] ensure that you're getting a DOT product um you'd have to make sure that the
[00:34:27] um you'd have to make sure that the dimensions are aligned similar to these
[00:34:30] dimensions are aligned similar to these so anything that's two by two plus for
[00:34:33] so anything that's two by two plus for both
[00:34:34] both um any any Matrix who doesn't have a
[00:34:36] um any any Matrix who doesn't have a single dimension in any of them yes it
[00:34:38] single dimension in any of them yes it would treat it as a matrix matte Neil
[00:34:40] would treat it as a matrix matte Neil the same thing
[00:34:42] the same thing okay
[00:34:45] all right um okay I'm going to move to
[00:34:47] all right um okay I'm going to move to indexing so similar to what I was
[00:34:50] indexing so similar to what I was talking about earlier remember with list
[00:34:51] talking about earlier remember with list I was saying if you just do the
[00:34:52] I was saying if you just do the semicolon it'll create like the same
[00:34:54] semicolon it'll create like the same array same deal here the the semicolon
[00:34:56] array same deal here the the semicolon just means that you take everything from
[00:34:58] just means that you take everything from the original array in fact it returns a
[00:34:59] the original array in fact it returns a copy so it returns a deep copy means if
[00:35:02] copy so it returns a deep copy means if you have a completely separate copy in
[00:35:03] you have a completely separate copy in memory
[00:35:04] memory um okay now I'm going into sort of more
[00:35:06] um okay now I'm going into sort of more details about how do you want to index
[00:35:08] details about how do you want to index quickly so if I for example have let's
[00:35:11] quickly so if I for example have let's say this three by four Matrix and I only
[00:35:14] say this three by four Matrix and I only want to select the zero and the second
[00:35:16] want to select the zero and the second rows how would I do that
[00:35:18] rows how would I do that so what's useful is that you can sort of
[00:35:20] so what's useful is that you can sort of treat a numpy you can treat different
[00:35:21] treat a numpy you can treat different dimensions differently for indexing so a
[00:35:24] dimensions differently for indexing so a semicolon means you select everything in
[00:35:26] semicolon means you select everything in that Dimension which for example here
[00:35:28] that Dimension which for example here there's a semicolon in the second
[00:35:29] there's a semicolon in the second dimension which means I'm taking all of
[00:35:31] dimension which means I'm taking all of the column values
[00:35:33] the column values um versus what's in the First Dimension
[00:35:35] um versus what's in the First Dimension here it's saying a numpy array of zero
[00:35:37] here it's saying a numpy array of zero and two so it's saying only the zero
[00:35:39] and two so it's saying only the zero index and only the two index which means
[00:35:41] index and only the two index which means only the zero would throw and only the
[00:35:44] only the zero would throw and only the second row so what this would look like
[00:35:46] second row so what this would look like would be something like
[00:35:50] would be something like I have a matrix
[00:35:55] okay I have a matrix and I only want to
[00:35:57] okay I have a matrix and I only want to select the zeroth row and I only want to
[00:35:59] select the zeroth row and I only want to select the column the second row 0 and
[00:36:02] select the column the second row 0 and second and everything in the columns
[00:36:06] all right
[00:36:09] all right and then similarly for example if I want
[00:36:11] and then similarly for example if I want to select in the column Dimension
[00:36:13] to select in the column Dimension um I want to select the first and second
[00:36:15] um I want to select the first and second rows at only the first row I can do that
[00:36:17] rows at only the first row I can do that so you can basically treat them
[00:36:18] so you can basically treat them separately you can think how many
[00:36:20] separately you can think how many columns do I want how many rows do I
[00:36:21] columns do I want how many rows do I want and then index so separately and
[00:36:24] want and then index so separately and that goes for as many dimensions as you
[00:36:25] that goes for as many dimensions as you want in your entire tensor
[00:36:28] want in your entire tensor um some nice things also if I want to
[00:36:30] um some nice things also if I want to for example take it I have this like let
[00:36:32] for example take it I have this like let me print that actually X here I'll just
[00:36:34] me print that actually X here I'll just generate the X
[00:36:37] generate the X okay so this is X right so if I want to
[00:36:39] okay so this is X right so if I want to take all the values of X that are above
[00:36:41] take all the values of X that are above 0.5 for example I can do that by using
[00:36:44] 0.5 for example I can do that by using what's called Boolean indexing so I just
[00:36:47] what's called Boolean indexing so I just basically would say x indexed by
[00:36:50] basically would say x indexed by everything in X that's bigger than 0.5
[00:36:52] everything in X that's bigger than 0.5 so it's pretty direct and it'll just
[00:36:54] so it's pretty direct and it'll just output all the values in this entire
[00:36:56] output all the values in this entire array that are bigger than 0.5
[00:37:00] all right this one is also another way
[00:37:03] all right this one is also another way to do reshaping so I kind of mentioned
[00:37:05] to do reshaping so I kind of mentioned earlier you know sometimes you won't
[00:37:06] earlier you know sometimes you won't have this like list of three elements
[00:37:08] have this like list of three elements and you want to reshape it to a three by
[00:37:10] and you want to reshape it to a three by one array for example you can also use
[00:37:13] one array for example you can also use what's called numpy.new access this will
[00:37:16] what's called numpy.new access this will essentially add another access
[00:37:18] essentially add another access in whatever Dimension you want so if I
[00:37:20] in whatever Dimension you want so if I want to change go from like this three
[00:37:22] want to change go from like this three by four array to a three by
[00:37:24] by four array to a three by three by four two three by four by one
[00:37:28] three by four two three by four by one then I can just add a numpy.nu axis
[00:37:31] then I can just add a numpy.nu axis there an even simpler way to think about
[00:37:33] there an even simpler way to think about it would be like a 2 comma
[00:37:35] it would be like a 2 comma to uh two comma one and so it's just
[00:37:39] to uh two comma one and so it's just it's another way to do what essentially
[00:37:40] it's another way to do what essentially what would be the reshape reshaping
[00:37:42] what would be the reshape reshaping operation
[00:37:45] does that make sense also what this
[00:37:46] does that make sense also what this would look like for example let me just
[00:37:48] would look like for example let me just a little bit more concrete
[00:37:50] a little bit more concrete this is
[00:37:58] so as we see I have this list right I
[00:38:00] so as we see I have this list right I have like a singular list and then each
[00:38:02] have like a singular list and then each in in that list I have a list of lists
[00:38:04] in in that list I have a list of lists so I have a list with element one and
[00:38:06] so I have a list with element one and list of element two so this is what that
[00:38:07] list of element two so this is what that reshape operation will do
[00:38:10] reshape operation will do and what numpy.new access will enable
[00:38:12] and what numpy.new access will enable you to do as well
[00:38:15] all right
[00:38:17] all right um I think we are a good time um so the
[00:38:20] um I think we are a good time um so the last main topic we'll be covering is
[00:38:22] last main topic we'll be covering is broadcasting
[00:38:24] broadcasting um and what's really great about
[00:38:25] um and what's really great about broadcasting is it'll allow you to
[00:38:28] broadcasting is it'll allow you to operate with numpy arrays that are of
[00:38:30] operate with numpy arrays that are of different shapes but can be sort of with
[00:38:33] different shapes but can be sort of with many operations in them can be repeated
[00:38:35] many operations in them can be repeated it allows for that in a very efficient
[00:38:37] it allows for that in a very efficient manner and this is actually one of the
[00:38:38] manner and this is actually one of the most I would say useful things about
[00:38:40] most I would say useful things about numpy and one of its defining features
[00:38:41] numpy and one of its defining features and what that means is
[00:38:44] and what that means is um if for example in this case right if
[00:38:46] um if for example in this case right if we go back to this example that I had
[00:38:47] we go back to this example that I had with I start off with the zero zero zero
[00:38:51] with I start off with the zero zero zero array how do I generate this array
[00:38:54] array how do I generate this array versus how do I generate this array
[00:38:56] versus how do I generate this array right instead of me saying okay element
[00:39:00] right instead of me saying okay element zero zero plus one element zero one plus
[00:39:05] zero zero plus one element zero one plus two all that stuff right instead of
[00:39:06] two all that stuff right instead of doing that one by one what broadcasting
[00:39:08] doing that one by one what broadcasting allows me to do is I can have only one
[00:39:11] allows me to do is I can have only one vector of size one two three
[00:39:14] vector of size one two three and it'll depending on how I do the
[00:39:17] and it'll depending on how I do the broadcasting which I'll come to in a
[00:39:18] broadcasting which I'll come to in a second I can duplicate it along the row
[00:39:21] second I can duplicate it along the row Dimension or I can duplicate it along
[00:39:23] Dimension or I can duplicate it along the column Dimension and numpy allows
[00:39:25] the column Dimension and numpy allows for that it'll do that on its own in the
[00:39:26] for that it'll do that on its own in the back end
[00:39:27] back end and so that's really what broadcasting
[00:39:28] and so that's really what broadcasting means is I don't need to for example
[00:39:31] means is I don't need to for example create a new array saying I want to like
[00:39:34] create a new array saying I want to like create a new array to begin with which
[00:39:36] create a new array to begin with which is already like this and then add those
[00:39:37] is already like this and then add those two together I can just duplicate this
[00:39:39] two together I can just duplicate this and get this
[00:39:41] and get this all right so now some rules for
[00:39:43] all right so now some rules for broadcasting and let me just quickly
[00:39:44] broadcasting and let me just quickly visually also just show what
[00:39:45] visually also just show what broadcasting will do oh sorry
[00:39:50] so broadcasting this is a pretty good
[00:39:52] so broadcasting this is a pretty good visual analogy
[00:39:54] visual analogy um I had this one by one comma one comma
[00:39:56] um I had this one by one comma one comma two comma three Vector right
[00:39:59] two comma three Vector right um and I wanna basically add let's say
[00:40:01] um and I wanna basically add let's say only the columns with this one comma two
[00:40:05] only the columns with this one comma two comma 3 Vector so what broadcasting
[00:40:07] comma 3 Vector so what broadcasting allows you to do is you'll you only pass
[00:40:09] allows you to do is you'll you only pass these two values in and on the back end
[00:40:11] these two values in and on the back end it'll duplicate this along the column
[00:40:13] it'll duplicate this along the column Dimension so let's say I have one two
[00:40:15] Dimension so let's say I have one two three one two three one two three one
[00:40:16] three one two three one two three one two three and then it'll do the addition
[00:40:18] two three and then it'll do the addition similarly if I pass it a vector one
[00:40:21] similarly if I pass it a vector one comma two comma three comma four and I
[00:40:23] comma two comma three comma four and I want it to be added to each of the rows
[00:40:25] want it to be added to each of the rows instead of each of the columns it'll be
[00:40:27] instead of each of the columns it'll be able to do that by sort of duplicating
[00:40:28] able to do that by sort of duplicating it on the back end so this is visually
[00:40:30] it on the back end so this is visually what's happening with Broadcasting
[00:40:32] what's happening with Broadcasting all right
[00:40:34] all right now some rules so how does numpy know
[00:40:37] now some rules so how does numpy know when and how to do broadcasting so the
[00:40:40] when and how to do broadcasting so the main two rules to keep in mind with for
[00:40:42] main two rules to keep in mind with for broadcasting is one
[00:40:44] broadcasting is one um it can only happen if all of the
[00:40:46] um it can only happen if all of the dimensions every single Dimension
[00:40:47] dimensions every single Dimension between two arrays are compatible and
[00:40:50] between two arrays are compatible and when they say what is compatible either
[00:40:52] when they say what is compatible either the dimension values are equal or one of
[00:40:55] the dimension values are equal or one of them is equal to one and that is the
[00:40:57] them is equal to one and that is the only rule required so for example I
[00:41:00] only rule required so for example I start off with this x array right I have
[00:41:02] start off with this x array right I have this like a three by four x array
[00:41:05] this like a three by four x array um will Y is equal to three comma 1 be
[00:41:08] um will Y is equal to three comma 1 be compatible
[00:41:09] compatible yes it will be why because you have
[00:41:11] yes it will be why because you have three in the First Dimension between the
[00:41:13] three in the First Dimension between the two which is the same and in the second
[00:41:14] two which is the same and in the second dimension you have four and you have one
[00:41:16] dimension you have four and you have one so those are compatible values and so
[00:41:18] so those are compatible values and so what this tells numpy on the back end is
[00:41:20] what this tells numpy on the back end is I'm doing for example an addition
[00:41:22] I'm doing for example an addition Operation X Plus y it knows that okay
[00:41:25] Operation X Plus y it knows that okay three and three are the same but four
[00:41:27] three and three are the same but four and one are not the same you know one of
[00:41:29] and one are not the same you know one of them has one dimension so I need to
[00:41:31] them has one dimension so I need to duplicate this y along the second
[00:41:34] duplicate this y along the second dimension which means I need to
[00:41:35] dimension which means I need to duplicate it along the column Dimension
[00:41:37] duplicate it along the column Dimension and once it does that it duplicates it
[00:41:39] and once it does that it duplicates it it'll get four three comma four in Array
[00:41:41] it'll get four three comma four in Array and then it can do the addition and it
[00:41:43] and then it can do the addition and it does that really fast so it's better to
[00:41:44] does that really fast so it's better to use broadcasting in this way but then
[00:41:47] use broadcasting in this way but then for you to create a separate array
[00:41:48] for you to create a separate array already duplicated and then add them
[00:41:51] already duplicated and then add them similarly I have this Z array which is
[00:41:54] similarly I have this Z array which is one comma four
[00:41:55] one comma four what x into Z will do is first I'll
[00:41:58] what x into Z will do is first I'll check okay three comma one okay is that
[00:42:01] check okay three comma one okay is that compatible yes because you have three in
[00:42:03] compatible yes because you have three in one dimension you have one in the second
[00:42:04] one dimension you have one in the second and four and four are compatible okay so
[00:42:06] and four and four are compatible okay so say I know that these two are compatible
[00:42:08] say I know that these two are compatible in the second dimension I'm going to
[00:42:09] in the second dimension I'm going to change anything in the First Dimension
[00:42:11] change anything in the First Dimension it'll know to duplicate them basically
[00:42:13] it'll know to duplicate them basically so in order to duplicate Z and so add it
[00:42:16] so in order to duplicate Z and so add it three times in the row Dimension create
[00:42:19] three times in the row Dimension create a separate array and then multiply those
[00:42:21] a separate array and then multiply those two
[00:42:22] two so this is giving you an example of
[00:42:24] so this is giving you an example of saying I started off with X I have y and
[00:42:26] saying I started off with X I have y and then the final shape will be three comma
[00:42:28] then the final shape will be three comma four so a lot of times in deep learning
[00:42:31] four so a lot of times in deep learning um you will have the same
[00:42:33] um you will have the same um
[00:42:33] um because you'll have different batches of
[00:42:35] because you'll have different batches of different images coming in but you want
[00:42:38] different images coming in but you want to apply let's say the same weight
[00:42:39] to apply let's say the same weight Matrix to all of them and instead of
[00:42:42] Matrix to all of them and instead of duplicating that weight Matrix a hundred
[00:42:44] duplicating that weight Matrix a hundred or even like potentially depending on
[00:42:46] or even like potentially depending on the size of your batch size like a
[00:42:47] the size of your batch size like a thousand times and then adding those
[00:42:49] thousand times and then adding those together you use the same Matrix and
[00:42:51] together you use the same Matrix and it'll know okay if I'm going to be
[00:42:53] it'll know okay if I'm going to be duplicating over the batch Dimension
[00:42:54] duplicating over the batch Dimension it'll do that for you on the back end so
[00:42:56] it'll do that for you on the back end so it's use a lot of times in deep learning
[00:42:58] it's use a lot of times in deep learning because of this and basically in your
[00:43:00] because of this and basically in your second homework that's basically what
[00:43:01] second homework that's basically what you'll be doing implementing a feed
[00:43:03] you'll be doing implementing a feed floral Network in numpy and it'll say
[00:43:06] floral Network in numpy and it'll say you have like this W Matrix yeah this
[00:43:08] you have like this W Matrix yeah this like B Matrix which is a by we'll come
[00:43:10] like B Matrix which is a by we'll come to those in class and it'll ask you to
[00:43:12] to those in class and it'll ask you to implement their numpy because that's
[00:43:14] implement their numpy because that's basically what you're doing is if you
[00:43:15] basically what you're doing is if you have this input image you have a weight
[00:43:17] have this input image you have a weight Matrix which will somehow scale it to an
[00:43:19] Matrix which will somehow scale it to an output and that weight Matrix will be
[00:43:21] output and that weight Matrix will be applied to multiple images in your batch
[00:43:23] applied to multiple images in your batch and those images can be different but
[00:43:24] and those images can be different but their sizes will be the same and it's
[00:43:26] their sizes will be the same and it's optimized for that
[00:43:29] okay
[00:43:30] okay um so this is just more examples of sort
[00:43:32] um so this is just more examples of sort of the same thing your final thing that
[00:43:34] of the same thing your final thing that you'll be coming to is a size of three
[00:43:35] you'll be coming to is a size of three comma four
[00:43:37] comma four um let's see this one's sort of the
[00:43:39] um let's see this one's sort of the example that I showed right here right
[00:43:41] example that I showed right here right which is that I have this array of
[00:43:43] which is that I have this array of flight say zeros I have this numpy like
[00:43:46] flight say zeros I have this numpy like this B array of size what size were they
[00:43:48] this B array of size what size were they would this be yes good because you have
[00:43:50] would this be yes good because you have one outer list and inside this you have
[00:43:52] one outer list and inside this you have one inner list so it's just basically
[00:43:54] one inner list so it's just basically one row and then three values inside so
[00:43:57] one row and then three values inside so yes and so would this be compatible yes
[00:44:00] yes and so would this be compatible yes and so it'll know basically to duplicate
[00:44:02] and so it'll know basically to duplicate um over the row Dimension and so you're
[00:44:03] um over the row Dimension and so you're going to get duplicates in the row
[00:44:05] going to get duplicates in the row Dimensions you're going to get one two
[00:44:06] Dimensions you're going to get one two three one two three one two three and
[00:44:08] three one two three one two three and that's what's Happening Here
[00:44:10] that's what's Happening Here um so these are for example a little bit
[00:44:12] um so these are for example a little bit sometimes when it says more complex
[00:44:14] sometimes when it says more complex um Behavior what this basically just
[00:44:16] um Behavior what this basically just means is that like if I have this B
[00:44:18] means is that like if I have this B Vector which is three comma one
[00:44:20] Vector which is three comma one if I'm doing this B plus b dot transpose
[00:44:23] if I'm doing this B plus b dot transpose by the transpose is just changing the
[00:44:25] by the transpose is just changing the dimensions and switching them so if I
[00:44:26] dimensions and switching them so if I have a two by three Matrix uh transpose
[00:44:28] have a two by three Matrix uh transpose will be a three by two Matrix
[00:44:30] will be a three by two Matrix um what that means visually is something
[00:44:32] um what that means visually is something like
[00:44:33] like your row and rows and like column
[00:44:36] your row and rows and like column Dimensions will get switched
[00:44:38] Dimensions will get switched six goes to I believe it's like one two
[00:44:42] six goes to I believe it's like one two three four five six so like three row
[00:44:46] three four five six so like three row rows versus like three columns
[00:44:49] um and what this is just saying is that
[00:44:51] um and what this is just saying is that uh a three by one and a one by three
[00:44:54] uh a three by one and a one by three um both of those vectors will be
[00:44:55] um both of those vectors will be compatible because remember in each
[00:44:57] compatible because remember in each Dimension it's either the same or one
[00:44:59] Dimension it's either the same or one and so it knows to duplicate uh over
[00:45:02] and so it knows to duplicate uh over both of those dimensions and that's
[00:45:04] both of those dimensions and that's what's Happening Here
[00:45:06] what's Happening Here uh okay so I think we are right at time
[00:45:10] uh okay so I think we are right at time um and what I would recommend is
[00:45:11] um and what I would recommend is basically playing with variations of
[00:45:13] basically playing with variations of this for broadcasting and see December
[00:45:15] this for broadcasting and see December the two rules for broadcasting is just
[00:45:17] the two rules for broadcasting is just if it's compatible it's either the same
[00:45:19] if it's compatible it's either the same value or it's one and whatever is the
[00:45:21] value or it's one and whatever is the one dimension is what's going to be
[00:45:22] one dimension is what's going to be duplicated over on the back end so yeah
[00:45:24] duplicated over on the back end so yeah it's not going to be compatible if
[00:45:26] it's not going to be compatible if they're divisible for example right so
[00:45:27] they're divisible for example right so if you have like let's say six and three
[00:45:29] if you have like let's say six and three that's not compatible
[00:45:31] that's not compatible um you can reshape it and then see if
[00:45:33] um you can reshape it and then see if you'd like to have one there's tricks
[00:45:35] you'd like to have one there's tricks you can use
[00:45:37] you can use um where you're sort of thinking like on
[00:45:38] um where you're sort of thinking like on the back end how do I want this data to
[00:45:39] the back end how do I want this data to be multiplied you can maybe reshape
[00:45:41] be multiplied you can maybe reshape everything into like an eight one like
[00:45:43] everything into like an eight one like one by eighteen Matrix and then multiply
[00:45:44] one by eighteen Matrix and then multiply everything and then reshape it back
[00:45:46] everything and then reshape it back that's what you can do but you can never
[00:45:48] that's what you can do but you can never just directly for example six by three
[00:45:49] just directly for example six by three make that compatible
[00:45:51] make that compatible okay
[00:45:52] okay um so I think let's wrap up this one's
[00:45:54] um so I think let's wrap up this one's just a quick example of another use of
[00:45:56] just a quick example of another use of efficient numpy code
[00:45:58] efficient numpy code um quick note never preferably don't use
[00:46:01] um quick note never preferably don't use uh Loops whenever you're dealing with
[00:46:03] uh Loops whenever you're dealing with large data matrices mostly because Loops
[00:46:06] large data matrices mostly because Loops are almost always about a hundred times
[00:46:08] are almost always about a hundred times slower numpy is usually very very
[00:46:11] slower numpy is usually very very efficient as this is just an example of
[00:46:13] efficient as this is just an example of what you can accomplish with numpy and
[00:46:15] what you can accomplish with numpy and same thing using Loops so what this is
[00:46:17] same thing using Loops so what this is saying is that I have an X Matrix of
[00:46:19] saying is that I have an X Matrix of size thousand by thousand and I want to
[00:46:22] size thousand by thousand and I want to apply you know let's say I want to add
[00:46:23] apply you know let's say I want to add everything from row 100 onwards
[00:46:27] everything from row 100 onwards um with a plus five so visually what
[00:46:28] um with a plus five so visually what that will look like is something like
[00:46:31] that will look like is something like I have this full Matrix and I wanted
[00:46:35] I have this full Matrix and I wanted everything here basically to be add with
[00:46:38] everything here basically to be add with plus added with plus five
[00:46:40] plus added with plus five um then in in the loop format I can
[00:46:43] um then in in the loop format I can basically Loop over the First Dimension
[00:46:45] basically Loop over the First Dimension um of 100 plus and do that or numpy I
[00:46:48] um of 100 plus and do that or numpy I can basically do what's called numpy.a
[00:46:49] can basically do what's called numpy.a range which will generate um integers in
[00:46:51] range which will generate um integers in like we see one two three four five six
[00:46:53] like we see one two three four five six all the way up to that hundred value in
[00:46:55] all the way up to that hundred value in this case it's between hundred and
[00:46:56] this case it's between hundred and thousands let's start with hundred
[00:46:57] thousands let's start with hundred hundred one hundred two all the way two
[00:46:59] hundred one hundred two all the way two thousand in the First Dimension and then
[00:47:01] thousand in the First Dimension and then just add that with five
[00:47:03] just add that with five so this is just an example of how you
[00:47:04] so this is just an example of how you would switch from using Loops to using
[00:47:06] would switch from using Loops to using numpy and it's a lot lot faster


================================================================================
LECTURE 022
================================================================================

Stanford CS224N NLP with Deep Learning | 2023 | PyTorch Tutorial,  Drew Kaul

Source: https://www.youtube.com/watch?v=Uv0AIRr3ptg

---

Transcript

[00:00:05] and so today I kind of just want to
[00:00:07] and so today I kind of just want to cover the fundamentals of Pi torch
[00:00:10] cover the fundamentals of Pi torch um really just kind of see what are the
[00:00:12] um really just kind of see what are the similarities between pi torch and numpy
[00:00:15] similarities between pi torch and numpy and python which you guys are used to at
[00:00:17] and python which you guys are used to at this point
[00:00:18] this point and see how we can build up a lot of the
[00:00:21] and see how we can build up a lot of the building blocks that we'll need in order
[00:00:23] building blocks that we'll need in order to Define more complex models so
[00:00:25] to Define more complex models so specifically we're going to talk today
[00:00:27] specifically we're going to talk today about tensors what are tensor objects
[00:00:30] about tensors what are tensor objects how do we manipulate them
[00:00:32] how do we manipulate them uh what is auto grad how pytorch helps
[00:00:35] uh what is auto grad how pytorch helps us compute different gradients
[00:00:37] us compute different gradients and finally how we actually do
[00:00:39] and finally how we actually do optimization and how we write the
[00:00:40] optimization and how we write the training Loop for our neural networks
[00:00:42] training Loop for our neural networks and if we have time at the end then
[00:00:44] and if we have time at the end then we'll try and go through a bit of a demo
[00:00:46] we'll try and go through a bit of a demo to kind of put everything together and
[00:00:49] to kind of put everything together and see how everything comes together when
[00:00:52] see how everything comes together when you want to solve an actual NLP task
[00:00:56] you want to solve an actual NLP task all right so let's get started
[00:00:58] all right so let's get started so if you go to the course website
[00:01:00] so if you go to the course website there's a notebook
[00:01:02] there's a notebook and you can just make a copy of this
[00:01:04] and you can just make a copy of this collab notebook and then just run the
[00:01:06] collab notebook and then just run the cells as we go
[00:01:08] cells as we go and so to start today we're talking
[00:01:12] and so to start today we're talking about Pi torch like I said it's a deep
[00:01:14] about Pi torch like I said it's a deep learning framework that really does two
[00:01:16] learning framework that really does two main things one is it makes it very easy
[00:01:18] main things one is it makes it very easy to author and manipulate tensors and
[00:01:21] to author and manipulate tensors and make use of your GPU so that you can
[00:01:23] make use of your GPU so that you can actually leverage a lot of that
[00:01:25] actually leverage a lot of that capability and two is it makes the
[00:01:28] capability and two is it makes the process of authoring neural networks
[00:01:30] process of authoring neural networks much simpler you can now use different
[00:01:32] much simpler you can now use different building blocks like linear layers and
[00:01:35] building blocks like linear layers and different loss functions and compose
[00:01:37] different loss functions and compose them in different ways in order to
[00:01:39] them in different ways in order to author the types of models that you need
[00:01:41] author the types of models that you need for your specific use cases
[00:01:44] for your specific use cases and so pytorch is one of the two main
[00:01:46] and so pytorch is one of the two main Frameworks along with tensorflow in this
[00:01:48] Frameworks along with tensorflow in this class we'll focus on Pi torch but
[00:01:50] class we'll focus on Pi torch but they're quite similar and so we'll start
[00:01:52] they're quite similar and so we'll start by importing torch and we'll import the
[00:01:56] by importing torch and we'll import the neural network module which is torch.nn
[00:02:00] neural network module which is torch.nn and for this first part of the tutorial
[00:02:02] and for this first part of the tutorial I want to talk a bit about tensors one
[00:02:05] I want to talk a bit about tensors one thing that you guys are all familiar
[00:02:07] thing that you guys are all familiar with now is numpy arrays and so pretty
[00:02:11] with now is numpy arrays and so pretty much you can think about tensors as
[00:02:14] much you can think about tensors as the equivalent in pi torch to numpy
[00:02:16] the equivalent in pi torch to numpy arrays they're essentially
[00:02:18] arrays they're essentially multi-dimensional arrays that you can
[00:02:20] multi-dimensional arrays that you can manipulate in different ways and you'll
[00:02:24] manipulate in different ways and you'll essentially use them to represent your
[00:02:25] essentially use them to represent your data to be able to actually manipulate
[00:02:28] data to be able to actually manipulate it and perform all the different Matrix
[00:02:30] it and perform all the different Matrix operations that underlie your neural
[00:02:33] operations that underlie your neural network
[00:02:34] network and so in this case for example if we're
[00:02:37] and so in this case for example if we're thinking of an image one way you can
[00:02:40] thinking of an image one way you can think about it in terms of a tensor is
[00:02:41] think about it in terms of a tensor is it's a 256 by 256 tensor where it's has
[00:02:46] it's a 256 by 256 tensor where it's has a width of 256 pixels and a height of
[00:02:49] a width of 256 pixels and a height of 256 pixels and for instance if we have a
[00:02:52] 256 pixels and for instance if we have a batch of images and those images contain
[00:02:54] batch of images and those images contain three channels like red green and blue
[00:02:57] three channels like red green and blue then we might have a four-dimensional
[00:02:59] then we might have a four-dimensional tensor which is the batch size by the
[00:03:02] tensor which is the batch size by the number of channels by the width and the
[00:03:04] number of channels by the width and the height and so everything we're going to
[00:03:06] height and so everything we're going to see today is all going to be represented
[00:03:08] see today is all going to be represented as tensors which you can just think of
[00:03:10] as tensors which you can just think of as multi-dimensional arrays
[00:03:13] as multi-dimensional arrays and so to kind of get some intuition
[00:03:15] and so to kind of get some intuition about this we're going to spend a little
[00:03:16] about this we're going to spend a little bit of time going through essentially
[00:03:19] bit of time going through essentially lists of lists and how we can convert
[00:03:22] lists of lists and how we can convert them into tensors and how we can
[00:03:24] them into tensors and how we can manipulate them with different
[00:03:25] manipulate them with different operations
[00:03:27] operations so to start off with we just have
[00:03:29] so to start off with we just have a simple list of lists that you're all
[00:03:31] a simple list of lists that you're all familiar with in this case it's a two by
[00:03:34] familiar with in this case it's a two by three list
[00:03:36] three list and now we want to create a tensor and
[00:03:39] and now we want to create a tensor and so here the way we'll create this tensor
[00:03:43] so here the way we'll create this tensor is by doing torch.tensor and then
[00:03:45] is by doing torch.tensor and then essentially writing the same
[00:03:48] essentially writing the same syntax that we had before just write out
[00:03:50] syntax that we had before just write out the list of lists that represents that
[00:03:52] the list of lists that represents that particular tensor
[00:03:56] and so in this case we get back a tensor
[00:03:58] and so in this case we get back a tensor object which is the same shape and
[00:04:00] object which is the same shape and contains the same data
[00:04:03] contains the same data and so now the second thing with the
[00:04:04] and so now the second thing with the tensor is that it contains a data type
[00:04:06] tensor is that it contains a data type so there's different data types for
[00:04:09] so there's different data types for instance there are different varying
[00:04:11] instance there are different varying level of precision floating Point
[00:04:12] level of precision floating Point numbers that you can use you can have
[00:04:14] numbers that you can use you can have integers you can have different data
[00:04:16] integers you can have different data types that actually populate your tensor
[00:04:18] types that actually populate your tensor and so by default I believe this will be
[00:04:21] and so by default I believe this will be float32 but you can explicitly specify
[00:04:25] float32 but you can explicitly specify which data type your tensor is by
[00:04:27] which data type your tensor is by passing in the d-type argument
[00:04:29] passing in the d-type argument and so we see here now even though we
[00:04:32] and so we see here now even though we you know wrote in a bunch of integers
[00:04:33] you know wrote in a bunch of integers they have a decimal point which
[00:04:35] they have a decimal point which indicates that they're floating Point
[00:04:36] indicates that they're floating Point numbers
[00:04:38] numbers and so same thing here we could
[00:04:40] and so same thing here we could create another tensor in this case with
[00:04:43] create another tensor in this case with data type float32
[00:04:46] data type float32 and
[00:04:48] and in this third example you see that we
[00:04:50] in this third example you see that we create another tensor we don't actually
[00:04:53] create another tensor we don't actually specify the data type but pytorch
[00:04:55] specify the data type but pytorch essentially implicitly takes the data
[00:04:58] essentially implicitly takes the data type to be floating points since we
[00:05:00] type to be floating points since we actually passed in a floating Point
[00:05:02] actually passed in a floating Point number into this tensor
[00:05:04] number into this tensor so
[00:05:06] so pretty much at a high level tensors are
[00:05:08] pretty much at a high level tensors are like multi-dimensional arrays we can
[00:05:10] like multi-dimensional arrays we can specify the data type for them we can
[00:05:12] specify the data type for them we can populate them just like numpy rays
[00:05:14] populate them just like numpy rays okay so now great we know how to create
[00:05:17] okay so now great we know how to create tensors we know that ultimately
[00:05:19] tensors we know that ultimately everything that we work with all the
[00:05:21] everything that we work with all the data we have is going to be expressed as
[00:05:23] data we have is going to be expressed as tensors
[00:05:24] tensors now the question is what are the
[00:05:25] now the question is what are the functions that we have to manipulate
[00:05:26] functions that we have to manipulate them
[00:05:27] them and so we have some basic utilities that
[00:05:30] and so we have some basic utilities that can help us instantiate tensors easily
[00:05:33] can help us instantiate tensors easily specifically torch.zeros and torch.1s
[00:05:37] specifically torch.zeros and torch.1s these are
[00:05:39] these are two ways to create tensors of a
[00:05:42] two ways to create tensors of a particular shape in this case tensors of
[00:05:44] particular shape in this case tensors of all zeros or tensors of all ones and
[00:05:47] all zeros or tensors of all ones and you'll see that this will be very
[00:05:48] you'll see that this will be very helpful
[00:05:49] helpful when you do your homeworks typically
[00:05:51] when you do your homeworks typically you'll you'll want to just need to just
[00:05:53] you'll you'll want to just need to just create a bunch of zero Matrix and it'll
[00:05:56] create a bunch of zero Matrix and it'll be very easy to just specify the shape
[00:05:57] be very easy to just specify the shape here without having to write everything
[00:05:59] here without having to write everything out super explicitly and then you can
[00:06:02] out super explicitly and then you can update that tensor as needed
[00:06:05] update that tensor as needed another thing you can do is just like we
[00:06:08] another thing you can do is just like we have ranges in Python so if you want to
[00:06:10] have ranges in Python so if you want to Loop over a bunch of numbers you can
[00:06:13] Loop over a bunch of numbers you can specify a range you can also use
[00:06:16] specify a range you can also use torch dot a range to be able to actually
[00:06:20] torch dot a range to be able to actually instantiate a tensor with a particular
[00:06:22] instantiate a tensor with a particular range in this case we just looped over
[00:06:25] range in this case we just looped over the numbers one through ten you could
[00:06:27] the numbers one through ten you could reshape this and make it one through
[00:06:29] reshape this and make it one through five and then six through ten that's
[00:06:32] five and then six through ten that's another way to be able to instantiate
[00:06:33] another way to be able to instantiate tensors
[00:06:34] tensors and finally
[00:06:37] and finally something to note is that
[00:06:39] something to note is that when we apply particular operations such
[00:06:42] when we apply particular operations such as just simple python operations like
[00:06:45] as just simple python operations like addition or multiplication by default
[00:06:48] addition or multiplication by default they're going to be element wise so
[00:06:50] they're going to be element wise so they'll apply to all the elements in our
[00:06:52] they'll apply to all the elements in our tensor so in this case we took our
[00:06:55] tensor so in this case we took our tensor I think this one was probably
[00:06:57] tensor I think this one was probably from earlier above and we added two
[00:07:01] from earlier above and we added two everywhere here we've multiplied
[00:07:02] everywhere here we've multiplied everything by two
[00:07:04] everything by two but pretty much the pi torch semantics
[00:07:07] but pretty much the pi torch semantics for broadcasting work pretty much the
[00:07:09] for broadcasting work pretty much the same as the numpy semantics so if you
[00:07:13] same as the numpy semantics so if you pretty much have different Matrix
[00:07:16] pretty much have different Matrix operations where you need to batch
[00:07:18] operations where you need to batch across a particular Dimension pytorch
[00:07:20] across a particular Dimension pytorch will be smart about it and it will
[00:07:22] will be smart about it and it will actually make sure that you broadcast
[00:07:24] actually make sure that you broadcast over the appropriate Dimensions although
[00:07:26] over the appropriate Dimensions although of course you have to make sure that the
[00:07:28] of course you have to make sure that the shapes are compatible based on the
[00:07:30] shapes are compatible based on the actual broadcasting rules so we'll get
[00:07:33] actual broadcasting rules so we'll get to that in a little bit when we look at
[00:07:35] to that in a little bit when we look at reshaping and how the bra uh how
[00:07:38] reshaping and how the bra uh how different you know operations have those
[00:07:40] different you know operations have those semantics in this case we have to define
[00:07:43] semantics in this case we have to define the I guess I'm not personally aware of
[00:07:45] the I guess I'm not personally aware of how you would Define kind of a jagged
[00:07:48] how you would Define kind of a jagged tensor that has unequal dimensions
[00:07:52] tensor that has unequal dimensions um but typically we don't want to do
[00:07:54] um but typically we don't want to do that because it makes our computation a
[00:07:56] that because it makes our computation a lot more complex and so in cases where
[00:07:58] lot more complex and so in cases where we have you know for instance we have
[00:08:00] we have you know for instance we have different sentences that we turn into
[00:08:03] different sentences that we turn into tokens
[00:08:04] tokens we might have different length sentences
[00:08:06] we might have different length sentences in our training set we'll actually pad
[00:08:08] in our training set we'll actually pad all the dimensions to be the same
[00:08:10] all the dimensions to be the same because ultimately we want to do
[00:08:12] because ultimately we want to do everything with Matrix operations and so
[00:08:14] everything with Matrix operations and so in order to do that we need to have a
[00:08:16] in order to do that we need to have a matrix of a fixed shape
[00:08:18] matrix of a fixed shape um but yeah that's that's a good point I
[00:08:20] um but yeah that's that's a good point I I'm not sure if there is a way to do
[00:08:22] I'm not sure if there is a way to do that but typically we just get around
[00:08:24] that but typically we just get around this by padding
[00:08:26] this by padding okay so now we know how to define
[00:08:27] okay so now we know how to define tensors
[00:08:29] tensors we can do some interesting things with
[00:08:31] we can do some interesting things with them so here we've created two tensors
[00:08:35] them so here we've created two tensors one of them is a three by two tensor the
[00:08:38] one of them is a three by two tensor the other one is a two by four tensor and I
[00:08:40] other one is a two by four tensor and I think the answer is written up here but
[00:08:43] think the answer is written up here but what do we expect is the shape when we
[00:08:45] what do we expect is the shape when we multiply these two tensors
[00:08:48] multiply these two tensors so we have a three by two tensor and a
[00:08:50] so we have a three by two tensor and a two by four tensor
[00:08:53] two by four tensor yeah three by four
[00:08:55] yeah three by four and so
[00:08:57] and so more generally
[00:08:59] more generally um we can use matte mole in order to do
[00:09:01] um we can use matte mole in order to do matrix multiplication it also implements
[00:09:04] matrix multiplication it also implements batch to matrix multiplication and so I
[00:09:08] batch to matrix multiplication and so I won't go over the entire review of
[00:09:10] won't go over the entire review of broadcasting semantics but the main gist
[00:09:12] broadcasting semantics but the main gist is that
[00:09:14] is that the dimensions of two tensors are
[00:09:16] the dimensions of two tensors are compatible if you can left pad the
[00:09:19] compatible if you can left pad the tensors with ones so that the dimensions
[00:09:22] tensors with ones so that the dimensions that line up either a have the same
[00:09:24] that line up either a have the same number in that Dimension or B one of
[00:09:26] number in that Dimension or B one of them is a dummy Dimension one of them
[00:09:28] them is a dummy Dimension one of them has a one
[00:09:29] has a one and in that case in those dummy
[00:09:31] and in that case in those dummy Dimensions Pi torch will actually make
[00:09:33] Dimensions Pi torch will actually make sure to copy over the tensor as many
[00:09:35] sure to copy over the tensor as many times as needed so that you can then
[00:09:38] times as needed so that you can then actually perform the operation and
[00:09:40] actually perform the operation and that's useful when you want to do things
[00:09:41] that's useful when you want to do things like batch dot products or bashed Matrix
[00:09:44] like batch dot products or bashed Matrix multiplications
[00:09:47] multiplications and I guess the final Point here is
[00:09:50] and I guess the final Point here is there's also a shorthand notation that
[00:09:52] there's also a shorthand notation that you can use so instead of kind of having
[00:09:54] you can use so instead of kind of having to type out matte mole every time you
[00:09:56] to type out matte mole every time you can just use the add operator similar to
[00:09:58] can just use the add operator similar to numpy effectively that's kind of where
[00:10:01] numpy effectively that's kind of where we get into how batching works so for
[00:10:04] we get into how batching works so for example if you had
[00:10:07] example if you had um let's say two tensors that have
[00:10:11] um let's say two tensors that have um
[00:10:11] um some batch Dimension and then one of
[00:10:14] some batch Dimension and then one of them is M by one and the other one is
[00:10:18] them is M by one and the other one is one by n
[00:10:19] one by n and if you do a batched matrix multiply
[00:10:22] and if you do a batched matrix multiply to those two tensors now what you
[00:10:25] to those two tensors now what you effectively do is you preserve the batch
[00:10:26] effectively do is you preserve the batch Dimension and then you're doing a matrix
[00:10:29] Dimension and then you're doing a matrix multiplication between an M by one
[00:10:30] multiplication between an M by one tensor and a one by n so you get
[00:10:33] tensor and a one by n so you get something that's the batch Dimension by
[00:10:35] something that's the batch Dimension by m by n so effectively they're kind of
[00:10:39] m by n so effectively they're kind of more I think the full semantics are
[00:10:40] more I think the full semantics are written out on the pytorch website for
[00:10:43] written out on the pytorch website for how the matrix multiplication works but
[00:10:45] how the matrix multiplication works but you're right you don't just have these
[00:10:46] you're right you don't just have these cases where you have two two dimensional
[00:10:48] cases where you have two two dimensional tensors you can have arbitrary number of
[00:10:50] tensors you can have arbitrary number of dimensions and as long as the dimensions
[00:10:53] dimensions and as long as the dimensions match up based on those semantics I was
[00:10:55] match up based on those semantics I was saying then you can multiply it
[00:10:57] saying then you can multiply it alternatively you can do what I do which
[00:10:59] alternatively you can do what I do which is just multiply it anyways and then if
[00:11:01] is just multiply it anyways and then if it throws an error print out the shapes
[00:11:03] it throws an error print out the shapes and kind of work from there that tends
[00:11:06] and kind of work from there that tends to be faster in my opinion a lot of ways
[00:11:08] to be faster in my opinion a lot of ways but yeah that's a good point
[00:11:12] but yeah that's a good point all right so yeah let's keep going
[00:11:15] all right so yeah let's keep going through some of the other different
[00:11:16] through some of the other different functionalities here
[00:11:19] functionalities here so we can Define another tensor
[00:11:21] so we can Define another tensor um and kind of one of the key things
[00:11:23] um and kind of one of the key things that we always want to look at is the
[00:11:25] that we always want to look at is the shape so in this case we just have a 1D
[00:11:29] shape so in this case we just have a 1D tensor of length three so the torch dot
[00:11:32] tensor of length three so the torch dot size just gives us three in general this
[00:11:35] size just gives us three in general this is kind of one of the key debugging
[00:11:36] is kind of one of the key debugging steps and something that I'll try and
[00:11:38] steps and something that I'll try and emphasize a lot throughout this session
[00:11:40] emphasize a lot throughout this session which is printing the shapes of all of
[00:11:43] which is printing the shapes of all of your tensors is probably your best
[00:11:45] your tensors is probably your best resource when it comes to debugging it's
[00:11:48] resource when it comes to debugging it's kind of one of the hardest things to
[00:11:49] kind of one of the hardest things to Intuit exactly what's going on once you
[00:11:51] Intuit exactly what's going on once you start stacking a lot of different
[00:11:53] start stacking a lot of different operations together so printing out the
[00:11:56] operations together so printing out the shapes at each point and seeing do they
[00:11:58] shapes at each point and seeing do they match what you expect is something
[00:12:00] match what you expect is something important and it's better to rely on
[00:12:03] important and it's better to rely on that than just on the error message that
[00:12:05] that than just on the error message that pytorch gives you because under the hood
[00:12:07] pytorch gives you because under the hood pytorch might Implement certain
[00:12:09] pytorch might Implement certain optimizations and
[00:12:11] optimizations and actually reshape the underlying tensor
[00:12:13] actually reshape the underlying tensor you have so you may not see the numbers
[00:12:15] you have so you may not see the numbers you expect so it's always great to print
[00:12:17] you expect so it's always great to print out the shape
[00:12:19] out the shape and
[00:12:21] and so yeah let's uh so again we can always
[00:12:25] so yeah let's uh so again we can always print out the shape
[00:12:27] print out the shape and we can have a more complex uh in
[00:12:31] and we can have a more complex uh in this case a three-dimensional tensor
[00:12:32] this case a three-dimensional tensor which is three by two by four and we can
[00:12:36] which is three by two by four and we can print out the shape and we can see all
[00:12:37] print out the shape and we can see all the dimensions here
[00:12:40] the dimensions here and so now you're like okay great we
[00:12:42] and so now you're like okay great we have tensors we can look at their shapes
[00:12:44] have tensors we can look at their shapes but what do we actually do with them and
[00:12:46] but what do we actually do with them and so now let's get into kind of what are
[00:12:49] so now let's get into kind of what are the operations that we can apply to
[00:12:51] the operations that we can apply to these tensors
[00:12:53] these tensors and so one of them is it's very easy to
[00:12:56] and so one of them is it's very easy to reshape tensors
[00:12:59] reshape tensors so in this case we're creating this 15
[00:13:02] so in this case we're creating this 15 dimensional tensor that's the numbers 1
[00:13:05] dimensional tensor that's the numbers 1 to 15 and now we're reshaping it so now
[00:13:08] to 15 and now we're reshaping it so now it's a five by three tensor here
[00:13:11] it's a five by three tensor here and so you might wonder well like what's
[00:13:14] and so you might wonder well like what's what's the point of that and it's
[00:13:17] what's the point of that and it's because a lot of times when we are doing
[00:13:20] because a lot of times when we are doing machine learning we actually want to
[00:13:22] machine learning we actually want to learn in batches and so we might take
[00:13:24] learn in batches and so we might take our data and we might reshape it so now
[00:13:26] our data and we might reshape it so now that instead of kind of being a long
[00:13:28] that instead of kind of being a long flat and list of things we actually have
[00:13:30] flat and list of things we actually have a set of batches or in in some cases we
[00:13:33] a set of batches or in in some cases we have a set of batches of a set of
[00:13:36] have a set of batches of a set of sentences or sequences of a particular
[00:13:39] sentences or sequences of a particular length and each of the elements in that
[00:13:41] length and each of the elements in that sequence has an embedding of a
[00:13:43] sequence has an embedding of a particular dimension and So based on the
[00:13:46] particular dimension and So based on the types of operations that you're trying
[00:13:47] types of operations that you're trying to do you'll sometimes need to reshape
[00:13:50] to do you'll sometimes need to reshape those tensors and sometimes you'll want
[00:13:53] those tensors and sometimes you'll want to
[00:13:54] to particularly sometimes transpose
[00:13:56] particularly sometimes transpose Dimensions if you want to for instance
[00:13:58] Dimensions if you want to for instance reorganize your data so
[00:14:01] reorganize your data so that's another operation to keep in mind
[00:14:05] that's another operation to keep in mind I believe the differences view will
[00:14:08] I believe the differences view will um
[00:14:09] um view will create a view of the
[00:14:11] view will create a view of the underlying tensor and so I think the
[00:14:13] underlying tensor and so I think the underlying tensor will still have the
[00:14:14] underlying tensor will still have the same shape reshape will actually modify
[00:14:17] same shape reshape will actually modify the tensor
[00:14:20] um
[00:14:22] all right and then finally like I said
[00:14:25] all right and then finally like I said at the beginning
[00:14:26] at the beginning your intuition about pytorch tensors can
[00:14:29] your intuition about pytorch tensors can simply be their kind of a nice easy way
[00:14:32] simply be their kind of a nice easy way to work with numpy arrays but they have
[00:14:35] to work with numpy arrays but they have all these great properties like now we
[00:14:36] all these great properties like now we can essentially use them with gpus and
[00:14:40] can essentially use them with gpus and it's very optimized and we can also
[00:14:42] it's very optimized and we can also compute gradients quickly and
[00:14:46] compute gradients quickly and to kind of just emphasize this point if
[00:14:48] to kind of just emphasize this point if you have some numpy code and you have a
[00:14:50] you have some numpy code and you have a bunch of numpy arrays you can directly
[00:14:52] bunch of numpy arrays you can directly convert them into Pi torch sensors by
[00:14:54] convert them into Pi torch sensors by simply catch casting them
[00:14:56] simply catch casting them and you can also take those tensors and
[00:14:59] and you can also take those tensors and convert them back to numpy arrays
[00:15:04] all right and so one of the things you
[00:15:06] all right and so one of the things you might be asking is why do we care about
[00:15:08] might be asking is why do we care about tensors what makes them good
[00:15:12] tensors what makes them good and one of the great things about them
[00:15:13] and one of the great things about them is that they support vectorized
[00:15:16] is that they support vectorized operations very easily essentially we
[00:15:18] operations very easily essentially we can parallelize a lot of different
[00:15:20] can parallelize a lot of different computations and do them for instance
[00:15:22] computations and do them for instance across a batch of data all at once and
[00:15:26] across a batch of data all at once and one of those operations you might want
[00:15:27] one of those operations you might want to do for instance is a sum so
[00:15:30] to do for instance is a sum so you can take in this case a tensor which
[00:15:34] you can take in this case a tensor which is
[00:15:35] is shape five by seven
[00:15:37] shape five by seven and
[00:15:40] it looks like that's not working
[00:15:42] it looks like that's not working you can take a tensor that's shaped five
[00:15:44] you can take a tensor that's shaped five by seven and now you can compute
[00:15:46] by seven and now you can compute different operations on it that
[00:15:49] different operations on it that essentially collapse the dimensionality
[00:15:50] essentially collapse the dimensionality so the first one is sum and so you can
[00:15:54] so the first one is sum and so you can take it and you can sum across both the
[00:15:55] take it and you can sum across both the rows as well as the columns and so one
[00:15:58] rows as well as the columns and so one way I like to think about this to kind
[00:16:00] way I like to think about this to kind of keep them straight is that the
[00:16:03] of keep them straight is that the dimension that you specify in the sum is
[00:16:05] dimension that you specify in the sum is the dimension you're collapsing so in
[00:16:09] the dimension you're collapsing so in this case if you take the data and sum
[00:16:10] this case if you take the data and sum over Dimension zero because you know the
[00:16:13] over Dimension zero because you know the shape of the underlying tensor is five
[00:16:15] shape of the underlying tensor is five by seven
[00:16:16] by seven you've collapsed the zeroth dimension so
[00:16:19] you've collapsed the zeroth dimension so you should be left with something that's
[00:16:21] you should be left with something that's just shape seven
[00:16:22] just shape seven and if you see the actual tensor you got
[00:16:25] and if you see the actual tensor you got 75 80 85 90. you get this tensor which
[00:16:28] 75 80 85 90. you get this tensor which is shape seven
[00:16:30] is shape seven alternatively you can think about
[00:16:32] alternatively you can think about whether or not you're kind of summing
[00:16:33] whether or not you're kind of summing across the rows or something across the
[00:16:35] across the rows or something across the columns
[00:16:36] columns but it's not just some it applies to
[00:16:39] but it's not just some it applies to other operations as well you can compute
[00:16:40] other operations as well you can compute standard deviations you can normalize
[00:16:43] standard deviations you can normalize your data you can do other operations
[00:16:45] your data you can do other operations which essentially batch across the
[00:16:47] which essentially batch across the entire set of data
[00:16:49] entire set of data and not only do these apply over one
[00:16:52] and not only do these apply over one dimension but here you can see that if
[00:16:54] dimension but here you can see that if you don't specify any dimensions then by
[00:16:57] you don't specify any dimensions then by default the operation actually applies
[00:16:58] default the operation actually applies to the entire tensor so here we end up
[00:17:01] to the entire tensor so here we end up just taking the sum of the entire thing
[00:17:03] just taking the sum of the entire thing so if you think about it the zeroth
[00:17:05] so if you think about it the zeroth dimension is the number of rows there
[00:17:07] dimension is the number of rows there are five rows and there are seven
[00:17:08] are five rows and there are seven columns so if we sum out the rows then
[00:17:14] columns so if we sum out the rows then we're actually summing across the
[00:17:16] we're actually summing across the columns
[00:17:17] columns and so now we only have seven values
[00:17:20] and so now we only have seven values but I like to think about more just in
[00:17:22] but I like to think about more just in terms of the dimensions to keep it
[00:17:23] terms of the dimensions to keep it straight rather than rows or columns
[00:17:24] straight rather than rows or columns because it can get confusing if you're
[00:17:26] because it can get confusing if you're summing out Dimension zero then
[00:17:28] summing out Dimension zero then effectively you've taken something which
[00:17:30] effectively you've taken something which has some shape that's Dimension Zero by
[00:17:32] has some shape that's Dimension Zero by Dimension One to just whatever is the
[00:17:35] Dimension One to just whatever is the dimension one shape
[00:17:36] dimension one shape and then from there you can kind of
[00:17:38] and then from there you can kind of figure out okay which way did I actually
[00:17:40] figure out okay which way did I actually sum to check if you were right numpy
[00:17:43] sum to check if you were right numpy implements a lot of this vectorization
[00:17:45] implements a lot of this vectorization and I believe in the homework that
[00:17:48] and I believe in the homework that you have right now I think part of your
[00:17:51] you have right now I think part of your job is to vectorize a lot of these
[00:17:52] job is to vectorize a lot of these things
[00:17:53] things so the big Advantage with pi torch is
[00:17:55] so the big Advantage with pi torch is that essentially it's optimized to be
[00:17:58] that essentially it's optimized to be able to take advantage of your GPU when
[00:18:00] able to take advantage of your GPU when we actually start building out neural
[00:18:01] we actually start building out neural networks that are bigger that involve
[00:18:04] networks that are bigger that involve more computation we're going to be doing
[00:18:06] more computation we're going to be doing a lot of these matrix multiplication
[00:18:07] a lot of these matrix multiplication operations that it's going to be a lot
[00:18:10] operations that it's going to be a lot better for our processor if we can make
[00:18:12] better for our processor if we can make use of the GPU and so that's where
[00:18:15] use of the GPU and so that's where pytorch really comes in handy
[00:18:17] pytorch really comes in handy in addition to also defining a lot of
[00:18:20] in addition to also defining a lot of those neural network modules as we'll
[00:18:22] those neural network modules as we'll see later for you so that now you don't
[00:18:25] see later for you so that now you don't need to worry about for instance
[00:18:26] need to worry about for instance implementing a basic linear layer and
[00:18:29] implementing a basic linear layer and back propagation from scratch and also
[00:18:31] back propagation from scratch and also your Optimizer all of those things will
[00:18:34] your Optimizer all of those things will be built in and you can just call the
[00:18:35] be built in and you can just call the respective apis to make use of them
[00:18:37] respective apis to make use of them whereas in Python and numpy you might
[00:18:40] whereas in Python and numpy you might have to do a lot of that coding yourself
[00:18:47] all right so
[00:18:50] we'll keep going
[00:18:54] so this is a quiz except I think it
[00:18:58] so this is a quiz except I think it tells you the answer so it's not much of
[00:18:59] tells you the answer so it's not much of a quiz
[00:19:00] a quiz but
[00:19:02] but pretty much you know what would you do
[00:19:03] pretty much you know what would you do if now I told you instead of you know
[00:19:06] if now I told you instead of you know summing over this tensor I want you to
[00:19:09] summing over this tensor I want you to compute the average
[00:19:11] compute the average and so there's there's two different
[00:19:12] and so there's there's two different ways you could compute the average you
[00:19:14] ways you could compute the average you could compute the average across the
[00:19:16] could compute the average across the rows or across the columns
[00:19:18] rows or across the columns and so essentially
[00:19:21] and so essentially now we kind of get back to this question
[00:19:23] now we kind of get back to this question of well which dimension am I actually
[00:19:25] of well which dimension am I actually going to reduce over and so here if we
[00:19:27] going to reduce over and so here if we want to preserve the rows then we need
[00:19:30] want to preserve the rows then we need to actually sum over the second
[00:19:31] to actually sum over the second dimension
[00:19:33] dimension um
[00:19:34] um they're really the first uh zeroth and
[00:19:37] they're really the first uh zeroth and first so the First Dimension is what we
[00:19:39] first so the First Dimension is what we have to sum over because we want to
[00:19:41] have to sum over because we want to preserve the zeroth dimension
[00:19:44] preserve the zeroth dimension and so that's why for row average you
[00:19:46] and so that's why for row average you see the dim equals one
[00:19:48] see the dim equals one and for column average same reasoning is
[00:19:50] and for column average same reasoning is why you see the dim equals zero
[00:19:53] why you see the dim equals zero and so if we run this code we'll see
[00:19:57] and so if we run this code we'll see kind of what are the shapes that we
[00:19:59] kind of what are the shapes that we expect if we're taking the average over
[00:20:01] expect if we're taking the average over rows then an object that's two by three
[00:20:04] rows then an object that's two by three should just become an object that's two
[00:20:06] should just become an object that's two it's just a one-dimensional
[00:20:09] it's just a one-dimensional almost a vector you can think of
[00:20:11] almost a vector you can think of and if we are averaging across the
[00:20:14] and if we are averaging across the columns there's three columns so now our
[00:20:16] columns there's three columns so now our average should have three values and so
[00:20:19] average should have three values and so now we're left with a three a
[00:20:21] now we're left with a three a one-dimensional tensor of length three
[00:20:25] so yeah does that kind of make sense I
[00:20:27] so yeah does that kind of make sense I guess is this General intuition about
[00:20:28] guess is this General intuition about how we deal with shapes and how some of
[00:20:31] how we deal with shapes and how some of these operations manipulate shapes
[00:20:33] these operations manipulate shapes so now we'll get into indexing
[00:20:35] so now we'll get into indexing this can get a little bit tricky
[00:20:38] this can get a little bit tricky but
[00:20:40] but I think you'll find that the semantics
[00:20:42] I think you'll find that the semantics are very similar to numpy
[00:20:44] are very similar to numpy so one of the things that you can do in
[00:20:48] so one of the things that you can do in numpy is that you can take these numpy
[00:20:50] numpy is that you can take these numpy arrays and you can slice across them in
[00:20:52] arrays and you can slice across them in many different ways you can create
[00:20:54] many different ways you can create copies of them
[00:20:55] copies of them and you can index across particular
[00:20:58] and you can index across particular Dimensions to select out different
[00:21:00] Dimensions to select out different elements different rows or different
[00:21:02] elements different rows or different columns
[00:21:03] columns and so in this case let's take this
[00:21:05] and so in this case let's take this example tensor which is
[00:21:07] example tensor which is three by two by two
[00:21:10] three by two by two and
[00:21:12] and first thing you'll always want to do
[00:21:13] first thing you'll always want to do when you have a new tensor print out its
[00:21:15] when you have a new tensor print out its shape understand what you're working
[00:21:16] shape understand what you're working with
[00:21:18] with and so
[00:21:20] and so I guess uh
[00:21:23] I guess uh I may have shown this already but what
[00:21:25] I may have shown this already but what will X bracket zero print out what
[00:21:28] will X bracket zero print out what happens if we index into just the first
[00:21:29] happens if we index into just the first element
[00:21:31] element what's the shape of this
[00:21:35] yeah two by two right because if you
[00:21:38] yeah two by two right because if you think about it our tensor is really just
[00:21:41] think about it our tensor is really just a list of three things each of those
[00:21:43] a list of three things each of those things happens to also be a two by two
[00:21:45] things happens to also be a two by two tensor so we get a two by two object in
[00:21:48] tensor so we get a two by two object in this case the first thing one two three
[00:21:50] this case the first thing one two three four
[00:21:52] four and so just like numpy if you provide a
[00:21:55] and so just like numpy if you provide a colon in a particular Dimension it means
[00:21:57] colon in a particular Dimension it means essentially copy over that dimension
[00:22:00] essentially copy over that dimension so if we do X bracket zero implicitly
[00:22:03] so if we do X bracket zero implicitly we're essentially putting a colon for
[00:22:05] we're essentially putting a colon for all the other dimensions so it's
[00:22:07] all the other dimensions so it's essentially saying grab the first thing
[00:22:10] essentially saying grab the first thing along the zeroth dimension and then grab
[00:22:13] along the zeroth dimension and then grab everything along the other two
[00:22:14] everything along the other two dimensions
[00:22:15] dimensions if we now
[00:22:18] if we now take uh just the zeroth along the
[00:22:21] take uh just the zeroth along the element along the First Dimension
[00:22:24] element along the First Dimension um
[00:22:24] um what are we going to get well
[00:22:28] what are we going to get well ultimately we're going to get now if you
[00:22:31] ultimately we're going to get now if you look uh the kind of First Dimension were
[00:22:34] look uh the kind of First Dimension were these three things the second dimension
[00:22:36] these three things the second dimension is now each of these two rows within
[00:22:39] is now each of these two rows within those things so like one two and three
[00:22:41] those things so like one two and three four five six and seven eight
[00:22:43] four five six and seven eight 9 10 and 11 12. so if we index into the
[00:22:48] 9 10 and 11 12. so if we index into the second dimension or the First Dimension
[00:22:50] second dimension or the First Dimension and get the zeroth element
[00:22:52] and get the zeroth element then we're going to end up with one two
[00:22:54] then we're going to end up with one two by six and nine ten
[00:22:58] by six and nine ten and
[00:22:59] and even if that's a little bit tricky you
[00:23:01] even if that's a little bit tricky you can kind of go back to the trick I
[00:23:03] can kind of go back to the trick I mentioned before where we're slicing
[00:23:06] mentioned before where we're slicing across the First Dimension so if we look
[00:23:08] across the First Dimension so if we look at the shape of our tensor it's three by
[00:23:10] at the shape of our tensor it's three by two by two
[00:23:12] two by two if we collapse the First Dimension that
[00:23:15] if we collapse the First Dimension that two in the middle we're left with
[00:23:16] two in the middle we're left with something that's three by two
[00:23:18] something that's three by two so it might seem a little bit trivial
[00:23:21] so it might seem a little bit trivial kind of going through this in a lot of
[00:23:23] kind of going through this in a lot of detail but I think it's important
[00:23:25] detail but I think it's important because it can get tricky when your
[00:23:26] because it can get tricky when your tensor shapes get more complicated how
[00:23:29] tensor shapes get more complicated how to actually reason about this
[00:23:31] to actually reason about this and so I won't go through every example
[00:23:33] and so I won't go through every example here since a lot of them kind of
[00:23:35] here since a lot of them kind of reinforce the same thing but I'll just
[00:23:38] reinforce the same thing but I'll just highlight a few things just like numpy
[00:23:40] highlight a few things just like numpy you can choose to get a range of
[00:23:43] you can choose to get a range of elements
[00:23:44] elements in this case
[00:23:47] in this case where we're taking this new tensor which
[00:23:50] where we're taking this new tensor which is
[00:23:51] is one two one through fifteen rearranged
[00:23:54] one two one through fifteen rearranged that's a five by three tensor and if we
[00:23:57] that's a five by three tensor and if we take the zero through third row
[00:24:00] take the zero through third row um exclusive we'll get the first three
[00:24:02] um exclusive we'll get the first three rows
[00:24:04] rows and we can do the same thing but now
[00:24:06] and we can do the same thing but now with slicing across multiple dimensions
[00:24:10] and I think the final point I want to
[00:24:13] and I think the final point I want to talk about here is list indexing
[00:24:16] talk about here is list indexing list indexing is also present in numpy
[00:24:19] list indexing is also present in numpy and it's a very clever shorthand for
[00:24:21] and it's a very clever shorthand for being able to essentially select out
[00:24:23] being able to essentially select out multiple elements at once
[00:24:26] multiple elements at once so in this case what you can do is
[00:24:29] so in this case what you can do is if you want to get the zeroth the second
[00:24:32] if you want to get the zeroth the second and the fourth element
[00:24:34] and the fourth element of our Matrix you can just instead of
[00:24:37] of our Matrix you can just instead of indexing with a particular number or set
[00:24:39] indexing with a particular number or set of numbers index with a list of indices
[00:24:43] of numbers index with a list of indices so in this case if we go up to our
[00:24:45] so in this case if we go up to our tensor
[00:24:48] if we take out the zeroth the second and
[00:24:51] if we take out the zeroth the second and the fourth we should see those three
[00:24:53] the fourth we should see those three rows
[00:24:55] rows and that's what we end up getting
[00:25:00] yeah again these are kind of a lot of
[00:25:03] yeah again these are kind of a lot of examples to just reiterate the same
[00:25:05] examples to just reiterate the same point which is that you can slice across
[00:25:08] point which is that you can slice across your data in multiple ways and at
[00:25:10] your data in multiple ways and at different points you're going to need to
[00:25:11] different points you're going to need to do that
[00:25:12] do that so being familiar with the shapes that
[00:25:15] so being familiar with the shapes that you understand what's the underlying
[00:25:17] you understand what's the underlying output that you expect is important
[00:25:19] output that you expect is important in this case for instance we're slicing
[00:25:22] in this case for instance we're slicing across the first and the second
[00:25:23] across the first and the second dimension and we're keeping the first uh
[00:25:27] dimension and we're keeping the first uh the zeroth and so
[00:25:29] the zeroth and so we're going to end up getting
[00:25:30] we're going to end up getting essentially kind of the the top left
[00:25:32] essentially kind of the the top left element of each of those three things in
[00:25:35] element of each of those three things in our tensor if we scroll all the way
[00:25:38] our tensor if we scroll all the way up here we'll get this one we'll get
[00:25:40] up here we'll get this one we'll get this five and we'll get this nine
[00:25:42] this five and we'll get this nine because we go across all of this the
[00:25:45] because we go across all of this the Earth Dimension and then across the
[00:25:47] Earth Dimension and then across the first and the second we only take the
[00:25:49] first and the second we only take the first
[00:25:49] first uh
[00:25:51] uh the the zeroth element in both of those
[00:25:53] the the zeroth element in both of those positions and so that's why we get 1 5
[00:25:56] positions and so that's why we get 1 5 9.
[00:26:01] and also of course you can you know
[00:26:03] and also of course you can you know apply all of the colons to get back the
[00:26:05] apply all of the colons to get back the original tensor
[00:26:11] okay and then I think the last thing
[00:26:14] okay and then I think the last thing when it comes to indexing is conversions
[00:26:17] when it comes to indexing is conversions so typically when we're writing code
[00:26:20] so typically when we're writing code with neural networks ultimately we're
[00:26:22] with neural networks ultimately we're going to
[00:26:24] going to you know process some data through a
[00:26:26] you know process some data through a network and we're going to get a loss
[00:26:27] network and we're going to get a loss and that loss needs to be a scalar and
[00:26:30] and that loss needs to be a scalar and then we're going to compute gradients
[00:26:31] then we're going to compute gradients with respect to that loss so one thing
[00:26:34] with respect to that loss so one thing to keep in mind is that sometimes you
[00:26:37] to keep in mind is that sometimes you might have an operation and it fails
[00:26:38] might have an operation and it fails because it was actually expecting a
[00:26:40] because it was actually expecting a scalar value rather than a tensor and so
[00:26:43] scalar value rather than a tensor and so you can extract out the scalar from this
[00:26:45] you can extract out the scalar from this one by one tensor by just calling dot
[00:26:48] one by one tensor by just calling dot item
[00:26:51] so in this case you know if you have a
[00:26:53] so in this case you know if you have a tensor which is just literally one then
[00:26:56] tensor which is just literally one then you can actually get the python scalar
[00:26:57] you can actually get the python scalar that corresponds to it by calling dot
[00:26:59] that corresponds to it by calling dot item
[00:27:00] item so now we can get into the more
[00:27:01] so now we can get into the more interesting stuff
[00:27:03] interesting stuff one of the really cool things with
[00:27:04] one of the really cool things with pytorch is autograd
[00:27:07] pytorch is autograd and what autograd is is high torch
[00:27:10] and what autograd is is high torch essentially provides an automatic
[00:27:13] essentially provides an automatic differentiation package where when you
[00:27:17] differentiation package where when you define your neural network you're
[00:27:19] define your neural network you're essentially defining many nodes that
[00:27:22] essentially defining many nodes that compute some function
[00:27:24] compute some function and in the forward past you're kind of
[00:27:26] and in the forward past you're kind of running your data through those nodes
[00:27:28] running your data through those nodes but what pytorch is doing on the back
[00:27:30] but what pytorch is doing on the back end is that each of those points it's
[00:27:33] end is that each of those points it's going to actually store the gradients
[00:27:34] going to actually store the gradients and accumulate them so that every time
[00:27:37] and accumulate them so that every time you do your backwards pass
[00:27:39] you do your backwards pass you apply the chain rule to be able to
[00:27:41] you apply the chain rule to be able to calculate all these different gradients
[00:27:43] calculate all these different gradients and pytorch caches those gradients
[00:27:46] and pytorch caches those gradients and then you will have access to all of
[00:27:49] and then you will have access to all of those gradients to be able to actually
[00:27:51] those gradients to be able to actually then run your favorite Optimizer and
[00:27:54] then run your favorite Optimizer and optimize you know with SGD or with atom
[00:27:57] optimize you know with SGD or with atom or whichever Optimizer you choose
[00:27:59] or whichever Optimizer you choose and so that's kind of one of the great
[00:28:02] and so that's kind of one of the great features you don't have to worry about
[00:28:04] features you don't have to worry about actually writing the code that computes
[00:28:06] actually writing the code that computes all these gradients and actually caches
[00:28:09] all these gradients and actually caches all of them properly applies the chain
[00:28:10] all of them properly applies the chain rule does all these steps you can have
[00:28:13] rule does all these steps you can have shocked all of that away with just one
[00:28:15] shocked all of that away with just one call to dot backward
[00:28:17] call to dot backward and so in this case we'll run through a
[00:28:20] and so in this case we'll run through a little bit of an example where we'll see
[00:28:22] little bit of an example where we'll see the gradients getting computed
[00:28:24] the gradients getting computed automatically
[00:28:26] automatically so
[00:28:29] in this case we're going to initialize a
[00:28:31] in this case we're going to initialize a tensor
[00:28:32] tensor and requires grad is true by default it
[00:28:36] and requires grad is true by default it just means that by default for a given
[00:28:38] just means that by default for a given tensor
[00:28:39] tensor python
[00:28:40] python pytorch will store the gradient
[00:28:42] pytorch will store the gradient associated with it and you might wonder
[00:28:45] associated with it and you might wonder well you know why why uh why do we have
[00:28:49] well you know why why uh why do we have this you know when we always want to
[00:28:51] this you know when we always want to store the gradient and the answer is at
[00:28:54] store the gradient and the answer is at train time you need the gradients in
[00:28:55] train time you need the gradients in order to actually train your network but
[00:28:58] order to actually train your network but at inference time you'd actually want to
[00:28:59] at inference time you'd actually want to disable your gradients and you can
[00:29:01] disable your gradients and you can actually do that because it's a lot of
[00:29:03] actually do that because it's a lot of extra computation that's not needed
[00:29:05] extra computation that's not needed since you're not making any updates to
[00:29:07] since you're not making any updates to your Network anymore
[00:29:09] your Network anymore and so let's create this right now
[00:29:13] and so let's create this right now uh we don't have any gradients uh being
[00:29:16] uh we don't have any gradients uh being computed because we haven't actually
[00:29:18] computed because we haven't actually called backwards to actually compute
[00:29:22] called backwards to actually compute um
[00:29:22] um some quantity with respect to this
[00:29:25] some quantity with respect to this particular tensor we haven't actually
[00:29:27] particular tensor we haven't actually computed
[00:29:28] computed um those gradients yet so right now the
[00:29:31] um those gradients yet so right now the dot grad feature which will actually
[00:29:33] dot grad feature which will actually store the gradient associated with that
[00:29:35] store the gradient associated with that tensor is not
[00:29:37] tensor is not and so now let's just Define a really
[00:29:39] and so now let's just Define a really simple function we have X we're going to
[00:29:42] simple function we have X we're going to define the function y equals 3x squared
[00:29:46] define the function y equals 3x squared and so now we're going to call Y dot
[00:29:48] and so now we're going to call Y dot backward
[00:29:50] backward and so now what happens is when we
[00:29:52] and so now what happens is when we actually print out x dot grad what we
[00:29:55] actually print out x dot grad what we should expect to see is number 12. and
[00:29:59] should expect to see is number 12. and the reason is that
[00:30:01] the reason is that our function y is 3x squared if we
[00:30:04] our function y is 3x squared if we compute the gradient of that function
[00:30:05] compute the gradient of that function we're going to get 6x
[00:30:08] we're going to get 6x and our actual value was 2. so the
[00:30:12] and our actual value was 2. so the actual gradient is going to be 12.
[00:30:16] actual gradient is going to be 12. and we see that when we print out X talk
[00:30:17] and we see that when we print out X talk grad that's what we get
[00:30:21] grad that's what we get and now we'll just run it again let's
[00:30:24] and now we'll just run it again let's set Z equal to 3x squared we call Z dot
[00:30:27] set Z equal to 3x squared we call Z dot backwards and we print out X talk grad
[00:30:29] backwards and we print out X talk grad again and now we see that
[00:30:32] again and now we see that I may not run this in the right order
[00:30:36] I may not run this in the right order Okay so
[00:30:39] Okay so here in the second one that I re-rad we
[00:30:41] here in the second one that I re-rad we see that it says 24. and so you might be
[00:30:44] see that it says 24. and so you might be wondering well I just did the same thing
[00:30:45] wondering well I just did the same thing twice shouldn't I see 12 again
[00:30:48] twice shouldn't I see 12 again and the answer is that by default
[00:30:50] and the answer is that by default pytorch will accumulate the gradients so
[00:30:54] pytorch will accumulate the gradients so it won't actually rewrite the gradient
[00:30:56] it won't actually rewrite the gradient each time you compute it it will sum it
[00:30:59] each time you compute it it will sum it and the reason is because when you
[00:31:01] and the reason is because when you actually have back propagation for your
[00:31:03] actually have back propagation for your network you want to accumulate the
[00:31:05] network you want to accumulate the gradients you know across all of your
[00:31:07] gradients you know across all of your examples and then actually apply your
[00:31:09] examples and then actually apply your update you don't want to overwrite the
[00:31:10] update you don't want to overwrite the gradient but this also means that every
[00:31:13] gradient but this also means that every time you have a training iteration for
[00:31:16] time you have a training iteration for your network you need to zero out the
[00:31:17] your network you need to zero out the gradient because you don't want the
[00:31:19] gradient because you don't want the previous gradients from the last Epoch
[00:31:22] previous gradients from the last Epoch where you iterated through all of your
[00:31:23] where you iterated through all of your training data to mess with the current
[00:31:26] training data to mess with the current update that you're doing
[00:31:28] update that you're doing so that's kind of one thing to note
[00:31:30] so that's kind of one thing to note which is that
[00:31:32] which is that that's essentially why we will see when
[00:31:35] that's essentially why we will see when we actually write the training Loop you
[00:31:37] we actually write the training Loop you have to run zero grad in order to zero
[00:31:39] have to run zero grad in order to zero out the gradient
[00:31:40] out the gradient yes so I accidentally ran the cells in
[00:31:44] yes so I accidentally ran the cells in the wrong order maybe to make it more
[00:31:46] the wrong order maybe to make it more clear let me put this one first
[00:31:52] so this is actually what it should look
[00:31:54] so this is actually what it should look like which is that we ran it once and I
[00:31:56] like which is that we ran it once and I ran this cell first and it has 12. and
[00:32:00] ran this cell first and it has 12. and then we ran it a second time and we get
[00:32:02] then we ran it a second time and we get 24.
[00:32:03] 24. yes so if you have all of your tensors
[00:32:06] yes so if you have all of your tensors defined
[00:32:07] defined then when you actually called out
[00:32:09] then when you actually called out backwards if it's a function of multiple
[00:32:11] backwards if it's a function of multiple variables it's going to compute all of
[00:32:13] variables it's going to compute all of those partials all of those gradients
[00:32:15] those partials all of those gradients yeah so what's happening here is that
[00:32:17] yeah so what's happening here is that the way Pi torch works is that it's
[00:32:20] the way Pi torch works is that it's storing the accumulate accumulated
[00:32:23] storing the accumulate accumulated gradient at X and so we've essentially
[00:32:26] gradient at X and so we've essentially made two different
[00:32:28] made two different backwards passes we've called it once on
[00:32:31] backwards passes we've called it once on this function y and we've which is a
[00:32:34] this function y and we've which is a function of X and we've called it once
[00:32:35] function of X and we've called it once on Z which is also a function of X and
[00:32:38] on Z which is also a function of X and so you're right we can't actually
[00:32:39] so you're right we can't actually disambiguate which came from what we
[00:32:41] disambiguate which came from what we just see the accumulated gradient but
[00:32:44] just see the accumulated gradient but typically that's actually exactly what
[00:32:46] typically that's actually exactly what we want because what we want is to be
[00:32:48] we want because what we want is to be able to run our Network and accumulate
[00:32:51] able to run our Network and accumulate the gradient across all of the training
[00:32:54] the gradient across all of the training examples that Define our loss
[00:32:55] examples that Define our loss and then perform our Optimizer step so
[00:32:58] and then perform our Optimizer step so yeah even with respect to one thing it
[00:33:00] yeah even with respect to one thing it doesn't matter because in practice each
[00:33:02] doesn't matter because in practice each of those things is really a different
[00:33:04] of those things is really a different example in our set of training examples
[00:33:06] example in our set of training examples and so we're not interested in you know
[00:33:08] and so we're not interested in you know the gradient from one example we're
[00:33:10] the gradient from one example we're actually interested in the overall
[00:33:11] actually interested in the overall gradient
[00:33:12] gradient so going back to this example
[00:33:15] so going back to this example What's Happening Here is that in the
[00:33:17] What's Happening Here is that in the backwards pass what it's doing is
[00:33:21] backwards pass what it's doing is you can imagine there's the X tensor and
[00:33:23] you can imagine there's the X tensor and then there's the dot grad attribute
[00:33:25] then there's the dot grad attribute which is another separate tensor it's
[00:33:26] which is another separate tensor it's going to be the same shape as X
[00:33:28] going to be the same shape as X and what that is storing is it's storing
[00:33:31] and what that is storing is it's storing the accumulated gradient from every
[00:33:34] the accumulated gradient from every single time that you've called dot
[00:33:36] single time that you've called dot backward on a quantity that
[00:33:39] backward on a quantity that essentially has some dependency on X
[00:33:41] essentially has some dependency on X that will have a non-zero gradient and
[00:33:44] that will have a non-zero gradient and so the first time we call it the
[00:33:46] so the first time we call it the gradient will be 12 because 6X 6 times 2
[00:33:49] gradient will be 12 because 6X 6 times 2 12. the second time we do it with Z it's
[00:33:52] 12. the second time we do it with Z it's also still 12. but the point is that dot
[00:33:55] also still 12. but the point is that dot grad doesn't actually overwrite the
[00:33:57] grad doesn't actually overwrite the gradient each time you called out
[00:33:58] gradient each time you called out backwards it simply adds them it
[00:34:00] backwards it simply adds them it accumulates them and kind of the
[00:34:02] accumulates them and kind of the intuition there is that
[00:34:04] intuition there is that ultimately you're going to want to
[00:34:06] ultimately you're going to want to compute the gradient with respect to the
[00:34:10] compute the gradient with respect to the loss and that loss is going to be made
[00:34:11] loss and that loss is going to be made up of many different examples and so you
[00:34:13] up of many different examples and so you need to accumulate the gradient from all
[00:34:16] need to accumulate the gradient from all of those in order to make a single
[00:34:17] of those in order to make a single update
[00:34:18] update and then of course you'll have to zero
[00:34:20] and then of course you'll have to zero that out because every time you make one
[00:34:22] that out because every time you make one pass through all of your data you don't
[00:34:23] pass through all of your data you don't want that next batch of data to also be
[00:34:26] want that next batch of data to also be double counting the previous batches
[00:34:28] double counting the previous batches update you want to keep those separate
[00:34:30] update you want to keep those separate and so we'll see that in a second
[00:34:38] all right so now we're going to move on
[00:34:40] all right so now we're going to move on to
[00:34:42] to one of the final pieces of the puzzle
[00:34:43] one of the final pieces of the puzzle which is neural networks how do we
[00:34:45] which is neural networks how do we actually use them in pi torch
[00:34:48] actually use them in pi torch and once we have that and we have our
[00:34:51] and once we have that and we have our optimization we'll finally be able to
[00:34:53] optimization we'll finally be able to figure out how do we actually train a
[00:34:55] figure out how do we actually train a neural network what does that look like
[00:34:56] neural network what does that look like and why it's so clean and efficient when
[00:34:59] and why it's so clean and efficient when you do it in pi torch
[00:35:02] so the first thing that you want to do
[00:35:04] so the first thing that you want to do is
[00:35:05] is we're going to be defining neural
[00:35:07] we're going to be defining neural networks in terms of existing building
[00:35:09] networks in terms of existing building blocks in terms of existing apis which
[00:35:12] blocks in terms of existing apis which will Implement for instance linear
[00:35:14] will Implement for instance linear layers or different activation functions
[00:35:16] layers or different activation functions that we need so we're going to import
[00:35:18] that we need so we're going to import torch.nn
[00:35:20] torch.nn because that is the neural network
[00:35:22] because that is the neural network package that we're going to make use of
[00:35:24] package that we're going to make use of and so let's start with the linear layer
[00:35:27] and so let's start with the linear layer the way the linear layer Works in pi
[00:35:30] the way the linear layer Works in pi torch is it takes in two arguments it
[00:35:32] torch is it takes in two arguments it takes in the input Dimension and then
[00:35:35] takes in the input Dimension and then the output dimension
[00:35:37] the output dimension and so pretty much what it does is it
[00:35:40] and so pretty much what it does is it takes in some input
[00:35:42] takes in some input which has some arbitrary amount of
[00:35:45] which has some arbitrary amount of dimensions and then finally
[00:35:47] dimensions and then finally the input Dimension and it will
[00:35:49] the input Dimension and it will essentially output it to that same set
[00:35:52] essentially output it to that same set of Dimensions except the output
[00:35:53] of Dimensions except the output dimension in the very last place
[00:35:57] dimension in the very last place and you can think of the linear layer as
[00:36:00] and you can think of the linear layer as essentially just performing a simple ax
[00:36:02] essentially just performing a simple ax plus b by default it's going to
[00:36:07] plus b by default it's going to um it's going to apply a bias but you
[00:36:09] um it's going to apply a bias but you can also disable that if you don't want
[00:36:11] can also disable that if you don't want a bias term
[00:36:13] a bias term and so let's look at a small example so
[00:36:23] here we have our input
[00:36:26] here we have our input and
[00:36:28] and we're going to create a linear layer in
[00:36:30] we're going to create a linear layer in this case as an input size of four an
[00:36:33] this case as an input size of four an output size of two
[00:36:36] output size of two and
[00:36:37] and all we're going to do is once we Define
[00:36:39] all we're going to do is once we Define it by instantiating it with nn.linear
[00:36:42] it by instantiating it with nn.linear whatever the name of our layer is in
[00:36:44] whatever the name of our layer is in this case we called it linear we just
[00:36:46] this case we called it linear we just essentially apply it with parentheses as
[00:36:48] essentially apply it with parentheses as if it were a function to whatever input
[00:36:51] if it were a function to whatever input and that actually does
[00:36:53] and that actually does the actual forward pass through this
[00:36:55] the actual forward pass through this linear layer to get our output
[00:37:01] and so you can see that the original
[00:37:04] and so you can see that the original shape was two by three by four then we
[00:37:07] shape was two by three by four then we pass it through this linear layer which
[00:37:09] pass it through this linear layer which has an output dimension of size two and
[00:37:11] has an output dimension of size two and so ultimately our output is two by three
[00:37:14] so ultimately our output is two by three by two which is good that's what we
[00:37:16] by two which is good that's what we expect that's not shape error
[00:37:18] expect that's not shape error but you know something common that
[00:37:21] but you know something common that you'll see is you know maybe
[00:37:23] you'll see is you know maybe uh you decide to
[00:37:27] uh you decide to you get a little confused and maybe you
[00:37:29] you get a little confused and maybe you do let's say
[00:37:31] do let's say two by two you match the wrong dimension
[00:37:35] two by two you match the wrong dimension and so here we're going to get a shape
[00:37:38] and so here we're going to get a shape error
[00:37:39] error and you see that the error message isn't
[00:37:41] and you see that the error message isn't as helpful because it's actually changed
[00:37:42] as helpful because it's actually changed the shape of what we were working with
[00:37:43] the shape of what we were working with we said this was two by three by four
[00:37:45] we said this was two by three by four under the hood Pi torch has changes to a
[00:37:48] under the hood Pi torch has changes to a six by four but
[00:37:50] six by four but if we you know in this case it's obvious
[00:37:52] if we you know in this case it's obvious because we instantiated it with the
[00:37:55] because we instantiated it with the shape but if we didn't have the shape
[00:37:57] shape but if we didn't have the shape then one simple thing we could do is
[00:37:59] then one simple thing we could do is actually just print out the shape and
[00:38:02] actually just print out the shape and we'd see okay this last Dimension is
[00:38:03] we'd see okay this last Dimension is size four so I actually need to change
[00:38:05] size four so I actually need to change my input dimension in my linear layer to
[00:38:08] my input dimension in my linear layer to be size four
[00:38:11] thank you
[00:38:14] and you'll also notice on this output we
[00:38:18] and you'll also notice on this output we have this grad function and so that's
[00:38:19] have this grad function and so that's because we're actually Computing and
[00:38:22] because we're actually Computing and storing the gradients here
[00:38:23] storing the gradients here for our tensor
[00:38:32] yeah so typically we think of the First
[00:38:34] yeah so typically we think of the First Dimension as the batch Dimension so in
[00:38:37] Dimension as the batch Dimension so in this case it said n this you can think
[00:38:39] this case it said n this you can think of as if you had a batch of images it
[00:38:40] of as if you had a batch of images it would be the number of images if you had
[00:38:42] would be the number of images if you had a training Corpus of text it would be
[00:38:45] a training Corpus of text it would be essentially the number of sentences or
[00:38:48] essentially the number of sentences or sequences
[00:38:50] sequences um pretty much that is usually
[00:38:51] um pretty much that is usually considered the batch Dimension the star
[00:38:53] considered the batch Dimension the star indicates that there can be an arbitrary
[00:38:55] indicates that there can be an arbitrary number of dimensions
[00:38:56] number of dimensions so for instance if we had images
[00:39:00] so for instance if we had images this could be a four-dimensional tensor
[00:39:01] this could be a four-dimensional tensor object it could be the batch size by the
[00:39:05] object it could be the batch size by the number of channels by the height by the
[00:39:07] number of channels by the height by the width
[00:39:07] width but in general there's no fixed number
[00:39:10] but in general there's no fixed number of Dimensions your input tensor can be
[00:39:12] of Dimensions your input tensor can be any number of Dimensions the key is just
[00:39:14] any number of Dimensions the key is just that that last Dimension needs to match
[00:39:16] that that last Dimension needs to match up with the input dimension of your
[00:39:19] up with the input dimension of your linear layer
[00:39:20] linear layer the two is the output size so
[00:39:23] the two is the output size so essentially we're saying that we're
[00:39:25] essentially we're saying that we're going to map
[00:39:26] going to map this last Dimension which is four
[00:39:29] this last Dimension which is four dimensional to now two dimensional
[00:39:31] dimensional to now two dimensional so in general you know you can think of
[00:39:33] so in general you know you can think of this is if we're stacking a neural
[00:39:35] this is if we're stacking a neural network this is kind of the input
[00:39:37] network this is kind of the input Dimension size and this would be like
[00:39:39] Dimension size and this would be like the hidden Dimension size
[00:39:44] and so one thing we can do is we can
[00:39:45] and so one thing we can do is we can actually print out the parameters and we
[00:39:47] actually print out the parameters and we can actually see what are the values of
[00:39:49] can actually see what are the values of our linear layer or in general for any
[00:39:51] our linear layer or in general for any layer that we Define in our neural
[00:39:53] layer that we Define in our neural network
[00:39:54] network what are the actual parameters and
[00:39:58] what are the actual parameters and in this case we see that there's two
[00:39:59] in this case we see that there's two sets of parameters
[00:40:01] sets of parameters because we have a bias as well as the
[00:40:04] because we have a bias as well as the actual
[00:40:05] actual um
[00:40:06] um the actual linear layer itself
[00:40:08] the actual linear layer itself and so both of them store the gradients
[00:40:12] and so both of them store the gradients and in this case
[00:40:16] um you know these are these are what the
[00:40:17] um you know these are these are what the current values of these parameters are
[00:40:19] current values of these parameters are and they'll change as we train the
[00:40:21] and they'll change as we train the network
[00:40:24] okay so now let's go through some of the
[00:40:26] okay so now let's go through some of the other module layers
[00:40:28] other module layers um
[00:40:31] so in general nn.linear is one of the
[00:40:34] so in general nn.linear is one of the layers you have access to you have a
[00:40:36] layers you have access to you have a couple of other different layers that
[00:40:37] couple of other different layers that are pretty common you have 2D
[00:40:40] are pretty common you have 2D convolutions you have transpose
[00:40:41] convolutions you have transpose convolutions you have batch Norm layers
[00:40:44] convolutions you have batch Norm layers when you need to do normalization in
[00:40:46] when you need to do normalization in your network
[00:40:47] your network you can do up sampling you can do Max
[00:40:49] you can do up sampling you can do Max pooling you can do lots of different
[00:40:51] pooling you can do lots of different operators but the main key here is that
[00:40:53] operators but the main key here is that all of them are built-in building blocks
[00:40:55] all of them are built-in building blocks that you can just call just like we did
[00:40:57] that you can just call just like we did with nn.linear
[00:40:58] with nn.linear and so
[00:41:01] and so let's just go I guess I'm running out of
[00:41:03] let's just go I guess I'm running out of time but let's just try and go through
[00:41:05] time but let's just try and go through these last few layers and then I'll wrap
[00:41:07] these last few layers and then I'll wrap up by kind of showing an example that
[00:41:09] up by kind of showing an example that puts it all together
[00:41:11] puts it all together so in this case we can define an
[00:41:13] so in this case we can define an activation function which is typical
[00:41:15] activation function which is typical with our networks we need to introduce
[00:41:17] with our networks we need to introduce non-linearities in this case we use the
[00:41:19] non-linearities in this case we use the sigmoid function and so now we can
[00:41:22] sigmoid function and so now we can Define our our Network as this very
[00:41:24] Define our our Network as this very simple thing which had one linear layer
[00:41:25] simple thing which had one linear layer and then an activation
[00:41:29] and in general when we compose these
[00:41:32] and in general when we compose these layers together we don't need to
[00:41:33] layers together we don't need to actually write
[00:41:35] actually write every single line by line applying the
[00:41:37] every single line by line applying the next layer we can actually stack all of
[00:41:39] next layer we can actually stack all of them together in this case we can use
[00:41:42] them together in this case we can use nn.sequential and list all of the layers
[00:41:44] nn.sequential and list all of the layers so here we have our linear layer
[00:41:46] so here we have our linear layer followed by our sigmoid and then
[00:41:49] followed by our sigmoid and then now we're just essentially passing the
[00:41:52] now we're just essentially passing the input through this whole set of layers
[00:41:53] input through this whole set of layers all at once so we take our input we call
[00:41:57] all at once so we take our input we call block on the input and we get the output
[00:42:01] block on the input and we get the output and so
[00:42:02] and so let's just kind of see putting it all
[00:42:03] let's just kind of see putting it all together what does it look like to
[00:42:05] together what does it look like to define a network and what does it look
[00:42:06] define a network and what does it look like when we train one
[00:42:08] like when we train one so here we're going to actually Define a
[00:42:10] so here we're going to actually Define a multi-layer perceptron and the way it
[00:42:13] multi-layer perceptron and the way it works is to define a neural network you
[00:42:15] works is to define a neural network you extend the NN dot module class
[00:42:18] extend the NN dot module class the key here is there's really two main
[00:42:19] the key here is there's really two main things you have to Define when you
[00:42:21] things you have to Define when you create your own network one is the
[00:42:23] create your own network one is the initialization so in the init function
[00:42:25] initialization so in the init function you actually initialize all the
[00:42:27] you actually initialize all the parameters you need in this case we
[00:42:29] parameters you need in this case we initialize an input size a hidden size
[00:42:32] initialize an input size a hidden size and we actually Define the model itself
[00:42:34] and we actually Define the model itself in this case it's a simple model which
[00:42:37] in this case it's a simple model which consists of a linear layer followed by
[00:42:41] consists of a linear layer followed by an activation followed by another linear
[00:42:43] an activation followed by another linear layer followed by a final activation
[00:42:45] layer followed by a final activation and the second function we have to
[00:42:47] and the second function we have to Define is the forward which actually
[00:42:49] Define is the forward which actually does the forward pass of the network and
[00:42:51] does the forward pass of the network and so here our forward function takes in
[00:42:54] so here our forward function takes in our input X
[00:42:56] our input X in general it could take in some
[00:42:58] in general it could take in some arbitrary amount of inputs into this
[00:43:00] arbitrary amount of inputs into this function
[00:43:01] function but essentially it needs to figure out
[00:43:03] but essentially it needs to figure out how are you actually Computing the
[00:43:05] how are you actually Computing the output and in this case it's very simple
[00:43:07] output and in this case it's very simple we just pass it into the network that we
[00:43:09] we just pass it into the network that we just defined
[00:43:10] just defined and return the output
[00:43:12] and return the output and
[00:43:14] and again you could do this more explicitly
[00:43:16] again you could do this more explicitly by kind of doing what we did earlier
[00:43:18] by kind of doing what we did earlier where we could actually write out all of
[00:43:20] where we could actually write out all of the layers individually instead of
[00:43:22] the layers individually instead of wrapping them into one object
[00:43:25] wrapping them into one object and then
[00:43:26] and then doing a line by line operation for each
[00:43:29] doing a line by line operation for each one of these layers
[00:43:32] one of these layers and so finally if we Define our class
[00:43:34] and so finally if we Define our class it's very simple to use it we can now
[00:43:37] it's very simple to use it we can now just instantiate some input instantiate
[00:43:40] just instantiate some input instantiate our model by calling multi-layer
[00:43:42] our model by calling multi-layer perceptron with our parameters and then
[00:43:44] perceptron with our parameters and then just pass it through our model
[00:43:48] so that's great but this is all just a
[00:43:51] so that's great but this is all just a full red pass how do we actually train
[00:43:53] full red pass how do we actually train the network how do we actually make it
[00:43:55] the network how do we actually make it better
[00:43:55] better and so this is the final step which is
[00:43:57] and so this is the final step which is we have optimization built in to Pi
[00:44:00] we have optimization built in to Pi torch so we have this backward function
[00:44:02] torch so we have this backward function which goes and computes all these
[00:44:04] which goes and computes all these gradients in the backward pass and now
[00:44:06] gradients in the backward pass and now the only step left is to actually update
[00:44:08] the only step left is to actually update the parameters using those gradients and
[00:44:11] the parameters using those gradients and so here we'll import the torch.opt-in
[00:44:13] so here we'll import the torch.opt-in package which contains all the
[00:44:15] package which contains all the optimizers that you need
[00:44:18] optimizers that you need essentially this part is just creating
[00:44:21] essentially this part is just creating some random data so that we can actually
[00:44:23] some random data so that we can actually decide how to fit our data
[00:44:26] decide how to fit our data but this is really the key here which is
[00:44:28] but this is really the key here which is we'll instantiate our model that we
[00:44:30] we'll instantiate our model that we defined
[00:44:31] defined we'll Define the atom optimizer
[00:44:34] we'll Define the atom optimizer um
[00:44:35] um and we'll Define it with a particular
[00:44:37] and we'll Define it with a particular learning rate we'll Define a loss
[00:44:39] learning rate we'll Define a loss function which is again another built-in
[00:44:41] function which is again another built-in module in this case we're using the
[00:44:42] module in this case we're using the cross entropy loss
[00:44:44] cross entropy loss and finally to calculate our predictions
[00:44:46] and finally to calculate our predictions all we do is simply is just call model
[00:44:49] all we do is simply is just call model on our actual input
[00:44:51] on our actual input and to calculate our loss we just call
[00:44:53] and to calculate our loss we just call our loss function on our predictions and
[00:44:56] our loss function on our predictions and our true labels and we extract the
[00:44:58] our true labels and we extract the scalar here
[00:45:00] scalar here and now when we put it all together this
[00:45:03] and now when we put it all together this is what the training Loop looks like
[00:45:05] is what the training Loop looks like we have some number of epochs that we
[00:45:07] we have some number of epochs that we want to train our Network
[00:45:09] want to train our Network for each of these epochs the first thing
[00:45:11] for each of these epochs the first thing we do is we take our Optimizer and we
[00:45:13] we do is we take our Optimizer and we zero out the gradient and the reason we
[00:45:14] zero out the gradient and the reason we do that is because like many of you
[00:45:16] do that is because like many of you noted we actually are accumulating the
[00:45:19] noted we actually are accumulating the gradient we're not resetting it every
[00:45:20] gradient we're not resetting it every time we call Dot backward so we zero out
[00:45:23] time we call Dot backward so we zero out the gradient
[00:45:24] the gradient we get our model predictions by doing a
[00:45:26] we get our model predictions by doing a forward pass
[00:45:28] forward pass we then compute the loss between the
[00:45:30] we then compute the loss between the predictions and the True Values
[00:45:33] predictions and the True Values finally we call law stop backward this
[00:45:36] finally we call law stop backward this is what actually computes all the
[00:45:37] is what actually computes all the gradients in the backward pass from our
[00:45:40] gradients in the backward pass from our loss
[00:45:41] loss and the final step is we call Dot step
[00:45:43] and the final step is we call Dot step on our Optimizer in this case we're
[00:45:46] on our Optimizer in this case we're using atom
[00:45:47] using atom and this will take a step on our loss
[00:45:49] and this will take a step on our loss function and so if we run this code
[00:45:52] function and so if we run this code we end up seeing that we're able to
[00:45:54] we end up seeing that we're able to start with some trading loss which is
[00:45:56] start with some trading loss which is relatively high and in 10 epochs we're
[00:45:58] relatively high and in 10 epochs we're able to essentially completely fit our
[00:46:00] able to essentially completely fit our data
[00:46:02] data and if we print out our model parameters
[00:46:05] and if we print out our model parameters and we printed them out from the start
[00:46:06] and we printed them out from the start as well we'd see that they've changed as
[00:46:09] as well we'd see that they've changed as we've actually done this optimization
[00:46:12] we've actually done this optimization so I'll kind of wrap it up here but I
[00:46:14] so I'll kind of wrap it up here but I think the key takeaway is that a lot of
[00:46:17] think the key takeaway is that a lot of the things that you're doing at the
[00:46:18] the things that you're doing at the beginning of this class are really about
[00:46:20] beginning of this class are really about understanding the basics of how neural
[00:46:22] understanding the basics of how neural networks work how you actually Implement
[00:46:25] networks work how you actually Implement them how you implement the backward pass
[00:46:27] them how you implement the backward pass the great thing about Pi torch is that
[00:46:29] the great thing about Pi torch is that once you get to the very next assignment
[00:46:31] once you get to the very next assignment you'll see that now that you have a good
[00:46:32] you'll see that now that you have a good underlying understanding of those things
[00:46:34] underlying understanding of those things you can abstract a lot of the complexity
[00:46:37] you can abstract a lot of the complexity of how do you do back prop how do you
[00:46:39] of how do you do back prop how do you store all these gradients how do you
[00:46:41] store all these gradients how do you compute them how do you actually run the
[00:46:43] compute them how do you actually run the optimizer and let pytorch handle all of
[00:46:45] optimizer and let pytorch handle all of that for you and you can use all of
[00:46:47] that for you and you can use all of these building blocks all these
[00:46:48] these building blocks all these different neural network layers to now
[00:46:51] different neural network layers to now Define your own networks that you can
[00:46:53] Define your own networks that you can use to solve whatever problems you need


================================================================================
LECTURE 023
================================================================================

Stanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel

Source: https://www.youtube.com/watch?v=b80by3Xk_A8

---

Transcript

[00:00:04] hi everyone uh welcome to the 224n
[00:00:09] hi everyone uh welcome to the 224n hugging face Transformers tutorial
[00:00:12] hugging face Transformers tutorial um so this tutorial is just going to be
[00:00:14] um so this tutorial is just going to be about using the hugging face Library
[00:00:17] about using the hugging face Library it's really useful in a super effective
[00:00:20] it's really useful in a super effective way of being able to use kind of some
[00:00:22] way of being able to use kind of some off the shelf NLP models specifically
[00:00:25] off the shelf NLP models specifically models that are kind of Transformer
[00:00:27] models that are kind of Transformer based and being able to use those for
[00:00:30] based and being able to use those for either your final project your custom
[00:00:33] either your final project your custom final project or something like that
[00:00:35] final project or something like that just using it in the future so these are
[00:00:37] just using it in the future so these are it's a really helpful package to to
[00:00:40] it's a really helpful package to to learn and it interfaces really well with
[00:00:43] learn and it interfaces really well with pi torch in particular too
[00:00:45] pi torch in particular too okay so first things first is in case
[00:00:49] okay so first things first is in case there's anything else that you are
[00:00:51] there's anything else that you are missing from this kind of like tutorial
[00:00:53] missing from this kind of like tutorial the hugging face documentation is really
[00:00:55] the hugging face documentation is really good they also have lots of kind of
[00:00:57] good they also have lots of kind of tutorials and walkthroughs as well as
[00:01:00] tutorials and walkthroughs as well as other kind of like notebooks that you
[00:01:01] other kind of like notebooks that you can play around with as well
[00:01:03] can play around with as well so if you're ever wondering about
[00:01:05] so if you're ever wondering about something else that's a really good
[00:01:07] something else that's a really good place to look
[00:01:08] place to look okay so in the collab the first thing
[00:01:10] okay so in the collab the first thing we're going to do uh that I already did
[00:01:12] we're going to do uh that I already did but can maybe run again is just
[00:01:14] but can maybe run again is just installing the Transformers python
[00:01:16] installing the Transformers python package and then the data sets python
[00:01:19] package and then the data sets python package so this corresponds to that
[00:01:23] package so this corresponds to that hugging face Transformers and data sets
[00:01:25] hugging face Transformers and data sets um and so those are really helpful the
[00:01:27] um and so those are really helpful the Transformers is where we'll get a lot of
[00:01:29] Transformers is where we'll get a lot of these kind of pre-trained models from
[00:01:31] these kind of pre-trained models from and the data sets will give us some
[00:01:33] and the data sets will give us some helpful data sets that we can
[00:01:35] helpful data sets that we can potentially use for various tasks so in
[00:01:37] potentially use for various tasks so in this case sentiment analysis
[00:01:41] okay and so we'll use a bit of like a
[00:01:43] okay and so we'll use a bit of like a helper function for helping us
[00:01:45] helper function for helping us understand what encoding is uh what
[00:01:48] understand what encoding is uh what encodings are actually happening as well
[00:01:50] encodings are actually happening as well so we'll run this just to kind of kick
[00:01:53] so we'll run this just to kind of kick things off an important a few more a few
[00:01:55] things off an important a few more a few more things
[00:01:56] more things okay so
[00:01:59] okay so um so first what we'll do is this is
[00:02:01] um so first what we'll do is this is generally kind of like the step by step
[00:02:03] generally kind of like the step by step for how to use something off of Hocking
[00:02:05] for how to use something off of Hocking face so first what we'll do is we'll
[00:02:07] face so first what we'll do is we'll find some model
[00:02:09] find some model um from like the hugging face Hub here
[00:02:12] um from like the hugging face Hub here and note that there's like a ton of
[00:02:15] and note that there's like a ton of different models that you're able to use
[00:02:16] different models that you're able to use there's bird there's gpt2 there's T5
[00:02:20] there's bird there's gpt2 there's T5 small which is another language model
[00:02:21] small which is another language model from Google
[00:02:23] from Google um so there are a bunch of these uh
[00:02:26] um so there are a bunch of these uh different models that are pre-trained
[00:02:27] different models that are pre-trained and all of these weights are up here in
[00:02:29] and all of these weights are up here in hugging face that are freely available
[00:02:31] hugging face that are freely available for for you guys to download so if
[00:02:34] for for you guys to download so if there's a particular model you're
[00:02:35] there's a particular model you're interested in you can probably find a
[00:02:37] interested in you can probably find a version of it here you can also see kind
[00:02:40] version of it here you can also see kind of different types of models on the side
[00:02:42] of different types of models on the side as well that for a specific task so if
[00:02:45] as well that for a specific task so if we wanted to do something like uh zero
[00:02:48] we wanted to do something like uh zero shot classification there are a couple
[00:02:50] shot classification there are a couple models that are specifically good at
[00:02:53] models that are specifically good at doing that particular task okay so based
[00:02:55] doing that particular task okay so based off of what tasks you're looking for
[00:02:56] off of what tasks you're looking for there's probably a hugging face model
[00:02:59] there's probably a hugging face model for it that's available online for you
[00:03:01] for it that's available online for you to download
[00:03:02] to download okay so that's what we'll do first is
[00:03:05] okay so that's what we'll do first is we'll go ahead and find a model on the
[00:03:08] we'll go ahead and find a model on the hug and face Hub and then
[00:03:10] hug and face Hub and then um you know whatever you want to do in
[00:03:11] um you know whatever you want to do in this case we'll do sentiment analysis
[00:03:13] this case we'll do sentiment analysis and then there are two things that we
[00:03:15] and then there are two things that we need next the first is a tokenizer for
[00:03:18] need next the first is a tokenizer for actually you know splitting your input
[00:03:21] actually you know splitting your input text into tokens that your model can use
[00:03:23] text into tokens that your model can use and the actual model itself
[00:03:27] and the actual model itself um and so the tokenizer again kind of
[00:03:30] um and so the tokenizer again kind of converts this to some vocabulary IDs
[00:03:32] converts this to some vocabulary IDs these discrete IDs that your model can
[00:03:34] these discrete IDs that your model can actually take in and the model will
[00:03:37] actually take in and the model will produce some prediction based off of
[00:03:38] produce some prediction based off of that
[00:03:39] that okay so
[00:03:41] okay so um so first what we can do is again
[00:03:44] um so first what we can do is again import this Auto tokenizer and this Auto
[00:03:47] import this Auto tokenizer and this Auto model from uh for sequence
[00:03:50] model from uh for sequence classification so what this will do
[00:03:52] classification so what this will do initially is download some of the you
[00:03:55] initially is download some of the you know key things that we need so that we
[00:03:57] know key things that we need so that we can actually initialize these so what do
[00:04:00] can actually initialize these so what do each of these do so first the tokenizer
[00:04:03] each of these do so first the tokenizer this Auto tokenizer is from some
[00:04:06] this Auto tokenizer is from some pre-trained tokenizer that has already
[00:04:08] pre-trained tokenizer that has already been used so in general there's a
[00:04:11] been used so in general there's a corresponding tokenizer for every model
[00:04:13] corresponding tokenizer for every model that you want to try and use in this
[00:04:15] that you want to try and use in this case it's like cbert so send like
[00:04:18] case it's like cbert so send like something around sentiment and Roberta
[00:04:20] something around sentiment and Roberta and then the second is you can import
[00:04:23] and then the second is you can import this model for sequence classification
[00:04:25] this model for sequence classification as well from something pre-trained on
[00:04:28] as well from something pre-trained on the model Hub again so again this
[00:04:30] the model Hub again so again this corresponds to sentiment Roberta large
[00:04:32] corresponds to sentiment Roberta large English and if we want we can even find
[00:04:34] English and if we want we can even find this over here
[00:04:37] this over here um we can find it as
[00:04:42] um
[00:04:43] um I think English yeah large English so
[00:04:46] I think English yeah large English so again this is something we can easily
[00:04:48] again this is something we can easily find you just copy this string up here
[00:04:50] find you just copy this string up here and then you can import that
[00:04:52] and then you can import that okay we've downloaded all of the kind of
[00:04:55] okay we've downloaded all of the kind of all the things that we need some kind of
[00:04:57] all the things that we need some kind of like binary files as well and then now
[00:05:00] like binary files as well and then now we can go ahead and actually you know
[00:05:02] we can go ahead and actually you know use some of these inputs right so this
[00:05:04] use some of these inputs right so this gives you some set of an input right
[00:05:07] gives you some set of an input right this input string I'm excited to learn
[00:05:08] this input string I'm excited to learn about hogging face Transformers we'll
[00:05:11] about hogging face Transformers we'll get some tokenized inputs
[00:05:13] get some tokenized inputs here from the actual tokenized things
[00:05:17] here from the actual tokenized things here after we pass it through the
[00:05:19] here after we pass it through the tokenizer and then lastly we'll get some
[00:05:22] tokenizer and then lastly we'll get some notion of the model output that we get
[00:05:25] notion of the model output that we get right so this is kind of some legits
[00:05:27] right so this is kind of some legits here over whatever classification that
[00:05:29] here over whatever classification that we have so in this case good or bad and
[00:05:32] we have so in this case good or bad and then some corresponding prediction okay
[00:05:35] then some corresponding prediction okay and we'll walk through what this kind of
[00:05:37] and we'll walk through what this kind of looks like in just a second as well a
[00:05:39] looks like in just a second as well a little more depth but this is broadly
[00:05:40] little more depth but this is broadly kind of like how we can actually use
[00:05:43] kind of like how we can actually use these together we'll tokenize some input
[00:05:45] these together we'll tokenize some input and then we'll pass these inputs to the
[00:05:47] and then we'll pass these inputs to the model
[00:05:47] model so we'll talk about tokenizers first so
[00:05:51] so we'll talk about tokenizers first so um so tokenizers are used for basically
[00:05:54] um so tokenizers are used for basically just pre-processing the inputs that you
[00:05:56] just pre-processing the inputs that you get for any model and it takes some raw
[00:05:58] get for any model and it takes some raw string to like
[00:06:00] string to like um essentially a mapping uh to some
[00:06:03] um essentially a mapping uh to some number or ID that the model can take in
[00:06:06] number or ID that the model can take in and actually kind of understand
[00:06:08] and actually kind of understand so tokenizers are either kind of like
[00:06:11] so tokenizers are either kind of like are specific to the model that you want
[00:06:13] are specific to the model that you want to use or you can use the auto tokenizer
[00:06:16] to use or you can use the auto tokenizer that will kind of conveniently import
[00:06:19] that will kind of conveniently import whatever corresponding tokenizer you
[00:06:21] whatever corresponding tokenizer you need for that model type
[00:06:24] need for that model type um so that's that's kind of like the
[00:06:26] um so that's that's kind of like the helpfulness of the auto tokenizer it'll
[00:06:28] helpfulness of the auto tokenizer it'll kind of make that selection for you
[00:06:30] kind of make that selection for you um and make sure that you get the
[00:06:32] um and make sure that you get the correct tokenizer for whatever model
[00:06:34] correct tokenizer for whatever model you're using so the question is uh does
[00:06:36] you're using so the question is uh does it make sure that everything is mapped
[00:06:38] it make sure that everything is mapped to the correct index that the model is
[00:06:40] to the correct index that the model is trained on the answer is yes so that's
[00:06:41] trained on the answer is yes so that's why the auto tokenizer is helpful
[00:06:44] why the auto tokenizer is helpful so there are two types of tokenizers uh
[00:06:48] so there are two types of tokenizers uh there's the a Python tokenizer and
[00:06:51] there's the a Python tokenizer and there's also like a tokenizer fast that
[00:06:55] there's also like a tokenizer fast that the tokenizer fast is written in Rust in
[00:06:58] the tokenizer fast is written in Rust in general if you do the auto tokenizer
[00:07:00] general if you do the auto tokenizer it'll just default to the fast one
[00:07:01] it'll just default to the fast one there's not really a huge difference
[00:07:03] there's not really a huge difference here it's just about kind of like the
[00:07:05] here it's just about kind of like the inference time for getting the model
[00:07:07] inference time for getting the model outputs
[00:07:08] outputs yeah uh so the question is the tokenizer
[00:07:11] yeah uh so the question is the tokenizer uh creates dictionaries of the model
[00:07:13] uh creates dictionaries of the model inputs
[00:07:15] inputs um so I to think it's more like I think
[00:07:18] um so I to think it's more like I think the way to think about a tokenizer is
[00:07:20] the way to think about a tokenizer is like that
[00:07:22] like that um like that dictionary almost right so
[00:07:25] um like that dictionary almost right so you want to kind of translate almost or
[00:07:27] you want to kind of translate almost or have this mapping from the tokens that
[00:07:30] have this mapping from the tokens that you can get from like this string and
[00:07:33] you can get from like this string and then map that into kind of some inputs
[00:07:35] then map that into kind of some inputs that the model will actually use so
[00:07:37] that the model will actually use so we'll see an example of that in just a
[00:07:38] we'll see an example of that in just a second
[00:07:39] second so for example we can kind of call the
[00:07:42] so for example we can kind of call the tokenizer in any way that we would for
[00:07:45] tokenizer in any way that we would for like a typical Pi torch model but we're
[00:07:47] like a typical Pi torch model but we're just going to call it on like a string
[00:07:48] just going to call it on like a string so here we have our input string is
[00:07:51] so here we have our input string is hugging face Transformers is great we
[00:07:54] hugging face Transformers is great we pass that into the tokenizer almost like
[00:07:56] pass that into the tokenizer almost like it's like a function right and then
[00:07:58] it's like a function right and then we'll get out some tokenization so this
[00:08:01] we'll get out some tokenization so this gives us a set of input IDs so uh to
[00:08:04] gives us a set of input IDs so uh to answer the earlier question these are
[00:08:06] answer the earlier question these are basically the numbers that each of these
[00:08:08] basically the numbers that each of these tokens represent
[00:08:10] tokens represent so that the model can actually use them
[00:08:13] so that the model can actually use them and then a corresponding attention mask
[00:08:16] and then a corresponding attention mask for the particular Transformer
[00:08:19] for the particular Transformer okay
[00:08:21] okay so there are a couple ways of accessing
[00:08:23] so there are a couple ways of accessing the actual tokenized input IDs you can
[00:08:28] the actual tokenized input IDs you can treat it like a dictionary so hence kind
[00:08:30] treat it like a dictionary so hence kind of thinking about it almost as that
[00:08:31] of thinking about it almost as that dictionary form it's also just like a
[00:08:33] dictionary form it's also just like a property of the output that you get so
[00:08:36] property of the output that you get so there are two ways of accessing this in
[00:08:38] there are two ways of accessing this in like a pretty pythonic way
[00:08:42] okay
[00:08:43] okay so what we can see as well is that we
[00:08:47] so what we can see as well is that we can look at the particular the actual
[00:08:49] can look at the particular the actual kind of tokenization process almost and
[00:08:52] kind of tokenization process almost and so this can maybe give some insight into
[00:08:54] so this can maybe give some insight into what happens at each step right so our
[00:08:57] what happens at each step right so our initial input string is going to be
[00:08:59] initial input string is going to be hugging face Transformers is great
[00:09:02] hugging face Transformers is great okay the next step is that we actually
[00:09:04] okay the next step is that we actually want to tokenize these individual kind
[00:09:07] want to tokenize these individual kind of uh individual words that are passed
[00:09:10] of uh individual words that are passed in so here this is the kind of output of
[00:09:13] in so here this is the kind of output of this tokenization step
[00:09:15] this tokenization step right we get kind of these individual
[00:09:18] right we get kind of these individual split tokens we'll convert them to IDs
[00:09:21] split tokens we'll convert them to IDs here
[00:09:23] here and then we'll add any special tokens
[00:09:25] and then we'll add any special tokens that our model might need for actually
[00:09:28] that our model might need for actually performing inference on this
[00:09:31] performing inference on this okay
[00:09:33] okay so there's a couple steps that happen
[00:09:35] so there's a couple steps that happen kind of like underneath when you use an
[00:09:38] kind of like underneath when you use an actual we use a tokenizer that happens
[00:09:41] actual we use a tokenizer that happens at it a few things at a time
[00:09:44] at it a few things at a time one thing to note is that for fast
[00:09:46] one thing to note is that for fast tokenizers as well there's another
[00:09:50] tokenizers as well there's another option that you're able to get to
[00:09:52] option that you're able to get to so you have essentially right you have
[00:09:56] so you have essentially right you have this input string you have the number of
[00:09:58] this input string you have the number of tokens that you get you might have some
[00:10:00] tokens that you get you might have some notion of like the special token mask as
[00:10:03] notion of like the special token mask as well right so using Char to word is
[00:10:07] well right so using Char to word is going to give you like the word piece of
[00:10:09] going to give you like the word piece of a particular character in the input
[00:10:11] a particular character in the input so here this is just giving you
[00:10:13] so here this is just giving you additional options that you can use for
[00:10:15] additional options that you can use for the fast tokenizer as well for
[00:10:17] the fast tokenizer as well for understanding how the tokens are being
[00:10:18] understanding how the tokens are being used
[00:10:20] used um in the from the input string
[00:10:25] okay uh so there are different ways of
[00:10:29] okay uh so there are different ways of using the outputs of these tokenizers
[00:10:31] using the outputs of these tokenizers too so one is that you know you can pass
[00:10:34] too so one is that you know you can pass this in and if you indicate that you
[00:10:38] this in and if you indicate that you want it to return a tensor you can also
[00:10:40] want it to return a tensor you can also return a pi torch tensor so that's great
[00:10:44] return a pi torch tensor so that's great um in case
[00:10:45] um in case you need a pie torch tensor which you
[00:10:47] you need a pie torch tensor which you probably generally want
[00:10:49] probably generally want you can also add multiple tokens into
[00:10:52] you can also add multiple tokens into the tokenizer and then pad them as
[00:10:55] the tokenizer and then pad them as however you need so for here for example
[00:11:00] however you need so for here for example we can use the pad token as being this
[00:11:03] we can use the pad token as being this kind of like pad bracket almost and
[00:11:06] kind of like pad bracket almost and giving the token ID is going to
[00:11:08] giving the token ID is going to correspond to zero right so it's just
[00:11:09] correspond to zero right so it's just going to add padding to whatever input
[00:11:11] going to add padding to whatever input that you give so if you need you need
[00:11:15] that you give so if you need you need your outputs to be the same length for a
[00:11:17] your outputs to be the same length for a particular type of model right this will
[00:11:19] particular type of model right this will add those padding tokens and then
[00:11:21] add those padding tokens and then correspondingly gives you like the zeros
[00:11:23] correspondingly gives you like the zeros in the attention mask where you actually
[00:11:25] in the attention mask where you actually need it
[00:11:28] okay and so the way to do that here is
[00:11:31] okay and so the way to do that here is uh you basically set padding padding to
[00:11:34] uh you basically set padding padding to be true you can also set truncation to
[00:11:36] be true you can also set truncation to be true as well and so if you ever have
[00:11:38] be true as well and so if you ever have kind of like
[00:11:40] kind of like um more uh
[00:11:42] um more uh any other kind of like features of the
[00:11:44] any other kind of like features of the tokenizer that you're interested in
[00:11:45] tokenizer that you're interested in again you can check the hugging face
[00:11:48] again you can check the hugging face documentation which is pretty thorough
[00:11:49] documentation which is pretty thorough for what each of these things do
[00:11:52] for what each of these things do yeah so the the question is kind of
[00:11:54] yeah so the the question is kind of looking at
[00:11:55] looking at um looking at the the hash hash at least
[00:11:58] um looking at the the hash hash at least and whether that means that we should
[00:12:00] and whether that means that we should have like a space before or not so
[00:12:04] have like a space before or not so um so here in this case
[00:12:06] um so here in this case um yeah so the in this case uh we
[00:12:09] um yeah so the in this case uh we probably don't want like the space
[00:12:10] probably don't want like the space before right just because
[00:12:12] before right just because um we uh have like the hugging like I
[00:12:16] um we uh have like the hugging like I don't know hugging is all one word
[00:12:19] don't know hugging is all one word um in this case
[00:12:20] um in this case um generally like generally the uh for
[00:12:23] um generally like generally the uh for like the tokenizers generally the output
[00:12:26] like the tokenizers generally the output that they give is still pretty
[00:12:28] that they give is still pretty consistent though
[00:12:29] consistent though um in terms of how the tokenization
[00:12:31] um in terms of how the tokenization process works so there might be kind of
[00:12:33] process works so there might be kind of these like you know instances where it
[00:12:35] these like you know instances where it might be contrary to what you might
[00:12:37] might be contrary to what you might expect for kind of how something is
[00:12:39] expect for kind of how something is tokenized
[00:12:41] tokenized um in general the tokenization generally
[00:12:43] um in general the tokenization generally works fine
[00:12:45] works fine um so
[00:12:46] um so in most cases kind of like the direct
[00:12:48] in most cases kind of like the direct output that you get from the hugging
[00:12:50] output that you get from the hugging face tokenizer is sufficient
[00:12:55] foreign
[00:12:57] okay awesome so one last thing past the
[00:13:01] okay awesome so one last thing past the adding kind of additional padding is
[00:13:04] adding kind of additional padding is that you can also kind of uh decode like
[00:13:07] that you can also kind of uh decode like an entire batch at one one given time so
[00:13:10] an entire batch at one one given time so if we
[00:13:12] if we um look again we have like uh our
[00:13:14] um look again we have like uh our tokenizer we'll initially have this
[00:13:17] tokenizer we'll initially have this method called like a batch decode so if
[00:13:19] method called like a batch decode so if we have like the model inputs that we
[00:13:22] we have like the model inputs that we get up here this is the output of
[00:13:24] get up here this is the output of passing these sentences or these strings
[00:13:26] passing these sentences or these strings into the tokenizer we can go ahead and
[00:13:30] into the tokenizer we can go ahead and just pass like these input IDs that
[00:13:34] just pass like these input IDs that correspond to that into the batch decode
[00:13:37] correspond to that into the batch decode and it'll give us kind of this good this
[00:13:39] and it'll give us kind of this good this decoding that corresponds to all the
[00:13:42] decoding that corresponds to all the padding we added in each of the
[00:13:43] padding we added in each of the particular kind of like uh words and
[00:13:47] particular kind of like uh words and strings
[00:13:48] strings um and if you want to you know ignore
[00:13:51] um and if you want to you know ignore all the the presence of these padding
[00:13:53] all the the presence of these padding tokens or anything like that
[00:13:55] tokens or anything like that um you can also pass that into skipping
[00:13:57] um you can also pass that into skipping the special tokens
[00:13:59] the special tokens gotcha so this gives like a this is a
[00:14:02] gotcha so this gives like a this is a pretty high level overview of the how
[00:14:04] pretty high level overview of the how you would want to use tokenizers I guess
[00:14:07] you would want to use tokenizers I guess in your
[00:14:08] in your um in using hugging face
[00:14:10] um in using hugging face so now we can talk about maybe how to
[00:14:14] so now we can talk about maybe how to use the hugging face models themselves
[00:14:16] use the hugging face models themselves so again this is this is pretty similar
[00:14:18] so again this is this is pretty similar to what we're seeing for something like
[00:14:21] to what we're seeing for something like initially using a tokenizer you just
[00:14:24] initially using a tokenizer you just choose the specific model type
[00:14:27] choose the specific model type um for your model and then I and you can
[00:14:30] um for your model and then I and you can use that or the specific kind of Auto
[00:14:32] use that or the specific kind of Auto model class where again this Auto model
[00:14:35] model class where again this Auto model kind of takes almost the
[00:14:38] kind of takes almost the um like the initialization process it
[00:14:40] um like the initialization process it takes care of it for you in a pretty
[00:14:42] takes care of it for you in a pretty easy way without really any too much
[00:14:45] easy way without really any too much overhead
[00:14:46] overhead um so additionally so
[00:14:49] um so additionally so um for the pre-trained Transformers that
[00:14:52] um for the pre-trained Transformers that we have they generally have the same
[00:14:53] we have they generally have the same underlying architecture but you'll have
[00:14:56] underlying architecture but you'll have different kind of heads associated with
[00:14:59] different kind of heads associated with each Transformer so attention head so
[00:15:01] each Transformer so attention head so you might have to train if you're doing
[00:15:03] you might have to train if you're doing some sequence classification or just
[00:15:05] some sequence classification or just some other task so hugging face will do
[00:15:08] some other task so hugging face will do this for you and so for this I I will
[00:15:12] this for you and so for this I I will walk through an example of how to do
[00:15:14] walk through an example of how to do this for sentiment analysis
[00:15:16] this for sentiment analysis um so if there's a specific context like
[00:15:19] um so if there's a specific context like sequence classification we want to use
[00:15:21] sequence classification we want to use we can use like this the very specific
[00:15:25] we can use like this the very specific kind of like Class A hugging face
[00:15:27] kind of like Class A hugging face provides so distilbert for sequence
[00:15:29] provides so distilbert for sequence classification
[00:15:31] classification alternatively if we were doing it using
[00:15:34] alternatively if we were doing it using distilbert in like a mass language model
[00:15:36] distilbert in like a mass language model setting we use distilbert for Mast LM
[00:15:39] setting we use distilbert for Mast LM and then lastly if we're just doing it
[00:15:42] and then lastly if we're just doing it purely for the representations that we
[00:15:43] purely for the representations that we get out of distilled bird we just use
[00:15:45] get out of distilled bird we just use like the Baseline model so the key thing
[00:15:48] like the Baseline model so the key thing here or key takeaway is that there are
[00:15:51] here or key takeaway is that there are some task specific classes that we can
[00:15:53] some task specific classes that we can use from hugging face to initialize
[00:15:56] use from hugging face to initialize so Auto model again is similar to kind
[00:15:59] so Auto model again is similar to kind of like the auto tokenizer so for this
[00:16:02] of like the auto tokenizer so for this it's just going to kind of load by
[00:16:04] it's just going to kind of load by default that specific model and so in
[00:16:08] default that specific model and so in this case it's going to be just like
[00:16:11] this case it's going to be just like kind of like the basic basic weights
[00:16:13] kind of like the basic basic weights that you need for them
[00:16:17] okay
[00:16:18] okay so
[00:16:19] so um so here we'll have basically three
[00:16:22] um so here we'll have basically three different types of models that we can
[00:16:24] different types of models that we can look at one is like an encoder type
[00:16:25] look at one is like an encoder type model which is Bert
[00:16:27] model which is Bert a decoder type model like gpt2 that's
[00:16:31] a decoder type model like gpt2 that's like uh performing these like uh you
[00:16:34] like uh performing these like uh you know generating some text potentially
[00:16:36] know generating some text potentially and encoder decoder models so Bart or T5
[00:16:39] and encoder decoder models so Bart or T5 in this case so again if you go back to
[00:16:42] in this case so again if you go back to kind of the the hugging face Hub there's
[00:16:45] kind of the the hugging face Hub there's a whole sort of different
[00:16:48] a whole sort of different um different types of models that that
[00:16:50] um different types of models that that you could potentially use and if we look
[00:16:52] you could potentially use and if we look in the documentation as well
[00:16:54] in the documentation as well so here we can understand some notion of
[00:16:58] so here we can understand some notion of like the different types of classes that
[00:17:00] like the different types of classes that we might want to use right so there's
[00:17:02] we might want to use right so there's some notion of like the auto tokenizer
[00:17:05] some notion of like the auto tokenizer different Auto models for different
[00:17:07] different Auto models for different types of tasks
[00:17:09] types of tasks um so here if again if you have any kind
[00:17:11] um so here if again if you have any kind of like specific use cases that you're
[00:17:13] of like specific use cases that you're looking for then you can check the
[00:17:15] looking for then you can check the documentation here again if you use like
[00:17:18] documentation here again if you use like an auto model from pre like pre-trained
[00:17:21] an auto model from pre like pre-trained you'll just create a model that's an
[00:17:22] you'll just create a model that's an instance of that or model
[00:17:25] instance of that or model in this case root model for the burst
[00:17:27] in this case root model for the burst Burt based case
[00:17:31] okay Let's uh we can go ahead and start
[00:17:35] okay Let's uh we can go ahead and start one last thing to note is that like
[00:17:37] one last thing to note is that like again the particular choice of your
[00:17:39] again the particular choice of your model matches up with kind of the type
[00:17:42] model matches up with kind of the type of architecture that you have to use
[00:17:44] of architecture that you have to use right so there are different these
[00:17:47] right so there are different these different types of models can perform
[00:17:49] different types of models can perform specific tasks so you're not going to be
[00:17:52] specific tasks so you're not going to be able to kind of load or use Bert for
[00:17:55] able to kind of load or use Bert for instance or distill Bert as like a
[00:17:57] instance or distill Bert as like a sequence to sequence model for instance
[00:17:58] sequence to sequence model for instance which requires the encoder and decoder
[00:18:01] which requires the encoder and decoder because distill distilber I only
[00:18:05] because distill distilber I only consists of an encoder so there's a bit
[00:18:07] consists of an encoder so there's a bit of like a limitation on how you can
[00:18:09] of like a limitation on how you can exactly use these but it's basically
[00:18:11] exactly use these but it's basically based on like the model architecture
[00:18:13] based on like the model architecture itself
[00:18:16] okay awesome so let's go ahead and get
[00:18:19] okay awesome so let's go ahead and get started here
[00:18:21] started here um so similarly here we can import so
[00:18:25] um so similarly here we can import so Auto model for sequence classification
[00:18:28] Auto model for sequence classification so again this is we're going to perform
[00:18:29] so again this is we're going to perform some classification tasks and we'll
[00:18:32] some classification tasks and we'll import this Auto model here so that we
[00:18:34] import this Auto model here so that we don't have to reference again just like
[00:18:36] don't have to reference again just like something like distilbert for sequence
[00:18:38] something like distilbert for sequence classification we'll be able to load it
[00:18:40] classification we'll be able to load it automatically and it'll be all set
[00:18:43] automatically and it'll be all set alternatively we can do distillburt for
[00:18:46] alternatively we can do distillburt for sequence classification here and that
[00:18:48] sequence classification here and that specifically will will require
[00:18:50] specifically will will require distilbert speed the input there okay so
[00:18:53] distilbert speed the input there okay so these are two different ways of
[00:18:55] these are two different ways of basically getting the same model here
[00:18:56] basically getting the same model here one using the auto model one using just
[00:19:00] one using the auto model one using just explicitly distiller
[00:19:02] explicitly distiller cool and here because it's
[00:19:04] cool and here because it's classification we need to specify the
[00:19:06] classification we need to specify the number of labels or the number of
[00:19:08] number of labels or the number of classes that we're actually going to
[00:19:09] classes that we're actually going to classify for each of the input sentences
[00:19:13] classify for each of the input sentences okay so here we'll get some like a
[00:19:17] okay so here we'll get some like a warning here right if you are following
[00:19:20] warning here right if you are following along and you print this out because
[00:19:22] along and you print this out because some of the sequence classification
[00:19:23] some of the sequence classification classification parameters aren't trained
[00:19:25] classification parameters aren't trained yet and so we'll go ahead and take care
[00:19:28] yet and so we'll go ahead and take care of that
[00:19:30] of that so here similarly we'll kind of like
[00:19:32] so here similarly we'll kind of like walk through how to
[00:19:35] walk through how to um how to actually you know train some
[00:19:37] um how to actually you know train some of these models so the first is how do
[00:19:40] of these models so the first is how do you actually pass any of the inputs that
[00:19:42] you actually pass any of the inputs that you get from a tokenizer into the model
[00:19:44] you get from a tokenizer into the model okay well if we get some model inputs
[00:19:48] okay well if we get some model inputs from the tokenizer up here
[00:19:51] from the tokenizer up here and we pass this into the model by
[00:19:53] and we pass this into the model by specifying that the input IDs are the
[00:19:57] specifying that the input IDs are the input IDs from the model inputs
[00:20:00] input IDs from the model inputs and likewise we want to emphasize or we
[00:20:03] and likewise we want to emphasize or we can you know show here and specifically
[00:20:05] can you know show here and specifically pass in that the attention mask is going
[00:20:07] pass in that the attention mask is going to correspond to the attention mask that
[00:20:09] to correspond to the attention mask that we gave from these like these outputs of
[00:20:12] we gave from these like these outputs of the tokenizer
[00:20:13] the tokenizer okay
[00:20:14] okay so this is option one where you can
[00:20:16] so this is option one where you can specifically identify which property
[00:20:19] specifically identify which property goes to what
[00:20:20] goes to what the second option is using kind of a
[00:20:24] the second option is using kind of a pythonic
[00:20:26] pythonic hack almost which is where you can
[00:20:28] hack almost which is where you can directly pass in the model inputs and so
[00:20:32] directly pass in the model inputs and so this will basically unpack almost the
[00:20:36] this will basically unpack almost the keys of like the model inputs here so
[00:20:39] keys of like the model inputs here so the model input keys so the input IDs
[00:20:41] the model input keys so the input IDs correspond to this the attention mask
[00:20:44] correspond to this the attention mask corresponds to the attention mask
[00:20:46] corresponds to the attention mask argument
[00:20:47] argument so when we use this star star kind of
[00:20:50] so when we use this star star kind of syntax this will go ahead and unpack our
[00:20:52] syntax this will go ahead and unpack our dictionary and basically map the
[00:20:54] dictionary and basically map the arguments to something of the same keys
[00:20:56] arguments to something of the same keys so this is an alternative way of passing
[00:20:58] so this is an alternative way of passing it into the model
[00:21:00] it into the model both are going to be the same
[00:21:04] both are going to be the same okay
[00:21:05] okay so now what we can do is we can actually
[00:21:07] so now what we can do is we can actually print out what the model outputs look
[00:21:09] print out what the model outputs look like so again these are the inputs the
[00:21:13] like so again these are the inputs the token IDs and the attention mask
[00:21:15] token IDs and the attention mask and then second we'll get the actual
[00:21:18] and then second we'll get the actual model outputs so here notice that the
[00:21:21] model outputs so here notice that the outputs are given by kind of these
[00:21:24] outputs are given by kind of these legits here there's two of them we
[00:21:26] legits here there's two of them we passed in one example and there's kind
[00:21:28] passed in one example and there's kind of two potential classes that we're
[00:21:29] of two potential classes that we're trying to classify
[00:21:31] trying to classify okay and then lastly we have of course
[00:21:33] okay and then lastly we have of course the corresponding distribution over the
[00:21:36] the corresponding distribution over the labels here here right since this is
[00:21:38] labels here here right since this is going to be binary classification
[00:21:40] going to be binary classification yes it's like a little bit weird that
[00:21:43] yes it's like a little bit weird that you're going to have like the two
[00:21:44] you're going to have like the two classes for the binary classification
[00:21:46] classes for the binary classification task and you could basically just choose
[00:21:48] task and you could basically just choose to classify one class or not
[00:21:51] to classify one class or not um but we do this just basically because
[00:21:53] um but we do this just basically because of how hugging face models are are set
[00:21:56] of how hugging face models are are set up
[00:21:57] up um and so uh Additionally you know these
[00:22:01] um and so uh Additionally you know these are the models that we load in from
[00:22:03] are the models that we load in from hugging face are basically just Pi torch
[00:22:06] hugging face are basically just Pi torch modules so like these are the actual
[00:22:09] modules so like these are the actual models and we can use them in the same
[00:22:10] models and we can use them in the same way that we've been using models before
[00:22:13] way that we've been using models before so that means things like lost dot
[00:22:15] so that means things like lost dot backward or something like that actually
[00:22:17] backward or something like that actually will do this back propagation step
[00:22:20] will do this back propagation step corresponding to the loss of like your
[00:22:22] corresponding to the loss of like your inputs that you pass in so so it's
[00:22:26] inputs that you pass in so so it's really easy to train train these guys as
[00:22:28] really easy to train train these guys as long as you have like a label you know
[00:22:30] long as you have like a label you know label for your data you can calculate
[00:22:32] label for your data you can calculate your loss using you know the pi torch
[00:22:35] your loss using you know the pi torch cross entropy function
[00:22:37] cross entropy function you get some loss back and then you can
[00:22:40] you get some loss back and then you can go ahead and back propagate it
[00:22:42] go ahead and back propagate it you can actually even get kind of the
[00:22:44] you can actually even get kind of the parameters as well
[00:22:46] parameters as well um in the model that you're would
[00:22:48] um in the model that you're would probably get updated from this
[00:22:50] probably get updated from this this is just some big tensor of the
[00:22:52] this is just some big tensor of the actual embedding weights that you have
[00:22:57] okay we also have like a pretty easy way
[00:23:00] okay we also have like a pretty easy way for hugging face itself to be able to to
[00:23:03] for hugging face itself to be able to to calculate the loss that we get so again
[00:23:06] calculate the loss that we get so again if we tokenize some input string we get
[00:23:08] if we tokenize some input string we get our model inputs we have two labels
[00:23:11] our model inputs we have two labels positive and negative
[00:23:13] positive and negative um and then give some kind of
[00:23:14] um and then give some kind of corresponding label that we assign to
[00:23:17] corresponding label that we assign to the the model inputs and we pass this in
[00:23:20] the the model inputs and we pass this in we can see here that the actual model
[00:23:23] we can see here that the actual model outputs that's that are given by a
[00:23:24] outputs that's that are given by a hugging face includes this loss here
[00:23:28] hugging face includes this loss here right so it'll include the loss
[00:23:30] right so it'll include the loss corresponding to that input anyways so
[00:23:33] corresponding to that input anyways so it's a really easy way of actually
[00:23:35] it's a really easy way of actually calculating the loss just natively in
[00:23:38] calculating the loss just natively in hugging face without having to call any
[00:23:41] hugging face without having to call any additional things from a pie torch
[00:23:42] additional things from a pie torch Library
[00:23:44] Library and lastly we can actually even use
[00:23:47] and lastly we can actually even use um if we have kind of like these two
[00:23:49] um if we have kind of like these two labels here
[00:23:50] labels here again for positive or negative what we
[00:23:53] again for positive or negative what we can do is just take the model outputs
[00:23:55] can do is just take the model outputs look at the legits and see which one is
[00:23:58] look at the legits and see which one is like the biggest again
[00:24:00] like the biggest again we'll pass that and take it to the ARG
[00:24:03] we'll pass that and take it to the ARG Max so that'll give the index that's
[00:24:04] Max so that'll give the index that's largest and then that's the output label
[00:24:07] largest and then that's the output label that the model is actually predicting
[00:24:09] that the model is actually predicting so again it gives a really easy way of
[00:24:11] so again it gives a really easy way of being able to do this sort of like
[00:24:13] being able to do this sort of like classification getting the loss getting
[00:24:15] classification getting the loss getting what the actual labels are just from
[00:24:17] what the actual labels are just from within hugging face
[00:24:22] okay awesome so
[00:24:25] okay awesome so um well last thing as well is that we
[00:24:27] um well last thing as well is that we can also kind of look inside the model
[00:24:31] can also kind of look inside the model um in a pretty pretty cool way and also
[00:24:33] um in a pretty pretty cool way and also seeing what the attention weights the
[00:24:36] seeing what the attention weights the model actually puts uh the attention
[00:24:38] model actually puts uh the attention weights the model actually has so this
[00:24:42] weights the model actually has so this is helpful if you're trying to
[00:24:43] is helpful if you're trying to understand like what's going on inside
[00:24:45] understand like what's going on inside of some NLP model and so for here we can
[00:24:50] of some NLP model and so for here we can do again I where we're importing our
[00:24:53] do again I where we're importing our model from some pre-trained kind of
[00:24:56] model from some pre-trained kind of pre-trained model model weights in the
[00:24:59] pre-trained model model weights in the um the hugging face Hub we want to
[00:25:02] um the hugging face Hub we want to Output attention set output attentions
[00:25:04] Output attention set output attentions to true and output hidden states to true
[00:25:06] to true and output hidden states to true so these are going to be the key
[00:25:08] so these are going to be the key arguments that we can use we're actually
[00:25:10] arguments that we can use we're actually kind of investigating what's going on
[00:25:13] kind of investigating what's going on inside the model at each point in time
[00:25:15] inside the model at each point in time again we'll set the model to be in eval
[00:25:18] again we'll set the model to be in eval mode
[00:25:20] mode and lastly we'll go ahead and tokenize
[00:25:22] and lastly we'll go ahead and tokenize our input string again
[00:25:25] our input string again we don't really care about any of the
[00:25:27] we don't really care about any of the gradients here
[00:25:29] gradients here um again so we don't actually want to
[00:25:31] um again so we don't actually want to back propagate anything here and finally
[00:25:34] back propagate anything here and finally pass in the model inputs
[00:25:36] pass in the model inputs so now what we're able to do is when we
[00:25:39] so now what we're able to do is when we print out the model hidden States so now
[00:25:42] print out the model hidden States so now this is a new kind of property in the
[00:25:44] this is a new kind of property in the output dictionary that we get we can
[00:25:46] output dictionary that we get we can look at what these actually look like
[00:25:48] look at what these actually look like here
[00:25:49] here um and sorry this is a massive output
[00:25:54] so you can actually look at the hidden
[00:25:56] so you can actually look at the hidden State size per layer right and so this
[00:25:59] State size per layer right and so this kind of gives a notion of what we're
[00:26:01] kind of gives a notion of what we're going to be looking like looking at like
[00:26:03] going to be looking like looking at like what the shape of this is at each given
[00:26:05] what the shape of this is at each given layer in our model
[00:26:07] layer in our model as well as the attention head size per
[00:26:10] as well as the attention head size per layer so this gives you like it the kind
[00:26:12] layer so this gives you like it the kind of shape of what you're looking at
[00:26:13] of shape of what you're looking at and then if we actually look at the
[00:26:16] and then if we actually look at the model output itself we'll get all of
[00:26:19] model output itself we'll get all of these different like hidden States
[00:26:21] these different like hidden States basically right so
[00:26:23] basically right so um so we have like tons and tons of
[00:26:26] um so we have like tons and tons of these different hidden States we'll have
[00:26:28] these different hidden States we'll have the last hidden State here
[00:26:31] the last hidden State here so the model output is pretty robust for
[00:26:33] so the model output is pretty robust for kind of showing you what the hidden
[00:26:35] kind of showing you what the hidden state looks like as well as what
[00:26:37] state looks like as well as what attention weights actually look like
[00:26:38] attention weights actually look like here
[00:26:40] here so in case you're trying to analyze a
[00:26:42] so in case you're trying to analyze a particular model this is a really
[00:26:44] particular model this is a really helpful way of doing that
[00:26:45] helpful way of doing that so what model.eval does is it sorry
[00:26:49] so what model.eval does is it sorry question is what is the dot eval do
[00:26:52] question is what is the dot eval do um what it does is it basically sets
[00:26:55] um what it does is it basically sets your and this is true for any PI torch
[00:26:57] your and this is true for any PI torch module or model is it sets it into
[00:26:59] module or model is it sets it into quote-unquote eval mode so again for
[00:27:02] quote-unquote eval mode so again for this like we're not really trying to
[00:27:05] this like we're not really trying to calculate any of the gradients or
[00:27:07] calculate any of the gradients or anything like that that might correspond
[00:27:09] anything like that that might correspond to
[00:27:11] to um like correspond to some data that we
[00:27:13] um like correspond to some data that we pass in or try and update our model in
[00:27:16] pass in or try and update our model in any way we just care about evaluating it
[00:27:19] any way we just care about evaluating it on that particular data point
[00:27:21] on that particular data point um so for that then it's helpful to set
[00:27:23] um so for that then it's helpful to set the model into like eval mode
[00:27:25] the model into like eval mode essentially to help make sure that that
[00:27:29] essentially to help make sure that that kind of like disables some of like that
[00:27:32] kind of like disables some of like that stuff that you'd use during training
[00:27:34] stuff that you'd use during training time so it just makes it a little more
[00:27:36] time so it just makes it a little more efficient
[00:27:37] efficient yeah the question was uh it's already
[00:27:39] yeah the question was uh it's already pre-changed so can you go ahead and
[00:27:40] pre-changed so can you go ahead and evaluate it yeah you you can so yeah
[00:27:43] evaluate it yeah you you can so yeah this is just the raw pre-trained model
[00:27:44] this is just the raw pre-trained model with no no fine tuning
[00:27:47] with no no fine tuning so the question is like how do you
[00:27:48] so the question is like how do you interpret
[00:27:50] interpret um these shapes basically uh for the
[00:27:53] um these shapes basically uh for the attention head size and then the hidden
[00:27:55] attention head size and then the hidden State size so um so yeah the the key
[00:27:58] State size so um so yeah the the key thing here uh is you'll probably want to
[00:28:01] thing here uh is you'll probably want to look at kind of the shape given on the
[00:28:03] look at kind of the shape given on the side it'll correspond to like the layer
[00:28:05] side it'll correspond to like the layer that you're actually kind of like uh
[00:28:07] that you're actually kind of like uh looking at so here
[00:28:09] looking at so here um like when we call we looked at the
[00:28:12] um like when we call we looked at the shape here we're specifically looking at
[00:28:14] shape here we're specifically looking at like the first first one in this list
[00:28:16] like the first first one in this list right so this will give us the first
[00:28:18] right so this will give us the first hidden layer all right the second gives
[00:28:21] hidden layer all right the second gives us a notion of kind of like the the
[00:28:24] us a notion of kind of like the the batch that we're looking at and then the
[00:28:26] batch that we're looking at and then the last is like so this is like some tensor
[00:28:28] last is like so this is like some tensor right 768 dimensional I don't know
[00:28:32] right 768 dimensional I don't know representation the corresponds there
[00:28:35] representation the corresponds there um and then for the attention head size
[00:28:37] um and then for the attention head size it corresponds to like the actual query
[00:28:40] it corresponds to like the actual query word and the keyword for these last two
[00:28:42] word and the keyword for these last two here
[00:28:48] but yes so um but for this you know we
[00:28:52] but yes so um but for this you know we would expect this kind of initial index
[00:28:54] would expect this kind of initial index here right the one to be bigger if we
[00:28:57] here right the one to be bigger if we printed out all of the you know all of
[00:28:58] printed out all of the you know all of the layers but we're just looking at the
[00:29:00] the layers but we're just looking at the first one here
[00:29:01] first one here so we can also do this
[00:29:04] so we can also do this um
[00:29:05] um for
[00:29:07] for um you know actually being able to get
[00:29:09] um you know actually being able to get some notion of how these different how
[00:29:12] some notion of how these different how this actually like looks
[00:29:14] this actually like looks um and plot out these axes as well so
[00:29:16] um and plot out these axes as well so again if we take this same kind of model
[00:29:18] again if we take this same kind of model input which again is like this hugging
[00:29:21] input which again is like this hugging face Transformers is great we're
[00:29:23] face Transformers is great we're actually trying to see like what do
[00:29:25] actually trying to see like what do these representations look like on like
[00:29:27] these representations look like on like a per layer basis so what we can do here
[00:29:30] a per layer basis so what we can do here is basically we're looking at for each
[00:29:33] is basically we're looking at for each layer that we have in our model right
[00:29:35] layer that we have in our model right and again this is purely from the model
[00:29:37] and again this is purely from the model output attentions or the actual outputs
[00:29:40] output attentions or the actual outputs of the model
[00:29:41] of the model so what we can do is for each layer and
[00:29:45] so what we can do is for each layer and then for each head we can analyze
[00:29:47] then for each head we can analyze essentially like what these
[00:29:49] essentially like what these representations look like and in
[00:29:51] representations look like and in particular what the attention weights
[00:29:52] particular what the attention weights are across each of like the tokens that
[00:29:54] are across each of like the tokens that we have so this is like a good way of
[00:29:56] we have so this is like a good way of again understanding like what your model
[00:29:58] again understanding like what your model is actually attending to within each
[00:30:01] is actually attending to within each layer so on the side if we look here
[00:30:04] layer so on the side if we look here maybe zoom in a bit we can see that this
[00:30:07] maybe zoom in a bit we can see that this is going to be like corresponds to the
[00:30:09] is going to be like corresponds to the different layers and the top will
[00:30:11] different layers and the top will correspond to these are across the the
[00:30:13] correspond to these are across the the different attention heads okay this will
[00:30:17] different attention heads okay this will just give you some notion of like what
[00:30:18] just give you some notion of like what the weights are
[00:30:21] the weights are so again just to um to clarify so again
[00:30:24] so again just to um to clarify so again if we maybe look at the labels sorry
[00:30:25] if we maybe look at the labels sorry it's like a little cut off and like
[00:30:27] it's like a little cut off and like zoomed out but so this y-axis here like
[00:30:31] zoomed out but so this y-axis here like these different rows corresponds to the
[00:30:34] these different rows corresponds to the different layers within the model oops
[00:30:37] different layers within the model oops um on the x-axis here right we have like
[00:30:41] um on the x-axis here right we have like the
[00:30:42] the um like the different attention heads
[00:30:45] um like the different attention heads that are present in the model as well
[00:30:46] that are present in the model as well and so for each head we are able to for
[00:30:50] and so for each head we are able to for each at each layer to basically get a
[00:30:53] each at each layer to basically get a sense of like what how the attention
[00:30:56] sense of like what how the attention distribution is actually being
[00:30:58] distribution is actually being distributed what's being attended to
[00:31:00] distributed what's being attended to corresponding to each of like the tokens
[00:31:02] corresponding to each of like the tokens that you actually get here
[00:31:04] that you actually get here so if we if we look up again
[00:31:08] so if we if we look up again um here as well right we're just trying
[00:31:10] um here as well right we're just trying to look at like basically the model of
[00:31:13] to look at like basically the model of tensions that we get
[00:31:14] tensions that we get for each kind of corresponding layer
[00:31:17] for each kind of corresponding layer the question is what's the the color key
[00:31:20] the question is what's the the color key um yellow is like higher higher
[00:31:23] um yellow is like higher higher magnitude and higher value and then
[00:31:25] magnitude and higher value and then darker is like closer to zero so
[00:31:27] darker is like closer to zero so probably very Navy is like zero
[00:31:33] so what we can do is now maybe walk
[00:31:35] so what we can do is now maybe walk through like what a fine-tuning task
[00:31:37] through like what a fine-tuning task looks like here
[00:31:39] looks like here um and so first like uh in a project you
[00:31:41] um and so first like uh in a project you know you're probably going to want to
[00:31:43] know you're probably going to want to fine-tune a model
[00:31:44] fine-tune a model um that's fine it's a and we'll go ahead
[00:31:47] um that's fine it's a and we'll go ahead and walk through an example of what that
[00:31:49] and walk through an example of what that looks like here
[00:31:53] okay so what we can do as well
[00:31:57] okay so what we can do as well is all right
[00:32:00] is all right what we can do as well is use some of
[00:32:02] what we can do as well is use some of the
[00:32:03] the um the data sets that we can get from
[00:32:06] um the data sets that we can get from hugging face as well so it doesn't just
[00:32:08] hugging face as well so it doesn't just have models it has really nice data sets
[00:32:10] have models it has really nice data sets and be able to kind of like load that in
[00:32:13] and be able to kind of like load that in as well so here what we're going to be
[00:32:15] as well so here what we're going to be looking at is uh looking at like the
[00:32:18] looking at is uh looking at like the IMDb data set
[00:32:21] IMDb data set um and so here again is for sentiment
[00:32:23] um and so here again is for sentiment analysis I will just look at only the
[00:32:26] analysis I will just look at only the first 50 tokens or so
[00:32:29] first 50 tokens or so um and generally
[00:32:31] um and generally so this is this is like a you know a
[00:32:34] so this is this is like a you know a helper function that we'll use for
[00:32:36] helper function that we'll use for truncating the output that we get
[00:32:39] truncating the output that we get and then lastly for actually kind of
[00:32:41] and then lastly for actually kind of making this data set we can use the data
[00:32:44] making this data set we can use the data set dict class from a hugging face again
[00:32:47] set dict class from a hugging face again that will basically give us this smaller
[00:32:51] that will basically give us this smaller data set that we can get for the uh for
[00:32:54] data set that we can get for the uh for the train data set as well as specifying
[00:32:56] the train data set as well as specifying what we want for validation as well
[00:32:58] what we want for validation as well so here what we're going to do for our
[00:33:00] so here what we're going to do for our like mini data set for the purpose of
[00:33:02] like mini data set for the purpose of this demonstration is we'll use I make
[00:33:06] this demonstration is we'll use I make train and vowel both from the IMDb train
[00:33:09] train and vowel both from the IMDb train data set uh we'll Shuffle it a bit and
[00:33:12] data set uh we'll Shuffle it a bit and then we're just going to select here 128
[00:33:15] then we're just going to select here 128 examples and then 32 for validation so
[00:33:19] examples and then 32 for validation so it'll Shuffle it around it'll take the
[00:33:21] it'll Shuffle it around it'll take the first 128 and I'll take the LA the next
[00:33:23] first 128 and I'll take the LA the next 32.
[00:33:25] 32. um
[00:33:26] um and then we'll kind of truncate those
[00:33:28] and then we'll kind of truncate those particular inputs that we get again just
[00:33:30] particular inputs that we get again just to kind of make sure we're efficient
[00:33:33] to kind of make sure we're efficient and we can actually run this on a CPU
[00:33:38] okay
[00:33:39] okay so next we can do is just see kind of
[00:33:42] so next we can do is just see kind of what does this look like it'll just
[00:33:44] what does this look like it'll just again this is kind of just like a
[00:33:46] again this is kind of just like a dictionary it's a wrapper class almost
[00:33:47] dictionary it's a wrapper class almost of giving you know your train data set
[00:33:50] of giving you know your train data set and then your validation data set
[00:33:52] and then your validation data set and in particular you can even look at
[00:33:54] and in particular you can even look at like what the first 10 of these looks
[00:33:57] like what the first 10 of these looks like
[00:33:58] like so first like the output so we specify
[00:34:01] so first like the output so we specify train we want to look at the first 10
[00:34:04] train we want to look at the first 10 entries in our train data set
[00:34:06] entries in our train data set and the output of this is going to be a
[00:34:10] and the output of this is going to be a dictionary as well which is pretty cool
[00:34:12] dictionary as well which is pretty cool so we have some the first 10 test text
[00:34:16] so we have some the first 10 test text examples that give the actual movie
[00:34:19] examples that give the actual movie reviews here
[00:34:21] reviews here um so this is the given in a list
[00:34:24] um so this is the given in a list and then the second uh key that you get
[00:34:27] and then the second uh key that you get are the labels corresponding to each of
[00:34:29] are the labels corresponding to each of these so whether it's positive or
[00:34:31] these so whether it's positive or negative so here one is going to be a
[00:34:33] negative so here one is going to be a positive review 0 is negative
[00:34:36] positive review 0 is negative so it makes it really easy to use this
[00:34:38] so it makes it really easy to use this for some something like sentiment
[00:34:40] for some something like sentiment sentiment analysis
[00:34:43] okay
[00:34:44] okay so what we can do is go ahead and
[00:34:48] so what we can do is go ahead and prepare the data set and put it into
[00:34:50] prepare the data set and put it into batches of 16. okay so what does this
[00:34:53] batches of 16. okay so what does this look like what we can do is we can call
[00:34:55] look like what we can do is we can call the map function that this like uh this
[00:34:59] the map function that this like uh this small like data set dictionary has so we
[00:35:02] small like data set dictionary has so we call map
[00:35:04] call map and pass in a Lambda function of what we
[00:35:06] and pass in a Lambda function of what we want to actually do so here the Lambda
[00:35:08] want to actually do so here the Lambda function is for each example that we
[00:35:11] function is for each example that we have
[00:35:12] have we want to tokenize the text basically
[00:35:15] we want to tokenize the text basically so this is basically saying how do we
[00:35:17] so this is basically saying how do we want to you know pre-process this
[00:35:21] want to you know pre-process this um and so here we're extracting the
[00:35:23] um and so here we're extracting the tokens input IDs that will pass as a
[00:35:25] tokens input IDs that will pass as a model we adding padding and truncation
[00:35:28] model we adding padding and truncation as well we're going to do this in a
[00:35:30] as well we're going to do this in a batch and then the batch size will be
[00:35:32] batch and then the batch size will be 16.
[00:35:33] 16. hopefully this makes sense
[00:35:36] hopefully this makes sense okay so
[00:35:38] okay so um next we're basically just going to
[00:35:42] um next we're basically just going to um
[00:35:43] um uh do like a little more modification on
[00:35:45] uh do like a little more modification on what the data set actually looks like so
[00:35:48] what the data set actually looks like so we're going to remove the column that
[00:35:49] we're going to remove the column that corresponds to text and then we're going
[00:35:53] corresponds to text and then we're going to rename the column label to labels so
[00:35:56] to rename the column label to labels so again if we see this this was called
[00:35:58] again if we see this this was called label we're just going to call it labels
[00:36:00] label we're just going to call it labels and we're going to remove the text
[00:36:01] and we're going to remove the text column because we don't really need it
[00:36:03] column because we don't really need it anymore we just have gone ahead and
[00:36:06] anymore we just have gone ahead and pre-processed our data into the input
[00:36:08] pre-processed our data into the input IDs that we need
[00:36:09] IDs that we need okay and lastly we're going to set it
[00:36:11] okay and lastly we're going to set it the format to torch so we can go ahead
[00:36:13] the format to torch so we can go ahead and just pass this in pass this into our
[00:36:16] and just pass this in pass this into our model or Pi torch model
[00:36:19] model or Pi torch model the question is what is labels so um so
[00:36:22] the question is what is labels so um so label here corresponds to like again the
[00:36:25] label here corresponds to like again the first in the context of sentiment
[00:36:27] first in the context of sentiment analysis it's like just yeah positive or
[00:36:30] analysis it's like just yeah positive or negative and so here we're just renaming
[00:36:32] negative and so here we're just renaming the column
[00:36:33] the column okay so now we'll just go ahead and see
[00:36:35] okay so now we'll just go ahead and see what this looks like again we're going
[00:36:37] what this looks like again we're going to look at the train set and only these
[00:36:39] to look at the train set and only these first two things
[00:36:41] first two things and so
[00:36:43] and so um so here now we have the two labels
[00:36:45] um so here now we have the two labels that correspond to each of the reviews
[00:36:47] that correspond to each of the reviews and the input IDs that we get
[00:36:50] and the input IDs that we get corresponding for each of the reviews as
[00:36:52] corresponding for each of the reviews as well
[00:36:53] well lastly we also get the attention mask so
[00:36:55] lastly we also get the attention mask so it's basically just taking the what you
[00:36:58] it's basically just taking the what you get out from the tokenizer and it's just
[00:37:00] get out from the tokenizer and it's just adding this back into the data set so
[00:37:01] adding this back into the data set so it's really easy to pass in
[00:37:03] it's really easy to pass in the question is we truncated which makes
[00:37:07] the question is we truncated which makes things easy but how do you want to apply
[00:37:10] things easy but how do you want to apply like padding evenly so here if we do
[00:37:15] like padding evenly so here if we do pass in so first is like you could
[00:37:17] pass in so first is like you could either manually set some high truncation
[00:37:20] either manually set some high truncation limit like we did the second is that you
[00:37:24] limit like we did the second is that you can just go ahead and set padding to be
[00:37:27] can just go ahead and set padding to be true and then basically like the padding
[00:37:30] true and then basically like the padding is basically uh
[00:37:33] is basically uh added I based off of kind of like the
[00:37:36] added I based off of kind of like the longest
[00:37:37] longest um like longest sequence that you have
[00:37:39] um like longest sequence that you have yeah so the question is I guess doing it
[00:37:42] yeah so the question is I guess doing it for all of them all the text lists
[00:37:44] for all of them all the text lists evenly
[00:37:46] evenly um so again it just like depends on like
[00:37:47] um so again it just like depends on like the size of like the data set you're
[00:37:49] the size of like the data set you're you're like you're loading in right so
[00:37:51] you're like you're loading in right so if you're looking at particular batches
[00:37:53] if you're looking at particular batches at a time you can just pad within that
[00:37:55] at a time you can just pad within that particular like batch versus like yeah
[00:37:58] particular like batch versus like yeah you don't need to like load all the data
[00:38:00] you don't need to like load all the data set into memory pad the entire data set
[00:38:03] set into memory pad the entire data set like or like in the same way so it's
[00:38:05] like or like in the same way so it's fine to do it within just batches
[00:38:08] fine to do it within just batches yeah the question was how does uh how
[00:38:10] yeah the question was how does uh how were the input IDs like added and uh
[00:38:13] were the input IDs like added and uh yeah the answer is yes it's basically
[00:38:15] yeah the answer is yes it's basically done automatically
[00:38:17] done automatically um so we had to manually remove the text
[00:38:20] um so we had to manually remove the text column here and that kind of like this
[00:38:23] column here and that kind of like this first line here buy it
[00:38:25] first line here buy it um like if you recall like the outputs
[00:38:27] um like if you recall like the outputs of token like at the tokenizer it's
[00:38:30] of token like at the tokenizer it's basically just the input IDs and the and
[00:38:32] basically just the input IDs and the and the attention mask so it just is smart
[00:38:35] the attention mask so it just is smart enough to basically aggregate those
[00:38:36] enough to basically aggregate those together
[00:38:39] um
[00:38:39] um okay the last thing we're going to do is
[00:38:42] okay the last thing we're going to do is basically just put these so we have this
[00:38:45] basically just put these so we have this like data set now
[00:38:47] like data set now um that looks great we're just gonna
[00:38:49] um that looks great we're just gonna import like a pytorch data loader
[00:38:52] import like a pytorch data loader typical normal data loader and then go
[00:38:55] typical normal data loader and then go ahead and load each of these data sets
[00:38:57] ahead and load each of these data sets that we just had I mean specifying the
[00:38:59] that we just had I mean specifying the batch size to be 16.
[00:39:02] batch size to be 16. okay so that's fine and great and so now
[00:39:08] okay so that's fine and great and so now for training the model it's basically
[00:39:10] for training the model it's basically like exactly the same as what we would
[00:39:13] like exactly the same as what we would do in typical pytorch so again it's like
[00:39:17] do in typical pytorch so again it's like you still want to compute the loss you
[00:39:19] you still want to compute the loss you can back propagate the loss and
[00:39:21] can back propagate the loss and everything
[00:39:22] everything um
[00:39:23] um yeah so it's it's really up to your own
[00:39:25] yeah so it's it's really up to your own design how you do uh how you do the
[00:39:28] design how you do uh how you do the training
[00:39:29] training um so here there's only like a few kind
[00:39:33] um so here there's only like a few kind of asterisks I guess one is that you can
[00:39:35] of asterisks I guess one is that you can import specific kind of optim Optimizer
[00:39:38] import specific kind of optim Optimizer types from the Transformers uh package
[00:39:42] types from the Transformers uh package so you can do atom with weight Decay you
[00:39:45] so you can do atom with weight Decay you can get a linear schedule for like the
[00:39:47] can get a linear schedule for like the learning rate which will kind of
[00:39:49] learning rate which will kind of decrease the learning during the
[00:39:51] decrease the learning during the learning rate over time for each
[00:39:53] learning rate over time for each training step so again it's basically up
[00:39:55] training step so again it's basically up to your choice but if you look at the
[00:39:57] to your choice but if you look at the structure of like this code right we
[00:39:59] structure of like this code right we load the model for classification we set
[00:40:02] load the model for classification we set a number of epochs and then however many
[00:40:04] a number of epochs and then however many training steps we actually want to do
[00:40:06] training steps we actually want to do we initialize our Optimizer and get some
[00:40:09] we initialize our Optimizer and get some learning rate schedule
[00:40:11] learning rate schedule right and then from there it's basically
[00:40:13] right and then from there it's basically the same thing as what we would do for a
[00:40:15] the same thing as what we would do for a typical kind of like Pi torch model
[00:40:17] typical kind of like Pi torch model right we set the model to train mode we
[00:40:20] right we set the model to train mode we go ahead and pass in all these batches
[00:40:24] go ahead and pass in all these batches from like the the data loader and then
[00:40:27] from like the the data loader and then back propagate step the optimizer and
[00:40:30] back propagate step the optimizer and everything like that
[00:40:31] everything like that so it's I pretty pretty similar from
[00:40:35] so it's I pretty pretty similar from what we're kind of like used to seeing
[00:40:37] what we're kind of like used to seeing essentially
[00:40:42] awesome so that'll go do its thing at
[00:40:46] awesome so that'll go do its thing at some point
[00:40:47] some point um okay and so I so that's one potential
[00:40:51] um okay and so I so that's one potential option is if you really like pie torch
[00:40:52] option is if you really like pie torch you can just go ahead and do that and
[00:40:54] you can just go ahead and do that and it's really nice and easy
[00:40:56] it's really nice and easy um the second thing is I that hugging
[00:41:00] um the second thing is I that hugging face actually has some sort of like a
[00:41:03] face actually has some sort of like a trainer class that you're able to use
[00:41:05] trainer class that you're able to use that can handle most of most of these
[00:41:08] that can handle most of most of these things so again if we do the kind of
[00:41:10] things so again if we do the kind of like the same thing here this will
[00:41:12] like the same thing here this will actually run one star model is done
[00:41:14] actually run one star model is done training
[00:41:16] training um
[00:41:16] um like we can create the our you know our
[00:41:19] like we can create the our you know our data set in the same way as before
[00:41:21] data set in the same way as before now what we can what we need to use is
[00:41:24] now what we can what we need to use is like this import of like a training
[00:41:27] like this import of like a training arguments class so this is going to be
[00:41:30] arguments class so this is going to be basically a dictionary of all the things
[00:41:32] basically a dictionary of all the things that we want to use when we're going to
[00:41:34] that we want to use when we're going to actually train our model and then this
[00:41:37] actually train our model and then this kind of like additional trainer class
[00:41:39] kind of like additional trainer class which will handle the training kind of
[00:41:42] which will handle the training kind of like magically for us and kind of wrap
[00:41:43] like magically for us and kind of wrap around in that way
[00:41:47] okay anyways so if you can okay I think
[00:41:49] okay anyways so if you can okay I think we're missing a directory but
[00:41:52] we're missing a directory but um
[00:41:52] um I think yeah pretty straightforward for
[00:41:54] I think yeah pretty straightforward for how you want to train yeah
[00:41:56] how you want to train yeah um so for for here at least again there
[00:42:00] um so for for here at least again there are kind of the two key arguments the
[00:42:02] are kind of the two key arguments the first is training arguments so this will
[00:42:04] first is training arguments so this will specify have a number of specifications
[00:42:07] specify have a number of specifications that you can actually pass through to it
[00:42:09] that you can actually pass through to it it's where you want to log things for
[00:42:12] it's where you want to log things for each kind of like device in this case
[00:42:14] each kind of like device in this case like we're just using one GPU but
[00:42:16] like we're just using one GPU but potentially if you're using multiple
[00:42:18] potentially if you're using multiple gpus what the batch size is during
[00:42:20] gpus what the batch size is during training or the batch sizes during
[00:42:23] training or the batch sizes during evaluation time
[00:42:25] evaluation time how long you want to train it for
[00:42:27] how long you want to train it for how you want to evaluate it so this is
[00:42:30] how you want to evaluate it so this is kind of like evaluating on an Epoch
[00:42:33] kind of like evaluating on an Epoch level
[00:42:33] level what the learning rate is and so on so
[00:42:35] what the learning rate is and so on so on so again if you want to check the
[00:42:38] on so again if you want to check the documentation uh you can see that here
[00:42:41] documentation uh you can see that here there's a bunch of different arguments
[00:42:43] there's a bunch of different arguments that you can give there's like warm-up
[00:42:45] that you can give there's like warm-up steps warm-up ratio like weight Decay
[00:42:47] steps warm-up ratio like weight Decay there's like so many things
[00:42:50] there's like so many things um so again it's basically like a
[00:42:52] um so again it's basically like a dictionary feel free to kind of like
[00:42:54] dictionary feel free to kind of like look at these different arguments you
[00:42:56] look at these different arguments you can pass in but there's a couple key
[00:42:58] can pass in but there's a couple key ones here and this is basically this
[00:43:00] ones here and this is basically this basically mimics the same arguments that
[00:43:02] basically mimics the same arguments that we used before in our like explicit Pi
[00:43:05] we used before in our like explicit Pi torch method here for hugging face
[00:43:09] torch method here for hugging face okay similarly
[00:43:12] okay similarly um what we do is we can just pass this
[00:43:14] um what we do is we can just pass this into the trainer and that will take care
[00:43:16] into the trainer and that will take care of basically everything for us so that
[00:43:19] of basically everything for us so that whole training Loop that we did before
[00:43:20] whole training Loop that we did before is kind of condensed into this one class
[00:43:23] is kind of condensed into this one class function
[00:43:24] function um for actually just doing the training
[00:43:26] um for actually just doing the training so we pass the model the arguments the
[00:43:29] so we pass the model the arguments the train data set eval data set what
[00:43:31] train data set eval data set what tokenizer we want to use and then some
[00:43:34] tokenizer we want to use and then some function for computing metrics
[00:43:36] function for computing metrics so for here we pass in this function uh
[00:43:40] so for here we pass in this function uh eval and it takes eval predictions as
[00:43:43] eval and it takes eval predictions as input basically what this does is these
[00:43:45] input basically what this does is these predictions are given from the trainer
[00:43:47] predictions are given from the trainer passed into this function and we just
[00:43:50] passed into this function and we just can split it into the actual legits and
[00:43:52] can split it into the actual legits and the labels that are predicted or sorry
[00:43:54] the labels that are predicted or sorry the ground truth labels that we have and
[00:43:57] the ground truth labels that we have and then from here we can just calculate any
[00:43:59] then from here we can just calculate any sort of additional metrics we want like
[00:44:00] sort of additional metrics we want like accuracy
[00:44:02] accuracy F1 square recolors and whatever you want
[00:44:07] okay so this is like an alternative way
[00:44:09] okay so this is like an alternative way of formulating that training Loop
[00:44:13] okay uh the last thing here as well is
[00:44:17] okay uh the last thing here as well is that we can have some sort of callback
[00:44:19] that we can have some sort of callback as well if you want to do things during
[00:44:21] as well if you want to do things during the training process so after every
[00:44:24] the training process so after every Epoch or something like that you want to
[00:44:26] Epoch or something like that you want to evaluate your model on the validation
[00:44:28] evaluate your model on the validation set or something like that or just go
[00:44:31] set or something like that or just go ahead and like dump some sort of output
[00:44:34] ahead and like dump some sort of output that's what you can use a callback for
[00:44:36] that's what you can use a callback for and so here this is just a login
[00:44:40] and so here this is just a login callback it's just gonna
[00:44:42] callback it's just gonna log kind of like that information about
[00:44:45] log kind of like that information about the the process itself again not super
[00:44:49] the the process itself again not super important but in case that you're
[00:44:51] important but in case that you're looking to try and do any sort of
[00:44:54] looking to try and do any sort of callback during training it's an easy
[00:44:56] callback during training it's an easy way to add it in the second is if you
[00:44:58] way to add it in the second is if you want to do early stopping as well so
[00:45:01] want to do early stopping as well so early stopping will basically
[00:45:03] early stopping will basically um stop your model early as it sounds if
[00:45:07] um stop your model early as it sounds if it's not learning anything and a bunch
[00:45:09] it's not learning anything and a bunch of epochs are going by and so you can
[00:45:12] of epochs are going by and so you can set that so that you don't waste kind of
[00:45:13] set that so that you don't waste kind of like compute time or you can see the
[00:45:15] like compute time or you can see the results more easily the question is is
[00:45:17] results more easily the question is is there a good choice for the patient's
[00:45:19] there a good choice for the patient's value
[00:45:20] value um I think it just depends on the model
[00:45:22] um I think it just depends on the model architecture not really I guess this is
[00:45:25] architecture not really I guess this is It's a
[00:45:26] It's a yeah pretty up to your discretion
[00:45:31] okay awesome and so the last thing that
[00:45:34] okay awesome and so the last thing that we do
[00:45:35] we do um is just do call trainer.train so if
[00:45:37] um is just do call trainer.train so if you recall this is just the
[00:45:39] you recall this is just the instantiation of this trainer class we
[00:45:42] instantiation of this trainer class we all trainer.train and it'll just kind of
[00:45:45] all trainer.train and it'll just kind of go so now it's training
[00:45:47] go so now it's training which is great it gives us a nice kind
[00:45:50] which is great it gives us a nice kind of estimate of
[00:45:51] of estimate of how long things are taking what's going
[00:45:53] how long things are taking what's going on what arguments do we actually pass in
[00:45:57] on what arguments do we actually pass in um
[00:45:58] um so that's just going to run and then
[00:46:01] so that's just going to run and then likewise hopefully it'll train
[00:46:03] likewise hopefully it'll train relatively quickly okay it'll take two
[00:46:06] relatively quickly okay it'll take two minutes we can also evaluate the model
[00:46:10] minutes we can also evaluate the model um pretty easily as well so we just
[00:46:12] um pretty easily as well so we just called trainer.predict
[00:46:14] called trainer.predict on whatever data set that we're
[00:46:16] on whatever data set that we're interested in so here it's the tokenized
[00:46:18] interested in so here it's the tokenized data set corresponding about the
[00:46:20] data set corresponding about the validation data set
[00:46:22] validation data set okay hopefully we can pop that out soon
[00:46:27] okay hopefully we can pop that out soon um and lastly so if we saved anything to
[00:46:30] um and lastly so if we saved anything to our model checkpoints so hopefully this
[00:46:33] our model checkpoints so hopefully this is
[00:46:33] is um
[00:46:34] um saving stuff right now
[00:46:39] yeah so this is going to be is
[00:46:41] yeah so this is going to be is continuing to save stuff to the folder
[00:46:43] continuing to save stuff to the folder that we specified
[00:46:45] that we specified and so here in case we ever want to kind
[00:46:48] and so here in case we ever want to kind of like load our model again from the
[00:46:51] of like load our model again from the weights that we've actually saved we
[00:46:53] weights that we've actually saved we just pass in the name of the checkpoint
[00:46:54] just pass in the name of the checkpoint like the relative path here to our
[00:46:57] like the relative path here to our checkpoint so notice how we have some
[00:46:59] checkpoint so notice how we have some checkpoint 8 here
[00:47:01] checkpoint 8 here right we just pass in the path to that
[00:47:04] right we just pass in the path to that folder we load it back in be tokenized
[00:47:07] folder we load it back in be tokenized and it's the same thing as as we did
[00:47:08] and it's the same thing as as we did before
[00:47:11] there are a few kind of additional
[00:47:14] there are a few kind of additional appendices for how to do like different
[00:47:17] appendices for how to do like different tasks as well there's appendix on
[00:47:19] tasks as well there's appendix on generation how to define a custom data
[00:47:23] generation how to define a custom data set as well
[00:47:24] set as well how it's a kind of like pipeline
[00:47:27] how it's a kind of like pipeline um different kind of like tasks together
[00:47:31] um different kind of like tasks together um so uh so this is kind of like uh
[00:47:35] um so uh so this is kind of like uh using some a pre-trained model that you
[00:47:37] using some a pre-trained model that you can just use through kind of like the
[00:47:39] can just use through kind of like the pipeline interface really easily
[00:47:43] um there's like in different types of
[00:47:45] um there's like in different types of tasks like Mass language modeling but um
[00:47:48] tasks like Mass language modeling but um feel free to look at through those at
[00:47:49] feel free to look at through those at your own time and uh yeah thanks a bunch


================================================================================
LECTURE INDEX.md
================================================================================

CS224N – NLP with Deep Learning

Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPDoLWjDaRAdP9D

Total Videos: 23
Transcripts Downloaded: 23
Failed/No Captions: 0

---

Lectures

1. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors
- Video: [https://www.youtube.com/watch?v=DzpHeXVSC5I](https://www.youtube.com/watch?v=DzpHeXVSC5I)
- Transcript: [001_DzpHeXVSC5I.md](001_DzpHeXVSC5I.md)

2. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 2 - Word Vectors and Language Models
- Video: [https://www.youtube.com/watch?v=nBor4jfWetQ](https://www.youtube.com/watch?v=nBor4jfWetQ)
- Transcript: [002_nBor4jfWetQ.md](002_nBor4jfWetQ.md)

3. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 3 - Backpropagation, Neural Network
- Video: [https://www.youtube.com/watch?v=HnliVHU2g9U](https://www.youtube.com/watch?v=HnliVHU2g9U)
- Transcript: [003_HnliVHU2g9U.md](003_HnliVHU2g9U.md)

4. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 4 - Dependency Parsing
- Video: [https://www.youtube.com/watch?v=KVKvde-_MYc](https://www.youtube.com/watch?v=KVKvde-_MYc)
- Transcript: [004_KVKvde-_MYc.md](004_KVKvde-_MYc.md)

5. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 5 - Recurrent Neural Networks
- Video: [https://www.youtube.com/watch?v=fyc0Jzr74y4](https://www.youtube.com/watch?v=fyc0Jzr74y4)
- Transcript: [005_fyc0Jzr74y4.md](005_fyc0Jzr74y4.md)

6. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 6 - Sequence to Sequence Models
- Video: [https://www.youtube.com/watch?v=Ba6Fn1-Jsfw](https://www.youtube.com/watch?v=Ba6Fn1-Jsfw)
- Transcript: [006_Ba6Fn1-Jsfw.md](006_Ba6Fn1-Jsfw.md)

7. Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 7 - Attention, Final Projects and LLM Intro
- Video: [https://www.youtube.com/watch?v=J7ruSOIzhrE](https://www.youtube.com/watch?v=J7ruSOIzhrE)
- Transcript: [007_J7ruSOIzhrE.md](007_J7ruSOIzhrE.md)

8. Stanford CS224N NLP with Deep Learning | 2023 | Lecture 8 - Self-Attention and Transformers
- Video: [https://www.youtube.com/watch?v=LWMzyfvuehA](https://www.youtube.com/watch?v=LWMzyfvuehA)
- Transcript: [008_LWMzyfvuehA.md](008_LWMzyfvuehA.md)

9. Stanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretraining
- Video: [https://www.youtube.com/watch?v=DGfCRXuNA2w](https://www.youtube.com/watch?v=DGfCRXuNA2w)
- Transcript: [009_DGfCRXuNA2w.md](009_DGfCRXuNA2w.md)

10. Stanford CS224N NLP with Deep Learning | 2023 | Lecture 11 - Natural Language Generation
- Video: [https://www.youtube.com/watch?v=N9L32bFieEY](https://www.youtube.com/watch?v=N9L32bFieEY)
- Transcript: [010_N9L32bFieEY.md](010_N9L32bFieEY.md)

11. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 10 - Post-training by Archit Sharma
- Video: [https://www.youtube.com/watch?v=35X6zlhoCy4](https://www.youtube.com/watch?v=35X6zlhoCy4)
- Transcript: [011_35X6zlhoCy4.md](011_35X6zlhoCy4.md)

12. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 11 - Benchmarking by Yann Dubois
- Video: [https://www.youtube.com/watch?v=TO0CqzqiArM](https://www.youtube.com/watch?v=TO0CqzqiArM)
- Transcript: [012_TO0CqzqiArM.md](012_TO0CqzqiArM.md)

13. Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 12 - Efficient Training, Shikhar Murty
- Video: [https://www.youtube.com/watch?v=UVX7SYGCKkA](https://www.youtube.com/watch?v=UVX7SYGCKkA)
- Transcript: [013_UVX7SYGCKkA.md](013_UVX7SYGCKkA.md)

14. Stanford CS224N: NLP w/ DL| Spring 2024 | Lecture 13 - Brain-Computer Interfaces, Chaofei Fan
- Video: [https://www.youtube.com/watch?v=tfVgHsKpRC8](https://www.youtube.com/watch?v=tfVgHsKpRC8)
- Transcript: [014_tfVgHsKpRC8.md](014_tfVgHsKpRC8.md)

15. Stanford CS224N: NLP w/ DL | Spring 2024 | Lecture 14 - Reasoning and Agents by Shikhar Murty
- Video: [https://www.youtube.com/watch?v=I0tj4Y7xaOQ](https://www.youtube.com/watch?v=I0tj4Y7xaOQ)
- Transcript: [015_I0tj4Y7xaOQ.md](015_I0tj4Y7xaOQ.md)

16. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert
- Video: [https://www.youtube.com/watch?v=dnF463_Ar9I](https://www.youtube.com/watch?v=dnF463_Ar9I)
- Transcript: [016_dnF463_Ar9I.md](016_dnF463_Ar9I.md)

17. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 16 - ConvNets and TreeRNNs
- Video: [https://www.youtube.com/watch?v=S8d-7v3f5MQ](https://www.youtube.com/watch?v=S8d-7v3f5MQ)
- Transcript: [017_S8d-7v3f5MQ.md](017_S8d-7v3f5MQ.md)

18. Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 18 - NLP, Linguistics, Philosophy
- Video: [https://www.youtube.com/watch?v=NxH0Y78xcF4](https://www.youtube.com/watch?v=NxH0Y78xcF4)
- Transcript: [018_NxH0Y78xcF4.md](018_NxH0Y78xcF4.md)

19. Stanford CS224N NLP with Deep Learning | 2023 | Lecture 16 - Multimodal Deep Learning, Douwe Kiela
- Video: [https://www.youtube.com/watch?v=5vfIT5LOkR0](https://www.youtube.com/watch?v=5vfIT5LOkR0)
- Transcript: [019_5vfIT5LOkR0.md](019_5vfIT5LOkR0.md)

20. Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim
- Video: [https://www.youtube.com/watch?v=cd3pRpEtjLs](https://www.youtube.com/watch?v=cd3pRpEtjLs)
- Transcript: [020_cd3pRpEtjLs.md](020_cd3pRpEtjLs.md)

21. Stanford CS224N NLP with Deep Learning | 2023 | Python Tutorial, Manasi Sharma
- Video: [https://www.youtube.com/watch?v=8j4wpU98Q74](https://www.youtube.com/watch?v=8j4wpU98Q74)
- Transcript: [021_8j4wpU98Q74.md](021_8j4wpU98Q74.md)

22. Stanford CS224N NLP with Deep Learning | 2023 | PyTorch Tutorial,  Drew Kaul
- Video: [https://www.youtube.com/watch?v=Uv0AIRr3ptg](https://www.youtube.com/watch?v=Uv0AIRr3ptg)
- Transcript: [022_Uv0AIRr3ptg.md](022_Uv0AIRr3ptg.md)

23. Stanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel
- Video: [https://www.youtube.com/watch?v=b80by3Xk_A8](https://www.youtube.com/watch?v=b80by3Xk_A8)
- Transcript: [023_b80by3Xk_A8.md](023_b80by3Xk_A8.md)